Building imagen test sets.
ESY wrote, "I have tried, on every imagegen since DALL-E that I got around to trying, "Eleven evil wizard schoolgirls in an archduke's library, dressed in red and black Asmodean schoolgirl uniforms, perched on armchairs and sofas" and none have drawn it well enough to use inside an online story."
Edwin Chen and Scott Alexander have discussed the following:
- A red cat sitting on top of a blue dog next to a purple lake, with a black pig flying in the sky
- A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth
- An oil painting of a man in a factory looking at a cat wearing a top hat
- A digital art picture of a child riding a llama with a bell on its tail through a desert
- A 3D render of an astronaut in space holding a fox wearing lipstick
- Pixel art of a farmer in a cathedral holding a red basketball
These test various constraints all at once, where it may be more useful to break these into smaller tests.
- Object level mastery (aka table stakes): can it draw a variety of individual nouns and actions convincingly?
- Object adjustments, color: Can the engine map colors to the correct objects? Can it draw objects with nonstandard colors?
- Object adjustments, gerrunds: Can the engine map different actions to the component objects?
- Compositional: Can it apply a style when one included in the prompt in natural language?
- Grammar: Can it correctly apply the directionality of verbs and prepositions, can it distinguish subjects and direct objects?
- Grammar: Can it do so when these are chained? ("an X on top of a Y that is beside a Z...")
- Grammar: Can it handle "in" (or related prepositions) in relevant different senses, as a way to denote the overall setting, or as a way to denote how something should be held or placed?
- Capacity: How many elements can you stack before it either breaks or begins to ignore some elements?
These and ESY's seem to test many of these at once, and maybe are ultimately a test of capacity.
A proper test set could be generated along each of these lanes, with a body of sample nouns, verbs, and particles, with accompanying objective questions that any mturk could judge: "Is this cat wearing lipstick?"
It would adjust over time to rate areas of difficulty, such as livestock with the appropriate number of legs, or hands. Undoubtedly leading AI companies have extensive internal test sets with fidelity far beyond these gross categories.
See also
Decomposing Midjourney
Another approach to imagegen validation would be interpretability, examining outputs by fuzzed inputs and separating the inputs into atomic components to identify shifts in responses, decomposing prompts word by word.
The below is not a thorough investigation, just a quick proof of concept that might generalize to future work if scaled.
"shift subsider fusillade vouch"



Subsets
Deconstruct into subsets to identify common and unique elements
subsider fusillade vouch

shift fusillade vouch

shift subsider vouch

shift subsider fusillade
shift subsider

shift fusillade

shift vouch

subsider fusillade

subsider vouch

fusillade vouch

shift

subsider

fusillade

vouch

Conclusion
Some initial patterns emerge. Fusillade is associated with more violent imagery. Vouch seems to be associated with a young woman. Red and a grayish-blue are dominant colors here, but it's not entirely clear where they come from, they show up across the subsets (perhaps fusillade most strongly, which makes some sense--maybe there's cross pollination from recent generations?). The machine seems to associate some of these random word combos with album titles ("fusillade vouch"), placing stylized words over art reminiscent of an album. I'm a little amazed midjourney wrote these words as well as it did, it generally struggles with text.
This type of decompositional analysis will likely have mixed results--some obivous results, and then a wall of incomprehensibility where it's difficult to deconstruct further. Elements with clear semantic implications will have an impact on the image in expected ways. Abstract words or word combinations will unearth hallucinations with an alien consistency that are hard to further analyze.
Further research
Examination of a few hundred more examples may reveal other subtle trends. This research would be easy to automate for generation, albeit slow, but human anaylsis (searching for common elements) would be difficult to reliably scale, since it's not obvious what to look for. Mechanical turk or (image to text classifiers as they mature) asking reviewers to tag images with elements, moods, colors, or styles could generate a dataset highlighting additional associations at scale.
Generating a grid of the common sets of 100 nouns and 100 adjectives, or noun-verb, or adverb-verb pairs could make it easy to track patterns that emerge from the data.
As imagegen quickly scales, there will be new opportunities to identify implicit features in new models, but the use of abstract terms will become more critical as the generations become more faithful to the inputs.