Building imagen test sets.

ESY wrote, "I have tried, on every imagegen since DALL-E that I got around to trying, "Eleven evil wizard schoolgirls in an archduke's library, dressed in red and black Asmodean schoolgirl uniforms, perched on armchairs and sofas" and none have drawn it well enough to use inside an online story."

Edwin Chen and Scott Alexander have discussed the following:

These test various constraints all at once, where it may be more useful to break these into smaller tests. These and ESY's seem to test many of these at once, and maybe are ultimately a test of capacity.

A proper test set could be generated along each of these lanes, with a body of sample nouns, verbs, and particles, with accompanying objective questions that any mturk could judge: "Is this cat wearing lipstick?" It would adjust over time to rate areas of difficulty, such as livestock with the appropriate number of legs, or hands. Undoubtedly leading AI companies have extensive internal test sets with fidelity far beyond these gross categories.

See also

Decomposing Midjourney

Another approach to imagegen validation would be interpretability, examining outputs by fuzzed inputs and separating the inputs into atomic components to identify shifts in responses, decomposing prompts word by word.

The below is not a thorough investigation, just a quick proof of concept that might generalize to future work if scaled.

Begin with four random words.

"shift subsider fusillade vouch"

Subsets

Deconstruct into subsets to identify common and unique elements

subsider fusillade vouch

shift fusillade vouch

shift subsider vouch

shift subsider fusillade

shift subsider

shift fusillade

shift vouch

subsider fusillade

subsider vouch

fusillade vouch

shift

subsider

fusillade

vouch

Conclusion

Some initial patterns emerge. Fusillade is associated with more violent imagery. Vouch seems to be associated with a young woman. Red and a grayish-blue are dominant colors here, but it's not entirely clear where they come from, they show up across the subsets (perhaps fusillade most strongly, which makes some sense--maybe there's cross pollination from recent generations?). The machine seems to associate some of these random word combos with album titles ("fusillade vouch"), placing stylized words over art reminiscent of an album. I'm a little amazed midjourney wrote these words as well as it did, it generally struggles with text.

This type of decompositional analysis will likely have mixed results--some obivous results, and then a wall of incomprehensibility where it's difficult to deconstruct further. Elements with clear semantic implications will have an impact on the image in expected ways. Abstract words or word combinations will unearth hallucinations with an alien consistency that are hard to further analyze.

Further research

Examination of a few hundred more examples may reveal other subtle trends. This research would be easy to automate for generation, albeit slow, but human anaylsis (searching for common elements) would be difficult to reliably scale, since it's not obvious what to look for. Mechanical turk or (image to text classifiers as they mature) asking reviewers to tag images with elements, moods, colors, or styles could generate a dataset highlighting additional associations at scale.

Generating a grid of the common sets of 100 nouns and 100 adjectives, or noun-verb, or adverb-verb pairs could make it easy to track patterns that emerge from the data.

As imagegen quickly scales, there will be new opportunities to identify implicit features in new models, but the use of abstract terms will become more critical as the generations become more faithful to the inputs.