subsider fusillade vouch
shift fusillade vouch
shift subsider vouch
shift subsider fusillade
shift subsider
shift fusillade
shift vouch
subsider fusillade
subsider vouch
fusillade vouch
shift
subsider
fusillade
vouch
ESY wrote, "I have tried, on every imagegen since DALL-E that I got around to trying, "Eleven evil wizard schoolgirls in an archduke's library, dressed in red and black Asmodean schoolgirl uniforms, perched on armchairs and sofas" and none have drawn it well enough to use inside an online story."
Edwin Chen and Scott Alexander have discussed the following:
A proper test set could be generated along each of these lanes, with a body of sample nouns, verbs, and particles, with accompanying objective questions that any mturk could judge: "Is this cat wearing lipstick?" It would adjust over time to rate areas of difficulty, such as livestock with the appropriate number of legs, or hands. Undoubtedly leading AI companies have extensive internal test sets with fidelity far beyond these gross categories.
See alsoAnother approach to imagegen validation would be interpretability, examining outputs by fuzzed inputs and separating the inputs into atomic components to identify shifts in responses, decomposing prompts word by word.
The below is not a thorough investigation, just a quick proof of concept that might generalize to future work if scaled.
"shift subsider fusillade vouch"
Deconstruct into subsets to identify common and unique elements
Some initial patterns emerge. Fusillade is associated with more violent imagery. Vouch seems to be associated with a young woman. Red and a grayish-blue are dominant colors here, but it's not entirely clear where they come from, they show up across the subsets (perhaps fusillade most strongly, which makes some sense--maybe there's cross pollination from recent generations?). The machine seems to associate some of these random word combos with album titles ("fusillade vouch"), placing stylized words over art reminiscent of an album. I'm a little amazed midjourney wrote these words as well as it did, it generally struggles with text.
This type of decompositional analysis will likely have mixed results--some obivous results, and then a wall of incomprehensibility where it's difficult to deconstruct further. Elements with clear semantic implications will have an impact on the image in expected ways. Abstract words or word combinations will unearth hallucinations with an alien consistency that are hard to further analyze.
Examination of a few hundred more examples may reveal other subtle trends. This research would be easy to automate for generation, albeit slow, but human anaylsis (searching for common elements) would be difficult to reliably scale, since it's not obvious what to look for. Mechanical turk or (image to text classifiers as they mature) asking reviewers to tag images with elements, moods, colors, or styles could generate a dataset highlighting additional associations at scale.
Generating a grid of the common sets of 100 nouns and 100 adjectives, or noun-verb, or adverb-verb pairs could make it easy to track patterns that emerge from the data.
As imagegen quickly scales, there will be new opportunities to identify implicit features in new models, but the use of abstract terms will become more critical as the generations become more faithful to the inputs.