>>
|
No. 2927
[Edit]
>>2926
>Are you sure those aren't incredibly similar to some "base image"?
This is a good question. Here's some rough thoughts based on whatever I know (not much, maybe someone here who's doing active research in the field can give a better response)
First thing we need to do is to make sure the outputs we're referring to are actually the result of stable diffusion used in the manner of generating outputs conditioned on a given prompt, instead of alternate uses like img2img combined with embeddings. By this what I mean is that someone could theoretically create a pipeline that does the following:
* Create text embeddings for a given booru database, essentially allowing you to find the closest booru image matching a given set of tags (most booru images are already tagged, but this would allow you to "expand" the tag set by inferring tags)
* On a given input prompt, lookup the closest image in the booru from the text embeddings
* Use stable diffusion in an img2img transformation way, where you provide BOTH the original prompt as well as the booru image, to get the final result.
Many of the models that people are playing with are closed-source accessible only via web api, so there's no way to know what exactly they are doing. Some savvy businessperson who wanted to capitalize on this could well do the above and their output would consistently be better than competitors because they're effectively "cheating", although the general audience likely doesn't care because it does do what they want.
Now even if we're using stable diffusion conditioned ONLY on input prompt, there's a question of to what extent it has overfit/memorized its training data. A naive information theoretic argument would be that the model is only a few gigs but the training set is several magnitudes larger, so it couldn't have memorized a substantial chunk. This would be flawed though, for two reasons:
1) The model is acting as an information compressor, so that it can capture defining characteristics of the input distribution without actual copying it verbatim. In some sense this is what you want, but it's a balance of being able to learn general characteristics without overfitting. It's of note that stable difufsion can actually be used in this manner for compression [1]. This is not an issue for real-world objects, because in the real world an apple is an apple, and is pretty much fungible, so only the composition really matters, and indeed we've seen stable diffusion can do novel composition pretty well. For 2D girls though it's more style that matters, and the shapes are basically identical so it may well have endedup overfitting on style instead of generlizing it (that is, is there some space of "all possible styles" that is captured in the latent space of the model and can be sampled from?) We should be able to see how close a generated is to an image in the training set by just computing similarity metric. It's known that images over-represented in the training set (i.e. non-deduped training data) will allow the model to reproduce images identically).
2) Realted to the above, since we are prompt-engineering and cherry-picking prompts, is it possible that there are essentially a fixed number of "good" images that the model then transforms compositionally (hair color, dress, background, accessories) and in the process of cherry-picking and rejecting poor outputs we are essentially driving towards these valleys? That is, since not every output of the model is usable, even if in aggregate that the outputs are novel, we have to see what fraction of usable outputs are novel.
I can't find any papers trying to match up the outputs of stable diffusion back to the original training set, and even if they existed they'd probably use the LAION set the original stable diffusion was trained on instead of the booru specific one.
To summarize, for models trained on real-world data, we know that it does novel compositions (because "a squirrel riding an octopus" isn't remotely part of the training set) and style is irrelevant, so it's effectively hard to overfit. For 2d anime input sets, composition is not relevant, but style is, and it may well have ended up overfitting on style.
As you mentioned, image editing seems to be the real immediate killer feature though. Inpainting, outpainting, and stylistic or compositional manipulations can be done with ease now ("Give me an image like X but with black panties and red hair", "image like X but holding a coffee cup"). This may also means that artists with distinctive styles will be appreciated, since mediocre artists with an "average" style won't be much distinguishable from the output of the model.
One thing I'd like to see someone do is fix a given prompt and then iterate over random seeds, generating a set of possible outputs for a given prompt. This should be a quick way to see the breadth of styles the model can output.
[1] https://medium.com/stable-diffusion-based-image-compresssion-6f1f0a399202
|