A few examples of AI generated images from a text prompt by me
Essentially, this are models that can generate an image from a text prompt, the most famous ones that are open source are: VQGAN + CLIP, CLIP Guided Diffusion, Dall-E Mini.
The main ingredient for all the image-to-text generation AI models are datasets of hundreds of millions of “image-and-text-pairs”, or images with labels describing what they are
Those hundreds of millions of image-and-text pairs are then used to train a neural networks (such as CLIP or DALL-E) that “learn” features and the connections between the text and the images without a human telling them what is what. For example, it can learn that there’s a dog, or a grass field or a red ball or even a dog’s mouth just by having enough examples of different things in the dataset and “learning” what those are.
Once this models are trained, they can then be used - either directly (such as in Dall-E like models), or indirectly (such as in CLIP guided models) - as guidance to generate new images based on a text. The idea here is that the model will be intertwined with other models that are trained to be good at image generation (such as VQ-VAE, VQGAN or Guided Diffusion), and will use this learned features of how well a text matches an image (called loss function or error function) to then guide this models to generate an image that satisfies that text.
Check out the excellent article The Weird and Wonderful World of AI Art that goes into way more detail, but in summary:
OpenAI’s Dall-E started the current trend in January 2021, but they haven’t released their pre-trained models, but they did release CLIP (a model that can say how good an image matches a text pair, as described above). With that, Ryan Murdock (@advadnoun) started this trend of hooking up CLIP with image generation models with Big Dream + CLIP. After that, Katherine Crowson (@rivershavewings) hooked CLIP with the VQGAN image generation neural network, starting a text-to-image Cambrian explosion. A few months after that she came back and did the same with CLIP + a model called Guided Diffusion, which dramatically increased the quality of the generations. In 2022 the process only accelerated. The amount of models released started growing a lot. I created a newsletter to keep up with it. But the highlights of 2022 so far are definitely the release of Latent Diffusion models, as well as Dall-E 2.
Vice/Motherboard featured this scene in a piece in July/2021 called "AI-Generated Art Scene Explodes as Hackers Create Groundbreaking New Tools"
Other prominent people in this field: @jbusted1, @nshepperd1, @BoneAmputee, dribnet, @danielrussruss, @bakztfuture. I also recommend the following discords: our own Multimodal Art, EleutherAI and LAION
The AIAIArt course (technical): The AI art open source course by Jonathan Whitaker
Understanding Dall-E: Two minute papers video, Yannic Kilcher video, original blog post
Understanding CLIP: Yannic Kilcher video, original blog post
Understanding CLIP text-to-image generation: Artificial Images video Understanding VQGAN+CLIP: Adafruit blogpost, Bestiario del Hypogripho
Understanding Diffusion models: Non-technical video, with technical bits, In depth video explanation, In depth step by step implementation Understanding DALL-E 2: Two Minute Papers, Dall 2 first-look The next 10 years of Multimodal art by Bakz T. Future
We just released MindsEye beta, a GUI for running multiple multimodal art models. Check it out here
Besides that check out this list of tools and other resources you can run on your own, or follow those based on your use-case:
Check out MindsEye!