welcome to multimodal.art

A comprehensive guide to understand the multimodal AI art scene and create your own text-to-image (and other) pieces

A few examples of AI generated images from a text prompt by me

A ritual to bring a Tamagotchi back to life (CLIP Guided Diffusion) Neuromancer in Ukiyo-e style (VQGAN+CLIP) The inauguration of a wormhole between Shanghai and New York (VQGAN+CLIP) A mecha-robot in a favela by James Gurney (CLIP Guided Diffusion)

(more here and check out our portfolio for my participation in real world AI art exhibitions)

Index

What is it (non technical, not in depth)

Essentially, this are models that can generate an image from a text prompt, the most famous ones that are open source are: VQGAN + CLIP, CLIP Guided Diffusion, Dall-E Mini.

The main ingredient for all the image-to-text generation AI models are datasets of hundreds of millions of “image-and-text-pairs”, or images with labels describing what they are

Example of a text-image pair
Example of a text-image pair

Those hundreds of millions of image-and-text pairs are then used to train a neural networks (such as CLIP or DALL-E) that “learn” features and the connections between the text and the images without a human telling them what is what. For example, it can learn that there’s a dog, or a grass field or a red ball or even a dog’s mouth just by having enough examples of different things in the dataset and “learning” what those are.

Once this models are trained, they can then be used - either directly (such as in Dall-E like models), or indirectly (such as in CLIP guided models) - as guidance to generate new images based on a text. The idea here is that the model will be intertwined with other models that are trained to be good at image generation (such as VQ-VAE, VQGAN or Guided Diffusion), and will use this learned features of how well a text matches an image (called loss function or error function) to then guide this models to generate an image that satisfies that text.

Short history of how we got here

Check out the excellent article The Weird and Wonderful World of AI Art that goes into way more detail, but in summary:
OpenAI’s Dall-E started the current trend in January 2021, but they haven’t released their pre-trained models, but they did release CLIP (a model that can say how good an image matches a text pair, as described above). With that, Ryan Murdock (@advadnoun) started this trend of hooking up CLIP with image generation models with Big Dream + CLIP. After that, Katherine Crowson (@rivershavewings) hooked CLIP with the VQGAN image generation neural network, starting a text-to-image Cambrian explosion. A few months after that she came back and did the same with CLIP + a model called Guided Diffusion, which dramatically increased the quality of the generations. In 2022 the process only accelerated. The amount of models released started growing a lot. I created a newsletter to keep up with it. But the highlights of 2022 so far are definitely the release of Latent Diffusion models, as well as Dall-E 2.

Vice/Motherboard featured this scene in a piece in July/2021 called "AI-Generated Art Scene Explodes as Hackers Create Groundbreaking New Tools"

Other prominent people in this field: @jbusted1, @nshepperd1, @BoneAmputee, dribnet, @danielrussruss, @bakztfuture. I also recommend the following discords: our own Multimodal Art, EleutherAI and LAION

What is it (in depth, curated content)

The AIAIArt course (technical): The AI art open source course by Jonathan Whitaker
Understanding Dall-E: Two minute papers video, Yannic Kilcher video, original blog post
Understanding CLIP: Yannic Kilcher video, original blog post
Understanding CLIP text-to-image generation: Artificial Images video Understanding VQGAN+CLIP: Adafruit blogpost, Bestiario del Hypogripho
Understanding Diffusion models: Non-technical video, with technical bits, In depth video explanation, In depth step by step implementation Understanding DALL-E 2: Two Minute Papers, Dall 2 first-look The next 10 years of Multimodal art by Bakz T. Future

I want to see examples of what it can do

A few examples created by me. For more follow me on Twitter or Instagram

A giant insect protecting the city of Lagos (CLIP Guided Diffusion) A cute monster bathing in an açai bowl (CLIP Guided Diffusion) A pão de queijo food cart with a Japanese castle in the backgroun by James Gurney (CLIP Guided Diffusion) A mecha robot celebrating Diwali by James Gurney (CLIP Guided Diffusion) A cute monster taking a shower in a bathtub trending on artstation (CLIP Guided Diffusion) Prision Shrimp Night Fight trending on Artstation (CLIP Guided Diffusion) A renaissance painting of eyeballs (CLIP Guided Diffusion) Two people silluetes looking at artificial intelligence art in a gallery (CLIP Guided Diffusion) A cute seahorse amigurumi (CLIP Guided Diffusion) A landscape resembling the Black Lotus Magic The Gathering Card (CLIP Guided Diffusion) A shakira chicken dancing (CLIP Guided Diffusion) A giant chicken in an Austrian supermarket by James Gurney (CLIP Guided Diffusion) A surrealist sculpture of a GameBoy (CLIP Guided Diffusion) The biggest baile funk party in Times Square (VQGAN+CLIP) Do not rinse raw chicken before cooking says the FDA (VQGAN+CLIP) Mark_Zuckerberg regretting having created Facebook oil in canvas (VQGAN+CLIP) Elon Musk saying his final words before his exile in a Jupiter moon oil in canvas (VQGAN+CLIP) Jeff Bezos apologizes to former employees before going to jail oil in canvas (VQGAN+CLIP) Drinking the milky way galaxy from a milk bottle a couple spending their first Universal Basic Income payment on a fully automated lab grown meat restaurant the online advertisement bubble burst crisis ; oil on canvas

Some prominent AI art/model creators: @rivershavewings, @advadnoun, @images_ai, @jbusted1, @nshepperd1, @BoneAmputee, @dribnet, @danielrussruss

I want to play with it myself

We just released MindsEye beta, a GUI for running multiple multimodal art models. Check it out here multimodal

Besides that check out this list of tools and other resources you can run on your own, or follow those based on your use-case:

I have a powerful GPU and I know a bit of coding and I want to run this models on my local machine

Run VQGAN+CLIP locally
Run CLIP Guided Diffusion locally
Run Dall-E mini locally
Run many models at once with Vision of Chaos (Windows only)

I know how to use a Google Colab (or I am willing to learn)

VQGAN+CLIP: original notebook, with pooling trick, MSE regularized
Guided Diffusion: 512x512px original, Disco Diffusion
Latent Diffusion: notebook by us

I'm not willing to learn how to use a Colab, I just want a website that I can just type the text and get the image out

Check out MindsEye!