Multimodal AI art updates that you maybe missed with the Stable Diffusion launch (August update)

Follow on Twitter, come hang on Discord, consider supporting on Patreon

With the Stable Diffusion hype catching our attentions across the month of August (including mine 😅) we have an update for what has happened in August in multimodal AI art. New models, new techniques, new modalities, check it out! For the 23+ applications of Stable Diffusion in 1 week of launch - come here.

New Text-to-image models:

- ERNIE-ViLG text-to-image model released (GitHub, Spaces demo)

by Baidu
A high quality text-to-image model by the Chinese company Baidu is out - the code and an interactive demo were released. The weights were not released yet.

Text-to-image editing updates:

- Textual Inversion concept learning released (GitHub, Project Page, Paper)

Rinon Gal et. al. - Tel Aviv University, NVidia
A technique that works on all diffusion models - teach a model a new concept with 3-5 images and have it generate novel contexts with the learned concept. It has already been implemented on Stable Diffusion WebUI and also is coming to the diffusers library
FZMroI_XkAIKW8l

- DreamBooth concept learning announced (Paper, Project Page)

by Google AI
A technique very similar to Textual Inversion on its use, but it does an efficient fine-tuning to teaching the concept. It was built on top of the closed source Google Imagen, but in theory can also work on any diffusion models, including Stable Diffusion, but no code or replication has been released yet.

- Prompt-to-prompt image editing announced (Paper, Project Page)

by Google AI
A technique to iterate on creations with just text. You use a text2image system, but you want to adjust something? You just swap that part on the text and the image remains coherent. Similar to "using the same seed" for Stable Diffusion - but more advanced. Similar to DreamBooth, it has been made on top of closed-source Imagen - but in theory can be applied to Stable Diffusion, but no code or replication so far.

- Text2Live CLIP to guide image edition released (GitHub, Paper, Static Spaces)

by Omer Bar Tal et. al. - Weizmann Institute of Science
We have reported on Text2Live before but now the code is out! Upload any image. Tell you how what on the image you want modified and what do you want to modify it to be. The model does a real time optimization (takes ~10-15min on a decent GPU) using CLIP and tries to generate what you asked for cake

Audio generation updates:

- Dance Diffusion pre-release (GitHub, Colab for running, Colab for fine-tuning)

by Zach Evans and the Harmonai community
Synthetic music by AI is clearly a next step - and it is already here. While one can't yet ask for a "brazilian baile funk in the style of Sebastian Bach" (which already exists by the way), the pre-release of the Dance Diffusion colab is out. It has pre-trained models for unconditional audio generation and tricks like audio style-transfer and init_audio, as well as fine-tuning on your own tunes. Soon you'll be able to tune in a nice open Discord community for music generation!

CLIP and CLIP-like models:

- CLIP-like - CLIP finetuning released (GitHub)

Want to teach a CLIP model new concepts? Try ylaxor notebook!

Image-to-3D models:

ICON - Image to 3D human model released (Paper, GitHub, Spaces)

by Yuliang Xiu et. al.
Upload an image that contains a human. The model will identify the human and generate a 3D model of that human pose - and wearing clothes! (which apparently is very hard) to the AI to do.