This week in multimodal ai art (31/May - 06/Jun)

Follow on Twitter, come hang on Discord, consider supporting on Patreon. Try MindsEye, our image generation GUI

Text-to-Image synthesizers:

- Composable-Diffusion approach released (Paper, GitHub, Colab)

by UIUC, MIT
Composable-Diffusion is a technique to enable diffusion models to accept multiple prompts at once and compose them together. The authors applied it on OpenAI's GLIDE but the technique could be used more broadly. We are planning on adding it to Majesty Diffusion and to vanilla Latent Diffusion Colab and Spaces

- Pixel Art Diffusion released (Colab)

by Kali Yuga
AI artist Kali Yuga, using this amazing fine tuning code created diffusion based pixel art models and released them on top of the Disco Diffusion code base, results look stunning

- Latent Diffusion fine-tuning code (GitHub)

by LAION AI
LAION has released a cool and handy tool for fine-tuning Latent Diffusion models on your own dataset. To showcase its powers, they released two finetuned models and Replicate demos for you to play with ongo (fine-tuned in paintings) and erlich (fine-tuned to generate logos)

- DiVAE (Photorealistic Images Synthesis with Denoising Diffusion Decoder) paper released (Paper)

by Peking University and Microsoft
The DiVAE approach combines elements from the variational auto-encoders and diffusion models to come up with a new way to synthetize images. It has been trained to both reconstruct images (like VQGAN or Diffusion models) and to generate images from text. It is not state-of-the-art but introduces interesting novelties and the mode models the marrier!

- Text2Human: Text-Driven Controllable Human Image Generation (Project website, Paper, GitHub, Spaces)

by Yuming Jiang/MMLab
Generate realistic looking humans from text-descriptions

Explainability updates

- No Token Left Behind - Explainability-Aided Image Classification and Generation (GitHub, Paper)

by Apple
Apple released a paper and GitHub repository for image generation based on StyleGAN, VQGAN and CLIP. Apple did not join the DALLE2/Imagen state-of-the-art new approach coming from big tech to compete with - but rather released three Colab notebooks for image generation and editing is focusing on explainability of what the models are doing

New CLIP and CLIP-like models:

Multi-lingual CLIP (GitHub)

by FreddeFrallan
Multi-lingual CLIP is a very interesting approach that creates OpenAI CLIP text encoders for any language - effectively a way to make CLIP multi-lingual without needing to train a new dataset on a new language for every language

LAION 400M trained OpenCLIP ViT-L/14 (GitHub)

by mlfoundations
The biggest yet OpenCLIP model trained: A ViT-L/14 model trained on LAION400M. It didn't surpass OpenAI's ViT-L/14 but got very close, yet again showing the power of open source. We have already hooked it to Majesty Diffusion to guide your generations. laion_clip_zeroshot_l14

UniCL - Unified text, image and label space (GitHub, Paper, HF Space)

by Microsoft
A new approach for CLIP where besides training on image and text, it is also trained on labeled images. Promises to increase image-text learning with the label augmentation. Code and models released, results look promising but pre-trained models currently smaller than OpenAI or OpenCLIP to check if the approach scales. Good to keep an eye on.

CyCLIP - Geometrically consistent CLIP embeddings (GitHub, Paper)

A new approach for CLIP to make the embeddings geometrically consistent. Code and models released. Same as with UniCL, it looks promising but pre-trained models are currently way smaller than OpenAI or OpenCLIP to asses the approach improvements scale. Also worth keeping an eye on (hey that's what we're here for).

Learning AI Art:

AIAIArt course (GitHub, Discord)

AIArt is a free and open source AI art course by John Whitaker. There are synchronous classes for the next few Saturdays 4 PM UTC on Twitch. All previous classes stay recorded and available on Google Colabs on the GitHub link