This week in multimodal ai art (02/04 - 08/04)

And I thought last week couldn't be toppled with RQ-VAE, GLID-3, LAION-5B, StyleGAN XL, Make-a-Scene, CLIPMatrix and more. Couldn't have been MORE wrong! What can I say about this insane rhythm of multimodal ai art releases? Enjoy the ride, I guess.

New text-to-Image synthesizers:

- Dall-E 2 (blog post and paper only)

by OpenAI

The newest iteration of Dall-E is out. With results that baffled the community, the new model takes the CLIP Guided Diffusion approach to the next level. No code or model released. There's a waiting list to use their closed demo.

- Latent Diffusion LAION-400M (code+model, colab, Spaces)

by CompVis (model), LAION (dataset), us (Colab and Hugging Face Spaces)
CompVis released its LAION 400M trained model for its Dec/2021 "Latent Diffusion" implementation. It is a breakthrough, with its major feat being great text synthesis, beating even Dall-E 2. The model also samples images super fast. Sometimes it struggles to interpret the prompt, performing worst than other VQGAN-CLIP and Guided Diffusion methods. We have released a Google Colab and a Hugging Face Spaces for it! Play with it yourself and share your results with us @multimodalart

- KNN Diffusion (Paper only)

by Meta AI
This paper contains a novel approach to text-to-image generation that can be trained with images only, using the KNN algorithm to provide samples that matches the caption.

- CLOOB Conditioned Latent Diffusion YFCC CFG (Code, Colab)

by John David Pressman
CLOOB is a CLIP variant that outperforms CLIP in some cases. It also has the advantage of being able to fine tune on images without captions. The new model released is a classifier free guidance model trained on YFCC 100M. The code also makes it easy to fine-tune the model on different images.

New text-to-video synthesizers:

Video Diffusion (paper and project page only)

[div class="small]by Google[/div] A text-to-video model based on a new "video diffusion approach" that can generate short video snippets.

Text2LIVE (paper and project page only)

by NVidia Research
A CLIP Guided image and video editor. Takes an image or a video and outputs the same content with the edits requested on the text. No model or code released yet.

TATS: Time-Agnostic VQGAN and Time-Sensitive Transformer to generate long videos (paper, project page, code)

University of Maryland, Meta AI and Georgia Tech
A model trained on 16-frame video clips that can produce coherent video with thousands of frames, from both videos of the same class, text and audio.