This week in multimodal ai art (25/03 - 01/04)

What an amazing week for multimodal ai art! Plenty of completely new models, new CLIP variants and a giant dataset! I will let the sheer amount of news speak for themselves

New text-to-Image synthesizers:

- RQ-VAE-Transformer (paper+code+model)

by Kakao Brain

A new architecture that does not use diffusion. A spiritual follow-up to VQGAN Super fast. Super VRAM hungry

- GLID-3 (code+model and colab)

by Jack000

A combination of GLIDE+CLIP+Latent Diffusion, with a mid-training model released that despite being small, is a powerful model for photorealistic generation

- StyleGAN XL (all purpose StyleGAN - paper+code+model and colab)

by Autonomous Vision
StyleGANs is known to generate very good and realistic results, but usually this results are restricted to one domain (websites like "This Person Doesn't Exist" use a stylegan model). This comes to change that with an all purpose - not domain specific StyleGAN. First hand testing and results coming soon on Twitter and will be attached here.

- Make-a-scene (paper only)

by Meta AI (yes, that's Facebook)
A new model introduces semantic segmentation and human priors. Looks promising, but nothing besides the paper released so far. I asked Meta's research director about whether they would make it open source. This was the answer:

- Disco Diffusion upgrade to v5.1 (notebook)

by gandamu and somnai_dreams
The beloved CLIP Guided Diffusion notebook got an upgrade: 3D animations can go Turbo and init_videos work perfectly! Animation mode isn't yet there, but you can also try Disco Diffusion on MindsEye

- CLOOB Guided Diffusion (code+models and notebook)

by Katherine Crowson and John David Pressman
A mid-training new CLOOB model guided model hooked to Diffusion. You can test out and have a glimpse of CLOOBs potential to sit on the drivers seat instead of CLIP.

New text-to-3D synthesizers:

- CLIPMatrix (paper and colab)

by @NJetchev

A mind-blowing CLIP Guided 3D mesh deformation and stylization

New CLIP and CLIP-like models:

- ViT-B/16 CLOOB (Github)

by Katherine Crowson
Two checkpoints of a CLOOB training on LAION400M that is still ongoing

- KELIP - Korean+English CLIP (Github)

by NAVER/VISION
A bilingual CLIP model trained in both English and Korean. The model dataset trained is bigger than the original CLIP, but the results in English are comparable while in korean are a bit poorer. However the multi-lingual and multi-cultural aspect makes it an exciting model to try out

- LAION CLIP ViT-B/32 (Github)

by LAION
A CLIP ViT-B/32 model trained from scratch on LAION400M. Performs similarly to OpenAI's CLIP, but may have a different knowledge base. It will be integrated to MindsEye models very soon.

- CLOOB YFCC CFG (Github)

by John David Pressman
Training code for a classifier free guidance model based on CLOOB (look, mom, without CLIP!). Initial training checkpoints for YFCC (Yahoo Flicker Creative Commons, an open image-text dataset) released. Pre-trained models coming soon!

New Dataset:

- LAION 5B (Release announcement)

by LAION
A giant text-to-image dataset with 5 BILLION text-image pairs. 2 billion in English and the others in many different languages. Builds up on the 400M text-image pairs released in late 2021. For reference:

  • CLIP, 'driver' of most AI art models was trained on 400M text-image-pairs (12.5x smaller) on a non-open dataset by @OpenAI
  • The biggest open model before LAION started its activities had ~15M text-image-pairs (333x smaller)

Big models like this can be used to train:

  • Better image-text classifiers (like CLIP and CLOOB)
  • Better image syntethizers (like VQGAN, Diffusion)
  • Better combinations of the two (like DALL-E, GLIDE, CogView)