Topics

Multimodal Models

Foundation models that combine language with images, audio, video, or other signals.

Flamingo: Few-Shot Learning for Images, Video, and Text

Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.

Text-to-Image · Google Research

Imagen: Why Text Understanding Matters for Image Generation

Imagen showed that stronger language encoders can materially improve text-to-image diffusion models, especially for prompt alignment and photorealism.

Speech Recognition · OpenAI

Whisper: Speech Recognition Trained on Web-Scale Weak Supervision

Whisper showed that large, diverse, weakly supervised audio data can produce robust multilingual speech recognition and translation models.

Text-to-Image · OpenAI

DALL·E 2: Text-to-Image Generation Through CLIP Latents

DALL·E 2 splits text-to-image generation into a prior that predicts a CLIP image embedding and a decoder that turns that embedding into an image.

Multimodal Models · OpenAI

CLIP: Computer Vision Learns to Read Natural Language

CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.

Long Context · Google DeepMind

Gemini 1.5: The Long-Context Bet Becomes a Product-Scale Model

Gemini 1.5 made million-token multimodal context feel less like a demo trick and more like a practical interface for long documents, video, audio, and code.

Multimodal Models · OpenAI

GPT-4: The Report That Made Frontier Models Feel Measurable

GPT-4 was less a full recipe than a measurement document: a multimodal Transformer whose benchmark performance, scaling predictability, and post-training alignment reset expectations for frontier AI.