Multimodal Models · Google DeepMind
Flamingo: Few-Shot Learning for Images, Video, and Text
Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.
Topics
Foundation models that combine language with images, audio, video, or other signals.
Multimodal Models · Google DeepMind
Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.
Text-to-Image · Google Research
Imagen showed that stronger language encoders can materially improve text-to-image diffusion models, especially for prompt alignment and photorealism.
Whisper showed that large, diverse, weakly supervised audio data can produce robust multilingual speech recognition and translation models.
DALL·E 2 splits text-to-image generation into a prior that predicts a CLIP image embedding and a decoder that turns that embedding into an image.
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
Long Context · Google DeepMind
Gemini 1.5 made million-token multimodal context feel less like a demo trick and more like a practical interface for long documents, video, audio, and code.
GPT-4 was less a full recipe than a measurement document: a multimodal Transformer whose benchmark performance, scaling predictability, and post-training alignment reset expectations for frontier AI.