Self-Supervised Learning · Meta AI
DINOv2: General Visual Features Without Labels
DINOv2 trains self-supervised vision models on curated large-scale data to produce robust features usable across many downstream tasks.
Topics
Large visual representation models that transfer across recognition, localization, and perception tasks.
Vision foundation models turn images and video into reusable representations instead of one-off task models. The core shift is from training a classifier or detector for a narrow label set to training a visual backbone that can transfer across recognition, segmentation, dense prediction, retrieval, and multimodal reasoning.
The papers in this topic show three complementary routes. ViT imports the Transformer token interface into images. DINOv2 emphasizes self-supervised features and curated data. Segment Anything reframes segmentation as a promptable primitive. SAM 2 extends that interaction pattern into video. Together they explain why visual AI is moving from benchmark-specific models toward general perception infrastructure.
Self-Supervised Learning · Meta AI
DINOv2 trains self-supervised vision models on curated large-scale data to produce robust features usable across many downstream tasks.
SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.
Vision Foundation Models · Google Research
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
Vision Foundation Models · Google Research
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
Multimodal Models · Google DeepMind
Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.
SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.
Self-Supervised Learning · Meta AI
DINOv2 trains self-supervised vision models on curated large-scale data to produce robust features usable across many downstream tasks.
Multimodal Models · Google DeepMind
Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.
SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.
Vision Foundation Models · Google Research
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
SAM 2 extends promptable segmentation from still images to real-time video by adding streaming memory and a data engine built around user interaction.
Self-Supervised Learning · Meta AI
DINOv2 trains self-supervised vision models on curated large-scale data to produce robust features usable across many downstream tasks.
Multimodal Models · Google DeepMind
Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.
SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.
Vision Foundation Models · Google Research
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
SAM 2 extends promptable segmentation from still images to real-time video by adding streaming memory and a data engine built around user interaction.