Vision Foundation Models

Start here

DINOv2: General Visual Features Without Labels

DINOv2 trains self-supervised vision models on curated large-scale data to produce robust features usable across many downstream tasks.

Segmentation · Meta AI

Segment Anything: Promptable Segmentation at Web Scale

SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.

Vision Foundation Models · Google Research

Vision Transformer: Treating Image Patches Like Tokens

ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.

Foundational papers

Vision Foundation Models · Google Research

Vision Transformer: Treating Image Patches Like Tokens

ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.

Multimodal Models · OpenAI

CLIP: Computer Vision Learns to Read Natural Language

CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.

Multimodal Models · Google DeepMind

Flamingo: Few-Shot Learning for Images, Video, and Text

Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.

Segmentation · Meta AI

Segment Anything: Promptable Segmentation at Web Scale

SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.

Recent papers

Self-Supervised Learning · Meta AI

Start here

DINOv2: General Visual Features Without Labels

Segment Anything: Promptable Segmentation at Web Scale

Vision Transformer: Treating Image Patches Like Tokens

Foundational papers

Vision Transformer: Treating Image Patches Like Tokens

CLIP: Computer Vision Learns to Read Natural Language

Flamingo: Few-Shot Learning for Images, Video, and Text

Segment Anything: Promptable Segmentation at Web Scale

Recent papers

DINOv2: General Visual Features Without Labels

Flamingo: Few-Shot Learning for Images, Video, and Text

Segment Anything: Promptable Segmentation at Web Scale

Vision Transformer: Treating Image Patches Like Tokens

CLIP: Computer Vision Learns to Read Natural Language

SAM 2: Segment Anything Moves From Images Into Video

DINOv2: General Visual Features Without Labels

Flamingo: Few-Shot Learning for Images, Video, and Text

Segment Anything: Promptable Segmentation at Web Scale

Vision Transformer: Treating Image Patches Like Tokens

CLIP: Computer Vision Learns to Read Natural Language

SAM 2: Segment Anything Moves From Images Into Video

Start here

Foundational papers

Recent papers

Related topics