Topics

Vision Foundation Models

Large visual representation models that transfer across recognition, localization, and perception tasks.

Close view of visual sensing hardware and circuit detail

Vision foundation models turn images and video into reusable representations instead of one-off task models. The core shift is from training a classifier or detector for a narrow label set to training a visual backbone that can transfer across recognition, segmentation, dense prediction, retrieval, and multimodal reasoning.

The papers in this topic show three complementary routes. ViT imports the Transformer token interface into images. DINOv2 emphasizes self-supervised features and curated data. Segment Anything reframes segmentation as a promptable primitive. SAM 2 extends that interaction pattern into video. Together they explain why visual AI is moving from benchmark-specific models toward general perception infrastructure.

Start here

Foundational papers

Recent papers