Segmentation · Vision Foundation Models
SAM 2: Segment Anything Moves From Images Into Video
SAM 2 extends promptable segmentation from still images to real-time video by adding streaming memory and a data engine built around user interaction.
SAM 2 extends promptable segmentation from still images to real-time video by adding streaming memory and a data engine built around user interaction.
What problem it solves
The original Segment Anything model made image segmentation feel interactive and general. Video was still harder: objects move, disappear, reappear, change scale, and get occluded. A frame-by-frame image model can segment each image, but it does not naturally remember which object the user meant. SAM 2 targets the more useful task: prompt an object once, then keep segmenting it across time.
The core method
SAM 2 keeps the promptable interface but adds a transformer architecture with streaming memory. That memory lets the model carry object information forward during real-time video processing instead of treating every frame as a fresh image. Meta also builds a data engine in which user interaction improves both the model and the dataset, producing a large video segmentation corpus for training and evaluation.
Key results
The paper reports better video segmentation accuracy while using three times fewer user interactions than prior approaches. On image segmentation, SAM 2 is more accurate and six times faster than the first SAM. Meta releases the main model, dataset, training code, and demo, which matters because segmentation tools become much more valuable when they can be adapted and inspected by the community.
Why it matters
Segmentation is an enabling layer for editing, robotics, augmented reality, medical annotation, scientific imaging, and dataset creation. SAM 2 turns a strong image foundation model into a temporal perception tool. That makes it closer to how downstream systems actually see the world: as a stream, not a stack of unrelated pictures.
Limits and open questions
Streaming memory is powerful, but memory policy becomes a product and research question. Long videos, heavy occlusion, scene cuts, and ambiguous prompts can still break tracking. Interactive segmentation also depends on how much correction users are willing to provide. The main question after SAM 2 is whether promptable video understanding can expand from masks to richer object state, actions, and physical relations.
One line: SAM 2 gives segmentation a memory.