Vision-Language-Action · Robotics

π0: One Model That Folds Laundry and Drives Seven Robots

A single vision-language-action model, trained on data from seven robot platforms, performs dexterous everyday tasks like folding laundry from plain language prompts.

π0: One Model That Folds Laundry and Drives Seven Robots
TL;DR

A single vision-language-action model, trained on data from seven robot platforms, performs dexterous everyday tasks like folding laundry from plain language prompts.

What problem it solves

General-purpose robots have been stuck on a hard wall: each new task tends to need its own dataset, its own model, and a lot of hand-tuning. That does not scale to the open-ended physical world. π0 asks whether one model, trained broadly, can pick up many skills across many robot bodies and then adapt to new tasks with little data.

The core method

π0 attaches an action head to a pretrained vision-language model and trains it with flow matching, a technique for generating continuous outputs. Instead of emitting discrete tokens, the action head produces smooth, high-frequency motor commands, which is what dexterous manipulation actually needs. Language and vision flow in, continuous action flows out, all in one network.

Key results

The model is trained on a large, diverse mixture covering seven distinct robot platforms and many tasks, then fine-tuned for harder skills. It can run real-world jobs such as folding laundry, bussing a table, and assembling a box, and it follows natural-language instructions to choose what to do.

Why it matters

This is a concrete step toward a single robot foundation model rather than one policy per task. The recipe (pretrain broadly, then adapt) mirrors what worked for language and vision, and suggests robot learning is entering the same scaling regime.

Limits and open questions

Results are strongest on tasks close to the training mixture, and truly novel skills still need demonstrations. Reliability under long horizons and rare edge cases remains the open frontier, as does how far the approach generalizes beyond the seven platforms it saw.

One line: the language-model playbook, pretrain broadly then adapt, finally reaches robot hands.