MolmoAct2: An Open Action Reasoning Stack for Real Robots

Quick answer

MolmoAct2 is an open “action reasoning” stack that grounds robot control in explicit 3D spatial reasoning instead of mapping pixels straight to motor commands. Out of the box on real-world DROID manipulation it reaches 87.1% success, a +38.7 point margin over the next-best model, and its embodied-reasoning brain Molmo2-ER scores 63.8% average across the benchmarks, edging past GPT-5 and Gemini Robotics ER-1.5. Everything — model, the OpenFAST action tokenizer, and three new datasets totaling 3.3M samples — is released openly.

Why “action reasoning” instead of end-to-end VLA

Most vision-language-action (VLA) models learn a direct mapping from camera frames and an instruction to low-level actions. That works in-distribution but is brittle when the scene, embodiment, or task shifts, because the model never forms an explicit plan it can reuse. MolmoAct2’s bet is that a robot policy should first reason about space — where objects are in 3D, which regions matter, what depth the gripper must reach — and only then emit actions. The reasoning is the product, not a hidden activation.

This is why the team built Molmo2-ER (the “ER” is embodied reasoning) as a dedicated spatial VLM rather than reusing a generic chat model. It is trained with a “specialize-then-rehearse” recipe on the 3.3M-sample corpus so it sharpens spatial skills without forgetting general vision-language ability.

The three pieces that make it work

Molmo2-ER is the reasoning backbone — a VLM tuned for spatial and embodied questions (object location, reachability, depth, affordance). It improves on the Molmo2 baseline by 17 points on embodied reasoning and posts 99 of 1313 top results across the benchmark suite.

OpenFAST is an open-weight action tokenizer that compresses one second of robot trajectory into discrete tokens drawn from a 2048-token action vocabulary, trained on one million action sequences across five embodiments. It is what lets a language-model-style head “speak” actions.

MolmoAct2-Think is the adaptive-depth variant for deployment. Its trick is to re-predict depth tokens only for scene regions that actually change between timesteps, so it keeps geometric grounding while cutting the per-step latency that usually makes reasoning policies too slow for real robots.

Key results

Real-world DROID: 87.1% success, +38.7 points over the runner-up — the headline number, and the one that matters most because DROID is a real, diverse manipulation setting rather than simulation.
Embodied reasoning: Molmo2-ER averages 63.8% across the benchmark suite, surpassing GPT-5 and Gemini Robotics ER-1.5, with 99 of 1313 top results and +17 points over the Molmo2 baseline.
Out-of-box deployment: +10.6% over π₀.₅ on MolmoSpaces and +3.2% absolute on MolmoBot, averaged across tasks; on SO-100/101 it reaches 56.7% vs 45.3% for the baseline.
Fine-tuned LIBERO: MolmoAct2-Think averages 98.1% vs 96.9% for π₀.₅.
Data released: three datasets totaling 3.3M samples, including 34.5k bimanual demonstrations spanning 720 hours, plus DROID (74,604 episodes) and SO-100/101 (38,059 episodes, ~184 hours) collections.

Honest read

The DROID +38.7 gap is the eye-catching claim, but treat it as a deployment-setting result, not a universal one: the margin shrinks to single digits on simulation benchmarks (+10.6% on MolmoSpaces, +3.2% on MolmoBot, ~1.2 points on LIBERO). That pattern is actually the interesting story — explicit spatial reasoning buys the most where distribution shift is worst, which is exactly real hardware. If you only read leaderboard deltas you would underrate it; if you only read the headline you would overrate the simulation gains.

Limits and open questions

The reasoning that makes MolmoAct2 robust is also its cost. Even with MolmoAct2-Think’s change-only depth re-prediction, an explicit-reasoning policy carries more inference overhead than a direct VLA, and the paper frames latency as a problem it mitigates, not one it eliminates. The strongest gains are concentrated in real-world manipulation; the simulation margins are modest, so the claim is “much better where it counts,” not “uniformly dominant.” OpenFAST is trained on five embodiments, leaving open how the tokenizer transfers to robots far outside that set. And as with any VLA, success rates well below 100% mean these are research systems, not production-reliable controllers — the value here is an open, reproducible reasoning stack the field can build on, including released weights, tokenizer, and the 3.3M-sample data.

FAQ

What is MolmoAct2 and who built it?

MolmoAct2 is an open vision-language-action stack from Ai2 (Allen Institute for AI) and collaborators that reasons explicitly about 3D space before producing robot actions, rather than mapping images directly to motor commands.

How does MolmoAct2 compare to Pi-0.5 (π₀.₅)?

MolmoAct2 beats π₀.₅ across the board out of the box — +10.6% on MolmoSpaces, +3.2% absolute on MolmoBot — and its fine-tuned MolmoAct2-Think reaches 98.1% on LIBERO vs 96.9% for π₀.₅.

Is MolmoAct2’s reasoning model better than GPT-5?

On embodied reasoning, yes: Molmo2-ER averages 63.8% and surpasses both GPT-5 and Gemini Robotics ER-1.5, which are far larger general models, because it is specialized for spatial and embodied questions.

What is OpenFAST in MolmoAct2?

OpenFAST is MolmoAct2’s open-weight action tokenizer. It turns one second of robot trajectory into discrete tokens from a 2048-token vocabulary, trained on one million sequences across five embodiments, so a language-model head can predict actions as tokens.

Is MolmoAct2 open source?

Yes — the models, the OpenFAST tokenizer, and three datasets totaling 3.3M samples (including 720 hours of bimanual teleoperation) are released openly for reproduction.

One line: reason about the scene in 3D first, then act — and the payoff shows up most on real robots, where MolmoAct2 hits 87.1% on DROID. Read the original paper on arXiv.