AI Agents · Multimodal Models

Video2GUI: Mining 12M GUI Agent Trajectories From Internet Videos

Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.

Video2GUI: Mining 12M GUI Agent Trajectories From Internet Videos

Quick answer

Video2GUI is a fully automated pipeline that reads unlabeled Internet tutorial videos and writes out grounded GUI agent trajectories — sequences of screenshots paired with the click and type actions that produced them. Run over 500 million video metadata entries, it builds WildGUI, a dataset of 12 million interaction trajectories spanning more than 1,500 applications and websites. Pretraining Qwen2.5-VL and Mimo-VL on WildGUI raises performance 5-20% across GUI grounding and action benchmarks, matching or beating state-of-the-art baselines. The point is not a new model architecture; it is a way to manufacture GUI training data at web scale without human annotators.

Why GUI agents are starved for data

A GUI agent has to look at a screen and decide where to click, what to type, and when to scroll. Teaching that requires trajectories: a screenshot, the action taken on it, and the resulting screen. Today those trajectories mostly come from human annotators driving apps by hand, which is slow, expensive, and stuck in a few domains — a handful of mobile apps or a fixed set of websites. That is why GUI agents generalize poorly the moment they meet an interface outside their training set. Video2GUI’s bet is that the data already exists: millions of screen-recorded software tutorials on the open web are, in effect, demonstrations of someone completing a task on a real interface. The hard part is converting raw video into clean, grounded action labels.

How Video2GUI turns video into trajectories

The pipeline is coarse-to-fine, and the filtering is the actual contribution. From 500 million video metadata entries it first cheaply discards anything that is not a screen-capture GUI tutorial — talking-head footage, gameplay, slideshows. Surviving videos are then inspected frame by frame to find the moments where the interface changes because the user acted, and to reconstruct what that action was: a click at a location, a typed string, a scroll. The output is a structured trajectory in the same shape a GUI agent consumes at training time — screenshot in, grounded action out — but produced with no human labeling. The yield is the headline: 500 million candidates filter down to 12 million usable trajectories covering over 1,500 distinct apps and sites, which is what gives the dataset its breadth.

Key results

  • WildGUI scale: 12 million grounded interaction trajectories, mined from 500 million video metadata entries, covering more than 1,500 applications and websites — far wider domain coverage than hand-annotated GUI datasets.
  • Pretraining gains: Pretraining Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks.
  • Competitiveness: The WildGUI-pretrained models match or surpass state-of-the-art performance on those benchmarks, despite the training signal coming entirely from automatically mined video rather than curated annotation.
  • Generality: The gain shows up on two different base models (Qwen2.5-VL and Mimo-VL), suggesting the data — not a single model’s quirks — is what carries the improvement.

Why this matters now

The bottleneck for GUI and computer-use agents in 2026 is not model capacity; it is grounded data that covers the long tail of real software. Video2GUI reframes that bottleneck as a mining problem against a corpus the field has largely ignored — the open web’s enormous stock of screen-recorded tutorials. If the released pipeline and WildGUI dataset hold up, the interesting consequence is that anyone can grow GUI training data by pointing the filter at more video, rather than paying annotators per trajectory. That is a different scaling curve than the annotation-bound one most GUI agent work sits on today.

Limits and open questions

The 5-20% improvement is reported as a range across benchmarks, and the abstract does not pin down which benchmark gets 5% and which gets 20% — so the ceiling is real but uneven, and the weakest benchmarks may barely move. Mined-from-video data also inherits video’s biases: tutorials over-represent popular consumer apps and “happy path” workflows, and under-represent error states, enterprise tools, and the messy recovery behavior that makes an agent robust. Reconstructing the exact action from pixels is inherently noisy — a click location inferred from a frame transition can be off, and label noise at 12M scale is hard to audit. Finally, the results are pretraining gains on two vision-language backbones; whether the same trajectories help end-to-end task success in live environments like full OS or browser control, not just offline grounding and action benchmarks, is the question the abstract leaves open.

FAQ

What is Video2GUI?

Video2GUI is an automated framework that extracts grounded GUI interaction trajectories — screenshots paired with the click/type/scroll actions that caused them — directly from unlabeled Internet tutorial videos, with no human annotation.

What is the WildGUI dataset?

WildGUI is the dataset Video2GUI produces: 12 million GUI interaction trajectories mined from 500 million video metadata entries, spanning more than 1,500 applications and websites. The authors say they will release it along with the pipeline.

How much does Video2GUI improve GUI agents?

Pretraining Qwen2.5-VL and Mimo-VL on WildGUI gives consistent 5-20% gains across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance.

How is Video2GUI different from hand-annotated GUI datasets?

Hand-annotated datasets are small, costly, and confined to narrow domains. Video2GUI mines data automatically from web video, so it covers 1,500+ apps and sites and scales with available video rather than annotation budget.

What are the main limitations of Video2GUI?

The 5-20% gain is uneven across benchmarks, mined trajectories carry video’s bias toward popular apps and happy-path flows, pixel-inferred action labels are noisy at 12M scale, and the gains shown are on offline benchmarks rather than live end-to-end task success.

One line: the data for GUI agents was already on the web as tutorial videos — Video2GUI just learned to read it. See the original paper on arXiv.