Topics

World Models

Generative models that simulate consistent, controllable environments over time.

Abstract 3D environment suggesting a simulated, navigable world

ABot-Earth 0.5: Generating 3D Cities From Satellite Images

ABot-Earth 0.5 uses satellite imagery to generate 3D Gaussian Splatting city scenes, reporting under 10 minutes per square kilometer and FID 16.1.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · JD.com (Joy Future Academy)

Echo-Memory: Which Memory Lets a World Model Remember a Room?

When a camera revisits an old spot, block-wise state-space recurrence scored 69.0 open-domain VLM consistency vs 12.25 for the no-memory baseline; aggressive compression and spatial summaries mostly collapsed.

World Models · Independent Researcher

Function2Scene: 3D Indoor Layout from Functional Specs

Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · University of Macau

PF-OPSD: When Should an MLLM Trust a World Model's Video?

PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.

Video Generation · HKUST

Echo-Infinity: Learnable Evolving Memory for 24-Hour Real-Time Video

Echo-Infinity is an autoregressive video model with a learnable evolving memory that compresses any-length history at constant cost, hitting 24-hour rollouts (over 1.3M frames) in real time at 18.5 FPS on an H100.

Diffusion Models · Tsinghua University

Causal Forcing++: Few-Step Autoregressive Diffusion for Real-Time Video

Causal Forcing++ distills bidirectional video diffusion into a 1-2 step frame-wise autoregressive generator at 14.1 FPS, halves first-frame latency, and cuts few-step training cost ~4x (11,600 to 2,900 A800 GPU hours).

Diffusion Models · Alibaba Qwen Team

MIGA: Train-Free Infinite-Frame Generation for Consistent Long Videos

MIGA turns a fixed-length video diffusion model into a 1000+-frame generator with no training and constant memory, hitting 97.82 overall on VBench with VideoCrafter2 — about 2.8 points over FIFO-Diffusion.

World Models · NVIDIA

Gamma-World: A Multi-Agent World Model That Scales Past Two Players

Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.

World Models · Microsoft Research

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Diffusion Models · NVIDIA

LongLive-2.0: NVFP4 4-bit Training and Inference for Long Video

LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.

Diffusion Models · University of Science and Technology of China

Stream-R1: Reliability-Perplexity Aware Reward Distillation Explained

Stream-R1 reweights DMD losses by video reward scores and per-region perplexity instead of treating signals equally. Its 1.3B streaming model hits 84.40 VBench at 23.1 FPS, beating its 14B teacher's 84.26 for free.

Diffusion Models · University of Science and Technology of China

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-T1 adds test-time search to streaming video generation without retraining, lifting VideoAlign motion quality from 0.350 to 0.629 at 5s and cutting the drift that wrecks 30-second clips.

World Models · Fudan University

WBench: A Multi-turn Benchmark for Interactive Video World Models

WBench scores interactive video world models on five axes — quality, setting, interaction, consistency, physics — across 289 cases and 1,058 turns, and finds no single model wins on all five.