Institution

ByteDance

ByteDance's AI research arm (Seed), publishing work on multimodal generation, foundation models, and video.

TaskMem: Teaching a Video Agent What Is Worth Remembering

TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.

Multimodal Models · ByteDance

Representation Forcing: Unified Multimodal Models Without a VAE

Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.

Speech Synthesis · ByteDance

SwanVoice: Zero-Shot Speech Synthesis for Long Monologue and Dialogue

SwanVoice is a zero-shot TTS system that generates an entire 1-4 speaker conversation in one pass, keeping voice, mood, and prosody consistent across turns where turn-by-turn synthesis drifts — but content accuracy lags.