Multimodal Models · Text-to-Image · Mixture of Experts
SenseNova-U1: One Model for Multimodal Understanding and Generation
SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.
Quick answer
SenseNova-U1 is a single transformer that both understands images and generates them, instead of bolting a diffusion model onto a vision-language model. Its larger A3B variant (a 30B mixture-of-experts) scores 80.55 on MMMU and 91.59 on MMBench-EN for understanding, while both variants reach 0.91 overall on GenEval for text-to-image generation. The 8B dense variant lands 74.78 on MMMU and 82.10 on OCRBench. The point is not any single record — it is that one model does competitively on both sides of a divide that most systems handle with two separate stacks.
Why “unified” is the hard part
Most multimodal systems that claim to “understand and generate” are really two models in a trench coat: a vision-language model for reading images and a separate diffusion decoder for drawing them, joined by a thin interface. SenseNova-U1’s bet is that understanding and generation are two views of the same process and should share one backbone. That is harder than it sounds — the two tasks pull weights in different directions, so naively merging them usually degrades both. SenseNova-U1 reaches competitive numbers on both without an external diffusion model, which is the result worth checking.
How NEO-unify works
The backbone is a native Mixture-of-Transformers (MoT) with a unified self-attention stream and RoPE across temporal and spatial axes. Two design choices stand out. First, generation is done in pixel space via flow matching with a velocity loss and an MLP decoder — it skips the usual VAE plus diffusion-head pipeline. Second, the architecture decouples parameters between the understanding and generation paths while still sharing attention, so the two objectives interfere less. Training is a joint loss: cross-entropy for text tokens, flow-matching velocity loss for image pixels, with classifier-free guidance weighted independently for text and image. The visual front end uses two-layer convolutional encoding with 16x2 strides corresponding to 32x32 pixel patches.
What the two variants are
There are two releases. SenseNova-U1-8B-MoT is dense: 8.2B parameters for understanding, 8.2B for generation, 42 layers, one expert. SenseNova-U1-A3B-MoT is the mixture-of-experts version: 30.0B understanding parameters with 128 experts (32 active), still 8.2B on the generation side, 48 layers. The naming is worth decoding — A3B refers to roughly 3B active parameters per token in the MoE understanding path, which is how a 30B model stays cheap to run. Total training runs about 3.75T tokens across six stages, from warmup through generation pretraining, unified mid-training, SFT, and post-training.
Key results
- MMMU (understanding): A3B-MoT scores 80.55; the 8B variant scores 74.78.
- MMBench-EN: 91.59 (A3B) and 90.25 (8B).
- OCRBench: 91.90 (A3B) and 82.10 (8B) — strong text-in-image reading.
- VSI-Bench (spatial): the 8B variant scores 62.66, ahead of A3B’s 56.90 — a rare case where the smaller model wins.
- GenEval (text-to-image): both variants hit 0.91 overall, with 1.00 on single-object and 0.96 on two-object.
- DPG-Bench: 88.14 (A3B) and 87.78 (8B).
- Text-rich generation: on CVTG-2K the 8B variant averages 0.940; on LongText-Bench it reaches 0.979 English / 0.962 Chinese.
- Text understanding: MMLU-Pro 84.04 (A3B) and 81.44 (8B); IFEval 92.39 (A3B).
Limits and open questions
The honest weak spots are documented in the paper. The pixel-space decoder produces grid artifacts, which the authors attribute to the final FFN and MLP head modeling each 32x32 patch independently — a direct cost of skipping the VAE. On GenEval’s attribute-binding metric the model sits slightly below 0.80, trailing specialized generators like OneCAT and Mogao. Spatial reasoning still lags larger reasoning-specialized models, which the authors frame as a deliberate trade-off favoring high-fidelity generation. And the most interesting open question is the VSI-Bench inversion: the 8B model beats the 30B A3B model on spatial intelligence, suggesting the MoE scaling does not help uniformly and that understanding and generation may still compete for capacity more than the “synergistic” framing implies.
FAQ
What is SenseNova-U1?
SenseNova-U1 is a unified multimodal model from the SenseNova team that performs both image understanding and image generation in a single transformer, using the NEO-unify architecture rather than pairing a vision-language model with a separate diffusion model.
How does SenseNova-U1 generate images without a diffusion model?
SenseNova-U1 generates in pixel space using flow matching with a velocity loss and an MLP decoder, skipping the usual VAE plus diffusion-head pipeline. The trade-off is visible grid artifacts from modeling 32x32 patches independently.
What is the difference between SenseNova-U1 8B and A3B?
The 8B-MoT is a dense model with 8.2B understanding parameters. The A3B-MoT is a 30B mixture-of-experts with 128 experts (32 active) for understanding. A3B is stronger on most understanding benchmarks, but the 8B model actually wins on VSI-Bench spatial reasoning (62.66 vs 56.90).
Is SenseNova-U1 good at text-to-image generation?
Yes — both variants reach 0.91 overall on GenEval and around 88 on DPG-Bench, competitive with dedicated generators. Its strongest niche is text-rich images: 0.940 average on CVTG-2K and 0.979 English on LongText-Bench. It is weakest on attribute binding, sitting below 0.80.
One line: a single backbone that reads and draws, trading a little generation fidelity for not needing two models. Read the original paper on arXiv.