Mixtral of Experts: The 47B Sparse MoE That Runs Like a 13B Model

Quick answer

Mixtral 8x7B is a sparse Mixture-of-Experts language model that stores 47B total parameters but activates only about 13B per token, because a router picks just 2 of 8 expert feed-forward blocks at each layer. With that 13B compute budget it matches or beats Llama 2 70B and GPT-3.5 on every benchmark Mistral evaluated, and it ships — base and instruct — under a permissive Apache 2.0 license.

How sparse routing works

Mixtral keeps the Mistral 7B architecture but replaces each layer’s single feed-forward block with 8 of them, the “experts.” A small router network looks at every token’s hidden state and outputs weights over the 8 experts; only the top-2 are run, and their outputs are combined by the router’s softmax weights. The other 6 experts sit idle for that token.

The key detail is that selection is per token, per layer — not per prompt. The same sentence can route consecutive tokens to entirely different expert pairs, and a token’s experts at layer 5 say nothing about its experts at layer 20. That is why the marketing “8x7B” name is misleading on both ends: the experts share attention layers so the total is 47B, not 56B; and only the feed-forward path is sparse, so 13B parameters are active per token rather than the full 47B.

Why 47B params cost like 13B

This is the whole pitch. A dense 47B model would multiply 47B parameters against every token. Mixtral only ever touches the 2 chosen experts, so the per-token FLOPs match a ~13B dense model while the model’s knowledge capacity is that of a much larger network. You get the quality of scale at the inference compute of a mid-size model.

The honest catch is memory. Routing is dynamic, so you cannot know in advance which experts a request will need — all 47B parameters must be resident in VRAM. Mixtral is cheap in compute and throughput but not in memory footprint: it needs the RAM of a 47B model to run at the speed of a 13B one. That trade is great for a busy server batching many requests and bad for a single laptop.

Key results

Parameter math: 8 experts per layer, top-2 routed, 47B total parameters, ~13B active per token, 32k-token context.
vs Llama 2 70B: Mixtral matches or outperforms it across all benchmarks reported, while using roughly 5x fewer active parameters at inference.
Where the gap is largest: Mistral reports Mixtral “vastly outperforms” Llama 2 70B on mathematics, code generation, and multilingual benchmarks specifically.
vs GPT-3.5: the base model matches or beats GPT-3.5 across the evaluated suite.
Instruct model: Mixtral 8x7B – Instruct surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat on human evaluation benchmarks.
License: both base and instruct are released under Apache 2.0 — commercial use, fine-tuning, and redistribution all permitted.

Why it matters now

Mixtral is the paper that turned sparse MoE from a research-lab technique into something the open community could actually download and serve. Before it, the strong open models were dense (the Llama line), and MoE results lived mostly inside closed frontier labs. Mixtral handed everyone a concrete, permissively licensed model where the routing genuinely pays off, and it reset the price-performance frontier: GPT-3.5-class quality at roughly 13B inference cost. The Apache 2.0 license is doing as much work here as the architecture — it is what let MoE serving stacks, fine-tunes, and quantizations proliferate within weeks.

Limits and open questions

The compute win is real but the memory bill is not optional: you provision VRAM for 47B parameters to enjoy 13B-speed inference, which makes Mixtral awkward on single consumer GPUs and pushes most users toward quantization. The paper also does not claim Mixtral understands its own routing — analysis showed the router does not cleanly assign experts to human-interpretable topics or domains, so “experts” is an architectural label, not a semantic one, and you cannot steer it by picking an expert. Top-2-of-8 is a fixed design choice rather than a derived optimum, and the benchmark wins are Mistral’s own evaluations against models from 2023, so the comparison is a snapshot, not a standing claim. None of this dents the core result, but it does mean “47B that runs like 13B” should always be read with “…if you can hold 47B in memory.”

FAQ

How many parameters does Mixtral 8x7B actually use per token?

About 13B active parameters per token, out of 47B total. The router selects 2 of the 8 experts at each layer, so most of the network is skipped for any given token even though all 47B must be loaded in memory.

Is Mixtral really better than Llama 2 70B?

On the benchmarks Mistral reported, Mixtral 8x7B matches or beats Llama 2 70B across the board — and outperforms it by a wide margin on math, code, and multilingual tasks — while using roughly 5x fewer active parameters at inference.

Why isn’t Mixtral 8x7B a 56B model?

Because the 8 experts only replace the feed-forward blocks; the attention layers are shared across all experts. Adding the shared parameters once instead of eight times brings the total to 47B, not 56B.

Can I use Mixtral commercially?

Yes. Both the base and the instruct-tuned Mixtral 8x7B are released under Apache 2.0, which permits commercial use, fine-tuning, and redistribution.

Do Mixtral’s experts specialize in topics like math or code?

Not in a human-readable way. The authors found the router does not assign experts to interpretable domains; routing is learned for performance, so an “expert” is not a topic specialist you can select on purpose.

One line: pick 2 experts of 8 per token and a 47B model thinks like a giant but bills like a 13B — as long as you can fit it in memory. Read the original paper on arXiv.