Mellum 2: A 12B MoE Code Model Running at 2.5B Compute

Quick answer

Mellum 2 is JetBrains’ open-weight code model with 12B total parameters but only 2.5B active per token, built as a 64-expert Mixture-of-Experts that routes 8 experts per token. It is pre-trained on roughly 10.6 trillion tokens, extended to a 128K context window, and JetBrains reports it stays competitive with dense open-weight baselines in the 4B-14B range while costing the per-token compute of a 2.5B dense model. Base, instruct, and thinking checkpoints all ship under Apache 2.0.

What Mellum 2 is built for

Mellum is JetBrains’ in-house code model line, and Mellum 2 targets the full loop of an IDE assistant rather than just autocomplete: code generation, editing, debugging, reasoning over a repository, tool use, and conversational programming help. That framing matters for who should care — this is a model designed by the company behind IntelliJ and PyCharm to sit inside developer tooling, not a general chat model that happens to write code.

The headline bet is efficiency. Most open code models in the usable-on-a-workstation range are dense, so every token pays for every parameter. Mellum 2 instead spends 12B parameters’ worth of stored knowledge but only fires 2.5B of them per token. If the quality holds, that is the difference between a model you can serve cheaply at IDE latency and one you cannot.

How the architecture keeps compute low

The MoE layout is the core trick: 64 experts with 8 active per token gives the model a large knowledge capacity while keeping the active compute tiny. On top of that, Mellum 2 stacks several efficiency choices that compound:

Grouped-Query Attention with 4 KV heads shrinks the key-value cache, which is what actually dominates memory at long context.
Sliding Window Attention on 75% of layers means most layers only attend locally, so attention cost does not blow up across a 128K window.
A Multi-Token Prediction (MTP) head does double duty — it is an auxiliary objective during pre-training and a speculative-decoding draft head at inference, which can raise tokens-per-second without a separate draft model.
Layer-selective YaRN extends the context to 128K by rescaling positions on chosen layers rather than uniformly.

None of these is novel on its own — GQA, sliding windows, MTP, and YaRN are all established. The contribution is the integration: a code-specialized MoE that combines them into something cheap to run and openly licensed.

Training recipe

Mellum 2 is pre-trained on about 10.6 trillion tokens, then post-trained in two stages: supervised fine-tuning followed by RLVR (reinforcement learning from verifiable rewards). RLVR is the same family of technique that powers recent reasoning models — reward the model on outcomes that can be checked, which for code means things like passing tests or producing valid edits. That is what yields the separate “thinking” checkpoint alongside the base and instruct variants.

Key results

2.5B active parameters per token out of 12B total — the model runs at the per-token compute of a 2.5B dense model.
JetBrains reports Mellum 2 is competitive with open-weight baselines in the 4B-14B range on software engineering tasks, meaning it punches at roughly 2-5x its active size.
~10.6 trillion training tokens and a 128K context window, large enough to hold substantial repository context in one prompt.
64 experts, 8 active, with GQA (4 KV heads) and sliding-window attention on 75% of layers.
Three Apache 2.0 checkpoints released: base, instruct, and thinking.

Why it matters now

The interesting claim is not “another open code model” — it is that a sparse MoE can give you mid-size-dense quality at small-dense cost specifically for coding, with a permissive license. For teams that want to self-host an IDE assistant, per-token compute is the recurring bill, and 2.5B active is cheap to serve. Apache 2.0 also removes the licensing friction that limits some other open code models in commercial products.

Limits and open questions

The report frames the result as “competitive with 4B-14B baselines,” which is a careful phrase, not a claim of beating them — without the full benchmark tables it is hard to know how much of that range it actually matches versus trails, and on which tasks. MoE models are also harder to deploy than their active-parameter count suggests: all 12B parameters must be resident in memory even though only 2.5B fire per token, so the VRAM footprint is closer to a 12B dense model than a 2.5B one. The efficiency win is in compute and throughput, not memory. And as a technical report from a single vendor, the evaluation is self-reported; independent benchmarking on agentic SWE tasks (where code models increasingly live or die) is the test that matters. Verify the exact numbers against the original report before quoting them.

FAQ

What is Mellum 2 and who made it?

Mellum 2 is an open-weight code language model from JetBrains, the company behind IntelliJ IDEA and PyCharm. It is a 12B-parameter Mixture-of-Experts model with 2.5B active parameters per token, targeting code generation, editing, debugging, and IDE assistance.

How is Mellum 2 efficient if it has 12B parameters?

Mellum 2 uses a Mixture-of-Experts design with 64 experts and only 8 active per token, so each token is processed with about 2.5B active parameters. That gives it the per-token compute cost of a 2.5B dense model while storing the knowledge capacity of a 12B model.

Is Mellum 2 open source and what license does it use?

Yes. JetBrains released Mellum 2 under the Apache 2.0 license, with base, instruct, and thinking checkpoints all publicly available — a permissive license that allows commercial use.

What context length does Mellum 2 support?

Mellum 2 supports a 128K-token context window, extended via layer-selective YaRN, which is large enough to fit substantial repository context in a single prompt for code reasoning.

One line: a code-specialized 12B MoE that runs at 2.5B-dense compute and ships Apache 2.0. Read the original report on arXiv.