Qwen2.5 Explained: Alibaba's Open LLM Family, 0.5B to 72B

Quick answer

Qwen2.5 is Alibaba’s family of large language models released in open and hosted forms, sized from 0.5B to 72B parameters and pretrained on 18 trillion tokens — up from 7T in Qwen2. The open-weight flagship, Qwen2.5-72B-Instruct, matches or beats many open and proprietary models and stays competitive with Llama-3-405B-Instruct, a model roughly 5x its size. The hosted MoE variants Qwen2.5-Turbo and Qwen2.5-Plus target GPT-4o-mini and GPT-4o on cost-effectiveness.

What ships in the family

The headline is breadth, not a single model. Qwen2.5 ships base and instruction-tuned checkpoints at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, plus quantized versions, all as open weights. Two proprietary mixture-of-experts models, Qwen2.5-Turbo and Qwen2.5-Plus, are served only through Alibaba Cloud Model Studio. The small end matters as much as the flagship: a 0.5B model that you can actually run on a phone or an edge box is what makes a “family” useful, and Qwen2.5’s spread across seven dense sizes is wider than most open releases of its time.

This report is also a base layer. Alibaba used these models to train specialized derivatives — Qwen2.5-Math, Qwen2.5-Coder, the QwQ reasoning model, and multimodal variants — so the technical report is less a single product than the foundation a whole lineup is built on.

How it was trained

Two numbers carry the pretraining story. The dataset grew from 7T tokens in Qwen2 to 18T high-quality tokens, with heavier filtering and more coverage of expert knowledge, math, and code. That scale is the main lever behind the reported gains in common-sense, reasoning, and STEM ability.

Post-training is where the instruct models get their behavior. Alibaba ran supervised fine-tuning on over 1 million curated samples, then a multistage reinforcement learning process for human preference. The report singles out long-text generation, structured-data analysis, and instruction following as the capabilities post-training improved most — the practical skills that separate a usable assistant from a strong-on-benchmarks base model. On context, Qwen2.5-Turbo extends the usable window to 1 million tokens, well beyond the dense open models’ standard long-context range.

Key results

Flagship vs a 5x-larger model: Qwen2.5-72B-Instruct performs competitively with Llama-3-405B-Instruct despite having roughly one-fifth the parameters — the report’s strongest efficiency claim.
Hosted MoE economics: Qwen2.5-Turbo and Qwen2.5-Plus are positioned to undercut GPT-4o-mini and GPT-4o respectively on cost while staying competitive on quality.
Pretraining scale: 18T tokens, up 2.6x from Qwen2’s 7T, is the single biggest change driving broad benchmark gains across language understanding, reasoning, math, and coding.
Post-training volume: over 1M SFT samples plus multistage RL, with the largest improvements reported in long-text generation, structured output, and instruction following.
Long context: Qwen2.5-Turbo handles up to 1M tokens, a 1M-token-class window in a production-served model.

Why it matters now

Qwen2.5 is the release that made Alibaba’s Qwen line a default choice for open-weight builders, not a regional alternative. The combination of a wide size ladder, a permissively usable open flagship, and a top result against a model 5x larger is exactly what downstream teams need: pick the size that fits your hardware, fine-tune, and ship. The bigger signal is that the strongest specialized open models of the period — coder, math, reasoning — were trained on top of Qwen2.5, so its quality propagated through the open ecosystem far beyond this one report.

Limits and open questions

The honest caveats are about what a technical report can and cannot tell you. Benchmark wins, especially “competitive with a 5x-larger model,” depend on the exact eval suite and decoding settings; the report is a first-party document, so independent reproduction matters before treating those numbers as settled. The most attractive cost claims attach to Turbo and Plus, which are closed MoE models behind an API — the openness story is the dense 0.5B–72B line, and the proprietary variants are where the headline efficiency lives. Quantized checkpoints trade quality for footprint in ways the report does not fully quantify per size. And as a base-model report it says little about safety, refusal behavior, or multilingual robustness under adversarial use — areas you must evaluate yourself before deployment.

FAQ

What sizes does Qwen2.5 come in?

Qwen2.5 ships open-weight base and instruction-tuned models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, with quantized versions. Two hosted MoE models, Qwen2.5-Turbo and Qwen2.5-Plus, are served only via Alibaba Cloud.

Is Qwen2.5 open source?

The dense 0.5B–72B base and instruct models are released as open weights you can download and run. Qwen2.5-Turbo and Qwen2.5-Plus are proprietary, available only through Alibaba Cloud Model Studio.

How does Qwen2.5-72B compare to Llama-3-405B?

Qwen2.5-72B-Instruct performs competitively with Llama-3-405B-Instruct despite being roughly 5x smaller, which is the report’s central efficiency result.

What changed from Qwen2 to Qwen2.5?

The main change is pretraining scale: 18 trillion tokens versus 7 trillion in Qwen2, plus over 1M supervised samples and multistage RL in post-training, improving reasoning, long-text generation, structured output, and instruction following.

What is Qwen2.5 used for besides chat?

It is the foundation for specialized Alibaba models including Qwen2.5-Math, Qwen2.5-Coder, the QwQ reasoning model, and multimodal variants, so its quality carries into a broader lineup.

One line: a full open ladder from 0.5B to 72B, trained on 18T tokens, where a 72B model holds its own against one 5x its size. Read the original paper on arXiv.