Retrieval-Augmented Generation · LLM Reasoning · Efficient AI

OCC-RAG: Small Models Built Only to Read Context Faithfully

OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.

OCC-RAG: Small Models Built Only to Read Context Faithfully

Quick answer

OCC-RAG is a pair of tiny language models — 0.6B and 1.7B parameters — deliberately stripped of reliance on memorized facts and trained to answer only from the documents you hand them. On multi-hop QA benchmarks (HotpotQA, MuSiQue, TAT-QA, ConFiQA, MuSiQue-Un) they match or beat general-purpose models 2 to 6 times their size, and they back each answer with reasoning traces that quote the source text verbatim. The training signal comes from a synthetic pipeline that generated over 3 million QA examples built around two behaviors most RAG models do badly: staying faithful to context and refusing to answer when the context doesn’t support one.

The “cognitive core” idea

The framing here is that a retrieval-augmented model doesn’t need to know much — it needs to read well. OCC stands for Optimal Cognitive Core: a family of small models that prioritize robust reasoning over parametric knowledge. The bet is that for grounded QA, parameters spent memorizing world facts are largely wasted (and actively harmful, since they tempt the model to answer from memory instead of from the retrieved passage). So OCC-RAG is trained to treat its weights as a reasoning engine, not a fact store, and to lean entirely on the provided context.

That reframing is the genuinely interesting claim. Most “small RAG model” work just shrinks a general model and hopes retrieval covers the knowledge gap. OCC instead optimizes the small model for the job retrieval actually creates: read several documents, chain facts across them, cite where each fact came from, and say “not answerable” when the documents don’t contain it.

How the data is built

The core engineering is a pipeline that synthesizes multi-context, multi-hop QA examples at scale — over 3 million of them — with two properties hand-engineered in:

  • Context faithfulness: answers must be derivable from the supplied passages, with literal quotes as supporting evidence, so the model learns to ground rather than recall.
  • Calibrated refusal: a portion of examples have no supportable answer, training the model to abstain instead of hallucinating a plausible-sounding one.

Both 0.6B and 1.7B models are trained to emit reasoning traces that include source citations — literal spans lifted from the context — so a reader can audit which sentence produced which claim. That citation-by-quote behavior is what makes the output checkable rather than just confident.

Key results

  • OCC-RAG matches or exceeds general-purpose models 2-6x its size across HotpotQA, MuSiQue, TAT-QA, ConFiQA, and MuSiQue-Un — i.e. a 1.7B model competing with models in the ~4B-10B range on grounded QA.
  • The benchmark mix is chosen to stress the right things: HotpotQA and MuSiQue for multi-hop chaining, TAT-QA for table-and-text reasoning, and ConFiQA / MuSiQue-Un for counterfactual and unanswerable cases that punish models for answering from memory.
  • Training scale is the headline number on the data side: 3M+ synthetic QA examples purpose-built for faithfulness and refusal, rather than scraped web QA.
  • Outputs ship with literal source citations inside the reasoning trace, so faithfulness is observable per-answer, not just an aggregate metric.

Why this is worth attention

The practical pitch is cost. A 0.6B or 1.7B model that reads context faithfully can run on a CPU or a phone-class GPU, which makes per-query RAG cheap enough to put on-device or in front of high-volume workloads where a 7B+ model is too expensive. The refusal training is the part most production RAG systems are missing — a model that confidently answers unanswerable questions is worse than useless in legal, finance, or support settings, and OCC-RAG treats abstention as a first-class trained behavior rather than a prompt-time afterthought.

Limits and open questions

The result is narrow by construction, and that cuts both ways. These models are built to be useless without retrieval — strip the context and a 0.6B “cognitive core” has very little parametric knowledge to fall back on, so quality depends entirely on the retriever upstream, which the paper doesn’t fix. “Matches or exceeds models 2-6x their size” is a range, not a single audited number per benchmark, so the real gap depends heavily on which baseline and which dataset. The 3M training examples are synthetic, which risks teaching the model the style of grounded answering on machine-generated text more than the messy retrieval failures of real corpora. And faithfulness-by-citation is only as good as the citations: quoting a passage verbatim doesn’t guarantee the reasoning over those quotes is correct. The honest read is that OCC-RAG is a strong efficiency story for grounded QA, not a general small model — its whole value evaporates if you ask it to know things on its own.

FAQ

What is OCC-RAG and how is it different from a normal small LLM?

OCC-RAG is a 0.6B/1.7B model family trained to answer strictly from provided context and to refuse when the context doesn’t support an answer. Unlike a general small LLM, it intentionally minimizes reliance on memorized facts — it’s a reading-and-reasoning engine, not a knowledge store.

How small are the OCC-RAG models?

Two sizes: 0.6 billion and 1.7 billion parameters. The paper reports they match or beat general-purpose models 2 to 6 times larger on grounded multi-hop QA.

What benchmarks does OCC-RAG use?

HotpotQA and MuSiQue (multi-hop), TAT-QA (table-and-text), and ConFiQA plus MuSiQue-Un, which include counterfactual and unanswerable questions to test whether the model refuses instead of hallucinating.

Does OCC-RAG cite its sources?

Yes — it emits reasoning traces containing literal quotes from the supplied context, so each claim can be traced back to the sentence that supports it.

When should I not use OCC-RAG?

When you need the model to answer from its own knowledge without retrieval. By design these models hold little parametric knowledge, so without a good retriever supplying relevant context, they have nothing reliable to reason over.

One line: spend the parameters on reading, not remembering — and train refusal as a real skill. Read the original paper on arXiv.