InterleaveThinker: Planner-Critic Agents for Interleaved Image Generation

Quick answer

InterleaveThinker is a multi-agent wrapper for interleaved text-image generation. A Planner writes the sequence of image instructions up front, and a Critic checks each generated image, decides whether it succeeded, and writes refinement prompts when it fails. With FLUX.2-klein-9B, the system reaches 66.3 average on UEval and improves WISE from 0.47 to 0.73. With Qwen-Image-Edit, it reaches 67.2 on UEval and 30.0 on RISE.

What interleaved generation needs

Single-image generators are good at one image or one edit. Interleaved generation asks for a sequence: a tutorial, visual story, procedure, comic, or embodied manipulation plan where each image and text step must stay consistent with prior steps. Errors compound. A wrong object in step two can poison step five.

InterleaveThinker does not replace the generator. It turns generation into an agent loop. The Planner reduces visual over-reliance by predicting the global text-image plan before seeing generated images. The Critic then handles local correction after each step.

Training the planner and critic

Both Planner and Critic start from Qwen3-VL-8B-Instruct. The authors build Planner-SFT-80k, Critic-SFT-112k, and Critic-RL-13k. The RL stage uses GRPO with two rewards: an accuracy reward for judging success correctly and a step-wise reward for improving the next iteration.

The compute cost is visible. The paper reports about 50 hours on eight H800 GPUs for the pipeline training setup, and inference can involve up to five refinement iterations per step. This is a quality wrapper, not a cheap one-call generator.

The surprising detail is that the Planner predicts the global instruction sequence before observing generated images. That deliberately prevents the plan from being dragged around by early visual mistakes. The Critic then works locally, where visual feedback is useful: it checks the current image, decides whether the step succeeded, and rewrites the prompt if the generator drifted.

Key results

UEval: InterleaveThinker plus FLUX.2-klein-9B reaches 66.3 average, while Qwen-Image-Edit integration reaches 67.2.
Proprietary comparison: Nano Banana scores 66.0 on UEval and Nano Banana Pro scores 76.1, so InterleaveThinker is competitive with the former but not the latter.
CoMM: with FLUX.2-klein, the system reports 9.3/9.6 style consistency and 9.2/9.6 entity consistency across Task 3 and Task 4.
WISE: FLUX.2-klein improves from 0.47 to 0.73 with InterleaveThinker.
RISE: FLUX.2-klein improves from 13.3 to 28.9 overall, with a large temporal category jump from 7.1 to 36.5.

Why the planner-critic split matters

The ablation table is the clearest mechanism evidence. A raw FLUX.2-klein setup gets 18.2 average on UEval. A Qwen3-VL-8B baseline wrapper reaches 48.1. Planner-SFT lifts the score to 60.5, Full-SFT reaches 64.5, and Full-RL reaches 66.3. A one-agent variant falls to 54.5.

That supports the paper’s central claim: planning the sequence and critiquing generated outputs are different jobs. Merging them into one agent loses quality when the image generator is frozen.

For builders, the lesson is to treat interleaved generation as workflow control. A stronger base generator helps, but consistency across a sequence also needs state, step-level judgment, and a recovery path. InterleaveThinker supplies those pieces without retraining the image generator itself.

Limits and open questions

The method depends on judge quality. The Critic training uses Gemini 2.5 Pro scores in data processing and reward construction. That is a reasonable engineering choice, but it means the pipeline partially inherits the judge’s preferences.

The second limit is latency. A trajectory may involve many generator calls, and the paper explicitly caps refinement iterations. InterleaveThinker is best suited for high-value visual sequences where consistency matters more than instant generation.

The third limit is benchmark fit. UEval, CoMM, WISE, and RISE cover useful aspects of sequence quality, but they still compress human visual expectations into scores. Real users will notice layout taste, narrative pacing, and brand constraints that are not fully captured by the reported metrics.

There is also a frozen-generator ceiling. If the underlying image model does not know a concept or cannot render it reliably, planner and critic loops can refine prompts but cannot add missing visual competence to the generator.

FAQ

What is InterleaveThinker?

InterleaveThinker is a planner-critic agent pipeline that wraps existing image generators to produce interleaved text-image sequences, such as tutorials, stories, diagrams, and step-by-step visual plans.

How does InterleaveThinker improve FLUX.2-klein?

On UEval, FLUX.2-klein alone gets 18.2 average in the ablation, while the full InterleaveThinker RL pipeline reaches 66.3. On WISE, FLUX.2-klein improves from 0.47 to 0.73.

Is InterleaveThinker better than Nano Banana Pro?

No on UEval. InterleaveThinker with Qwen-Image-Edit reaches 67.2 average, while Nano Banana Pro reaches 76.1. It is closer to Nano Banana at 66.0 and ahead of open-source baselines in the table.

What is InterleaveThinker’s main cost?

Inference cost. The system can call the image generator multiple times per step, with planner and critic passes around those calls. It is designed for consistency, not minimum latency.

One line: InterleaveThinker is strongest as evidence that image generation quality can improve through agentic planning and critique around a frozen generator. Read the original paper on arXiv.