Multimodal Models · Text-to-Image · AI Agents

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Quick answer

Crafter is a five-agent harness that drives an image generator (default: Nano Banana 2) to produce publication-quality figures, and it scores 50.34 on PaperBanana-Bench versus 11.13 for the same backbone used alone — a ~39-point jump from coordination, not a bigger model. A second system, CraftEditor, then converts the raster figure into a coordinate-faithful SVG, scoring 8.04/10 on edit fidelity against 6.91 for the prior best editor. The whole point: treat a figure as discrete semantic components a team of agents negotiates over, rather than one shot from a text prompt.

The real problem: figures break the one-prompt model

Text-to-image models render a scientific figure as flat pixels. The moment a reviewer says “move the arrow, recolor that box, fix the typo in the legend,” you are stuck — you cannot locally edit pixels, and re-prompting regenerates the whole image with new mistakes. Prior automated systems also each handled one figure type under text-only input, so a tool that made a flowchart could not make a poster.

Crafter’s authors argue the bottleneck is not the image model’s raw quality but the lack of a coordination layer. A figure is structured: boxes, arrows, text, icons, each with a position and a relationship. A single forward pass has no place to hold that structure, revise it, and check it.

How Crafter’s five agents work

Crafter runs a loop of five specialized agents over a shared, typed specification rather than a single prompt:

  • Intent Reasoner reads the input (text, a sketch, a partial figure) and seeds an initial spec.
  • Plan Generator proposes several candidate visual framings instead of committing to one.
  • Critic returns per-dimension diagnostics — position, color, text, layout — as directives, not a single scalar score. This is the most important design choice: a scalar “7/10” tells the generator nothing actionable; “the legend overlaps the y-axis” does.
  • Specification Refiner writes typed edits back into the shared spec.
  • Convergence Judge decides whether to accept the figure, refine again, or revert a bad edit.

Because the agents operate on a spec rather than on pixels, the same harness generalizes across figure types and input conditions with no architectural change — the differences live in the spec, not the code.

What CraftBench actually tests

CraftBench spans three figure types — academic figures, posters, and infographics — under four input conditions: text-to-image, mask completion, key-element composition, and sketch refinement. That 3x4 grid is the contribution most prior work skipped: it forces a system to handle “finish this half-drawn diagram” and “compose these given icons,” not just “draw from a sentence.”

Key results

  • PaperBanana-Bench: Crafter scores 50.34 overall, against 33.73 for the PaperBanana agentic baseline, 11.13 for the Nano Banana 2 backbone alone, and 1.37 for GPT-Image-2.
  • CraftBench: Crafter scores 50.20, versus 28.00 for PaperBanana, 22.40 for Nano Banana Pro standalone, and 19.90 for Nano Banana 2 standalone — roughly a 16-22 point lead over the strongest agentic baseline.
  • CraftEditor: 8.04/10 overall edit fidelity (three-VLM ensemble judging, 80 samples) versus 6.91 for AutoFigure-Edit and 3.69 for Edit-Banana. Per-axis it lands 8.34 on color, 8.10 on position, 8.07 on icons, 7.83 on arrows, and 7.61 on text — text and arrows are the weakest axes.

Why it matters now

The honest read: Crafter is evidence that for structured visual output, the harness is doing the heavy lifting, not the image model. The same Nano Banana 2 backbone goes from 11.13 to 50.34 purely by wrapping it in a critique-and-revise loop over a typed spec. That is a strong argument for the broader 2026 thesis that agent scaffolding around a frozen model can beat swapping in a bigger model — and it is directly useful, because researchers actually need to edit figures, not admire un-editable ones.

Limits and open questions

The benchmark scores top out near 50/100, so even the best system is closer to “usable draft” than “submission-ready” — these are not 90%+ numbers. The agents add latency and cost: five agents looping over a spec means many model calls per figure, and the paper does not foreground wall-clock or dollar cost. CraftEditor’s SVG conversion leans on SAM3 for grounding and is weakest exactly where scientific figures are most demanding — text (7.61) and arrows (7.83). The authors also note PDF text extraction is clean only for LaTeX-rendered PDFs; scanned or dense two-column papers may need manual extraction first. And every number here comes from VLM-as-judge scoring, which can reward figures that look right to a model over figures that are correct.

FAQ

What is Crafter and how does it generate scientific figures?

Crafter is a multi-agent harness that drives an image generator with five cooperating agents — Intent Reasoner, Plan Generator, Critic, Specification Refiner, and Convergence Judge — operating over a shared typed specification. It generates academic figures, posters, and infographics and scores 50.34 on PaperBanana-Bench versus 11.13 for the raw backbone.

How much does Crafter beat the base image model?

On PaperBanana-Bench, Crafter scores 50.34 against 11.13 for the same Nano Banana 2 backbone run alone — about a 39-point gain from the agent loop, with no change to the underlying image model.

What does CraftEditor do that a normal image editor cannot?

CraftEditor converts a raster figure into a coordinate-faithful editable SVG through extraction, grounding (via SAM3), and composition, so you can move, recolor, or retype individual elements. It scores 8.04/10 on edit fidelity versus 6.91 for the prior best editor.

What models does Crafter use by default?

By default Crafter uses Claude Opus as the LLM, Gemini Pro as the VLM, and Nano Banana (Gemini image generation) as the image backbone, with an optional GPT-Image variant via Azure or OpenRouter.

Read the original paper on arXiv and the code on GitHub.