Topics
Reinforcement Learning
Training language models and agents from reward — RLHF, RLVR, GRPO, and verifiable-reward methods that drive reasoning gains.
Text-to-Image · The Chinese University of Hong Kong
InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.
Theorem Proving · MiniMax AI
MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.
AI Agents · University of Illinois Urbana-Champaign
Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.
Theorem Proving · Google Research
HOList turns machine learning for higher-order theorem proving into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Reinforcement Learning · Tianjin University
When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.
Long Context · Tsinghua University
Tsinghua's LongTraceRL mines distractors from real search-agent trajectories and adds entity-level rubric rewards, lifting a Qwen3-4B reasoner from 53.3 to 59.0 average across five long-context benchmarks (+5.7).
AI Agents · Lehigh University
OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.
Reinforcement Learning · Tsinghua University
CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.
AI Agents · Xiamen University
SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).
Reinforcement Learning · University of Edinburgh
SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.
Agent Memory · ByteDance
TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.
Theorem Proving · DeepSeek
DeepSeek-Prover-V1.5 combines Lean feedback, reinforcement learning, and RMaxTS search, reaching 63.5% on miniF2F and 25.3% on ProofNet.
Alignment · OpenAI
PPO keeps policy-gradient RL stable with a clipped surrogate objective — almost as well-behaved as TRPO but far simpler — which made it the default RL engine behind RLHF for ChatGPT and InstructGPT.
AI Agents · Zhejiang University
SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.
LLM Reasoning · DeepSeek
DeepSeek-R1 learns to reason from reinforcement learning on whether its answer is correct — with no human reasoning examples — matches OpenAI o1 on AIME and MATH-500, and ships open MIT-licensed weights.