Peer review is the primary mechanism through which the research community filters and improves new scientific work before publication. The rapid growth of submissions at major AI conferences (with counts at leading venues surpassing 10,000) has placed sustained pressure on peer review workflows. Meanwhile, recent advances in LLMs have spurred growing interest in using them to assist or complement the peer review workflow.
Despite these advances, prior work has highlighted notable shortcomings in existing LLM-based peer review frameworks: they produce routine, template-like critiques; accept authors' claimed novelty or limitations without thorough verification; and lack technical details, actionable suggestions, as well as justification grounded in the paper. These limitations can be traced to the underutilization of two crucial sources: (1) Reviewer Guidelines and Rubricsβtop-tier venues provide well-established guidelines; and (2) Context from Existing Workβassessing novelty inherently requires situating a paper relative to existing work.
We introduce ReviewBench, a benchmark that leverages reviewer rubrics in an explicit and systematic manner, and ReviewGrounder, a rubric-guided, tool-integrated, multi-agent framework for producing grounded, content-rich reviews. ReviewGrounder decomposes reviewing into collaborating agents: the drafter produces an initial draft, and subsequent grounding agents (Literature Searcher, Insight Miner, Result Analyzer) refine it using tools. Experiments show that ReviewGrounder with a Phi-4-14B-based drafter and GPT-OSS-120B-based grounding consistently outperforms baselines including GPT-4.1 and DeepSeek-R1-670B across rubric-based dimensions and human-aligned metrics.
Overview of the ReviewBench construction pipeline. For each paper, paper-specific rubrics are instantiated by an aggregated reference review, the submission PDF, and meta-rubrics.
Similarity-based metrics and LLM-as-a-Judge approaches used by prior studies either fail to capture fine-grained review competencies or rely on ambiguous evaluation criteria. We introduce ReviewBench, a benchmark built on DeepReview-13K that augments each paper $p$ and its human reviews $\mathsf{H}_p$ with: (1) an aggregated reference review $r^*_p$; and (2) a set of paper-specific rubrics $\mathsf{R}^{\text{paper}}_p$. By leveraging these alongside an evaluator $\mathcal{E}$, ReviewBench enables accurate, multi-faceted assessment.
We define eight paper-agnostic meta-rubrics: Core Contribution Accuracy, Results Interpretation, Comparative Analysis, Evidence-Based Critique, Critique Clarity, Completeness Coverage, Constructive Tone, and False or Contradictory Claims (pitfall). Each meta-rubric is instantiated into paper-specific rubrics $\mathsf{R}^{\text{paper}}_{p,i}$ using the reference review and paper content. The overall content score is $S(p,\hat{r}_p) = \sum_{i=1}^{8} s_{p,i}$.
Overview of ReviewGrounder. (a) Review Drafter: Generates an initial draft based on the paper. (b) Multi-dimensional Grounding Agents: Literature Searcher retrieves and summarizes related work; Insight Miner verifies methodology and core contributions; Result Analyzer checks experimental results. (c) Review Aggregator: Synthesizes the draft and evidence into a coherent, accurate, and actionable review.
ReviewGrounder casts reviewing as a staged process that progressively refines an initial draft via targeted analysis, external evidence, and structured synthesis.
Stage I: Draft Review Generation. Given a paper $p$, a fine-tuned Drafter $\mathcal{P}$ generates an initial draft review $r^{(0)}$ that captures basic structure and stylistic conventions.
Stage II: Multi-dimensional Review Grounding. Three specialized agents collaboratively enrich the draft: Literature Searcher $\mathcal{S}$ situates the submission within contemporary literature via Semantic Scholar API and reranking; Insight Miner $\mathcal{M}$ consolidates conceptual understanding and refines method-focused critiques; Result Analyzer $\mathcal{A}$ strengthens empirical grounding by examining experiments, datasets, baselines, and quantitative evidence.
Stage III: Rubric-Guided Synthesis. The Aggregator $\mathcal{G}$ synthesizes the draft and grounded evidence $\mathsf{E}(p)$ with meta-rubrics $\mathsf{R}^{\text{meta}}$ to produce a coherent, accurate, and actionable final review. Paper-specific rubrics are not exposed at generation time, preventing evaluation leakage.
ReviewGrounder decomposes reviewing into collaborating agents with specialized capabilities
Qualitative comparison: Comparing reviews generated by ReviewGrounder vs. DeepReviewer-14B on the same paper
ReviewGrounder produces concise, evidence-grounded critiques with specific references (sections, equations, tables). DeepReviewer-14B tends to generate verbose, repetitive text that echoes prior reviewers without adding substantive, paper-specific insight. Toggle below to compare the two approaches.
We conduct evaluation on ReviewBench using two complementary metric families: (1) Rubric-based Evaluationβassesses textual quality across eight paper-specific rubric dimensions (Sec. 3.2); (2) Numeric-field Evaluationβmeasures predicted ratings with MSE/MAE and decisions with ACC/F1 (Sec. 3.3). Paper-specific rubrics and the evaluator are fixed across all methods; models access only the paper text at generation time.
Baselines: (1) Foundation Models: Qwen3-32B, QWQ-32B, GPT-4o, GPT-4.1; (2) Agentic Frameworks: AI Scientist, AgentReview (instantiated with GPT-4o/GPT-4.1); (3) Fine-tuned Reviewers: CycleReviewer-8B/70B, DeepReviewer-7B/14B. ReviewGrounder uses Phi-4-14B as Drafter and GPT-OSS-120B for grounding agents.
Overall, ReviewGrounder consistently outperforms all baselines. Relative to the best foundation model (Qwen3-32B), +38%. Against AgentReview and AI Scientist (GPT-4o): +121% and +193%. Exceeds DeepReviewer-14B by 36%. Outperforms GPT-4o across all dimensions (+135%).
| Method | Model | Core | Res. | Comp. | EBC | Clr. | Cov. | Tone | Contradict. | Overall | Ξ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Foundation | Qwen3-32B | 1.70 | 0.76 | 0.58 | 0.14 | 1.61 | 1.15 | 2.00 | -0.15 | 7.80 | β38% |
| QWQ-32B | 1.69 | 0.65 | 0.35 | 0.12 | 1.68 | 0.95 | 2.00 | -0.08 | 7.35 | β46% | |
| GPT-4o | 1.20 | 0.10 | 0.03 | 0.00 | 1.05 | 0.33 | 1.98 | -0.12 | 4.58 | β135% | |
| GPT-4.1 | 1.76 | 0.70 | 0.34 | 0.11 | 1.63 | 1.17 | 2.00 | -0.04 | 7.66 | β41% | |
| AgentReview | GPT-4o | 1.13 | 0.16 | 0.11 | 0.13 | 1.34 | 0.59 | 2.00 | -0.16 | 4.87 | β121% |
| GPT-4.1 | 1.03 | 0.13 | 0.12 | 0.00 | 1.41 | 0.63 | 1.98 | -0.16 | 4.96 | β117% | |
| AI Scientist | GPT-4o | 0.85 | 0.00 | 0.02 | 0.00 | 0.67 | 0.18 | 1.76 | -0.19 | 3.68 | β193% |
| GPT-4.1 | 1.67 | 0.48 | 0.36 | 0.08 | 1.56 | 1.13 | 1.94 | -0.09 | 7.09 | β52% | |
| CycleReviewer | Llama-3.1-8B | 0.99 | 0.10 | 0.06 | 0.01 | 0.58 | 0.15 | 1.66 | -0.45 | 3.10 | β248% |
| Llama-3.1-70B | 1.02 | 0.16 | 0.10 | 0.01 | 0.77 | 0.26 | 1.85 | -0.64 | 3.52 | β206% | |
| DeepReviewer | Phi-4-7B | 1.42 | 0.45 | 0.33 | 0.13 | 1.37 | 1.06 | 1.94 | -0.40 | 6.32 | β70% |
| Phi-4-14B | 1.63 | 0.65 | 0.50 | 0.35 | 1.68 | 1.29 | 1.99 | -0.19 | 7.90 | β36% | |
| ReviewGrounder | Phi-4-14B | 1.85 | 1.41 | 0.91 | 1.48 | 1.92 | 1.33 | 2.00 | -0.12 | 10.77 | β |
Higher scores indicate better performance. Contradict. is a pitfall dimension scored in {β2, β1, 0}; others in {0, 1, 2}. Core=CORE CONTRIBUTION ACCURACY, Res.=RESULTS INTERPRETATION, Comp.=COMPARATIVE ANALYSIS, EBC=EVIDENCE-BASED CRITIQUE, Clr.=CRITIQUE CLARITY, Cov.=COMPLETENESS COVERAGE, Tone=CONSTRUCTIVE TONE, Contradict.=FALSE OR CONTRADICTORY CLAIMS.
Compared with all baselines, ReviewGrounder achieves the lowest rating error (MSE: 1.1607, MAE: 0.8597) and highest decision accuracy (ACC: 0.6809, F1: 0.6699). Relative to the strongest AI Scientist variant (Gemini-2.0-Flash-Thinking), improves ACC by 8% and reduces MSE by ~63%.
| Method | Model | ACCβ | F1β | MSEβ | MAEβ |
|---|---|---|---|---|---|
| AgentReview | Claude-3-5-sonnet | 0.2826 | 0.2541 | 2.8406 | 1.2989 |
| Gemini-2.0-Flash-Thinking | 0.4242 | 0.4242 | 2.6186 | 1.2170 | |
| DeepSeek-V3 | 0.3140 | 0.2506 | 1.9951 | 1.1017 | |
| AI Scientist | GPT-o1 | 0.4167 | 0.4157 | 4.3072 | 1.7917 |
| Claude-3-5-sonnet | 0.5579 | 0.4440 | 3.0992 | 1.3500 | |
| Gemini-2.0-Flash-Thinking | 0.6139 | 0.4808 | 3.9232 | 1.6470 | |
| DeepSeek-V3 | 0.4059 | 0.3988 | 4.8006 | 1.8403 | |
| DeepSeek-R1 | 0.4259 | 0.4161 | 4.7719 | 1.8099 | |
| CycleReviewer | Llama-3.1-8B | 0.2354 | 0.3988 | 3.1324 | 1.3663 |
| Llama-3.1-70B | 0.1545 | 0.4156 | 1.8440 | 1.0643 | |
| DeepReviewer | Phi-4-7B | 0.6381 | 0.6068 | 1.4442 | 0.9416 |
| Phi-4-14B | 0.6667 | 0.5204 | 1.3527 | 0.9041 | |
| ReviewGrounder | Phi-4-14B | 0.6809 | 0.6699 | 1.1607 | 0.8597 |
Impact of Drafter Backbones. When trained on the same SFT data, Qwen3-4B outperforms Phi-4-7B but remains inferior to Phi-4-14B. Smaller Drafters (e.g., Qwen3-4B: 10.6418) still benefit substantially from grounding and aggregation.
Impact of Grounding Agents. Omitting any agent (Searcher, Miner, Analyzer) degrades performance relative to the full model (10.7699), underscoring the importance of each component.
| Drafter Phi-4-14B | Drafter Phi-4-7B | Drafter Qwen3-4B | Searcher | Miner | Analyzer | Overall |
|---|---|---|---|---|---|---|
| β | β | β | β | 10.6418 | ||
| β | β | β | β | 10.5928 | ||
| β | β | β | 10.6568 | |||
| β | β | β | 10.6526 | |||
| β | β | β | 10.0186 | |||
| β | β | β | β | 10.7699 |
Hyperparameter Study (Literature Searcher). 10 reranked papers per keyword yields the highest overall score. OpenScholar-Reranker significantly outperforms BAAI-BGE Base/Large.
Figure 3: Ablation on Literature Searcher configurations under rubric-based evaluation.
Defense Against Adversarial Attacks. With malicious instructions injected into input papers (500-sample subset): ReviewGrounder shows strong resilience (10.70β10.65, β0.05); DeepReviewer-14B drops from 7.70 to 7.30. Rubric-based evaluation prevents score inflation via instruction injection.
Figure 4: Comparison with baselines under normal and attack scenarios via rubric-based evaluation.
Human Evaluation. On 120 papers, expert raters (avg. 2,000 Google Scholar citations) closely align with the LLM-evaluator: Pearson r=0.8954, Spearman Ο=0.7923, MAE=0.0969, Pairwise Error=0.1494.
@inproceedings{reviewgrounder2026acl,
title={ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents},
author={Li, Zhuofeng and Lu, Yi and Zhang, Haoxiang and Zhang, Yu},
booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
year={2026}
}