ReviewGrounder: Improving Review Substantiveness with Rubric-Guided Agents

Introduction

Peer review is the primary mechanism through which the research community filters and improves new scientific work before publication. The rapid growth of submissions at major AI conferences (with counts at leading venues surpassing 10,000) has placed sustained pressure on peer review workflows. Meanwhile, recent advances in LLMs have spurred growing interest in using them to assist or complement the peer review workflow.

Despite these advances, prior work has highlighted notable shortcomings in existing LLM-based peer review frameworks: they produce routine, template-like critiques; accept authors' claimed novelty or limitations without thorough verification; and lack technical details, actionable suggestions, as well as justification grounded in the paper. These limitations can be traced to the underutilization of two crucial sources: (1) Reviewer Guidelines and Rubrics—top-tier venues provide well-established guidelines; and (2) Context from Existing Work—assessing novelty inherently requires situating a paper relative to existing work.

We introduce ReviewBench, a benchmark that leverages reviewer rubrics in an explicit and systematic manner, and ReviewGrounder, a rubric-guided, tool-integrated, multi-agent framework for producing grounded, content-rich reviews. ReviewGrounder decomposes reviewing into collaborating agents: the drafter produces an initial draft, and subsequent grounding agents (Literature Searcher, Insight Miner, Result Analyzer) refine it using tools. Experiments show that ReviewGrounder with a Phi-4-14B-based drafter and GPT-OSS-120B-based grounding consistently outperforms baselines including GPT-4.1 and DeepSeek-R1-670B across rubric-based dimensions and human-aligned metrics.

ReviewBench: Rubric-Driven Evaluation Benchmark

Overview of the ReviewBench construction pipeline. For each paper, paper-specific rubrics are instantiated by an aggregated reference review, the submission PDF, and meta-rubrics.

Similarity-based metrics and LLM-as-a-Judge approaches used by prior studies either fail to capture fine-grained review competencies or rely on ambiguous evaluation criteria. We introduce ReviewBench, a benchmark built on DeepReview-13K that augments each paper $p$ and its human reviews $\mathsf{H}_p$ with: (1) an aggregated reference review $r^*_p$; and (2) a set of paper-specific rubrics $\mathsf{R}^{\text{paper}}_p$. By leveraging these alongside an evaluator $\mathcal{E}$, ReviewBench enables accurate, multi-faceted assessment.

We define eight paper-agnostic meta-rubrics: Core Contribution Accuracy, Results Interpretation, Comparative Analysis, Evidence-Based Critique, Critique Clarity, Completeness Coverage, Constructive Tone, and False or Contradictory Claims (pitfall). Each meta-rubric is instantiated into paper-specific rubrics $\mathsf{R}^{\text{paper}}_{p,i}$ using the reference review and paper content. The overall content score is $S(p,\hat{r}_p) = \sum_{i=1}^{8} s_{p,i}$.

ReviewGrounder: Rubric-Guided, Tool-Integrated Agents

Overview of ReviewGrounder. (a) Review Drafter: Generates an initial draft based on the paper. (b) Multi-dimensional Grounding Agents: Literature Searcher retrieves and summarizes related work; Insight Miner verifies methodology and core contributions; Result Analyzer checks experimental results. (c) Review Aggregator: Synthesizes the draft and evidence into a coherent, accurate, and actionable review.

ReviewGrounder casts reviewing as a staged process that progressively refines an initial draft via targeted analysis, external evidence, and structured synthesis.

Stage I: Draft Review Generation. Given a paper $p$, a fine-tuned Drafter $\mathcal{P}$ generates an initial draft review $r^{(0)}$ that captures basic structure and stylistic conventions.

Stage II: Multi-dimensional Review Grounding. Three specialized agents collaboratively enrich the draft: Literature Searcher $\mathcal{S}$ situates the submission within contemporary literature via Semantic Scholar API and reranking; Insight Miner $\mathcal{M}$ consolidates conceptual understanding and refines method-focused critiques; Result Analyzer $\mathcal{A}$ strengthens empirical grounding by examining experiments, datasets, baselines, and quantitative evidence.

Stage III: Rubric-Guided Synthesis. The Aggregator $\mathcal{G}$ synthesizes the draft and grounded evidence $\mathsf{E}(p)$ with meta-rubrics $\mathsf{R}^{\text{meta}}$ to produce a coherent, accurate, and actionable final review. Paper-specific rubrics are not exposed at generation time, preventing evaluation leakage.

Featured Agents

ReviewGrounder decomposes reviewing into collaborating agents with specialized capabilities

Case Studies

Qualitative comparison: Comparing reviews generated by ReviewGrounder vs. DeepReviewer-14B on the same paper

ReviewGrounder produces concise, evidence-grounded critiques with specific references (sections, equations, tables). DeepReviewer-14B tends to generate verbose, repetitive text that echoes prior reviewers without adding substantive, paper-specific insight. Toggle below to compare the two approaches.

Experimental Results

We conduct evaluation on ReviewBench using two complementary metric families: (1) Rubric-based Evaluation—assesses textual quality across eight paper-specific rubric dimensions (Sec. 3.2); (2) Numeric-field Evaluation—measures predicted ratings with MSE/MAE and decisions with ACC/F1 (Sec. 3.3). Paper-specific rubrics and the evaluator are fixed across all methods; models access only the paper text at generation time.

Baselines: (1) Foundation Models: Qwen3-32B, QWQ-32B, GPT-4o, GPT-4.1; (2) Agentic Frameworks: AI Scientist, AgentReview (instantiated with GPT-4o/GPT-4.1); (3) Fine-tuned Reviewers: CycleReviewer-8B/70B, DeepReviewer-7B/14B. ReviewGrounder uses Phi-4-14B as Drafter and GPT-OSS-120B for grounding agents.

Table 1: Rubric-based Evaluation

Overall, ReviewGrounder consistently outperforms all baselines. Relative to the best foundation model (Qwen3-32B), +38%. Against AgentReview and AI Scientist (GPT-4o): +121% and +193%. Exceeds DeepReviewer-14B by 36%. Outperforms GPT-4o across all dimensions (+135%).

Method	Model	Core	Res.	Comp.	EBC	Clr.	Cov.	Tone	Contradict.	Overall	Δ
Foundation	Qwen3-32B	1.70	0.76	0.58	0.14	1.61	1.15	2.00	-0.15	7.80	↑38%
	QWQ-32B	1.69	0.65	0.35	0.12	1.68	0.95	2.00	-0.08	7.35	↑46%
	GPT-4o	1.20	0.10	0.03	0.00	1.05	0.33	1.98	-0.12	4.58	↑135%
	GPT-4.1	1.76	0.70	0.34	0.11	1.63	1.17	2.00	-0.04	7.66	↑41%
AgentReview	GPT-4o	1.13	0.16	0.11	0.13	1.34	0.59	2.00	-0.16	4.87	↑121%
AgentReview	GPT-4.1	1.03	0.13	0.12	0.00	1.41	0.63	1.98	-0.16	4.96	↑117%
AI Scientist	GPT-4o	0.85	0.00	0.02	0.00	0.67	0.18	1.76	-0.19	3.68	↑193%
AI Scientist	GPT-4.1	1.67	0.48	0.36	0.08	1.56	1.13	1.94	-0.09	7.09	↑52%
CycleReviewer	Llama-3.1-8B	0.99	0.10	0.06	0.01	0.58	0.15	1.66	-0.45	3.10	↑248%
CycleReviewer	Llama-3.1-70B	1.02	0.16	0.10	0.01	0.77	0.26	1.85	-0.64	3.52	↑206%
DeepReviewer	Phi-4-7B	1.42	0.45	0.33	0.13	1.37	1.06	1.94	-0.40	6.32	↑70%
DeepReviewer	Phi-4-14B	1.63	0.65	0.50	0.35	1.68	1.29	1.99	-0.19	7.90	↑36%
ReviewGrounder	Phi-4-14B	1.85	1.41	0.91	1.48	1.92	1.33	2.00	-0.12	10.77	—

Higher scores indicate better performance. Contradict. is a pitfall dimension scored in {−2, −1, 0}; others in {0, 1, 2}. Core=CORE CONTRIBUTION ACCURACY, Res.=RESULTS INTERPRETATION, Comp.=COMPARATIVE ANALYSIS, EBC=EVIDENCE-BASED CRITIQUE, Clr.=CRITIQUE CLARITY, Cov.=COMPLETENESS COVERAGE, Tone=CONSTRUCTIVE TONE, Contradict.=FALSE OR CONTRADICTORY CLAIMS.

Table 2: Numeric-field Evaluation

Compared with all baselines, ReviewGrounder achieves the lowest rating error (MSE: 1.1607, MAE: 0.8597) and highest decision accuracy (ACC: 0.6809, F1: 0.6699). Relative to the strongest AI Scientist variant (Gemini-2.0-Flash-Thinking), improves ACC by 8% and reduces MSE by ~63%.

Method	Model	ACC↑	F1↑	MSE↓	MAE↓
AgentReview	Claude-3-5-sonnet	0.2826	0.2541	2.8406	1.2989
	Gemini-2.0-Flash-Thinking	0.4242	0.4242	2.6186	1.2170
	DeepSeek-V3	0.3140	0.2506	1.9951	1.1017
AI Scientist	GPT-o1	0.4167	0.4157	4.3072	1.7917
	Claude-3-5-sonnet	0.5579	0.4440	3.0992	1.3500
	Gemini-2.0-Flash-Thinking	0.6139	0.4808	3.9232	1.6470
	DeepSeek-V3	0.4059	0.3988	4.8006	1.8403
	DeepSeek-R1	0.4259	0.4161	4.7719	1.8099
CycleReviewer	Llama-3.1-8B	0.2354	0.3988	3.1324	1.3663
CycleReviewer	Llama-3.1-70B	0.1545	0.4156	1.8440	1.0643
DeepReviewer	Phi-4-7B	0.6381	0.6068	1.4442	0.9416
DeepReviewer	Phi-4-14B	0.6667	0.5204	1.3527	0.9041
ReviewGrounder	Phi-4-14B	0.6809	0.6699	1.1607	0.8597

Table 3: Ablation Study

Impact of Drafter Backbones. When trained on the same SFT data, Qwen3-4B outperforms Phi-4-7B but remains inferior to Phi-4-14B. Smaller Drafters (e.g., Qwen3-4B: 10.6418) still benefit substantially from grounding and aggregation.

Impact of Grounding Agents. Omitting any agent (Searcher, Miner, Analyzer) degrades performance relative to the full model (10.7699), underscoring the importance of each component.

Drafter Phi-4-14B	Drafter Phi-4-7B	Drafter Qwen3-4B	Searcher	Miner	Analyzer	Overall
		✓	✓	✓	✓	10.6418
	✓		✓	✓	✓	10.5928
✓				✓	✓	10.6568
✓			✓		✓	10.6526
✓			✓	✓		10.0186
✓			✓	✓	✓	10.7699

Additional Analyses

Hyperparameter Study (Literature Searcher). 10 reranked papers per keyword yields the highest overall score. OpenScholar-Reranker significantly outperforms BAAI-BGE Base/Large.

Figure 3: Ablation on Literature Searcher configurations under rubric-based evaluation.

Defense Against Adversarial Attacks. With malicious instructions injected into input papers (500-sample subset): ReviewGrounder shows strong resilience (10.70→10.65, −0.05); DeepReviewer-14B drops from 7.70 to 7.30. Rubric-based evaluation prevents score inflation via instruction injection.

Figure 4: Comparison with baselines under normal and attack scenarios via rubric-based evaluation.

Human Evaluation. On 120 papers, expert raters (avg. 2,000 Google Scholar citations) closely align with the LLM-evaluator: Pearson r=0.8954, Spearman ρ=0.7923, MAE=0.0969, Pairwise Error=0.1494.

Share ReviewGrounder

BibTeX

@inproceedings{reviewgrounder2026acl,
    title={ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents},
    author={Li, Zhuofeng and Lu, Yi and Zhang, Haoxiang and Zhang, Yu},
    booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
    year={2026}
}