CMU Paper Reviewer

Expert Annotation Study

How 45 expert scientists evaluated AI-generated reviews of Nature-family papers.

In our study, 45 expert scientists meta-reviewed AI-generated reviews of Nature-family papers, judging each review item — from three frontier LLM agents and from the papers' official human reviewers — against the official human reviews for significance, correctness, and evidence.

Figure 1 · Motivation
Figure 1: motivation for the expert annotation study

Earlier work judged AI reviews with shallow proxies — overall score correlation and accept/reject agreement — which never reveal whether an AI's criticisms are actually useful or whether they overlap with what humans flagged. We instead had expert scientists judge every individual review item.

Table 9 · Platform comparison
Table 9: comparison of AI reviewer platforms on PeerReview Bench

On PeerReview Bench (78 papers), CMU Paper Reviewer reaches higher precision than the other public platforms — 95.5% vs 59.8% (Stanford Agentic Reviewer) and 57.6% (OpenAIReview) — while producing fewer review items (4.7 vs 11.1 and 18.6). Allowed up to 15 items, its best configuration reaches the highest F1 (58.6%).

Table 5 · Reviews vs. humans
Table 5: win-rate of AI reviews against human references

Judged on the full review, GPT-5.2 matches or exceeds the top-rated human reviewer on 48.6% of papers, and the lowest-rated human on 73.4%. Claude Opus 4.5 (32.1%) and Gemini 3 Pro (30.3%) trail against the top human.

Table 3 · Significance vs. correctness
Reviewer Correctness % Significance (0–2) Evidence % Items
Top-Rated Human92.31.3992.21,139
Lowest-Rated Human79.11.3089.7833
GPT-5.286.21.6197.1442
Claude Opus 4.583.71.5396.5475
Gemini 3 Pro81.91.5689.5460

Per review item, AI reviewers raise more significant issues than even the top-rated human (significance 1.5–1.6 vs 1.39, p<.001) but are factually correct slightly less often (82–86% vs 92%). Their evidence quality is on par with or above humans. Point estimates shown; 95% CIs and effect sizes are in the paper.

Figure 4 · Reviewer overlap
Figure 4: overlap between human and AI reviewers

AI reviewers overlap with one another far more than humans do. A panel of three AI reviewers covers most of the human reviewers' targets, but only about half of their specific criticisms — broadening coverage without fully reproducing human judgment.

Together: today's AI reviewers raise more issues and rival human reviewers on many papers, but their lower correctness and field-context gaps mean they are best used to augment, not replace expert review. Read the full study in our paper.