CMU Paper Reviewer

Expert Annotation Study

How 45 expert scientists evaluated AI-generated reviews of Nature-family papers.

In our study, 45 expert scientists meta-reviewed AI-generated reviews and the papers' official human reviewers for 82 Nature-family papers in terms of the correctness, significance, and sufficiency of evidence of the criticism.

Figure 1 · Motivation
Figure 1: motivation for the expert annotation study

Earlier work judged AI reviews with shallow proxies (e.g. overall score correlation and accept/reject agreement) which never reveal whether an AI's criticisms are actually useful or whether they overlap with what humans flagged. We instead had expert scientists judge every individual review item.

Table 5 · Reviews vs. humans
Table 5: win-rate of AI reviews against human references

Judged on the full review, GPT-5.2 matches or exceeds the top-rated human reviewer on 48.6% of papers, and the lowest-rated human on 73.4%. Claude Opus 4.5 (32.1%) and Gemini 3 Pro (30.3%) trail against the top human.

Table 3 · Significance vs. correctness
Reviewer Correctness % Significance (0–2) Evidence % Items
Top-Rated Human92.31.3992.21,139
Lowest-Rated Human79.11.3089.7833
GPT-5.286.21.6197.1442
Claude Opus 4.583.71.5396.5475
Gemini 3 Pro81.91.5689.5460

Per review item, AI reviewers raise more significant issues than even the top-rated human (significance 1.5–1.6 vs 1.39, p<.001) but are factually correct slightly less often (82–86% vs 92%). Their evidence quality is on par with or above humans. Point estimates shown; 95% CIs and effect sizes are in the paper.

Figure 4 · Reviewer overlap
Figure 4: overlap between human and AI reviewers

AI reviewers overlap with one another far more than humans do. A panel of three AI reviewers covers most of the human reviewers' targets, but only about half of their specific criticisms — broadening coverage without fully reproducing human judgment.

Table 9 · Platform comparison
Table 9: comparison of AI reviewer platforms on PeerReview Bench

On PeerReview Bench (78 papers), CMU Paper Reviewer reaches higher precision than the other public platforms — 95.5% vs 59.8% (Stanford Agentic Reviewer) and 57.6% (OpenAIReview) — while producing fewer review items (4.7 vs 11.1 and 18.6). Allowed up to 15 items, its best configuration reaches the highest F1 (58.6%).

Together: today's AI reviewers raise more issues and rival human reviewers on many papers, but their lower correctness and field-context gaps mean they are best used to augment, not replace expert review. Read the full study in our paper.