Expert Annotation Study
How 45 expert scientists evaluated AI-generated reviews of Nature-family papers.
In our study, 45 expert scientists meta-reviewed AI-generated reviews of Nature-family papers, judging each review item — from three frontier LLM agents and from the papers' official human reviewers — against the official human reviews for significance, correctness, and evidence.
Earlier work judged AI reviews with shallow proxies — overall score correlation and accept/reject agreement — which never reveal whether an AI's criticisms are actually useful or whether they overlap with what humans flagged. We instead had expert scientists judge every individual review item.
On PeerReview Bench (78 papers), CMU Paper Reviewer reaches higher precision than the other public platforms — 95.5% vs 59.8% (Stanford Agentic Reviewer) and 57.6% (OpenAIReview) — while producing fewer review items (4.7 vs 11.1 and 18.6). Allowed up to 15 items, its best configuration reaches the highest F1 (58.6%).
Judged on the full review, GPT-5.2 matches or exceeds the top-rated human reviewer on 48.6% of papers, and the lowest-rated human on 73.4%. Claude Opus 4.5 (32.1%) and Gemini 3 Pro (30.3%) trail against the top human.
| Reviewer | Correctness % | Significance (0–2) | Evidence % | Items |
|---|---|---|---|---|
| Top-Rated Human | 92.3 | 1.39 | 92.2 | 1,139 |
| Lowest-Rated Human | 79.1 | 1.30 | 89.7 | 833 |
| GPT-5.2 | 86.2 | 1.61 | 97.1 | 442 |
| Claude Opus 4.5 | 83.7 | 1.53 | 96.5 | 475 |
| Gemini 3 Pro | 81.9 | 1.56 | 89.5 | 460 |
Per review item, AI reviewers raise more significant issues than even the top-rated human (significance 1.5–1.6 vs 1.39, p<.001) but are factually correct slightly less often (82–86% vs 92%). Their evidence quality is on par with or above humans. Point estimates shown; 95% CIs and effect sizes are in the paper.
AI reviewers overlap with one another far more than humans do. A panel of three AI reviewers covers most of the human reviewers' targets, but only about half of their specific criticisms — broadening coverage without fully reproducing human judgment.
Together: today's AI reviewers raise more issues and rival human reviewers on many papers, but their lower correctness and field-context gaps mean they are best used to augment, not replace expert review. Read the full study in our paper.