Expert Annotation Study — CMU Paper Reviewer

Figure 1 · Motivation

Earlier work judged AI reviews with shallow proxies (e.g. overall score correlation and accept/reject agreement) which never reveal whether an AI's criticisms are actually useful or whether they overlap with what humans flagged. We instead had expert scientists judge every individual review item.

Table 5 · Reviews vs. humans

Table 5: win-rate of AI reviews against human references

Judged on the full review, GPT-5.2 matches or exceeds the top-rated human reviewer on 48.6% of papers, and the lowest-rated human on 73.4%. Claude Opus 4.5 (32.1%) and Gemini 3 Pro (30.3%) trail against the top human.

Table 3 · Significance vs. correctness

Reviewer	Correctness %	Significance (0–2)	Evidence %	Items
Top-Rated Human	92.3	1.39	92.2	1,139
Lowest-Rated Human	79.1	1.30	89.7	833
GPT-5.2	86.2	1.61	97.1	442
Claude Opus 4.5	83.7	1.53	96.5	475
Gemini 3 Pro	81.9	1.56	89.5	460

Per review item, AI reviewers raise more significant issues than even the top-rated human (significance 1.5–1.6 vs 1.39, p<.001) but are factually correct slightly less often (82–86% vs 92%). Their evidence quality is on par with or above humans. Point estimates shown; 95% CIs and effect sizes are in the paper.

Figure 4 · Reviewer overlap

Figure 4: overlap between human and AI reviewers

AI reviewers overlap with one another far more than humans do. A panel of three AI reviewers covers most of the human reviewers' targets, but only about half of their specific criticisms — broadening coverage without fully reproducing human judgment.

Table 9 · Platform comparison

Table 9: comparison of AI reviewer platforms on PeerReview Bench

On PeerReview Bench (78 papers), CMU Paper Reviewer reaches higher precision than the other public platforms — 95.5% vs 59.8% (Stanford Agentic Reviewer) and 57.6% (OpenAIReview) — while producing fewer review items (4.7 vs 11.1 and 18.6). Allowed up to 15 items, its best configuration reaches the highest F1 (58.6%).