Physician Disagreement in Healthcare AI Evals

When you ask two doctors to grade the same AI response they disagree almost a quarter of the time. We wanted to know why.

Mar 2, 20266 min read

Health Queries on ChatGPT (estimated weekly users, millions)
Total WAUHealth users

Sources: OpenAI "ChatGPT in Health" (Jan 2026). 1 in 4 weekly users ask health-related questions.

As of this writing, over 40 million people use ChatGPT daily for health questions 1 and 67% of 1000+ physicians surveyed said they use AI tools in their practice 2. Yet when you ask physicians to grade model responses, they disagree 22.5% of the time. Evaluating the trustworthiness of medical AI systems is becoming increasingly consequential.

Roy and I dug into this disagreement in OpenAI's HealthBench 3, where he previously served on the physician panel.

Disagreement Rate by Theme (%)overall: 22.5%

The Context Seeking theme leads the pack in disagreement.

Unlike other datasets, HealthBench has open-ended prompts and completions that closely mimic how people actually interact with AI for health-related questions.

We looked at ~60K physician grades across ~30K cases. Our full paper covers nine analysis phases, and here's what we found.

Most are case-specific

We wanted to know where disagreement comes from. We started with a linear mixed model which split label variance into three buckets: the physician's identity (anonymized IDs), which rubric criterion is being applied, and everything else.

ICC Variance Decomposition
PhysicianRubricResidual

81.8% of variance is case-specific.

Physician identity explains only ~2.4%. The distribution below shows why: physicians cluster tightly around the mean pass rate. No individual systematically drove agreement or disagreement.

Physician Leniency Distributionmean: 0.766

Physicians cluster tightly around the mean.

The rubric criterion matters more (~16% of variance), since some criteria are inherently harder to agree on. But the overwhelming share is case-specific.

This echoes a classic finding by Elstein et al.: how a physician performs on one case doesn't predict how they'll perform on the next 4. Norman et al. also found ~80% of error variance at the item level using 6,342 medical students 5. A strikingly similar pattern now applied to AI evals.

What gives

So, 81.8% of variance is case-specific. But what's driving it? We tested every observable feature we could find:

  • Medical specialty: ANOVA picks up heterogeneity across 26 specialties but nothing survives pairwise correction. Many small differences, no outlier specialties
  • HealthBench metadata (theme, category, axis): reshuffles rubric variance but leaves the residual untouched
  • Normative rubric language: detectable but practically small
  • Surface features (e.g. word count, completion length, qualifier count): above chance (AUC = 0.58) but again too small to be useful
  • Embeddings: even less predictive than surface features. Agree and disagree centroids sit at ~0.9998 cosine similarity i.e. geometrically indistinguishable
Variance Partition with HealthBench Labels
PhysicianRubricResidual

Adding HealthBench metadata repartitions rubric variance but leaves the case-level residual untouched.

The disagreement isn't in any single component. It lives in the interaction between a specific completion and a specific rubric criterion. Even adding a physician x criterion interaction term (physician-level) to the mixed model doesn't help: it splits the criterion component in half but leaves the 81.8% residual untouched.

Agreement on the extremes

Boundary Effect: Quality vs. Disagreement
Quadratic fitObserved

Disagreement peaks at borderline quality and drops for both clearly good and clearly bad outputs.

One pattern is clear. Disagreement follows an inverted-U with completion quality: physicians agree on clearly good and clearly bad outputs (shocker I know), but split on the borderline. A leave-one-out analysis confirms this is a genuine effect, not just a mechanical artifact. Intuitive, yes, but now properly quantified.

The (ir)reducible dissociation

HealthBench's consensus dataset tags prompts with physician-validated uncertainty categories like reducible (missing context, ambiguous phrasing), irreducible (genuine medical ambiguity), or none. We checked whether these categories explain any of the disagreement.

Disagreement by Uncertainty Category
n=2,730
n=2,376OR=1.01
n=3,420OR=2.55

Reducible uncertainty more than doubles disagreement (OR = 2.55). Irreducible uncertainty has no effect (OR = 1.01, p = 0.90).

Reducible uncertainty more than doubles disagreement odds. Irreducible uncertainty (i.e. genuine medical ambiguity as judged by physicians) has no effect. Physicians don't disagree more on inherently ambiguous medical questions; they disagree when information is missing or the scenario is underspecified.

The disagreement is about information gaps, not medical complexity. Together with the quality boundary effect, these are the two clearest signals in the data. And though each explains ~3% of variance, they both point to concrete levers for improving evaluation.

What this means for benchmarks

HealthBench reports macro F1=0.709F_1 = 0.709 for GPT-4.1 as a grader, which means physicians agree with the models about as much as they agree with each other. It's what you get when physicians themselves agree only 77.5% of the time.

LLM-as-judge systems inherit the same ceiling. When physician labels are collapsed to a single "correct" answer, case-level uncertainty gets treated as error 6. The "model got it wrong" becomes indistinguishable from "model agreed with the minority physician". This is a critically important distinction, as no faction should overwhelm the case-level judgment, especially when there's genuine reducible uncertainty. Preserving the full label distribution 7 would let benchmarks make it so.

Where do we go from here?

The bigger question is how much of the 81.8% is even explainable? We can't distinguish pattern noise (case-specific but systematic) from occasion noise (the same physician grading the same case differently on a different day) 8.

Jackson et al. found only 53% intra-observer agreement for atypia cases in pathology 9, suggesting occasion noise could account for 20-40% of the residual. The natural next step would be physician self-consistency testing: presenting the same case to the same physician twice. If a large chunk of the residual is stochastic judgment, no feature engineering will ever explain it.

But the two levers we identified are actionable: (1) close information gaps in evaluation scenarios with better prompts that give physicians sufficient context to agree; and (2) report benchmark results separately for consensus cases vs. contested borderline cases. The effect sizes are modest but they compound, and unlike the structural ceiling, they point to things we can actually change in how we evaluate AI.

Footnotes

  1. OpenAI, ChatGPT in Health, 2026.

  2. OffCall, AI Adoption Among Physicians: Survey of 1,000 Physicians, 2025.

  3. Arora et al., HealthBench: Evaluating Large Language Models Towards Improved Human Health, 2025.

  4. Elstein et al., Medical Problem Solving: An Analysis of Clinical Reasoning, Harvard University Press, 1978.

  5. Norman et al., "How Specific Is Case Specificity?," Medical Education, 2006.

  6. Sylolypavan et al., "The impact of inconsistent human annotations on AI driven clinical decision making," npj Digital Medicine, 2023.

  7. Plank, "The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation," EMNLP, 2022.

  8. Kahneman et al., Noise: A Flaw in Human Judgment, Little, Brown Spark, 2021.

  9. Jackson et al., "Diagnostic Reproducibility: What Happens When the Same Pathologist Interprets the Same Breast Biopsy Specimen at Two Points in Time?," Annals of Surgical Oncology, 2017.