The Unknowable "Good"
Why It's Impossible to Measure Physician Quality
There’s an old joke among neurosurgeons: if you present a case to five surgeons, you’ll get seven different opinions.
Consider a patient with neck pain and a herniated disc. The clinical nuance is immense. There are nonoperative options: physical therapy, medications, steroid injections. There are surgical options ranging from limited decompressions to artificial disc replacement to multilevel fusion with rods and screws. Each choice involves tradeoffs that depend on anatomy, symptoms, risk tolerance, lifestyle, and values.
Then there is execution. Was the surgery performed efficiently and precisely, or slowly and sloppily?
A patient may undergo a fusion performed flawlessly from a technical standpoint and still endure a prolonged recovery or postoperative infection. Another patient may choose nonoperative care and continue to suffer, or experience spontaneous improvement. Outcomes vary even when decisions are reasonable.
Which raises the question patients and policymakers keep asking:
Who was the better doctor?
For decades, an entire quality metric industrial complex has tried to answer this question. It relies on what is easiest to measure: readmission rates, length of stay, complication codes. These metrics are “risk-adjusted” using administrative data, but only by adjusting for variables that themselves are easily measured.
We are obsessed with finding a “ground truth” for physician quality.
But how would one actually recognize a good doctor?
Ask patients, and the answer is inherently subjective. Different patients want different things at different times. Most want evidence-based recommendations, but many do not. (As proof, the healing crystal industry in the U.S. is worth over $1 billion annually.)
Ask bureaucrats, and “good” is defined by costs, compliance, and checkboxes.
Our current system relies on absolute judgment. We take a single physician and grade them against a fixed rubric devised by private contractors working for governments and insurers. Unsurprisingly, this results in measuring what is easy to measure rather than what matters. Clinical judgment, technical finesse, and empathy remain largely invisible.
It also turns out humans are terrible at absolute judgment.
This is well illustrated in an EconTalk discussion featuring Daisy Christodoulou. She uses soccer as an analogy. Was a particular tackle “too rough” and deserving of a yellow card? Even with slow-motion replay, experts frequently disagree when evaluating a single event in isolation.
A better approach is comparative judgment. Instead of asking whether a tackle crosses some abstract threshold, ask a simpler question: Which of these two tackles was rougher? Humans are remarkably good at that and the same principle can apply to medicine.
Rather than grading a surgeon’s care of cervical spine disease as an “A–” or “B+,” we can compare two cases and ask: Who managed this case better? Repeat this process thousands of times, and a statistical pattern emerges. This approach respects the inherent nuance of medicine. It allows for stylistic differences while still punishing objectively bad care, which reliably appears as an outlier that loses nearly every comparison.
The deeper problem is not that we have chosen the wrong metrics. It is that absolute physician quality does not exist as a stable quantity to be measured.
Absolute judgment requires a fixed standard: the same patient preferences, the same constraints, the same tradeoffs, and the same definition of success. Medicine offers none of these. Outcomes depend on anatomy, comorbidities, risk tolerance, timing, social support, and values, many of which are unobservable and irreducible to data.
Without a fixed reference frame, there is no such thing as an “A-level doctor” in the abstract. There are only doctors performing better or worse relative to other doctors facing similar problems. Any attempt to assign an absolute score is therefore not merely imprecise, it is conceptually incoherent.
This is not merely intuitive. It rests on nearly a century of statistical theory.
In 1927, psychologist Louis Thurstone proposed the Law of Comparative Judgment. He observed that human evaluation of a single object is noisy and unstable, influenced by mood, context, and bias. But when forced to choose between two objects, those errors tend to cancel out, producing far more reliable judgments.
These binary comparisons can then be analyzed using the Bradley–Terry model, which takes win–loss data and infers a latent “strength” or “quality” parameter for each participant. This is the same mathematics underlying Elo ratings in chess and modern matchmaking systems. We do not need a checklist to know Magnus Carlsen is good at chess. We know because he consistently beats nearly everyone he faces. The math simply formalizes that reality.
Applied to medicine, this framework would not require us to define “quality” in the abstract. Quality would be the latent variable that best explains why one physician’s decisions and outcomes are consistently preferred over another’s by their peers.
This approach is not easy. Comparative judgment does not scale cleanly, and it cannot be administered by bureaucrats. That is precisely why the establishment will resist it. Perhaps one day AI will assist in this process, but that day has not yet arrived.
Good medicine is not a checklist. Quality is not a variable sitting in the electronic health record waiting to be mined. It is a consensus. We cannot measure doctors against a divine standard. We can only measure them against each other.
And that may be the closest thing to ground truth we will ever get.


Really sharp take on the measurement problem. The shift from absolute to comparative judgement feels right - we do this intuitively when evaluating expertise in any field but somehow healthcare got stuck on checklists. I've seen similar issues in tech evaluation where metrics optimzie what's measurable over what's meaningful. The Bradley-Terry framing is compelling tbh.
I appreciate your reflection. Have you read "The Citadel" ?https://en.wikipedia.org/wiki/The_Citadel_(novel)
I studied "The Doctor, His Patient, and the Illness" by Balint. It was an attempt to deal with the changes of the Citadel to the British NHS.
Here's my shit.https://danjschmidt.substack.com/p/coding-mastodons-doctor-story?utm_source=publication-search
Neural Foundry is a Bot.