Technical Note: Spectrum Bias and Test Performance Metrics

A follow-up to "Beyond PPV: Why Likelihood Ratios Matter for AI-Driven Veterinary Diagnostics"

Aug 03, 2025

Several astute readers reached out after my recent post about diagnostic metrics to raise an important point I had simplified for clarity: the stability of sensitivity and specificity across populations. I appreciate these thoughtful responses—they highlight exactly the kind of nuanced thinking we need when evaluating diagnostic tools. Let me address this more completely.

The Pedagogical Simplification

In my original post, I stated that sensitivity and specificity are "inherent properties of the test" that remain constant across populations. This was indeed a simplification made for pedagogical purposes. The complete picture is more nuanced.

Spectrum Bias: When Test Performance Varies

Sensitivity and specificity can vary based on what epidemiologists call "spectrum bias" or "spectrum effects." This well-documented phenomenon occurs when test performance changes across different disease severities, clinical presentations, patient populations, and even prevalence itself through complex mechanisms including threshold adjustments and verification bias.

Veterinary Examples

An AI tool for detecting cardiomegaly on radiographs might show 95% sensitivity in dogs with severe DCM and marked enlargement, but only 75% sensitivity in dogs with early DCM and mild changes. Specificity can vary too, depending on whether the comparison group includes dogs with non-cardiac causes of apparent enlargement like obesity, respiratory disease, or positioning artifacts.

Similarly, a test for canine hypothyroidism might perform differently in young dogs with congenital disease (classic presentation) versus geriatric dogs with concurrent illness (confounding factors), or across different breeds with varying baseline thyroid hormone levels.

Why My Original Argument Remains Sound

Despite this oversimplification, the core message about PPV's unreliability stands even stronger when we consider spectrum bias.

The Scale of Variation Differs Dramatically

Spectrum effects typically cause sensitivity and specificity to vary by 10-30%, while PPV can swing by 700-800% with prevalence changes alone. The magnitude difference is what matters for clinical decision-making.

Predictability and Clinical Intuition

Spectrum effects follow understandable patterns—tests work better on obvious disease—and experienced clinicians intuitively adjust for these patterns. In contrast, PPV variations with prevalence often run counter to clinical intuition. A test that seems "highly predictive" in one setting can be nearly useless in another, which catches many practitioners off guard.

Likelihood Ratios: More Stable, Not Perfect

While likelihood ratios can also vary with spectrum effects, they show much smaller variations than PPV and change in predictable ways that align with clinical thinking. They remain more useful for clinical decision-making across settings and can be stratified by disease severity when needed, as I demonstrated with AI scoring ranges in the original post.

The Framework Still Works

Even accounting for spectrum bias, the diagnostic approach I outlined—estimating pre-test probability based on your specific patient, applying likelihood ratios (adjusted for severity if available), and calculating post-test probability—remains the most robust framework for diagnostic decision-making.

Implications for AI Validation in Veterinary Medicine

This nuance actually reinforces several key points about AI validation.

What We Need from AI Companies

Transparent reporting of validation populations is essential, including disease severity distribution, clinical settings (emergency, specialty, general practice), and the species, breeds, and age ranges included. We need stratified performance metrics showing how the tool performs across different stages of disease, various clinical presentations, and multiple practice types. Companies should provide confidence intervals, not just point estimates, and implement real-world performance monitoring to detect spectrum effects in practice.

Red Flags in AI Validation

Be wary of studies using only textbook cases or severe disease, validation in single institutions or practice types, mixing screening and diagnostic populations without stratification, or reporting single sensitivity/specificity values without context. These are signs of inadequate validation that doesn't account for real-world performance variation.

The Clinical Bottom Line

Understanding spectrum bias strengthens rather than weakens the case for careful diagnostic test interpretation:

No diagnostic test has truly fixed performance—all vary based on population and disease characteristics
PPV remains the least stable metric, varying both with prevalence AND spectrum effects
Likelihood ratios provide more stable guidance, though they too can vary
Clinical context matters more than any single metric
Validation transparency is essential for understanding when and how to use AI tools

A More Sophisticated Framework

Rather than viewing test characteristics as either "fixed" or "variable," we should think in terms of relative stability. Some metrics like likelihood ratios are more stable than others like PPV. Performance changes follow predictable patterns that we can understand and account for. The best metrics align with clinical reasoning, and all metrics require consideration of the specific clinical scenario.

Closing Thoughts

I simplified the concept of "fixed" sensitivity and specificity to make a pedagogical point about PPV's extreme variability. But as this discussion shows, the reality provides even more reason to demand comprehensive validation data from AI companies, reject simplistic accuracy claims, embrace likelihood ratios as more stable (though not perfect) metrics, and always integrate test results with clinical judgment.

Thank you to the readers who prompted this clarification. In the complex world of diagnostic testing, these nuanced discussions help us all develop more sophisticated frameworks for evaluating and using emerging technologies. The goal isn't perfect metrics—it's better clinical decisions.

For those interested in diving deeper into spectrum bias, I recommend:

Prior Knowledge and Practice

Discussion about this post