Why Veterinary Data Is Fundamentally Messy

What a 40-Year-Old Book Teaches Us About AI and Clinical Information

Feb 02, 2026

A framework from the dawn of medical informatics explains why some veterinary AI works brilliantly while the rest struggles—and why your data challenges aren’t going away.

In 1984, a UCSF dermatologist and medical informaticist named Marsden Blois published a book that should be required reading for anyone trying to understand why AI in medicine is so hard. “Information and Medicine: The Nature of Medical Descriptions” was written before the current AI revolution, before machine learning as we know it, before anyone had heard of large language models. Yet its core insights explain with remarkable precision why veterinary AI succeeds in some domains and fails spectacularly in others—and why the data integration problems I’ve been writing about aren’t merely technical challenges but fundamental properties of how clinical information works.

After 29 years in veterinary diagnostics, I’ve come to believe that Blois’s framework provides the missing theoretical foundation for understanding the challenges we face. It explains why DICOM-based imaging integration works seamlessly while practice management systems can’t agree on what to call a diagnosis. It explains why point-of-care coding has failed everywhere it’s been tried. And it suggests that modern large language models might represent a genuine breakthrough—not because they’re more powerful, but because they operate at a different level of clinical description than anything we’ve built before.

The Hierarchy of Medical Description

Blois’s central insight was deceptively simple: medical descriptions exist along a continuum from highly abstract and general to highly specific and concrete. He visualized this as a funnel or inverted pyramid, with vague descriptions at the wide top and precise measurements at the narrow bottom.

At the wide end, you have descriptions like “this patient isn’t doing well” or “something’s wrong with this cat.” At the narrow end, you have serum creatinine of 2.3 mg/dL, or a genetic mutation at a specific chromosomal location, or a radiographic finding with precise measurements.

The diagnostic process, Blois argued, is essentially a journey through this hierarchy—starting with vague, undifferentiated presentations at the wide end and progressively narrowing toward specific characterizations at the bottom.

Consider a typical veterinary case: A Golden Retriever presents with “lethargy and not eating.” That’s the wide end—a description that encompasses hundreds of possible conditions. Through history-taking, physical examination, and diagnostic testing, we progressively narrow: perhaps “geriatric large-breed dog with weight loss, polyuria, and mild azotemia” becomes “chronic kidney disease” becomes “CKD Stage 2 based on IRIS classification with specific creatinine and SDMA values.”

Each step down the funnel represents a more specific, more precise description of the patient’s condition. And critically for understanding AI, each level of the funnel requires fundamentally different kinds of reasoning.

Why This Matters for AI: The Two Zones

Blois was writing during the heyday of expert systems—rule-based AI programs like MYCIN (more on the lessons from MYCIN in my next post) that attempted to encode medical knowledge as “if-then” rules. He observed something that remains true today: these systems worked reasonably well at the narrow end of the funnel but struggled at the wide end.

The reason is straightforward. At the narrow end, you’re dealing with specific, well-defined data: laboratory values with numeric ranges, imaging measurements with standard protocols, genetic variants with definitive presence or absence. This kind of information is highly amenable to algorithmic processing. You can write rules for it. You can train classifiers on it. The boundaries are clear, the data is structured, and the relationships can be specified precisely.

At the wide end, everything is different. “Not eating” could mean the patient refused breakfast once or hasn’t eaten in three days. “Lethargy” is inherently subjective—does the owner mean the dog is sleeping more, or collapsed and unresponsive? The same clinical presentation might represent a minor upset or a life-threatening emergency, and distinguishing between them requires judgment that’s extraordinarily difficult to formalize.

Blois was skeptical that AI (yes, rules-based programs fall under the AI umbrella as you’ll remember from my previous post) would ever handle the wide end of the funnel effectively. He argued that the vagueness and ambiguity at this level weren’t bugs to be fixed but essential features of clinical reasoning. When a patient presents with undifferentiated symptoms, the clinician must work with incomplete, uncertain, and often conflicting information. This requires a kind of gestalt pattern recognition that seemed fundamentally different from what computers could achieve.

For decades, he was right.

The Veterinary Data Problem Through Blois’s Lens

Reading Blois’s framework, the veterinary data challenges I’ve been writing about suddenly snap into focus. They’re not random technical problems—they’re predictable consequences of how clinical information actually works.

Why DICOM Works

In my article on veterinary software interoperability, I noted that DICOM-based imaging integration is the one area where veterinary medicine has achieved genuine plug-and-play interoperability. Walk into almost any veterinary practice with digital radiography, and the X-ray machine from Vendor A talks seamlessly to the PACS system from Vendor B, which displays images perfectly in the practice management system from Vendor C.

Blois’s framework explains why. Imaging data lives at the narrow end of the funnel. Images have standardized formats. Acquisition parameters can be precisely specified. Anatomical regions have agreed-upon terminology. Even when interpretation is involved, it’s applied to well-defined visual data with consistent presentation.

DICOM succeeds because it operates entirely within the zone where algorithmic processing works—the domain of specific, structured, well-defined information.

Why Semantic Interoperability Fails

In that same article, I identified the semantic layer—agreeing on what things mean—as veterinary medicine’s greatest unsolved interoperability challenge. Every practice uses different terminology for the same conditions: “DM” versus “diabetes mellitus” versus “endocrine disorder.” This chaos makes multi-practice data analysis nearly impossible.

Blois’s framework reveals why this problem is so intractable: practices aren’t just using different words for the same thing. They’re recording information at different levels of the hierarchy.

When one veterinarian records “DM” and another records “endocrine disorder,” these aren’t synonyms that can be mapped to each other through simple translation. “Endocrine disorder” sits higher on the funnel—it’s more abstract, less specific, encompassing a broader range of conditions. “DM” is further down, more precise but still not as specific as “Type 1 diabetes mellitus with secondary ketoacidosis.”

The semantic chaos in veterinary data reflects the fundamental variability in where along the diagnostic funnel clinicians choose to record their observations. Some veterinarians record highly specific diagnoses when they’re confident. Others prefer broader categories that acknowledge uncertainty. Still others record the level of specificity that’s relevant for the clinical decision at hand.

This isn’t sloppy data entry—it’s the natural expression of clinical reasoning at different levels of certainty and different stages of the diagnostic process.

Why Point-of-Care Coding Fails

I’ve argued previously that forcing veterinarians to code diagnoses at the point of care is doomed to fail, and that the solution has to happen on the backend through intelligent translation systems. Blois’s framework provides the theoretical explanation for why this is true.

Clinical reasoning naturally starts at the wide end of the funnel and progressively refines. When a veterinarian first sees that lethargic Golden Retriever, they’re genuinely uncertain about the diagnosis. Forcing them to select a specific ICD or SNOMED code at that moment doesn’t just slow down workflow—it demands artificial precision that doesn’t match their actual clinical state.

Even worse, forcing early coding constrains clinical expressiveness. Consider a complex case: a 12-year-old Golden Retriever with lethargy, mild azotemia, and a heart murmur that wasn’t present six months ago. The veterinarian suspects early kidney disease but can’t rule out cardiac involvement, and the breed predisposition for both conditions makes the diagnostic picture unclear.

Standard coding systems force this nuanced clinical picture into rigid categories. Is this “chronic kidney disease” or “heart murmur” or “lethargy”? The coding system demands a choice, but the clinical reality is uncertainty and interconnected possibilities. The veterinarian ends up either oversimplifying the case to fit the codes or spending excessive time trying to find codes that capture the full clinical complexity.

This loss of expressiveness isn’t just inconvenient—it’s clinically dangerous. Rich clinical narratives that capture complexity and uncertainty get reduced to simplistic code combinations that miss the subtleties crucial for patient care.

The solution, as I’ve argued, is to preserve the full richness of clinical expression at whatever level the veterinarian naturally describes, then apply intelligent backend systems to map those descriptions to standardized codes for data sharing and analysis. Blois’s framework shows why this approach aligns with the fundamental nature of clinical information.

The LLM Revolution: Why This Time Might Be Different

Blois was skeptical that AI would ever handle the wide end of the diagnostic funnel. The rule-based expert systems of his era certainly couldn’t. They required precise inputs, explicit knowledge encoding, and clear decision boundaries—all characteristics of the narrow end.

But large language models represent something genuinely new: AI systems that operate natively in natural language at varying levels of specificity.

When you describe a case to ChatGPT or Claude—”I have a 12-year-old Golden Retriever with lethargy and decreased appetite, mild azotemia on bloodwork, and a new heart murmur”—the model doesn’t demand that you first encode this into structured categories. It can work with the same natural language that veterinarians use to reason about cases, at whatever level of specificity you provide.

This doesn’t mean LLMs have solved the problems Blois identified. As I discussed in my article on hallucinations, LLMs can generate confident-sounding but incorrect information, and they can’t reliably access real-time databases or verify facts against authoritative sources. The wide end of the funnel remains challenging precisely because the information is inherently uncertain and ambiguous.

But LLMs do offer something that previous AI approaches couldn’t: the ability to work with clinical information across the full range of Blois’s hierarchy. They can discuss vague presentations and specific diagnoses in the same conversation, moving up and down the funnel as the clinical picture develops. They can preserve the rich expressiveness of natural language that gets lost when we force everything into structured codes.

This capability has profound implications for the backend translation systems I’ve advocated. Instead of requiring explicit rules mapping every possible clinical term to standard codes, LLM-based systems can potentially understand clinical intent across terminology variations and levels of specificity. They might finally bridge the gap between how veterinarians naturally document cases and the structured data needed for interoperability and analysis.

Implications for Evaluating Veterinary AI

Blois’s framework also provides a practical lens for evaluating AI tools, complementing the evaluation frameworks I’ve previously discussed.

Ask: Where on the Hierarchy Does This AI Operate?

When evaluating any AI tool, the first question should be: what level of clinical description does this system work with?

Tools operating at the narrow end—imaging analysis, laboratory interpretation, specific diagnostic predictions—work with well-defined data at the bottom of Blois’s funnel. These are the domains where AI has historically succeeded, and we have established methodologies for evaluation: sensitivity, specificity, likelihood ratios, receiver operating characteristic curves.

Tools operating at the wide end—clinical decision support, differential diagnosis generators, natural language interfaces—face fundamentally different challenges. They must handle vague inputs, uncertain reasoning, and the full messiness of undifferentiated clinical presentations. Evaluation here is more complex: we need to assess not just accuracy on well-defined cases but robustness across the full range of clinical uncertainty.

Understand the Mismatch Problem

Many AI failures occur when systems designed for one level of the hierarchy encounter data from another level.

An imaging AI trained on clear pathological findings may fail when presented with subtle or ambiguous images. A rule-based diagnostic system expecting specific symptoms may produce nonsensical recommendations when given vague chief complaints. A documentation AI trained on well-structured notes may struggle with the shorthand and abbreviations that veterinarians actually use.

Before implementing any AI tool, verify that the validation data matches the level of clinical description you’ll actually provide. A system validated on clear-cut cases may not perform nearly as well in the ambiguous situations that constitute most real clinical practice.

Recognize the Translation Challenge

Any AI system that must translate between different levels of the hierarchy faces particular challenges. This includes:

Backend coding systems that map narrative documentation to structured codes
Clinical decision support that takes vague presentations and suggests specific diagnoses
Data integration platforms that normalize terminology across practices

These systems are attempting to bridge levels of Blois’s hierarchy—inherently more challenging than operating within a single level. Evaluate them with particular attention to how they handle uncertainty and ambiguity rather than just their performance on clear-cut cases.

The Path Forward

Understanding Blois’s framework doesn’t solve our problems, but it does help us approach them more realistically.

First, we should stop treating data inconsistency as a problem to be eliminated and start treating it as a fundamental property of clinical information that must be accommodated. Practices will always record information at different levels of specificity because they’re capturing different stages of clinical reasoning. Our systems need to handle this variability rather than demanding artificial uniformity.

Second, we should invest in backend translation systems that can work across Blois’s hierarchy—systems that preserve rich clinical narratives while extracting standardized codes for data sharing. Large language models may finally provide the capability to build such systems effectively, though significant development work remains.

Third, we should evaluate AI tools against realistic clinical scenarios that include the full range of uncertainty and ambiguity found in practice—not just the clear-cut cases where AI has historically excelled.

Finally, we should maintain appropriate humility about what AI can accomplish in clinical medicine. Blois’s core insight—that different levels of clinical description require fundamentally different kinds of reasoning—remains valid. Even as AI capabilities advance, the wide end of the diagnostic funnel will likely remain the domain where human clinical judgment is most essential.

Key Insights for Veterinary Practice

📊 Recognize the hierarchy in your own documentation. Notice when you’re recording vague observations versus specific diagnoses. Both are valid and necessary—they represent different stages of clinical reasoning. Systems that force premature specificity are working against the natural diagnostic process.

🔍 Match AI tools to appropriate levels. Imaging AI, laboratory interpretation, and specific diagnostic predictions operate at the narrow end of Blois’s hierarchy and can be evaluated with traditional accuracy metrics. Decision support and natural language tools operate higher on the hierarchy and require different evaluation approaches.

📋 Demand validation at realistic specificity levels. When evaluating AI tools, ask whether the validation data matches your actual practice. A system that performs brilliantly on textbook cases may struggle with the vague presentations and clinical uncertainty that constitute most real-world practice.

🔄 Support backend translation approaches. Rather than demanding that your staff code everything at entry, look for systems that can accept natural language documentation and apply intelligent coding on the backend. This approach aligns with how clinical reasoning actually works.

📝 Preserve clinical expressiveness. Resist pressure to sacrifice the richness of clinical narrative for the convenience of structured data entry. The nuances captured in natural language—uncertainty, interconnected possibilities, clinical reasoning—are often exactly the information that matters most for patient care.

🧠 Understand why semantic chaos persists. The terminology variations across practices aren’t just different words for the same thing—they often represent recording at different levels of clinical specificity. This is why simple synonym mapping doesn’t solve interoperability, and why more sophisticated translation approaches are needed.

⚖️ Maintain appropriate expectations. Even as AI advances, the wide end of the diagnostic funnel—where presentations are vague, uncertainty is high, and clinical judgment is most critical—will likely remain the domain where human expertise is irreplaceable. AI tools are most valuable when they complement rather than attempt to replace this judgment.

Blois died in 1988, four years after publishing “Information and Medicine.” He didn’t live to see either the AI winter that followed or the remarkable renaissance we’re experiencing today. But his insights about the fundamental nature of clinical information—written when computers were room-sized and the internet didn’t exist—remain remarkably relevant for anyone trying to make AI work in veterinary medicine.

The book is out of print but available through used book sellers and some academic libraries for those wanting to explore further.

How do Blois’s concepts resonate with your own experience of clinical reasoning? Have you encountered AI tools that work well at one level of clinical description but struggle at another? I’d love to hear your observations—they help shape how I think about these frameworks and their practical applications.

The AI Architect

Feb 2

Brilliant framing of why data standardization has been such a nightmare in vet med. The funnel metaphor really clarifies something I've experienced but couldn't articulate, that forcing precision too early just doesnt match how cases actually unfold. Ran into this constantly when trying to get my team to use structured codes mid-appointment, it always felt wrong but now I see it's fighting against clinical cognition itself. Backend translation powered by LLMs migt finally be the workaround we need.

Eric Fish, DVM

This is a great framework to start understanding pitfalls in medical data coding, though I would argue it is fundamentally incomplete: Medical labels do not exist solely on a unidirectional continuum that progresses from vague, unstructured information to precise structured data; there is bi-directionality and rich compression inherent to diagnosis.

Consider a patient with a tumor that a pathologist calls a "melanoma" in their report. Seems like a simple, unambiguous label: It either is or isn't that type of tumor, right? The challenge is that under the microscope (or these days, a computer screen), a "melanoma" can have variable levels of pigmentation ranging from heavy to absent. It can differ in appearance from "epithelioid" to "spindyloid" to "round cell," or a combination! There are weird variants like "balloon cell" melanomas that don't fit the options above. One tumor might have a lot more "atypia" than another, but considered lower grade based on a factor like mitotic rate (number of dividing cells in a tumor). And two melanomas with cancer cells that appear morphologically identical can differ in aggressiveness based solely on anatomic location or on non-tumor features in the tissue section (stroma, vasculature/lymphatic invasion, inflammatory infiltrate, etc). Even if you used SNOMED or ICD coding to try and force a specific structured label (to reduce variation in terms like "melanoma" vs "malignant melanoma" vs "melanocytoma" vs "melanocytic neoplasia, etc), any computer vision AI program that is trying to learn what "melanoma" is going to face an uphill battle to differentiate it from very similar looking lesions that behave completely differently.

Some folks seem to think you can simply overcome this problem with "brute force" by throwing more and more data at the problem. I'm skeptical. An experienced pathologist does not only rely on visual pattern recognition, they draw on knowledge about cell biology, physiology, normal and abnormal anatomy, the impacts of drugs and radiation on tissues, infectious agents that cause dysplasia, and more. A pathologist might see a possible melanoma and say to themselves "Hmm, here is my initial set of differentials, let's order a panel of immunohistochemistry including pancytokeratin, vimentin, MelanA, and S100." Sometimes you might need molecular tests besides IHC (like PARR clonality testing for lymphoma or PCR for mutated c-kit in mast cell tumors). Then you would be presented with one or more sets of additional assay data that requires it's own interpretation and fine discernment of real signal vs background, comparison to positive and negative control reactions, knowledge from the research literature about how well (or not) different assays perform in different situations, etc. Sometimes the key to diagnosis might depend on clinical history or other lab/imaging data in the medical record. The combinatorial complexity is staggering when you really think about it!

My point is that even something that seems extremely precise and specific like a cancer diagnosis often encodes a wide richness that is more challenging than it appears at first glance. Is it an impossible problem to solve? Theoretically, no. But it is going to require a *LOT* more time, data, resourcefulness, and creativity than what I'm seeing many companies prepared to build.

3 replies by Dave Kincaid and others

4 more comments...

Prior Knowledge and Practice

Discussion about this post

Ready for more?