Language Models in Veterinary Practice: What Good Evaluation Actually Looks Like
Understanding the Critical Difference Between Looking Right and Being Right in Veterinary AI
When you document that a patient is "vomiting," an AI might generate "experiencing emesis" in your SOAP note. Both are correct. But how do we teach a computer to know when "ADR" means "adverse drug reaction" in one context and "ain't doing right" in another?
The answer reveals why evaluating veterinary AI is surprisingly complex—and why the metrics most vendors cite might be dangerously misleading.
In my previous post about LLM hallucinations, I explained why these systems inevitably generate plausible-sounding but incorrect information. Today, we're tackling the flip side: how do we actually measure whether an LLM is performing well in veterinary practice?
After nearly three decades in veterinary diagnostics and extensive work with LLMs, I've learned that the evaluation metrics that sound most impressive are often the least meaningful for clinical practice. Just as positive predictive value (PPV) misleads us about diagnostic tests, traditional NLP metrics can make dangerous AI look deceptively good.
Why Traditional Metrics Fail Spectacularly
Imagine grading a student's essay by counting how many words match the answer key. That's essentially what traditional AI metrics do—and it's about as useful as you'd expect for evaluating clinical documentation.
The Dangerous Flip That Proves the Point
Let me show you why this matters with a real example. Consider an AI reviewing this case: "Labrador with elevated ALT levels indicating possible liver disease."
Three AI responses:
"Dog shows increased ALT suggesting hepatic dysfunction"
"Patient has high liver enzymes indicating potential liver pathology"
"Dog's ALT levels are significantly low, indicating healthy liver function"
The third response is clinically opposite—it could lead to missing serious liver disease. Yet when we score these with traditional metrics, something shocking happens:
The dangerous error scores almost twice as high as the correct answers! This isn't a quirk—it's a fundamental flaw in how these metrics work.
Understanding the Metrics (And Why They Mislead)
BLEU and ROUGE are traditional NLP metrics that have been used for decades to evaluate text generation in machine translation and summarization. While most veterinary AI vendors don't cite any metrics at all (red flag!), if they did use standard NLP evaluation, these would likely be the ones. They literally count matching words and phrases between AI output and reference examples. If the reference says "gave subcutaneous fluids" and the AI writes "administered SQ fluids," these metrics score it as wrong—even though any veterinarian knows they're identical. Research shows these metrics correlate with human medical judgment very poorly.1
BERTScore represents an improvement—it understands that "puppy" and "young dog" mean similar things, achieving about 62-80% correlation with human experts. But it still can't tell if a drug dose is appropriate for a cat versus a dog. It sees semantic similarity but misses clinical significance.2
The Benchmark Trap: Many AI companies tout high scores on multiple-choice veterinary exams. But real clinical work isn't multiple choice. A system that aces the NAVLE might still generate dangerous free-text recommendations. We need evaluation that matches how AI is actually used, not how it performs on standardized tests.
Fortunately, three evaluation approaches have emerged that actually catch these dangerous errors before they reach your patients—and knowing which one fits your practice could be the difference between AI that saves time and AI that creates liability.
Keep reading with a 7-day free trial
Subscribe to Prior Knowledge and Practice to keep reading this post and get 7 days of free access to the full post archives.