How to Evaluate AI Systems in Veterinary Medicine: A Framework for Every Type of Tool
Building on the Transparency Crisis: What Validation Should Actually Look Like
In my previous post, I outlined the veterinary AI transparency crisis—how most companies make bold performance claims without providing the validation data to support them. Demanding evidence is step one, but what should that validation actually look like?
The challenge isn't just getting validation data—it's understanding what kind of validation is appropriate for different AI systems. A diagnostic imaging AI requires entirely different evaluation approaches than an automated appointment scheduling system or a transcription tool. Yet most discussions about "AI accuracy" treat all AI systems as if they're identical.
After nearly three decades in veterinary diagnostics, I've learned that we need to fundamentally categorize AI systems based on their role in veterinary practice before we can properly evaluate them.
Here's a practical framework for evaluating any AI tool entering your practice, tailored to how these systems actually work and how you'll actually use them.
The Decision-Action Framework: Two Fundamentally Different AI Types
Before diving into evaluation frameworks, we need to distinguish between two fundamentally different types of AI systems.
Decision Support AI provides information to help humans make better decisions. Think diagnostic imaging analysis, risk prediction models, or differential diagnosis generators. The key question becomes: "How will this AI change my clinical decisions?" Evaluation must focus on clinical utility, decision impact, and integration with human judgment.
Automation AI performs tasks with minimal human oversight. Examples include automated transcription, appointment scheduling, inventory management, and routine data entry. Here the key question shifts to: "How reliably does this AI perform the intended task?" Evaluation focuses on task completion accuracy, efficiency gains, error rates, and workflow integration.
These categories require completely different evaluation approaches. Decision support AI must be evaluated based on how it influences clinical thinking, while automation AI must be evaluated based on task performance and reliability.
Why This Distinction Matters
The same "95% accuracy" claim means entirely different things for these two categories. For decision support AI, you need to know how that accuracy translates to better clinical decisions. For automation AI, you need to know how reliably it completes its designated tasks without human intervention.
Consider a hypothetical AI tool that detects heart murmurs with "90% accuracy." That number means completely different things depending on whether you're using it for wellness screening (deciding whether to pursue further cardiac workup), pre-anesthetic evaluation (assessing surgical risk), or emergency triage (prioritizing patient urgency). The evaluation framework must match the decision context where you'll actually use the tool.
Clinical Prediction Models: Decision Support AI Evaluation
For AI tools that help with diagnosis, prognosis, or treatment decisions, traditional accuracy metrics are insufficient and sometimes misleading.
The Prevalence Problem
Positive predictive value (PPV) changes dramatically with disease prevalence—a fundamental statistical reality that makes reported PPV nearly meaningless for clinical decision-making. An AI tool with "95% PPV" in a referral hospital might have 20% PPV in general practice—same tool, same accuracy, completely different clinical utility.
What to Demand Instead
Instead of relying on misleading PPV claims, focus on metrics that remain stable across populations. Sensitivity and specificity are inherent properties of the test itself and don't change based on where you practice. Likelihood ratios tell you how much a test result should change your clinical thinking, regardless of your patient population. Population validity ensures the validation study actually matches your practice setting. Clinical utility studies demonstrate that the AI changes decisions appropriately, not just that it produces accurate outputs. Finally, failure mode analysis reveals what happens when the system is wrong—critical information for managing clinical risk.
Given the critical importance of understanding how prevalence affects diagnostic interpretation and why likelihood ratios provide a more reliable framework for clinical decision-making, I'll be dedicating an entire upcoming post to this topic.
Validation Requirements
The validation study should match your intended use. If you're considering an AI tool for routine screening, the validation should include routine cases, not just referred patients with obvious disease. Ask vendors: "What was the disease prevalence in your validation population, and how does that compare to my practice?"
Language Generation Models: Bridging Decision Support and Automation
Language generation AI can function as either decision support or automation, depending on the application.
Decision Support Applications include generating differential diagnoses, explaining complex conditions to clients, or summarizing case information. These should be evaluated like other decision support tools, focusing on clinical accuracy, appropriateness, and impact on decisions.
Automation Applications encompass generating routine discharge instructions, appointment confirmations, or basic client communications. These should be evaluated like other automation tools, emphasizing task completion accuracy, consistency, and reliability.
What Actually Matters for Both Categories
Traditional natural language processing metrics like BLEU or ROUGE scores are essentially useless for veterinary applications. These metrics were designed for translation tasks and measure similarity to reference texts—but there are multiple correct ways to express the same clinical information.
Instead, focus on clinical accuracy (are the medical facts correct?), appropriateness (is the tone and content suitable for the intended audience?), safety (risk of harmful or misleading information), workflow integration (does it actually save time and improve quality?), and consistency (reproducible quality across different inputs).
Validation Requirements
Effective validation requires human expert evaluation protocols rather than automated metrics, fact-checking against veterinary literature, A/B testing in real practice settings, and long-term monitoring for drift and degradation.
Given the complexity and unique challenges of evaluating language models—especially the hallucination issues I discussed previously—I'll be dedicating an entire post to LLM evaluation frameworks in the coming weeks.
Imaging Models: Three Distinct Categories
Imaging AI falls into three distinct categories requiring different evaluation approaches.
Diagnostic imaging AI provides clinical predictions such as fracture detection or mass identification. These should be evaluated like clinical prediction models, focusing on sensitivity, specificity, and likelihood ratios. They require reader studies showing human-AI versus human-alone performance.
Image enhancement AI highlights regions of interest or improves image quality. Evaluation should focus on workflow integration and user acceptance, measuring time savings and diagnostic confidence while assessing consistency and reliability.
AI-Only imaging Systems provide fully automated diagnostic analysis without human radiologist review. These should be evaluated like clinical prediction models, using sensitivity, specificity, and likelihood ratios. They require extensive validation across diverse patient populations and image conditions, need clear protocols for when results should trigger human review, and must demonstrate performance equivalent to or better than human interpretation.
Key Questions
Critical considerations include whether the AI enhances or disrupts radiologist workflow, how combined human-AI performance compares to either approach alone, whether AI-only systems match human diagnostic accuracy, and what happens when the AI highlights irrelevant findings or misses critical ones.
Automation AI: Task Performance and Reliability
For AI systems designed to perform tasks with minimal human oversight—transcription, scheduling, data entry, routine communications—the evaluation framework shifts dramatically.
Key Evaluation Metrics
Focus on task completion accuracy (how often does the system successfully complete the intended task?), error detection (when the system fails, is it obvious to users?), reliability (consistent performance across different conditions and inputs), speed and efficiency (does it actually improve workflow?), and graceful failure handling (how does the system behave when it encounters unexpected situations?).
Examples by Category
Transcription Systems should be evaluated on Word Error Rate weighted for medical significance, medical terminology accuracy, handling of unclear audio or multiple speakers, and integration with existing documentation systems.
Practice Management Systems require assessment of scheduling accuracy and conflict resolution, inventory prediction accuracy, billing automation error rates, and client communication delivery and formatting.
Validation Requirements
Effective validation demands real-world testing in actual practice environments, long-term reliability monitoring, user acceptance and adoption rates, comparison to manual processes for speed and accuracy, and recovery protocols when automation fails.
The Universal Evaluation Framework: Five Critical Questions
Regardless of AI system type, always ask these five questions:
Decision Impact: How will this change what actions I take?
Context Validity: Was this validated in settings like mine?
Failure Modes: What happens when this system is wrong?
Monitoring: How will I know if performance degrades?
Integration: How does this fit into my existing workflow?
Red Flags for ALL AI Systems
Validation only in ideal conditions
No discussion of failure modes or edge cases
Lack of ongoing performance monitoring plans
Performance metrics that don't match intended use
Refusal to provide validation methodology
For AI Companies: The Path Forward
If you're developing veterinary AI tools, rigorous validation isn't just ethical—it's a competitive advantage. In a crowded market where most companies provide no validation data, transparent evidence immediately sets you apart.
Validation Best Practices
Multi-site validation: Test across different practice types and populations
Report appropriate metrics: Likelihood ratios for clinical tools, workflow metrics for operational tools
Document limitations: Be clear about when and where your tool should not be used
Plan post-market surveillance: Performance can degrade over time
Seek independent validation: Third-party studies carry more weight than internal testing
Key Insights for Veterinary Practice
🎯 Match evaluation to intended use: Screening tools need different validation than diagnostic confirmation tools—demand evaluation data that matches how you'll actually use the AI.
📊 Demand context-appropriate metrics: For clinical tools, insist on sensitivity, specificity, and likelihood ratios. For practice management tools, focus on workflow and business metrics.
🏥 Verify population validity: Ask whether the validation population matches your patient demographics, case mix, and practice setting.
⚠️ Understand failure modes: Every AI system fails sometimes—demand clear documentation of when and how failures occur.
🔍 Establish monitoring protocols: Set up systems to track AI performance in your practice over time—performance can degrade without obvious warning signs.
📋 Create evaluation checklists: Develop standardized evaluation processes for different AI tool categories to ensure consistent vendor assessment.
💼 Calculate true ROI: Factor validation quality into purchasing decisions—well-validated tools are more likely to deliver promised benefits.
🔄 Plan for integration: Consider how each AI tool fits into existing workflows and what training will be required for successful adoption.
📚 Document AI-assisted decisions: Establish protocols for documenting when and how AI tools influence clinical decisions for both medical records and quality improvement.
🚨 Start with pilot programs: When validation data is limited, implement AI tools on a trial basis with careful monitoring of real-world performance before full deployment.
Conclusion
The key insight driving this framework is understanding that different AI systems serve fundamentally different roles in veterinary practice. Decision-support AI exists to influence human judgment, while automation AI exists to perform tasks directly. These different roles require completely different evaluation approaches.
For decision-support AI, the critical question is "How will this change what I decide or do?" You need to understand how AI information fits into your clinical reasoning process and whether it improves decision-making quality.
For automation AI, the critical question is "Should this task be automated, and how will I monitor the automation?" You need to assess whether the task is appropriate for automation and establish proper oversight mechanisms.
Both types require rigorous validation, but the validation criteria, performance metrics, and monitoring approaches are fundamentally different. A diagnostic imaging AI needs sensitivity, specificity, and likelihood ratios. An automated scheduling system needs task completion rates, error handling protocols, and exception management procedures.
The common thread is understanding how each AI system fits into your existing workflows and whether it genuinely improves outcomes for your patients and practice. No AI tool—regardless of how sophisticated—should be deployed without clear evidence that it enhances rather than complicates veterinary care.
In upcoming posts, I'll be diving deep into two critical areas that deserve their own detailed analysis: likelihood ratios as the foundation for interpreting diagnostic AI (including practical frameworks for moving from pre-test to post-test probability), and comprehensive evaluation methodologies for language generation models.
After all, you already apply evidence-based thinking to every other aspect of veterinary practice. Why should AI be any different?
What AI tools are you currently evaluating for your practice? Reply and let me know what challenges you're facing with vendor validation claims—I'd love to hear about your experiences and may feature your questions in future deep-dive posts.
I’m trying something out here. Listen to a short AI generated podcast based on this post. If you listen, please let me know what you think about it. Does it add anything?
An excellent resource on the subject and deserves to be disseminated widely. The podcast provided a great summary imo.