Abstract
This study evaluated the accuracy, consistency, and clinical appropriateness of responses generated by large language models (LLMs) to frequently asked questions (FAQs) in veterinary dentistry client communication. Six common FAQs were identified based on guidance from the American Veterinary Dental College (AVDC) and submitted under standardized conditions to multiple LLMs, including ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Perplexity, Qwen3-Max, and DeepSeek. Artificial intelligence (AI) generated responses were compared with expert-reviewed reference answers prepared by 2 veterinarians with academic and clinical experience in small animal dentistry. Responses were independently evaluated by 2 expert and 2 novice assessors across 4 domains: main idea coverage, information quality, consistency with expert content, and presence of inconsistencies using a 3-point Likert scale (Yes, Neutral, No). Inter-rater agreement between expert evaluators was assessed using Cohen's kappa, and between-model comparisons were performed using McNemar's exact test after dichotomization of ratings. Inter-rater agreement was substantial (κ = 0.68). ChatGPT-5 showed the highest alignment with expert-reviewed reference content, followed by Claude Sonnet 4.5. Differences between expert and novice evaluations were most evident for questions related to anesthesia safety and anesthesia-free dental procedures. Clinically relevant inaccuracies were identified across several models, particularly regarding the requirement for general anesthesia with a protected airway. No statistically significant differences were detected between primary model comparisons (P = 1.00). These findings indicate that LLMs may support client education in veterinary dentistry but require expert oversight to ensure clinical accuracy and patient safety.
Keywords
Get full access to this article
View all access options for this article.
