Abstract
Research Type:
Level 3 - Retrospective cohort study, Case-control study, Meta-analysis of Level 3 studies
Introduction/Purpose:
The integration of Large Language Models (LLMs) into healthcare has raised critical questions about their role in patient education. While these models offer scalable content generation and real-time patient interaction, their ability to produce accessible and actionable patient education materials (PEMs) remains unexamined. Traditional patient education materials are created through physician expertise. Although traditional PEMs provide comprehensive content enhanced by clinical imagery and accessible handouts, they require significant resources to maintain and update. This study provides the first comparative analysis of the readability and quality of physician-created PEMs versus those generated by leading LLMs for foot and ankle conditions.
Methods:
Twelve common foot and ankle conditions (Achilles tendon rupture, ankle sprain, calcaneal fracture, avulsion fracture, Achilles tendinitis, flatfoot deformity, ankle arthritis, bunion, hallux rigidus, metatarsalgia, plantar fasciitis, and peroneal tendinitis) were evaluated, comparing PEMs from established surgeon-created sources (American Orthopaedic Foot and Ankle Society [FootCareMD.org] and FootEducation.com) to materials generated by three LLMs (ChatGPT-4, Claude, and Gemini). LLM responses were generated using a standardized patient-centered prompt. All materials were formatted identically to remove source identification, and assessed using comprehensive readability metrics (Flesch-Kincaid, Gunning Fog, Simplified Measure of Gobbledygook (SMOG) indices), the DISCERN instrument for quality assessment, and the Patient Education Materials Assessment Tool-Printable Materials (PEMAT-P) for understandability and actionability. Two fellowship-trained foot and ankle surgeons independently evaluated all materials using a blinded review process, with discrepancies resolved through consensus. Independent sample t-tests compared surgeon-created versus LLM-generated content, with Cohen's d calculated for effect sizes.
Results:
LLM-generated content showed significantly higher grade level requirements for material comprehension (13.27 ± 3.92 vs. 9.45 ± 2.55, p< 0.0001) and consistently lower readability scores, potentially limiting accessibility for individuals with low educational attainment . Quality assessment through DISCERN revealed surgeon-created materials were superior (33.59 ± 3.34 vs. 27.67 ± 0.95, p< 0.0001), particularly in providing practical guidance and updated treatment options. PEMAT-P analysis demonstrated surgeon-created materials had better understandability (79.08 vs. 75.00, p=0.0002) and actionability (54.17 vs. 46.67, p< 0.0001). Both LLM and surgeon-created materials generally lacked scientific citations, with Gemini being the notable exception. Large effect sizes (Cohen's d ranging from 1.12-3.81) confirmed clinically meaningful differences, especially in practical instruction and clinical context.
Conclusion:
While LLMs offer promising capabilities for real-time patient interaction, they currently fall short of physician-created materials in crucial areas of accessibility, actionability, and clinical context. These findings emphasize the importance of comprehensive patient handouts and evidence-based content. Future development should focus on improving readability, incorporating current treatment techniques, enhancing actionability, and integrating visual clinical elements to better serve diverse patient populations. Optimal patient education may be achieved through a hybrid approach where LLMs supplement physician-created materials.
Comparison of Readability Metrics and Quality Assessment Scores between Surgeon-Created and Artificial Intelligence-Generated Patient Education Materials
