Abstract
Background
General-purpose large language models (LLMs) often lack specialized knowledge for cleft lip and palate (CLP), limiting clinical utility. Accessible, accurate, and affordable AI tools could potentially supplement existing educational resources for clinicians and patients.
Objective
To develop and evaluate CLP-generative pre-trained transformer (GPT), a parameter-efficient model fine-tuned for CLP, and to assess whether a smaller specialized model can achieve performance comparable to large generalist models for clinical deployment.
Method
We built a CLP dataset of 7815 validated Q&A pairs from medical dialogs, literature, and guidelines. Using Qwen2-7B-Instruct as the base, we applied low-rank adaptation for domain adaptation. Performance was assessed via physician- and patient-oriented subjective evaluations and an objective test of 135 specialized multiple-choice questions.
Result
CLP-GPT showed significant improvements over its baseline across all dimensions (p < 0.001). In physician evaluations, its accuracy surpassed Claude-3.5-Sonnet and Gemini-1.5-Pro (p < 0.05) and showed no significant differences from models like GPT-4o (p > 0.05). In patient evaluations, CLP-GPT performed comparably to leading large models across all dimensions (p > 0.05). On the objective test, it achieved the highest accuracy (71.85%), though differences among models were not statistically significant.
Conclusion
CLP-GPT, a parameter-efficient, domain-specific model, delivers performance comparable to other LLMs in CLP. This demonstrates that cost-effective, specialized models can achieve high performance without massive computational resources.
Keywords
Introduction
Cleft lip and palate (CLP) is one of the most common congenital craniofacial anomalies, affecting approximately 1 in 700 newborns worldwide. 1 The management of CLP is a lifelong process necessitating a multidisciplinary team approach involving surgery, orthodontics, speech therapy, and genetics.2,3 CLP serves as a strategic validation domain for specialized AI because it follows strict, internationally consensus-based clinical pathways, requires the integration of heterogeneous knowledge, and addresses digital health equity for underserved populations.4,5 Despite established clinical protocols, patients and primary care providers often face significant barriers in accessing specialized knowledge, while existing educational frameworks frequently suffer from outdated information or a lack of personalized interaction. 6
The emergence of large language models (LLMs) offers a potential solution to these knowledge gaps.7,8,9 While prior AI research in CLP has predominantly focused on image-based diagnosis or surgical outcome prediction, 10 general-purpose LLMs demonstrate capabilities in medical question-answering but are prone to “hallucinations” and high computational costs.11,12 Recent research suggests that smaller models, when fine-tuned on high-quality domain-specific data, can achieve performance comparable to larger models while being more efficient to deploy.13,14,15,16 Consequently, this study aims to develop CLP-generative pre-trained transformer (GPT) by fine-tuning a parameter-efficient model (Qwen2-7B) using low-rank adaptation (LoRA) on an expert-verified dataset.17,18 We hypothesize that this specialized model can offer a reliable, cost-effective alternative to leading generalist models for clinical decision support and patient education, as measured by standard accuracy and reliability metrics.
Methods
Study design and setting
We developed and evaluated CLP-GPT, a LLM specialized for CLP. The model is based on the Qwen2-7B-Instruct architecture, selected for its optimal balance between clinical performance and computational efficiency in moderately resourced medical settings. While smaller models exist, the 7B scale was prioritized to capture the multifaceted complexity of CLP-specific knowledge. For benchmarking, CLP-GPT was compared against Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro. Specialized models like Med-PaLM 2 were excluded as they are not currently open-source or accessible via API for independent validation.
Data sources and curation
The dataset construction followed a rigorous multi-stage pipeline supervised by two attending physicians (X.H. and X.Z.) with over 5 years of clinical experience in CLP (Figure 1). Initially, we extracted 1,119,884 Q&A pairs from public medical datasets, including HuatuoGPT, the Chinese Medical Dialogue Dataset, and cMedQA. Through keyword-based filtering and independent expert review, this pool was reduced to 3640 high-quality pairs. To address gaps in complex academic knowledge, we mined 608 peer-reviewed articles and nine clinical guidelines from PubMed and CNKI. We utilized the GPT-4 API to generate candidate Q&A pairs from these technical texts, and after removing redundancies and ambiguous items, the final dataset comprised 7815 validated Q&A pairs (Figure 1).

The data curation and model development pipeline for CLP-GPT. The workflow demonstrates a multi-stage process: (1) Initial extraction of over 1.1 million raw Q&A pairs from public medical datasets; (2) Expert filtering to identify high-quality subsets; (3) Literature expansion via GPT-4 API utilizing peer-reviewed articles and clinical guidelines; (4) Human-in-the-loop manual verification by attending physicians; and (5) Final refinement to eliminate redundancies, resulting in the final validated dataset of 7,815 Q&A pairs.
Variables and measurement
Performance was assessed through subjective evaluations and an objective standardized test. For subjective evaluation, we designed 30 consultation questions categorized into physician-oriented (complex clinical decisions) and patient-oriented (common concerns) (Table 1). The physician-oriented responses were evaluated by a committee of nine experts, including clinical surgeons, dentists, and a biomedical scientist (PhD), ensuring accuracy in areas such as embryogenesis and molecular pathways. For patient-oriented questions, we implemented a Dual-Track evaluation where three laypersons assessed Comprehensibility and Credibility, while the expert clinicians reviewed the same responses for Accuracy and Completeness. Regarding readability metrics, we deliberately excluded automated indices such as the Flesch–Kincaid Grade Level, Gunning Fog Index, or Coleman–Liau Index. These metrics are primarily designed for English linguistic structures and do not accurately reflect the nuances of Chinese medical discourse. Instead, following the human-centered approach advocated by Gohari, we utilized human evaluation to capture medical clarity and ensure that simplified language did not compromise factual integrity. 19 For objective assessment, a standardized test of 135 multiple-choice questions was compiled from professional dental qualification examinations.
Doctor–patient bilateral issues.
Bias and verification
To mitigate potential AI hallucinations, every AI-generated Q&A pair underwent strict one-to-one manual verification by the expert physicians. Subjective scoring was conducted on a 5-point Likert scale, ranging from 1 (Poor) to 5 (Excellent) (Table 2). Any discrepancies in scoring were resolved through a consensus-based arbitration model by senior experts to minimize individual subjective bias and ensure the reliability of the evaluation process.20,21
Likert scale evaluation criteria.
Statistical methods
Statistical analyses were performed using R software (version 4.4.2). Subjective evaluation data are expressed as mean ± standard deviation (SD). Comparisons among multiple groups were conducted using the Kruskal–Wallis test, and where statistically significant differences were identified (p < 0.05), pairwise comparisons were performed using Dunn's post hoc test with Benjamini–Hochberg correction. Accuracy rates for the objective multiple-choice test were compared using the Chi-square test. All statistical significance was defined at a two-sided p-value < 0.05.
Result
Participants and data sources
The evaluation involved five language models: CLP-GPT, Qwen2-7B-Instruct (baseline), Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro. Subjective assessment was performed by a panel of nine experts and three laypersons. Objective assessment utilized 135 standardized questions derived from medical licensing examinations (Table 3).
Types of single-choice questions.
Descriptive data
See Table 4.
Descriptive statistics of different models (N = 30).
Note: Data are presented as Mean ± SD. Detailed statistics including Median and 95% CI are provided in Supplementary Table S1.
CLP: cleft lip and palate; SD: standard deviation; GPT: generative pre-trained transformer.
Main results
Doctor-side evaluation: In the accuracy dimension, CLP-GPT (Mean = 4.37, SD = 0.76) significantly outperformed Claude-3.5-Sonnet (p = 0.018) and Gemini-1.5-Pro (p < 0.001). While CLP-GPT achieved higher nominal scores than GPT-4o, the difference was not statistically significant (p > 0.05) (Figure 2).

(a) Heatmap of open-ended questions from healthcare providers, (b) heatmap of open-ended questions from patients, (c) accuracy rate of single-choice question.
Patient-side evaluation
For patient-oriented questions, the evaluation was split into two perspectives:
Layperson assessment (Readability): CLP-GPT achieved high scores in Credibility (4.66) and Comprehensibility (4.46), performing comparably to GPT-4o and Claude-3.5-Sonnet (p > 0.05).
Expert assessment (Medical Fact-Checking): In terms of Accuracy and Completeness for these patient questions, CLP-GPT also showed no significant difference compared to the leading large models (Table 5).
Statistical comparison of CLP-GPT against other models across evaluation dimensions (N = 30).
CLP: cleft lip and palate; GPT: generative pre-trained transformer. Bold type indicates statistically significant differences (p < 0.05).
Objective testing
CLP-GPT achieved the highest accuracy (71.85%) on the 135 multiple-choice questions. However, Chi-square analysis indicated no statistically significant difference among the top models (p = 0.75), suggesting comparable competency in standardized testing.
Discussion
Principal findings and performance context
In this study, we successfully developed CLP-GPT, a domain-specific model that achieves performance parity with state-of-the-art general LLMs (GPT-4o, Claude-3.5) in the field of CLP. While CLP-GPT did not “far surpass” the massive general models in all metrics, this result is highly significant given the disparity in model size (7B vs. estimated trillions of parameters). The “ceiling effect” observed in objective testing suggests that general models are already highly competent in standardized medical knowledge; however, CLP-GPT proves that a specialized model can match this competence with a fraction of the computational cost.
Clinical implications and digital equity
The primary value of CLP-GPT lies not just in raw accuracy, but in its potential to democratize access to specialized care. Firstly, regarding patient education, the model can serve as a 24/7 FAQ tool, generating personalized, readable, and medically accurate explanations for parents, thereby reducing anxiety and misinformation. Secondly, in terms of clinical support, particularly for primary care providers in non-specialized centers, CLP-GPT can act as a decision-support tool, offering guidance on referral timing and preoperative care based on standard guidelines. Furthermore, this approach helps bridge the digital divide. Unlike cloud-based giants requiring high-bandwidth internet and expensive subscriptions, CLP-GPT's lightweight architecture allows for local deployment on modest hardware. This is crucial for hospitals in resource-limited settings or rural areas, ensuring that digital health innovations benefit populations that are often left behind.
Limitations
To clearly define the scope of this study and guide future research, we acknowledge several important limitations. First, regarding data generalizability, the training and evaluation datasets were constructed primarily from Chinese-language sources. This restricts the model's immediate global application, as inherent linguistic and cultural biases from the dialogue corpora likely remain despite the incorporation of international literature. Consequently, substantial multilingual fine-tuning and cross-cultural validation are required.
Second, the evaluation design presents certain constraints. The clinical experts who participated in the dataset curation were also involved in the subjective evaluation, introducing a potential for confirmation bias. Future validation studies should employ independent, third-party evaluators who are blind to the dataset's construction process. Additionally, the subjective assessment relied on a curated set of 30 consultation questions; while representative, this sample size may be too small to robustly generalize the findings regarding the model's nuanced interactive performance. Furthermore, while we utilized a consensus-based arbitration model to resolve scoring disagreements, we did not formally calculate an inter-rater reliability coefficient (e.g. Cohen's kappa) prior to resolution.
Third, the study's methodological scope was intentionally focused on a single model scale (7-billion parameters). This prevents definitive claims about the entire spectrum of “small models,” and a comparative analysis including even smaller models is necessary. We also did not investigate hybrid approaches, such as combining CLP-GPT with retrieval-augmented generation, which represents a key area for future enhancement to mitigate hallucinations. Finally, the most significant limitation is the absence of real-world clinical validation. As all evaluations were conducted in a controlled, offline environment, the study's claims regarding practical clinical utility and safety remain preliminary. Future implementation research must address practical barriers such as EHR integration and patient data privacy.
Comparison with existing methods
This study contributes to the growing body of evidence, alongside models like Phi-2, that supports a paradigm shift towards smaller, specialized language models. Our primary contributions are threefold. First, regarding validation in a complex domain, we chose CLP—a field requiring deep, integrated knowledge from surgery, dentistry, and speech therapy—as a rigorous test case. Our results validate that a parameter-efficient model can effectively handle this complexity, a scenario where general-purpose models often lack the necessary depth. Second, we provide a replicable framework for low-resource specialization. By detailing the methodology for creating a high-quality, domain-specific dataset and fine-tuning an open-source model (Qwen2-7B-Instruct) using LoRA, we offer a cost-effective and accessible technical pathway for other researchers and institutions to develop specialized AI tools, particularly in resource-limited settings. Finally, we employed a more holistic evaluation framework. By moving beyond single-dimensional accuracy metrics and incorporating a dual-perspective (physician and patient) evaluation, our study offers a more comprehensive assessment of a model's practical clinical utility. This approach addresses the critical need for models to be not only factually correct but also credible and comprehensible to patients.
Conclusion
This study demonstrates that a parameter-efficient, domain-specific language model, CLP-GPT, can achieve a quality of clinical communication that is statistically comparable to leading large-scale models in many key aspects, while being significantly more resource-efficient. Although not definitively superior in all objective metrics, its strong performance within the specialized and complex domain of CLP highlights a viable and cost-effective pathway for developing targeted AI tools. This work provides a replicable technical framework that holds promise for bridging knowledge gaps and enhancing healthcare accessibility, particularly in primary care and resource-limited environments. It substantiates the practical value of shifting focus towards specialized AI solutions that can effectively augment clinical expertise and facilitate shared decision-making in medicine.
Declarations ethics approval and consent to participate
This study adhered to the ethical principles outlined in the Declaration of Helsinki. This study was reviewed and approved by the Human Research Ethics Committee of Xiamen Maternity and Child Health Care Hospital (Approval No. KY-2023-096-H01). Written informed consent was obtained from the two attending physicians who participated in this study. The study utilized three publicly available and anonymized datasets. The requirement for informed consent from the original data subjects was waived by the Human Research Ethics Committee of Xiamen Maternity and Child Health Care Hospital as the data was publicly accessible and de-identified.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076261416336 - Supplemental material for Cleft lip and palate-generative pre-trained transformer: A parameter-efficient, domain-specific language model for cleft lip and palate clinical applications
Supplemental material, sj-docx-1-dhj-10.1177_20552076261416336 for Cleft lip and palate-generative pre-trained transformer: A parameter-efficient, domain-specific language model for cleft lip and palate clinical applications by Xiaoqin He, Xiaohong Zhong, Jiaru Wang, Kaixuan Zhen, Jinzhun Wu, Boya Tian, Longbiao Chen, Haolun Yan and Guorong Lyu in DIGITAL HEALTH
Footnotes
Consent for publication
Not applicable.
Authors’ contributions
As the first author, XH coordinated the overall execution of the study, including the formulation of the research protocol, development of experimental methodology, design and optimization of the model architecture, and drafting of the manuscript. XZ and JW both contributed equally as co-first authors, with XZ serving as the second co-first author and JW as the third co-first author. XZ provided expertise in medical content analysis while JW undertook the fine-tuning of the model, encompassing parameter adjustments, optimization of model performance, and experimental validation. KZ and BT conducted data collection and question compilation, as well as screening, cleaning, and pre-processing of the large-scale corpus. JW and LC contributed to the development of the benchmarking programme, formulation of experimental evaluation methods, and establishment of the evaluation system. HLY and GL, serving as the corresponding authors, proposed the initial research concept, guided the research direction, oversaw overall project planning, and carried out critical revisions of the manuscript. All authors participated in discussions of research findings, reviewed manuscript drafts, and collectively approved the final submitted version.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Natural Science Foundation of Fujian Province project (2023J011610). The funding agency did not participate in the experimental design or the formation of conclusions. The views expressed in this article are those of the authors and may not represent the views of the funding agency.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Availability of data and materials
The curated datasets utilized for model training and validation, as well as all analytical outputs reported in this study, can be obtained from the corresponding author upon reasonable request. The three publicly available datasets analyzed during the current study are available at the following repositories. (1) The HuatuoGPT Clinical Dialogue Dataset is available at: https://huggingface.co/datasets/FreedomIntelligence/HuatuoGPT-sft-data-v1 (2)The Chinese Medical Dialogue Dataset is available at: https://huggingface.co/datasets/BillGPT/Chinese-medical-dialogue-data. (3)The cMedQA dataset is available on GitHub:
. The model weights and training code developed for this study are not publicly archived due to intellectual property restrictions and ethical considerations regarding the generation of unsupervised medical advice; however, they are available from the corresponding author upon reasonable request for academic research purposes.
Guarantor
Dr Guorong Lyu, Department of Ultrasound Medicine, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, Fujian Province, and Dr Haolun Yan, Department of Pediatrics, Women and Children's Hospital, School of Medicine, Xiamen University, serve as co-guarantors for this work. They accept full responsibility for the integrity of the study, the accuracy of the data and analysis, and the decision to publish. Both had full access to all the data and will respond to any inquiries regarding the work.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
