Sage Journals: Discover world-class research

Abstract

Background

General-purpose large language models (LLMs) often lack specialized knowledge for cleft lip and palate (CLP), limiting clinical utility. Accessible, accurate, and affordable AI tools could potentially supplement existing educational resources for clinicians and patients.

Objective

To develop and evaluate CLP-generative pre-trained transformer (GPT), a parameter-efficient model fine-tuned for CLP, and to assess whether a smaller specialized model can achieve performance comparable to large generalist models for clinical deployment.

Method

We built a CLP dataset of 7815 validated Q&A pairs from medical dialogs, literature, and guidelines. Using Qwen2-7B-Instruct as the base, we applied low-rank adaptation for domain adaptation. Performance was assessed via physician- and patient-oriented subjective evaluations and an objective test of 135 specialized multiple-choice questions.

Result

CLP-GPT showed significant improvements over its baseline across all dimensions (p < 0.001). In physician evaluations, its accuracy surpassed Claude-3.5-Sonnet and Gemini-1.5-Pro (p < 0.05) and showed no significant differences from models like GPT-4o (p > 0.05). In patient evaluations, CLP-GPT performed comparably to leading large models across all dimensions (p > 0.05). On the objective test, it achieved the highest accuracy (71.85%), though differences among models were not statistically significant.

Conclusion

CLP-GPT, a parameter-efficient, domain-specific model, delivers performance comparable to other LLMs in CLP. This demonstrates that cost-effective, specialized models can achieve high performance without massive computational resources.

Keywords

Cleft lip and palate large language model domain-specific model parameter-efficient fine-tuning clinical decision support digital health equity

Introduction

Cleft lip and palate (CLP) is one of the most common congenital craniofacial anomalies, affecting approximately 1 in 700 newborns worldwide.¹ The management of CLP is a lifelong process necessitating a multidisciplinary team approach involving surgery, orthodontics, speech therapy, and genetics.^2,3 CLP serves as a strategic validation domain for specialized AI because it follows strict, internationally consensus-based clinical pathways, requires the integration of heterogeneous knowledge, and addresses digital health equity for underserved populations.^4,5 Despite established clinical protocols, patients and primary care providers often face significant barriers in accessing specialized knowledge, while existing educational frameworks frequently suffer from outdated information or a lack of personalized interaction.⁶

The emergence of large language models (LLMs) offers a potential solution to these knowledge gaps.^7,8,9 While prior AI research in CLP has predominantly focused on image-based diagnosis or surgical outcome prediction,¹⁰ general-purpose LLMs demonstrate capabilities in medical question-answering but are prone to “hallucinations” and high computational costs.^11,12 Recent research suggests that smaller models, when fine-tuned on high-quality domain-specific data, can achieve performance comparable to larger models while being more efficient to deploy.^13,14,15,16 Consequently, this study aims to develop CLP-generative pre-trained transformer (GPT) by fine-tuning a parameter-efficient model (Qwen2-7B) using low-rank adaptation (LoRA) on an expert-verified dataset.^17,18 We hypothesize that this specialized model can offer a reliable, cost-effective alternative to leading generalist models for clinical decision support and patient education, as measured by standard accuracy and reliability metrics.

Methods

Study design and setting

We developed and evaluated CLP-GPT, a LLM specialized for CLP. The model is based on the Qwen2-7B-Instruct architecture, selected for its optimal balance between clinical performance and computational efficiency in moderately resourced medical settings. While smaller models exist, the 7B scale was prioritized to capture the multifaceted complexity of CLP-specific knowledge. For benchmarking, CLP-GPT was compared against Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro. Specialized models like Med-PaLM 2 were excluded as they are not currently open-source or accessible via API for independent validation.

Data sources and curation

The dataset construction followed a rigorous multi-stage pipeline supervised by two attending physicians (X.H. and X.Z.) with over 5 years of clinical experience in CLP (Figure 1). Initially, we extracted 1,119,884 Q&A pairs from public medical datasets, including HuatuoGPT, the Chinese Medical Dialogue Dataset, and cMedQA. Through keyword-based filtering and independent expert review, this pool was reduced to 3640 high-quality pairs. To address gaps in complex academic knowledge, we mined 608 peer-reviewed articles and nine clinical guidelines from PubMed and CNKI. We utilized the GPT-4 API to generate candidate Q&A pairs from these technical texts, and after removing redundancies and ambiguous items, the final dataset comprised 7815 validated Q&A pairs (Figure 1).

Figure 1.

The data curation and model development pipeline for CLP-GPT. The workflow demonstrates a multi-stage process: (1) Initial extraction of over 1.1 million raw Q&A pairs from public medical datasets; (2) Expert filtering to identify high-quality subsets; (3) Literature expansion via GPT-4 API utilizing peer-reviewed articles and clinical guidelines; (4) Human-in-the-loop manual verification by attending physicians; and (5) Final refinement to eliminate redundancies, resulting in the final validated dataset of 7,815 Q&A pairs.

Variables and measurement

Performance was assessed through subjective evaluations and an objective standardized test. For subjective evaluation, we designed 30 consultation questions categorized into physician-oriented (complex clinical decisions) and patient-oriented (common concerns) (Table 1). The physician-oriented responses were evaluated by a committee of nine experts, including clinical surgeons, dentists, and a biomedical scientist (PhD), ensuring accuracy in areas such as embryogenesis and molecular pathways. For patient-oriented questions, we implemented a Dual-Track evaluation where three laypersons assessed Comprehensibility and Credibility, while the expert clinicians reviewed the same responses for Accuracy and Completeness. Regarding readability metrics, we deliberately excluded automated indices such as the Flesch–Kincaid Grade Level, Gunning Fog Index, or Coleman–Liau Index. These metrics are primarily designed for English linguistic structures and do not accurately reflect the nuances of Chinese medical discourse. Instead, following the human-centered approach advocated by Gohari, we utilized human evaluation to capture medical clarity and ensure that simplified language did not compromise factual integrity.¹⁹ For objective assessment, a standardized test of 135 multiple-choice questions was compiled from professional dental qualification examinations.

Table 1.

Doctor–patient bilateral issues.

NO	Patient-side issues	Doctor-side issues
1	What is the incidence rate of cleft lip and palate among newborns? How common is this condition?	What are the epidemiological trends in global cleft lip and palate incidence rates? Are there significant geographical and racial differences?
2	Is there a gender disparity in the occurrence of cleft lip and palate between males and females?	What roles do oxidative stress and epigenetic modifications play in the pathogenesis of cleft lip and palate? How do these factors affect gene expression?
3	What are the primary etiological factors of cleft lip and palate? Which lifestyle factors may increase the risk?	What are the regulatory relationships between key signaling pathways (such as Shh, Wnt, BMP) during lip and palate development?
4	Do environmental factors such as air pollution or psychological stress potentially increase the risk of cleft formation?	What are the molecular regulatory mechanisms of embryonic lip and palate fusion? What roles do cell adhesion, apoptosis, and extracellular matrix remodeling play?
5	During which gestational period does cleft lip and palate typically develop?	Using 3D imaging technology, what are the characteristic changes in facial soft tissue and skeletal structure in cleft lip and palate patients? How do these changes relate to clinical phenotypes?
6	What prenatal diagnostic methods are available to detect cleft lip and palate?	What are the currently used international classification systems for cleft lip and palate? How do these systems compare in terms of advantages and disadvantages?
7	What interventions are available following prenatal diagnosis of cleft lip and palate?	What is the potential of emerging prenatal diagnostic technologies (such as 3D/4D ultrasound, fetal MRI) in improving the diagnostic accuracy of cleft lip and palate? What advantages do they have compared to conventional 2D ultrasound?
8	What is the diagnostic accuracy of prenatal screening? What is the false-positive rate?	Besides physical examination, what other auxiliary examination methods can be used for cleft lip and palate diagnosis and evaluation? What are the applications and limitations of various methods?
9	What is the typical duration of the complete treatment protocol for cleft lip and palate?	Systematically evaluate the clinical evidence of 3D printing models in improving the precision of cleft lip and palate surgery. In which types of cleft repairs is it most valuable?
10	What are the indications and benefits of Pre-Surgical Infant Orthopedics (PSIO) and Presurgical Nasoalveolar Molding (PNAM)?	How should long-term follow-up studies be designed to compare different cleft lip and palate treatment options? What key outcome measures need to be considered?
11	When should orthodontic intervention begin for a child with cleft lip and palate? What is the complexity of treatment?	Complications in PSIO and PNAM treatment: What are the incidence rates, prevention strategies, and management methods for common complications?
12	What is the expected scarring after cleft repair surgery? What scar management options are available?	Long-term effects of early orthopedic intervention on jaw development: How to balance the benefits of early intervention with potential growth inhibition risks?
13	Besides plastic surgeons, which specialists comprise the multidisciplinary cleft care team?	Success rates and patient compliance with PSIO and PNAM: What are the key factors affecting treatment outcomes and compliance? How can compliance be improved?
14	What speech disorders are associated with cleft lip and palate? What speech therapy protocols are effective?	Comparison of staged versus single-stage cleft surgery long-term outcomes: How do these strategies differently affect facial growth, speech development, and psychosocial adaptation?
15	What are the success criteria for cleft repair surgery? What is the expected recovery timeline?	Complications of cleft lip and palate repair surgery: What are the incidence rates, risk factors, and prevention strategies for short-term and long-term complications?
16	Is there an association between cleft lip and palate and sleep apnea?	Orthodontic treatment and bone grafting coordination strategies: How to optimize orthodontic treatment timing and methods to complement bone grafting and improve overall treatment outcomes?
17	What are the systemic health implications and activity restrictions associated with cleft lip and palate?	Complications during orthodontic treatment in cleft patients: What are the incidence rates, risk factors, and prevention strategies for dental and periodontal complications?
18	What are the psychosocial impacts on families with children affected by cleft lip and palate? What support resources are available?	Pathophysiological mechanisms of Obstructive Sleep Apnea (OSA) in cleft patients: How do upper airway anatomical abnormalities lead to OSA?
19	What is the timeline for resuming normal feeding post-surgery? What are the dietary recommendations?	Screening and diagnostic strategies for cleft-related OSA: Which screening tools are most suitable for this population? What are the special considerations for polysomnography?
20	How does cleft lip and palate affect facial growth? When are secondary facial procedures indicated?	Special oral hygiene measures for cleft patients: Which preventive strategies are most effective? How can patient compliance be improved?
21	What is the optimal timing for orthodontic intervention in cases of dental anomalies associated with cleft lip and palate?	Diagnostic strategies for cleft-related syndromes: Which clinical features suggest potential syndromes? What is the role of genetic testing in diagnosis?
22	What psychosocial support strategies are effective in promoting self-esteem and social integration?	Effects of cleft lip and palate on immune function and infection risk: Are there specific immune function abnormalities? How can infection prevention strategies be optimized for these patients?
23	What is the typical number of surgical interventions required? What are the recommended intervals between procedures?	Long-term impact of cleft lip and palate on academic and career development: Which factors (such as speech disorders, appearance) most affect patients’ educational and career choices?
24	What rehabilitation milestones should be achieved before school entry?	Genetic susceptibility to cleft lip and palate: Which genes and genomic regions are most associated with non-syndromic cleft risk?
25	What are the components of treatment costs? What insurance coverage and financial assistance options are available?	Genetic counseling strategies for cleft family history: How to accurately assess and communicate recurrence risk? What factors affect risk assessment accuracy?
26	What is the role of genetic factors? How does family history affect risk assessment?	Application of genetic testing in cleft risk assessment: What is the clinical value of whole exome sequencing and targeted gene panels?
27	What lifestyle modifications during pregnancy may reduce the risk of cleft formation?	Development and validation of cleft risk prediction models: How to integrate genetic and environmental factors to build accurate risk prediction models?
28	Besides air pollution, what other environmental factors (water quality, chemical exposure, etc.) may be associated with cleft formation?	How to obtain the mid-sagittal plane view of the fetus in ultrasound?
29	How does the accuracy of cleft detection vary across different gestational stages?	How to obtain the palatal line in ultrasound?
30	What conditions necessitate multiple surgical interventions? What are the recommended intervals between procedures?	What is the palatal line?

Bias and verification

To mitigate potential AI hallucinations, every AI-generated Q&A pair underwent strict one-to-one manual verification by the expert physicians. Subjective scoring was conducted on a 5-point Likert scale, ranging from 1 (Poor) to 5 (Excellent) (Table 2). Any discrepancies in scoring were resolved through a consensus-based arbitration model by senior experts to minimize individual subjective bias and ensure the reliability of the evaluation process.^20,21

Table 2.

Likert scale evaluation criteria.

Dimension	1 point (Strongly disagree)	2 points (Disagree)	3 points (Neutral)	4 points (Agree)	5 points (Strongly agree)
Comprehensibility (Patient Assessment)	Cannot understand the response at all; completely unclear what it means	Need multiple readings to understand small portions of content	Can understand main points but some parts remain unclear	Most content is understandable with only a few unclear terms	Response is completely clear and easily understood
Credibility (Patient Assessment)	Response seems completely unreliable and made up	Major doubts about the truthfulness of the response	Partially credible but not fully convincing	Response seems fairly credible and convincing	Response appears highly professional and fully trustworthy
Completeness (Doctor Assessment)	Completely fails to address key points	Addresses only few points, missing crucial content	Covers main points but lacks comprehensiveness	Comprehensive with only minor omissions	Fully comprehensive, covering all points
Accuracy (Evaluated by Experts on Patient Questions)	Information completely incorrect, contradicts medical facts	Majority of information is inaccurate	Partially accurate with significant errors	Mostly accurate with minor deviations	Completely accurate, aligned with current medical consensus

Statistical methods

Statistical analyses were performed using R software (version 4.4.2). Subjective evaluation data are expressed as mean ± standard deviation (SD). Comparisons among multiple groups were conducted using the Kruskal–Wallis test, and where statistically significant differences were identified (p < 0.05), pairwise comparisons were performed using Dunn's post hoc test with Benjamini–Hochberg correction. Accuracy rates for the objective multiple-choice test were compared using the Chi-square test. All statistical significance was defined at a two-sided p-value < 0.05.

Result

Participants and data sources

The evaluation involved five language models: CLP-GPT, Qwen2-7B-Instruct (baseline), Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro. Subjective assessment was performed by a panel of nine experts and three laypersons. Objective assessment utilized 135 standardized questions derived from medical licensing examinations (Table 3).

Table 3.

Types of single-choice questions.

Category	Proportion	Case presentation
Basic knowledge	31(23%)	Question: Which of the following statements is incorrect? Options: A. Failure of fusion between the ipsilateral maxillary prominence and the medial nasal prominence results in a unilateral cleft of the upper lip. B. Lack of fusion between the maxillary and lateral nasal prominences leads to an oblique facial cleft. C. An isolated cleft of the soft palate manifests exclusively as a midline defect because the palatal shelves fuse in an anteroposterior direction. D. Cleft lip and palate arise from multifactorial disruption of normal embryonic prominence development and fusion. E. Cleft lip and palate occur within the first 12 weeks of gestation, before fetal maturation is complete. Answer: C
Comprehensive treatment & rehabilitation	17(12.6%)	Question: For a newborn with a complete right-sided cleft lip and palate, the appropriate time for speech evaluation and speech therapy is: Options: A: 4 to 6 years of age B: 9 to 12 years of age, C: 13 to 18 years of age D: After 25 years of age E: Any time after surgery Answer: A
Diagnosis & manifestation	28(20.7%)	Question: Which of the following is NOT a feature of microform cleft lip? Options: A. Separation of the Cupid's bow peaks B. Presence of creases in the nasolabial fold C. Intact epidermis without a visible cleft D. Superficial cutaneous groove on the affected side E. Failure of fusion of the underlying orbicularis oris muscle Answer: B
Early Postoperative Management	14(10.4%)	Question: A 7-month-old boy presented to our hospital with congenital bilateral complete cleft lip and alveolus that had remained untreated since birth. Physical examination revealed a full-thickness cleft extending from the vermilion to the nasal floor bilaterally, marked collapse of both alar bases, and a partial alveolar cleft on the left side, whereas the palate was intact. Following surgical repair, if significant wound tension is observed, the lip-adhesive bow should be maintained for a minimum of Options: A. 7 days B. 10 days C. 14 days D. 1 month E. 6 months Answer: B
Prognosis& complication management	10(7.4%)	Question: Primary cause of fistula formation (recurrent cleft) after cleft palate repair, Options: A. Excessive tension during wound closure B. Infection C. Postoperative agitation or crying in the pediatric patient D. nsufficient soft and hard palate tissue E. Trauma Answer: A
Surgical treatment	35(25.9%)	Question: In cleft palate repair surgery, fracture of the pterygoid hamulus is performed to relieve the tension of which muscle? Options: A. Palatoglossus muscle B. veli palatini muscle C. Levator veli palatini muscle D.Superior pharyngeal constrictor muscle E. Palatopharyngeus muscle Answer:B
Total	135(100％)

Descriptive data

See Table 4.

Table 4.

Descriptive statistics of different models (N = 30).

	Mean ± SD
	Doctor-side		Patient-side
Model	Accuracy	Comprehensiveness	Credibility	Comprehensibility	Accuracy	Comprehensiveness
CLP-GPT	4.37 ± 0.76	4.07 ± 0.87	4.66 ± 0.53	4.46 ± 0.53	4.63 ± 0.56	4.57 ± 0.50
GPT-4o	3.87 ± 0.90	3.57 ± 0.86	4.44 ± 0.50	4.48 ± 0.66	4.63 ± 0.72	4.27 ± 0.83
Claude-3.5-Sonnet	3.77 ± 0.63	3.73 ± 0.87	4.63 ± 0.47	4.49 ± 0.49	4.63 ± 0.49	4.63 ± 0.49
Gemini-1.5-pro	3.30 ± 0.88	2.90 ± 0.71	4.00 ± 0.69	4.30 ± 0.62	4.00 ± 0.64	4.03 ± 0.85
Qwen2-7B-Instruct	2.70 ± 0.84	2.40 ± 0.67	2.40 ± 0.67	3.57 ± 0.63	3.43 ± 0.86	3.43 ± 0.86

Note: Data are presented as Mean ± SD. Detailed statistics including Median and 95% CI are provided in Supplementary Table S1.

CLP: cleft lip and palate; SD: standard deviation; GPT: generative pre-trained transformer.

Main results

Doctor-side evaluation: In the accuracy dimension, CLP-GPT (Mean = 4.37, SD = 0.76) significantly outperformed Claude-3.5-Sonnet (p = 0.018) and Gemini-1.5-Pro (p < 0.001). While CLP-GPT achieved higher nominal scores than GPT-4o, the difference was not statistically significant (p > 0.05) (Figure 2).

Figure 2.

(a) Heatmap of open-ended questions from healthcare providers, (b) heatmap of open-ended questions from patients, (c) accuracy rate of single-choice question.

Patient-side evaluation

For patient-oriented questions, the evaluation was split into two perspectives:

Layperson assessment (Readability): CLP-GPT achieved high scores in Credibility (4.66) and Comprehensibility (4.46), performing comparably to GPT-4o and Claude-3.5-Sonnet (p > 0.05).

Expert assessment (Medical Fact-Checking): In terms of Accuracy and Completeness for these patient questions, CLP-GPT also showed no significant difference compared to the leading large models (Table 5).

Table 5.

Statistical comparison of CLP-GPT against other models across evaluation dimensions (N = 30).

Evaluation types	Evaluation dimensions	CLP-GPT vs. different models
Evaluation types	Evaluation dimensions	Claude-3.5-Sonnet	GPT-4o	Gemini-1.5-pro	Qwen2-7B-Instruct
Doctor-side evaluation	Accuracy	−2.616 (p < 0.05)	−1.863 (p < 0.05)	−4.420 (p < 0.05)	6.848 (p < 0.05)
Doctor-side evaluation	Comprehensiveness	−1.161 (p = 0.137)	−1.637 (p = 0.064)	−4.446 (p < 0.05)	6.397 (p < 0.05)
Patient-side evaluation	Credibility	−0.200 (p = 0.421)	−1.399 (p = 0.101)	−3.735 (p < 0.05)	5.681 (p < 0.05)
	Comprehensibility	0.218 (p = 0.460)	0.373 (p = 0.443)	−0.826 (p = 0.292)	4.708 (p < 0.05)
	Accuracy	−0.088 (p = 0.465)	0.263 (p = 0.396)	−3.538 (p < 0.05)	5.519 (p < 0.05)
	Comprehensiveness	−0.117 (p = 0.453)	−1.211 (p = 0.113)	−3.618 (p < 0.05)	4.881 (p < 0.05)

CLP: cleft lip and palate; GPT: generative pre-trained transformer. Bold type indicates statistically significant differences (p < 0.05).

Objective testing

CLP-GPT achieved the highest accuracy (71.85%) on the 135 multiple-choice questions. However, Chi-square analysis indicated no statistically significant difference among the top models (p = 0.75), suggesting comparable competency in standardized testing.

Discussion

Principal findings and performance context

In this study, we successfully developed CLP-GPT, a domain-specific model that achieves performance parity with state-of-the-art general LLMs (GPT-4o, Claude-3.5) in the field of CLP. While CLP-GPT did not “far surpass” the massive general models in all metrics, this result is highly significant given the disparity in model size (7B vs. estimated trillions of parameters). The “ceiling effect” observed in objective testing suggests that general models are already highly competent in standardized medical knowledge; however, CLP-GPT proves that a specialized model can match this competence with a fraction of the computational cost.

Clinical implications and digital equity

The primary value of CLP-GPT lies not just in raw accuracy, but in its potential to democratize access to specialized care. Firstly, regarding patient education, the model can serve as a 24/7 FAQ tool, generating personalized, readable, and medically accurate explanations for parents, thereby reducing anxiety and misinformation. Secondly, in terms of clinical support, particularly for primary care providers in non-specialized centers, CLP-GPT can act as a decision-support tool, offering guidance on referral timing and preoperative care based on standard guidelines. Furthermore, this approach helps bridge the digital divide. Unlike cloud-based giants requiring high-bandwidth internet and expensive subscriptions, CLP-GPT's lightweight architecture allows for local deployment on modest hardware. This is crucial for hospitals in resource-limited settings or rural areas, ensuring that digital health innovations benefit populations that are often left behind.

Limitations

To clearly define the scope of this study and guide future research, we acknowledge several important limitations. First, regarding data generalizability, the training and evaluation datasets were constructed primarily from Chinese-language sources. This restricts the model's immediate global application, as inherent linguistic and cultural biases from the dialogue corpora likely remain despite the incorporation of international literature. Consequently, substantial multilingual fine-tuning and cross-cultural validation are required.

Second, the evaluation design presents certain constraints. The clinical experts who participated in the dataset curation were also involved in the subjective evaluation, introducing a potential for confirmation bias. Future validation studies should employ independent, third-party evaluators who are blind to the dataset's construction process. Additionally, the subjective assessment relied on a curated set of 30 consultation questions; while representative, this sample size may be too small to robustly generalize the findings regarding the model's nuanced interactive performance. Furthermore, while we utilized a consensus-based arbitration model to resolve scoring disagreements, we did not formally calculate an inter-rater reliability coefficient (e.g. Cohen's kappa) prior to resolution.

Third, the study's methodological scope was intentionally focused on a single model scale (7-billion parameters). This prevents definitive claims about the entire spectrum of “small models,” and a comparative analysis including even smaller models is necessary. We also did not investigate hybrid approaches, such as combining CLP-GPT with retrieval-augmented generation, which represents a key area for future enhancement to mitigate hallucinations. Finally, the most significant limitation is the absence of real-world clinical validation. As all evaluations were conducted in a controlled, offline environment, the study's claims regarding practical clinical utility and safety remain preliminary. Future implementation research must address practical barriers such as EHR integration and patient data privacy.

Comparison with existing methods

This study contributes to the growing body of evidence, alongside models like Phi-2, that supports a paradigm shift towards smaller, specialized language models. Our primary contributions are threefold. First, regarding validation in a complex domain, we chose CLP—a field requiring deep, integrated knowledge from surgery, dentistry, and speech therapy—as a rigorous test case. Our results validate that a parameter-efficient model can effectively handle this complexity, a scenario where general-purpose models often lack the necessary depth. Second, we provide a replicable framework for low-resource specialization. By detailing the methodology for creating a high-quality, domain-specific dataset and fine-tuning an open-source model (Qwen2-7B-Instruct) using LoRA, we offer a cost-effective and accessible technical pathway for other researchers and institutions to develop specialized AI tools, particularly in resource-limited settings. Finally, we employed a more holistic evaluation framework. By moving beyond single-dimensional accuracy metrics and incorporating a dual-perspective (physician and patient) evaluation, our study offers a more comprehensive assessment of a model's practical clinical utility. This approach addresses the critical need for models to be not only factually correct but also credible and comprehensible to patients.

Conclusion

This study demonstrates that a parameter-efficient, domain-specific language model, CLP-GPT, can achieve a quality of clinical communication that is statistically comparable to leading large-scale models in many key aspects, while being significantly more resource-efficient. Although not definitively superior in all objective metrics, its strong performance within the specialized and complex domain of CLP highlights a viable and cost-effective pathway for developing targeted AI tools. This work provides a replicable technical framework that holds promise for bridging knowledge gaps and enhancing healthcare accessibility, particularly in primary care and resource-limited environments. It substantiates the practical value of shifting focus towards specialized AI solutions that can effectively augment clinical expertise and facilitate shared decision-making in medicine.

Declarations ethics approval and consent to participate

This study adhered to the ethical principles outlined in the Declaration of Helsinki. This study was reviewed and approved by the Human Research Ethics Committee of Xiamen Maternity and Child Health Care Hospital (Approval No. KY-2023-096-H01). Written informed consent was obtained from the two attending physicians who participated in this study. The study utilized three publicly available and anonymized datasets. The requirement for informed consent from the original data subjects was waived by the Human Research Ethics Committee of Xiamen Maternity and Child Health Care Hospital as the data was publicly accessible and de-identified.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261416336 - Supplemental material for Cleft lip and palate-generative pre-trained transformer: A parameter-efficient, domain-specific language model for cleft lip and palate clinical applications

Supplemental material, sj-docx-1-dhj-10.1177_20552076261416336 for Cleft lip and palate-generative pre-trained transformer: A parameter-efficient, domain-specific language model for cleft lip and palate clinical applications by Xiaoqin He, Xiaohong Zhong, Jiaru Wang, Kaixuan Zhen, Jinzhun Wu, Boya Tian, Longbiao Chen, Haolun Yan and Guorong Lyu in DIGITAL HEALTH

Footnotes

ORCID iDs

Haolun Yan

Guorong Lyu

Consent for publication

Not applicable.

Authors’ contributions

As the first author, XH coordinated the overall execution of the study, including the formulation of the research protocol, development of experimental methodology, design and optimization of the model architecture, and drafting of the manuscript. XZ and JW both contributed equally as co-first authors, with XZ serving as the second co-first author and JW as the third co-first author. XZ provided expertise in medical content analysis while JW undertook the fine-tuning of the model, encompassing parameter adjustments, optimization of model performance, and experimental validation. KZ and BT conducted data collection and question compilation, as well as screening, cleaning, and pre-processing of the large-scale corpus. JW and LC contributed to the development of the benchmarking programme, formulation of experimental evaluation methods, and establishment of the evaluation system. HLY and GL, serving as the corresponding authors, proposed the initial research concept, guided the research direction, oversaw overall project planning, and carried out critical revisions of the manuscript. All authors participated in discussions of research findings, reviewed manuscript drafts, and collectively approved the final submitted version.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Natural Science Foundation of Fujian Province project (2023J011610). The funding agency did not participate in the experimental design or the formation of conclusions. The views expressed in this article are those of the authors and may not represent the views of the funding agency.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The curated datasets utilized for model training and validation, as well as all analytical outputs reported in this study, can be obtained from the corresponding author upon reasonable request. The three publicly available datasets analyzed during the current study are available at the following repositories. (1) The HuatuoGPT Clinical Dialogue Dataset is available at: https://huggingface.co/datasets/FreedomIntelligence/HuatuoGPT-sft-data-v1 (2)The Chinese Medical Dialogue Dataset is available at: https://huggingface.co/datasets/BillGPT/Chinese-medical-dialogue-data. (3)The cMedQA dataset is available on GitHub: . The model weights and training code developed for this study are not publicly archived due to intellectual property restrictions and ethical considerations regarding the generation of unsupervised medical advice; however, they are available from the corresponding author upon reasonable request for academic research purposes.

Guarantor

Dr Guorong Lyu, Department of Ultrasound Medicine, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, Fujian Province, and Dr Haolun Yan, Department of Pediatrics, Women and Children's Hospital, School of Medicine, Xiamen University, serve as co-guarantors for this work. They accept full responsibility for the integrity of the study, the accuracy of the data and analysis, and the decision to publish. Both had full access to all the data and will respond to any inquiries regarding the work.

Supplemental material

Supplemental material for this article is available online.

References

Esmail

Abdo

Krentz

, et al. Centre-based statistics of cleft lip with/without alveolus and palate as well as cleft palate only patients in Aden, Yemen. J Craniomaxillofac Surg 2014; 42: 297–304.

Kuijpers-Jagtman

Kuijpers

. Cleft lip and palate: role of the orthodontist in the interdisciplinary management team. J Clin Orthod 2023: 128–149. DOI: 10.1002/9781119870081.ch7.

LaFrance

Weiss

Kazemi

, et al. Multidisciplinary teaming: enhancing collaboration through increased understanding. Behav Anal Pract 2019; 12: 709–726.

Chapman

Haby

Toma

, et al. Knowledge translation strategies for dissemination with a focus on healthcare recipients: an overview of systematic reviews. Implement Sci 2020; 15: 1–4.

Riera

de Oliveira Cruz Latorraca

Padovez

, et al. Strategies for communicating scientific evidence on healthcare to managers and the population: a scoping review. Health Res Policy Syst 2023; 21: 71.

Fang

Wang

. Enhancing patient education in cancer care: intelligent cancer patient education model for effective communication. Comput Biol Med 2024; 169: 107874.

Wang

Zhang

. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev 2024; 57: 299.

Thirunavukarasu

Ting

Elangovan

, et al. Large language models in medicine. Nat Med 2023; 29: 1930–1940.

Bedi

Liu

Orr-Ewing

, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 2025; 333: 319–328.

10.

Huqh

MZU

Abdullah

Wong

, et al. Clinical applications of artificial intelligence and machine learning in children with cleft lip and palate—a systematic review. Int J Environ Res Public Health 2022; 19: 10860.

11.

Omiye

Gui

Rezaei

, et al. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann Intern Med 2024; 177: 210–220.

12.

Alber

Yang

Alyakin

, et al. Medical large language models are vulnerable to data-poisoning attacks. Nat Med 2025; 31: 618–626.

13.

Javaheripi

Bubeck

Abdin

, et al. Phi-2: the surprising power of small language models. Microsoft Research Blog 2023; 1: 3.

14.

Anil

Dai

Firat

, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403. 2023.

15.

Anisuzzaman

Malins

Friedman

, et al. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health 2025; 3: 100184.

16.

Tinn

Cheng

, et al. Fine-tuning large neural language models for biomedical natural language processing. Patterns 2023; 4: 100729.

17.

Shen

Wallis

, et al. Lora: low-rank adaptation of large language models. ICLR 2022; 1: 3.

18.

Team Q. Qwen2 technical report. arXiv preprint arXiv:2407.10671. 2024 December.

19.

Gohari

Baczynska

Weber

, et al. Online patient information on temporomandibular disorders provided by UK NHS hospitals: assessment and improvement of readability standards using AI-chatbots. Br J Oral Maxillofac Surg 2025. Advance online publication. DOI: 10.1016/j.bjoms.2025.08.008

20.

Joshi

Kale

Chandel

, et al. Likert Scale: explored and explained. Br J Appl Sci Technol 2015; 7: 396–403.

21.

Johnson

Goodman

Patrinely

, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq [Preprint]. 2023. DOI: 10.21203/rs.3.rs-2566942/v1

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

Cleft lip and palate-generative pre-trained transformer: A parameter-efficient,domain-specific language model for cleft lip and palate clinical applications

Abstract

Background

Objective

Method

Result

Conclusion

Keywords

Introduction

Methods

Study design and setting

Data sources and curation

Variables and measurement

Bias and verification

Statistical methods

Result

Participants and data sources

Descriptive data

Main results

Patient-side evaluation

Objective testing

Discussion

Principal findings and performance context

Clinical implications and digital equity

Limitations

Comparison with existing methods

Conclusion

Declarations ethics approval and consent to participate

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261416336 - Supplemental material for Cleft lip and palate-generative pre-trained transformer: A parameter-efficient, domain-specific language model for cleft lip and palate clinical applications

Footnotes

ORCID iDs

Consent for publication

Authors’ contributions

Funding

Declaration of conflicting interests

Availability of data and materials

Guarantor

Supplemental material

References

Supplementary Material