The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review

Abstract

Objective

To systematically review literature on the use of artificial intelligence (AI) and machine learning (ML) models for detecting velopharyngeal dysfunction (VPD) in patients with cleft palate.

Design

Systematic review conducted in accordance with PRISMA guidelines (PROSPERO CRD420251034524).

Setting

Studies published were identified through EMBASE, ProQuest, Google Scholar, and PubMed.

Participants

A total of 3967 participants contributed 92,323 training samples. Internal validation included 2331 controls and 2449 VPD cases, generating 81,143 validation samples. Ages ranged from 1 to 93 years.

Interventions

ML models were trained on speech features such as mel frequency cepstral coefficients (MFCCs) and constant Q cepstral coefficients (CQCCs) to classify or validate VPD-related speech outcomes.

Main Outcome Measure(s)

Reported performance metrics included accuracy, precision, recall, F1-score, sensitivity, specificity, and Pearson correlation coefficient (PCC). External validation was assessed when reported.

Results

Of 455 screened articles, 34 met the inclusion criteria. Support vector machines were the most commonly used models (16/34, 47.1%), followed by convolutional neural networks (6/34, 17.6%) and deep neural networks (2/34, 5.9%). Across studies reporting performance metrics, midpoint estimates yielded a mean accuracy of 82.9%, precision of 86.7%, F1-score of 0.88, sensitivity of 80.5%, specificity of 82.2%, and PCC of 0.58. Only 3 studies (3/34, 8.8%) performed external validation.

Conclusions

AI/ML models demonstrate promise for VPD detection with encouraging performance. Inconsistent reporting, reliance on engineered features, and limited external validation restrict generalizability. No clinically deployable model has yet been achieved.

Keywords

artificial intelligence velopharyngeal dysfunction cleft lip and palate

Introduction

Velopharyngeal dysfunction (VPD) severe enough to require corrective surgery can occur in up to 20% to 30% of patients after cleft palate repair.¹ When VPD is untreated or when VPD treatment is delayed, severe speech and psychosocial dysfunction can result.² The gold standard diagnosis of VPD is resource-intensive, requiring a specialized speech language pathologist (SLP), often with adjunct testing such as videonasoednoscopy (VNE), nasometry, and/or magnetic resonance imaging (MRI).³ The absence of a universal standardized protocol makes these assessments highly experience-dependent and further constrained by the global shortage of trained SLPs.⁴ These challenges are thus particularly pronounced in low- and middle-income countries (LMICs).^5,6

In recent years, artificial intelligence (AI) and machine learning (ML) have garnered increasing attention for their potential to address diagnostic challenges in speech disorders, including VPD.^5,7 ML models are particularly successful in pattern recognition, which can be extrapolated for image and sound processing.⁸ As such, ML models have been used to develop accessible, resource-efficient screening tools.⁹ Several studies have already examined their use in the identification of VPD^7,10,11 by training models to target hypernasality, formant distortions, nasal emissions, and articulation errors.⁸ If validated and deployed effectively, acoustic-based ML models could be used for VPD detection, particularly in areas where access to specialized SLPs is rare or impossible. The development of such screening tools could expedite referrals to SLPs, thereby mitigating the long-term functional and psychosocial impacts of untreated or poorly treated VPD.¹²

Over the last decade, there has been an exponential increase in the development of AI/ML models for use in healthcare.⁵ The field of cleft lip and palate care has followed suit, with multiple groups publishing on the use of AI/ML for the detection of cleft-related VPD.^5,10–41 Previous studies have primarily focused on the development and internal validation of AI/ML tools, leaving a gap in the literature regarding their generalizability, methodological rigor, and clinical deployment in real-world settings. The authors hypothesize that the transition to clinical deployment for AI models for VPD detection has been hindered by fundamental issues in study design, particularly in control group selection, external validation, and handling of acoustic variability. This systematic review aims to address this gap by critically evaluating the aforementioned issues while answering the following question. In patients with cleft palate, do existing AI/ML models for VPD detection demonstrate sufficient external validation and generalizability to support clinical deployment, and what study design limitations may explain barriers to translation?

Methods

This study was exempt from Institutional Review Board (IRB) approval. The review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines and was prospectively registered on PROSPERO (CRD420251034524). Methodological quality and risk of bias were assessed using the QUADAS-AI tool. No amendments were made to the protocol throughout the study.

Eligibility Criteria

Original research articles, including case–control studies, longitudinal observational studies, retrospective cohort studies, and cross-sectional studies, that employed AI or ML methods to develop models or algorithms for the detection of cleft-related VPD using speech samples were included in this study. Included studies were required to validate their models using speech data, regardless of whether performance metrics were reported, and to be published in English. No restrictions were placed on year of publication, patient age, or geographic location.

Studies were excluded if they focused on non-cleft-related causes of VPD, were prior systematic reviews or meta-analyses, were available only as abstracts without accessible full text, or constituted non-peer-reviewed material, including book chapters, magazine articles, blog posts, editorials, case reports, and case series. In addition, reference lists of excluded reviews were screened to identify any additional eligible studies.

Search Strategy

The studies included in this review were identified through searches of EMBASE, ProQuest, Google Scholar, and PubMed. The publishing period was unrestricted. The search strategy incorporated predefined keywords and Medical Subject Headings (MeSH) terms related to ML and velopharyngeal insufficiency (eg, “machine learning,” “deep learning,” “artificial intelligence,” “velopharyngeal insufficiency,” “velopharyngeal dysfunction,” and “hypernasality”), with full Boolean search queries detailed in Supplemental Digital Content 1. All databases were searched from April 16, 2025, through September 11, 2025.

Study Selection and Data Collection Process

The review process was conducted using Rayyan (Rayyan Systems Inc., Cambridge, MA, USA), a web-based systematic review management platform.⁴² Rayyan was used for duplicate removal and for the screening of titles and abstracts. All full-text screenings were conducted manually. A total of 427 abstracts were uploaded into Rayyan, after which 143 duplicate records were identified and removed by 1 reviewer, J.I. Following duplicate removal, titles and abstracts were independently reviewed by 2 reviewers, J.I. and S.D. Two hundred and forty-nine abstracts did not meet the inclusion criteria and were therefore excluded from the study. Three systematic reviews were excluded at this stage; however, their reference lists were subsequently assessed for potentially eligible studies by 2 independent reviewers, J.I. and S.D. This process yielded 28 additional articles, which were then subjected to full-text eligibility assessment. Any disagreements or uncertainties at this stage were advanced to full-text screening. Full-text screening was conducted independently by both reviewers, J.I. and S.D, to determine final eligibility. Both authors retrieved full texts of potentially eligible papers and read them individually to identify eligibility. Any discrepancies between the 2 reviewers (J.I. and S.D.) were resolved through discussion and consensus, with consultation of senior authors (Z.Y. and M.E.P.).

Data Extraction

Two reviewers independently extracted predefined data elements from each eligible study using a standardized data collection form. Extracted variables included study characteristics (title, first author, year of publication, country of origin, income classification, language, and study design), participant demographics (age, sex, and whether multiple speech samples were obtained per participant), dataset composition (number of cases and controls, total sample size, and composition of training, validation, and test sets), characteristics of speech tasks, and technical information related to model development (ML model type, software used, extracted features, and recording methodology). Performance metrics (accuracy, precision, specificity, recall, F-1 score, and Pearson correlation coefficient [PCC]) were also extracted. Discrepancies were resolved by consensus. Countries were assigned income classifications according to the World Bank income categories for 2024-2025.⁴³ When required data were not reported in a study, the item was recorded as not reported.

Risk of Bias Assessment

Risk of bias and applicability was formally assessed in all 34 included studies using the Quality Assessment of Diagnostic Accuracy Studies–Artificial Intelligence (QUADAS-AI) tool.⁴⁴ QUADAS-AI is a structured instrument designed to evaluate the methodological quality of AI-based diagnostic accuracy studies across key domains, including patient selection, index test, reference standard, and flow and timing. Risk of bias judgments were categorized by 2 independent reviewers as low, high, or unclear in accordance with QUADAS-AI guidance. Detailed assessment results are provided in Supplemental Digital Content 2.

Data Synthesis

A descriptive analysis was conducted to summarize extracted study characteristics, participant demographics, dataset composition, speech task types, and AI model features. These data were tabulated and used to determine eligibility for inclusion in each synthesis. Extracted performance metrics were reviewed for completeness, and where necessary, missing or inconsistently reported summary statistics were handled by reporting available ranges or marking values as not reported. For each outcome, effect measures consisted of the reported performance metrics from individual studies. Given substantial heterogeneity in study design and reporting practices, performance results were summarized descriptively as ranges with reported standard deviations and confidence intervals when available. To facilitate comparison across studies, medians were calculated, and estimated means were derived using the midpoint of reported ranges. Model performance outcomes, including accuracy, precision, recall, F1-score, specificity, and PCC, were compiled and compared across studies using structured tables to visually display individual study results. Subgroup analyses were performed to explore heterogeneity, including comparisons by external versus internal validation, model type, study design, validation approach, and reported performance metrics.

Results

Study Selection and Characteristics

The initial database search, which included articles from prior systematic reviews, yielded 455 articles. After screening titles and abstracts, 63 studies were selected for full-text review. Of these, 29 studies were excluded for the following reasons: 16 were abstract presentations only, 7 did not use ML or AI-based tools, 3 were systematic reviews, 2 did not utilize speech input, and 1 did not include patients with a cleft palate. Ultimately, 34 studies met the inclusion criteria (Figure 1).

Figure 1.

PRISMA flow sheet.

Included studies were published between 1996 and 2025. A positive trend in publication frequency was observed over time, with a notable spike during 2018-2019 and another recent rise in 2024-2025 (Figure 2). Of the 34 studies, 14 originated from high-income countries (HICs), 13 from upper-middle-income countries (UMICs), 8 from LMICs, and none from low-income countries (LICs) (Supplemental Digital Content 3).⁴³ There were 12 unique countries and 9 unique languages featured across studies. Most (23/34, 68.0%) were validation studies, 15% (5/34) were classification studies, 9% (3/34) were cross-validation studies, 6% (2/34) were prospective studies, and 3% (1/34) were developmental studies. Participant ages ranged from 1 to 93 years (Table 1).

Figure 2.

Trend in the number of publications on VPD detection using artificial intelligence.

Table 1.

Study Design and Overview of Included Studies.

Author	Title	Year of Publication	Country	Income Country	Original Language	Study Design	Age Range (y)
Cairns et al.²¹	A Noninvasive Technique for Detecting Hypernasal Speech Using a Nonlinear Operator	1996	United States	HIC	English	Validation	-
Rah et al.²²	A noninvasive estimation of hypernasality using a linear predictive model	2001	South Korea	HIC	Korean	Validation	1-30
Pruthi et al.²³	Acoustic parameters for automatic detection of nasal manner	2004	United States	HIC	English	Classification	-
Mayr et al.²⁴	The use of automatic speech recognition showing the influence of nasality on speech intelligibility.	2010	Germany	HIC	German	Prospective	18-93
Rendon et al.¹⁴	Automatic detection of hypernasality in children	2011	Colombia	UMIC	Spanish	Classification	-
Akafi et al.²⁵	Detection of hypernasal speech in children with cleft palate	2012	Iran	UMIC	Persian	Validation	-
He et al.¹⁰	Automatic Evaluation of Hypernasality Based on a Cleft Palate Speech Database	2015	China	UMIC	Mandarin	Classification	5-12
Orozco-Arroyave, J.R et al.¹⁶	Characterization Methods for the Detection of Multiple Voice Disorders: Neurological, Functional, and Laryngeal Diseases	2015	Colombia, Germany, Czech Republic, Saudi Arabia	HIC	Spanish, German, Czech	Validation	5-15
Orozco-Arroyave, J.R et al.¹⁷	Automatic detection of hypernasal speech of children with cleft lip and palate from Spanish vowels and words using classical measures and nonlinear analysis	2016	Colombia	UMIC	Spanish	Validation	5-15
Golabbakhsh et al.¹⁵	Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech	2017	Iran	UMIC	Persian	Validation	4-28
Dubey et al.²⁶	Detection of hypernasality based on vowel space area	2018	India	LMIC	Kannada	Validation	7-12
Seaward et al.²⁷	Improving the accuracy of automated cleft speech evaluation	2018	United States	HIC	English	Validation	5-18
Kalita et al.²⁸	Intelligibility assessment of cleft lip and palate speech using Gaussian posteriograms based on joint spectro-temporal features	2018	India	LMIC	English	Validation	-
Dubey et al.²⁹	Pitch-adaptive front-end feature for hypernasality detection	2018	India	LMIC	Kannada	Validation	7-12
Wang et al.³⁰	Automatic Hypernasality Detection in Cleft Palate Speech Using CNN	2019	China	UMIC	Mandarin	Validation	5-12, 18-24
Dubey et al.²⁶	Detection and assessment of hypernasality in repaired cleft palate speech using vocal tract and residual features	2019	India	LMIC	Kannada	Validation	7-12
Dubey et al.³¹	Hypernasality Severity detection using Constant Q Cepstral Coefficients	2019	India	LMIC	Kannada	Validation	7-12
Wang et al.¹¹	HypernasalityNet: Deep recurrent neural network for automatic hypernasality detection	2019	China	UMIC	English	Validation	5-12
He et al.³²	Acoustic analysis and detection of pharyngeal fricative in cleft palate speech using correlation of signals in independent frequency bands and octave spectrum prominent peak	2020	China	UMIC	Chinese	Classification	-
Zhang et al.³³	Automatic hypernasality grade assessment in cleft palate speech based on the spectral envelope method	2020	China	UMIC	Mandarin	Validation	-
Dubey et al.³⁴	Sinusoidal model-based hypernasality detection in cleft palate speech using CVCV sequence	2020	India	LMIC	Kannada	Validation	7-12
Javid et al.³⁵	Single Frequency Filter Bank Based Long-Term Average Spectra for Hypernasality Detection and Assessment in Cleft Lip and Palate Speech	2020	India	LMIC	English	Cross-validation	5-13
Mathad et al.¹⁹	A deep learning algorithm for objective assessment of hypernasality in children with cleft palate	2021	United States	HIC	English	Prospective	6-9
Chen et al.¹³	Diagnose Parkinson's disease and cleft lip and palate using deep convolutional neural networks evolved by IP-based chimp optimization algorithm	2022	China and Iran	UMIC	Spanish	Validation	5-15
Song et al.¹²	Improving Hypernasality Estimation with Automatic Speech Recognition in Cleft Palate Speech	2022	China, United States	HIC	English, Mandarin	Validation	6-13
Zhang et al.³⁶	Automatic detection system for velopharyngeal insufficiency based on acoustic signals from Nasal and Oral channels	2023	China	UMIC	Mandarin	Classification	4-45
Ha et al.³⁷	Deep Learning-Based Diagnostic System for Velopharyngeal Insufficiency Based on Videofluoroscopy in Patients With Repaired Cleft Palates.	2023	South Korea	HIC	Korean	Validation	4-19.5
Lucas et al.⁵	Machine Learning for Automatic Detection of Velopharyngeal Dysfunction: A Preliminary Report	2024	United States	HIC	English	Validation	4-10
Sireesha et al.³⁸	Variational mode decomposition based features for detection of hypernasality in cleft palate speech	2024	India	LMIC	Kannada	Validation	7-12
Cornefjord et al.³⁹	Using Artificial Intelligence for Assessment of Velopharyngeal Competence in Children Born With Cleft Palate With or Without Cleft Lip	2024	Sweden	HIC	Swedish	Development	5 and 10
Korba et al.⁴⁰	Improved Laryngeal Pathology Detection Based on Bottleneck Convolutional Networks and MFCC	2024	Algeria	UMIC	Spanish	Cross-validation	-
Shirk et al.²⁰	Leveraging large language models for automated detection of velopharyngeal dysfunction in patients with cleft palate	2025	United States	HIC	English	Validation	-
Kothadia et al.⁴¹	Cross-lingual Evaluation of Hypernasality Using Wav2Vec2 Features	2025	United States and India	HIC and UMIC	English, Kannada	Cross-validation	6-12
Alter et al.¹⁸	From Support Vector Machines to Neural Networks: Advancing Automated Velopharyngeal Dysfunction Detection in Patients With Cleft Palate	2025	United States	HIC	English	Validation	5-18

Dataset Composition and Validation Strategies

The majority of studies (32/34, 94.1%) reported the number of cases and datasets used for model training and collected multiple recorded samples per participant for model training purposes. Across all studies, 3967 participants contributed 92,323 training samples. Most (31/34, 92.1%) studies completely recorded the number of cases, controls, and datasets used for internal validation. In total, 2331 controls and 2449 VPD cases were used to generate 81,143 internal validation samples. Only 3 studies (3/34, 8.8%) conducted external validation after developing their models. Of these, just 2 (2/34, 5.9%) reported the number of samples used, with Lucas et al. documenting 113 external validation samples and Alter et al. documenting 600 external validation samples.^5,6 Of the 34 studies, 28 used institutional databases (28/34, 82.4%), 3 used public databases (3/34, 8.8%), and 3 used a combination of both institutional and public databases (3/34, 8.8%; Table 2).

Table 2.

Training, Internal Validation, and External Validation Information Across Machine Learning Models.

Author	Year of Publication	Collection Sounds	Repeat Samples (y/n)	Control Data Set	Test Data Set	Programmed Cases	Programmed Samples	Internal Validation Controls	Internal Validation Cases	Internal Validation Samples	External Validation Set	External Validation Sample Size (n)
Cairns et al.²¹	1996	Carrier phrase with CVC utterances (/i/, /A/ vowels, plosive and nasal consonant contexts)	y	Institutional	Institutional	11	-	11	11	-	N/A	N/A
Rah et al.²²	2001	prolonged vowel /a/ recorded at 50 kHz	n	-	Yonsei University Institute of Logopedics and Phoniatrics	-	-	-	24	24	N/A	N/A
Pruthi et al.²³	2004	Nasals (/m/, /n/, /ng/) and semivowels (/l/, /r/, /w/, /y/) from sentence corpus	y	TIMIT training database	TIMIT training database	1800	3600	-	-	504	N/A	N/A
Mayr et al.²⁴	2010	Full sentence passage: “Der Nordwind und die Sonne” (71 words, phonetically balanced)	y	In-house longitudinal recordings	In-house longitudinal recordings	20	59	-	20	59	N/A	N/A
Rendon et al.¹⁴	2011	Vowels only: /a/, /e/, /i/, /o/, /u/	y	In-house recordings (National University of Colombia and University of Antioquia)	In-house recordings (National University of Colombia and University of Antioquia)	20	200	20	20	200	N/A	N/A
Akafi et al.²⁵	2012	Sustained vowel /a/ extracted from /pamap/ disyllables	y	Locally recorded (institutional)	Locally recorded (institutional)	2	45	2	2	45	N/A	N/A
He et al.¹⁰	2015	Words; words include Mandarin initials (consonants) and finals (vowels)	y	Hospital of Stomatology, Sichuan University	Hospital of Stomatology, Sichuan University	90	7650	30	90	7560	N/A	N/A
Orozco-Arroyave, J.R et al.¹⁶	2015	Spanish vowels (/a/, /e/, /i/, /o/, /u/) recorded	y	Universidad Nacional de Colombia, Manizales	Universidad Nacional de Colombia, Manizales	130	1190	108	130	1190	N/A	N/A
Orozco-Arroyave, J.R et al.¹⁷	2016	Vowels: /a/, /e/, /i/, /o/, /u/; Words: /coco/, /gato/	y	GPRS (Grupo de Procesamiento y Reconocimiento de Señales), Universidad Nacional de Colombia	GPRS (Grupo de Procesamiento y Reconocimiento de Señales), Universidad Nacional de Colombia	65	357	54	65	357	N/A	N/A
Golabbakhsh et al.¹⁵	2017	Sentences and Fricative Consonants	y	Isfahan University of Medical Sciences	Isfahan University of Medical Sciences	15	180	15	15	180	N/A	N/A
Dubey et al.²⁶	2018	Sustained vowels (/a/, /i/, /u/) extracted from syllables in Mandarin	y	AIISH Mysore	AIISH Mysore	15	725	15	15	725	N/A	N/A
Seaward et al.²⁷	2018	Sentences (6 CAPS-A-AM sentences with target phonemes)	y	Clinic	Clinic	60	1440	-	13	78	N/A	N/A
Kalita et al.²⁸	2018	Sentences	y	Institutional	Institutional	-	-	14	12	26	N/A	N/A
Dubey et al.²⁹	2018	Vowels /a/, /i/, /u/ extracted from disyllables /papa/, /pipi/, /pupu/	y	AIISH Mysore	AIISH Mysore	30	2982	30	30	2982	N/A	N/A
Wang et al.³⁰	2019	Syllables and Vowels	y	Hospital of Stomatology, Sichuan University	Hospital of Stomatology, Sichuan University	111	25572	113	111	25572	N/A	N/A
Dubey et al.²⁶	2019	Vowels /a/, /i/, /u/ extracted from /papa/, /pipi/, /pupu/	y	All India Institute of Speech and Hearing (AIISH), Mysore	All India Institute of Speech and Hearing (AIISH), Mysore	30	600	30	30	600	N/A	N/A
Dubey et al.³¹	2019	High vowels /i/ and /u/ extracted from disyllabic words (/pipi/, /pupu/)	y	AIISH Mysore, India	AIISH Mysore, India	30	1054	15	30	1054	N/A	N/A
Wang et al.¹¹	2019	Vowels /a/, /i/, /u/ segmented from Mandarin syllables	y	Hospital of Stomatology, Sichuan University	Hospital of Stomatology, Sichuan University	72	14544	72	72	14544	N/A	N/A
He et al.³²	2020	Consonants (/c/, /ch/, /q/, /s/, /sh/, /x/) in 24 words	y	West China Hospital of Stomatology, Sichuan University	West China Hospital of Stomatology, Sichuan University	50	1208	50	50	1208	N/A	N/A
Zhang et al.³³	2020	Vowels (/a/, /i/, /u/) segmented from Mandarin syllables with varied consonants	y	West China Hospital of Stomatology, Sichuan University	West China Hospital of Stomatology, Sichuan University	60	4640	20	60	4640	N/A	N/A
Dubey et al.³⁴	2020	Words: /papa/, /pipi/, /pupu/ containing vowels /a/, /i/, /u/	y	In-house AIISH Mysore recordings	In-house AIISH Mysore recordings	30	900	24	24	3600	N/A	N/A
Javid et al.³⁵	2020	Sentences	y	New Mexico Cleft Palate Center (NMCPC)	New Mexico Cleft Palate Center (NMCPC)	73	1813	32	41	1813	N/A	N/A
Mathad et al.¹⁹	2021	Sentences; phoneme-level tagging (nasal/ oral consonants, vowels)	y	Americleft and New Mexico Cleft Palate Center	New Mexico Cleft Palate Center	60	1680	10	41	7-69 per patient	New Mexico Cleft Palate Center	N/A
Chen et al.¹³	2022	Spanish words: /bola/, /coco/, /chuzo/, /jugo/, /gato/, /papa/, /susi/, /mano/ (vowels embedded)	y	Universidad Nacional de Colombia (GPRS group)	Universidad Nacional de Colombia (GPRS group)	135	1190	58	135	1190	N/A	N/A
Song et al.¹²	2022	Sentences	y	CNH dataset (China); NMCPC (New Mexico Cleft Palate Center, USA)	CNH dataset (China); NMCPC (New Mexico Cleft Palate Center, USA)	350	-	264	350	-	N/A	N/A
Zhang et al.³⁶	2023	4 vowels: /a/, /e/, /i/, /u/; 9 consonants: /p/, /t/, /k/, /q/, /c/, /h/, /x/, /sh/, /f/	y	West China Hospital of Stomatology, Sichuan University	West China Hospital of Stomatology, Sichuan University	89	4860	46	89	4860	N/A	N/A
Ha et al.³⁷	2023	Sustained vowel /e/ during videofluoroscopy	y	Seoul National University Hospital, internal archive	Seoul National University Hospital, internal archive	96	714	618	96	714	N/A	N/A
Lucas et al.⁵	2024	Sentences prompted by a screener	y	CDC, Eastern Ontario Health Unit, Institutional	CDC, Eastern Ontario Health Unit, Institutional	55	110	55	55	110	Institutional	113
Sireesha et al.³⁸	2024	Words: /papa/, /pipi/; Vowels: /a/, /i/	y	AIISH Mysore	AIISH Mysore	30	600	24	24	480	N/A	N/A
Cornefjord et al.³⁹	2024	Swedish Articulation and Nasality Test (SVANTE)	y	SR Dataset (Swedish Registry)	SC dataset (Scandcleft Trials) IC dataset (Swedish Intercenter study)	162	141	308	399	-	N/A	N/A
Korba et al.⁴⁰	2024	Sustained vowel /a/ at normal pitch	y	HUPA Database	HUPA Database	239 (controls) 201 (cases)	440	239	201	440	N/A	N/A
Shirk et al.²⁰	2025	Not specified—full raw audio segments, 0.44 to 9.35 s long	n	CDC and Eastern Ontario Health Unit (public)	CDC and Eastern Ontario Health Unit (public)	67	129	14	14	28	N/A	N/A
Kothadia et al.⁴¹	2025	Sentences	y	Americleft NMCPC AIISH	Americleft NMCPC AIISH	149	5700	10	150	5700	N/A	N/A
Alter et al.¹⁸	2025	Words and phrases	y	Monroe Carell Jr. Children's Hospital	Monroe Carell Jr. Children's Hospital	60	8000	30	30	660	Institutional	600
					Totals (n)	3967	92323	2331	2449	81143	-	713

Model Types and Feature Extraction

The most commonly used classifier was the support vector machine (SVM), reported in 16 studies (16/34, 47.1%), convolutional neural networks (CNN) in 6 (6/34, 17.6%), and deep neural networks (DNN) in 2 studies (2/34, 5.9%). The most common speech features that were used include mel frequency cepstral coefficients (MFCCs) (11/34, 32.4%), Shimmer (4/34, 11.8%), and Jitter (3/34, 8.8%). Other commonly used features included spectral entropy and constant Q cepstral coefficients (CQCC). Other models and the corresponding speech features are detailed in Table 3.

Table 3.

Machine Learning Models, Software, and Features Across Studies.

Author	Year of Publication	ML Model	Speech Software	Recording	Features
Cairns et al.²¹	1996	Nonlinear Teager Energy Operator-based classification algorithm	-	DAT recorder, Sony TCD-D3, 48 kHz; microphone 8 inches from mouth	Teager Energy operator profile comparison between lowpass vs bandpass filtered signals
Rah et al.²²	2001	Signal Processing Method	CSL Model 4300B (Kay Elemetrics); analysis done with LP modeling and spectral comparison	50 kHz sampling rate, 8-bit resolution, microphone with LPF and downsampling to 12.5 kHz	LP Cepstrum; spectral distance between high-order (36-40) and low-order (10) LP models
Pruthi et al.²³	2004	Support Vector Machine (SVM)	SVMlight (classification); HTK toolkit (digit recognition)	TIMIT corpus recordings; 16 kHz sampling, phonetically transcribed, studio-quality audio	4 Acoustic Parameters (APs): onset/offset energy, energy ratio (0-320 Hz vs 320-5360 Hz), spectral peak frequency, envelope variance
Mayr et al.²⁴	2010	ASR with unigram-based statistical acoustic modeling (no syntactic knowledge)	PEAKS (Program for the Automatic Evaluation of All Kinds of Speech Disorders)	dnt Call4you headset, 16 kHz/16-bit, consistent setup across timepoints	Word Recognition Rate (WR) from ASR; % of correctly recognized words
Rendon et al.¹⁴	2011	Linear Bayes Classifier	-	-	Jitter, Shimmer, 11 MFCCs, spectral entropy, spectral moments, H-, noise measurements (51 features total)
Akafi et al.²⁵	2012	Cepstral Distance	-	Shure Beta 54 mic, 44.1 kHz, 16-bit, S- > 30 dB	Cepstrum from AR and ARMA models; Distance Index (DI) as classification metric
He et al.¹⁰	2015	k-Nearest Neighbors (KNN)	-	16,000 Hz sampling rate; specific mic setup not reported	Pitch, energy-amplified frequency bands, MFCC, cepstral-based features, short-time energy in sub-bands
Orozco-Arroyave, J.R et al.¹⁶	2015	Support Vector Machine (SVM)	-	44.1 kHz, 16-bit audio; studio-quality recordings; sustained vowel recordings only	4 groups: Noise measures, Stability/Periodicity, Spectral-Cepstral (MFCC, LPC, Formants), Nonlinear Dynamics
Orozco-Arroyave, J.R et al.¹⁷	2016	Soft-Margin Support Vector Machine (SM-SVM) with RBF kernel	Custom MATLAB-based; feature selection with PCA and SFFS	Omnidirectional mic, professional audio card, 44.1 kHz, 16-bit, quiet room	CAV: Jitter, Shimmer, H-, CH-, NNE, GNE, 11 MFCCs; NLD: CD, LLE, HE, LZC
Golabbakhsh et al.¹⁵	2017	Support Vector Machine (SVM)	-	Lavalier mic (AKG C417), Edirol R-44 recorder, 20 cm from mouth, 44.1 kHz sampling	Jitter, shimmer, Mel frequency cepstral coefficients (MFCC), bionic wavelet transform entropy, bionic wavelet transform energy
Dubey et al.²⁶	2018	Support Vector Machine (SVM)	Wavesurfer (for segmentation); analysis in MATLAB-like environment	44.1 kHz, 16-bit, sound-treated room with sound level meter microphone	Vowel Space Area (VSA) and Mel-Frequency Cepstral Coefficients (MFCC)
Seaward et al.²⁷	2018	Hidden Markov Model (HMM)-based speech recognition engine	MATLAB + Hidden Markov Toolkit (HTK, University of Cambridge)	RØDE VideoMic Pro; recordings segmented using Audacity	Discriminatory phonemes from CAPS-A-AM sentences; phoneme-level classification into: normal, VPD, or articulation error
Kalita et al.²⁸	2018	Gaussian posteriogram-based ASR model with LDA and MLLR adaptation	-	Studio microphone, clean environment, speaker-dependent alignment	MFCC (13D), delta and acceleration features, joint spectro-temporal features
Dubey et al.²⁹	2018	Support Vector Machine (SVM) with RBF Kernel	MATLAB; Wavesurfer used for manual annotation	Bruel & Kjaer sound level meter microphone, 44.1 kHz, 16-bit, sound-treated room	Pitch-adaptive MFCC (PAMFCC), compared to standard MFCC
Wang et al.³⁰	2019	Convolutional Neural Network (CNN)	-	22,050 Hz sampling; syllable-level recordings	Speech spectrogram (time-frequency representation)
Dubey et al.²⁶	2019	Support Vector Machine (SVM)	MATLAB and WaveSurfer (for segmentation and annotation)	Bruel & Kjaer microphone, 44.1 kHz, 16-bit, sound-treated room	VTC (vocal tract constriction), PSR (peak-to-sidelobe ratio), SMAC (spectral moment + cepstrum)
Dubey et al.³¹	2019	Support Vector Machine (SVM)	MATLAB	Bruel & Kjaer SLM mic, 44.1 kHz, 16-bit, in a sound-treated room	CQCC (Constant Q Cepstral Coefficients); compared with MFCC and formant features
Wang et al.¹¹	2019	LSTM-based Deep Recurrent Neural Network (DRNN)	TensorFlow (used for model); MATLAB used for feature preprocessing	Sennheiser wired microphone, 22.05 kHz, 16-bit, sound-treated room	Reflection coefficients, Power Spectrum Density (PSD), CRGE, Mel spectrum
He et al.³²	2020	Ensemble Learning	-	Studio recording, 44.1 kHz, 16-bit resolution	CSIFs (correlation in independent frequency bands), OSPP (octave spectrum prominent peak
Zhang et al.³³	2020	Support Vector Machine (SVM)	-	Professional recording chamber, 22.05 kHz, 16-bit, Sennheiser wired microphone	Spectral envelope parameters derived from LP, SWLP, XLP
Dubey et al.³⁴	2020	Support Vector Machine (SVM) with RBF kernel	MATLAB, Wavesurfer (for annotation)	Bruel & Kjaer SLM microphone, 44.1 kHz, 16-bit, quiet room	NHA (Normalized Harmonics Amplitude), HAR (Harmonics Amplitude Ratio), PHF (Prominent Harmonic Frequency)
Javid et al.³⁵	2020	Support Vector Machine (SVM, polynomial kernel order 3)	MATLAB (custom implementations; link provided in paper)	NMCPC-CLP corpus; 8 kHz downsampled .wav utterances; each sentence segmented and rated by 5 listeners	Proposed: SFFB-LTAS (single frequency filter bank long-term average spectra, 2019-dim vectors)Baselines: MFCC, PLP, MGD, CQCC, MT-MFCC, PE-SFCC, plus prosody features (duration, intonation, AFB-LTAS)
Mathad et al.¹⁹	2021	Deep Neural Network (DNN)	Librosa (for feature extraction), TensorFlow/Keras (for deep learning model training)	- for hardware; sentence-level recordings; sampling at 16 kHz	Mel-frequency cepstral coefficients (MFCCs), forced alignment to classify NC, OC, NV, OV
Chen et al.¹³	2022	Deep Convolutional Neural Network (DCNN) optimized with IP-based Chimp Optimization Algorithm (IPChOA)	MATLAB (2019b)	16 kHz, 16-bit WAV files; no background noise or music	128-bin Mel-spectrograms (500 ms chunks with 250 ms shift; 32 ms window, 4 ms step)
Song et al.¹²	2022	CNN + Transformer encoder initialized via ASR transfer learning (Fine-tuned encoder classifier	Fairseq-S2 T (ASR); SpecAugment used during ASR pretraining	16 kHz, 80-dim Mel filter-bank features (25 ms window, 10 ms shift)	80-dim log Mel filter-bank features
Zhang et al.³⁶	2023	Cross-attention residual Siamese network (CARS-Net) and Support Vector Machine (SVM)	-	Nasometer II 6450 (kayPENTAX); 2-channel simultaneous recording (oral & nasal); 11,025 Hz	RPFD (Relative Prominent Frequency Description), RFD (Relative Frequency Distribution), vowel spectrograms
Ha et al.³⁷	2023	Convolutional Neural Network (CNN)	PyTorch	Static lateral-view videofluoroscopy during /e/ phonation; 22.05 kHz sample assumed for phonation trigger	Image-based features from still-frame videofluoroscopy; no handcrafted features used
Lucas et al.⁵	2024	Support Vector Machine (SVM)	Python, LibROSA for feature extraction; Microsoft Excel for statistical evaluation	iPad recordings and online video/audio (public); 16 kHz WAV format post-processing	MFCC (Mel-Frequency Cepstral Coefficients)
Sireesha et al.³⁸	2024	Support Vector Machine (SVM) with RBF kernel	MATLAB; Wavesurfer (annotation)	Bruel & Kjaer microphone, 44.1 kHz, 16-bit, quiet room	Modal features (correlation, energy, central frequency, peak amplitude), VMD-MFCC (13D)
Cornefjord et al.³⁹	2024	Convolutional Neural Network (CNN); Pre-trained CNN (VGGish)	MATLAB (R2021a, R2022a) with Deep Learning and Audio Toolbox; PyTorch; Audacity; PyAnnote	Audio recordings at ages 5 or 10, .wav files, condenser microphones (Psytec Std61, Sennheiser MD 421-U-5)	Mel-spectrograms, harmonic ratio, spectral entropy, frame-wise segmentation (200 ms → 20 ms)
Korba et al.⁴⁰	2024	Convolutional Bottleneck Network (CBN); evaluated with SVM (RBF), Random Forest, XGBoost classifiers	Custom Python (DisVoice for glottal features; MFCC/perturbation feature extraction); Praat for jitter/shimmer/HNR	Spanish speakers; CSL 4300B equipment; condenser microphone; 50 kHz downsampled to 25 kHz; 3-s recordings in soundproof room	MFCC (static, delta, delta-delta)MFCC-CBN (bottleneck CNN-learned features)Perturbation features (jitter, shimmer, HNR)Glottal features (OQ, NAQ, H1-H2, HRF)
Shirk et al.²⁰	2025	Whisper (base/medium/large-v2) with 5-layer classifier head (fully connected)	Whisper via Hugging Face Transformers; PyTorch framework; LibROSA for baseline features	Public datasets; recordings resampled to 16 kHz WAV format, trimmed/padded to 30s	Raw audio embeddings from Whisper encoders (no handcrafted features); MFCCs for baselines
Kothadia et al.⁴¹	2025	Support Vector Regression (SVR, RBF kernel)	Wav2Vec2-large-xlsr-53 pre-trained model (HuggingFace/transformers), UMAP for visualization, MATLAB/Python toolkits for preprocessing	-	1024-dim contextual embeddings from wav2vec2 layers; best performance from 11th-12th layers
Alter et al.¹⁸	2025	Wav2Vec 2.0 (self-supervised feature extractor) with Transformer-based classifier head (Classifier)	Pretrained wav2vec 2.0 self-supervised neural network (wav2vec_big_960 h.pt) with Transformer classifier	Smartphone/tablet recordings and sound-booth recordings; 16 kHz mono PCM audio	Self-supervised wav2vec 2.0 contextual embeddings (512-dimensional, 48 temporal frames per 0.5-s clip)

Model Performance

The performance of these models was assessed using a range of metrics across studies, including accuracy, precision, F1-score, sensitivity, specificity, and PCC. Most studies (30/34, 88.2%) did not report mean performance values, instead presenting ranges of values to characterize model performance, as summarized in Table 4. Of the 34 studies, 4 (4/34, 11.8%) studies did not report any performance metrics. The majority (24/34, 70.5%) of studies reported accuracy as a performance metric, with reported accuracies ranging from 37.7% to 100%. Using the midpoint of reported ranges when necessary, the mean accuracy across these studies was approximately 82.9%. Precision was reported in 6 (6/34, 17.6%) studies. Similarly, reported precisions ranged from 55.6% to 100%; the mean precision across studies using midpoint was 86.7%. The F1-score, reported in 7 (7/34, 20.6%) studies, ranged from 0.50 to 1.0. The median F1-score was 0.96, and the average was 0.88. Sensitivity was reported in 16 studies (16/34, 47.1%), with reported values ranging from 32.3% to 100%. The median sensitivity across the 16 studies was 82.9%, and the mean was approximately 80.5%. Specificity was reported across 14 studies (14/34, 41.2%), ranging from 45% to 100%. The median specificity was 82.4%, and the mean was 82.2%. PCC was reported in 3 (3/34, 8.8%) studies, ranging from −0.17 to 0.84. The median of the 3 studies was 0.71, with a mean of 0.58.

Table 4.

Performance Metrics Across Included Studies.

Author	Year of Publication	ML Model	Accuracy	Precision	F1- Ratio	Sensitivity/ Recall	Specificity	Pearson Coefficient Correlation (PCC)
Cairns et al.²¹	1996	Nonlinear Teager Energy Operator-based classification algorithm	93-94.7%	-	-	93-95%	93-95%	-
Rah et al.²²	2001	Signal Processing Method	-	-	-	-	-	0.58-0.84
Pruthi et al.²³	2004	Support Vector Machine (SVM) (Classifier)	89.53-95.8%	-	-	-	-	-
Mayr et al.²⁴	2010	ASR with unigram-based statistical acoustic modeling (no syntactic knowledge)	-	-	-	-	-	-
Rendon et al.¹⁴	2011	Linear Bayes Classifier (Classifier)	80-90%	-	-	-	-	-
Akafi et al.²⁵	2012	Cepstral Distance	86.7%	-	-	-	-	-
He et al.¹⁰	2015	k-Nearest Neighbors (KNN) (Classifier)	80.4%	-	-	-	-	-
Orozco-Arroyave, J.R et al.¹⁶	2015	Support Vector Machine (SVM) (Classifier)	71-99% (3-17)	-	-	62-100% (0-29)	64-100 (0-32)	-
Orozco-Arroyave, J.R et al.¹⁷	2016	Soft-Margin Support Vector Machine (SM-SVM) with RBF kernel (Classifier)	85.7-95.4% (2.2-10.7)	-	-	81.7-97.0% (3.8-19.04)	79.7-96.5 (4.8-21.0)	-
Golabbakhsh et al.¹⁵	2017	Support Vector Machine (SVM) (Classifier)	48-85.0%	-	-	50-82.0%	45-85.0%	-
Dubey et al.²⁶	2018	Support Vector Machine (SVM) (Classifier)	65.8-91.7%	-	-	68.8-95.9%	60.8-95.2%	-
Seaward et al.²⁷	2018	Hidden Markov Model (HMM)-based speech recognition engine	-	-	-	-	-	-
Kalita et al.²⁸	2018	Gaussian posteriogram-based ASR model with LDA and MLLR adaptation (Classifier)	-	-	-	-	-	-
Dubey et al.²⁹	2018	Support Vector Machine (SVM) with RBF Kernel (Classifier)	77.8-88.0%	-	-	78.6-88.1%	77.5-88.0%	-
Wang et al.³⁰	2019	LSTM-based Deep Recurrent Neural Network (DRNN) (Classifier)	-	-	0.95-0.98	94.8-97.4%	94.9-97.6%	-
Dubey et al.²⁶	2019	Convolutional Neural Network (CNN) (Classifier)	49.3-93.7% (0-4.0)	-	-	32.3-94.1% (1-7.6)	61.6-96.1 (0.5-7.8)	-
Dubey et al.³¹	2019	Support Vector Machine (SVM) (Classifier)	78-83%	-	-	-	-	-
Wang et al.¹¹	2019	LSTM-based Deep Recurrent Neural Network (DRNN) (Classifier)	87.9%-93.4%	-	-	-	-	-
He et al.³²	2020	Ensemble Learning (Meta-Classifier)	-	-	-	-	-	-
Zhang et al.³³	2020	Support Vector Machine (SVM) (Classifier)	78.5-97.5%	-	-	-	-	-
Dubey et al.³⁴	2020	Support Vector Machine (SVM) with RBF kernel (Classifier)	55.5-86.4% (0.5-6.6)	-	-	45.2-86.4% (1.2-15.6)	62.1-91.3% (0.7-11.1)	-
Javid et al.³⁵	2020	Support Vector Machine (SVM, polynomial kernel order 3)	82.1-89%	-	-	-	-	-
Mathad et al.¹⁹	2021	Deep Neural Network (DNN) (Classifier)	-	-	-	-	-	−0.13-0.72
Chen et al.¹³	2022	Deep Convolutional Neural Network (DCNN) optimized with IP-based Chimp Optimization Algorithm (IPChOA) (Classifier	96.37%	-	0.96	94.2%	94.25%	-
Song et al.¹²	2022	CNN + Transformer encoder initialized via ASR transfer learning (Fine-tuned encoder classifier (Classifier)	-	73.1-96.5% (0.2-0.6)	-	-	-	-
Zhang et al.³⁶	2023	Cross-attention residual Siamese network (CARS-Net) and Support Vector Machine (SVM)	80.2-93.7%	81.7-95.4%	0.86-0.95	89.4-96.1%	-	-
Ha et al.³⁷	2023	(Classifier) Convolutional Neural Network (CNN) (Classifier)	86.1-93.1%	-	0.50-0.67	45.8-58.3%	96.8-100%	-
Lucas et al.⁵	2024	Support Vector Machine (SVM) (Classifier)	-	100%	-	88.9% [95% CI: 78.44-95.41%]	66.0% [95% CI: 51.23-78.79%]	-
Sireesha et al.³⁸	2024	Support Vector Machine (SVM) with RBF kernel (Classifier)	62.8-89.4% (0.24-3.2)	55.6-91.8% (0.8-5.3)	-	46.1-88.1% (1.1-7.3)	66.0-92.3% (1.0-8.2)	-
Cornefjord et al.³⁹	2024	Convolutional Neural Network (CNN); Pre-trained CNN (VGGish)	37.7-57.1%	-	-	-	-	-
Korba et al.⁴⁰	2024	Convolutional Bottleneck Network (CBN); evaluated with SVM (RBF), Random Forest, XGBoost (Classifiers)	68.9-88.8% (1.3-5.8)	69.3-86.9%	0.60-0.88	60.7-92.5%	77.3-88.2%	-
Shirk et al.²⁰	2025	Whisper (base/medium/large-v2) with 5-layer classifier head (fully connected) (Classifier)	97%	-	0.97	-	-	-
Kothadia et al.⁴¹	2025	Support Vector Regression (SVR, RBF kernel) (Classifier)	-	-	-	-	-	0.58-0.87
Alter et al.¹⁸	2025	Wav2Vec 2.0 (self-supervised feature extractor) with Transformer-based classifier head (Classifier)	100%	100%	1.0	100%	-	-

Performance metrics are presented as ranges when multiple values were reported across experiments or models. Standard deviations are shown in parentheses (SD), and confidence intervals are shown in brackets [CI] when provided by the original study. If no range is displayed, the metric was reported only as a single mean value in the source publication. A dash (–) indicates that the metric was not reported.

External Validation and Generalizability

Furthermore, an examination of the subset of studies that performed external validation in addition to internal validation provides insight into how well ML models may generalize to real-world clinical settings (Table 4). Of the 3 studies that included an external validation, each reported different performance metrics. One study reported results using PCC, another reported precision, sensitivity, and specificity, and the third reported accuracy, sensitivity, precision, and F1 score. The externally validated study that reported PCC (Mathad et al.) demonstrated a range of −0.13 to 0.72, indicating variable and in some cases inverse agreement with the reference standard.¹⁹ In contrast, 2 studies that reported PCC without external validation demonstrated higher and more consistently positive correlations, with ranges of 0.58-0.84 and 0.58-0.87, respectively. Comparison of midpoint values between these groups revealed a 59% relative reduction in PCC when external validation was performed, which is to be expected.

Among the remaining externally validated studies, sensitivity was reported as 88.9% (Lucas et al.) and 100% (Alter et al.).^5,18 Alter et al. acknowledged that the perfect sensitivity and specificity observed in their model were likely influenced by confounding variables and potential overfitting.¹⁸ When averaged together, these 2 externally validated sensitivity estimates were approximately 17% higher than the mean sensitivity reported by the remaining studies, further suggesting that reported performance may be inflated in the absence of robust external validation. The overall certainty of evidence was judged to be low for all outcomes due to substantial methodological heterogeneity, frequent high or unclear risk of bias on QUADAS-AI assessment, limited use of external validation, and inconsistent reporting of performance metrics across studies.

Synthesis of Findings Relative to the Research Question

Across included studies, AI/ML models demonstrated generally favorable diagnostic performance for the detection of VPD in patients with cleft palate, with mean accuracy, sensitivity, and specificity exceeding 80% among studies reporting these metrics. However, only 3 of 34 studies (8.8%) performed external validation, and reported performance metrics varied across these studies, with inconsistent agreement with reference standards.^5,18,19 PCC-based studies reported values from −0.13 to 0.72, while studies reporting accuracy, sensitivity, specificity, and F1 score showed high performance, with sensitivity ranging from 88.9% to 100%.^5,18,19

Several recurrent methodological characteristics relevant to barriers to clinical translation were identified. No included study utilized a control cohort consisting of patients with cleft palate without VPD. Most studies relied exclusively on internal validation datasets, with 31 of 34 studies (91.2%) lacking external validation. Substantial variability was observed in recording conditions, speech task selection, and feature extraction methods. Additionally, datasets frequently consisted of multiple recordings per participant rather than independent samples. Reporting of performance metrics was inconsistent, with 4 of 34 studies (11.8%) not reporting any performance outcomes.

These findings demonstrate that limited external validation, absence of appropriate control cohorts, heterogeneity in data acquisition and feature selection, non-independent sampling, and inconsistent reporting are common across studies.

Discussion

The results of this study indicate that current AI/ML models for VPD detection do not demonstrate sufficient generalizability or methodological rigor to support clinical deployment. Although reported diagnostic performance is generally favorable, external validation is rare, and performance is inconsistent when evaluated beyond development datasets. The absence of appropriate control cohorts, reliance on internal validation, heterogeneity in data acquisition and feature selection, non-independent sampling, and the lack of a unified performance metric framework were consistently observed across studies. These are critical barriers to translation. Addressing these limitations will enable clinical implementation of AI/ML models for VPI detection.

Over the last decade, ML algorithm capabilities have progressed remarkably due to the convergence of several key factors: the development of deep learning algorithms, a significant increase in available data, and a rise in computational power.⁴⁵ ML algorithms have woven their way into nearly all facets of healthcare, and cleft lip and palate care is no exception.^46–49 The surge in AI/ML research is also in part driven by the potential to extend care into rural and low-income areas.^46,47 For this potential to be realized, models must be intentionally designed for real-world constraints and rigorously tested in the environments where they will be deployed. Moreover, to effectively control for external confounders and ensure consistency in data collection, prospective studies are particularly valuable, as they allow data to be gathered under conditions that closely mirror the intended real-world use of the device.

ML models have been particularly attractive for the identification of VPD, given their ability to identify other forms of pathological speech.^13,50,51 VPD presents with characteristic speech distortions that are audible to the human ear, and it therefore stands to reason that pattern recognition algorithms would succeed in this space. This is evidenced by the recent spike in VPD-related AI/ML publications (Figure 2). Additionally, while some features of VPD are language-specific, most are language-agnostic and predominantly sound-based.⁵¹ This is evidenced by the use of similar modeling techniques in multiple different languages, along with the use of language-agnostic features such as jitter, shimmer, and spectral entropy (Tables 1-3).^14–17 Jitter captures small fluctuations in pitch (fundamental frequency), shimmer measures variations in loudness (amplitude), and spectral entropy quantifies how evenly or unevenly energy is distributed across frequencies, with higher values indicating greater vocal noise or breathiness. Considering this, it is theoretically possible that a single ML model could detect VPD in multiple languages. The belief in this concept is supported by the fact that studies have already originated from 12 countries using 9 different languages.^5,7,10 Additionally, 3 studies used multiple language inputs to train their models and further increase generalizability.^1–3 It should be clarified that the intention of VPD detection models is not to replace specialized SLPs, but rather to extend their reach in areas where access to their expertise is constrained, such as LMICs. However, as with any major technological advancement and the accompanying data surge, quality control is key to ensuring newly developed, complicated methodologies are sound. The results of this study validate this concern, as numerous models have been developed using various techniques, yet a clinically deployable tool has not yet been achieved.

Studies reviewed demonstrated considerable variations in sample size concerning model testing, training, and validation. The lack of standardized sample size calculation is an inherent limitation to AI/ML model studies. While many theories exist, there is no standardized formula for how many “samples” are required to train a diagnostic model. This is complicated by the fact that a single patient can generate innumerable voice samples, and the number of voice samples ultimately depends on the segmentation of voice data. For example, studies that analyze vowels or phonemes can have tens of thousands of samples from only a handful of patients. Importantly, although the studies collected in this systematic review reported a vast number of training and validation numbers, these totals reflect multiple recordings obtained from the same participants rather than independent subjects. This distinction has important implications for model generalizability and the potential for overfitting, especially given the limited use of external validation. Model data augmentation techniques can also expand the quantity of sounds generated from a single patient. Understanding the contributions from each patient to the overall dataset is critical to ensure data balancing. A model that learns based on a dataset predominantly from a single patient is more likely to analyze patient-specific factors, therefore skewing model performance.

The composition of case and control cohorts is equally as important as the quantity of samples in each. No publications reviewed in this study included patients in the control cohort with a history of a cleft palate without VPD. This is a critical error brought to light in studies on the use of AI/ML to detect speech disorders in patients with Alzheimer's disease.⁵² Models trained to screen for disease must be tested by those at risk for the disease. For example, a patient not at risk for VPD is highly unlikely to undergo VPD testing. There are obvious differences between the speech of patients with and without a cleft palate, but not all are pathological. In the absence of appropriate controls, models will be unable to determine which components of cleft palate speech are pathologic and which are not. This will result in the models identifying all patients with a cleft palate as “abnormal.” This observation underscores a fundamental design flaw that threatens the clinical validity of most existing models. This is a key reason why model performance declines significantly with external validation. Conversely, external validation results that appear unusually strong should be interpreted cautiously, as they may indicate confounding or overfitting.

CNN, DNN, and SVM were the most frequently used classifiers, though researchers also explored a wide range of other architectures. This diversity in model architectures, often paired with various feature extraction methods like MFCC and CQCC, indicates that no universally effective approach has been established. Although some models show promise, they often rely heavily on extensive preprocessing and manually engineered features.^7,20,53,54 While these engineered features are grounded in acoustic science, this reliance creates a significant bottleneck, as the manual effort limits real-world scalability and may fail to capture novel pathological speech characteristics. In contrast, deep learning models such as CNNs can automatically learn features from raw or minimally processed data, but this advantage comes at the cost of requiring larger datasets and often results in less interpretable models. This underscores the complexity of pathological speech and the critical role of SLPs in diagnosing VPD secondary to cleft palate, which makes them in such high demand.

In many reviewed publications, the settings in which voice samples are recorded are highly regulated. While this tactic is seemingly beneficial to standardize data, it can be detrimental to model development, especially concerning external validation. Acoustic models are highly sensitive to all components of a voice sample, including ambient noise, echoes, and other extraneous acoustic artifacts. When these non-vocal features are present in internal samples but absent in external validation samples, model development can be compromised.^55,56 Although such regulation may improve internal consistency, it also creates an artificial testing environment that fails to reflect real-world clinical conditions, thereby limiting model generalizability and hindering real-world deployment. The studies reviewed also varied considerably with regard to sound sample composition and feature selection for model processing. Voice samples ranged from vowel-only samples to full sentence analysis (Table 2). In general, for a model to perform robustly in real-world settings, training should be on full sentences with vowel or phoneme analysis used for fine-tuning.⁵⁷ The features analyzed in each study were also highly variable. Studies analyzed MFCCs, jitter, shimmer, and many more (Table 3). The complexity of differentiating non-pathologic cleft palate-related speech from VPD should mandate that model development evaluate as many features as possible.

The surge in ML-focused research has led to a lack of standardized reporting across the literature, resulting in substantial gaps that impede meaningful comparisons of models and their efficacies.⁵⁸ Notably, performance metrics in this review varied widely among published studies, with 11.8% (4/34) failing to report performance measures altogether. The lack of consistent performance measures across studies severely limited the ability to conclude the various studies. To enable robust cross-study comparisons, the adoption of standardized performance reporting metrics will be essential moving forward.⁵⁸ When reporting performance metrics for ML models designed to function as screening tools, best practice is to align with TRIPOD-AI and STARD-AI standards, which emphasize reporting in both prediction-model performance and diagnostic accuracy. The TRIPOD-AI statement recommends presenting discrimination metrics such as the area under the receiver operating characteristic curve (AUROC) and the area under the precision–recall curve (AUPRC); threshold-based measures including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1-score with 95% confidence intervals; calibration statistics such as calibration plots, slope, intercept, and Brier score; and decision-curve analyses.⁵⁹ The STARD-AI guideline similarly emphasizes reporting sensitivity, specificity, PPV, NPV, likelihood ratios, and AUROC, accompanied by cross-tabulations of the index test versus the reference standard and confidence intervals for all estimates.⁶⁰ While accuracy is often included in studies, it is not sufficient alone in either framework, as it fails to capture performance in imbalanced datasets. The overarching theme between both reporting systems is that disease screening models must be evaluated not only by global measures (AUROC, AUPRC), but also by clinically interpretable outcomes such as high sensitivity/NPV ratios to rule out disease and specificity/PPV ratios to capture false-positive burden.^59,60 These tools reduce heterogeneity across the rapidly expanding body of ML models and ensure that meaningful comparisons can be made between studies. Therefore, adherence to these standardized reporting frameworks should be considered a mandatory requirement for future research.

As with any systematic review, the quality of the data is limited by the quality of the studies meeting the inclusion criteria. Outcome metric reporting practices were inconsistent, with 11.8% (4/34) of studies omitting key performance metrics, hindering meaningful model comparisons. The distribution of work was concentrated in HICs and UMICs, limiting applicability to underrepresented populations, particularly those in LICs. Only 3 studies conducted external validation, and just 2 reported sample size, leaving a major gap in assessing generalizability. Most (91.2%, 31/34) of models were tested only internally, risking overfitting to dataset-specific confounders such as ambient noise, recording echoes, and microphone characteristics. Although the results of all these studies are largely theoretical, the successes demonstrated with small-scale external validation testing suggest that model clinical deployment is feasible. Future studies should focus on realistic case and control cohorts, rigorous external validation testing in varied recording environments, and the inclusion of populations who will ultimately benefit from the screening tool.

Conclusion

Over the last decade, there has been a surge of interest in developing AI/ML models to screen speech samples of patients with cleft palate for the presence or absence of VPD. A clinically deployable model could help expand the reach of specialized SLPs, particularly in areas where access to speech care is limited, globalizing VPD care. However, publications on this topic are highly heterogeneous in both design and performance metric reporting, many are vulnerable to acoustic confounding, and few have been externally validated. Despite these limitations, preliminary model data across 9 languages suggest that the development of a clinically deployable screening tool is feasible. Nevertheless, no study has achieved this goal, and substantial work is still required to translate these models into practical, real-world applications.

Future studies should clearly define and include appropriate control cohorts, particularly patients with cleft palate without VPD. Furthermore, external validation should be mandatory using prospectively collected, multi-center datasets with varied recording environments. Adoption of the TRIPOD-AI and STARD-AI frameworks is strongly recommended to promote standardized reporting in ML research, thereby enabling meaningful comparison of models and facilitating their translation into clinical practice. Finally, ML devices should be piloted and further developed in populations they are intended to serve, such as in low-resource settings.

Supplemental Material

sj-docx-1-cpc-10.1177_10556656261445320 - Supplemental material for The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review

Supplemental material, sj-docx-1-cpc-10.1177_10556656261445320 for The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review by Julia Isber, Weixin Liu, Bowen Qu, Shama Dufresne, Amy Stone, Maria E. Powell, Stephane A. Braun, Izabela A. Galdyn, Michael S. Golinko, Zhijun Yin and Matthew E. Pontell in The Cleft Palate Craniofacial Journal

Supplemental Material

sj-jpeg-2-cpc-10.1177_10556656261445320 - Supplemental material for The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review

Supplemental material, sj-jpeg-2-cpc-10.1177_10556656261445320 for The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review by Julia Isber, Weixin Liu, Bowen Qu, Shama Dufresne, Amy Stone, Maria E. Powell, Stephane A. Braun, Izabela A. Galdyn, Michael S. Golinko, Zhijun Yin and Matthew E. Pontell in The Cleft Palate Craniofacial Journal

Supplemental Material

sj-docx-3-cpc-10.1177_10556656261445320 - Supplemental material for The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review

Supplemental material, sj-docx-3-cpc-10.1177_10556656261445320 for The Rise in Artificial Intelligence and Machine Learning Models to Screen for Cleft-Related Velopharyngeal Dysfunction: A Systematic Review by Julia Isber, Weixin Liu, Bowen Qu, Shama Dufresne, Amy Stone, Maria E. Powell, Stephane A. Braun, Izabela A. Galdyn, Michael S. Golinko, Zhijun Yin and Matthew E. Pontell in The Cleft Palate Craniofacial Journal

Footnotes

ORCID iDs

Julia Isber

Weixin Liu

Stephane A. Braun

Matthew E. Pontell

Ethical Considerations

This systematic review analyzed data exclusively from previously published studies and did not involve any new human or animal participants. IRB approval and informed consent were therefore not required. All studies included in the review were assumed to have obtained appropriate ethical approval and participant consent as reported in their respective publications. The review was conducted in accordance with the PRISMA guidelines and the ethical standards of research integrity and transparency.

Author Contributions

All authors contributed substantially to the conception and design of the study, data collection, analysis, and interpretation. All authors were involved in drafting and revising the manuscript, approved the final version for publication, and agree to be accountable for all aspects of the work.

Consent to Participate

This systematic review analyzed data exclusively from previously published studies and did not involve any new human or animal participants.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

References

Witt

D’Antonio

. Velopharyngeal insufficiency and secondary palatal management. A new look at an old problem. Clin Plast Surg. 1993;20(4):707‐721.

Bhuskute

Skirko

Roth

Bayoumi

Durbin-Johnson

Tollefson

. Association of velopharyngeal insufficiency with quality of life and patient-reported outcomes after speech surgery. JAMA Facial Plast Surg. 2017;19(5):406‐412. doi:10.1001/jamafacial.2017.0639

Wright

MacIsaac

Vieux

Rottgers

Halsey

. Demystifying velopharyngeal dysfunction for plastic surgery trainees part 2: Speech fundamentals and perceptual speech assessment. J Craniofac Surg. 2025;36(3):794‐800. doi:10.1097/SCS.0000000000010606

Prathanee

Dechongkit

Manochiopinig

. Development of community-based speech therapy model for children with cleft lip/palate in northeast Thailand. J Med Assoc Thail Chotmaihet Thangphaet. 2006;89(4):500‐508.

Lucas

Torres-Guzman

James

, Corlew

, Stone

, Powell

, Golinko

, Pontell

. Machine learning for automatic detection of velopharyngeal dysfunction: A preliminary report. J Craniofac Surg. 2025;36(3):816-819. doi:10.1097/SCS.0000000000010147

Alter

Stone

Powell

, Gordon EJ, Anan B, Hamdan U, Yin Z, Pontell ME. It’s time to define the global burden of velopharyngeal insufficiency. Cleft Palate Craniofac J. 2026;63(4):945-948. doi:10.1177/10556656251316084

Dhillon

Chaudhari

Dhingra

, Kuo RF, Sokhi RK, Alam MK, Ahmad S. Current applications of artificial intelligence in cleft care: A scoping review. Front Med. 2021;8:676490. doi:10.3389/fmed.2021.676490

Bianco

Gerstoft

Traer

, Ozanich E, Roch MA, Gannot S, Deledalle CA. Machine learning in acoustics: Theory and applications. J Acoust Soc Am. 2019;146(5):3590‐3628. doi:10.1121/1.5133944

Mantena

Celi

Keshavjee

Beratarrechea

. Improving community health-care screenings with smartphone-based AI technologies. Lancet Digit Health. 2021;3(5):e280‐e282. doi:10.1016/S2589-7500(21)00054-6

10.

Zhang

Liu

Yin

Lech

Huang

. Automatic evaluation of hypernasality based on a cleft palate speech database. J Med Syst. 2015;39(5):61. doi:10.1007/s10916-015-0242-2

11.

Wang

Yang

Tang

Yin

Huang

. HypernasalityNet: Deep recurrent neural network for automatic hypernasality detection. Int J Med Inf. 2019;129:1‐12. doi:10.1016/j.ijmedinf.2019.05.023

12.

Song

Wan

Wang

Huiqiang

. Improving hypernasality estimation with automatic speech recognition in cleft palate speech. arXiv. Preprint posted online August 2022. Accessed May 25, 2025. https://arxiv.org/abs/2208.05122

13.

Chen

Yang

Khishe

. Diagnose Parkinson’s disease and cleft lip and palate using deep convolutional neural networks evolved by IP-based chimp optimization algorithm. Biomed Signal Process Control. 2022;77:103688. doi:10.1016/j.bspc.2022.103688

14.

Rendón

Orozco Arroyave

Vargas Bonilla

Arias Londoño

Castellanos Domínguez

. Automatic detection of hypernasality in children. In: Ferrández

Álvarez Sánchez

de la Paz

Toledo

, eds. New Challenges on Bioinspired Applications . Berlin, Heidelberg, Germany: Springer; 2011:167‐174. doi:10.1007/978-3-642-21326-7_19

15.

Golabbakhsh

Abnavi

Kadkhodaei Elyaderani

, Derakshandeh F, Khanlar F, Rong P, Kuehn D. Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech. J Acoust Soc Am. 2017;141(2):929. doi:10.1121/1.4976056

16.

Orozco-Arroyave

Belalcazar-Bolaños

Arias-Londoño

, Vargas-Bonilla JF, Skodda S, Rusz J, Daqrouq K, Hönig F, Nöth E. Characterization methods for the detection of multiple voice disorders: Neurological, functional, and laryngeal diseases. IEEE J Biomed Health Inform. 2015;19(6):1820‐1828. doi:10.1109/JBHI.2015.2467375

17.

Orozco-Arroyave

Vargas-Bonilla

Vásquez-Correa

Castellanos-Domínguez

Nöth

. Automatic detection of hypernasal speech of children with cleft lip and palate from Spanish vowels and words using classical measures and nonlinear analysis. Rev Fac Ing Univ Antioquia. 2016;80:109‐123. doi:10.17533/udea.redin.n80a12

18.

Alter

Lucas

Torres-Guzman

, James A, Stone A, Powell M, Corlew S, Liu W, Qu B, Yin Z, et al. From support vector machines to neural networks: Advancing automated velopharyngeal dysfunction detection in patients with cleft palate. Ann Plast Surg. 2025;95(3S Suppl 1):S55‐S59. doi:10.1097/SAP.0000000000004460

19.

Mathad

Scherer

Chapman

Liss

Berisha

. A deep learning algorithm for objective assessment of hypernasality in children with cleft palate. IEEE Trans Biomed Eng. 2021;68(10):2986‐2996. doi:10.1109/TBME.2021.3058424

20.

Shirk

Dang

Cho

, Chen

, Hofstetter

, Bijur

, Lucas

, James

, Guzman

R-T

, Hiller

, et al. Leveraging large language models for automated detection of velopharyngeal dysfunction in patients with cleft palate. Front Digit Health. 2025;7. doi:10.3389/fdgth.2025.1552746

21.

Cairns

Hansen

Riski

. A noninvasive technique for detecting hypernasal speech using a nonlinear operator. IEEE Trans Biomed Eng. 1996;43(1):35‐45. doi:10.1109/10.477699

22.

Rah

Lee

Kim

. A noninvasive estimation of hypernasality using a linear predictive model. Ann Biomed Eng. 2001;29(7):587‐594. doi:10.1114/1.1380422

23.

Pruthi

Espy-Wilson

. Acoustic parameters for automatic detection of nasal manner. Speech Commun. 2004;43(3):225‐239. doi:10.1016/j.specom.2004.06.001

24.

Mayr

Burkhardt

Schuster

Rogler

Maier

Iro

. The use of automatic speech recognition showing the influence of nasality on speech intelligibility. Eur Arch Otorhinolaryngol. 2010;267(11):1719‐1725. doi:10.1007/s00405-010-1256-5

25.

Akafi

Vali

Moradi

. Detection of hypernasal speech in children with cleft palate. In: 2012 19th Iranian Conference of Biomedical Engineering (ICBME). IEEE; 2012:p.237‐241. doi:10.1109/ICBME.2012.6519688

26.

Dubey

Tripathi

Prasanna

SRM

Dandapat

. Detection of hypernasality based on vowel space area. J Acoust Soc Am. 2018;143(5):EL412. doi:10.1121/1.5039718

27.

Seaward

Hallac

Vucovich

, Dumas B, Van’t Slot C, Lentz C, Cook J, Kane AA. Improving the accuracy of automated cleft speech evaluation. J Cranio-Maxillo-fac Surg Off Publ Eur Assoc Cranio-Maxillo-fac Surg. 2018;46(12):2022‐2026. doi:10.1016/j.jcms.2018.09.014

28.

Kalita

Mahadeva Prasanna

Dandapat

. Intelligibility assessment of cleft lip and palate speech using Gaussian posteriograms based on joint spectro-temporal features. J Acoust Soc Am. 2018;144(4):2413. doi:10.1121/1.5064463

29.

Dubey

Prasanna

SRM

Dandapat

. Pitch-adaptive front-end feature for hypernasality detection. In: Proceedings of the Interspeech 2018 Conference; September 2-6, 2018; Hyderabad, India. International Speech Communication Association; 2018:p. 372‐376. doi:10.21437/Interspeech.2018-1251

30.

Wang

Tang

Yang

Yin

Huang

. Automatic hypernasality detection in cleft palate speech using CNN. Circuits Syst Signal Process. 2019;38(8):3521‐3547. doi:10.1007/s00034-019-01141-x

31.

Dubey

Prasanna

SRM

Dandapat

. Hypernasality severity detection using constant Q cepstral coefficients. In: Proceedings of the Interspeech 2019 Conference; September 15-19, 2019; Graz, Austria. International Speech Communication Association; 2019:p. 4554‐4558. doi:10.21437/Interspeech.2019-2151

32.

Wang

Yin

Zhang

Yang

. Acoustic analysis and detection of pharyngeal fricative in cleft palate speech using correlation of signals in independent frequency bands and octave spectrum prominent peak. Biomed Eng OnLine. 2020;19:36. doi:10.1186/s12938-020-00782-3

33.

Zhang

Yang

Wang

Tang

Yin

. Automatic hypernasality grade assessment in cleft palate speech based on the spectral envelope method. Biomed Tech (Berl). 2020;65(1):73‐86. doi:10.1515/bmt-2018-0181

34.

Dubey

Prasanna

SRM

Dandapat

. Sinusoidal model-based hypernasality detection in cleft palate speech using CVCV sequence. Speech Commun. 2020;124:1‐12. doi:10.1016/j.specom.2020.08.001

35.

Javid

Gurugubelli

Vuppala

. Single frequency filter bank–based long-term average spectra for hypernasality detection and assessment in cleft lip and palate speech. In: ICASSP 2020: IEEE International Conference on Acoustics, Speech and Signal Processing; May 4-8, 2020; Barcelona, Spain. IEEE; 2020:p. 6754‐6758. doi:10.1109/ICASSP40776.2020.9054684.

36.

Zhang

Yin

. Automatic detection system for velopharyngeal insufficiency based on acoustic signals from nasal and oral channels. Diagn Basel Switz. 2023;13(16):2714. doi:10.3390/diagnostics13162714

37.

Lee

Kwon

, Joo H, Lin G, Kim DY, Kim S, Hwang JY, Chung JH, Kong HJ. Deep learning-based diagnostic system for velopharyngeal insufficiency based on videofluoroscopy in patients with repaired cleft palates. J Craniofac Surg. 2023;34(8):2369‐2375. doi:10.1097/SCS.0000000000009560

38.

Sireesha

Dubey

Govind D

Gangashetty

. Variational mode decomposition based features for detection of hypernasality in cleft palate speech. Biomed Signal Process Control. 2024;97:106689. doi:10.1016/j.bspc.2024.106689

39.

Cornefjord

Bluhme

Jakobsson

, Klintö K, Lohmander A, Mamedov T, Stiernman M, Svensson R, Becker M. Using artificial intelligence for assessment of velopharyngeal competence in children born with cleft palate with or without cleft lip. Cleft Palate Craniofac J. 2025;62(10):1684‐1694. doi:10.1177/10556656241271646

40.

Korba

MCA

Doghmane

Khelil

Messaoudi

. Improved laryngeal pathology detection based on bottleneck convolutional networks and MFCC. IEEE Access. 2024;12:124801‐124815. doi:10.1109/ACCESS.2024.3454825

41.

Kothadia

Vikram

Abraham

, Mariswamy P, Prasanna SR, Scherer N, Chapman K, Liss J, Berisha V. Cross-lingual evaluation of hypernasality using Wav2Vec2 features. In: Proceedings of the ICASSP 2025: IEEE International Conference on Acoustics, Speech and Signal Processing ; April 6-11, 2025; Hyderabad, India. IEEE; 2025:1‐5. doi:10.1109/ICASSP49660.2025.10890815

42.

Ouzzani

, Hammady

, Fedorowicz

, Elmagarmid

. Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5(1). doi:10.1186/s13643-016-0384-4

43.

World Bank income groups. Our World in Data website. Accessed August 10, 2025. https://archive.ourworldindata.org/20250707-155151/grapher/world-bank-income-groups.html

44.

Sounderajah

Ashrafian

Rose

, Shah N, Ghassemi M, Golub R, Kahn C, Esteva A, Karthikesalingam A, Mateen B, et al. A quality assessment tool for artificial intelligence–centered diagnostic test accuracy studies: QUADAS-AI. Nat Med. 2021;27(10):1663‐1665. doi:10.1038/s41591-021-01517-0

45.

Rahman

Debnath

Kundu

, Khan MSI, Aishi AA, Sazzad S, Sayduzzaman M, Band SS. Machine learning and deep learning-based approach in smart healthcare: Recent advances, applications, challenges and opportunities. AIMS Public Health. 2024;11(1):58‐109. doi:10.3934/publichealth.2024004

46.

Zuhair

Babar

Ali

, Oduoye

, Noor

, Chris

, Okon

, Rehman

. Exploring the impact of artificial intelligence on global health and enhancing healthcare in developing nations. J Prim Care Community Health. 2024;15:21501319241245847. doi:10.1177/21501319241245847

47.

Ciecierski-Holmes

Singh

Axt

Brenner

Barteit

. Artificial intelligence for strengthening healthcare systems in low- and middle-income countries: A systematic scoping review. NPJ Digit Med. 2022;5(1):162. doi:10.1038/s41746-022-00700-y

48.

Rajagopal

Ayanian

Ryu

, Qian R, Legler SR, Peeler EA, Issa M, Coons TJ, Kawamoto K. Machine learning operations in health care: A scoping review. Mayo Clin Proc Digit Health. 2024;2(3):421‐437. doi:10.1016/j.mcpdig.2024.06.009

49.

Habehh

Gohel

. Machine learning in healthcare. Curr Genomics. 2021;22(4):291‐300. doi:10.2174/1389202922666210705124359

50.

Shokrpour

MoghadamFarid

Bazzaz Abkenar

, Haghi Kashani

, Akbari

, Sarvizadeh

. Machine learning for Parkinson’s disease: A comprehensive review of datasets, algorithms, and challenges. npj Parkinsons Dis. 2025;11:187. doi:10.1038/s41531-025-01025-9

51.

Scimeca

Amato

Olmo

, Asci

, Suppa

, Costantini

, Saggio

. Robust and language-independent acoustic features in Parkinson’s disease. Front Neurol. 2023;14:1198058. doi:10.3389/fneur.2023.1198058

52.

De la Fuente Garcia

Ritchie

Luz

. Artificial intelligence, speech, and language processing approaches to monitoring Alzheimer’s disease: A systematic review. J Alzheimers Dis JAD. 2020;78(4):1547‐1574. doi:10.3233/JAD-200888

53.

Rong

Heidrick

Pattee

. A multimodal approach to automated hierarchical assessment of bulbar involvement in amyotrophic lateral sclerosis. Front Neurol. 2024;15:1396002. doi:10.3389/fneur.2024.1396002

54.

Rogers

Hseu

Kim

, Silberholz E, Jo S, Dorste A, Jenkins K. Voice as a biomarker of pediatric health: A scoping review. Child Basel Switz. 2024;11(6):684. doi:10.3390/children11060684

55.

Kalia

Boyer

Fagherazzi

Bélisle-Pipon

Bensoussan

, Bridge2AI-Voice Consortium. Master protocols in vocal biomarker development to reduce variability and advance clinical precision: A narrative review. Front Digit Health. 2025;7:1619183. doi:10.3389/fdgth.2025.1619183

56.

Staartjes

Kernbach

. Significance of external validation in clinical machine learning: Let loose too early? Spine J Off J North Am Spine Soc. 2020;20(7):1159‐1160. doi:10.1016/j.spinee.2020.02.016

57.

Nakilcioglu

Reimann

John

. Adaptation and Optimization of Automatic Speech Recognition (ASR) for the Maritime Domain in the Field of VHF Communication. arXiv. Preprint posted online August 19, 2025. doi:10.48550/arXiv.2306.00614

58.

Muralidharan

Adewale

Huang

, Nta

, Ademiju

, Pathmarajah

, Hang

, Adesanya

, Abdullateef

, Babatunde

, et al. A scoping review of reporting gaps in FDA-approved AI medical devices. NPJ Digit Med. 2024;7:273. doi:10.1038/s41746-024-01270-x

59.

Collins

Moons

KGM

Dhiman

, Riley RD, Beam AL, Van Calster B, Ghassemi M, Liu X, Reitsma JB, van Smeden M, et al. TRIPOD + AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. The BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378

60.

Sounderajah

Guni

Liu

, Collins GS, Karthikesalingam A, Markar SR, Golub RM, Denniston AK, Shetty S, Moher D, et al. The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence. Nat Med. 2025;31(10):3283-3289. doi:10.1038/s41591-025-03953-8

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.20 MB

7.59 MB

0.01 MB

0.00 MB