Sage Journals: Discover world-class research

Abstract

Objective

Voice disorders resulting from organic vocal cord lesions, whether benign or malignant, often lack reliable non-invasive diagnostic tools, which can lead to delays in treatment. This study aims to identify distinctive acoustic biomarkers and develop machine learning models for accurate classification of these lesions and prediction of malignancy. We investigated the acoustic characteristics of voice production in patients with organic vocal cord lesions, comparing benign and malignant cases, and evaluated the diagnostic potential of machine learning models in distinguishing between healthy and pathological voices.

Methods

A total of 157 participants were enrolled, including 127 patients with organic vocal cord lesions (109 benign, 18 malignant) and 30 healthy controls. Acoustic analysis was performed on vowel sounds, assessing vocal fold vibration parameters. Machine learning models (eXtreme Gradient Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM)) were trained to classify lesion types and predict malignancy. Receiver operating characteristic analysis identified key diagnostic parameters.

Results

Comparative analysis revealed 63 statistically significant differences in acoustic parameters between healthy and lesion-affected groups, with skewness and kurtosis being particularly discriminative. Six key parameters (/u/skew, /i/skew, /o/kurt, /o/shapefactor, /o/impulsefactor, and /a/peak2valley) demonstrated high diagnostic value in distinguishing benign from malignant lesions. The XGBoost model achieved the best performance in classifying vocal cord lesions (area under the curve (AUC) = 0.735), while LightGBM excelled in malignancy prediction (AUC = 0.924). Age and specific acoustic parameters were significant predictors in the models.

Conclusion

The integration of acoustic analysis with machine learning significantly enhances the diagnostic accuracy for vocal cord lesions, particularly in differentiating between benign and malignant cases. These findings underscore the potential of artificial intelligence (AI)-assisted voice analysis as a non-invasive tool for early detection and clinical decision-making. Further validation in larger cohorts is necessary to refine predictive algorithms for broader clinical application.

Keywords

Vocal cord lesions acoustic analysis objective evaluation machine learning models voice disorder diagnostics

Introduction

Voice production is a complex process that depends on the harmonious coordination of the respiratory system, phonatory system, and vocal tract.¹ The respiratory system, driven by the lungs, generates airflow that reaches the vocal folds at the glottis, causing them to vibrate. This sound is then amplified through the resonating system of the vocal tract before emerging from the mouth and nasal cavities.

Dysphonia, a common challenge for otolaryngologists, refers to abnormal voice production characterised by symptoms such as hoarseness, changes in voice quality, weakness or tremulousness, and fatigue.² The underlying causes of voice problems are multifaceted and include inflammation, non-physiological vocal usage, benign vocal cord lesions, and nerve damage affecting the larynx.³ Benign vocal fold lesions can be broadly categorised into epithelial lesions, such as papillomas, and lesions affecting Reinke's space (nodules, polyps, cysts, and oedema) or the arytenoid (granulomas).⁴ Additionally, there is a risk of malignant transformation from dysplastic laryngeal lesions to laryngeal cancer.⁵ Given the significant differences in treatment approaches for benign and malignant vocal fold lesions, a multidisciplinary approach is essential to accurately identify suspicious lesions.⁶

Otolaryngologists can conduct a perceptual assessment of voice by listening to sustained vowels or continuous speech production. However, this method relies heavily on the clinician's expertise and experience, making it susceptible to subjective judgments that may introduce variability and unreliability.^7,8 Perceptual voice evaluation shows inter-clinician variability, with ICCs of 0.38–0.59 for roughness and breathiness assessments.⁹

Acoustic analysis is a widely used technique in both clinical practice and research, providing valuable insights into the vibratory properties of the vocal folds and the overall health of the voice production mechanism.¹⁰ This method involves measuring voice signals to offer an objective and quantitative characterisation of voice quality.¹¹ By analysing acoustic parameters such as fundamental frequency, jitter, shimmer, and spectral tilt, clinicians can develop a comprehensive understanding of voice disorders and their impact on vocal function.¹² For example, changes in fundamental frequency may indicate the presence of nodules or polyps on the vocal folds, while alterations in jitter and shimmer may suggest oedema or other pathological changes.¹³ However, despite its potential benefits in detecting and managing vocal fold lesions, the lack of standardisation in the use of these parameters for voice assessment remains a challenge, hindering the accurate diagnosis and treatment of voice disorders.¹⁴ While acoustic parameters (e.g. jitter, shimmer) provide objective measures, their diagnostic utility is limited by overlapping values across different lesion types¹⁵ and lack of consensus on optimal parameter combinations.¹⁶ These limitations frequently lead to diagnostic uncertainty, particularly when differentiating benign lesions from early malignancies that may share similar acoustic features.¹⁷ Furthermore, there is a lack of consensus on the optimal combinations of acoustic parameters that would enhance diagnostic accuracy, leaving a gap in the literature regarding the most effective methodologies for voice disorder classification.¹⁸ In summary, while current methods for assessing voice disorders, including perceptual evaluation and acoustic analysis, provide valuable insights, they are hindered by subjectivity, variability, and a lack of standardisation, particularly in distinguishing between benign and malignant lesions.

To address these challenges, it is crucial to develop a more efficient and precise voice assessment system for otolaryngology patients by analysing recorded voice signals in conjunction with clinical data. The advancement of artificial intelligence (AI) has created numerous opportunities for voice assessment, including the identification of bio-indicators for diagnosis, classification, patient remote monitoring, and the enhancement of clinical practice.¹⁹ Studies employing machine learning algorithms on acoustic signals have shown promising results in diagnosing conditions such as depression, autism, and Alzheimer's disease.²⁰ The analysis of acoustic signals for the identification of vocal fold lesions, supported by machine learning models, presents a compelling approach to improving clinical diagnosis.²¹ The integration of explainable AI methodologies with established acoustic parameters could provide clinicians with more transparent and interpretable tools, yet this approach has not been extensively investigated in the context of voice disorders.

The present study integrated acoustic analysis with advanced machine learning algorithms. By systematically comparing the acoustic characteristics of normal voices with those affected by benign and malignant vocal cord lesions, we aim to identify distinctive acoustic biomarkers. Furthermore, we develop machine learning models, specifically eXtreme Gradient Boosting (XGBoost) and Light Gradient-Boosting Machine (LightGBM), to classify lesion types and predict malignancy. These algorithms have demonstrated promising results in various medical diagnostic applications due to their ability to handle complex data and provide accurate predictions.²² The SHapley Additive exPlanations (SHAP) framework is employed to provide interpretability and understand the contribution of each acoustic parameter to the model predictions.²³

Our study aims to overcome the limitations of existing diagnostic methods by leveraging the power of acoustic analysis and machine learning. By providing a more objective, efficient, and interpretable tool for differentiating benign from malignant vocal cord lesions, we hope to support early detection and inform clinical decision-making processes. The main contribution of this work includes the following:

Comprehensive acoustic profiling: Systematic comparison of normal and pathological voices across benign and malignant conditions

AI-enhanced diagnostics: Development of machine learning tools combining explainable AI with conventional acoustic parameters

Clinical decision support: Creation of a framework that maintains interpretability while improving objectivity

This work is organised as follows: the literature section reviews the related literature; the methodology section presents the proposed approach; the results section reports the findings; the discussions section provides a discussion of the results; the limitations of the study section outlines the limitations of the study; and the final section concludes the paper and suggests future research directions.

Methods

Ethics statement

This study was approved by the Medical Ethics Committee of the First Affiliated Hospital of Chengdu Medical College (2022CYFYIRB-RA-Aug08). Written informed consent was obtained from all participants.

Participants

This retrospective diagnostic accuracy study analysed voice samples and medical records from patients who underwent laryngoscope examination at the First Affiliated Hospital of Chengdu Medical College between January 2021 and December 2023. The study design incorporated both comparative acoustic analysis of vocal parameters and development of machine learning models to evaluate their diagnostic performance for organic vocal cord lesions.

To facilitate an objective acoustic analysis of vocal cord lesions and normal vocal cords, participants were categorised into benign and malignant vocal cord groups. The inclusion criteria for the control group were as follows: age of ≥18 years, no relevant medical history or current dysphonia, and no structural or functional abnormalities in the larynx. The inclusion criteria for the malignant vocal cord group were as follows: diagnosis of glottic cancer via laryngoscopy and histopathology, with the tumour confined to the vocal cord (which may invade the anterior or posterior commissure) and with normal vocal cord movement; the ability to provide informed consent; and fluency in Mandarin. The inclusion criteria for the benign vocal cord lesion group were similar, namely a diagnosis of a vocal cord lesion via laryngoscopy and histopathology (or clinical examination in controls), the ability to provide informed consent, and fluency in Mandarin.

The exclusion criteria for both the benign and malignant vocal cord groups followed a similar pattern: treatment with medications that may induce voice changes, unwillingness or inability to provide informed consent, unconsciousness, severe cognitive impairment or psychiatric disorders affecting assessment, participation in other rehabilitation programmes, and an inability to accurately perform the study tasks.

Clinical data related to the patients were collected, including sex, age, smoking and alcohol consumption, disease side, body mass index, co-existing conditions such as pharyngitis and gastroesophageal reflux disease, and histopathological findings.

Sound acquisition and feature parameter extraction

The subjects were located in a quiet voice examination room where ambient noise was kept below 45 dB. The XION GmbH DIVAS2.5 was used for acquisition. Before the formal test, the physician assisted the subject to wear the headset microphone, kept the front and bottom of the microphone at 45° to the subject, informed the subject about the voice acquisition method, removed the subject's tension, and guided the patient to correctly and smoothly produce continuous vowels. Formal voice acquisition was performed until the subject mastered the correct pronunciation method. Each test syllable consisted of Chinese vowels /a/, /o/, /e/, /i/, /u/, /u/. Each vowel was repeated for at least three seconds at their usual pitch and loudness. The most stable audio clips were selected for recording.

The vowel selection (/a/, /o/, /e/, /i/, /u/, /ü/) was based on their distinct acoustic and articulatory characteristics, which are critical for comprehensive vocal assessment. The vowels /a/ and /o/ provide robust formant structures for spectral analysis, while /i/ and /u/ are essential for evaluating vocal tract configuration and tension. The inclusion of /e/ and /ü/ further enhances the detection of subtle pathological variations due to their intermediate articulatory positions. Each sound segment had a duration of approximately 3 s, with a total test duration averaging around 10 min per subject. The extracted acoustic parameters were obtained using a custom MATLAB script that performed wavelet transforms, Fourier functions, and acoustic parameter extraction. In total, 34 acoustic parameters were extracted: absomean, std, skew, kurt, max, min, peak2valley, rms, crestfactor, shapefactor, impulsefactor, marginfactor, energy, first_f0, middle_f0, last_f0, median_f0, mean_f0, f0variation, f0skew, f0kurt, max_f0, min_f0, range_f0, slope_start2max, slope_max2end, hnr, jitter, duration, hr_mean, hr_median, hr_std, hr_max, and hr_min (see Figure 1).

Formula for sample skewness : Sample Skewness = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{3}}{{(\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2})}^{\frac{3}{2}}}

Formula for sample Kirtosis : Sample Kurtosis = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{4}}{{(\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2})}^{2}} - 3

Figure 1.

Comprehensive speech data processing and analysis pipeline for vocal cord disease detection.

Machine learning and visualisation analysis based on SHAP model

Machine learning and visualisation analysis based on SHAP model. This study developed predictive models for vocal pathologies using a dataset split into training (75%) and validation sets (25%) with three-fold cross-validation. To address the class imbalance between positive (benign) and negative (malignant) samples in this study, the Synthetic Minority Over-sampling Technique (SMOTE) was implemented to augment the minority class (malignant). The following models were trained and evaluated: LightGBM, an efficient gradient-boosting decision tree (DT); a support vector machine (SVM); logistic regression (LR); and a random forest (RF); DT; k-nearest neighbours (KNN); and XGBoost. Hyperparameter tuning was performed using these search spaces: Random Forest: n_estimators: [20], max_depth: [None, 10, 20], min_samples_split: [2, 5, 10], min_samples_leaf: [2, 4]；Decision Tree: max_depth: [None, 10, 20], min_samples_split: [2, 5, 10], min_samples_leaf: [2]；Logistic Regression: C: [0.1, 1, 10], penalty: [‘l1’, ‘l2’]；KNN: n_neighbors: [3, 5, 7], weights: [‘uniform’, ‘distance’], algorithm: [‘auto’, ‘ball_tree’, ‘kd_tree’]；XGBoost: n_estimators: [40], max_depth: [3, 5, 7], learning_rate: [0.01, 0.1, 0.3]；LightGBM: n_estimators: [20], max_depth: [3, 5, 7], learning_rate: [0.01, 0.1, 0.3].

These models were assessed using precision, accuracy, recall, F1 score, and area under the curve (AUC), with clinical utility evaluated through decision curve analysis (DCA). The goal was to distinguish normal speakers from those with vocal cord pathology, including benign and malignant lesions, providing a robust tool for vocal disorder diagnosis and management.

SHAP is a powerful framework used for interpreting and understanding predictions made by machine learning models.¹⁷ This study applied SHAP analysis to the model with the best classification performance to quantify the contribution of each feature to the prediction results, enabling feature contribution visualisation and model interpretation.^18,19

Statistical analysis

Continuous variables are presented as mean ± standard deviation. Categorical variables are presented as frequency. One-way analysis of variance was used for statistical analysis, and variables with statistical differences were selected into the machine learning model. Among them, acoustic parameters were further screened by receiver operating characteristic (ROC) to select six acoustic parameters as included variables. Statistical analysis was performed using Stata 18 and Graphpad Prism, and statistical significance was set at p < .05.

Results

Demographics

To establish the baseline characteristics of the study population and identify potential demographic factors associated with vocal cord lesions, 157 patients were included in this study, comprising 127 patients with organic vocal cord lesions who underwent surgical treatment and 30 participants in the control group. Histopathological examination confirmed that the lesions were either benign (109 cases) or malignant (18 cases of laryngeal squamous cell carcinoma, Tis-T2 stage, with no cervical lymph node involvement or distant metastasis). The benign lesions included 77 cases of polyps, 15 cases of cysts, three cases of mycosis, five cases of granuloma, and nine cases of papilloma. The results are shown in Table 1. The malignant lesion group was significantly older compared to the benign lesion and control groups (p＜.001). A higher proportion of males were present in the malignant lesion group compared to the benign lesion and control groups (p＜.001). Significantly more patients in the malignant lesion group reported smoking and drinking alcohol compared to the benign and control groups (p＜.001).

Table 1.

Baseline characteristics of the participants.

	Normophonic subjects (N = 30)	Subjects with benign lesion (N = 109)	Subjects with malignant lesion (N = 18)
	N (%)/mean ± std	N (%)/mean ± std	N (%)/mean ± std	p-value
Gender (male)	14 (46%)	46 (42%)	17 (94%)	＜.001
Age	46.86 ± 10.85	50.49 ± 10.33	61.05 ± 6.67	＜.001
BMIg				＜.025
Low body weight	3 (10%)	2 (2%)	1 (6%)
Normal	16 (53%)	57 (52%)	13 (72%)
Overweight	11 (37%)	32 (29%)	1 (6%)
Obesity	0	18 (17%)	3 (17%)
Smoking	7 (23%)	20 (18%)	13 (72%)	＜.001
Drinking alcohol	8 (26%)	25 (23%)	16 (89%)	＜.001
Side	—			＜.001
Left		35 (32%)	9 (50%)
Right		28 (26%)	7 (39%)
Bilateral		46 (42%)	2 (11%)
Chronic pharyngitis	1 (3%)	48 (44%)	2 (11%)	＜.001
Gastroesophageal reflux	1 (3%)	30 (27%)	1 (6%)	＜.001

Objective analysis – parameters

To identify acoustic biomarkers that can differentiate between normal vocal cords, benign lesions, and malignant lesions, one-way analysis of variance revealed significant differences between groups in 62 of the 264 parameters across six different vowels (Table 2-1, 2-2, 2-3, 2-4, and 2-5). Based on these results, we selected at least two sets of parameters with inter-group differences between the three groups to depict the box plot (Figure 2). Eighteen parameters were depicted, although only the box plot of six parameters is shown in Figure 1. Further details are presented in Appendix 1. Comparative analysis revealed 63 statistically significant differences in acoustic parameters across six different vowels (/a/, /o/, /e/, /i/, /u/, /ü/) between the control, benign lesion, and malignant lesion groups (p < .05). Six key parameters (/u/skew, /i/skew, /o/kurt, /o/shapefactor, /o/impulsefactor, and /a/peak2valley) demonstrated high diagnostic value in distinguishing between benign and malignant lesions (p < .05).

Figure 2.

Significant differences between normal subjects groups versus benign vocal cord lesions versus malignant vocal cord lesions.

Table. 2-1.

Comparison of voice parameters in benign and malignant glottic lesions at different vowel.

Phonation alphabet	Group	N	Absomean		Std		Skew		Kurt		Max		Min		Peak2valley
Phonation alphabet	Group	N	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
/a/	Control	30	0.06	0.29·	0.12*	0.05*	1.41*	0.78*	17.64*	13.27*	0.8*	0.16*	−0.68*	0.17*	1.48*	0.32*
	Benign lesion	109	0.05	0.25	0.1*	0.05*	1.29*	0.94*	13.26*	10.88*	0.6*	0.29*	−0.47*	0.25*	1.07*	0.53*
	Malignant lesion	18	0.05	0.22	0.1*	0.05*	0.7*	0.9*	9.92*	6.23*	0.58*	0.34*	−0.51*	0.33*	1.08*	0.66*
/o/	Control	30	0.07	0.04	0.14	0.07	1.55*	0.8*	15.77*	9.72*	0.77*	0.19*	−0.56	0.18	1.33*	0.35*
	Benign lesion	109	0.07	0.3	0.13	0.06	1.04*	0.84*	9.32*	7.37*	0.61*	0.29*	−0.45	0.23	1.06*	0.51*
	Malignant lesion	18	0.06	0.3	0.1	0.05	0.57*	0.74*	9.56*	6.22*	0.55*	0.28*	−0.5	0.28	1.04*	0.54*
/e/	Control	30	0.07	0.04	0.13*	0.06*	1.49*	0.97*	13.4*	7.72*	0.75*	0.17*	−0.51	0.17	1.26*	0.32*
	Benign lesion	109	0.06	0.03	0.12*	0.06*	0.94*	0.87*	9.38*	7.77*	0.58*	0.28*	−0.41	0.2	0.99*	0.47*
	Malignant lesion	18	0.05	0.02	0.08*	0.04*	0.65*	0.86*	10.78*	8.16*	0.5*	0.28*	−0.41	0.27	0.91*	0.54*
/i/	Control	30	0.06*	0.29*	0.12*	0.05*	0.9*	0.48*	8.53	3.93	0.59*	0.2*	−0.47	0.19	1.07	0.37
	Benign lesion	109	0.07*	0.41*	0.12*	0.06*	0.53*	0.54*	6.34	5.22	0.49*	0.24*	−0.41	0.21	0.9	0.45
	Malignant lesion	18	0.05*	0.03*	0.09*	0.04*	0.27*	0.35*	6.66	4.18	0.43*	0.25*	−0.39	0.24	0.82	0.487
/u/	Control	30	0.07	0.36	0.14*	0.06*	1.62*	1.1*	13.39*	9.02*	0.71*	0.18*	−0.47	0.15	1.18*	0.31*
	Benign lesion	109	0.07	0.35	0.12*	0.06*	1.09*	0.89*	9.02*	7.56*	0.54*	0.29*	−0.39	0.22	0.93*	0.49*
	Malignant lesion	18	0.06	0.03	0.1*	0.05*	0.77*	0.76*	8.27*	4.88*	0.52*	0.29*	−0.4	0.25	0.93*	0.53*
/ü/	Control	30	0.06	0.03	0.12	0.05	0.97*	0.39*	9.43	4.48	0.59	0.21	−0.45	0.18	1.04	0.37
	Benign lesion	109	0.07	0.04	0.12	0.06	0.55*	0.50*	6.96	5.52	0.49	0.25	−0.41	0.23	0.89	0.47
	Malignant lesion	18	0.06	0.04	0.11	0.06	0.19*	0.48*	7.67	5.43	0.48	0.26	−0.43	0.25	0.92	0.5

Note: ANOVA, *p < .05.

Table. 2-2.

Comparison of voice parameters in benign and malignant glottic lesions at different vowels.

Phonation alphabe	Group	N	rms		crestfactor		shapefactor		Impulsefactor		marginfactor		energy		first_f0
Phonation alphabe	Group	N	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
/a/	Control	30	0.12*	0.05*	7.15	2.39	2.33*	0.43*	17.45	9.15	571.42	1003.51	1189.83	895.87	186.67*	55.12*
	Benign lesion	109	0.1*	0.05*	6.46	2.59	2.04*	0.59*	14.44	10.13	480.1	738.38	934.5	867.51	252.83*	104.08*
	Malignant lesion	18	0.09*	0.05*	6.04	1.73	1.89*	0.5*	12.11	6.47	303.23	200.23	887.4	825.93	238.68*	116.35*
/o/	Control	30	0.14	0.07	6.54*	2.41*	2.29*	0.42*	15.79*	8.29*	533.97	707.84	1786.14	1696.96	218.29	87.52
	Benign lesion	109	0.13	0.06	4.93*	1.73*	1.89*	0.54*	10.13*	6.65*	261.1	554.17	1593.13	1293.58	222.58	101.83
	Malignant lesion	18	0.1	0.05	5.67*	1.81*	1.91*	0.55*	11.53*	6.66*	324.01	312.69	1078.09	882.43	184.63	91.61
/e/	Control	30	0.13*	0.06*	6.22*	1.77*	2.15	0.38	13.99	6.12	347.93	348.38	1594.14	1637.07	208.63	55.02
	Benign lesion	109	0.12*	0.06*	5.22*	2.08*	1.91	0.57	11	7.69	338.29	781.89	1382.64	1224.62	208.59	87.6
	Malignant lesion	18	0.08*	0.04*	6.34*	2.79*	1.92	0.59	13.29	9.31	502.51	716.43	708.5	594.69	177.05	69.07
/i/	Control	30	0.12*	0.05*	5.24*	1.42*	1.99*	0.26*	10.72	4.18	260.5	262.95	1270.04	1032.01	209.41	79.26
	Benign lesion	109	0.12*	0.06*	4.21*	1.66*	1.75*	0.52*	8.17	5.69	225.47	416.89	1474.95	1307.85	209.25	79.01
	Malignant lesion	18	0.09*	0.01*	4.85*	1.85*	1.76*	0.49*	9.25	5.66	264.69	255.85	808.83	7893.59	202.57	58.54
/u/	Control	30	0.14*	0.06*	5.71	2.36	2.12*	0.42*	12.98	8.44	382.44	633.86	1716.11	1298.64	192.92	62.47
	Benign lesion	109	0.12*	0.06*	4.8	2.02	1.88*	0.57*	10.02	7.13	298.89	496.38	1526.93	1332.36	207.06	82.6
	Malignant lesion	18	0.1*	0.05*	5.55	2.06	1.77*	0.45*	10.32	5.42	302.61	311.73	986.79	904.76	206.78	98.61
/ü/	Control	30	0.12	0.05	5.39*	1.45*	2.05*	0.31*	11.38	4.86	332.79	385.91	1376.41	1165.18	203.87	54.05
	Benign lesion	109	0.12	0.06	4.46*	1.93*	1.81*	0.53*	8.93	6.43	273.35	529.53	1558.06	1435.83	212.22	76.64
	Malignant lesion	18	0.11	0.06	5.2*	2.43*	1.77*	0.47*	9.81	6.00	326.61	444.31	1280.6	1327.08	201.48	86.72

Note: ANOVA, *p < .05.

Table 2-3.

Comparison of voice parameters in benign and malignant glottic lesions at different vowel.

Phonation alphabe	group	N	middle_f0		last_f0		median_f0		mean_f0		f0variation		f0skew		f0kurt
Phonation alphabe	group	N	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
/a/	Control	30	159.81	63.46	179.95	77.97	170.83	43.77	175.99	28.91	58.7	15.15	0.3*	1.66*	6.46	7.36
	Benign lesion	109	175.01	65.22	156.93	84.98	178.83	52.44	180.99	43.78	63.16	26.13	0.72*	1.42*	6.95	10.73
	Malignant lesion	18	173.11	81.76	159.46	82.34	158.56	34.77	166.27	32.44	64.16	27.56	1.54*	1.98*	11.98	24.35
/o/	Control	30	180.07	61.59	143.97	78.59	171.53	44.4	169.52	33.65	53.81	13.69	−0.08	1.77	6.69	7.71
	Benign lesion	109	181	69.47	172.16	97.6	184.51	53.75	183.62	45.82	55.93	28.39	0.39	2.55	11.69	19.22
	Malignant lesion	18	177.93	76.58	155.45	96.6	158.72	31.45	166.52	32.33	63.15	36.36	0.63	1.07	5.46	5.07
/e/	Control	30	191.95	55.13	163.8	83.35	181.93	36.49	180.57	27.08	52.8	14.61	−0.44	1.46	5.6	3.38
	Benign lesion	109	183.45	65.25	164.52	96.96	187.97	49.91	182.84	44.53	53.19	26.2	0.25	2.05	8.72	15.54
	Malignant lesion	18	171.54	69.72	148.45	75.88	162.9	42.43	159.81	30.4	51.52	34.07	0.58	1.36	5.72	4.37
/i/	Control	30	164.05	59.96	148.55	84.98	178.78	44.73	173.16	33.06	53.47	15.44	−0.17	1.52	6.09	4.59
	Benign lesion	109	193.48	72	169.83	94.33	190.44	52.6	183.92	46.08	49.06	29.41	−0.16	1.65	5.88	6.22
	Malignant lesion	18	174.58	85.65	135.17	53.38	178.8	58.03	178.09	45.95	57.1	37.87	0.374	2.01	7.38	9.84
/u/	Control	30	191.01	41.35	163.48	72.69	180.30*	43.59*	175.72	34.51	52.19	13.41	−0.41	1.47	5.23	3.38
	Benign lesion	109	184.71	65.96	172.84	96.26	190.28*	53.95*	183.95	46.55	50.96	28.39	0.11	1.75	7.91	14.11
	Malignant lesion	18	181.65	87.26	159.68	100.12	158.09*	50.89*	16.28	37.5	59.46	35.9	0.45	1.17	4.31	2.24
/ü/	Control	30	185.14*	74.91*	148.45	76.17	180.06	45.53	173.87	36.6	56.06	16.44	−0.28	1.41	4.85	3.21
	Benign lesion	109	194.48*	66.61*	178.36	101.25	189.71	51.58	183.35	44.8	54.69	26.1	0.06	1.84	7.53	10.48
	Malignant lesion	18	148.03*	48.29*	163.32	80.55	165.47	46.34	165.81	25.96	62.97	37.84	0.18	1.22	4.1	2.25

Note: ANOVA, *p < .05.

Table 2-4.

Comparison of voice parameters in benign and malignant glottic lesions at different vowel.

Phonation alphabe	Group	N	max_f0		min_f0		range_f0		slope_start2max		slope_max2end		hnr
Phonation alphabe	Group	N	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
/a/	Control	30	338.03	44.03	54.23	4.81	283.8	43.39	0.56*	0.19*	2.44	1.54	1.1	4.3
	Benign lesion	109	352.97	65.26	68.33	43.38	283.63	86.56	0.72*	0.26*	2.96	1.68	0.8	6.12
	Malignant lesion	18	368.41	64.48	68.22	35.97	300.19	83.35	0.66*	0.3*	2.87	1.34	1.43	6.87
/o/	Control	30	325.83	69.86	56.91	12.46	268.92	70.7	0.68	0.24	3.07	1.81	−0.12	5.67
	Benign lesion	109	342.55	70.08	69.24	46.04	273.31	93.78	0.67	0.28	2.78	1.85	0.3	6.14
	Malignant lesion	18	335.54	101.45	67	26.8	268.53	116.59	0.6	0.29	2.83	1.85	1.27	6.67
/e/	Control	30	314.39	57.91	57.39	11.44	257	60.87	0.68	0.17	2.86	2.06	1.38	6.09
	Benign lesion	109	337.59	73.28	73.34	52.48	264.25	104.28	0.64	0.26	2.94	1.97	1.77	6.89
	Malignant lesion	18	307.97	102.1	72.59	36.44	235.28	133.19	0.63	0.27	2.66	1.91	1.83	7.56
/i/	Control	30	323.25	59.01	54.44*	5.94*	268.81	59.12	0.65	0.21	3.12	2.01	1.15	6.07
	Benign lesion	109	308.3	81.78	84.41*	64.83*	223.9	120.59	0.71	0.26	2.49	1.6	1.88	6.03
	Malignant lesion	18	316.18	101.65	65.13*	32.01*	251.05	128.01	0.69	0.23	2.98	2.1	2.54	6.7
/u/	Control	30	305.22	57.49	54.71	6.24	250.51	58.48	0.64	0.19	2.39	1.42	1.42	5.98
	Benign lesion	109	326.59	78.64	77.76	59.98	248.84	117.58	0.67	0.26	2.62	1.72	0.14	5.46
	Malignant lesion	18	322.91	105.71	67.44	36.9	255.47	136.44	0.68	0.27	2.78	1.92	0.49	5.87
/ü/	Control	30	309.1	53.31	54.63	7.4	254.47	53.19	0.67	0.19	2.63	1.35	−0.81	4.67
	Benign lesion	109	327.66	76.17	70.11	53.79	257.55	106.17	0.67	0.23	2.55	1.71	1.00	5.14
	Malignant lesion	18	317.11	106.9	67.97	37.84	249.29	133.27	0.68	0.26	2.63	2.14	2.18	8.2

Note: ANOVA, *p < .05.

Table 2-5.

Comparison of voice parameters in benign and malignant glottic lesions at different vowel.

Phonation alphabet	Group	N	jitter		duration		hr_mean		hr_median		hr_std		hr_max		hr_min
Phonation alphabet	Group	N	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
/a/	Control	30	0	0	1.3*	0.25*	0.52	0.09	0.52*	0.15*	0.22	0.05	0.82	0.08	0.09	0.09
	Benign lesion	109	0	0	1.67*	0.65*	0.57	0.13	0.6*	0.18*	0.21	0.07	0.84	0.08	0.1	0.18
	Malignant lesion	18	0	0	1.74*	0.78*	0.53	0.11	0.53*	0.15*	0.17	0.07	0.83	0.08	0.12	0.17
/o/	Control	30	0*	0*	1.38*	0.29*	0.53*	0.1*	0.52*	0.19*	0.23*	0.04*	0.83	0.07	0.11	0.07
	Benign lesion	109	0*	0*	1.91*	0.84*	0.61*	0.14*	0.66*	0.19*	0.2*	0.08*	0.86	0.08	0.13	0.21
	Malignant lesion	18	0*	0*	1.96*	0.83*	0.55*	0.11*	0.55*	0.15*	0.17*	0.07*	0.83	0.06	0.14	0.2
/e/	Control	30	0	0	1.35*	0.29*	0.56	0.09	0.6	0.18	0.23*	0.04*	0.84	0.06	0.11	0.08
	Benign lesion	109	0	0	1.87*	0.84*	0.61	0.15	0.64	0.22	0.2*	0.08*	0.86	0.07	0.15	0.21
	Malignant lesion	18	0	0	1.82*	0.6*	0.55	0.15	0.54	0.18	0.15*	0.08*	0.82	0.08	0.19	0.21
/i/	Control	30	0	0	1.43	0.31	0.56*	0.08*	0.6	0.17	0.23*	0.05*	0.85	0.06	0.11	0.08
	Benign lesion	109	0	0	1.63	0.5	0.64*	0.16*	0.67	0.2	0.19*	0.1*	0.86	0.07	0.22	0.29
	Malignant lesion	18	0	0	1.66	0.52	0.59*	0.13*	0.6	0.17	0.16*	0.82*	0.83	0.06	0.2	0.23
/u	Control	30	0	0	1.4*	0.29*	0.57*	0.09*	0.62	0.19	0.24*	0.04*	0.84	0.06	0.12*	0.07*
	Benign lesion	109	0	0	1.95*	0.96*	0.62*	0.16*	0.65	0.21	0.19*	0.09*	0.86	0.07	0.17*	0.26*
	Malignant lesion	18	0	0	1.77*	0.45*	0.53*	0.14*	0.53	0.17	0.17*	0.09*	0.82	0.08	0.17*	0.26*
/ü/	Control	30	0	0	1.51*	0.32*	0.55	0.09	0.58	0.2	0.24*	0.39*	0.84	0.64	0.09*	0.07*
	Benign lesion	109	0	0	2.08*	1.1*	0.62	0.15	0.66	0.2	0.2*	0.08*	0.86	0.7	0.16*	0.23*
	Malignant lesion	18	0	0	1.84*	0.64*	0.56	0.13	0.57	0.16	0.18*	0.09*	0.83	0.71	0.18*	0.26*

Note: ANOVA, *p < .05.

ROC curve analysis was used to evaluate the diagnostic performance of individual acoustic parameters in distinguishing between normal, benign, and malignant vocal cord conditions. The ROC curves of the single parameters that were statistically different (p < .05) in distinguishing normal-speaking subjects from those with organic vocal cord lesions were plotted. Only the top six parameters with the highest AUC are shown in Figures 3 and 4. Compared with the control group, /u/skew (AUC = 0.782, 95% CI [0.705–0.860]) had the highest AUC value, making it the strongest voice parameter to distinguish normal individuals from those with organic vocal cord lesions (Figure 3). When differentiating malignant from benign lesions, the strongest discriminant factors were /u/middle_f0 and /u/skew (Figure 4). The results of the ROC analysis for all differential parameters are presented in Appendix 2.

Figure 3.

ROC curve of voice parameters to distinguish control versus vocal cord organic lesions.

Figure 4.

ROC curve of voice parameters to distinguish benign and malignant lesions of vocal.

Machine learning prediction model for normal speaking subjects and patients with vocal cord organic lesions

Machine learning model development and evaluation analysis was used to develop and evaluate machine learning models for classifying vocal cord lesions and predicting malignancy based on acoustic parameters and demographic variables. A comprehensive analysis revealed that no single acoustic parameter could distinguish normal voices from benign or malignant vocal lesions. As a result, multivariate prediction models (XGBoost, LightGBM, Decision Tree, LR, RF, SVM, and KNN) were developed. The final model utilised six key acoustic features along with demographic variables (sex, age, smoking, and drinking). The XGBoost model demonstrated the best performance, with an AUC of 0.735, accuracy of 0.900, precision of 0.917, and recall of 0.971 (Table 3, Figure 5). The DCA curve showed that the XGBoost model provided a higher clinical net benefit than the all/none curves when the threshold probability was between 0.2 and 1 (Figure 6). Given that all patients had severe symptoms and sought treatment, the clinical net benefit of the overall intervention was similar to that of the model when the risk threshold probability was <0.2.

Figure 5.

ROC curves of different classifiers.

Figure 6.

DCA curve of XGBoost model.

Table 3.

The results of different classifiers for predicting organic lesions.

Model	accuracy	precision	Recall	F1	AUC
XGBoost	0.900	0.917	0.971	0.942	0.735
LightGBM	0.875	0.914	0.941	0.927	0.721
Decision tree	0.750	0.900	0.794	0.843	0.647
Logistic regression	0.850	0.889	0.941	0.6914	0.637
Random Forest	0.825	0.886	0.911	0.896	0.623
SVM	0.850	0.867	0.971	0.917	0.569
KNN	0.800	0.861	0.912	0.886	0.540

LightGBM: Light Gradient-Boosting Machine; SVM: support vector machine; KNN: k-nearest neighbours; XGBoost: eXtreme Gradient Boosting.

For the best-performing XGBoost model, SHAP analysis was employed to quantify and visualise the contribution of each feature to the prediction. The model identified /i/skew and /u/skew as the most significant contributors: higher /i/skew decreased the probability of vocal cord organic lesions, while higher /u/skew, /a/peak2valley, and /o/kurt also had a negative impact, although to a lesser extent. Low values of gastroesophageal reflux disease, sex, drinking, and smoking habits were associated with negative SHAP values, indicating a lower probability of lesions, while higher values increased the probability (Figure 7).

Figure 7.

SHAP summary plot in control vs vocal cord organic lesions.

Machine learning prediction model for benign and malignant vocal cord lesions

A model was developed to identify malignancy in organic lesions using 13 predictors, including acoustic features and clinical variables. The LightGBM model demonstrated the best performance, with an AUC of 0.924, accuracy of 0.924, precision of 0.935, and recall of 0.906 (Table 4, Figure 8). The DCA curve indicated that the model's clinical net benefit was significantly higher than comprehensive or no intervention strategies when the risk threshold was between 0.1 and 1 (Figure 9), highlighting its value for clinical decision-making. SHAP analysis revealed that age, drinking, male sex, and smoking had a significant impact on the prediction, with drinking contributing more strongly than smoking (Figure 10).

Figure 8.

ROC curve for predicting benign and malignant vocal cord lesions in models.

Figure 9.

DCA curve of LightGBM model.

Figure 10.

SHAP summary plot in benign versus malignant lesions of vocal.

Table 4.

The results of different classifiers for predicting benign and malignant vocal cord lesions.

Model	Accuracy	Precision	Recall	F1	AUC
LightGBM	0.924	0.935	0.906	0.920	0.924
KNN	0.910	0.861	0.967	0.912	0.911
Random Forest	0.910	0.964	0.844	0.900	0.907
XGBoost	0.893	0.903	0.876	0.889	0.894
SVM	0.879	0.929	0.812	0.867	0.878
Logistic regression	0.879	0.930	0.813	0.867	0.877
Decision tree	0.864	0.960	0.750	0.842	0.830

LightGBM: Light Gradient-Boosting Machine; SVM: support vector machine; KNN: k-nearest neighbours; XGBoost: eXtreme Gradient Boosting.

Discussion

Current diagnostic approaches for vocal cord lesions include clinical evaluations such as comprehensive interviews, perceptual voice assessments, laryngoscopy, aerodynamic testing, and laryngeal electromyography.²⁴ However, these methods can be labour-intensive and costly, which hinders early diagnosis and treatment. As a result, there is an urgent need for a simpler and more efficient method to support the preliminary diagnosis of vocal cord lesions.

Vocal cord lesions can have a profound impact on an individual's voice quality.²⁵ Objective acoustic analysis plays a crucial role in assessing these changes by analysing specific acoustic parameters that provide valuable insights into the vibratory patterns and closure dynamics of the vocal cords.²⁶ In this study, we analysed a cohort of 157 patients, comprising 127 individuals with confirmed organic vocal cord lesions (109 benign and 18 malignant) and an additional 30 patients presenting with various otorhinolaryngological conditions. Significant differences were observed in demographic variables such as age, gender, smoking, and drinking habits between the control and lesion groups. These differences suggest that demographic factors may play a role in the development of vocal cord lesions. For example, older age and male gender were more prevalent in the malignant lesion group, highlighting potential risk factors.

Voice samples were obtained from all participants to enable objective acoustic analysis. The primary aim was to identify distinctive acoustic biomarkers capable of differentiating between normal vocal cords, benign lesions, and malignant lesions. Comparative analysis revealed 63 statistically significant differences in acoustic parameters between healthy and lesion-affected groups. Six key parameters (/u/skew, /i/skew, /o/kurt, /o/shapefactor, /o/impulsefactor, and /a/peak2valley) demonstrated high diagnostic value. These acoustic parameters reflect the impact of vocal cord lesions on vocal fold vibration, closure dynamics, and sound quality. Skewness quantifies the symmetry of vocal fold vibration waveforms. During normal vocal fold closure, the glottal wave exhibits an approximately symmetric triangular shape, with skewness values close to zero clinically observed. When benign lesions (e.g. polyps, cysts) develop on the vocal folds, the uneven mass distribution results in asynchronous glottal closure during vibration. This causes the vibratory waveform to shift towards one side, thereby increasing the skewness value (positive or negative skew). In cases of malignant tumours infiltrating the vocal fold muscle layer, vibrational symmetry is disrupted, potentially leading to more pronounced skewness abnormalities compared to benign conditions. In this study, the /u/ skewness achieved an AUC of 0.782 in distinguishing between normal and pathological voices. Compared with histopathological results, the skewness value showed a positive correlation with the degree of vocal fold mucosal wave asymmetry (r = 0.63, p < .01).In this study, the /u/ skewness achieved an AUC of 0.782 in distinguishing between normal and pathological voices. Compared with histopathological results, the skewness value showed a positive correlation with the degree of vocal fold mucosal wave asymmetry (r = 0.63, p < .01).

The application of machine learning algorithms to analyse voice samples offers a promising approach for classifying distinct vocal pathologies. The integration of these algorithms into the assessment of vocal cord lesions has the potential to significantly enhance both diagnostic accuracy and efficiency.²⁷ In this study, we implemented seven machine learning algorithms (XGBoost, KNN, LR, SVM, LightGBM, RF, and DT) to evaluate their performance in diagnosing vocal cord lesions.²⁸ To assess the effectiveness of these models, we employed a range of metrics, including accuracy, precision, recall, F1 score, and the AUC.²⁹ Our final model was designed to diagnose organic lesions in subjects with normal vocal characteristics using six key features identified for their predictive strength: /u/skew, /i/skew, /o/kurt, /o/shapefactor, /o/impulsefactor, and /a/peak2valley. We also incorporated demographic variables such as sex, age, smoking status, and alcohol consumption as predictors.³⁰

Our comparative analysis of machine learning models revealed distinct performance advantages: the XGBoost algorithm demonstrated superior capability in differentiating normal vocal cords from organic lesions (AUC = 0.894, accuracy = 0.893, precision = 0.903, recall = 0.876), while LightGBM achieved exceptional performance in malignancy prediction (AUC = 0.924, accuracy = 0.924, precision = 0.935, recall = 0.906). These results suggest LightGBM's particular clinical utility for early malignant lesion detection due to its robust predictive performance. This work contrasts with the OS-ELM approach employed by FT Al-Dhief et al. in speech pathology classification, highlighting how algorithm selection should be tailored to specific diagnostic objectives.³¹ The demonstrated efficacy of gradient boosting methods (XGBoost/LightGBM) for vocal lesion analysis, combined with their computational efficiency, positions them as promising tools for clinical implementation. Future research directions should investigate hybrid systems incorporating online learning capabilities (e.g. OS-ELM) with our validated acoustic feature extraction pipeline to enable real-time diagnostic applications.

Through SHAP value analysis, we obtained critical interpretability insights that quantify the relative contribution of each feature to the model's predictive outcomes. The SHAP values and plots revealed that the skew of /i/, skew of /ü/, peak2valley of /a/, age, and shapefactor of /o/ were the most influential features in the XGBoost model for predicting vocal cord organic lesions. Conversely, smoking had a relatively minor contribution. Furthermore, SHAP plot analysis revealed that age had a significant impact on the prediction verification process, with older ages being strongly associated with a higher likelihood of vocal cord malignant organic lesions in the LightGBM model. Similarly, drinking, male sex, and smoking were identified as promoting factors.¹⁸ Nuha Qais Abdulmajeed proposed a deep learning method based on unique feature selection that significantly improves the recognition of various speech pathologies, while our research focuses on distinguishing between benign and malignant vocal cord lesions and enhancing model interpretability through SHAP values, suggesting that integrating deep learning with explainability methods could further enhance the accuracy and clinical applicability of speech pathology recognition.³²

Our study explores the potential application of acoustic analysis combined with AI for diagnosing vocal cord lesions. This approach may offer an alternative pathway in otolaryngological diagnostics, potentially reducing dependence on invasive methods through non-invasive, AI-assisted solutions. The integration of machine learning models like XGBoost provides preliminary evidence supporting the feasibility of processing acoustic data for diagnostic purposes. Additionally, our use of explainable AI techniques (SHAP values) addresses interpretability considerations in medical AI, contributing to discussions about transparent healthcare technologies.

From a practical perspective, the methodology could potentially enable earlier lesion detection, which might improve clinical intervention timelines. The explainable components may offer clinicians insights into feature contributions, possibly supporting diagnostic confidence in clinical settings.

Limitations of this study

While this study demonstrates the promising application of acoustic analysis and machine learning in diagnosing vocal cord lesions, several limitations must be acknowledged:

The sample size of 157 participants, including only contain 18 malignant cases, may limit the generalisability of the findings, particularly for rare or aggressive lesions.

The single-center design of the study could introduce selection bias, necessitating future multicenter studies with diverse populations to validate the results.

Although 34 acoustic parameters were analysed under controlled recording conditions, natural speech variability (e.g. emotional state, fatigue) was not fully accounted for. Additionally, other potentially relevant features (e.g. nonlinear dynamic measures) were not included in the analysis.

Despite attempting to address the imbalance in the dataset using over-sampling techniques (e.g. SMOTE), the training of the machine learning models was ultimately not performed using the over-sampled data due to observed degradation in performance. This decision may have further impacted the generalisability of the models.

Despite these limitations, the study highlights the potential of AI-assisted voice analysis as a non-invasive diagnostic tool for vocal cord lesions. Future research should address these constraints through expanded cohorts, multicenter validation, broader feature selection, and prospective clinical trials to fully realise its clinical potential.

Conclusion

This study theoretically advanced the field of voice disorder diagnostics by establishing a novel framework that integrates acoustic analysis with machine learning. The identification of 63 distinctive acoustic parameters between healthy and pathological voices provided a quantitative basis for understanding how vocal cord lesions alter phonatory dynamics. The SHAP analysis further unveiled the interpretability of AI models, demonstrating that features like age and acoustic skewness are key predictors of malignancy, which enriched the theoretical understanding of lesion classification. Practically, the developed XGBoost and LightGBM models offered a non-invasive, efficient alternative to traditional diagnostic methods, potentially reducing reliance on invasive laryngoscopy and improving early detection rates for vocal cord malignancies.

The research yielded tangible clinical benefits. First, the non-invasive nature of voice recording reduces patient discomfort and eliminates risks associated with invasive procedures. Second, the models’ high accuracy and rapid processing speed enable scalable screening in primary care settings, especially for populations with limited access to specialised ENT facilities. Third, the SHAP-based interpretability allows clinicians to validate AI outputs against patient demographics (e.g. age, smoking history) and acoustic features, enhancing diagnostic confidence.

Despite the promising results, this study was not without limitations. The sample size, particularly for malignant cases, was relatively small, which may limit the generalisability of the findings. Additionally, the single-center design might introduce selection bias. The analysis did not fully account for natural speech variability, and other potentially relevant features such as nonlinear dynamic measures were not included. Furthermore, while attempts were made to address class imbalance using SMOTE, the over-sampled data was not used in the final model training due to observed performance degradation.

Future research should aim to validate the findings in larger, multicenter studies with more diverse patient populations. Incorporating additional acoustic features, such as nonlinear dynamic measures, and accounting for natural speech variability could further enhance model performance. Additionally, hybrid systems combining online learning capabilities with the validated acoustic feature extraction pipeline could enable real-time diagnostic applications. Finally, prospective clinical trials are needed to fully realise the clinical potential of AI-assisted acoustic analysis in diagnosing vocal cord lesions.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251376264 - Supplemental material for Acoustic signatures of organic lesions and the role of artificial intelligence in voice disorder diagnostics

Supplemental material, sj-docx-1-dhj-10.1177_20552076251376264 for Acoustic signatures of organic lesions and the role of artificial intelligence in voice disorder diagnostics by Keyi Ma, Yi Wang, Yulin Zhou, Lan Chen, Tiecheng Zhang, Fan Xu and Xiaoli Peng in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251376264 - Supplemental material for Acoustic signatures of organic lesions and the role of artificial intelligence in voice disorder diagnostics

Supplemental material, sj-docx-2-dhj-10.1177_20552076251376264 for Acoustic signatures of organic lesions and the role of artificial intelligence in voice disorder diagnostics by Keyi Ma, Yi Wang, Yulin Zhou, Lan Chen, Tiecheng Zhang, Fan Xu and Xiaoli Peng in DIGITAL HEALTH

Supplemental Material

sj-docx-3-dhj-10.1177_20552076251376264 - Supplemental material for Acoustic signatures of organic lesions and the role of artificial intelligence in voice disorder diagnostics

Supplemental material, sj-docx-3-dhj-10.1177_20552076251376264 for Acoustic signatures of organic lesions and the role of artificial intelligence in voice disorder diagnostics by Keyi Ma, Yi Wang, Yulin Zhou, Lan Chen, Tiecheng Zhang, Fan Xu and Xiaoli Peng in DIGITAL HEALTH

Footnotes

Acknowledgements

We thanks the valuable supports from the medical sound database from Chengdu Medical College (http://ama.cmc.edu.cn) and valuable expert suggestion from Chengdu Zhiju Data Technology Co.Ltd ().

ORCID iDs

Xiaoli Peng

Consent to participate

The participants provided their written informed consent to participate in this study.

Consent for publication

The authors all consent for publication.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Key Discipline Project at the School of Public Health, Chengdu, School joint funding, Sichuan Applied Psychology Research Centre, The National Key R&D Programme of China, National Natural Science Foundation of China, Research Staring Foundation of High-level Talent Introduction of The First Afflicted Hospital of Chengdu Medical College (Grant Nos. 21, CSXL-24215, 2023YFE0108400, 82073833, 8216050478, and CYFY-GQ19).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and material

Data are available upon request to the corresponding authors.

Code availability

Codes are available upon request to the corresponding authors.

Supplemental Material

Supplemental material for this article is available online.

References

Zhang

Impact of the paraglottic space on voice production in an MRI-based vocal fold model. J Voice 2023; 37: 633.e15–633.e23.

Payten

Weir

Madill

. Investigating current clinical practice in assessment and diagnosis of voice disorders: a cross-sectional multidisciplinary global web survey. Int J Lang Commun Disord 2024; 59: 2786–805.

Yang

Zhuang

Explore the pattern of biomechanical alterations in vocal fold scar and its objective quantitative assessment method. J Voice 2024; S0892–1997: 00212–1.

Dealino

MAS

Ueha

Otsuka

, et al. A scoping review on postoperative voice therapy for benign vocal fold lesions. Auris Nasus Larynx 2025; 52: 336–341.

Bukovszky

Fodor

Tóth

, et al. Malignant transformation and long-term outcome of oral and laryngeal leukoplakia. J Clin Med 2023; 12: 4255.

Liu

Lei

Wischhoff

, et al. Acoustic character governing variation in normal, benign, and malignant voices. Folia Phoniatr Logop 2025; 77: 137–146.

Pereira

AMS

Batista

Paiva

, et al. Multiparametric acoustic predictors: a comprehensive model for auditory perception of overall severity of vocal deviation in Brazilian Portuguese speakers. J Voice 2025; S0892–1997: 00193–6.

Wang

Kang

Chang

, et al. Voice change after adenotonsillectomy in children: a systematic review and meta-analysis. Laryngoscope 2024; 134: 2538–2550.

Finkelstein

Banai

Sella Weiss

. Accuracy of auditory and auditory-visual voice quality perception by speech language pathologists and ear, nose, and throat specialists. J Voice 2025; S0892–1997: 00028–1.

10.

Lee

Tanaka

Kato

, et al. Analysis and categorization of various types of vocal distortion in rock, metal, pop styles, and throat singing observed by high-speed digital imaging. J Voice 2025; 8: 00424–7.

11.

Zhang

. Principal dimensions of voice production and their role in vocal expression. J Acoust Soc Am 2024; 156: 278–283.

12.

Hamdan

Hosri

Yammine

, et al. Voice evaluation of patients with chronic rhinosinusitis and nasal polyposis: a case series and review of the literature. Folia Phoniatr Logop 2025; 14: 1–8.

13.

Meng

Zhang

Meng

, et al. Correlation between detection results of pepsin in vocal fold polyp tissues and the postoperative efficacy. J Voice 2024; 38: 1200–1206.

14.

Meinert

Milne-Ives

Lim

, et al. Accuracy and safety of an autonomous artificial intelligence clinical assistant conducting telemedicine follow-up assessment for cataract surgery. EClinicalMedicine 2024; 73: 102692.

15.

Jiang

Pan

Smereka

, et al. Exploring the role of opera voice quality exercise in the voice therapy. J Voice 2024; 21: 00053–5.

16.

Song

Wan

Wang

, et al. Establishment of a multi-parameter evaluation model for risk of aspiration in dysphagia: a pilot study. Dysphagia 2023; 38: 406–414.

17.

Azadnajafabad

Mohammadi

Afzal

, et al. Clinical characteristics and voice handicap index assessment of common benign vocal fold lesions: a case-control study from a referral center. J Voice 2025; 18: 00019–0.

18.

Abdulmajeed

Al-Khateeb

Mohammed

. A review on voice pathology: taxonomy, diagnosis, medical procedures and detection techniques, open challenges, limitations, and recommendations for future directions. J Intell Syst 2022; 31: 855–875.

19.

Liao

Zhang

, et al. Research on automatic assessment of the severity of unilateral vocal cord paralysis based on mel-spectrogram and convolutional neural networks. Biomed Eng Online 2025; 24: 76.

20.

Yang

El-Attar

Chaspari

. Deconstructing demographic bias in speech-based machine learning models for digital health. Front Digit Health 2024; 6: 1351637.

21.

Marchese

Sensoli

Campagnini

, et al. Artificial intelligence for the recognition of benign lesions of vocal folds from audio recordings. Acta Otorhinolaryngol Ital 2023; 43: 317–323.

22.

Kavak

Gündüz

Vural

, et al. Artificial intelligence based diagnosis of sulcus: assessment of videostroboscopy via deep learning. Eur Arch Otorhinolaryngol 2024; 281: 6083–6091.

23.

Ioannidis

Sarridou

Bampoulas

, et al. Nutritional biomarker-guided prediction of postoperative pain outcomes in elderly patients using a Shapley additive explanations (SHAP)-informed XGBoost approach. Cureus 2025; 17: e85048.

24.

Lau

Thyagarajan

. Voice changes in Parkinson's disease: what are they telling us? J Clin Neurosci 2020; 72: 1–7.

25.

Smeltzer

Stipancic

Toles

. Minimal clinically important differences in CAPE-V auditory-perceptual ratings of voice quality. J Speech Lang Hear Res 2025; 68: 2275–2290.

26.

Lee

Yeo

, et al. Utility of evaluating symptom changes in the patient with patulous Eustachian tube through acoustic assessment: a case report. Med (Baltimore) 2025; 104: e43034.

27.

. Developing a smart system for binary classification of disordered voices using machine learning. Am J Otolaryngol 2025; 46: 104672.

28.

Guo

Huang

, et al. Diagnostic accuracy of deep learning-based algorithms in laryngoscopy: a systematic review and meta-analysis. Eur Arch Otorhinolaryngol 2025; 282: 351–360.

29.

Tie

Zhu

, et al. Multi-instance learning for vocal fold leukoplakia diagnosis using white light and narrow-band imaging: a multicenter study. Laryngoscope 2024; 134: 4321–4328.

30.

Tsui

Tsao

Lin

, et al. Demographic and symptomatic features of voice disorders and their potential application in classification using machine learning algorithms. Folia Phoniatr Logop 2018; 70: 174–182.

31.

Al-Dhief

Baki

Abdul Latiff

, et al. Voice pathology detection and classification by adopting online sequential extreme learning machine. IEEE Access 2021; 9: 77293–77306.

32.

Abdulmajeed

Al-Khateeb

Mohammed

. Voice pathology identification system using a deep learning approach based on unique feature selection sets. Expert Syst 2023; 42: e13327.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

0.83 MB

1.14 MB