Sage Journals: Discover world-class research

Abstract

Objective

With the increasing prevalence of Parkinson's disease (PD) and the development of PD-based acoustic recording databases, this study aims to evaluate the feasibility of using an ensemble-based machine learning (ML) approach to detect PD across diverse acoustic datasets.

Methods

We utilized three publicly available PD speech datasets—MIU (Sakar), UEX (Carrón), and UCI (Little)—to build ML models incorporating a hybrid preprocessing framework. This framework includes a scaling phase (using RobustScaler), a sampling phase (employing random oversampling (ROS), synthetic minority oversampling technique (SMOTE), and random undersampling (RUS)), and an ML classifier selection phase (featuring eXtreme gradient boosting (XGBoost) and adaptive boosting (AdaBoost)). Performance was evaluated using accuracy, precision, recall, and F1-score metrics. Additionally, we conducted SHAP (SHapley Additive exPlanations) analysis to identify the most significant PD-related acoustic features.

Results

The optimal combination of preprocessing and classification techniques varied across datasets. However, the highest classification performance was generally achieved using RobustScaler for scaling, a combination of ROS, SMOTE, and RUS for sampling, and XGBoost or AdaBoost for classification. The best-performing model on the MIU dataset achieved accuracy of 97.37%, precision of 96.07%, and F1-score of 96.57%. The UEX and UCI datasets achieved perfect classification with 100% accuracy, precision, and recall. SHAP analysis revealed that Mel-frequency cepstral coefficients were consistently among the most influential PD-related acoustic features.

Conclusions

Our findings confirm the feasibility of an ensemble-based approach for PD detection using acoustic recordings, highlighting the importance of dataset-specific preprocessing strategies. This study ranks impactful PD-related acoustic features, offering guidance for future voice-based PD screening tools.

Keywords

Parkinson's disease acoustic datasets machine learning SHAP analysis Mel-frequency cepstral coefficients

Introduction

Parkinson's disease (PD) is a prevalent neurodegenerative disorder affecting over 10 million individuals worldwide.¹ With the aging global population, the number of PD cases is expected to exceed 17 million by 2040.² This increasing prevalence places a significant burden on healthcare systems, with the economic cost of PD in the United States alone surpassing $52 billion annually in 2017—a significant increase from 14.4 billion in 2010, and is projected to grow past 79 billion in 2037. ^3,4 Early and accurate diagnosis is critical for improving patient outcomes, yet misdiagnosis rates remain high; reaching up to 26% in the United Kingdom.⁵ Due to symptom overlap with other neurological disorders such as essential tremor and multiple system atrophy,^6,7 PD is often confused with conditions like stroke or superficial siderosis.^8,9 Currently, postmortem neuropathological examination remains the gold standard for definitive diagnosis,¹⁰ highlighting the urgent need for accurate premortem diagnostic tools that enable timely interventions.

Recent advances in biomarker discovery,^11–13 respiratory dysfunction analysis,¹⁴ and high-field MRI imaging^15,16 have contributed to improving PD diagnostics. However, many of these methods require expensive equipment and specialized expertise, limiting their widespread application. Moreover, they are typically employed only after clear motor symptoms have developed,¹⁷ missing the window for early intervention.^18,19 In recent studies have shown that voice-based assessments can effectively screen for cognitive and neurological disorders, supporting the use of speech as a scalable digital biomarker.^20,21 This further highlights the potential of speech analysis for early detection of PD as a low-cost, noninvasive diagnostic tool.

Despite numerous studies applying machine learning (ML) to Parkinson's-related speech data, key challenges remain: limited dataset generalizability, inconsistent preprocessing pipelines, and high class imbalance often result in overfitted or biased models. Prior work typically focuses on a single dataset or uses narrow classifier types without interpretability.^22–24 Our study addresses this gap by proposing a generalized hybrid pipeline that combines robust scaling, multiple resampling techniques, and ensemble classifiers, validated across three heterogeneous datasets, with SHAP-based interpretation to ensure transparency.

Speech as a diagnostic biomarker for PD

Speech impairment is among the earliest detectable symptoms of PD, affecting up to 90% of patients during disease progression, with 78% of those in Hoehn and Yahr Stage 1–2—early stages—showing speech deficits.^25–28

These changes, including monotonic speech, reduced volume, and articulatory imprecision, arise from neuromuscular control deficits in the basal ganglia.²⁹ Unlike MRI-based diagnostics, speech analysis offers a noninvasive, cost-effective, and easily deployable method that can facilitate remote monitoring and large-scale screening.^30,31 To extract meaningful information from speech recordings, researchers analyze acoustic features such as:

Mel-frequency cepstral coefficients (MFCCs)—widely used in speech recognition and PD classification³²

Jitter and shimmer—measuring frequency and amplitude perturbations, often increased in PD speech³³

Pitch period entropy (PPE)—quantifying variability in vocal fold vibrations.³⁴

Several ML models have been applied to classify PD from speech data with promising results.^35–38 However, significant challenges remain:

Heterogeneity of datasets—Existing datasets differ in recording conditions, extracted features, feature extraction methods, and participant demographics, making cross-study comparisons difficult.³⁹

Class imbalance—PD datasets often contain disproportionately fewer healthy controls, leading to biased ML models.^40,41

Lack of standardized feature selection—No consensus exists on the optimal speech biomarkers for PD diagnosis, resulting in inconsistent methodologies across studies.⁴²

Objective of this study

To address these challenges, we propose a hybrid preprocessing and ensemble classification approach for PD speech analysis. Our method integrates:

Multiple data scaling techniques (RobustScaler, MinMaxScaler, Z-score standardization) to reduce dataset heterogeneity

Sampling strategies (synthetic minority oversampling technique (SMOTE), random oversampling (ROS), and random undersampling (RUS)) to resolve class imbalance

Ensemble learning models (eXtreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), and Random Forest) to enhance classification performance across diverse datasets

We evaluate our framework using three publicly available PD speech datasets (MIU, UEX, and UCI), comparing the impact of scaling, resampling, and classification techniques on diagnostic accuracy. Furthermore, we employ SHAP analysis to identify key acoustic features influencing model predictions, contributing to future research on standardized PD speech biomarkers.

By establishing a generalizable ML pipeline, this study aims to create a simple and robust framework for addressing the complexities of diverse datasets—such as variations in extracted features, sample sizes, demographic compositions, and recording conditions—while maintaining consistent diagnostic performance.

Given the high rates of misdiagnosis and the scarcity of neurologists in many regions, there is an urgent need for scalable, noninvasive, and interpretable screening tools for early PD detection. Voice-based assessments, which can be performed remotely and require minimal equipment, offer a promising solution. This study aims to develop and validate a flexible ML pipeline for voice-based PD screening, providing clinical value in resource-limited settings and telemedicine applications.

Method

Flowchart

Our research utilized speech signal databases from PD patients collected from three distinct medical institutions. Our ensemble method can be split into two stages: preprocessing and classification; in which the database is made more succinct in the preprocessing phase and an ML model is built in the classification phase. In the preprocessing stages, we employ data scaling techniques—Robust Scaler, Max Abs Scaler, Min-Max Scaler, and Standardization—to homogenize the data and sampling techniques—ROS, SMOTE, Border Line SMOTE, adaptive synthetic sampling (ADASYN), RUS, cluster undersampling (CUS), cluster centroid undersampling (CCS), NearMiss (NM) 1, 2, and 3—to address the data imbalance. Next, in the classification stage, the preprocessed data is fed into supervised ML algorithms: AdaBoost, XGBoost, gradient boosted decision trees (GBDT), K nearest neighbors (KNN), random subspace ensemble (RaSE), and Random Forest algorithms to build a model. As part of our approach, the two phases of the preprocessing stage and the machine building phase of the classification stage are serially mixed and matched into multiple learning models; and an extensive comparison of the performance these pathways are conducted. The flowchart below (Figure 1) summarizes our proposed method.

Figure 1.

Flowchart of our proposed method.

Overview of selected PD voice datasets

In this study, we utilize three publicly available datasets compiled by Sakar,²² Carrón,²³ and Little,²⁴ which contain acoustic features related to PD They were selected for their public availability and extensive documentation, providing a reliable framework for reproducibility and comparative analyses. Each dataset consists of voice recordings collected under different protocols, with variations in participant demographics, recording conditions, and extracted acoustic features. Table 1 provides a consolidated comparison of these datasets.

Table 1.

Comparative summary of PD voice datasets.

Feature	MIU dataset (Sakar et al., 2019)	UEX dataset (Carrón et al., 2021)	UCI dataset (Little et al., 2009)
Dataset source and collection protocol	Department of Neurology, Cerrahpaşa Faculty of Medicine, Istanbul University (MIU database). Participants sustained the phonation of the vowel /a/ for acoustic feature extraction.	University of Extremadura (UEX database). Participants sustained the vowel /a/ in a supervised environment to assess mobile-assisted voice analysis.	UCI machine learning repository. Participants produced sustained vowel /a/ phonations to evaluate the feasibility of dysphonia features for PD monitoring.
Sample size	252 (188 PD, 64 healthy)	60 (30 PD, 30 healthy)	31 (23 PD, 8 healthy)
Demographics	PD group: 107 males, 81 femalesHealthy group: 23 males, 41 femalesAge range (mean ± SD):PD: 33–87 (65.1 ± 10.9)Healthy: 41–82 (61.1 ± 8.9)	PD group: 24 males, 6 femalesHealthy group: 26 males, 4 femalesAge range (mean ± SD):PD: 70.27 ± 9.54Healthy: 67.33 ± 8.57	PD group: 16 males, 7 femalesHealthy group: 3 males, 5 femalesAge range: 46–85 years (mean ± SD: 65.8 ± 9.8)Disease Duration: 0–28 years
Speech tests and recording conditions	Sustained phonation of /a/Recorded using a 44.1 kHz microphone	Sustained phonation of /a/ (3 times per subject)Recorded in a supervised environment using a 44.1 kHz microphone	Sustained phonation of /a/195 voice recordings collectedRecorded using a 44.1 kHz microphone
Number of features	754 features	34 features	23 features
Feature categories	Tunable Q-factor wavelet transform (TQWT)—432 features Baseline features (Jitter, Shimmer, fundamental frequency, RPDE, DFA, PPE, etc.)—21 features Time-frequency features (formant frequencies, bandwidth, etc.)—11 features Mel frequency cepstral coefficients (MFCCs)—84 features Wavelet transform-based features (WT features related to F0)—182 features Vocal fold features (glottis quotient, glottal to excitation, vocal fold excitation ratio, empirical mode decomposition, etc.)—22 features	1. Health status 2. Sex 3. Jitter 4. Shimmer 5. LZ-2 6. CCP 7. Hurst 8. MFSW 9. Shannon 10. Permutation 11. PPE 12. FFMI 13. FZCF 14. GNE 15. ZCR 16. D2 17. HNR 18. RPDE 19. GQ prc5–95 20. GQ std cycle 21–33. MFCCs (0–12)	MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer MDVP:Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 MDVP:APQ Shimmer:DDA NHR HNR RPDE DFA spread1 spread2 D2 PPE

Note: PPE: pitch period entropy; MDVP:APQ: multidimensional voice program amplitude perturbation quotient; PD: Parkinson’s disease.

Dataset origins and collection protocols: The three datasets were collected with a focus on sustained vowel phonation of /a/, which is commonly used in PD speech analysis. While all datasets used microphones with a sampling rate of 44.1 kHz, the UEX dataset specified a supervised recording environment, whereas the MIU and UCI datasets did not provide explicit supervision details.

Participant demographics: The MIU dataset is the largest, containing 252 subjects, while the UCI dataset is the smallest, with only 31 participants. The UEX dataset has a balanced distribution of 30 PD and 30 healthy participants, while the other two datasets have a higher proportion of PD cases. Age distributions also vary, with the MIU dataset covering a broader age range (33–87 years) compared to the UEX (51–87 years) and UCI (46–85 years) datasets. The UCI dataset uniquely reports disease duration, ranging from 0 to 28 years.

Speech analysis and acoustic features: Across the datasets, there are a diverse set of acoustic features; with the MIU being the most extensive (754) and UCI dataset having the smallest scope. These features include baseline measures like jitter, shimmer, and fundamental frequency, which reflect pitch and amplitude instability. Time-frequency features and MFCCs capture spectral and articulatory characteristics affected by PD-related motor dysfunction. More advanced representations, such as wavelet- and TQWT-based features, enable the detection of subtle nonstationary patterns in speech. Complexity and entropy-based measures (e.g. Lempel–Ziv complexity, PPE, DFA, Shannon entropy) provide insight into the irregularity and reduced dynamism typical of PD speech. Collectively, these features offer a multidimensional view of voice pathology, though they are distributed unevenly across datasets, as noted in the introduction.

Each dataset offers unique advantages for PD classification. The MIU dataset provides the most comprehensive feature set, making it ideal for deep feature analysis. The UEX dataset balances PD and control cases while incorporating entropy-based and complexity-related features. The UCI dataset, despite its smaller size, has been widely used in PD speech research due to its simplicity and effectiveness in assessing dysphonia-related features. These variations highlight the importance of selecting appropriate datasets based on research objectives, model complexity, and feature requirements.

Scaling techniques

There is substantial variability in the training parameters, the acoustic features, of the PD datasets, thus it is necessary to standardize the data. In this study, we selected four different methods—based on their ability to reduce dataset heterogeneity and handle outliers effectively—to overcome the challenge⁴³:

Robust Scaler, which scales the feature using the median and interquartile range.

Max Abs Scaler, which calculates the absolute value maximum normalization.

Min-Max Scaler, which scales features into a specified range (i.e. 0 and 1).

Standardization (Z-score normalization or Z-score standardization) which scales features to have a mean of 0 and a standard deviation of 1.

Sampling techniques

To balance the datasets, we employ a series of upsampling and downsampling techniques, each used exclusively: if one upsampling technique is applied, other upsampling techniques and downsampling techniques are not used. In upsampling, by synthesizing samples, we are able to amplify the minority class to match the majority class. To do this, we utilize four different methods:

SMOTE, which syntheses samples based on the minority class's k-nearest neighbors; and is applied repeatedly until balance is achieved.

ROS, which randomly chooses a sample from the minority class and clones the sample until classes in the dataset is balanced.

Borderline SMOTE, which oversamples the samples that are closest to the borderline classifier in both classes.

ADASYN, which oversamples based on the learning difficulty of the minority sample; the harder the sample is to learn, the more prioritized.

In downsampling, by removing the resemble data, the majority class can be reduced to fit the minority class. To do this, we utilize a selection of five different methods:

Random subspace undersampling (RAS), which subset features are randomly selected from the majority class to form a feature subspace.

RUS, which samples are randomly selected from the majority class.

CUS, which a clustering algorithm (e.g. k-means) is used to select sample clusters from the majority class.

CCS, which a clustering algorithm (e.g. k-means) is used to select sample clusters form the majority class and a centroid (mean or median) of each cluster is calculated.

NM under-sampling variants (NM1, NM2, NM3), which selects samples in the majority class that are most similar to samples in the minority class. Once samples are selected, they are removed; and this process is iteratively applied until balance is attained in the dataset.

These techniques were selected based on their effectiveness in handling biomedical class imbalance problems in prior literature,^44–47 ensuring our pipeline remains generalizable and robust across datasets of varying PD/control ratios.

ML algorithms

In our study, a diverse array of supervised ML classifiers was employed. These classifiers encompassed a range of algorithms aimed at predicting outcomes based on input features. The classifiers utilized in our analysis included the following: “AdaBoost,” “XGBoost,” “GBDT,” “KNN,” “RaSE,” and “Random Forest.” Adaptive boosting (“AdaBoost”) combines multiple weak classifiers in which recurrent misclassifications are weighted more heavily to create a strong classifier. “GBDT,” iteratively builds a stronger classifier by combining a series of weak learners in a sequential manner. “XGBoost”, is an improved version of GBDT with added regularization, tree pruning, missing value handling and tuning. “KNN” is a supervised learning classifier that makes classifications based on its similarity with known data “RaSE” combines many weak classifiers, where each classifier is trained in a randomly selected subspace. “Random Forest” is an ensemble method that constructs multiple decision trees and combines their predictions.

By employing a diverse set of classifiers, this study explores multiple approaches to predictive modeling, evaluating the strengths and limitations of each. While more advanced ML and deep learning methods exist, our goal is to develop a simple, efficient approach that requires minimal tuning—facilitating real-time use and broader accessibility. Moreover, this simplicity enables rapid updates as new data becomes available, allowing the model to improve continuously without substantial delay. This flexibility is particularly important in PD research, where voice datasets are often heterogeneous, and patient characteristics can vary substantially. A comprehensive analysis of these models provides deeper insight into which techniques are most effective in PD detection with acoustic features. We prioritized ensemble methods such as XGBoost and AdaBoost due to their robustness on tabular and heterogeneous biomedical data.^48,49 These algorithms have demonstrated high accuracy and interpretability in PD voice classification, making them suitable for our pipeline. Additionally, models like GBDT, KNN, and RaSE were included for comparative analysis.

Performance evaluation

The performance of models was evaluated based on four widely used classification metrics: accuracy, precision, recall (sensitivity), and F1-score. These metrics are defined as follows:

Accuracy measures the proportion of all correct predictions among total predictions.

Precision indicates how many of the samples predicted as positive (PD) are truly positive.

Recall (or sensitivity) measures the ability of the model to detect actual PD cases.

F1-score is the harmonic mean of precision and recall, providing a balanced assessment in imbalanced datasets.

Accuracy (Acc) = \frac{T p + T n}{T p + T n + F p + F n}

Precision (Pre) = \frac{T p}{T p + F p}

Recall (Re) = \frac{T p}{T p + F n}

F 1 - score (F 1) = \frac{2 \times Sensitivity \times Precision}{Sensitivity + Precision}

These metrics were chosen to provide both statistical rigor and clinical relevance. In real-world medical applications, recall is particularly critical, as missing a PD diagnosis (false negative) could delay treatment and worsen patient outcomes. Precision is equally important to avoid false alarms and prevent unnecessary psychological burden or clinical referrals. F1-score balances both concerns and is especially useful when class distributions are unequal, as in many PD datasets. Finally, accuracy alone may be misleading in imbalanced data, and is therefore considered alongside the other metrics. This comprehensive evaluation framework allows for a more realistic assessment of how the model would perform in clinical screening or triage scenarios.

Results

Performance metrics for different scaling and sampling techniques

MIU database

This study investigates the application of ML techniques for PD classification using the MIU dataset. The focus lies on evaluating the impact of scaling and sampling methods on model performance, alongside an analysis of optimized threshold adjustments. The results are supported by visualizations such as confusion matrices and ROC curves (Figures 2 and 3) as well as detailed performance metrics (Tables 2 and 3).

Figure 2.

Parkinson's disease prediction on the MIU test dataset (a) confusion matrix (b) area under the ROC curve and optimal threshold.

Figure 3.

Performance evaluation using ROC and PR curves for Parkinson's disease prediction on the MIU test dataset (a) and (b): ROC and PR curves for XGBoost with different sampling methods (c) and (d): ROC and PR curves comparing different machine learning models.

Table 2.

Performance metrics for scaling and sampling techniques on MIU database.

Scaling techniques	Sampling techniques	Performance (%)				Best classifier
Scaling techniques	Sampling techniques	Precision	Recall	F1-score	Accuracy	Best classifier
None	Vanilla	90.50	90.35	90.41	90.35	AdaBoost, XGBoost, GBDT
	ROS	93.68	92.98	93.18	92.98	XGBoost
	SMOTE	92.02	91.23	91.48	91.23	XGBoost
	BLSMOTE	89.29	89.47	89.35	89.47	AdaBoost
	ADASYN	91.50	90.35	90.71	90.35	XGBoost
	RUS	86.53	86.84	86.61	86.84	XGBoost
	CCS	76.20	70.18	67.94	70.18	AdaBoost
	NearMiss-1	82.21	80.70	79.47	80.70	XGBoost
	NearMiss-2	69.81	51.75	49.46	51.75	AdaBoost
	NearMiss-3	84.86	84.21	83.36	84.21	AdaBoost
Robust	Vanilla	91.45	90.35	90.69	90.35	AdaBoost
	ROS	96.07	97.10	96.57	97.37	XGBoost
	SMOTE	88.15	86.84	87.30	86.84	XGBoost
	BLSMOTE	92.98	92.98	92.83	92.98	AdaBoost
	ADASYN	90.26	90.35	90.29	90.35	AdaBoost
	RUS	86.35	85.96	85.33	85.96	AdaBoost
	CUS	87.78	87.72	87.25	87.72	XGBoost
	CCS	76.24	74.56	72.92	74.56	XGBoost
	NearMiss-1	71.78	69.30	67.20	69.30	XGBoost
	NearMiss-2	67.79	61.40	58.80	61.40	FOREST
	NearMiss-3	84.85	85.09	84.85	85.09	AdaBoost
Max abs	Vanilla	92.52	90.35	90.99	90.35	XGBoost, GBDT
	ROS	89.33	88.60	88.88	88.60	AdaBoost
	SMOTE	89.47	89.47	89.47	89.47	XGBoost
	BLSMOTE	89.01	85.09	86.45	85.09	XGBoost
	ADASYN	88.94	84.21	85.87	84.21	AdaBoost, XGBoost
	RUS	80.74	81.58	80.78	81.58	XGBoost
	CUS	83.62	84.21	83.78	84.21	AdaBoost
	CCS	89.16	85.96	87.07	85.96	KNN
	NearMiss-1	73.95	73.68	71.56	73.68	KNN
	NearMiss-2	76.10	76.32	74.57	76.32	KNN
	NearMiss-3	79.77	80.70	79.96	80.70	XGBoost
Min-max	Vanilla	95.01	93.86	94.17	93.86	XGBoost, GBDT
	ROS	93.85	90.35	91.37	90.35	XGBoost
	SMOTE	95.22	89.47	91.13	89.47	XGBoost
	BLSMOTE	90.20	88.60	89.18	88.60	XGBoost
	ADASYN	87.75	86.84	87.22	86.84	XGBoost
	RUS	82.53	83.33	82.52	83.33	AdaBoost
	CUS	82.53	83.33	82.52	83.33	GBDT
	CCS	82.53	83.33	82.52	83.33	GBDT
	NearMiss-1	74.33	75.44	73.49	75.44	KNN
	NearMiss-2	78.69	78.95	77.27	78.95	XGBoost
	NearMiss-3	79.59	80.70	79.87	80.70	AdaBoost, XGBoost
Z-score	Vanilla	91.23	91.23	91.23	91.23	XGBoost, GBDT
	ROS	91.98	92.11	91.96	92.11	AdaBoost
	SMOTE	92.62	92.11	92.28	92.11	XGBoost
	BLSMOTE	88.33	88.60	88.38	88.60	XGBoost
	ADASYN	92.88	92.98	92.89	92.98	AdaBoost
	RUS	83.07	83.33	82.55	83.33	FOREST
	CUS	85.81	83.33	82.17	83.33	AdaBoost, XGBoost
	CCS	90.05	85.96	87.26	85.96	KNN
	NearMiss-1	82.75	83.33	82.85	83.33	KNN
	NearMiss-2	82.82	83.33	82.69	83.33	GBDT
	NearMiss-3	86.50	86.84	86.46	86.84	XGBoost, KNN

Note: SMOTE: synthetic minority oversampling technique; ROS: random oversampling; RUS: random undersampling; GBDT: gradient boosted decision trees; AdaBoost: adaptive boosting; XGBoost: eXtreme gradient boosting; KNN: K nearest neighbors; CUS: cluster undersampling; CCS: cluster centroid undersampling; ADASYN: adaptive synthetic sampling.

Table 3.

Classification report for the MIU test dataset using an adjusted threshold (0.79).

Classification report
	Precision	Recall	F1-score	Support
0 (nonviable)	0.9333	0.9655	0.9492	29
1 (viable)	0.9881	0.9765	0.9882	85
Accuracy			0.9737	114
Macro avg	0.9607	0.9710	0.9657	114
Weighted avg	0.9742	0.9737	0.9738	114

The experimental results reveal that robust scaling combined with ROS achieved the highest accuracy of 97.37% when paired with the XGBoost classifier. This configuration also yielded a precision of 96.07%, recall of 97.10%, and an F1-score of 96.57%, as shown in Table 3. Comparatively, Z-score scaling with ADASYN sampling demonstrated lower performance, achieving an accuracy of 92.98%. min–max scaling without any sampling (Vanilla) performed moderately well, with XGBoost and GBDT achieving a precision of 95.01%.

The confusion matrix for the MIU test dataset (Figure 2(a)) illustrates that the model correctly classified 28 true negatives and 83 true positives, with only one false positive and two false negatives. This corresponds to a sensitivity of 97.65% and a specificity of 96.55%. The high classification accuracy is further corroborated by the ROC curve (Figure 2(b)), where the model achieved an area under the curve (AUC) of 0.99432 at an optimal threshold value of 0.79.

When comparing sampling techniques, SMOTE paired with XGBoost delivered superior results, achieving an AUC of 0.9980, followed by ROS at 0.9943 and Vanilla at 0.9899 (Figure 3(a)). Precision-Recall curves further validate these findings, with SMOTE attaining an Average Precision (AP) of 0.9993 (Figure 3(b)), outperforming ROS (AP = 0.9980) and Vanilla (AP = 0.9966). Conversely, under-sampling methods such as Near-Miss variant 2 (NM2) performed poorly, with an AUC of only 0.8114 and AP = 0.9262.

A comparative analysis across ML models highlights XGBoost's superior performance, achieving an AUC of 0.9943 compared to GBDT (AUC = 0.9448) and Random Forest (AUC = 0.9323). KNN, however, struggled to capture complex patterns in the data, with an AUC of just 0.6582 (Figure 3(c)).

In conclusion, this study underscores the effectiveness of ensemble-based methods like XGBoost for PD classification tasks when paired with appropriate scaling and oversampling techniques such as ROS or SMOTE. These approaches not only address class imbalance but also enhance diagnostic accuracy, as evidenced by robust metrics and visual analyses.

UEX database

This study examines the application of ML techniques for classification using the UEX database. The research focuses on evaluating various scaling and sampling methods and their impact on model performance, along with threshold optimization for classification accuracy. The findings are supported by visualizations including an ROC curve and confusion matrix (as shown in the uploaded images) as well as comprehensive performance metrics presented in Tables 4 and 5.

Table 4.

Performance metrics for scaling and sampling techniques on UEX database.

Scaling techniques	Sampling techniques	Performance (%)				Best classifier
Scaling techniques	Sampling techniques	Precision	Recall	F1-score	Accuracy	Best classifier
None	Vanilla	91.11	88.89	88.89	88.89	XGBoost, GBDT, FOREST
	ROS	91.11	88.89	88.89	88.89	XGBoost, FOREST
	SMOTE	100.00	100.00	100.00	100.00	XGBoost
	BLSMOTE	91.11	88.89	88.89	88.89	XGBoost
	RUS	100.00	100.00	100.00	100.00	XGBoost, GBDT
	CCS	91.11	88.89	88.89	88.89	XGBoost
	NearMiss-1	100.00	100.00	100.00	100.00	XGBoost, FOREST
	NearMiss-2	100.00	100.00	100.00	100.00	XGBoost
	NearMiss-3	91.67	88.89	89.18	88.89	RaSE
Robust	Vanilla	100.00	100.00	100.00	100.00	AdaBoost, RaSE
	ROS	100.00	100.00	100.00	100.00	AdaBoost
	SMOTE	91.67	88.89	89.18	88.89	AdaBoost, KNN, RaSE
	BLSMOTE	91.11	88.89	88.89	88.89	AdaBoost, GBDT, KNN
	RUS	91.67	88.89	89.18	88.89	AdaBoost, XGBoost
	CUS	100.00	100.00	100.00	100.00	RaSE
	CCS	91.67	88.89	89.18	88.89	AdaBoost, XGBoost, KNN
	NearMiss-1	100.00	100.00	100.00	100.00	AdaBoost, XGBoost
	NearMiss-2	100.00	100.00	100.00	100.00	AdaBoost
	NearMiss-3	100.00	100.00	100.00	100.00	AdaBoost
Max abs	Vanilla	92.59	88.89	89.57	88.89	RaSE
	ROS	100.00	100.00	100.00	100.00	RaSE
	SMOTE	77.78	77.78	77.78	77.78	GBDT, KNN
	BLSMOTE	90.74	88.89	88.60	88.89	AdaBoost,
	RUS	100.00	100.00	100.00	100.00	RaSE
	CUS	90.74	88.89	88.60	88.89	AdaBoost, XGBoost, GBDT
	CCS	92.59	88.89	89.57	88.89	GBDT
	NearMiss-1	100.00	100.00	100.00	100.00	AdaBoost
	NearMiss-2	100.00	100.00	100.00	100.00	RaSE
	NearMiss-3	90.74	88.89	88.60	88.89	RaSE
Min-max	Vanilla	92.59	88.89	89.57	88.89	XGBoost, GBDT
	ROS	90.74	88.89	88.60	88.89	AdaBoost
	SMOTE	92.59	88.89	89.57	88.89	GBDT
	BLSMOTE	100.00	100.00	100.00	100.00	RaSE
	RUS	100.00	100.00	100.00	100.00	RaSE
	CUS	92.59	88.89	89.57	88.89	XGBoost, GBDT
	CCS	90.74	88.89	88.60	88.89	AdaBoost, GBDT, RaSE
	NearMiss-1	92.59	88.89	89.57	88.89	XGBoost, GBDT
	NearMiss-2	92.59	88.89	89.57	88.89	GBDT
	NearMiss-3	92.59	88.89	89.57	88.89	XGBoost, GBDT, RaSE
Z-score	Vanilla	85.19	77.78	77.22	77.78	AdaBoost
	ROS	83.33	66.67	66.67	66.67	AdaBoost, XGBoost
	SMOTE	85.19	77.78	77.22	77.78	AdaBoost
	BLSMOTE	83.33	66.67	66.67	66.67	AdaBoost, XGBoost, KNN, RaSE
	RUS	83.33	66.67	66.67	66.67	AdaBoost,XGBoost,GBDT,RaSE
	CUS	92.59	77.78	81.75	77.78	KNN
	CCS	83.33	66.67	66.67	66.67	AdaBoost,XGBoost,GBDT,RaSE
	NearMiss-1	83.33	66.67	66.67	66.67	AdaBoost,XGBoost,GBDT,RaSE
	NearMiss-2	77.78	77.78	77.78	77.78	RaSE
	NearMiss-3	77.78	77.78	77.78	77.78	AdaBoost

Note: SMOTE: synthetic minority oversampling technique; ROS: random oversampling; RUS: random undersampling; GBDT: gradient boosted decision trees; AdaBoost: adaptive boosting; XGBoost: eXtreme gradient boosting; RaSE: random subspace ensemble; KNN: K nearest neighbors; CUS: cluster undersampling; CCS: cluster centroid undersampling; ADASYN: adaptive synthetic sampling.

Table 5.

Classification report for the UEX test dataset using an adjusted threshold (0.52).

Classification report
	Precision	Recall	F1-score	Support
0 (nonviable)	1.0000	1.0000	1.0000	4
1 (viable)			1.0000	5
Accuracy			1.0000	9
Macro avg	1.0000	1.0000	1.0000	9
Weighted avg	1.0000	1.0000	1.0000	9

The experimental results from Table 4 demonstrate that multiple scaling-sampling combinations achieved perfect classification performance (100% precision, recall, F1-score, and accuracy). Notable among these are robust scaling with Vanilla and ROS sampling techniques, which attained perfect scores when paired with AdaBoost and RaSE classifiers. Similarly, no scaling (none) combined with SMOTE, RUS, and NM techniques achieved 100% performance with XGBoost. Min-max scaling with BLSMOTE and RUS also reached perfect classification metrics when used with the RaSE classifier. Conversely, Z-score scaling consistently underperformed across all sampling techniques, with accuracy values predominantly between 66.67% and 77.78%.

The classification report in Table 5 confirms that implementing an adjusted threshold of 0.52 for the UEX test dataset resulted in flawless classification performance. This optimized threshold yielded perfect precision, recall, F1-score, and accuracy values of 1.0000 across both viable and nonviable classes.

The confusion matrix for the test dataset (Figure 4) demonstrates that the model correctly classified four true negatives (non-Parkinson's cases) and five true positives (Parkinson's cases), with no false positives or false negatives. This corresponds to a sensitivity of 100% and a specificity of 100%, indicating perfect classification performance. The ROC curve (Figure 4(b)) further supports this result, showing an AUC of 1.000, signifying ideal model discrimination ability. The optimal threshold value for classification is identified as 0.52, marked by the red dot on the curve.

Figure 4.

Parkinson's disease prediction on the UEX test dataset (a) confusion matrix (b) area under the ROC curve and optimal threshold.

Figure 5 showcases a comprehensive performance evaluation for PD prediction on the UEX test dataset using both ROC and PR curves. The analysis reveals several important findings about sampling techniques and model selection.

Figure 5.

Performance evaluation using ROC and PR curves for Parkinson's disease prediction on the UEX test dataset (a) and (b): ROC and PR curves for XGBoost with different sampling methods (c) and (d): ROC and PR curves comparing different machine learning models.

When comparing sampling techniques, multiple methods achieved optimal performance when paired with AdaBoost, with Vanilla, ROS, NM1, and RUS all attaining perfect AUC scores of 1.0000 (Figure 5(a)). Precision-Recall curves further validate these findings, with these methods achieving AP of 1.0000 (Figure 5(b)). Some sampling techniques demonstrated slightly reduced performance, such as SMOTE (SM) with an AUC of 0.9500 and Borderline SMOTE (BLSM) with an AUC of 0.8500. Similarly, CCS showed an AUC of 0.8500, and NM2 achieved an AUC of 0.9000 with corresponding AP = 0.9267.

A comparative analysis across ML models highlights the exceptional performance of several algorithms on this dataset. AdaBoost, XGBoost, GBDT, RaSE, and KNN all achieved perfect classification with AUC scores of 1.0000 (Figure 5(c)). These findings are reinforced by the PR curves, with all these models attaining AP scores of 1.0000 (Figure 5(d)). Random Forest showed slightly lower performance with an AUC of 0.9000 and AP of 0.9250, though still demonstrating strong predictive capability.

In conclusion, this study underscores the effectiveness of ensemble-based methods like AdaBoost and XGBoost for PD classification when paired with appropriate sampling techniques. Several sampling approaches, including Vanilla, ROS, NM1, and RUS, achieved perfect classification metrics, indicating their effectiveness in addressing the inherent challenges in medical diagnostic data. The consistent performance across multiple evaluation metrics demonstrates the robustness of these approaches for developing reliable diagnostic tools for PD.

UCI database

This study examines the application of ML techniques for PD classification using the UCI dataset. The research focuses on evaluating various scaling and sampling methods and their impact on model performance, along with threshold optimization for classification accuracy. The findings are supported by visualizations including a confusion matrix and ROC curve (as shown in the uploaded images) as well as comprehensive performance metrics presented in Tables 6 and 7.

Table 6.

Performance metrics for scaling and sampling techniques on UCI dataset.

Scaling techniques	Sampling Techniques	Performance(%)
Scaling techniques	Sampling Techniques	Precision	Recall	F1-score	Accuracy
None	Vanilla	93.75	90.00	90.83	90.00	XGBoost,GBDT,RaSE,FOREST
	ROS	88.86	86.67	87.33	86.67	AdaBoost,XGBoost,RaSE,FOREST
	SMOTE	84.28	83.33	83.70	83.33	AdaBoost, RaSE, FOREST
	BLSMOTE	86.67	86.67	76.67	86.67	RaSE, FOREST
	ADASYN	88.86	86.67	87.33	86.67	RaSE
	RUS	90.68	90.00	90.22	90.00	RaSE, FOREST
	CCS	84.28	83.33	83.70	83.33	XGBoost, RaSE,
	NearMiss-1	84.28	83.33	83.70	83.33	XGBoost, GBDT
	NearMiss-2	88.86	86.67	87.33	86.67	FOREST
	NearMiss-3	86.67	86.67	76.67	86.67	AdaBoost, RaSE, FOREST
Robust	Vanilla	95.00	93.33	93.66	93.33	FOREST
	ROS	90.68	90.00	90.22	90.00	XGBoost
	SMOTE	86.67	86.67	86.67	86.67	XGBoost, GBDT
	BLSMOTE	90.68	90.00	90.22	90.00	FOREST
	ADASYN	86.67	86.67	86.67	86.67	AdaBoost, XGBoost
	RUS	83.90	83.33	82.63	83.33	RaSE, FOREST
	CUS	96.82	96.67	96.61	96.67	AdaBoost
	CCS	76.44	76.67	75.68	76.67	AdaBoost, GBDT, KNN
	NearMiss-1	93.33	93.33	93.33	93.33	AdaBoost, GBDT
	NearMiss-2	86.67	86.67	86.67	86.67	XGBoost
	NearMiss-3	86.74	86.67	86.24	86.67	KNN
Max Abs	Vanilla	100.00	100.00	100.00	100.00	XGBoost, GBDT
	ROS	100.00	100.00	100.00	100.00	XGBoost
	SMOTE	96.81	96.67	96.57	96.67	XGBoost, RaSE
	BLSMOTE	96.81	96.67	96.57	96.67	AdaBoost, XGBoost
	ADASYN	97.22	96.67	96.78	96.67	AdaBoost, XGBoost
	RUS	96.81	96.67	96.57	96.67	XGBoost
	CUS	100.00	100.00	100.00	100.00	XGBoost
	CCS	96.81	96.67	96.57	96.67	XGBoost
	NearMiss-1	88.89	86.67	85.64	86.67	XGBoost
	NearMiss-2	86.81	83.33	81.85	83.33	XGBoost, KNN, FOREST
	NearMiss-3	89.72	90.00	89.72	90.00	KNN
Min-Max	Vanilla	95.00	93.33	93.66	93.33	XGBoost, GBDT, RaSE
	ROS	93.33	93.33	93.33	93.33	RaSE
	SMOTE	93.94	93.33	93.12	93.33	KNN, FOREST
	BLSMOTE	82.95	83.33	83.04	83.33	XGBoost, KNN
	ADASYN	89.89	90.00	89.82	90.00	XGBoost
	RUS	86.74	86.67	86.24	86.67	XGBoost, GBDT, RaSE, FOREST
	CUS	82.95	83.33	83.04	83.33	GBDT
	CCS	82.95	83.33	83.04	83.33	GBDT, RaSE
	NearMiss-1	93.33	93.33	93.33	93.33	XGBoost, FOREST
	NearMiss-2	77.20	73.33	71.80	73.33	XGBoost, RaSE
	NearMiss-3	97.08	96.67	96.74	96.67	XGBoost
Z-score	Vanilla	93.85	93.33	92.80	93.33	XGBoost, GBDT, KNN
	ROS	96.79	96.67	96.51	96.67	FOREST
	SMOTE	93.85	93.33	92.80	93.33	FOREST
	BLSMOTE	91.54	90.00	90.62	90.00	XGBoost, FOREST
	ADASYN	93.33	93.33	93.33	93.33	XGBoost
	RUS	88.72	86.67	85.00	86.67	XGBoost
	CUS	86.54	83.33	81.01	83.33	XGBoost
	CCS	91.15	90.00	88.94	90.00	XGBoost
	NearMiss-1	88.72	86.67	85.00	86.67	FOREST
	NearMiss-2	84.62	80.00	77.02	80.00	XGBoost, KNN
	NearMiss-3	86.54	83.33	81.01	83.33	XGBoost, KNN

Note: SMOTE: synthetic minority oversampling technique; ROS: random oversampling; RUS: random undersampling; GBDT: gradient boosted decision trees; AdaBoost: adaptive boosting; XGBoost: eXtreme gradient boosting; RaSE: random subspace ensemble; KNN: K nearest neighbors; CUS: cluster undersampling; CCS: cluster centroid undersampling; ADASYN: adaptive synthetic sampling.

Table 7.

Classification report for the UCI test dataset using an adjusted threshold (0.91).

Classification report
	Precision	Recall	F1-score	support
0 (nonviable)	1.0000	1.0000	1.0000	7
1 (viable)			1.0000	23
Accuracy			1.0000	30
Macro avg	1.0000	1.0000	1.0000	30
Weighted avg	1.0000	1.0000	1.0000	30

The experimental results from Table 6 demonstrate that several scaling-sampling combinations achieved perfect classification performance (100% precision, recall, F1-score, and accuracy). Notable among these are Max Abs scaling with Vanilla sampling, which attained perfect scores when paired with XGBoost and GBDT classifiers. Max Abs scaling also achieved 100% performance when combined with ROS and CUS sampling techniques using XGBoost. Robust scaling demonstrated strong results as well, with CUS sampling reaching 96.67% accuracy when used with AdaBoost. Conversely, CCS with Robust scaling produced the lowest performance among all combinations, achieving only 76.67% accuracy across AdaBoost, GBDT, and KNN classifiers.

The classification report in Table 7 confirms that implementing an adjusted threshold of 0.91 for the UCI test dataset resulted in flawless classification performance. This optimized threshold yielded perfect precision, recall, F1-score, and accuracy values of 1.0000 across both classes (nonviable and viable).

The confusion matrix for the test dataset (Figure 6(a)) demonstrates that the model correctly classified seven true negatives (non-Parkinson's cases) and 23 true positives (Parkinson's cases), with no false positives or false negatives. This corresponds to a sensitivity of 100% and a specificity of 100%, indicating perfect classification performance. The ROC curve (Figure 6(b)) further supports this result, showing an AUC of 1.0000, signifying ideal model discrimination ability. The optimal threshold value for classification is identified as 0.91, marked by the red dot on the curve.

Figure 6.

Parkinson's disease prediction on the UCI test dataset (a) confusion matrix (b) area under the ROC curve and optimal threshold.

Figure 7 showcases a comprehensive performance evaluation for PD prediction on the UCI test dataset using both ROC and PR curves. The analysis reveals several important findings about sampling techniques and model selection.

Figure 7.

Performance evaluation using ROC and PR curves for Parkinson's disease prediction on the UCI test dataset (a) and (b): ROC and PR curves for XGBoost with different sampling methods (c) and (d): ROC and PR curves comparing different machine learning models.

When comparing sampling techniques with XGBoost, several methods achieved optimal performance, with Vanilla, ROS, SMOTE (SM), ADASYN, NM2, and NM3 all attaining perfect AUC scores of 1.0000 (Figure 7(a)). Precision-Recall curves further validate these findings, with these methods achieving AP of 1.0000 (Figure 7(b)). Some sampling techniques demonstrated slightly reduced performance, such as BLSM with an AUC of 0.9876 and AP of 0.9963, and RUS with an AUC of 0.9752 and AP of 0.9923. Similarly, CCS showed an AUC of 0.9814 and AP of 0.9943, while NM1 achieved an AUC of 0.9876 with AP of 0.9963.

A comparative analysis across ML models highlights the exceptional performance of several algorithms on this dataset. AdaBoost, XGBoost, RaSE, and Random Forest all achieved perfect classification with AUC scores of 1.0000 (Figure 7(c)). These findings are reinforced by the PR curves, with all these models attaining AP scores of 1.0000 (Figure 7(d)). GBDT showed slightly lower performance with an AUC of 0.9938 and AP of 0.9982, while KNN demonstrated the lowest performance among the tested models with an AUC of 0.9845 and AP of 0.9944, though still exhibiting strong predictive capability.

In conclusion, this study underscores the effectiveness of ensemble-based methods like AdaBoost, XGBoost, and Random Forest for PD classification when paired with appropriate sampling techniques. Several sampling approaches, including Vanilla, ROS, SM, ADASYN, NM2, and NM3, achieved perfect classification metrics when used with XGBoost, indicating their effectiveness in addressing the inherent challenges in medical diagnostic data. The consistent performance across multiple evaluation metrics demonstrates the robustness of these approaches for developing reliable diagnostic tools for PD.

Analysis of differential features of datasets on model output

A SHAP analysis^50,51 was done and the impact differential features on model output are exhibited in Figures 8–10. The (a)s of Figures 8–10 show the top 20 features in terms of magnitude of impact of each feature on model output (represented by SHAP values); and is presented in a descending order where the features with most impact are at the top and features with least impact are at the bottom. The (b)s of Figures 8–10 correspond to Figure (a): features with most impact descends from the top, and shows the impact of feature value (high or low) on model output. If there are a lot of red (high) points on the right of the zero for a feature, it means that the higher the value of the feature, the more likely the patient has PD. Conversely, if there are a lot of blue (low) points on the left of the zero for a feature, it means that the lower the value of the feature, the less likely the patient has PD. Moreover, the righter or left the points are, the more strongly it is positively or negatively correlated to PD.

Figure 8.

Summary of SHAP analysis of the MIU database. (a) Bar plot demonstrating the magnitude of impact of features on model output (b) distribution illustrating the impact of feature values on model output of differential features.

Figure 9.

Summary of SHAP analysis of the UEX dataset. (a) Bar plot demonstrating the magnitude of impact of features on model output (b) distribution illustrating the impact of feature values on model output of differential features.

Figure 10.

Summary of SHAP analysis of the UCI dataset. (a) Bar plot demonstrating the magnitude of impact of features on model output (b) distribution illustrating the impact of feature values on model output of differential features.

In Figure 8(a), the plot shows the features: std_delta_delta_log_energy (the standard deviation of the change in the delta delta log energy across multiple frames of a speech signal), to tqwt_minValue_dec_11 (minimum value of a voice signal after applying TQWT at level 11)—the first six, played a relatively substantial role on model output. The features: tqwt_kurtosisValue_dec_26 (kurtosis of the voice signal at TQWT decomposition level 26) to std_8th_delta_delta (standard deviation of the eighth delta-delta feature)—the last six, had marked lower impact on model output; everything in between had a moderate impact. In Figure 8(b) whether a feature value has a positive or negative impact on model output is relatively distinct, as in each feature the blues and reds are distinctly polarized to one end with minimal mixing of the two (blue and red) or anything color in between.

In Figure 9(a), the plot shows that features: cepstral peak prominence (CPP) and LZ-2 (Lempel—Ziv complexity)—the first two, had a considerable influence on model output. The last 10 features: MFCC8 to fundamental motor moment inconsistency had notably diminished impact on model output; everything in between had a moderate to low impact. In Figure 9(b) points for each feature are sparsed and there is mixing between the two colors: red and blue, thereby it is difficult to draw a clear line to see whether the value of each feature is positively or negatively correlated with PD. However, it does seem to some extent it seems like CPP is negatively correlated with PD and LZ-2 is positively correlated with PD.

In Figure 10(a), the plot shows that features: PPE, multidimensional voice program (MDVP):Fhi(Hz) (maximum vocal fundamental frequency) and spread2 (frequency variation)—the first three, contributed significantly on model output. The last 10 features: MDVP:Shimmer(dB) (MDVP local shimmer in dB) to MDVP:amplitude perturbation quotient (APQ) played a far lesser role on model output; everything in between had a moderate to low impact. In Figure 10(b), other than the first seven features, it is difficult to draw conclusion on whether feature values for each feature are positively or negatively correlated to PD.

Discussion

This study establishes the effectiveness of ensemble ML approaches combined with hybrid preprocessing techniques for PD detection using heterogeneous speech datasets. By integrating RobustScaler for feature normalization, dynamic resampling strategies (ROS/SMOTE/RUS), and optimized classifiers (XGBoost, AdaBoost), the framework achieved up to 97.37% accuracy on the MIU dataset and perfect classification (100%) on both the UEX and UCI datasets. SHAP analysis identified MFCCs as consistently influential biomarkers; moreover, our study suggests that the proposed approach may offer improved performance compared to existing methods, based on comparative evaluations. Critical limitations include dataset size constraints and reliance on sustained vowel phonation, highlighting the need for longitudinal multi-modal validation in future research. Among all classifiers tested, XGBoost consistently delivered the best performance, particularly when used with SMOTE or ROS, due to its ability to capture complex feature interactions in high-dimensional spaces. Conversely, AdaBoost performed optimally on smaller, balanced datasets like UEX, likely benefiting from its sequential emphasis on misclassified instances. These findings demonstrate the importance of aligning model complexity with dataset characteristics (see Table 8).

Table 8.

Dataset characteristics and model response.

Parameter	MIU	UEX	UCI
Feature count	754 (TQWT, MFCC, vocal fold dynamics, entropy-based)	34 (e.g. CPP, jitter, shimmer, MFCCs, LZ-2)	23 (e.g., MDVP:Fhi, MDVP:Flo, jitter, PPE, DFA)
Recording	756	60	195
PD:control ⇒ ratio	564:192 ⇒ 2.94:1	30:30 ⇒ 1:1	147:48 ⇒ 3.06:1
Key features	TQWT, MFCCs, RPDE, PPE, DFA, glottal and nonlinear features	MFCCs, entropy measures (Shannon, LZ-2), CPP, HNR	Classic dysphonia features (Jitter, Shimmer, PPE)
Optimal classifier	XGBoost	AdaBoost	XGBoost
Regularization	L2 (λ = 0.1), early stopping	Stage-wise error correction	L2 (λ = 0.05), early stopping

Note: MFCC: Mel-frequency cepstral coefficients; AdaBoost: adaptive boosting; XGBoost: eXtreme gradient boosting; CPP: cepstral peak prominence; PPE: pitch period entropy.

Comparative analysis across heterogeneous datasets

The varying performance across the three datasets can be attributed to differences in sample distribution, feature dimensionality, and feature types. The MIU dataset, with its high-dimensional feature space (754 features), includes complex representations such as TQWT-based decompositions, MFCCs, and nonlinear dynamics (e.g. RPDE, PPE, DFA). These complex, high-variance features introduced nonlinear decision boundaries, requiring a robust model like XGBoost with L2 regularization (λ = 0.1) and early stopping (patience = 10) to control overfitting and maintain generalization.

In contrast, the UEX dataset offered a balanced PD:control ratio (1:1) and moderate feature complexity (34 features), including entropy-based metrics (e.g. Shannon entropy, LZC), CPP, and MFCCs. This configuration proved ideal for AdaBoost, which effectively leveraged its stage-wise additive correction to minimize classification errors. However, AdaBoost's sensitivity to noisy labels and its reliance on weak classifiers performing better than random guessing (ɛ < 0.5) highlight the importance of careful feature preprocessing and base estimator selection to ensure robust performance.

The UCI dataset, while smaller and limited to 23 traditional dysphonia features (e.g. MDVP jitter, shimmer, and PPE), demonstrated sufficient class separation despite strong class imbalance (147 PD vs 48 controls, ratio 3.06:1). XGBoost performed well on this dataset with minimal L2 regularization (λ = 0.05), likely due to the high discriminative power of the features and consistent patterns in the majority class. However, class imbalance typically reduces minority class recall and biases the model toward the majority class; techniques such as SMOTE or class weighting could further enhance performance by addressing this imbalance.

These observations highlight the significant influence of dataset characteristics—such as feature complexity, class imbalance, preprocessing strategies, and feature interactions—on model selection and regularization choices. For instance, the time-series nature of MFCCs may interact differently with the axis-aligned splits of tree-based models compared to entropy-based features like Shannon entropy or LZC. A nuanced understanding of these interactions is essential for optimizing classification performance, as summarized in Table 8.

Comparison with existing literatures

The proposed dynamic multiscale resampling framework demonstrates significant advancements in handling imbalanced biomedical datasets, as evidenced by comparative evaluations across three benchmark repositories (MIU, UEX, and UCI). The comparison is detailed in Table 9, highlighting the performance metrics of precision, recall, F1-score, and accuracy.

MIU database: Overcoming high-dimensional imbalance challenges

Table 9.

The dynamic multiscale resampling machine learning compares with the mainstream baseline and state-of-the-art models on three datasets.

Dataset	Study	Year	features	Method	Performance (%)
Dataset	Study	Year	features	Method	Precision	Recall	F1-score	Accuracy
MIU database	Sakar et al.⁴²	2019	754	SVM (RBF)	-	-	84.00	86.00
	Prasad et al.⁵²	2020	754	multi-ANOVA + XGBoost	-	-	96.00	94.70
	Raihan et al.⁵³	2021	754	R-D-P	-	-	89.00	81.00
	Barukab et al.⁵⁴	2022	754	AdaBoost	-	96.50	-	90.10
	Dao et al.⁵⁵	2022	754	LGBM	90.20	97.90	93.90	90.50
	Santos et al.⁵⁶	2023	754	XGBoost	84.00	78.00	81.00	80.09
	Ali et al.⁵⁷	2023	754	Random Forest	-	-	-	92.06
	Du et al.⁵⁸	2024	754	HKNN	-	-	-	95.23
	Hossain et al.⁵⁹	2024	754	LSVC + MLP (AdaBoost)	-	-	-	85.09
	Ours	2024	754	RobustScaler ROS XGBoost	96.07	97.10	96.57	97.37
UEX database	Carrón et al.⁴⁰	2021	34	Passive Aggressive	-	93.96	-	92.05
UEX database	Ours	2024	34	RobustScaler ROS AdaBoost	100	100	100	100
UCI Dataset	Little et al.⁴¹	2008	23	Kernel-SVM				91.40
	Shaikn et al.⁶⁰	2014	23	Artificial Neural Network	-	85.42	-	90.77
	Al-Fatlawi et al.⁶¹	2016	23	DBN	-	-	-	94.00
	Pahuja et al.⁶²	2018	23	ANN	-	-	-	95.89
	Mathur et al.⁶³	2019	23	KNN + AdaBoosta.M1	87.30	90.70	88.80	91.28
	Kadam et al.⁶⁴	2019	23	FESA-DNN	-	95.23	-	93.84
	Karapinar ⁶⁵	2020	23	SVM	-	-	-	93.84
	Rasheed et al.⁶⁶	2020	23	BPVAM-PCA	-	-	-	97.50
	Alshammri et al.⁶⁷	2023	23	MLP	-	-	-	98.31
	Al-Nefaie et al.⁶⁸	2024	23	Random Forest	95.00	95.00	95.00	95.00
	Ours	2024	23	MaxAbsScaler ROS XGBoost	100	100	100	100

Note: AdaBoost: adaptive boosting; XGBoost: eXtreme gradient boosting; KNN: K nearest neighbors.

The MIU dataset (Sakar et al.) presents a 754-feature space for neurodegenerative disease monitoring. Our RobustScaler-ROS-XGBoost pipeline achieved 97.37% accuracy (96.57% F1-score), outperforming: SVM-RBF implementations (86.00% accuracy),⁴² hybrid ANOVA-XGBoost models (94.70% accuracy),⁵² recent LightGBM adaptations (90.50% accuracy).⁵⁵ Notably, Santos et al.'s XGBoost variant achieved only 80.09% accuracy,⁵⁶ while Ali et al.'s Random Forest reached 92.06%⁵⁷—both constrained by inadequate imbalance mitigation. The 5.3% accuracy improvement over Prasad et al.'s⁵² prior XGBoost implementation highlights our dynamic resampling's efficacy in preserving minority class patterns.

UEX database: Perfect classification through adaptive ensembling

For UEX's 35-feature dataset (Carrón et al.), our AdaBoost-based model achieved 100% precision-recall-accuracy, a 7.95% accuracy gain over Carrón's Passive Aggressive classifier. The results suggest ROS effectively amplifies subtle biomarker signatures in small-sample contexts.

UCI dataset: Redefining low-dimensional benchmark performance

The proposed MaxAbsScaler-ROS-XGBoost configuration achieved perfect metrics (100%) on Little et al.'s 23-feature UCI dataset, surpassing: ANN architectures (90.77% accuracy),⁶⁰ Deep Belief Networks (94.00% accuracy),⁶¹ State-of-the-art Random Forest (95.00% accuracy).⁶⁸ Despite Alshammri et al.'s MLP reaching 98.31% accuracy,⁶⁷ the absence of precision/recall data prevents full clinical validation—a gap addressed by our complete metric reporting. The 4.69% accuracy improvement over Rasheed et al.'s BPVAM-PCA demonstrates how scale-aware preprocessing enhances conventional dimensionality reduction.⁶⁶

Differential acoustic features on machine learning outcomes

As shown in Figures 8–10, positive parallels could barely be drawn on the impact of the same parameters for each dataset. However, MFCC subtypes in MIU and UEX dataset—albeit marginally—are in the top 10 of 20 in terms of impact magnitude on model output. MFCC captures the vocal tract's characteristics, and since PD affects the surrounding muscles—leading to imprecise articulation, hypophonia, and monotony—MFCC is likely able to pick up on this. This aligns with prior research emphasizing MFCCs as robust indicators for distinguishing PD from healthy controls due to their ability to capture spectral energy distribution changes associated with vocal impairments in PD.⁶⁹ While “Jitter”—a measure of pitch irregularity—is ranked on the list, it is near the bottom in both the UCI and UEX datasets. This finding is consistent with studies suggesting that while Jitter is a valuable feature in PD detection,⁷⁰ it's diagnostic utility may be limited due to its considerable variability across individuals⁷¹ and natural pitch fluctuations in healthy controls. Notably, PPE is ranked first in the UCI dataset but near bottom in the UEX dataset, which shows that the same parameter may not have same impact across datasets; potentially due to differences in data collection protocols, population demographics, or disease severity among datasets.^70,71 Interestingly, within the top 20 features of Figures 2–4, there is still a distinct gap between the relevant and less relevant features; which suggests there is an even greater disparity compared to features not ranked on the list. This further substantiates the notion that certain acoustic features are central in the diagnosis of PD in certain datasets, and some are not.

Our findings highlight two critical points: first, that acoustic feature relevance is highly dataset-dependent; and second, that certain features—such as MFCCs—show promise as robust markers for PD diagnosis. However, it is too early to generalize these conclusions without further validation across larger and more diverse datasets. Future research should focus on standardizing data collection methods and exploring feature interactions to better understand their clinical relevance. By narrowing down potential acoustic biomarkers through this analysis, our study provides a foundation for future investigations aimed at optimizing feature selection for PD diagnosis. Researchers are encouraged to build upon these findings by validating key features across independent datasets and integrating them into multimodal diagnostic frameworks.

Research limitations

The generalizability of our findings is constrained by several demographic and methodological factors. First, the analyzed datasets (MIU, UEX, and UCI) exhibit limited geographic and linguistic diversity, predominantly comprising European/North American cohorts with sustained vowel /a/ recordings. This narrow scope may not fully capture PD-related vocal variations across different language groups or regional accents, particularly in tonal languages where pitch variations play distinct linguistic roles. Second, inherent gender differences in baseline vocal parameters—such as fundamental frequency ranges (female: 165–255 Hz vs male: 85–155 Hz) and harmonic structures—were not systematically stratified, potentially confounding feature interpretation. For instance, higher fundamental frequencies in females could amplify jitter measurements independent of PD pathology, introducing classification bias in mixed-gender cohorts.

Moreover, critical clinical covariates remain unaddressed due to dataset limitations. The UCI dataset's reported disease duration (0–28 years) lacks standardized Hoehn-Yahr staging, while medication status (e.g. levodopa on/off states) is absent across all datasets. This omission is particularly consequential given that dopaminergic medications are known to transiently improve vocal intensity and pitch variation, potentially masking disease-specific acoustic signatures.

Conclusion

The contribution of this study is the substantiation of utilizing an ensemble-based approach to build a robust PD diagnostic ML model; moreover, the study provides inside into future applications based on dataset characteristics such as: imbalance and the presence of outliers. We provide a concise solution to constructing a ML model from a previously unapproached PD acoustic recording database, which would otherwise demand gruesome hours of training time, countless configuration adjustments and extensive optimization efforts.

Our results are limited to our study of acoustic features of PD. Thus, we were not able to account for factors like stage of PD or if they were on medications in our study, which may affect outcomes, as once PD reaches a certain stage, some acoustic features could be more pronounced than others. In the future, studies could perhaps collect voice recordings based on age and or stage of PD with or without medication and construct a more comprehensive dataset of PD.

In addition, our study also provides an impact-based index of PD-related parameters which distills the multitude of parameters of each database into a list of key and relevant features that were salient in ML model performance. Although only MFCC was the consistent relevant feature among two of the three databases, our results still establish a foundational reference to guide future researchers in selecting parameters to prioritize in the collection and analysis of acoustic features.

Footnotes

Acknowledgments

We thank all individuals who participated in this study.

ORCID iD

Ping-Nan Chen

Author contributions

PNC, LCS, and CWT performed the experiments and wrote the manuscript. KFL and PNC provided the concept and experimental design of the study and reviewed the paper prior to submission. All authors discussed the results, analyzed the data, and commented on the manuscript. All authors have read and approved the submitted version.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The design and costs of collection, analysis and interpretation of data, and writing are funded by the Ministry of National Defense-Medical Affairs Bureau (Grant No. MND-MAB-D-113143).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All data generated or analyzed during this study are included in this published article and are also available online in references.^40–42

References

Balestrino

Schapira

AHV

. Parkinson disease. Eur J Neurol 2020; 27: 27–42.

Dorsey

Sherer

Okun

, et al. The emerging evidence of the Parkinson pandemic. J Parkinsons Dis 2018; 8: S3–S8.

Huse

Schulman

Orsini

, et al. Burden of illness in Parkinson's disease. Mov Disord 2005; 20: 1449–1454.

Yang

Hamilton

Kopil

, et al.

Current and projected future economic burden of Parkinson's disease in the U. S.

NPJ Parkinsons Dis 2020; 6: 15.

Parkinson’s UK. Poll finds a quarter of people with Parkinson’s are wrongly diagnosed, https://www.parkinsons.org.uk/news/poll-finds-quarter-people-parkinsons-are-wrongly-diagnosed (2025, accessed 8 April 2025).

Hustad

Aasly

. Clinical and imaging markers of prodromal Parkinson's disease. Front Neurol 2020; 11: 395.

Tolosa

Garrido

Scholz

, et al. Challenges in the diagnosis of Parkinson’s disease. Lancet Neurol 2021; 20: 385–397.

Thenganatt

Jankovic

. The relationship between essential tremor and Parkinson's disease. Parkinsonism Relat Disord 2016; 22: S162–S165.

Miki

Tsushima

Foti

, et al. Identification of multiple system atrophy mimicking Parkinson's disease or progressive supranuclear palsy. Brain 2021; 144: 1138–1151.

10.

Signaevsky

Marami

Prastawa

, et al. Antemortem detection of Parkinson's disease pathology in peripheral biopsies using artificial intelligence. Acta Neuropathol Commun 2022; 10: 21.

11.

Atik

Stewart

Zhang

. Alpha-Synuclein as a biomarker for Parkinson's disease. Brain Pathol 2016; 6: 410–418.

12.

Cabral G

Schaan

Cavalcante

, et al. Nuclear and mitochondrial genome, epigenome, and gut microbiome: emerging molecular biomarkers for Parkinson's disease. Int J Mol Sci 2021; 22: 9839.

13.

Lin

Chiu

Chen

, et al. Classifications of neurodegenerative disorders using a multiplex blood biomarkers-based machine learning model. Int J Mol Sci 2020; 21: 6914.

14.

Torsney

Forsyth

. Respiratory dysfunction in Parkinson's disease. J R Coll Physicians Edinb 2017; 47: 35–39.

15.

Aludin

Schmill

LP-A

. MRI signs of Parkinson's disease and atypical parkinsonism. RoFo 2021; 193: 1403–1410.

16.

Ryman

Poston

. MRI Biomarkers of motor and non-motor symptoms in Parkinson's disease. Parkinsonism Relat Disord 2020; 73: 85–93.

17.

Rossi

Perez-Lloret

Merello

. How much time is needed in clinical practice to reach a diagnosis of clinically established Parkinson's disease? Parkinsonism Relat Disord 2021; 92: 53–58.

18.

Pagan

. Improving outcomes through early diagnosis of Parkinson's disease. Am J Manag Care 2012; 18: S176–S182.

19.

Murman

. Early treatment of Parkinson's disease: opportunities for managed care. Am J Manag Care 2012; 18: S183–S188.

20.

Rubaiat

Templeton

Schneider

, et al. Exploring speech biosignatures for traumatic brain injury and neurodegeneration: pilot machine learning study. JMIR Neurotech 2025; 4: e64624.

21.

De Silva

Madanian

Olsen

, et al. Clinical decision support using speech signal analysis: systematic scoping review of neurological disorders. J Med Internet Res 2025; 27: e63004.

22.

Brabenec

Mekyska

Galaz

, et al. Speech disorders in Parkinson's disease: early diagnostics and effects of medication and brain stimulation. J Neural Transm 2017; 124: 303–334.

23.

Lau

Thyagarajan

. Voice changes in Parkinson's disease: what are they telling us? J Clin Neurosci 2020; 72: 1–7.

24.

Holmes

Oates

Phyland

, et al. Voice characteristics in the progression of Parkinson's disease. Int J Lang Commun Disord 2000; 35: 407–418.

25.

Convey

Laukkanen

Ylinen

, et al. Analysis of voice changes in early-stage Parkinson's disease with AVQI and ABI: a follow-up study. J Voice 2024: S0892-1997(24)00160-7.

26.

Rohl

Gutierrez

Johari

, et al. Speech dysfunction, cognition, and Parkinson's disease. Prog Brain Res 2022; 269: 153–173.

27.

Hung

Wang

, et al. Smartphone automated motor and speech analysis for early detection of Alzheimer's disease and Parkinson's disease: validation of TapTalk across 20 different devices. Alzheimers Dement (Amst) 2024; 16: e70025.

28.

Mass

JJL

de Vries

IntHout

, et al. Effectiveness of remotely delivered speech therapy in persons with Parkinson's disease - a randomised controlled trial. EClin Med 2024; 76: 102823.

29.

Reddy

. PPINtonus: early detection of Parkinson’s disease using deep-learning tonal analysis. ArXiv 2024; abs/2406.02608.

30.

Rusz

Tykalová

Novotný

, et al. Automated speech analysis in early untreated Parkinson's disease: relation to gender and dopaminergic transporter imaging. Eur J Neurol 2022; 29: 81–90.

31.

Quatra

Turco

Svendsen

, et al. Exploiting foundation models and speech enhancement for Parkinson's disease detection from speech in real-world operative conditions. ArXiv 2024; abs/2406.16128.

32.

Skibińska

Hosek

. Computerized analysis of hypomimia and hypokinetic dysarthria for improved diagnosis of Parkinson's disease. Heliyon 2023; 9: e21175.

33.

Suppa

Costantini

Asci

, et al. Voice in Parkinson's disease: a machine learning study. Front Neurol 2022; 13: 831428.

34.

Iyer

Kemp

Rahmatallah

, et al. A machine learning method to process voice samples for identification of Parkinson's disease. Sci Rep 2023; 13: 20615.

35.

Tougui

Jilbab

Mhamdi

. Machine learning smart system for Parkinson disease classification using the voice as a biomarker. Healthc Inform Res 2022; 28: 210–221.

36.

Zhao

Wang

Niu

, et al. A triplet multimodel transfer learning network for speech disorder screening of Parkinson’s disease. Int J Intell Syst 2024; 1: 8890592.

37.

Rahman

Rizvi

Khan

, et al. Parkinson’s disease diagnosis in cepstral domain using MFCC and dimensionality reduction with SVM classifier. Mob Inf Syst 2021; 1: 8822069.

38.

Thomas

Benlamin

. An SVM based prediction of Parkinson’s disease using MFCC, end to end ASR and voice. Int J Adv Comput Sci Technol 2020; 9: 133–136.

39.

Chen

Yao

, et al. Dysphonic voice pattern analysis of patients in Parkinson's disease using minimum interclass probability risk feature selection and bagging ensemble learning methods. Comput Math Methods Med 2017; 2017: 4201984.

40.

Carrón

Campos-Roca

Madruga

, et al. A mobile-assisted voice condition analysis system for Parkinson’s disease: assessment of usability conditions. Biomed Eng Online 2021; 20: 114.

41.

Little

McSharry

Hunter

, et al. Suitability of dysphonia measurements for telemonitoring of Parkinson's disease. IEEE Trans Biomed Eng 2009; 56: 1015–1022.

42.

Sakar

Serbes

Gunduz

, et al. A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform. Appl Soft Comput 2019; 74: 255–263.

43.

Amorim

Cavalcanti

Cruz

. The choice of scaling technique matters for classification performance. Appl. Soft Comput 2022; 133: 109924.

44.

Liu

Zhou

. Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern 2009; 39: 539–550.

45.

Batista

GEAPA

Prati

Monard

. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 2004; 6: 20–29.

46.

Yen

Lee

. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 2009; 36: 5718–5727.

47.

Zhang

Mani

. kNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on learning from imbalanced datasets II, ICML; 2003 Jul; Washington, DC. p. 42–48.

48.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). New York, NY: ACM; 2016. p.785–794.

49.

Bai

Garcia

. Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008 Jun; Hong Kong, China. IEEE; 2008. p.1322–1328.

50.

Lundberg

Lee

. A unified approach to interpreting model predictions. In: Proceedings of Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA: ACM, 2017, pp. 4765–4774.

51.

Lundberg

Erion

Chen

, et al. Explainable AI for trees: from local explanations to global understanding. ArXiv 2019; abs/1905.04610.

52.

Prasad

Munasinghe

Seneviratne

. A two-step framework for Parkinson’s disease classification: using multiple one-way ANOVA on speech features and decision trees. CEUR Workshop Proc 2020; 2884: 19–23.

53.

Raihan

Zisad

Islam

, et al. A belief rule base approach to support comparison of digital speech signal features for Parkinson’s disease diagnosis. In: Proceedings of the International Conference on Brain Informatics, Virtual Event, 2021, p.388–400. Springer.

54.

Barukab

Ahmad

Khan

, et al. Analysis of Parkinson's disease using an imbalanced-speech dataset by employing decision tree ensemble methods. Diagnostics (Basel) 2022; 12: 3000.

55.

Dao

SVT

Tran

, et al. An analysis of vocal features for Parkinson’s disease classification using evolutionary algorithms. Diagnostics 2022; 12: 1980.

56.

Santos

. XGBoosting early detection: Advancing Parkinson’s disease diagnosis through machine learning, https://doi.org/10.1101/2023.10.23.23297368 (2023, accessed 9 April 2023).

57.

Ali

Salim

Saeed

. Parkinson's disease detection using filter feature selection and a genetic algorithm with ensemble learning. Diagnostics (Basel) 2023; 13: 2816.

58.

Shen

Chen

. Parkinson’s disease detection by using machine learning method based on local classification on class boundary. Discov Appl Sci 2024; 6: 576.

59.

Hossain

Amenta

. Machine learning-based classification of Parkinson's disease patients using speech biomarkers. J Parkinsons Dis 2024; 14: 95–109.

60.

Shaikh

. A prototype of Parkinson’s and primary tumor diseases prediction using data mining techniques. Int J Eng Sci Invention 2014; 3: 23–28.

61.

Al-Fatlawi

Jabardi

Ling

. Efficient diagnosis system for Parkinson's disease using deep belief network. In: Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada, 2016, pp.1324–1330. IEEE.

62.

Pahuja

Nagabhushan

. A comparative study of existing machine learning approaches for Parkinson’s disease detection. IETE J Res 2018; 67: 4–14.

63.

Mathur

Pathak

Bandil

. Parkinson disease prediction using machine learning algorithm. In: Proceedings of Emerging Trends in Expert Applications and security. Advances in Intelligent Systems and Computing, Singapore, 2018, pp.357–363. Springer.

64.

Kadam

Jadhav

. Feature ensemble learning based on sparse autoencoders for diagnosis of Parkinson’s disease. Comput Commun Signal Process Adv Intell Syst Comput 2019; 810: 567–581.

65.

Senturk Z

. Early diagnosis of Parkinson's disease using machine learning algorithms. Med Hypotheses 2020; 138: 109603.

66.

Rasheed

Hameed

Ajlouni

, et al. Application of adaptive back-propagation neural networks for Parkinson's disease prediction. In: Proceedings of 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI). Sakheer, Bahrain, 2020, pp.1–5. IEEE.

67.

Alshammri

Alharbi

, et al. Machine learning approaches to identify Parkinson's disease using voice signal features. Front Artif Intell 2023; 6: 1084001.

68.

Al-Nefaie

Aldhyani

THH

Koundal

. Developing system-based voice features for detecting Parkinson’s disease using machine learning algorithms. J Disease Recognit 2024; 3: 1–10.

69.

Tracey

Volfson

Glass

, et al. Towards interpretable speech biomarkers: exploring MFCCs. Sci Rep 2023; 13: 22787.

70.

Momeni

Whitling

Jakobsson

. Detecting Parkinson’s disease using voice recordings from mobile devices. In: Proceedings of the 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 2024, p.1516–1520. IEEE.

71.

Fadavi

. Early recognition of Parkinson’s disease through acoustic analysis and machine learning. ArXiv 2024; abs/2407.16091.

Hybrid preprocessing and ensemble classification for enhanced detection of Parkinson's disease using multiple speech signal databases

Abstract

Objective

Methods

Results

Conclusions

Keywords

Introduction

Speech as a diagnostic biomarker for PD

Objective of this study

Method

Flowchart

Overview of selected PD voice datasets

Scaling techniques

Sampling techniques

ML algorithms

Performance evaluation

Results

Performance metrics for different scaling and sampling techniques

MIU database

UEX database

UCI database

Analysis of differential features of datasets on model output

Discussion

Comparative analysis across heterogeneous datasets

Comparison with existing literatures

Differential acoustic features on machine learning outcomes

Research limitations

Conclusion

Footnotes

Acknowledgments

ORCID iD

Author contributions

Funding

Declaration of conflicting interests

Data availability statement

References