Application of machine learning to predict periodontal disease in US adults: A cross-sectional analysis of NHANES 2009

Abstract

Background: Periodontal disease (PD) is a primary contributor to tooth loss, which negatively affects oral functionality and quality of life. This research aims to investigate the effectiveness of various machine learning (ML) classifiers in identifying PD among U.S. adults. Method: Nineteen features, selected based on prior literature and expert dentist input, were preprocessed using feature engineering techniques. Eleven machine learning classifiers, including basic and ensemble models, were evaluated to identify the best performing model. The interpretability of the model was evaluated using Shapley additive explanations and individual conditional expectation plots to determine key predictors of periodontitis. Results: The predictive efficacy of the ML classifiers is assessed using metrics such as the area under the receiver operating curve (AUC), accuracy, sensitivity, and specificity. The CatBoost classifier performed best in identifying PD. It achieved an AUC of 84.5%, an accuracy of 75.8%, a precision of 75.8%, a sensitivity of 78.8%, and a specificity of 72.5%. Having an annual dentist visit and age emerged as the most influential variables. Conclusions: The ML models utilized in this study exhibited robust predictive performance and can be further improved by incorporating additional clinical parameters. The proposed models effectively identified individuals at high risk for developing PD.

Keywords

health information systems machine learning periodontitis health surveys United States

Introduction

Periodontal disease (PD), also known as gum disease,¹ is a widespread chronic inflammatory condition that affects the supporting structures of the teeth, including the gums, periodontal ligament, and alveolar bone. It is considered one of the most important and preventable oral health concerns globally.² The disease is characterized by the accumulation of bacterial plaque on the surfaces of the teeth, leading to localized inflammation and damage to the surrounding tissues.³ Approximately 42% of adults in the United States (US) aged 30 and older with natural teeth experienced periodontitis, with 7.8% suffering from severe periodontitis.^4,5 A total of 3. 3% of the probed sites (equivalent to 9. 1% of the teeth) had a probe depth of 4 mm or more and 19. 0% of the sites (equal to 37. 1% of the teeth) showed a clinical attachment loss of 3 mm or more.^4,5 Severe periodontitis was most common among individuals aged 65 and older, Mexican Americans, non-Hispanic blacks, and smokers.⁶ If left untreated, PD can cause tooth loss⁷ and have a significant impact on the general health of an individual^8,9 and quality of life.¹⁰

Investigations of the frequency and effects of PD revealed that several individual factors are associated with compromised oral health. Men are more susceptible to severe PD compared to women.¹¹ Racial and ethnic minority populations, including Native Americans, Blacks, and Hispanics, exhibit higher rates of PD, untreated root caries, and tooth loss. Minority populations are more susceptible to an increased frequency of oral cancer compared to non-Hispanic whites.¹² Dental disparities in periodontitis are influenced by various aspects of socioeconomic status, such as income, living conditions, education, and access to dental care.¹³ Periodontitis is influenced by lifestyle factors, including a poor diet, nutritional deficiencies, and inadequate dental hygiene practices.¹⁴ Elderly individuals are more likely to have periodontitis,¹⁵ as a result of reduced adherence to dental hygiene practices and limited access to adequate dental care. When examining periodontitis-related dental factors¹⁶ including host characteristics (age, gender), social and behavioral factors, including socioeconomic status, smoking,¹⁷ and alcohol¹⁸ can increase the likelihood and severity of PD.

PD can be prevented if its risk factors are identified early and mitigated.¹⁹ Therefore, early identification of these risk factors is crucial for managing and preventing their onset. Farhadian et al.²⁰ used support vector machines (SVM) to develop decision support systems to improve the diagnosis of periodontitis. The development of decision support systems to identify risk factors associated with periodontitis is critical to its prevention and treatment, and to improve dental health outcomes. A recent review by Kierce et al.²¹ discussed the use of artificial intelligence (AI) to improve the management of PD. In this current study, the application of AI is directed towards improving patient health knowledge about PD. Through the use of extensive data sets and advanced algorithms, machine learning (ML) techniques improve the early diagnostic phase. The primary goal of this study is to enhance the performance of predicting PD diagnosis, thus facilitating the development of personalized treatment strategies.

This study used data from the National Health and Nutrition Examination Survey (NHANES), which provides extensive information on the health and nutritional status of the US population through data collected from interviews, physical examinations, and laboratory tests. NNHANES incorporates direct clinical assessments collected by trained examiners following standardized protocols. Several of these assessments are not typically available in routine clinical practice, as they require specialized procedures and examiner calibration. NHANES has been widely used to predict various health conditions and their associated risk factors, including cardivascular diseases,^22,23 cancer,^24,25 insomnia,²⁶ diabetes,^27,28 kidney-related diseases,^29,30 liver diseases,^31,32 osteoarthritis,^33,34 and dental related diseases.^35–38 While numerous existing studies have conducted statistical analyses to understand the risks and contributing factors associated with PD,^38–42 there remains a significant gap in leveraging ML techniques to comprehensively predict PD and analyze its associated factors. To the best of our knowledge, no prior work has systematically investigated the potential of ML classifiers in this domain. Despite progress in preventive dental care, access remains uneven,⁴³ especially for low-income, minority, and rural populations due to poor insurance coverage, limited awareness, and geographic barriers such as dental deserts.⁴⁴ Periodontal disease remains a significant public health burden in the US, affecting over 42% of adults aged 30 and above.⁴⁵ Leveraging automated analytical approaches can facilitate the identification of key contributing factors, enabling early intervention and targeted public health strategies. In this study, our objective is to address this gap by evaluating the effectiveness of various ML classifiers in accurately identifying PD. Through this approach, we seek to provide deeper insight into the predictive capabilities of these models, while also exploring the importance of features that contribute to the prediction of PD. Although earlier studies have mainly focused on examining individual variables in isolation, this study adopts a more integrated analytical approach. In addition, most of the variables used in this study are readily available from routine clinical records and do not require specialized clinical measures.

Materials and methods

Data source

This study used the publicly available version of the 2009-2014 National Health and Nutrition Examination Survey (NHANES). Our study was exempt from Institutional Review Board (IRB) review under 45 CFR 46.101(b) since it involved the analysis of deidentified secondary data. The NHANES annual assessment involves in-depth interviews and clinical examinations to evaluate oral health. In addition, the dataset includes questionnaires that cover demographic information, socioeconomic status, diet habits, various physical and mental health conditions, laboratory components, and other pertinent factors.

Study population

The data cohort analyzed in this study comprised of 10,714 adults aged 30 years or older who underwent a full-mouth periodontal examination.¹⁵ Participants were selected using a random probability sampling method designed to be representative of 143.8 million US adults. The regression analysis utilized the probability weight for each individual in the sample.⁸ Dental exams were performed by trained and calibrated dentists in mobile examination centers (MEC).⁸

Methodology

The proposed ML pipeline to predict PD using NHANES data to develop accurate predictions is shown in Figure 1. Data preprocessing steps involved handling missing values, eliminating irrelevant or redundant features, and recoding essential features. This research focused on a subset of nineteen informative features. Feature engineering involves handling missing values through imputation techniques and creating enhanced derived features. The preprocessed data are split into training (80%) and testing (20%) sets. Eleven ML classifiers, including logistic regression (LR), decision trees (DT), SVM, Random Forest (RF), Naïve Bayes (NB), k-nearest neighbors (KNN), neural networks (NN), and ensemble classifiers such as GradientBoost classifier, AdaBoost classifier, XGBoost classifier, and CatBoost classifier, are employed. The training set is employed for model development, while the test set is utilized to assess performance using metrics such as accuracy, precision, sensitivity, and specificity.

Figure 1.

ML pipeline for periodontal prediction.

Outcome definition

In this study, periodontitis was the primary oral health outcome variable. The outcome was represented as a binary variable, with responses coded as ’yes’ (1) or ’no’ (0) to indicate the presence or absence of periodontitis among participants during clinical examination. The CDC/AAP standard for population-based periodontitis surveillance, including clinical attachment loss (CAL) and periodontal pocket depth (PPD), was used for this classification.⁶ Periodontitis was classified as having two or more interproximal sites with PPD of 4 mm (not on the same tooth) or at least one interproximal site with PPD of 5 mm in addition to two or more interproximal sites with CAL of 3 mm (on different teeth).⁴⁶

Independent variables

The development of the proposed PD prediction tool included nineteen variables. Some of these parameters (refer Table 1) were established as confounding variables from prior literature,^5,18,47–53 while others (refer Table 2) derived from expert dentist insights. These controlled parameters encompassed demographic factors such as dental visit status (OHQ030), gender (RIAGENDR), ethnicity (RIDRETH1), and age (RIDAGEYR). Additionally, variables included smoking status (SMQ020, SMQ040), alcohol consumption (ALQ101), presence of sleep disorders (SLQ050), history of overnight hospitalization (HUQ071), diabetes status (DIQ010), glycated hemoglobin level (LBXGH), presence of kidney issues (KIQ022), blood lead level (LBXBPB), hypertension status (BPQ020), body mass index (BMXBMI), food security category (FSDAD), educational attainment (DMDEDUC2), household income (INDFMPIR), and health insurance coverage (HIQ011). The study included participants aged between 30 and 80 years old, considering age as a critical factor in PD classification, with older individuals being more susceptible to periodontitis.⁶

Table 1.

Summary of confounding variables and their value ranges.

Serial number	Variable name	Type	Mean	Standard deviation
1	Age (RIDAGEYR)	Continuous	52.01	14.30
2	Gender (RIAGENDR)	Categorical	1.5	0.5
3	Ethnicity (RIDRETH1)	Categorical	3.07	1.17
4	Education level (DMDEDUC2)	Categorical	3.49	1.29
5	Health insurance (HIQ011)	Categorical	1.22	0.42
6	Household income (INDFMPIR)	Continuous	2.42	1.74
7	Body mass index (BMXBMI)	Continuous	29.11	7.07
8	Smoking status	Categorical	1.62	0.78
9	Alcohol drinking (ALQ101)	Categorical	1.17	0.59
10	Hypertension status (BPQ020)	Categorical	1.63	0.56
11	Sleep disorder (SLQ050)	Categorical	1.75	0.44

Table 2.

Summary of expert-driven variables and their value ranges.

Serial number	Variable name	Type	Mean	Standard deviation
12	General health condition (HUQ010)	Categorical	2.73	1.03
13	Blood lead level (LBXBPB)	Continuous	1.30	1.78
14	Diabetes status (DIQ010)	Categorical	1.91	0.41
15	Weak/failing kidneys (KIQ022)	Categorical	1.98	0.30
16	Glycohemoglobin level (LBXGH)	Continuous	5.60	1.53
17	Food security (FSDAD)	Categorical	1.53	0.96
18	Last visit to a dentist (OHQ030)	Categorical	1.74	2.81
19	Overnight hospital patient in last year (HUQ071)	Categorical	1.89	0.33
20	PD measures (label)	Categorical	0.5118	0.4999

As people get older, their likelihood of having PD increases. This correlation is visually illustrated in Figure 2, indicating a distinct association between age and the risk of PD. Dental visits were classified as follows: less than 6 months (coded as 1), less than a year (coded as 2), less than 2 years (coded as 3), less than 3 years (coded as 4), less than 5 years (coded as 5), more than 5 years (coded as 6), and never visited a dentist (coded as 7). Gender was represented as male (coded as 0) or female (coded as 1). Race and ethnicity were grouped as follows: Mexican American (coded as 1), Other Hispanic (coded as 2), non-Hispanic White (coded as 3), non-Hispanic Black (coded as 4), and other races (coded as 5). Education level was divided into five categories¹: less than 9th grade,²9-12th grade,³ high school graduate,⁴ some college degree, and⁵ college graduate. Family income was a continuous variable with values ranging from 0 to 5.00. Due to missing values in smoking status (SMQ020, SMQ040), it was consolidated into single feature as follows: participants who smoked fewer than 100 cigarettes in the past were classified as never smokers (coded as 1). Those who currently do not smoke but have smoked at least 100 cigarettes were classified as former smokers (coded as 2). Participants who currently smoke every day or some days with at least 100 cigarettes in the past were categorized as current smokers (coded as 3).⁶ Obesity, calculated based on BMI, was a continuous variable with BMI values between 13.18 and 82.9. Food security was divided into four categories: full (coded as 1), marginal (coded as 2), low (coded as 3), and very low (coded as 4). The blood lead level (ug/dL) included a range of values between 0.18 and 43.52. Tables 1 and 2 presents the variables used in this study along with their respective value ranges.

Figure 2.

Age versus periodontal.

Data preprocessing

The original NHANES dataset comprised of 101,316 cases and 10,896 features. Cases with missing periodontal data were excluded to prepare the data for ML analysis, resulting in a reduced sample size of 10,714. Following a comprehensive and systematic review of the existing literature^38–42 on statistical analyses of risk factors for PD, supplemented by clinical expertise specific to this study, nineteen features were identified for the initial investigation. These features were selected based on their documented importance in previous research and their potential relevance in capturing patterns associated with PD.^54–56 By combining evidence-based insights from the literature with clinical expertise, we ensured that the selected features not only align with established knowledge but also hold practical significance for predictive modeling and analytical purposes. Redundant features providing similar information and variables that were likely outcomes of periodontitis were excluded from the NHANES dataset for three cycles (2009–2014). Rows containing values marked as ”refused” and ”don’t know” were removed from the dataset.

The preprocessed dataset had missing values, presenting a substantial challenge for ML classifiers. The adept management of missing data is crucial to obtaining reliable results. Imputation is a widely adopted technique for addressing missing values, entailing the substitution of missing data with estimated or calculated values.⁵⁷ In this study, the iterative imputer was employed. This method leverages the internal structure and relationships within the dataset to estimate and fill in missing values. It follows an iterative modeling approach by initializing missing values with preliminary estimates. The imputation process iteratively refines values by fitting a model that predicts missing values based on relationships among observed features in a multiple linear regression equation. This iterative process continues until convergence, refining prior imputations in each iteration. The final imputed values are determined by averaging imputations across multiple iterations, effectively optimizing missing value estimates using the best available information. Iterative imputation was chosen for its ability to capture multivariate relationships with greater accuracy than simple methods, while offering a balance of interpretability and efficiency compared to complex DL-based techniques—making it well-suited for structured tabular healthcare data.

Identifying the most relevant features can enhance the performance of the ML classifiers. The feature selection process not only focuses on pertinent features, reducing overfitting and improving interpretability, but also accelerates training and inference, leading to the development of more efficient and robust models capable of effectively generalizing new data. Initially, all the columns with more than 50% NaN were removed, leading to 408 features. The study considered the top 50 features with the most significant F-scores. Redundant variables conveying similar information, as well as those highly influential in PD prediction, such as the count of teeth loss (OHX*), dental visit reason, etc., were excluded. Confounding variables identified in prior literature, along with insights from expert dentists, were integrated with the resulting selected features. This led to the inclusion of nineteen distinct features for analysis. To mitigate any significant association between periodontitis (coded as 1) and non-periodontitis (coded as 0) groups, Chi-square tests were conducted. It was observed that four variables (weak kidney, drinking, overnight hospitalization, and sleep disorder) had higher p-values, indicating their independence from the response variable and rendering them unsuitable for model training. Consequently, these variables were removed from the dataset. This exclusion helps improve the model’s performance by eliminating irrelevant variables that do not contribute meaningfully to predictive accuracy. It also prevents overfitting and enhances model interpretability. Domain expertise further confirmed their minimal clinical relevance in the context of this study. A total of thirteen variables exhibited statistically significant relationships with the outcome variable, PD (p < 0.05). Additionally, the pairwise correlation matrix of the features revealed a strong correlation $(\geq 0.5)$ between educational attainment status and household income, indicating a positive linear relationship—i.e., as one variable increases, the other also tends to increase. To mitigate multicollinearity and reduce redundancy, educational attainment status was removed from the dataset.

Statistical analysis

ML techniques were used to classify PD presence (coded as 1) or absence (coded as 0).⁵⁸ ML classifiers analyze the training data, recognize patterns and relationships between input features and desired outcomes. Throughout the training phase, the algorithm fine-tunes its parameters to minimize the disparity between its predicted periodontal outcomes and the actual target periodontal outcomes. Subsequently, these learned parameters are applied to predict future outcomes for the test dataset. The performance of the ML classifiers is assessed through diverse metrics, which entail comparing predicted and actual periodontal outcomes derived from the test dataset, distinct from the training phase. ML has a distinct advantage in processing large and complex datasets, especially with complex correlations between features. ML can enhance clinical decision support by providing potential benefits for the precise diagnosis and prognosis of oral health conditions. An important implication of these findings is the formulation of personalized dental treatment plans.

A total of 10,714 cases were used to train and test the ML classifiers. Among them, 8571 cases (80% of 10,714) were randomly selected for training, while 2143 cases (20% of 10,714) were reserved for testing purposes. In order to ensure fair results, identical splits (using a unique random seed) were utilized for all ML classifiers. The ML classifiers in this study were developed using Python 3.7.0 (Python Software Foundation). Evaluation metrics such as accuracy, precision (macro), recall (macro), f1-score (macro), sensitivity, specificity, and AUC were calculated using the test dataset.

Results

The study included 10,714 participants, with 48.67% males and 51.6% females. The racial/ethnic distribution was 40.83% White, 21.3% 22.02% African American, 21.50% Hispanic or Mexican American, and 15.65% others. Among the participants, 79.24% were below 65 years, while 20.76% were 65 years or older.

The ML classifiers demonstrated robust performance in classifying PD. Specifically, CatBoost, XGBoost, and RF achieved the highest accuracy of 75.8%, 74.4%, and 73.7%, respectively (refer Table 3). GradientBoost classifier, AdaBoost classifiers, LR, and SVM exhibited lower accuracy of 73.1%, 71.3%, 71.6%, and 70.4% respectively. KNN, NB, and NN demonstrated the lowest performances among all tested algorithms yet still achieved reasonable accuracy of 65.5%, 65.1%, and 63.5%, respectively.

Table 3.

Evaluation of each model’s accuracy, AUC curve, precision (macro), recall (macro)/sensitivity, specificity, F1-score (macro).

ML model	Accuracy	AUC curve	Precision	Recall/sensitivity	Specificity	F1-score
NN	0.635	0.679	0.638	0.702	0.562	0.634
NB	0.651	0.729	0.663	0.549	0.765	0.649
KNN	0.655	0.705	0.655	0.644	0.668	0.655
DT	0.677	0.677	0.677	0.680	0.674	0.677
SVM	0.704	0.764	0.704	0.740	0.665	0.703
LR	0.716	0.773	0.715	0.737	0.692	0.714
AdaBoost	0.713	0.777	0.713	0.721	0.705	0.713
GradientBoost	0.731	0.798	0.731	0.760	0.699	0.730
RF	0.737	0.815	0.736	0.758	0.714	0.735
XGBoost	0.744	0.823	0.744	0.766	0.721	0.743
CatBoost	0.758	0.845	0.758	0.788	0.725	0.757
CatBoost (five-fold)	0.770 ± 0.010	0.853 ± 0.004	0.767 ± 0.009	0.793 ± 0.012	0.746 ± 0.016	0.780 ± 0.007
CatBoost (19 features)	0.749	0.820	0.748	0.769	0.726	0.748

A confusion matrix (refer to Figure 3) from the CatBoost classifier was the optimal model. The model has made 2143 predictions. Of these, 739 instances were correctly classified as negative (true negatives), indicating that the model accurately identified negative cases. However, there were 281 false positive cases where the model incorrectly predicted instances as positive when they were actually negative. On the other hand, there were 238 false negative cases where the model incorrectly classified instances as negative when they were actually positive. The model performed well in terms of true positive cases, correctly identifying 885 instances as positive.

Figure 3.

Confusion matrix with evaluated statistics.

The Figure 4 depicts the precision-recall curve, demonstrating the trade-off between precision and recall in a binary classification model. The primary objective of the model is to minimize false negatives when identifying potential dental patients. It is essential to minimize false negatives to ensure that no potential patients are overlooked. Therefore, the emphasis is on maximizing the recall metric, which signifies the model’s capability to accurately identify all positive cases.

Figure 4.

Precision-recall curve.

By prioritizing recall, the model seeks to identify the maximum number of PD patients, reducing the chances of overlooking individuals requiring dental attention. Maintaining a balance between false positives and false negatives is crucial to upholding the model’s overall performance. Although false positives might occur, the significance of preventing false negatives—which could lead to the neglect of patients in need of dental care—overshadows their impact. Overall, the research model emphasizes maximizing recall to reduce false negatives and ensure that potential dental patients are not overlooked while maintaining a balance with false positives for an effective and practical screening approach.

The Receiver Operating Characteristic (ROC) curves shown in Figure 5 demonstrate the performance of different ML classifiers. CatBoost exhibited outstanding ROC, with an AUC of 0.84. This can be attributed to its ability to efficiently handle multivariate, structured datasets like NHANES, which include both continuous and categorical variables. Its native support for categorical data reduces the need for extensive preprocessing while preserving feature relationships. Additionally, CatBoost’s ordered boosting technique mitigates overfitting by preventing target leakage—particularly valuable when dealing with correlated clinical variables. In contrast, the Decision Tree classifier demonstrated the lowest performance, with an AUC of 0.68, likely due to its tendency to overfit and its limited capacity to capture complex feature interactions.

Figure 5.

Receiver operating characteristics AUC score for each model.

Discussion

Periodontitis is a chronic inflammatory condition affecting the supportive structures of the teeth, including the gums, periodontal ligament, and alveolar bone. This condition leads to a progressive deterioration of the supportive tissues, primarily caused by gum disease (gingivitis). The study centered on periodontitis due to its significant impact on oral health. However, periodontitis is preventable and treatable, with early intervention playing a crucial role in minimizing its consequences. Identification of periodontitis was carried out through oral examinations conducted by licensed and trained dental professionals as part of the NHANES study. Assessment of periodontitis involved evaluating the health of the gums, periodontal ligament, and alveolar bone. Various clinical parameters, such as probing depth, attachment loss, and bleeding on probing, were measured to diagnose the condition and assess its severity.

The increasing prevalence of PD presents a significant public health challenge. Leveraging ML techniques offers the opportunity to uncover factors associated with periodontitis, thereby facilitating improvements in oral health and subsequent enhancements in overall well-being. In this study, we utilized ML approaches on a comprehensive dataset to detect periodontitis. Through the analysis of NHANES data, we applied various ML methods to identify the optimal model and factors correlated with PD. Our results indicated that the CatBoost classifier demonstrated the highest accuracy in distinguishing between the presence and absence of PD. The CatBoost classifier also demonstrated the highest accuracy when utilizing the nineteen initial features (refer Table 3). However, the performance was observed to decline due to the inclusion of irrelevant features. This underscores the importance of an effective preprocessing method, which successfully identified features for exclusion from the initial cohort. The selected subset of thirteen features provided the superior performance, highlighting the significance of the proposed preprocessing approach. Moreover, this feature reduction minimizes the number of parameters that need to be documented, facilitating the real-time development of applications with fewer features while preserving the model’s performance.

To assess the significance of features in identifying PD, feature importance scores (F-Scores) were derived from the CatBoost Classifier and depicted in Figure 6. Higher scores indicate a greater contribution to the accurate identification of PD. Interestingly, dental visit status, household income, and age have emerged as the three most influential variables. Following these, the ten top ranking characteristics include blood lead level, glycohemoglobin, obesity (BMI), demographic factors (sex, race/ethnicity), and smoking status. In addition, it could be observed that most of the filtered features had a minimal impact on the performance of the model.

Figure 6.

Variable importance measurement score based on F-Score for: (a) all nineteen features and (b) the selected thirteen features.

Several features consistently emerged as critical indicators of PD across different ML methods. These variables included dental visit status, age, household income, blood lead level, obesity, demographic factors, and smoking status. Age was identified as the most relevant predictive variable, aligning with existing evidence¹⁵ that highlights the increased risk of PD with advancing age. As individuals grow older, they are more prone to gum tissue inflammation and bone loss, contributing to the development of PD. Low income, a marker of socioeconomic status and potential financial barriers to dental care access, was also found to be associated with PD.⁵⁹ Lack of regular dental care can impede early diagnosis, prevention, and treatment of PD, leading to poorer oral health outcomes.

During our ML feature analysis, we noted a notable correlation between daily smoking and an elevated probability of developing PD. Individuals who smoke daily are at increased risk of developing PD compared to non-smokers, corroborating findings from recent studies.^17,60 Furthermore, an interesting correlation observed in our analysis is the association between blood lead levels. Several recent statistical studies^61–63 have highlighted the link between blood lead levels and the risk of periodontitis. To further enhance the interpretability of the model, the model, Shapley Additive Explanations (SHAP)⁶⁴ and Individual Conditional Expectation (ICE)⁶⁵ used to provide insight into the into the predictive logic of the CatBoost model. Figure 7(a) and (b) represent the ICE feature expectation plots. Figure 8(a) and (b) represent the SHAP feature importance scores. As observed in both graphs, the elimination of six features had minimal impact on model performance, highlighting the effectiveness of the preprocessing techniques in identifying redundant features.

Figure 7.

ICE feature expectation plots for: (a) the selected features and (b) the remaining six features.

Figure 8.

SHAP feature importance score for: (a) all nineteen features and (b) the selected thirteen features.

Experiments were conducted to optimize the CatBoost classifier using Particle Swarm Optimization (PSO)⁶⁶ across 100 iterations. The ranges explored and the final selected hyperparameter values are summarized in Table 4. The observed improvement (refer Table 5) in model performance was marginal. This can be attributed to the inherently robust default hyperparameters of CatBoost, which are well-optimized for structured tabular data and frequently yield competitive results without the necessity for extensive hyperparameter fine-tuning.

Table 4.

Evaluation of each model’s accuracy, AUC curve, precision, sensitivity, specificity.

Hyperparameter	Range	PSO selected value
learning_rate	0.0001–0.9	0.051
Depth	1–15	8
l2_leaf_reg	1–10	2.786
bagging_temperature	0–10	5.429
Subsample	0.5–1.0	0.800

Table 5.

Evaluation of each model’s accuracy, AUC curve, precision (macro), recall (macro/sensitivity), specificity, f1-score (macro).

ML model	Accuracy	AUC curve	Precision	Recall/sensitivity	Specificity	F1-score
CatBoost	0.758	0.845	0.758	0.788	0.725	0.757
CatBoost (fine tuned)	0.769	0.852	0.769	0.807	0.726	0.767

The potential for AI to revolutionize oral health diagnosis and prognosis is immense. By seamlessly integrating ML algorithms into future applications, such as real-time clinical decision support tools, precision medicine in dental care can become a reality. These screening tools can be utilized in various settings, including general medical practices, clinics, social service centers, and online platforms. They can provide oral examination recommendations for those at high risk. Additionally, ML can provide valuable insights into identifying underlying medical conditions or lifestyle factors linked to PD. This information can be particularly beneficial for non-dental professionals who recognize at-risk patients and refer them to oral health experts for prompt intervention, evaluation, and prevention.

By 2034, it is anticipated that the older population will substantially increase, with approximately one in five individuals aged 65 years or older, as projected by the US Census Bureau’s 2017 National Population Projections.⁶⁷ With the aging demographic, the prevalence of PD and other oral health issues among older individuals is expected to rise. Leveraging ML techniques presents an opportunity to gain deeper insights into and address PD in elderly populations, offering a promising avenue for early intervention and enhancing oral health outcomes. Through this study, robust and accurate algorithms have been developed to classify periodontitis using ML. These algorithms have the potential to catalyze the development of automated, cost-effective tools for dental care and precision medicine, with far-reaching implications for the prevention and management of PD and other oral health conditions.

Limitations

Several limitations were encountered in this study. First, the prediction model was developed exclusively using data from NHANES, potentially restricting the generalizability of our results. Moreover, reliance on self-reported data for behavioral factors such as smoking and alcohol consumption may introduce reliability issues and could influence the accuracy of our prediction model. Additionally, the utilization of imputation methods to address missing data could impact the performance of the model.

Future studies

In future research endeavors, the proposed model will undergo application and potential retraining with an expanded dataset. Furthermore, investigation into various other NN architectures will be conducted to bolster the model’s performance. Additionally, the exploration of incorporating new risk factors will be pursued to better encapsulate the distribution and behavior of input features within the model. We plan to explore additional imputation methods^68,69 and conduct further sensitivity analyses to ensure the robustness of our findings. Although this study used features selected from the prior literature and expert dental insights, further refined through feature engineering, future research could extend this work through comprehensive data-driven research of feature selection.⁷⁰ Such studies may utilize all available features to identify optimal combinations and apply advanced methods to validate the robustness of the selection process. This study lays the groundwork for the development of a decision support tool aimed at assisting healthcare practitioners in making well-informed decisions regarding the risk of PD development across various patient screening scenarios.

Conclusions

Periodontal disease represents a widespread oral health concern, and the development of ML techniques to assist in diagnostic decision-making and preventive interventions for this condition can yield substantial health advantages. Our initial investigation yielded promising results by leveraging NHANES data and ML techniques to predict the likelihood of PD. The implementation of such a model holds the potential to provide healthcare providers with fresh insights into the risks associated with PD and the key factors contributing to its progression. Consequently, clinicians can adopt proactive measures informed by this knowledge. The models developed in this study demonstrated commendable performance in accurately categorizing the presence or absence of PD, as evidenced by strong accuracy, sensitivity, specificity, precision, and AUC scores. The visualization of features facilitated the interpretability of the predicted outcomes. Notably, many identified features aligned with findings from recent studies, underscoring the clinical relevance of our approach. This knowledge can empower clinicians to adopt proactive, informed interventions for PD management.

Footnotes

ORCID iDs

Giang T. Vu

Veena Mayya

Author contributions

Giang T. Vu: Contributed to conception, design, data acquisition and interpretation, drafted and critically revised the manuscript. Veena Mayya: Contributed to design, data acquisition and interpretation, drafted and critically revised the manuscript. Babu Mandhidi: Contributed to data acquisition and interpretation, drafted and critically revised the manuscript. Christian King: Contributed to draft and critically revised the manuscript. Bert B. Little: Contributed to draft and critically revised the manuscript. Varadraj Gurupur: Contributed to draft and critically revised the manuscript. Astha Singhal: Contributed to draft and critically revised the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The NHANES data that support the findings of the present study are publicly available at the CDC website [], reference number.^13,15

References

Lang

Bartold

. Periodontal health. J Periodontol 2018; 89(Suppl 1): S9–S16.

Peres

Macpherson

LMD

Weyant

, et al. Oral diseases: a global public health challenge. Lancet 2019; 394(10194): 249–260.

Papapanou

Sanz

Buduneli

, et al. Periodontitis: consensus report of workgroup 2 of the 2017 world workshop on the classification of periodontal and peri-implant diseases and conditions. J Clin Periodontol 2018; 45(Suppl 20): S162–S170.

Eke

Dye

Wei

, et al. Update on prevalence of periodontitis in adults in the United States: NHANES 2009 to 2012. J Periodontol 2015; 86(5): 611–622.

Eke

Dye

Wei

, et al. Prevalence of periodontitis in adults in the United States: 2009 and 2010. J Dent Res 2012; 91(10): 914–920.

Eke

Thornton-Evans

Wei

, et al. Periodontitis in us adults: national health and nutrition examination survey 2009–2014. J Am Dent Assoc 2018; 149(7): 576–588.

Wang

Meng

, et al. Mean platelet volume as an inflammatory marker in patients with severe periodontitis. Platelets 2015; 26(1): 67–71.

Kandelman

Petersen

Ueda

. Oral health, general health, and quality of life in older people. Spec Care Dent 2008; 28(6): 224–236.

Wong

Yap

Allen

. Periodontal disease and quality of life: umbrella review of systematic reviews. J Periodontal Res 2021; 56(1): 1–17.

10.

Paśnik-Chwalik

Konopka

. Impact of periodontitis on the oral health impact profile: a systematic review and meta-analysis. Dent Med Probl 2020; 57(4): 423–431.

11.

Shiau

Reynolds

. Sex differences in destructive periodontal disease: a systematic review. J Periodontol 2010; 81(10): 1379–1389.

12.

LeHew

Weatherspoon

Peterson

, et al. The health system and policy implications of changing epidemiology for oral cavity and oropharyngeal cancers in the United States from 1995 to 2016. Epidemiol Rev 2017; 39(1): 132–147.

13.

Borrell

Crawford

. Socioeconomic position indicators and periodontitis: examining the evidence. Periodontology 2000 2012; 58(1): 69–83.

14.

Verma

Reddy

Verma

, et al. The impact of lifestyles on the periodontal health among 35–44 years old adult population in lucknow district: a cross-sectional study. J Indian Assoc Public Health Dent 2022; 20(2): 147–152.

15.

Huang

Dong

. Prevalence of periodontal disease in middle-aged and elderly patients and its influencing factors. Am J Transl Res 2022; 14(8): 5677–5684.

16.

Pham

TAV

Kieu

Ngo

LTQ

. Risk factors of periodontal disease in vietnamese patients. J Investig Clin Dent. 2018; 9(1): e12272.

17.

Duarte

Nogueira

CFP

Silva

, et al. Impact of smoking cessation on periodontal tissues. Int Dent J 2022; 72(1): 31–36.

18.

Gay

Tran

Paquette

. Alcohol intake and periodontitis in adults aged ≥30 years: NHANES 2009–2012. J Periodontol 2018; 89(6): 625–634.

19.

Genco

Borgnakke

. Risk factors for periodontal disease. Periodontology 2000 2013; 62(1): 59–94.

20.

Farhadian

Shokouhi

Torkzaban

. A decision support system based on support vector machine for diagnosis of periodontal disease. BMC Res Notes 2020; 13(1): 337.

21.

Kierce

Kolts

. Improving periodontal disease management with artificial intelligence. Compendium 2023; 44(6): e1–e4.

22.

Irfan

Riggs

Koromia

, et al. Smoking-associated electrocardiographic abnormalities predict cardiovascular mortality. Sci Rep 2024; 14(1): 31189.

23.

Hei

Cai

Wang

, et al. Association of the triglyceride-glucose index with cardiovascular mortality risk and competing risks in arthritis patients. Sci Rep 2024; 14(1): 31387.

24.

Yang

Pan

Xia

, et al. Effect of dietary probiotics intake on cancer mortality: a cohort study of NHANES 1999–2018. Sci Rep 2025; 15(1): 959.

25.

Chen

Xiong

, et al. Association of prophylactic low-dose aspirin use with all-cause and cause-specific mortality in cancer patients. Sci Rep 2024; 14(1): 25918.

26.

Huang

. Use of machine learning to identify risk factors for insomnia. PLoS One 2023; 18(4 April): e0282622.

27.

Vangeepuram

Liu

Chiu

, et al. Predicting youth diabetes risk using NHANES data and machine learning. Sci Rep 2021; 11(1): 11212.

28.

Chen

. Perspective from NHANES data: synergistic effects of visceral adiposity index and lipid accumulation products on diabetes risk. Sci Rep 2025; 15(1): 258.

29.

Dai

Chang

Hou

. Associations between the conicity index and kidney stone disease prevalence and mortality in American adults. Sci Rep 2025; 15(1): 902.

30.

Liu

Jin

Hao

, et al. Association between relative fat mass and kidney stones in American adults. Sci Rep 2024; 14(1): 27045.

31.

Zhang

, et al. Association between metabolic dysfunction associated steatotic liver disease and gallstones in the US population using propensity score matching. Sci Rep 2025; 15(1): 910.

32.

Zeng

, et al. Evaluating body roundness index and systemic immune inflammation index for mortality prediction in MAFLD patients. Sci Rep 2025; 15(1): 330.

33.

Huang

Guo

Feng

, et al. Comparative study on the association between types of physical activity, physical activity levels, and the incidence of osteoarthritis in adults: the NHANES 2007–2020. Sci Rep 2024; 14(1): 20574.

34.

Jiang

. Association between the composite dietary antioxidant index and all-cause mortality in individuals with osteoarthritis via NHANES data. Sci Rep 2024; 14(1): 30387.

35.

Schuch

Furtado

Silva

GFDS

, et al. Fairness of machine learning algorithms for predicting foregone preventive dental care for adults. JAMA Netw Open 2023; 6(11): E2341625.

36.

Chen

Zheng

Lan

, et al. Development and validation of a new nomogram for self-reported OA based on machine learning: a cross-sectional study. Sci Rep 2025; 15(1): 827.

37.

Zhang

Jin

. New insights into the correlation between bone mineral density and dental caries in NHANES 2011–2016. Sci Rep 2024; 14(1): 29143.

38.

Shakib

King

, et al. Association between uncontrolled diabetes and periodontal disease in US adults: NHANES 2009–2014. Sci Rep 2023; 13(1): 16694.

39.

Zhang

Lin

Chen

, et al. Association of periodontitis with all-cause and cause-specific mortality among individuals with depression: a population-based study. Sci Rep 2024; 14(1): 21917.

40.

Zhao

Cao

Zhang

. Association between relative fat mass and periodontitis: results from NHANES 2009–2014. Sci Rep 2024; 14(1): 18251.

41.

Brahmbhatt

Alqaderi

Chinipardaz

. Association between severe periodontitis and cognitive decline in older adults. Life 2024; 14(12): 1589.

42.

Liu

Zhang

, et al. Association study of depressive symptoms and periodontitis in an obese population: analysis based on NHANES data from 2009 to 2014. PLoS One 2024; 19(12 December): e0315754.

43.

Lowenstein

Singh

Papas

. Addressing disparities in oral health access and outcomes for aging adults in the United States. Front Dent Med 2025; 6: 1522892.

44.

Rahman

Blossom

Kawachi

, et al. Dental clinic deserts in the US: spatial accessibility analysis. JAMA Netw Open 2024; 7(12): e2451625.

45.

National Institute of Dental and Craniofacial Research (NIDCR) . Prevalence of periodontal disease in adults (age 30 or older). Last access on June 2025. https://www.nidcr.nih.gov/research/data-statistics/periodontal-disease/adults 2021.

46.

Eke

Page

Wei

, et al. Update of the case definitions for population-based surveillance of periodontitis. J Periodontol 2012; 83(12): 1449–1454.

47.

Ghassib

Batarseh

Wang

, et al. Clustering by periodontitis-associated factors: a novel application to NHANES data. J Periodontol 2021; 92(8): 1136–1150.

48.

Wang

Xiao

, et al. A novel nomogram for predicting risk of hypertension in US adults with periodontitis: national health and nutrition examination survey (NHANES) 2009-2014. Medicine (United States) 2023; 102(51): E36659.

49.

Yuan

Miao

Hou

, et al. Association between sleep and periodontitis: NHANES 2009-2014 and mendelian randomization study. Cranio J Craniomandib Sleep Pract 2024; 43: 986–995.

50.

Liu

Xia

Gao

, et al. Association between obesity and periodontitis in us adults: NHANES 2011–2014. Obes Facts 2024; 17(1): 47–58.

51.

Song

Wang

Zheng

, et al. Periodontitis prevalence and acceleration of biological aging: insights from NHANES 2009–2014 and mendelian randomization study. J Periodontal Res 2024; 60: 350–360.

52.

Liu

Yang

Liu

. Interaction between tobacco smoke exposure and zinc intake and its effect on periodontitis: evidence from NHANES. Int Dent J 2024; 74(5): 978–986.

53.

Liang

Liu

, et al. Contribution of individual and cumulative social determinants of health underlying gender disparities in periodontitis in a representative us population: a cross-sectional NHANES study. J Clin Periodontol 2024; 51(5): 558–570.

54.

Wang

Xiao

Zhang

. A systematic comparison of machine learning algorithms to develop and validate prediction model to predict heart failure risk in middle-aged and elderly patients with periodontitis (NHANES 2009 to 2014). Medicine (United States) 2023; 102(34): E34878.

55.

Tao

Feng

, et al. Exploring the association between heavy metal exposure and periodontitis using interpretable machine learning models: NHANES 2009–2014. Hum Ecol Risk Assess 2025; 31: 1084–1099.

56.

Wang

Tian

Bai

, et al. Associations between thyroid function and periodontitis: a machine learning approach using NHAMES. Int Dent J 2025; 75(5): 100921.

57.

Liu

Yuan

et al. Handling missing values in healthcare data: a systematic review of deep learning-based imputation techniques. Artif Intell Med 2023; 142: 102587. URL. https://www.sciencedirect.com/science/article/pii/S093336572300101X

58.

Lakshmi

Dheeba

. Digital decision making in dentistry: analysis and prediction of periodontitis using machine learning approach. INJGC. 2022; 13(3). doi:10.47164/ijngc.v13i3.614.

59.

Freeland-Graves

Babaei

Sachdev

. Food insecurity and periodontal disease in low-income women (p04–043–19). Curr Dev Nutr 2019; 3: nzz051.P04–043–19, P04–043–19. Nutrition 2019 Abstracts.

60.

Chatzopoulos

Jiang

Marka

, et al. Association between periodontitis extent, severity, and progression rate with systemic diseases and smoking: a retrospective study. J Personalized Med 2023; 13(5): 814. URL. https://www.mdpi.com/2075-4426/13/5/814

61.

Dye

Hirsch

Brody

. The relationship between blood lead levels and periodontal bone loss in the United States, 1988–1994. Environ Health Perspect 2002; 110(10): 997–1002.

62.

Tort

Choi

Kim

, et al. Lead exposure may affect gingival health in children. BMC Oral Health 2018; 18(1): 79.

63.

Huang

Yao

Yang

, et al. Association between levels of blood trace minerals and periodontitis among United States adults. Front Nutr 2022; 9: 999836.

64.

Lundberg

Lee

. A unified approach to interpreting model predictions. In: NIPS, 2017, pp. 4766–4775.

65.

Goldstein

Kapelner

Bleich

, et al. Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation. J Comput Graph Stat 2015; 24(1): 44–65.

66.

Yallabandi

Mayya

Jeganathan

, et al. ICU patients’ pattern recognition and correlation identification of vital parameters using optimized machine learning models. Int J Electr Comput Eng Syst 2023; 14(9): 1003–1013.

67.

Bureau

. An aging nation: projected number of children and older adults, 2020.

68.

Ramalingam

Yadalam

Ramani

, et al. Light gradient boosting-based prediction of quality of life among oral cancer-treated patients. BMC Oral Health 2024; 24(1): 349.

69.

Patel

Tellez

, et al. Developing and testing a prediction model for periodontal disease using machine learning and big electronic dental record data. Front Artif Intell 2022; 5: 979525.

70.

Mayya

King

, et al. Empirical study of feature selection methods in regression for large-scale healthcare data: a case study on estimating dental expenditures. IEEE Access 2024; 12: 153564–153579.

Application of machine learning to predict periodontal disease in US adults: A cross-sectional analysis of NHANES 2009–2014

Abstract

Keywords

Introduction

Materials and methods

Data source

Study population

Methodology

Outcome definition

Independent variables

Data preprocessing

Statistical analysis

Results

Discussion

Limitations

Future studies

Conclusions

Footnotes

ORCID iDs

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

References