Abstract
Background:
Hepatitis remains a major global health concern, leading to significant morbidity and mortality worldwide. Early identification of individuals at high risk is crucial for prevention and management.
Objective:
This study aims to investigate the clinical or lifestyle variables for early detection of hepatitis risk individuals by integrated machine learning and cross-sectional study.
Methods:
We analyzed 27,387 participants from the 2023 National Health Interview Survey, randomly divided into training (n = 16,431) and validation (n = 10,956) cohorts. Least absolute shrinkage and selection operator regression was applied to identify candidate predictors, followed by univariate and multivariable logistic regression to determine independent predictors. A nomogram was developed and evaluated using receiver operating characteristic curves, calibration plots, and decision curve analysis. Besides, positive predictive value, negative predictive value, and precision–recall analysis were conducted for evaluation of model efficacy and accuracy.
Results:
Five independent predictors were identified, including age, sex, hypertension, smoking status, and economic status of which associated with hepatitis prevalence.
Conclusions:
This study is a cross-sectional, machine learning-based predictive modeling study that aims to identify key demographic and lifestyle factors associated with hepatitis risk and develop a clinically applicable risk prediction tool. Novelty, this study illustrated the association between hepatitis risk and various epidemiologic patterns, including demographic, lifestyle, and health-related factors, which facilitate the precision early-detection of hepatitis risk individuals.
Introduction
Hepatitis remains a major global public health concern, accounting for more than 1 million deaths annually and contributing substantially to the global burden of cirrhosis and hepatocellular carcinoma.1–3 Its development and progression are influenced not only by non-modifiable factors such as age, sex, and genetic susceptibility but also by modifiable risk factors including smoking, alcohol consumption, obesity, metabolic comorbidities, and socioeconomic disparities.4–8 Therefore, early identification of individuals at high risk of hepatitis and timely preventive strategies are critical to reducing disease burden and improving outcomes.
Traditional screening strategies, such as universal Hepatitis B virus (HBV) vaccination and Hepatitis C virus (HCV) antibody testing, have been widely implemented and significantly reduced disease incidence. 6 However, these approaches may have limited applicability in heterogeneous populations, and barriers such as low awareness, restricted healthcare access, and socioeconomic inequalities continue to hinder effective prevention.9,10 In recent years, prediction models using large-scale population data have emerged, and nomograms have gained popularity in liver disease research owing to their intuitive visualization and individualized prediction capacity.11–13 Nonetheless, existing studies often focus on limited clinical factors rather than demographic heterogeneity and lifestyle factors without sufficient external validation.14–17 To enhance model reliability, researchers have increasingly combined least absolute shrinkage and selection operator (LASSO) regression with multivariable logistic regression for robust predictor identification.18,19 Moreover, adherence to reporting guidelines such as Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis improves transparency, reproducibility, and clinical translation. 20 For example, an independent investigation has been discovered that 13 lifestyle predictive factors can alert hepatitis C seroconversion in China. 21
In this study, we utilized the large-scale National Health Interview Survey (NHIS) dataset to construct a hepatitis risk prediction nomogram based on demographic, behavioral, and clinical variables selected by LASSO and logistic regression, which deciphered hepatitis risk at demographic and lifestyle levels. Next, we also examined the efficacy and accuracy of variables via multiple clinical computational pipelines of which included variables demonstrated favorable discrimination and calibration, and providing a practical tool for individualized risk assessment and early intervention in hepatitis prevention. To further illustrate the study design and analytical process, we describe the workflow of this study in Figure 1.

Flowchart of study population configuration.
Materials and methods
Inclusion and exclusion criteria for participants
Data were obtained from the NHIS conducted by the Centers for Disease Control and Prevention (CDC). The NHIS study protocol was approved by the National Center for Health Statistics Disclosure Review Board, and written informed consent was obtained from all participants. First, we included all participants in year of 2023, and then applying exclusion criteria as below: (1) participants with missing covariate information, and (2) participants with unclear hepatitis diagnosis. A total of 27,387 eligible participants were finally included.
Definition of study variable
Hepatitis was defined using the NHIS variable ACN.201_02.000, based on the question: “Have you EVER been told by a doctor or other health professional that you had Hepatitis?” Participants answering “Yes” were classified as the hepatitis group (n = 568), while those answering “No” were classified as the non-hepatitis group (n = 26,819).
Risk factors
According to the predefined inclusion and exclusion criteria, a total of 27,387 eligible participants were included in the analysis. To evaluate baseline differences between patients with hepatitis and those without hepatitis, participants were divided into hepatitis and non-hepatitis groups. About 12 covariates were assessed as baseline characteristics. The table-one R package (version 0.13.2; Kazuki Yoshida, available from the Comprehensive R Archive Network (CRAN), https://CRAN.R-project.org/package=tableone) was used to generate baseline tables, and Student’s t-test or chi-square test was applied as appropriate. Co-variates were categorized into four domains: (1) sociodemographic characteristics (age, sex, race, education, marital status, economic status, and region), (2) health status (BMI, cholesterol, hypertension, and diabetes), (3) comorbidities (smoking use), and (4) healthcare-related factors. Detailed variable definitions are provided in Table 1.
Variables and definitions.
Outcome
This study aimed to estimate the probability of prevalent hepatitis among adults who completed NHIS interviews/examinations in 2023. Hepatitis status was operationalized as a binary endpoint at baseline, as specified in definition of outcome variable. The interview and examination date served as the index date; no longitudinal follow-up or claims linkage was available, so time-to-event incidence and survival metrics were not computed. In the main analysis, the outcome variable was the presence (1) or absence (0) of hepatitis at baseline.
Nature of the study
This study is a cross-sectional, machine learning-based predictive modeling study that aims to identify key demographic and lifestyle factors associated with hepatitis risk and develop a clinically applicable risk prediction tool.
Statistical analysis
Baseline characteristics of the hepatitis study population
Before model construction, the tableone package was used to generate descriptive analyses of baseline characteristics, including demographic factors, lifestyle behaviors, and comorbidities, as well as to compare differences between groups. 22 Continuous variables were summarized as mean ± standard deviation (SD) for normally distributed data and as median with interquartile range for skewed data, while categorical variables were expressed as counts with percentages. Group differences were assessed according to baseline hepatitis status (yes vs no), both in the overall sample and across prespecified age strata, using Student’s t-test for normally distributed continuous variables, Wilcoxon rank-sum test for non-normal variables, and chi-square test for categorical variables. A two-sided p < 0.05 was considered statistically significant. This baseline analysis provided a comprehensive overview of the study population and served as a reference for subsequent machine learning and regression modeling.
Machine learning for feature selection and model development in hepatitis risk prediction
In constructing the hepatitis risk prediction model, the glmnet package of R (version 4.1-8; Jerome Friedman, Trevor Hastie, Robert Tibshirani, Noah Simon, and Junyang Qian, available from CRAN at https://CRAN.R-project.org/package=glmnet) was applied to perform LASSO regression with 10-fold cross-validation. 23 The optimal penalty parameter (λ) was selected based on the minimum cross-validation error, and variables with non-zero coefficients at this λ were retained as candidate predictors. This machine learning approach effectively reduced dimensionality, avoided multicollinearity, and allowed identification of the most stable and representative predictors from the training cohort (60% of the overall sample; 6:4 split). LASSO provided automated feature selection and reduced subjective variable inclusion. These candidate variables were subsequently carried forward for logistic regression modeling.
Logistic regression for identification of independent predictors of hepatitis
Candidate variables selected by LASSO regression were first examined through univariate logistic regression to assess their association with hepatitis via rms package of R (version 6.7-1; Frank E. Harrell Jr., available from CRAN at https://CRAN.R-project.org/package=rms). Variables with p < 0.05 were then entered into a multivariable logistic regression model to determine independent predictors. Adjusted odds ratios (ORs) with 95% confidence intervals (CIs) were calculated and reported. This step enabled confirmation of robust predictors and quantified their individual contributions to hepatitis risk.
Nomogram construction and validation for hepatitis risk prediction
Independent predictors identified from the multivariable logistic regression model were incorporated into a nomogram to generate individualized risk scores. Each predictor contributed a point value proportional to its regression coefficient, and the sum of these points was mapped to the predicted probability of hepatitis. The nomogram was constructed using the rms package of R. 24 Model calibration was assessed using calibration plots with bootstrap resampling, generated by the calibrate package of R, to evaluate the agreement between predicted probabilities and observed outcomes. 25 Outcomes of variables was examined through decision curve analysis (DCA), implemented with the rmda package of R (version 1.6; Jeremy D. Collins and Karen E. Rothman, available from CRAN at https://CRAN.R-project.org/package=rmda), which quantified the net clinical benefit of using the nomogram across a range of threshold probabilities. 26 Discrimination of the model was evaluated using receiver operating characteristic (ROC) curves and corresponding AUC values, generated with the pROC package of R (version 1.18.5; Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller, available from CRAN at https://CRAN.R-project.org/package=pROC). 27 Model performance was consistently assessed in both the training and validation cohorts to ensure robustness and generalizability. We also used the pROC R package to draw the precision-recall (PR) curve to evaluate the performance of the model. 27 This curve has strong reference value when there is class imbalance. We also used the caret R package of R to draw the confusion matrix to further assess the performance of the model. 24 This matrix provides a detailed breakdown of the results and enables the calculation of key indicators such as accuracy, precision, recall, and F1 score.
Results
Demographic and clinical characteristics of the study population
A total of baseline demographic, lifestyle, and comorbidity variables were compared between participants with and without hepatitis. The analysis revealed that eight characteristics showed statistically significant differences between the two groups (p < 0.05). These differences mainly involved age distribution, sex ratio, education level, economic status, smoking behavior, hypercholesterolemia, diabetes, and hypertension, indicating that the development of hepatitis is influenced by both demographic and health-related factors. In contrast, race, marital status, region, and BMI did not differ significantly between the hepatitis and non-hepatitis groups (p ⩾ 0.05), suggesting that these variables may not be key determinants of hepatitis risk in this dataset. Detailed results of these comparisons are presented in Table 2, which provides a comprehensive overview of baseline characteristics stratified by hepatitis status.
Baseline characteristics of participants.
No HEPEV: without hepatitis; HEPEV: with hepatitis; AGE_P: age; SEX: sex; EDUCP: education background; MARITAL: marital status; RATCAT: financial situation; SMKCIGST: smoking; REGION: region; CHLEV: high cholesterol; BMICAT: BMI; PREDIB: diabetes; HYPEV: high blood pressure.
Feature selection and development of a risk model for hepatitis
A total of 27,387 participants were randomly divided into a training cohort (n = 16,431) and a validation cohort (n = 10,956) at a 6:4 ratio for subsequent analyses. In the training cohort, to identify potential risk factors associated with hepatitis, a LASSO regression model was constructed, based on the predefined covariates. Ten-fold cross-validation was performed to determine the optimal penalty parameter (λ = 0.00005). At this λ value, 14 variables acquired from different stage or status characteristics of participants were inputted in LASSO model, and then we confirmed eight candidate predictors (Figure 2(a) and (b)). The LASSO model was used primarily as a standardized variable selection framework rather than for aggressive coefficient shrinkage. Given the modest number of predictors and limited multicollinearity, variables selected by LASSO were carried forward into conventional multivariable logistic regression, yielding results comparable to standard regression approaches. The LASSO results indicated that all eight predictors were important features associated with hepatitis and were therefore included for further analysis.

LASSO algorithm plots. (a) Illustration of the coefficient path plot of the LASSO regression. The x-axis represents λ, while the y-axis represents the absolute or standardized values of the coefficients. (b) Illustration of the changes in feature coefficients under different values of the regularization parameter (λ).
To further identify risk factors associated with hepatitis, univariate logistic regression analysis was applied to all samples in the training cohort based on the candidate features derived from the LASSO regression. ORs with 95% confidence intervals (95% CIs) were calculated, and a forest plot was generated to display the eight candidate features (Figure 3, p < 0.05). The results showed that never smoking (SMKCIGST_A4) markedly reduced the likelihood of developing hepatitis, whereas increasing age was strongly associated with a higher risk of hepatitis.

Forest plot of univariate logistic regression. The x-axis represents the odds ratio (OR). When OR = 1, the predictor has no effect on the probability of the event (neutral effect). When OR > 1, the predictor is positively associated with the event, indicating that its presence or increase raises the likelihood of the outcome. When OR < 1, the predictor is negatively associated with the event, indicating that its presence or increase reduces the likelihood of the outcome. The symbols “*”, “**”, “***”, and “****” denote statistical significance levels, corresponding to P < 0.05, P < 0.01, P < 0.001, and P < 0.0001, respectively.
Multivariable logistic regression analysis was applied to the 14 candidate features identified from the univariate logistic regression analysis, and ORs with 95% CIs were calculated. Five predictors were found to be significantly associated with the occurrence of hepatitis, including AGE_P (age), SEX (sex), HYPEV_A (hypertension), SMKSTAT2 (smoking status), and RATCAT (economic status; Figure 4 and Table 3; p < 0.05).

Forest plot of multivariate logistic regression. The x-axis represents the odds ratio (OR). When OR = 1, the predictor has no effect on the probability of the event (neutral effect). When OR > 1, the predictor is positively associated with the event, indicating that its presence or increase raises the likelihood of the outcome. When OR < 1, the predictor is negatively associated with the event, indicating that its presence or increase reduces the likelihood of the outcome. The symbols “*”, “**”, “***”, and “****” denote statistical significance levels, corresponding to P < 0.05, P < 0.01, P < 0.001, and P < 0.0001, respectively.
Variable importance analysis.
AGEP-A3: age between 46 and 65; AGEP-A4: age >65; HYPEV-A2: variables without high blood pressure; SEX-A2: female variables; SMKICGST-A3: individuals had smoking before; SMKICGST-A4: individuals never smoke; RATCAT-A3: variables in defining as decent economic incomes.
Development and validation of nomogram-based risk scores for hepatitis prediction
To further evaluate the overall predictive ability of the identified factors, a nomogram was constructed in the training cohort, based on predictors selected through LASSO regression and univariate/multivariable logistic regression (Figure 5). In the nomogram, each predictor is assigned a point value, and the sum of the individual scores yields the total points. The total points correspond to the estimated probability of developing hepatitis, with higher scores indicating a greater likelihood of hepatitis occurrence.

Nomogram in the training cohort. Nomogram lists the predictors included in the model, while the scales on the right represent the range of values for each predictor. The length of each line indicates the contribution of the predictor to disease risk. The β(X − m) terms represent the point values assigned to different levels of each predictor; total score indicates the sum of all individual scores; and Pr denotes the predicted probability of hepatitis.
Evaluation of calibration and predictive accuracy for hepatitis risk
To evaluate the predictive performance of the nomogram, multiple validation metrics were applied in both the training and validation cohorts. Calibration curves were generated to assess prediction accuracy (Figure 6(a) and (b)). The calibration plots showed that the predicted probabilities were in close agreement with the ideal diagonal line, indicating that the nomogram demonstrated high calibration and predictive accuracy in both cohorts.

Calibration curves of the nomogram in the training cohort and validation cohort. (a) Calibration in training cohort. (b) Calibration in validation cohort.
DCA for assessing the clinical value of the hepatitis nomogram
DCA was performed for assessing the clinical value of the hepatitis nomogram. The results demonstrated that the net benefit of the nomogram was consistently higher than that of the “all” and “none” strategies in both the training and validation cohorts, indicating favorable accuracy of the output variables (Figure 7(a) and (b)).

Decision curve analysis (DCA) of the nomogram in the training cohort (left) and validation cohort (right). (a) DCA in training cohort. (b) DCA in validation cohort.
ROC curve analysis of the nomogram for hepatitis risk prediction
The ROC curves were generated for descripting the efficacy of the model. Results demonstrated that the nomogram exhibited good predictive performance in both the training cohort (AUC = 0.73) and the validation cohort (AUC = 0.70; Figure 8(a) and (b)).

ROC curves of the nomogram in the training cohort and validation cohort. (a) ROC analysis in training cohort. (b) ROC analysis in validation cohort.
PR curve and confusion matrix analysis of the nomogram for hepatitis risk prediction
In the training set, the model demonstrated an area under the precision-recall curve (AUPRC) of 0.796 (Figure 9(a)). At the threshold determined by the Youden index, the sensitivity was 0.42 and the specificity was 0.97 (Figure 9(c)). The positive predictive value (PPV) was 0.25, whereas the negative predictive value (NPV) was 0.99, reflecting a high capacity to correctly classify low-risk subjects (Figure 9(c)). In the validation set, a similar performance profile was observed, with an AUPRC of 0.798, sensitivity of 0.43, and specificity of 0.98 (Figure 9(b) and (d)). The PPV was 0.28, and the NPV remained elevated at 0.99 (Figure 9(b) and (d)). The close alignment of performance indicators between the training and validation cohorts indicates stable model behavior and limited overfitting. Notably, the consistently high AUPRC values underscore the model’s robust discriminative ability for the minority class, even in the presence of substantial class imbalance. These results suggest that the model identifies meaningful and reproducible predictive signals that go beyond a simple prevalence-based baseline. Although the low disease prevalence inherently constrains the attainable PPV, the model exhibits considerable promise for risk stratification and initial screening applications. Subsequent research employing larger sample sizes and learning strategies specifically designed to address class imbalance could further improve predictive accuracy and clinical utility.

PR and confusion matrix in the training and validation cohorts. (a) PR analysis in training cohort. (b) PR analysis in validation cohort. (c) Confusion matrix in training cohort.
Discussion
In this nationally representative analysis of NHIS participants, we developed and validated a nomogram to estimate the probability of prevalent hepatitis based on demographic, lifestyle, comorbidity, and socioeconomic variables. Five independent predictors, such as age, sex, hypertension, smoking status, and economic status, were consistently associated with hepatitis status and jointly enabled a practical risk scoring system with acceptable discrimination and calibration in both training and validation cohorts (AUC 0.73 and 0.70), respectively. These findings are consistent with prior reports, highlighting the role of demographic and social determinants in hepatitis epidemiology.1–4 Besides, global evidence has shown that viral hepatitis remains a major cause of morbidity and mortality despite elimination efforts. Age and sex patterns, as well as socioeconomic disparities, are well documented.4,5,19,28,29 Smoking and hypertension, while not classical risk factors for transmission, have been repeatedly associated with poorer outcomes and higher prevalence among hepatitis patients.18,20,30,31 Besides, the observed association between hypertension and hepatitis may be explained by shared biological and behavioral pathways. Chronic systemic inflammation and endothelial dysfunction, common in hypertension, can increase susceptibility to hepatic injury, while hypertension also clusters with adverse health behaviors such as smoking and physical inactivity. 32 From a psychosocial perspective, hypertension may further reflect broader metabolic vulnerability and chronic stress exposure that influence immune regulation and healthcare utilization. 33 In addition, economic status may act as a proxy for several proximal determinants of hepatitis risk, including access to healthcare, health literacy, psychosocial stress, and environmental exposures. These socioeconomic factors can influence both exposure risk and timely diagnosis, providing a plausible explanation for its predictive value in our model. 34
Our results reinforce the importance of integrating such variables into risk stratification. Significantly, the nomogram provides a simple tool for individualized risk assessment, which could help prioritize screening and prevention, particularly in resource-limited settings where socioeconomic barriers are pronounced.5,21,29,35 By highlighting modifiable factors such as smoking, our model also supports combined prevention strategies that may improve both hepatic and extrahepatic outcomes. However, it is worth noting that the model be more reflective of the likelihood of being diagnosed rather than the true likelihood of infection. It can be considered as a risk stratification tool that incorporates biological, behavioral, and healthcare access factors, rather than positioning it as a purely causal or etiological model. This study is cross-sectional, based on self-reported physician diagnosis, and does not differentiate HBV from HCV or include laboratory confirmation. However, there are still several limitations in this research. NHIS employs a complex, multistage survey design, and ignoring survey weights may impact the accuracy of population-level inferences. The lackage of calculation and justification of the sample size acquired from NHIS database may decrease the authenticity of our results. The primary goal of this study is to develop a risk prediction model rather than to estimate population parameters, and the universality of our model to each individual should be further estimated. Therefore, while we recognize the importance of survey weighting for broader population estimates, the focus of our analysis was on identifying relevant risk factors and predicting individual risk. Besides, the outcome variables acquired from our study is based on self-reported physician-diagnosed hepatitis rather than laboratory-confirmed infection. This reliance on self-reported data may introduce biases or inaccuracies, as individuals may not accurately recall their diagnoses, or the diagnosis may not have been confirmed by laboratory tests. Future studies should aim to use laboratory-confirmed diagnoses of hepatitis to improve the accuracy and reliability of the findings. In addition, we used a random split for training and validation, this approach does not account for temporal, geographic, or external cohort validation. As such, the validation cohort is not independent in a meaningful epidemiological sense, and the potential for optimism bias remains. We recognize that internal validation via a random split alone is insufficient for the robustness of clinical prediction models. Future studies should incorporate independent validation cohorts to better assess the generalizability and reliability of the model.
Conclusion
In conclusion, we identified five predictors, including age, sex, hypertension, smoking status, and economic status, can be considered as independent risk factors for hepatitis of which associated with its prevalence. Our study can guide early-detection of hepatitis risk based on demographic and lifestyle factors.
Footnotes
Acknowledgements
The authors sincerely thank the colleagues at the No. 1 People’s Hospital for their insightful discussions and technical assistance throughout the preparation of this article.
Author contributions
YT contributed to hypothesis generation, study design, data collection, statistical analysis, and manuscript preparation. WC contributed to study design and supervised manuscript preparation.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
All data generated or analyzed during this study are included in this article.
