Association of lifestyle and demographic variables with hepatitis risk: Evidence from machine learning and cross-sectional study

Abstract

Background:

Hepatitis remains a major global health concern, leading to significant morbidity and mortality worldwide. Early identification of individuals at high risk is crucial for prevention and management.

Objective:

This study aims to investigate the clinical or lifestyle variables for early detection of hepatitis risk individuals by integrated machine learning and cross-sectional study.

Methods:

We analyzed 27,387 participants from the 2023 National Health Interview Survey, randomly divided into training (n = 16,431) and validation (n = 10,956) cohorts. Least absolute shrinkage and selection operator regression was applied to identify candidate predictors, followed by univariate and multivariable logistic regression to determine independent predictors. A nomogram was developed and evaluated using receiver operating characteristic curves, calibration plots, and decision curve analysis. Besides, positive predictive value, negative predictive value, and precision–recall analysis were conducted for evaluation of model efficacy and accuracy.

Results:

Five independent predictors were identified, including age, sex, hypertension, smoking status, and economic status of which associated with hepatitis prevalence.

Conclusions:

This study is a cross-sectional, machine learning-based predictive modeling study that aims to identify key demographic and lifestyle factors associated with hepatitis risk and develop a clinically applicable risk prediction tool. Novelty, this study illustrated the association between hepatitis risk and various epidemiologic patterns, including demographic, lifestyle, and health-related factors, which facilitate the precision early-detection of hepatitis risk individuals.

Keywords

hepatitis risk national health interview survey LASSO nomogram early-detection

Introduction

Hepatitis remains a major global public health concern, accounting for more than 1 million deaths annually and contributing substantially to the global burden of cirrhosis and hepatocellular carcinoma.^1–3 Its development and progression are influenced not only by non-modifiable factors such as age, sex, and genetic susceptibility but also by modifiable risk factors including smoking, alcohol consumption, obesity, metabolic comorbidities, and socioeconomic disparities.^4–8 Therefore, early identification of individuals at high risk of hepatitis and timely preventive strategies are critical to reducing disease burden and improving outcomes.

Traditional screening strategies, such as universal Hepatitis B virus (HBV) vaccination and Hepatitis C virus (HCV) antibody testing, have been widely implemented and significantly reduced disease incidence.⁶ However, these approaches may have limited applicability in heterogeneous populations, and barriers such as low awareness, restricted healthcare access, and socioeconomic inequalities continue to hinder effective prevention.^9,10 In recent years, prediction models using large-scale population data have emerged, and nomograms have gained popularity in liver disease research owing to their intuitive visualization and individualized prediction capacity.^11–13 Nonetheless, existing studies often focus on limited clinical factors rather than demographic heterogeneity and lifestyle factors without sufficient external validation.^14–17 To enhance model reliability, researchers have increasingly combined least absolute shrinkage and selection operator (LASSO) regression with multivariable logistic regression for robust predictor identification.^18,19 Moreover, adherence to reporting guidelines such as Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis improves transparency, reproducibility, and clinical translation.²⁰ For example, an independent investigation has been discovered that 13 lifestyle predictive factors can alert hepatitis C seroconversion in China.²¹

In this study, we utilized the large-scale National Health Interview Survey (NHIS) dataset to construct a hepatitis risk prediction nomogram based on demographic, behavioral, and clinical variables selected by LASSO and logistic regression, which deciphered hepatitis risk at demographic and lifestyle levels. Next, we also examined the efficacy and accuracy of variables via multiple clinical computational pipelines of which included variables demonstrated favorable discrimination and calibration, and providing a practical tool for individualized risk assessment and early intervention in hepatitis prevention. To further illustrate the study design and analytical process, we describe the workflow of this study in Figure 1.

Figure 1.

Flowchart of study population configuration.

Materials and methods

Inclusion and exclusion criteria for participants

Data were obtained from the NHIS conducted by the Centers for Disease Control and Prevention (CDC). The NHIS study protocol was approved by the National Center for Health Statistics Disclosure Review Board, and written informed consent was obtained from all participants. First, we included all participants in year of 2023, and then applying exclusion criteria as below: (1) participants with missing covariate information, and (2) participants with unclear hepatitis diagnosis. A total of 27,387 eligible participants were finally included.

Definition of study variable

Hepatitis was defined using the NHIS variable ACN.201_02.000, based on the question: “Have you EVER been told by a doctor or other health professional that you had Hepatitis?” Participants answering “Yes” were classified as the hepatitis group (n = 568), while those answering “No” were classified as the non-hepatitis group (n = 26,819).

Risk factors

According to the predefined inclusion and exclusion criteria, a total of 27,387 eligible participants were included in the analysis. To evaluate baseline differences between patients with hepatitis and those without hepatitis, participants were divided into hepatitis and non-hepatitis groups. About 12 covariates were assessed as baseline characteristics. The table-one R package (version 0.13.2; Kazuki Yoshida, available from the Comprehensive R Archive Network (CRAN), https://CRAN.R-project.org/package=tableone) was used to generate baseline tables, and Student’s t-test or chi-square test was applied as appropriate. Co-variates were categorized into four domains: (1) sociodemographic characteristics (age, sex, race, education, marital status, economic status, and region), (2) health status (BMI, cholesterol, hypertension, and diabetes), (3) comorbidities (smoking use), and (4) healthcare-related factors. Detailed variable definitions are provided in Table 1.

Table 1.

Variables and definitions.

Variable	Variable definition
Age (AGE_P)	18–30, 31–45, 46–65, >65
Sex (SEX)	Male, female
Race (HISDETP)	White, Black, Asian
Education background (EDUCP)	Below high school education, high school graduate, above high school education
Marital status (R_MARITL)	Married, not married, unknown
Financial situation (Income/poverty ratio; RATCAT)	Poor, relatively poor, not poor
Smoking (SMKCIGST)	Smoke everyday, smoke occasionally, had smoke before, never smoked
Region (REGION)	Northeast, midwest, south, west
BMI (BMICAT)	Underweight, healthy weight, overweight, obesity
High cholesterol (CHLEV)	Yes, no
Diabetes (PREDIB)	Yes, no
High blood pressure (HYPEV)	Yes, no

Outcome

This study aimed to estimate the probability of prevalent hepatitis among adults who completed NHIS interviews/examinations in 2023. Hepatitis status was operationalized as a binary endpoint at baseline, as specified in definition of outcome variable. The interview and examination date served as the index date; no longitudinal follow-up or claims linkage was available, so time-to-event incidence and survival metrics were not computed. In the main analysis, the outcome variable was the presence (1) or absence (0) of hepatitis at baseline.

Nature of the study

Statistical analysis

Baseline characteristics of the hepatitis study population

Before model construction, the tableone package was used to generate descriptive analyses of baseline characteristics, including demographic factors, lifestyle behaviors, and comorbidities, as well as to compare differences between groups.²² Continuous variables were summarized as mean ± standard deviation (SD) for normally distributed data and as median with interquartile range for skewed data, while categorical variables were expressed as counts with percentages. Group differences were assessed according to baseline hepatitis status (yes vs no), both in the overall sample and across prespecified age strata, using Student’s t-test for normally distributed continuous variables, Wilcoxon rank-sum test for non-normal variables, and chi-square test for categorical variables. A two-sided p < 0.05 was considered statistically significant. This baseline analysis provided a comprehensive overview of the study population and served as a reference for subsequent machine learning and regression modeling.

Machine learning for feature selection and model development in hepatitis risk prediction

In constructing the hepatitis risk prediction model, the glmnet package of R (version 4.1-8; Jerome Friedman, Trevor Hastie, Robert Tibshirani, Noah Simon, and Junyang Qian, available from CRAN at https://CRAN.R-project.org/package=glmnet) was applied to perform LASSO regression with 10-fold cross-validation.²³ The optimal penalty parameter (λ) was selected based on the minimum cross-validation error, and variables with non-zero coefficients at this λ were retained as candidate predictors. This machine learning approach effectively reduced dimensionality, avoided multicollinearity, and allowed identification of the most stable and representative predictors from the training cohort (60% of the overall sample; 6:4 split). LASSO provided automated feature selection and reduced subjective variable inclusion. These candidate variables were subsequently carried forward for logistic regression modeling.

Logistic regression for identification of independent predictors of hepatitis

Candidate variables selected by LASSO regression were first examined through univariate logistic regression to assess their association with hepatitis via rms package of R (version 6.7-1; Frank E. Harrell Jr., available from CRAN at https://CRAN.R-project.org/package=rms). Variables with p < 0.05 were then entered into a multivariable logistic regression model to determine independent predictors. Adjusted odds ratios (ORs) with 95% confidence intervals (CIs) were calculated and reported. This step enabled confirmation of robust predictors and quantified their individual contributions to hepatitis risk.

Nomogram construction and validation for hepatitis risk prediction

Independent predictors identified from the multivariable logistic regression model were incorporated into a nomogram to generate individualized risk scores. Each predictor contributed a point value proportional to its regression coefficient, and the sum of these points was mapped to the predicted probability of hepatitis. The nomogram was constructed using the rms package of R.²⁴ Model calibration was assessed using calibration plots with bootstrap resampling, generated by the calibrate package of R, to evaluate the agreement between predicted probabilities and observed outcomes.²⁵ Outcomes of variables was examined through decision curve analysis (DCA), implemented with the rmda package of R (version 1.6; Jeremy D. Collins and Karen E. Rothman, available from CRAN at https://CRAN.R-project.org/package=rmda), which quantified the net clinical benefit of using the nomogram across a range of threshold probabilities.²⁶ Discrimination of the model was evaluated using receiver operating characteristic (ROC) curves and corresponding AUC values, generated with the pROC package of R (version 1.18.5; Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller, available from CRAN at https://CRAN.R-project.org/package=pROC).²⁷ Model performance was consistently assessed in both the training and validation cohorts to ensure robustness and generalizability. We also used the pROC R package to draw the precision-recall (PR) curve to evaluate the performance of the model.²⁷ This curve has strong reference value when there is class imbalance. We also used the caret R package of R to draw the confusion matrix to further assess the performance of the model.²⁴ This matrix provides a detailed breakdown of the results and enables the calculation of key indicators such as accuracy, precision, recall, and F1 score.

Results

Demographic and clinical characteristics of the study population

A total of baseline demographic, lifestyle, and comorbidity variables were compared between participants with and without hepatitis. The analysis revealed that eight characteristics showed statistically significant differences between the two groups (p < 0.05). These differences mainly involved age distribution, sex ratio, education level, economic status, smoking behavior, hypercholesterolemia, diabetes, and hypertension, indicating that the development of hepatitis is influenced by both demographic and health-related factors. In contrast, race, marital status, region, and BMI did not differ significantly between the hepatitis and non-hepatitis groups (p ⩾ 0.05), suggesting that these variables may not be key determinants of hepatitis risk in this dataset. Detailed results of these comparisons are presented in Table 2, which provides a comprehensive overview of baseline characteristics stratified by hepatitis status.

Table 2.

Baseline characteristics of participants.

Variable name	Variable category	No HEPEV	HEPEV	p-Value
Number		26,819	68
AGE_P (%)	18–30	1535 (5.7)	6 (1.1)	<0.001
	31–45	8096 (30.2)	74 (13.0)
	46–65	7787 (29.0)	157 (27.6)
	>65	9401 (35.1)	331 (58.3)
SEX (%)	Male	12,338 (46.0)	310 (54.6)	<0.001
SEX (%)	Female	14,481 (54.0)	258 (45.4)	<0.001
HISDETP (%)	White	2177 (8.1)	47 (8.3)	0.217
	Black	1735 (6.5)	47 (8.3)
	Asian	22,907 (85.4)	474 (83.5)
EDUCP (%)	Below high school education	2206 (8.2)	86 (15.1)	<0.001
	High school graduate	6774 (25.3)	150 (26.4)
	Above high school education	17,839 (66.5)	332 (58.5)
MARITAL (%)	Married	12,336 (46.0)	240 (42.3)	0.052
	Not married	1798 (6.7)	31 (5.5)
	Unknown	12,685 (47.3)	297 (52.3)
RATCAT (%)	Poor	2668 (9.9)	99 (17.4)	<0.001
	Relatively poor	4837 (18.0)	140 (24.6)
	Not poor	19,314 (72.0)	329 (57.9)
SMKCIGST (%)	Smoke every day	2147 (8.0)	101 (17.8)	<0.001
	Smoke occasionally	724 (2.7)	25 (4.4)
	Had smoked before	6782 (25.3)	219 (38.6)
	Never smoked	17,166 (64.0)	223 (39.3)
REGION (%)	Northwest	4124 (15.4)	78 (13.7)	0.441
	Midwest	5933 (22.1)	117 (20.6)
	South	9957 (37.1)	217 (38.2)
	West	6805 (25.4)	156 (27.5)
CHLEV (%)	Yes	8746 (32.6)	262 (46.1)	<0.001
CHLEV (%)	No	18,073 (67.4)	306 (53.9)	<0.001
BMICAT (%)	Underweight	414 (1.5)	13 (2.3)	0.406
	Healthy weight	8183 (30.5)	181 (31.9)
	Overweight	9246 (34.5)	185 (32.6)
	Obesity	8976 (33.5)	189 (33.3)
PREDIB (%)	Yes	4714 (17.6)	161 (28.3)	<0.001
PREDIB (%)	No	22,105 (82.4)	407 (71.7)	<0.001
HYPEV (%)	Yes	10,025 (37.4)	325 (57.2)	<0.001
HYPEV (%)	No	16,794 (62.6)	243 (42.8)	<0.001

No HEPEV: without hepatitis; HEPEV: with hepatitis; AGE_P: age; SEX: sex; EDUCP: education background; MARITAL: marital status; RATCAT: financial situation; SMKCIGST: smoking; REGION: region; CHLEV: high cholesterol; BMICAT: BMI; PREDIB: diabetes; HYPEV: high blood pressure.

Feature selection and development of a risk model for hepatitis

A total of 27,387 participants were randomly divided into a training cohort (n = 16,431) and a validation cohort (n = 10,956) at a 6:4 ratio for subsequent analyses. In the training cohort, to identify potential risk factors associated with hepatitis, a LASSO regression model was constructed, based on the predefined covariates. Ten-fold cross-validation was performed to determine the optimal penalty parameter (λ = 0.00005). At this λ value, 14 variables acquired from different stage or status characteristics of participants were inputted in LASSO model, and then we confirmed eight candidate predictors (Figure 2(a) and (b)). The LASSO model was used primarily as a standardized variable selection framework rather than for aggressive coefficient shrinkage. Given the modest number of predictors and limited multicollinearity, variables selected by LASSO were carried forward into conventional multivariable logistic regression, yielding results comparable to standard regression approaches. The LASSO results indicated that all eight predictors were important features associated with hepatitis and were therefore included for further analysis.

Figure 2.

LASSO algorithm plots. (a) Illustration of the coefficient path plot of the LASSO regression. The x-axis represents λ, while the y-axis represents the absolute or standardized values of the coefficients. (b) Illustration of the changes in feature coefficients under different values of the regularization parameter (λ).

To further identify risk factors associated with hepatitis, univariate logistic regression analysis was applied to all samples in the training cohort based on the candidate features derived from the LASSO regression. ORs with 95% confidence intervals (95% CIs) were calculated, and a forest plot was generated to display the eight candidate features (Figure 3, p < 0.05). The results showed that never smoking (SMKCIGST_A4) markedly reduced the likelihood of developing hepatitis, whereas increasing age was strongly associated with a higher risk of hepatitis.

Figure 3.

Forest plot of univariate logistic regression. The x-axis represents the odds ratio (OR). When OR = 1, the predictor has no effect on the probability of the event (neutral effect). When OR > 1, the predictor is positively associated with the event, indicating that its presence or increase raises the likelihood of the outcome. When OR < 1, the predictor is negatively associated with the event, indicating that its presence or increase reduces the likelihood of the outcome. The symbols “*”, “**”, “***”, and “****” denote statistical significance levels, corresponding to P < 0.05, P < 0.01, P < 0.001, and P < 0.0001, respectively.

Multivariable logistic regression analysis was applied to the 14 candidate features identified from the univariate logistic regression analysis, and ORs with 95% CIs were calculated. Five predictors were found to be significantly associated with the occurrence of hepatitis, including AGE_P (age), SEX (sex), HYPEV_A (hypertension), SMKSTAT2 (smoking status), and RATCAT (economic status; Figure 4 and Table 3; p < 0.05).

Figure 4.

Forest plot of multivariate logistic regression. The x-axis represents the odds ratio (OR). When OR = 1, the predictor has no effect on the probability of the event (neutral effect). When OR > 1, the predictor is positively associated with the event, indicating that its presence or increase raises the likelihood of the outcome. When OR < 1, the predictor is negatively associated with the event, indicating that its presence or increase reduces the likelihood of the outcome. The symbols “*”, “**”, “***”, and “****” denote statistical significance levels, corresponding to P < 0.05, P < 0.01, P < 0.001, and P < 0.0001, respectively.

Table 3.

Variable importance analysis.

Variables	p-Value	OR_mean
AGEP-A3	9.00e-03	6.55843398735785
AGEP-A4	1.02e-03	10.5791482387573
HYPEV-A2	3.30e-02	0.763365363806662
SEX-A2	8.22e-03	0.739949157694122
SMKCIGST-A3	7.76e-04	0.578983476583723
RATCAT-A3	1.37e-05	0.48452121747644
SMKCIGST-A4	7.11e-14	0.298191779754772

AGEP-A3: age between 46 and 65; AGEP-A4: age >65; HYPEV-A2: variables without high blood pressure; SEX-A2: female variables; SMKICGST-A3: individuals had smoking before; SMKICGST-A4: individuals never smoke; RATCAT-A3: variables in defining as decent economic incomes.

Development and validation of nomogram-based risk scores for hepatitis prediction

To further evaluate the overall predictive ability of the identified factors, a nomogram was constructed in the training cohort, based on predictors selected through LASSO regression and univariate/multivariable logistic regression (Figure 5). In the nomogram, each predictor is assigned a point value, and the sum of the individual scores yields the total points. The total points correspond to the estimated probability of developing hepatitis, with higher scores indicating a greater likelihood of hepatitis occurrence.

Figure 5.

Nomogram in the training cohort. Nomogram lists the predictors included in the model, while the scales on the right represent the range of values for each predictor. The length of each line indicates the contribution of the predictor to disease risk. The β(X − m) terms represent the point values assigned to different levels of each predictor; total score indicates the sum of all individual scores; and Pr denotes the predicted probability of hepatitis.

Evaluation of calibration and predictive accuracy for hepatitis risk

To evaluate the predictive performance of the nomogram, multiple validation metrics were applied in both the training and validation cohorts. Calibration curves were generated to assess prediction accuracy (Figure 6(a) and (b)). The calibration plots showed that the predicted probabilities were in close agreement with the ideal diagonal line, indicating that the nomogram demonstrated high calibration and predictive accuracy in both cohorts.

Figure 6.

Calibration curves of the nomogram in the training cohort and validation cohort. (a) Calibration in training cohort. (b) Calibration in validation cohort.

DCA for assessing the clinical value of the hepatitis nomogram

DCA was performed for assessing the clinical value of the hepatitis nomogram. The results demonstrated that the net benefit of the nomogram was consistently higher than that of the “all” and “none” strategies in both the training and validation cohorts, indicating favorable accuracy of the output variables (Figure 7(a) and (b)).

Figure 7.

Decision curve analysis (DCA) of the nomogram in the training cohort (left) and validation cohort (right). (a) DCA in training cohort. (b) DCA in validation cohort.

ROC curve analysis of the nomogram for hepatitis risk prediction

The ROC curves were generated for descripting the efficacy of the model. Results demonstrated that the nomogram exhibited good predictive performance in both the training cohort (AUC = 0.73) and the validation cohort (AUC = 0.70; Figure 8(a) and (b)).

Figure 8.

ROC curves of the nomogram in the training cohort and validation cohort. (a) ROC analysis in training cohort. (b) ROC analysis in validation cohort.

PR curve and confusion matrix analysis of the nomogram for hepatitis risk prediction

In the training set, the model demonstrated an area under the precision-recall curve (AUPRC) of 0.796 (Figure 9(a)). At the threshold determined by the Youden index, the sensitivity was 0.42 and the specificity was 0.97 (Figure 9(c)). The positive predictive value (PPV) was 0.25, whereas the negative predictive value (NPV) was 0.99, reflecting a high capacity to correctly classify low-risk subjects (Figure 9(c)). In the validation set, a similar performance profile was observed, with an AUPRC of 0.798, sensitivity of 0.43, and specificity of 0.98 (Figure 9(b) and (d)). The PPV was 0.28, and the NPV remained elevated at 0.99 (Figure 9(b) and (d)). The close alignment of performance indicators between the training and validation cohorts indicates stable model behavior and limited overfitting. Notably, the consistently high AUPRC values underscore the model’s robust discriminative ability for the minority class, even in the presence of substantial class imbalance. These results suggest that the model identifies meaningful and reproducible predictive signals that go beyond a simple prevalence-based baseline. Although the low disease prevalence inherently constrains the attainable PPV, the model exhibits considerable promise for risk stratification and initial screening applications. Subsequent research employing larger sample sizes and learning strategies specifically designed to address class imbalance could further improve predictive accuracy and clinical utility.

Figure 9.

PR and confusion matrix in the training and validation cohorts. (a) PR analysis in training cohort. (b) PR analysis in validation cohort. (c) Confusion matrix in training cohort. (d) Confusion matrix in validation cohort.

Discussion

In this nationally representative analysis of NHIS participants, we developed and validated a nomogram to estimate the probability of prevalent hepatitis based on demographic, lifestyle, comorbidity, and socioeconomic variables. Five independent predictors, such as age, sex, hypertension, smoking status, and economic status, were consistently associated with hepatitis status and jointly enabled a practical risk scoring system with acceptable discrimination and calibration in both training and validation cohorts (AUC 0.73 and 0.70), respectively. These findings are consistent with prior reports, highlighting the role of demographic and social determinants in hepatitis epidemiology.^1–4 Besides, global evidence has shown that viral hepatitis remains a major cause of morbidity and mortality despite elimination efforts. Age and sex patterns, as well as socioeconomic disparities, are well documented.^4,5,19,28,29 Smoking and hypertension, while not classical risk factors for transmission, have been repeatedly associated with poorer outcomes and higher prevalence among hepatitis patients.^18,20,30,31 Besides, the observed association between hypertension and hepatitis may be explained by shared biological and behavioral pathways. Chronic systemic inflammation and endothelial dysfunction, common in hypertension, can increase susceptibility to hepatic injury, while hypertension also clusters with adverse health behaviors such as smoking and physical inactivity.³² From a psychosocial perspective, hypertension may further reflect broader metabolic vulnerability and chronic stress exposure that influence immune regulation and healthcare utilization.³³ In addition, economic status may act as a proxy for several proximal determinants of hepatitis risk, including access to healthcare, health literacy, psychosocial stress, and environmental exposures. These socioeconomic factors can influence both exposure risk and timely diagnosis, providing a plausible explanation for its predictive value in our model.³⁴

Our results reinforce the importance of integrating such variables into risk stratification. Significantly, the nomogram provides a simple tool for individualized risk assessment, which could help prioritize screening and prevention, particularly in resource-limited settings where socioeconomic barriers are pronounced.^5,21,29,35 By highlighting modifiable factors such as smoking, our model also supports combined prevention strategies that may improve both hepatic and extrahepatic outcomes. However, it is worth noting that the model be more reflective of the likelihood of being diagnosed rather than the true likelihood of infection. It can be considered as a risk stratification tool that incorporates biological, behavioral, and healthcare access factors, rather than positioning it as a purely causal or etiological model. This study is cross-sectional, based on self-reported physician diagnosis, and does not differentiate HBV from HCV or include laboratory confirmation. However, there are still several limitations in this research. NHIS employs a complex, multistage survey design, and ignoring survey weights may impact the accuracy of population-level inferences. The lackage of calculation and justification of the sample size acquired from NHIS database may decrease the authenticity of our results. The primary goal of this study is to develop a risk prediction model rather than to estimate population parameters, and the universality of our model to each individual should be further estimated. Therefore, while we recognize the importance of survey weighting for broader population estimates, the focus of our analysis was on identifying relevant risk factors and predicting individual risk. Besides, the outcome variables acquired from our study is based on self-reported physician-diagnosed hepatitis rather than laboratory-confirmed infection. This reliance on self-reported data may introduce biases or inaccuracies, as individuals may not accurately recall their diagnoses, or the diagnosis may not have been confirmed by laboratory tests. Future studies should aim to use laboratory-confirmed diagnoses of hepatitis to improve the accuracy and reliability of the findings. In addition, we used a random split for training and validation, this approach does not account for temporal, geographic, or external cohort validation. As such, the validation cohort is not independent in a meaningful epidemiological sense, and the potential for optimism bias remains. We recognize that internal validation via a random split alone is insufficient for the robustness of clinical prediction models. Future studies should incorporate independent validation cohorts to better assess the generalizability and reliability of the model.

Conclusion

In conclusion, we identified five predictors, including age, sex, hypertension, smoking status, and economic status, can be considered as independent risk factors for hepatitis of which associated with its prevalence. Our study can guide early-detection of hepatitis risk based on demographic and lifestyle factors.

Footnotes

Acknowledgements

The authors sincerely thank the colleagues at the No. 1 People’s Hospital for their insightful discussions and technical assistance throughout the preparation of this article.

ORCID iD

Yuhang Tang

Consent to participate

All data used in this study can be sourced from public NHIS database ().

Author contributions

YT contributed to hypothesis generation, study design, data collection, statistical analysis, and manuscript preparation. WC contributed to study design and supervised manuscript preparation.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All data generated or analyzed during this study are included in this article.

References

GBD 2019 Hepatitis B Collaborators. Global, regional, and national burden of hepatitis B, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Gastroenterol Hepatol 2022; 7(9): 796–829.

Polaris Observatory HCV Collaborators. Global change in hepatitis C virus prevalence and cascade of care between 2015 and 2020: a modelling study. Lancet Gastroenterol Hepatol 2022; 7(5): 396–415.

Cooke

Flower

Cunningham

, et al. Progress towards elimination of viral hepatitis: a Lancet Gastroenterology & Hepatology Commission update. Lancet Gastroenterol Hepatol 2024; 9(4): 346–365.

Yang

Wang

, et al. The burden of hepatitis C virus in the world, China, India, and the United States. Infect Dis Poverty 2023; 12(1): 43.

Hall

Bradley

Barker

, et al. Estimating hepatitis C prevalence in the United States, 2017–2020. Hepatology 2025; 82(1): 137–149.

Cui

Wang

Zheng

, et al. Global reporting of progress towards elimination of viral hepatitis. J Hepatol 2023; 78(3): 523–534.

Chang

Chen

Lai

, et al. Universal hepatitis B vaccination in Taiwan and the incidence of hepatocellular carcinoma in children. N Engl J Med 1997; 336(26): 1855–1859.

Schweitzer

Horn

Mikolajczyk

, et al. Estimations of worldwide prevalence of chronic hepatitis B virus infection: a systematic review of data published between 1965 and 2013. Lancet 2015; 386(10003): 1546–1555.

Terrault

Lok

ASF

McMahon

, et al. Update on prevention, diagnosis, and treatment of chronic hepatitis B: AASLD 2018 hepatitis B guidance. Hepatology 2018; 67(4): 1560–1599.

10.

Ghany

Morgan

; AASLD-IDSA Hepatitis C Guidance Panel. Hepatitis C guidance 2019 update: AASLD-IDSA recommendations for testing, managing, and treating hepatitis C virus infection. Hepatology 2020; 71(2): 686–721.

11.

Blach

Terrault

Tacke

, et al. Global progress report on access to hepatitis C treatment, 2014–2019. J Hepatol 2020; 72(5): 855–864.

12.

Kim

Gadiparthi

, et al. Changing trends in etiology-based annual mortality from chronic liver disease in the United States. Hepatology 2018; 67(2): 600–612.

13.

Wang

Zhang

, et al. Economic-related inequalities in hepatitis B virus infection: a decomposition analysis. BMC Infect Dis 2022; 22(1): 534.

14.

Gnyawali

Lyn-Cook

Touré

, et al. Epidemiologic and socioeconomic factors impacting hepatitis C treatment and management. Hepatol Med Policy 2022; 7: 3.

15.

Alenzi

Matarneh

Alenzi

, et al. Bridging the gap: addressing disparities in hepatitis C elimination. Hepatol Forum 2024; 5(3): 133–141.

16.

Mariz

Braga

Albuquerque

MFPM

, et al. Occurrence of hepatitis B and C virus infection in different socioeconomic strata. Rev Soc Bras Med Trop 2024; 57: e0211.

17.

Pereira

LMMB

Martelli

CMT

Merchán-Hamann

, et al. Population-based multicentric survey of hepatitis B infection and risk factors in the Northeast and Central-West regions of Brazil. BMC Public Health 2009; 9: 362.

18.

Rajewski

Małyszko

Hepatitis C infection as a risk factor for hypertension and cardiovascular diseases. High Blood Press Cardiovasc Prev 2022; 29(4): 339–347.

19.

Kim

Weinberger

Chander

, et al. Cigarette smoking in persons living with hepatitis C: The National Health and Nutrition Examination Survey (NHANES), 1999–2014. Am J Med 2018; 131(6): 669–676.

20.

Yang

Zhao

Wang

, et al. The association between hepatitis C virus infection status and hypertension among adults in the United States. BMC Public Health 2024; 24(1): 1250.

21.

Yue

Jing

, et al. Machine learning model for predicting hepatitis C seroconversion in methadone maintenance patients in China. BMJ Public Health 2025; 3(2): e002290.

22.

Yoshida

Bartel

Tableone: create ‘Table 1’ to describe baseline characteristics. R package version 0.13.2. Journal of Statistical Software, 2022.

23.

Friedman

Hastie

Tibshirani

Regularization paths for generalized linear models via coordinate descent. J Statist Softw 2010; 33(1): 1–22.

24.

Harrell

Jr.

Rms: regression modeling strategies. R package version 6.5-0. Vanderbilt University, 2023.

25.

Graffelman

Van Eeuwijk

FA.

Calibration of multivariate scatter plots for exploratory analysis of relations. Comput Statist Data Anal 2005; 48(4): 861–878.

26.

Brown

Rmda: Risk Model Decision Analysis. R package version 1.6, 2018.

27.

Robin

Turck

Hainard

, et al. PROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform 2011; 12: 77.

28.

King

Thompson

Shrestha

, et al. Testing for hepatitis C virus infection among adults aged ⩾18 years—United States, 2013–2017. MMWR Morb Mortal Wkly Rep 2021; 70(41): 1441–1446.

29.

Cohen

Ward

Gittleman

, et al. Hepatitis C and cigarette smoking behavior: themes from focus groups. Health Promot Pract 2024; 25(5): 688–696.

30.

Pericot-Valverde

Heo

Akiyama

, et al. Factors and HCV treatment outcomes associated with cigarette smoking among people who inject drugs in opioid agonist treatment programs. Subst Use Misuse 2020; 55(13): 2130–2139.

31.

Wang

, et al. Association between hypertension and the prevalence of steatosis and fibrosis. Front Endocrinol (Lausanne) 2023; 14: 1200667.

32.

Barnes

RFW

Pandey

Sun

, et al. Diabetes, hepatitis C and human immunodeficiency virus influence hypertension risk differently in cohorts of haemophilia patients, veterans and the general population. Haemophilia 2022; 28(6): e228–e236.

33.

Parrilli

Manguso

Orsini

, et al. Essential hypertension and chronic viral hepatitis. Dig Liver Dis 2007; 39(5): 466–472.

34.

Woerdenbag

Kane

, et al. Economic evaluations of hepatitis B vaccination for developing countries. Expert Rev Vaccines 2009; 8(7): 907–920.

35.

Wang

Fan

Yin

, et al. Global burden of hepatitis B attributable to modifiable risk factors from 1990 to 2019. Global Health 2023; 19(1): 30.