Abstract
Background
Few models exist that incorporate measures from an array of individual characteristics to predict the risk of COVID-19 infection in the general population. The aim was to develop a prognostic model for COVID-19 using readily obtainable clinical variables.
Methods
Over 74 weeks surveys were periodically administered to a cohort of 1381 participants previously uninfected with COVID-19 (June 2020 to December 2021). Candidate predictors of incident infection during follow-up included demographics, living situation, financial status, physical activity, health conditions, flu vaccination history, COVID-19 vaccine intention, work/employment status, and use of COVID-19 mitigation behaviors. The final logistic regression model was created using a penalized regression method known as the least absolute shrinkage and selection operator. Model performance was assessed by discrimination and calibration. Internal validation was performed via bootstrapping, and results were adjusted for overoptimism.
Results
Of the 1381 participants, 154 (11.2%) had an incident COVID-19 infection during the follow-up period. The final model included six variables: health insurance, race, household size, and the frequency of practicing three mitigation behavior (working at home, avoiding high-risk situations, and using facemasks). The c-statistic of the final model was 0.631 (0.617 after bootstrapped optimism-correction). A calibration plot suggested that with this sample the model shows modest concordance with incident infection at the lowest risk.
Conclusion
This prognostic model can help identify which community-dwelling older adults are at the highest risk for incident COVID-19 infection and may inform medical provider counseling of their patients about the risk of incident COVID-19 infection.
Introduction
As of December 2022, it is estimated there have been over 650,000,000 COVID-19 infections from the SARS-CoV-2 virus worldwide and almost 100,000,000 in the USA (https://covid19.who.int/ accessed 29 December 2022). 1 Researchers worldwide have been collecting population-based data from individuals who are uninfected with COVID-19 along with infected people and tracking their status over time to assess COVID-19 prevalence and incidence rates. As COVID-19 variants develop, the dynamics of the infection process change and public health officials and clinicians are faced with estimating the risk for COVID-19 infection. 2
At the outset of the COVID-19 pandemic beginning in early 2020, public health agencies developed and disseminated guidelines for reducing the risk of infection. Prior to and early in the distribution of COVID-19 vaccines in the United States in December 2020–January 2021 3 Centers for Disease Control (CDC) issued guidelines comprised of ten behavioral measures including masking, social distancing, working at home, staying home, avoiding crowds, washing hands, avoiding high-risk situations, avoiding restaurants, avoiding touching people, and wiping surfaces. 4 On a global basis, similar behavioral and nonpharmaceutical guidelines were studied and recommended by public health organizations1,5,6 and other investigators.7–9 The efficacy of such behavioral and nonpharmaceutical interventions in the time frame previous to the initiation of the rollout and widespread availability of vaccines in the United States and Europe were largely found to be the most useful tools to attenuate the rate of Covid-19 infection.10–16 Community studies of adherence to behavioral and nonpharmaceutical interventions contributed to building an evidence base about the uptake of the nonvaccine guidelines.9,12,17–27
Utilizing the power of extant databases (eg, formal registries, electronic health records, or various types of surveys), data of all types can be fed into programs that model the risk of a given outcome, such as hospitalization or death28,29. However, few models exist that incorporate measures from an array of individual characteristics, including clinical and sociodemographic features and use of CDC-recommended protective behaviors, among other items, to predict the risk of COVID-19 infection in the general previously uninfected population. Ultimately clinicians can translate the predictions to effectively counsel patients as to their risk with respect to modifiable risk factors. To address this issue, we developed a prognostic model for incident COVID-19 infection among uninfected participants in a longitudinal, community-based cohort study.
Methods
Study Sample
All data from this study came from the Cabarrus County COVID-19 Prevalence and Immunity (C3PI) Study. The C3PI Study was a community COVID-19 surveillance study that enrolled 1410 individuals from the Measurement to Understand the Reclassification of Disease of Cabarrus/Kannapolis (MURDOCK) Study Community Registry and Biorepository longitudinal cohort30,31 and was conducted in North Carolina by Duke University with funding from the North Carolina Department of Health and Human Services (NCDHHS). The design and methods of this study were published previously. 32 Briefly, each participant completed a baseline survey covering demographics, current health status, household features, lifestyle, and employment and their perceptions of the COVID-19 pandemic, use of COVID-19 mitigation behaviors, and attitudes about COVID-19 vaccination. Follow-up after the baseline survey occurred on a biweekly basis for up to 74 weeks. Of all participants, 29 (2.1%) did not respond to most or all of the use of COVID-19 mitigating behavior items (see Measures) and were dropped from the analysis. This left 1381 (97.9%) that were ultimately included in this analysis with an age range from 24 to 98 years. Over the course of the follow-up period, there were 154 COVID-19 infections. COVID-19 infections were identified from C3PI Study-related testing in a subset of 300 individuals in a COVID-19 testing subcohort or from self-report of infection in the biweekly surveys.
Measures
The baseline survey comprised nearly 370 unique items querying a variety of dimensions before a vaccine for COVID-19 was available Demographics Living situation Financial status Physical activity Health conditions Flu vaccination history COVID-19 vaccine intention Work/employment status Use of COVID-19 mitigation behaviors from CDC.
4
After a data reduction, data quality, and adjudication decision-making process, 34 separate items were included in the analysis (see Table 2).
Statistical Analysis
Descriptive statistics comparing participants with and without incident COVID-19 infections during the follow-up period were presented as means (SD) for continuous variables or frequencies for categorical variables. To develop our prediction model, we included the selected variables described above in a multivariable logistic regression 33 with any infection during the follow-up period as the outcome variable. Missing values were imputed using single-imputation maximum likelihood estimation. We then developed a parsimonious model by using a penalized regression method known as the least absolute shrinkage and selection operator (LASSO) 34 specifying 500 bootstrap samples and included those predictors that were retained in more than 10% of the bootstrapped samples 35 for the final logistic model. There are several methods for selecting a set of independent variables for use to develop the “best” regression model, but some such as the family of methods known as stepwise regression have several problems including inflated R2 values, invalid F and Chi-square distributions, underestimated standard errors and confidence intervals with resultant too small p-values, inflated parameter estimates, and exacerbated issues around multicollinearity. 33 LASSO selection arises from a constrained form of ordinary least squares regression where the sum of the absolute values of the regression coefficients is constrained to be smaller than a specified parameter. 34 Using conventional methods, model performance was assessed through discrimination and calibration.
Discrimination of the logistic models refers to the ability of the model to separate individuals who develop an infection from those who do not. 36 Calibration refers to the graphical association between the observed risk of infection and predicted risk. This was visually assessed via a calibration plot with the x-axis displaying the predicted estimate from the model and the y-axis displaying the observed proportion of infection. 37 Internal validation was assessed by evaluating the c-statistic of results of bootstrapped samples. 33 Estimates from the final model were used to develop several hypothetical scenarios based on a participant's specific characteristics in order to illustrate the risk of infection. All statistical analyses were performed using SAS statistical software, version 9.4 (SAS Institute, Inc, Cary, NC).
Ethics
Both the parent MURDOCK Study (Approval Number: Pro00011196) and Phase 1 and 2 of the C3PI Study (Approval Number: Pro00105703) were approved by the Duke Health Institutional Review Board. Participants provided electronic informed consent within REDCap<συπ>® to participate in the C3PI Study.
Results
Baseline Characteristics
The baseline characteristics of the 1381 participants included in this study are summarized in Table 1. Of the total sample, 154 (11.2%) participants reported or tested positive for a COVID-19 infection.
Baseline Characteristics of Participants by Infection status During Follow-up.
For most of the demographic, living situations, employment, health conditions, and attitudes and behaviors associated with the COVID-19 pandemic, the infected and noninfected groups were similar in their characteristics (Table 1). The group with an infection was on average about 2.1 years younger (p = .054), had a lower proportion of Whites (p = .005), a higher proportion with Hispanic ethnicity (p = .03), reported more people living in their household (p = .05), and a lower proportion who had any health insurance coverage (p < .0001). Of the CDC-recommended COVID-19 mitigation behaviors, compared with the not-infected group the infected group reported lower average frequencies of practicing using facemasks (p = .03), maintaining six-foot distances (p = .04), avoiding high-risk situations (p = .008), working at home (p = .002), and avoiding touching people (p = .04).
Modeling COVID-19 Infection Using LASSO
The candidate variables were analyzed for predicting COVID-19 infection using LASSO. The final model retained six variables (Table 2): Any current health insurance coverage (yes or no); Race (white, black, or other); the number of people living in the household; and the frequency of practicing three CDC-recommended mitigation behavior (working at home, avoiding high-risk situations, and using facemasks. The c-statistic for this model was 0.631. After bootstrap internal validation, the optimism-corrected estimate of the c-statistic was 0.617. A calibration plot is shown in Figure 1. Based on the parameter values from the final logistic model, we present a set of estimates for risk of infection for a set of hypothetical patients with different baseline characteristics (Table 3) and graphically in Figure 2.

LASSO calibration plot.

Predicted probability of infection using hypothetical scenarios from final model.
Multivariable-adjusted prognostic factors assessed at baseline for incident COVID-19 infection during follow-up after penalized least absolute shrinkage and selection operator selection.
CDC recommended mitigating behaviors were dichotomized as: 1, 2, 3 = Not frequently practicing behavior, 4 or 5 = frequently practicing the behavior
Referent group was White race
Abbreviations: c-statistic, concordance statistic (or area under the receiver operating characteristic curve); CDC: Centers for Disease Control.
Hypothetical patient examples with the prediction model's calculated predicted probability of COVID-19 infection.
Discussion
Using the LASSO method in a community-based cohort we developed a prognostic model for incident COVID-19 infection during one-year follow-up surveillance after the baseline assessment. In our final model the effects of not having current insurance, identifying as non-White race, having four or more people in the household, and not frequently using a face mask significantly contributed to the prognostic information for incurring an incident COVID-19 infection, while not working from home fell just out of the range of statistical significance.
As noted above, the set of items captured in the baseline survey
Our final model included variables dichotomized as “yes/no,” which could be assessed using a short questionnaire in a clinician's office and potentially entered into a web- or computer-based calculator utility. For example, using Table 3 hypothetical patient examples, the model estimates that the non-COVID-19 infected patient with all the risk factors has a 61% risk of an incident COVID-19 infection at some point in the next year. The patient with all the risk factors except “not frequently using face masks” has about a 50% risk, so all other things being equal, in that case not frequently using facemasks adds about 10–11% risk. Similar types of added risk scenarios can help clinicians counsel their patients in practicing these modifiable risk behaviors. The model had modest discrimination, with a c-statistic of 0.631 (optimism corrected to 0.617). 43 Rozenfeld et al. 43 developed a more discriminating risk factor model (c-statistic 0.78) for COVID-19 infection using a much larger sample of over 34,000, but did not include variables with modifiable risk factors. Among a large cohort of nursing home patients, Mehta et al. found 29 significant associations of incident COVID-19 infection, including non-White race and other demographic and facility-level factors, but did not construct a multivariable model or risk calculator.
There are a few important limitations to the construction of this prognostic model. First, while the rate of incident infections was in the range reported at the time the database was active, the number of infections was not high enough to provide sufficient statistical power to accurately classify many individuals to the correct true infection status (see the calibration plot, Figure 1). Although it appears that the estimated infection projection below 25% has a modest agreement with the observed infection rate, at higher estimated levels the agreement is low. With a larger database that offers enhanced discriminatory power it is likely that the calibration between estimated and observed infection rates can be improved. Therefore, the model discrimination, as noted above, was modest. Also, it is common that in the LASSO procedure, variables retained in the final models are developed from being present in at least 50% of the models, while we used 10% as the threshold, mostly for illustrative purposes. Further, most of the incident infections in our analysis were derived from self-report from follow-up surveys rather than from serology databases, which could have led to under- or over-identification of true infections; thus, affecting both power and potentially introducing error into the estimate of associations with infection. However, the surveys did collect the date of diagnosis and method of diagnosis to add stringency to this self-report question. Finally, during the follow-up period there was 15.9% early termination (5.8% in the infected group and 17.2% in the non-infected group) so there are likely incident infection cases that were missed from the final count.
Conclusion
In summary, the COVID-19 pandemic has persisted since early 2020 and as new variants develop, infections affect patient health status and contribute greatly to healthcare costs and utilization. Our model offers proof of concept that the risk of incident COVID-19 infection can be estimated using a limited number of predictor variables. We believe this prognostic information could provide foundational information to develop tool clinicians and patients can use to assess risk, guide clinical decisions and inform the allocation of healthcare resources.
Supplemental Material
sj-docx-1-hme-10.1177_23333928231154336 - Supplemental material for COVID-19 Infection Risk Among Previously Uninfected Adults: Development of a Prognostic Model
Supplemental material, sj-docx-1-hme-10.1177_23333928231154336 for COVID-19 Infection Risk Among Previously Uninfected Adults: Development of a Prognostic Model by Richard Sloane, Carl F Pieper, Richard Faldowski, Douglas Wixted, Coralei E Neighbors, Christopher W Woods and L Kristin Newby in Health Services Research and Managerial Epidemiology
Footnotes
Acknowledgments
The authors thank all participants in the C3PI Study for their involvement in and enthusiasm for the study. We also thank Dr Aaron Fleischauer, Career Epidemiology Field Officer, CDC; Jie Liu, Senior Analyst Programmer, Duke Clinical and Translational Science Institute; Angie Wu, Senior Biostatistician, Department of Clinical Research, Cytel Inc. for their engagement and support throughout the C3PI Study.
Author Contributions
R.S. and C.F.P. contributed to the study conception and design. R.S. performed the data analysis and wrote the initial version of the manuscript. C.F.P. and R.F. provided statistical advice. C.E.N., D.W., and R.F. collaborated on data management and construction of the foundational analytic data file. D.W. and L.K.N. provided critical input on data items for the preliminary variable screening process. D.W., C.W.W., and L.K.N. provided material and administrative support. All authors contributed to the critical revision of the manuscript for important intellectual content.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This work was funded by a gift from the David H. Murdock Institute for Business and Culture and is supported by Duke's NIH National Center for Advancing Translational Sciences (NCATS) Clinical and Translational Science Award (CTSA) (UL1TR002553). The C3PI Study was funded by research grants to Duke University from the NCDHHS and the Center for Disease Control and Prevention (CDC), and the Duke Claude D. Pepper Older Americans Independence Center Grant (5P30AG028716-15).
Supplemental Material
Supplemental material for this article is available online.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
