Abstract
Background:
Alzheimer’s disease (AD) and related dementia (ADRD) risk is affected by multiple dependent risk factors; however, there is no consensus about their relative impact in the development of these disorders.
Objective:
To rank the effects of potentially dependent risk factors and identify an optimal parsimonious set of measures for predicting AD/ADRD risk from a larger pool of potentially correlated predictors.
Methods:
We used diagnosis record, survey, and genetic data from the Health and Retirement Study to assess the relative predictive strength of AD/ADRD risk factors spanning several domains: comorbidities, demographics/socioeconomics, health-related behavior, genetics, and environmental exposure. A modified stepwise-AIC-best-subset blanket algorithm was then used to select an optimal set of predictors.
Results:
The final predictive model was reduced to 10 features for AD and 19 for ADRD; concordance statistics were about 0.85 for one-year and 0.70 for ten-year follow-up. Depression, arterial hypertension, traumatic brain injury, cerebrovascular diseases, and the APOE4 proxy SNP rs769449 had the strongest individual associations with AD/ADRD risk. AD/ADRD risk-related co-morbidities provide predictive power on par with key genetic vulnerabilities.
Conclusion:
Results confirm the consensus that circulatory diseases are the main comorbidities associated with AD/ADRD risk and show that clinical diagnosis records outperform comparable self-reported measures in predicting AD/ADRD risk. Model construction algorithms combined with modern data allows researchers to conserve power (especially in the study of disparities where disadvantaged groups are often grossly underrepresented) while accounting for a high proportion of AD/ADRD-risk-related population heterogeneity stemming from multiple domains.
Keywords
INTRODUCTION
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder, which accounts for approximately 80% of all dementia cases diagnosed in the U.S. with associated healthcare costs approaching $215 billion a year [1]. The importance of the assessment of risk-related and protective factors associated with AD and related dementias (ADRD) becomes increasingly important as the population of individuals at risk for AD/ADRD grows both in size (e.g., population growth) and exposure times (e.g., gains in longevity). However, existing literature on AD/ADRD risk provides a highly varied picture spanning a wide range of study design, structure of utilized data, and population focus [2]. Multiple studies investigated the role of cardiovascular disease [3, 4], cancer [5], and diabetes mellitus [6] in AD/ADRD risk [7, 8], but results often remained inconsistent, likely due to the differences in methodological approaches across the studies, different target populations, and a focus on identifying individual risk factors without determining the role of multiple co-existing factors [9, 10]. More recently, the 2020 Lancet Commission on dementia prevention, identified twelve potentially modifiable risk factors suggested to increase dementia risk [11] across three groups based on the critical age region for modifying each risk factor. These were: 1) early life (education), 2) midlife (hypertension, obesity, hearing loss, traumatic brain injury (TBI), and alcohol misuse), and 3) later life (smoking, depression, physical inactivity, social isolation, diabetes, and air pollution) [11]. This framework can be extended with further modifiable or non-modifiable (genetics) risk factors; but the original 12 features clearly demonstrate the complexity of the multifactorial process of AD/ADRD onset.
Increased scope and availability of high-power longitudinal studies of aging as well as individual-level administrative/health records from health and health insurance providers, makes the incorporation of the multiple aspects of AD/ADRD development into a single, sufficiently powered model, increasingly feasible, especially with the growing trend in the incorporation of data on genetics and other biomarkers. However, the construction of such complex models brings its own challenges. Among them are the multiple pathways of creating an indicator/measure of a risk-related characteristic, and the multiple dependencies between AD/ADRD risk factors, which reflect real-life disease processes, but are a significant challenge for risk models, many of which require independence between included features. In this study, we will use data from the Health and Retirement Study (HRS) to assess the impacts and relative ranks of known AD/ADRD risk factors spanning several domains: comorbidities, demographics/socioeconomics, health-related behavior, genetics, and environmental exposure. All twelve factors identified by the 2020 Lancet Commission on dementia prevention were represented by at least one measure and expanded with additional modifiable and non-modifiable risk factors. The HRS is a complex study linkable to administrative health insurance records from the Medicare health insurance program and genetic biomarkers [9, 12] and has the necessary scope and power for the construction and testing of such a model. This study offers an approach for the selection of an optimal set of AD/ADRD predictors, that recognizes that many of these predictors demonstrate mutual dependencies or can be represented through alternative data features. We estimate the strength of the associations between AD/ADRD and risk-related/protective factors (collectively “predictors”) in univariable analyses, identify optimal predictors of AD/ADRD risk, create a multivariable predictive model, and study its predictive power.
METHODS
Data
The Health and Retirement Study is an ongoing study that has been fielded every other year since 1992. It collects data on a battery of demographic, socioeconomic, behavioral, and health-related characteristics and is nationally representative of U.S. older adults age 50+ [13]. The HRS provides linkages to administrative claims data from the Medicare program for eligible individuals (most of the U.S. population becomes eligible at age 65), which contains information on medical services paid by either Medicare Part A (facility-based services), Medicare Part B (professional services), or Medicare Part D (prescription drugs), as well as supplementary demographic and enrollment information. A subset of the HRS population has been genotyped (∼2.5 M single nucleotide polymorphisms (SNPs) genotyped on Illumina HumanOmni2.5 chipset).
Zip-code level data on the daily and annual estimates of the concentration of fine particulate matter (PM2.5), ozone (O3), and nitrogen dioxide (NO2) over the 2000–2016 period from Socioeconomic Data and Applications Center (SEDAC) [14, 15] were used to create indicators of adverse environmental conditions, specifically exposure to air pollutants.
Sample
At the time of this study, the HRS contained data on 37,319 respondents age 50+. Of these, 15,620 were genotyped and 12,872 had polygenic risk scores officially produced by the HRS. Within this latter group 8,603 were enrolled in the Medicare program. For inclusion in this study, we required respondents to be genotyped and have Medicare fee-for-service Parts A and B coverage during the 2000–2015 period. Using Medicare enrollment files, we identified the first and last month/year during which an individual was enrolled in a traditional Medicare fee-for-service plan with both Parts A and B coverage. These two points in time served as the bounds over which an individual was followed. Individuals with significant gaps (≥50% of total follow-up time for an individual) in fee-for-service coverage (such as due to non-payment of Part B premiums or membership in a non-fee-for-service Medicare plan, e.g., Medicare Advantage) were dropped. Finally, we required that the age at the start of the follow-up period be between 65–80 and that there be no history of AD or ADRD diagnoses prior to this baseline age. After restrictions, the final sample included 8,094 individuals for the AD sample and 8,084 for ADRD.
Outcomes
The onset of AD (International Classification of Disease 9th Edition [ICD-9]: 331.0) and ADRD [16] (see also Supplementary Table 1) based on individual diagnoses from health insurance claims records were the primary outcomes of this study. Disease ascertainment algorithms described in other publications [17] were used to identify the presence of AD and ADRD as well all other co-morbidities. Briefly, we required the presence of two distinct claims with a diagnosis of AD/ADRD within 90 days of each other with the earliest date in the pair designated as the date of onset. Once the date of onset was identified, the individual was considered diagnosed during the entire follow-up period. Sensitivity analysis was conducted with algorithms utilizing confirmation periods >90 days and treating death within the confirmation period as a confirmatory record.
Medicare diagnosis records-based predictors
We included the following AD/ADRD risk-related morbidities/morbidity groups based on existing literature: 1) cardiovascular diseases including myocardial infarction, angina pectoris, congestive heart failure, valvular heart disease, rheumatic heart disease, arterial hypertension, systemic hypotension, and cardiac arrhythmias [18 –20]; 2) cerebrovascular diseases including ischemic stroke, hemorrhagic stroke, and other cerebrovascular diseases [19, 21]; 3) chronic kidney disease [20, 22]; 4) chronic obstructive pulmonary disease (COPD) [23]; 5) diabetes mellitus [6]; 6) thyroid dysfunction [24]; 7) metabolic syndrome (obesity, dyslipidemia, hypertension, impaired insulin tolerance [6 , 26]); 8) obstructive sleep apnea [27, 28]; 9) depression [29]; and 10) TBI or head trauma [30]. To account for a possible link between liver function and metabolism of common medications (e.g., medications for cardiovascular diseases), which may contribute to the drug efficacy and adverse effects, we also evaluated liver diseases with hepatic failure. Finally, we included cancers because of evidence [31 –34] that there may be a trade-off between cancer and AD risk. The dates of diagnoses of these diseases were identified from Medicare administrative claims records selected based on ICD-9 codes shown in Supplementary Table 1. Disease onset was identified using the same ascertainment algorithm as for AD/ADRD. Cardiovascular and cerebrovascular disease events such as myocardial infarction and stroke reflect pathological processes in circulatory system and likely contribute to the future risks of AD/ADRD through this mechanism, therefore these conditions were treated similarly to chronic diseases.
HRS survey-based predictors
The HRS survey was used to generate (See Supplementary Table 1 for complete list and variable definitions) indicators of: self-reported disease status; activities of daily living (ADL) and instrumental ADL (IADL) limitations; scores on Center for Epidemiologic Studies Depression Scale (CESD); hearing and vision problems; health behaviors (current smoking status, hard drinking, no physical activity, and body mass index (BMI)), socioeconomics (social isolation index, education level, and low income), and demographics. All survey-based predictors were time dependent with values being drawn from the most recent wave prior to any age during an individual’s follow-up.
Genetic predictors
Genetic predictors were treated as time-invariant and included through several alternative specifications. First, we included candidate SNPs that were shown to be associated with AD/ADRD risk (Table 1 of [35]). Second, we used polygenic risk scores (PRS) provided by the HRS (version 4). These were constructed using effect sizes from the IGAP study [36] and SNPs measured for HRS individuals. Four different PRS specifications were assessed: with/without inclusion of APOE-related genes/SNPs and with/without the deselection of SNPs not associated with AD in GWAS at the p = 0.01 threshold. Third, we replicated the methodology used to generate the official HRS PRS, to generate 12 alternative PRS hypothesized to better reflect the specific mechanisms of AD/ADRD development. These were based on gene/SNP sets associated with a pathway potentially contributing to a specific mechanism of AD/ADRD development and known as AD hallmarks. Specifically, markers of neuro-inflammation, mitochondrial dysfunction, synapse loss, de-myelination, and neuronal death, in addition to the extracellular amyloid-β plaques, and intracellular neurofibrillary tangles of hyperphosphorylated tau protein— the two primary hallmarks of AD— were included. SNP sets for these hallmarks were specified using: 1) results of GWAS of AD using the HRS and other AD/ADRD-related datasets [2]; 2) SNPs located in ∼280 AD-related genes available from NHGRI-EBI Catalog of published GWAS (https://www.ebi.ac.uk/gwas/); 3) AD-related genes selected from other recent publications [37 –52]; and 4) gene sets representing pathways and processes that are thought to play role in AD and available from established online collections of the gene sets and pathways, such as MsigDB/GSEA, KEGG, REACTOME, BioCarta, GO, and others. We used available software for automated high-resolution computations of the polygenic risk scores for binary and quantitative traits (PRSice [53]). The list of specific genes to each PRS is shown in Supplementary Table 2. SNPs associated with APOE in dbSNP database were additionally unselected.
AD/ADRD hazard ratios evaluated in univariable analyses for the total sample of older adults; male, female and Black strata; and with the predictors entering at a 1-, 3-, and 5- years of lag
AD, Alzheimer’s disease; ADRD, Alzheimer’s disease and related dementia; HR, hazard ratio; CI, confidence interval; COPD, chronic obstructive pulmonary disease; SR, self-reported indicator; ADL, activities of daily living; CESD, Center for Epidemiologic Studies Depression Scale.
Measures of ambient air pollution
Adverse environmental exposures to PM2.5 (μg/m3), O3 (ppb), and NO2 (ppb) were derived from ensemble predictions of three machine-learning models that estimate daily concentrations at the centroids of 1 km×1 km grid cells across the U.S. over the 2000–2016 period. Model predictors included air monitoring data, satellite data, weather data, and chemical transport model simulations [14 , 55]. Zip code-level pollution concentrations were estimated by averaging the predictions of grid cells falling within zip code polygons or— for single-point zip codes— the value of the nearest grid cell. This data was then aggregated at the annual level and linked to the zip-code associated with each HRS individual for that year. The dataset includes approximately 31,000 zip codes with polygon representations and about 10,000 zip codes as single points.
Statistical analysis
The analysis included three steps. First, we investigated the associations of specific predictors with AD and ADRD risks in a series of univariable Cox models. Indicators at age α of a disease diagnosed at a diag was defined as p (a) = I (a > a diag ), and the latest measurement before a current age was used for HRS survey variables. Since the effect of a specific time-dependent predictor can be delayed, models with lagged indicators p (a) = I (a > a diag + a lag ) of a lag =1, 3, and 5 years were also estimated. All genetic indicators were used in these univariable analyses as time-independent predictors.
Second, we used multivariable models to identify the optimal combination of predictors with AD and ADRD risks. Predictors for the final multivariable models for AD and ADRD were selected applying the stepwise-AIC-best-subset blanket algorithm. Specifically, we first use stepwise selection to create a sequence of models from the null model (with no predictors) through the full model (with all predictors). Next, we identified the numbers of predictors, n, that correspond to the models with minimal Akaike and Bayesian information criteria (AIC and BIC) and identify the best models for each n. The best model is characterized by the set of n predictors with the best χ2-score test. Since BIC has a larger penalty term for additional parameters than the AIC, resulting in the BIC favoring more parsimonious models than the AIC [57], we started with the best model with the minimal BIC and considered models with sequentially higher n for as long as each added predictor remained significant. The resulting model was then tested to ensure stability across subpopulations, analyses with lag-periods, and alternative definitions for outcomes. In addition, mutual correlation between predictors was accounted for when selecting the initial set of predictors. This is especially important for selecting between claims-based and survey-based measures representing the same health condition as well as for genetic markers. In such cases, we used information criteria for separate models utilizing each of the alternative specifications in turn; this is equivalent to choosing an indicator with the better p-value from a model with both predictors included.
Third, the predictive powers of the multivariable models were evaluated using the concordance statistic and ROC curves. This statistic is a well-known c-index defined as the area under the ROC curve which is widely used in logistic regression models. The validation datasets were created at integer baseline ages, and all predictors were evaluated at the baseline and considered time independent. For survival models, ROC curves are time-sensitive, therefore we estimated statistics for different follow-up times time from one to 10 years. Two approaches for evaluation of the c-index introduced by Harrell et al. [58, 59] and Uno et al. [60] were used and compared. Standard error for the c-index was evaluated using bootstrapping with 100 resampling of the original sample.
RESULTS
The AD and ADRD analysis subsamples differed primarily by the size of the outcome group (375/8,094 for AD; 919/8,084 for ADRD). follow-up time (10.7 years for AD and 10.5 years for ADRD), and percent of individuals lost to mortality during follow-up. In all other aspects the samples were approximately identical (Supplementary Table 3). The age at baseline was 67.5 (4.0) for both samples. The sex makeup of the sample is consistent with that of the general population (57.4% female; 42.6% male for both samples). The race/ethnicity composition provides good statistical power for the White (84.1% for AD and 84.2% for ADRD) and Black (13.8% for AD and ADRD) American subgroups. Detailed summary statistics are presented in Supplementary Table 3.
Parameter estimates for non-genetic predictors used in the univariable Cox proportional hazard model are presented in Table 1 for AD and Supplementary Table 4 for ADRD. The results for ADRD for the entire sample are also shown in Table 1. Estimates are marked “n/a” for analyses with an insufficient number of cases in a specific subgroup. The patterns of the results are similar between AD and ADRD, though the magnitudes are significantly stronger for ADRD in several diagnosis-based disease indicators (heart failure, cardiac arrhythmias, ischemic stroke, other cerebrovascular diseases, and chronic kidney disease) and one self-reported disease indicator (stroke). The strongest associations with AD risk were detected for depression (Hazard Ratio [HR] = 3.9), hypertension (HR = 2.7), traumatic brain injury (HR = 2.9), and hemorrhagic stroke (HR = 2.9). These diseases were also among the strongest predictors for ADRD with HRs >2.8. Four additional strong predictors of ADRD were ischemic stroke (HR = 2.9), other cerebrovascular diseases (HR = 3.8), systemic hypotension (HR = 2.7), and heart failure (HR = 2.5).
For the majority of the predictors the associations with AD/ADRD were similar for both sexes, though for several predictors the magnitude was much stronger for females, e.g., hemorrhagic stroke (AD HR = 4.4 for females versus 1.2 for males) and arterial hypertension (HR = 2.9 versus 2.5). In contrast the association of traumatic brain injury was stronger for males (HR = 3.8 versus 2.6). The variability in estimates for the Black population was higher because of lower statistical power for this group; however, their patterns, represented by the rank of the associations, were similar to those observed in the entire sample, though several diseases demonstrated higher magnitude in the Black population: hemorrhagic stroke (HR = 8.0), systemic hypotension (HR = 3.6), and rheumatic heart disease (HR = 3.3). For almost all time-dependent predictors, magnitude declined as the lag period increased; however, this decline was small for most predictors. The HRs for models with a lag period of 1, 3, and 5 years were within the range of the confidence intervals of the estimates for the entire sample. A stable increase in the magnitude of the association with the length of the lag period was detected for valvular and rheumatic heart diseases.
The following associations were detected for non-disease-related predictors of AD/ADRD: 1) the number of instrumental ADLs, lack of physical activities, and Black race were the strongest predictors, 2) higher BMI was associated with reduced risk, 3) low income was a better predictor of AD/ADRD than low education or social isolation; 4) female sex was not significant, 5) vision and hearing problems demonstrated low predictive power, and 6) PM2.5 was the only significant predictor among the three measures of exposure to air pollutants. The results for non-disease predictors were very similar for AD and ADRD risk.
Table 2 and Supplementary Table 5 provide estimates for genetic predictors. These predictors are time-independent, so the estimates for lag-time models are identical to the main model and are not shown. The results show that associative effects of the HRS PRS are significant, with the strongest associations observed for the standard PRS calculated by HRS with the threshold p = 0.01. We also found the PRS constructed by us using the entire set of SNPs available in HRS data reproduces the estimate obtained for the respective HRS PRS. No significant associations were found among pathway-specific PRS from which 15 SNPs from the APOE region relevant to AD/ADRD were excluded. One minor exception was the pathway associated with inflammation for which the association at the level of significance was found. The SNPs rs769449 and rs2075650, in APOE and TOMM40 genes respectively, had the largest associations with AD/ADRD. Genetic associations with AD risk were much stronger than for ADRD.
AD/ADRD hazard ratios evaluated in univariable analysis for genetic indicators of risk
*Sets of genes included in these pathways are provided in Supplementary Table 2. **SNPs with insignificant effects were not shown. AD, Alzheimer’s disease; ADRD, Alzheimer’s disease and related dementia; HR, hazard ratio; CI, confidence interval; PRS, Polygenic risk score.
Sets of predictors for multivariable models of the risks of AD and ADRD were identified using the procedure described in Methods Section. The final models contained 10 predictors for AD and 19 for ADRD (Table 3). In the final models, time-dependent diagnosis-based measures of co-morbidity was the most powerful group of predictors. Comorbidity self-reports performed poorly and were not selected for the model due to the higher predictive power of diagnosis-based comorbidities. Genetic factors that were incorporated in the final model included rs769449 SNP in intron of APOE gene (this SNP is also a proxy for APOE4 in Whites [61]) and the standard HRS PRS that was constructed with the threshold p < 0.01 and for which APOE-related SNPs were not selected for the final set for multivariable modeling. This was expected because of the results of univariable analyses. Five diseases were identified as the primary predictors of AD: depression (HR = 2.5), arterial hypertension (HR = 2.1), TBI (HR = 1.6), rheumatic heart disease (HR = 1.6), and stroke (HR = 1.5). Three self-reported predictors (Black race, number of IADLs, BMI) were also retained. These chosen variables have similar HR as in univariable analyses. The model for ADRD contains additional 5 risk-related diseases (heart failure, systemic hypotension, chronic kidney disease, diabetes mellitus, and chronic liver disease). Three variables representing stoke (i.e., ischemic stroke, hemorrhagic stroke, and other cerebrovascular diseases) were identified by the variable selection algorithm in contrast to the case of AD when only one variable for stroke (i.e., any type) was identified as a predictor for multivariable modeling. The association with female sex was protective and significant for ADRD. Genetic variants were stronger predictors of AD than ADRD but rs769449 in the APOE region is still significant for ADRD. PM2.5 was selected as a significant predictor for ADRD model. Depression was the main AD predictor for females, traumatic brain injury––for males, and arterial hypertension––for Black population. These predictors and hemorrhagic stroke were strongest predictors of ADRD among these strata.
AD and ADRD hazard ratios evaluated in multivariable analyses for i) the overall population of older adults (All) with 95% confidence intervals (Lo.CI, Up.CI), sex-specific subpopulations (Male, Female), Black population, and for models with 1, 3, and 5 years of lag (Lag1, Lag3, and Lag5)
AD, Alzheimer’s disease; ADRD, Alzheimer’s disease and related dementia; HR, hazard ratio; CI, confidence interval; IADL, instrumental activities of daily living.
The predictive power of the multivariable models was evaluated using the concordance statistic (c-index). For a survival model, ROC curves are time-sensitive, i.e., we can have different ROC curves at different time points of survival, we compared follow-up times of one-, three-, five-, and ten-years. The estimates of the concordance (c-index) statistics for one- and ten-year follow-up are shown in Fig. 1. The results of sensitivity analysis using alternative algorithms for ascertainment of AD/ADRD onset are shown in Supplementary Figures 1 and 2. The quality for one-year predictions is high. For one year of follow-up the c-index can reach 0.85; this number falls to about 0.70 for ten-year forecasts. A minor decline in the c-index with higher ages at the start of the follow-up period was detected. The concordance statistics are similar for AD and ADRD.

C-index for AD and ADRD risk for one year (red) and ten years of follow-up (black).
DISCUSSION
The main objective of this study was to identify a minimal subset of AD/ADRD-related risk and protective factors (out of larger set of possibly correlated factors) that collectively influence these health outcomes. Four sources of information were linked together to produce a single study sample: 1) the presence of diseases based on diagnosis records from the Medicare program; 2) self-reported diseases and non-disease health measures, health-related behaviors, functional status, and socioeconomic status; 3) genetic markers including polygenic risk scores and known-to-date genetic risk factors for AD/ADRD, and 4) zip-code based measures of environmental exposure. The associations of predictors with AD/ADRD risks were evaluated in univariable analyses, and then the optimal set of predictors for multivariable models of AD/ADRD risk was selected using formal statistical criteria, ensuring the replicability of the process in future applications.
The constructed multivariable models of AD/ADRD risk improve our knowledge of how the risk factors present at ages 65+ predict the risks of AD/ADRD. The current status quo for the role of preventable risk factors of AD/ADRD was described by the 2020 Lancet Commission on dementia prevention [11] and summarized in the 2023 Alzheimer’s disease facts and figures [62]. In a life-course model, it has been estimated that modifying these factors can prevent or delay up to 40% of dementias by delaying the onset of cognitive impairment and dementia and improving the quality of life of patients with dementia and their caregivers [11]. Available information comes from different studies for varying subpopulations; therefore, we considered the extended list of disease-related predictors together with other factors in the same study. Such design allowed us to evaluate the rank of multiple risk factors in their ability to predict the risk of AD/ADRD and compare power of alternative data sources, e.g., health records and self-reports.
Methodologically, the paper is focused on the analyses of the effects of risk/protective factors that impact AD/ADRD risk focusing on identification of the multivariable model of the risk in AD/ADRD in which the most powerful factors from available sources of information are selected. We found that the effects of diagnosis record-based morbidity indicators were stronger than those based on self-reports, though they shared general effect patterns. A possible reason could be the ability to tailor a claims ascertainment algorithm (e.g., sets of ICD-9/10 codes utilized; type of medical encounter(s) assessed; diagnosis verification method etc.) to the exact disease being studied not present in self-reported data; shorter times between updates in an electronic health database vs. a survey cohort; and lack of recall-related errors. Both self-report and administrative data have well known methodological limitations [63, 64], and the accuracy of the resulting epidemiological estimates involve uncertainties which can further vary from disease to disease.
Three high-prevalent factors (hypertension, depression, and rs769449, proxy SNP for APOE4) had the highest effect on the risk of AD, this suggests central role of these factors in AD etiology. They may represent different pathways contributing to heterogeneity of AD or be connected through a common AD mechanism. For instance, all three factors may compromise blood–brain barrier integrity and increase chances of damaging factors (such as infection and air pollution) to reach the brain and trigger neuroinflammation in turn contributing to neurodegeneration. Current evidence supports the role of disruption of the blood–brain barrier as a common mediator of hypertension and depression-induced cognitive impairment, and APOE4 detrimental effects on the brain [65 –70]. This study also replicated protective effects on AD of rs157580 (G) and rs8106922 (G), SNPs in TOMM40 gene, that were previously reported by us in LOADFS and CHS data [61]. This suggests a possibility that simultaneous presence of these protective variants with APOE4, or its proxies, in an individual, might offset some or all negative effect of APOE4 on AD, which deserves further investigation.
Our results confirm the overall consensus that circulatory diseases are the main comorbidities contributing to AD/ADRD risk [71]: more than half of the diseases selected for the final model of AD/ADRD risk (3 of 5 for AD and 7 of 12 for ADRD) were cardio- or cerebrovascular diseases. Positive associations between the presence of disease and higher risk of AD/ADRD observed in our study also have been reported in other studies for heart [18 –20] and cerebrovascular diseases [19].. In our study, the effects of hemorrhagic stroke is much higher than the effect of ischemic stroke that is in agreement with earlier studies [21]. Cerebral microhemorrhage has been shown to be associated with Aβ deposition [72], with shared genetic mechanisms for both hemorrhagic stroke and dementia [73]. Neuropathological studies have suggested specific mechanisms (e.g., cerebrovascular lesions, atherosclerosis, and neurodegenerative changes) that often coexist with a clinical manifestation of dementia [74, 75]. These mechanisms generate a link between cardiovascular health and AD [76] through onset of diseases such as heart failure, coronary artery disease, atrial fibrillation, and vasculopathy. A cardiovascular health score based on clustering of behavior- and biological-related measures associated with the increased risk of cardiovascular diseases was constructed using data from Whitehall II cohort study. This score was associated with a risk of dementia [77], though individual risk factors in this sample can give insignificant associations when controlling for other individual risk factors [78].
Depression and TBI were selected for both AD and ADRD. Observed associations with higher risk of AD among patients with depression have been also reported in other studies [11, 29] with potential contribution of anti-depressant therapy to increased AD risk, shared genetic characteristics, and recognizing depression as one of the early symptoms of AD/ADRD. TBI also appeared in the list of 12 risk factors [11]. In that study TBI explained 3% of variance in AD risk and was set in the group of midlife predictors. TBI is usually caused by car, motorcycle, and bicycle injuries; military exposures; boxing, horse riding, and other recreational sports; firearms; and falls, [79], i.e., events with reasonable prevalence in midlife period. In our study TBI is measured by the presence of the condition in individuals independently when it occurred, however, the event should be sufficiently serious to make an individual have doctor visits with the respective ICD-9 code. There were no essential differences in the effects calculated with other lag periods, that emphasize the chronic nature of the measure of TBI we use. In our study the effect of TBI was much higher for males, potentially due to more severe or more frequent injuries in the male populations.
Other diseases selected for the model of ADRD risk were type 2 diabetes, chronic kidney and chronic liver diseases. Hazard ratios of diabetes on AD/ADRD type 2 diabetes mellitus [6 , 26] in other studies were similar to that observed in our study. Diabetes is one of 12 risk factors identified in recent study with only 1% of population attributable fraction [11]. The effects of any particular medication on the risk of AD/ADRD remains unclear, and intensive diabetic control does not seem to decrease the risk of dementia [11]. A possible mechanism of the effect of chronic liver disease could be related with hepatic failure resulting in insufficient detoxication of toxic metabolites including medications. Although cancer-related predictors were not selected for the final model (protective effect of non-specified cancer with p-value ≈0.06 was the most significant not included in the model for ADRD risk), there is some evidence on protective effects of cancer diagnosis on the risk of AD/ADRD. In literature evidence on the risk of dementia and AD among cancer patients varies, e.g., recent study demonstrated higher risk [80], though less recent studies resulted in lowed risks of AD/ADRD [81] or in insignificant estimates [32, 82]. Also, studies demonstrated the role of cancer treatment in the subsequent risk of AD/ADRD [83 –85].
Polygenic risk scores associated with specific pathways did not demonstrate strong effects in univariable analyses and were not selected into the final model. Most significant effect came from pathway associated with inflammation, represented by the Hallmark Inflammatory Response set composed of 200 genes and available from MSigDB [86] (Table 2, Supplementary Table 1). Effect size of PRS for this pathway was similar to that for the PRS made of all genotyped SNPs (minus APOE) in HRS data. We note that AD-related SNPs in APOE region on chr 19 were excluded from the definition of PRS. The final model included the standard HRS PRS that was created with the p-value threshold 0.01 and without any exclusion of specific SNPs. Specific SNPs that demonstrated effects in different studies were significant in univariable analyses in our study; however, only SNP rs769449 (proxy for APOE4) remained in the final model. This means that APOE-related genetic effects are detectable in our analyses in the multivariable models for AD/ADRD risks. Other SNPs, detected in different studies [35, 87], may work indirectly, through their effects on different risk factors presented in the final multivariable model.
The effect of IADL was strongest among non-disease predictors. The effect largely comes from two components that are partly symptoms of ADRD: inability to manage money and use the telephone. The effect of higher BMI as a continuous predictor is protective in all models. Its effect is more significant than binarized effects of obesity (BMI⩾30) or underweight (BMI⩽18). Physical activity had a strong protective effect in univariable analysis but was not selected in the final multivariable model likely because of its correlation with other more specific factors representing the current health status of an individual. Generally our results confirm the effects identified by the 2020 Lancet Commission on dementia prevention [11], though other strong predictors were also identified. Thus, the effects of higher income and higher number of education years are protective; low income retained in the final model for ADRD. The social isolation index and hearing/vision issues were significant in univariable analyses but not selected in the final model. With the exception of rheumatic heart disease, the incorporation of lag effects into the model generally decreases the effects. Comparison for the effects in sex-specific subpopulations show, that although women have higher AD prevalence and suffer greater symptoms, there is no consensus on whether this difference extends to incidence and/or mortality once the effects of other predictors have been controlled for [88 –92]. The significant effects of PM2.5 on AD/ADRD risk is in agreement with findings of prior studies [93 –95]; however, the fact that the effect remains significant in multivariable setting is among new results of our study.
Identification of predictors responsible for the increase or decrease the risks of AD/ADRD allows us to better understand the underlying mechanisms of AD/ADRD development, and also help further improve forecasting models of AD/ADRD-related outcomes. Modern health forecasting of AD/ADRD has been approached from a number of directions ranging from simple prevalence projections based on current prevalence trends and U.S. Census population forecasts [16 , 96–100]; to projections based on assumptions on future population dynamics of AD/ADRD outcomes [101] and multistate modeling of healthy and unhealthy states [102, 103]. These forecasts, however, do not involve broad information on risk factor effects. The forecasted prevalence of dementia attributable to risk factors were recently assessed using methods involving methods of population attributable fractions [11, 104]. Models with limited incorporation of morbidity patterns appeared recently [105, 106] but further incorporation of highly predictive factors into the model can essentially decrease the overall uncertainty of the resulting projections and a reduction of the practical applicability of the results.
We acknowledge several study limitations. The HRS provides a sample representative of older U.S. adults; however, its statistical power is still not sufficient for the analysis of some ethnic subgroups in a model involving rare outcomes/measures. Self-reported data is relatively easy to collect, however there are concerns about their accuracy [107, 108]. Administrative claims data does not share the above limitations and has been validated for the study of many diseases [109 –112]. However, an insurance claim is at best a proxy for the presence of a disease though in most cases this is a valid proxy a number of serious limitations exist. Differences in the utilization of healthcare services as well as race/ethnicity-specific coding patterns can lead to variation in the frequency of ICD diagnosis between race/ethnicity-specific groups. However, such data allows for more granular identification of disease onset while surveys are limited by the time periods inherent to their design (2 years in the HRS).
Our study confirmed that AD/ADRD risk is multifactorial in nature and that its variation is tied to multiple demographic, socioeconomic, and health-related risk factors. The size and detail of data available for research grows yearly. In many cases, this can lead to multiple measures of the same underlying construct to be available. Consequently, the identification of measures with relatively higher predictive power, and minimization of errors due to mutual correlation becomes an increasingly important question. We identified and ranked the association strengths of multiple possibly correlated risk/protective factors from multiple domains. We found that diagnosis-based disease indicators were superior in terms of predictive power of AD/ADRD risk than their self-reported counterparts. Depression and arterial hypertension were the largest contributors to AD risk; comparable to the additional risk associated with adverse APOE4 profiles. The effect of PRS on AD was significant for genes involved in the inflammatory pathway, which is in line with current views on potentially major roles of inflammation and infections in AD. Significant SNPs in the APOE region included both risk and protective factors, which warrants further investigation of the interactions between these SNPs, which might offset the negative role of APOE4 in AD risk.
Footnotes
ACKNOWLEDGMENTS
The Health and Retirement Study (HRS) data is sponsored by the National Institute on Aging (grant number U01-AG009740) and is conducted by the University of Michigan.
FUNDING
This study was supported by the National Institute on Aging (R01-AG066133, RF1-AG046860, R01-AG057801, R01AG076019, R01AG062623) and the Department of Defense (W81XWH-20-1-0253). The sponsors had no role in the design and conduct of this study.
CONFLICT OF INTEREST
The authors have no conflict of interest to report.
DATA AVAILABILITY
Restrictions apply to the availability of the data. The data used for the analyses is considered identifiable protected research data and access to it is regulated by HIPAA and CMS. Researchers wishing to obtain access to the data must have appropriate professional qualifications to conduct human subjects research, seek approval from IRB and CMS. Researchers wishing to reproduce our research, in whole or in part, will be able to contact the Principal Investigator of the project, Dr. Igor Akushevich, for possible access to the raw materials and the detailed archives of the programs used to prepare the data and perform the analysis.
