Abstract
A growing body of literature has examined the impact of neighborhood characteristics on Alzheimer’s disease (AD) dementia, yet the spatial variability and relative importance of the most influential factors remain underexplored. We compiled various widely recognized factors to examine spatial heterogeneity and associations with AD dementia prevalence via geographically weighted random forest (GWRF) approach. The GWRF outperformed conventional models with an out-of-bag R2 of 74.8% in predicting AD dementia prevalence and the lowest error (MAE = 0.34, RMSE = 0.45). Key findings showed that mobile homes were the most influential factor in 19.9% of U.S. counties, followed by NDVI (17.4%), physical inactivity (12.9%), households with no vehicle (11.3%), and particulate matter (10.4%), while other primary factors affecting <10% of U.S. counties. Findings highlight the need for county-specific interventions tailored to local risk factors. Policies should prioritize increasing affordable housing stability, expanding green spaces, improving transportation access, promoting physical activity, and reducing air pollution exposure.
Keywords
Introduction
Neighborhood environments where persons living with AD dementia reside can significantly impact the progression of the disease.1-5 For example, an age-friendly environment has been shown to be crucial for older adults with limited mobility who often spend more time in their immediate geographic surroundings. 6 Evidence suggests that higher neighborhood deprivation is linked to lower cognitive function and increased depressive symptoms.7,8 Mechanistically, long-term exposure to air pollution, specifically fine particulate matter (PM2.5), can trigger brain inflammation and increase the risk of cognitive decline and neuroinflammation.9-11 Conversely, greater exposure to green spaces can reduce stress and mental fatigue, promote physical activity, and lower the risk of cognitive impairment.12-16 Access to healthcare and community resources is crucial for managing the disease and alleviating caregiver burden. 17
Geospatial analysis and modeling can provide a robust framework for examining the impact of neighboring environments on the distribution of AD dementia.18,19 In the U.S., it has been used for the identification of spatial clusters of AD mortality rates, 20 spatiotemporal clusters of dementia mortality, 21 and spatial and spatiotemporal hotspots of dementia hospitalization rates. 22 However, only a limited number of geospatial modeling studies have been conducted to examine the local impact of neighborhood characteristics on AD dementia. Karway et al (2024) analyzed data from Medicare beneficiaries using geospatial methods to assess the impact of cardiometabolic diseases (CMDs), smoking, and educational attainment on the incidence of AD. They computed county-level AD incidence, adjusted for age, sex, and race using the spatial Bayesian latent Gaussian approach. They reported strong agreement between the spatial patterns of AD and CMD, a moderate association with smoking, and a weak inverse relationship with educational attainment. 23 Using the same approach, Alhasan et al (2024) investigated the link between neighborhood attributes and AD dementia incidence in South Carolina. They reported that a higher AD dementia incidence was associated with poverty, air pollution, rural settings, and limited access to nutritious food. 24
Additionally, a limited number of geospatial models have examined the impact of socioeconomic, environmental, and lifestyle characteristics on AD dementia distribution. One widely used model to assess the regional impact of these variables on various health outcomes is geographically weighted regression (GWR),25,26 which relies on several challenging assumptions, such as local linearity between outcome and predictors, 27 and is sensitive to outliers and multicollinearities. 28 In contrast, random forest (RF), a nonlinear and nonparametric decision tree-based model, is not based on statistical assumptions and is less prone to overfitting and outliers.29,30 Nevertheless, non-spatial machine learning models such as RF, Boosted Regression Trees or Neural Networks do not inherently account for spatial heterogeneity. Geographically weighted random forest (GWRF) is a recently developed spatial machine learning algorithm that can address these limitations. 31 Conceptually, like GWR, the GWRF is calibrated using RF instead of ordinary least squares (OLS) regression and can address linearity and spatial dependence and does not rely on distributional assumptions. 32 Moreover, while spatial machine learning models like Geographically Weighted Artificial Neural Networks can handle nonlinearity and spatial non-stationarity, GWRF offers more straightforward implementation and interpretation. 33
Spatial variants of RF have been previously applied to model COVID-19,
28
hypertension,
34
malaria,
35
asthma,
36
physical inactivity,
37
and type 2 diabetes.
32
However, to our knowledge, studies have yet to examine spatial nonlinear machine learning models to study the prevalence of AD dementia. The National Institute on Aging (NIA) Health Disparities Research Framework advocates for the use of multidisciplinary methodologies to achieve health equity in aging research.
38
This framework organizes a spectrum of spatial factors, including environmental, sociocultural, and behavioral elements, to address health disparities, identify intervention targets, and elucidate causal pathways to health outcomes in aging populations.
38
Here, we compile various socioeconomic, environmental, and lifestyle variables together with related health outcomes as predictors to address the following research questions: 1. How does the spatial variability of socioeconomic, environmental, and lifestyle factors influence AD dementia prevalence at the county level in the U.S.? 2. How does the performance of GWRF compare to widely used spatial (GWR) and nonspatial (RF, OLS) models in explaining or predicting AD dementia prevalence? 3. How do the most influential variables vary across counties, and how their regional importance is ranked?
Our study extends previous geospatial research on AD dementia by being the first to apply a spatial nonlinear machine learning model to analyze its regional prevalence across the U.S. The findings can help public health policymakers to develop more relevant interventions to reduce disparities of AD dementia prevalence in the U.S. rather than a one-size-fits-all policy that might not be efficient on a broader scale.
Methods
Data Collection and Preparation
The estimated prevalence of AD dementia, serving as the outcome variable, was sourced from Dhana et al (2020) at the county level across the continental U.S. 39 This data comprises county-specific estimates for adults aged 65 and older, adjusted on demographic characteristics such as age, sex, race/ethnicity, and education, and was computed from cognitive data from population-based studies and the National Center for Health Statistics.
Predictors Used in This Study.
Variables that describe vulnerable communities before, during, and after disasters were developed by the Center for Disease Control and Prevention (CDC) as the social vulnerability index (SVI). 41 The SVI is constructed using county-level variables obtained from the American Community Survey (ACS) 2016-2020. The estimated rates per county were included for the following SVI variables: below 150% poverty, unemployed, housing cost-burdened, no health insurance, mobile homes, occupied housing with more people than rooms, households with no vehicle, and households without a computer with internet access. More comprehensive descriptions of variables are available from the CDC/ATSDR SVI 2020 Documentation. 42 Additionally, limited access to healthy foods was sourced from County Health Rankings and Roadmaps. 43
To assess the impact of air pollution on AD dementia prevalence, annual average estimates of outdoor concentrations of air pollutants such as O3, CO, NO2, and PM2.5 were used. These data were estimated using a land use regression model and retrieved from the Center for Air, Climate, and Energy Solutions (CACES) through a partnership with the Environmental Protection Agency. 44 Moreover, remotely sensed normalized difference vegetation index (NDVI) was obtained from Moderate Resolution Imaging Spectroradiometer (MODIS) satellite imagery to represent vegetation density on the ground with a 500 m spatial resolution. NDVI values are calculated from Red (R) and Near Infrared (NIR) spectral bands as (NIR-R)/(NIR+R), ranging from −1 to +1, with higher values indicating denser vegetation cover. 45 Zonal statistics function in ArcGIS Pro 3.3 (ESRI, Redlands, CA) Spatial Analyst Extension was used to compute the average NDVI per county across the continental U.S.
The age-adjusted prevalence of lifestyle variables per county was extracted from the CDC PLACES dataset, derived from the Behavioral Risk Factor Surveillance System (BRFSS), using a validated method for small-area estimation.46,47 BRFSS is the primary system for health-related telephone surveys nationwide, gathering information from U.S. residents about their health behaviors and chronic conditions. 48 The lifestyle variables encompassed the age-adjusted prevalence rates of binge drinking, current smoking, and lack of leisure-time physical activity. Additionally, the age-adjusted prevalence rates for various health outcomes, including depression, obesity, cognitive disability, diagnosed diabetes, hearing disability, and high blood pressure, were sourced from the same repository.
All collected data were compiled at the county level across the continental U.S. and were joined to the administrative boundary shapefile of counties sourced from the U.S. Census Bureau’s TIGER/Line repository. 49 Counties with several missing variables were removed from the geodatabase, resulting in a final set of n = 3032 counties. The linear correlations between the predictors were examined, and the highly correlated variables (|r| > 0.8) and those with high multicollinearity (ie, variance inflation factor (VIF) > 7.5) were excluded. 50 The final compiled geodatabase was projected to the USA Contiguous Albers Equal Area Conic projection, and the coordinates of the centroid for each county were added to the dataset using the Calculate Geometry function in ArcGIS Pro 3.3 for further geospatial modeling.
Linear Models
OLS
The baseline OLS model was employed to investigate the relationship between AD dementia prevalence and the predictors. This model has several assumptions, such as spatial stationarity across the study area, linear relationships, independence of observations, normality, random distribution of errors, and is sensitive to outliers. However, spatial dependencies can violate independence and randomness assumptions in practice. 51 The mathematical descriptions of the OLS model have been provided in Supplemental File S1.
GWR
To address spatial heterogeneity, geographically weighted regression (GWR), an extension of OLS, was utilized. 52 Mapping the locally estimated coefficients can help reveal hidden spatial patterns and the local influence of predictors on AD dementia prevalence. 53 Nevertheless, GWR retains assumptions of local linearity and independence of observations and remains sensitive to outliers. 27 Additionally, local multicollinearity can lead to instability in the parameter estimates. 54 The mathematical descriptions of GWR model have been provided in Supplemental File S1.
Nonlinear Models
Random Forest (RF)
In contrast to linear models, RF and its variants do not depend on specific relationship assumptions for the variables, 29 making them well-suited for capturing nonlinear relationships and interactions between variables. 55 RF is a non-parametric machine learning algorithm used for both classification and regression tasks.29,56,57 We randomly divided the entire dataset into two portions: a training set (2/3 of the data) and an out-of-bag (OOB) set (1/3 of the data). Using the training set, numerous unpruned regression trees were constructed through bootstrap sampling (ie, random sampling with replacement) and the random selection of a subset of variables. 58 The prediction accuracy was computed for each tree, and the final output was the average prediction from all regression trees. 59
The OOB data, which was not used during the training process, were used for two primary purposes: (1) evaluating the goodness of fit and overall model accuracy using metrics such as mean square error (MSE), coefficient of determination (R2), and mean absolute error (MAE), 60 and (2) determining the relative importance of each variable. 28 To assess variable importance, the values of each variable in the OOB set were randomly permuted, and the OOB error was recalculated. The variable was deemed important if the OOB MSE error increases (%IncMSE) with the permuted values. The greater the increase, the more influential the variable is in predicting AD dementia prevalence. 31 A detailed description of the RF is provided elsewhere.29,58
Geographically Weighted Random Forest (GWRF)
Although RF addresses the limitations of linearity assumptions and susceptibility to outliers in GWR, it remains a global model and fails to capture spatial heterogeneity. 61 The GWRF has recently been developed to overcome the stationary assumption of RF and the limitations of the GWR model. 62 Similar to GWR, GWRF involves locally calibrated models rather than a single global model. GWRF addresses spatial heterogeneity by computing local RF using nearest neighbors for each location based on the spatial weight matrix. 63 An adaptive kernel was used to identify an optimal number of nearest neighbors due to various sizes of counties and uneven distribution of centroid of counties in the U.S. 31 Detailed discussion on GWRF can be found elsewhere. 62
To enhance GWRF interpretability, permutation feature importance was used to rank the local variable importance for each county based on the %IncMSE. Further, we mapped the local variable importance to better understand where, and to what extent each variable affects AD dementia prevalence geographically. 28 Moreover, we estimated and mapped the local residuals and local goodness-of-fit statistics for training and OOB samples to assess the model performance in different regions. Consistent with other studies,29,30,34 we used the OOB method to assess model accuracy instead of relying on a separate validation set. The OOB method allows for internal validation during the RF algorithm run by measuring prediction error through bootstrap aggregating. This approach facilitates the calculation of the OOB MSE and R2. While both OOB and cross-validation are valid techniques for model validation, OOB is computationally more efficient as it evaluates model performance during training.29,30
Hyperparameter Tuning
RF and GWRF are less susceptible to overfitting due to the creation of many slightly different regression trees.29,64 They are robust prediction models when the hyperparameters are appropriately tuned, especially with large datasets. 29 To mitigate overfitting, we conducted a randomized grid search to optimize key RF hyperparameters using a dataset of 3032 samples. The tuned parameters included the number of features considered at each split (mtry), the fraction of data used per tree (sample.fraction), tree depth (max.depth), the minimum required samples at a leaf node (min.node.size), and the total number of trees (num.trees).28,32 A 10-fold cross-validation method was used to determine the optimal hyper-parameters from all possible combinations. We also used these hyperparameters for GWRF. Moreover, the optimal bandwidth (maximum number of neighbors to calibrate local models) in GWRF was selected based on minimized OOB error. 62 To further reduce overfitting, we limited the number of variables, acknowledging the constraints of county-level data, where the number of available data points is limited. After an initial RF training and tuning phase, we removed features with low importance scores (<10%).
We applied the Getis-Ord Gi* statistics to assess the clustering of high (hotspots) or low (cold spots) residual values within the entire dataset.65,66 Mathematical descriptions of the Getis-Ord Gi* statistics have been provided in Supplemental File S1. All statistical analyses were conducted using randomForest, SpatialML, and CARET packages in R, with mapping performed in ArcGIS Pro 3.3.
Results
Preliminary descriptive statistics showed that the AD dementia prevalence (%) in the continental U.S. ranged from 5.6% in Lovington County, Texas, to 18.4% in Presidio County, Texas. The mean prevalence rate (adjusted for age, sex, race/ethnicity, and education) was 11.17%, with a standard deviation of 1.43%. The highest rates were predominantly found in southern states, including North Carolina, South Carolina, Georgia, (southern) Florida, Tennessee, Alabama, New Mexico, and (southern) Texas. The states with the lowest (mean) prevalence rates, respectively, were Utah (9.70%), Vermont (9.76%), and Idaho (9.87%), while those with the highest (mean) rates were Mississippi (13.10%), South Carolina (12.46%), and Louisiana (12.42%). Among the 24 initial predictors, four variables (ie, below 150% poverty, the age-adjusted prevalence of cognitive disability, diagnosed diabetes, and high blood pressure) were removed due to a high correlation with other variables or high VIF. After an initial RF training and tuning phase, we removed the variables with low importance scores (<10%). In Supplemental Table S1 and Figure S1, we present the correlation coefficients for the 13 final predictors, all of which are below the threshold of |0.8|, with VIFs under 7.5, indicating low multicollinearity.
Comparative Performance of the Models for AD Dementia Prevalence Rates in the US.
aNot applicable.
To avoid overfitting before fitting the RF and GWRF models, we tested various combinations of hyper-parameter values through 10-fold cross-validation. The optimal hyperparameters were based on the adaptive kernel, bandwidth = 216 nearest neighbors, ntree = 1000, mtry = 6, sample.fraction = 0.3, min.node.size = 9, and max.depth = 4. For comparison, these parameters were fixed for both the training and OOB datasets of RF and GWRF models. Based on the OOB accuracy assessment, GWRF demonstrated significantly improved performance, showing a 34.5% increase in R2 (Table 2). Additionally, due to the local implications and interpretations of GWRF, such as spatially varying relationships, local feature importance, flexible bandwidth selection, and integration with GIS for visualization and targeted decision-making, GWRF was used for further inferences.
According to permutation-based feature importance, the variable “Households with no Vehicle” was the most significant contributor (highest %IncMSE), followed by “Binge Drinking”, “Obesity”, “Physical Inactivity”, “Depression”, “PM2.5”, and “Mobile Homes”. However, among these 13 variables “No Health Insurance” was the least contributing variable to AD dementia prevalence. Figure 1 depicts the relative importance of variables to AD dementia prevalence. The relative importance of variables to AD dementia prevalence using permutation-based feature importance. A higher increase in MSE (%IncMSE) corresponds to higher importance.
The percentage of “Mobile homes” was the leading primary factor for the prevalence rate of AD dementia in most U.S. counties (n = 602, 19.9%). “NDVI” was the most influential factor in Central and Northern U.S. counties (n = 528, 17.4%), notably in Minnesota, Wisconsin, Missouri, and Oklahoma. “Physical inactivity” was the primary factor in south and southwestern counties (n = 391, 12.9%), predominantly in Arizona, New Mexico, Colorado, Utah, and southern Texas. “Households with no vehicle” was the main influential factor in (n = 342, 11.3%) of U.S. counties, particularly in most counties of Arkansas, Delaware, New Jersey and Illinois. Figure 2A illustrates the geospatial distribution of local primary factors of AD dementia prevalence across the U.S. along with the corresponding county counts as shown in Figure 2B. (A) Geospatial distribution of local primary factors associated with AD dementia prevalence across the U.S. (B) County-level counts for these factors.
The local effects of the top important variables were mapped, regardless of their primary status. ‘NDVI’ had the largest and most widespread impact in northern and central states, with a small cluster of counties in Texas, while having the lowest impact in the eastern and western regions. ‘Mobile Homes’ significantly influenced AD dementia prevalence in the northern and northeastern U.S., particularly in North Dakota, South Dakota, Pennsylvania, and New York, with the least impact in the south, northwest, and central regions. ‘Physical inactivity’ had the highest impact in the southern half of the U.S., notably in New Mexico, Arizona, Colorado, Louisiana, Mississippi, and South Carolina. ‘Depression’ most affected the southern half of the U.S., particularly California, Virgina and North Carolina and had the least impact on the northern states. ‘Obesity’ almost showed the same geographic pattern as ‘Physical Inactivity’. ‘PM 2.5’ had the most significant effects in the western half of the U.S., especially in northern Colorado, Wyoming, Arizona, and Nevada, and northeastern states. ‘Binge drinking’ and ‘Households with no vehicle’ exhibited nearly identical geographic patterns, with the highest impacts in southern states, particularly Alabama and Missouri, and the lowest impacts in the western and northern U.S. Figure 3 illustrates the local impact of the top variables on AD dementia prevalence in the U.S. The local impact of the top variables on AD dementia prevalence in the US.
The local R2 of the GWRF model in explaining the variance of AD dementia prevalence ranged from 0.04 to 0.84 (median = 0.43, standard deviation = 0.16). The model tended to perform better in most counties in the southern, and northeastern regions (local R2 > 0.7), while it underperformed in the central and northwestern U.S. (local R2 < 0.2). Figure 4A illustrates the local R2 of the GWRF model. The Getis-Ord Gi* analysis showed that the residuals of the GWRF model were mainly randomly distributed, with a few hotspots or cold spots indicating areas of overprediction or underprediction (Figure 4B). The geospatial distribution of (A) Local R2 of the GWRF model and (B) Getis-Ord Gi* hotspots/cold spots of residuals.
Discussion
Inspired by the NIA’s Health Disparities framework 38 and leveraging the foundational work by Livingston et al (2020), 40 we compiled a geodatabase of socioeconomic, environmental, and lifestyle variables along with other pertinent health outcomes at the county level in the continental U.S. Utilizing advanced spatial technologies such as GIS, remote sensing, spatial statistics, and machine learning model, we conducted a geospatial machine learning modeling of the distribution of AD dementia prevalence to better understand the local impacts of these variables in the continental U.S. The results of our study can guide interventions such as enhancing green spaces to improve cognitive health, reducing air pollution, and promoting physical activity through initiatives, for instance, for mobile home communities. The findings of this study can serve as the foundational baseline to support the development of cost-effective, place-based interventions and more refined policy development across the nation.
We sourced the AD dementia prevalence data from the study by Dhana et al (2023). 39 The advantage of using this data, as compared to conventional estimates relying on medical reports, claims, and national surveys, is its mitigation of significant underestimation, particularly among racial and ethnic minority groups. Conventional methods often miss cases due to discrepancies between data sources and systemic diagnostic biases. In contrast, the data used in this study incorporated findings from a diverse population-based study and adjusted for demographic variations (ie, age, sex, race/ethnicity, and education) at the county level, providing a more accurate assessment of the factors impacting AD dementia prevalence in the U.S. 39 African American and Hispanic older adults face a potentially 2-fold higher risk of AD dementia. 67 Therefore, the involvement of racial and ethnic minority groups in AD clinical research is emphasized.68,69
Geospatial modeling of AD dementia remains relatively sparse, with conventional geospatial linear models like GWR constrained by local linearity, sensitivity to outliers, and multicollinearity.27,28 Thus, our study adopted a nonlinear spatial approach to circumvent these limitations while incorporating variable interactions, resulting in superior performance to conventional linear models. Our findings confirmed previous research on other health outcomes using GWRF.36,70 Although the model exhibited proper performance in most counties, it showed low goodness-of-fit in central and northwest U.S. counties, indicating the necessity for integrating additional critical variables and enhancing data quality.
Recent studies have shown the influence of environmental factors on AD dementia, including NDVI71-73 and criteria air pollutants.74-76 The permutation-based feature importance analysis suggested that among environmental factors, NDVI was found to be the most important factor in 17.4% of U.S. counties. Counties most influenced by this factor lie in the northern and central regions. Numerous studies have linked NDVI, or neighborhood greenness, to improved cognitive function and MRI results. 12 A 13-year cohort study involving over 1.1 million older adults from 2001 to 2014 indicated that an increase of one interquartile range in surrounding greenness was associated with a 4%-5% decrease in premature mortality from all neurodegenerative diseases, including AD. 77 However, one study found that this association is only significant for those with low genetic risk. 78 The protective effect of NDVI likely due to its ability to improve air quality, increase physical activity, and reduce stress, 79 while also enhancing cortical thickness, reducing ventricular size, and promoting neuroprotection, all of which may help mitigate AD risk. 80 Although NDVI is among the most influential factors in local modeling, it is not highly ranked in the global model. This discrepancy likely stems from methodological differences between global and local models. Local models, trained exclusively on neighboring data, are more sensitive to spatially clustered factors, whereas these localized effects become diluted in the global model. This sensitivity to local patterns is a key advantage of spatial modeling, allowing for the detection of trends that may be obscured in a global analysis.
PM 2.5 emerged as the other important environmental variables and the primary factors in 10.4% of U.S. counties, notably in northeastern and western states. This concurs with previous studies associating air pollution with cognitive decline.81,82 PM 2.5, typically composed of heavy metals such as nickel, lead, cadmium, arsenic as well as organic carbon, can easily cross the blood-brain barrier and reach the central nervous system.83,84 They can induce systemic inflammation, cell death, and DNA damage in the human brain by producing reactive oxygen species, which are critical for developing AD dementia.85,86
Lack of leisure-time physical activity was identified as the most important modifiable lifestyle variable in explaining AD dementia prevalence. It was found as the primary variable in 12.9% of U.S. counties, particularly affecting New Mexico, Arizona, Colorado, Louisiana, and Mississippi. A review article of various study designs indicated that higher levels of physical activity are linked to a lower risk of late-life dementia. Physical activity can enlarge prefrontal and hippocampal brain areas, potentially mitigating memory impairments. 87 Possible reasons for the regional impact might include access to recreational facilities, cultural attitudes toward physical activity, and environmental factors that either promote or hinder an active lifestyle. For instance, a cross-sectional analysis of data from 50 884 women revealed that those residing in areas with the highest exposure to greenness were more inclined to participate in higher levels of physical activity (more than 67.1 MET hours per week) compared to those in areas with the lowest level of greenness. 88 ‘Binge drinking’ was another most important lifestyle variable that impacted AD dementia prevalence and the primary important variable in 9.5% of U.S. counties. A 25-year follow-up analysis among a cohort of 554 Finnish twins aged 65 years and older at the time of assessment revealed that midlife binge drinking was associated with a relative risk of 3.2 for dementia. 89 ‘Binge drinking’ can markedly worsen cognitive deficits. 90 A review study investigated how binge drinking is associated with the development of cognitive impairment among young adults in the UK. They found deficits in brain regions such as the frontal lobe, temporal lobe, and hippocampus. However, the studies lacked sufficient evidence to generalize the findings. 91
Among the socioeconomic factors, ‘Mobile homes’ and ‘Households with no vehicle’ were identified as the most significant variables impacting the prevalence of AD dementia. These factors were primary variables in 19.9% and 11.3% of U.S. counties, respectively. While the exact mechanisms linking ‘Mobile homes’ to AD dementia are unclear, possible explanations include the increased socioeconomic stress faced by residents due to financial instability, limited healthcare access, and increased exposure to air pollution, 92 —as mobile homes are typically found in low-income rural areas with limited access to interstate highways or public transit— 93 all of which can contribute to cognitive decline. Additionally, ‘Households with no vehicle’ can exacerbate isolation and restrict access to healthcare services and social activities, all of which are crucial for maintaining cognitive health, thereby increasing the prevalence of AD dementia.
While this study offers valuable insights, it does have its limitations. The data utilized in this study were primarily based on model-derived estimations of AD dementia prevalence and telephone survey responses from BRFSS, which are susceptible to recall and social desirability biases. The AD dementia prevalence data obtained from 39 focused on three major racial/ethnic groups (Black/African American, Hispanic, and White) and simply assumed that other races (ie, Asian American and American Indian/Alaska Native individuals) have a similar risk to White individuals. The AD dementia data may also be subject to selection biases, particularly due to the underrepresentation of racial/ethnic groups in clinical diagnoses, which could skew both prevalence estimates and diagnosis rates at the county level. Evidence of such bias is highlighted by Lennon et al (2022), 94 who used data from the National Alzheimer’s Coordinating Center and found that Black participants in AD research studies had 35% lower odds of receiving a dementia diagnosis at the initial visits than White participants. Additionally, due to the ecological fallacy, the conclusions cannot be extended to the sub-county or individual levels and should only be interpreted at the county level. Another limitation that might reduce the statistical power of the analysis is the absence of various key variables in certain states, notably Florida, and particularly Alaska and Hawaii, where each has a racial and ethnic minority group of approximately 40%. This led to the exclusion of these states from the analysis. While GWRF captures local spatial variations, its findings may not transfer well if key predictors, data distributions, or geographic characteristics differ. Differences in spatial scale can further affect its generalizability. Compared to linear models, GWRF only captures the relative contribution of each explanatory variable to AD dementia prevalence through variable importance, without assessing the actual effects of these variables. To enhance interpretability, future work could integrate spatial models with methods like Shapley Additive Explanations, which would provide a clearer understanding of the effects of explanatory variables. 95 Moreover, it should be acknowledged that the feature importance technique used in this study can be impacted by inter-feature correlation and multicollinearity, which can lead to misleading importance scores as permuting one feature may not accurately reflect its individual impact due to the correlated variables. Although we included a selection of well-recognized variables, future research should incorporate distinctions between rural and urban settings and consider the built environment and social isolation variables. Additionally, conducting research at finer spatial scales while maintaining a national scope is crucial for more targeted interventions. Subsequent research should involve spatial causal inference and mediation analyses to identify mediators and mechanisms.
Conclusions
In conclusion, our findings underscore the considerable spatial variability in the factors associated with AD dementia prevalence across U.S. counties. This suggests that local and regional governments should implement targeted interventions based on specific regional variables. For example, some areas may benefit from prioritizing improving green space and reducing air pollution, while others might focus on modifying lifestyle variables such as increasing physical activity and reducing binge drinking. Moreover, the fit of the local model highlights the urgent need for more comprehensive data collection and analysis to better understand how disparities impact AD dementia prevalence. Together, these insights can pave the way for more nuanced research and targeted interventions to mitigate the prevalence of AD dementia in the U.S.
Supplemental Material
Supplemental Material - Alzheimer’s Disease Dementia Prevalence in the United States: A County-Level Spatial Machine Learning Analysis
Supplemental Material for Alzheimer’s Disease Dementia Prevalence in the United States: A County-Level Spatial Machine Learning Analysis by Abolfazl Mollalo, PhD, George Grekousis, PhD, Hermes Florez, MD, PhD, Brian Neelon, PhD, Leslie A. Lenert, MD, and Alexander V. Alekseyenko, PhD in American Journal of Alzheimer’s Disease & Other Dementias®
Footnotes
Acknowledgments
The authors sincerely thank Dr. Andreana Benitez for her insightful feedback on an earlier version of this manuscript.
Statements and Declarations
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: AM and AVA are supported by the South Carolina SmartState Endowed Center for Environmental and Biomedical Panomics (CEABP); AVA is supported by South Carolina Cancer Disparities Research Center (SC CADRE) from NIH/NCI U54 CA210962.
Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Data sharing is not applicable as no new data were generated or analysed during this study.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
