Abstract
Background
Third-variable effect refers to the effect from a third-variable that explains an observed relationship between an exposure and an outcome. Depending on whether there is a causal relationship from the exposure to the third variable, the third-variable is called a mediator or a confounder. The multilevel mediation analysis is used to differentiate third-variable effects from data of hierarchical structures.
Data Collection and Analysis
We developed a multilevel mediation analysis method to deal with time-to-event outcomes and implemented the method in the
Results
We found that the racial disparity in survival were mostly explained at the census tract level and partially explained at the individual level. The associations among variables were depicted. Conclusion: The multilevel mediation analysis method can be used to differentiate mediation/confounding effects for factors originated from different levels. The method is implemented in the R package
Keywords
Background and Introduction
Health disparities exist widely in the United States (US). One example lies in breast cancer outcomes. Due to advanced screening methods for detecting breast cancer at early stage and improved treatments, the overall death rate of breast cancer in the US has decreased in recent years. However, compared with White women, African American women diagnosed with breast cancer have higher recurrence and death rates despite a lower incidence rate.1–15 Understanding the factors that account for these disparities is imperative to inform regulations, interventions, and treatments to reduce them.
There is consensus that individual behaviors, physical, and social environments collectively contribute to disparities in breast cancer outcomes. However, little is known of the relative contribution of behavioral factors (e.g., smoking status), or the relative contribution of any specific neighborhood or community context to the disparities. This is due to the lack of comprehensive data sets that include both individual level and environmental risk factors, and more importantly, the lack of statistical modeling method that can differentiate the intermediate effects on different paths across multiple levels (e.g., environment and individual risk factors) to explain the observed disparities.
Mediation analysis is used to differentiate a third-variable (e.g., mediator or confounder) effect that intermediates an observed relationship between an exposure variable and an outcome variable.15–25 In mediation analysis, besides the pathway that directly connects the exposure variable with the outcome, we explore the
For the research to explore racial/ethnic disparities in breast cancer survival, risk factors are collected hierarchically at both the individual and the residential neighborhood levels. In such situation, since patients living in the same neighborhood cannot be considered as independent of each other, the mediation analysis based on generalized linear models, where all patients are assumed to be indepencent, cannot be used directly to fit relationships among variables. Multilevel or mixed-effect models, are more appropriate since these models can account for dependencies among nested observations. Much research has been done on the multilevel mediation analysis. Ref. 26 studied the bias brought by using single-level models when data are hierarchical. Refs. 27–33 proposed mediation analysis methods for different types of multilevel models. In addition, Ref. 34 proposed to use Bayesian mediation analysis to deal with hierarchical databases. Yu and Li
35
extended the definitions of third-variable effects (mediation or confounding effects) to multilevel data structures and proposed to use multilevel additive models to fit variable relationships. Using their method, the effects from multilevel paths relating exposure(s) through third-variable(s) to outcome(s) can be estimated. However, all these methods deal with continuous or binary outcomes only. In this paper, we extend the multilevel mediation analysis to deal with time-to-event outcomes. The extended method is implemented in the R package,
Multilevel mediation analysis with time-to-even outcomes
In mediation analysis, besides the pathway that directly connects the predictor variable with the outcome (direct effect), we explore the Conceptual model for multilevel mediation analysis.
Multilevel additive models on contributing third-variables
We first model the relationship between predictors and third-variables. We use the multilevel additive model, a nonlinear regression method that was first proposed by Ref. 36 to build relationships among variables. A multilevel additive model can deal with both nonlinear associations and cluster-specific heterogeneity.
37
Assume that we have
For level-2 third-variables,
For level-1 third-variables,
In Equations 1 and 2,
The multilevel additive proportional hazard model
In our proposed multilevel mediation analysis, the multilevel proportional hazard function is used to fit the relationship between a time-to-event outcome and all other variables (e.g., the predictors, third-variables, and covariates). A multilevel proportional hazard model has the following format
Assume there are
There can also be random slopes for risk factors. In the mediation analysis, a random slope for a risk factor means that the indirect effect through the risk factor can be different among different groups. That is, there are group-moderated effects on the risk factor to influence the hazard rate. Here, we focus the multilevel mediation analysis to random intercept models since the purpose is to account for correlations among subjects.
Using the notations in the
Multilevel third-variable effects inference and interpretation
Based on the definitions of third-variable effects byRef. 35, we derive the direct and indirect effects based on models (1), (2), and (4). In the following,
With the relationships among variables built by models (1), (2), and (4), the derived third-variable effects for level-1 exposure variable
A level-2 predictor can have both level-1 and level-2 third-variables. The derived third-variable effects for level-2 predictor
The interpretations of third-variable effects are similar for those of the level-1 predictors. For each pair of the predictor-outcome relationship, there is a set of total effect, direct effect, and indirect effects.
Finally, to calculate the variances of the estimated third-variable effects, two methods are used: (1) the Delta method based on the normal approximation of the estimates and (2) the nonparametric bootstrap method. Both methods are implemented in the R package
The R package mlma
The R package
In the second step, two tests are performed: (1) check the importance of potential third-variables in estimating outcome(s) when all transformed variables are used for the estimation and (2) test the association between each potential mediator/confounder and predictor(s). The function
Finally, the function
SEER data to explore the racial/ethnic disparity in breast cancer survival
We implemented the above proposed method to explore the racial/ethnic disparity in breast cancer survival, taking into account tumor characteristics, individual demographics, and census tract level residential environmental factors.
Data sources
For the individual-level dataset for breast cancer patients, we use data from the California population-based cancer registries of the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute (NCI). Patients diagnosed from the Alaska Native Tumor Registry are excluded because additional confidentiality constraints are in place. Those data cannot be used in any analyses that involve census tracts. The SEER Program registries routinely collect patient-level data from medical records, including patient demographics, tumor characteristics, cancer stage at diagnosis, the first course of treatment, and follow-up results (vital status, date of the last contact, and cause of death). The residential census tracts for patients at diagnosis can provide important information for exploring area attributes in cancer research. However, due to concerns of disclosing patient privacy, the publicly available research data usually do not include the geographic location of the patient residential areas that are more specific than counties. 40 developed a method to provide multiple imputed, synthetic census tract in supplement to cancer registry data. The synthetic census tract identifier has been shown to produce similar cancer statistics by census tract based socioeconomic variables. To evaluate the usefulness of cancer registry data with synthetic census tracts in preserving the statistical validity for more complex analyses, and to explore ways of safely releasing confidentiality data, the NCI funded a validation project for researchers to propose useful analyses of cancer registry data with census tracts. Selected researchers develop analysis plans and write statistical programming codes using the synthetic census tracts. The NCI then runs researcher provided codes on the real census tracts data behind the firewall and returns the real data results to the researchers after the results are cleared by the NCI disclosure avoidance review. All analysis results presented hereafter in this paper are based on real cancer registry data and are cleared to be published.
To explore the racial/ethnic disparity in survival among breast cancer patients, this study includes all females diagnosed with primary invasive breast cancer between 2006 and 2017 excluding those diagnosed through autopsy or death certificate. Out of 237, 167 cases, 77.95
For the census tract level environmental data, we downloaded variables from the California healthy places index (HPI). Healthy places index is developed by the Public Health Alliance of Southern California (Alliance) in partnership with the Virginia Commonwealth University’s Center on Society and Health. In addition to the overall HPI score, the index also contains eight sub-scores in areas of economic, education, housing, health care access, neighborhood, clean environment, transportation, and social factors. The measurements are standardized to percentiles so that census tracts in California are comparative to each other. Readers are referred to the website http://healthyplacesindex.org for details and data downloading. The NCI work group linked the HPI variables at the census tract level with each patient and performed the analysis for this study. The R codes for all analysis are provided in Section 1 of the Supplementary Material.
Descriptive analysis
To explore factors contributing to racial/ethnic disparities in breast cancer mortality, we first applied the mediation analysis by Ref.[15 to identify important third-variables that may explain the observed disparities. Multiple additive regression trees (MART) were used to build relationships among variables. Multiple additive regression trees is a tree-based ensemble method of data mining. 41 In the mediation analysis, MART is used for exploratory and inference purposes. We benefit from the following properties of MART. First, MART can model the nonlinear relationships between the dependent and independent variables. Second, due to the hierarchical splitting scheme in regression trees, MART is natural to capture multilevel data structure. Third, there are established tools on the tree-based method to help depict relationships among variables. 42 Fourth, MART can handle different types of outcomes. Yu et al. 15 used MART in mediation analysis to explore time-to-event outcomes. Here, we use their method to identify important third-variables at both the individual and environmental levels. The results are then used to guide the variable selection and transformation in the multilevel additive models (1) and (4).
We use the Mediation analysis results based on multiple additive regression trees.
As result, all eleven individual variables and two environmental level variables are selected as potential third-variables. The individual level variable, year of diagnosis, is included as a covariate. The two environmental factors are
We then check how the selected variables relate with the hazard rate and distribute differently at different race/ethnicity groups. Based on the fitting graphs from MART, we decide how to transform the third-variables so that the transformed variables are roughly linearly related with the hazard rate. For example, Figure 3 shows the variable associations of The associations of 
The multilevel mediation analysis
As a result of mediation analysis based on MART, we select both individual and environmental level risk factors. Continuous variables are transformed according to their relationship with the hazard rate and the categorical variables are binarized so that a
We input only one exposure variable—the race/ethnicity of the individual patient. There are three racial/ethnic groups; therefore, two dummy variables are created as the level one exposures: one is 1 for Black patients and the other is 1 for Other race/ethnicity. Since there are level two (environmental level) third-variables, level two exposure variables are automatically created in the
Estimation results with the multilevel mediation analysis. The effects are relative effects except for the total effects.
From Table 1, we see that on average, the hazard rate for individual Black patients is 69% (= exp(0.396)-1) times higher than that for White patients. At the census tract level, if the proportion of Black patients is higher, the hazard rate is also higher (the confidence interval (0.353, 0.607) is to the right of 0). The Other race/ethnicity group has an average hazard rate that is 65.84
All other estimation is in term of the relative effect, which is defined as the estimated (in)direct effect divided by the total effect. A confidence interval including 0 means that the (in)direct effect is not significant after adjusting for other variables. A relative effect can be negative, which indicates that the estimated effect is at an opposite direction of the total effect.
Conclusions
We explain the results for Black versus White patients and Other race/ethnicity group versus White patients separately.
Comparing Black with White patients
Comparing Black with White patients at the individual level, the direct effect is estimated at 64.08
Age of diagnosis had a negative relative effect, contributing −10 The associations of 
The effects of other third-variables can be explained similarly. The figures similar to Figure 4 are provided for each potential third-variable as online Supplementary Materials.
For the census tract level variables, the estimated total effect is also positive, which implies that the increased proportion of Black breast cancer patients in the census tract was related to an increased hazard rate. The direct effect at the census tract level now becomes significantly negative, which means that after adjusting for all other factors, higher proportion of Black patients is associated with decreased mortality rate. An interesting factor is the ranking percentile of bachelor’s degrees. The variable was transformed to have a b-spline with two degrees of freedom. Therefore, there are two coefficients fitted to the transformed variables The associations of transformed 
Comparing patients of Other race/ethnicity with White patients
At both the individual and the environmental levels, Other race/ethnicity breast cancer patients had a lower hazard rate compared with White patients. At the individual level, the average hazard rate for pateints of Other race/ethnicity is 65.83
After adjusting for other factors, 77.88
Other factors explained all the differences in survival at the census tract level. The relative direct effect is only 1.07 The associations of transformed 
Conclusions and future research
In this paper, we develop a multilevel mediation analysis method for time-to-event outcomes. Frailty models are used to show correlations for patients lived in the same residential environment. With the proposed method, data with a hierarchical structure can be considered. Third-variable effects (confounding or mediating effects) are differentiated from multiple levels. We also expand a previously developed R package, mlma, to implement the proposed method. The method is used to explain the racial/ethnic disparities in breast cancer survival, taking into account both the individual level and census tract level risk factors. As a result, a large proportion of the racial/ethnic differences at the individual level were still not explained. As a future research direction, we would like to collect genetic data and individual level behavioral data (e.g., smoking and physical activity) among breast cancer patients to check if those variables can help further explain the observed racial/ethnic disparities at the individual level. In comparison, most of the differences at the census tract level were explained. Both the average educational level (proportion of bachelor or higher degrees) and the life expectancy at birth played important roles in explaining the racial/ethnic disparities in breast cancer survival at the census tract level.
The multilevel mediation analysis method works well in this application. Since generalized linear regression models are used for the analysis and many risk factors are highly correlated with each other, we plan to deal with the potential collinearities in analysis by implementing regularized regression methods in the multilevel model fitting. We have successfully used elastic-net regularized regressions in the single level mediation method. 43 As a next step, we will further develop a regularized multilevel mediation analysis to deal with high dimensional and potentially highly correlated third-variables.
In addition, the frailty models we used in this paper deal with only random intercept models. As a future research, we will extend the multilevel mediation analysis method to handle random slopes, so that the heterogeneous third-variable effects are allowed at higher levels.
Supplemental Material
sj-pdf-1-rmm-10.1177_26320843211061292 – Supplemental Material for Multilevel mediation analysis on time-to-event outcomes: Exploring racial/ethnic disparities in breast cancer survival in California
Supplemental Material, sj-pdf-1-rmm-10.1177_26320843211061292 for Multilevel mediation analysis on time-to-event outcomes: Exploring racial/ethnic disparities in breast cancer survival in California by Qingzhao Yu, Mandi Yu, Joe Zou, Xiaocheng Wu, Scarlett L Gomez and Bin Li in Research Methods in Medicine & Health Sciences
Footnotes
Acknowledgments
We acknowledge National Cancer Institute’s Surveillance, Epidemiology, and End Results Program for the research award (75N91020P00728) and for linking and providing data for this study. Part of this research were conducted with high performance computational resources provided by the Louisiana Optical Network Infrastructure.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by the National Institute On Minority Health And Health Disparities of the National Institutes of Health under Award Number R15MD012387, and the National Institute of Environmental Health Sciences under the Award Number P42ES013648 and its administrative supplement P42ES013648-09S2.
Supplemental Material
Supplementary Material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
