Sage Journals: Discover world-class research

Abstract

The Integrated Clinical and Environmental Exposures Service (ICEES) provides open regulatory-compliant access to clinical data, including electronic health record data, that have been integrated with environmental exposures data. While ICEES has been validated in the context of an asthma use case and several other use cases, the regulatory constraints on the ICEES open application programming interface (OpenAPI) result in data loss when using the service for multivariate analysis. In this study, we investigated the robustness of the ICEES OpenAPI through a comparative analysis, in which we applied a generalized linear model (GLM) to the OpenAPI data and the constraint-free source data to examine factors predictive of asthma exacerbations. Consistent with previous studies, we found that the main predictors identified by both analyses were sex, prednisone, race, obesity, and airborne particulate exposure. Comparison of GLM model fit revealed that data loss impacts model quality, but only with select interaction terms. We conclude that the ICEES OpenAPI supports multivariate analysis, albeit with potential data loss that users should be aware of.

Keywords

open clinical data open application programming interface generalized linear model environmental exposures asthma

Introduction

Electronic health record (EHR) data, while intended to serve an administrative function, offer a valuable resource for clinical and translational researchers. However, the regulatory and institutional restrictions placed on access to EHR data, while critical to protect patient privacy, often impose challenges to data dissemination for research purposes. To address such issues, several groups have developed approaches for openly exposing EHR data to support clinical and translational research, while preserving patient privacy and abiding by all federal and institutional regulations.^1,2 One recent contribution is the Integrated Clinical and Environmental Exposures Service (ICEES).^3–5 ICEES provides open regulatory-compliant access to EHR data and other sources of clinical data that have been integrated at the patient level with a variety of public exposure data. The data are accessible via an open application programming interface (OpenAPI). ICEES supports several use cases and cohorts and is equipped with basic tools for exploratory and bivariate analyses.

ICEES bivariate functionalities were recently extended to support multivariate analysis using the ICEES OpenAPI.⁶ The approach was designed to be applied on a query-by-query basis, and data loss occurs due to regulatory constraints on the Open API. In particular, restrictions are placed on the creation of small cohorts (less than or equal to 10 patients), which cannot be accessed due to privacy concerns. Our preliminary work suggests that data loss is minimal; however, we did not systematically quantify data loss in our pilot work. With a growing number of available feature variables and a theoretically endless number of user queries, the analytic bounds of multivariate analysis using the ICEES OpenAPI are important to determine. A systematic comparison of the multivariate analytic capabilities achievable with data extracted from the ICEES OpenAPI versus the underlying, constraint-free source data is essential to quantify the impact of data loss, determine the analytic bounds of the open service, and inform multivariate model development.

In the present study, we systematically evaluate the robustness, validity, accuracy, and specificity of knowledge and assertions generated when applying a multivariate model to the ICEES OpenAPI. We first briefly describe the open approach that we developed to support multivariate analysis using the ICEES OpenAPI.⁶ We then perform a comparative analysis on data extracted using the ICEES OpenAPI data versus the underlying, constraint-free source data and quantify data loss. We apply a multivariate model to both the data extracted from the ICEES OpenAPI and the underlying constraint-free source data, using a generalized linear model (GLM) to predict exacerbations of asthma, our driving use case, and we further investigate the presence of divergence in the GLM results. Finally, we discuss model robustness and the nuances of data loss, as well as methods for users to quantify data loss for queries submitted to the ICEES OpenAPI.

We focus on asthma as our driving use case, as asthma is exquisitely sensitive to environmental exposures. We include seven predictors in our GLM based on their demonstrated relationship to asthma in the literature. For instance, several groups^5–9,11,12 have identified female sex and/or race as key demographic factors associated with asthma exacerbations. Obesity also is an established risk factor for asthma exacerbations.^6,7,9 Moreover, exposure to high levels of airborne pollutants is a known trigger for asthma exacerbations.^3,5,6,8,9,13 Likewise, close proximity to a major roadway or highway and residential density have been suggested to contribute to asthma exacerbations.^8–10,14 Finally, we include prednisone as a predictor in our model because it is typically used to treat severe asthma¹⁵ and thus is an acceptable proxy for asthma exacerbations.^6,8,9

Methods and Statistical Plan

All study procedures were approved by the Institutional Review Board at the University of North Carolina at Chapel Hill.

Overview of Analytic Plan

ICEES was developed to support functionalities that allow users to dynamically define cohorts and explore univariate and bivariate relationships between feature variables.³ We focused on sex, race, prescriptions for prednisone, diagnoses of obesity, airborne particulate matter exposure, residential proximity to a major roadway or highway, and residential density as potential predictors of asthma exacerbations, the primary outcome measure. We considered the existing ICEES cohort of UNC Health patients with asthma or related conditions.³ Asthma exacerbations were defined as the annual number of emergency department (ED) or inpatient visits for respiratory issues in year 2010. Hereinafter, we use the term “post-API data” to refer to data extracted using the ICEES OpenAPI and “pre-API data” to refer to the underlying, constraint-free source dataset. We applied a GLM to both the pre-API data and the post-API data to systematically assess model robustness.

Pre-API Dataset

The ICEES asthma OpenAPI exposes integrated clinical and environmental data on ∼160,000 patients from UNC Health.³ We selected seven features to generate a multivariate table: sex; race; prescriptions for prednisone; diagnoses of obesity; airborne particulate matter exposure; residential proximity to a major roadway or highway; and residential density. We focused on outcomes in year 2010 by filtering out patients who did not have any observations in the EHR during that year, using the ICEES feature variable “Active_In_Year.” Specifically, setting “Year = 2010” and “Active_ In_ Year = 1” will select only those patients who were active during the study period of interest, in this case, year 2010. We further filtered patients who were missing data on any of the feature variables of interest. The two filtering steps reduced the sample size for the pre-API dataset to N = 15,420 patients.

Post-API Dataset

The algorithm used to programmatically extract a multivariate dataset is described in detail in Fecho et al. 2021,⁶ and a detailed technical design and overview of the ICEES OpenAPI is provided in Fecho et al. 2019.³ Briefly, the algorithm must be applied to a specific use case question. In the example described here and in our prior work,⁶ the question was: are sex, race, prednisone use, obesity, airborne particulate matter exposure, major roadway or highway exposure, and residential density significant predictors of asthma exacerbations, either independently or by way of interaction? The ICEES dynamic cohort creation functionality is then applied to generate separate subcohorts for each level of the primary outcome measure, in this case, the annual number of ED or inpatient visits for respiratory issues. For each subcohort, the ICEES bivariate contingency functionality is applied to examine the relationship between two feature variables for each level of the primary outcome measure. The dynamic cohort creation functionality is applied again to generate subcohorts based on combinations of each level of the primary outcome measure and each bivariate relationship. This process is then continued until all desired feature variables have been incorporated. The result is a multivariate table, with rows representing each combination of the primary outcome measure level and the feature variable level and with sample sizes for each row (Figure 1). This table can then be expanded such that a row with a sample size of N = 20 can be transformed to 20 rows with a sample size of N = 1 each. Thus, the output of the algorithm is a multivariate table, where each column represents one of the seven above-mentioned feature variables, with contingencies maintained across feature variables, and where each row represents an individual patient/observation. We applied the same filtering approach as we did to the pre-API data, in that we filtered the data by selecting “Year = 2010” and “Active_In_Year = 1” and removed patients who were missing data on any of the selected feature variables. The final sample size for the post-API dataset was N = 14,586.

Figure 1.

Excerpt from the ICEES eight-feature multivariate table (post-API data).

Quality Control and Summary Statistics

We first verified feature name and feature level consistency between the pre- and post-API datasets. After the datasets passed the consistency check, we then performed descriptive statistics. Specifically, we performed pandas inner and outer merge¹⁶ to compare the pre- and post-API contingency tables, in terms of sample sizes at each feature variable level and overall. This step allowed us to quantify data loss and determine whether any feature variables were selectively impacted by the data loss inherent in the ICEES OpenAPI multivariate approach due to regulatory constraints that prevent the creation of cohorts ≤10 patients.

GLM Application

After quantifying data loss, we used R software to develop and apply a GLM model¹⁷ to predict “TotalEDInpatientVisits” using the above-mentioned feature variables extracted from the pre- and post-API datasets. Given that the feature variable “TotalEDInpatientVisits” represents counts and is therefore ordinal and that the distribution was skewed towards lower “TotalEDInpatientVisits” (i.e. most patients do not visit an ED or inpatient clinic in any given year), we fit a negative binomial (NB) model to the data.¹⁸ NB models have been shown to be well suited for such skewed count distributions.¹⁹

Because we filtered by “Active_In_Year = 1,” “TotalEDInpatientVisits = 0” had fewer counts than “TotalEDInpatientVisits = 1,” which is generally inconsistent with the NB distribution. Hence, we filtered out patients/rows with “TotalEDInpatientVisits = 0” from the pre-API (N = 410) and post-API (N = 351) datasets in order to maintain the NB distribution. We then applied the GLM to the interactions and main effects. We applied an Analysis of Variance (ANOVA) to the results obtained from GLM model output, with α = 0.05.

We applied the Residual Diagnostics for Hierarchical (Multi-Level/Mixed) Regression Models DHARMa package from R¹⁸ to the model results in order to evaluate if the NB model accurately fit the data and accounted for overdispersion. The DHARMa tool uses a simulation-based approach to generate quantile residuals for generalized linear mixed models, including NB GLM.^18,19

After we obtained the ANOVA results for both datasets, we compared the coefficients, standard errors, and p values for one-way, two-way, and three-way interactions. If there was a discrepancy for any coefficient between the pre- and post-API datasets, then one would expect for there to be divergence between the corresponding frequencies in the underlying multivariate datasets. To assess the degree of divergence between the model output for the pre-API and post-API datasets, a Kullback-Leibler test and a Chi Square test were applied to the coefficients.

Results

We first quantified overall data loss. After filtering for Active_In_Year = 1 and removing patients/rows with missing observations, samples sizes were N = 15,420 for the pre-API dataset and N = 14,586 for the post-API dataset, thus indicating a loss of 834 patients/rows.

Having quantified data loss in the pre-API and post-API datasets, we then applied a GLM model to both datasets, focusing on the seven potential predictors of asthma exacerbations (Table 1). Of all possible combinations of the seven feature variables (one-, two-, and three-way interactions), seven were significant (p < 0.05) for both the pre- and post-API model fit. Five of those represented one-way interactions or main effects (Sex2, Prednisone, ObesityDx, Race, and MaxDailyPM2_5Exposure_StudyMax), one represented a two-way interaction (Sex2:ObesityDx), and one represented a three-way interaction (Race:Prednisone:ObesityDx). Ten interactions were significant only in the pre-API model fit, and six interactions were significant only in the post-API model fit.

Table 1.

ANOVA on GLM model summary for one-way, two-way, and three-way interactions.

	Pre-API					Post-API
	Df	Deviance	Resid. Df	Resid. Dev	Pr (>Chi)	Df	Deviance	Resid. Df	Resid. Dev	Pr (>Chi)
NULL	NA	NA	15,419	11,876.78	NA	NA	NA	14,585	7712.32	NA
Sex2	1	24.37	15,418	11,852.40	7.94E-07	1	16.948	14,584	7695.37	3.84E-05
Race	6	64.43	15,412	11,787.97	5.63E-12	5	143.553	14,579	7551.82	3.14E-29
Prednisone	1	694.64	15,411	11,093.33	4.39E-15	1	126.118	14,578	7425.70	2.90E-29
ObesityDx	1	210.67	15,410	10,882.67	9.83E-48	1	16.996	14,577	7408.70	3.75E-05
MaxDailyPM2_5Exposure_StudyMax	2	8.47	15,408	10,874.20	1.45E-02	2	43.989	14,575	7364.71	2.80E-10
RoadwayDistanceExposure2	5	10.04	15,403	10,864.16	7.41E-02	5	8.677	14,570	7356.04	1.23E-01
EstResidentialDensity	1	0.03	15,402	10,864.14	8.74E-01	1	0.263	14,569	7355.77	6.08E-01
Sex2:Race	5	4.50	15,397	10,859.64	4.80E-01	5	8.297	14,564	7347.48	1.41E-01
Sex2:Prednisone	1	1.17	15,396	10,858.47	2.79E-01	1	0.924	14,563	7346.55	3.36E-01
Sex2:ObesityDx	1	9.54	15,395	10,848.93	2.01E-03	1	6.295	14,562	7340.26	1.21E-02
Sex2:MaxDailyPM2_5Exposure_StudyMax	2	0.26	15,393	10,848.67	8.77E-01	2	0.095	14,560	7340.16	9.53E-01
Sex2:RoadwayDistanceExposure2	5	14.14	15,388	10,834.54	1.48E-02	5	2.680	14,555	7337.48	7.49E-01
Sex2:EstResidentialDensity	1	1.18	15,387	10,833.35	2.77E-01	1	1.979	14,554	7335.50	1.59E-01
Race:Prednisone	5	7.15	15,382	10,826.20	2.10E-01	2	19.676	14,552	7315.83	5.34E-05
Race:ObesityDx	6	14.64	15,376	10,811.56	2.32E-02	2	1.244	14,550	7314.58	5.37E-01
Race:MaxDailyPM2_5Exposure_StudyMax	6	4.58	15,370	10,806.98	5.99E-01	3	2.166	14,547	7312.42	5.39E-01
Race:RoadwayDistanceExposure2	25	33.81	15,345	10,773.17	1.12E-01	25	8.795	14,522	7303.62	9.99E-01
Race:EstResidentialDensity	5	4.82	15,340	10,768.34	4.38E-01	5	0.802	14,517	7302.82	9.77E-01
Prednisone:ObesityDx	1	2.41	15,339	10,765.93	1.21E-01	1	7.432	14,516	7295.39	6.41E-03
Prednisone:MaxDailyPM2_5Exposure_StudyMax	1	0.12	15,338	10,765.81	7.28E-01	1	9.155	14,515	7286.23	2.48E-03
Prednisone:RoadwayDistanceExposure2	5	8.36	15,333	10,757.45	1.37E-01	5	14.416	14,510	7271.82	1.32E-02
Prednisone:EstResidentialDensity	1	3.59	15,332	10,753.86	5.80E-02	1	0.192	14,509	7271.63	6.61E-01
ObesityDx:MaxDailyPM2_5Exposure_StudyMax	1	0.06	15,331	10,753.80	8.05E-01	1	1.083	14,508	7270.54	2.98E-01
ObesityDx:RoadwayDistanceExposure2	5	3.74	15,326	10,750.06	5.88E-01	5	1.985	14,503	7268.56	8.51E-01
ObesityDx:EstResidentialDensity	1	1.12	15,325	10,748.93	2.89E-01	1	0.642	14,502	7267.92	4.23E-01
MaxDailyPM2_5Exposure_StudyMax:RoadwayDistanceExposure2	5	4.48	15,320	10,744.45	4.83E-01	5	2.662	14,497	7265.25	7.52E-01
MaxDailyPM2_5Exposure_StudyMax:EstResidentialDensity	1	0.46	15,319	10,743.99	4.98E-01	1	0.053	14,496	7265.20	8.18E-01
RoadwayDistanceExposure2:EstResidentialDensity	5	6.02	15,314	10,737.98	3.05E-01	5	1.036	14,491	7264.17	9.60E-01
Sex2:Race:Prednisone	5	3.09	15,309	10,734.89	6.87E-01	2	0.004	14,489	7264.16	9.98E-01
Sex2:Race:ObesityDx	4	2.90	15,305	10,731.99	5.75E-01	2	6.607	14,487	7257.55	3.68E-02
Sex2:Race:MaxDailyPM2_5Exposure_StudyMax	5	2.42	15,300	10,729.57	7.88E-01	3	0.564	14,484	7256.99	9.05E-01
Sex2:Race:RoadwayDistanceExposure2	25	13.47	15,275	10,716.10	9.70E-01	25	5.025	14,459	7251.97	1.00 E+00
Sex2:Race:EstResidentialDensity	5	3.00	15,270	10,713.11	7.01E-01	5	1.215	14,454	7250.75	9.43E-01
Sex2:Prednisone:ObesityDx	1	1.40	15,269	10,711.71	2.36E-01	1	4.748	14,453	7246.00	2.93E-02
Sex2:Prednisone:MaxDailyPM2_5Exposure_StudyMax	1	0.07	15,268	10,711.64	7.98E-01	1	0.249	14,452	7245.75	6.17E-01
Sex2:Prednisone:RoadwayDistanceExposure2	5	8.64	15,263	10,703.00	1.24E-01	5	2.349	14,447	7243.40	7.99E-01
Sex2:Prednisone:EstResidentialDensity	1	0.22	15,262	10,702.78	6.40E-01	1	0.000	14,446	7243.40	9.88E-01
Sex2:ObesityDx:MaxDailyPM2_5Exposure_StudyMax	1	0.02	15,261	10,702.76	8.79E-01	0	0.000	14,446	7243.40	NA
Sex2:ObesityDx:RoadwayDistanceExposure2	5	4.53	15,256	10,698.23	4.76E-01	5	2.044	14,441	7241.36	8.43E-01
Sex2:ObesityDx:EstResidentialDensity	1	0.63	15,255	10,697.60	4.29E-01	1	1.170	14,440	7240.19	2.79E-01
Sex2:MaxDailyPM2_5Exposure_StudyMax:RoadwayDistanceExposure2	5	3.09	15,250	10,694.51	6.86E-01	5	0.549	14,435	7239.64	9.90E-01
Sex2:MaxDailyPM2_5Exposure_StudyMax:EstResidentialDensity	1	1.57	15,249	10,692.94	2.10E-01	1	0.569	14,434	7239.07	4.51E-01
Sex2:RoadwayDistanceExposure2:EstResidentialDensity	5	13.62	15,244	10,679.33	1.82E-02	5	2.109	14,429	7236.96	8.34E-01
Race:Prednisone:ObesityDx	5	16.15	15,239	10,663.17	6.42E-03	1	5.729	14,428	7231.23	1.67E-02
Race:Prednisone:MaxDailyPM2_5Exposure_StudyMax	4	7.43	15,235	10,655.74	1.15E-01	0	0.000	14,428	7231.23	NA
Race:Prednisone:RoadwayDistanceExposure2	22	19.03	15,213	10,636.71	6.43E-01	10	4.127	14,418	7227.11	9.41E-01
Race:Prednisone:EstResidentialDensity	5	14.57	15,208	10,622.14	1.24E-02	2	0.032	14,416	7227.07	9.84E-01
Race:ObesityDx:MaxDailyPM2_5Exposure_StudyMax	3	7.82	15,205	10,614.32	5.00E-02	1	1.270	14,415	7225.80	2.60E-01
Race:ObesityDx:RoadwayDistanceExposure2	21	35.35	15,184	10,578.97	2.59E-02	10	2.717	14,405	7223.09	9.87E-01
Race:ObesityDx:EstResidentialDensity	5	14.71	15,179	10,564.26	1.17E-02	2	2.761	14,403	7220.33	2.52E-01
Race:MaxDailyPM2_5Exposure_StudyMax:RoadwayDistanceExposure2	16	9.99	15,163	10,554.27	8.67E-01	15	3.046	14,388	7217.28	1.00 E+00
Race:MaxDailyPM2_5Exposure_StudyMax:EstResidentialDensity	3	0.80	15,160	10,553.46	8.48E-01	3	0.084	14,385	7217.20	9.94E-01
Race:RoadwayDistanceExposure2:EstResidentialDensity	25	32.42	15,135	10,521.05	1.46E-01	24	11.617	14,361	7205.58	9.84E-01
Prednisone:ObesityDx:MaxDailyPM2_5Exposure_StudyMax	1	0.12	15,134	10,520.93	7.31E-01	0	0.000	14,361	7205.58	NA
Prednisone:ObesityDx:RoadwayDistanceExposure2	5	20.76	15,129	10,500.17	8.98E-04	5	2.723	14,356	7202.86	7.43E-01
Prednisone:ObesityDx:EstResidentialDensity	1	0.01	15,128	10,500.16	9.34E-01	1	0.054	14,355	7202.80	8.16E-01
Prednisone:MaxDailyPM2_5Exposure_StudyMax:RoadwayDistanceExposure2	4	17.82	15,124	10,482.33	1.34E-03	4	0.332	14,351	7202.47	9.88E-01
Prednisone:MaxDailyPM2_5Exposure_StudyMax:EstResidentialDensity	1	0.16	15,123	10,482.18	6.90E-01	1	0.002	14,350	7202.47	9.66E-01
Prednisone:RoadwayDistanceExposure2:EstResidentialDensity	5	20.78	15,118	10,461.39	8.90E-04	5	5.514	14,345	7196.95	3.56E-01
ObesityDx:MaxDailyPM2_5Exposure_StudyMax:RoadwayDistanceExposure2	5	4.98	15,113	10,456.41	4.18E-01	5	0.383	14,340	7196.57	9.96E-01
ObesityDx:MaxDailyPM2_5Exposure_StudyMax:EstResidentialDensity	1	0.04	15,112	10,456.37	8.47E-01	1	0.038	14,339	7196.53	8.45E-01
ObesityDx:RoadwayDistanceExposure2:EstResidentialDensity	5	12.66	15,107	10,443.71	2.68E-02	5	3.786	14,334	7192.75	5.81E-01
MaxDailyPM2_5Exposure_StudyMax:RoadwayDistanceExposure2:EstResidentialDensity	5	2.67	15,102	10,441.05	7.51E-01	5	0.534	14,329	7192.21	9.91E-01

Notes: Significant results are in bold italicized font (α = 0.05).

Abbreviations: Df, degrees of freedom; Pr, probability; Resid. Df, residual degrees of freedom; Resid. Dev, residual deviance.

To better understand the model discrepancies and behavior, we examined interactions where only the pre-API model fit or only the post-API model fit was significant. A closer examination of the interaction model showed an imbalance in ObesityDx = 0 and ObesityDx = 1 across five of the seven racial categories (Asian, Native Hawaiian/Pacific Islander, American/Alaskan native, Other, Unknown), thus suggesting that this imbalance was likely the source of model divergence in the pre-API versus post-API model fits.

To balance the data, we collapsed Race to include only African Americans and Caucasians, comprising the majority of patients (85% in the pre-API dataset and 87% in the post-API dataset) and aligning with the approach we used previously.⁵ We then refit the GLM model. We plotted p values for one-, two-, and three-way interactions for the GLM model with all races included and for the GLM model with races restricted to include only African Americans and Caucasian (Figure 2).

Figure 2.

Coefficient probabilities from one- (blue), two- (red), and three-way (grey) interactions in the GLM model for the pre-API dataset (PRE) plotted against the post-API dataset (POST) for (a) all races and (b) African Americans and Caucasians only. The significance of the model is shown by the different marker symbols. Squares denote that only the pre-API model fit is significant (Panel I); cross symbols denote that neither the pre- nor the post-API model fit is significant (Panel II); plus signs denote that only the post-API model fit is significant (Panel III); and dots denote that both the pre- and post-API model fits are significant (Panel IV). The significance level was set at α = 0.05. Chi = Chi Square; Pr = probability.

When we examined coefficients in the GLM model with African Americans and Caucasians only, we found that one-way interactions or main effects were significant for Sex2, Prednisone, ObesityDx, and MaxDailyPM2_5Exposure_StudyMax in both the pre-API and post-API datasets. Race was significant only for the pre-API model fit.

One two-way interaction was significant for both the pre-API and post-API model fit, Sex2:ObesityDx. None of the three-way interactions were significant after collapsing the race category. As with the model with all races included, ten interactions were significant only for the pre-API model fit, and six interactions were significant only for the post-API model fit.

We further examined model discrepancy by looking at frequencies for a second non-significant two-way interaction with Race, namely, Race:RoadwayDistanceExposure2. We selected these two variables because Race showed a significant main effect in the GLM, whereas RoadwayDistanceExposure2 did not. Thus, Race:RoadwayDistanceExposure2 was considered an appropriate comparison to further assess model discrepancy. We examined frequencies in the pre-API versus post-API datasets and focused on the balanced datasets with African Americans and Caucasians only (Figure 3). We found that the pre-API and post-API frequencies were not significantly different.

Figure 3.

Frequency scatter plot for Race:RoadwayDistance2 in pre-and post-API datasets.

A Kullback-Leibler test was then applied to the pre-API and post-API 2 × 2 table for Race:RoadwayDistanceExposure2. The test showed a divergence of 0.00011, with a Chi Square test probability of 0.99, indicating that there was no divergence between the pre-API and post-API model output.

Discussion

We used the ICEES OpenAPI to generate a multivariate dataset and quantify data loss when compared to the constraint-free source data underlying the OpenAPI. We performed a comparative analysis of model robustness by applying a GLM model to the post-API versus pre-API datasets, using seven potential predictors of asthma exacerbations. Our comparative analysis included an examination of one-, two-, and three-way interactions for both pre-API and post-API model output. The purpose of the pre-API versus post-API comparative analysis was to: (i) evaluate the analytic impact of select feature variables on asthma exacerbations, as measured by annual ED or inpatient visits for respiratory issues; (ii) quantify and account for data loss; and (iii) determine if there is any divergence between pre-API and post-API model output.

Importantly, the ICEES OpenAPI prevents the creation of cohorts with ≤10 patients.³ In the event that a user attempts to create such a cohort, the user receives an error message. This regulatory restraint may result in data loss when accessing the ICEES OpenAPI to generate a multivariate table. Thus, the multivariate table generated via the ICEES OpenAPI will not necessarily be identical to the multivariate table generated from the underlying, constraint-free source data. Moreover, when using the ICEES OpenAPI to generate multivariate datasets, the impact of this regulatory constraint will be cumulative, as the programmatic approach that is used entails iterative cohort generation, using an increasing number of feature variables with each iteration.

One way to understand the impact of data loss with the ICEES OpenAPI is to compare frequencies for the pre-API versus post-API datasets. After carefully assessing and comparing the pre-API and post-API datasets generated to answer a specific use-case question, we found that we lost approximately 6% of the pre-API data when accessing the data via the ICEES OpenAPI. Fecho et al. 2021⁶ reported that data loss is generally <10% with the open multivariate approach, but with the potential for greater data loss, depending on the feature variables that are selected, their order, and several other considerations. A critical discussion on data loss can be found in Fecho et al. 2021.⁶

When we applied a GLM to the pre-API and post-API datasets, we found that the main predictors of asthma exacerbations, estimated from the GLM model, were sex, race, prednisone, obesity, and PM2.5 exposure. These predictors were significant for both the pre-API and post-API model output. These results are consistent with Fecho et al.⁶ and Lan et al.⁹ Two additional interactions were significant for both the pre-API and post-API model output: sex x obesity and race x prednisone x obesity. Ten interactions were significant only for the pre-API model output, and six interactions were significant only for the post-API model output.

After fitting the GLM model to both the pre-API and post-API datasets, we found divergence in model output, in terms of the interaction terms for the pre-API versus post-API datasets. A closer examination of the divergence suggested that it was due to the imbalance in the frequency of obesity across racial categories. We chose to collapse race to focus only on African Americans and Caucasians, which together accounted for the majority of patients, and then refit and reevaluated the GLM model. The GLM model fit applied to the pre-API and post-API datasets with balanced race data (i.e., only African Americans and Caucasians) revealed little divergence in model output, apart from minor divergence in the model’s interaction terms. Thus, while regulatory constraints imposed on the ICEES OpenAPI, coupled with inherent data imbalances in EHR data, indeed do affect data loss and multivariate model robustness, the general approach we describe herein yields valid results. To minimize the likelihood of spurious results, users of the ICEES OpenAPI are encouraged to investigate possible class imbalances when fitting models with higher order interactions.

Limitations

While our results suggest minimal impact of the ICEES open multivariate functionality on model robustness, our findings have limitations that should be considered when interpreting the results. First, our pre- versus post-API analysis demonstrated that the data loss that is inherent in the ICEES open multivariate approach did not impact the main effects identified by the GLM. However, we did find an impact on model robustness when comparing two-way and higher interactions in the pre- versus post-API model results, although the impact was fairly minimal. Second, we focused our systematic analysis on GLM results, but we did not apply other statistical models or machine learning algorithms. We plan to repeat our analysis using other models and approaches such as causal inference analysis. Finally, we conducted our analysis using data derived from a single, large ICEES cohort on patients with asthma. We therefore do not know how generalizable our results are, in terms of applicability to other cohorts and smaller datasets. We plan to repeat our analysis using datasets derived from additional ICEES cohorts, including cohorts of patients with primary ciliary dyskinesia, drug-induced liver injury, and Long COVID syndrome.

Conclusions

In summary, the results from our comparative analysis of ICEES pre-API versus post-API datasets reveal similar predictors of asthma exacerbations and consistency with our prior results. We further show that datasets generated via the ICEES OpenAPI are generally reliable for exploratory multivariate studies, albeit with a certain amount of data loss and a potential for data imbalance that users should be aware of. Of relevance, we recently developed a preliminary theoretical framework to estimate data loss when applying the ICEES multivariate functionality.²⁰ We are now refining the framework and considering approaches for providing users with case-by-case estimates of data loss and model robustness to inform multivariate model development. In addition, we are expanding the current use of the ICEES OpenAPI to investigate predictors using longitudinal data and analytic approaches such as machine learning models and causal network analysis. Finally, we note that ICEES is disease-agnostic and currently supports asthma and several other use cases or cohorts. We plan to implement the analytic method presented in this study in other ICEES use cases.

Footnotes

Acknowledgements

The authors wish to acknowledge Stanley C. Ahalt, Director of the Renaissance Computing Institute, for his support and advice on the work described herein; David B. Peden for his expertise on the asthma use case; Emily Pfaff and James Champion for their help with the patient data; and Sarav Arunachalam, Stephen A. Appold, Alejandro Valencia Arias, and Lisa Stillwell for their help with the environmental exposures data.

Author Contributions

PSh prepared the first draft of the manuscript and conducted the analyses described herein; all authors contributed to the study design, assisted with interpretation of the results, reviewed the first draft of the manuscript, and approved the final submission.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Center for Advancing Translational Sciences, National Institutes of Health (OT2TR003430, OT2TR003428, UL1TR002489, UL1TR002489-03S4, OT3TR002020).

Institutional Review Board Statement

The study procedures were approved by the Institutional Review Board at the University of North Carolina at Chapel Hill (protocol #16–2978).

Data Availability Statement

The ICEES asthma OpenAPI is freely available and can be accessed at https://icees-asthma.renci.org/apidocs [Permalink: https://perma.cc/7RWE-78JL]. The ICEES GitHub repository and software code can be found at: .

ORCID iD

Karamarie Fecho

References

Johnson

Pollard

Shen

, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035. doi: 10.1038/sdata.2016.35

Dumontier

Hripcsak

, et al. Columbia open health data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 2018; 5: 180273. doi: 10.2196/31122

Fecho

Pfaff

, et al. A novel approach for exposing and sharing clinical data: The translator integrated clinical and environmental exposures service. J Am Med Inform Assoc 2019; 26(10): 1064–1073. doi: 10.1093/jamia/ocz042

Pfaff

Champion

Bradford

, et al. Fast healthcare interoperability resources (FHIR) as a meta model to Integrate common data models: Development of a tool and quantitative validation study. JMIR Med Inform 2019; 7(4): e15199. doi: 10.2196/15199

Cox

Stillwell

, et al. FHIR PIT: An open software application for spatiotemporal integration of clinical data and environmental exposures data. BMC Med Inform Decis Mak 2020; 20(1): 53. doi: 10.21203/rs.2.19633/v1

Fecho

Haaland

Krishnamurthy

, et al. An approach for open multivariate analysis of integrated clinical and environmental exposures data. Inform Med Unlocked 2021; 26: 100733. doi: 10.1016/j.imu.2021.100733

Fecho

Ahalt

Arunachalam

Biomedical Data Translator Consortium , et al.. Sex, obesity, diabetes, and exposure to particulate matter among patients with severe asthma: Scientific insights from a comparative analysis of open clinical data sources during a five-day hackathon. J Biomed Inform 2019; 100: 103325. Special Communication. doi: 10.1016/j.jbi.2019.103325

Fecho

Ahalt

Appold

, et al. Development and application of an open tool for sharing and analyzing integrated clinical and environmental exposures data: Asthma use case. JMIR Form Res 2022; 6(4): e32357. doi: 10.2196/32357

Lan

Haaland

Krishnamurthy

, et al. Open application of statistical and machine learning models to explore the impact of environmental exposures on health and disease: An asthma use case. Int J Environ Res Public Health 2021; 18(21): 11398. doi: 10.3390/ijerph182111398

10.

Perez

Lurmann

Wilson

, et al. Near-roadway pollution and childhood asthma: Implications for developing “win-win” compact urban development and clean vehicle strategies. Environ Health Perspect 2012; 120(11): 1619–1626. doi: 10.1289/ehp.1104785

11.

Keet

McCormack

Pollack

, et al. Neighborhood poverty, urban residence, race/ethnicity, and asthma: Rethinking the inner-city asthma epidemic. J Allergy Clin Immunol 2015; 135(3): 655–662. doi: 10.1016/j.jaci.2014.11.022

12.

Greenblatt

Zhao

Henrickson

, et al. Factors associated with exacerbations among adults with asthma according to electronic health record data. Asthma Res Pract 2019; 5: 1. doi: 10.1186/s40733-019-0048

13.

Mirabelli

Vaidyanathan

Flanders

, et al. Outdoor PM2.5, ambient air temperature, and asthma symptoms in the past 14 days among adults with active asthma. Environ Health Perspect 2016; 124(12): 1882–1890. doi:10.1289/EHP92

14.

Schurman

Bravo

Innes

, et al. Toll-like receptor 4 pathway polymorphisms interact with pollution to influence asthma diagnosis and severity. Sci Rep 2018; 8: 12713.

15.

Alangari

. Corticosteroids in the treatment of acute asthma. Ann Thorac Med 2014; 9(4): 187–192. doi: 10.4103/1817-1737.140120

16.

Pandas

inner and outer merge functions. pandas via NumFOCUS, Inc., Austin, Texas. Available at: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

17.

Rindskopf

Generalized linear models. In: Cooper

Camic

Long

, et al (eds) APA Handbook of Research Methods in Psychology. Washington, DC: American Psychological Association. APA Publications, 3, pp. 191–206, doi:10.1037/13621-009.

18.

HartigDHARMa

: Residual diagnostics for hierarchical (Multi-Level/Mixed) regression Models. R package version 0.4.3. Institute for Statistics and Mathematics, Vienna University of Economics and Business, Vienna, Austria. Available at: https://CRAN.R-project.org/package=DHARMa

19.

Lehman

Archer

. Penalized negative binomial models for modeling an overdispersed count outcome with a high-dimensional predictor space: Application predicting micronuclei frequency. PloS One 2019; 14(1): e0209923. doi: 10.1371/journal/pone.0209923

20.

Schmitt

Fecho

Haaland

, et al. A framework for estimating the bounds of contingency tables: Application to an open clinical service. RENCI technical report, Chapel Hill, NC, USA: Renaissance Computing Institute, University of North Carolina at Chapel Hill. Available at: https://renci.org/technical-reports/tr-22-01. doi: 10.7921/2261-7b16

Evaluating robustness of a generalized linear model when applied to electronic health record data accessed using an Open API

Abstract

Keywords

Introduction

Methods and Statistical Plan

Overview of Analytic Plan

Pre-API Dataset

Post-API Dataset

Quality Control and Summary Statistics

GLM Application

Results

Discussion

Limitations

Conclusions

Footnotes

Acknowledgements

Author Contributions

Declaration of conflicting interests

Funding

Institutional Review Board Statement

Data Availability Statement

ORCID iD

References