Abstract
Decision support techniques and online algorithms aim to help individuals predict costs and facilitate their choice of health insurance coverage. Self-reported health status (SHS), whereby patients rate their own health, could improve cost-prediction estimates without requiring individuals to share personal health information or know about undiagnosed conditions. We compared the predictive accuracy of several models: (1) SHS only, (2) a “basic” model adding health-related variables, and (3) a “full” model adding measures of healthcare access. The Medical Expenditure Panel Survey was used to predict 2015 health expenditures from 2014 data. Relative performance was assessed by comparing adjusted-R2 values and by reporting the predictive accuracy of the models for a new cohort (2015–2016 data). In the SHS-only model, those with better SHS were less likely to incur expenditures. However, after accounting for health variables, those with better SHS were more likely to incur expenses. In the full model, SHS was no longer predictive of incurring expenses. Variables indicating better access to care were associated with higher likelihood of spending and higher spending. The full model (
Keywords
Decision support tools can improve health insurance consumers’ knowledge and confidence; many online tools offer a degree of customization based upon demographic and health information.
This research asks whether such tools can be customized sufficiently to make accurate predictions about utilization and cost that could then guide specific plan selection in a given health insurance market.
Policies that rely heavily on health insurance consumers making individually optimal choices cannot assume that decision tools can accurately anticipate high costs.
Introduction
Over the past 40 years, health insurance has become an increasingly sophisticated product that plays the dual roles of financial protection and means of access to healthcare services. At the same time, publicly subsidized insurance has begun mimicking the private market as it relies on the notion of informed and savvy consumer shopping behavior to drive efficiency. Many non-elderly American adults are asked to select an insurance plan from an employer-sponsored or individual marketplace, and adults over the age of 65 can choose from an array of Medicare supplemental and Medicare Advantage plans to optimize coverage and access to care. Those anticipating significant care needs may prefer plans with more comprehensive coverage at a higher up-front cost, whereas those anticipating needing little care may prefer less insurance coverage, or a higher-deductible plan. 1 Once this high-level choice is made, more granular choices involving smaller variations in plan design must be weighed against small differences in premiums.
Prior research has found that individuals commonly make mistakes when faced with such decisions. In a study of Medicare beneficiaries selecting Part D plans, people systematically placed too much weight on premiums rather than out-of-pocket costs, resulting in suboptimal choices. 2 In a study of private employer-sponsored insurance selection, people commonly chose plans that left them with higher overall spending than other available options, and lower-income consumers fared worse; the authors found these effects to be driven by a lack of understanding of how health insurance works. 3
Moreover, future healthcare utilization and its associated costs are often unpredictable at the individual level, especially when one is faced with a new or worsening health condition, as a recent analysis of spending in the last year of life confirmed. 4 Yet anticipating future costs is essential to making the optimal choice for health coverage—especially for the decision about whether to enroll in a high-deductible plan. 5 Resources such as decision aids, which in general aim to inform and help individuals faced with complex or unfamiliar decisions, may include online algorithms that help individuals predict future utilization and associated costs. 6 However, to generate cost estimates, these tools often require individuals to provide demographics, socioeconomic information, and medical comorbidities or prescription drug lists. 7 For some people, these questions feel intrusive. 8 For others, who may have new or undiagnosed conditions—possibly due to prior lack of access to insurance—the tools’ results may be inaccurate and missing key information needed to anticipate future expenses.
The policy approach that relies on individuals optimizing their health insurance decisions may be strengthened if decision aids can improve accuracy while remaining simple enough for widespread use. One potential way to do this is via the concept of self-reported health status (SHS). 9 SHS, which asks patients to rate their own health on a scale, often from poor to excellent, is a straightforward method for assessing individuals’ physical symptoms, emotional well-being, and functional status. 10 SHS plays a significant role in predicting total healthcare spending.11–13 Some state-based tools even use it to suggest insurance plans. 7 SHS does not require individuals to share personal health information or know about potentially undiagnosed conditions; thus it could provide helpful input to tools estimating future healthcare spending.10,14 However, it is a subjective measure that may be context-dependent, and there is little research regarding its ability to predict individual utilization and costs.
A closely related question when predicting utilization and costs is the degree to which access to care impacts the estimates and the resulting decision aid output; recent studies suggest that differential access for racial minorities may lead to racial bias in risk prediction algorithms. 15 Using a dataset that specifically includes information on whether an individual has access to healthcare (in a variety of forms) can test whether and how access modifies cost estimates. This information may be helpful to understand the risk of using cost prediction tools in certain settings (such as Health Insurance Marketplaces) in which racial minorities disproportionately gain access to new coverage.
In this study, we therefore aimed to: (1) determine the degree of predictive accuracy that can be achieved by SHS alone; (2) compare these findings to SHS plus a full complement of typical health-related variables to predict future utilization and costs; and (3) examine how direct “healthcare access” variables, and additional indirect access variables such as socioeconomic characteristics, relate to predictions of utilization and spending.
Methods
Data
Data are from the 2014–2016 Medical Expenditure Panel Survey (MEPS), a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States. MEPS is the most complete source of data on the cost and use of healthcare and health insurance coverage. 16 Two recent panels from the MEPS were analyzed: the model was estimated on MEPS Panel 19 (2014–2015) and assessed for accuracy using Panel 20 (2015–2016). Each unique panel contains information on about 16,000 respondents over a two-year period; through a restricted access agreement, we were able to supplement MEPS with other county and state data.
Predictor Variables
Our primary predictor of interest was SHS, for which individuals described their overall health in qualitative terms (poor, fair, good, very good, or excellent). We transformed these responses such that 0 represented poor health, 1 indicated fair health, 2 indicated good health, 3 indicated very good health, and 4 indicated excellent health. An analogous variable included in the MEPS survey, self-reported mental health status, was coded identically. We considered various model specifications for this variable, including categorical, linear, and log-linear. As model performance was similar, including the magnitude of predictive errors, we present the linear form as the easiest to interpret.
Respondents’ health-related demographics included age and sex, as well as a variable for older age (greater than 64). Indicator variables for 11 chronic and acute conditions asked by MEPS (arthritis, asthma, back/joint pain, cancer, chronic bronchitis, diabetes, emphysema, high cholesterol, heart condition, hypertension, and stroke) were used separately and also combined into a count variable for total number of chronic conditions. A variable capturing the degree to which the respondent’s pain level over the past 4 weeks caused limitations in their ability to work was also included, with responses ranging from 1 (indicating “not at all”) to 5 (indicating “extremely”). Current smoking status was also used as a health-related indicator.
To measure access to care, we assessed respondents’ geographic, cost-related, and personal access. First, to analyze geographic access, we used Urban Influence Codes 17 to categorize counties as metropolitan, micropolitan (ie, rural counties containing small cities), and non-core (ie, rural counties that are non-micropolitan). We merged MEPS data with county-level indicators of healthcare supply from the Area Health Resources File (AHRF), 2015–16. 18 Counts of primary care providers and specialists served as proxies for availability of care. Second, to analyze cost-related access, we included Medicaid expansion status by year (with states which expanded during the study period coded as expansion states only in the appropriate years), as well as income level and source of health insurance coverage. We did not control for any finer details of insurance coverage because our goal was to determine the expected utilization for a given individual independent of any specific plan features. Finally, to analyze respondents’ personal access, we included a MEPS variable asking whether the respondent had a “usual source” of healthcare. Other variables in this category were self-reported mental health status, marital status, and racial and Hispanic identities. Those with mental health conditions are less likely to seek care.19,20 Marital status is often associated with increased utilization and better health, especially for men,21,22 whereas racial disparities may be contributing differentially to access due to discrimination. 23
Primary Outcome: Health Expenditures
Annual expenditures were calculated for each respondent. Actual utilization and cost data are reported by individual households and their members, and this information is supplemented by data from their medical providers (doctors, hospitals, pharmacies, etc.,) to characterize the individual’s total spending from all sources (private and public insurance, employer, patient, etc.). Expenditures were log-transformed for all positive values due to skewness in the distribution of the data.
Analysis
We first described patient characteristics by SHS group. We then created sequential models for health expenditures. Year 2 expenditures were predicted based upon year 1 characteristics. In the derivation sample, the model used Panel 19’s year 1 data (2014) to predict year 2 expenditures (2015). The estimates that fit the relationship best were then applied to Panel 20 data (the validation sample), to predict 2016 expenditures based upon 2015 data. To assess model fit, the predictions were compared to actual 2016 expenditures in the second year of Panel 20. Because a large number of zero expenditures existed, we used a two-stage Heckman model, 24 in which stage 1 estimated the probability that expenses would be incurred at all, and stage 2 estimated the relationship between the explanatory variables and year 2 expenditures, conditional on year 2 expenditures being positive. To predict from the model, the stage 1 and 2 estimates from Panel 19 data were combined to produce an unconditional estimate of Panel 20 year 2 expenditures, given Panel 20 year 1 data. The SAS procedure QLIM, with the HECKIT option, was used to estimate a binary probit selection model with a second-stage OLS regression.
To gauge the value of adding additional variables to the predictive model, we compared (a) a bare-bones model in which SHS alone predicted expenditures; (b) a “basic” model in which SHS and the health-related variables described above were used; and (c) a “full” model which used all variables that were statistically significant, which specifically included several measures of healthcare access as described above. Because prior work found an interactive effect between age and SHS, 9 we tested various specifications. We assessed the relative performance of the three models both by comparing adjusted-R2 values and root mean squared errors (MSEs) and by reporting on the predictive accuracy of the model using a new cohort of people. For the latter, we were particularly interested in how well the full model could predict extremely high costs compared to the performance of the basic model, given that the decision to purchase health insurance is often about mitigating the risk of unexpectedly high costs.
All analyses were conducted using SAS Enterprise Guide Version 7.15 (Cary, NC). We considered a two-tailed
Results
Sample Characteristics
Selected Descriptive Characteristics of Weighted MEPS Data, 2014–2015.
In terms of socioeconomic and access indicators, people in excellent health were more educated, earned higher incomes, and were more likely to have private insurance, live in metropolitan areas, and live in a Medicaid expansion state. More than half (54.9%) of those reporting poor health had a high school diploma or less compared to only 31.8% of those reporting excellent health. Among those in poor health, only 7.2% were uninsured, while 53.0% were covered by a public source, compared to 11.7% of those in excellent health being uninsured and only 11.6% with public coverage. A greater share of respondents with poor, rather than excellent, health were from non-metropolitan areas, and especially from rural non-core counties. Similarly, a greater share of respondents who reported poor health (87.3%) vs. those who reported excellent health (69.0%) had a usual source of care. Finally, 60.7% of those in excellent health lived in a Medicaid expansion state, whereas only 56.8% of those in poor health lived in an expansion state.
Relationship Between Health and Expenditures
Coefficients of Basic vs. Full Models of Healthcare Spending.
Notes: Blank cells indicate that a variable was not included in the model. N.S. indicates that a variable was not statistically significant and was excluded. Significance levels are indicated by * (P<0.10), ** (P<0.05), and *** (P<0.01).
The full model, which added race and socioeconomic variables, as well as access variables, yielded an adjusted
Out-Of-Sample Model Performance
After applying the basic and full models to Panel 20 MEPS data, the full model performed slightly better at the upper tail of the cost distribution. Figure 1 displays the median and extreme (99%, 95%, and 5%) values of the differences between predicted expenditures and the actual expenditures observed in the data by SHS. For example, the median individual in fair health had expenditures that were $5.48 lower than the basic model predicts, and $3.71 lower than the full model predicts. At the upper tail, where a well-informed insurance choice becomes critical, the basic model underpredicted the 99th percentile costs by $886, while the full model underpredicted by $832. This difference of $54 was the largest difference across the five SHS groups. Moreover, the full model was actually less accurate than the basic model at the 99th percentile for those in poor health. Thus, while the full model performed slightly better, on average, than the basic model, neither model performed well at the upper tail of the distribution. Predictive Accuracy (Actual—Predicted Expenditures) of Basic vs. Full Models Across Five Health status levels.
Discussion
This study found that SHS alone was not a strong predictor of medical expenditures on an individual level, despite strong correlations between better health and lower expenditures at the aggregate level. The addition of a small number of key health-related variables such as age, sex, and number of chronic conditions, however, led to significant improvements in the amount of variation explained by the model. The addition of access-related variables such as possession of health insurance and a usual source of healthcare, most of which were associated with higher expenditures, improved accuracy marginally. Neither model performed well in predicting the upper tail of the cost distribution where individuals incur high costs in a given year.
The basic model, which included variables directly related to health, explained significantly more variation in the data than the SHS-only model could. The model predicted future expenditures based upon previous data, performing well for a large majority of the sample. However, it did poorly at predicting expenditures at the highest end of the distribution, as did the SHS-only model.
After controlling for health needs through basic health-related variables, having high levels of access—geographic, cost-related, and personal—in a given year was associated with higher costs the following year. However, the “value added” in terms of modeling accuracy was quite modest, suggesting that a person’s basic health-related data and SHS stand in, to a large extent, for these other variables. In particular, the findings are consistent with the notion that SHS at a given point in time is already a product of the access a person has had in the past, and is likely very similar to their current access measures. Insurance status and type, income level, geographic proximity to care, and urban/rural status can impact healthcare access. 25 Barriers to access are sometimes present for racial and ethnic minorities, 26 and for those of lower educational attainment. 25 Because the dependent variable is in log form, it is not straightforward to interpret the coefficients themselves, but the signs and other relationships help illustrate the relative importance of the key health-related variables.
Among the access variables included as significant in the final model, we note that many relate directly to the ability to access healthcare (rural geography, residence in a Medicaid expansion state, and having health insurance, high income, and a usual source of care) in a logistical or economic sense, and that others relate only indirectly (race, educational status). We found that better health status was more strongly predictive of lower spending when we controlled for all geographic, cost-related, and personal access variables. The negative coefficient on non-white race (−0.493) reinforces extensive literature on lower access for racial minorities, since we are controlling for health need through SHS and number of health conditions. Estimates of the public and private insurance variables indicate, unsurprisingly, that both are associated with higher expenditures, compared to having no insurance.
Even if access variables had added significantly to the predictive power of the model, it is likely inappropriate to include them for purposes of guiding insurance plan selection if they stem from inequitable allocation of resources rather than underlying health need. The implication would be that utilization should be estimated under the assumption that such inequities will persist. Similarly, a model that controlled for more specific details of health insurance coverage, while potentially being more accurate, would risk, in a decision aid application, steering consumers to high-deductible health plans that could end up influencing them to avoid seeking necessary care.
Ultimately, we found that all three models are far from accurate when it comes to predicting very high outlier expenditures, in the sense that the size of the error can be large enough that a higher up-front premium to obtain a lower deductible would have been a good decision for the individual. Predicating a specific insurance plan choice on such a prediction would likely be a suboptimal choice for such individuals. The slight differences between them were not meaningful in dollar terms, and much additional sensitive and private information was required to generate the full model estimates. It is possible that predicting particularly high costs may be improved with more complex modeling, for example, machine-learning techniques such as LASSO that test for higher-order relationships among variables, or by including variables capturing functional status, frailty, or other high-cost conditions with greater specificity than SHS. A study with a different model specification in which insurance choice data are examined to predict the chance of high, outlier spending, defined
Limitations
We included variables in the basic model which we judged that people would be willing to share in the context of health insurance decision aids (all were variables directly connected to health while avoiding questions on race, income, and other potentially sensitive subjects). However, validating the acceptability of this list of questions was beyond the scope of the current study.
We used the most recently available MEPS panels at the time the analyses were conducted in February 2020, but more recent data would now be available through a new restricted-access request. Our primary goal was to predict year 2 utilization based upon year 1 information, and we do not anticipate these trends to vary greatly due to the age of the data.
Conclusion
SHS was not an entirely satisfactory predictor of individual medical expenditures, even when augmented with other variables likely to be available in a decision-aid context. The most significant variation in plan benefit designs within a given insurance market is the contrast between standard and high-deductible plans, and the errors in our model were of a magnitude that would sometimes produce errors in recommending one of these options over the other. Decision aids may be quite useful as health insurance literacy interventions, which should focus on communicating the
Policies that rely heavily on health insurance consumers making individually optimal choices cannot assume that decision tools can accurately anticipate high costs. Only 1% of Health Insurance Marketplace consumers failing to buy comprehensive coverage when they “should” have done represents more than 100,000 people making a costly mistake—or finding themselves in a situation in which they cannot afford needed care. This analysis calls into question the policy approach of relying heavily on a model in which savvy health insurance consumers assess risks and tradeoffs, weighing expected cost and other factors to arrive at an optimal choice. Insurance exists in part to protect against financial duress, that is, to help individuals with outlier costs, and it appears that no simple decision aid can contribute meaningfully to this assessment of outlier risk, even as it may educate consumers as to the nature of risks and tradeoffs. However, modern health insurance is also about facilitating access to care. More work should explore how shopping across plans which vary in their quality, networks, and other features (perhaps holding cost constant as a function of income) could improve individual- and societal-level health outcomes.
Footnotes
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Dr. Joynt Maddox serves on a health policy advisory committee for Centene Corporation, and previously did contract work for the US Department of Health and Human Services.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Publication of this work was supported by the National Cancer Institute of the National Institutes of Health (P50CA244431). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author’s Note
The research in this paper was conducted at the CFACT Data Center, and the support of AHRQ is acknowledged. The results and conclusions in this paper are those of the authors and do not indicate concurrence by AHRQ or the Department of Health and Human Services.
