Abstract
Objectives:
Health-care performance comparisons across countries are gaining popularity. In such comparisons, the risk adjustment methodology plays a key role for meaningful comparisons. However, comparisons may be complicated by the fact that not all participating countries are allowed to share their data across borders, meaning that only simple methods are easily used for the risk adjustment. In this study, we develop a pragmatic approach using patient-level register data from Finland, Hungary, Italy, Norway, and Sweden.
Methods:
Data on acute myocardial infarction patients were gathered from health-care registers in several countries. In addition to unadjusted estimates, we studied the effects of adjusting for age, gender, and a number of comorbidities. The stability of estimates for 90-day mortality and length of stay of the first hospital episode following diagnosis of acute myocardial infarction is studied graphically, using different choices of reference data. Logistic regression models are used for mortality, and negative binomial models are used for length of stay.
Results:
Results from the sensitivity analysis show that the various models of risk adjustment give similar results for the countries, with some exceptions for Hungary and Italy. Based on the results, in Finland and Hungary, the 90-day mortality after acute myocardial infarction is higher than in Italy, Norway, and Sweden.
Conclusion:
Health-care registers give encouraging possibilities to performance measurement and enable the comparison of entire patient populations between countries. Risk adjustment methodology is affected by the availability of data, and thus, the building of risk adjustment methodology must be transparent, especially when doing multinational comparative research. In that case, even basic methods of risk adjustment may still be valuable.
Introduction
A major theme in health services research is to develop performance indicators and promote the use of benchmarking information for health policy. In developed countries, there are national benchmarking projects ongoing, but cross-country comparisons on national level are very few. Especially, there is not much information of health system performance between countries based on patient-level data. A central goal of European Health Care Outcomes, Performance and Efficiency (EuroHOPE) project is to develop performance indicators and to evaluate the performance of European health-care systems in terms of outcomes, quality, use of resources, and costs. 1 Multinational patient-level studies of health system performance are hampered most by data availability and the lack of unique patient identifiers. 2 In the EuroHOPE partner countries, linkable patient-level administrative data for use of in- and outpatient hospital services and prescribed medicines, as well as data on mortality, are available for researchers.
One of the challenges when comparing health-care performance measures between countries is to adjust for differences in the patient mix. This is further complicated by the fact that detailed information on the patients may not be available, or variables being very differently defined across countries. EuroHOPE aims to solve this problem by using individual-level register data available for everyone with a specified health problem, which contains detailed information on variables with effect on the health performance measures such as disease-specific comorbidities, number of days in hospital, and medication use prior to the occurrence of the health problem studied.
This study gives a description of the data and methods used to compare health-care performance measures within EuroHOPE. We briefly describe the contents of the data. Then, we discuss the methodological aspects of risk adjustment with regard to how stable the risk-adjusted estimates are depending on what data are used as the reference when calculating them. In multinational studies, it is likely that the effects of risk adjusters differ between countries, but it may be difficult to evaluate the importance of the difference that a single coefficient has on the risk-adjusted value just by looking at, for example, interactions between countries and single-risk adjusters as there are many variables in the models. Changes in other coefficients might offset the effect a single coefficient would have on the risk-adjusted value. Also, in EuroHOPE, some countries (the Netherlands and Scotland) have data sharing restrictions and the whole individual-level data cannot be pooled. We think this would be a problem in other studies using detailed individual-level registry data as well, as several countries are not eager to share their data abroad. This problem limits the choice of methods that can be used for analyzing the data. Most methods require all data to be pooled. Hence, the main aim of this article is to study whether one can get stable results from simple methods of risk adjustment even when the data are complex.
In this study, we used only acute myocardial infarction (AMI) patient data to illustrate the methodology. More comprehensive output will be presented in separate articles focusing on each condition included in EuroHOPE.
Methods
Data
A total of seven countries participated in the EuroHOPE project: Finland, Hungary, Italy (the city of Turin), the Netherlands, Norway, Scotland, and Sweden. In this article, we analyze data from all countries except Scotland and the Netherlands (see section “Statistical Methods” below). EuroHOPE applied an episode-based approach to analyzing the performance of countries and regions in the treatment of certain health problems similarly as done earlier in Finland on national level. 3 The patient populations studied in the project are very low-birth-weight infants and individuals suffering from AMI, cerebral infarction, hip fracture, and breast cancer.
Each country prepared a data file for each disease, following a disease-specific protocol of inclusion and exclusion criteria. 1 Data were collected from various health registers containing the relevant information with the widest possible coverage on the use of health services of these patients. This included cause-of-death registers; hospital inpatient registers containing length of stay (LOS), comorbidity, and treatment information; and prescription registers containing information on medication use. Data from different registers were merged using unique patient identifiers of patients in each country. For AMI, cases were identified for the year 2007 in all countries, except Norway which used 2009 due to the unavailability of deterministically linkable hospital discharge, prescribed medication, and cause-of-death data before that year. For a more detailed description of the data and methods used in the EuroHOPE project, please see Häkkinen et al. 1
Each patient had a follow-up of 1 year beginning on the date when the episode started (index admission), excluding patients with AMI admissions in the 365 days before the index admission. In addition, the patients’ hospital discharge data and data on purchases of prescribed medicine were collected 1 year back as these were used in defining some of the risk adjustment variables.
Variables used in the risk adjustment include age at index admission, gender, disease-specific comorbidities, and the number of hospital days (for any cause) the year prior to the index admission. The comorbid diseases were specified for each disease group by clinical experts in the field. The comorbid diseases used in the study of AMI are presented in more detail in Häkkinen et al., 1 including the International Classification of Diseases (ICD; both ICD-10 and ICD-9) codes and the Anatomical Therapeutic Chemical (ATC) classification system codes used for identifying the selected comorbid diseases from the hospital discharge registers and the data on medicine purchases, respectively.
The performance measures were specifically tailored for each disease in EuroHOPE. Main measures used in AMI were all-cause mortality after 30 days, 90 days, and 1 year; LOS for the first hospital episode; all-cause LOS during the first year after the diagnosis; and disease-specific LOS during the first year after the diagnosis. The first hospital episode is defined as continuous hospital inpatient care (overnight stay at home between the hospital stays is allowed), truncated at 365 days if the LOS was longer. For the disease-specific LOS, only days spent in hospital with the particular disease as the main diagnosis are considered. A list of the performance measures and risk adjustment variables used in the study of AMI, with some descriptives, is given in Tables 1 and 2.
Descriptive statistics on background variables, length of stay, and 90-day mortality for AMI.
SD: standard deviation; AMI: acute myocardial infarction; LOS: length of stay.
Descriptive statistics on comorbidities used in risk adjustment of performance measures for AMI.
AMI: acute myocardial infarction; COPD: chronic obstructive pulmonary disease.
Comorbid disease not included in the restricted risk adjustment models (M2 and M3).
Statistical methods
Descriptive statistics of the outcomes and risk adjusters are compared, using measures such as proportions, means, and medians.
The first step of the risk adjustment was to construct a merged database from the countries which were allowed by their national data protection authorities to share data across borders. The countries included Finland, Norway, Sweden, Italy, and Hungary, and pooled data of these are called the reference database. As seen from Table 1, the reference database for AMI includes 59,135 patients in total.
For each response, three different risk-adjusted outputs were produced: adjusted for sex and age only (M1); adjusted for sex, age, LOS previous year, disease-specific comorbidities based on primary and secondary diagnoses the year prior to diagnosis (M2); and M3 identical to M2 except comorbidities were based on both primary and secondary diagnoses and medication purchases the year prior to diagnosis. The reason for using both models M2 and M3 is to compare the effect of a narrow and broad definition of comorbidities. Only comorbidities with prevalence of >1% in all countries based on the definition given for M3 were included as risk adjusters. As seen from Table 2, in the case of AMI, this excludes atherosclerosis, dementia, renal insufficiency, and alcoholism.
Based on the experiences in the PERFormance, Effectiveness and Cost of Treatment episodes (PERFECT) project, 3 the observed/expected approach 4 was used, which roughly corresponds to indirect standardization. Logistic regression was used for the mortality outcomes, whereas negative binomial regression was used for the LOS measures.
The regression coefficients used to produce the risk-adjusted estimates in each country were based on the reference database. In order to avoid that the relatively large samples from Sweden and Hungary gave a much greater contribution to the estimates compared with the smaller sample from Italy (representative only for the city of Turin), weighted regression was used to ensure equal weight to all five countries as the effect of the risk adjusters might differ between countries. By comparing both the risk-adjusted estimates and the descriptive statistics on background variables and comorbidities in each country, one may get an indication as to why some countries perform better than others. It is also possible to present comparisons of regions between the countries, and an example of such output is given for AMI in Norway.
A sensitivity analysis is presented to study the extent risk-adjusted mortality rates differ depending on whether the two countries with the highest unadjusted mortality (Hungary and Finland) or the three countries with the lowest mortality (Norway, Sweden, and Italy) are used as the reference database. This allows us to study whether results depend on the choice of reference data and, more importantly, whether interaction effects between country and risk adjusters seem to matter in practice. Normally, one would pool data from all countries to use as the reference, but as this is not possible, it is interesting to study the impact different choices of reference data have on the risk-adjusted measures. In order to illustrate this, we need access to all data used in the analysis, and Scotland and the Netherlands are hence omitted. The reference data in EuroHOPE will therefore only consist of those countries’ data that can be pooled. In case it is impossible to construct a merged database from all participating countries, the approach proposed here is a practical option to study the problem. The data were analyzed using Stata. 5
Results
Examples of risk-adjusted results for AMI
The pooled coefficients from a logistic regression analysis with AMI 90-day mortality as the response and using the reference database are given in Table 3, left column. The age group 90+ is used as the reference category for age; for all other variables, the coefficients give the effect of scoring on each variable. Area under the curve (AUC) values are above 0.7 for all models M1–M3, with M3 showing the best performance. Most risk adjusters are significant in all models and for all three choices of reference data, but the coefficients can be quite different, as expected.
Coefficients with standard errors for 90-day mortality after AMI based on the reference database and two alternative databases.
M1: sex/age adjusted; M2: sex/age/comorbidity without medication adjusted; M3: sex/age/comorbidity with medication adjusted; AMI: acute myocardial infarction; COPD: chronic obstructive pulmonary disease; LOS: length of stay; AIC: Akaike information criterion; BIC: Bayesian information criterion; AUC: area under the curve.
Pseudo R2, AIC, BIC, and AUC values are included at the end for comparison of explanatory power between the models.
Significant at 5% level.
Significant at 1% level.
Significant at 0.1% level.
As seen from Table 1, the unadjusted 90-day mortality proportion varies from 10% in Norway to 21% in Hungary. Figure 1(a) shows the effect of risk adjustment on these proportions, using the pooled regression coefficients from the reference database. The risk adjustment changes the mortality proportions to a limited degree compared with the unadjusted ones, with the exception of Hungary. The effect here is that Hungary has younger patients, who from the regression output in Table 3 are expected to have lower mortality; thus, when adjusting for age and sex only, this causes the mortality proportion for Hungary to increase in M1. However, Hungary also has quite a lot more comorbidities than the other countries, most of which have an increasing effect on mortality, so adjusting for these in M2 and M3 causes the mortality proportion to become closer to the unadjusted value.

Unadjusted and risk-adjusted 90-day mortality after AMI in five countries: (a) 90-day mortality proportions with 95% confidence intervals for each country, full data used as reference in adjustment; (b) 90-day mortality proportions for each country, data of Finland and Hungary used as reference in adjustment; (c) 90-day mortality proportions for each country, data of Norway, Sweden, and Italy used as reference in adjustment; and (d) regional 90-day mortality proportions in Norway.
To assess the effect of heterogeneity in regression coefficients between countries, Figure 1(b) shows the 90-day unadjusted and adjusted mortality proportions when Finland and Hungary are used as the reference data, whereas Figure 1(c) shows the corresponding proportions when Norway, Italy, and Sweden are used as the reference data. There are some differences between the two graphs, but perhaps surprisingly few. We see that M1 gives a higher estimate for the mortality in Hungary if the low-mortality countries Norway, Sweden, and Italy are used as the reference data instead of the full reference. As seen from Table 3, the age effects for M1 are more protective using the low-mortality country reference compared with the full reference. Hence, this influences Hungary with its young patient population. However, in M2 and M3, some of the comorbidities for which the prevalence is highest in Hungary get greater estimated effects using the low-mortality country reference compared with the full reference, moving the mortality estimates for Hungary to a lower level than the estimates from the full reference. A similar reasoning can be used for Italy and the observed lower mortality in model M3 when Hungary and Finland were used as the reference compared with the other reference data. Effects of comorbidities were smaller when using the Finland/Hungary reference data, so even though the prevalence of the comorbidities was higher in model M3 estimated with this reference data, the adjustment gives a smaller impact on the mortality estimates.
The same approach can also be used to illustrate regional differences in mortality within a country. An example for Norway is shown in Figure 1(d). From the point estimates, it is evident that there is heterogeneity in the 90-day mortality, although the confidence intervals are too wide to give any significant differences in most cases. The international focus is a key element when looking at regional differences also; otherwise, there would be little point in basing the risk adjustment on pooled regression coefficients over national coefficients.
Another example is to study the LOS of first hospital episode as a measure of performance. As shown in Table 1, the unadjusted averages vary from 8 days in Norway to 12 days in Finland. From the regression output shown for the full reference data in Table 4, it is evident that fewer risk adjusters reach significance, indicating poorer explanatory power than for 90-day mortality. There is, for instance, no clear age trend in the results. The pseudo R2 values are low.
Coefficients with standard errors for length of first hospital episode after AMI based on the reference database and two alternative databases.
M1: sex/age adjusted; M2: sex/age/comorbidity without medication adjusted; M3: sex/age/comorbidity with medication adjusted; AMI: acute myocardial infarction; COPD: chronic obstructive pulmonary disease; LOS: length of stay; AIC: Akaike information criterion; BIC: Bayesian information criterion.
Pseudo R2, AIC, and BIC values are included at the end for comparison of explanatory power between the models.
Significant at 5% level.
Significant at 1% level.
Significant at 0.1% level.
Looking at the graphs on average LOS of first hospital episode in Figure 2, one can see that when using the full reference database (Figure 2(a)), there is little difference between the unadjusted and risk-adjusted averages. Finally, a graph showing the regional variation in the LOS of first hospital episode in Norway is given in Figure 2(d).

Unadjusted and risk-adjusted length of stay after AMI in five countries: (a) average length of first hospital episode with 95% confidence intervals for each country, full data used as reference in adjustment; (b) 90-day mortality proportions for each country, data of Finland and Hungary used as reference in adjustment; (c) 90-day mortality proportions for each country, data of Norway, Sweden, and Italy as reference in adjustment; and (d) regional 90-day mortality proportions in Norway.
Discussion
There have been other recent examples of multinational comparisons of health-care quality outcomes with access to individual-level data.6,7 However, one major complication in EuroHOPE, which could be a general problem in any multinational study, is that not all countries have permission to share data across borders due to confidentiality restrictions. As data cannot be pooled, this limits the number of methods that are possible to use, such as multilevel models, propensity score matching, and other methods.6,8,9 When it comes to model choice, certain compromises must be made in order for a study like this to be feasible. A model which shows a good fit in one country may not be equally applicable in another. But in order to perform the analysis, a single choice of model has to be made. Also, when the number of responses to risk adjusted in a study is large, it becomes impractical to have different model choices for each response. Hence, not being able to pool all data poses several problems, as the methods used for finding the “best” model have to be simpler than the methods one would ideally wish for. Most covariates are categorized, even at the expense of less discriminating power. If a polynomial or spline was to be fitted for continuous covariates, it would have to be fitted to the data we are able to pool. Then one would have to impose exactly the same fit on the data not part of the pooling in order to get risk-adjusted estimates for those countries. This we thought would be a larger potential source for bias than using a simple categorization. For LOS, we also tried several alternative generalized linear models (GLM) including negative binomial, gamma, and inverse Gaussian models with log and identity link functions, but the negative binomial model showed the best fit.
The methods presented in this article are simple to use. In the risk adjustment, one implicitly assumes that the effects of the confounding factors on the response are similar in all countries, which is often not the case. We estimated weighted pooled regression coefficients to be used in the risk adjustment. The weighting would have been unnecessary if we believed that the effect of the risk adjusters would be exactly equal across all countries, as then it would not matter if some country contributed many more cases to the total data than others. Notwithstanding, the point estimates of the regression coefficients used in the risk adjustment would be the same. Equality of coefficients across countries can be checked by studying the interaction effects between the countries and the risk adjusters.10,11 In large register-based studies like this, many interactions will be statistically significant, and ignoring this in the risk adjustment would lead to the constant risk fallacy 11 potentially causing the standard risk-adjusted estimates to be biased. Although not shown in the results, significant interactions were also the case in EuroHOPE for the data we were able to pool. But again, this is only possible to study thoroughly if one can pool data from all countries included in the study. Also, the impact on the outcome of large differences in the effect of single-risk adjusters between countries is difficult to ascertain, as there are many risk adjusters working together in the models. In any case, the problem is hard to solve, as to our knowledge there are no ready-made solutions to the problem if there are significant interactions. Different effects of risk adjusters may be due to differences in treatment practices, coding practices, or under-/over-reporting of comorbid diseases. However, the magnitude of the problem can be studied by comparing the risk-adjusted responses using different choices of reference database, like demonstrated in Figures 1 and 2. These figures illustrate that the choice of reference data does not matter too much for AMI; hence, statistically significant interactions may not always be a problem in practice. Thus, even simple methods of risk adjustment may be useful if more advanced methods are difficult or impossible to use.
Footnotes
Acknowledgements
The authors are grateful to two anonymous reviewers for their comments and constructive input. Any remaining errors and omissions are the authors’ responsibility.
Declaration of conflicting interests
The authors have declared that there is no conflict of interest.
