Opposing conclusions from post hoc analyses of the AFFIRM trial: was propensity score analysis to blame or just an innocent victim?

Abstract

Introduction

A great advantage of randomized controlled trials (RCTs) over observational studies is their ability to balance both known and unknown confounders, and to remove the potential for other forms of bias. As such, large well-balanced RCTs are considered the gold standard for estimating the causal effects of a treatment. In the Atrial Fibrillation Follow-up Investigation of Rhythm Management (AFFIRM) trial, a total of 4060 patients with atrial fibrillation (AF) were randomized to a rate-control (2027 patients) or rhythm-control (2033 patients) strategy [Van Gelder et al. 2002]. Results suggested that the use of a rate-control strategy was associated with a trend towards decreased mortality, which was significant in those 65 years of age or older. Digoxin was one of the four rate-control drugs in AFFIRM. The European Heart Journal has recently reported results of two separate post hoc analyses of the AFFIRM trial data that attempted to retrospectively assess the causal effects of digoxin treatment on mortality [Gheorghiade et al. 2013; Whitbeck et al. 2013]. Since patients were originally randomized to strategies and not to individual drugs, both research groups used propensity score (PS) analysis methods in an attempt to balance digoxin and non-digoxin users for case mix and therefore their likelihood of being prescribed such treatment. Although both research groups used data from the same trial, the results were ambiguous. Whitbeck and colleagues concluded that digoxin use was associated with an increase in all-cause mortality (hazard ratio [HR] = 1.41, 95% confidence interval [CI] = 1.19–1.67, p = 0.001) [Whitbeck et al. 2013], while Gheorghiade and colleagues concluded that it was not (HR = 1.06, 95% CI = 0.83–1.37; p = 0.64) [Gheorghiade et al. 2013]. Although both studies used PS analysis, the approaches used were different, raising the possibility that this was to blame for the discrepancy. There were however other important aspects of study design and analysis which also differed between the two reports. In particular, there were differences in the selection of participants, and different approaches used to define digoxin exposure (fixed versus time varying). It is important to therefore consider the likely effects of each before assuming that the different PS analysis techniques were to ‘blame’.

Estimating causal effects using PS analysis

PS analysis was proposed as a method to unbiasedly estimate the causal effect of an exposure in the absence of confounding [Rosenbaum and Rubin, 1983]. It essentially comprises a two-stage regression approach, in which the first regression is a binary logistic regression used to create a predicted probability or ‘propensity for treatment’ score. The PS is then used either to individually match subjects in the treatment (digoxin/nondigoxin) groups, thereby ensuring an even balance in case mix between the treatment groups before regressing the outcome (mortality) on the treatment, or the PS is simply used as an additional covariate in the outcome model. PS analysis has a number of advantages over using one-stage regression modelling with covariate adjustment. First, it allows for adequate matching on a large number of covariates that may affect treatment decisions. Even if all covariates were binary, and only a moderate number (n) were used, this would still require creation of 2ⁿ strata to allow matching based on the covariates themselves. If the PS analysis is successful the mean values for each of the included PS covariates should be similar across groups. PSs should therefore provide a better estimate of the true causal effects of the exposure of interest when a large number of treatment-related variables are measured. In particular, PS should reduce the potential for confounding by indication, an important source of bias in observational studies whereby those receiving treatment have worse outcomes, not because of the treatment but because they were sicker and therefore required the treatment. PS methods still have limitations however and even when good prognostic data is available and groups are well matched, bias such as that due to confounding by indication may still not be removed completely [Bosco et al. 2010; Deeks et al. 2003] and indeed may still have been present here. PS analysis exists in a number of different forms. The four most common approaches used are: stratification on the PS, individual matching on the PS, weighting by the inverse of the PS and including the PS as a covariate adjustment [Williamson et al. 2012].

Differences between studies in the analytical approach

Leaving aside the specified study populations of the two studies, there are at least three ways in which the different analytical approaches may have influenced findings.

Different forms of PS analysis

Two of the most common forms of PS analysis are PS covariate adjustment and PS matching, which were employed by Whitbeck and colleagues and Gheorghiade and colleagues, respectively. Each control for confounding in the same way that adjusting for a covariate in regression or matching on a covariate respectively would control for confounding. Whilst matching is seen as an attractive choice, when performed without replacement it leaves open the strong possibility of losing subjects that are not adequately matched on their PS. As a result of using matching Gheorghiade and colleagues used only 878 pairs (n = 1756) of subjects from an initial pool of 2706 available subjects after applying exclusion criteria.

Different predictors of treatment in the PS estimation

The PS used by Whitbeck and colleagues was created using predictors for digoxin treatment that largely consisted of indicators of disease history and initial treatments but did not include predictors such as age, sex and renal function which were instead included along with the PS in the Cox regression for mortality. Gheorghiade and colleagues used a nonparsimonious multivariable logistic regression model in which 59 variables were included as covariates mostly related to demographics, disease and medication history and New York Heart Association (NYHA) class. After then matching of the subjects on PS, their Cox model for mortality contained only digoxin. In the sensitivity analysis that was common to both groups, Whitbeck and colleagues highlight (although perhaps unintentionally) the important effect of the overall covariate adjustment whereby their unadjusted Kaplan–Meier curve shows a high degree of separation between groups (likelihood ratio test p = 0.0014) compared with the complete lack of an association between groups after adjustment.

Classification of digoxin exposure throughout the trial (fixed versus time varying)

The classification of digoxin exposure as fixed throughout the analysis by Gheorghiade and colleagues more closely mirrors the intention-to-treat (ITT) analysis for an RCT than the time-varying covariate approach employed by Whitbeck and colleagues which resembles more of a per-protocol analysis. The latter also re-introduces the potential for confounding by indication, biasing the estimated effect towards harm if patients who were not on digoxin at baseline but who later change to digoxin treatment do so because of deterioration in their condition. PSs are calculated by using information on baseline covariates and the estimated effects of treatment on mortality therefore account for differences between digoxin users and nonusers in these baseline covariates. Since subjects in an RCT are randomized at baseline, it is their baseline characteristics that will be well matched, rather than their characteristics at some later point in the trial.

Differences between studies in the underlying populations

Beyond these analytical aspects of the two studies, there were also important differences in the subjects included in each post hoc analysis. Gheorghiade and colleagues included 2706 AFFIRM participants who were either receiving or not receiving digoxin at baseline. Excluded from the analysis were 1354 AFFIRM participants that either were using digoxin prior to randomization but then discontinued its use (n = 465) or had missing information on prior digoxin therapy at randomization (n = 889). After then matching on PS, a total of 878 pairs (n = 1756) were left in the primary analysis. In contrast, the study population for the primary analysis of Whitbeck and colleagues included all but two of the 4060 subjects in AFFIRM. Trying to make sense of the real effect of the two different PS approaches on the overall results is therefore difficult. Fortunately, of considerable help in exploring the cause of the different conclusions is the fact that both research groups performed several sensitivity analyses. Importantly, one of these employed the same definition of digoxin exposure in both studies (that of whether or not patients received digoxin during the 6 months prior to baseline) and included all trial patients (2153 patients receiving digoxin at baseline and 1905 patients not receiving digoxin at baseline) except that although Gheorghiade and colleagues started with this same population, only 1454 matched pairs (n = 2908) were included after PS matching. The adjusted results for both groups were very similar, Whitbeck and coworkers reporting a HR = 1.02 (95% CI 0.86–1.20; p = 0.83) while Gheorghiade and colleagues reported a HR = 0.97 (95% CI 0.81–1.18; p = 0.78). This was the only one of four different analyses reported by Whitbeck and colleagues that was statistically nonsignificant, in contrast to all of those by Gheorghiade and colleagues. The similar results from both research groups for the only analysis in which essentially the same population of subjects were chosen suggests that the different approaches to the analyses may have had relatively minor effects on the estimated treatment effects.

Use of time-varying covariates and confounding by indication

In additional sensitivity analysis of their primary analysis that used propensity adjustment rather than matching, Gheorghiade and colleagues obtained similar results, again suggesting that it was either the subjects excluded outside of the matching process that influenced the results most or the effect of using time-varying versus fixed covariates. Some evidence that it was due to the latter is provided from an earlier post hoc analysis performed just 2 years after the initial AFFIRM trial results were reported, in which investigators of AFFIRM assessed the ‘on treatment’ effect of digoxin using time-varying covariate adjustment and observed increased risk of mortality with both a full cohort and a reduced cohort. In the main analysis, a total of 2796 patients were included and digoxin use was independently associated with higher all-cause mortality (HR = 1.42; 95% CI 1.09–1.86; p = 0.0007) [Corley et al. 2004]. Excluded from analysis were approximately 25% of the patients who were without echocardiographic data. Individual rate control drugs (digoxin, beta-blockers and calcium channel blockers) were included as separate covariates in a Cox regression while rhythm control drugs were grouped together. Other covariates in the final parsimonious model were age, coronary artery disease, congestive heart failure, diabetes, stroke, smoking, left ventricular dysfunction, mitral regurgitation, sinus rhythm, warfarin use and rhythm-control drug use. Results were similar when patients without echocardiographic data were also included in the analysis (HR = 1.50, 95% CI 1.18–1.89; p < 0.001).

Discussion

By excluding subjects treated with digoxin but then later discontinuing treatment, and those with missing information on digoxin as an initial therapy, Gheorghiade and colleagues studied the effects of consistent digoxin use as an initial therapy. However, their findings cannot be assumed to be generalizable to other patients especially those who have previously discontinued digoxin treatment. In addition, the reason for the missing baseline data (whether or not digoxin was an initial therapy) was related to the exposure of interest (digoxin use) rather than being missing at random, and it is therefore possible that exclusion of these subjects may have biased the true treatment effect of digoxin. However, overall mortality was only slightly higher in those subjects excluded from analysis (16.4%) compared with those included (13.8%). On the other hand, the consistent classification of treatment throughout the follow-up period by Gheorghiade and colleagues rather than use of time-varying covariates as employed by Whitbeck and colleagues would seem to be a more logical approach if the aim is to more closely follow the normal protocol for an RCT and obtain an ITT estimate of the treatment effect.

With these considerations in mind, the divergent results and conclusions obtained by the two groups most likely reflect either the different underlying populations that were studied or the different approaches to treatment of the exposure (fixed versus time-varying), rather than the different forms of PS techniques pursued. Interesting discussion in the two editorials that accompanied the reports in the European Heart Journal also provided useful insight [Murphy, 2013; van Veldhuisen et al. 2013]. Van Veldhuisen and colleagues were generally more accepting of the evidence for increased risk with digoxin, and also postulated a likely mechanism by which it may occur, namely that the use of relatively high doses of digoxin in the elderly and frail AFFIRM study population may have led to toxic serum digoxin concentrations. They also noted that the effects of dose were not assessed in the trial and that there is some evidence that low doses of digoxin may reduce risk in those with atrial fibrillation and heart failure although this evidence also stemmed from a post-hoc analysis [Ahmed et al. 2006]. The statistical editorial by Sabina Murphy generally favoured the conservative conclusion of no increased risk, also pointing towards the potential for selection bias, for confounding by indication by treating digoxin exposure as time-varying rather than fixed, as well as the lack of clarity with regards to the covariates included in the creation of the PS by Whitbeck and colleagues.

PS analysis is known to reduce the potential for bias in assessing observational data, including that of selection bias and confounding by indication. That is not to say however that either of these are removed completely or even re-introduced by other aspects of the study design and approaches to analysis. Conclusions from post hoc analyses of RCTs therefore require careful consideration in terms of both their internal and external validity. Whilst certainly perplexing, opposing conclusions from analysis of the same trial data is just as likely to reflect differences in chosen study populations or exposure classification as differences in analytical techniques.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Conflict of interest statement

The author has no conflicts of interest to declare.

References

Ahmed

Rich

Love

Lloyd-Jones

Aban

Colucci

. (2006) Digoxin and reduction in mortality and hospitalization in heart failure: a comprehensive post hoc analysis of the DIG trial. Eur Heart J 27: 178–186.

Bosco

Silliman

Thwin

Geiger

Buist

Prout

. (2010) A most stubborn bias: no adjustment method fully resolves confounding by indication in observational studies. J Clin Epidemiol 63: 64–74.

Corley

Epstein

DiMarco

Domanski

Geller

Greene

. (2004) Relationships between sinus rhythm, treatment, and survival in the Atrial Fibrillation Follow-Up Investigation of Rhythm Management (AFFIRM) Study. Circulation 109: 1509–1513.

Deeks

Dinnes

D’Amico

Sowden

Sakarovitch

Song

. (2003) Evaluating non-randomised intervention studies. Health Technol Assess 7(27): iii-x, 1–173.

Gheorghiade

Fonarow

van Veldhuisen

Cleland

Butler

Epstein

. (2013) Lack of evidence of increased mortality among patients with atrial fibrillation taking digoxin: findings from post hoc propensity-matched analysis of the AFFIRM trial. Eur Heart J 34: 1489–1497.

Murphy

(2013) When ‘digoxin use’ is not the same as ‘digoxin use’: lessons from the AFFIRM trial. Eur Heart J 34: 1465–1467.

Rosenbaum

Rubin

(1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70: 41–55.

Van Gelder

Hagens

Bosker

Kingma

Kamp

Kingma

. (2002) A comparison of rate control and rhythm control in patients with recurrent persistent atrial fibrillation. N Engl J Med 347: 1834–1840.

van Veldhuisen

Van Gelder

Ahmed

Gheorghiade

(2013) Digoxin for patients with atrial fibrillation and heart failure: paradise lost or not? Eur Heart J 34: 1468–1470.

10.

Whitbeck

Charnigo

Khairy

Ziada

Bailey

Zegarra

. (2013) Increased mortality among patients taking digoxin–analysis from the AFFIRM study. Eur Heart J 34: 1481–1488.

11.

Williamson

Morley

Lucas

Carpenter

(2012) Propensity scores: from naive enthusiasm to intuitive understanding. Stat Methods Med Res 21: 273–293.