Abstract
Background
Multiple sclerosis (MS) comparative effectiveness research needs to go beyond average treatment effects (ATEs) and post-host subgroup analyses.
Objective
This retrospective study assessed overall and patient-specific effects of dimethyl fumarate (DMF) versus teriflunomide (TERI) in patients with relapsing-remitting MS.
Methods
A novel precision medicine (PM) scoring approach leverages advanced machine learning methods and adjusts for imbalances in baseline characteristics between patients receiving different treatments. Using the German NeuroTransData registry, we implemented and internally validated different scoring systems to distinguish patient-specific effects of DMF relative to TERI based on annualized relapse rates, time to first relapse, and time to confirmed disease progression.
Results
Among 2791 patients, there was superior ATE of DMF versus TERI for the two relapse-related endpoints (p = 0.037 and 0.018). Low to moderate signals of treatment effect heterogeneity were detected according to individualized scores. A MS patient subgroup was identified for whom DMF was more effective than TERI (p = 0.013): older (45 versus 38 years), longer MS duration (110 versus 50 months), not newly diagnosed (74% versus 40%), and no prior glatiramer acetate usage (35% versus 5%).
Conclusion
The implemented approach can disentangle prognostic differences from treatment effect heterogeneity and provide unbiased patient-specific profiling of comparative effectiveness based on real-world data.
Keywords
Introduction
Multiple sclerosis (MS) is one of the most prevalent neurological autoimmune diseases and may cause severe disability. More than a dozen disease-modifying therapies (DMTs) are approved for reducing the risk of MS relapses, such as dimethyl fumarate (DMF) and teriflunomide (TERI), but there is currently no single solution that controls disease progression for all patients with MS. MS is heterogeneous in both disease course and patient responses, motivating the development of a personalized strategy. It is important to prescribe the best treatment for each individual to optimize the timing of treatment, better manage symptoms, and prevent future relapses and loss in functions and quality of life. To help clinicians make informed treatment decisions and to save patients from unwarranted physical and financial burden, precision medicine using rich medical data of patients with MS can serve as a promising guide. Precision medicine is a data-driven paradigm that recommends the most effective treatment for individual patients based on patient-specific clinical variables (i.e., treatment effect modifiers).
The objective of this analysis is to explore the potential treatment effect heterogeneity between two oral DMTs, DMF and TERI, in the German NeuroTransData (NTD) registry and to look for stratifications of the patient population for tailored treatment recommendation. Real-world evidence has several advantages over evidence derived from randomized controlled trials (RCTs), which are often underpowered for detecting patient-specific heterogeneity 1 : (1) acquisition of larger sample sizes; (2) more variety of patient characteristics; and (3) treatment administration representative of the real-world setting. A recently developed generalization of a novel precision medicine method used for RCTs has been applied (Supplemental Figure S1).2,3 The method stratifies patients with similar characteristics based on a scoring system, which measures the relative benefit of DMF versus TERI for each individual based on their baseline characteristics. The method is especially appealing in analyzing treatment effect for patients with MS because it: (1) adjusts for potential confounding; (2) constructs and internally validates the scoring systems; and (3) caters to ratio-based metrics of the treatment benefit, suitable for count and survival outcomes. This work extends the previous analysis of two phase 3 RCTs between DMF versus placebo, 1 a clinical application of the original methodological work2,3 tailored to a clinical audience. It is also a companion analysis of a French MS registry analysis for comparative treatment effectiveness between DMF and fingolimod. 4 By the application of this precision medicine approach to real-world data, we aim to provide guidance on the personalized selection between two MS therapeutic options and pave the road to personalized treatment decisions in MS.
Methods
The NTD network and NTD MS registry
Founded in 2008, the NTD is a physician network of approximately 66 practices and 133 members across Germany in the field of neurology and psychiatry, gathering healthcare data to improve physician–patient interaction and optimize the treatment of individual patients. 5 Information on demographics, medical history, patient-related outcomes, and clinical variables are recorded regularly via a web-based patient management platform 6 during outpatient office visits, with 3.7 visits per year on average. 7 A comprehensive manual and automated data control framework has been established to ensure high data quality. 8 As of the end of 2019, around 25,000 patients with MS were included and followed longitudinally over an average of 5 years in the NTD registry, representing around 10% of approximately 224,000 patients with MS in Germany. 9
Study population
The study included NTD records of 2791 patients with relapsing-remitting MS who received DMF or TERI between January 1, 2009, and July 1, 2018. Inclusion and exclusion criteria have been published. 7 Patient baseline characteristics were taken at the initiation of index therapy.
Effectiveness assessment endpoints
Annualized relapse rate (ARR), the average number of relapses a for a group of patients per year, was the primary endpoint of this analysis. Secondary endpoints were time to first relapse since treatment initiation and time to 12-week confirmed disease progression (CDP) defined by Expanded Disability Status Scale (EDSS) scores. b The time to first relapse or time to CDP may be right censored, and the censoring time was defined as time to loss to follow-up, treatment discontinuation, or data cut date, whichever occurred first.
Baseline characteristics
Nineteen baseline characteristic variables were included both as potential confounders and potential treatment effect modifiers: age, sex, number of prior DMTs, number of months since MS diagnosis, newly diagnosed (yes if diagnosed <12 months ago or no prior DMTs; no otherwise), prior use of glatiramer acetate (GA) (yes/no), prior use of interferon (yes/no), number of relapses in 12 months and in 24 months before treatment initiation, EDSS total score, individual EDSS score (visual, ambulatory, brainstem, cerebellar, cerebral, pyramidal, sensory, sphincteric), and EuroQol-5D-5L visual analog scale (EQ5D5L VAS) score.
Statistical methods
General statistical considerations
Baseline patient characteristics were reported as mean (SD) or n (%) for continuous and categorical variables, respectively. Cohen's standardized mean or proportion differences (SMD) were reported to measure the difference in baseline characteristics between treatment groups. Absolute values of SMD > 0.1 were considered clinically relevant. Missing values were imputed by the corresponding observed sample mean.
Baseline variable selection
Three sets of potential confounders, one set for each endpoint, were identified from baseline variables that were associated with the endpoint at the significance level of 0.1 in a multivariable regression analysis, separately among patients receiving DMF or TERI. Negative binomial regression was used for the ARR, and Cox proportional hazards model was used for the two time-to-event endpoints. The selected informative variables were treated as potential confounders and effect modifiers in the calculation of average treatment effects (ATEs) as well as individualized treatment response (ITR) score.
Average treatment effects
The ATE between DMF and TERI was measured as an ARR ratio for number of relapses and restricted mean time lost (RMTL) ratio for the two time-to-event endpoints. To account for potential confounding, we estimated the ATE adjusting for baseline variables with a doubly robust procedure, which combined the propensity score (PS) model and the outcome regression model for protection against model misspecification and provided more precise statistical inferences on treatment effects. 3 Patients were weighted based on an inverse probability weighting scheme estimated from the PS model to balance their baseline characteristics between the two treatment groups (pages 3–4 in the Supplement). Both unadjusted and doubly robust estimators, together with their 95% confidence intervals, were reported with standard errors obtained from 400 bootstrap samples.
Individualized treatment response score and treatment effect heterogeneity assessment
Patient-specific treatment effects between DMF and TERI were measured with ITR scores, which were estimated by the aforementioned novel precision medicine approach 3 chosen for its robustness and adaptability to real-world data.
Constructed for each of the three endpoints separately, the estimated ITR score for patient i was a ratio between DMF versus TERI,
We used cross-validated validation curves to evaluate the performance of ITR scores in detecting heterogeneous treatment effects. 3 Validation curves < 1 mean that patients on average benefited more from DMF than TERI, and patients benefited more from TERI on average if >1. The steeper the validation curves, the stronger the treatment effect heterogeneity captured by the ITR scores. To test whether the treatment effects between DMF and TERI varied among patients grouped according to ITR scores, we sorted and grouped patients in the validation set according to their ranked ITR scores into two subgroups, high DMF responders and equal responders, defined a priori based on the 60% versus 40% ITR distributions. We compared the univariate patient baseline data within subgroups and by treatment group to study the characterization of the responder groups. In addition, three mutually exclusive subgroups of approximately equal sizes were identified and the ATE in each subgroup was estimated, the distribution of which were summarized as boxplots across 200 training and validation splits. Two-sided p values < 0.05 were considered statistically significant throughout the study. All analyses were conducted using R version 4.0.5. 10 See Technical Details on Methods section of the Supplement for more technical details.
Results
The analysis included 2791 NTD patients with MS: 1741 (62%) received DMF and 1050 (38%) received TERI. The two treatment cohorts were moderately different. Younger patients with shorter disease duration, more relapses in the prior 24 months, higher EQ5D5L VAS scores, and lower EDSS pyramidal, sensory, and total scores at baseline tended to be treated with DMF (Table 1). This NTD cohort was generally older, had prior DMTs, and had longer MS duration than clinical trial cohorts, as the latter typically enroll patients with fewer complicating characteristics.
Patient baseline characteristics by treatment group.
DMF: dimethyl fumarate; DMT: disease-modifying therapy; EDSS: Expanded Disability Status Scale; EQ5D5L: EuroQol-5 Dimension 5-level version; GA: glatiramer acetate; SD: standard deviation; SMD: standardized mean or proportion difference; TERI: teriflunomide; VAS: visual analogue scale.
Data are reported as mean (SD) for continuous variables, and n (%) for categorical variables.
New diagnosis is defined as “yes” if one of the following conditions is satisfied: (1) having a time since diagnosis of < 12 months or (2) having no prior DMTs; and “no” otherwise.
Standardized mean or proportion difference (Cohen's d values): In general, a value <0.2 is considered acceptable, between 0.2 and 0.5 considered as a moderate difference, between 0.5 and 0.8 as a significant difference, and >0.8 as a major difference.
Number of relapses at 12 months
Average treatment effect
Average follow-up time was 2.17 (SD 1.73) in the DMF group and 2.11 (SD 1.72) years in the TERI group. DMF patients had a lower unadjusted ARR than TERI patients (DMF 0.33, TERI 0.41). The ARR of patients receiving DMF was 23.7% lower than that of patients receiving TERI (Table 2). Nine baseline variables were identified and balanced after PS weighting (Supplemental Tables S1 and S2). The doubly robust adjusted ARR ratio remained close to the unadjusted ARR ratio (Table 2), suggesting that the ARR among DMF patients was significantly reduced compared with TERI patients even after adjusting for confounding effects.
Average treatment effect and treatment effect by subgroups for all three endpoints.
ARR: annualized relapse rate; CDP: confirmed disease progression; CI: confidence interval; DMF: dimethyl fumarate; DMT: disease-modifying therapy; EDSS: Expanded Disability Status Scale; EQ5D5L: EuroQol-5 Dimension 5-level version; GA: glatiramer acetate; ITR: individualized treatment response; RMTL: restricted mean time lost; TERI: teriflunomide; VAS: visual analogue scale.
p-Values < 0.05 are boldfaced, indicating that ARR ratio and RMTL ratio of time to first relapse are significantly different within the high DMF responder subgroup.
Adjusted for age, number of prior DMTs, months since diagnosis, number of relapses in the previous 24 months, prior GA, prior interferon, EDSS total score, EDSS cerebral score, and EDSS pyramidal score.
Adjusted for age, number of prior DMTs, months since diagnosis, number of relapses in the previous 12 months, number of relapses in the previous 24 months, prior GA, prior interferon, EDSS cerebellar score, and EDSS cerebral score.
Adjusted for number of prior DMTs, months since diagnosis, prior interferon, EQ5D5L VAS score, EDSS total score, EDSS ambulatory score, EDSS cerebellar score, EDSS sensory score, and EDSS sphincteric score.
Based on cross-validated ITR score estimated from contrast regression.
Performance of the ITR score and treatment effect heterogeneity
Figure 1 (top left) displays the validation curves as the performance of the four ITR scoring methods. For each scoring method (colored line), a proportion of patients with the lowest estimated ITR scores was grouped together (X-axis) and the observed ARR ratio between DMF and TERI (Y-axis) of the subgroup was averaged over 200 cross validationss. Subgroups of patients with smaller proportions of the lowest estimated ITR scores (left part of the X-axis) tended to have larger observed treatment effect between DMF and TERI (ARR ratios further from 1). All methods except the boosting method (black) indicated treatment heterogeneity because they produced positive, steeper curves. Two regression and contrast regression (green and blue) were relatively better at detecting treatment heterogeneity than Poisson regression (red). Boosting started to detect treatment heterogeneity when subgroup sizes were larger than 80% of all samples. More explorations of ARR ratios and subgroups can be found on pages 8–11 of the Supplement.

Observed ARR ratio of DMF/TERI (a surrogate for the treatment effect) in nested subgroups of patients ranked by increasing values of the estimated ITR score (high DMF responders), averaged over all validation sets in cross validation. The size of the subgroup of high DMF responders (proportion of patients with the lowest estimated ITR scores): X-axis; the observed ARR ratio of DMF/TERI (treatment effect of DMF relative to TERI): Y-axis. ITR score 1, boosting (black); ITR score 2, Poisson regression (red); ITR score 3, two regressions (green); ITR score 4, contrast regression (blue). ARR: annualized relapse rate; DMF: dimethyl fumarate; EDSS: Expanded Disability Status Scale; ITR: individualized treatment rule; TERI: teriflunomide.
We tested whether the treatment effect between DMF and TERI varied among patients grouped according to ITR scores using the 60% and 40% proportion cut-off (Table 2). Among patients with the lowest 60% ITR scores estimated by contrast regression, which we referred to as high DMF responders, the average cross-validated ARR ratio was well below 1, suggesting that DMF was associated with substantially reduced relapse rates compared with TERI in this group. For the remaining patients, which were referred to as equal responders to DMF versus TERI, our methods did not provide specific recommendation on treatment selection. The relative benefit of DMF was greater among the high DMF responders than the equal responders, showing that ITR score possibly detected treatment effect heterogeneity in the ARR ratios. More evidence is needed to conclude that the treatment heterogeneity between the two subgroups was statistically significant (p = 0.12). Other proportion cut-offs (33%–67%) were also explored and the observed differences in treatment effects for these subgroups were similar and summarized in Supplemental Table S3.
The weights of the baseline characteristics estimated by the contrast regression scoring method were listed in Table 3, and patient baseline characteristics were compared between the high DMF responder and equal responder subgroups in Table 4. High DMF responders tended to be older (45 versus 38 years of age), had more prior DMTs (1.2 versus 0.65), had a longer MS duration (110 versus 50 months), were not newly diagnosed (74% versus 40%), had no prior GA (35% versus 5%), and had lower EDSS cerebral scores (0.29 versus 0.72). Further comparison of patient characteristics between DMF and TERI within each responder group can be found in the Supplement (Supplemental Table S4).
Weights of the ITR score based on contrast regression for ARR between DMF and TERI.
ARR: annualized relapse rate; CI: confidence interval; DMF: dimethyl fumarate; DMT: disease-modifying therapy; EDSS: Expanded Disability Status Scale; GA: glatiramer acetate; ITR: individualized treatment response; TERI: teriflunomide.
95% CI and p value were calculated assuming that the true log ARR ratio was a linear combination of baseline variables.
The standardized weights were for the rescaled predictors with a standard deviation of 1 and used to gauge the relative importance of the predictor in the construction of the ITR score. ITR Score = −0.592 − 0.020 × age + 0.086 × number of prior DMTs − 0.003 × months since diagnosis + 0.722 × prior GA + 0.319 × prior interferon + 0.042 × number of relapses in the previous 24 months + 0.188 × EDSS score + 0.304 × EDSS cerebral score–0.033 × EDSS pyramidal score.
Patient characteristics by subgroups defined by the ITR score estimated from contrast regression for ARR ratio DMF versus TERI.
ARR: annualized relapse rate; DMF: dimethyl fumarate; DMT: disease-modifying therapy; EDSS: Expanded Disability Status Scale; EQ5D5L: EuroQol-5 Dimension 5-level version; GA: glatiramer acetate; ITR: individualized treatment response; SD: standard deviation; SMD: standardized mean or proportion difference; TERI: teriflunomide; VAS: visual analogue scale.
Data are reported as mean (SD) for continuous variables, and n (%) for categorical variables.
New diagnosis is defined as “yes” if one of the following conditions is satisfied: (1) having a time since diagnosis of <12 months or (2) having no prior DMTs; and “no” otherwise.
Standardized mean or proportion difference (Cohen's d values): a value <0.2 is considered acceptable, between 0.2 and 0.5 considered as a moderate difference, between 0.5 and 0.8 as a significant difference, and >0.8 as a major difference.
p-Values are from unpaired t-tests (or Wilcoxon rank sum test if non-normally distributed) for continuous variables and Pearson Chi-square (or exact test extensions in case of low frequency) for categorical variables.
Time to first relapse and time to CDP
Average treatment effect
The RMTL to first relapse in a 5-year follow-up period was lower for DMF than TERI (1.29 and 1.58, respectively). The RMTL of DMF patients was significantly lower than that of TERI patients for time to first relapse by 18.3% (p = 0.001) before adjustment and 25.7% (p = 0.018) after doubly robust adjustment of selected baseline variables (Table 2). The RMTL to CDP in a 5.1-year follow-up period was 0.921 for DMF and 0.996 for TERI. The RMTL of DMF patients was 7.5% and 5.5% lower than that of TERI patients for time to CDP before and after adjustment, respectively, with no statistical significance (p > 0.05, Table 2). All results suggested later onset of first relapse and CDP among DMF patients compared with TERI patients.
Performance of the ITR score and treatment effect heterogeneity
For time to first relapse, there was moderate treatment effect heterogeneity as reflected by the positive slopes of all validation curves in Figure 1. In subgroups containing a smaller proportion of patients with the lowest ITR scores (left side of the X-axis), the ATEs measured by both the RMTL ratio and HR ratio tended to be further below 1, implying stronger heterogeneity among high DMF responders. For time to CDP, the treatment effect heterogeneity was little as the validation curves in Figure 1 were relatively flat for both RMTL ratio and HR ratio. The validation curves agreed with the heterogeneity test for both time-to-event endpoints as shown in Table 2. Like ARR, boosting method identified smaller treatment effect heterogeneity compared with the other three scoring methods. See more information related to the two time-to-event endpoints in the Supplement.
The resulting ITR scores for ARR and RMTL ratios were highly correlated (estimated correlation coefficient = 0.82), corroborating each other in identifying high DMF responders. This high correlation suggested that the optimal treatment that reduced the ARR may also likely delay the onset of first relapse of the same patients.
Discussion
Previous related studies often focused on RCTs,1,11 did not have enough registry data for DMF and TERI, 12 and the prediction methods of individual treatment response remained to be traditional regression-based models.12,13 To our knowledge, this is the first application of a rigorous approach for precision medicine generalized to real-world data between DMF and TERI, and one of the few attempts in MS to disentangle the prognostic role of baseline covariates from their possible role as treatment effect modifiers.1,14 We found a significant benefit of DMF relative to TERI in the entire NTD patient sample and detected a significant responder subgroup, among which DMF was substantially better than TERI with respect to ARR and time to first relapse (but not time to CDP). The factors affecting the relative benefit of DMF versus TERI included patients’ age, number of prior DMTs, month since diagnosis, prior GA, and EDSS cerebral score. Prior GA had the largest weight, and the positivity suggested that the ARR ratio (DMF versus TERI) was lower for patients without receiving GA. Among the remaining patients, the effects of DMF versus TERI were statistically inconclusive and a better powered study is needed to draw further conclusions on the significance of both subgroups.
It is crucial to keep the treatment comparator and study population in mind when comparing or generalizing precision medicine conclusions (Supplemental Figure S6). Age and MS duration might seem to have opposite effects on ARR between DMF and TERI compared with existing literature of RCTs such as the DEFINE/CONFIRM trials,15,16 where DMF responders were younger and more recently diagnosed, whereas the high DMF responders were older and had a longer MS duration. However, they are not in conflict for two main reasons: (1) the comparators were different (NTD compared DMF versus TERI and DEFINE/CONFIRM compared DMF versus placebo; Supplemental Figure S5); and (2) NTD patients were different from the DEFINE/CONFIRM patients at baseline in terms of age, MS duration, number of relapses in the prior year, prior treatment, and EDSS score. For example, NTD patients (mean age: 40 and 45 years, mean MS duration: 6 and 9 years, respectively, for DMF and TERI) tended to be relatively older with longer MS duration than DEFINE/CONFIRM patients (mean age < 40 years and mean MS duration < 6 years).
The study has several limitations. First, we assumed that all confounders were observed and included in the analysis, which may not be true. The presence of unmeasured confounders (e.g., MRI or cognition data for this study) can lead to poor ITR score and biased evaluation. Second, we considered ARR as a meaningful summary of a patient's response to treatment when patients with the same ARR can have different lengths of treatment. A shorter treatment could imply that the patient experienced negative treatment effects and quickly switched to an alternative treatment, but this was not necessarily reflected in the ARR ratio. Third, the study presented a set of analyses as a proof-of-concept example without necessarily optimizing or justifying all analytical choices. Different PS, outcome regression, imputation, ITR scoring methods, and baseline variables could lead to different results. Fourth, it is not clear why prior GA was identified as an important treatment effect modifier. The mechanistic relationship between prior GA and relapses can be complicated, and further research is needed. Last, internal validation was applied to avoid over-optimistic ITR scores, and external validation with a sufficient sample size was not possible due to data availability, but it would be needed to draw conclusions beyond the NTD cohort.
Conclusions
This study sets a solid basis to build on future research. Patients may switch from treatment to treatment according to their personal experience. A possible future research direction can be a dynamic treatment strategy to guide patients to select the optimal treatment at baseline and make subsequent adjustment according to their clinical history.
Supplemental Material
sj-docx-1-mso-10.1177_20552173231194353 - Supplemental material for Overall and patient-specific comparative effectiveness of dimethyl fumarate versus teriflunomide: A novel approach to precision medicine applied to the German NeuroTrans Data Multiple Sclerosis Registry
Supplemental material, sj-docx-1-mso-10.1177_20552173231194353 for Overall and patient-specific comparative effectiveness of dimethyl fumarate versus teriflunomide: A novel approach to precision medicine applied to the German NeuroTrans Data Multiple Sclerosis Registry by Xiaotong Jiang, Gabrielle Simoneau, Mel Zuercher, Yanic Heer, Philip van Hoevell, Adrian Harrington, Wanda Castro-Borrero, Carl de Moor, Fabio Pellegrini and Lu Tian, Arnfin Bergmann, Stefan Braune in Multiple Sclerosis Journal – Experimental, Translational and Clinical
Footnotes
Acknowledgments
Cara Farrell, Excel Medical Affairs, copyedited and styled the manuscript per journal requirements. Biogen reviewed and provided feedback on the paper. The authors had full editorial control of the paper and provided their final approval of all content.
Author contributions
X Jiang interpreted data and drafted the manuscript. F Pellegrini and A Harrington designed the study, coordinated the analyses, interpreted the data, and revised the manuscript. L Tian, and M Zuercher developed the statistical algorithms, analyzed and interpreted data, and revised the manuscript. G Simoneau, W Castro-Borrero, P van Hoevell, C de Moor, A Bergmann, Y Heer, and S Braune interpreted data. All authors provided final approval of the manuscript for submission.
Data availability
Declaration of conflicting interests
X Jiang, G Simoneau, A Harrington, W Castro-Borrero, C de Moor, F Pellegrini are employees and former employees of and hold stock/stock options in Biogen. L Tian received consulting fees from Biogen. M Zuercher, P van Hoevell, and Y Heer are employees of Rewoso AG, Zürich, Switzerland. A Bergmann has received consulting fees from advisory board, speaker, and other activities for NeuroTransData; project management and clinical studies for and travel expenses from Novartis and Servier. S Braune has received honoraria from Kassenaerztliche Vereinigung Bayern and health maintenance organizations for patient care; honoraria for consulting, project management, clinical studies, and lectures and from Biogen, Eli Lilly, Merck, NeuroTransData, Novartis, Roche, and Thieme Verlag; honoraria and expense compensation as board member of NeuroTransData.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was sponsored by Biogen (Cambridge, MA, USA). Biogen funded editorial support in the development of this paper. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
