Abstract
As routinely collected longitudinal data becomes more available in many settings, policy makers are increasingly interested in the effect of time-varying treatments (sustained treatment strategies). In settings such as this, many commonly used statistical approaches for estimating treatment effects, such as g-methods, often adopt the ‘no unmeasured confounding’ assumption. Instrumental variable (IV) methods aim to reduce biases due to unmeasured confounding, but have received limited attention in settings with time-varying treatments. This paper extends and critically evaluates a commonly used IV estimating approach, Two Stage Least Squares (2SLS), for evaluating time-varying treatments. Using a simulation study, we found that, unlike standard 2SLS, the extended 2SLS performs relatively well across a wide range of circumstances, including certain model misspecifications. We illustrate the methods in an evaluation of treatment intensification for Type-2 Diabetes Mellitus, exploring the exogeneity in prescribing preferences to operationalise a time-varying instrument.
Introduction
As routinely collected data has become more available, there has been an increasing interest in long term causal effects in studies with time-varying treatments. For example, decision makers are interested in the effect of glycemic control strategies over a sustained period of time, for which evidence from randomised controlled studies is often limited. Time-varying confounding, where health-related factors impact both treatment and outcome after the first recorded time period is a major challenge. Since time-varying confounders often act as mediators of the effect of previous treatment, standard regression adjustment for these variables blocks indirect effects of the treatment. To handle this, more sophisticated methods such as the g-methods 1 are needed. Such methods often make the assumption of no unmeasured confounding. In practice, this is likely to be implausible. For example, the prescription of anti-diabetic medication at each time point may depend on a wide range of factors, such as co-morbidities and disease severity, not all of which may be measured.
A popular approach to deal with unmeasured confounding is to use Instrumental Variable (IV) analysis, which has been widely used across many disciplines such as genetics, economics and clinical research.2–4 IV exploits sources of exogenous variation that are strongly associated with treatment assignment, and affect the outcome only through the treatment. IV estimating approaches such as the Wald estimator and Two Stage Least Squares (2SLS) for evaluating time-fixed point treatments are well established. 5 However, these approaches have received little attention in the evaluation of time-varying treatments.
Recent research has sought to extend IV methodologies to time-varying settings.6,7 In this setting, a major challenge is to identify an instrument that remains valid and strong across different time periods. For example, genes have been considered for the evaluation of time-varying treatments in Mendelian Randomisation (MR).8–11 MR studies only considered time-fixed baseline IVs given the nature of genetic markers. An alternative strategy is to identify an IV that varies over time to instrument a sequential treatment assignment, which is then operationalised via
Two promising methods have recently been documented that can apply a time-varying IV to longitudinal data. The first is a novel inverse probability weighting procedure,
17
and the second is an application of
However,
The paper’s primary aim is to investigate the application of time-varying IVs using 2SLS, for which there is little established work. We then draw on multivariate 2SLS methods in MR settings4,18,19 and robust 2SLS methods in time-fixed settings,20–22 and detail a residual instrument 2SLS (RI2SLS) approach that accommodates a time-varying IV. We show this method can be related to recent advances in
Motivating example
Our paper is motivated by an analysis of the effectiveness of second line therapy for T2D on blood glucose levels. T2D is a progressive disease characterised by a impaired ability for pancreatic
Treatment involves prescribed medication to control and lower HbA1c levels. NICE guidelines in the UK recommends Metformin monotherapy as first line. However about 30%–50% of patients either fail to respond to first line treatment, or monotherapy becomes less effective over time, and second line intensification is often necessary. Second line therapy supplements Metformin with a second oral anti-diabetic. Owing to insufficient evidence of a preferred second line therapy, NICE (2022) guidelines 24 leave the choice of treatment to clinicians and primary care practices. For this reason, second line therapy preferences can differ greatly between practices and GPs, and are subject to change over time. 25 Patients without high risk of cardio-vascular disease are most commonly assigned Sulfonylureas (SU), or DPP4-inhibitors (DPP4) at second line therapy. As one of the first intensifications for T2D, GPs have a strong historical preference for SU. However recent studies may have shifted preference towards DPP4 in recent years.26,27
Study population and eligibility criteria
Our motivating example includes data from routinely collected primary care records in three East London clinical commissioning groups (CCGs) based in Tower Hamlets, Newham and City, and Hackney. Data on treatment and health related information was collected and recorded in intervals of 6 months from 2012 to 2018. This period coincided with a shift in medical preferences about DPP4 versus SU. 26
The median follow up for patients on second line T2D treatment is two and half years (i.e. 5 time periods). We look to follow up patients for up to 2 years, or 4 periods. Time
Eligible patients were between 18 and 89 years of age, registered with a primary care practice, and initiated second line therapy after first line monotherapy failed. Patients were required to be on either SU or DPP4 at initiation of second line therapy, with complete relevant data available for the full follow up period. Patients who do not start on one of these two treatment regimes, leave the study before 3 follow up times, pause treatment on SU or DPP4, or begin a further intensification by taking both or another diabetic treatment, are censored from the study. Our initial data subset includes
Study design
Our treatment is a contrast of one of two first intensification second line treatments. Treatment: Treatment intensification with DPP4. Control: Treatment intensification with SU.
Treatment is recorded at each 6 month interval, with the Treatment group denoted 1, and the Control group, denoted 0.
Outcome is the recorded measure of HbA1c levels in mmol/mol, 2 years (4 time periods) after initiation of second line therapy. We are interested in the Average Treatment Effect (ATE) of sustained treatment with DPP4, compared to SU, over 18 months.
Due to the possibility of sustained treatment depending upon time varying prognostic measures, such as trajectory of glycemic control, or unmeasured patient characteristics, we are motivated to perform the analysis using time-varying IV methodologies, using a measure of physicians prescription preference, denoted as tendency to prescribe (TTP), taken over time as an IV. Full details are in the methods section.
Baseline characteristics are presented in the Appendix A.4 (Table A.3). Data is available at baseline on age, gender and ethnicity, and over time on HbA1c levels, Body Mass Index, systolic blood pressure, blood lipid profiles, kidney function, and history of stroke and hypoglycaemic events. Co-prescription history of statins and beta blockers was also available. Patients on DPP4 have lower HbA1c levels at baseline and prior to second line therapy, with higher levels of Body Mass Index over 34. Notably, patients were majority non-White, with around 75% recorded as Black, South Asian, or Other Ethnicity.
Simulation results for the simple data setup, targeting the ATE with 95% coverage based on
=1000 bootstrapped samples.
Simulation results for the simple data setup, targeting the ATE with 95% coverage based on
Simulation results for the complex data setup, targeting the ATE with 95% coverage based on
Simulation results targeting the ATE, using RI2SLS for complex data setups with mispecified models for
ATE: average treatment effect; RI2SLS: residual instrument two stage least squares; RMSE: root mean square error.
Overview
Suppose
We describe the general time-varying data setup in the Directed Acyclic Graph (DAG) shown in Figure 1. We first followed previous works with time dependent variables15,17 which considered data setups without the dashed directional arrows. We refer to this as the ‘Simple’ data setup. Informed by our case study and prior works 16 we also considered a more ‘Complex’ data setup which includes these dashed arrows, which may pose additional challenges to time-varying IV analysis. In particular the treatment and instrument are allowed to depend on past instrument and treatment history respectively, as well as on time-varying confounders.

Directed acyclic graph (DAG) of data setup with
Taking our motivating example,
Define
As with most causal methods, we make the assumption of counterfactual consistency (observed outcome IV1 (IV relevance): There exists a measurement of IV2 (Conditional exchangeability): IV3 (Exclusion restriction): IV2 and IV3 are often formalised by the single assumption IV4 (No current treatment interaction): The average causal effect of treatment at time
We also make the following assumption about the nature of the causal relationship between
where
Assumptions IV1 to IV3 are multivariate extensions of the standard IV assumptions, whilst IV4 is necessary for the interpretation of the 2SLS estimators as the ATE.
29
Of these assumptions only IV1 is empirically testable from the available data, with IV2-4 requiring careful consideration by practitioners. In a time-fixed setting, the Cragg-Donald
For IV2, checking the balance of measured confounders within IV groups can test for the necessity of measured variables to be included within
In time-fixed 2SLS literature the substantive model of interest is usually expressed as a Linear Structural Mean Model.
20
In time-varying settings, the substantive model is expressed via a Structural Nested Mean Model (SNMM).
35
We assume in this paper a SNMM of the form.
The parameters
Standard 2SLS
The standard 2SLS methodology in time-fixed situations 1 was generalised to the case of multiple treatments via Multivariate Mendelian Randomisation.11,18 In this paper, we have a just-identified situation, where we have as many instruments as time periods. Standard 2SLS can be generalised to the case of time-varying instruments as follows.
First stage models: Postulate and fit a series of first stage models for each Obtain predicted values for Second stage model: Postulate and fit a main effects regression model for
Provided that 2SLS is fit using OLS in both stages, it can be shown that this method amounts to solving the following estimating equations.
Consistent estimation using 2SLS requires two conditions on the set of auxiliary variables
However in the complex setup
To remedy this problem we turn to robust 2SLS methodologies.20–22 To handle the complex data structure of Figure 1 using 2SLS, we consider the following modification. Postulate a model for each Fit this model and calculate predictions for From this, define residuals Now perform 2SLS, replacing
This method may be viewed as an application of 2SLS, taking
Crucially, provided that
It follows that, provided that
Intuitively,
When fitted using OLS, RI2SLS amounts to solving the following estimating equations (see Appendix)
Provided that both the first and second stage models are fit using OLS, RI2SLS is equivalent to the
Calculation of the sample variance and its asymptotic properties for the estimators of
In RI2SLS, the only necessary conditions on the set
Recent work
20
demonstrated in the time-fixed case that controlling for baseline confounders between
RI2SLS in time-fixed settings was noted to be doubly robust, that is unbiased provided that either
There is one exception. If
This means we are typically reliant on the IV models
Simulations
Data generating mechanism
We simulate data according to Figure 1 with Unmeasured confounding The time-varying instrument Lastly
The true values for
Implementation
For each simulated scenario, we generate 1000 datasets. CIs are obtained via a percentile bootstrap method using 1000 bootstrapped datasets. 2SLS and RI2SLS are performed as in Section 3, with
In Appendix A.3 (Table A.2), we also report the performance of 2SLS and RI2SLS when the first stage model is a log linear (Probit) model.
To test our first objective, we set
Secondly, we set
Results
Tables 1 and 2 present the results of the simulations for the scenarios with no misspecification. For the simpler data set-up, 2SLS and RI2SLS perform identically. In this scenario values of the conditional
As IV strength and sample size decrease, the RMSE increases as expected. The performance of the methods slightly deteriorates with weaker IV strength, although biases are still minimal (below 5%) and CI remain near nominal levels (except the scenario highlighted above).
In the complex scenario, 2SLS shows poor performance and unstable results. With coverages tending towards 0 in cases of heavy bias, or towards 100 in cases where estimates over the
In line with the proof shown in the Appendix, the simulation results confirmed the equivalence between
Table 3 shows the results using RI2SLS for the complex data setup when we add non-linear terms to the models. As expected, when all 3 models are misspecified we encounter biased and less efficient results in all cases, though the extent of this bias appears to be mitigated to an extent with a stronger instrument.
Results remain biased when the model for
We considered 2SLS and RI2SLS using Probit based first stage models in the Appendix A.3 (Table A.2), which performed poorly and unpredictably in all cases, similarly to 2SLS in the complex case. The implication is that the relation to
Case study
Instrument definition: GP prescribing preferences
Our instrument is a measure of GP prescribing preferences (TTP) for DPP4 over SU over time. There are 139 GPs in our data, with an average of 27 patients each, ranging from 1 to 98 patients. A recent paper
3
summarised well the various measures to approximate GP preference via proportion of prescriptions issued. Our data does not include specific dates, and hence, we are unable to derive subject-specific measures of TTP. We instead consider GP specific measures of preference at each 6 month period based on the definitions considered in prior works.
3
IV1: IV relevance
It is important in a practical context to investigate if
The first stage
IV2: Exclusion restriction
It cannot be determined from the available data if there is a direct association between GP preference and HbA1c levels at any future time. We anticipate that any effect of TTP on future HbA1c levels is likely via its effect on assigned treatment, though this may depend on how well treatment is measured. A possible pathway through which IV2 could be violated would be if the GPs with a preference for DPP4 provided better quality of care, or possessed greater clinical capacity, in a way that might have led to greater improvement in HbA1c levels. Our discussions with clinical experts suggested that this is unlikely to be the case as both SU and DPP4 are both well regarded, and easily available treatments.
IV3: Exchangeability
GP preference may be affected by past confounders, and measurements of outcome. Dependence on past preference and treatment history is likely, but can be easily controlled. However observing poor past performance of patients on one drug may change preference as a result. This may mean that prescribing preferences drifting over time towards DPP4 being dependent on other health related measures such as HbA1c history.
It is unlikely for TTP to be independent of confounding at baseline, only to be dependent at later follow up times. As such, testing the balance of observed confounding at baseline offers an insight into what needs to be conditioned on to meet IV3. Table A.4 in the Appendix A.5 shows correlations between
Estimating approaches
We perform 2SLS and RI2SLS to estimate the ATE of DPP4 versus SU on HbA1c levels at 18 months. We repeat these analyses using both
Firstly we use 2SLS with no adjustment for confounders as an illustration. We then perform 2SLS-L, adjusting for Median HbA1c levels prior to initiation, smoking status, calendar period and ethnicity at baseline in the first and second stage models. RI2SLS and RI2SLS-L are then performed, using no adjustment and adjustment for the above confounders respectively. For RI2SLS
Results
We present results for the ATE in Figure 2 (and Appendix A.6, Table A.5). All methods suggest a reduction in HbA1c levels with sustained treatment with DPP4 compared to SU over the 2 year period. The 95% CI in each case suggests that this effect is significant. Results with

Forest plot of the ATE of DPP4 versus SU on Hba1c levels at 18 months. The intervals represent the bounds of the 95% confidence interval. ATE: average treatment effect; SU: sulfonylureas.
RI2SLS reported slightly narrower 95% CIs compared to those suggested by 2SLS (with or without adjustment). The RI2SLS-L method gave similar results to RI2SLS (Table A.5, Appendix A.6). This indicates that TTP dominantly depended on history of TTP and baseline blood glucose levels for which regressions of TTP on confounding history indicate. This implies that both 2SLS-L and RI2SLS were an effective estimating approach.
A visual representation of the definitions of

Plot of trends of

Plots showing characteristics of baseline HbA1c levels and Ethnicity, for data within specific percentiles of
This study explores the practical implementation of 2SLS methods in a full time-varying setting with both confounders, instruments, and treatments varying over time. We propose a 2SLS approach that uses residualised IVs to handle challenges in complex time-varying data structures, and assess its relative performance in a simulation study. We clarify how 2SLS methods relate to
This is the first study to investigate the statistical properties of 2SLS methods for incorporating time-varying IVs. We showed that the standard 2SLS can accommodate a time-varying IV, but only provided consistent estimates in simple time-varying settings where the IV depends only on its past history. We proposed a residualised 2SLS and showed, using theory and simulations, that this approach can attain the same performance as
This paper adds on to emerging literature exploring the use of time-varying IVs in several ways. Firstly, unlike the recently proposed inverse probability weighting method,
17
we showed that the RI2SLS provides an appropriate estimating approach for time-varying IV analysis in settings with weaker IV strengths (e.g.
Secondly, we found that the double robust property of 2SLS extends to time-varying scenarios if the instrument depends only on baseline confounding, and consistent estimation is possible even when treatment and outcome depend on non-linear relationships. Thirdly, we illustrate the applicability of the conditional
Fourthly, we showed that when the first and second stage models were not fit with OLS, the equivalence between RI2SLS and
Lastly, we illustrated the modelling and operationalisation of medical prescribing preferences as a time-varying instrument. We find that sustained treatment with DPP4 over a two year period led to a significant reduction in HbA1c levels of around 3 mmol/mol, which is in line with previous studies.38–40 Inferences did not differ according to estimating approach, but the standard 2SLS overestimated the ATE and led to somewhat wider 95% CIs. A notable aspect of the study cohort was that it included 75% non-White patients, a cohort for which prevalence of diabetes is estimated to be two to four times higher in the UK. 41 Our findings provide, therefore, valuable evidence for the external validity of these studies to non-White populations.
Our work has some limitations. Firstly, while the correct specification of
However, as with most IV methods, if a main effect of a variable of
Secondly, our case study was limited by missing data and the nature of the treatment. We also required patients with at least three recorded phases and complete data on HbA1c levels, treatment, and confounders. This risks selection bias, as patients with more recorded follow-ups and complete data could be in worse health than those with less complete data.
We also excluded patients who took insulin or other diabetic treatments, preselecting patients who took either SU and DPP4. Recent relevant work 44 highlighted potential selection biases when using IV methods to compare two treatments (DPP4 and SU), when more than two treatments are available (e.g. SGLT2 inhibitors or insulin), if the propensity to give these alternative treatments differs between preference groups. An investigation into the effects of SU versus DPP4 on BMI in time fixed settings suggested this could be handled by sensitivity analyses.44,45 Extending such sensitivity analyses to time-varying treatments would likely involve a sensitivity analysis of large dimensions, and is beyond the scope of this paper. However, this would be a very worthy area of further work.
Patients who initiated second-line therapy with DPP4 or SU rarely switched between treatments, which meant that the IV was weakly associated with treatment assignment over time, and hence limited our ability to estimate time-specific treatment effects. As such, the case study did not enable us to demonstrate the wider strengths of the RI2SLS approach. An interesting extension to our work would be to more closely examine the relationship between treatment switching rate, and IV strength.
Thirdly, the IV strength of TTP after initiation has room for improvement. For example, with richer datasets, GP’s last prescribed treatment may better capture TTP over time. We may also consider model-based TTP methods, such as the Abrahamowics method, Bidulka et al. 37 to identify the period in which a GP switches preference. Near far matching methodologies could identify pairs of GPs with the furthest possible difference in preference at initiation 46 to boost IV strength.
To conclude, RI2SLS provides a promising approach to perform time-varying IV analysis. It has good theoretical properties and can be performed using standard regression techniques. Identifying a strong time-varying IV, remains a major barrier to its wider adoption.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802251404064 - Supplemental material for Two stage least squares with time-varying instruments: An application to an evaluation of treatment intensification for type-2 diabetes
Supplemental material, sj-pdf-1-smm-10.1177_09622802251404064 for Two stage least squares with time-varying instruments: An application to an evaluation of treatment intensification for type-2 diabetes by Daniel Tompsett, Stijn Vansteelandt, Richard Grieve, John Robson and Manuel Gomes in Statistical Methods in Medical Research
Supplemental Material
sj-R-2-smm-10.1177_09622802251404064 - Supplemental material for Two stage least squares with time-varying instruments: An application to an evaluation of treatment intensification for type-2 diabetes
Supplemental material, sj-R-2-smm-10.1177_09622802251404064 for Two stage least squares with time-varying instruments: An application to an evaluation of treatment intensification for type-2 diabetes by Daniel Tompsett, Stijn Vansteelandt, Richard Grieve, John Robson and Manuel Gomes in Statistical Methods in Medical Research
Footnotes
Acknowledgments
The authors thank the Queen Mary University of London, Clinical Effectiveness Group and Barts Charity for access to the deidentified data and GPs in North East London for sharing deidentified patient data for research for patient benefit.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Medical Research Council, grant number MR/V020935/1.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article
Data availability
The data that support the findings of this study are available from the QMUL Clinical Effectiveness Group but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of QMUL CEG.
Supplemental material
Supplemental material for this article is available online.
Appendix
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
