Sage Journals: Discover world-class research

Abstract

Neural network-based treatment effect estimation algorithms are well-known in the causal inference community. Many works propose designs and architectures and report performance metrics over benchmarking data sets, in a machine learning manner. Nevertheless, most authors focus solely on binary treatment scenarios. This is a limitation, as many real-world scenarios have a multivalued treatment nature (for instance, multiarmed clinical trials, or health technology assessment processes). In this work, a novel approach is presented, where a top-performing, neural network-based algorithm for binary treatment effect estimation is generalized to a multivalued treatment setting. This approach yields an estimator with desirable asymptotic properties that delivers very good results in a wide range of experiments. To the best of the authors’ knowledge, this work is opening ground for the benchmarking of neural network-based algorithms for multivalued treatment effect estimation.

Keywords

causal inference multivalued treatment effect estimation neural networks

1. Introduction

Machine learning and neural networks are becoming a common choice for performing causal analysis tasks (causal inference and causal discovery) due to their power and flexibility for modelling complex functions, especially when dimensionality of the data is high (Hernán & Robins, 2020). Several authors have investigated specific network architectures, loss functions, regularization methods, etc., to tackle the task of inferring causal quantities using neural networks (Johansson et al., 2018; Nair et al., 2022; Yoon et al., 2018). The performance of those algorithms is being benchmarked in the scientific literature, by using specific data sets and common metrics to achieve comparable results (Lin et al., n.d; Shimoni et al., 2018). These advancements are happening almost exclusively in binary treatment scenarios. Nevertheless, often real-life applications have multiple-valued treatments, for instance, multiarmed clinical trials, or health technology assessment (Lampe et al., 2009; Li et al., 2021) processes where the health technology or intervention being evaluated can take multiple values (Cattaneo et al., 2013; Li & Li, 2019). This highlights the need to explore neural network-based causal inference methods for multivalued treatments, both at the theoretical and empirical levels. In the present work, a top-performance, neural network-based, binary average treatment effect (ATE) estimation algorithm named Dragonnet (Kiriakidou & Diou, 2022; Shi et al., 2019) is selected, and its generalizability to a multivalued treatment setting is tested. To the best of the authors’ knowledge, Velasco et al. (2022) is one of the first attempts to establish a benchmark for this type of algorithm in the aforementioned setting. Other works found in the literature focus on specific data morphologies (Kaddour et al., 2021), do not use neural networks (Künzel et al., 2019), or can be considered meta-methods (Schwab et al., 2019). In the remainder of this text, the problem statement, the network architecture, and the associated mathematical formulations, a framework for experiments, the results obtained in different scenarios, and a critical evaluation of those results are presented. The code of the algorithm and the experiments can be found in https://github.com/BorjaGIH/Hydranet.

2. Problem Statement

Let the treatment of interest be a discrete random variable $T \in [0, \dots, k]$ that can take $k + 1$ different values. Let the outcome be a continuous random variable $Y \in R$ , and let the covariates (i.e., the variables affecting both the treatment and the outcome) be a random vector $X \in R^{j}$ . Thus, the set of data points is $(Y_{i}, T_{i}, X_{i}), i \in [1, \dots, N]$ , generated independently and identically. This set of data points constitutes the body of observational data. Let the causal effect of the treatment $t$ over the outcome $Y$ be $μ_{t} = E [Y | d o (T = t)]$ , using Pearl’s do-calculus notation (Pearl et al., 2016), which denotes intervention. It can be shown that, if the data meet certain conditions, causal (interventional) quantities can be estimated based on observational data. Those conditions are known as the identifiability conditions: positivity, consistency, and “no hidden confounder” conditions. For a more detailed explanation, see Pearl et al. (2016). Under such conditions, $μ_{t} = E [Y | X = x, T = t]$ , which is a quantity that can be inferred from the body of observational data. Along the rest of the section, it is assumed that the identifiability conditions hold.

Let the conditional outcome be defined as the expectation of the outcome given the treatment and the covariates, $Q (t, x) = E [Y | t, x]$ . Based on $Q$ , a simple estimator ${\hat{μ}}_{t}$ of $μ_{t}$ can be constructed as ${\hat{μ}}_{t} = (1 / N) \sum_{i} Q (t, x_{i})$ . In the following, the goal will be to approximate $Q$ . Let $\hat{Q}$ be an approximation of $Q$ , and let $μ_{t}^{\hat{Q}} = (1 / N) \sum_{i} \hat{Q} (t, x_{i})$ be the estimator of $μ_{t}$ obtained replacing $Q$ by its estimation $\hat{Q}$ . Furthermore, the generalized propensity score (GPS; Cattaneo, 2010) is expressed as $G (x) = [g_{0} (x), g_{1} (x), \dots, g_{k} (x)] \in R^{k + 1}$ , with $g_{t} (x) = P (T = t | x)$ .

In a binary treatment setting, under the identifiability conditions, the ATE is one of the most common causal quantities of interest, and it is defined as $ψ = μ_{1} - μ_{0}$ . Given an approximation $\hat{Q}$ of $Q$ , $ψ$ can be easily estimated as $ψ^{\hat{Q}} = μ_{1}^{\hat{Q}} - μ_{0}^{\hat{Q}}$ . In a multivalued treatment setting, a wider class of causal quantities of interest can be defined, and all the conditional outcomes must be computed together in order to obtain valid estimates of those quantities (Cattaneo, 2010). In this work, such quantities of interest are defined as the pair-wise average differences between the several treatments and a treatment considered the control (note that, in practice, the control treatment does not necessarily mean absence of treatment). Thus, a vector of ATEs $ψ \in R^{k}$ , $ψ = [ψ_{1}, ψ_{2}, \dots, ψ_{k}]$ , with $ψ_{t} = μ_{t} - μ_{0}$ is defined. These quantities can be approximated in a similar fashion as shown before, the t-th element of the vector being $ψ_{t}^{\hat{Q}} = μ_{t}^{\hat{Q}} - μ_{0}^{\hat{Q}}$ . Note that if the causal quantity of interest was $ψ_{i, j} = μ_{i} - μ_{j}$ , it could easily be computed based on the previous definition, as $ψ_{i, j} = ψ_{i} - ψ_{j}$ , due to the linearity of the expectation operator.

The subject of interest in this paper is the estimation of the vector of ATEs $ψ$ . In the next section, the estimation method provided in Shi et al. (2019) is generalized, which has the objective of estimating the ATE in the binary case, to the estimation of $ψ$ in the multivalued treatment case.

3. From Dragonnet to Hydranet

Dragonnet is a high-capacity, end-to-end neural network architecture for estimating binary treatment effects (Shi et al., 2019). In this section, the variation of the architecture, mathematical formulations, and proofs for adapting Dragonnet to a multivalued treatment setting are presented. This adaptation is called Hydranet.

3.1. Architecture

The architecture of Hydranet can be seen in Figure 1. It consists of two parts: the representation part, formed by the input layer and two hidden layers, and the heads, formed by $k + 2$ ends. Out of those, $k + 1$ correspond to the conditional outcomes, and are formed by two more hidden layers plus the output layer. The remaining head corresponds to the GPS, $G (x) = [g_{0} (x), g_{1} (x), \dots, g_{k} (x)] \in R^{k + 1}$ , with $g_{t} (x) = P (T = t | x)$ , consisting on just the output layer. All layers are fully connected. Recall that the $t$ -th element of the vector of ATEs is approximated as $ψ_{t}^{\hat{Q}} = (1 / N) \sum_{i} \hat{Q} (t, x_{i}) - \hat{Q} (0, x_{i}) .$

Figure 1.

Hydranet Architecture, where $Z$ is the Representation Layer, and the $k + 2$ Heads Correspond to the $k + 1$ Potential Outcomes, $\hat{Q} (k, \cdot)$ , and the Generalized Propensity Score, $\hat{G} (\cdot)$ .

The baseline objective function has the shape

\hat{R} (θ) = \frac{1}{N} \sum_{i} [(Q^{n n} (t_{i}, x_{i}; θ) - y_{i})^{2} + α CrossEntropy (g_{t}^{n n} (x_{i}; θ), t_{i})]

(1)

where the quadratic term relates to the errors of the potential outcomes’ heads and the cross entropy term relates to the errors of the propensity score’s head. The model parameters are

\hat{θ} = \underset{θ}{\arg min} [\hat{R} (θ)]

(2)

3.2. Targeted Regularization

Now, following the reasoning in Shi et al. (2019), we present targeted regularization. Targeted regularization is a modification of the objective function, obtained with the introduction of an extra parameter, $ϵ$ . In our setting, $ϵ$ is a vector in $R^{k}, ϵ = (ϵ_{1}, ϵ_{2}, \dots, ϵ_{k})$ , and the new objective function is

\begin{aligned} \bar{F} (θ, ϵ) & = \hat{R} (θ) + β \frac{1}{N} \sum_{i} γ_{i} (y_{i}, t_{i}, x_{i}; θ, ϵ), where \end{aligned}

(3)

\begin{aligned} γ_{i} (y_{i}, t_{i}, x_{i}; θ, ϵ) & = (y_{i} - {\bar{Q}}_{i} (θ, ϵ))^{2}, and \end{aligned}

(4)

\begin{aligned} {\bar{Q}}_{i} (θ, ϵ) & = Q^{n n} (t_{i}, x_{i}) + ϵ_{1} (\frac{I (T = 1)}{g_{1}^{n n} (x_{i})} - \frac{I (T = 0)}{g_{0}^{n n} (x_{i})}) + \dots + ϵ_{k} (\frac{I (T = k)}{g_{k}^{n n} (x_{i})} - \frac{I (T = 0)}{g_{0}^{n n} (x_{i})}) \end{aligned}

(5)

with

I (T = t)

the indicator function. The desired model parameters are

\hat{θ}, \hat{ϵ} = \underset{θ, ϵ}{\arg min} [\hat{R} (θ) + β \frac{1}{N} \sum_{i} γ_{i} (y_{i}, t_{i}, x_{i}; θ, ϵ)]

(6)

What is the rationale behind this modification of the objective function? It lies in targeted maximum likelihood estimation (TMLE) (Lendle, 2015) and in the semiparametric estimation theory (SET; Kennedy, 2016). On the one hand, SET provides us with mathematical conditions and guarantees that if those are fulfilled, the estimator $ψ$ will have desirable properties (which we mention later). On the other hand, TMLE is a method that introduces the idea of the perturbation parameter $ϵ$ , in which targeted regularization is inspired. In the remainder of this section, we show that the minimization of the modified objective function automatically ensures the fulfillment of the conditions defined by SET, thus ensuring the desirable properties of the estimator $ψ$ .

First, note that the aforementioned conditions are simply the nonparametric estimating equations, defined as

0 = [\frac{1}{N} \sum_{i} φ_{i, 1}, \frac{1}{N} \sum_{i} φ_{i, 2}, \dots, \frac{1}{N} \sum_{i} φ_{i, k}]

(7)

where each

φ_{i, t}

is the

t

-th influence curve, defined as

φ_{i, t} = Q^{n n} (t, x_{i}) - Q^{n n} (0, x_{i}) + (\frac{I (T = t)}{g_{t}^{n n} (x_{i})} - \frac{I (T = 0)}{g_{0}^{n n} (x_{i})}) (y_{i} - Q^{n n} (t, x_{i})) - ψ_{t}

(8)

forming the vector of efficient influence curves,

φ \in R^{k}, φ = [φ_{1}, φ_{2}, \dots, φ_{k}]

Then, recall that what is desired is that the minimization of the modified objective function (3) ensures the fulfillment of the nonparametric estimation equations (7). This is mathematically expressed as

{\nabla \bar{F} |}_{\hat{ϵ}} = {[\frac{\partial \bar{F}}{\partial ϵ_{1}}, \frac{\partial \bar{F}}{\partial ϵ_{2}}, \dots, \frac{\partial \bar{F}}{\partial ϵ_{k}}] |}_{\hat{ϵ}} = [\frac{β}{N} \sum_{i} φ_{i, 1}, \frac{β}{N} \sum_{i} φ_{i, 2}, \dots, \frac{β}{N} \sum_{i} φ_{i, k}] = 0

(9)

and the proof of the equality can be found in the supplemental material. This warrants the aforementioned desirable properties of the estimator

ψ

, which are double robustness, fast convergence, and efficiency (Chernozhukov et al., 2017).

4. Data and Metrics

Hydranet has been tested in two data sets, a fully synthetic one and a semisynthetic one. In the remainder of the text, they will be referred to as the synthetic dataset and the IHDP dataset, respectively. In order to generate these datasets, algorithms mimicking different data-generating processes (DGPs) have been designed and implemented. For the synthetic data set, the covariates, treatments, and outcomes have been synthetically generated, taking inspiration from Kaddour et al. (2021). For the IHDP data set, the covariates are taken from a study with real participants, while the treatments and outcomes are synthetically generated. Those real covariates were collected for a randomized controlled trial carried out in 1985 (Gross, 1993; Multisite, 1990), and are routinely used for benchmarking causal inference algorithms, usually following the configuration in Dorie et al. (2018). A similar strategy has been followed in the current work, but adapting the DGP to the present needs (a multivalued treatment scenario). With both data sets, the number of treatments has been set to 5. In the remainder of this section, a more detailed explanation of the DGP of each dataset and its output is provided.

4.1. Synthetic Dataset DGP

For generating fully synthetic data, DGPs with tunable parameters of bias size $B$ , degree of positivity $ρ$ , dataset size $D$ , and number of confounders NC have been designed. The number of treatments has been set to 5. The potential covariates are constituted by vectors $x \in R^{30}$ with each value sampled from a uniform distribution $U (- 1, 1)$ . The number of such vectors is equal to the data size parameter $D$ , forming a matrix $X \in R^{D \times 30}$ . The actual confounders, that is, the variables that participate in the determination of both the treatment and the outcome, are the first NC elements of each covariate vector, thus forming a matrix $C \in R^{D \times N C}$ . The treatment for each datapoint has been obtained in two steps. First, by squaring the confounder vector element-wise and summing the elements, applying a min–max scaler to the range $[0, 4]$ (for five treatments), and rounding to the closest integer. Then, in order to fulfill the positivity condition, by drawing the final treatment value from a categorical distribution such that

p (t | c) = {\begin{cases} ρ, if t = m (c) \\ \frac{1 - ρ}{k - 1}, otherwise \end{cases}

with

m (\cdot)

the operation defined in the first step and

ρ

the degree of positivity. Note that with this definition, a value of

ρ = 0.5

would mean perfect overlap, treatment assigned at random, while a value of

ρ = 1

would mean the violation of the positivity condition. Finally, for computing the potential outcomes, three outcome functions, (

l_{a} (t, x), l_{b} (t, x), l_{c} (t, x)

) have been defined, that map a combination of the covariates and the treatment to the output space. The outcome functions have the shape

\begin{aligned} l_{a} (t, x) & = 30 v_{0}^{T} x + 10 t^{2} v_{t}^{T} x + ϵ \\ l_{b} (t, x) & = 20 v_{0}^{T} x + 5 B t v_{t}^{T} x + ϵ \\ l_{c} (t, x) & = 10 v_{0}^{T} x + 5 \log (| B t v_{t}^{T} x |) + ϵ \end{aligned}

with

B

the bias parameter,

v_{0}

the baseline effect parameter, defined as

u_{0} / | | u_{0} | |

with

| | \cdot | |

the Euclidean norm and

u_{0} \sim U (0, 1)

a randomly sampled vector

(u_{0} \in R^{30})

, and

ϵ \sim N (0, 1)

. Recall that a potential outcome, denoted

y^{t}

, is the outcome that a datapoint would have had, had it been treated with a particular treatment

t

. The matrix of potential outcomes

Y \in R^{D \times 5}

is defined as

Y = [Y^{0}, Y^{1}, Y^{2}, Y^{3}, Y^{4}] = [l_{a} (0, X)^{T}, l_{b} (1, X)^{T}, l_{c} (2, X)^{T}, l_{b} (3, X)^{T}, l_{a} (4, X)^{T}]

, with

0

(0, 0, \dots, 0) \in R^{D}

1

(1, 1, \dots, 1) \in R^{D}

, etc.

Several data sets have been generated under varying values of the four parameters of interest, bias size $B = [2, 5, 10, 30]$ , degree of positivity $ρ = [0.6, 0.7, 0.8, 0.90, 0.95, 0.98]$ , dataset size $D = [1000, 2000, 5000, 8000]$ and $N C = [2, 5, 10, 18]$ , varying one parameter at a time. When kept fixed, the values have been set to $B = 20$ , $ρ = 0.8$ , $D = 2000$ , and $N C = 2$ . The main text includes the experiments for varying values of bias size, positivity degree, and dataset size. Experiments of varying values of NC are included in the supplemental material.

4.2. IHDP Dataset DGP

For generating the IHDP dataset, a similar strategy has been followed, but fixing $B = 10$ , $ρ = 0.8$ , $N C = 2$ , with $D = 985$ being the size of the original IHDP covariate set. The treatment assignment function is based on two variables present in the set, mom ethnicity and weeks preterm. Treatment 0 is assigned to individuals with mom ethnicity equalling “black,” treatment 1 to individuals with mom ethnicity equalling “white,” treatment 2 to individuals with mom ethnicity equalling “hispanic,” treatment 3 to individuals with mom ethnicity equalling “hispanic” and weeks preterm being bigger than 6, and treatment 4 to individuals with mom ethnicity equalling “black” and weeks preterm smaller than 6. Note that this setting is completely made up and has no connection with any real-life situation. Then, the final treatment is sampled from a probability distribution as explained in subsection 4.1. The outcome functions are defined as

\begin{aligned} l_{1} (t, x) & = \exp (x β) + B * M B + t^{2} + ϵ \\ l_{2} (t, x) & = \log (| x β |) + B * M W * t + ϵ \\ l_{3} (t, x) & = x β + B * M H + t^{2} + ϵ \\ l_{4} (t, x) & = \exp (x β) + B * W P + t + ϵ \\ l_{5} (t, x) & = \log (| x β |) + B * W P * t + ϵ \end{aligned}

where

β

is a vector of parameters,

B

is the bias parameter, MB, MW, and MH are the components of the one-hot encoding of mom ethnicity, and WP is weeks preterm.

4.3. Metrics

For performance benchmarking purposes, the sum of errors of the vector of ATEs has been employed. This is computed as the sum of the absolute values of the differences of all estimated ATE components with respect to their true values, $E = \sum_{t = 1}^{k} | ψ_{t} - {\hat{ψ}}_{t} |$ . This choice allows us to have a single real number as a final result, making comparisons simpler. All values have been computed as averages across 20 data set realizations to increase the robustness of the results, and 95% confidence intervals have been computed with bootstrapping.

5. Experiments and Results

In the case of binary treatment settings, there are de facto benchmarking data sets and metrics, that is, data sets and metrics that are widely used in the literature and thus serve for algorithmic performance comparison purposes. The IHDP data set and the metrics presented in Dorie et al. (2018) are an example of this. This is not the case in multivalued treatment settings, where comparators are scarce. Nevertheless, algorithms that can be considered comparable to Hydranet have been developed and implemented to benchmark its performance. Thus, in every experiment, the results of the following algorithms are included: (1) Naive, a naive estimator of the treatment effect that employs only the observable data, without controlling, and serves to visualize the impact of confounding; (2) B2BD, back to back Dragonnets, a strategy that uses four Dragonnets (with the same setup as in Shi et al., 2019), each one estimating one element of the vector of ATEs $ψ$ , (3) Meta-learner, a meta-learner estimator (Künzel et al., 2019) that employs a gradient boosting machine model¹, and finally (4) Hydranet, both in its baseline form and with targeted regularization (which, for convenience, will be abbreviated as t-reg in figures and tables). Note that for the meta-learner, both T-learner and X-learner estimators have been tested, selecting finally the T-learner due to its better performance. Hydranet performs well in all the tested scenarios and outperforms the comparators, both with in-sample (train set) data and with out-sample (test set) data, reaching low or very low error values for different bias sizes, positivity degrees, dataset sizes, and NC. The employed training scheme consists of a first stage with the Adam optimizer and a second stage with the stochastic gradient descent optimizer, with hyperparameters similar to Shi et al. (2019).

5.1. Synthetic Data Experiments

Figure 2 and Table 1 show the error of the different algorithms for increasing values of the bias size. As it should be expected, the error of the naive algorithm increases with the bias size, and the out-sample error is bigger than the in-sample error. The comparators also suffer from bigger errors with the increase of the bias. Hydranet outperforms all the comparators, and is very robust in front of the bias increase. It also shows a similar performance in-sample and out-sample, both for the baseline algorithm and for the targeted regularization version.

Figure 2.

Errors w.r.t. Bias Size. (a) Out-Sample and (b) In-Sample.

Table 1.

Errors of the Different Algorithms w.r.t. Bias Size.

	5		10		30
Bias	In-sample	Out-sample	In-sample	Out-sample	In-sample	Out-sample
Naive	28.61 $\pm$ 5.78	13.97 $\pm$ 2.77	35.37 $\pm$ 6.69	16.17 $\pm$ 3.55	52.31 $\pm$ 7.83	30.49 $\pm$ 7.17
B2BD base.	14.75 $\pm$ 3.78	9.73 $\pm$ 2.6	14.86 $\pm$ 2.64	11.14 $\pm$ 3.76	37.59 $\pm$ 8.6	18.58 $\pm$ 5.28
B2BD t-reg.	12.3 $\pm$ 4.12	12.3 $\pm$ 3.13	13.66 $\pm$ 4.05	13.66 $\pm$ 3.29	25.76 $\pm$ 10.11	25.76 $\pm$ 6.83
Meta-learner	15.91 $\pm$ 3.3	15.94 $\pm$ 3.43	15.3 $\pm$ 3.21	15.88 $\pm$ 3.15	29.98 $\pm$ 5.83	32.54 $\pm$ 6.44
Hydranet base.	1.37 $\pm$ 0.37	1.22 $\pm$ 0.31	1.87 $\pm$ 0.32	1.68 $\pm$ 0.26	2.65 $\pm$ 0.4	1.92 $\pm$ 0.31
Hydranet t-reg.	1.45 $\pm$ 0.37	1.45 $\pm$ 0.38	1.62 $\pm$ 0.26	1.62 $\pm$ 0.28	2.26 $\pm$ 0.65	2.26 $\pm$ 0.38

Figure 3 and Table 2 show the error of the different algorithms for increasing values of the degree of positivity $ρ$ . Note that here $ρ$ has been expressed as a percentage. Again, as expected, all algorithms throw bigger errors for bigger values of $ρ$ . Hydranet outperforms the comparators, both in-sample and out-sample, and both in its baseline form and with targeted regularization.

Figure 3.

Errors w.r.t. Degree of Positivity. (a) Out-Sample and (b) In-Sample.

Table 2.

Errors of the Different Algorithms w.r.t. Degree of Positivity.

	90		95		98
Positivity degree	In-sample	Out-sample	In-sample	Out-sample	In-sample	Out-sample
Naive	46.79 $\pm$ 10.97	29.92 $\pm$ 5.35	60.9 $\pm$ 14.71	26.39 $\pm$ 4.88	67.0 $\pm$ 13.77	33.91 $\pm$ 7.47
B2BD base.	22.02 $\pm$ 3.85	14.68 $\pm$ 2.66	32.62 $\pm$ 7.33	21.71 $\pm$ 4.14	31.87 $\pm$ 7.15	25.79 $\pm$ 5.02
B2BD t-reg.	23.49 $\pm$ 4.52	23.49 $\pm$ 5.87	23.7 $\pm$ 7.27	23.7 $\pm$ 4.9	25.08 $\pm$ 7.98	25.08 $\pm$ 4.06
Meta-learner	28.21 $\pm$ 5.17	30.48 $\pm$ 5.96	28.77 $\pm$ 6.67	31.26 $\pm$ 6.76	42.88 $\pm$ 7.3	44.43 $\pm$ 7.17
Hydranet base.	3.09 $\pm$ 0.53	2.54 $\pm$ 0.48	5.01 $\pm$ 1.35	4.69 $\pm$ 1.17	5.7 $\pm$ 1.57	4.88 $\pm$ 1.48
Hydranet t-reg.	2.91 $\pm$ 0.53	2.91 $\pm$ 0.53	4.93 $\pm$ 1.43	4.93 $\pm$ 1.13	6.88 $\pm$ 1.37	6.88 $\pm$ 1.46

Figure 4.

Errors w.r.t. Dataset Size. (a) In-Sample and (b) Out-Sample.

Figure 4 and Table 3 show the performance of the algorithms for varying dataset sizes. As expected, all algorithms reduce their error with bigger data sets, but Hydranet with targeted regularization outperforms them all, and shows a smaller error even for small dataset sizes, proving its efficiency. Recall that data efficiency was one of the desired properties ensured for our estimator, thanks to targeted regularization. It must be highlighted that in this experiment, the estimations of the baseline Hydranet have been plugged into a naive doubly-robust estimator, the augmented inverse probability of treatment weighted (A-IPTW) estimator. The resulting estimations show big error values, proving the utility of Hydranet with targeted regularization for achieving double robustness (which was demonstrated in Section 3.2).

Table 3.

Errors of the Different Algorithms w.r.t. Dataset Size.

	2,000		5,000		8,000
Dataset size	In-sample	Out-sample	In-sample	Out-sample	In-sample	Out-sample
Naive	46.03 $\pm$ 8.36	19.59 $\pm$ 4.07	26.33 $\pm$ 4.61	12.75 $\pm$ 2.44	18.59 $\pm$ 3.83	9.86 $\pm$ 2.21
B2BD base.	21.23 $\pm$ 5.29	12.3 $\pm$ 2.92	16.59 $\pm$ 3.31	10.82 $\pm$ 2.44	12.32 $\pm$ 2.31	8.55 $\pm$ 1.86
Meta-learner	18.55 $\pm$ 3.82	16.95 $\pm$ 4.13	9.67 $\pm$ 2.06	12.97 $\pm$ 2.8	6.04 $\pm$ 1.43	7.35 $\pm$ 1.53
B2BD t-reg.	12.63 $\pm$ 5.19	12.63 $\pm$ 2.86	10.42 $\pm$ 3.33	10.42 $\pm$ 2.53	8.27 $\pm$ 2.33	8.27 $\pm$ 1.81
Hydranet base. (DR)	79.05 $\pm$ 13.41	29.15 $\pm$ 4.18	75.16 $\pm$ 17.76	28.6 $\pm$ 6.61	51.64 $\pm$ 9.8	32.43 $\pm$ 5.33
Hydranet t-reg.	1.94 $\pm$ 0.42	1.94 $\pm$ 0.32	0.99 $\pm$ 0.22	0.99 $\pm$ 0.2	0.97 $\pm$ 0.36	0.97 $\pm$ 0.34

5.2. IHDP Data Experiments

Table 4 shows the error of the different algorithms with the IHDP dataset. Similarly as with synthetic data, Hydranet (both baseline and targeted regularization) outperforms the comparators. The targeted regularization algorithm has a slightly smaller error than the baseline algorithm. These results prove the efficacy of Hydranet with semisynthetic data, showing its potential suitability for real-world scenarios.

Table 4.
Performance With IHDP Dataset.

Out-sample In-sample

Naive 14.81 $\pm$ 0.95 17.51 $\pm$ 2.03

B2BD base. 26.35 $\pm$ 2.46 26.73 $\pm$ 3.1

B2BD t-reg. 27.57 $\pm$ 2.42 26.05 $\pm$ 2.84

Meta-learner 13.53 $\pm$ 1.22 13.7 $\pm$ 1.2

Hydranet base. 3.22 $\pm$ 0.73 3.33 $\pm$ 0.83

Hydranet t-reg. 2.87 $\pm$ 0.57 2.91 $\pm$ 0.68

	Out-sample	In-sample
Naive	14.81 $\pm$ 0.95	17.51 $\pm$ 2.03
B2BD base.	26.35 $\pm$ 2.46	26.73 $\pm$ 3.1
B2BD t-reg.	27.57 $\pm$ 2.42	26.05 $\pm$ 2.84
Meta-learner	13.53 $\pm$ 1.22	13.7 $\pm$ 1.2
Hydranet base.	3.22 $\pm$ 0.73	3.33 $\pm$ 0.83
Hydranet t-reg.	2.87 $\pm$ 0.57	2.91 $\pm$ 0.68

6. Discussion

In this work, a top-performing, neural network-based algorithm for ATE estimation has been generalized from a binary treatment setting to a five-valued treatment setting. Synthetic and semi-synthetic DGPs for algorithmic benchmarking purposes in multivalued settings have been developed and implemented, and comparator algorithms have been designed for evaluating the performance of Hydranet. It is shown that Hydranet performs well under different bias sizes and degrees of positivity, and both theoretical and empirical evidence for the motivation of developing targeted regularization-equipped Hydranet is provided: the data efficiency of the algorithm is shown in the varying data set size scenario and when double robustness is attempted through a naive approach such as the plug-in A-IPTW estimator. In addition, the good performance of the algorithm with semi-synthetic data is also demonstrated, with the IHDP dataset. This suggests its potential value for real-world datasets. Note also that the property of double robustness means twice as many chances of avoiding model misspecification (with respect to a simple estimator), which can be especially relevant in real-world scenarios where assumptions such as the “no hidden confounder” can be harder to make.

The direct generalizability of neural network-based algorithms for ATE estimation from binary settings to $k$ -valued treatment settings is a common claim in the literature, but this work shows that it has its own challenges and that the behavior of the algorithms in each particular scenario requires its own interpretations. As far as we know, this paper is opening ground on the proposal of benchmarking results for neural network-based ATE estimation in multivalued treatment scenarios.

The main limitations of this work are twofold: on the one hand, only a five-valued treatment scenario has been tested. It is a line of future work to adapt the algorithm and perform experiments for $k$ -valued scenarios. We would expect an increasing complexity with the increase of $k$ , and probably some extra form of regularization would be needed, for instance, regarding the $ϵ$ parameter. On the other hand, competitor algorithms of Hydranet have been constructed ad hoc due to the scarcity of benchmarking data in the literature. In Schwab et al. (2019), some experiments are performed in multivalued treatment settings, with TARNet being the best method. TARNet was shown to be outperformed by Dragonnet in binary treatment settings in Shi et al. (2019), and thus, it is presumed that, as an extension of Dragonnet, Hydranet would also outperform TARNet in multivalued treatment scenarios. Nevertheless, this has not been tested empirically.

Supplemental Material

sj-pdf-1-eai-10.1177_30504554251385053 - Supplemental material for Hydranet: A Neural Network for the Estimation of Multi-valued Treatment Effects

Supplemental material, sj-pdf-1-eai-10.1177_30504554251385053 for Hydranet: A Neural Network for the Estimation of Multi-valued Treatment Effects by Borja Velasco-Regulez and Jesus Cerquides in The European Journal on Artificial Intelligence

Footnotes

ORCID iDs

Borja Velasco-Regulez

Jesus Cerquides

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Doctorat Industrial funded by Generalitat de Catalunya (DI-2020-18) and by project CI-SUSTAIN funded by the Spanish Ministry of Science and Innovation (PID2019-104156GB-I00). Borja Velasco-Regúlez was a PhD Student of the doctoral program in Computer Science at the Universitat Autonoma de Barcelona.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Cattaneo

M. D.

(2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics, 155(2), 138–154. https://doi.org/10.1016/j.jeconom.2009.09.023

Cattaneo

M. D.

Drukker

D. M.

Holland

A. D.

(2013). Estimation of multivalued treatment effects under conditional independence. The Stata Journal: Promoting communications on statistics and Stata, 13(3), 407–450. https://doi.org/10.1177/1536867X1301300301

Chernozhukov

Chetverikov

Demirer

Duflo

Hansen

Newey

Robins

(2017). Double/Debiased machine learning for treatment and causal parameters. ArXiv:1608.00060 [econ, stat].

Dorie

Hill

Shalit

Scott

Cervone

(2018). Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. ArXiv:1707.02641 [stat]. http://arxiv.org/abs/1707.02641.

Enhancing the outcomes of low-birth-weight, premature infants: A multisite, randomized trial. (1990). JAMA, 263(22), 3035–3042. https://doi.org/10.1001/jama.1990.03440220059030

Gross

R. T.

(1993). Infant health and development program (IHDP): Enhancing the outcomes of low birth weight, premature infants in the United States, 1985–1988. https://doi.org/10.3886/ICPSR09795.v1.

Hernán

M. A.

Robins

J. M.

(2020). Causal inference: What if. Chapman & Hall/CRC.

Johansson

F. D.

Shalit

Sontag

(2018). Learning representations for counterfactual inference. ArXiv:1605.03661 [cs, stat]. http://arxiv.org/abs/1605.03661.

Kaddour

Zhu

Liu

Kusner

M. J.

Silva

(2021). Causal effect inference for structured treatments. Advances in Neural Information Processing Systems, 34, 24841–24854.

10.

Kennedy

E. H.

(2016). Semiparametric theory and empirical processes in causal inference. In H. He, P. Wu, & D. G. Chen (Eds.), Statistical causal inferences and their applications in public health research. ICSA book series in statistics (pp. 141–167). Springer International Publishing. https://doi.org/10.1007/978-3-319-41259-7

11.

Kiriakidou

Diou

(2022). An improved neural network model for treatment effect estimation. ArXiv:2205.11106 [cs, stat]. http://arxiv.org/abs/2205.11106.

12.

Künzel

S. R.

Sekhon

J. S.

Bickel

P. J.

(2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(10), 4156–4165. https://doi.org/10.1073/pnas.1804597116

13.

Lampe

Mäkelä

Garrido

M. V.

Anttila

Autti-Rämö

Hicks

N. J.

Hofmann

Koivisto

Kunz

Kärki

, et al. (2009). The HTA core model: A novel method for producing and reporting health technology assessments. International Journal of Technology Assessment in Health Care, 25(S2), 9–20. https://doi.org/10.1017/S0266462309990638

14.

Lendle

S. D.

(2015). Targeted minimum loss based estimation: Applications and extensions in causal inference and big data. PhD thesis, UC Berkeley. https://escholarship.org/uc/item/4cs716rc.

15.

(2019). Propensity score weighting for causal inference with multiple treatments. The Annals of Applied Statistics, 13(4), 2389–2415. https://doi.org/10.1214/19-AOAS1282

16.

Chen

Lai

Liang

Wang

Shi

Lin

Yao

Ung

C. O. L.

(2021). Integrating real-world evidence in the regulatory decision-making process: A systematic analysis of experiences in the US, EU, and China using a logic model. Frontiers in Medicine, 8, 669509 . https://doi.org/10.3389/fmed.2021.669509

17.

Lin

Merchant

Sarkar

S. K.

D’Amour

(2019). Universal causal evaluation engine: An API for empirically evaluating causal inference models. Proceedings of Machine Learning Research, 104, 50–58. https://proceedings.mlr.press/v104/lin19a.html

18.

Nair

Gurumoorthy

K. S.

Mandalapu

(2022). Individual treatment effect estimation through controlled neural network training in two stages. ArXiv:2201.08559 [cs]. http://arxiv.org/abs/2201.08559.

19.

Pearl

Glymour

Jewell

N. P.

(2016). Causal inference in statistics. A primer. John Wiley and Sons Ltd.

20.

Schwab

Linhardt

Karlen

(2019). Perfect match: A simple method for learning representations for counterfactual inference with neural networks. ArXiv:1810.00656 [cs, stat]. http://arxiv.org/abs/1810.00656

21.

Shi

Blei

D. M.

Veitch

(2019). Adapting neural networks for the estimation of treatment effects. ArXiv:1906.02120 [cs, stat]. http://arxiv.org/abs/1906.02120.

22.

Shimoni

Yanover

Karavani

Goldschmnidt

(2018). Benchmarking framework for performance-evaluation of causal inference analysis. arXiv:1802.05046 [cs, stat]. http://arxiv.org/abs/1802.05046.

23.

Velasco

Cerquides

Arcos

J. L.

(2022). Hydranet: A neural network for the estimation of multi-valued treatment effects. In NeurIPS 2022 Workshop on causality for real-world impact. https://openreview.net/forum?id=sJChORLuPHK.

24.

Yoon

Jordon

Van Der Schaar

(2018). Ganite: Estimation of individualized treatment effects using generative adversarial nets. International Conference on Learning Representations, 2196.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.16 MB

Hydranet: A Neural Network for the Estimation of Multi-valued Treatment Effects

Abstract

Keywords

1. Introduction

2. Problem Statement

3. From Dragonnet to Hydranet

3.1. Architecture

4.1. Synthetic Dataset DGP

4.2. IHDP Dataset DGP

4.3. Metrics

5. Experiments and Results

5.1. Synthetic Data Experiments

Table 4. Performance With IHDP Dataset. Out-sample In-sample Naive 14.81 ± 0.95 17.51 ± 2.03 B2BD base. 26.35 ± 2.46 26.73 ± 3.1 B2BD t-reg. 27.57 ± 2.42 26.05 ± 2.84 Meta-learner 13.53 ± 1.22 13.7 ± 1.2 Hydranet base. 3.22 ± 0.73 3.33 ± 0.83 Hydranet t-reg. 2.87 ± 0.57 2.91 ± 0.68

Supplemental Material

sj-pdf-1-eai-10.1177_30504554251385053 - Supplemental material for Hydranet: A Neural Network for the Estimation of Multi-valued Treatment Effects

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

Supplemental Material

Notes

References

Supplementary Material