Abstract
Estimating the causal treatment effects by subgroups is important in observational studies when the treatment effect heterogeneity is present. Existing propensity score methods rely on a correctly specified propensity score model. Model misspecification results in biased treatment effect estimation and covariate imbalance. We proposed a method for the propensity score analysis with controlled subgroup balance (G-SBPS) to achieve covariate mean balance in all subgroups. We further incorporated nonparametric kernel regression for the propensity scores and developed a kernelized G-SBPS (kG-SBPS) to improve the subgroup mean balance of covariate transformations in a rich functional class. This extension increased robustness to propensity score model misspecification. Extensive numerical studies showed that G-SBPS and kG-SBPS improve both subgroup covariate balance and subgroup treatment effect estimation, compared to existing approaches. For illustration, we applied G-SBPS and kG-SBPS to a dataset on right heart catheterization to estimate the subgroup average treatment effects on the hospital length of stay and a dataset on diabetes self-management training to estimate the subgroup average treatment effects for the treated on the hospitalization rate.
Keywords
Introduction
An important goal of observational studies is to estimate the treatment effect. Naive comparison between treatment groups is subject to selection bias when covariates are unbalanced between treatment groups due to lack of randomization. Propensity score, the conditional probability of treatment assignment given the covariates, is widely used to adjust for covariate imbalance and remove selection bias through matching, 1 stratification, 2 regression, 3 and weighting. 4 Commonly used estimands of the treatment effect include the average treatment effect (ATE) and the average treatment effect for the treated (ATT). When there are heterogeneous treatment effects (HTEs), subgroups with different characteristics respond to the treatment differently. For example, a drug may have better efficacy on patients with certain genetic traits. The overall treatment effects that ignore the underlying heterogeneity, such as the ATE or ATT, do not provide sufficient granular information for scientific investigation and clinical practice. The HTE is common in biomedical, epidemiological, and social research. In this article, we study HTEs among pre-specified subgroups of scientific interest, and these subgroups are defined through covariates.
The subgroup HTEs can be estimated with or without modeling the outcome. Examples of the former approach include regression models stratified on subgroups, Bayesian additive regression trees,5,6 causal forest, 7 etc. However, their performance depends on a correctly specified outcome model. Additionally, there are benefits to being blinded from the outcome data when developing causal models. 8 In this article, we focus on the latter approach and study the causal subgroup analysis, an HTE estimation method that adjusts subgroup covariate imbalance in a propensity score analysis.9,10 Our proposed approach is built upon propensity score weighting, and the weights development does not incorporate any information about the outcomes.
The propensity score is a balancing score, that is the treatment assignments and covariates are independent conditional on the propensity score.
11
Theoretically, the propensity score balances covariates in the overall population and any covariate-defined subgroups. We define
Nonparametric propensity score models, such as boosting, 13 random forest 16 and CBSR, 14 do not guarantee overall and subgroup balance.9,10,15,17 Although the nonparametric methods may reduce model misspecification and bias due to the higher flexibility than their parametric counterparts, the estimation may have more variability, a typical bias-variance trade-off phenomenon. This has been observed, for example, in the comparison between CBSR and CBPS. 17 This trade-off is amplified in the subgroup analysis due to the large number of subgroups under research and the limited sample size of each subgroup.
Developing methods that ensure both overall and subgroup covariate balance is essential when studying subgroup HTEs. For this purpose, Dong et al.
9
proposed the subgroup balancing propensity score (SBPS). SBPS selects among either parametric logistic regression models with covariate-by-subgroup interactions fitted to the overall sample or parametric logistic regression models fitted to the subgroup samples. However, this method cannot guarantee subgroup balance when the propensity score model is misspecified or when the sampling variability results in extreme inverse probability weights. The SBPS needs to examine up to
In this article, we propose the propensity score weighting analysis with controlled subgroup balance (G-SBPS), which optimizes both overall and subgroup balance simultaneously. The G-SBPS does not require mutually exclusive subgroups. We estimate the propensity scores by solving a system of equations that achieves the mean independence between the treatment indicator and the covariate terms in the propensity score model, which includes covariates, subgroup indicators and their interactions. We show that the G-SBPS controls both overall and subgroup balance. To further improve the flexibility of propensity score models and reduce misspecification, we extend the G-SBPS to nonparametric estimation by using kernel principle component analysis (PCA). We propose a parameter tuning algorithm tailored for the subgroup analysis, which optimizes the subgroup covariate balance while controlling the overall balance. This kernelized G-SBPS (kG-SBPS) optimizes the overall and subgroup balance of the covariates and their transformations from a rich functional class. In simulations and two empirical data applications, both the G-SBPS and kG-SBPS demonstrated robustness to model misspecification compared to existing propensity score methods or subgroup propensity score analysis methods. The proposed methods are implemented in statistic software R, which is available from GitHub (https://github.com/fiona19832008/GSBPS).
The rest of the article is organized as follows. Section 2 presents the model, the G-SBPS algorithm, the kG-SBPS algorithm, and the tuning algorithm. Section 3 evaluates the numerical performance of G-SBPS and kG-SBPS and compares them with other published methods in a simulation study. Section 4 presents two real data applications, one for the estimation of ATE and one for ATT. We conclude this article with a summary and discussion in Section 5. Some tables and figures are included in the online supplementary materials and numbered as Table S1, Figure S1, etc.
Methodology
Model set-up
We consider a sample of
The propensity score of subject
Parametric propensity score model with controlled subgroup balance (G-SBPS)
This subsection discusses G-SBPS as a parametric approach. We model the propensity score by the logit link as
We discuss ATE estimation first. Zhao et al.
14
constructed a special loss function for the CBSR method and demonstrated that the function can be maximized by solving similar score equations as those used in CBPS,
12
which guarantees covariate mean balance between the two treatment groups in the overall study population. To optimize both the overall and subgroup balance, we consider the following adaptation of that loss function:
The propensity scores are estimated by maximizing this loss function with respect to the modeling parameters
The loss function in (2) is concave with a global maximum. The Hessian matrix ( the matrix of second-order partial and cross-partial derivatives) is:
Therefore, we estimate the coefficient
Next, we discuss the estimation procedure for the ATT. The procedure is similar to the ATE but with the following changes. The CBSR loss function, the estimating equation, and the Hessian matrix become
The proposed G-SBPS avoids some of the limitations of SBPS mentioned in the Introduction section. The parameter estimation of G-SBPS produces global optimization, because it minimizes the loss due to the overall and subgroup covariate imbalance. In data analysis practice, we often observed that the G-SBPS achieved near-zero weighted mean differences in covariates between treatment groups within each subgroup, which is referred to as exact subgroup (covariate) balance in this article. G-SBPS builds on the CBPS method, which has been shown to achieve exact overall covariate balance. 12 Unlike CBPS, however, G-SBPS incorporates interactions between covariates and subgroup indicators, enabling it to achieve exact subgroup covariate balance. The SBPS stochastically searches through a large number of parametric propensity score models to find one with the best overall and subgroup balance among the candidate models. The result of this process depends on the set of candidate models and may not produce the optimal solution or exact overall or subgroup balance.
The G-SBPS relies on a parametric propensity score model (1). It can be viewed as an application of the CBPS 12 to the subgroup propensity score analysis problems. It inherits some desired properties of CBPS, such as good covariate balance and also limitations, such as requiring correct specification of the propensity score model. In the next section, we further extend the G-SBPS to mitigate the effect of model misspecification.
A misspecified propensity score model leads to covariate imbalance and bias in estimated treatment effects. 17 A commonly used practice to alleviate this problem is to tweak the logistic model by adding covariate transformations or interaction terms, until a satisfactory covariate balance is achieved. However, this process is ad hoc, lacks methodologically justified guidelines, and does not achieve good balance. Some methods that force overall covariate balance (e.g. CBPS 12 ) are subject to model misspecification and biased treatment effect estimation. 19
We propose to improve the flexibility of the propensity score model in G-SBPS through reproducing kernel Hilbert space (RKHS), which transforms the observed covariate vector into an
The existing subgroup propensity score methods, such as the SBPS, 9 models the propensity score parametrically. Although it produces improved subgroup balance, it suffers from propensity score model misspecification (shown in the simulation results below). Nonparametric methods not designed for subgroup analysis, such as boosting, 13 often give unsatisfactory subgroup balance, leading to biased subgroup treatment effect estimation.9,10 The kernelized G-SBPS is the first subgroup propensity score analysis method that aims at both goals: flexible nonparametric modeling and the overall and subgroup balance.
When modeling the propensity score without using the outcome, the commonly used out-of-sample target for optimization, such as prediction error, does not always produce overall or subgroup balance, which may lead to suboptimal performance in treatment effect estimation. We propose to optimize the subgroup balance, while controlling the overall balance. The standardized difference (S/D) is used to measure covariate balance.22,23 The S/D is the absolute difference in weighted mean between treated and untreated groups, divided by the pooled standard deviation of the weighted data.
24
In addition to the S/D of covariates in the overall population, we optimize the S/D in the subgroups. The details of this tuning process are presented in the supplementary materials (Algorithm 1). We use this algorithm to tune the bandwidth
Simulations
In this section, we compare the proposed G-SBPS (parametric method) and kG-SBPS (nonparametric method) with two popular propensity score methods and two representative subgroup propensity score analysis methods under various numerical settings.
Simulation design
Let (Correct PS model) The propensity score model is a logistic regression with the main effects of covariates and subgroup-specific intercept:
(Misspecified PS model) The propensity score model has additional interaction and nonlinear terms:
In practice, the data analyst usually just uses the main effect terms. With many covariates, there are numerous possible interactions and nonlinear terms, and it is difficult to determine which ones should be added to the model. In this simulation study, we call PS1 the “correct PS model” because our parametric data analysis uses this model, and call PS2 the “misspecified PS model” because the parametric analysis does not include any interaction or nonlinear terms. For both PS1 and PS2,
We consider two outcome models: (Standard outcome model): The outcome model includes the main effects of all covariates in the propensity score model. The treatment effects (Extended outcome model): The outcome model includes additional interactions and nonlinear transformations of the covariates. These additional terms are unknown to the data analyst and hence are not explicitly accounted for in the data analysis.
The residual is
The two propensity score models (PS1, PS2) and the two outcome models (OM1, OM2) produce four scenarios. We simulated data under each scenario, and evaluated the performance of the following six methods in the subgroup treatment effect estimation and overall and subgroup balance. Logistic: the logistic regression analysis with the main effects of observed covariates and the subgroup indicator ( Logistic-S: separately fitted logistic models within each subgroup. Each model includes the main effects of covariates. This was implemented in CBPS: the just-identified CBPS with the main effects of covariates and the subgroup indicator ( SBPS: implemented in the G-SBPS: the proposed parametric G-SBPS method with the main effects of covariates. kG-SBPS: the proposed nonparametric kG-SBPS method.
We studied both ATE and ATT estimation these are widely used estimands. The treatment effect estimators were evaluated by percent bias and root mean squared error (RMSE) in each subgroup. The overall or subgroup covariate balance were quantified by the S/D of covariates in the overall population or the subgroups. In each simulation scenario, the results were aggregated from 500 Monte Carlo repetitions.
First, we examine the overall balance. When the propensity score model was correctly specified (PS1), all methods under comparison had good overall balance in the sense that the S/Ds were generally less than
When the propensity score model was misspecified (PS2), the only method that maintained good performance is the kG-SBPS because it is the only nonparametric method. CBPS and G-SBPS maintained good balance in
Next, we study the subgroup balance and subgroup treatment effects. Figures 1 and 2 present the results from ATE estimation, and we comment on those results here. The results from ATT estimation are presented in Fig S3-S4 and they produce similar conclusions. Comparing Figures 1 and 2, we observe the expected results that all parametric methods performed better under the correct PS model (PS1). The subgroup S/Ds are usually higher than the corresponding overall S/Ds because the subgroups have smaller sample sizes. Nonetheless, the subgroup S/Ds are generally less than 10% under the PS1. The Logistic-S and SBPS methods performed better than Logistic and CBPS in subgroup balance, because the former methods were designed for subgroup propensity score analysis. This is the opposite of the overall balance results in Figure S1, where the latter methods were better. The proposed G-SBPS and kG-SBPS had equivalent or better performance than all other methods in terms of subgroup balance.

Boxplots of the standardized differences (S/D; %) in the four subgroups when estimating the

Boxplots of the standardized differences (S/D; %) in the four subgroups when estimating the
Under the misspecified PS model (PS2), Logistic, Logistic-S, CBPS and SBPS have deteriorated subgroup balance performance in all covariates and their transformations, as expected. The G-SBPS still gives exact subgroup balance of
In addition, to evaluate the effect of small sample size on the performance of the proposed methods, we applied all methods on the simulations with a correct PS model (PS 1) with only 40 units for subgroup 2. We found that only G-SBPS achieves good subgroup balance for subgroup 2 if
Tables 1 and 2 show the estimation of subgroup ATEs under the correct or misspecified PS models. Lower % bias and smaller RMSE indicate better performance. The two methods that explicitly deal with subgroup balance (Logistic-S, SBPS) perform better than the ones that do not (Logistic, CBPS). This is consistent with the subgroup covariate balance results above. Compared with the other four methods (Logistic, Logistic-S, CBPS, SBPS), the G-SBPS has the best % bias and RMSE, due to its use of the globally optimal solution to the subgroup balance constraints (Section 2). The kG-SBPS has more variation than the G-SBPS, due to its nonparametric nature. Nonetheless, the kG-SBPS still has better % bias and RMSE than the other four methods.
The performance of various methods in the estimation of subgroup ATEs in the simulation.
The performance of various methods in the estimation of
The
The performance of various methods in the estimation of
The
The benefit of kG-SBPS is best shown when the model is under misspecification, where it is the only method that performs well in all scenarios. Even under misspecification, the G-SBPS did not break down in all scenarios like the other four methods: it still performed well under standard outcome model. This is due to the theory by Hazlett, 25 which states that even when a propensity score model is misspecified, as long as it balances the linear additive terms in the outcome model (which is the case here with the standard outcome model), the misspecification does not cause bias to the treatment effect estimation. Of the three parametric subgroup propensity score analysis methods (Logistic-S, SBPS, G-SBPS), only our proposed method exploited this theoretical result, which produces the doubly robust-like performance shown in Table 2. This is because the optimal solution from the G-SBPS results in exact subgroup balance, while Logistic-S and SBPS do not have this guarantee.
The estimation of ATT is presented in the online supplementary materials (Tables S1 and S2). The results are similar, demonstrating better performance of the proposed G-SBPS and kG-SBPS methods over the four other existing methods (Logistic, Logistic-S, SBPS, CBPS). The G-SBPS has no bias under the standard outcome model but some bias under the extended outcome model. The kG-SBPS shows some bias in ATT estimation under both outcome models, although the bias is generally smaller than the other four existing methods. Increasing the subgroup sample size to 1000 per group can reduce the bias of kG-SBPS to no or slight bias in ATT estimation, which suggests that kG-SBPS requires a larger sample size as a nonparametric method (Table S3).
Right heart catheterization (RHC) data
We applied the proposed methods, G-SBPS and kG-SBPS, to the right heart catheterization (RHC) data
26
to examine the ATE of RHC versus non-RHC on the length of hospital stay. The data set contains
For illustration, we studied the subgroup treatment effects of RHC in a non-overlapping subgroup scheme and an overlapping subgroup scheme (Table S4). In the former scheme, three non-overlapping subgroups (“3-subgroup scheme”) were defined from mean blood pressure (<80, 80–120,or >120 mmHg); each patient belongs to only one subgroup. In the latter scheme, six overlapping subgroups were defined (“6-subgroup scheme”), with three based on the mean blood pressure and another three based on the estimated probability of surviving 2 months. These six subgroups overlap because a patient can belong to a blood pressure subgroup and a survival probability subgroup simultaneously. Here the estimated 2 months survival probability was calculated using the SUPPORT prognostic model. 28 It is known that high blood pressure can induce cardiovascular damage, which may lead to worse prognosis after RHC treatment. In addition, patients with lower estimated survival probability are usually sicker and may need longer hospital stay after the RHC treatment. These are the motivations to study those subgroups. The Logistic-S and SBPS methods do not work with overlapping subgroups. Therefore, we compared all six methods in the simulation among the three non-overlapping subgroups, but excluded Logistic-S and SBPS from the analysis with the overlapping subgroups.
All methods achieved overall covariate balance for the two subgroup schemes in the sense that the S/Ds were generally less than 5%, with CBPS and G-SBPS being the best (Figure S5). The subgroup balance results are in Figure 3. For the 3-subgroup scheme in Figure 3(a), all methods give reasonably good subgroup balance. The two subgroup analysis methods, Logistic-S and SBPS, performed better than Logistic and CBPS, but performed worse than the proposed methods, G-SBPS and kG-SBPS. For the 6-subgroup scheme in Figure 3(b), G-SBPS and kG-SBPS had notably better subgroup balance than Logistic and CBPS. On average, G-SBPS has slightly better subgroup balances than kG-SBPS, which may be attributed to the larger variation in kG-SBPS results, a typical bias-variance trade-off phenomenon between parametric versus nonparametric methods. These observations are consistent with the results of simulation studies.

Boxplots of the subgroup standardized differences (S/D) of all covariates in the RHC data analysis. Red line: 10% S/D; green line: 5% S/D.
Before propensity score adjustment, the average lengths of hospital stay in the RHC was on average 4.20, 6.28, 8.22, 3.54, 5.59, and 4.87 days longer than in the non-RHC in the subgroups 1 to 6, respectively. The estimated subgroup ATEs by various methods are reported in Table S5. They are considerable differences in the estimated ATEs across subgroups, but these results are ignored in the usual propensity score analysis. This example highlights the importance of exploring subgroup treatment effects. The estimated subgroup ATEs are generally smaller after the propensity score adjustment, regardless which method was used. Notably, the estimated ATEs by the four subgroup analysis methods are smaller than the general propensity score methods (Logistic and CBPS), which may suggest that the improved subgroup covariate balance reduced heterogeneity between the RHC and non-RHC and hence also reduced bias. The ATEs of subgroup 1 (low blood pressure, <80) and subgroup 2 (normal blood pressure, 80–120) are similar after propensity score adjustment, but they are smaller than the ATE of subgroup 3 (high blood pressure, >120). These results show that the RHC causes longer hospital stay among patients with higher blood pressure. The RHC causes longer hospital stay in subgroup with highest mortality risk (subgroup4), whose baseline health may be worse.
We applied G-SBPS and kG-SBPS to a second dataset to illustrate their performance in estimating the ATT. The ATT is best applied to situations where the treated group has a much smaller sample size than the control group. We used Texas Cancer Registry-Medicare linkage data for diabetes self-management training (DSMT) program among 5
We are interested in the ATTs in subgroups of various social statuses, which are described in Table S7. Patients eligible for dual Medicaid or living in the metropolitan area have easier access to medical care. But we also speculate that patients living in rural areas or city may have different lifestyles, which can contribute to the heterogeneity in the effect of DSMT training. Similarly, married and unmarried patients might respond to DSMT training differently due to the differences in their way of life or personalities. There is overlap among the six subgroups. Again, Logistic-S and SBPS were only applied to the non-overlapped subgroups 1–4 (Table S7). All subgroups have relatively sufficient sample sizes except subgroup 4, which allows us to study the proposed method with a small subgroup size.
For both overlapped and non-overlapped subgroups, global balances are attained by all methods, with less than

Boxplots of the subgroup standardized differences (S/D) of all covariates in the DSMT data analysis. Red line: 10% S/D; green line: 5% S/D.
Before propensity score adjustment, the average rates of 3-year hospitalization from the index year are
Subgroup causal effect estimation has wide application, but received limited attention in the propensity score analysis field. Previously, we demonstrated that global covariate balance is not equivalent to having propensity score’s balancing property when the fitted propensity score model is subject to misspecification. 15 Thus, propensity score methods that optimize the global balance, such as CBPS, may result in subgroup imbalance and biased subgroup treatment effect estimation. There are critical limitations in the current subgroup propensity score analysis methods. Firstly, subgroup analysis methods, such as SBPS, are not applicable to overlapped subgroups. Secondly, while the SBPS can improve the subgroup balance (as shown in the numerical studies in this article), it suffers from suboptimal parameter estimation and may not ensure adequate subgroup balance. We propose the novel G-SBPS (parametric method) and kG-SBPS (nonparametric method) that achieve exact subgroup balance through globally optimal parameter estimations. Our numerical studies demonstrated that the proposed methods significantly outperform the existing methods in the literature.
G-SBPS shows a doubly robust-like property, that is if the fitted model contains all the terms in either the true propensity score model or the true outcome model, the estimated treatment effect shows no bias and small RMSE. Our simulations demonstrate this (Tables 1, 2, S1, and S2). Being a nonparametric method, kG-SBPS requires large subgroup sample sizes for its good performance (Table S3, Figures S7 and S8). However, kG-SBPS is more robust to model misspecification, especially when both the propensity score and outcome models are unknown (Tables 2, S2, and S3). While not pursued in this article, it is straightforward to augment the G-SBPS and kG-SBPS estimators with an outcome model for each treatment group to form a doubly robust estimator. Closed-form formula are available (e.g. Lunceford and Davidian 4 ). Since the kG-SBPS reduces the misspecification of the propensity score model, this will help the bias control and efficiency improvement in the doubly robust estimation. The constraints used by the G-SBPS improves the finite sample balancing properties of the propensity score weights, for both the overall sample and subgroups. This is expected to improve the finite sample performance of the doubly robust estimation. Hence, it is necessary to conduct model diagnostics, particularly when the subgroup sample sizes are small. If the data analysts are confident that enough covariate terms have been included in the model, G-SBPS should perform the best because it has less variability than the kG-SBPS.
The kG-SBPS may depend on a kernel function to measure similarity in covariate space. In this work, we use the Gaussian kernel due to its smoothness, flexibility, and widespread success in nonparametric learning applications. While other kernels may behave differently, existing evidence suggests that performance differences are often modest when reasonable kernels are used. Thus, we expect kG-SBPS to be relatively robust to kernel choice. Furthermore, the bandwidth
Next, we would like to discuss on how to choose between G-SBPS and kG-SBPS in practice. This choice can be guided by considerations of model complexity, potential nonlinearities, and covariate balance diagnostics. G-SBPS relies on a parametric specification of the propensity score and achieves global covariate balance through direct optimization, making it efficient when the parametric model is approximately correct. In contrast, kG-SBPS employs a kernel-based approach, which provides greater flexibility to capture nonlinear relationships and interactions, but may reduce efficiency if the parametric model is adequate. In practical applications, if prior knowledge or exploratory analysis suggests that the relationships between covariates and treatment assignment are relatively simple, G-SBPS is generally preferred. Conversely, if complex or nonlinear relationships are suspected, kG-SBPS may offer improved balance. Ultimately, empirical covariate balance diagnostics after weighting provide critical guidance for method selection, ensuring that the chosen method adequately balances both main effects and desirable higher-order terms.
We have focused on propensity score-based methods in this article, and the propensity score model development does not use any information from the outcomes. In addition, since our method is based on inverse probability of treatment weighting, it is straightforward to incorporate outcome models and calculate a doubly robust estimator.4,19 The subgroup balancing can also be performed using “balancing weights,” that is searching for weights that directly balance covariates without explicitly using propensity scores. 30 The relationship and relative merits of model-based versus direct balancing methods are an ongoing debate,17,31 and a comparison of their performance in the context of subgroup causal analysis will be future work. Other topics of future interest include developing new model selection methods for the subgroup propensity score analysis, and extending G-SBPS and kG-SBPS to multiple treatment groups.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802251415157 - Supplemental material for Parametric and nonparametric propensity score weighting analysis with subgroup covariate balance
Supplemental material, sj-pdf-1-smm-10.1177_09622802251415157 for Parametric and nonparametric propensity score weighting analysis with subgroup covariate balance by Yan Li, Yong-Fang Kuo and Liang Li in Statistical Methods in Medical Research
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by NIH grants R01CA225646 and P30CA016672, and CPRIT grant RP210130.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
