Sage Journals: Discover world-class research

Abstract

Cluster randomized trials are widely used in healthcare research for the evaluation of intervention strategies. Beyond estimating the average treatment effect, it is often of interest to assess whether the treatment effect varies across subgroups. While conventional methods based on tests of interaction terms between treatment and covariates can be used to detect treatment effect heterogeneity in cluster randomized trials, they typically rely on parametric assumptions that may not hold in practice. Adapting existing permutation tests from individually randomized trials, however, requires conceptual clarification and modification due to the multiple possible interpretations of treatment effect heterogeneity in the cluster randomized trial context. In this work, we develop variations of permutation tests and clarify key causal definitions in order to assess treatment effect heterogeneity in cluster randomized trials. Our procedure enables investigators to simultaneously test for effect modification across a large number of covariates, while maintaining nominal type I error rates and reasonable power in simulation studies. In the Pain Program for Active Coping and Training (PPACT) study, the proposed methods are able to detect treatment effect heterogeneity that was not identified by conventional methods assessing treatment–covariate interactions.

Keywords

Cluster randomized trials estimands generalized additive mixed model intracluster correlation coefficient linear mixed models permutation test

1. Introduction

Cluster randomized trials (CRTs) are conducted because the treatment naturally occurs at the group level, to limit the risk of within-cluster treatment contamination, and for logistical reasons.¹ In a CRT, outcome observations are often measured at the individual level, while randomization occurs at the cluster level. Because of this, individual observations within the same group generally have a positive intracluster correlation and it is recommended that statistical analyses of CRTs account for this correlation.² While the average treatment effect has been the main target of estimation and inference in many published CRTs and the corresponding methodological literature, interest is growing in investigating treatment effect heterogeneity across subpopulations. Such analyses can reveal crucial insights at both the individual and cluster levels, providing a more nuanced understanding of intervention efficacy and informing the design of tailored strategies.

In this article, we consider treatment effect heterogeneity that refers to nonrandom, explainable treatment effects which vary across individual or cluster subpopulations. These differences may arise due to biological mechanisms, differences in access to services, and potentially adverse effects, among others, and can occur both at the cluster and individual level. At the cluster level, for example, hospitals may vary in terms of their quality of care and physician experiences, leading to differential responses to treatment. On the individual level, a surgical intervention may lead to greater benefit in younger participants, while being less effective in older or more frail participants, due to the presence of comorbidities. Exploring such variation of treatment effects at the cluster- or individual-level in CRTs can provide finer-resolution evidence beyond an average treatment effect, and carries important policy implications for scaling up interventions in target populations.

A common practice for exploring treatment effect heterogeneity is to test for treatment-by-covariate interactions. In CRTs, this can be done by postulating a linear mixed model or a marginal model fitted by generalized estimating equations in order to incorporate the clustering effect.³ Either procedure facilitates statistical testing of treatment-by-covariate interaction terms based on asymptotic theory. For example, a likelihood ratio (LR) test, which compares the likelihood of a given mean model specification with the likelihood of a nested specification, can be used to assess the presence of interaction effects in a linear mixed model. When conducting a statistical test for one or more effect modifiers, one can either consider an omnibus test of all interactions or individual tests of each interaction separately. For individual tests, as with more general tests for subgroup differences, there is a high risk of false positive results due to multiplicity. It is generally recommended to use a priori knowledge to limit the number of subgroups considered.⁴ Once the subgroups are pre-specified, investigators may use a multiple testing procedure, such as a Bonferonni correction, to reduce the risk of false positives.⁵ However, both omnibus and Bonferonni-corrected model-based tests often lack power to effectively detect meaningful subgroup differences, especially when the interaction terms are modelled incorrectly.

Different versions of permutation tests have been developed and compared for CRTs, but with a strict focus on the average treatment effect. To conduct a permutation test, we generally shuffle parts of the data, consistent with the null hypothesis, and compute the portion of re-samples that have a test statistic larger than the observed data. Such tests can better maintain the type I error rates in small samples and require less distributional assumptions than model-based tests. For exponential family outcomes, Braun and Feng⁶ derived uniformly and locally most powerful permutation test statistics for detecting a non-zero treatment effect. Wang and De Gruttola⁷ proposed efficient permutation tests for time-to-event outcomes and stepped wedge CRTs by using a weighted average of pair-specific treatment effect estimates, with the optimal choice of weights dependent on the intracluster correlation coefficient and degree of cluster size heterogeneity. Methods have also been developed to account for CRTs which employ constrained randomization, an allocation technique used for ensuring covariate balance.⁸ By way of estimation, Rabideau and Wang⁹ proposed a general method to construct permutation-based confidence intervals for the average treatment effect using individual-level data from a CRT. Watson et al.¹⁰ investigated permutation-based methods for CRTs in settings with multiple outcomes by adapting various p-value corrections, including Bonferroni, Holm, and Romano–Wolf adjustments. They found that the Romano–Wolf procedure controlled family-wise error rates with efficiency gains over other methods under certain dependence structures. Despite its potential, to the best of our knowledge, no permutation tests have been proposed for detecting treatment effect heterogeneity in CRTs.

Permutation tests generally rely on the exchangeability of observations under a specified null hypothesis. In individually randomized trials, permutation procedures for detecting treatment effect heterogeneity involve first removing the main effect of treatment from the outcome of interest to create exchangeable observations under the null. Then, the treatment indicators are permuted in accordance with the randomization scheme and the effect of treatment is added back into the outcome. In this context, Wang et al.¹¹ proposed a test statistic which compares the mean squared error of a model relating outcome, treatment indicators, and all the covariates to a series of similar treatment-specific models. They apply variable selection within the computation of each test statistic. Wolf et al.¹² used a Virtual Twins approach, extending the methods of Foster et al.¹³ They considered machine learning techniques, such as random forests or Super Learner, to estimate the conditional average treatment effect for each individual. For each permutation, test statistics are based on a single decision tree relating the conditional average treatment effects of the permutation distribution with candidate covariates. Rather than permuting the treatment indicators directly, Foster et al.⁴ permuted the residuals of a flexible model accounting for effect measure modification. Ding et al.¹⁴ proposed first constructing a confidence interval for the average treatment effect, repeating the permutation procedure point-wise over that interval and then taking the maximum p-value. This procedure guarantees test validity as long as the confidence interval is valid. Chung and Olivares¹⁵ propose a permutation test for heterogeneous treatment effects that addresses the challenge of an estimated nuisance parameter, employing a martingale transformation to stabilize the empirical process and ensure asymptotic control of type I error. Unlike Ding et al.¹⁴, who take a conservative pointwise approach by constructing a confidence interval around the nuisance parameter and using the maximum p-value across this interval, the method of Chung and Olivares¹⁵ directly neutralizes the impact of the nuisance parameter, resulting in a less conservative, more powerful test. These methods, however, are all designed for individually randomized trials, where the concept of treatment effect heterogeneity is typically narrowly defined at the individual level. In the context of a CRT, treatment effect heterogeneity can be defined at the cluster or individual level, with each having distinct implications. Neglecting this conceptual distinction may result in needlessly inferior policy and treatment recommendations.

In this article, we describe a new permutation test for detecting treatment effect heterogeneity in CRTs. The contribution of this work is two-fold. Firstly, we make explicit the various definitions of treatment effect heterogeneity in CRTs. This is crucial as, unlike in individually randomized trials, treatment effect heterogeneity can have different meanings in a CRT. For example, treatment may be more effective in females (individual level), or treatment may be more effective in clusters with a larger female population (cluster level). Ignoring this distinction may render the analysis less interpretable. We outline testing procedures for both cluster- and individual-level heterogeneity, with a specific focus on the latter, given that the former is a straightforward extension of Ding et al.¹⁴ Secondly, we enhance the methods proposed by Ding et al.¹⁴ by incorporating semiparametric modeling techniques to enhance the efficiency of adjusting for baseline covariates. We assess the finite-sample validity and empirical power of our proposed test under various scenarios in Section 3. In Section 4, we use the proposed method to identify treatment effect heterogeneity in the PPACT (Pain Program for Active Coping and Training) study. In Section 5, we discuss practical recommendations and future directions of our development.

2. Methods for detecting treatment effect heterogeneity in CRTs

2.1. Notation and null hypotheses

We define treatment effect variation using the potential outcomes framework. Let $Y_{k i}$ denote a continuous outcome for individual $i$ ( $i = 1, \dots, n_{k}$ ) nested within cluster $k$ ( $k = 1, \dots, K$ ), $W_{k} \in {0, 1}$ be a cluster-level treatment indicator for cluster $k$ that is randomly assigned at the cluster level, $X_{k i} \in R^{p}$ denote a group of predefined potential treatment effect modifiers, which may be at the individual or cluster level. We further let $Y_{k i} (a)$ be the potential outcome for individual $i$ in cluster $k$ had cluster $k$ been assigned to $a \in {0, 1}$ . We invoke the cluster-level Stable Unit Treatment Value Assumption, which assumes that there is only one version of treatment and there is no interference between clusters.¹⁶ Under this assumption, the observed outcome $Y_{k i}$ can be written as

Y_{k i} = W_{k} Y_{k i} (1) + (1 - W_{k}) Y_{k i} (0)

(1)

Here, when defining the observed outcome, it is clear that the source of the randomness of

Y_{k i}

is the randomized treatment assignment. Additionally, we assume that the treatment assignment is random and exchangeable under the null hypothesis.

We define the individual-specific treatment effect as $Δ_{k i} = Y_{k i} (1) - Y_{k i} (0)$ . Within this framework, there are two natural null hypotheses of interest. The first considers a constant treatment effect on the individual level, taking the null hypothesis

H_{0} : Y_{k i} (1) = Y_{k i} (0) + Δ_{1}^{*} for all k, i

(2)

where

Δ_{1}^{*}

is often assumed as a constant, leading to the sharp null hypothesis. The second null hypothesis of interest considers constant treatment effect on the cluster level, or

H_{0} : \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} [Y_{k i} (1) - Y_{k i} (0)] = Δ_{2}^{*} for all k

(3)

for some constant

Δ_{2}^{*}

. This is a sharp null hypothesis when each cluster is treated as the unit of inference, but can be viewed as a weak null hypothesis because it does not assume homogeneity of individual-specific treatment effects within clusters.

To clarify the differences between the two null hypotheses: the individual-level null hypothesis assumes that treatment effects are constant across all individuals. In contrast, the cluster-level null hypothesis focuses on the average treatment effect across clusters being constant. The individual-level hypothesis is sharper because it implies the cluster-level hypothesis, but not vice versa. In the analysis of CRTs, the choice of null hypothesis depends on the research objectives. For cluster-level interventions, such as implementing a new healthcare program across different hospitals, researchers may prioritize treatment effect variation between clusters. In this case, a cluster-level null hypothesis would be appropriate as a primary pursuit. On the other hand, when studying individual-level interventions that were cluster-randomized for practical reasons, the variation in treatment effects across individual patients may be of primary interest (and variation in treatment effects across clusters could still be secondary interest), making an individual-level null hypothesis more suitable. It’s worth noting that researchers aren’t limited to choosing between these approaches. Studies could examine both null hypotheses to investigate both between-cluster and between-individual heterogeneity, regardless of the level at which the intervention is delivered.

We may also consider various definitions of the finite-sample average treatment effect.¹⁷ The first definition considers the average treatment effect across the entire study population of individuals, or the individual-average treatment effect given by

Δ_{1} = \frac{1}{\sum_{k = 1}^{K} n_{k}} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [Y_{k i} (1) - Y_{k i} (0)]

The second definition considers the average treatment effects across the study population of clusters, or the cluster-average treatment effect given by

Δ_{2} = \frac{1}{K} \sum_{k = 1}^{K} \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} [Y_{k i} (1) - Y_{k i} (0)]

(4)

Depending on the context, one or both of these estimands may be of interest in a CRT, and a detailed discussion on the interpretation of

Δ_{1}

and

Δ_{2}

can be found in Kahan et al.¹⁷ Under certain assumptions, the individual- and cluster-level estimands converge to the same value. Such an equivalence relies on the assumption of non-informative cluster size, or that each cluster’s size

n_{k}

is random and is independent of the potential outcomes, treatment assignment, and baseline covariates. This assumption allows for unequal cluster sizes, whose randomness can be attributed to logistical differences across clusters, but is unrelated to potential outcomes. In this setting,

E [Δ_{1}] = E [Δ_{2}]

.¹⁸ The assumption of non-informative cluster size is violated if the cluster-specific average treatment effect depends on cluster size.

In general, the individual-level null hypothesis, described in (2), and the cluster-level null hypothesis, described in (3), are not equivalent. Any randomization test valid for the cluster-level null is also valid for the individual-level null, because the latter implies the former. The converse, however, does not hold. As such, careful consideration must be made to determine which testing procedures are appropriate. To test for constant treatment effects across clusters under the cluster-level null, we can apply the randomization inference framework of Ding et al.,¹⁴ treating clusters as the units of analysis. We can compute cluster-level means,

{\bar{Y}}_{k} = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} Y_{k i}

which serve as independent observations. Under the sharp null that assumes a known average treatment effect

Δ_{2}

, defined in (4), we can impute missing potential outcomes by shifting observed means by

Δ_{2}

, permuting treatment assignments

W_{k}

, and recomputing the test statistic. If

Δ_{2}

is unknown, we estimate it from the observed data as

{\hat{Δ}}_{2}

, the observed difference in mean cluster outcomes, and either plug it in directly or compute the

p

-value by taking the supremum over a wide confidence interval for

Δ_{2}

. To improve precision, we can adjust for covariates by regressing

{\bar{Y}}_{k}

on cluster-level covariate means and using residuals in place of raw outcomes in the permutation procedure. In the case of an individual-level hypothesis, however, careful consideration must be made as individual outcomes are correlated. The remainder of this work addresses a gap in the literature by focusing on the individual-level null hypothesis.

2.2. Permutation tests for detecting individual-level treatment effect heterogeneity

To test the null (2), we first present a general permutation testing framework for the idealized scenario when $Δ_{1}^{*} = Δ$ is known. This assumption will be relaxed in due course with an estimated value for $Δ$ . Under the sharp null hypothesis: $H_{0} : Y_{k i} (1) = Y_{k i} (0) + Δ$ for all $k$ and $i$ , we can impute all missing potential outcomes given the observed data. For the treated observations, the potential outcome under treatment is $Y_{k i}$ and the potential outcome under control is $Y_{k i} - Δ$ . For the control observations, the potential outcome under control is $Y_{k i}$ and the potential outcome under treatment is $Y_{k i} + Δ$ . Then a permutation test for testing treatment effect variation at the individual level proceeds with the following steps:

Calculate a specified test statistic based on the observed data, defined as $T = T (W, Y, X)$ , where $W = {W_{1}, \dots, W_{K}}$ is the collection of cluster-level treatment assignments, $Y = {Y_{11}, \dots, Y_{1 n_{1}}, Y_{21} \dots, Y_{K n_{K}}}$ is the collection of all observed outcomes across all clusters, and $X \in R^{n \times p}$ is the matrix of baseline covariates, with $n = \sum_{k = 1}^{K} n_{k}$ the total number of participants and $p$ the number of covariates. Covariates in $X$ may be used to improve testing power.

Enumerate all possible treatment assignments in accordance with the randomization scheme (accounting for appropriate restrictions at the design stage such as stratification or covariate-constrained randomization). Typically, there would be too many possible schemes to list so one can take a random sample by simulating a large number ( $B$ ) of possible schemes.

For $b = 1, \dots, B$ , let $W^{b} = {W_{1}^{b}, \dots, W_{K}^{b}}$ be a permuted treatment assignment. Then given the observed outcomes $Y_{k i}$ , observed treatment assignment $W_{k}$ , and the sharp null $H_{0} : Y_{k i} (1) = Y_{k i} (0) + Δ$ , compute first the transformed outcomes ${\tilde{Y}}_{k i}^{b} = Y_{k i} - Δ W_{k} + Δ W_{k}^{b}$ based on the permuted treatment assignment for all $k, i$ . Then based on the transformed outcomes ${\tilde{Y}}^{b} = {{\tilde{Y}}_{11}^{b}, \dots, {\tilde{Y}}_{1 n_{1}}^{b}, {\tilde{Y}}_{21}^{b} \dots, {\tilde{Y}}_{K n_{K}}^{b}}$ , compute the test statistic $T^{b} = T (W^{b}, {\tilde{Y}}^{b}, X)$ .

Compare the observed test statistic $T$ with its null distribution under permutation, and obtain the p-value

p (Δ) = p (T^{b} \geq T)

(5)

When $Δ$ is known, the above test is exact because it is guaranteed to maintain the desired type I error rate (generally specified at 5%). In practice, however, $Δ$ is usually unknown and needs to be estimated to carry out the test. One natural approach is to use a consistent estimator of $Δ$ and simply plug in this value as an estimate for $Δ$ throughout the above procedures. In permutation tests for both group and individually randomized trials, the effects of replacing nuisance parameters with their consistent estimators has been studied previously.^6,19 Namely, these works conclude that, under regularity conditions, such replacements will not negatively affect the validity or power of the permutation test. In our setting, however, $Δ$ is not a classical nuisance parameter; its correct estimation is crucial to establishing test validity. To elaborate, the validity of the permutation test requires $(Y - Δ W, X) ⊥ ⊥ W$ , which rests on $Δ$ being correctly estimated. If the estimator of $Δ$ is biased, then the validity of the test is no longer guaranteed. In the analogous situation of detecting treatment effect heterogeneity using permutation tests in individually randomized trials, previous simulations studies have yielded mixed results. Wang et al.¹¹ showed via simulations that the type I error rates based on the true treatment effects compared with using their consistent estimators were similar, even with a small sample size. However, Ding et al. showed that this plug-in method yields incorrect type I error rates when the sampling distribution of $\hat{Δ}$ is highly skewed.¹⁴ Because of these mixed findings, we assess both the plug-in method and a more robust alternative method that uses a confidence interval approach.

Specifically, we operationalize the plug-in method in CRTs by replacing $Δ$ with a consistent estimator $\hat{Δ}$ and computing the p-value using the above procedure. This consistent estimator is given by the two-sample difference-in-means estimator, or equivalently, an independent generalized estimating equation estimator fitted to the individual-level data that targets the individual-average treatment effect.¹⁷ To address the potential issues with the plug-in method, we additionally consider an alternative method where we find the maximum p-value over a range of values for $\hat{Δ}$ . To ensure validity, one approach is to find the maximum p-value across all possible values of the nuisance parameters $Δ^{'} \in (- \infty, \infty)$ :

p_{\sup} = sup_{Δ^{'}} p (Δ^{'})

(6)

Computing the supremum over the entire real line, however, is computationally intractable and may lead to a loss in statistical power. Berger and Boos²⁰ propose a convenient fix to this, by instead maximizing over a

(1 - γ)

-level estimated confidence interval for

Δ

. This approach involves constructing a confidence interval around the estimated nuisance parameter and selecting the maximum p-value within this interval:

p_{\sup} = sup_{Δ^{'} \in C I_{γ}} p (Δ^{'}) + γ

(7)

Given that

{CI}_{γ}

is a valid (

1 - γ

)-level confidence interval for

Δ

, the type I error rate is guaranteed under the null hypothesis; a formal proof is provided in Ding et al.¹⁴ The

γ

term accounts for the small chance that the true value of

Δ

lies outside the interval, guaranteeing that the desired type I error rate is maintained.

γ

is typically chosen to be very small (e.g. 0.0001) so that the resulting p-value remains both conservative and sufficiently powerful.

In the CRT setting, as $\hat{Δ}$ is estimated via the independent generalized estimating equations, we propose constructing the confidence interval using the robust sandwich variance estimator.²¹ In individually randomized trials, Ding et al.¹⁴ noted that the behavior of the p-values as a function of $Δ^{'} \in {CI}_{γ}$ can be complex, depending on the specified test statistic and the value of $Δ^{'}$ . They observed that in their simulations, p-values general tended to 0 or remained flat towards the tails of $Δ^{'}$ , but made no theoretical guarantees about such trends. We will explore these observations in the context of CRTs in the ensuing simulation studies.

Further, we clarify the difference between the sharp null hypothesis evaluated by our testing procedure and the conventional null hypothesis of constant average treatment effect across predefined subgroups. The latter tests the null hypothesis that the treatment effect is consistent across subgroups defined by one or more covariates. This null hypothesis requires pre-specifying covariates of interest and specifically targets the form of effect modification by the resulting subgroups. In contrast, our permutation-based test evaluates a sharper null hypothesis: it assesses whether the treatment effect is constant across all subgroups defined by any combination of covariates, regardless of whether they are observed or unobserved. That is, rather than conditioning on a specific set of covariates, our procedure tests whether treatment effect heterogeneity exists across all possible covariate combinations.

When the sharper null hypothesis—that the treatment effect is constant for all individuals—does not hold, it is still possible that the conventional null hypothesis—that the average treatment effect is constant across pre-specified subgroups—does hold. In such cases, the proposed test procedure, which is designed to test the sharper null, would not be appropriate for testing the conventional null, as it may lead to inflated type I error rates. Specifically, if the conventional null holds while the sharper null does not, the proposed test may still reject, detecting finer-scale variations in treatment effects that do not necessarily translate into differences in subgroup-level averages. Compared to the tests for the conventional null, the proposed procedure is more sensitive to detect treatment effect heterogeneity because it is designed to identify any form of variation in treatment effects across individuals. It does not rely on pre-specified subgroup definitions but instead evaluates heterogeneity in a more flexible and global manner.

2.3. Proposed test statistics for permutation tests to detect treatment effect heterogeneity

There are many possible choices for the test statistic $T (\cdot)$ that measure treatment effect heterogeneity, such as optimal power against certain alternatives, robustness to deviations from assumptions, sensitivity to effect size, computational efficiency, ease of interpretation, flexibility in handling different types of data, and the ability to accommodate complex experimental designs. In the following sections we describe two such statistics, one based on comparing the marginal cumulative distribution functions (CDFs) of potential outcomes under the control and treatment conditions and another based on comparing these marginal CDFs while adjusting for baseline covariates. As baseline covariates often contain important information about the outcomes in CRTs,²² it is relevant to explore whether adjusting for baseline covariates can further improve the power for the permutation test that targets treatment effect variation.

2.3.1. Shifted Komologorov–Smirnov (SK–S) statistics

We first consider a test statistic that relies solely on the Stable Unit Treatment Value Assumption and exchangeability, without making any additional distributional assumptions about our observed data and potential outcomes. Let $N_{0}$ be the number of total individuals across cluster randomized into the control condition and $N_{1}$ be the total number of individuals across clusters randomized into the treatment condition. The empirical CDFs of the potential outcomes under the control and treatment conditions are defined as

{\hat{F}}_{a} (y) = \frac{1}{N_{a}} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} I (W_{k} = a) I (Y_{k i} (a) \leq y) = \frac{1}{N_{a}} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} I (W_{k} = a) I (Y_{k i} \leq y)

for

a \in {0, 1}

, respectively. We can then consider SK–S test statistic based on comparing these empirical CDFs:

T_{SKS} (W, Y, X) = \sup_{y \in Y} | {\hat{F}}_{0} (y) - {\hat{F}}_{1} (y - \hat{Δ}) |

(8)

where

\hat{Δ}

is the estimated individual-average treatment effect based on the observed data. With this test statistic, one can proceed with the permutation test described in Section 2.2 with either the plug-in method or the confidence interval method.

2.3.2. Shifted Komologorov–Smirnov statistics with baseline covariate adjustment (GK–S)

In order to improve testing power, we may allow the chosen test statistic to account for the relationship between covariates and outcome. As long as the covariates are predictive of the outcome, such an adjustment will often improve power to detect treatment effects by explaining at least some baseline variation.^22,23 We conjecture that this observation may be generalized to testing for treatment effect heterogeneity in CRTs, and propose to integrate baseline covariate adjustment into the test statistics. Importantly, since the validity of the test relies only on cluster randomization and the correct estimation of the average treatment effect, adjusting for baseline covariates does not compromise the test validity. To allow for flexible functional forms of the baseline covariates, we create a generalized additive mixed model (GAMM)-adjusted K–S (GK–S) test statistic by comparing the CDFs of the residuals of a GAMM fit of $Y_{k i} (0)$ on $X_{k i}$ in all clusters.²⁴ In the case of normally distributed errors and an identity link function, we fit the following model using data from all clusters under the control condition, $Y_{k i} (0) = h (X_{k i}) + α_{k} + ϵ_{k i}$ , with $α_{k} \sim N (0, σ_{α}^{2})$ and $ϵ_{k i} \sim N (0, σ_{ϵ}^{2})$ , and

h (X_{k i}) = f_{1} (X_{k i 1}) + f_{2} (X_{k i 2}) + \dots + f_{p} (X_{k i p})

Here, the predictor is specified through an additive structure based on the

p

components of

X_{k i}

(possibly including higher-order terms and interactions), and

f_{1}, \dots, f_{p}

are unknown smooth functions of the covariates. The estimation can be performed using spline smoothing with the gamm function in the mgcv R package;

h (X_{k i})

can be considered as the prognostic score²⁵ that can explain away the variation in the observed outcome. Then for each individual in the CRT, we obtain the prognostic-score adjusted outcome as

{\hat{e}}_{k i} = Y_{k i} - \hat{h} (X_{k i})

, and construct the test statistic as

T_{GKS} (W, Y, X) = \underset{y \in Y}{s u p} | {\hat{F}}_{e_{0}} (y) - {\hat{F}}_{e_{1}} (y - \hat{Δ}) |

(9)

where

{\hat{F}}_{e_{1}} (y)

and

{\hat{F}}_{e_{0}} (y)

are the empirical CDFs of the covariate-adjusted outcomes

{\hat{e}}_{k i}

for treatment and control groups, respectively, and

\hat{Δ}

remains the estimated individual-average treatment effect,”where its efficiency may be improved by incorporating baseline covariates

X

. In our simulation study and data analysis, we simplify this approach by fitting a generalized additive model (GAM) using only fixed effects and data from control units. Since we outcome data are Gaussian, the fixed effects from a GAM remain unbiased even if the random effects (i.e. the covariance structure) are misspecified, as established in the literature on linear mixed models for longitudinal data.²⁶ The permutation test can then be operationalized with either the plug-in method or the confidence interval method. We note that other semi- and non-parametric methods, such as random forests, can also be used for baseline covariate adjustment. We selected GAMMs because they can model complex, nonlinear relationships using smooth flexible functions and can conveniently incorporate random effects to account for the intracluster correlation in CRT settings. In what follows, we will examine the finite-sample performance characteristics of these versions of permutation tests in simulated CRTs and real data from a completed CRT.

3. Simulation study

We conducted a simulation study to assess the validity and empirical power of the proposed permutation testing frameworks under a range of scenarios.

3.1. Validity

We first investigate the empirical type I error rate for different permutation tests under the null hypothesis of no treatment effect heterogeneity. In order to do this, we use the following steps to generate synthetic CRTs for evaluation.

Generate a synthetic CRT with $n_{k} = n \in {50, 100}$ individuals per cluster, and $K \in {20, 100}$ clusters. Randomly assign treatment, $W_{k}$ , on the cluster level for each cluster $k \in {1, \dots, K}$ with equal allocation.

Simulate potential outcomes consistent with the constant treatment effect assumption that $Y_{k i} (1) = Y_{k i} (0) + Δ$ with $Δ = 1$ . To do this, generate individual outcomes using the model-based formula $Y_{k i} = h (X_{k i}) + W_{k} + α_{k} + ϵ_{k i}$ , where $X_{k i} \sim N (0, 1^{2})$ is an individual-level variable, with covariate ICC $ρ_{x} = 0.01$ , which is unrelated to treatment effectiveness, $α_{k} \sim N (0, σ_{α}^{2})$ is a cluster intercept and $ϵ_{k i}$ is an individual-level heterogeneity term with variance $σ_{ϵ}^{2}$ ; the variance components are specified such that the outcome ICC is $ρ_{y | x} = 0.1$ , which is within the commonly reported range of ICC values.²⁷ In our study, we consider both $ϵ_{k i} \sim N (0, {0.5}^{2})$ and $ϵ_{k i}$ with a log-normal distribution scaled and shifted to have mean 0 and standard deviation $0.5$ ; the latter scenario is used to assess the robustness of the permutation tests to skewed outcome distributions. We also consider three functional forms of the baseline covariate in the data generating model; that is, $h (x)$ is specified as $h (x) = 0$ , $h (x) = 2 x$ or $h (x) = x + cos (x)$ , representing no covariate effect, linear covariate effect and nonlinear covariate effect, respectively.

For each choice of the test statistic and method to address the nuisance parameter $Δ$ (details below), calculate the observed test statistic and p-value under permutation.

Specifically, we compare our three proposed methods with one existing method. The three proposed methods are the permutation tests with either the true average treatment effect $Δ$ (PT-T), with the plug-in (PT-PI) estimator of $Δ$ , and p-value optimized over a 99.999% confidence interval for $Δ$ (PT-CI). The PT-T method is an idealized approach only available in a simulation setting (the gold-standard). The existing method considered is a LR test derived from a linear mixed model with treatment-by-covariate interaction effects (L-LRT). For the proposed PT-PI, we use the procedure described in Section 2.2, with $\hat{Δ}$ computed using the independent generalized estimating equation estimator. For the proposed PT-CI, we use the procedure described in Section 2.2, maximizing over a 99.999% confidence interval for $Δ$ , constructed based on the robust sandwich variance estimator and normal approximation.²¹ For the L-LR test, we first fit a linear mixed model using the function lmer in the R package lme4 with $Y_{k i}$ as the outcome, a random cluster intercept, the main effects of both treatment, $W_{k}$ , and the covariate, $X_{k i}$ , and an interaction between $X_{k i}$ and $W_{k}$ . We then use a LR test to assess the null that the interaction term is equal to 0. We compute the empirical rejection rates at significance level $0.05$ for each method. For each data generating model, we simulate $1000$ CRTs, and the empirical type I error rate is summarized as the proportion of simulations iterations in which the observed p-value is less than $0.05$ .

Table 1 shows the empirical type I error rates based on these simulations. Although the LR test relies on correct distributional assumptions, it maintains the correct size even when the error distribution is misspecified in our simulations. However, when $h (x)$ is non-linear, i.e. the conditional mean structure of the linear mixed model is misspecified, the type I error for the model-based LR test is inflated, with all type I error rates higher than $0.064$ (the acceptable upper bound accounting for the Monte Carlo error over $1000$ iterations). For the permutation tests, we find that the empirical type I error rate can depend on the method for estimating the average treatment effect, $Δ$ . Of note, the empirical type I error rate using the true $Δ$ is similar to the plug-in method, and both generally maintain the correct test size across the simulation scenarios we considered. This is in contrast to the observations in Ding et al.,¹⁴ who found that tests which use the plug-in average treatment effect estimator can be anti-conservative in individually randomized trials. The PT-CI method is consistently conservative, with type I error rates lower than $0.036$ (the acceptable lower bound accounting for Monte Carlo error), particularly when the number of clusters is limited ( $K = 20$ ).

Table 1.
Empirical type I error rate for proposed tests compared with the existing linear mixed model tests for testing the null hypothesis of no treatment effect heterogeneity at a 0.05 significance level for 1000 simulated CRTs with varying $n$ , $K$ , and distribution of $ϵ_{k i}$ .

Type I error

$h (x)$ Distribution of $ϵ_{k i}$ $K$ $n$ PT-T PT-PI PT-CI L-LRT

$0$ Normal 20 50 0.050 0.051 0.016 0.045

$0$ Normal 20 100 0.036 0.033 0.005 0.057

$0$ Normal 100 50 0.048 0.043 0.032 0.047

$0$ Normal 100 100 0.053 0.053 0.035 0.053

$0$ Log normal 20 50 0.046 0.055 0.014 0.056

$0$ Log normal 20 100 0.052 0.065 0.005 0.052

$0$ Log normal 100 50 0.044 0.045 0.031 0.056

$0$ Log normal 100 100 0.052 0.053 0.034 0.045

$2 x$ Normal 20 50 0.057 0.059 0.038 0.045

$2 x$ Normal 20 100 0.052 0.054 0.034 0.046

$2 x$ Normal 100 50 0.033 0.031 0.026 0.046

$2 x$ Normal 100 100 0.048 0.045 0.037 0.050

$2 x$ Log normal 20 50 0.041 0.039 0.022 0.052

$2 x$ Log normal 20 100 0.041 0.042 0.027 0.040

$2 x$ Log normal 100 50 0.055 0.055 0.044 0.051

$2 x$ Log normal 100 100 0.058 0.057 0.052 0.060

$x + cos (x)$ Normal 20 50 0.046 0.047 0.023 0.079

$x + cos (x)$ Normal 20 100 0.042 0.041 0.020 0.092

$x + cos (x)$ Normal 100 50 0.054 0.054 0.038 0.073

$x + cos (x)$ Normal 100 100 0.048 0.048 0.043 0.081

$x + cos (x)$ Log normal 20 50 0.048 0.051 0.020 0.061

$x + cos (x)$ Log normal 20 100 0.047 0.048 0.017 0.084

$x + cos (x)$ Log normal 100 50 0.047 0.044 0.031 0.084

$x + cos (x)$ Log normal 100 100 0.034 0.037 0.028 0.078

				Type I error
$0$	Normal	20	50	0.050	0.051	0.016	0.045
$0$	Normal	20	100	0.036	0.033	0.005	0.057
$0$	Normal	100	50	0.048	0.043	0.032	0.047
$0$	Normal	100	100	0.053	0.053	0.035	0.053
$0$	Log normal	20	50	0.046	0.055	0.014	0.056
$0$	Log normal	20	100	0.052	0.065	0.005	0.052
$0$	Log normal	100	50	0.044	0.045	0.031	0.056
$0$	Log normal	100	100	0.052	0.053	0.034	0.045
$2 x$	Normal	20	50	0.057	0.059	0.038	0.045
$2 x$	Normal	20	100	0.052	0.054	0.034	0.046
$2 x$	Normal	100	50	0.033	0.031	0.026	0.046
$2 x$	Normal	100	100	0.048	0.045	0.037	0.050
$2 x$	Log normal	20	50	0.041	0.039	0.022	0.052
$2 x$	Log normal	20	100	0.041	0.042	0.027	0.040
$2 x$	Log normal	100	50	0.055	0.055	0.044	0.051
$2 x$	Log normal	100	100	0.058	0.057	0.052	0.060
$x + cos (x)$	Normal	20	50	0.046	0.047	0.023	0.079
$x + cos (x)$	Normal	20	100	0.042	0.041	0.020	0.092
$x + cos (x)$	Normal	100	50	0.054	0.054	0.038	0.073
$x + cos (x)$	Normal	100	100	0.048	0.048	0.043	0.081
$x + cos (x)$	Log normal	20	50	0.048	0.051	0.020	0.061
$x + cos (x)$	Log normal	20	100	0.047	0.048	0.017	0.084
$x + cos (x)$	Log normal	100	50	0.047	0.044	0.031	0.084
$x + cos (x)$	Log normal	100	100	0.034	0.037	0.028	0.078

Data are generated from $Y_{k i} = h (X_{k i}) + W_{k} + α_{k} + ϵ_{k i}$ . The nominal type I error rate is 0.05, and the acceptable range for nominal type I error rate with 1000 replicates is (0.036, 0.064).

PT-T: Permutation test using the true average treatment effect; PT-PI: Permutation test using a plug-in estimator of the average treatment effect; PT-CI: Permutation test marginalizing over the confidence interval; L-LRT: Linear mixed model likelihood ratio test; $K$ : Number of clusters; $n$ : Constant cluster size.

3.2. Power

Next, we carry out simulations to compare the power to detect heterogeneous treatment effects for each of the above tests in CRTs. For each simulation iteration, we generate a synthetic CRT with $K \in {20, 100}$ clusters and $n_{k} = n = 50$ individuals per cluster. We then generate an individual-level effect modifier $X_{k i} \sim N (0, 1^{2})$ with covariate ICC $ρ_{x} = 0.01$ . Then, based on $X_{k i}$ , we generate the potential outcomes for each individual using the model-based formula $Y_{k i} (W_{k}) = 2 X_{k i} + (f (X_{k i}) + 1) W_{k} + α_{k} + ϵ_{k i}$ with $α_{k}$ independent of $ϵ_{k i}$ , $α_{k} \sim N (0, σ_{α}^{2})$ such that the outcome ICC is $ρ_{y | x} = 0.1$ , $ϵ_{k i}$ either Normally distributed or $t_{10}$ -distributed with mean 0 and standard deviation 0.1. We consider three plausible functional forms of $f (X_{k i})$ : (1) a treatment effect which linearly increases with $X_{k i}$ , (2) an oscillating function of $X_{k i}$ , and (3) a parabolic treatment effect which is increasing for smaller $X_{k i}$ and decreasing for larger $X_{k i}$ . The exact functional forms for each data generating model are described in Table 2.

Table 2.
Functional form of effect modification for simulation study. For all settings, the constant effect is $f (x) = 1$ .

Dimension of $X_{k i}$ Form No. of Clusters $f (x)$

1 Linear 20 $0.28 x$

1 Oscillating 20 $0.8 cos (3 π^{- 1} x)$

1 Parabolic 20 $0.014 x + 0.25 x^{2}$

1 Linear 100 $0.168 x$

1 Oscillating 100 $0.48 cos (3 π^{- 1} x)$

1 Parabolic 100 $0.0084 x + 0.15 x^{2}$

2 Linear 20 $0.196 x_{1} + 0.21 x_{2}$

2 Oscillating 20 $0.42 cos (6 π^{- 1} x_{1})$

2 Parabolic 20 $0.01 x_{1} + 0.19 x_{1}^{2} + 0.10 x_{2}^{2}$

2 Linear 100 $0.112 x_{1} + 0.12 x_{2}$

2 Oscillating 100 $0.24 cos (6 π^{- 1} x_{1})$

2 Parabolic 100 $0.006 x_{1} + 0.11 x_{1}^{2} + 0.06 x_{2}^{2}$

25 Linear 106 $0.011 age + 3.3 female$

25 Parabolic $^{a}$ 106 $4 cos (\frac{age - 60}{12})$

25 Interaction 106 ${\begin{cases} 2, & if female = 1, depression/anxiety = 1 \\ - 2, & if female = 0, depression/anxiety = 1 \\ - 2, & if female = 1, depression/anxiety = 0 \\ 2, & if female = 0, depression/anxiety = 0 \end{cases}$

Dimension of $X_{k i}$	Form	No. of Clusters	$f (x)$
1	Linear	20	$0.28 x$
1	Oscillating	20	$0.8 cos (3 π^{- 1} x)$
1	Parabolic	20	$0.014 x + 0.25 x^{2}$
1	Linear	100	$0.168 x$
1	Oscillating	100	$0.48 cos (3 π^{- 1} x)$
1	Parabolic	100	$0.0084 x + 0.15 x^{2}$
2	Linear	20	$0.196 x_{1} + 0.21 x_{2}$
2	Oscillating	20	$0.42 cos (6 π^{- 1} x_{1})$
2	Parabolic	20	$0.01 x_{1} + 0.19 x_{1}^{2} + 0.10 x_{2}^{2}$
2	Linear	100	$0.112 x_{1} + 0.12 x_{2}$
2	Oscillating	100	$0.24 cos (6 π^{- 1} x_{1})$
2	Parabolic	100	$0.006 x_{1} + 0.11 x_{1}^{2} + 0.06 x_{2}^{2}$
25	Linear	106	$0.011 age + 3.3 female$
25	Parabolic $^{a}$	106	$4 cos (\frac{age - 60}{12})$
25	Interaction	106	${\begin{cases} 2, & if female = 1, depression/anxiety = 1 \\ - 2, & if female = 0, depression/anxiety = 1 \\ - 2, & if female = 1, depression/anxiety = 0 \\ 2, & if female = 0, depression/anxiety = 0 \end{cases}$

$^{a}$ The function $4 \cos ((age - 60) / 12)$ has a $24 π$ -year period, but near age 60, it appears parabolic because a small segment of a cosine function closely resembles a quadratic curve around its peak.

We additionally consider multivariate scenarios with $p > 1$ . First, we consider two effect modifiers, $X_{k i 1}$ and $X_{k i 2}$ , both Normally distributed with mean 0, variance 1, and covariate ICC specified at $ρ_{x_{1}} = ρ_{x_{2}} = 0.01$ . Based on $X_{k i 1}$ and $X_{k i 2}$ , we generate outcomes for each individual using the model

Y_{k i} = 2 X_{k i 1} + 1.5 X_{k i 2} + (f (X_{k i 1}, X_{k i 2}) + 1) W_{k} + α_{k} + ϵ_{k i}

where

α_{k}

is as described above,

ϵ_{k i} \sim N (0, {0.1}^{2})

. We consider three functional forms of

f (x_{1}, x_{2})

: (a) linearly increasing with both variables, (b) oscillating function of one variable, and (c) parabolic function of both variables. Next, we consider simulation scenarios based on the real data example in Section 4, with the 25 candidate effect modifiers and 106 clusters of size 7 [2-12] (median [range]), described in Section 4. Based on the observed values of the effect modifiers,

X_{k i}

, we generate outcomes for each individual using the model

Y_{k i} = 0.05 {age}_{k i} + f (X_{k i}) W_{k} + α_{k} + ϵ_{k i}

where

α_{k}

is as described above,

ϵ_{k i} \sim N (0, 2^{2})

. Here we consider three non-null functional forms of

f (x)

: (a) linearly increasing with age and higher for women, (b) parabolic function of age, and (c) an interaction effect between sex and a diagnosis of depression or anxiety. Table 2 shows the exact functional forms for each data-generating model.

Finally, at the end of each simulation iteration, we calculate the test statistic and p-value for each of the three proposed approaches (PT-T, PT-PI, and PT-CI) and the existing L-LRT and NS4-LRT approach. The smoothed likelihood ratio test (NS4-LRT) follows a similar approach but incorporates a smoothing mechanism to improve stability and power in detecting treatment effect heterogeneity. Instead of modeling the covariates linearly, NS4-LRT represents all continuous covariates using natural splines with four degrees of freedom, allowing for flexible, nonlinear relationships with the outcome. The test compares a model that includes interactions between the treatment and the spline-transformed covariates with a model that only includes main effects. We consider two different test statistics (SK–S and GK–S) for each of PT-T, PT-PI, and PT-CI, totaling six different testing procedures. The specific procedures used for the permutation tests with the SK–S test statistic and the existing L-LRT approach are described in Section 3.1. For the permutation tests using the GK–S test statistic, we use the function gamm in the R package mcgv assuming Normally distributed error terms, a linear link function, and a thin plate regression spline to compute the prognostic-score adjusted outcomes.²⁸ For the confidence interval method, to potentially further improve the precision, we use the robust sandwich variance estimator based confidence interval from an independent generalized estimating equation fit adjusting for the nonlinear covariate effects using B-splines.

A total of $1000$ simulations are considered for each scenario. In each scenario, we compute the empirical power to detect treatment effect heterogeneity with a 0.05 significance level, defined as the proportion of simulation iterations in which the observed p-value is less than 0.05, for each method. The specific effects and degrees of treatment effect heterogeneity were chosen such that at least one method has power close to 80% in at least one of the scenarios. The linear effect modification trends correspond to a situation in which the LR test assumptions are met and we would expect this test to have high power. The oscillating and parabolic trends correspond to situations where the LR test assumptions are violated, and the flexible permutation testing methods may gain an advantage.

Tables 3 and 4 show the empirical type I error and power of the three proposed tests (PT-T, PT-PI, PT-CI) and the existing L-LRT and NS4-LRT to detect treatment effect heterogeneity for three different functional forms of effect measure modification at a significance level of 0.05. Table 3 includes simulation results for a single effect modifier. Type I error rates for all tests are well-controlled at the nominal 0.05 level, with the CI methods being slightly conservative as in Table 1. For Normally distributed errors, the model-based LR test always has highest power when its assumptions are met, i.e. when the form of effect modification is linear. When the its functional assumptions are not met, power of the LR test becomes substantially lower than that of the GAMM-adjusted permutation test. As expected, the tests which use the GAMM-adjusted test statistics have similar or higher empirical power than their corresponding unadjusted counterparts, capitalizing on the information embedded in baseline covariates. Further, the confidence interval methods have slightly lower power than the plug-in and true average treatment effect methods, for the permutation tests. We note that the effect of covariate adjustment on the permutation test is more pronounced in the case of linear effect modification. Results were similar for non-Normally distributed errors, indicating that the model-based LR test is much more sensitive to misspecification of the conditional mean structure than the misspecification of the error distribution. The NS4-LRT, which incorporates spline-based effect modification modeling, achieves the highest power across all scenarios, including those where the true effect modification is nonlinear. Unlike the standard L-LRT, which suffers from power loss when functional forms are complex, the NS4-LRT maintains robust performance, effectively capturing complex, nonlinear effect modifications such as oscillatory and parabolic patterns.

Table 3.

Empirical power and type I error (T1E) for the three proposed tests (PT-T, PT-PI, PT-CI) compared with the existing L-LRT and NS4-LRT for testing the null hypothesis of no treatment effect heterogeneity at a 0.05 significance level for 1000 simulated CRTs with varying $K$ , and forms of effect modification.

		SK–S			GK–S			LRT
$K$	Form of $f (X_{k i})$	PT-T	PT-PI	PT-CI	PT-T	PT-PI	PT-CI	L	NS4
$ϵ_{k i} \sim N (0, {0.1}^{2})$
20	Constant (T1E)	0.045	0.047	0.029	0.036	0.058	0.030	0.049	0.041
20	Linear	0.357	0.359	0.290	0.892	0.926	0.878	1.000	1.000
20	Oscillating	0.795	0.794	0.747	0.996	0.999	0.995	0.176	1.000
20	Parabolic	0.658	0.661	0.601	0.975	0.980	0.963	0.281	1.000
100	Constant (T1E)	0.059	0.057	0.048	0.061	0.062	0.049	0.043	0.055
100	Linear	0.659	0.655	0.626	0.916	0.917	0.900	1.000	1.000
100	Oscillating	0.988	0.987	0.985	0.996	0.997	0.996	0.125	1.000
100	Parabolic	0.941	0.940	0.930	0.995	0.996	0.993	0.224	1.000
$3.5 ϵ_{k i} \sim t_{10}$
20	Constant (T1E)	0.045	0.045	0.029	0.026	0.050	0.016	0.051	0.042
20	Linear	0.327	0.322	0.278	0.935	0.960	0.922	1.000	1.000
20	Oscillating	0.770	0.771	0.728	0.993	0.996	0.992	0.191	1.000
20	Parabolic	0.636	0.631	0.578	0.987	0.988	0.980	0.294	1.000
100	Constant (T1E)	0.049	0.046	0.045	0.038	0.044	0.035	0.042	0.064
100	Linear	0.638	0.628	0.609	0.953	0.955	0.942	1.000	1.000
100	Oscillating	0.992	0.990	0.986	1.000	1.000	1.000	0.112	1.000
100	Parabolic	0.942	0.938	0.929	0.998	0.998	0.997	0.221	1.000

Data are generated from $Y_{k i} = 2 X_{k i} + (f (X_{k i}) + 1) W_{k} + α_{k} + ϵ_{k i}$ . For all scenarios, $ρ_{y | x} = 0.1$ .

SK–S: Shifted Kolmogorov–Smirnov test statistic; PT-T: Permutation test using the true average treatment effect; PT-PI: Permutation test using a plug-in estimator of the average treatment effect; PT-CI: Permutation test marginalizing over the confidence interval; GK–S: GAMM-adjusted Kolmogorov–Smirnov test statistic; L-LRT: Linear mixed model likelihood ratio test; NS4-LRT: Mixed model likelihood ratio test using splines; $K$ : Number of clusters.

In the presence of multiple effect modifiers (Table 4), type I error rates were well-controlled at the nominal 0.05 level for the permutation tests; however, when $p = 25$ , the LRT-based methods (L-LRT and NS4-LRT) exhibited inflated type I error rates, suggesting sensitivity to increasing degrees of freedom. The model-based LR tests have the highest power when the functional model form is correctly specified; however, when models are misspecified, their power is substantially lower than that of the GAMM-adjusted permutation test. As in the univariate case, adjusting for baseline covariates improves the power of the permutation tests in detecting treatment effect heterogeneity. The NS4-LRT, which incorporates splines for modeling effect modification, consistently achieves the highest power across most scenarios, particularly when the effect modification is nonlinear. However, when the true effect modification is an interaction between two variables, the NS4-LRT exhibits lower power, as this structure violates the assumed additivity of the interaction term in the model. This is expected, as the power of a test depends on the specific alternative hypothesis: when assumptions about this alternative hold, tests tailored to detect that specific form of interaction are generally more powerful than non-directional tests designed to test any deviation from the null.

4. An empirical application to the PPACT cluster randomized trial

The PPACT (Pain Program for Active Coping and Training) study was a parallel-arm CRT to evaluate non-pharmaceutical strategies for treatment of chronic pain (ClinicalTrials.gov: NCT02113592).²⁹ In response to the lack of research on such interventions for chronic pain, investigators aimed to compare cognitive behavioral therapy (CBT) embedded in primary care settings versus usual care for treating chronic pain among patients receiving long-term opioid therapy. The study was conducted from 2014 to 2016 at Kaiser Permanente health care systems in Georgia, Hawaii, and the Northwest. Participants included adults aged 18 and older with mixed chronic pain conditions receiving long-term opioid therapy. The primary outcome, self-reported pain impact as measured by the PEGS scale (pain intensity and interference with enjoyment of life, general activity, and sleep), and was assessed quarterly over 12 months. For the purposes of this illustration, we focus on the embedded cross-sectional CRT comparing intervention to standard-of-care at 12 months.

At the 12-month visit, 816 patients in 106 clusters of primary care providers completed assessments. In the intervention arm, CBT clusters had a reduced mean PEGS score (5.52) compared with the standard-of-care arm (6.15), indicating a modest but sustained reduction in pain compared with usual care. A remaining question is whether summary measures are a sufficient representation of the effectiveness of CBT in the treatment of chronic pain. We use the proposed permutation test with 2000 Monte Carlo samples to assess the null hypothesis of no treatment effect heterogeneity and compare with the existing LR test based on a linear mixed model with prespecified effect modifiers. For the purposes of illustration, we consider all available baseline variables as potential effect modifiers and investigate whether there is treatment effect variation along any of these baseline covariates (or equivalently the existence of effect modification); see Table 5 for an entire list of baseline covariates.

Table 4.
Empirical power and type I error (T1E) for the three proposed tests (PT-T, PT-PI, PT-CI) compared with the existing L-LRT and NS4-LRT for testing the null hypothesis of no treatment effect heterogeneity at a 0.05 significance level for 1000 simulated CRTs with varying $p$ , $K$ , $n$ , and forms of effect modification.

SK–S GK–S LRT

p $K$ n $^{*}$ Form of $f (X_{k i})$ PT-T PT-PI PT-CI PT-T PT-PI PT-CI L NS4

2 20 50 Constant (T1E) 0.050 0.051 0.031 0.040 0.058 0.026 0.047 0.056

2 20 50 Linear 0.224 0.221 0.184 0.940 0.952 0.922 1.000 1.000

2 20 50 Oscillating 0.173 0.178 0.137 0.976 0.986 0.974 0.067 1.000

2 20 50 Parabolic 0.222 0.223 0.170 0.925 0.933 0.903 0.192 1.000

2 100 50 Constant (T1E) 0.044 0.041 0.033 0.044 0.050 0.039 0.051 0.056

2 100 50 Linear 0.417 0.408 0.373 0.884 0.885 0.869 1.000 1.000

2 100 50 Oscillating 0.325 0.318 0.293 0.946 0.950 0.939 0.050 1.000

2 100 50 Parabolic 0.378 0.381 0.344 0.911 0.914 0.898 0.131 1.000

25 106 7 [2-12] Constant (T1E) 0.048 0.048 0.037 0.053 0.049 0.039 0.076 0.098

25 106 7 [2-12] Linear 0.839 0.832 0.807 0.865 0.857 0.841 1.000 1.000

25 106 7 [2-12] Parabolic 0.667 0.661 0.606 0.830 0.824 0.794 0.446 1.000

25 106 7 [2-12] Interaction 0.902 0.899 0.877 0.968 0.966 0.957 0.474 0.432

				SK–S	GK–S	LRT
2	20	50	Constant (T1E)	0.050	0.051	0.031	0.040	0.058	0.026	0.047	0.056
2	20	50	Linear	0.224	0.221	0.184	0.940	0.952	0.922	1.000	1.000
2	20	50	Oscillating	0.173	0.178	0.137	0.976	0.986	0.974	0.067	1.000
2	20	50	Parabolic	0.222	0.223	0.170	0.925	0.933	0.903	0.192	1.000
2	100	50	Constant (T1E)	0.044	0.041	0.033	0.044	0.050	0.039	0.051	0.056
2	100	50	Linear	0.417	0.408	0.373	0.884	0.885	0.869	1.000	1.000
2	100	50	Oscillating	0.325	0.318	0.293	0.946	0.950	0.939	0.050	1.000
2	100	50	Parabolic	0.378	0.381	0.344	0.911	0.914	0.898	0.131	1.000
25	106	7 [2-12]	Constant (T1E)	0.048	0.048	0.037	0.053	0.049	0.039	0.076	0.098
25	106	7 [2-12]	Linear	0.839	0.832	0.807	0.865	0.857	0.841	1.000	1.000
25	106	7 [2-12]	Parabolic	0.667	0.661	0.606	0.830	0.824	0.794	0.446	1.000
25	106	7 [2-12]	Interaction	0.902	0.899	0.877	0.968	0.966	0.957	0.474	0.432

$^{*}$ Cluster size represented as median [range] if nonconstant, constant value otherwise; SK–S: Shifted Kolmogorov–Smirnov test statistic; PT-T: Permutation test using the true average treatment effect; PT-PI: Permutation test using a plug-in estimator of the average treatment effect; PT-CI: Permutation test marginalizing over the confidence interval; GK–S: GAMM-adjusted Kolmogorov–Smirnov test statistic; L-LRT: Linear mixed model likelihood ratio test; NS4-LRT: Mixed model likelihood ratio test using splines; $K$ : Number of clusters.

Table 5.

p-Values assessing the presence of an interaction between given effect modifier and treatment for various linear mixed models.

	p-Values
Effect modifiers	Individual models	Combined model
Age, y	0.92	0.38
Sex	0.18	0.04
Receives disability benefits	0.47	0.60
Current smoker	0.61	0.51
BMI	0.60	0.58
Alcohol misuse	0.52	0.67
Drug misuse	0.22	0.28
Diabetes	0.85	0.62
Cardiovascular disorder	0.22	0.29
Hypertension	0.95	0.92
Chronic pulmonary disease	0.97	0.71
Anxiety or depression diagnosis	0.86	0.71
Back and/or neck pain diagnosis	0.63	0.75
General pain diagnosis	0.12	0.14
Limb/extremity pain, joint pain and/or arthritic disorders diagnosis	0.17	0.25
Neuropathy diagnosis	0.88	0.94
Abdominal and/or bowel pain diagnosis	0.66	0.99
Musculoskeletal chest pain diagnosis	0.82	0.34
Urogenital, pelvic and menstrual pain diagnosis	0.10	0.08
Headache diagnosis	0.29	0.19
Other painful condition diagnosis	0.21	0.36
Orofacial, ear, and/or temporomandibular disorder pain diagnosis	0.52	0.46
Fibromyalgia diagnosis	0.87	0.69
Average morphine milligram equivalents (MME) dose per day	0.67	0.10
Benzodiazepine dispensed in 6 months prior to randomization	0.62	0.62

For the individual models, a linear mixed model is fit for each row, with PEGS score as the outcome, a random cluster intercept, the main effects of both treatment and the effect modifier, and the interaction between the effect modifier and treatment. The p-value is from an LR test assessing if the interaction term is equal to 0. For the combined model, one linear mixed model is fit with PEGS score as the outcome, a random cluster intercept, the main effects of treatment and all 25 effect modifiers, and the interactions between each of the 25 effect modifiers and treatment. The p-values are from various LR tests assessing if each interaction term is equal to 0, separately.

As an exploratory analysis, we first fit separate linear mixed models to model the interaction effect between CBT and each of the 25 effect modifiers. Here, the p-values range from 0.10 to 0.97. We then fit one linear mixed model with all possible interactions and individually tested for each interaction within the combined model. Here, the p-values range from 0.04 to 0.99 (sex is the only effect modifier with $p < 0.05$ in either analysis). Such an analysis can only be exploratory as it is subject to the issue of multiple testing. For example, if an investigator uses a cutoff of 0.05, then the probability of at least one false positive with such a procedure is 0.92. Because of this, we caution against using this approach in practice.

To formally test for treatment effect heterogeneity, we proceed with the proposed permutation tests. We consider the SK–S test statistic, which does not adjust for any covariates, and the GK–S test statistic, which adjusts for a the covariates in Table 5. The p-values for the permutation tests using the confidence interval method are 0.13 and 0.007 for the SK–S and GK–S statistics, respectively. The p-values for the permutation tests using the plug-in method are 0.03 and 0.005 for the SK–S and GK–S statistics, respectively. In sharp contrast, the p-values for the LR tests, which assesses treatment effect variation by any of the effect modifiers using an omnibus test are 0.59 (linear) and 0.41 (spline). Table 6 summarizes the results from each test. As expected, testing using the covariate-adjusted GK–S test statistic yields a smaller (and statistically significant at a 0.05 level) p-value compared to that of the SK–S test statistic, which emphasizes the importance of leveraging baseline covariates to improve precision when detecting treatment effect variation across individuals.

Table 6.

Results of motivating example for proposed permutation test optimized over a 99.999% confidence interval for the average treatment effect (PT-CI) or using a plug-in estimator for the average treatment effect (PT-PI) using two different test statistics (SK–S and GK–S) compared with LR tests based on mixed models with specified effect modifiers.

Test	Individual- or cluster-level null?	Need to specify effect modifiers?	Adjustment for baseline covariates?	p-Value
PT-PI (SK–S)	Individual	No	No	0.03
PT-CI (SK–S)	Individual	No	No	0.13
PT-PI (GK–S)	Individual	No	Yes	0.005
PT-CI (GK–S)	Individual	No	Yes	0.007
L-LRT	Individual	Yes	Yes	0.59
NS4-LRT	Individual	Yes	Yes	0.41
Ding-PI (SK–S)	Cluster	No	No	0.92
Ding-CI (SK–S)	Cluster	No	No	0.93
Ding-PI (RK–S)	Cluster	No	Yes	0.03
Ding-CI (RK–S)	Cluster	No	Yes	0.04

CI: Confidence interval; Ding: Method of Ding et al.¹⁴; GK–S: GAMM-adjusted Kolmogorov–Smirnov test statistic; L-LRT: Linear mixed model likelihood ratio test; NS4-LRT: Mixed model likelihood ratio test using splines; PI: Plug in; RK–S: Regression-adjusted Kolmogorov–Smirnov test statistic; PT: Permutation test; SK–S: Shifted Kolmogorov–Smirnov test statistic.

We additionally assess cluster-level treatment effect heterogeneity using the methods of Ding et al.¹⁴ For this permutation test, we consider the SK–S test statistic, which does not adjust for any covariates, and a linear regression-adjusted K–S test statistic (RK–S), which adjusts for all 25 available variables using a simple multiple linear regression. Instead of testing the individual-level null hypothesis, we test the cluster-level null hypothesis and use the cluster-level average treatment effect. We define the outcome as the mean PEGS score within each cluster and estimate the average treatment effect by computing the difference in mean cluster-level PEGS scores between the treatment and control groups. For the regression-adjusted test statistic, we use the residuals from a linear regression where the cluster mean is the outcome and the predictors are the cluster-level mean of each covariate. The p-values for the permutation tests using the confidence interval method are 0.93 and 0.04 for the SK–S and RK–S statistics, respectively. The p-values for the permutation tests using the plug-in method are 0.92 and 0.03 for the SK–S and RK–S statistics, respectively.

In this setting, the LR tests did not detect any treatment effect heterogeneity based on 25 effect modifiers at the 0.05 level, whereas both cluster-level (only significant after adjustment for covariates) and individual-level permutation tests did detect the existence of treatment effect heterogeneity at the 0.05 level. Although permutation tests do not pinpoint the precise source of treatment effect heterogeneity, they highlight the need for further data-driven exploration of subgroups with enhanced treatment effects. To explore which covariates contribute most to treatment effect heterogeneity, we adopt a leave-one-out variant of the treatment effect variable importance measure (TE-VIM) as introduced by Hines et al.³⁰ In this approach, we begin by fitting a flexible spline model (as in NS4-LRT) that includes interaction terms between treatment and all baseline covariates. We then iteratively refit the model, each time removing a single treatment–covariate interaction. After each refit, we re-estimate the variation in the conditional average treatment effect (CATE). Comparing the CATE variation from the full model to that from the reduced models allows us to assess how much each interaction contributes to overall treatment effect heterogeneity. This leave-one-out TE-VIM procedure helps identify key effect modifiers and enhances interpretability of heterogeneity in personalized treatment effects. Figure 1 displays the percent change in the estimated treatment effect when each interaction term is removed. Larger deviations indicate variables that contribute more to treatment effect heterogeneity. The results suggest that BMI, age, general pain diagnosis, and morphine dose per day and sex have the largest influence on the estimated treatment effect, while the impact from other variables appears relatively smaller. More broadly, flexible semi-parametric and non-parametric methods have been developed to identify subgroups and tailor treatment recommendations in individually randomized trials.^13,31,32 The extension of these methods to CRTs is an active area of research and will be considered in future work.

Figure 1.

The percent change in the variation in treatment effect estimates when each interaction term is removed (leave one out treatment effect variable importance measure; LOO TE-VIM). Larger deviations indicate variables that contribute more to treatment effect heterogeneity.

5. Discussion

As interest grows in assessing differential treatment effectiveness in subgroups, we propose new permutation testing methods for detecting treatment effect heterogeneity in CRTs. Standard methods for testing for treatment effect heterogeneity in CRTs typically involve specifying effect modifiers and using linear mixed models or marginal models to perform a model-based test for the appropriate parametric interaction terms. However, these tests require a priori specification of effect modifiers, rely on parametric assumptions which may not be met in practice, and may have low power to detect non-linear effects. Our new methods improve upon this existing testing paradigm by circumventing the need for effect modifier specification and the need for specifying the functional forms of the interactions. In simulation studies, the proposed permutation test maintains desired type I error rates for various error distributions and baseline covariate effects. Compared with the model-based LR test, the GAMM-adjusted permutation test has higher empirical power when the functional form of effect modification is non-linear. Finally, we illustrate these methods using real data from the PPACT study.

In practice, when effect modifiers have an unknown form, we recommend using the proposed permutation test to detect treatment effect heterogeneity. The methods of Ding et al.¹⁴ are appropriate for a cluster-level null hypothesis and our new developments are appropriate for an individual-level null hypothesis. These permutation methods require marginalization over a range of the average treatment effect estimates. Marginalizing over a wide confidence interval for the average treatment effect will ensure test validity, while there are no theoretical guarantees for the plug-in method.¹¹ Consistent with the theory, Ding et al.¹⁴ found the plug-in method to have an inflated type I error rate in simulation studies. In our simulations for individual-level heterogeneity, however, the plug-in method maintains empirical validity and the confidence interval method is overly conservative. Furthermore, we recommend employing a GAMM to leverage baseline variables, which can improve statistical power for detecting heterogeneous treatment effects.³³

The proposed method provides an omnibus test for the null hypothesis that treatment effect is constant across individual participants in a CRT and serves as a first step in searching for subgroup effects. If the overall test is statistically significant at a pre-specified level, a natural next step would be to identify subgroups with enhanced treatment effects to inform personalized interventions; to this end, model-robust methods for detecting enhanced treatment effects have been proposed in individually randomized trials,^30,31,34 and may be worthwhile to generalize to accommodate the clustered data structure in CRTs. In particular, flexible exploratory procedures that assess the marginal contribution of each covariate to heterogeneity—such as those used in our application which correspond to leave-one-out treatment effect variable importance measures—can help identify the most influential modifiers and guide future hypothesis generation.³⁰ We view such analyses as valuable complements to the proposed global testing framework, and formalization of these strategies for CRTs will be pursued in future work.

Although we have developed the permutation tests in the context of two-arm CRTs, it can be adapted to multi-arm CRTs. With three or more arms, the null must be carefully specified.³⁵ Potential options include simultaneously testing if there is treatment effect heterogeneity between any arms, and pairwise hypotheses assessing heterogeneous effects of each treatment with respect to usual care. In both settings, multiple average treatment effect values may need to be estimated. This would necessitate the estimation of more nuisance parameters and may require more intensive computations to execute the permutation procedure. For testing treatment effect heterogeneity between each intervention arm and usual care separately, one possible solution is to consider mini sub-trials nested within the CRT and apply our methods directly. For testing treatment effect variation across all arms simultaneously, we can consider the trial in its entirety and construct a permutation distribution consistent with the given allocation scheme, with appropriate specifications of the test statistics. In this latter case, rejecting the null does not provide information about the source of the treatment effect heterogeneity and additional analyses to identify arms that contribute to treatment effect variation would be necessary.

It is possible to further expand the permutation testing methods to stepped wedge CRTs, where clusters are randomized into distinct treatment sequences and clusters are repeated assessed over time.⁷ However, due to the staggered roll-out of treatment, the definition of treatment effect variation is naturally more complicated because the average treatment effects can vary both across exposure time and subpopulation defined by baseline covariates.³⁶ Maleyeff et al.³⁷ proposed permutation tests specifically for addressing exposure-time specific treatment effect heterogeneity in stepped wedge CRTs. However, investigators may be interested in testing for additional subgroup differences. As in multi-arm CRTs, careful consideration must be made when adapting our methods to stepped wedge CRTs. A full development of this testing framework will be pursued in a separate work.

Footnotes

Acknowledgments

Lara Maleyeff is supported by a trainee award from the Canadian Network for Statistical Training in Trials (CANSTAT) funded by Canadian Institutes of Health Research (CIHR) grant #262556. Research in this article was in part supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (NIH) T32 AI007358 and R01 AI136947, and a Patient-Centered Outcomes Research Institute Award^® (PCORI^® Award ME-2020C3-21072). The statements presented in this article are solely the responsibility of the authors and do not necessarily represent the views of the NIH, PCORI^® or its Board of Governors or Methodology Committee.

Data availability

The R code to execute the proposed permutation test and to reproduce the results of this paper can be found at https://github.com/laramaleyeff1/permutation_teh_crt. The data that were used to illustrate the proposed approach in Section 4 are freely available at .

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

Not applicable.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Informed consent

Not applicable.

ORCID iDs

Lara Maleyeff

Fan Li

Sebastien Haneuse

Rui Wang

References

Murray

, et al. Design and analysis of group-randomized trials, volume 29. Oxford University Press, USA, 1998. https://www.worldcat.org/search?q=no%3A37024813.

Turner

Gallis

, et al. Review of recent methodological developments in group-randomized trials: Part 1—design. Am J Public Health 2017; 107: 907–915.

Yang

Starks

, et al. Sample size requirements for detecting treatment effect heterogeneity in cluster randomized trials. Stat Med 2020; 39: 4218–4237.

Foster

Nan

Shen

, et al. Permutation testing for treatment–covariate interactions and subgroup identification. Stat Biosci 2016; 8: 77–98.

Dunn

. Multiple comparisons among means. J Am Stat Assoc 1961; 56: 52–64.

Braun

Feng

. Optimal permutation tests for the analysis of group randomized trials. J Am Stat Assoc 2001; 96: 1424–1432.

Wang

De Gruttola

. The use of permutation tests for the analysis of parallel and stepped-wedge cluster-randomized trials. Stat Med 2017; 36: 2831–2843.

Lokhnygina

Murray

, et al. An evaluation of constrained randomization for the design and analysis of group-randomized trials. Stat Med 2016; 35: 1565–1579.

Rabideau

Wang

. Randomization-based confidence intervals for cluster randomized trials. Biostatistics 2021; 22: 913–927.

10.

Watson

Akinyemi

Hemming

. Permutation-based multiple testing corrections for p-values and confidence intervals for cluster randomized trials. Stat Med 2023; 42: 3786–3803.

11.

Wang

Schoenfeld

Hoeppner

, et al. Detecting treatment–covariate interactions using permutation methods. Stat Med 2015; 34: 2035–2047.

12.

Wolf

Koopmeiners

Vock

. A permutation procedure to detect heterogeneous treatment effects in randomized clinical trials while controlling the type i error rate. Clin Trials 2022; 19: 512–521.

13.

Foster

Taylor

Ruberg

. Subgroup identification from randomized clinical trial data. Stat Med 2011; 30: 2867–2880.

14.

Ding

Feller

Miratrix

. Randomization inference for treatment effect variation. J R Stat Soc Ser B 2016; 78: 655–671.

15.

Chung

Olivares

. Permutation test for heterogeneous treatment effects with a nuisance parameter. J Econom 2021; 225: 148–174.

16.

Imbens

Rubin

. Causal inference in statistics, social, and biomedical sciences. New York, NY: Cambridge University Press, 2015.

17.

Kahan

Copas

, et al. Estimands in cluster-randomized trials: choosing analyses that answer the right question. Int J Epidemiol 2023; 52: 107–118.

18.

Wang

Harhay

Small

et al. On the mixed-model analysis of covariance in cluster-randomized trials. arXiv preprint arXiv:211200832, 2021.

19.

Weinberg

Lagakos

. Asymptotic behavior of linear permutation tests under general alternatives, with application to test selection and study design. J Am Stat Assoc 2000; 95: 596–607.

20.

Berger

Boos

. P values maximized over a confidence set for the nuisance parameter. J Am Stat Assoc 1994; 89: 1012–1016.

21.

Liang

Zeger

. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22.

22.

Wang

Park

Small

, et al. Model-robust and efficient covariate adjustment for cluster-randomized experiments. J Am Stat Assoc 2023; 119: 2959–2971.

23.

Ding

. Model-assisted analyses of cluster-randomized experiments. J R Stat Soc Ser B 2021; 83: 994–1015.

24.

Chen

. Generalized additive mixed models. Commun Stat Theory Methods 2000; 29: 1257–1271.

25.

Hansen

. The prognostic analogue of the propensity score. Biometrika 2008; 95: 481–488.

26.

Verbeke

Lesaffre

. The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data. Comput Stat Data Anal 1997; 23: 541–556.

27.

Murray

Blitstein

. Methods to reduce the impact of intraclass correlation in group-randomized trials. Eval Rev 2003; 27: 79–103.

28.

Wood

. Thin plate regression splines. J R Stat Soc Ser B 2003; 65: 95–114.

29.

DeBar

Mayhew

Benes

, et al. A primary care-based cognitive behavioral therapy intervention for long-term opioid users with chronic pain: a randomized pragmatic trial. Ann Intern Med 2022; 175: 46–55.

30.

Hines

Diaz-Ordaz

Vansteelandt

. Variable importance measures for heterogeneous causal effects. arXiv preprint arXiv:220406030, 2022.

31.

Fan

Song

. Change-plane analysis for subgroup detection and sample size calculation. J Am Stat Assoc 2017; 112: 769–778.

32.

Sivaganesan

Müller

Huang

. Subgroup finding via bayesian additive regression trees. Stat Med 2017; 36: 2391–2403.

33.

Yang

Thomas

, et al. Covariate adjustment in subgroup analyses of randomized clinical trials: a propensity score approach. Clin Trials 2021; 18: 570–581.

34.

Lundberg

Lee

. A unified approach to interpreting model predictions. In: Advances in neural information processing systems (NeurIPS), volume 30, pp.4765–4774.

35.

Zhou

Turner

Simmons

, et al. Constrained randomization and statistical inference for multi-arm parallel cluster randomized controlled trials. Stat Med 2022; 41: 1862–1883.

36.

Chen

Tian

, et al. Planning stepped wedge cluster randomized trials to detect treatment effect heterogeneity. Stat Med 2023; 43: 890–911.

37.

Maleyeff

Haneuse

, et al. Assessing exposure-time treatment effect heterogeneity in stepped-wedge cluster randomized trials. Biometrics 2023; 79: 2551–2564.

				Type I error
$h (x)$	Distribution of $ϵ_{k i}$	$K$	$n$	PT-T	PT-PI	PT-CI	L-LRT
$0$	Normal	20	50	0.050	0.051	0.016	0.045
$0$	Normal	20	100	0.036	0.033	0.005	0.057
$0$	Normal	100	50	0.048	0.043	0.032	0.047
$0$	Normal	100	100	0.053	0.053	0.035	0.053
$0$	Log normal	20	50	0.046	0.055	0.014	0.056
$0$	Log normal	20	100	0.052	0.065	0.005	0.052
$0$	Log normal	100	50	0.044	0.045	0.031	0.056
$0$	Log normal	100	100	0.052	0.053	0.034	0.045
$2 x$	Normal	20	50	0.057	0.059	0.038	0.045
$2 x$	Normal	20	100	0.052	0.054	0.034	0.046
$2 x$	Normal	100	50	0.033	0.031	0.026	0.046
$2 x$	Normal	100	100	0.048	0.045	0.037	0.050
$2 x$	Log normal	20	50	0.041	0.039	0.022	0.052
$2 x$	Log normal	20	100	0.041	0.042	0.027	0.040
$2 x$	Log normal	100	50	0.055	0.055	0.044	0.051
$2 x$	Log normal	100	100	0.058	0.057	0.052	0.060
$x + cos (x)$	Normal	20	50	0.046	0.047	0.023	0.079
$x + cos (x)$	Normal	20	100	0.042	0.041	0.020	0.092
$x + cos (x)$	Normal	100	50	0.054	0.054	0.038	0.073
$x + cos (x)$	Normal	100	100	0.048	0.048	0.043	0.081
$x + cos (x)$	Log normal	20	50	0.048	0.051	0.020	0.061
$x + cos (x)$	Log normal	20	100	0.047	0.048	0.017	0.084
$x + cos (x)$	Log normal	100	50	0.047	0.044	0.031	0.084
$x + cos (x)$	Log normal	100	100	0.034	0.037	0.028	0.078

				SK–S			GK–S			LRT
p	$K$	n $^{*}$	Form of $f (X_{k i})$	PT-T	PT-PI	PT-CI	PT-T	PT-PI	PT-CI	L	NS4
2	20	50	Constant (T1E)	0.050	0.051	0.031	0.040	0.058	0.026	0.047	0.056
2	20	50	Linear	0.224	0.221	0.184	0.940	0.952	0.922	1.000	1.000
2	20	50	Oscillating	0.173	0.178	0.137	0.976	0.986	0.974	0.067	1.000
2	20	50	Parabolic	0.222	0.223	0.170	0.925	0.933	0.903	0.192	1.000
2	100	50	Constant (T1E)	0.044	0.041	0.033	0.044	0.050	0.039	0.051	0.056
2	100	50	Linear	0.417	0.408	0.373	0.884	0.885	0.869	1.000	1.000
2	100	50	Oscillating	0.325	0.318	0.293	0.946	0.950	0.939	0.050	1.000
2	100	50	Parabolic	0.378	0.381	0.344	0.911	0.914	0.898	0.131	1.000
25	106	7 [2-12]	Constant (T1E)	0.048	0.048	0.037	0.053	0.049	0.039	0.076	0.098
25	106	7 [2-12]	Linear	0.839	0.832	0.807	0.865	0.857	0.841	1.000	1.000
25	106	7 [2-12]	Parabolic	0.667	0.661	0.606	0.830	0.824	0.794	0.446	1.000
25	106	7 [2-12]	Interaction	0.902	0.899	0.877	0.968	0.966	0.957	0.474	0.432

Permutation tests for detecting treatment effect heterogeneity in cluster randomized trials

Abstract

Keywords

1. Introduction

2. Methods for detecting treatment effect heterogeneity in CRTs

2.1. Notation and null hypotheses

2.3.1. Shifted Komologorov–Smirnov (SK–S) statistics

3.1. Validity

Footnotes

Acknowledgments

Data availability

Declaration of conflicting interests

Ethical approval

Funding

Informed consent

ORCID iDs

References