Sage Journals: Discover world-class research

Abstract

There are multiple possible cluster randomised trial designs that vary in when the clusters cross between control and intervention states, when observations are made within clusters, and how many observations are made at each time point. Identifying the most efficient study design is complex though, owing to the correlation between observations within clusters and over time. In this article, we present a review of statistical and computational methods for identifying optimal cluster randomised trial designs. We also adapt methods from the experimental design literature for experimental designs with correlated observations to the cluster trial context. We identify three broad classes of methods: using exact formulae for the treatment effect estimator variance for specific models to derive algorithms or weights for cluster sequences; generalised methods for estimating weights for experimental units; and, combinatorial optimisation algorithms to select an optimal subset of experimental units. We also discuss methods for rounding experimental weights, extensions to non-Gaussian models, and robust optimality. We present results from multiple cluster trial examples that compare the different methods, including determination of the optimal allocation of clusters across a set of cluster sequences and selecting the optimal number of single observations to make in each cluster-period for both Gaussian and non-Gaussian models, and including exchangeable and exponential decay covariance structures.

Keywords

Cluster randomised trial optimal experimental design generalised linear mixed model

1. Introduction

The cluster randomised trial is an increasingly popular experimental study design. It is used to evaluate interventions applied to groups of people, like classrooms, clinics, or villages, or when the outcome for one individual in the group depends on the outcomes for the others, as is the case with infectious diseases, for example.^1,2 The design of a cluster trial involves the specification of all aspects of the study, many of which are determined by practical, ethical, and contextual restrictions. However, one major aspect of cluster trial design that can be resolved, or at least supported, with statistical analysis is the sample size of individuals and clusters, when observations are captured from the clusters and individuals, and when each cluster receives the intervention(s).

From both an ethical and practical standpoint, minimising the number of individuals, clusters, or observations required to achieve an inferential goal is highly desirable. For cluster trials, inferences are almost always based on the variance of the estimator of the treatment effect. Designs that minimise the variance of a specific parameter, or combination of parameters, are said to be ‘c-optimal’. However, for any particular design problem, enumerating all the different possible designs and their associated variances to identify the c-optimal design is often impossible, given the number of possible variants. Therefore, we use algorithms that can quickly identify an efficient, or ‘optimal’, design. The correlation of outcomes within clusters, over time, and potentially within-individuals over time, makes the analysis of the efficiency of a study design more complex though, over and above individual-level studies with independent observations.^3–5

There have been recent methodological advances in the optimal experimental design literature to identify optimal designs in studies with correlated observations, as well as several recent studies to look at the problem specifically for certain types of cluster randomised trial. In this article, we review the literature on optimal cluster randomised trial designs and review and translate more general methods and algorithms from the broader literature to this context. We present results for different cluster trial design scenarios using a range of methods to illustrate the use of the different approaches and to identify optimal cluster trial design for a range of contexts.

2. What is the optimal cluster trial problem?

There are multiple types of optimality in the experimental design literature, which are referenced using an ‘alphabet’ system of letters.^6,7 The primary objective of a cluster randomised trial is almost always to provide an estimate of the treatment effect of an intervention and an associated measure of uncertainty for one or more outcomes. Other parameters in the statistical model, such as the covariance parameters or intraclass correlation coefficient, are not of primary interest. For example, the predominant method used to justify the sample size within a particular study design is the power for a null hypothesis significance test of the treatment effect parameter.^3,4 Thus, efficiency and optimality in this setting relate to minimising the variance of the treatment effect estimator, which is c-optimality.

We now introduce concepts and notation to describe the methods to identify c-optimal cluster trial designs. Unless otherwise stated, time is modelled discretely where there are repeated measures. Approximations can be made to a continuous time model by finely discretising time within this framework; using a regular grid over a continuous space is a common strategy in optimal design work.⁸ We represent matrices using capital letters, for example $X$ . We use a subscript to denote submatrices or elements of a matrix, in particular, we notate $X_{A}$ as the rows of $X$ in set $A$ with all the columns, and $Σ_{A}$ as the sub-matrix of $Σ$ with rows and columns in $A$ . Lower case letters represent scalars and so $X_{i, j}$ indicates the element of $X$ in the $i$ th row and the $j$ th column.

We assume there are $N$ possible observations that could be made as part of the study. An observation consists of a single ‘design point’ that generates an outcome datum. Often observations are grouped into higher-level units around which the study is designed. For cluster trials with cross-sectional sampling each observation will be a unique individual grouped into cluster-periods and then clusters. For cohort designs, observations will be grouped within an individual trial participant and then in clusters. We refer to an experimental unit as the smallest indivisible set of observations for the design problem. For example, we may wish to choose which whole cluster sequences to include, so the cluster sequence is the experimental unit. Equally, we may select cluster-periods, if we do not need to include all time periods within a cluster, or indeed which specific observations. The set of all experimental units is the design space. To simplify matters, we assume each observation is in one and only one experimental unit. However, experimental units in the design space need not be unique. For example, in the absence of individual-level covariates, an observation made in a cluster at a given time will have identical values of the fixed effect parameters to all other observations in the same cluster-period.

The top left panel of Figure 1 (Design Space A) represents a cluster randomised trial design space. This design space includes the following restrictions:

i.
No reversibility, that is, clusters can only cross from control to intervention states.
ii.
There must be contemporaneous comparison in at least one time period, that is, a before-and-after design would not be permitted since it would not include a randomised comparison.
Each row indicates a cluster sequence, and each column a discrete time interval or period. Within each cell, there are a pre-specified number of observations. Each row in the diagram is illustrated only once, but there may be multiple repeats of the same row in the design space depending on the formulation of the problem. Where there are multiple instances of the same cluster sequence, the design space includes the most common types of cluster randomised trial design with repeated measures: a parallel design, in which a cluster receives the intervention in all periods or the control in all periods, or a stepped-wedge cluster randomised trial, where the intervention roll out is staggered such that all clusters start in the control condition and then one or more clusters receives the intervention in each time period until all clusters are in the intervention state. A ‘hybrid’ design consists of a mix of parallel and stepped-wedge cluster sequences, and a ‘staircase’ design includes only the cluster-periods on the diagonal. Figure 1 illustrates these designs. Given that this design space incorporates among the most widely used cluster randomised trial designs, and that these restrictions reflect common real-world limitations, it is an obvious choice for many applications. However, more complex design spaces (such as Design Space B) are required to allow for alternative designs like cluster cross-over. Such a design space is illustrated in Figure 2, which removes the no reversibility restriction. In these cases, the cross-over design is almost always the optimal design.⁹

Figure 1.
Examples of cluster trial design spaces and study designs for six time periods. Each row represents a cluster, or cluster sequence, and may be repeated more than once in the design space. Each cell is a cluster-period and contains one or more individual potential observations. Design Space A encodes a no reversibility assumption and includes contemporaneous comparisons. Design Space B allows for both addition and removal of the intervention over time. Parallel, stepped-wedge, hybrid, and staircase are all designs within both design spaces.

Figure 2.
Optimal study designs with 10 clusters and six time periods for different values of the ICC and CAC using a linear mixed model with EXC2 covariance structure with $m = 10$ individuals per cluster-period. ‘Combin’ are results from the combinatorial local search run 100 times and selecting the best design, ‘G-H’ are results using the method from Girling and Hemming, and ‘Weight’ are designs produced by estimating experimental unit weights. The number is the estimator variance from the design. ICC: intra-cluster correlation coefficient; CAC: cluster autocorrelation coefficient.

We base our analyses around a generalised linear mixed model (GLMM). For outcome vector $y$ :
$\begin{aligned} y & \sim F (μ, σ) \\ μ & = h^{- 1} (η) \\ η & = X β + Z u \\ u & \sim N (0, D) \end{aligned}$
(1)
where $F$ is a statistical distribution with mean $μ$ and scale parameter $σ$ , and $h^{- 1}$ is a link function. We assume that the distribution $F$ is in the exponential family. The matrix $X$ is the design matrix of fixed effects, the matrix $Z$ is the design matrix for the random effects $u$ , and $D$ is the covariance matrix of the random effects. We discuss below the specification of $X$ , $Z$ , and $D$ .

We assume that the matrices $X$ and $Z$ have $N$ rows and so contain all of the observations in the design space. There are $J$ experimental units, which we denote as $E_{j}$ for $j = 1, \dots, J$ , where each $E_{j} \subset [1, \dots, N]$ contains a subset of the rows. We also denote the design space as $D := {E_{j} : j = 1, \dots, J}$ and a specific design as $d \subset D$ . Our aim is to identify the ‘optimal’ design $d^{}$ of size $m < J$ by selecting the most efficient set of $m$ experimental units from $D$ .

Most experimental design criteria are based on the Fisher information matrix. For the GLMM above, the information matrix for the generalised least squares estimator, the best linear unbiased estimator, for a particular design is:
$M_{d} = X_{d}^{T} Σ_{d}^{- 1} X_{d}$
(2)
where $Σ$ is the covariance matrix of the observations $y$ . Then the c-optimal design criterion is:
$f (d) = {\begin{cases} c^{T} M_{d}^{- 1} c & if M positive semi-definite \\ \infty & otherwise \end{cases}$
(3)
where $c$ is a vector consisting of zeroes except for a one in the position of the treatment effect parameter. For some designs, such as if there were no observations in the treatment condition, $M$ would not be positive semi-definite and so we define the variance as infinite, that is, the design provides no information on the parameter. The formal design problem is then to find $d^{} \subset D$ that minimises $f$ such that $| d^{} | = m < J$ , that is, $d^{} = a r g m i n_{d} f (d)$ .
2.1. GLMM specifications for cluster randomised trials

Without loss of generality, we focus on models for cluster trials where individuals are cross-sectionally sampled in each cluster-period. Where relevant and also without loss of generality, we use $r$ to represent the number of observations per cluster-period. For a comprehensive discussion of different models relevant to cluster randomised trials, see Li et al.⁵

2.1.1. Covariance function

The observed outcome for an individual $i$ in cluster $k$ at time $t$ is specified as $y_{i k t}$ with linear predictor $η_{i k t} = x_{i k t} β + s_{i k t}$ , where $s_{i k t} = z_{i k t} u$ represents the random effects. The covariance function defines the entries of the covariance matrix $D$ . We define a covariance function as:

Cov (s_{i k t}, s_{i^{'} k^{'} t^{'}}) = g (Δ t, Δ k)

(4)

where

Δ t = | t - t^{'} |

and

Δ k = | k - k^{'} |

. We define the following covariance functions:

EXC1

Cluster Exchangeable

g (Δ t, Δ k) = {\begin{cases} τ^{2} & if Δ k = 0 \\ 0 & otherwise \end{cases}

EXC2

Nested Exchangable

g (Δ t, Δ k) = {\begin{cases} τ^{2} + ω^{2} & if Δ k = 0 and Δ t = 0 \\ τ^{2} & if Δ k = 0 and Δ t > 0 \\ 0 & otherwise \end{cases}

AR1

Auto-regressive or exponential decay

h (Δ t, Δ k) = {\begin{cases} τ^{2} λ^{Δ t} & if Δ k = 0 \\ 0 & otherwise \end{cases}

In the cluster and nested exchangeable functions above, the parameter

τ^{2}

represents the between cluster variance,

ω^{2}

is the within-cluster, between-period variance. For the auto-regressive function,

τ^{2}

similarly represents the between cluster variance, with

λ

the auto-regressive parameter describing the rate of temporal decay.

For Gaussian-identity models, we use $σ^{2}$ to denote the observation-level variance. Gaussian-identity models are often re-parameterised in terms of other parameters, in particular: ICC

Intra-class correlation coefficient. Equal to $ρ = \frac{τ^{2}}{τ^{2} + σ^{2}}$ for EXC1 and AR1 and $ρ = \frac{τ^{2} + ω^{2}}{τ^{2} + σ^{2} + ω^{2}}$ for EXC2. More precisely, this is the ‘within-period ICC’ for designs with repeated measures and EXC2 function.³

CAC

Cluster-autocorrelation coefficient. Equal to $r = \frac{τ^{2}}{τ^{2} + ω^{2}}$ for EXC2 and not defined for the other models.

2.1.2. Matrix X

The $N \times P$ matrix $X$ is a matrix of covariates. For cluster trials with repeated measures $X$ typically consists of an intercept, time period indicators for $T - 1$ time periods, and a treatment indicator. Equivalently, matrix $X$ may be specified without the intercept and with $T$ time period indicators. We use this specification for the examples in subsequent sections. For some trial designs, investigators may consider alternative specifications. For a parallel design, the treatment effect estimator from a model that does not adjust for time period is unbiased. However, such a specification would result in a biased treatment effect estimator where the intervention roll out is staggered over time. Thus, we assume that for most applications where staggered designs feature in the design space, adjustment for time is incorporated in the specification for $X$ . We may also consider adjusting for continuous functions of time, such as polynomials, if the time periods are an approximation to continuous time. For example, Hooper and Eldridge¹⁰ considered cubic and piecewise continuous polynomials.

In this discussion, we also assume there is a single treatment that enters the model as a dichotomous treatment indicator. More complex cluster trial designs may feature multiple arms and treatments,¹¹ including continuous treatments representing dose. We do not consider these designs here, however the optimal design methods below can be extended to these cases.

3. Methods and previous literature

We divide the currently available methods for the optimal cluster trial design problem into three categories: (i) derivation of exact formulae for the treatment effect variance or precision for specific models and design spaces; (ii) general ‘multiplicative’ methods that derive weights to place on each unique experimental unit; and (iii) general combinatorial optimisation algorithms designed to select the optimum $m$ items from a discrete set of size $J$ .

3.1. Exact formulae

For simpler models one can derive explicit formulae for $f (d)$ . Given a statement of the variance or precision one can then either determine an algorithm to identify an optimal solution, or use it to calculate the variance for a wide range of designs and/or parameter values and compare numerically or graphically. Many such studies are based on the formula for the variance of the treatment effect estimator in the linear mixed model with EXC1 covariance given by Hussey and Hughes.¹²

Girling and Hemming⁹ provided perhaps the most notable study of this type for cluster trials. They derive a formula for the precision of the treatment effect estimator in a linear mixed model under covariance structure EXC2 along with individual-level cohort effects, although we drop the individual level cohort effects for this summary. They consider the problem of determining which periods to introduce the intervention into each of the $m$ clusters. Each cluster is observed in each of the $T$ time periods, and each cluster-period has $n$ observations. One can map this problem onto Design Space A in Figure 1 where each row is an experimental unit repeated $m$ times, and the goal is to identify the optimal $m$ set of experimental units.

We can rewrite model (1) as a linear mixed model for individual $i$ in cluster $k$ at time $t$ as:

y_{i k t} = J_{k t} δ + w_{t} γ_{t} + α_{k} + θ_{k t} + u_{i k t}

(5)

where

J_{k t}

is an indicator for if cluster

k

has the intervention at time

t

w_{t}

is a time period indicator with time period parameters

γ_{t}

, and

α_{1 k} \sim N (0, τ^{2})

and

θ_{k t} \sim N (0, ω^{2})

are the cluster and cluster-period random effect terms, and

u_{i k t} \sim N (0, σ^{2})

is the error term. The fixed effect parameters are

β = [δ, γ_{1}, \dots, γ_{T}]^{T}

and

δ

is the treatment effect parameter. With discrete clusters and time periods, we can aggregate model (5) into a model for the cluster-period means:

{\bar{y}}_{k t} = J_{k t} δ + w_{t} γ_{t} + α_{k} + e_{k t}

(6)

where

{\bar{y}}_{k t} = \frac{1}{r} \sum_{i = 1}^{r} y_{i k t}

is the mean outcome for cluster

k

in time period

t

and

Var (e_{k t}) \sim N (0, ω^{2} + \frac{σ^{2}}{r})

. Girling and Hemming, following work by Hussey and Hughes¹² and others, then show that the precision of the treatment effect estimator

\hat{δ}

is given by:

\begin{aligned} \frac{1}{f (d)} & = {Var}^{- 1} (\hat{δ}) \\ = \frac{m T}{(ω^{2} + \frac{σ^{2}}{n}) (1 - \bar{ρ})} (a_{d} - b_{d} R) \end{aligned}

(7)

where

\bar{ρ} = \frac{τ^{2}}{τ^{2} + ω^{2} + σ^{2} / n}

is equivalent to the ICC at the cluster-period mean level and

R = \frac{T \bar{ρ}}{1 + (T - 1) \bar{ρ}}

is the cluster mean correlation. The coefficients

a_{d}

and

b_{d}

are determined by the study design:

\begin{aligned} a_{d} & = \frac{1}{m T} \sum_{t = 1}^{T} \sum_{k = 1}^{m} (J_{k t} - {\bar{J}}_{\cdot t})^{2} \\ b_{d} & = \frac{1}{m} \sum_{k = 1}^{m} ({\bar{J}}_{\cdot t} - {\bar{J}}_{\cdot \cdot})^{2} \end{aligned}

where the dot indicates the index over which the mean is taken.

Girling and Hemming⁹ provided a method for using equation (7) to produce an optimal design under a no reversibility constraint. We assume the clusters are numbered such that a lower numbered cluster has greater than or equal number of intervention periods than any higher numbered cluster. We then map the cluster-period indexes to coordinates on a unit square $(j, t) \mapsto (x_{0 j}, x_{1 t})$ , where $x_{0 j}, x_{1 t} \in [- 1 / 2, 1 / 2]$ . All cluster-periods start in the control state, and then starting with cluster 1 in the $T$ th period, one successively changes the cluster-period to an intervention state in the order of decreasing values of $R x_{1 t} - x_{0 j}$ until ${\bar{J}}_{\cdot}$ cluster-periods are included in the treated set. Examples of this method are provided in the article, which we reproduce in the examples section.

Lawrie et al.¹³ derive explicit formulae for the optimal proportion of clusters to allocate to each sequence (row) in the Design Space A to minimise $f (d)$ using a linear mixed model and the EXC1 covariance function. They show that the optimal proportion of clusters allocated to the $t$ th sequence in the stepped-wedge design space (see Figure 1) with $T - 1$ sequences, $ϕ_{t}$ is:

\begin{aligned} ϕ_{1} & = ϕ_{T - 1} = \frac{1 + ρ (3 r - 1)}{2 (1 + ρ (r T - 1)} \\ ϕ_{t} & = \frac{r ρ}{1 + ρ (r T - 1)} for t = 2, \dots, T - 2 \end{aligned}

(8)

A similar analysis for EXC1 structure and a linear mixed model is given by Woertman et al.¹⁴

Zhan et al.¹⁵ extended Lawrie et al.’s analyses using Girling and Hemming’s work to identify more general ‘optimal unidirectional switch designs’ by extending the probability weights (8) to a larger design space with sequences incorporating exclusively control or intervention conditions, and with EXC1 covariance functions. Here, unidirectional switching means no reversibility, giving, for example Design Space A in Figure 1. The more general probability weights for the design space with $T + 1$ sequences are:

\begin{aligned} ϕ_{0} & = ϕ_{T} = \frac{1 + ρ (r - 1)}{2 (1 + ρ (r T - 1)} \\ ϕ_{t} & = \frac{r ρ}{1 + ρ (r T - 1)} for t = 1, \dots, T - 1 \end{aligned}

(9)

Zhanet al. also extended this analysis to smaller design spaces including only a subset of the rows of Design Space A. We discuss below methods for rounding proportions to whole number of clusters.

There are several other studies that derive expressions for the treatment effect variance to identify efficient study designs. Hooper and Copas¹⁶ considered a linear mixed model with AR1 covariance for a cluster randomised trial with continuous recruitment. They consider a parallel study design with baseline measures and aim to identify the when the intervention should be implemented in the intervention arm under different sample sizes and covariance parameters. They calculate the value of (3) for a large range of models and graphically compare the results. Copas and Hooper¹⁷ take a similar approach with a linear mixed model with EXC1 covariance with a parallel trial design. They aim to identify optimal sample sizes and the proportion of data to collect in baseline and endline periods. Moerbeek¹⁸ also uses an explict criterion, although not strictly for c-optimality, as they aim to identify an optimal sample size within treatment and control groups subject to a budget constraint. They consider only a single time period, such that the treatment effect estimator is a difference in means. Lemme et al also consider a similar cost–benefit optimisation approach for multicentre trials.¹⁹

Deriving explicit formulae for the variance or precision is appealing due to its relative simplicity. Identifying a c-optimal design does not require specialist tools and can be done using spreadsheet software. However, these methods are typically limited to specific models and designs, such as exchangeable covariance structures, linear models, and equal cluster-period sizes. The mathematical approach used to derive the precision formula does not carry over to more complex covariance structures or design spaces, nor to problems where the experimental unit is an observation or cluster-period. One can calculate the value of the c-optimality criterion directly for any design, as Hooper and Copas¹⁶ do. However, the number of designs one must calculate the variance for grows exponentially and prohibitively with the size of the design space. More general methods are required for these extended problems.

3.2. Multiplicative weighting methods

Determining probability weights for experimental units, as the studies cited above do explicitly,^15,13 is a useful strategy to simplify the optimal design problem. One can generalise this approach to tackle more complex models and design spaces. We place a probability measure $ϕ$ on $D$ so that our design is characterised by $ϕ := {(E_{j}, ϕ_{j}) : j = 1, \dots, J}$ where $ϕ_{j} \in [0, 1]$ are weights. The optimal design problem can then be re-stated as finding a design that minimises $f (ϕ)$ .

3.2.1. Elfving’s theorem

Elfving’s theorem is a classic result in the theory of optimal designs.²⁰ The original formulation considered independent, identically distributed observations. Holland-Letz et al.²¹ and Sangol²² generalised the theorem to the case where there is correlation within experimental units and multiple observations, such as within a cluster, but not between experimental units, such as if the experimental unit was a cluster-period or observation. Elfving’s theorem provides a geometric characterisation of the c-optimal design problem. If the experimental units are uncorrelated, the information matrix in equation (2) can be rewritten as:

M_{d} = \sum_{E_{j} \in d} X_{E_{j}}^{T} Σ_{E_{j}}^{- 1} X_{E_{j}}

thus for the approximate design

ϕ

we can write:

M_{ϕ} = \sum_{k = 1}^{K} X_{E_{k}} Σ_{E_{k}}^{- 1} X_{E_{k}} ϕ_{k}

(10)

which we can rewrite as:

M_{ϕ} = \sum_{k = 1}^{K} F_{E_{k}}^{T} F_{E_{k}} ϕ_{k}

(11)

where

F_{E_{k}} = L_{E_{k}, E_{k}}^{T} X_{E_{k}}

and

L_{E_{k}, E_{k}}

is a square root of

Σ_{E_{k}, E_{k}}^{- 1}

A ‘generalised Elfving set’ is:

R = co {F_{E_{k}}^{T} ϵ_{k} : X_{E_{k}} \in X^{| E_{k} | \times P}; | | ϵ_{k} | | = 1; k = 1, \dots, K}

(12)

where

co

denotes the convex hull. This set leads us to a generalised Elfving theorem:

Theorem 1 generalised Elfving theorem

A design $ϕ := {(E_{k}, ϕ_{k}) : k = 1, \dots, K}$ is c-optimal if and only if there exists vectors $ϵ_{1}, \dots, ϵ_{K}$ , where $| | ϵ_{k} | | = 1$ and positive real scalar $π$ such that $π c = \sum_{k = 1}^{K} ϕ_{k} F_{E_{k}}^{T} ϵ_{k}$ is a boundary point of the set $R$ .

For proof see Holland-Letz et al.²¹ and Sagnol.²²

Sagnol²² shows how the generalised Elfving theorem can be used to define a second-order cone program, which is a type of conic optimisation problem than can be solved with interior point methods. This program returns the optimal values of $ϕ_{1}, \dots, ϕ_{k}$ . We provide functionality for the problems we consider in this article in the R package glmmrOptim, including this program. Other proposals exist for identifying the optimal weights, such as using a multiplicative algorithm based on an upper bound for the solution.²³

3.2.2. Mixed model weights

Girling (forthcoming) has proposed an algorithm for finding the optimum set of weights that can be applied to the case when experimental units are equivalent to cluster-periods. Since observations in this context are exchangeable within a cluster-period, when the weights are rounded to number of observations (see next section), the result is equivalent to when the experimental unit is a single observation. We consider the aggregated cluster-period model (6). The best linear unbiased estimator for the linear combination $b = c^{T} β$ can be written as

\begin{aligned} \hat{b} = a^{T} y = {a^{'}}^{T} L y \end{aligned}

where

a = [a_{1, 1}, \dots, a_{1, T}, a_{2, 1}, \dots, a_{K, T}]

is a vector of weights, with

a_{k, t}

the estimation weight for cluster

k

and time

t

, and

a = L^{T} a^{'}

. As before,

L

is the Cholesky decomposition of

Σ

and

F = L^{T} X

. By the Gauss–Markov theorem, the estimator is unbiased if

F^{T} a = c

for

a = F (F^{T} F)^{- 1} c

. So we have that

a = Σ^{- 1} X (X^{T} Σ^{- 1} X)^{- 1} c

, giving us the generalised least squares estimator. In the linear model case, we can write

Σ = \frac{σ^{2}}{N} Φ^{- 1} + Z D Z^{T}

, where

Φ

is a diagonal matrix of weights with diagonal entries

ϕ_{k t}

such that the number of observations in the cluster period

(k, t)

ϕ_{k t} N

with

\sum_{k t} ϕ_{k t} = 1

. The variance of the estimator can then be written as:

\begin{aligned} Var (\hat{b}) & = a^{T} Σ a = \sum_{k = 1}^{K} a_{k}^{T} Σ_{k} a_{k} \\ = \sum_{k = 1}^{K} \sum_{t = 1}^{T} \sum_{s = 1}^{T} a_{k t} a_{k s} g (| t - s |, 0) + \frac{σ^{2}}{N} \sum_{k = 1}^{K} \sum_{t = 1}^{T} \frac{a_{k t}^{2}}{ϕ_{k t}} \end{aligned}

where

g (.)

is the covariance function (4). Ignoring the first part of the final line, which is not determined by the cluster-period weights in each cell, the Cauchy–Schwarz inequality shows that:

\frac{σ^{2}}{N} \sum_{k = 1}^{K} \sum_{t = 1}^{T} \frac{a_{k t}^{2}}{ϕ_{k t}} \geq \frac{σ^{2}}{N} {(\sum_{k = 1}^{K} \sum_{t = 1}^{T} | a_{k t} |)}^{2}

which then gives us a lower bound on the variance when adding in the coviarance terms. This inequality becomes an equality, and hence the minimal variance, when the weights are set as:

ϕ_{k t} = \frac{| a_{k t} |}{\sum_{k t} | a_{k t} |}

The above argument therefore suggests a simple algorithm to identify the cluster-period weights that minimise the variance, and hence are the c-optimal design, which is shown in Algorithm 1. We have implemented this algorithm in the R package glmmrOptim as the ‘Girling algorithm’. To ensure the algorithm terminates, we had to add additional steps to the algorithm described below. In particular, on each iteration of the algorithm we exclude cluster-periods where the weight is smaller than some lower bound (

10^{- 7}

) to avoid the weights continually shrinking on every iteration; and excluding time periods in the linear predictor if the total weights for that period are zero. The algorithm applies to only the cases where the experimental units are cluster-periods or individual observations; however, one might sum the weights within larger experimental conditions for different contexts.

3.3. Rounding proportions of experimental units

Where a method produces an optimal design in terms of the proportion of experimental units of each type to include, we must use a rounding procedure to translate it into exact numbers. There are several methods for converting proportions to integer counts that sum to a given total. The problem was famously identified for converting popular vote totals in states into number of seats in the US House of Representatives; the solutions are named after their proposers.²⁴ Pukelsheim and Rieder²⁵ following others²⁶ argued that the procedure of John Quincy Adams is the most efficient method of rounding to an exact design. As Pukelsheim and Rieder note though, the design weights do not contain enough information to exactly identify a experimental design, and so multiple designs may be generated. However, this procedure is based on the assumption that a ‘fair’ allocation includes at least one experimental unit of each type. For many cluster trial design problems we do not require this restriction, for example, a parallel trial is optimal in some cases.⁹ In other cases though, there may be practical reasons to ensure staggering of the roll-out,^10,27 in which case this rounding scheme would be the most efficient. Hamilton’s rounding procedure is an alternative method. We initially assign $⌊ J ϕ_{j} ⌋$ clusters to each sequence (where $⌊ x ⌋$ is the floor of $x$ ), and then incrementally add clusters according to the largest remainder $J ϕ_{j} - ⌊ J ϕ_{j} ⌋$ . In the later examples (and the implementation in the R package glmmrOptim) we use all rounding procedures and then select the design with the smallest value of $f (d)$ since evaluating the variance for the small number of possibly optimal designs does not bear a high computational cost.

While the solutions generated by different rounding schemes, and the algorithms discussed in the next section, may in fact be an exact optimal solution, they cannot guarantee such a result. In the results section we provide several examples where the results of the methods may disagree. The equivalence theorem²⁸ provides precise conditions to check whether a given design is indeed optimal. However, it requires knowledge of the optimal design. Girling and Hemming⁹ use an approach of comparing the relative efficiency of the design to that of a cluster cross-over, which is the most efficient if it is within the design space. Not all design spaces include the cluster cross-over design, and so the optimal design may not be known. Holland-Letz et al.²³ derived a lower bound for the relative efficiency of a given design in the context of a pharmacokinetic study with correlated observations.

3.4. Combinatorial optimisation algorithms

Watson and Pan²⁹ showed how the c-optimal design criterion in equation (3) is a ‘monotone supermodular function’, which means it is amenable to one of several combinatorial optimisation algorithms that are well-studied in the literature. A supermodular function is one for which, given a design $d \subset D$ and a smaller design $d^{'} \subseteq d$ , then $f (d \cup E) - f (d) \geq f (d^{'} \cup E) - f (d^{'})$ is true.³⁰ Intuitively, one can see this is the case for the design problems considered in this article since it states that the decline in variance from adding a new experimental unit $E$ is smaller for larger designs. The function is monotone decreasing if $d^{'} \subseteq d \to f (d^{'}) \geq f (d)$ , which means that the variance will be at least as large if you remove any observations. The advantage of these algorithms is that they allow identification of optimal designs in cases where there is correlation between experimental units, such as when the experimental units are cluster-periods or single observations.

The three algorithms relevant to supermodular function minimisation are the local search, the greedy search, and the reverse greedy search algorithm.^26,31,32,33 We exclude the greedy search algorithm here, as it starts from the empty set and successively adds observations. As we require a minimum of $P$ observations to ensure a positive semidefinite information matrix, the algorithm therefore performs poorly as Watson and Pan²⁹ show. The local and reverse greedy searches are shown in Algorithm box 2. These algorithms are also implemented in the R package glmmrOptim.

Finding the subset of size $m$ from the design space that minimises $f (d)$ is an NP-hard problem, however, much work has been produced from the 1970s onwards on computationally efficient methods of finding approximate solutions. In some cases, these algorithms give a ‘constant factor approximation’, that is the worst case result has a provable bound on $f (d) / f (d^{*})$ if $d^{*}$ is the c-optimal design. For the design problem, we consider in this article, only the local search has a constant factor approximation. However, we also include the reverse greedy search as it, or similar variants, have appeared in the literature for cluster trials.

The local search algorithm starts from a design of the desired size $m$ and then makes the swap of an experimental unit in the design with one not in the design that leads to the greatest reduction in the c-optimality criterion. Such swaps are made until no further value improving swaps are available. The worst possible design that this algorithm produces under a cardinality constraint (i.e. $| d | \leq J$ ) has a value no larger than $3 / 2$ times the true c-optimal design.³² This bound can be improved to $1 + 1 / e$ with certain extensions to the algorithm.³⁴

The reverse greedy algorithm starts from the complete design space and successively removes the experimental unit that results in the largest decrease in variance. Proofs of the constant factor approximation for the reverse greedy algorithm depends on the ‘steepness’ or ‘curvature’ of $f (.)$ , which depends on the value of $f (\emptyset)$ , where $\emptyset$ is the empty set, that is, a design with no observations. A reasonable choice for the variance of an estimator from a design with no observations is infinity, as we specify in (3). However, the resulting curvature of the function then means there is no constant factor approximation bound.^30,35 Alternatively, we could say $f (\emptyset)$ is undefined, and we would again lack a theoretical guarantee.

Watson and Pan²⁹ investigated these algorithms for a range of study designs, including cluster randomised trials. They find that empirically the reverse greedy and local search algorithms provide similar performance in terms of the variance of the resulting design. The reverse greedy search is deterministic, while the local search starts from a random design, so Watson and Pan run the local search multiple times and select the best design. They also suggest several approaches to improve the computational efficiency of these algorithms.

Kasza and Forbes³⁶ used a reverse greedy approach to identify optimal designs. They describe the method as estimating the ‘information content’ of clusters or cluster-periods in a design space like Figure 1, where their measure of information is the marginal change in variance from removing the observations from the design. The results presented by Kasza and Forbes³⁶ are qualitatively similar to those using other methods and algorithms, such as those presented below.

Hooper et al.³⁷ examined the optimal cluster trial designs in the context of the linear mixed model with covariance function AR1. They consider a discrete approximation to a continuous time model with continuous recruitment and polynomial functions of time. The design space consists of individuals regularly spaced over a time interval within clusters; the individuals constitute the experimental unit. They aim to provide a set of illustrative optimal designs under different parameter values for the covariance function. The method used to identify these designs could also be described as a variant of the ‘reverse greedy’ algorithm. Each iteration of the algorithm is supplemented with a type of local search, although the swaps of experimental units that can be made are limited at each step to preserve a no reversibility restriction. The designs presented by Hooper and Eldridge¹⁰ are often qualitatively different from those presented here resulting from other methods. However, the design space they use includes a wide range of other designs, and their specfication of $X$ does not include time period indiciators, which may account for some of the differences.

3.4.1. Computational complexity

The computational complexity of the local and greedy searches scales as $O (m^{4} r^{3} (J - m))$ and $O (J^{3} r^{3} (J - m)$ , respectively,²⁹ where $r$ is the number of observations in an experimental unit. These algorithms scale relatively poorly with the size of the design. However, the approach taken by Girling and Hemming⁹ discussed above suggests a way of improving the computational time of these algorithms when the experimental unit is a cluster or cluster-period. Equation (6) specifies a model for the cluster-period mean under covariance function EXC2. A similar model can be specified for the AR1 function with equal sized cluster-periods:

\begin{aligned} {\bar{y}}_{k t} & = J_{k t} δ + W_{t} τ_{t} + α_{k t} + e_{k t} \\ Cov (α_{k t}, α_{k t^{'}}) & = τ^{2} λ^{| t - t^{'} |} \\ Var (e_{k t}) & = \frac{σ^{2}}{r} \end{aligned}

(13)

The advantage of using a model for the cluster-periods, when the experimental unit is the cluster period is that it only requires a single swap or addition to change an experimental unit as opposed to

r

swaps or additions to the design.

3.5. Non-Gaussian models

The multiplicative weighting, optimal mixed model weights, and combinatorial methods all require calculation of the covariance matrix $Σ$ and its inverse. For Gaussian models with identity link function $Σ = σ^{2} I + Z D Z^{T}$ , so it can be calculated exactly. For non-Gaussian models, such as Binomial or Poisson, generating $Σ$ can be computationally demanding. For non-linear models, an approximation to $Σ$ and hence to the information matrix $M$ , is typically used.³⁸ Breslow and Clayton³⁹ used the marginal quasilikelihood of the GLMM to propose the first-order approximation:

Σ \approx W^{- 1} + Z D Z^{T}

(14)

where

W

is a diagonal matrix with entries

W_{i, i} = ((\frac{\partial μ}{\partial η})^{2} Var (y | u))

, which are the GLM iterated weights.⁴⁰ Here,

W

is evaluated at the marginal mean

X β

. For the optimal mixed model weights algorithm we can generate

Σ = \frac{1}{N} W^{- 1} diag (ϕ^{- 1}) + Z D Z^{T}

Zeger et al.⁴¹ suggested that when using the marginal quasilikelihood, approximations can be improved by ‘attenuating’ the linear predictor. For example, for the binomial-logit model one would use $μ_{i} = h^{- 1} (x_{i} β | a D z_{i}^{T} z_{i} + I |^{- 1 / 2})$ , where $a = 16 \sqrt{3} / 15 π$ . For other types of optimality this attenuation can improve the resulting designs,³⁸ however, for c-optimality there was little evidence of a difference in the designs considered by Watson and Pan.²⁹ Other information matrix approximations that may be relevant for non-Gaussian models include using the GEE working covariance matrix or higher-order approximations, however, these methods are either more restrictive or there is little evidence they improve the designs. The approximation also permits the use of cluster-period mean models, like (6) and (13), with heteroskedastic errors given by $Var (e_{j t}) = \frac{W_{j t, j t}}{r_{j t}}$ , where the $r_{j t}$ is the number of observations in cluster $j$ at time period $t$ , and $W_{j t, j t}$ the individual-level variance of an observation in that cluster-period. For the non-Gaussian examples we give below, we use equation 14 without attenutation.

Morbeek and Maas⁴² examined optimal designs for clustered studies with a binomial-logisitic mixed model. The derive an approximation to the variance of the treatment effect parameter under the EXC1 covariance function using a linearisation approach with the marginal quasilikelihood. They specifically aim to identify the optimal number of individuals within a cluster in a cost–benefit framework.

3.6. Robust optimality

The methods to generate an optimal design have so far assumed the model parameters are known. However, a well known issue for optimal experimental design methodology is that a design that may be optimal for one set of parameters or model specification may perform poorly for another. Robust designs that are efficient across a range of specifications are therefore desirable. There are multiple possible criteria for modifying the c-optimal design criterion to account for multiple designs. For example, Girling and Hemming⁹ considered a minimax criterion in which they identify a design that maximises (minimises) the minimum (maximum) precision (variance) over all values of the correlation between cluster-period means. This results in a ‘hybrid’ trial design (see Figure 1). Van Breukelen and Candel⁴³ also considered a minimax criterion to identify a robust optimal cluster trial design when the ICC is unknown. Similarly to Moerbeek,⁴² they use a cost–benefit framework and examine the optimal design under a fixed budget.

As a robust optimality criterion, the maximin function is not necessarily generally applicable. For the combinatorial methods, we require that the objective function is supermodular to fit within the framework discussed above, and the maximum of a set of supermodular functions is not necessarily supermodular. As an alternative, we can use a ‘weighted average’. In particular, we assume there is a set of $L$ candidate models and we specify a prior probability for each model $p_{1}, \dots, p_{L}$ with the property $\sum_{l = 1}^{L} p_{l} = 1$ . Dette⁴⁴ and Lauter⁴⁵ propose the following generalisation of the c-optimality criterion:

f (d; A) = \sum_{l = 1}^{L} p_{l} \log (c_{l}^{T} M_{d, l}^{- 1} c_{l}^{T})

(15)

where

M_{d, l} = (X_{d, (l)}^{T} Σ_{d, (l)}^{- 1} X_{d, (l)})^{- 1}

represents the information matrix for design

D

under the

l

th model. As well as the parameters varying between model specification, the vectors

c_{l}

and matrices

X_{(l)}

and

Σ_{(l)}

can vary between models, for example, there may be different specifications of time and covariance functions.

Dette⁴⁴ generalises the Elfving theorem for this robust criterion for models with uncorrelated observations. One can further generalise this theorem to the case where observations are correlated within experimental units following the results of Holland-Letz et al.²¹ and Sagnol.²² However, a specification for a program to solve this generalised problem using conic optimisation methods, extending the results of Sagnol in the single model case, is not currently available, and remains a topic for future research. An extension of the optimal mixed model weights method to robust optimal designs is similarly an open question.

Another robust c-optimality criterion is the weighted average:

f (d; A) = \sum_{l = 1}^{L} p_{l} c_{l}^{T} M_{d, l}^{- 1} c_{l}^{T}

(16)

Both this criterion and (15) can be used with the combinatorial search methods, since they are also supermodular and maintain the same theoretical guarantees. Following Dette,⁴⁴ we describe a design that maximises either of these criteria as being c-optimal for the class

A

with respect to the prior

p

3.7. Code examples

We have provided code samples and examples using the glmmrOptim package, including code to reproduce the figures in this article at https://samuel-watson.github.io/glmmr-web/other/optimal\_examples/.

4. Results and examples

In this section, we provide a range of examples to illustrate the use of the methods and summarise results from several of the papers cited above. For the combinatorial algorithms, we use the reverse greedy algorithm. For multiplicative weighting methods, we select the best design from a variety of different rounding methods. Where applicable we also compare the results to those presented by Girling and Hemming.⁹

4.1. Clusters as experimental units

For the first set of examples, we consider Design Space A in Figure 1 with seven unique cluster sequences and six time periods. Our goal is to identify a design of $m = 10$ clusters. Each row is repeated up to five times in the design space, which is to say each sequence could be duplicated up to five times in the final design. Limiting the number of duplicate sequences to five, rather than 10, prevents the final design being, for example, a purely before and after design, while permitting parallel, stepped-wedge, and hybrid designs (although, we have not found a scenario where before-and-after design is optimal). Before and after designs may not be desirable as they lack any randomised comparison; treatment status will be correlated strongly with secular temporal trends, which is why they are unlikely to be optimal. We consider the linear mixed models given in equations (6) and (13) with EXC2 and AR1 covariance functions, respectively. The method proposed by Girling and Hemming is applicable in the EXC2 case (the scenario here is the same as that given in Figure 5 of Girling and Hemming⁹).

Figure 2 shows the results using the EXC2 covariance function with $m = 10$ individuals per cluster-period. The resulting designs for each set of covariance parameters are the same from each method, with only a couple of exceptions. However, the difference between the variances from the designs do not exceed 0.0001. In all cases, the design from the combinatorial method has the lowest variance. Figure 3 shows the results from the model with AR1 covariance function. As with EXC2, the designs are generally the same from both combinatorial and weighting methods, but where there is a difference, the combinatorial method produces a design with marginally lower variance. For both covariance functions, as the level of correlation within a cluster and between periods or the overall level of within cluster-period correlation gets higher, the degree of ‘staggering’ increases.

Figure 3.

Optimal study designs with 10 clusters and six time periods for different values of the ICC and autoregressive parameter $λ$ (‘lambda’) using a linear mixed model with AR1 covariance structure with $m = 10$ individuals per cluster-period. ‘Combin’ are results from the combinatorial local search run 100 times and selecting the best design and ‘Weight’ are designs produced by estimating experimental unit weights. The number is the estimator variance from the design. AR: auto-regressive; ICC: intra-cluster correlation coefficient.

The previous example assumes any design might be permissible within the design space. However, more restrictive design problems may be of interest given practical limitations on intervention roll out. As an example, we may require there to be only two trial arms within which all clusters receive the intervention at the same time. The question is then when each arm should receive the intervention (if at all). We can consider this problem as selecting two experimental units from Design Space A containing the seven experimental units in Figure 1, since the variance of this design is proportional to a design with $J$ clusters allocated 1:1 to each of the two sequences. Figure 4 shows the optimal two cluster sequences using combinatorial and weighting methods. The two methods agree for all parameter values with the AR1 covariance function, however, for the EXC2 function the weighting method produces designs with higher variance. For low values of the CAC or $λ$ and the ICC a parallel design is optimal. For higher values of these parameters, inclusion of baseline or endline observations in which both trial arms are in control or treatment states, respectively, is superior to a purely parallel design.

Figure 4.

Optimal study designs of two cluster sequences and six time periods for different values of the covariance parameters with the EXC2 and AR1 covariance functions. C = combinatorial local search. W = experimental unit weights. The number on each panel is the treatment effect estimator variance for the design. The rows are difference values of the ICC. (a) EXC2 covariance function; (b) AR1 covariance function. AR: auto-regressive; ICC: intra-cluster correlation coefficient.

Figure 5.

Optimal study designs of 80 individuals with seven clusters and six time periods using a linear mixed model with EXC2 covariance structure with different values of the ICC (rows) and CAC (columns). Results from the combinatorial reverse greedy search (with up to 10 individuals per cluster-period) and optimal mixed model weights algorithms. The number for the left two columns is the estimator variance from the design. The number within each cell is the intervention status and the colour represents the number of observations (left two columns) or the weight (right two columns). ICC: intra-cluster correlation coefficient; CAC: cluster autocorrelation coefficient.

4.2. Single observations as experimental units

For the next examples, we specify a single observation as the experimental unit. The design space is as specified in Figure 1 with seven clusters and six time periods, and each cluster-period has 10 unique individuals who each contribute an observation. Using the combinatorial algorithms, our goal here is to select 80 observations of the 420 possible observations up to a maximum of 10 per cluster-period. The mixed model weights can also be calculated using Algorithm 1 for comparison. Figures 5 and 6 show the results for the EXC2 and AR1 covariance functions, respectively. In general, the levels of within cluster-period correlation (CAC or $λ$ ) appear to determine the optimal design, with higher levels resulting in greater number of observations placed along the main diagonal. Not all the designs are exactly symmetric, which may suggest the algorithm has not found the exactly optimal design.

Figure 6.

Optimal study designs of 80 individuals for different values of the ICC and $λ$ using a linear mixed model with AR1 covariance structure with different values of the ICC (rows) and autoregressive parameter $λ$ (columns). Results from the combinatorial reverse greedy search (with up to 10 individuals per cluster-period) and optimal mixed model weights algorithms. The number for the left two columns is the estimator variance from the design. The number within each cell is the intervention status and the colour represents the number of observations (left two columns) or the weight (right two columns). AR: auto-regressive; ICC: intra-cluster correlation coefficient.

4.3. Non-Gaussian models

For non-Gaussian models, we illustrate how the parameters $β$ affect the resulting optimal design. We consider the design problem given for the examples shown in Figures 5 and 6 with single observations as experimental units and Design Space A of Figure 1 with up to 10 individuals per cluster-period. We specify a binomial-logistic model. In all the examples, we use parameters $τ^{2} = 0.16$ and $ω^{2} = 0.04$ for EXC2 or $τ^{2} = 0.20$ and $λ = 0.8$ for AR1. The time period parameters are specified to give a control group mean outcome proportion of either 5%, 25%, or 50% and odds ratios for the six time periods of 0.8, 0.9, 1.0, 1.0, 1.1, and 1.2, respectively. The treatment effect is an odds ratio of either 0.5 or 1.5.

Figure 7 shows the optimal designs of 80 individuals for the binomial-logistic example using the combinatorial and optimal mixed model weight algorithms. When the base rate is low, the relative difference in individual-level variance between time periods is larger, and the resulting designs favour placing more observations in those later time periods. When the base rate is higher, the designs more closely resemble those from the linear model in Figures 5 and 6. The optimal weights suggest that when the base rate is low in this example, we should place all our efforts in the last periods; the combinatorial algorithms have specified a cap of 10 observations per cluster-period and so distribute the observations in the next-best cluster-periods.

Figure 7.

Optimal study designs of 80 indivudals with 10 clusters and six time periods for different values of the base rate (rows) and intervention effect size (columns) with a binomial-logistic mixed model. Results from the combinatorial reverse greedy search (with up to 10 individuals per cluster-period) and optimal mixed model weights algorithms. The number for the left two columns is the estimator variance from the design. The number within each cell is the intervention status and the colour represents the number of observations (left two columns) or the weight (right two columns).

4.4. Robust optimal designs

To illustrate robust optimal designs, we consider the 18 models and parameter values represented by the panels Figures 1 and 3. We assume that there is no prior knowledge of the likely values of the covariance parameters, nor the covariance function, and so assign equal prior weights to all 18 designs. We use the weighted average robust criterion (16), and run the local search algorithm 100 times, selecting the lowest variance design. The left panel of Figure 8 shows the resulting optimal design with respect to the equal weighting prior. Similarly to Girling and Hemming,⁹ the design is a ‘hybrid’ trial design with six of 10 clusters following a parallel trial design, and the remaining four a staggered implementation roll-out. We also identify a robust optimal design for individual experimental units with the 18 designs shown in Figures 5 and 6 using the same procedure. The resulting design is shown in the right-hand panel of Figure 8.

Figure 8.

Robust optimal study designs of 80 indivudals with 10 clusters and six time periods with respect to a prior that weights each possibility from earlier examples equally. Results from the combinatorial local search run 100 times and selecting the best design. The left panel is for a design space with clusters as experimental units, and the right panel where individuals are experimental units. The numbers in the cells on the right panel show the intervention status.

5. Discussion and conclusions

5.1. Comparison of algorithms

The correlation between observations in a cluster randomised trial setting complicates identification of optimal study designs. Indeed, there have been relatively few studies on the topic of optimal cluster trial designs, particularly when compared with individual-level randomised controlled trials. However, recent methodological advances provide several approaches for approximating c-optimal designs with correlated observations.

We have discussed three different types of method within a general framework for cluster trials using exact formulae for specific models specifications and design spaces and using an algorithm or enumerating and evaluating multiple relevant designs; determining weights to place on each experimental units in a design space; and, combinatorial algorithms for selecting an optimal subset of experimental units. These categories are not exhaustive and new methods may be developed using novel approaches. Each of the three types of method has their advantages and disadvantages. Minimising exact functions for the estimator variance would be preferable, but explicit formulae are only available in the simpler cases. Many authors (e.g. Girling and Hemming,⁹ Zhan et al.,¹⁵ and Lawrie et al.¹³) consider the linear mixed model with cluster and cluster-period exchangeable random effects, for example. The combinatorial algorithms produced the lowest variance design in all the examples we considered where we could compare methods, but were generally more computationally demanding, especially when one takes into account the suggestion to run the algorithm multiple times and select the best design. The optimal mixed model weights algorithm identifies the optimal weights for each cluster-period, although may not produce an exact design when rounding the totals. The optimal mixed model weights algorithm is much faster to run than other generic algorithms. For the examples presented in Figures 5 and 6, the reverse greedy search took around 1 min, the local search 10 s, and the model weights 50 ms. In many circumstances, it is difficult or impractical to specify an exact number of individuals, and so weights would be sufficient, in which case the mixed model weights are likely the best choice given its efficiency. However, for more complex design problems, such as setting maximum or minimum number of observations in different cluster-periods, the combinatorial approaches may be required.

5.2. Small sample bias

A well recognised issue for cluster trials, and GLMMs in general, is that the generalised least squares estimator of the standard errors of $β$ in equation (2) exhibits a small sample bias. The standard errors for $\hat{β}$ are underestimated when the number of clusters is small (see, e.g. Leyrat et al.,⁴⁶ Kahan et al.,⁴⁷ and Watson et al.⁴⁸). All of the examples given in this article may well suffer from this issue. There are two reasons for the bias. First, the information matrix $M_{d}$ is estimated in practice by evaluating the covariance matrix at the estimated values of the covariance parameters. The GLS estimator (2) does not account for this additional variability from estimating the covariance parameters. Second, the estimator for the information matrix is itself a biased estimator the variance. Kackar and Harwell⁴⁹ described an approximation to the small sample variance of $\hat{β}$ for linear mixed models that accounts for the estimation of the covariance parameters and Kenward and Rogers⁵⁰ extended this approximation to also account for the bias. One might consider therefore using these ‘corrected’ estimators in place of the generalised least squares information matrix in the optimality criterion. However, it is not clear whether this approach would perform well or not; both corrections are first-order approximations and can exhibit behaviour that may undermine the performance of the algorithms. For example, in exploratory testing we found it was possible, while using fixed covariance parameter values, for a smaller design to have a marginally smaller ‘corrected’ variance than a larger design. The algorithms produced similar, but not identical, ‘optimal’ designs using these corrected matrices though. Optimal designs with small sample corrections thus remains an important topic for future research in this area.

5.3. Usefulness of optimal designs

Optimal designs are not always practical. For example, many of the designs in Figures 5 to 8 where the experimental unit was the individual included cluster-periods with a single individual. It is very unlikely that this would ever be implementable in practice given the logistics of data collection within clusters such as hospitals, clinics, or schools. However, one can view these optimal designs as a benchmark against which to justify a chosen study design. Hooper¹⁰ suggests that there is a common misconception among cluster trial practitioners that the stepped-wedge design is more efficient than a parallel trial. The results of Girling and Hemming,⁹ which are replicated in Figure 2, and others show that this is not the case. The most efficient design depends on the covariance parameters, and in the case of a non-linear model, the parameters in the linear predictor too. Indeed, a useful heuristic is that emerges from these results is that the less variable the cluster means over time, the more ‘variable’ the intervention should be (i.e. more staggered over time). Identifying an optimal design can help design a practicable trial that is more efficient than might otherwise be considered. Where individual-level experimental units are used, it can identify which cluster-periods to exclude entirely and which to place more effort into. Kasza et al.³⁶ proposed just such an approach based on a ‘reverse greedy’ type algorithm.

The framework we use to present these methods requires enumeration of all the unique experimental units. For more complex design problems the design space can then become very large. For example, Hooper et al.¹⁶ used a discrete approximation to a continuous time model, and aim to identify when a cluster should start and stop recruiting and when it should implement the intervention. There is a very large number of possible cluster sequences that would fit within this design space given the large number of time increments, even with the no reversibility and symmetric restrictions they use. Enumerating the complete design space and subjecting it to one of the algorithms above would likely be highly computationally demanding. Indeed, this issue raises the question of how one might approach cluster trial optimal design question with continuous time. Other examples in the literature in which a treatment variable is potentially continuous, have relied on selecting a small number of discrete possible values; the finer the discretisation the better the result.⁸ Extending this to larger number of possible conditions, or treating time as truly continuous thus remains a topic of future research. One potentially useful approach for continuous time models may be ‘particle swarm’ optimization and other ‘nature inspired’ methods.⁵¹

5.4. Bayesian optimal designs

We have also not considered Bayesian optimal design. While Bayesian methods are relatively rarely used for the design and analysis of cluster randomised trials, there are growing number of examples (e.g. at Work Wellbeing Programme Collaboration⁵²). Chaloner⁵³ provides a review of Bayesian optimal experimental design criteria. Bayesian optimal designs are based on maximising a utility function for the experiment. The resulting optimality criteria though are highly similar to their Frequentist counterparts, but they introduce the added complexity of needing to integrate over the prior distributions of the model parameters. There have been several methodological advances and new algorithms proposed for identifying Bayesian optimal experiemental designs. For example, Overstall and Woods⁵⁴ provided perhaps the most general solution to this problem for non-linear Bayesian models. The algorithms in this article might also be used to find approximate solutions to Bayesian cluster trial design problems. For example, the robust criterion (16) could be translated to a Bayesian context where the weights are derived using a Riemann sum approximation to the integral over the prior distributions.²⁹ However, further research is required into Bayesian methods for the design and analysis of cluster randomised trials.

5.5. Conclusion

The final choice of study design for a cluster randomised trial results from the confluence of a range of practical, financial, and statistical considerations. However, there is an ethical obligation to try to minimise the sample size required to achieve a research objective. Methods to identify optimal or approximately optimal study designs therefore serve a useful purpose where there is flexibility in the roll out of an intervention. We have identified several methods relevant to cluster randomised trials, which can be used on a standard computer in a short amount of time. We would therefore suggest that examining the optimal trial design should be a step in the design of every cluster randomised trial.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article: This work was supported with funding from the Medical Research Council MR/V038591/1.

ORCID iDs

Samuel I Watson

Karla Hemming

References

Eldridge

Kerry

. A practical guide to cluster randomised trials in health services research. Chichester, UK: John Wiley & Sons, Ltd, 2012. ISBN 9781119966241. DOI: 10.1002/9781119966241.

Murray

. Design and analysis of group randomised trials. New York, NY: Oxford University Press Inc., 1998.

Hooper

Teerenstra

de Hoop

, et al. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Stat Med 2016; 35: 4718–4728.

Hemming

Kasza

Hooper

, et al. A tutorial on sample size calculation for multiple-period cluster randomized parallel, cross-over and stepped-wedge trials using the Shiny CRT calculator. Int J Epidemiol 2020; 49: 979–995.

Hughes

Hemming

, et al. Mixed-effects models for the design and analysis of stepped wedge cluster randomized trials: an overview. Stat Methods Med Res 2021; 30: 612–639.

Atkinson

Donev

Tobias

. Optimum experimental design, with SAS. Clarendon, 2007. https://global.oup.com/academic/product/optimum-experimental-designs-9780198522546?cc=gb&lang=en&

Berger

MPF

Wong

. An introduction to optimal designs for social and biomedical research. Wiley, 2009. https://onlinelibrary.wiley.com/doi/book/10.1002/9780470746912

Yang

Biedermann

Tang

. On optimal designs for nonlinear models: a general and efficient algorithm. J Am Stat Assoc 2013; 108: 1411–1420.

Girling

Hemming

. Statistical efficiency and optimal design for stepped cluster studies under linear mixed effects models. Stat Med 2016; 35: 2149–2166.

10.

Hooper

Eldridge

. Cutting edge or blunt instrument: how to decide if a stepped wedge design is right for you. BMJ Qual Saf 2021; 30: 245–250.

11.

Watson

Girling

Hemming

. Design and analysis of three-arm parallel group randomised trials with small numbers of clusters. Stat Med 2021; 40: 1133–1146.

12.

Hussey

Hughes

. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 2007; 28: 182–191.

13.

Lawrie

Carlin

Forbes

. Optimal stepped wedge designs. Stat Probab Lett 2015; 99: 210–214.

14.

Woertman

de Hoop

Moerbeek

, et al. Stepped wedge designs could reduce the required sample size in cluster randomized trials. J Clin Epidemiol 2013; 66: 752–758.

15.

Zhan

de Bock

van den Heuvel

. Optimal unidirectional switch designs. Stat Med 2018; 37: 3573–3588.

16.

Hooper

Copas

. Optimal design of cluster randomised trials with continuous recruitment and prospective baseline period. Clinical Trials 2021; 18: 147–157.

17.

Copas

Hooper

. Cluster randomised trials with different numbers of measurements at baseline and endline: sample size and optimal allocation. Clinical Trials 2020; 17: 69–76.

18.

Moerbeek

. Optimal designs for group randomized trials and group administered treatments with outcomes at the subject and group level. Stat Methods Med Res 2020; 29: 797–810.

19.

Lemme

van Breukelen

Candel

. Efficient treatment allocation in

2 \times 2

multicenter trials when costs and variances are heterogeneous. Stat Med 2018; 37: 12–27.

20.

Elfving

. Optimum allocation in linear regression theory. The Annals of Mathematical Statistics 1952; 23: 255–262.

21.

Holland-Letz

Dette

Pepelyshev

. A geometric characterization of optimal designs for regression models with correlated observations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2011; 73: 239–252.

22.

Sagnol

. Computing optimal designs of multiresponse experiments reduces to second-order cone programming. J Stat Plan Inference 2011; 141: 1684–1708.

23.

Holland-Letz

Dette

Renard

. Efficient algorithms for optimal designs with correlated observations in pharmacokinetics and dose-finding studies. Biometrics 2012; 68: 138–145.

24.

Balinski

Young

. Fair representation: meeting the ideal of one man, one vote. 2nd ed. Washington D.C.: Brookings Institution Press, 2002. ISBN 0-8157-0111-X.

25.

Pukelsheim

Rieder

. Efficient rounding of approximate designs. Biometrika 1992; 79: 763–770.

26.

Fedorov

. Theory of optimal experiments. New York: Academic Press, 1972.

27.

Hemming

Haines

Chilton

, et al. The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ 2015; 350: h391–h391.

28.

Pukelsheim

. Optimal design of experiments. 2nd ed. Society for Industrial and Applied Mathematics, 2006. ISBN 978-0-89871-604-7. DOI:10.1137/1.9780898719109.

29.

Watson

Pan

. Approximate c-Optimal Experimental Designs with Correlated Observations using Combinatorial Optimisation. Statistics and Computing 2022; http://arxiv.org/abs/2207.09183. (In press).

30.

Sviridenko

Vondrák

Ward

. Optimal approximation for submodular and supermodular optimization with bounded curvature. Mathematics of Operations Research 2017; 42: 1197–1218.

31.

Wynn

. The sequential generation of $D$-optimum experimental designs. The Annals of Mathematical Statistics 1970; 41: 1655–1664.

32.

Fisher

Nemhauser

Wolsey

. An analysis of approximations for maximizing submodular set functions—II. Math Program 1978; 15: 265–294.

33.

Nemhauser

Wolsey

. Best algorithms for approximating the maximum of a submodular set function. Mathematics of Operations Research 1978; 3: 177–188.

34.

Filmus

Ward

. Monotone submodular maximization over a matroid via non-oblivious local search. SIAM Journal on Computing 2014; 43: 514–542.

35.

Il’ev

. An approximation guarantee of the greedy descent algorithm for minimizing a supermodular set function. Discrete Appl Math 2001; 114: 131–146.

36.

Kasza

Forbes

. Information content of cluster-period cells in stepped wedge trials. Biometrics 2019; 75: 144–152.

37.

Hooper

Kasza

Forbes

. The hunt for efficient, incomplete designs for stepped wedge trials with continuous recruitment and continuous outcome measures. BMC Med Res Methodol 2020; 20: 279.

38.

Waite

Woods

. Designs for generalized linear models with random block effects via information matrix approximations. Biometrika 2015; 102: 677–693.

39.

Breslow

Clayton

. Approximate inference in generalized linear mixed models. J Am Stat Assoc 1993; 88: 9–25.

40.

McCullagh

Nelder

. Generalized linear models. 2nd ed. Routledge, 1989. https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606

41.

Zeger

Liang

Albert

. Models for longitudinal data: a generalized estimating equation approach. Biometrics 1988; 44: 1049–1060.

42.

Moerbeek

Maas

CJM

. Optimal experimental designs for multilevel logistic models with two binary predictors. Communications in Statistics - Theory and Methods 2005; 34: 1151–1167.

43.

van Breukelen

Candel

. Efficient design of cluster randomized and multicentre trials with unknown intraclass correlation. Stat Methods Med Res 2015; 24: 540–556.

44.

Dette

. Elfving’s theorem for $D$-optimality. Ann Stat 1993; 21: 753–766.

45.

Läuter

. Experimental design in a class of models. Mathematische Operationsforschung und Statistik 1974; 5: 379–398.

46.

Leyrat

Morgan

Leurent

, et al. Cluster randomized trials with a small number of clusters: which analyses should be used? Int J Epidemiol 2018; 47: 321–331.

47.

Kahan

Forbes

Ali

, et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials 2016; 17: 438.

48.

Watson

Girling

Hemming

. Design and analysis of three-arm parallel group randomised trials with small numbers of clusters. Stat Med 2021; 40: 1133–1146.

49.

Kackar

Harville

. Approximations for standard errors of estimators of fixed and random effects in mixed linear models. J Am Stat Assoc 1984; 79: 853–862.

50.

Kenward

Roger

. Small sample inference for fixed effects from restricted maximum likelihood. Biometrics 1997; 53: 983.

51.

Chen

Wong

. Particle swarm optimization for searching efficient experimental designs: a review. WIREs Comput Stat 2022; 14: e1578.

52.

at Work Wellbeing Programme Collaboration T. Evaluation of a policy intervention to promote the health and wellbeing of workers in small and medium sized enterprises – a cluster randomised controlled trial. BMC Public Health 2019; 19: 493.

53.

Chaloner

Verdinelli

. Bayesian experimental design: a review. Stat Sci 1995; 10: 273–304.

54.

Overstall

Woods

. Bayesian design of experiments using approximate coordinate exchange. Technometrics 2017; 59: 458–470.

Optimal study designs for cluster randomised trials: An overview of methods and results

Abstract

Keywords

1. Introduction

2. What is the optimal cluster trial problem?

2.1.1. Covariance function

3. Methods and previous literature

3.1. Exact formulae

3.2.1. Elfving’s theorem

3.2.2. Mixed model weights

3.3. Rounding proportions of experimental units

3.4. Combinatorial optimisation algorithms

3.4.1. Computational complexity

4. Results and examples

4.1. Clusters as experimental units

5.1. Comparison of algorithms

5.2. Small sample bias

5.3. Usefulness of optimal designs

5.4. Bayesian optimal designs

5.5. Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References