Sage Journals: Discover world-class research

Abstract

Classical automated test assembly (ATA) methods assume fixed and known coefficients for the constraints and the objective function. This hypothesis is not true for the estimates of item response theory parameters, which are crucial elements in test assembly classical models. To account for uncertainty in ATA, we propose a chance-constrained version of the maximin ATA model, which allows maximizing the α-quantile of the sampling distribution of the test information function obtained by applying the bootstrap on the item parameter estimation. A heuristic inspired by the simulated annealing optimization technique is implemented to solve the ATA model. The validity of the proposed approach is empirically demonstrated by a simulation study. The applicability is proven by using the real responses to the Trends in International Mathematics and Science Study (TIMSS) 2015 science test.

Keywords

automated test assembly uncertainty chance-constrained simulated annealing

In educational measurement, tests should be designed and developed providing evidence of fairness, reliability, and validity (American Educational Research Association et al., 2014). To meet these requirements, a test assembly process should be employed to perform an optimal selection of items from an item bank. In addition to producing test forms that conform to the content and psychometric specifications, a test assembly process ensures that the resulting ability measurements can be trusted and interpreted in a transparent way. Moreover, it can produce comparable measurements in operational settings, where various parallel versions of tests are needed. Furthermore, test assembly plays a crucial role in ability assessment as it lies at the basis of the entire test production process: from the earlier stages of item creation to the selection of items for building the test forms. In detail, the requirements of the final tests specified in the test assembly model not only determine the structure of the test forms but also define the composition of the item pool (Ariel & van der Linden, 2006), guiding the item writing process.

In the last decades, the simplified access to modern digital resources such as sophisticated item banking systems opened the possibility of improving the manual test assembly process through automated test assembly (ATA). The introduction of ATA dramatically improved the quality of the test forms and simplified the test assembly process, especially for large testing programs.

ATA differs from the manual process because the item selection is performed by optimizing mathematical models through specific software called solvers. Automation has brought many advantages over manual test assembly. First of all, a rigorous definition of test specifications reduces the need to repeat some phases of the test development. Secondly, ATA is the only way to find multiple optimal or near-optimal combinations of items starting from large item banks, despite the computational complexity of the task. Thus, ATA is fundamental to making measurements comparable while simultaneously reducing operational costs.

In ATA, mathematical optimization models such as 0–1 linear programming (LP) models (see van der Linden, 2005) are usually applied. These classical models use the item information functions (IIFs) as linear coefficients for the decision variables, which are kept fixed throughout the entire optimization process. However, it is well known that the IIFs are derived from the item parameters estimated within the item response theory (IRT) framework. Consequently, the IIFs should be considered uncertain inputs in the ATA models. Many papers (e.g., Mislevy et al., 1994; Patton et al., 2014; Tsutakawa & Johnson, 1990; Xie, 2019; Zhang et al., 2011; Zheng, 2016) discussed the consequences of uncertainty in item parameters on several aspects of educational measurement, such as the accuracy of ability estimation. However, relatively few studies focused on this issue in the ATA research field. In particular, De Jong et al. (2009), Veldkamp (2013), Veldkamp et al. (2013), Veldkamp and Paap (2017), and Veldkamp and Verschoor (2019) proposed robust alternatives to the classical optimization models. These papers focus on the assembly of single test forms only.

In this article, we propose incorporating the uncertainty in the optimization model for simultaneous multiple test assembly, which is the most applied and discussed ATA model in the literature (Ali & van Rijn, 2016; Debeer et al., 2017; van der Linden, 2005). In more detail, we suggest a test assembly model based on the chance-constrained (CC) approach (see Charnes & Cooper, 1959; Charnes et al., 1958), namely, the CCATA model, by which the $α$ -quantile of the sampling distribution of the test information function (TIF) is maximized. The proposed model extends the classical maximin ATA model (van der Linden, 2005, pp. 69–70). The sampling distribution of the TIF is obtained by applying the bootstrap technique (Bradley & Tibshirani, 1993) during the estimation of item parameters, that is, the item calibration. In this way, we ensure that, independently of the calibration conditions, we have a high probability of having a certain, possibly low error in the ability estimation (or conversely, a high TIF). The main novelty of our model is to take into account the observed structure of uncertainty of the item parameters and, in this light, produce optimal tests with the highest accuracy of the ability estimates. The validity of the proposal is assessed by comparing our method with other existing approaches in a simulation study.

For solving the CCATA model, we developed an algorithm based on simulated annealing (SA), a stochastic meta-heuristic proposed by Goffe (1996). The added value of this technique is represented by the possibility of handling large-sized models, characterized by many optimization variables and constraints, and nonlinear functions. All the proposed algorithms have been coded in the open-source framework Julia (Bezanson et al., 2017) and are free to use as they do not rely on commercial software.

This article is organized as follows. First, the key elements of IRT and ATA are reviewed. The following section discusses the issues arising from uncertainty in IRT and ATA models. Subsequently, an introduction to the CC approach for solving optimization problems with uncertainty is provided. Then, a CC version of the maximin ATA model is proposed. The retrieval of the TIF empirical distribution and the development of a heuristic based on SA for solving the model are discussed in the same section. Afterward, the results of a simulation study are presented in order to compare our proposal to the existing approaches solved by the CPLEX 12.10.0 Optimizer (IBM, 2019). An application of our approach to real data taken from the 2015 TIMSS data is shown. Some concluding remarks and suggestions for applying the CCATA model end this article.

IRT and Test Assembly Models

In educational and psychological measurement, IRT modeling provides several methods to estimate the item parameters. Intending to produce test forms with the highest accuracy in ability estimation, IRT is a solid foundation for ATA methods because the Fisher information function, which is a key object in test assembly, is derived from the item parameter estimates. Given an IRT model, once the items have been calibrated, it is possible to evaluate how informative the test is at various ranges of the latent ability using the TIF, which is defined as the sum of the item Fisher information of all the items in the test (or the inverse of the variance of the maximum likelihood estimator of the ability $θ$ ). Hence, the TIF has very favorable properties: the additivity (i.e., the linearity) over the test items and its easiness of interpretation. Formally, for a given test with n items and ability $θ \in (- \infty, \infty)$ , the TIF is equal to

T I F (θ) = \sum_{i = 1}^{n} I_{i} (θ),

where $I_{i} (θ)$ is the IIF for item i computed at $θ$ . Expressions for the IIFs can be easily derived within the framework of IRT. For example, if we assume binary response data, where the probability $P_{i} (θ)$ of item i endorsement follows the two-parameter logistic (2PL) model, the IIF of item i is equal to

I_{i} (θ) = a_{i}^{2} P_{i} (θ) (1 - P_{i} (θ)) = a_{i}^{2} \frac{{exp}^{(a_{i} θ + b_{i})}}{{[1 + {exp}^{(a_{i} θ + b_{i})}]}^{2}} .

The item parameters a_i and b_i represent the discrimination and the intercept for item i, respectively.¹

From a general point of view, an ATA model is an optimization model consisting of an objective function to be maximized or minimized and a set of constraints to be satisfied. Specific objective functions may be related to psychometric features of the test, such as the maximization of the TIF at given cutoff scores, or to test content or other test requirements, such as the minimization of the total testing time. Examples of constraints include the test length, the restriction on the number of items of a certain type, test overlap, and so on. Altogether, they represent the test specifications, which should be defined in the standard form of Table 1 (van der Linden, 2005, p. 40), before being translated into an ATA model.

Table 1.

Standard Form of a Test Assembly Problem

Optimize	Objective function
subject to
	Constraint 1
	Constraint 2
	$⋮$
	Constraint J

Only one objective can be optimized at a time. If we have more than one function to optimize, some tricks can be applied to transform the objectives into constraints (Veldkamp, 1999), such as the maximin paradigm. On the other hand, there is no upper limit for the number of constraints, provided that the solver can handle the problem (Spaccapanico et al., 2020). If at least one combination of items that meets all the constraints does exist, then the set of these combinations is the feasible set; otherwise, if this set is empty, the model is said to be infeasible. The subset of the feasible set that optimizes the objective function represents the optimal feasible solution.

Tests can be assembled merely through the selection of appropriate items out of an item bank. One way to do so is to use mathematical programming techniques like 0–1 LP or mixed integer programming models and optimize them with commercial solvers, such as CPLEX (IBM, 2019) or Gurobi (Gurobi, 2018). Following the mentioned approaches, it is possible to assemble a set of tests that meet some (mostly linear) constraints maximizing their TIFs (see van der Linden, 2005). For example, given an item pool of size I, a commonly used objective for ATA models maximizes the TIFs of T tests at K ability points:

maximize \sum_{i = 1}^{I} I_{i} (θ_{k t}) x_{i t}, \forall t, k, (objective)

with $t = 1, \dots, T$ , and $k = 1, \dots, K$ . $I_{i} (θ_{k t})$ is the IIF for item i at abilities $θ_{k t}$ , the set of ability points for which we want to control the shape of the TIFs, and $x_{i t}$ is a decision variable taking value 1 if the item i is assigned to test t and 0 otherwise. Depending on the application scenario, the K ability points may be chosen within a limited set of values around the mean of the population ability. A common choice is to maximize the TIF at $θ = 0$ , which is generally the population’s average ability.

Since model (3) has $T * K$ objectives, it cannot be solved without resorting to multiobjective programming methods (Deb et al., 2016). Therefore, the maximin paradigm is applied. Within this setting, given an item pool of I items, the maximin approach allows to maximize the lower bound y of the TIFs, that is, it maximizes the minimum observed TIF among all the tests. The maximin ATA model is specified by the following objective and set of constraints:

maximize y (objective)

subject to

\sum_{i = 1}^{I} I_{i} (θ_{k t}) x_{i t} \geq y, \forall t, k,

y \geq 0,

where y is the lower bound for the TIF, so that all the considered TIFs are equal or higher than this value. In this way, the previous objectives are transformed in $T \times K$ constraints, and only one objective appears in the model.

In order to describe the structure of the test forms, extra inequalities must often be added to the model due to security concerns. In fact, among others, it may be required to specify a minimum number of items in a given category (e.g., content domain or item type) and the item use among the test forms.

Uncertainty in Test Assembly

In the classical context of test assembly, the optimization models used for item selection do not consider the uncertainty of the estimates of item parameters (van der Linden, 2005). For example, the maximin ATA model is based on the TIF, which appears in the objective function, being the goal of the optimization model. The TIF is the sum of the IIFs of the items in the test form and depends on the item parameter estimates, which are generally considered fixed quantities. Nevertheless, ignoring the uncertainty derived from the estimation process may lead to several issues, such as the misinterpretation of the psychometric properties of the assembled test forms. When the calibration algorithm produces biased estimates for the item parameters, the IIFs are not accurate enough, and, consequently, the TIF of the assembled test might be underestimated or overestimated. In Veldkamp et al. (2013), the authors found that, for large uncertainties, the decrease of information in robust test assembly can reach 37%. As a consequence, the perceived accuracy of ability estimates may be compromised. Mostly regarding the latter issue, a good test assembly model would consider the variation of item parameter estimates in order to build test forms in a conservative manner, that is, it would produce tests with a maximum plausible lower bound of the TIF.

Several attempts to incorporate uncertainty in the test assembly models have been made, mostly by proposing robust approaches. Starting from the conservative approach of Soyster (1973), where the maximum level of uncertainty is considered for 0–1 LP optimization, De Jong et al. (2009) proposed a modified version, where one posterior standard deviation is subtracted from the estimated Fisher information to take the calibration error into account. This approach was also adopted in Veldkamp et al. (2013), where the consequences of ignoring uncertainty in item parameters are studied for ATA models. In addition, Veldkamp (2013) investigated the approach of Bertsimas and Sim (2003), who developed a robust method for LP models by including uncertainty only for some parameters in the assembly of linear test forms. More recently, Veldkamp and Paap (2017) proposed to include the uncertainty related to the violation of the assumption of local independence in ATA for testlets. Finally, Veldkamp and Verschoor (2019) discussed robust alternatives for both ATA and computerized adaptive testing.

The mentioned ATA robust approaches consider the standard error of the estimates and a protection level $Γ$ that indicates how many items in the model are assumed to be changed in order to affect the solution (Bertsimas & Sim, 2003). In this sense, the uncertainty is treated in a deterministic way, and, given $Γ$ , the solution is adjusted by adopting a highly conservative approach, as standard errors are the maximum expression of uncertainty of the estimates.

A reasonable solution to the mentioned problems appears to be the use of chance-constraints (or probabilistic constraints). In fact, they are among the first extensions proposed in the stochastic programming framework to deal with constraints, where some parameters are uncertain (Charnes & Cooper, 1963; Krokhmal et al., 2002).

Chance-Constrained Modeling

The CC approach (Charnes & Cooper, 1959; Charnes et al., 1958) is a method for optimization problems with uncertainty, where a conservative parameter $α$ , the risk level, modulates the level of fulfillment of probabilistic constraints. The CC modeling has been deeply explored in the financial field, especially in risk management and reliability applications. In this context, the decision-maker must select a combination of assets for building a portfolio by maximizing their utility function (see Chen, 1973; Freund, 1956; Rockafellar & Uryasev, 2000, 2001; Scott & Baker, 1972).

More recently, this problem was formulated in terms of percentiles of loss distributions, giving rise to the theory of chance-constraints originally proposed by Charnes and Cooper (1959).

Probabilistic constraints include parameters assumed to be randomly distributed and subject to some predetermined threshold $α$ , defined in the interval $[0, 1]$ , controlling their fulfillment. By modifying $α$ , it is possible to relax or tighten some constraints, modulating the level of the conservativeness of the model. To introduce the formal representation of a CC model, we start with the standard form of a mixed-integer optimization model:

max_{x} f (x),

subject to g_{j} (x) \leq 0 j = 1, \dots, J

x \in ℤ^{p} \times ℝ^{q},

where $f (\cdot)$ is the objective function to be optimized, $g_{j} (\cdot)$ is the function expressing constraint j, J is the number of constraints, and $x$ is the vector of p integer and q continuous optimization variables. Both $f (\cdot)$ and $g (\cdot)$ are the scalar functions.

The optimization domain is $D = dom (f) \cap \cap_{j = 1}^{J} dom (g_{j})$ and the set $X = {x : x \in D, g_{j} (x) \leq 0, \forall j}$ is the feasible set, which means that a solution $x$ is feasible if it is in the optimization domain and it satisfies all the constraints. Thus, a CC reformulation of the optimization problem adds to model (5) the following set of H probabilistic constraints:

ℙ [g_{h} (x, ξ) \leq 0] \geq 1 - α, h = 1, \dots, H,

where $ξ$ is a vector of random variables, which represent the uncertain parameters. This formulation seeks a decision vector $x$ that maximizes the function $f (x)$ while satisfying the chance-constraints $g_{h} (x, ξ) \leq 0$ with probability at least equal to ( $1 - α$ ).

CC models represent a fully customizable robust approach to optimization. However, although they were proposed in the 1950s, they are still hard to be solved. In fact, a major issue is the general nonconvexity of the probabilistic constraints. Even though the original deterministic constraints $g_{h} (x, ξ)$ with nonrandom $ξ$ are convex, the respective chance-constraints may be nonconvex. Moreover, the chance-constraints are usually intractable because the quantiles of the random variables are difficult or impossible to compute (see Nemirovski & Shapiro, 2006) or involve nonconvex functions. Several methods of approximating the chance-constraints have been proposed in the literature (see Ahmed & Shapiro, 2008; Kataria et al., 2010; Margellos et al., 2014; Song et al., 2014; Tarim et al., 2006; Wang et al., 2011).

Chance-Constrained Automated Test Assembly

In order to develop a conservative approach that incorporates the uncertainty of item parameters into the ATA model, we propose a stochastic optimization approach for the maximin test assembly model based on the CC method. Under this approach, the TIF is not considered a fixed quantity but a random variable. As explained further on, the distribution of the TIF is retrieved by using the bootstrap technique. Whenever a maximin principle is applied, the CC model can be seen as a percentile optimization problem (Krokhmal et al., 2002). In fact, the probability in the inequality (6) is replaced by the $α$ quantile of the distribution function of $g_{h} (x, ξ)$ , and this quantile is maximized. In our case, $ξ$ is the vector of the IIFs, and the function $g (\cdot)$ is the summation over items.

By considering the maximin model (4a), the constraints (4b) involved in the maximization of the TIF are replaced by the CC equivalents as follows:

ℙ [\sum_{i = 1}^{I} I_{i} (θ_{k t}) x_{i t} \geq y] \geq 1 - α, \forall t, k,

where $t = 1, \dots, T$ are the tests to be assembled, and $θ_{k t}$ are the ability points at which the TIF of the test form t must be maximized, with $k = 1, ..., K$ . Usually, these points are chosen within a limited set of values around the mean of the population’s ability. Commonly, $θ = 0$ is chosen, that is, the TIF is peaked at $θ = 0$ , so that the expected standard error of ability estimates at this ability point is reduced. Finally, $α$ is a real-valued variable defined in the interval $[0, 1]$ . In the proposed approach, the chance-constraints are optimized independently of each other. We call model (7) CC maximin ATA, or briefly CCATA. Again, the key element of this model is the information function assumed to be random.

The CCATA model maximizes the expected precision of the assembled tests in estimating the latent trait values of the test-takers at the predetermined ability points with a high confidence level if $α$ is chosen to be close to zero. In probabilistic terms, we can say that the constraints in model (4b) must be fulfilled with a probability of at least ( $1 - α$ ). By adjusting the confidence level $(1 - α)$ , it is possible to relax or tighten the attainment of the chance-constraints to reflect a specific conservative extent, for example, a small $α$ means a high level of conservativeness. On the contrary, a large $α$ means an almost complete relaxation of the constraints. The introduction of a confidence level is one of the most relevant novelties of the CCATA model compared to the robust approach proposed by Veldkamp (2013) and Veldkamp et al. (2013), who, instead, performed a worst-case optimization.

Once the chance-constraints have been defined, a method to compute the probability appearing in inequality (7) should be found. A possible solution is to make assumptions on the probability distribution of $ξ$ , such as the multivariate normal (Kim et al., 1990). For example, Ahmed and Shapiro (2008) try to approximate the probability distribution using the samples of the random variable of interest by a Monte Carlo simulation, a specific case of scenario generation,² where all the scenarios have the same probability of occurrence. We decided to use the Monte Carlo method because of its flexibility and adaptability to our problem.

The proposed CCATA model for ATA is based on the empirical distribution of the TIFs of the assembled tests. Therefore, our random variable is the TIF of a test form. This statistic depends on the uncertain IRT item parameter estimates, such as the discrimination and the intercept. There are different ways to retrieve the distribution function of the TIF: Given the standard errors of the estimates, the samples can be uniformly drawn from their confidence intervals as in the robust approach of Veldkamp (2013); otherwise, if a Bayesian estimation is carried out, the samples in the Markov chain can be used.

In this article, another approach is used: A bootstrap procedure is performed to resample the response data and obtain a batch of estimates for each item parameter (see “Empirical Measure of the TIF” section). At the end of this phase, the IIF for all the items in the pool is computed at predefined ability points using the bootstrapped samples. These quantities are then used in the CCATA model to compute the $α$ -quantiles of the TIFs, and the model is optimized by looking for the combination of items that compose the test forms with the highest quantiles. A percentile optimization model would maximize a reasonable lower bound of the TIF: its $α$ -quantile, approximated by the $α R$ -th ranked value of the TIF computed on the R bootstrap replications of the item parameter estimates. The following sections explain the details of the retrieval of the TIF empirical distribution function by the bootstrap and the heuristic proposed to solve the model.

Empirical Measure of the TIF

The test forms built using the CCATA model should have the maximum possible empirical $α$ -quantile of their TIFs. The optimality in this sense will ensure that the assembled tests are conservative in terms of accuracy of ability estimation (indeed, the TIF), taking into account the uncertainty in the item parameter estimates. A standard approach to extract the uncertainty could be to sample many plausible values of the item parameters from the confidence intervals built using the standard errors and, subsequently, compute the related IIFs at $θ$ target points. The latter may be an optimal starting point to assemble robust tests (see Veldkamp, 2013; Veldkamp et al., 2013), but it has its own downsides as a uniform interval of plausible values is assumed. Another attempt to account for the influence of sampling error in the Bayesian framework has been made by Yang et al. (2012). They proposed a multiple-imputation approach with the aim of better measuring the latent ability of a respondent.

Our approach is based on bootstrapping the calibration process. In particular, the observed vectors of responses coming from the full sample (one vector for each test-taker) are resampled with replacement R times, and the item parameters are estimated for each sample. In this way, it is possible to preserve the natural relationship of dependence between the items, and, given the ability targets, it is possible to compute their IIFs. After that, given a set of items, we can build a test form and compute its TIF for each of the R replications. The resulting sample constitutes the empirical distribution function of the TIF.

More formally, let $ξ_{1}, \dots, ξ_{R}$ be an independent and identically distributed sample of R realizations of the random vector $ξ$ , and ${\hat{F}}_{R} : = R^{- 1} \sum_{r = 1}^{R} Δ ξ_{r}$ be the respective empirical measure. Here, $Δ (ξ)$ denotes the measure of mass one at point $ξ$ , that is, $Δ ξ_{r} (A) = 1$ if $ξ_{r} \in A$ . Hence, ${\hat{F}}_{R}$ is a discrete measure assigning probability $1 / R$ to each sample. In this way, we can approximate the probability in the left-hand side of inequality (7) by replacing the true cumulative distribution function of $ξ$ with ${\hat{F}}_{R}$ .

The Approximated Model

The retrieved empirical distribution function of the TIF is now incorporated into the CCATA model in the following way. Let $1_{(- \infty,0]} {x} : ℝ \to {0, 1}$ be the indicator function of x in the interval $(- \infty,0]$ , that is

1_{(- \infty,0]} {x} = {\begin{matrix} 0, & i f x > 0 \\ 1, & i f x \leq 0 . \end{matrix}

Thus, given a specific chance-constraint h, a known set of optimization variables $x$ and samples $ξ_{1}, \dots, ξ_{R}$ of our random vector, we can rewrite

\begin{array}{l} ℙ [g_{h} (x, ξ) \leq 0] = E_{F} [1_{(- \infty,0]} {g_{h} (x, ξ)}] \approx E_{{\hat{F}}_{R}} [1_{(- \infty,0]} {g_{h} (x, ξ)}], \\ = \frac{1}{R} \sum_{r = 1}^{R} 1_{(- \infty,0]} {g_{h} (x, ξ_{r})} . \end{array}

Equation (9) means that the chance-constraint is approximated by the fraction of the R bootstrap samples, in which $g_{h} (x, ξ_{r}) \leq 0$ .

Adopting the same principle to the left-hand side of the chance-constraints in inequality (7), the CCATA model can be approximated by

max_{x} y

subject to \frac{1}{R} \sum_{r = 1}^{R} 1_{[y, \infty)} {{\vec{I}}_{r} (θ_{k t})^{'} x_{t}} \geq 1 - α, \forall t, k,

g_{j} (x_{t}) \leq 0 \forall j, t,

x_{t} \in {0, 1}_{I}, y \in ℝ^{+}, \forall t,

where ${\vec{I}}_{r} (θ_{k t}) = I_{1 r} (θ_{k t}), \dots, I_{I r} (θ_{k t})$ . The following issues characterize model (10): It is nonconvex because of the indicator function used in the chance-constraints (see Rockafellar & Uryasev, 2000, 2001, for the demonstrations), and commercial solvers do not well handle the indicator function. To overcome these problems, we propose to solve the model by the heuristic described in the following.

The Heuristic

Since a linear formulation cannot effortlessly approximate the proposed CCATA model, a heuristic based on SA (Goffe, 1996) has been developed. This technique can handle large-sized models and nonlinear functions. The theory of SA is derived from the physics of annealing substances. Briefly, we adapted the annealing process to our ATA model by replacing the random selection of a decision variable with the random selection of an item from the item bank. The perturbation of the decision variables is done by adding, removing, or switching the chosen item with another available item. At each modification, the objective function is evaluated and the solution is accepted in accordance with an exponential function based on a parameter called temperature. The higher the temperature, the higher the probability of accepting a worse solution. The temperature is decremented until only better solutions are accepted. The way the temperature is controlled is referred to as the cooling schedule. If there are no further improvements in the area (neighborhood) around the current solution, the process is stopped. At the end, the reannealing phase is actuated if the global stopping criteria have not been reached. In this phase, the best solution obtained is perturbed, the temperature is heated (set to its initial value) and another area is explored. Consequently, more than one neighborhood of the solution space is explored by adopting the SA algorithm, avoiding being trapped in a local optimum. More information about the implementation of the SA algorithm can be found in Spaccapanico (2020) and in the pseudocode in the Appendix.

Unfortunately, the SA algorithm is not able to deal with the constraints, so they are incorporated into the objective function using the hinge function and the Lagrange relaxation, as in Stocking and Swanson (1993). Moreover, the SA has the disadvantage that it can hardly find the feasible space for a problem. Thus, we decided to start our heuristic with a fill up sequential phase: The worst performing test, both in terms of optimality and feasibility, is filled up with the best item available in the item pool. After the selected item has been assigned, the process is repeated until all the tests have reached their maximum length, that is, they are all filled up. Once the first step is performed, we process the solution with the SA principle. The result of the heuristic is a set of solutions with a length equal to the number of neighborhoods explored. Finally, the solution with the best objective function is selected.

Simulation Study

The performance and advantages of the CCATA test assembly model (10) are investigated through a simulation study. Our specific scenario is the on-the-fly test assembly for individualized testing. In fact, we will focus on the average examinee with $θ = 0$ , at which the TIF should be maximized. In this way, the estimation error of the population average ability is reduced. This setting allows us to evaluate the effects of using probabilistic methods in the field of ATA models and to control the conservativeness of the produced tests. To assess under which conditions our proposal is preferable, the true TIFs of the tests assembled by our CCATA model are compared to those obtained with four alternative models under several conditions. The alternative models are: the classical maximin model (classical, see Equation 4a), the mean minus 3 standard deviations model (3sd, Soyster, 1973), the mean minus 1 standard deviation model (1sd, De Jong et al., 2009), and the robust model (robust, Veldkamp et al., 2013). The mean and the standard deviations of the IIFs used in models 3sd and 1sd are computed on the bootstrap samples. For the robust model, the protection level $Γ$ , which indicates how many items in the model should be changed in order to affect the solution, is set equal to $40$ , following the suggestion in Veldkamp et al. (2013), so $41$ submodels are solved, and the solution that produced the highest objective is retained.

All the models are solved using the ATA.jl Julia package (Spaccapanico, 2021a). For the classical, 1sd, 3sd, and robust models, the CPLEX solver interfaced by JuMP.jl (http://www.juliaopt.org/JuMP.jl/0.18/) is chosen. On the other hand, the CCATA model is solved by our heuristic. The data needed for assembling the CC tests consist of the sample of the IIFs computed at $θ = 0$ , for each item in the pool, namely, the vector ${\vec{I}}_{r} (0)$ , for $r = 1 \dots, R$ . These quantities are obtained by estimating the item parameters by bootstrapping, where the 2PL model is assumed. The item parameters a (discrimination) and b (intercept) are sampled from the following distributions: $a \sim L N (0, 0.25)$ and $b \sim N (0, 1)$ .

The results are compared in terms of the true TIFs averaged across tests and replications. Other benchmarks used to compare the model performances are the relative bias and relative root mean square error (RMSE) between the true and observed TIFs. The true TIF is the reciprocal of the real expected ability estimation error; higher values indicate that the test will produce on average more accurate ability estimates. Moreover, by comparing the values of the relative biases and RMSEs, we can evaluate the accuracy and conservativeness of the models under the specified conditions. In particular, the bias asserts if the observed TIF underestimates (negative values) or overestimates (positive values) the true one. Moreover, as the RMSE approaches zero, the model’s capability to estimate the true TIF increases. On the other hand, high absolute values of the RMSE and bias indicate that the observed TIF is not reproducing the real expected ability estimation error of the test.

Simulation Design

The optimization has been performed on a personal computer with an AMD Ryzen 7 PRO 4750U processor and 16 GB of RAM. Two Julia packages have been used for the computational tasks: Psychometrics.jl for calibration and bootstrap (Spaccapanico, 2021b) and ATA.jl for the ATA models (Spaccapanico, 2021a). The steps addressed in the simulation study are described in the following:

A pool of $I = 250$ true items with contents: content_A = {type1, type2, type3}, content_B = {type4, type5, type6} is simulated.

For each replication $m = 1, \dots, M$ , the responses of $N = 3, 000$ subjects with $θ \sim N (0, 1)$ are generated. Then, the items are calibrated with the marginal maximum likelihood estimation approach with an unbalanced design of 500–1,000 responses per item. $M = 10$ replications are performed. To investigate the validity of the methods in multiple scenarios, we also implemented the cases $N = 1, 200$ and $N = 6, 000$ , where each item gets 200–400 and 2,000–4,000 responses, respectively.

The items are recalibrated $R = 500$ times on $N^{*} = N$ respondents sampled with replacement (bootstrap). The ${\vec{I}}_{r} (0)$ for $r = 1, \dots, R$ are computed.

The test specifications (see Table 2) are added to the models, and the optimization hyperparameters are set as explained in the next paragraph.

For each combination of sample size and set of test specifications, the models classical, 3sd, 1sd, robust, and CCATA are solved. The CCATA model is solved both with $α = 0.01$ and $α = 0.05$ .

Table 2.

Test Specifications

Case	T	Max Item Use
1	10	4
2	10	2
3	20	4
4	25	4
Case	Variable	Bounds
All	Test length	[38, 40]
All	content_A	[6, 10], [9, 12], [18, 25]
All	content_B	[9, 12], [15, 19], [9, 12]
All	Maximum overlap between tests	11

Performing the bootstrap procedure on the item calibration and solving each ATA model is computationally intensive. In detail, each model requires about 500 seconds to approach its theoretical upper bound of the objective, and the bootstrap procedure takes about 6 to 7 hours, depending on the sample size.

Test Specifications

The mentioned models are solved under different settings, such as the number of test forms and confidence levels. The assembly is performed in a parallel framework, that is, the T tests must meet the same constraints. Two fictitious categorical variables, content_A and content_B, with three possible categories each, are simulated to constrain the tests to have certain content validity. The following specifications replicate realistic ATA applications, where feasibility is the main concern, along with the search for the optimal set of tests in terms of the TIF. The complete set of test specifications is summarized in Table 2.

For example, the constraints described for variable content_A require that tests have 6 to 10 items of the first category of the variable content_A, 9 to 12 items of the second category, and so forth. For classical, 3sd, 1sd, and robust models, different combinations of the specifications in Table 2 create four cases to be investigated in increasing order of complexity. For the CCATA model, eight cases are investigated (four cases for each $α$ level).

Moreover, the hyperparameters for the heuristic are chosen as follows. The starting temperature is equal to 0.1, so the solver does not check solutions too far from the last explored neighborhood, while the geometric cooling parameter is set equal to 0.1. At the beginning of the optimization, we perform one fill up phase, only taking into account the feasibility of the model. Then, we proceed to look for the most optimal combination of items by randomly selecting one item in all the tests to be added, removed, or switched. A Lagrange multiplier equal to 0.1 is chosen to balance the model’s feasibility and optimality. The amount of time needed to solve the model is imposed as the termination criterion, and it is set equal to $500$ seconds. This stopping criterion is also valid for the other models.

Results

In Table 3 and Figure 1, the mean of the true TIFs computed at $θ = 0$ , $\bar{T I F^{†} (0)}$ , is reported. It is obtained by averaging the true $T I F_{t m}^{†}$ across the $t = 1, \dots, T$ tests and $m = 1 \dots, M$ replications as follows:

\bar{T I F^{†} (0)} = M^{- 1} T^{- 1} \sum_{m = 1}^{M} \sum_{t = 1}^{T} T I F_{t m}^{†} (0) .

Table 3.

$\bar{T I F^{†} (0)}$ , True $T I F$ at $θ = 0$ Averaged Across T Tests and M Replications

Case	CCATA ( $α = 0.01$ )	CCATA ( $α = 0.05$ )	Classical	3sd	1sd	Robust
$N = 1, 200$
1	12.6274	12.7517	13.0457	9.8415	13.0052	12.9474
2	10.3974	10.4868	10.5739	9.4460	10.5651	10.5706
3	9.7330	10.2992	10.3712	9.3373	10.3727	—
4	9.4686	9.4603	8.9815	8.9878	8.9815	—
$N = 3, 000$
1	13.4907	13.5446	13.5599	13.2208	13.5487	13.3787
2	10.6187	10.6286	10.6792	10.5404	10.6741	10.6781
3	10.5389	10.5506	10.3151	10.0580	10.6204	—
4	9.4715	9.4700	8.9815	8.9775	8.9815	—
$N = 6, 000$
1	13.6271	13.3121	13.5403	13.4735	13.6565	13.6889
2	10.6664	10.5988	10.7347	10.7079	10.7310	10.7332
3	10.5452	10.3929	10.5362	10.5249	10.3805	—
4	9.4684	9.4516	8.9815	8.9815	8.9815	—

Note. The robust model could not be solved for Cases 3 and 4. TIF = test information function; CCATA = chance-constrained automated test assembly.

Figure 1.

True test information function averaged across tests and replications. Plots are grouped by Case = {1, 2, 3, 4} and by N = {1,200, 3,000, 6,000}.

Table 4 and Figure 2 show the results for the relative bias between the observed and true TIF, while Table 5 and Figure 3 for the corresponding relative RMSE. Relative measures are chosen to make the results of the different conditions comparable. The two indicators are obtained as follows. First, for each test t and replication m, the observed TIF, $T I F_{t m}^{† †} (0)$ , and the true TIF, $T I F_{t m}^{†} (0)$ , are computed; then, they are averaged with respect to the T tests, getting $\bar{T I F_{m}^{† †} (0)}$ and $\bar{T I F_{m}^{†} (0)}$ , respectively. Finally, bias and RMSE are computed as:

Bias = M^{- 1} \sum_{m = 1}^{M} [(\bar{T I F_{m}^{† †} (0)} - \bar{T I F_{m}^{†} (0)}) / \bar{T I F_{m}^{†} (0)}],

RMSE = \sqrt{M^{- 1} \sum_{m = 1}^{M} {(\bar{T I F_{m}^{† †} (0)} - \bar{T I F_{m}^{†} (0)})}^{2}} / \bar{T I F^{†} (0)} .

Table 4.

Relative Bias of the TIF

Case	CCATA ( $α = 0.01$ )	CCATA ( $α = 0.05$ )	Classical	3sd	1sd	Robust
$N = 1, 200$
1	.1191	.1658	.2248	−.8973	−.1553	.2218
2	.0239	.0699	.1228	−.9298	−.2429	.1231
3	−.0182	.0604	.1187	−.9319	−.2462	—
4	−.0360	.0148	.0733	−.9368	−.2811	—
$N = 3, 000$
1	−.0063	.0196	.0667	−.6081	−.1448	.0666
2	−.0412	−.0174	.0312	−.6725	−.1847	.0310
3	−.0474	−.0204	.0263	−.6819	−.1852	—
4	−.0752	−.0449	.0078	−.7004	−.2082	—
$N = 6, 000$
1	−.0211	−.0032	.0378	−.4268	−.1103	.0374
2	−.0369	−.0219	.0191	−.4679	−.1344	.0192
3	−.0408	−.0246	.0177	−.4703	−.1379	—
4	−.0574	−.0349	.0064	−.4901	−.1494	—

Note. The robust model could not be solved for Cases 3 and 4. TIF = test information function; CCATA = chance-constrained automated test assembly.

Figure 2.

Relative bias of the test information function. Plots are grouped by Case = {1, 2, 3, 4} and by N = {1,200, 3,000, 6,000}.

Table 5.

Relative RMSE of the TIF

Case	CCATA ( $α = 0.01$ )	CCATA ( $α = 0.05$ )	Classical	3sd	1sd	Robust
$N = 1, 200$
1	.1212	.1674	.2262	.8980	.1564	.2235
2	.0280	.0721	.1248	.9297	.2435	.1251
3	.0269	.0624	.1222	.9318	.2463	—
4	.0382	.0209	.0757	.9369	.2813	—
$N = 3, 000$
1	.0207	.0253	.0693	.6083	.1455	.0689
2	.0443	.0235	.0356	.6725	.1853	.0353
3	.0507	.0260	.0343	.6817	.1857	—
4	.0769	.0475	.0183	.7005	.2086	—
$N = 6, 000$
1	.0237	.0121	.0392	.4269	.1108	.0391
2	.0374	.0229	.0206	.4679	.1345	.0208
3	.0413	.0256	.0199	.4703	.1378	—
4	.0578	.0356	.0098	.4901	.1496	—

Note. The robust model could not be solved for Cases 3 and 4. RMSE = root mean square error; TIF = test information function; CCATA = chance-constrained automated test assembly.

Figure 3.

Relative RMSE of the test information function. Plots are grouped by Case = {1, 2, 3, 4} and by N = {1,200, 3,000, 6,000}.

Clearly, the observed TIFs are different for each model. For example, the observed TIF for a particular test under the CCATA model corresponds to the $α$ quantile of its empirical distribution function. In contrast, for the 3sd and 1sd models, the observed TIF is the sum of the mean of the $R = 500$ IIFs values obtained with the bootstrap, minus 3 or 1 bootstrap standard deviations, respectively. Finally, for the classical and robust ATA models, the observed TIF is the sum of the IIFs computed on the item parameters estimated on the full sample, following the classical approach.

As can be noticed from Tables 3 through 5, the robust model could not produce any solution under Conditions 3 and 4. These models reached the termination criterion of 500 seconds before a feasible solution was found for all their submodels. For this reason, the robust approach turned out to be impractical for complex, that is, large-sized, ATA models. Specifically, large-sized ATA models are characterized by having several optimization variables and constraints. This condition occurs especially when overlap constraints are imposed, because many auxiliary optimization variables are needed to linearize the model.

Looking at Table 3 and Figure 1, the results on the mean TIF are very similar for all the approaches. However, some patterns have been detected and explained afterward. Lower values of the true TIFs are observed for the 3sd model mainly for the smallest sample size and for $N = 6, 000$ , in Case 1. The CCATA model never produces the worst results and outperforms the other approaches in Case 4, where the underlying ATA model is more constrained and has a higher number of decision variables. In particular, the configuration with $α = 0.05$ seems to behave slightly better than the configuration with $α = 0.01$ . Overall, we can say that our approach is stable and reliable. Also, the heuristic is able to find satisfying optimal solutions for our model.

Likewise, the relative biases and RMSEs shown in Tables 4 and 5 and depicted in Figures 2 and 3 are very interesting. As expected, the relative bias and RMSE tend to approach zero as the sample size increases for all the approaches. This behavior is more evident for the classical ATA model. Previous findings about the classical ATA maximin model are confirmed by this simulation. In detail, we observe that the mean TIF obtained with this method overestimates the mean true TIF for all the cases under inspection. The positive bias goes from $0.6 %$ , if the responses per item are 1,000 or 2,000 ( $N = 6, 000$ ) to $22.48 %$ , if each item gets from 200 to 400 responses ( $N = 1, 200$ ). This aspect highlights the importance of using a more conservative ATA model in order to keep the results interpretable. Otherwise, the expected measurement precision of the tests is overestimated as well. On the other hand, it is evident that the 3sd and 1sd models produce very low estimates of the true TIFs since these approaches are too conservative. For example, the 3sd model always generates meaningless values since the negative relative bias never exceeds $42 %$ , even with the largest sample size. Moderately better results are obtained with the 1sd model, but the negative difference is still not above the $- 11 %$ . In general, the 3sd and 1sd models tend to underestimate the true TIF in all the cases. The robust model produces quite accurate results, very similar to the classical model, but given its complex structure, it requires too much time to be solved. Thus, the robust model can be applied to large-sized ATA models only if large investments in equipment or cloud computing are made. Instead, the CCATA model always tends to produce relative bias and RMSE close to zero and likely negative. In particular, the CCATA outperforms all the other approaches for $N = 1, 200$ in all the cases in terms of bias and RMSE, turning out to be a powerful method when the sample size is small. The advantage of the CCATA solution compared to the other approaches is noticeable, especially for Cases 2 through 4. For $N = 3, 000$ , the CCATA method outperforms the other approaches for Cases 1–3 in terms of bias and RMSE. However, with this sample size, the results are pretty similar, especially to the classical and the robust methods. Finally, for $N = 6, 000$ , the CCATA solution outperforms the other approaches for Case 1. The results are again very similar to the classical and the robust solutions.

The CCATA approach significantly improves the interpretation of the test’s expected precision, which can be expressed as “the tests have a $1 - α$ probability of having a mean TIF higher than…” denoting the level of conservativeness of the solution. Furthermore, for the CCATA model, the relative bias decreases when $α$ passes from 0.05 to 0.01, showing that the risk level of the solution is, as expected, positively correlated with $α$ and hence, customizable. In other words, we could say that, to increase the probability of having a true TIF higher than the observed one, that is, a lower risk level, $α$ should be decreased.

Application to Real Data

The data used in this application come from the 2015 TIMSS survey, a large-scale standardized student assessment conducted by the International Association for the Evaluation of Educational Achievement. Since 1995, this project has monitored mathematics and science achievement trends in 39 countries every 4 years, in the fourth and eighth grades and in the final year of secondary school. TIMSS 2015 was the sixth of such assessments. Further information regarding this study is available on the TIMSS 2015 Web page. We selected the Italian sample of Grade 8 students for the science test ( $n = 4, 479$ ). The greater availability of science items, compared to the mathematics ones, has driven the choice of the subject. The original item pool has been filtered, removing derived³ and polytomous items, retaining only original binary items. The final data set contains 234 items with the following categorical features:

four content domains (69 biology items, 57 chemistry items, 58 physics items, and 50 earth science items),

three cognitive domains (98 applying items, 88 knowing items, and 48 reasoning items), and

four topics (110 items with topic 1, 80 items with topic 2, 33 items with topic 3, and 11 items with topic 4).

Furthermore, a subset of these items is grouped into 27 units.

The design is unbalanced, as students are given only a subset of the items, so missing values appear in the response data. In particular, each item has from 611 to 663 responses. The item parameters were estimated according to the 2PL model. After the calibration, we performed a nonparametric bootstrap with $R = 500$ replications on the item parameters, and we computed the IIF at $θ = 0$ for all the items in the pool. The two already mentioned Julia packages Psychometrics.jl and ATA.jl were used for calibration, bootstrap, and test assembly tasks.

In the calibrated item pool, the discrimination parameter estimates range from 1e-05 to 4.708, with a mean of 0.920 and a median of 0.867. There are two items with the minimum allowed value of the discrimination estimate. On the other hand, the intercept estimates range from −4.340 to 4.546, with mean and median equal to 0.071 and 0.025, respectively.

The final matrix of the IIFs contains $234 \times 500$ samples. Subsequently, we solved the CCATA model by using the proposed approach and imposing the following specifications in terms of test constraints, which were based on the features of the tests administered in the TIMSS 2015. In detail, a set of $T = 14$ tests with length from $29$ to $31$ items is assembled. The already mentioned friend sets are included in the assembly as constraints. We imposed the tests to have at least six items for each content domain (biology, chemistry, physics, and earth science), a minimum of eight items in the applying and knowing cognitive domains, and a minimum of seven items in the reasoning cognitive domain. The first and the second topic must be present at least 10 times in each test form. Forms must contain at least two items on the third topic and one item on the fourth topic. Each item can be used in at most three test forms. The overlap must be less than or equal to $15$ items between adjacent forms, five items between forms at a distance equal to 2 (e.g., Forms 1 and 3 can have at most five items in common), and no overlap is allowed for the pairs at a distance greater than 2. For the CCATA model, we chose $α = 0.05$ and a Lagrange multiplier equal to $0.01$ . The last choice is motivated by the high level of infeasibility of the model. We excluded from the assembly 11 items that had an IRT b parameter higher than 3 or lower than −3. Removing items with extreme difficulty parameters helped the solver assemble the tests with a TIF peaked at $θ = 0$ .

After we included all the specifications in the model, we ran the optimization algorithm, which implements our heuristic. We selected the same termination criteria as in the simulation study. Before the time limit was reached, the algorithm explored four neighborhoods: The first and the second neighborhoods were not feasible, while the third and fourth neighborhoods produced feasible tests with a minimum $0.05$ -quantile of the TIF equal to 4.55 and 4.84, respectively.

Thus, the best solution is produced within the last neighborhood, where the smallest $0.05$ -quantile among the tests is equal to 4.843. The assembled tests fulfill all the constraints, as shown in Table 6. Also, constraints on overlap and item use are satisfied.

Table 6.

TIMSS Data, Features of the Test Forms Assembled by the CCATA Model

Test (t)	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Length	29	29	29	30	29	29	29	29	29	30	30	29	29	29
Content domain
Biology	9	6	7	6	10	10	10	10	7	7	9	8	9	10
Chemistry	6	6	8	9	6	7	6	6	6	8	9	8	7	6
Physics	8	9	8	6	7	6	7	7	8	8	6	7	7	7
Earth science	6	8	6	9	6	6	6	6	8	7	6	6	6	6
Cognitive domain
Applying	12	13	10	12	12	13	12	8	11	13	12	10	12	10
Knowing	9	8	12	11	9	9	10	12	11	9	11	11	9	11
Reasoning	8	8	7	7	8	7	7	9	7	8	7	8	8	8
Topic
1	11	10	11	11	11	10	12	15	16	15	17	13	10	15
2	10	12	10	10	10	10	10	10	10	10	10	10	13	10
3	7	6	6	7	6	8	6	3	2	4	2	2	2	3
4	1	1	2	2	2	1	1	1	1	1	1	4	4	1

Note. CCATA = chance-constrained automated test assembly; TIMSS = Trends in International Mathematics and Science Study.

The maximized $α$ -quantiles together with the TIF at $θ = 0$ computed on the sample are reported in Table 7. A graphical representation of the sampling distributions of the TIFs is shown in Figure 4.

Table 7.

Test Information Function of the Assembled Tests for TIMSS Data at $θ = 0$

Test (t)	$Q (T I F_{t} (0),0.05)$	$T I F_{t} (0)$
1	4.856	5.157
2	4.844	5.166
3	4.895	5.243
4	4.861	5.175
5	4.999	5.325
6	4.878	5.178
7	4.896	5.276
8	4.856	5.259
9	4.868	5.243
10	4.861	5.175
11	4.870	5.286
12	4.907	5.355
13	4.880	5.308
14	4.853	5.185

Note. TIF = test information function; TIMSS = Trends in International Mathematics and Science Study.

Figure 4.

Examples of test information functions (TIFs) of the Assembled Tests 1 and 2. TIF estimated on the full sample (solid black) against quantiles.

The resulting TIFs and quantiles do not considerably differ among the test forms, and this is a signal that the model reached an optimal solution which is very proximal to the global one. However, the high complexity of the model and the low values of the IIFs at $θ = 0$ contributed to low TIFs. In particular, we found that the TIFs of the assembled tests have their peaks in the interval $θ > 0$ (Figure 4), suggesting that the item bank is appropriate to measure the ability of examinees more proficient than the Italian ones.

Analyzing the sampling distribution of the TIFs of the assembled tests illustrated in Figure 4, we can notice that the TIF computed on the full sample is consistently higher than the $0.05$ -quantile. Thus, we could say that there is a low possibility that Test 2 produces estimates of the ability of an examinee with a true $θ = 0$ with a standard error of measurement greater than $\sqrt{(} 1 / 4.843) = 0.454$ .

Concluding Remarks

In this work, a CC version of the maximin ATA model, namely CCATA, has been introduced. This new test assembly model is able to deal with uncertainty in item parameters affected by calibration errors, which, in practice, can be relevant especially for small sample sizes, where the classical approaches highly overestimate the true TIF. In particular, the proposed approach can take into account the structure of the uncertainty observed in the response data used in the calibration phase, with the aim of reducing the risk of misinterpreting the test accuracy in estimating the examinee’s ability. This goal is achieved by approximating the distribution function of the TIF using the bootstrapped replicates of the item parameter estimates. The new model reformulates the classical maximin ATA model in a percentile optimization problem a subcategory of CC models. To deal with the nonlinear formulation of the proposed CCATA model, we developed a heuristic based on the SA principle for finding the optimal conservative tests. In this way, unlike classical and robust optimization techniques, it is also possible to handle large-sized models.

The results of a simulation study in the context of on-the-fly assembly for individualized testing show that the CCATA model, together with our heuristic, maximizes an adjustable conservative version of the TIF, that is, its $α$ -quantile, where $α$ can be arbitrarily chosen from the test assembler. In particular, it has been empirically proven that these quantiles are lower bounds to the true TIF for small $α$ s, such as $0.05$ or $0.01$ . Thus, using the sampling distribution function of the TIF along with the CC formulation gives a better idea of the accuracy of the tests in estimating future abilities and reduces the potential side effects of calibration errors. In contrast, with alternative methods, the observed TIF is often higher or excessively lower than the true one giving dangerous misinterpretations. An application on real data from the TIMSS survey demonstrated that our approach is replicable in real-world situations.

The results are encouraging, especially for complex and large-sized ATA models and for small sample sizes. A further contribution to the ATA research field is the development of two open-source Julia packages Psychometrics.jl and ATA.jl (Spaccapanico, 2021a, 2021b), which do not rely on commercial solvers and can be used free of charge. However, further studies are needed to consider different test constraints and more Monte Carlo replicates. Moreover, unlike other robust ATA models, the CCATA model requires the availability of response data for the application of the bootstrap technique. To perform the study under these conditions, it would be useful to reduce the computational effort required for item calibration.

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Giada Spaccapanico Proietti

Notes

References

Ahmed

Shapiro

(2008). Solving chance-constrained stochastic programs via sampling and integer programming. In 2008 tutorials in operations research: State-of-the-art decision-making tools in the information-intensive age (pp. 261–269). educ.1080.0048 (informs.org).

Ali

U. S.

van Rijn

P. W.

(2016). An evaluation of different statistical targets for assembling parallel forms in item response theory. Applied Psychological Measurement, 40(3), 163–179.

American Educational Research Association, American Psychological Association, & National Council on Measurement Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Ariel

van der Linden

W. J.

Veldkamp

B. P.

(2006). A strategy for optimizing item-pool management. Journal of Educational Measurement, 43(2), 85–96.

Bertsimas

Sim

(2003). Robust discrete optimization and network flows. Mathematical Programming, 98(1–3), 49–71.

Bezanson

Edelman

Karpinski

Shah

V. B.

(2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1), 65–98.

Bradley

Tibshirani

R. J.

(1993). An introduction to the bootstrap. Chapman & Hall/CRC.

Charnes

Cooper

W. W.

(1959). Chance-constrained programming. Management Science, 6(1), 73–79.

Charnes

Cooper

W. W.

(1963). Deterministic equivalents for optimizing and satisficing under chance constraints. Operations Research, 11(1), 18–39.

10.

Charnes

Cooper

W. W.

Symonds

G. H.

(1958). Cost horizons and certainty equivalents: An approach to stochastic programming of heating oil. Management Science, 4(3), 235–263.

11.

Chen

J. T.

(1973). Quadratic programming for least-cost feed formulations under probabilistic protein constraints. American Journal of Agricultural Economics, 55(2), 175–183.

12.

De Jong

M. G.

Steenkamp

J.-B. E. M.

Veldkamp

B. P.

(2009). A model for the construction of country-specific yet internationally comparable short-form marketing scales. Marketing Science, 28, 674–689.

13.

Deb

Sindhya

Hakanen

(2016). Multi-objective optimization. In Decision sciences (pp. 161–200). CRC Press. Multi-Objective Optimization | 12 | Decision Sciences | Kalyanmoy Deb, (taylorfrancis.com).

14.

Debeer

Ali

U. S.

van Rijn

P. W.

(2017). Evaluating statistical targets for assembling parallel mixed-format test forms. Journal of Educational Measurement, 54(2), 218–242.

15.

Foy

(2016). TIMSS 2015 user guide for the international database. Boston College, Chestnut Hill, MA: TIMSS & PIRLS International Study Center. https://timssandpirls.bc.edu/timss2015/international-database/downloads/T15_UserGuide.pdf

16.

Freund

R. J.

(1956). The introduction of risk into a programming model. Econometrica, 24(3), 253–263.

17.

Goffe

W. L.

(1996). SIMANN: A global optimization algorithm using simulated annealing. Studies in Nonlinear Dynamics & Econometrics, 1(3), 169–176.

18.

Gurobi. (2018). The gurobi optimizer [version 8.0].

19.

IBM. (2019). Ibm ilog cplex optimization studio [version 12.10.0].

20.

Kataria

Elofsson

Hasler

(2010). Distributional assumptions in chance-constrained programming models of stochastic water pollution. Environmental Modeling and Assessment, 15, 273–281.

21.

Kim

C. S.

Schaible

G. D.

Segarra

(1990). The deterministic equivalents of chance-constrained programming. Journal of Agricultural Economics Research, 42(2), 30–39.

22.

Krokhmal

Palmquist

Uryasev

(2002). Portfolio optimization with conditional value-at-risk objective and constraints. Journal of Risk, 4, 43–68.

23.

Margellos

Goulart

Lygeros

(2014). On the road between robust optimization and the scenario approach for chance constrained optimization problems. IEEE Transactions on Automatic Control, 59(8), 2258–2263.

24.

Mislevy

R. J.

Wingersky

M. S.

Sheehan

K. M.

(1994). Dealing with uncertainty about item parameters: Expected response functions. ETS Research Report Series, 1994(1), i–20.

25.

Nemirovski

Shapiro

(2006). Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17, 969–996.

26.

Patton

J. M.

Cheng

Yuan

K.-H.

Diao

(2014). Bootstrap standard errors for maximum likelihood ability estimates when item parameters are unknown. Educational and Psychological Measurement, 74(4), 697–712.

27.

Rockafellar

R. T.

Uryasev

(2000). Optimization of conditional value-at-risk. Journal of Risk, 2, 21–42.

28.

Rockafellar

R. T.

Uryasev

(2001). Conditional value-at-risk for general loss distributions. ISE Dept., University of Florida.

29.

Scott

J. T.

Jr. Baker

C. B.

(1972). A practical way to select an optimum farm plan under risk. American Journal of Agricultural Economics, 54(4), 657–660.

30.

Song

Luedtke

J. R.

Küçükyavuz

(2014). Chance-constrained binary packing problems. INFORMS Journal on Computing, 26(4), 735–747. https://doi.org/10.1287/ijoc.2014.0595

31.

Soyster

A. L.

(1973). Convex programming with set-inclusive constraints and applications to inexact linear programming. Operations Research, 21, 1154–1157.

32.

Spaccapanico

P. G.

(2020). https://doi.org/10.6092/unibo/amsdottorato/9217

33.

Spaccapanico

P. G.

(2021a). ATA.jl: Automated test assembly made easy [Computer software]. https://github.com/giadasp/ATA.jl

34.

Spaccapanico

P. G.

(2021b). Psychometrics.jl [Computer software]. https://github.com/giadasp/Psychometrics.jl

35.

Spaccapanico

P. G.

Matteucci

Mignani

(2020). Automated test assembly for large-scale standardized assessments: Practical issues and possible solutions. Psych, 2(4), 315–337. https://www.mdpi.com/2624-8611/2/4/24

36.

Stocking

M. L.

Swanson

(1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17(3), 277–292.

37.

Tarim

S. A.

Manandhar

Walsh

(2006). Stochastic constraint programming: A scenario-based approach. Constraints, 11(1), 53–80.

38.

Tsutakawa

R. K.

Johnson

J. C.

(1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55(2), 371–390.

39.

van der Linden

W. J.

(2005). Linear models for optimal test design. Springer.

40.

Veldkamp

B. P.

(1999). Multiple objective test assembly problems. Journal of Educational Measurement, 36(3), 253–266.

41.

Veldkamp

B. P.

(2013). Application of robust optimization to automated test assembly. Annals of Operations Research, 206(1), 595–610.

42.

Veldkamp

B. P.

Matteucci

de Jong

M. G.

(2013). Uncertainties in the item parameter estimates and robust automated test assembly. Applied Psychological Measurement, 37(2), 123–139.

43.

Veldkamp

B. P.

Paap

M. C. S.

(2017). Robust automated test assembly for testlet-based tests: An illustration with analytical reasoning items. Frontiers in Education, 2(63), 1–8.

44.

Veldkamp

B. P.

Verschoor

A. J.

(2019). Robust computerized adaptive testing. In Theoretical and practical advances in computer-based educational measurement (pp. 291–305). Springer. https://doi.org/10.1007/978-3-030-18480-3_15

45.

Wang

Guan

Wang

(2011). A chance-constrained two-stage stochastic program for unit commitment with uncertain wind power output. IEEE Transactions on Power Systems, 27(1), 206–215.

46.

Xie

(2019). The impact of collateral information on ability estimation in an adaptive test battery [Doctoral dissertation]. University of Iowa. https://doi.org/10.17077/etd.njvy-42a6

47.

Yang

J. S.

Hansen

Cai

(2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and Psychological Measurement, 72(2), 264–290.

48.

Zhang

Xie

Song

(2011). Investigating the impact of uncertainty about item parameters on ability estimation. Psychometrika, 76(1), 97–118.

49.

Zheng

(2016). Online calibration of polytomous items under the generalized partial credit model. Applied Psychological Measurement, 40(6), 434–450.