Optimal Item Calibration in the Context of the Swedish Scholastic Aptitude Test

Abstract

Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.

Keywords

item response theory optimal design 3PL model simulation study SweSAT

Introduction

Item calibration is the process of estimating the characteristics of new pretest items. The goal of item calibration is to develop a bank of items with precisely estimated item parameters ready for use in operational tests. It is of great importance that the item parameters are estimated as precisely as possible, since it directly affects the accuracy and estimated standard errors of the latent ability estimates (Cheng & Yuan, 2010). As Ali and Chang (2014) point out, well calibrated items are particularly important for computerized adaptive testing (CAT), when test items are assigned to the examinees adaptively based on gradually updated estimates of their latent ability. In this setting it is assumed that the item parameters have been estimated with enough precision to be treated as the true ones (van der Linden & Pashley, 2000).

Methods of optimal experimental design can be applied in this context to determine which examinees to select from a population such that the item parameters are estimated as efficiently as possible. The precision of the item-parameter estimates depends on the ability levels of the examinees that respond to the item. The design problem here is the problem of selecting a sample of examinees with the most suitable abilities, that is to find an optimal sampling design (Berger, 1991; Buyske, 2005). In addition to being more cost-effective, using an optimal sampling design means that the item assignment will be targeted to examinees in a better way in terms of item difficulty compared to when assigning items randomly. Assigning items with a fitting difficulty level reduces the burden of the examinee taking the test and the risk of identification of the pretest items.

Which sampling design strategy that is feasible depends on whether the calibration is conducted in an “online” or “offline” setting and whether the pretest items can be administered adaptively. In an online setting, pretest items are integrated in a test together with the operational items (Stocking, 1988). This allows items to be assigned to the examinees in an adaptive way, where the rule of the assignment is determined by an optimal design. Alternatively, the calibration can be done in a separate test, consisting of pretest items only, a so-called offline calibration (He et al., 2020).

The online calibration setting is well suited for adaptive optimal design schemes adjusted to sequentially arriving examinees, sequential calibration methods are implemented in different ways by van der Linden and Pashley (2000), Chang and Lu (2010), Ali and Chang (2014), Lu (2014) and van der Linden and Ren (2015). Zheng and Chang (2017) compared five methods for calibration in an online sequential setup. While the considered methods that are based on an optimality criterion in theory should improve parameter precision, they observed in their simulation study that a random design in many situations achieved a comparable performance.

However, not all testing situations have sequentially arriving examinees. Instead, the test is given to a large number of examinees taking the test simultaneously in parallel. Such a parallel settings is common for large-scale achievement tests, for example, for the Programme for International Student Assessment (PISA; OECD, 2018) and for the Swedish Scholastic Aptitude Test (SweSAT; Umeå University, 2021) considered in this paper. For item calibration under such parallel test settings, other optimal design methods are needed. The method for calibrating items based on an optimal design algorithm by Ul Hassan and Miller (2021) is suitable for calibration of items in a parallel setting. The method is designed for allocating a large group of examinees in parallel to items based on examinee ability. The algorithm utilizes a so-called optimal restricted design. The optimal restricted design creates a set of non-overlapping ability intervals that dictates which examinee should be given which pretest item. Once the ability of an examinee is estimated, it is known which ability interval the examinee belongs to, and the designated item can be allocated to the examinee.

The computations involved in the optimal design algorithm are based on some approximations which means the demonstrated theoretical efficiency gains are not guaranteed to hold in practice. The optimal design therefore needs to be evaluated in a real testing situation. Moreover, Zheng and Chang (2017) demonstrate that a random assignment can give comparable results as optimal designs when they evaluated several different test conditions.

In this study, we evaluate the optimal allocation strategy proposed in Ul Hassan and Miller (2021) and compare the results to the random allocation strategy. We use real data from the Swedish Scholastic Aptitude Test (SweSAT) to conduct an empirically based simulation study, designed to replicate the real calibration setting of SweSAT. Moreover, we consider a set of different simulation scenarios with varying degree of assumptions made. In this way, we are able to separate the effect of each assumption. The responses in the test round of 2018 are used to estimate item parameters that are utilized in the optimal design algorithm to decide the optimal allocation on the basis of estimated examinee ability. The calibration procedure of SweSAT is that each examinee is given the same fixed number of pretest items, in contrast to alternative schemes where the items may be given to the examinees until the standard errors of the estimates meet some criteria, or by letting a fixed number of examinees be given each pretest item (Ren et al., 2017; van der Linden & Ren, 2015).

The abilities used for assigning the examinees to items are unknown and need to be estimated from an operational test. The estimates can, for example, be derived using the Expected A Posteriori (EAP) method, producing ability estimates with good properties (Bock & Mislevy, 1982). Once the examinee’s responses to the pretest items are available, the pretest item parameters are estimated via maximum likelihood, treating the abilities as fixed. This ignores the uncertainty in the ability estimates, which is one of the factors investigated in the present simulation study. For an approach to handle the uncertainty in the abilities while optimizing the calibration design, see Bjermo et al. (2025).

The item response theory (IRT) models that are being used to describe the relationship between the items and latent abilities depend on the model parameters in a nonlinear way. This means that the optimal allocation design is dependent on the unknown item parameters that are supposed to be estimated, see for example, Atkinsson et al. (2007). The item parameters, therefore, need to be either pre-estimated or given some values by expert guessing; for references justifying the latter see Berger et al. (2019). We explore the effect of pre-estimation in the simulation study by comparing the precision of estimates when the optimal design is derived assuming the true parameter values known to when it is based on pre-estimated values. The SweSAT consists of multiple-choice items and we fit the three-parameter logistic (3PL) IRT model. We note that Berger et al. (2019) investigated and compared by a simulation study calibration designs which can be used when examinees take the test simultaneously in parallel. Their designs are not based on optimality criteria and they assume the Rasch model for the pretest items. They investigated the impact of item-parameter pre-estimation.

The article is organized as follows. We start with describing the 3PL model and the proposed optimal allocation method and give some background about the SweSAT. Next, the details about the setup of the empirically based simulation study, and its different scenarios, are specified along with the definitions of the measures and evaluation metrics used to compare the optimal and random designs. Then the results from the simulation study are presented, and the paper ends with a discussion about the results.

The 3PL Model

The test items considered here are all multiple-choice questions that have dichotomous outcomes; a response is either correct or incorrect. Therefore, a 3PL model (see e.g., Lord, 1980) is used to model how the probability of a correct response depends on examinee ability. The 3PL model has 3 item parameters $β = {(a, b, c)}^{⊤}$ ; the a-parameter is related to item discrimination, b reflects item difficulty and c is the so-called “guessing” parameter. For the 3PL model the probability that an examinee with ability θ_j correctly responds to item i with parameters $β_{i} = {(a_{i}, b_{i}, c_{i})}^{⊤}$ , is

p_{i} (θ_{j}) = p_{i} (θ_{j} | a_{i}, b_{i}, c_{i}) = c_{i} + \frac{1 - c_{i}}{1 + e^{- a_{i} (θ_{j} - b_{i})}} .

(1)

The 3PL model is a generalized nonlinear model (GNLM) with logit link $η_{i} (θ) = \log (\frac{p_{i} (θ)}{1 - p_{i} (θ)}) = \log (\frac{c_{i} + \exp (- a_{i} (θ - b_{i}))}{1 - c_{i}})$ , where p_i is the probability defined in equation (1). The link is differentiable in β_i.

The graph of the probability to correctly respond to an item as a function of examinee ability θ illustrates the item characteristics and is often referred to as the Item Characteristic Curve (ICC) or item response curve. Figure 1 displays three item response curves with different levels of item discrimination.

Figure 1.

Item response curves with varying discrimination levels

Optimal Design Allocation

An optimal design of an experiment determines the experimental conditions in a way that the estimation is optimal with respect to some criterion, see for example, Silvey (1980). Methods of optimal experimental design are used here to derive an optimal allocation strategy for item calibration. The optimal design allocation provides rules for matching pretest items to examinees based on examinee ability, such that the precision of item-parameter estimates is optimized.

Using an optimal design allocation, the item parameters in the 3PL model will be estimated with better precision compared to a non-optimal design, at least in theory. Since the item-parameter estimates are correlated (Baker & Kim, 2004), we do not focus on the precision of the three model parameters separately. Rather we use here the so-called D-optimality criterion (Atkinsson et al., 2007) which takes the correlation of the parameters into account. This criterion minimizes the determinant of the inverse information matrix (asymptotically equivalent to the covariance matrix) of the item-parameter estimators. This determinant is proportional to the volume of a confidence ellipsoid for the three model parameters.

A standard (unrestricted) optimal design specifies a number of ability levels that would be optimal to sample from. For example, a design where equal proportions of the examinees are divided between two specific θ points would be an optimal design under the 2PL model. Such an unrestricted design is feasible if there are no restrictions on the availability of examinees with certain abilities. However, in a real test situation, it is not realistic to be able to choose an examinee with the exact ability needed. Instead, Ul Hassan and Miller (2019) proposed a restricted design that is more reasonable to attain in practice. In the restricted design, ability intervals are instead specified, as opposed to ability points for the unrestricted optimal design.

While Ul Hassan and Miller (2019) have exemplified the restricted design approach for the 2PL model, it is valid even for other IRT models including the 3PL model; the latter model has been considered by Ul Hassan and Miller (2021).

Let g be a continuous density on Θ = R which describes the abilities of the examinees; we assume in this article that the examinees have standard normal distributed abilities and g is the N (0, 1)-density. A restricted design is described by sub-densities h₁, h₂, …, h_n ≥ 0 for each item in the test, where $\sum_{i = 1}^{n} h_{i} (θ) = g (θ)$ for all θ ∈ Θ. Each non-overlapping h_i represents the part of the examinee population calibrating item i.

Writing an expression for the information matrix in a GNLM (see Section 5 of Seber & Wild, 1989) in the notation of Ul Hassan and Miller (2019, 2021), the standardized information matrix of the item parameters β = (β₁, …, β_n) is M(h) = diag (M₁ (h₁), …, M_n (h_n)) with

M_{i} (h_{i}) = \int_{Θ} p_{i} (θ) (1 - p_{i} (θ)) (\frac{\partial η_{i} (θ)}{\partial β_{i}}) {(\frac{\partial η_{i} (θ)}{\partial β_{i}})}^{T} h_{i} (θ) d θ .

(2)

Here, the density h summarizing the sub-densities h_i describes the allocation rule saying which pretest item 1, …, n should be given to examinees with a specific ability θ (formally, it is defined as h (θ, i) = h_i(θ) and is a density on the product space Θ × {1, …, n}). Also, ∂η_i(θ)/∂β_i is the derivative of the logit link described in Section The 3PL Model with respect to β_i.

To obtain an optimal design, an appropriate convex function Ψ of M(h) needs to be optimized; a design h* is Ψ-optimal if h* = arg min_hΨ(M(h)).

The considered model is not linear in this case; it means that M(h) is dependent on the item parameters β_i. Therefore, some initial values must be assigned to the item parameters β_i. Initial values can be obtained by a guess from an expert or some pre-estimation using a small sample of examinees. The optimal design h* is said to be a locally optimal restricted design (Atkinsson et al., 2007; Ul Hassan & Miller, 2019). A sample of 30 examinees was enough for pre-estimation of item parameters in the situation considered by He and Chen (2020).

As mentioned above, the D-optimality criterion is used which is computed by minimizing

Ψ {M (h)} = - \log | M (h) | = - (\sum_{i = 1}^{n} \log (| M_{i} (h_{i}) |)) .

(3)

Ul Hassan and Miller (2019) derived a new equivalence theorem for calibration of multiple items which is able to identify if a design is D-optimal. To find an optimal design, an exchange algorithm can be used (Ul Hassan & Miller, 2021) which is implemented in the R-package optical (Ul Hassan & Miller, 2023). An assumption in the algorithm is that the pretest items are arranged in blocks and that each examinee can be given one pretest item in each block. If, for example, four items are given, the exchange algorithm will find the optimal design for those specific items. The design defines which item an examinee with a certain ability will be given in a calibration situation, see Figure 2 for an example of a produced design. For example, this design recommends that an examinee should receive Item 3 who has, based on the operational items, an ability in one of the four ability intervals: [−1.92, −1.49], [−0.43, −0.13], [0.23, 0.31], [1.32, 1.88]. For this example, the theoretical D-efficiency of this D-optimal design compared to a design which randomly allocates the items to the examinees is 1.128. This means that the random design needs 12.8% more examinees to attain similar information about the parameters compared to the optimal design, see Ul Hassan and Miller (2021).

Figure 2.

Calibration of 4 items under the 3PL model.

The Swedish Scholastic Aptitude Test (SweSAT)

SweSAT is a standardized test used for admission to higher education in Sweden (Umeå University, 2021). The test is given twice a year to around 40,000 examinees. It consists of two main sections—a quantitative and a verbal section. The test is a paper and pencil-based test with a total of five parts; two quantitative parts and two verbal parts, and one try-out/calibration part used for testing the performance of new items. Every part consists of 40 items, all identical across all test locations except for the try-out parts which could differ between test locations. The total score is put on a scale from 0 to 2 and is equated between tests to be comparable.

In this study, we use response data from the quantitative part of the second test round of 2018 and estimate the item parameters which are used as a starting point in our simulation study. The test parts consist of multiple-choice questions with dichotomous outcomes. The 3PL model is often appropriate for analyzing such items and by comparing goodness-of-fit tests, we concluded that the 3PL model had the best fit also in these particular cases.

Methods

Calibration of New Items

The method we propose here for calibrating new items can be performed under a separate offline setup or an integrated online setup. For both setups, the estimated ability is used to determine which examinee will be given which pretest item. The items used for estimating the ability are called the operational items, or the operational test. SweSAT uses a separate offline calibration, as described in Section The Swedish Scholastic Aptitude Test (SweSAT). For an integrated setup, the calibration is usually at the end of the test since the estimated abilities have better precision when the estimation can be based on more operational items. In this paper, we are using item parameters that are estimated from the SweSAT. We divide the items into an operational and calibration part, consisting of the same number of items. Which examinee that should be given which pretest item is either determined randomly (random design) or according to the restricted optimal design defined in Section Optimal design allocation. The optimal design should theoretically produce estimates with lowest standard error of the parameters in the item-parameter vector β_i in the 3PL IRT model (Equation (1)).

Block Design

An assumption in the optical exchange algorithm (Ul Hassan & Miller, 2021) is that an examinee can be given at most one pretest item out of a block of a low number of pretest items. We assume therefore that the pretest items are divided into l blocks consisting of m items each. The optical algorithm is then run separately on each block, assigning every examinee one item per block. Since the design is locally optimal, the pretest item parameters will need to have an initial value determined, for example, based on a pre-estimation using a small number of examinees.

Every examinee will be given l pretest items in addition to the operational items. Which item from each block that should be assigned to the examinees is decided by the estimate of the ability of the examinees, which is estimated based on their responses to the operational test.

Which pretest items will be assigned to which block is decided by the difficulty parameter of the items, either estimated from the full SweSAT data or pre-estimated from a smaller number of examinees. The items could be assigned to the blocks in many different ways. A main idea of design optimization is however that we aim to target the difficulty of the items with the right abilities of the examinees. Therefore, in this paper we choose to assign items to blocks such that each block should have a mix of items with different difficulty levels. We assigned the easiest item to Block 1, the second easiest to Block 2, etc. The l-th easiest item will be assigned to block l. Then the l + 1 easiest item is assigned to Block 1, the l + 2 easiest to Block 2 etc. This procedure is continued until all pretest items are assigned to one of the l blocks. Table 1 shows an example of a block. This way of assigning the items to the blocks ensures a spread in difficulty between all items within a block. If there would be items with similar difficulty within a block, we would not expect large possibilities to increase efficiency with choosing an optimal design. For example, if we would use four items with a- and c-parameter like in Table 1 and b-parameter equal to −0.306 (the average of the b-parameters in Table 1), the D-efficiency would be 1.094, that is, the information gain compared to the random design would be 9.4% instead of 12.8%.

Table 1.

One Block of Pre-estimated Parameters

	a	b	c
1	0.862	−1.063	0.203
2	1.320	−0.549	0.195
3	1.220	−0.067	0.155
4	2.173	0.454	0.107

Simulation Setup - 4 Cases

The proposed method will be evaluated using simulation studies divided into four separate scenarios. An outline of the elements in the simulations under the 4 cases is illustrated in Figure 3. The different cases range from a purely theoretical case to a case replicating a real calibration setting as closely as possible. There are two intermediate steps, where we relax one factor at time, aiming to isolate the influence of each.

Figure 3.

Outline of the simulation studies. Data and design generation is marked with rectangles, estimation of abilities or item parameters is marked as rectangles with rounded corners, and computations for result generation are marked as parallelograms

The simulation study compares an optimal design allocation to a random allocation of pretest items to examinees, in terms of precision of the estimated item parameters. The simulations are run S = 2,000 times. Let the estimated parameter vector for item i, i = 1, …, I, in simulation run s, s = 1, …, S, when design d, d = O (Optimal), R (Random), is used be

{\hat{β}}_{i, s}^{(d)} = {({\hat{a}}_{i, s}^{(d)}, {\hat{b}}_{i, s}^{(d)}, {\hat{c}}_{i, s}^{(d)})}^{⊤}

(4)

and the true parameter vector be

β_{i} = {(a_{i}, b_{i}, c_{i})}^{⊤}

We use the SweSAT data as a starting point with the aim to replicate the SweSAT test and calibration setting in the simulations. The total number of items of the quantitative part of the SweSAT is 80, 39,321 examinees took the test and the 39321 × 80 response matrix is used to estimate the item parameters for the 80 items. This is done via marginal maximum likelihood estimation (MMLE) assuming a normal distribution of the examinee population (Bock & Aitkin, 1981). These MMLE item-parameter estimates are then utilized as the true item parameters β_true, when generating a response matrix in each simulation iteration via the 3PL model. Using the same number of examinees as in the SweSAT, 39,321 abilities are drawn from the N (0, 1) distribution. These will function as the true abilities of the examinees, θ_true, and are also used when generating the simulated response matrices (together with the “true” item parameters β_true).

We decided to let the operational and calibration part consist of 40 items each, since each part in SweSAT consists of 40 items. The calibration part will be divided into l = 10 blocks with m = 4 items per block. This means that every examinee will be given 10 pretest items. Which items each examinee is given is determined by the optimal design on the basis of their EAP ability estimate. This optimal design is to be compared to the random design where 10 item are randomly allocated to every examinee. This means that the sample size per pretest item is about 39321/4 ≈ 9800 for the random design. The 80 true items are randomly divided into an operational part and a calibration part, both of length 40. Responses are simulated by generating draws from the binomial distribution Bin(1; p_ij (θ_j|β_i)) where p_ij are defined as in equation (1). The R-package mirt (Chalmers et al., 2023) was used for both the parameter estimation and generation of response matrices.

Case I - Theoretical

The first step is to form the blocks of items, so that the optimal design allocation rules can be derived through optical. The blocks are constructed based on the true difficulty parameters b_i. The pretest items are sorted from easiest to hardest, and the blocks are created as described in 2.2. The optical algorithm is then run for all blocks, creating the optimal design rules for each block and calculating the theoretical design efficiency. This case does not involve any data generation or parameter estimation once the true parameters β_true are generated, just theoretical calculations of the efficiency of the optimal design allocation compared to a random allocation.

Case II - True Abilities

For this case, as well as for the two following cases, we generate response matrices based on θ_true and β_i = β_true. The optimal design rules derived with optical using β_true are used to assign pretest items to examinees, based on examinee ability, one item from each block. Here, we use the true abilities θ_true also for design allocation. Every examinee is now assigned responses for these items, 10 pretest items each (the non-pretest items will be set to missing in the generated response matrix) and the pretest item parameters are estimated. These item estimates are the optimal estimates ${\hat{β}}_{i, s}^{(O)}$ .

For the random design every examinee is randomly assigned responses for 10 pretest items (and missing values for the rest of the items). The item parameters for the pretest items are estimated using these 10 responses for every respondent, yielding the random design estimates of the item parameters ${\hat{β}}_{i, s}^{(R)}$ .

Design efficiencies are then calculated using the item-parameter estimates of the two designs over the S = 2000 simulation runs.

Case III - Estimated Abilities and True Parameters

In a real calibration situation, examinee abilities are not known and need to be estimated. We use the operational part of the response matrix to first estimate the ability of each examinee ${\hat{θ}}_{i}^{(0)}$ ; for details, see Section Estimation of abilities. The optimal design rules derived with optical using β_true are then again used to assign pretest items to examinees, but this time based on the estimated examinee ability. The efficiencies of the resulting item-parameter estimates compared to the random allocation are calculated over the S = 2000 simulation runs.

Case IV - Estimated Abilities and Pre-estimated Parameters

To come even closer to a realistic setup, we relax also the fact that the item parameters used in the design generation are known. Instead, we assume that a small pre-calibration study was done with 200 examinees to obtain initial information about all item parameters which can be used for design generation of the following calibration with N = 39321 examinees. Therefore, we simulate 200 responses and pre-estimate values for the pretest items ${\hat{β}}_{pre−est}$ . We use Bayesian estimation using Rstan (Stan Development Team, 2025) to avoid the severe bias that otherwise can appear in the item-parameter estimates when using marginal maximum likelihood estimation (MMLE) for a small number of examinees. The reason for only letting 200 responses pre-estimate the pretest items is that, in a real test situation, you would like to keep the costs down. The pre-estimates are then used for block formation and derivation of the optimal allocation rules through optical. The optical allocation rules that dictate which items the examinees will calibrate are now dependent on their estimated abilities ${\hat{θ}}_{i}^{(0)}$ . Item-parameter estimates from the optimal design allocation under this setup are compared to the parameter estimates from the random design, again averaged over the S = 2000 simulation runs.

Estimation of Parameters

In Case III and IV, when pretest item parameters are estimated, the ability estimates will be treated as the true abilities. The pretest item parameters are then estimated by maximum likelihood for known fixed abilities, as for regular logistic regression with a non-latent covariate. This approach is similar to the so-called Method A (Stocking, 1988), in the sense that abilities are estimated separately in a first step and treated as fixed in the subsequent step, when the item parameters are estimated. However, Method A uses as a first step separate standard maximum likelihood estimation, via for example, Newton-Raphson maximization, of the ability of each examinee, given the operational item parameters and responses. In our simulations we use instead the EAP ability estimates obtained through MMLE in the first step, which has several advantages over ML for CAT (Bock & Mislevy, 1982). If the ability is not treated as fix, the EM-algorithm used for estimating the item parameters will assume that the examinees abilities follow a standard normal distribution. Since the examinees given a certain pretest item are purposely selected when the optimal design allocation is used, their abilities do not follow the assumed distribution, and the estimates would be biased.

Ban et al. (2001) evaluated Method A among other different estimation methods in the setting of online CAT and concluded that the Multiple Expectation-Maximisation (MEM) performed the best, while Method A was associated with the highest error. Yet, for some models and especially with larger sample sizes, Method A was shown to perform well (Chen et al., 2012, 2017). Chen and Wang (2016) develop and assess an improved version of Method A for multidimensional CAT. Another variant is the adapted methods proposed in He et al. (2017). For our purposes, the possibility to set the abilities according to the distribution used to derive the optimal allocation is essential. Since our method ignores the estimation error of the ability estimates, the estimates of the pretest items will be inaccurate to some extent. A comparison between the simulation Case II and III provide some insight about this difference.

Estimation of Abilities

In Case III and Case IV, the abilities of the examinees are estimated. Based on the results from the operational part of the test, the ability of each examinee is estimated using the EAP method. Unfortunately, despite that the true abilities are known to be N (0, 1) in our simulation studies, the EAP-estimates have a non-normal distribution. The negative abilities are somewhat shrunken towards 0, see Figure 4, which is due to the asymmetry of the 3PL model.

Figure 4.

Histogram of estimated abilities ${\hat{θ}}_{i}^{(0)}$ in one simulation run and standard normal density as comparison

Since the optimal design method assumes a standard normally distributed population, we transform the abilities ${\hat{θ}}_{i}^{(0)}$ with the percentile method to values ${\hat{θ}}_{i}$ distributed as N (0, 1), see for example, Kolen and Brennan (2014), Section 9.5.2. The method ranks the estimated abilities ${\hat{θ}}_{i}^{(0)}$ and converts the ranks to percentiles by the formula ${\hat{θ}}_{i} = Φ^{- 1} ((rank - 0.5) / N)$ , where N is the number of estimated abilities and Φ⁻¹ is the inverse cumulative distribution function of the standard normal distribution. By this, the estimated abilities have z-values of the standard normal distribution.

The percentile-transformed estimates ${\hat{θ}}_{i}$ are then used to determine the pretest item allocation for the optimal design.

Definitions of Measures

In the following section, several summary measures that are used to evaluate the designs in the simulation studies are presented. For the sake of clarity in the presentation, the index i will be temporarily dropped from the formulas in the following section. Note that each measure is defined per item, although the index i is not explicitly written out.

Error Matrix

The empirical error matrix can be viewed as a multivariate version of the mean squared error of an estimator. When design d, d = O, R, is used, it is computed as

Q ({\hat{β}}^{(d)}; β) = \frac{1}{S} \sum_{s = 1}^{S} ({\hat{β}}_{s}^{(d)} - β) {({\hat{β}}_{s}^{(d)} - β)}^{⊤} .

(5)

Based on this, the empirical D-criterion is defined as

D ({\hat{β}}^{(d)}; β) = \det [Q ({\hat{β}}^{(d)}; β)] .

(6)

MSE

The empirical mean squared error for the parameter estimators $\hat{β}$ in relation to the true parameters β is defined by the 3 × 1 vector

M S E ({\hat{β}}^{(d)}; β) = \frac{1}{S} \sum_{s = 1}^{S} {({({\hat{a}}_{s}^{(d)} - a)}^{2}, {(b_{s}^{(d)} - b)}^{2}, {(c_{s}^{(d)} - c)}^{2}))}^{T}

(7)

This is equivalent to the diagonal of the empirical error matrix defined above, that is

M S E ({\hat{β}}^{(d)}; β) = d i a g [Q ({\hat{β}}^{(d)}; β)] .

(8)

To summarize the MSE of all three parameters in the model, the average MSE is computed:

A M S E ({\hat{β}}^{(d)}; β) = \{M S E ({\hat{a}}^{(d)}; a) + M S E ({\hat{b}}^{(d)}; b) + M S E ({\hat{c}}^{(d)}; c)\} / 3 .

(9)

This measure is the empirical counterpart of the A-optimality criterion

Ψ {M (h)} = tr M^{- 1} (h)

(10)

which is the same as minimizing the average variance (Atkinsson et al., 2007, p. 135).

CC Method

Inspired by the Haebara approach for IRT test equating (Kolen & Brennan, 2014), the squared difference between ICCs based on estimated and true parameters, is evaluated for a certain ability level θ_j. The empirical characteristic curve difference is

d (θ_{j} | {\hat{β}}^{(d)}; β) = \frac{1}{S} \sum_{s = 1}^{S} {\{p (θ_{j} | {\hat{a}}_{s}^{(d)}, {\hat{b}}_{s}^{(d)}, {\hat{c}}_{s}^{(d)}) - p (θ_{j} | a, b, c)\}}^{2} .

(11)

To judge the overall difference between estimated and true ICCs, we can integrate this difference over θ (area between curves). If we weight according to the distribution of the examinees, the integral corresponds to the sum over the abilities of the j = 1, …, N examinees. This yields the total difference

C C ({\hat{β}}^{(d)}; β) = \sum_{j = 1}^{N} d (θ_{j} | {\hat{β}}^{(d)}; β) .

(12)

Evaluation Metrics

To be able to evaluate which of the two designs: Optimal (O) or Random (R) performs better; several metrics are calculated as ratios of the previously defined measures. These ratios are called (relative) efficiencies; when they exceed 1, the parameters are estimated less precisely with the random design compared to the optimal design. We can interpret the efficiencies in terms of sample size: An efficiency of 1.25 means that 25 % more examinees are needed for the random design compared to the optimal design to achieve a similar precision of the estimates, independent of the actual sample sizes used.

Relative Average MSE

The relative average MSE, or relative A-efficiency, is given by the ratio of the empirical AMSE of the random design (R) and the optimal design (O)

{R E}_{A} = \frac{A M S E ({\hat{β}}^{(R)}; β)}{A M S E ({\hat{β}}^{(O)}; β)} .

(13)

Relative D-efficiency

The relative D-efficiency is obtained as the ratio of the empirical D-criterion of the random design (R) and the optimal design (O)

{R E}_{D} = {(\frac{D ({\hat{β}}^{(R)}; β)}{D ({\hat{β}}^{(O)}; β)})}^{\frac{1}{3}} .

(14)

The exponent 1/3 (in general, 1/(number of parameters in the model)), is needed for scaling such that we can interpret it in terms of sample size, see for example, Atkinsson et al. (2007), Section 11.

Relative CC-efficiency

The relative CC-efficiency is defined as the ratio of the CC-criterion of the random design (R) and the optimal design (O)

{R E}_{CC} = \frac{C C ({\hat{β}}^{(R)}; β)}{C C ({\hat{β}}^{(O)}; β)} .

(15)

Overall Evaluation

All of the previously defined measures are calculated per item. For an overall assessment across all 40 items, we take the geometric mean over them. Since efficiencies are relative measures, the geometric mean is more appropriate than the arithmetic mean. For example, if Item 1 has efficiency 0.5, Item 2 has 2, then the random design needs half of the examinees for Item 1 and double the examinees for Item 2 to obtain similar precision as the optimal design. Overall, both designs should then be of the same quality which is reflected by the geometric mean of 1. In contrast, the arithmetic mean would be 1.25 suggesting an advantage of the optimal design. The arithmetic mean would give a too optimistic overall measure; the geometric mean is always smaller than or equal to the arithmetic mean.

Results

The results are presented as the evaluation metric per item based on averages taken over the S = 2,000 simulation runs. The number of runs was chosen to obtain a precision of relative item efficiencies of ±0.05 (we have derived bootstrap 95%-confidence interval for the simulation error in some cases which are approximately of that size but we have not included them in the results later for sake of clarity). This means that values $\geq 1.05$ are significantly larger than 1 given the Monte Carlo precision; and values $\leq 0.95$ are significantly smaller than 1. The average relative efficiency for all pretest items has then a precision of less than ±0.01.

All four simulation cases have been evaluated and the results of the overall evaluations are given in Table 2. Results per item of Case III and Case IV are presented in more detail in this section. The theoretical efficiencies per item of Case I are given in Table A1 in the appendix. The results of Case II are very similar to those, and the per item details are therefore omitted.

Table 2.

Relative Efficiencies RE_D, RE_CC, and RE_A for Optimal Versus Random Design, Summarized for All of the 40 Items for the Three Measures (Geometric Mean)

	RE_D	RE_CC	RE_A
Case I: Theoretical	1.155	1.126	1.426
Case II: True abilities	1.155	1.105	1.346
Case III: Estimated abilities and true parameters	1.103	1.080	1.198
Case IV: Estimated abilities and pre-estimated parameters	1.048	1.048	1.093

Since the D-optimality criterion is the criterion that was used for optimization of the design, the D-efficiency RE_D will be in focus when examining the results. Plots to display the correlation between the true item parameters and the relative D-efficiency RE_D are given. Corresponding plots of RE_CC and RE_A can be found in the appendix.

A general result is that the proposed method using the block design and the optical algorithm manages to produce item parameters estimated with better precision in a majority of the cases.

For Case IV (Table 3), we conclude that 6 items have a RE_D significantly lower than 1, 12 items have RE_D not significantly different from 1, and 22 items have RE_D significantly higher than 1. For Case III (Table 4), the corresponding number of items are 4, 7, and 29. Based on the RE_D in Table 3, we can see that 21 items will need 5% or more examinees for the random design to obtain the same precision of the parameter estimates compared to the optimal design. Also, as expected, the three measures RE_D, RE_CC, and RE_A are correlated; for example, the Pearson correlation between RE_D and RE_CC is 0.68. As were explained in Section Evaluation metrics, a relative efficiency

> 1

can be interpreted as the multiplication factor of the pretest item’s sample size in the optimal design that gives the same estimation precision as the random design. Since the item sample size is about 9,800 for each pretest items when the random design is used, a RE_D = 1.169 (item 4, block 1) means that a sample size of 9800/1.169 ≈ 8383 is sufficient for the optimal design to achieve the same precision. When the pretest item parameters are pre-estimated the overall measure of RE_D are lower compared to when the item-parameters used are the true ones, as shown in Table 2. This gives us an indication of the magnitude of the efficiency loss due to the uncertainty of the estimation of item parameters. Roughly half of the overall efficiency gain over the random allocation is lost due to the pre-estimation of item parameters. From Table 2, we can calculate the pretest item’s sample size in the optimal design that would give the same estimation precision as the sample size 9,800 when using the random design. For Case IV with RE_D = 1.048, it will be 9800/1.048 ≈ 9351. It means that about 4.8 % or 450 more examinees are needed if the random design is used instead of the optimal design.

Table 3.

Relative Efficiencies RE_D, RE_CC, and RE_A for Optimal Versus Random Design and True Item Parameters a, b, c; Per Item. Case IV: Optimal Allocation Based On Estimated Abilities and Pre-estimated Parameters

Block	Pos	RE_D	RE_CC	RE_A	a	b	c	Item
1	1	1.007	1.026	1.279	0.799	−1.516	0.071	39
1	2	1.033	1.090	1.079	1.281	−0.466	0.189	8
1	3	1.077	1.035	1.255	0.924	−0.389	0.147	17
1	4	1.169	1.119	1.135	2.385	0.607	0.191	25
2	1	1.029	1.344	1.254	1.803	−0.900	0.310	20
2	2	1.081	1.136	1.138	1.911	−0.376	0.240	11
2	3	0.997	0.951	0.850	2.203	0.315	0.461	26
2	4	1.056	1.016	1.023	1.582	0.694	0.275	19
3	1	0.925	1.041	0.861	1.458	−0.739	0.318	3
3	2	1.117	1.103	1.330	1.399	−0.346	0.150	35
3	3	1.258	1.174	1.337	1.782	−0.108	0.140	2
3	4	1.072	1.003	1.126	1.676	0.513	0.185	29
4	1	0.889	0.922	0.782	1.333	−0.648	0.296	21
4	2	1.031	1.037	1.161	0.876	−0.792	0.016	27
4	3	1.052	1.036	0.999	0.989	−0.170	0.093	38
4	4	1.094	1.032	1.238	1.144	0.510	0.169	40
5	1	0.990	1.037	0.954	0.983	−1.227	0.052	31
5	2	1.049	1.050	1.152	1.334	−0.241	0.187	4
5	3	1.083	1.099	1.136	1.930	0.315	0.266	24
5	4	1.118	1.017	1.195	1.187	0.806	0.137	18
6	1	0.841	0.898	0.735	1.627	−0.184	0.375	1
6	2	1.087	1.075	1.238	1.781	−0.228	0.073	5
6	3	1.105	1.077	1.157	1.426	0.421	0.158	12
6	4	0.917	1.010	0.832	2.238	1.116	0.301	6
7	1	0.999	1.006	1.314	0.775	−1.223	0.010	36
7	2	1.013	1.057	1.052	1.070	−0.428	0.116	37
7	3	1.160	1.130	1.216	2.374	0.208	0.281	23
7	4	1.175	1.138	1.153	1.789	1.169	0.274	30
8	1	0.878	0.864	0.793	1.181	−0.270	0.372	7
8	2	1.075	0.975	0.992	1.652	−0.590	0.166	22
8	3	1.050	1.035	1.281	0.794	−0.454	0.004	14
8	4	1.197	1.136	1.207	0.946	1.250	0.100	13
9	1	0.915	0.919	1.000	0.813	−0.949	0.137	15
9	2	1.198	1.155	1.242	1.369	−0.140	0.210	33
9	3	0.953	0.935	1.193	1.407	0.427	0.358	16
9	4	1.147	1.114	1.116	1.200	0.741	0.096	34
10	1	1.054	1.136	1.237	1.214	−0.691	0.062	32
10	2	1.006	1.030	0.968	1.471	0.196	0.225	28
10	3	1.134	1.074	1.245	1.256	0.157	0.138	9
10	4	1.037	1.026	0.998	1.687	1.147	0.207	10

Table 4.

Relative Efficiencies RE_D, RE_CC, and RE_A for Optimal Versus Random Design and True Item Parameters a, b, c; Per Item. Case III: Optimal Design Allocation Based On Estimated Abilities and True Item Parameters

Block	Pos	RE_D	RE_CC	RE_A	a	b	c	Item
1	1	0.814	0.808	0.693	0.799	−1.516	0.071	39
1	2	1.073	1.079	1.017	1.281	−0.466	0.189	8
1	3	1.090	1.028	1.245	0.989	−0.170	0.093	38
1	4	1.187	1.081	1.443	1.144	0.510	0.169	40
2	1	0.878	0.857	0.768	0.983	−1.227	0.052	31
2	2	1.038	0.931	1.511	0.794	−0.454	0.004	14
2	3	1.154	1.119	1.243	1.369	−0.140	0.210	33
2	4	1.250	1.162	1.295	1.676	0.513	0.185	29
3	1	1.003	0.995	1.463	0.775	−1.223	0.010	36
3	2	1.095	1.115	1.361	1.070	−0.428	0.116	37
3	3	1.154	1.101	1.224	1.782	−0.108	0.140	2
3	4	1.244	1.160	1.255	2.385	0.607	0.191	25
4	1	0.890	0.918	0.832	0.813	−0.949	0.137	15
4	2	1.034	1.013	1.277	0.924	−0.389	0.147	17
4	3	1.189	1.153	1.340	1.256	0.157	0.138	9
4	4	1.235	1.167	1.395	1.582	0.694	0.275	19
5	1	1.214	1.408	1.468	1.803	−0.900	0.310	20
5	2	1.146	1.149	1.227	1.911	−0.376	0.240	11
5	3	1.124	1.035	1.102	1.471	0.196	0.225	28
5	4	1.116	1.044	1.149	1.200	0.741	0.096	34
6	1	0.902	0.899	1.055	0.876	−0.792	0.016	27
6	2	1.108	1.123	1.266	1.399	−0.346	0.150	35
6	3	1.219	1.107	1.210	2.374	0.208	0.281	23
6	4	1.189	1.070	1.363	1.187	0.806	0.137	18
7	1	0.979	1.148	0.955	1.458	−0.739	0.318	3
7	2	1.128	1.136	1.479	1.181	−0.270	0.372	7
7	3	1.218	1.162	1.223	2.203	0.315	0.461	26
7	4	1.017	0.918	0.951	2.238	1.116	0.301	6
8	1	1.037	1.135	1.148	1.214	−0.691	0.062	32
8	2	1.157	1.145	1.415	1.334	−0.241	0.187	4
8	3	1.199	1.166	1.294	1.930	0.315	0.266	24
8	4	1.114	1.050	1.015	1.687	1.147	0.207	10
9	1	1.045	1.126	1.197	1.333	−0.648	0.296	21
9	2	1.200	1.177	1.370	1.781	−0.228	0.073	5
9	3	1.191	1.171	1.256	1.426	0.421	0.158	12
9	4	1.087	1.040	1.035	1.789	1.169	0.274	30
10	1	1.189	1.327	1.482	1.652	−0.590	0.166	22
10	2	1.166	1.091	1.205	1.627	−0.184	0.375	1
10	3	1.115	1.034	1.169	1.407	0.427	0.358	16
10	4	1.147	1.085	1.227	0.946	1.250	0.100	13

Figures 5 and 6 show scatter plots of RE_D versus true item parameters for Case IV with pre-estimated item parameters and Case III using true item parameters, respectively. The colors of the dots indicate the position in the blocks. The position in the block is determined by the difficulty of the item (b_i-parameter), which means that position is a measure of the relative difficulty in the blocks. In both cases, irrespective of which parameter we examine, it is almost always the easiest items in each block that stands out as being less precisely estimated in terms of D-efficiency RE_D (with just a few exceptions). The per item efficiencies generally have higher values for Case III compared to Case IV, since we do not have any item-parameter uncertainty when the item parameters are assumed known in Case III. A similar pattern is observed also with respect to the other two measures, as can be seen in Figures A1 to A4 in the Appendix.

Figure 5.

True item parameters plotted against RE_D, grouped on position in the block. Case IV: Optimal allocation based on estimated abilities and pre-estimated parameters

Figure 6.

True item parameters plotted against RE_D, grouped on position in the block. Case III: optimal design allocation based on estimated abilities and true item parameters

Discussion

We studied four different simulation scenarios with varying degree of practicality, from entirely theoretical to the most realistic, designed to be as close as possible to the SweSAT setup. This allows us to quantify the influence of different factors (on average) by comparing the overall efficiencies, summarized over all items, between the scenarios. We conclude that there is an efficiency loss at each step, especially between cases III and IV (pre-estimation of item parameters) but also cases II and III (estimation of abilities). For the most realistic Case IV, the optimal design allocation was still better than the random design allocation, being about 5% more efficient in terms of RE_D and RE_CC and almost 10% in terms of RE_A. It means that 5% more examinees are needed for the random design compared to the optimal design to obtain the same precision in the item-parameter estimates.

The results per item show that the optimal design method estimates the pretest items more efficiently compared to the random design in most of the cases. Also, analyzing the block positions, we are able to identify that it is mainly for the easiest items in the blocks the optimal design method is inferior to the random allocation. This pattern can be observed both when the item parameters used in the optimization are pre-estimated (Case IV) and when true item parameters were used (Case III), see Figures 5 and 6. The reason for that is the asymmetry of the 3PL model, which implies that the c-parameters are estimated based on examinees with low abilities. It is more difficult to estimate c when the item is easy (low b). Since it is more difficult to estimate b and c for easy items, the optimal design puts more focus on the other items in the block. Their precision can easier be improved leading to an increased overall efficiency even if accepting a decreased efficiency for the easy items. In reality, most of the examinees’ abilities will however lay in θ ∈ [−2, 2]. In that span, two items can have similar Item Characteristic Curves even if the item parameters differ.

Since most of the abilities of examinees will lay in the span θ ∈ [−2, 2] the consequences for using such item in an item bank might not be that severe (even if the item parameters are less precisely estimated).

As noted, the overall RE_D (Table 2) is lower when the pretest item parameters are pre-estimated (Case IV) compared to when true item parameters are used (Case III). For individual items, there can be even larger differences, for example, Item 1 in Table 3 and in 4. This indicates that the effect of putting more focus on the harder items is bigger in the scenario where the pretest item parameters are pre-estimated. But even if the item parameters used in optical are not the true ones, the optimal block design seems to work for a majority of the items. That is, only 6 items have a RE_D significantly lower than 1. When comparing the two different cases it needs however to be considered that the division into the blocks is different since the pre-estimated item parameters will differ from true ones. But since the difference between the overall measures (Table 2) is not so big, it can be concluded that the method functions well when using the pre-estimates that are produced from only 200 examinees. An alternative to pre-estimation is using experts’ guesses. However, Attali et al. (2014); Bejar (1981) suggests that experts often perform poorly when estimating item difficulty. An alternative is to use expert judgmental information (Swaminathan et al., 2003) when constructing priors for Bayesian pre-estimation of item parameters. The item-parameter estimation precision in the 200 examinee pre-estimation example used in this article can then be improved by using expert judgmental information.

It is shown that with the proposed method, most of the items are estimated with higher efficiency and that it is possible to identify when it is not useful. Since it is the easier item in every block that is estimated with less efficiency, it would be beneficial if they can be identified in advance. Even though the exact value of the parameter b is not known, it is reasonable to believe that the items can be ranked with respect to difficulty by using, for example, comparative judgments. Attali et al. (2014) showed that judges where able to rank several item in terms difficulty. The items that are probable to be estimated with worse efficiency for the optimal design could then be estimated through a random design for example.

In this paper, we are not taking into account practical issues with assigning items. It could be necessary to put constraints on the pretest items, for example, on items that cannot be simultaneously included in the calibration part. There must also be a mix of content in the pretest items, and they need to be compatible with the operational test items.

In the optimal design approach we are using, it is assumed that the abilities of the examinees are known. In the Cases III and IV of our study, we apply this optimal design for abilities which are estimated. This discrepancy leads to the drop in efficiency observed. However, it could be possible in future research to modify the optimal design approach to include the uncertainty around the ability estimates and to determine optimal designs for the situation of estimated abilities. With these optimal designs, it could be possible to increase efficiency in the case of estimated abilities.

Also, instead of using a pre-estimated point estimate or guess for the pretest item parameters, a prior distribution could be assigned from which an optimal-on-average (also known as Bayesian) design could be derived.

Footnotes

ORCID iDs

Jonas Bjermo

Ellinor Fackle Fornius

Frank Miller

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Swedish Research Council (Vetenskapsrådet) Grant 2019-02706.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

Figure A1.

True item parameters plotted against RE_CC, grouped on position in the block. Case IV: Optimal allocation based on estimated abilities and pre-estimated parameters

Figure A2.

True item parameters plotted against RE_CC, grouped on position in the block. Case III: optimal design allocation based on estimated abilities and true item parameters

Figure A3.

True item parameters plotted against RE_A, grouped on position in the block. Case IV: Optimal allocation based on estimated abilities and pre-estimated parameters

Figure A4.

True item parameters plotted against RE_A, grouped on position in the block. Case III: optimal design allocation based on estimated abilities and true item parameters

Table A1.

Theoretical Relative Efficiencies RE_D, RE_CC, and RE_A for Optimal Versus Random Design and True Item Parameters a, b, c; Per Item. Case I: Optical Based On True Item Parameters

Block	Pos	RE_D	RE_CC	RE_A	a	b	c	Item
1	1	1.124	1.043	1.580	0.799	−1.516	0.071	39
1	2	1.103	1.083	1.195	1.281	−0.466	0.189	8
1	3	1.060	1.070	1.283	0.989	−0.170	0.093	38
1	4	1.188	1.199	1.383	1.144	0.510	0.169	40
2	1	1.136	1.067	1.612	0.983	−1.227	0.052	31
2	2	1.076	1.096	1.474	0.794	−0.454	0.004	14
2	3	1.146	1.130	1.252	1.369	−0.140	0.210	33
2	4	1.226	1.248	1.453	1.676	0.513	0.185	29
3	1	1.121	1.052	1.811	0.775	−1.223	0.010	36
3	2	1.082	1.056	1.434	1.070	−0.428	0.116	37
3	3	1.190	1.138	1.187	1.782	−0.108	0.140	2
3	4	1.293	1.329	1.557	2.385	0.607	0.191	25
4	1	1.076	1.012	1.494	0.813	−0.949	0.137	15
4	2	1.052	1.028	1.350	0.924	−0.389	0.147	17
4	3	1.105	1.093	1.208	1.256	0.157	0.138	9
4	4	1.224	1.250	1.499	1.582	0.694	0.275	19
5	1	1.230	1.100	1.574	1.803	−0.900	0.310	20
5	2	1.155	1.196	1.411	1.911	−0.376	0.240	11
5	3	1.104	1.176	1.213	1.471	0.196	0.225	28
5	4	1.108	1.050	1.249	1.200	0.741	0.096	34
6	1	1.149	1.107	1.866	0.876	−0.792	0.016	27
6	2	1.143	1.079	1.314	1.399	−0.346	0.150	35
6	3	1.252	1.254	1.325	2.374	0.208	0.281	23
6	4	1.191	1.169	1.503	1.187	0.806	0.137	18
7	1	1.167	0.974	1.362	1.458	−0.739	0.318	3
7	2	1.099	1.025	1.384	1.181	−0.270	0.372	7
7	3	1.213	1.218	1.351	2.203	0.315	0.461	26
7	4	1.256	1.363	1.785	2.238	1.116	0.301	6
8	1	1.157	1.033	1.619	1.214	−0.691	0.062	32
8	2	1.125	1.044	1.365	1.334	−0.241	0.187	4
8	3	1.187	1.184	1.408	1.930	0.315	0.266	24
8	4	1.261	1.236	1.855	1.687	1.147	0.207	10
9	1	1.109	0.997	1.450	1.333	−0.648	0.296	21
9	2	1.140	1.089	1.201	1.781	−0.228	0.073	5
9	3	1.151	1.165	1.318	1.426	0.421	0.158	12
9	4	1.261	1.295	1.876	1.789	1.169	0.274	30
10	1	1.237	1.156	1.586	1.652	−0.590	0.166	22
10	2	1.146	1.186	1.337	1.627	−0.184	0.375	1
10	3	1.123	1.203	1.207	1.407	0.427	0.358	16
10	4	1.103	1.022	1.198	0.946	1.250	0.100	13

References

Ali

U. S.

Chang

H.-H.

(2014). An item-driven adaptive design for calibrating pretest items. ETS Research Report Series, 2014(2), 1–12. https://doi.org/10.1002/ets2.12044

Atkinsson

A. C.

Donev

A. N.

Tobias

R. D.

(2007). Optimum experimental designs, with SAS. Oxford University Press.

Attali

Saldivia

Jackson

Schuppan

Wanamaker

(2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1–8. https://doi.org/10.1002/ets2.12042

Baker

F. B.

Kim

S.-H.

(2004). Item response theory: Parameter estimation techniques (2nd ed.). CRC Press.

Ban

J.-C.

Hanson

B. A.

Wang

Harris

D. J.

(2001). A comparative study of on-line pretest item—Calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement, 38(3), 191–212. https://doi.org/10.1111/j.1745-3984.2001.tb01123.x

Bejar

I. I.

(1981). Subject matter experts’ assessment of item statistics. ETS Research Report Series, 1981(2), i–47. https://doi.org/10.1002/j.2333-8504.1981.tb01274.x

Berger

M. P.

(1991). On the efficiency of IRT models when applied to different sampling designs. Applied Psychological Measurement, 15(3), 293–306. https://doi.org/10.1177/014662169101500310

Berger

Verschoor

A. J.

Eggen

T. J.

Moser

(2019). Efficiency of targeted multistage calibration designs under practical constraints: A simulation study. Journal of Educational Measurement, 56(1), 121–146. https://doi.org/10.1111/jedm.12203

Bjermo

Fackle-Fornius

Miller

(2025). Optimizing calibration designs with uncertainty in abilities. British Journal of Mathematical and Statistical Psychology, 78(3), 889–910. https://doi.org/10.1111/bmsp.12387

10.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an em algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/bf02293801

11.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive eap estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. https://doi.org/10.1177/014662168200600405

12.

Buyske

(2005). Optimal designs in educational testing. In Berger

M. P. F.

Wong

W. K.

(Eds.), Applied Optimal Designs (pp. 1–19). Chichester: Wiley.

13.

Chalmers

Pritikin

Robitzsch

Zoltak

KwonHyun

Falk

C. F.

Meade

Schneider

King

Liu

C.-W.

Oguzhan

(2023). Multidimensional item response theory. R Package Version 1.38.1. https://cran.r-project.org/web/packages/mirt/mirt.pdf

14.

Chang

Y.-C. I.

H.-Y.

(2010). Online calibration via variable length computerized adaptive testing. Psychometrika, 75(1), 140–157. https://doi.org/10.1007/s11336-009-9133-0

15.

Chen

Wang

(2016). A new online calibration method for multidimensional computerized adaptive testing. Psychometrika, 81(3), 674–701. https://doi.org/10.1007/s11336-015-9482-9

16.

Chen

Wang

Xin

Chang

H.-H.

(2017). Developing new online calibration methods for multidimensional computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 70(1), 81–117. https://doi.org/10.1111/bmsp.12083

17.

Chen

Xin

Wang

Chang

H.-H.

(2012). Online calibration methods for the Dina model with independent attributes in cd-cat. Psychometrika, 77(2), 201–222. https://doi.org/10.1007/s11336-012-9255-7

18.

Cheng

Yuan

K.-H.

(2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75(2), 280–291. https://doi.org/10.1007/s11336-009-9144-x

19.

Chen

(2020). Optimal online calibration designs for item replenishment in adaptive testing. Psychometrika, 85(1), 35–55. https://doi.org/10.1007/s11336-019-09687-0

20.

Chen

(2020). New efficient and practicable adaptive designs for calibrating items online. Applied Psychological Measurement, 44(1), 3–16. https://doi.org/10.1177/0146621618824854

21.

Chen

Zhang

(2017). A new online calibration method based on lord’s bias-correction. Applied Psychological Measurement, 41(6), 456–471. https://doi.org/10.1177/0146621617697958

22.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking. Springer.

23.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.

24.

H.-Y.

(2014). Application of optimal designs to item calibration. Plos One, 9(9), Article e106747. https://doi.org/10.1371/journal.pone.0106747

25.

OECD . (2018). PISA 2018 technical report. https://www.oecd.org/pisa/data/pisa2018technicalreport/

26.

Ren

van der Linden

W. J.

Diao

(2017). Continuous online item calibration: Parameter recovery and item utilization. Psychometrika, 82(2), 498–522. https://doi.org/10.1007/s11336-017-9553-1

27.

Seber

G. A. F.

Wild

C. J.

(1989). Nonlinear regression. John Wiley & Sons.

28.

Silvey

S. D.

(1980). Optimal design. Monographs on applied probability and statistics (1st ed., Vol. 1). Chapman & Hall.

29.

Stan Development Team . (2025). RStan: The R interface to Stan. R Package Version 2.32.7.

30.

Stocking

M. L.

(1988). Scale drift in on-line calibration (research report no. 88-28). Educational Testing Service.

31.

Swaminathan

Hambleton

R. K.

Sireci

S. G.

Xing

Rizavi

S. M.

(2003). Small sample estimation in dichotomous item response models: Effect of priors based on judgmental information on the accuracy of item parameter estimates. Applied Psychological Measurement, 27(1), 27–51. https://doi.org/10.1177/0146621602239475

32.

Ul Hassan

Miller

(2019). Optimal item calibration for computerized achievement tests. Psychometrika, 84(4), 1101–1128. https://doi.org/10.1007/s11336-019-09673-6

33.

Ul Hassan

Miller

(2021). An exchange algorithm for optimal calibration of items in computerized achievement tests. Computational Statistics and Data Analysis, 157(5), 107–177. https://doi.org/10.1016/j.csda.2021.107177

34.

Ul Hassan

Miller

(2023). Optimal item calibration. R Package Version 1.7.1. https://cran.r-project.org/web/packages/optical/optical.pdf

35.

Umeå University . (2021). The Swedish scholastic aptitude test - SweSAT. https://www.umu.se/en/department-of-applied-educational-science/swesat/

36.

van der Linden

W. J.

Pashley

P. J.

(2000). Item selection and ability estimation in adaptive testing. In van der Linden

W. J.

Glas

G. A.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 1–25). Kluwer Academic Publishers.

37.

van der Linden

W. J.

Ren

(2015). Optimal Bayesian adaptive design for test-item calibration. Psychometrika, 80(2), 263–288. https://doi.org/10.1007/s11336-013-9391-8

38.

Zheng

Chang

H.-H.

(2017). A comparison of five methods for pretest item selection in online calibration. International Journal of Quantitative Research in Education, 4(1-2), 133–158. https://doi.org/10.1504/ijqre.2017.086500