Sage Journals: Discover world-class research

Abstract

To solve the attenuation paradox in computerized adaptive testing (CAT), this study proposes an item selection method, the integer programming approach based on real-time test data (IPRD), to improve test efficiency. The IPRD method turns information regarding the ability distribution of the population from real-time test data into feasible test constraints to reversely assembled shadow tests for item selection to prevent the attenuation paradox by integer programming. A simulation study was conducted to thoroughly investigate IPRD performance. The results indicate that the IPRD method can efficiently improve CAT performance in terms of the precision of trait estimation and satisfaction of all required test constraints, especially for conditions with stringent exposure control.

Keywords

computerized adaptive testing attenuation paradox shadow test integer programming optimal design

Computerized adaptive testing (CAT) sequentially selects each suitable item for an examinee’s current trait estimate, and thus, the examinee does not need to answer items that are too easy or too difficult. Because of the adaptive item selection procedure, CAT can effectively improve test efficiency compared with linear testing. Although CAT is one of the most efficient test formats in practice, it is by no means the optimal one. The inherent nature of adaptive testing procedures could potentially affect CAT performance.

Considering that CAT is a sequential test, an examinee’s ability estimate will become more precise as the test progresses. In the early stages of CAT, where trait estimates are less precise, always selecting the optimal item (e.g., with maximum Fisher information [MFI]) toward an examinee’s current ability estimate may not always be best. The current ability estimate could be far from the examinee’s true ability, and the administration of the optimal item can be inefficient (e.g., provides little information), which is known as the attenuation paradox in CAT (Lord & Novick, 1968; van der Linden & Pashley, 2000). Furthermore, when CAT is under exposure control, the attenuation paradox causes greater inefficiency in ability estimation. Because the usage of each item is restricted, the consequences of the attenuation paradox are not only obtaining little information from the optimal item but also the preclusion of the possibility for administering the optimal item to other suitable examinees (Chen et al., 2020). The more stringent the item exposure control, the more CAT efficiency may be reduced due to the attenuation paradox under item exposure control (denoted as the preclusion problem). It is worth noting the preclusion problem arises when examinees take CAT in a consecutive manner, where preceding examinees’ CAT administrations can influence the subsequent ones.

Several item selection rules (ISRs) have been proposed to mitigate the effects of the attenuation paradox, including Kullback–Leibler information (Chang & Ying, 1996; Cover & Thomas, 1991; Kullback, 1959) and general weighted information (Veerkamp & Berger, 1997). Studies have also been conducted to investigate the effects of the attenuation paradox on ability estimation. In summary, the proposed ISR can generally reduce the effects of the attenuation paradox, yielding an improvement in ability estimation in the early stages of CAT. Given a certain test length (e.g., no less than 20 items), most ISRs become indistinguishable regarding ability estimation precision regardless of whether the ISR takes the attenuation paradox into account (e.g., Chen & Ankenman, 2004; Chen et al., 2000).

However, the indistinguishable difference in ability estimation among item selection methods (with or without considering the attenuation paradox) is not sufficient for inferring that the attenuation paradox eventually causes a negligible effect on CAT. That is, the attenuation paradox may have not been fully solved by those proposed ISRs. Furthermore, few studies have addressed the preclusion problem of the attenuation paradox when there is exposure control in which the degree of adaptation between examinees may be reduced (Chen et al., 2020). Thus, test efficiency could be compromised, especially for high-stakes achievement tests that take item exposure control as a routine procedure. To summarize, we must reconsider how to deal with the attenuation paradox (including the preclusion problem) and develop an item selection method that can respond accordingly to it.

To prevent the attenuation paradox (including the preclusion problem), especially when there is exposure control, CAT must correctly administer each item to suitable examinees at an appropriate administration order according to an optimization rule (e.g., maximizing the sum of test information across all examinees). Before delving into the problem, we can start by examining the features of a conventional CAT relevant to the attenuation paradox to gain insights into solutions. According to Figure 1, the conventional CAT has two inherent features that have the potential to cause the attenuation paradox: between-examinee sequential administration and within-examinee sequential administration. Specifically, regarding the first feature, examinees participate in tests in sequence, so that we do not know whether the next examinees are more appropriate to the currently selected items, and this is why the preclusion problem occurs under item exposure control. Regarding the second feature, a test starts from the first item, where a trait estimate is less precise. In the early stages, therefore, CAT stands the risk of inefficient administration (e.g., the optimal item selected does not suit the examinee), causing the attenuation paradox. To deal with the attenuation paradox, correspondingly, the two features of the conventional CAT could be modified by administering tests to all examinees at the same time (in responding to the feature of between-examinee sequential administration) and administering items to an examinee using a reverse procedure (in responding to the feature of within-examinee sequential administration). The reverse procedure means starting a test from the last item, where a trait estimate is most precise. The CAT with the two modified features proposed here (hereafter called the ideal CAT) is not necessarily optimal but is expected to effectively solve the attenuation paradox (including the preclusion problem) that the conventional CAT encountered.

Figure 1.

Illustration of the conventional and ideal computerized adaptive testing.

In the ideal CAT, because each examinee simultaneously receives a test, we can determine how to administer each item (with restricted usage) to examinees who are the most suitable for optimizing test efficiency (e.g., the sum of test information). Therefore, the preclusion problem of the attenuation paradox can be avoided. In addition, the reverse administration procedure that starts a test from the last item, where an ability estimate is the most precise, can reduce the effect of the attenuation paradox. Specifically, items selected in the early stages tend to be of a higher quality (e.g., larger discrimination power) than those selected in the later stages because CAT always selects the optimal item. The ideal CAT administers higher (lower) quality items when the ability estimates are more (less) precise, which can reduce the possibility that high-quality items are administered to examinees with imprecise ability estimates.

Although the ideal CAT is conceptually simple and is effective in regard to the attenuation paradox, it is not possible to implement it in practical tests. Specifically, CAT is an on-demand test, and examinees take CAT in sequence. CAT cannot administer items with the full consideration of all examinees. Furthermore, the reverse administration procedure is unreasonable because CAT can never start a test when an examinee’s ability estimate is the most precise. Consequently, developing the ideal CAT as immune to the attenuation paradox seems impossible. Fortunately, real-time test data from the previously administered examinees could provide useful information to achieve the ideal CAT in practical test conditions.

Real-time test data accumulated by previously administered examinees contain information about each item administration, including the corresponding item, examinee, administration order, response, and response time. Given that these examinees are a representative sample of the population, real-time test data can be used to infer the CAT administrations of the population. That is, utilizing the information of the population’s CAT administrations in the real-time test data while administering a CAT to an examinee may approach the ideal CAT, which considers all examinees simultaneously, solving the preclusion problem of the attenuation paradox.

Regarding the reverse administration procedure, it can be achieved by preassembling a test after each item administration, just as the shadow test approach does (van der Linden, 2000). Specifically, the shadow test here assembles items in a reverse manner, sequentially selecting each item from the last to the current one being administered. Because the current item is selected last, it should be the least optimal item in the assembled shadow test. Consequently, the effect of the attenuation paradox caused by item administration with a less precise ability estimate and a high-quality item can be reduced.

Therefore, the purpose of the current study is to propose an item selection method, the integer programming approach based on real-time test data (IPRD), which utilizes the information regarding CAT administrations of the population in real-time test data to solve the attenuation paradox (including the preclusion problem). Specifically, the IPRD method turns the information in real-time test data into feasible test constraints to reversely assembled shadow tests using integer programming to optimize each item administration. The IPRD method is expected to reduce the effect of the attenuation paradox, thereby improving CAT’s ability estimation precision.

While the IPRD method represents a novel approach, it shares similarities with the a-stratified (AS) method proposed by Chang and Ying (1999). The AS method involves stratifying an item pool into multiple strata based on the discrimination parameters of the items, ensuring that items with higher discrimination are administered in the later stages of CAT. This parallels the strategy employed by the IPRD method, where reversely assembled shadow tests are utilized, enabling the use of optimal items when examinees possess precise ability estimates. However, it is worth noting that the AS method does not consider other examinees during item selection, thus failing to directly address the preclusion problem. In situations with stringent exposure control, where the preclusion problem is more prominent, the IPRD method is expected to outperform the AS method in terms of trait estimation precision.

The remainder of the current article is organized as follows: First, the new item selection method, the IPRD, which is proposed in the current study to enhance optimal design in CAT, is illustrated. The procedures for implementing the IPRD method are demonstrated using three examples from simple to complex scenarios. Next, a simulation study is conducted to evaluate the performance of the IPRD method compared with several relevant item selection methods (e.g., the AS method with content blocking [ASCB]; Yi & Chang, 2003). Finally, based on the simulations, the results are discussed, and we draw our conclusions.

Integer Programming Method Based on Real-Time Test Data (IPRD)

To solve the attenuation paradox (including the preclusion problem), the features of between- and within-examinee sequential administration possessed by the conventional CAT need to be modified. That is, CAT should reversely administer items to all examinees simultaneously, just like the ideal CAT mentioned above. Although it is infeasible to implement the ideal CAT, we can utilize real-time test data and reversely assembled shadow tests to approach it. Specifically, the real-time test data from the sample (i.e., previously administered examinees) can provide information regarding the population, while the reverse-assembly procedure can reduce the effect of the attenuation paradox. To turn the information in real-time test data into feasible test constraints, selecting items like the ideal CAT becomes an optimization problem. Accordingly, the IPRD method is developed to optimize each item selection with a certain objective function (e.g., maximizing average test information) by reverse-assembling the shadow test according to the test constraints formulated by the real-time test data. Three examples are given to illustrate, step by step, the core concepts and implementation procedures of the IPRD method. It is worth noting that the conditions assumed in Examples 1 and 2 are ideal rather than real, and these two examples are only introduced to provide insights into CAT optimization in practical test scenarios (i.e., Example 3).

Example 1: CAT Optimization for All Examinees With Known Ability Levels

In this example, we assume that the ability level of each examinee is known; thus, here, CAT does not have the two features (i.e., between- and within-examinee sequential administration) related to the attenuation paradox. Under these circumstances, the optimization of CAT administration is straightforward: to simultaneously assemble tests for all examinees, as depicted in Figure 2. As an optimization problem, this can be formulated as the following equations. To facilitate the illustration, only the constraints of the item exposure rates are considered:

Max \sum_{i = 1}^{I} \sum_{j = 1}^{J} X_{i j} I_{i j} (θ_{j}),

\begin{array}{l} s . t . \sum_{j = 1}^{J} X_{i j} \leq r_{max} \times J, for i = 1, \dots, I \\ \sum_{j = 1}^{J} X_{i j} \geq r_{min} \times J, for i = 1, \dots, I \\ \sum_{i = 1}^{I} X_{i j} = L, for j = 1, \dots, J \end{array},

where $X_{i j}$ is the decision variable of item i for examinee j, $I_{i j}$ is the item information of item i for examinee j, $θ_{j}$ is the ability estimate of examinee j, L is the test length, I is the pool size, J is the number of examinees, and $r_{max}$ and $r_{min}$ are the tolerance of the maximum and minimum item exposure rates, respectively.

Figure 2.

Illustration of Example 1.

The objective function aims to maximize the sum of test information across J examinees. The constraints of Equation 2 mean that the item exposure rates are restricted between $r_{min}$ and $r_{max}$ , where $r_{min} \leq \bar{r} \leq r_{max},$ and $\bar{r}$ is the average item exposure rate (i.e., $L / I$ ). $r_{min}$ and $r_{max}$ are chosen by test practitioners according to an acceptable range of item exposure rates. Other constraints (e.g., content balancing) can be directly added to Equation 2.

Example 2: CAT Optimization for Each Examinee With Known Ability Levels in Sequence

Because CAT offers time flexibility for examinees in scheduling the tests, examinees may take tests in sequence. To get closer to the real situation, in this example, we assume only that the ability level of the current examinee is known. Here, CAT therefore has the vulnerable feature of between-examinee sequential administration in regard to the preclusion problem since we do not know the ability levels of future examinees. In this case, the IPRD method cannot simultaneously optimize the test for all examinees; however, previously administered examinees can be utilized as a sample to infer the population, optimizing the current examinee’s test. According to Figure 3, since the ability levels of previous examinees are known, we can assemble the test for examinee j* with the information of previously p administered examinees. Given that p administered examinees are sufficient to represent the population distribution, sequentially optimizing each examinee’s test with previously p administered examinees would be a good approximation of optimizing the tests for all examinees at the same time. Specifically, the optimization problem for examinee j* can be solved using the following equations:

Max \sum_{i = 1}^{I} \sum_{j = j^{*} - p}^{j^{*}} X_{i j} I_{i j} (θ_{j}),

\begin{array}{l} s . t . \sum_{j = j^{*} - p}^{j^{*}} X_{i j} \leq r_{max} \times (p + 1), for i = 1, \dots, I \\ \sum_{j = j^{*} - p}^{j^{*}} X_{i j} \geq r_{min} \times (p + 1), for i = 1, \dots, I \\ \sum_{i = 1}^{I} X_{i j} = L, for j = j^{*} - p, \dots, j^{*} \end{array},

where p is the number of previously administered examinees.

Figure 3.

Illustration of Example 2.

The objective function here is to maximize the sum of test information for examinee j* and previously p administered examinees. Given that the p administered examinees are sufficient in providing information for the population distribution, the test that is assembled by the vector $X_{i j^{*}}$ for examinee j* is deemed the optimal test considering global adaptiveness among all examinees. However, in testing practice, we do not know the true ability of any examinee. Example 3 further demonstrates how the IPRD method solves the CAT optimization problem under practical test conditions suffering from the attenuation paradox (including the preclusion problem) by using real-time test data via reverse-assembling shadow tests.

Example 3: CAT Optimization for Each Examinee With an Unknown Trait Level in Sequence

CAT is a sequential test; an examinee’s ability estimate is unknown and updated after each item administration, becoming more precise as the test progresses. Hence, in this example, we demonstrate how the IPRD can be applied to practical test conditions without assuming that any ability level is known. Considering that high-quality items are always selected in the early stages, the IPRD method uses a reverse approach to assemble the shadow test to reduce the effect of the attenuation paradox. To solve the preclusion problem of the attenuation paradox, real-time test data are incorporated into the reverse-assembly approach. Specifically, the real-time test data of the item administrations at each administration order from previously administered examinees are utilized to form feasible constraints to achieve ideal item selection.

According to Figure 4, the IPRD method assembles a shadow test for each item administration of the current examinee. Specifically, there are two steps for assembling the shadow test. First, the cumulative distributions of item exposure counts are calculated in a reverse manner at each administration order. That is, the reverse cumulative distribution at a specific administration order is derived based on summing the items administered last to the specified administration order. For example, assume a three-item test with a six-item pool and three previously administered examinees. According to Figure 5, the three tables on the left record the items the three examinees received at each administration order, which are then used to calculate the reverse cumulative distribution table on the right. Taking the reverse cumulative distribution at administration order 1, for example, the exposure counts from the last item administration (i.e., the third administration) to the first item administration are summed.

Figure 4.

Illustration of Example 3.

Figure 5.

Illustration of formulating the reverse cumulative distribution.

The second step to assemble a shadow test is to sequentially reversely select items based on the reverse cumulative distribution derived in the first step. That is, when selecting an item for the shadow test (e.g., the third item), the corresponding reverse cumulative distribution (e.g., the cumulative distribution at third administration order) is used to determine the eligibility for each candidate item. For example, given r _max = .6, all items are eligible for the item selection of the last item in the shadow test. In contrast, for the first item in the shadow test, only Items 3, 4, and 6 are eligible (their exposure rates are .33 and will be .5 if administered). In brief, with the reverse cumulative distribution of item exposure counts, the reverse-assembly procedure can imitate the reverse administration procedure with all examinees simultaneously in the ideal CAT because whether an item is included in the current shadow test considers item exposure counts accumulated only from the last to the current administration order based on all previously administered examinees.

To more precisely illustrate the IPRD method, we suppose that the test length is L, and the number of previously administered examinees as a sample for inferring the population distribution is p. The IPRD method can solve the attenuation paradox (including the preclusion problem) as a CAT optimization problem by assembling shadow tests according to corresponding constraints with a reverse-assembly procedure using the following steps:

Step 1. For the first item administration (i.e., $s^{*}$ = 1) of examinee $j^{*}$ , assemble the first shadow test (i.e., an L $\times$ 1 vector T ₁) as follows:

Step 1.1. Assemble the one-item tests for the Lth item to the second item in the shadow test T ₁ in sequence. For $k^{*}$ = L, L − 1,…, and 2, the following equations are used:

Max \sum_{i = 1}^{I} X_{i j^{*} k^{*}} I_{i j^{*}} (θ_{j^{*}}^{0}),

\begin{array}{l} s . t . \sum_{k = k^{*}}^{L} \sum_{j = j^{*} - p}^{j^{*}} X_{i j k} \leq r_{max} \times (p + 1), for i = 1, \dots, I \\ \sum_{k = 1}^{k^{*}} \sum_{j = j^{*} - p}^{j^{*}} X_{i j k} \geq r_{min} \times (p + 1), for i = 1, \dots, I \\ \sum_{k = k^{*}}^{L} \sum_{i = 1}^{I} X_{i j^{*} k} = L - k^{*} + 1 \\ \sum_{k = k^{*}}^{L} X_{i j^{*} k} \leq 1, for i = 1, \dots, I \end{array},

where $θ_{j^{*}}^{0}$ is set to CAT’s initial starting point of ability estimate by design (e.g., zero).

Step 1.2. For $k^{*}$ = 1, randomly select an item from underexposed items (i.e., item exposure rate lower than $r_{min}$ ), with its selection probability being proportional to the difference between its exposure rate and $r_{min}$ . If all items in the pool have satisfied the minimum exposure rate constraint (i.e., no item exposure rate is lower than $r_{min}$ ), we can randomly select an item with an exposure rate lower than $\bar{r}$ with a selection probability proportional to the difference between its exposure rate and $\bar{r}$ . If item $i^{*}$ is selected, then $X_{i^{*} j^{*} 1}$ is 1.

Step 1.3. The $k^{*}$ th element ( $k^{*}$ = 1, 2,…, L) in the vector of the shadow test T ₁ is the location where $X_{i j^{*} k^{*}}$ = 1. The first element in T ₁ is the first administered item.

Step 1.4. Estimate ${\hat{θ}}_{j^{*}}^{1}$ based on the response to the first administered item.

Step 2. For the $s^{*}$ th item administration, where $s^{*} > 1$ , for examinee $j^{*}$ , assembling the $s^{*}$ th shadow test is as follows:

Step 2.1. Assemble the one-item tests for the Lth item to the $s^{*}$ th item in the shadow test in sequence based on Equations 5 and 6 using ${\hat{θ}}_{j^{*}}^{s^{*} - 1}$ .

Step 2.2. The $k^{*}$ th element ( $k^{*}$ = $s^{*}$ , $s^{*}$ +1,…, L) in the vector of the shadow test $T_{s^{*}}$ is the location where $X_{i j^{*} k^{*}}$ = 1. Vector $T_{s^{*}}$ is the shadow test for the $s^{*}$ th item administration, and the first to the $s^{*}$ −1th elements in $T_{s^{*}}$ are items administered previously to examinee $j^{*}$ .

Step 2.3. The $s^{*}$ th element in $T_{s^{*}}$ is the $s^{*}$ th administered item for examinee $j^{*}$ . Estimate ${\hat{θ}}_{j^{*}}^{s^{*}}$ based on the responses of the first $s^{*}$ administered items.

Step 3. Repeat Step 2 for the $s^{*}$ +1th item administration to the Lth item administration for examinee $j^{*}$ .

Step 4. Repeat steps 1–3 until the required number of examinees has been administered.

The IPRD method utilizes the distributions of item exposure counts at each administration order based on previously p administered examinees to reversely assemble a shadow test for each item administration. Given that the real-time test data from the sample of previously p examinees are the representative of the population distribution, the reversely assembled shadow tests would be the optimal test for not only examinee $j^{*}$ but also the other examinees who will take CAT in the future. The IPRD method is expected to provide optimal tests for all examinees by considering the attenuation paradox (including the preclusion problem) while satisfying all test constraints.

The two hypotheses to be evaluated in this study:

When there is no item exposure control, compared to the other recruited ISRs (e.g., the MFI method), the IPRD method can further reduce the effect of the attenuation paradox, yielding better trait estimation.

When there is item exposure control, the IPRD method can solve the preclusion problem of the attenuation paradox, leading to further improvements in trait estimation. The more stringent the exposure control is, the greater the improvement is.

Method

The purpose of the current study is to investigate the performance of the IPRD method using rigorous experimental-controlled designs based on an operational item pool. CAT with practical constraints, including maximum item exposure control and content balancing control, was employed. The performance of the IPRD was compared with several item selection methods, including the MFI and ASCB methods, in terms of the precision of trait estimation and balance of item pool usage. The results from the simulation study would allow us to generalize the efficiency of the IPRD method for conditions with practical considerations.

Design

The study attempts to create a practical context for investigating the performance of the IPRD method. The test length was set to 30 to mimic a test of medium length. A real item pool consisting of 360 items from a practical math test with item parameters calibrated using the three-parameter logistic model (3PLM; Birnbaum, 1968) was utilized. Table 1 shows the descriptive statistics of the item pool. As shown in Figure 6, there was a moderate correlation between a and b and low correlations between a and c and between b and c. The item pool’s content areas were as follows: (A) prealgebra and elementary algebra, 144 items (40%); (B) intermediate algebra and coordinate geometry, 108 items (30%); and (C) plane geometry and trigonometry, 108 items (30%). The ratio of each content area for controlling content balancing was set to its proportion in the item pool. That is, there were 12, 9, and 9 items for content areas A through C, respectively. Item pool usage was restricted by maximum item exposure control.

Table 1.

Descriptive Statistics for the Item Parameters of the Item Pool

Content	Item Parameter	N	Mean	SD	Minimum	Maximum
	a	360	0.969	0.324	0.282	2.366
	b	360	0.398	1.122	−3.429	2.943
	c	360	0.185	0.087	0.025	0.500
A	a	144	0.841	0.262	0.282	1.665
	b	144	−0.128	1.154	−3.429	2.576
	c	144	0.184	0.083	0.047	0.500
B	a	108	1.021	0.314	0.379	2.058
	b	108	0.749	0.920	−1.737	2.943
	c	108	0.187	0.088	0.035	0.500
C	a	108	1.087	0.350	0.370	2.366
	b	108	0.750	0.988	−1.450	2.927
	c	108	0.185	0.091	0.025	0.500

Figure 6.

Correlations among item parameters of the item pool.

In addition to the IPRD method, the ASCB method, which stratifies an item pool into several strata according to item discrimination parameters, was used for comparisons. The ASCB method was thought to reduce the effects of the attenuation paradox by administering more discriminative items in later stages, which, in principle, is similar to the reverse-assembly procedure of the IPRD method. Specifically, the ASCB method was used to consider both the content balancing control and the correlation (i.e., r = .476) between a and b parameters while stratifying the item pool. Additionally, the random item selection (RN) method and MFI method were also employed to serve as baselines for evaluating the performance of the IPRD method. In sum, four methods, the IPRD, ASCB, MFI, and RN methods, were compared.

Two independent variables were manipulated: (a) r _max for maximum item exposure control (.09, .10, .12, .15, .20, .30, .50, and 1.00) and (b) the mean of the distribution of examinees’ abilities (0 and 1). Hence, the four methods (i.e., IPRD, MFI, ASCB, and RN) can be compared under CAT with each combination of the two independent variables. The levels of r _max represented the degree of exposure control from stringent to no control, which can be used to evaluate the effect of the preclusion problem of the attenuation problem on trait estimation. The manipulation regarding the ability distribution was used to investigate the generalizability of the results for different ability distributions, particularly for the IPRD method in CAT optimization. Specifically, the ability distribution with a mean of 0 represents a group of examinees with moderate abilities, while the ability distribution with a mean of 1 represents examinees with higher abilities. The correspondence between the ability distribution and the item difficulty distribution varied across these conditions. This variation allows us to examine the robustness of the IPRD method in CAT, specifically in its ability to deal with the changing correspondence between examinees’ abilities and item difficulties.

For each condition, CAT was administered to 10,000 examinees, whose abilities were drawn from the normal distribution with the manipulated mean and fixed variance of 1. In sum, 160,000 CATs were administered, and they constituted 16 conditions (i.e., r _max [8]* ability distribution [2]), each with 10,000 examinees. The simulation study was conducted by a computer program written by the author using Fortran 90.

General Implementation

The present study applied 3PLM, estimated the traits with the expected a posteriori (EAP) estimation with a normal prior, and assumed the initial trait estimate to be zero. All the methods used the MFI criterion for item selection except for the RN method. The MFI, ASCB, and RN methods used the Sympson–Hetter method (Sympson & Hetter, 1985) with online procedure (SHO; Chen et al., 2008; Ju & Chen, 2008) for exposure control, while the IPRD method was directly applied to control item exposure. See Supplemental Appendix A for a detailed explanation of the implementation of the SHO method. Regarding the content balancing constraint, the modified multinomial model (MMM; Chen & Ankenman, 2004) method was used to determine the content order of each examinee for all methods. After each item administration, the MMM method updated the number of items that were still required to meet the content constraint for each content area. The percentages of the numbers were then calculated to form a cumulative distribution. Based on the cumulative distribution, a random number was drawn from U[0, 1] to determine the content area. Taking Figure 7 for example, an item in content area B will be selected for the next item administration.

Figure 7.

An example illustrating the mechanism of the multinomial model method, where there are seven, six, and seven items required to be administered for the A–C content areas, respectively, with their percentages and cumulative percentages, and a random number of .53 is drawn.

IPRD Implementation

To implement the IPRD method, Steps 1 through 4 mentioned in Example 3 were applied, where p was set to the number of examinees who have taken tests. For example, if the current examinee is the nth, p will be $n - 1$ . It is worth noting that applying the IPRD procedures in Example 3 is relatively simple and can be directly programmed in any computer language, such as Fortran. For more complex conditions (to be addressed in the Discussion section), more sophisticated software or packages (e.g., package lpSolve in R language; Berkelaar, 2022) would be required.

Evaluation Criteria

Two sets of criteria were employed to evaluate CAT performance regarding the efficiency for examinee and test levels. For the examinee-level evaluation, two indexes—the root mean square error (RMSE) and bias—were calculated to investigate the trait estimates.

R M S E (\hat{θ}) = \sqrt{\frac{1}{J} \sum_{j = 1}^{J} {({\hat{θ}}_{j} - θ_{j})}^{2}},

B i a s (\hat{θ}) = \frac{1}{J} \sum_{j = 1}^{J} ({\hat{θ}}_{j} - θ_{j}),

where J is the number of examinees, and ${\hat{θ}}_{j}$ and $θ_{j}$ are, respectively, the estimated and true abilities of the jth examinee. In addition, the test information was averaged across all examinees to evaluate precision in trait estimation.

For the test-level evaluation, the test overlap rate ( $\bar{T}$ ) between pairs of examinees will be calculated and can be expressed as (Chen et al., 2003)

\bar{T} = \frac{\sum_{i = 1}^{I} (\begin{matrix} m_{i} \\ 2 \end{matrix})}{L (\begin{matrix} J \\ 2 \end{matrix})},

where I is the number of items in the pool, m_i is the number of times item i is administered, and L is the test length.

In addition, the average value of the discrimination parameters of the administered items within each administration order was calculated to illustrate the quality of the items administered as the test proceeded.

Results

According to the simulation outcomes, for each condition, the item exposure rates were controlled under the prespecified r _max level, and the content balancing control was satisfied for each examinee’s CAT. This indicated that the IPRD method can well control not only the maximum item exposure rates but also the content balancing, as the other methods (e.g., the ASCB method) did. The performance of these methods was further evaluated by trait estimation and item pool usage. Considering that the patterns of the simulation outcomes were quite similar between conditions with different ability distributions, only the results with abilities drawn from a standard normal distribution were illustrated. The results for the other conditions are listed in Supplemental Appendix B.

Conditions With Examinees From N(0, 1)

Outcomes for trait estimation

Figure 8 shows the bias of the trait estimation for all methods under different r _max levels. All biases were close to 0, indicating that the trait estimates generated by the methods were generally unbiased. Figure 9 shows the RMSE of the trait estimation under different r _max levels. In general, the RMSE decreased as r _max increased except for those generated by the RN method, which were stable and around 0.42 across all r _max levels. For conditions with stringent exposure control (e.g., r _max = .09), the IPRD method yielded a remarkable improvement in RMSE over the other methods. As the exposure control became less stringent, the advantage of the IPRD method in RMSE became less obvious. When there was mild or no exposure control (e.g., r _max ≥ .5), all the methods (except for the RN method) performed similarly. The IPRD method can always yield lower RMSE than those with the ASCB method. In summary, the IPRD method has shown advantages regarding trait estimation for CAT under exposure control, and the advantages became more obvious as the exposure control became more stringent.

Figure 8.

Bias of trait estimation under conditions with examinees from N(0, 1).

Figure 9.

Root mean square error of trait estimation under conditions with examinees from N(0, 1).

Figure 10 shows the average test information under different r _max levels. The patterns of the average test information among the methods were similar to those of RMSE but in the opposite direction. Generally, the test information increased as r _max increased except for those generated by the RN method, which was stable and around 6.35 across all r _max levels. For conditions with stringent exposure control (e.g., r _max = .09), the IPRD method yielded the highest average test information. As the item exposure control became less stringent, the difference in the average test information between the IPRD method and other methods decreased. When there was mild or no exposure control (e.g., r _max ≥ .5), the MFI method yielded the highest average test information. The IPRD method can always yield higher average test information than the ASCB method. In brief, the results indicated that the IPRD method can yield more average test information than the other methods when there is a relatively stringent exposure control (e.g., r _max < .5). Specifically, for conditions with r _max = .09, the IPRD method had on average about 33% more average test information than the MFI and ASCB methods.

Figure 10.

Average test information under conditions with examinees from N(0, 1).

Outcomes for test overlap rate and item administration

Figure 11 shows the test overlap rates under different r _max levels. In general, the test overlap rate increased as r _max increased except for those generated by the RN method. When the exposure control was stringent (i.e., no more than .2), the three methods produced similar test overlap rates. In contrast, for the conditions with relatively loose exposure control (i.e., r _max > .2), the IPRD method yielded the lowest test overlap rates. It is worth noting that the test overlap rate for the IPRD method was no more than .32 even when there was no item exposure control.

Figure 11.

Test overlap rate under conditions with examinees from N(0, 1).

Figure 12 shows the averaged values of the discrimination parameters (a) of the administered items under each administration order. In general, the average a increased as r _max increased except for those generated by the RN method, which were stable and about .97 (close to the mean of a of the whole item pool). For the other three methods, distinct patterns of average a by administration order could be observed. The IPRD method generally yielded a monotonically increasing pattern of average a except for the first item, which was administered randomly by design. Specifically, the slope of the average a on administration order increased as r _max decreased, which indicated that the IPRD method had an increasing tendency to administer items with higher discrimination in later stages of CAT as the exposure control became stringent. In contrast, the MFI method exhibited a monotonically decreasing pattern of average a. Regarding the ASCB method, it showed a generally increasing pattern except for the beginning and end periods of CAT.

Figure 12.

Averaged values of the discrimination parameters of the administered items under conditions with examinees from N(0, 1).

Conditional outcomes for trait estimation

To obtain the conditional outcomes, the examinees were divided into seven groups of equal size according to their true abilities. Therefore, the performance of these methods can be investigated for examinees with certain ability levels (e.g., moderate or extremely high). The average abilities of the seven groups were about −1.58, −0.80, −0.37, 0.00, 0.37, 0.80, and 1.58, respectively. Figure 13 shows the bias of trait estimation for each ability level under four r _max levels (i.e., .09, .12, .20, and 1.00), which represent the conditions with stringent, slightly stringent, normal, and no exposure controls, respectively. Generally, these methods tended to overestimate (underestimate) examinees with lower (higher) ability levels. The degree of overestimation (underestimation) decreased as the item exposure control became less stringent. The IPRD, MFI, and ASCB methods performed similarly across all r _max levels except for the condition in which r _max = .09, where the IPRD method yielded the minimum bias for those examinees with extremely low ability levels.

Figure 13.

Conditional bias of trait estimation under conditions with examinees from N(0, 1).

Figure 14 shows the RMSE for each group of ability levels under the four levels of r _max. In general, the RMSE decreased as the ability levels increased, reflecting the fact that the item pool was composed of items with a slightly higher level of difficulty (the average difficulty level was 0.398). Regarding the r _max levels, the more stringent the exposure control, the higher the RMSE. For conditions with r _max = .09, the IPRD method yielded the lowest RMSE across all ability levels. As the exposure control became less stringent, the advantage of the IPRD method became less obvious. When there was no exposure control (i.e., r _max = 1.00), all methods performed similarly except for the RN method, which yielded the highest RMSE. Notably, the IPRD method showed a remarkable improvement in trait estimation for examinees with relatively low ability when there was stringent exposure control. Taking the plot of r _max = .12, for example, compared with the MFI and ASCB methods, the IPRD method yielded relatively lower RMSE for those examinees with very low and extremely low ability levels than for examinees with other ability levels. Considering that the item pool was relatively weak for examinees with low ability levels, the IPRD method shows its strengths in improving trait estimation in “harsh” test conditions (e.g., stringent exposure control and a relatively weak item pool) by appropriately administering items to suitable examinees.

Figure 14.

Conditional root mean square error of trait estimation under conditions with examinees from N(0, 1).

Figure 15 shows the average test information for each group of ability levels under the four levels of r _max. In general, the average test information increased as the ability levels increased, which was consistent with the RMSE patterns, and it should be related to the characteristics of the item pool. The item pool consists of items with slightly high difficulty parameters and with a moderate correlation between difficulty and discrimination parameters. Therefore, for examinees with higher ability levels, more test information can be expected. Regarding the r _max levels, the more stringent the exposure control, the lower the average test information. For conditions with r _max = .09, the IPRD method yielded the highest average test information across all ability levels. As the exposure controls became less stringent, the advantage of the IPRD method became less obvious. When there was no exposure control (i.e., r _max = 1.00), the MFI method yielded higher average test information than the IPRD and ASCB methods, especially for examinees at the high end.

Figure 15.

Conditional average test information under conditions with examinees from N(0, 1).

Conditions With Examinees Sampled From N(1, 1)

For brevity’s sake, the results for the conditions with examinees sampled from N(1, 1) are not shown here. However, the patterns of these results followed those observed under the conditions with examinees sampled from a standard normal distribution. Generally, the bias was slightly more negative than that of conditions with N(0, 1), reflecting the mismatch between the EAP’s prior and population distribution. The RMSE was slightly lower than those of conditions with N(0, 1) because the item pool was composed of items with a slightly higher difficulty level. Among those methods, the IPRD method could efficiently improve trait estimation in terms of lower RMSE and higher average test information when CAT was under exposure control. See Supplemental Appendix B for detailed results.

Conclusion and Discussion

CAT shows an improvement in efficiency when compared with linear testing, but its inherent nature produces the attenuation paradox (including the preclusion problem), which has the potential to reduce CAT efficiency, especially for conditions with item exposure control. To mitigate the effect of the attenuation paradox, the current study proposed the IPRD method, which utilizes real-time test data to implement reversely assembled shadow tests. A simulation study was conducted to thoroughly investigate IPRD’s performance. The results have indicated that the IPRD method can efficiently improve CAT performance in terms of the precision of trait estimation and satisfaction of the required test constraints, particularly in conditions with stringent exposure control.

However, the results of the simulation study do not support our first hypothesis. For conditions with no exposure control, the IPRD method does not outperform the MFI method. This is consistent with previous findings (e.g., Chen & Ankenman, 2004; Chen et al., 2000). That is, the conclusion that the effects of the attenuation paradox on trait estimation are negligible after a certain number of item administrations is still held in this study. Regarding the second hypothesis, we find that the IPRD method outperforms the other methods when exposure control is present. The more stringent the exposure control is, the more obvious the advantages of the IPRD method are. That is, the IPRD method can effectively solve the preclusion problem of the attenuation paradox, which supports our second hypothesis. Furthermore, the high test information generated by the IPRD method also reveals the benefit in CAT performance when it comes to solving the preclusion problem by improving between-examinee adaptivity. For conditions with stringent exposure control, the average test information of the IPRD method can be even one-third more than that of the MFI method. Furthermore, the present study also demonstrated that the preclusion problem of the attenuation paradox is the key to affecting CAT efficiency under item exposure control.

To illustrate why the IPRD method can reduce the effects of the preclusion problem of the attenuation paradox, the current study has looked at how the IPRD method aims at global optimization. The IPRD method aims at maximizing test information for all examinees by not only administering higher quality items during the later stages but also improving the suitability of an examinee for an item at each item administration. It can be found that the average discrimination parameters of the administered items at each administration order increase as the test progresses. In contrast, the MFI method, which is considered a kind of greedy algorithm (Bengs et al., 2018; Han, 2018), aims to maximize item information for only current item administration, generating a decreasing pattern of average discrimination parameters. Consequently, the IPRD method could yield the most precise trait estimation with the largest average test information for examinees in most conditions, especially for conditions with stringent item exposure control, which tend to generate a serious preclusion problem.

Regarding the conditional outcomes, the IPRD method effectively improves trait estimation for examinees with very low and extremely low ability levels when there is stringent exposure control. To further consider the relative lack of easy items in the item pool, the IPRD method can efficiently maintain the precision of trait estimation under “harsh” test conditions, including those tests with weak banks (e.g., quality items are relatively few). In conclusion, the IPRD method seems to have potential in practical test applications, especially for high-stakes tests that place high demands on both estimation precision and test security. Even in large-scale assessments where many students may take the test simultaneously, without directly causing the preclusion problem, preliminary indications suggest that the IPRD method could potentially be utilized as an exposure control procedure based on the existing test data. As long as the existing test data are a representative sample of the population, it is expected that the IPRD method can effectively control item exposure while maintaining trait estimation precision.

There are several directions for future studies that not only are suggested by the findings but also address the limitations inherent in the current research. First, when the item exposure control is less stringent (e.g., r _max = .5), the IPRD method does not outperform the MFI method in terms of trait estimation precision, regardless of whether the mean of ability distribution is 0 or 1. This suggests that the mechanism of the IPRD method in addressing the preclusion problem may come at a cost. To optimize CAT design, it is crucial to further differentiate the effects of the two components of the IPRD method, the use of reversely assembled shadow tests (to tackle the attenuation paradox), and the utilization of real-time test data (to address the preclusion problem), on trait estimation. This differentiation will help us better understand their respective impacts on trait estimation under different levels of exposure control.

Second, the present study employed an operational item pool with items that are slightly difficult, which may have produced a gap between the distributions of item difficulty and examinee ability. Although the robustness of the IPRD method was evaluated by manipulating the mean of the ability distribution, it is recommended that more item pools (including generated item pools) with different characteristics be used to study the IPRD method’s generalizability. Furthermore, manipulating the examinee ability distribution with more extreme levels of mean (e.g., −2 or −3) would also contribute to a better understanding of the robustness of the IPRD method. Additionally, in this study, the examinees were randomly sampled from the manipulated normal distribution; therefore, previously administered examinees would definitely be a representative sample of the population. However, this may not always be the case in practice. To evaluate the robustness of the IPRD method in practical test scenarios, manipulations of different types of ability distribution (e.g., normal distribution with a time-varying mean) should be included in future simulation studies.

Third, the IPRD method uses real-time test data of item administrations by examinees and administration order to formulate constraints in maximizing the objective function while implementing the reverse-assembly procedure. However, information on item responses is not utilized. With item responses, we can obtain the distribution of the ability estimate of previously administered examinees at each administration order. In this way, the IPRD method could simultaneously reversely assemble shadow tests for both current and previous examinees. It is worth noting that although previous examinees have completed their own tests, assembling shadow tests with previous examinees is necessary for the goal of approximating the test assembly with population, as in Example 2 (given that the previous examinees are a representative sample of the population). By doing so, the occurrence of the attenuation paradox and its preclusion problem are expected to be further reduced. See Supplemental Appendix C for a detailed illustration. However, under the circumstances, to simultaneously assemble shadow tests with all previous examinees would necessitate more advanced software or packages (e.g., package lpSolve). Furthermore, a large amount of computations would be necessary, and the shadow test assembly may thus become infeasible. For example, administering a 30-item CAT with 500 previous examinees will require 15,000 shadow test assemblies, creating obstacles in applying the IPRD method in practical applications. To this end, finding out the minimum number of previous examinees that are sufficient to provide robust item selection while applying the IPRD method becomes critical.

Fourth, considering that integer programming is used to assemble shadow tests, whether feasible solutions can always be provided becomes an important issue. The study considers only exposure control and the content balancing of test constraints; however, it is expected to include more complex test conditions (e.g., with test overlap control) to not only meet CAT’s practical needs but also demonstrate the advantages of the shadow tests. Given that integer programming may not be able to assemble a shadow test with a large number of test constraints, the robustness of the IPRD method requires further investigation. This would be one of the most important limitations regarding the applicability of the IPRD method and should be addressed.

Fifth, considering the reliable performance the IPRD method has, it is highly reminiscent of the dynamic stratification method (SDC) recently proposed by Chen et al. (2020). Although the SDC and IPRD methods are essentially different (the cores of the SDC and IPRD methods are stratification and test assembly methods, respectively), they share the same merits in improving trait estimation in CAT with exposure control. Specifically, the SDC method stratifies an item pool into the same number of strata as the test length, employing a dynamic item–stratum adjustment procedure to enhance item usage, which can be considered a heuristic approach. In contrast, the IPRD method turns information from real-time test data into test constraints with reversely assembled shadow tests to prevent the attenuation paradox, belonging to the integer programming approach. When the best feasible solution exists, the integer programming approach is expected to provide optimal performance. However, for test practitioners, methods belonging to the heuristic and integer programming approaches could be complementary, offering flexibility to optimize test performance under various considerations (e.g., computational time; Chen, 2017). For future studies, while it would be interesting to compare these two ISRs, it may be of more worth looking forward to incorporating the SDC method into the IPRD method to combine both merits for developing the optimal ISR.

Sixth, the current study manipulated CAT in a relatively simple way, where unidimensional CAT with a fixed test length was used. The IPRD method can be further extended to CAT with variable test lengths, as well as multidimensional tests, to improve trait estimation under item exposure control. Furthermore, the IPRD method can also be applied to other test formats, including computerized classification tests and on-the-fly assembled multistage testing (Zheng & Chang, 2015), to improve test efficiency.

Finally, the IPRD method can be applied to control item exposure to prevent items from being over- or underexposed. To the best of our knowledge, no method has been proposed to directly control the minimum item exposure rate. It is expected that CAT can greatly benefit from applying the IPRD method to control minimum item exposure rates to improve balanced item pool usage and test security while maintaining its ability estimation precision.

Supplemental Material

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231197666 - Utilizing Real-Time Test Data to Solve Attenuation Paradox in Computerized Adaptive Testing to Enhance Optimal Design

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231197666 for Utilizing Real-Time Test Data to Solve Attenuation Paradox in Computerized Adaptive Testing to Enhance Optimal Design by Jyun-Hong Chen and Hsiu-Yi Chao in Journal of Educational and Behavioral Statistics

Footnotes

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their valuable comments and suggestions on this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This work was supported by Ministry of Science and Technology, Taiwan (MOST 109-2410-H-006-119).

ORCID iD

Hsiu-Yi Chao

References

Bengs

Brefeld

Kröhne

(2018). Adaptive item selection under Matroid constraints. Journal of Computerized Adaptive Testing, 6(2), 15–36. https://doi.org/10.7333/Fjcat.v6i2.64

Berkelaar

(2022). lpSolve: Interface to “Lp_solve” v.5.5 to Solve Linear/Integer Programs. R package version 5.6.17. https://CRAN.R-project.org/package=lpSolve

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397–472). Addison Wesley.

Chang

H. H.

Ying

(1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213–229. https://doi.org/10.1177/014662169602000303

Chang

H. H.

Ying

(1999). a-Stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23(3), 211–222. https://doi.org/10.1177/01466219922031338

Chen

P. H.

(2017). Should we stop developing heuristics and only rely on mixed integer programming solvers in automated test assembly? A rejoinder to van der Linden and Li (2016). Applied Psychological Measurement, 41(3), 227–240.

Chen

J. H.

Chao

H. Y.

Chen

S. Y.

(2020). A dynamic stratification method for improving trait estimation in computerized adaptive testing under item exposure control. Applied Psychological Measurement, 44(3), 182–196. https://doi.org/10.1177/0146621619843820

Chen

S. Y.

Ankenman

R. D.

(2004). Effects of practical constraints on item selection rules at the early stages of computerized adaptive testing. Journal of Educational Measurement, 41(2), 149–174. https://doi.org/10.1111/j.1745-3984.2004.tb01112.x

Chen

S. Y.

Ankenmann

R. D.

Chang

H. H.

(2000). A comparison of item selection rules at the early stages of computerized adaptive testing. Applied Psychological Measurement, 24(3), 241–255. https://doi.org/10.1177/01466210022031705

10.

Chen

S. Y.

Ankenmann

R. D.

Spray

J. A.

(2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40(2), 129–145. https://doi.org/10.1111/j.1745-3984.2003.tb01100.x

11.

Chen

S. Y.

Lei

P. W.

Liao

W. H.

(2008). Controlling item exposure and test overlap on the fly in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 61, 471–492. https://doi.org/10.1348/000711007X227067

12.

Cover

T. M.

Thomas

J. A.

(1991). Elements of information theory. Wiley.

13.

Han

K. C. T.

(2018). Components of the item selection algorithm in computerized adaptive testing. Journal of Educational Evaluation for Health Professions, 15(7), 1–13. https://doi.org/10.3352/jeehp.2018.15.7

14.

Chen

S. Y.

(2008). Item exposure control in a-stratified computerized adaptive testing. Psychological Testing, 55(4), 793–811. https://doi.org/10.7108/PT.200812.0015

15.

Kullback

(1959). Information theory and statistics. Wiley.

16.

Lord

F. M.

Novick

(1968). Statistical theories of mental test scores. Addison Wesley.

17.

Sympson

J. B.

Hetter

R. D.

(1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th Annual Meeting of the Military Testing Association (pp. 973–977). Navy Personnel Research and Development Center.

18.

van der Linden

W. J.

(2000). Constrained adaptive testing with shadow tests. In van der Linden

W. J.

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 27–52). Kluwer Academic.

19.

van der Linden

W. J.

Pashley

P. J.

(2000). Item selection and ability estimation in adaptive testing. In van der Linden

W. J.

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 1–25). Kluwer Academic.

20.

Veerkamp

W. J.

Berger

M. P. F.

(1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavior Statistics, 22(2), 203–226. https://doi.org/10.3102/10769986022002203

21.

Chang

H. H.

(2003). a-Stratified CAT design with content blocking. British Journal of Mathematical and Statistical Psychology, 56(2), 359–378. https://doi.org/10.1348/000711003770480084

22.

Zheng

Chang

H. H.

(2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.40 MB