Sage Journals: Discover world-class research

Abstract

The testlet design is very popular in educational and psychological assessments. This article proposes a new cognitive diagnosis model, the multiple-choice cognitive diagnostic testlet (MC-CDT) model for tests using testlets consisting of MC items. The MC-CDT model uses the original examinees’ responses to MC items instead of dichotomously scored data (i.e., correct or incorrect) to retain information of different distractors and thus enhance the MC items’ diagnostic power. The Markov chain Monte Carlo algorithm was adopted to calibrate the model using the WinBUGS software. Then, a thorough simulation study was conducted to evaluate the estimation accuracy for both item and examinee parameters in the MC-CDT model under various conditions. The results showed that the proposed MC-CDT model outperformed the traditional MC cognitive diagnostic model. Specifically, the MC-CDT model fits the testlet data better than the traditional model, while also fitting the data without testlets well. The findings of this empirical study show that the MC-CDT model fits real data better than the traditional model and that it can also provide testlet information.

Keywords

cognitive diagnosis multiple-choice item testlet effect MCMC

1. Introduction

A cognitive diagnosis model (CDM) allows us to detect the presence or absence of specific skills as measured by items in assessment measures. A considerable number of studies exploring CDMs have already been conducted (e.g., Hartz, 2002; Junker & Sijtsma, 2001; Rupp et al., 2010). However, most of the proposed CDMs are modeled based on the implementations of dichotomously scored data. These models depend on responses in the form of either 1 or 0. If an examinee chooses the key answer correctly, the response to this item is denoted as 1, and if they select a distractor rather than the correct answer, the response is denoted as 0. However, this method is not an optimal way to identify examinees’ skills because it ignores additional diagnostic information about their cognitive process and other concepts that could be obtained by considering the information provided via the distractor responses (Sadler, 1998). To maximize the potential value of multiple choice (MC) items on assessments, several CDMs that consider the additional information provided by distractors have been proposed, including the multiple choice deterministic inputs, noisy “and” gate (MC-DINA) model (de la Torre, 2009a), the scaling individuals and classifying misconceptions (SICM) model (Bradshaw & Templin, 2014), the generalized diagnostic classification models for MC (GDCM-MC; DiBello, Henson, & Stout, 2015), and the three-category structured DINA model for MC items (MC-S-DINA; Ozaki, 2015). Models such as these, which can be referred to generally as MC-CDMs, can collect additional diagnostic information obtained from distractors to further improve the estimation accuracy for examinees’ attribute profiles (de la Torre, 2009a; DiBello et al., 2015; Ozaki, 2015). For example, the MC-DINA model can improve the attribute profile classification rate by 29% when compared with the DINA model (de la Torre, 2009b) under the same simulation condition (de la Torre, 2009a). Hence, they are effective ways to improve the diagnostic accuracy when the test contents remain the same.

A testlet is a bundle of items that share a common stimulus (e.g., a reading comprehension passage; Wainer & Kiely, 1987; Wainer & Wang, 2000), which has been widely implemented in educational and psychological tests, such as the test of English as a foreign language (TOEFL), graduate record examinations (GRE), and programme for international student assessment (PISA). Many researchers and item developers believe that the testlet-based form has high test efficiency, especially for item writing and administration (Huang & Wang, 2012). Nevertheless, traditional CDMs cannot model testlet responses sufficiently due to local item dependence between items within a testlet. Thus, further research that incorporates testlets under the CDM framework is needed. Hansen (2013) introduced an alternative diagnostic model to describe local item dependence by including random effects in CDMs to account for potential residual dependence due to the common source of variations shared by a particular set of items. Zhan et al. (2018) proposed the joint testlet CDM framework to handle two types of random testlet effect, one for item response accuracy and the other for item response times. Although the testlet effect was generalized to the field of CDMs in both studies, neither of them considered the diagnostic information provided by the distractors, meaning that these two testlet CDMs were yet unable to estimate the MC data with distractors’ information because all distractors were not coded options in the existing testlet CDMs.

Finding a feasible CDM for testlet-based MC data continues to be a challenge. In this study, we proposed an MC-CDM that considered the testlet (MC-CDT), which can introduce random effects to represent the testlet effects. The new model can not only optimize the diagnostic function of MC items, but it also fits the testlet-based response data well.

This article is organized as follows. First, the MC-CDMs are briefly introduced and the new MC-CDT is proposed. Second, the parameter estimation of the MC-CDT model using the Markov Chain Monte Carlo (MCMC) method is presented. Third, two simulation studies are presented, which used the WinBUGS 1.4 software (Spiegelhalter et al., 2003) to assess parameter recovery of the MC-CDT and the model misspecification impact. Finally, conclusions and issues are discussed.

2. Cognitive Diagnostic Models for MC Items Using Testlets

We will first introduce some basic concepts and terms used in the MC-CDM model. Let $α_{i} = (α_{i 1}, α_{i 2}, \dots α_{i k})$ denotes the attribute profile of examinee i, where K is the number of attributes measured by the test. $α_{i k} = 1$ indicates the ith examinee masters the kth attribute, and $α_{i k} = 0$ otherwise. There are 2 ^K distinct patterns representing the space of attribute profiles. For the MC response $X_{i j} = 1, 2, \dots, H_{j}$ , each category represents a related option, and H_j is the number of options for item j, $j = 1, 2, \dots, J$ . For simplicity, we have fixed $H_{j} = H$ for all j here. To obtain diagnostic information from distractors, the Q matrix needs to be modified by coding each item option with some of required attributes first, which is the biggest difference between the MC-CDM and the traditional CDM. For MC data, an option-level Q matrix needs to be defined, referred to as the Q_MC matrix, where subscript MC is used to indicate Multiple-Choice in this article. Thus, the Q_MC matrix has a q-vector per option rather than just a q-vector per item as it would in the traditional Q matrix. The entry $q_{j k}^{h} = 1$ indicates that attribute k is necessary to select option h of item j; otherwise, it is $q_{j k}^{h} = 0$ . For example, consider item j has four options, and its three attributes are required to select the key option A, which can be coded as $q_{j A} = (1, 1, 1)$ . The other three distractors are coded as $q_{j B} = (1, 0, 0)$ , $q_{j C} = (1, 1, 0)$ , and $q_{j D} = (0, 1, 1)$ , respectively. If an examinee i chooses option A, their attribute profile would theoretically belong to $α_{i} = (1, 1, 1)$ . The other three distractors would have different meanings. Option B would attract examinees who master only the first attribute. Similarly, Options C and D would attract examinees who master and select two of the three same-pattern attributes.

2.1. Traditional MC-CDMs

As previously mentioned, four MC-CDMs have already been proposed to fit MC data. de la Torre (2009a) developed the MC-DINA model, which can analyze MC items when distractors are coded by required attributes. More specifically, to incorporate diagnostic information from distractors when using MC-CDMs to estimate examinees’ attribute profile, the distractors should be coded to indicate which of the required attributes are missing in examinees’ responses. For example, an item with the content $2 \frac{4}{12} - \frac{7}{12}$ has four options: $2 \frac{3}{12}$ , $2 \frac{1}{4}$ , $1 \frac{9}{12}$ , and $1 \frac{3}{4}$ . This item probes three attributes, namely, borrowing one from the whole number of the fraction (A1), a basic fraction subtraction (A2), and simplification (A3), respectively. Thus, the q-vector of the correct answer ( $1 \frac{3}{4}$ ) should be (1,1,1), which would represent examinees who master all required attributes. The distractors would then need to be coded as subsets of the q-vector of the correct answer, following de la Torre’s (2009a) rule. Theoretically, an item can be used to effectively classify examinees into different attribute profiles if all options are coded. However, it is unnecessary to code all options in practice when it is in fact difficult to code some distractors. In this case, examinees can then be classified into $H_{j}^{*} + 1$ groups (i.e., attribute profiles) according to their responses on item j (de la Torre, 2009a). Here, $H_{j}^{*}$ is the number of coded options with different attribute patterns.

If all distractors are not coded, the MC-DINA model is equivalent to the DINA model. For the MC-DINA model, the probability of choosing option h of item j by examinee i is represented by $P_{j h} (α_{i}) = P (X_{i j} = h | α_{i})$ . Therefore, as stated by de la Torre (2009a), the MC-DINA model then needs to estimate $\sum_{j = 1}^{J} H_{j} (H_{j}^{*} + 1) - \sum_{j = 1}^{J} H_{j}^{*} - J$ item parameters in total, which then means that too many item parameters remain to be effectively estimated in the MC-DINA model.

To simplify the MC-DINA model, Ozaki (2015) proposed three structured DINA models for MC items, the structured DINA model for MC items 1, the structured DINA model for multiple-choice items 2, and the structured DINA model for MC items 3 (denoted as MC1, MC2, and MC3, respectively). For the MC1 model, the probability of choosing option h of item j by examinee i is

P (X_{i j} = h | α_{i}) = γ_{i j} {(1 - δ_{j})}^{η_{i j h}} {(\frac{δ_{j}}{H - 1})}^{1 - η_{i j h}} + \frac{(1 - γ_{i j})}{H} .

Here

η_{i j h} = \prod_{k = 1}^{K_{j}} (2 - 2^{{(α_{i k} - q^{j k h})}^{2}}),

γ_{i j} = \sum_{h = 1}^{H} η_{i j h} \cdot (1 - \prod_{k = 1}^{K_{j}} (1 - α_{i k})),

where K_j is the largest number of required attributes for item j. $δ_{j}$ is a parameter of item j, representing the probability that examinee i with $α_{i}$ will choose an option other than the most likely one (which can also be referred to as “slipping probability”). In other words, this indicates the probability that the ith examinee’s attribute profile matches a coded option of item j, but that they did not in fact choose that option. The value of $η_{i j h}$ is 1 when examinee i with profile $α_{i}$ is a perfect match of the attribute vector of option h of item j; otherwise, it is $η_{i j h} = 0$ . When $γ_{i j}$ = 1, the first term determines the selection probability, and the second term determines the selection probability (Ozaki, 2015). Two requirements need to be satisfied in order for that $γ_{i j}$ to equal 1: (a) examinee i must master at least one attribute required by the key option of item j [i.e., $1 - \prod_{k = 1}^{K_{j}} (1 - α_{i k}) = 1$ ], and (b) $\sum_{h = 1}^{H} η_{i j h} = 1$ . Hence, $γ_{i j} = 1$ represents the ith examinee using their true ability to respond to the jth item, and $γ_{i j} = 0$ represents the ith examinee responding to the jth item by choosing randomly. Each item has only one item parameter in the MC1 model, so the total number of item parameters equals J. This is the strictest constraint that minimizes the number of item parameters in the MC-DINA model. For the MC2 model, the number of parameters of item j is $H_{j}^{*}$ because the parameters are defined at the option-level. This means that each of the coded options has corresponding parameter (referred to as the option parameter). In contrast to the MC1 model, the MC2 model allows $δ_{j}$ to take different values according to the attribute vector of option h. As the model parameters are defined at the option-level, this is expressed as $δ_{j h}$ , representing the parameter for option h of item j. Thus, the $H_{j}^{*} \times J$ parameters must be estimated in the MC2 model. Based on the model assumption, the selection probabilities for incorrect options are the same for each $α_{i}$ in the first two models. Meanwhile, in the MC3 model, this restriction is relaxed by setting the selection probabilities according to the distance between $α_{i}$ and each combination of the required attributes for option h (h = 1, 2,…, $H_{j}^{*}$ ) of item j. Readers interested in the more detailed specifications of the MC2 and MC3 models are referred to Ozaki (2015). Of the three structured DINA models introduced above, the MC1 model is clearly the simplest one.

The SICM model merges the nominal response item response theory model (Bock, 1972) and CDMs, allowing the continuous ability and distinct misconception patterns to be estimated using both information from the coded key option as well as from the distractors. However, the SICM model is rather complex because a large number of the item and examinee parameters must be estimated (both continuously and categorically). This is to say that its estimation procedure is computationally intensive. Furthermore, the examinee’s attribute profile is expanded to include facets of thinking, which are either problematic (including misconceptions or partially correct) or desirable (both skills and conceptual understandings). The profile also includes the Q matrix, which uses a three-valued coding scheme to specify which attribute profiles are strongly attracted to a particular option for the MC data. Similar to the SICM model, the GDCM-MC model further incorporates the guessing components leading to the application of a “hybrid strategy,” which can be interpreted as an initial cognitive step to exclude some options, followed by a random guess to choose between the remaining options.

Considering everything discussed thus far, the MC1 model does appear to be the most parsimonious traditional MC-CDM. According to the results of simulation studies conducted by Ozaki (2015), the recovery rates of examinee parameters were relatively comparable across the three MC-S-DINA models when the test length was relatively long. Furthermore, the biases and RMSEs of the item parameters obtained from the MC1 model were the smallest of these three MC-S-DINA models across various conditions. Therefore, prioritizing the principles of simplicity and effectiveness, the MC1 model was chosen as the basic model to incorporate the testlet effects in the current study.

2.2. The MC-CDM With Testlet

Various approaches have already been made to analyze response data obtained from testlets in psychological and educational measurement models, especially in the field of item response theory, resulting in, among others, the development of the second-order model (Rijmen, 2010), the bifactor model (Demars, 2006; Li et al., 2005), the random-effect testlet model (Bradlow et al., 1999; Wang & Wilson, 2005), and the fixed-effect testlet model (Kang et al., 2021). Those models offer different perspectives in describing testlet structures. However, no MC-CDMs thus far has been shown to accurately fit testlet items. Thus, the testlet structure needs to be incorporated into the MC-CDMs to fully consider testlet-based MC data.

This research focused on the random-effect testlet structure rather than the fixed-effect testlet structure because the former enables a transformable testlet effect at the examinee level, while the latter does not. As shown in Figure 1, the testlet model treats each item as an indicator of a general attribute profile, with one of M testlet effects set for each of the fixed- and random-effect testlet CDM. The fixed-effect testlet model (Figure 1A) treats the testlet effect as a constant on all examinees’ responses. However, the random-effect testlet model (Figure 1B) treats the testlet effect as a variant impact on the responses of different examinees. As an illustration, the MC1 model may be selected as it offers a generalized fit for testlet tests due to its simplicity and better parameter recovery. Nevertheless, the additional testlet effects can still also be applied in other MC-CDMs.

Figure 1.

Directed acyclic graph of the cognitive diagnosis model with (A) fixed-effect testlet and (B) random-effect testlet and with four items within two testlets. Note. $ξ_{m} and ξ_{im}$ represent the fixed and random testlet effect on item responses, respectively.

To incorporate the random testlet effects into the MC1 model, an appropriate modification for Equation 1 is first required. The logit transformation of the item parameter $δ_{j}$ can be represented as

δ_{j} = \frac{exp ({δ^{'}}_{j})}{1 + exp ({δ^{'}}_{j})},

logit (δ_{j}) = {δ^{'}}_{j} .

Therefore, the MC1 model can be re-expressed as

P (X_{i j} = h | α_{i}) = γ_{i j} {(1 - \frac{exp ({δ^{'}}_{j})}{1 + exp ({δ^{'}}_{j})})}^{η_{i j h}} {(\frac{exp ({δ^{'}}_{j})}{(1 + exp ({δ^{'}}_{j})) (H - 1)})}^{1 - η_{i j h}} + \frac{(1 - γ_{i j})}{H} .

Inspired by modeling approaches utilized in previous studies, such as the joint testlet CDM (Zhan, et al., 2018), the Rasch testlet model (Wang & Wilson, 2005), the two-parameter logistic testlet model (Bradlow et al., 1999), and the three-parameter testlet model (Wainer et al., 2000), we also added the testlet effects into the MC1 model to reflect the testlets’ influence on response probabilities. Following the same modeling approach as adopted in previous studies, we extended Equation 4 to include an additional random effect for dependence between items within the same testlet. The kernel function of the MC-CDM with the testlet is expressed as

λ_{i j} = \frac{exp (δ_{j}^{′} + ξ_{i d (j)})}{1 + exp (δ_{j}^{′} + ξ_{i d (j)})} .

Similarly, $λ_{i j}$ can be interpreted as the probability that examinee i with $α_{i}$ will choose an option other than the most likely one of item j within testlet d(j). Therefore, $1 - λ_{i j}$ is the probability that examinee i’s attribute profile which matches a $q_{j h}$ vector exactly will in fact choose that option within testlet d(j). Its value is affected by two terms: ${δ^{'}}_{j}$ and $ξ_{i d (j)}$ . The distribution of item parameter ${δ^{'}}_{j}$ in Equation 7 is different from $δ_{j}$ in Equation 1. The values of $δ_{j}$ s are bounded between 0 and 1 in the MC1 model, while the scale of ${δ^{'}}_{j}$ s turned to $- \infty$ and $+ \infty$ in the logit form (Henson et al., 2009). However, both parameters reflect the quality of the item and affect the probability of choosing option h of item j. $ξ_{i d (j)}$ is the random effect for examinee i on testlet d(j). The variance of $ξ_{i d (j)}$ , indicating the amount of the testlet effect for testlet d(j), is denoted as $σ_{ξ_{d (j)}}^{2}$ . The larger that $σ_{ξ_{d (j)}}^{2}$ is, the greater the contribution the testlet has to the total variance in the test score, that is, the greater the influence from the testlets. Thus, in the MC-CDT model, the probability of an examinee with $α_{i}$ selecting option h to item j within testlet d(j) is:

P (X_{i j} = h | α_{i}) = γ_{i j} \cdot {(1 - λ_{i j})}^{η_{i j h}} {(\frac{λ_{i j}}{H - 1})}^{1 - η_{i j h}} + \frac{(1 - γ_{i j})}{H} .

Specifically, if $ξ_{i d (j)} = 0$ (i.e., no testlet effects), the MC-CDT model reverts to the MC1 model. In the MC-CDT model, the testlet effect is intrinsically an examinee–testlet interaction, similar to other random-effect testlet models. However, the interaction effect of the preceding testlet models directly affects the correct response probability, but as shown in Figure 2, the testlet effect of the MC-CDT model affects the probability of the examinee choosing the key option in the condition of $γ_{i j}$ = 1. Inversely, the testlet effect in the MC-CDT model does not express its impact when $γ_{i j}$ = 0. In other words, the MC-CDT model assumes that, when an examinee does not match any of the options’ q-vectors, the examinee responds to an item by guessing randomly, and that this behavior would not be affected by the interaction effect due to the inherent mechanism of the MC1 model (see Equation 1).

Figure 2.

Directed acyclic graph of the multiple-choice cognitive diagnostic testlet model. Note. RG = random guessing.

3. Bayesian Estimation Algorithms

A fully Bayesian approach was used to estimate the parameters for the MC-CDT model. The MCMC method was adopted because it provides a simple and effective way to simulate joint posterior distribution of unknown quantities and get simulation-based estimates of posterior parameters. In this study, WinBUGS Version 1.4 software was employed to perform the MCMC method, and the means of the joint posterior distribution represented parameter estimates. WinBUGS Version 1.4 uses Gibbs sampling and the metropolis algorithm to generate a Markov chain by sampling from full conditional distributions. It allows users to define and calibrate a variety of models. For example, Curtis (2010) developed BUGS codes to fit commonly used item response models, such as a two-parameter logistic model, a three-parameter logistic model, a graded response model, a generalized partial credit model, a testlet model, and a generalized testlet model. Furthermore, it is easy to extend the BUGS codes to fit even more complicated models, including CDMs (Culpepper, 2015).

For the MC-CDT model, the prior distributions of examinee parameters are assumed as being

α_{i k} \sim Bernoulli (0 .5),

which generally represents randomness or no prior cases (Wang et al., 2021; Zhan et al., 2015).

The priors of item and testlet parameters are specified as

δ_{j}^{′} \sim U (- 10,10),

and

ξ_{i d (j)} \sim N (0, σ_{d (j)}^{2}) .

The testlet variances were assigned inverse-gamma prior distributions, the values of which are usually set in models as

σ_{d (j)}^{2} \sim I n v G a m (1,1) .

4. Simulation Study 1

The main goals of Simulation Study 1 were (a) to evaluate the performance of the MC-CDT model under various conditions and (b) to test for possible issues triggered by using the traditional MC-CDM to fit MC data with testlets. The study had three manipulated independent variables that were fully crossed: the quality of item parameters, the testlet effect, and the number of testlets. We have also included the results of a small simulation study in the Online Appendix, which used sample size (i.e., 250, 500, 1,000, and 2,000) and test length (i.e., 15 or 30 items), to check the robustness of the two models using a small sample size and a limited number of item situations. Each data set in this study was generated using the MC-CDT model (see Equation 8) across all conditions.

A Q_MC matrix that combines the four options of the q-vectors of each item is shown in Table 1. The entries in the Q_MC matrix denote the number of times an attribute was required in the options level. The key item was coded, as well as some of the distractors. The q-vector represents the attribute specification for a coded option $q_{j h}$ , whereas the null vector implicitly represents the attribute specification for the noncoded options (de la Torre, 2009a). For instance, the q-vector for Item 30, [0 0 1 3 2], indicated that there were three coded options. The key option’s q-vector was [0 0 1 1 1], and the other two coded distractors were [0 0 0 1 1] and [0 0 0 1 0]. The remaining option was the null vector. Note that this Q_MC matrix is the same as de la Torre’s (2009a). Overall, the 30 items measured five attributes, and each item had four response options.

Table 1.

Q_MC Matrix for the 30 MC Items

Attribute						Attribute
Item	A1	A2	A3	A4	A5	Item	A1	A2	A3	A4	A5
1	1	0	0	0	0	16	0	1	0	2	0
2	0	1	0	0	0	17	0	1	0	0	2
3	0	0	1	0	0	18	0	0	1	2	0
4	0	0	0	1	0	19	0	0	1	0	2
5	0	0	0	0	1	20	0	0	0	1	2
6	1	0	0	0	0	21	3	2	1	0	0
7	0	1	0	0	0	22	1	2	0	3	0
8	0	0	1	0	0	23	3	2	0	0	0
9	0	0	0	1	0	24	1	0	3	2	0
10	0	0	0	0	1	25	2	0	3	0	1
11	2	1	0	0	0	26	1	0	0	2	3
12	2	0	1	0	0	27	0	2	3	1	0
13	1	0	0	2	0	28	0	3	1	0	2
14	2	0	0	0	1	29	0	2	0	3	1
15	0	2	1	0	0	30	0	0	1	3	2

4.1. Examinee Generation

The attribute profile α was generated to force a real situation, in which the attributes were correlated and of unequal prevalence. A total of 3000 examinees were drawn from a multivariate normal distribution $M V N (0, ρ)$ , where $ρ$ represents a correlation matrix with equal off-diagonal elements. In this study, the off-diagonal elements were set to 0.5, representing a moderate correlation between attributes (Henson & Douglas, 2005). Thus, the ith examinee’s attribute profile $α_{i} = (α_{i 1}, α_{i 2}, \dots α_{i K})$ was determined by

α_{i k} = {\begin{matrix} 1 i f α \geq z_{c} \\ 0 o t h e r w i s e \end{matrix},

where z_c is a cutoff value, which, in this study, was fixed at 0.

4.2. Testlet Number

The 30 items shown in Table 1 were assigned to either three or six testlets, according on the condition. The first 10 items constituted the first testlet, the middle 10 items constituted the second testlet, and the final 10 items constituted the third testlet for the three-testlet condition. For the six-testlet condition, each testlet was created out of five items, following the same assignment rule as for the other condition.

4.3. Testlet Effect

$ξ_{i d (j)}$ was drawn from the normal distribution $N (0, σ_{ξ_{d (j)}}^{2})$ , standing for the random effect of examinee i on testlet d(j). The variances of the random testlet $σ_{ξ_{d (j)}}^{2}$ were set as 0.25, 0.5, and 1, representing small to large effect (Wang et al., 2005).

4.4. Item Parameters

Existing studies have shown that item quality has a significant impact on recovery accuracy of item parameters and examinees’ attribute profiles (Bradshaw, et al., 2014; de la Torre et al., 2010; Ma et al., 2016). Thus, it was necessary to investigate the influence of item quality in the MC-CDT model. Referring to Ma et al.’s (2016) setting of item parameters, items with $δ^{'} \in U (log i t (0 .05), log i t (0 .15))$ were categorized as being high quality, items with $δ^{'} \in U (log i t (0.15), log i t (0.25))$ as medium quality, and items with $δ^{'} \in U (log i t (0.25), log i t (0.35))$ as low quality.

4.5. Estimation

The R package R2WinBUGS (Sturtz et al., 2005) was used to implement the MCMC method in WinBUGS. Parameter estimation was averaged over three parallel chains with initial values chosen randomly by WinBUGS. Thirty replications were performed for each condition. For every replication, each parallel chain had 10,000 iterations, with the first 5,000 iterations as burn-in. These numbers were determined using the criterion of the multivariate potential scale reduction factor $\hat{R}$ , of which values less than 1.1 were considered converged (Brooks & Gelman, 1998). By running three parallel chains, all values of $\hat{R}$ satisfied the criterion. The R scripts of simulation and estimation are available in the Online Appendix.

4.6. Evaluation Criterion

The bias and root mean square error (RMSE) of item parameter estimates (denoted as $\hat{β}$ ) based on the 30 replications were computed as

B i a s (\hat{β}) = \sum_{r = 1}^{M} ({\hat{β}}_{r} - β_{r}) / M,

R M S E (\hat{β}) = \sqrt{\sum_{r = 1}^{M} {({\hat{β}}_{r} - β_{r})}^{2} / M},

where $\hat{β}$ represents the estimated parameter values, and $β_{r}$ was the true parameter values in the rth replication. M was the number of replications. RMSE reflects the average magnitude of the biases between true and estimated parameters and gives a relatively high weight to large biases. The smaller the RMSE, or the closer the bias value was towards 0, the higher the estimation accuracy.

The pattern correct classification rate (PCCR) and the attribute correct classification rate (ACCR) were used to quantify the estimation accuracy of the attribute profile and each individual attribute, respectively. The PCCR and ACCR for each replication were computed as follows:

P C C R = \frac{1}{N} \sum_{i = 1}^{N} t_{i},

A C C R_{k} = \frac{1}{N} \sum_{i = 1}^{N} t_{i k} (k = 1, 2, \dots, K),

where N was the number of examinees. If the ith examinee had mastered (or not mastered) the attribute, and they were classified correctly, then $t_{i k} = 1$ ; otherwise, $t_{i k} = 0$ . If the estimated attribute profile equaled the true attribute profile of examinee i, then $t_{i} = 1$ ; otherwise, $t_{i} = 0$ . Subsequently, the means of the PCCR and ACCR values across the 30 replications were computed as the final results of the estimation accuracy of the examinee parameters. The larger the means of PCCR and ACCR, the higher the rates of correct classification of attribute profile and of the individual attribute.

4.7. Study 1 Results

Figure 1 shows the average bias and RMSE values of variance of the testlet effect parameters ( $σ_{ξ}^{2}$ ) under each condition. Overall, the results showed that favorable estimates were achieved using the MC-CDT model when using the MCMC algorithm. Under the three-testlet condition, the magnitudes of bias were quite small, between −0.011 and 0.027. The RMSE values were also reasonably small, between 0.085 and 0.179. As for the six-testlet condition, the bias (between −0.025 and 0.005) and RMSE (between 0.098 and 0.188) had similar results to those achieved in the three-testlet condition. This indicated that the testlet number had little influence on the accuracy of the testlet effect estimates. However, the estimation accuracy of $σ_{ξ}^{2} s$ was slightly affected by the magnitudes of the testlet effect and by item quality, as shown in Figure 3. As the testlet effect increased (from 0.25 to 1), the bias and RMSE values grew larger overall. The same trend also appeared in response to the improvement of item quality conditions.

Figure 3.

Bias and root mean square error of testlet effect variance ( $σ_{ξ}^{2}$ ) for the multiple-choice cognitive diagnostic testlet model. Note. In the axial coordinates, 3 and 6 represent the testlet number; 0.25, 0.5, and 1 represent the testlet effect (i.e., $σ_{ξ}^{2}$ ); H, M, and L represent the high, medium, and low item quality.

Figures 4 and 5 show the bias and RMSE values of the item parameters for the MC-CDT and MC1 models under the various conditions. For the MC-CDT model, the estimation accuracy of the item parameters was fairly good, with bias and RMSE values all relatively small across all item quality levels and testlet effect configurations. Reported biases ranged from −0.086 to 0.097, close to zero. The RMSE values ranged from 0.026 to 0.287. Notably, the bias and RMSE values of the item parameters were relatively small in the high item quality condition and the largest in the low quality condition. These results were expected because, in these same conditions, more accurate estimates of item parameters were obtained when the values were smaller. The results were in accordance with the conclusion drawn by de la Torre, that extreme response probabilities have less sampling variability (de la Torre et al., 2010). Furthermore, the results indicated that data sets generated with a low level of testlet effect led to slightly more accurate parameter estimates. It is worth noting that the number of testlets had little effect on either bias or RMSE. This suggests that the MC-CDT model can accurately estimate the item parameters regardless of the number of testlets in a particular test. Overall, the desired estimation accuracy of item parameters was achieved using the MC-CDT model.

Figure 4.

Bias of item parameters for the multiple-choice cognitive diagnostic testlet (MC-CDT) and MC1 models (true model = MC-CDT). Note. In the axial coordinates, 3 and 6 represent the testlet number; 0.25, 0.5, and 1 represent the testlet effect (i.e., $σ_{ξ}^{2}$ ); H, M, and L represent the high, medium, and low item quality.

Figure 5.

Root mean square error of item parameters for the multiple-choice cognitive diagnostic testlet (MC-CDT) and MC1 models (true model = MC-CDT). Note. In the axial coordinates, 3 and 6 represent the testlet number; 0.25, 0.5, and 1 represent the testlet effect (i.e., $σ_{ξ}^{2}$ ); H, M, and L represent the high, medium, and low item quality.

As for the MC1 model, the estimation accuracy of item parameters was worse across all conditions. Bias values achieved through the MC1 model ranged from −0.708 to 0.637, and the RMSEs ranged from 0.083 to 0.474. It was obvious that the bias and RMSE values from the MC1 model were much larger than those from MC-CDT model, as shown in Figures 3 and 4. In other words, the MC-CDT model recovered item parameters much better than the MC1 model. These results supported our expectation that when MC data were generated using testlets, the use of traditional MC-CDM would yield much worse parameter estimates.

The average correct classification rate of the examinees’ attribute profiles and individual attributes using both the MC-CDT and the MC1 models were also investigated. As seen in Table 2, the PCCR and the mean of the ACCR (denoted as AACCR) values were virtually unaffected by the number of testlets. This suggests that the MC-CDT model has strong stability across the different numbers of testlets. However, the estimation accuracy of examinees’ parameters was influenced most by item quality. This result was expected because higher item parameter values have a stronger effect on the examinees’ probability of answering items correctly, that is, it increases the possibility of unreasonable item-biased item responses. This trend was consistent with the psychometric literature at large (Bradshaw et al., 2014; de la Torre et al., 2010). For the MC-CDT model, results showed that the PCCR values were all above 0.94, and the AACCR values were all above 0.97 in the high item quality condition. In contrast, the PCCR and AACCR values decreased to 0.787 and 0.858, respectively, in the low item quality condition. As for the MC1 model, the PCCR and AACCR values were consistently smaller than those from the MC-CDT model under the same conditions. In particular, the minimum values of the PCCR and AACCR decreased to 0.703 and 0.804, respectively, when the item quality was low. The strength of the testlet effect also had a mild effect on the estimation accuracy of examinees’ parameters, as shown in Table 2. With the increase of the testlet effect, the PCCR and AACCR values slightly decreased. The higher random testlet effect would increase the probability of examinees possessing the same attribute profile choosing a different option for one item. This result is similar to those from previous studies (Glas et al., 2000) and suggests that an inaccurate estimate of examinee parameters would be obtained if traditional MC-CDM is used to fit MC data with testlets. Furthermore, this negative effect may be increased when item quality is not particularly high.

Table 2.

Recovery Rate of Examinee Parameters for Model Comparison in Simulation Study 1

Model	IQ	TE	TN = 3		TN = 6
Model	IQ	TE	PCCR	AACCR	PCCR	AACCR
MC-CDT	High	0.25	.973	.994	.972	.994
		0.5	.964	.981	.958	.977
		1	.942	.978	.943	.971
	Moderate	0.25	.931	.976	.929	.968
		0.5	.917	.942	.914	.947
		1	.919	.946	.916	.953
	Low	0.25	.823	.895	.819	.901
		0.5	.808	.860	.806	.863
		1	.787	.862	.788	.858
MC1	High	0.25	.922	.963	.920	.968
		0.5	.916	.969	.915	.967
		1	.906	.968	.895	.954
	Moderate	0.25	.883	.912	.878	.909
		0.5	.853	.905	.860	.901
		1	.843	.901	.847	.902
	Low	0.25	.748	.823	.742	.819
		0.5	.717	.807	.715	.804
		1	.705	.810	.703	.806

Note. IQ = item quality; TE = testlet effect; TN = testlet number; MC-CDT = multiple-choice cognitive diagnostic testlet; PCCR = pattern correct classification rate; ACCR = attribute correct classification rate; AACCR = average ACCR.

An additional small simulation study considering limited item length and small sample size was performed to investigate the robustness of the MC-CDT model. As shown in Table S1 of the Online Appendix, the magnitudes of the PCCR of the MC-CDT were between 0.582 and 0.968. Compared to the MC1 model, the MC-CDT results showed PCCR improvements of 3%–20.8%. The magnitudes of the AACCR of the MC-CDT were between 0.779 and 0.992. Compared to the MC1 model, the MC-CDT model improved the AACCR by 0.3%–11.8%. Test length, sample size, and item quality also showed a great impact on the PCCR and AACCR values in the supplementary simulations. When the test length was 15 items, an acceptable classification rate (i.e., PCCR > 0.75 and AACCR > 0.85) occurred in the sample size of 500 and moderate item quality condition. As the test length increased to 30, a similar classification rate of the MC-CDT model was obtained when the sample size was only 250. These results indicated that the MC-CDT model showed promising performance when item numbers were limited sample sizes were small. Furthermore, these findings indicate that lower PCCR and AACCR values will be obtained if traditional the MC-CDM model fits the MC data with testlets. Moreover, the bias and RMSE values of item parameters in the MC-CDT model were smaller than those in the MC1 model, and the testlet parameters of the MC-CDT model also led to accurate estimations in these scenarios, as shown in Figures S1 through S3 of the Online Appendix.

5. Simulation Study 2

The purpose of Simulation Study 2 was to investigate the robustness of the MC-CDT model. While the true model in this study was the MC1 model, both the MC-CDT and MC1 models were used to fit the MC data without testlets to evaluate (a) whether the MC-CDT model provides comparable accuracy of the parameter estimation compared to the MC1 model and (b) whether the MC-CDT model can identify that there are no testlet effects.

5.1. Design

The data sets were generated using the MC1 model. Both models were then investigated to explore the MC data with no testlet effect present. Note that no testlet effects were included when the data sets were generated by the MC1 model, so the true values of $σ_{ξ_{d (j)}}^{2}$ were all zero. When using the MC-CDT model to fit the MC data without testlet effect, we were able to freely estimate the parameters of the testlet effect variance $σ_{ξ_{d (j)}}^{2}$ . The closer the estimate value was to zero, the more accurate the estimate was. All other conditions remained the same as those used in Study 1, including the distribution of item parameters, the generation of true attribute profiles, and testlet numbers. Note that the MC-CDT model must identify whether there was any testlet effect in a test, so the setting of the testlet must be considered. We set the testlet numbers the same as they were in Simulation Study 1, representing one of many possible scenarios. One can conclude that the MC-CDT model was robust if the estimation accuracy of the parameters fitted by the MC-CDT model was comparable to that estimated by the MC1 model. The bias, RMSE, PCCR, and AACCR values were also used to assess the performance of the two models.

5.2. Study 2 Results

Figure 6 shows the average bias and RMSE values of the testlet effect variance for the MC-CDT model. There was no variance estimation result of testlet effect parameters for the MC1 model because it was exclusive of testlet effect terms in the MC1 model. The bias values ranged from 0.0040 to 0.0079, while RMSEs ranged from 0.0093 to 0.0122 for the MC-CDT model. The results showed that these values were very close to zero across all conditions, indicating that the MC-CDT model can effectively identify a no testlet-effect result when there was no testlet in the test.

Figure 6.

Bias and root mean square error of testlet effect variance for the multiple-choice cognitive diagnostic testlet model in Simulation Study 2. Note. In the axial coordinates, 3 and 6 represent the testlet number; H, M, and L represent the high, medium, and low item quality.

Figures 7 and 8 give the bias and RMSE values of the item parameters for the MC-CDT and MC1 models. The bias values obtained from the MC1 model ranged from −0.016 to 0.026 with a mean value of 0.006, and RMSE values ranged from 0.012 to 0.118 with a mean value of 0.061 in the high-quality item condition. Meanwhile, bias values acquired using the MC1 model ranged from −0.033 to 0.069 with a mean value 0.032, and RMSE values ranged from 0.023 to 0.219 with a mean value of 0.108 in the low-quality item condition. These results indicated that the MC1 model had relatively accurate estimates for item parameters. The accuracy of the estimates made using the MC1 model can be treated as a baseline to compare the estimation results obtained from the MC-CDT model. Although the range and mean of both bias and RMSE values in the low-quality item condition were larger than those in the high-quality item condition, the differences between the two conditions were trivial.

Figure 7.

Bias of item parameters for the multiple-choice (MC) cognitive diagnostic testlet and MC1 model. Note. In the axial coordinates, 3 and 6 represent the testlet number; H, M, and L represent the high, medium, and low item quality.

Figure 8.

Root mean square error of item parameters for the multiple-choice (MC) cognitive diagnostic testlet and MC1 model. Note. In the axial coordinates, 3 and 6 represent the testlet number; H, M, and L represent the high, medium, and low item quality.

The dark part of the box plot in Figures 4 and 5 shows the item parameters estimation results for the MC-CDT model. Note that the bias and RMSE values were almost the same across the different testlet numbers, which indicated that the influence of the number of testlets was negligible in terms of the estimation accuracy of the item parameters. The magnitudes of the bias (between −0.031 and 0.029, mean = 0.008) and the RMSE (between 0.015 and 0.133, mean = 0.072) were small in the high-quality item condition. Similarly, the magnitudes of bias (between −0.038 and 0.077, mean = 0.038) and the RMSE (between 0.032 and 0.236, mean = 0.116) were also small in the low-quality item condition. These results were comparable to those obtained from the MC1 model in the same conditions. This suggests that the MC-CDT model can be used to fit MC data without the use of testlets.

Table 3 shows the PCCR and AACCR values calculated using both the MC-CDT and the MC1 models. The estimation accuracy of examinees’ parameters was also affected by item quality in this study. Higher item quality led to higher PCCR and AACCR in both models. However, the PCCR and AACCR values were basically unaffected by the number of testlets in the MC-CDT model. Although the performance of the MC1 model was better than that of the MC-CDT model in terms of both the PCCR and the AACCR, the difference was trivial. The PCCR and AACCR values ranged from 0.829 to 0.977 and 0.958 to 0.995, respectively, for the MC1 model, while the PCCR and AACCR values ranged from 0.821 to 0.970 and 0.948 to 0.991, respectively, for the MC-CDT model. These results were considered to be reasonable because, in this case, the MC1 was the true model for generating response data. Most importantly, the estimation accuracy of the examinees’ parameters in the MC-CDT model was comparable to that of the traditional MC-CDM model. In other words, the MC-CDT model fit the MC data very well, even without testlets.

Table 3.

Recovery Rate of Examinee Parameters for Model Comparison in Simulation Study 2

Model	Item Quality	Testlet Number	PCCR	AACCR
MC1	High	—	.977	.995
	Moderate	—	.925	.983
	Low	—	.829	.958
MC-CDT	High	3	.969	.990
	Moderate		.920	.979
	Low		.821	.952
	High	6	.970	.991
	Moderate		.921	.977
	Low		.823	.948

Note. MC-CDT = multiple-choice cognitive diagnostic testlet; PCCR = pattern correct classification rate; AACCR = average attribute correct classification rate.

6. An Empirical Study

6.1. Data Description

To investigate the performance of the MC-CDT model using the real data, we analyzed 15 items from an advanced English reading assessment. The assessment used three testlets in total. Each testlet contained five MC items, and each item consisted of four options. The data were collected from a sample of 607 undergraduate students in a Chinese University.

This assessment was developed with consideration of the National Language Standard, China’s Standards of English Language Ability, which is published by the Ministry of Education of the People’s Republic of China and the National Language Commission of the People’s Republic of China, which also serves as a yardstick for English teaching and learning (National Education Examinations Authority, 2018). Four foreign language professors (two of whom were involved in developing the assessment) worked together to develop the required attributes for the assessment. The six attributes that were measured by the assessment were: (A1) identifying details, (A2) comprehending syntactic relationships, (A3) deducing implied meaning, (A4) summarizing main ideas, (A5) inferring attitude and tone, and (A6) understanding rhetoric devices. Q-vectors for the option-level of each item were also specified. Table 4 presents the description of attributes and their affiliations with each item. All four professors agreed that the coding of all 15 items were reasonable and accurate.

Table 4.

List of Expert-Defined Attributes

Attribute	Description	Items
A1 Identifying details	Searches sentences or paragraphs in the text and locates relevant information that is clearly stated	1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, and 15
A2 Comprehending syntactic relationships	Comprehends grammatical relationships across successive sentences by analyzing sentence structures or distinguishing signal words	1, 2, 3, 4, 5, 6, 7, 10, 12, 13, and 15
A3 Deducing implied meanings	Searches paragraphs and locates relevant information that is not clearly stated in the text	4, 5, 10, 12, 13, 14, and 15
A4 Summarizing main ideas	Generalizes the main idea of the text by locating and understanding the key points	1, 2, 3, 6, 7, and 11
A5 Inferring attitude and tone	Reads between the lines to deduce the author’s attitude based on the overall information in the text	11
A6 Understanding rhetoric devices	Distinguishes different rhetorical devices and their functions in the text	1, 2, and 3

Details of the Q _MC matrix are presented in Table 5. The q-vector was coded at the option-level, and the correct option is shown in bold. The distractors were coded into subsets of the correct option’s q-vector. Some were not coded because some distractors were difficult to specify using the expert-defined attributes.

Table 5.

The Q _MC Matrix for All 15 Items

Passage 1 (Testlet 1)
Item 1							Item 2						Item 3						Item 4						Item 5
A	1	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
B	1	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0
D	0	0	0	0	0	0	1	1	0	1	0	1	1	0	0	1	0	0	1	1	1	0	0	0	1	1	0	0	0	0
Passage 2 (Testlet 2)
Item 6							Item 7						Item 8						Item 9						Item 10
A	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
B	1	1	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	1	0	0	0	0	0	1	1	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
D	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	1	1	1	0	0	0
Passage 3 (Testlet 3)
Item 11							Item 12						Item 13						Item 14						Item 15
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	1	1	0	0	0
B	0	0	0	0	0	0	1	0	0	0	0	0	1	1	0	0	0	0	1	0	1	0	0	0	1	0	1	0	0	0
C	0	0	0	0	0	0	1	1	1	0	0	0	1	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
D	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0

Note. The bolded q-vectors indicate the correct answer.

The Akaike information criterion (AIC; Akaike, 1974) and the Bayesian information criterion (BIC; Schwarz, 1978) were computed for the model fit comparisons. In Bayesian analysis, the AIC and the BIC can be defined as $A I C = \bar{D} + 2 p$ and $B I C = \bar{D} + (log (N) - 1) p$ , respectively (Congdon, 2003), where $\bar{D}$ is the posterior mean of deviance (i.e., −2 log-likelihood) estimated using WinBUGS Version 1.4, and p is the number of estimated parameters. The number of estimated parameters in the MC1 model and the MC-CDT model was 15 (i.e., the 15 item parameters) and 18 (i.e., the 15 item parameters plus three testlet effect variance parameters), respectively. Usually, a model with a better comparative fit will result in lower AIC, BIC, or −2LL values. A posterior predictive model checking (PPMC) was also used to evaluate the absolute model-data fit. A posterior predictive probability (PPP) value close to 0.5 indicates that there were no systematic differences between the realized and predictive values, and a PPP value > 0.95 or < 0.05 indicates an adequate model fit (Gelman et al., 2013). In the PPMC, the sum of the squared Pearson residuals for examinee i on item j and option h is used as a discrepancy measure to investigate the overall fit. Option h was extended based on the squared Pearson residuals for dichotomous data (Yan et al., 2003). The expression of the sum of the squared Pearson residuals is as follows:

D (Y_{i j h}; α_{i}) = \sum_{i = 1}^{N} \sum_{j = 1}^{J} \sum_{h = 1}^{H} {(\frac{X_{i j h} - P_{i j h}}{\sqrt{P_{i j h} (1 - P_{i j h})}})}^{2},

where $X_{i j h} = 1$ if examinee i chooses option h on item j; otherwise, it is $X_{i j h} = 0$ . $P_{i j h}$ indicates the probability of examinee i choosing option h on item j. $P_{i j h}$ can be computed using both Equations 1 and 8. The estimation procedure of the MC1 model took 14 minutes to calculate, and the MC-CDT model took 24 minutes.¹

6.2. Results

As shown in Table 6, the AIC, BIC, and −2LL values obtained from the MC-CDT model were consistently lower than those obtained from MC1 model, which means that the MC-CDT model was the preferred fit for this dataset. The PPP value of the MC1 model was 0.84, and PPP value of the MC-CDT model was 0.78. This indicates that even though both models fit the empirical data adequately, the MC-CDT model had a better fit than the MC1 model.

Table 6.

Model Fittings in the Empirical Study

Model	−2LL	AIC	BIC	NP	PPP
MC1	19,063.9	19,093.9	19,145.0	15	.84
MC-CDT	18,685.5	18,721.5	18,782.9	18	.78

Note. −2LL = −2 log-likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion; NP = number of parameters; PPP = posterior predictive probability; MC-CDT = multiple-choice cognitive diagnostic testlet.

Table 7 presents the estimated item parameters and the variances of the testlet effect. To compare the item parameter estimates in the same scale, the item parameters ( $δ_{j}$ ) of the MC1 model needed to be transformed into the logistic form. The results of item parameter estimation as calibrated by the two models were very close. Most of the item parameters had a negative value, which means that the slipping probabilities $i.e., [\frac{exp (δ_{j}^{'})}{1 + exp (δ_{j}^{'})}]$ of most items were smaller than 0.5. Compared to the MC1, the MC-CDT model does provide additional information regarding the testlet effect. As mentioned previously, a larger variance of testlet effect parameters indicates a larger testlet effect. The estimated variances of the three testlet effects were 0.37, 1.00, and 0.20, respectively. This indicates that there were indeed different amounts of local dependence among these three passages, particularly the testlet effect of the second passage, which was the largest one. Thus, one could get inaccurate estimate results if the traditional MC-CDM model was used to analyze the MC data with testlets.

Table 7.

Estimated Item Parameters ( $λ$ ) and Testlet Effect Variances ( $σ_{d (j)}^{2}$ )

Testlet	$σ_{d (j)}^{2}$ (MC-CDT)	Item	$log i t (δ_{j})$ (MC1)	${δ^{'}}_{j}$ (MC-CDT)
Testlet 1	0.37 (0.17)	Item 1	3.65 (0.54)	3.82 (0.51)
		Item 2	−4.51 (0.38)	−4.56 (0.35)
		Item 3	−2.56 (0.58)	−2.71 (0.55)
		Item 4	−2.09 (0.19)	−2.25 (0.21)
		Item 5	−1.28 (0.14)	−1.38 (0.16)
Testlet 2	1.00 (0.24)	Item 6	0.72 (0.12)	0.93 (0.14)
		Item 7	−3.42 (0.55)	−3.75 (0.49)
		Item 8	0.66 (0.1)	0.81 (0.13)
		Item 9	−0.03 (0.11)	−0.02 (0.13)
		Item 10	−0.51 (0.12)	−0.59 (0.14)
Testlet 3	0.20 (0.06)	Item 11	−4.73 (0.23)	−4.75 (0.22)
		Item 12	−0.64 (0.11)	−0.64 (0.12)
		Item 13	0.26 (0.11)	0.29 (0.11)
		Item 14	−1.31 (0.14)	−1.35 (0.15)
		Item 15	0.17 (0.1)	0.19 (0.11)

Note. Standard errors presented in parentheses. MC-CDT = multiple-choice cognitive diagnostic testlet.

7. Discussion

Although there are already various CDMs proposed to analyze response data with testlets (Hansen, 2013; Zhan et al., 2018), none of these have shown an ability to process MC data. This article proposes a new CDM for MC items in tests with testlets, the MC-CDT model. The results of Simulation Study 1 demonstrated that the item and examinee parameters in the MC-CDT model could be estimated accurately using a full Bayesian MCMC algorithm. Specifically, the item quality and testlet effects size affected the item and examinee parameter estimations, but the number of testlets had negligible influence. This suggests that test developers should pay more attention to item quality and the degree of dependence between items within testlets. In contrast, the performance of the traditional MC-CDM (MC1 model in the article) was not a favorable fit for testlet-based tests. Bias and RMSE values increased greatly, while the PCCR and AACCR showed a notable decrease. These results indicate that traditional MC-CDMs are not an appropriate fit for MC data involving testlets, which means that to ignore local dependence between items within testlets is inappropriate.

The results of our Simulation Study 2 showed that the MC-CDT model was also suitable for fitting MC data without testlets. The MC-CDT model showed an ability to recover not only item parameters $δ_{j}^{'}$ but also examinees’ parameters. The bias, RMSE, PCCR, and AACCR values in the MC-CDT model were comparable to those of the MC1 model. Therefore, the MC-CDT model was shown to be a feasible and robust CDM model for use with MC items in testlet-based tests. In other words, the MC-CDT model was more generalized, while traditional MC-CDMs which are unable to account for testlet items should be considered special versions of the MC-CDT model.

The results of this empirical study indicate that the MC-CDT model fit real data better than the MC1 model for the MC test using testlets. Moreover, the MC-CDT model can be used to calculate the size of a testlet effect to describe local dependence for each testlet, which cannot be achieved using the traditional MC-CDM model.

Although the MC-CDT model did demonstrate advantages in our two simulation studies and the one real data study, several areas should be explored further in future research. First, the MC1 model was used as the base model in the current research due to its simplicity and feasibility. However, the MC1 model is a constrained model, which strongly assumes that each item contains only one item parameter, so the MC1 model may not be able to solve all problems in reality. Beyond that, another limitation of the MC-CDT model is that it only allows the testlet effect impact the response when $γ_{i j} = 1$ . In future research, more traditional MC-CDM models should be developed and explored to determine how to effectively relax these constraints. Moreover, the two MC-S-DINA models proposed by Ozaki (2015), as well as other MC-CDM models such as the SICM (Bradshaw & Templin, 2014) and the GDCM-MC (DiBello et al., 2015) should be included as comparisons in future studies.

Second, the random-effect testlet structure was integrated into the MC1 model in this article to allow us to analyze the MC data with testlets. However, the fixed-effect testlet model also has certain advantages in terms of describing local item dependence within a testlet (Hoskens & De Boeck, 1997; Tuerlinckx & De Boeck, 2001; Wang & Wilson, 2005). Further studies should be conducted to compare the random- and fixed-effect testlet approaches in MC diagnosis assessments.

Third, the current research only investigated the independent attribute hierarchy structure, and while it considered already existing cognitive diagnostic studies, it did not consider other attribute hierarchy structures, such as linear, convergent, divergent, or unstructured (Leighton et al., 2004). If other attribute hierarchy structures were considered, some possible combinations of attribute profiles may be nonexistent. So, the MC-CDT model should be expanded to integrate it with other attribute hierarchy structures and explore the performance of MC-CDT model. Indeed, de la Torre (2004) has also proposed a method for modeling the joint distribution of a latent attribute vector based on higher order latent traits. It would also be interesting to probe the effectiveness of the HO-DINA model in the context of MC data with testlets.

Furthermore, in the current study, a moderate correlation was conducted within the examinee ability correlation matrix. However, Henson and Douglas (2005) also set as zero to represent independence between attributes. Future research should explore conditions of none, low, or high correlations to further explore the influence of $ρ$ on estimation accuracy for the MC-CDT model.

Finally, as previously stated by de la Torre (2009a), the attribute profiles represented by the distractors should be included in the subset of the attribute profiles corresponding to the key answers. The MC-CDT model must also follow this principle. However, the q-vector of distractors may not necessarily be the subset of the q-vector of the key answer when some of the misconceptions are examined further. In these instances, further investigations using the MC-CDT model may provide interesting insights.

Supplemental Material

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231165622 - Cognitive Diagnosis Testlet Model for Multiple-Choice Items

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231165622 for Cognitive Diagnosis Testlet Model for Multiple-Choice Items by Lei Guo, Wenjie Zhou and Xiao Li in Journal of Educational and Behavioral Statistics

Footnotes

Authors' Note

The R scripts used in the simulation study, the empirical study data, the English reading assessment, and the Online Appendix are available at . None of the studies were preregistered.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This work was supported by, National Natural Science Foundation of China (31900793), and Fundamental Research Funds for the Central Universities (SWU2109222).

ORCID iD

Wenjie Zhou

Note

References

Akaike

(1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19, 716–723.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, 29–51.

Bradlow

E. T.

Wainer

Wang

(1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.

Bradshaw

Templin

(2014). Combining item response theory and diagnostic classification models: A psychometric model for scaling ability and diagnosing misconceptions. Psychometrika, 79(3), 403–425.

Brooks

S. P.

Gelman

(1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455.

Congdon

(2003). Applied Bayesian modelling. John Wiley.

Culpepper

S. A.

(2015). Bayesian estimation of the DINA model with Gibbs sampling. Journal of Educational and Behavioral Statistics, 40, 454–476.

Curtis

S. M.

(2010). BUGS code for item response theory. Journal of Statistical Software, 36, 1–34.

de la Torre

(2009a). A cognitive diagnosis model for cognitively based multiple-choice options. Applied Psychological Measurement, 33, 163–183.

10.

de la Torre

(2009b). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34, 115–130.

11.

de la Torre

Douglas

(2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353.

12.

de la Torre

Hong

Deng

(2010). Factors affecting the item parameter estimation and classification accuracy of the DINA model. Journal of Educational Measurement, 47, 227–249.

13.

DeMars

C. E.

(2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168.

14.

DiBello

L. V.

Henson

R. A.

Stout

W. F.

(2015). A family of generalized diagnostic classification models for multiple choice option-based scoring. Applied Psychological Measurement, 39, 62–79.

15.

Gelman

Carlin

J. B.

Stern

H. S.

Dunson

D. B.

Vehtari

Rubin

D. B.

(2013). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC.

16.

Glas

C. A.

Wainer

Bradlow

E. T.

(2000). MML and EAP estimation in testlet-based adaptive testing. In van der Linden

W. J.

Glas

G. A.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 271–287). Springer.

17.

Hansen

(2013). Hierarchical item response models for cognitive diagnosis [Unpublished doctoral dissertation]. University of California.

18.

Hartz

S. M. C.

(2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality [Unpublished doctoral dissertation]. University of Illinois at Urbana-Champaign.

19.

Henson

Douglas

(2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29, 262–277.

20.

Henson

Templin

Willse

(2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210.

21.

Hoskens

Boeck

P. D.

(1997). A parametric model for local dependence among test items. Psychological Methods, 2, 261–277.

22.

Huang

H. Y.

Wang

W. C.

(2012). Higher order testlet response models for hierarchical latent traits and testlet-based items. Educational and Psychological Measurement, 73, 491–511.

23.

Junker

B. W.

Sijtsma

(2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272.

24.

Kang

H. A.

Han

Kim

Kao

S. C.

(2022). Polytomous testlet response models for technology-enhanced innovative items: Implications on model fit and trait inference. Educational and Psychological Measurement, 82(4), 811–838.

25.

Leighton

J. P.

Gierl

M. J.

Hunka

S. M.

(2004). The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka's rule-space approach. Journal of Educational Measurement, 41, 205–237.

26.

Bolt

(2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29, 340–356.

27.

Iaconangelo

de la Torre

(2016). Model similarity, model selection, and attribute classification. Applied Psychological Measurement, 40(3), 200–217.

28.

National Education Examinations Authority. (2018). China’s standards of English language ability. Higher Education Press & Shanghai Foreign Language Education Press. http://cse.neea.edu.cn/html1/report/18112/9627-1.html

29.

Ozaki

(2015). DINA models for multiple-choice items with few parameters: Considering incorrect answers. Applied Psychological Measurement, 39, 431–447.

30.

Rijmen

(2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47, 361–372.

31.

Rupp

A. A.

Templin

Henson

R. A.

(2010). Diagnostic measurement: Theory, methods, and applications. The Guilford Press.

32.

Sadler

P. M.

(1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and distractor-driven assessment instruments. Journal of Research in Science Teaching, 35, 265–296.

33.

Schwarz

(1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.

34.

Spiegelhalter

D. J.

Thomas

Best

(2003). WinBUGS version 1.4 [Computer program]. MRC Biostatistics Unit, Institute of Public Health.

35.

Sturtz

Ligges

Gelman

(2005). R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software, 12, 1–16.

36.

Tuerlinckx

De Boeck

(2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195.

37.

Wainer

Bradlow

E. T.

(2000). Testlet response theory: An analog for the 3PL model using in testlet-based adaptive testing. In van der Linden

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 245–269). Kluwer.

38.

Wainer

Kiely

(1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185–202.

39.

Wainer

Wang

(2000). Using a new statistical model for testlets to score TOEFL. Education Measurement, 37, 203–220.

40.

Wang

Shi

Zhang

(2021). Sequential Gibbs sampling algorithm for cognitive diagnosis models with many attributes. Multivariate Behavioral Research, 1–37. https://doi.org/10.1080/00273171.2021.1896352

41.

Wang

W. C.

Wilson

(2005). The Rasch testlet model. Applied Psychological Measurement, 29, 126–149.

42.

Yan

Mislevy

R. J.

Almond

R.G.

(2003). Design and analysis in a cognitive assessment. ETS Research Report Series, 2003, 13–15.

43.

Zhan

Wang

W.-C.

Bian

Wang

(2015). The multidimensional testlet-effect cognitive diagnostic models. Acta Psychologica Sinica, 47(5), 689–701.

44.

Zhan

Liao

Bian

(2018). A joint testlet cognitive diagnosis modeling for paired local item dependence in response times and response accuracy. Frontiers in Psychology, 9, 607.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.24 MB