Sage Journals: Discover world-class research

Abstract

As computer-based testing becomes more prevalent, the attention paid to response time (RT) in assessment practice and psychometric research correspondingly increases. This study explores the rate of Type I error in detecting preknowledge cheating behaviors, the power of the Kullback-Leibler (KL) divergence measure, and the L person fit statistic under various conditions by modeling patterns of response accuracy (RA) and RT using a joint hierarchical model. Four design factors were manipulated: test length, the difficulty level of compromised items, the ratio of compromised items, and variations in the RTs for these compromised items. The results indicate that the KL measure consistently exhibits higher power and Type I error rates than the person fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ across all RA and RT patterns and under all conditions. Furthermore, the KL measure demonstrates the greatest power at a medium test length.

Plain language summary

Investigation of Pre-Knowledge Cheating Using Response Time Model

This study aims to reveal the rate of Type I error in the detection of preknowledge cheating behaviors, and the rate of power of response time based on Kullback–Leibler divergence (KL) measure, and L person fit statistic under different conditions via modeling patterns of response accuracy (RA) and response time (RT), using a joint hierarchical model. 200 data sets were generated with 50 iterations that item responses were modeled with three parameter logistic model and RTs were modeled with the log-normal RT model under the conditions of test length and difficulty level of the compromised items for obtained means of Type I error rate of methods. In order to obtain the means power rate of the methods, 1,800 data sets were generated with 50 iterations under the conditions of test length, the difficulty level of the compromised items, the ratio of compromised items, changing in RT of the compromised items. Gibbs sampling algorithm is used as a Monte Carlo Markov Chain (MCMC) approach in estimating model parameters for each data set. In the cheating scenario, where item preknowledge examinees were selected from those with low ability level, the rate of fraudulent data was created as 5% of the 1,000 sample size. As a result of the study, it was found that the KL measure has a high power and Type I error rate mean values than person fit statistics L and L(l^t ) for RA and RT patterns under all conditions. In addition, it is seen that the KL measure for RA and RT patterns show the highest power rate mean values at medium test length, and when the RTs of the compromised items which have difficult and medium difficulty level fixed at 20 s and the ratio of compromised items is high.

Keywords

joint modeling person fit statistics response times response accuracy

Computer-based large-scale, in-class speed-tested applications have become widespread. In computer-based tests, the responses of examinees to test items also provide more detailed information, such as the moment of seeing the test item on the computer screen, time information, such as the solving time, and the moment of viewing response to the item. Thus, in computer-based testing (CBT), the examinee's recorded time information across the test items is used to support or define aberrant testing behaviors (such as preknowledge cheating, answer copying, and rapid guessing) and aberrant response patterns. In recent studies, in computerized adaptive and computer-based test applications, measuring examinees’ response times (RTs) to items is considered an important variable in determining examinees who have item preknowledge, are copying answers, and are rapid guessing (Bolsinova et al., 2017; Boughton et al., 2017; Cizek & Wollack, 2017; De Boeck et al., 2017; Eckerly et al., 2018; Fox & Marianti, 2016, 2017; Gorney & Wollack, 2022; Guo et al., 2016; Kasli & Zopluoglu, 2018; Kasli et al., 2023; Lee & Jia, 2014; Lee & Wollack, 2017; Lee, 2018; Lu et al., 2020; Man & Harring, 2021; Man et al., 2018; Man, Harring, Jiao et al., 2019; Marianti et al., 2014; Meng et al., 2015; Qian et al., 2016; Ranger et al., 2021; Sinharay & Johnson, 2020; Toton & Maynes, 2019; Ulitzsch, 2019; van der Linden & Guo, 2008; van der Linden & van Krimpen-Stoop, 2003; van der Linden et al., 2010; van der Linden, 2006; van Rijn & Ali, 2017; Wang et al., 2018; Wise & Kong, 2005; Zhan et al., 2018; Zhan et al., 2021). Thus, various psychometric models for RTs have been developed to investigate topics such as speed–accuracy relationships, speed tests, testing strategies (e.g., rapid guessing behavior), and subgroup differences (Fox et al., 2007; Klein Entink et al., 2009; Meijer & Sotaridona, 2006; van der Linden & Guo, 2008; van der Linden, 2006, 2007; Wise & DeMars, 2006; Wise & Kong, 2005).

Several response time (RT) models have been developed, such as loglinear RT model, effective RT model, Bayesian lognormal RT model, a model that combines a RT model with an Item Response Theory (IRT) model (e.g., 1PL, 2PL, 3PL) model for purposes of simultaneously modeling responses and RTs, and the mixture model (Lee & Wollack, 2017; Meijer & Sotaridona, 2006; van der Linden et al., 1999; van der Linden et al., 2010; van der Linden, 2006, 2007; von Davier & Rost, 2006; Wang & Xu, 2015; Wang et al., 2018; Zhan et al., 2018). The RTs do not contain any information about the examinees’ abilities. Thus, the integration of RTs with item responses will contribute to the investigation of response behaviors. Additionally, the integration of RTs with item responses may help determine the type of aberrant response behavior (Lee, 2018). A joint model in which the response accuracy (RA) and RT data are modeled using a hierarchical latent variable model called H-IRTRT has been proposed (van der Linden, 2007). At the first level, the responses and RTs are modeled using a model that associates the distribution of the responses with the examinees’ latent ability and the distribution of the RTs with the examinees’ latent speed. The two models are not linked but are conditional on ability and speed and that the responses and RTs are independent (Fox & Marianti, 2017; van der Linden, 2011). On the second level model, population of the potential test takers distributions are defined for the latent variables speed and ability, and for the item parameters in the RT model and the IRT model (Lee, 2018).The covariance matrix for the item parameters defines the correlation between the parameters in the first stage of modeling and the item parameters. Thus, the covariance structures defined in both stages of modeling reveal the relationship between item responses and RTs (Fox & Marianti, 2017; van der Linden, 2009).

Several studies have applied H-IRTRT (Bolsinova et al., 2017; Boughton et al., 2017; Fox & Marianti, 2016, 2017; Fox et al.2007; Klein Entink et al., 2009; Lee & Wollack, 2020; Man & Harring, 2021; Man et al., 2018; Man, Harring, Jiao et al., 2019; Molenaar et al., 2015a, 2015b, 2016; Qian et al., 2016; Ranger et al., 2020, 2021; Sinharay & Johnson 2020; van der Linden & Guo, 2008; van Rijn & Ali, 2017; Wang & Xu, 2015; Wang et al., 2013, 2018; Zhan et al., 2021). Bayesian person-fit approach is used for detecting aberrant testing behavior based on latent-variable models and both responses and RTs. The standardized residuals, mixture models, data mining methods, and the ${\bar{χ}}^{2}$ statistic is used to detect test fraud (Boughton et al., 2017; Fox & Marianti, 2017; Man et al., 2018; Man, Harring, & Sinharay, 2019; Marianti et al., 2014; Qian et al., 2016; Sinharay & Johnson, 2020; van der Linden & Guo, 2008; van der Linden et al., 2010; Wang et al, 2018; Zhan et al., 2018). Existing studies that used a joint model for the detection of preknowledge cheating in the observed RA patterns from the expected RA patterns have used the Neyman-Person lemma approach, $L ({l_{s}}^{y})$ , PSS (Posterior Shift Statistic), SBNPL, $L_{s}$ person-fit statistics and residual analysis, and while determining aberrant RT patterns, $L (l^{t})$ , $χ_{pf}$ , $Λ_{t}$ person-fit statistics and Kullback–Leibler divergence (KL) approach are widely used (Belov, 2013, 2015, 2016; Kullback & Leibler, 1951; Levine & Drasgow, 1988; Marianti et al., 2014; Qian et al., 2016; Sinharay, 2017a, 2018, 2020).

A Bayesian person-fit approach was used to detect preknowledge cheating (Fox & Marianti, 2017; Man et al., 2018). Fox and Marianti (2017) investigated the performance of $L$ person fit statistic under different conditions by modeling the patterns of RA and RTs using a joint hierarchical model. In this study, researchers investigated the performance of $L ({l_{s}}^{y})$ and $L (l^{t})$ person-fit statistics in the detection of cheating behavior under varying conditions of aberrant response patterns. The results showed that in the condition where there was both cheating and random RT behavior, person-fit statistics showed higher performance at higher rates of aberrant response patterns. Man et al. (2018) found that the parametric $L (l^{t})$ person fit statistics and RTbased nonparametric KL measure approach via H-IRTRT modeling were effective in the detection of preknowledge cheating and answer copying. In conclusion, the response to the time-based nonparametric KL measure approach proposed in this study has a higher performance under all conditions, and this approach is more sensitive in the detection of preknowledge cheating than that for in the detection of answer copying.

Purpose of the Study

When related studies are examined, it can be observed that few have investigated aberrant testing behaviors via H-IRTRT modeling. Additionally, because the KL measure is a non-parametric measure that can evaluate the RA and RT patterns of the test-taker directly from the test scores, it appears that calculating the KL measure values is an effective approach for evaluating the test-taking behavior of examinees (Man et al., 2018). Studies have shown that person-fit statistics, which are based on the log-likelihood of RA and RT patterns, are more powerful than other person-fit statistics in detecting aberrant behavior in educational testing (Dimitrov & Smith, 2006; Karabatsos, 2003). Moreover, to our knowledge, no study has been conducted in which $L$ person fit statistics and the KL divergence approach were considered together in the detection of aberrant RA and RT patterns using a joint hierarchical model in preknowledge cheating. Although the performance of the KL measure in determining aberrant RA under various conditions has been investigated, but its ability to determine aberrant RT patterns has not been adequately investigated. Further research is required to determine the response information for the detection of aberrant response behaviors. This study aims to reveal in the detection of preknowledge cheating behaviors the performance of the KL measure, and $L$ person fit statistic under different conditions via modeling patterns of RA and RTs using a joint hierarchical model When related studies on preknowledge cheating were examined, the criteria used to determine the performance of the methods used were the rate of Type I errors and the rate of power. It is expected from the methods of determining preknowledge cheating that examinees who have item preknowledge are correctly detected if preknowledge cheating has been performed in the test. As a result, it is seen that power is the correct determination of examinees who have item preknowledge, and it has been observed that the Type I error rate mistakenly identifies examinees who do not have item preknowledge. Thus, Type I error and power rates were used as outcome measures to evaluate the performance of the methods.

Research Question

What are the Type I error and power rates of KL measure, and $L$ person fit statistic as used for detecting preknowledge cheating under various conditions via modeling patterns of RA and RTs using a joint hierarchical model?

Sub-questions

Following are the sub-questions:

What is the main and common effect of test length and the difficulty level of compromised items on the Type I error rates of KL measure, and person fit statistics $L ({l_{s}}^{y})$ for RA and $L (l^{t})$ for RT patterns?

What is the main and common effect of test length, the difficulty level, the ratio and changing in RTs of compromised items on the power rates of KL measure, and person fit statistics $L ({l_{s}}^{y})$ for RA and $L (l^{t})$ for RT patterns?

Method

In Table 1, the variables and changing levels of research are presented, taking into account studies in the relevant literature.

Table 1.

Simulation Design Conditions and Levels.

	Condition	Level values	Number of levels
Manipulated	Test length	15–50	2
	Difficulty level of compromised items	medium-high	2
	The ratio of compromised items	20%–40%–60%	3
	Changing in RTs of the compromised items	U (10,15) fixed to 20 seconds, fixed to 30 seconds	3
Fixed	Sample size*	1000	1
Fixed	Proportion of examinees with preknowledge*	5%	1
Total number of conditions			2x2x3x3x1x1=36

fixed variable

The sample size was changed to 500, 1,000, 10,000, and 50,000, and the test length was changed to 15, 20, 25, 30, and 50 in related studies, in which preknowledge cheating was determined through RT models (Fox & Marianti, 2017; Lee, 2018; Man et al., 2018; Marianti et al., 2014; Shu et al., 2013; Sinharay, 2020; van der Linden & Guo, 2008; van der Linden & van Krimpen-Stoop, 2003). In this study, the sample size factor did not change, and was fixed at 1,000 because of the long iterations of the Markov Chain Monte Carlo (MCMC) algorithm, in which Bayesian estimations of the joint model parameters were performed. In addition, in the cheating scenario, examinees who had item preknowledge were selected from those with low ability levels, and the rate of fraudulent data was created as 5% of the 1,000-sample size who were examined in the methods of performance of preknowledge cheating to reduce parameter estimation errors during the analysis. In the relevant literature, test length is seen as an important factor in determining preknowledge of cheating. As a result of the increase in the probability of observing cheating behavior as the number of items in the test increases, the length variable was changed to 15 and 50 to be short and medium test lengths, respectively.

When the studies on answer copying are examined, it is observed that the difficulty level of the compromised items is considered an important factor, but it is seen that the difficulty level of the compromised items does not change when determining preknowledge cheating by H-IRTRT. In his study examining the relationship between item statistics and RTs, Altuner (2019) stated that item RTs vary according to the difficulty level of the items. Thus, in this study, to determine preknowledge cheating, the difficulty level of the compromised items was changed to medium or difficult. Sunbul and Yormaz (2018) varied the difficulty levels of the items in the test, and discussed the intervals such as (−2.50–0.00) for easy items and (0.01–2.50) for difficult items, for the b parameter. In this study, intervals such as (−1.50–1.50) for items of medium difficulty and intervals such as (1.51–3.00) for difficult items were considered for the b parameter.

The ratio of the compromised items is an important factor because it affects the performance of person-fit statistics for the RA and RT patterns. Marianti et al. (2014) varied the number of aberrant responses by 5, 10, and 20%. Fox and Marianti (2017) varied the number of aberrant responses by 10% and %20. Lee (2018) varied the ratio of compromised items at three levels: 20% (low), 40% (medium) and 60% (high) of the total test items. Considering the varying ratios of compromised items in related studies, the ratios of the compromised items adopted were 20% (low), 40% (medium), and 60% (high) of the total test items in this study.

When designing a simulation study, one of the major decisions is to simulate the difference in RT between examinees without item preknowledge and those with item preknowledge. Meijer and Sotaridona (2006) varied the RTs of compromised items to reduce RT by half and a quarter. Van der Linden and Guo (2008) fixed the RT of compromised items to 10, 20, and 30 s. Fox and Marianti (2017) simulated aberrant RTs from a lognormal distribution with a mean equal to the average RTs and a standard deviation three times the mean standard deviation of the RTs. Lee (2018) drew the RTs of compromised items from a uniform distribution between 20 and 30 s. Man et al. (2018) simulated the RTs of compromised items by drawing from a uniform distribution in the range of 10 to 15 seconds for preknowledge cheating. Kasli and Kasli and Zopluoglu (2018) concluded that preknowledge may increase the latent speed parameter for examinees by approximately 0.45 standard deviation. In this study, the RTs of the compromised items were simulated by drawing randomly from a uniform distribution in the range of 10 to 15 seconds (U (10,15)) and fixed at 20 and 30 s. Considering the observed RTs of the examinees with preknowledge from a large-scale assessment, these time range intervals where the time express elapsed between remembering the compromised item and reacting was chosen, as stated by Qian et al. (2016) and van der Linden and Guo (2008).

Data Simulation

This study manipulated four design factors: test length (15 and 50 items), difficulty level of compromised items (medium and high), ratio of compromised items (20, 40, and 60%), and variations in the RTs for these compromised itemsvariations in the RTs for these compromised itemsvariations in the RTs for these compromised items (U (10,15), fixed at 20 s, and fixed at 30 s). The four factors were fully crossed, resulting in 36 conditions. Fifty iterations were simulated for each condition. In line with the purpose of the study, 200 datasets were generated with 50 iterations, which were modeled with 3PLM, and RTs were modeled with the Bayesian lognormal RT model under the conditions of test length and difficulty level of the compromised items for the obtained means of the Type I error rate of the methods. To obtain the mean power rate of the methods, 1,800 datasets were generated with 50 replicates under the conditions of test length, difficulty level of the compromised items, ratio of compromised items, and change in RTs of the compromised items. Each dataset consisted of simulated RA and RT data for 15 and 50 items of 1,000 examinees according to the H-IRTRT model. The H-IRTRT model parameters were simulated using the prior distributions for each simulated dataset. The item parameters were simulated from the multivariate prior with a mean vector of $μ_{a} = 1, μ_{b} = 0, μ_{α} = 1, μ_{β} = 0$ and with a diagonal covariance matrix with elements; ${σ_{a}}^{2} = 0.05, {σ_{b}}^{2} = 1, {σ_{α}}^{2} = 0.05, {σ_{β}}^{2} = 1$ . To investigate the performance of the MCMC algorithm by minimizing the effect of guessing behavior, the guessing parameter was fixed at c = 0.10. The Pearson parameters were simulated from a multivariate normal distribution with means equal to zero and variances equal to one. In Fox et al. (2007) and Fox and Marianti (2017),where the correlation between the speed and ability parameters changed to 0.30, 0.50, and 0.75, respectively, and the correlation between the person parameters was high, the performance of the methods increased. Therefore, in this study, the correlation value between the person parameters was fixed at 0.75. The measurement error variances were obtained from a log-normal distribution with a mean of zero and variance of 0.30. The data generation has been carried out by using the codes written by the researcher and the “LNIRT” package (Fox et al., 2019) in the R software.

MCMC Estimation

The Gibbs sampling algorithm developed by Fox et al. (2007) and Klein Entink et al. (2009) in the LNIRT package program was used as an MCMC approach to estimate the RA and RT patterns using joint hierarchical model parameters for each dataset. Parameter estimates were obtained as the posterior means of 10,000 iterations after the initial 1,000 iterations of the burn-in period. Studies that used the Gibbs sampling algorithm as an MCMC approach to estimate the model parameters (Fox & Marianti, 2017; Marianti et al., 2014; Qian et al., 2016) were considered to determine the initial value of 10,000. Before investigating the convergence of the model parameters, 10 data files were randomly selected to check whether the data of 1,000 examinees, produced in accordance with the 3PLM consisting of 15 items, were unidimensional. The dimensionality analysis of the data in the selected files were examined using the “sirt” (Robitzsch, 2020) and “paran” (Dinno, 2018) packages in the R software. The convergence of the model parameters (item difficulty, item discrimination, time discrimination, and time intensity) for 10,000 MCMC iterations for 15 test items was monitored by visual inspection of trace and posterior probability density plots and autocorrelation graphs. The Geweke convergence test was used for convergence diagnostics. Convergence investigations were conducted using the “R2OpenBUGS” (Thomas, 2020) and “coda” (Plummer et al., 2020) packages in the R software. When the trace plots of the model parameters were examined, it was observed that there were no extreme fluctuations; that is, the chain reached a posterior distribution quickly, and were strong throughout the iterations, each parameter converged, and the predicted values of the estimations in the parameters gradually stagnated and did not take extreme values. When the posterior probability density plots for the model parameters were examined, it was observed that the distributions were normal. Thus, the chains converge, and a posterior distribution is reached. In a Markov chain, the values obtained by simulation are not independent of each other, and the relationship between these values is measured by autocorrelation (Hosmer & Lemeshow, 2013). When the autocorrelation graphs for the model parameters were examined, the autocorrelation values generally approached zero; that is, there was no relationship between the parameter values, and thus, convergence was achieved. The Geweke diagnostic test examines convergence in two-way Markov chains based on the Z-test statistic, which determines whether the mean estimates converge by comparing the means of the previous 10% and the next 50% parts of the Markov chain. In the model, since the hypothesis tests were set at α = .05 level, it is seen that the Geweke convergence test statistic values of all parameters were less than 1.96, thus convergence was achieved (Santos et al., 2009).

Cheating Scenario

In the cheating scenario, the abilities of 1,000 examinees obtained from the multivariate normal distribution were drawn for each generated data file. In the cheating scenario, in which item preknowledge examinees were selected from those with low ability levels, the rate of fraudulent data was set at 5% of the 1,000 sample size. The RTs for the medium- and high-difficulty level items of the examinees with item preknowledge changed. After this process, the responses of the examinees who had item preknowledge of items whose RTs were changed were corrected. In this study, fraudulent data were created with the simulation codes written by the researcher in the R software to observe the effect of the changing factors and their levels on the performance of the methods of preknowledge cheating. The codes were shared in an additional file.

Type I Error and Power Analysis

In this study, the person-fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ , and KL were used to evaluate the fit of the RA and RT patterns. The KL measure is used to detect aberrant test behaviors (Belov & Armstrong, 2010; Belov et al., 2007; Belov, 2014). The KL measure is a nonparametric measure of divergence between two posterior distributions for examinees based on responses, and is given by:

K L = D (R ‖ S) = \int_{- \infty}^{+ \infty} R (θ_{j}) \log \frac{R (θ_{j}}{S (θ_{j}} d θ_{j}

Large values of the KL measure may indicate copying or item preknowledge (Kullback & Leibler, 1951). Therefore, the KL measure has been used to detect answer copying and preknowledge cheating when we have prior knowledge of the examinees’ abilities (Balta et al., 2019; Belov, 2013, 2014; Cizek & Wollack, 2017; Man et al., 2018; Ucar & Dogan, 2020).

The person-fit statistic $L ({l_{s}}^{y})$ has the power to determine aberrant RA patterns (Dimitrov & Smith, 2006; Fox & Marianti, 2017; Karabatsos, 2003). Given the log-likelihood based on the 2PLM, the person-fit statistic $L ({l_{s}}^{y})$ is expressed as follows:

\begin{matrix} l^{y} (θ_{j}, a, b; y_{j}) = \log p (y_{j} | θ_{j}, a, b) = \sum_{i = 1}^{I} \log p (y_{j} | θ_{j}, a_{i}, b_{i}) \\ = \sum_{i = 1}^{I} [y_{i j} \log p (y_{i j} | θ_{j}, a_{i}, b_{i}) \\ + (1 - y_{i j}) \log (1 - p (y_{i j} | θ_{j}, a_{i}, b_{i}))] \end{matrix}

where $y_{j i}$ denotes the RA of person j on item i. Whether the RA patterns show extreme values can be determined using a Bayesian significance test. When the likelihood of item responses increases, the person-fit statistical value also increases. Increasing values of the negative person-fit statistic calculated from the examinees’’ responses to the items corresponded to misfit. Thus, the posterior probability value of misfit increases, increasing $L ({l_{s}}^{y})$ person-fit statistic values. For the joint model, the relationships between person parameters must be considered in the calculation of the person-fit statistics. Hence, considering the full covariance structure of the item and person parameters, the classification variable was defined as ${F_{j}}^{y}$ defined. Thus:

{F_{j}}^{y} = {\begin{matrix} 1, P ({l_{s}}^{y} (θ_{j}, a_{i}, b_{i}; y_{j})) > C) \\ 0, P ({l_{s}}^{y} (θ_{j}, a_{i}, b_{i}; y_{j})) \leq C) \end{matrix} .

I value of ${F_{j}}^{y}$ is calculated at each MCMC iteration, and can be used to estimate the posterior probabilities of an aberrant RA pattern. This method allows the identification of aberrant RA patterns by calculating person-fit statistics with a posterior probability greater than the given critical value, C (Fox & Marianti, 2017).

A person-fit statistic, based on the likelihood of RT patterns derived from a model for the logarithm of RTs, was introduced (Marianti et al., 2014). Probability of the response pattern ( ${t_{j}}^{*}$ =( ${t_{1 j}}^{*}, \dots . ., {t_{Ij}}^{*}))$ is given by

\begin{matrix} l_{0} (τ_{j}, β, α, σ^{2}; {t_{j}}^{*}) = - 2 \log p ({t_{j}}^{*} | τ_{j}, β, α, σ^{2}) \\ = - 2 \sum_{i = 1}^{I} \log p ({t_{i j}}^{*} | τ_{j}, β_{i}, α_{i}, {σ_{i}}^{2}) \\ = \sum_{i = 1}^{I} ({(\frac{{t_{j i}}^{*} - μ_{j i}}{σ_{i}})}^{2} + \log (2 π {σ_{i}}^{2})) \\ = \sum_{i = 1}^{I} ({Z_{j i}}^{2} + \log (2 π {σ_{i}}^{2})) = \sum_{i = 1}^{I} l_{0 i} \end{matrix}

where ${t_{j}}^{*}$ denotes the logarithm of the RT of person j on itIm i and $Z_{j i}$ is the standardized error of the normally distributed logarithm of RT. The posterior probability can be used to evaluate the extremes of the pattern of RTs under the model such that the estimated statistical value of the null distribution of the standardized version of the ${l_{0}}^{t}$ , say ${l_{z}}^{t} ({T_{j}}^{*})$ is greater than a certain threshold C. The observed statistical value, which is the probability value of the posterior distribution, is compared with the critical value of C obtained from the chi-square distribution as follows:

P (l^{t} ({T_{j}}^{*}) > C) = P ({χ_{I}}^{2} > C) = α,

where $α$ is the posterior probability. If the posterior probability value of the statistic is greater than the critical value of C as the observed statistical value, it is considered as an extreme value. The classification variable ${F_{j}}^{t}$ can be used to calculate the posterior probability. Thus,

{F_{j}}^{t} = {\begin{matrix} 1, P (l^{t} ({T_{j}}^{*}) > l^{t} ({t_{p}}^{*})) < α \\ 0, P (l^{t} ({T_{j}}^{*}) > l^{t} ({t_{p}}^{*})) \geq α \end{matrix} .

The posterior probability that the RT pattern of test taker j will be flagged is given by

\begin{matrix} P ({F_{j}}^{t} = 1 | {t_{j}}^{*}) = \int_{β} \int_{τ_{j}} I ({F_{j}}^{t} = 1 | {t_{j}}^{*}, τ_{j}, β) \\ p (τ_{j}, λ) d τ_{j} d β \approx \sum_{m = 1}^{M} I (F_{j}^{t (m)} = 1 | {τ_{j}}^{(m)}, β^{(m)}) / M \end{matrix}

where in MCMC iteration m, ${F_{j}}^{t^{(m)}} = 1$ when $P (χ^{2} > l^{t} ({t_{j}}^{*}) | {τ_{j}}^{(m)}, β^{(m)}) < α$ . Therefore, the probability that an RT pattern will be flagged as an outlier is evaluated at each iteration (Fox & Marianti, 2017; Man et al., 2018; Marianti et al., 2014).

In order to obtain the value of the KL measure function and Bayesian posterior probability values of person fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ for RA and RT patterns for each data set produced in the study, using item response matrices (1-0) obtained after the manipulation of the responses and RTs of the examinees who have item preknowledge, their abilities were re-estimated by using the "LNIRT" (Fox et al., 2019) package in the R software. For all examinees whose abilities were re-estimated, the values of the KL measure function were calculated by using the “LaplaceDemon” (Singmann, 2020) package, and the Bayesian posterior probability values of person-fit statistics for RA and RT patterns were calculated by using the “LNIRT” (Fox et al., 2019) package in the R software.

We obtained the KL measure values for the examinees and used the cutoff score to identify whether they had item preknowledge. The receiver operating characteristic (ROC) curve method was used to determine the cutoff points for the KL measure function values. The KL method’s performance in detecting copiers was higher when the cutoff scores by evaluating the performance of the classification rule of the ROC analysis according to the minimum score p-value approach, and the Youden index displayed a balancing approach (Krzanowski & Hand, 2009; Ucar & Dogan, 2021). Thus, the cutoff scores for the KL measure function values were calculated using the Youden Index. To obtain cut scores at an ‘=0.05, Bayesian significance level using the Youden Index, we employed the “OptimalCutpoints” package (Raton-Lopez et al., 2014) in the R software.

To obtain the Type I error rate mean values of the KL measure for RA, the posterior probability values obtained from the KL measure function and person fit statistics $L ({l_{s}}^{y})$ for RA were calculated for examinees who had item preknowledge in each iteration, repeated with the item response data. A 1–0 response matrix was formed by giving “1” of the posterior probabilities calculated for the item responses, which were higher than the cutoff score calculated for the item responses (for examinees who were known not to cheat and were mistakenly judged to have “cheated”) and “0” for the others. In order to obtain the mean power rate KL measure for RA, the ratio of the compromised items was also manipulated, and a 1–0 matrix was created by giving “1” to the posterior probabilities calculated for the item responses, which are higher than the cutoff score calculated for the item responses (the probability of the “cheating” decision to be correct for examinees known to cheat) and “0” for the others. The ratio of the sum of the “1” values in the matrix to the number of examinees known to cheat was calculated. In this way, the probability of correctness of the “cheating” decision given for examinees known to cheat was showcased. In order to determine the Type I error rate mean values of person fit statistics $L ({l_{s}}^{y})$ for RA patterns, a 1–0 response matrix was formed by giving “1” of the posterior probabilities calculated for the item responses which were higher or equal than 0.95 and “0” for the others. The ratio of the sum of the values given as “1” in the matrix to the number of examinees known to cheat was calculated. Relevant studies (Fox & Marianti, 2017; Marianti et al., 2014) were considered to determining the α = .05 Bayesian significance level. In order to obtain the mean power rate of person fit statistic $L ({l_{s}}^{y})$ for RA, the ratio of the compromised items was also manipulated, a 1–0 matrix was created by giving “1” to the posterior probabilities calculated for the item responses, which are higher or equal than 0.95 (the probability of the “cheating” decision to be correct for examinees known to cheat) and “0” for the others. The ratio of the sum of the “1” values in the matrix to the number of examinees known to cheat was then calculated. Similar analysis processes were repeated to obtain the Type 1 error rate mean, mean power rate of the KL divergence measure, and person-fit statistics $L (l^{t})$ for the RT patterns.

Findings

Type I Error Rates of Methods

The main effects of methods on Type I error rate mean values of the test length and the difficulty level of the compromised items factors are shown using tables. Table 2 shows the Type I error rate mean values of the KL measure and person-fit statistics $L$ for the RA and RT patterns by condition.

Table 2.

Type I Error Rate Mean Values of KL Measure and Person Fit Statistics $L$ for RA Patterns by Conditions.

Methods	Test length	Difficulty level of compromised items	Type I error rate mean values (for RA patterns)	Type I error rate mean values (for RT patterns)
KL measure	15	High	0.378	0.160
	15	Medium	0.364	0.167
	50	High	0.387	0.157
	50	Medium	0.366	0.161
$L$ person fit statistic	15	High	0.008	0.024
	15	Medium	0.006	0.026
	50	High	0.002	0.223
	50	Medium	0.004	0.029

The common effects of methods on Type I error and rate mean values of the test length and the difficulty level of the compromised items factors are shown using graphs. Figure 1 shows that the conditions’ common effects for Type I error rate mean values of KL and person fit statistics $L ({l_{s}}^{y})$ for RA patterns.

Figure 1.

The conditions’ common effects for Type I error rate mean values of KL measure and person fit statistics $L ({l_{s}}^{y})$ for RA patterns.

Figure 2 shows that the conditions’ common effects for Type I error rate mean values of KL and person fit statistics $L (l^{t})$ for RT patterns.

Figure 2.

The conditions’ common effects for Type I error rate mean values of KL measure and person fit statistics $L (l^{t})$ for RT patterns.

The KL measure had a higher Type I error rate mean value than the person-fit statistic $L ({l_{s}}^{y})$ for RA patterns under all conditions. In addition, the KL measure showed the highest Type I error rate mean value for medium test length and difficult items. Although it can be seen that the Type I error rate mean values of the person fit statistic $L ({l_{s}}^{y})$ for RA are close to 0 under all changing conditions, it is seen that the highest value is observed in the short test length and difficult items. Thus, the KL measure and the $L$ person fit statistic have higher Type I error rate mean values in determining aberrant RA patterns in difficult items. The Type I error rate mean values of the person fit statistics $L (l^{t})$ for RT patterns were close to 0 in all conditions except those of the medium test length and difficult items, but they had lower Type I error rate mean values compared to the KL measure.

Power Rates of Methods

Tables 3 and 4 show the power rate mean values and of the KL measure and person fit statistics $L$ for the RA and RT patterns by condition.

Table 3.

Power Rate Mean Values of KL Measure and Person Fit Statistics $L ({l_{s}}^{y})$ for RA Patterns by Conditions.

Methods	Test length	Difficulty level of compromised items	The ratio of compromised items	Changing in RTs of the compromised items
Methods	Test length	Difficulty level of compromised items	The ratio of compromised items	U (10,15)	fixed to 20 s	fixed to 30 s
KL measure	15	High	20%	0.659	0.692	0.696
			40%	0.691	0.682	0.688
			60%	0.703	0.698	0.682
		Medium	20%	0.758	0.841	0.852
			40%	0.872	0.899	0.902
			60%	0.931	0.932	0.938
	50	High	20%	0.752	0.756	0.784
			40%	0.757	0.758	0.763
			60%	0.812	0.762	0.761
		Medium	20%	0.869	0.885	0.887
			40%	0.940	0.947	0.948
			60%	0.948	0.952	0.954
$L ({l_{s}}^{y})$ person fit statistic	15	High	20%	0.114	0.121	0.091
			40%	0.132	0.134	0.118
			60%	0.134	0.101	0.107
		Medium	20%	0.110	0.112	0.103
			40%	0.288	0.253	0.308
			60%	0.294	0.224	0.198
	50	High	20%	0.279	0.279	0.246
			40%	0.364	0.322	0.392
			60%	0.304	0.312	0.374
		Medium	20%	0.384	0.293	0.394
			40%	0.564	0.651	0.662
			60%	0.425	0.545	0.556

Table 4.

Power Rate Mean Values of KL Measure and Person Fit Statistics $L (l^{t})$ for RT Patterns by Conditions.

Methods	Test length	Difficulty level of compromised items	The ratio of compromised items	Changing in RTs of the compromised items
Methods	Test length	Difficulty level of compromised items	The ratio of compromised items	U (10,15)	fixed to 20 s	fixed to 30 s
KL measure	15	High	20%	0.601	0.529	0.512
			40%	0.556	0.636	0.539
			60%	0.547	0.658	0.615
		Medium	20%	0.762	0.763	0.827
			40%	0.761	0.813	0.822
			60%	0.731	0.833	0.791
	50	High	20%	0.547	0.577	0.638
			40%	0.543	0.573	0.578
			60%	0.636	0.564	0.565
		Medium	20%	0.739	0.782	0.818
			40%	0.654	0.757	0.801
			60%	0.746	0.823	0.762
$L (l^{t})$ person fit statistic	15	High	20%	0.071	0.012	0.002
			40%	0.078	0.031	0.003
			60%	0.063	0.021	0.001
		Medium	20%	0.219	0.019	0.007
			40%	0.331	0.017	0.005
			60%	0.263	0.002	0.001
	50	High	20%	0.152	0.019	0.003
			40%	0.178	0.001	0.000
			60%	0.158	0.001	0.000
		Medium	20%	0.536	0.009	0.002
			40%	0.684	0.002	0.000
			60%	0.569	0.001	0.000

The common effects of methods on power rate mean values of the test length, difficulty level of the compromised items, ratio of compromised items, and changes in RTs of the compromised item factors are shown in the following figures. Figure 3 shows that the conditions' common effects for power rate mean values of KL measure and person fit statistics for RA patterns.Figure 4 shows that the conditions' common effects for power rate mean values of KL measure and person fit statistics for RT patterns.

Figure 3.

The conditions’ common effects for power rate values of KL measure and person fit statistics $L ({l_{s}}^{y})$ for RA patterns.

Figure 4.

The conditions’ common effects for power rate values of KL measure and person fit statistics $L (l^{t})$ for RT patterns.

The KL measure had higher power rate values than the person-fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ for the RA and RT patterns under all conditions. It showed the highest power rate values at all levels of changing RTs of the compromised items, medium test lengths, medium difficulty level items, and when the ratio of compromised items was high. Additionally, when the RTs of the compromised items were fixed at 30 s, they had a higher power rate than the other RT variation levels. However, when the RTs of the compromised items are fixed at 20 s, it is seen that it has the highest power rate mean value with little difference. Thus, the power rate value of the KL measure for the RA and RT patterns is directly proportional to the test length and the ratio of the compromised items, and inversely proportional to the difficulty level of the compromised items. In addition, fixing the RTs of the compromised items to longer time values and 20 s tended to increase the mean power rate values.

Although the power rate mean values of the person fit statistics $L ({l_{s}}^{y})$ were low under all conditions, high power rate mean values were observed for the medium test length, medium difficulty level items, and the ratio of the compromised items, and the RTs of the compromised items were fixed at 20 and 30 s. It was also observed that the RTs of the compromised items were fixed at 30 s and the power rate mean values were higher with little difference compared to the situation when they were fixed at 20 seconds. It can be observed that the power rate mean values of the person fit statistics $L (l^{t})$ for RT patterns are low in all other varying conditions, except for the medium test length, where the ratio of the compromised items is moderate, the compromised items are of medium difficulty level, and the RTs of the compromised items are obtained from a uniform distribution of 10 to 15 s. The person fit statistics $L (l^{t})$ show higher power rate mean values when it is obtained from the uniform distribution of the RTs of the compromised items between 10 and 15 s in all changed conditions. Thus, although the power rate mean value of the person fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ increases as the test length increases and the difficulty level of the compromised items decreases, it is seen that it tends to decrease when the ratio of compromised items is high. Additionally, fixing the RTs of the compromised items to longer time values tended to increase the power rate mean values of the person-fit statistics $L ({l_{s}}^{y})$ for RA patterns. It is also seen that manipulating the RTs of the compromised items into shorter time values tends to increase the power rate mean values.

Discussion and Conclusions

The application of large-scale and high-risk exams, which play an important role in making important decisions by examinees, and the necessity of using online applications for exams in the distance education process, measurement, and evaluation are gaining momentum. In computer-based testing, which is widely used to provide more evidence for the validity and reliability of test scores, and in fraud-detection studies to ensure test security, it is important to obtain more detailed information, such as the moment of seeing the test item on the computer screen, the time to solve the test item, and the response of examinees to the item, in addition to the response of the examinees to the test items. The hierarchical model for responses and RTs (H-IRTRT) is becoming more common, owing to an increase in computer-based testing in educational and psychological assessments. In computer-based testing, responses are recommended to determine aberrant response behaviors together with RTs, as RTs provide information about working speed and time intensity, and the responses provide information about the examinees’ abilities.

This study applied the H-IRTRT to detect aberrant testing behavior by simulating preknowledge cheating. Person-fit statistics and divergence measures were developed to identify aberrant RA and RT patterns using H-IRTRT. In this study, the power rate mean value of the KL measure for the RA and RT patterns was directly proportional to the test length and the ratio of the compromised items and inversely proportional to the difficulty level of the compromised items. Fixing the RTs of the compromised items at 20 s increased the power rate mean values of the KL measure for the RT pattern, and fixing them to longer time values increased the power rate mean values of the KL measure for the RA pattern. In addition, the KL measure showed the highest Type I error rate mean value for medium test length and difficult items. This showcases that the KL distance, which is calculated as the difference between the a priori and posterior probability distributions of the examinees' ability parameters, is highly sensitive to the differences owing to the ability parameter estimation, and that the cutoff point method used in determining aberrant RA and RT patterns is quite conservative because of the uncertainty of the distribution under the null hypothesis. Using the Youden Index, one of the ROC analysis methods to calculate the cut score used in determining aberrant response and RTs patterns due to the unknown sample distribution under the null hypothesis of the KL measure, is one of the limitations of this study. As Uçar and Doğan (2021) stated, the KL measure shows a high Type I error rate mean value because the Youden Index, which is used to determine the cut-off score, tends to make the classification more appropriate by balancing the sensitivity and specificity amounts. In addition, the KL measure for RA and RT patterns had the highest performance when the RTs of the compromised items with difficult and medium difficulty levels were fixed at 20 s. Further, the use of the KL measure with a sample size of 1,000 is not recommended for sample sizes less than 1,000 because Type I error rate control cannot be achieved under all conditions. Thus, using the KL measure for identifying aberrant RA and RT patterns in cases where the RTs of compromised items are less than 20 seconds is not advisable. Moreover, simulation studies are recommended to determine the performance of the KL measure for RA and RT patterns in determining answer copying under various conditions by creating copy and source pairs under conditions in which the Type I error rate is controlled. Thus, conducting multidimensional studies is recommended by evaluating the detailed information (recording examinees’ eye movements and posture positions, etc.) obtained with the developed software, in addition to the use of approaches such as the KL measure, which may include false positive rates in identifying examinees with preknowledge, especially in large-scale and high-risk computer-based test applications. In light of the research findings obtained from this study, the conditions under which the Type I error rate was controlled by calculating the threshold value of the KL distance with different cut-off points in determining preknowledge cheating by using responses and RTs.

The power rate mean value of the person-fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ for the RA and RT patterns changed directly with the test length and were inversely proportional to the difficulty level of the compromised items and the ratio of the compromised items. It was concluded that manipulating the RTs of the compromised items into shorter time values increased the power rate mean values of the person-fit statistics, $L ({l_{s}}^{y})$ and $L (l^{t})$ for the RA and RT patterns. As a result, it was seen that the performance of person fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ for RA and RT patterns which are based on likelihood of item responses and RTs was low, but it controlled the Type I error rate in all conditions. Thus, this study, carried out as a simulation study, can be repeated at different alpha levels to increase the performance of the person-fit statistics $L$ by using real data. However, the sample size and proportion of examinees with preknowledge, which are considered fixed variables, were differentiated by considering the conditions in large-scale exam applications. Furthermore, simulation studies should be conducted to improve the performance of the person-fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ for the RA and RT patterns. This will contribute to obtaining more evidence to prove that aberrant response behaviors occur.

It was shown in a simulation study that the KL measure has higher power and Type I error rate mean values than the person fit statistics $L ({l_{s}}^{y})$ and $L (l^{t})$ for RA and RT patterns under all conditions. It is seen that present research findings are similar to the findings obtained from the studies of Man et al. (2018) and Fox and Marianti (2017). Modeling item responses with 3PLM, RTs with the Bayesian lognormal RT model, and methods of determining preknowledge cheating used in the research, the selected simulation conditions, and levels are the limitations of the research. Thus, several simulation studies can be conducted with real data conditions when item responses are modeled with a Nominal Response Model (NRM) and RTs with a log-normal model to detect aberrant RA and RT patterns. With this, preknowledge cheating behavior was considered in this study, but a more comprehensive study is needed to investigate the performance of the KL measure and $L$ person fit statistic for joint models to detect several aberrant response behaviors (such as random guessing, answer copying, etc.). Combining RT with responses increases the ability to detect aberrant response behaviors (van der Linden & Guo, 2008). Different methods such as the use of standardized residuals, mixture models, and data mining methods have been suggested to detect aberrant testing behavior using both item scores and RTs (Sinharay & Johnson, 2020; van der Linden et al., 2010; Wang et al, 2018; Zhan et al., 2018). The joint hierarchical model of van der Linden (2007) was used in the current study. Simulation studies on the use of other models for RA and RT patterns to detect person-fit tests and the divergence measure approach would be useful. In addition, the performance of the methods for detecting aberrant testing behaviors using various models can be investigated using computerized adaptive testing. In practice, it is important for test companies to determine test fraud because aberrant testing behavior in computerized and adaptive test applications damages test validity. In addition to the proposed approaches for detecting test fraud, the biometric information of the examinees (examinees’ cameras are accessed, and their eye movements and posture positions are positioned., etc.) will increase confidence in the scores obtained from the evaluations.

Supplemental Material

sj-txt-1-sgo-10.1177_21582440241297946 – Supplemental material for Investigation of Preknowledge Cheating via Joint Hierarchical Modeling Patterns of Response Accuracy and Response Time

Supplemental material, sj-txt-1-sgo-10.1177_21582440241297946 for Investigation of Preknowledge Cheating via Joint Hierarchical Modeling Patterns of Response Accuracy and Response Time by Ebru Balta and Celal Deha Dogan in SAGE Open

Supplemental Material

sj-txt-2-sgo-10.1177_21582440241297946 – Supplemental material for Investigation of Preknowledge Cheating via Joint Hierarchical Modeling Patterns of Response Accuracy and Response Time

Supplemental material, sj-txt-2-sgo-10.1177_21582440241297946 for Investigation of Preknowledge Cheating via Joint Hierarchical Modeling Patterns of Response Accuracy and Response Time by Ebru Balta and Celal Deha Dogan in SAGE Open

Footnotes

Acknowledgements

We would like to thank Editage () for English language editing.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Ebru Balta

Data Availability Statement

All relevant data that support the findings of this study are available from the corresponding author upon reasonable request.

Supplemental material

Supplemental material for this article is available online.

References

Altuner

(2019). Investigation of the relationship between item statistics and item response time. https://tez.yok.gov.tr/UlusalTezMerkezi/tarama.jsp [Unpublished Master Dissertation]. Mersin: Mersin University.

Balta

Uçar

Sahin

(2019). Detection of cheating behavior in online unproctored CATs via a validation test [Oral presentation]. The 2019 International Association for Computerized Adaptive Testing (IACAT) Conference, Minneapolis, MN, USA.

Belov

D. I.

(2013). Detection of test collusion via Kullback–Leibler divergence. Journal of Educational Measurement, 50(2), 141–163. https://doi.org/10.1111/jedm.12008

Belov

D. I.

(2014). Detecting item preknowledge in computerized adaptive testing using information theory and combinatorial optimization. Journal of Computerized Adaptive Testing, 2(3), 37–58. https://doi.org/10.7333/1410-0203037

Belov

D. I.

(2015). Robust detection of examinees with aberrant answer changes. Journal of Educational Measurement, 52(4), 437–456. https://www.jstor.org/stable/43940632. https://doi.org/10.1111/jedm.12094

Belov

D. I.

(2016). Comparing the performance of eight item preknowledge detection statistics. Applied Psychological Measurement, 40(2), 83–97. https://doi.org/10.1177/0146621615603327

Belov

D. I.

Armstrong

R. D.

(2010). Automatic detection of answer copying via Kullback–Leibler divergence and K-index. Applied Psychological Measurement, 34(6), 379–392. https://doi.org/10.1177/0146621610370453

Belov

Pashley

Lewis

Armstrong

(2007). Detecting aberrant responses with Kullback–Leibler distance. In Shigemasu

Okada

Imaizumi

Hoshino

(Eds.), New trends in psychometrics (pp. 7–14). Universal Academy Press.

Bolsinova

De Boeck

Tijmstra

(2017). Modelling conditional dependence between response time and accuracy. Psychometrika, 82(4), 1126–1148. https://doi.org/10.1007/s11336-016-9537-6

10.

Boughton

Smith

Ren

(2017). Using response time data to detect compromised items and/or people. In Cizek

G. J.

Wollack

J. A.

(Eds.), Handbook of detecting cheating on tests (pp. 177–190). Routledge.

11.

Cizek

Wollack

(2017). Identification of item preknowledge by the methods of information theory and combinatorial optimization. In Cizek

Wollack

(Eds.), Handbook of quantitative methods for detecting cheating on tests (pp.217–233). Routledge.

12.

De Boeck

Chen

Davison

. (2017). Spontaneous and imposed speed of cognitive test responses. British Journal of Mathematical and Statistical Psychology, 70(2), 225–237. https://doi.org/10.1111/bmsp.12094

13.

Dimitrov

D. M.

Smith

R. M.

(2006). Adjusted rasch person-fit statistics. Journal of Applied Measurement, 7(2), 170–183. https://www.researchgate.net/publication/7147714

14.

Dinno

(2018). “paran: Horn's test of principal components/factors.”R package version 1.5.2. https://cran.r-project.org/web/packages/paran/paran.pdf

15.

Eckerly

Smith

Lee

(2018). An introduction to item preknowledge detection with real data applications [Oral presentation] Conference on Test Security, Park City, UT, USA, (Vols. 10-12) p. 2018.

16.

Fox

J.-P.

Klein Entink

van der Linden

(2007). Modeling of responses and response times with the package CIRT. Journal of Statistical Software, 20(7), 1–14. https://www.jstatsoft.org/index

17.

Fox

J.-P.

Klotzke

Klein Entink

R. H.

(2019). ‘LNIRT: Lognormal response time item response theory models.’ R package version 0.4.0. https://cran.r-project.org/web/packages/LNIRT/index.html

18.

Fox

J. P.

Marianti

(2016). Joint modeling of ability and differential speed using responses and response times. Multivariate Behavioral Research, 51(4), 540–553. https://doi.org/10.1080/00273171.2016.1171128

19.

Fox

J.-P.

Marianti

(2017). Person-fit statistics for joint models for accuracy and speed. Journal of Educational Measurement, 54(2), 243–262. https://doi.org/10.1111/jedm.12143

20.

Gorney

Wollack

J. A.

(2022). Two new models for item preknowledge. Applied Psychological Measurement, 46(6), 447–461. https://doi.org/10.1177/01466216221108130

21.

Guo

Rios

J. A.

Haberman

Liu

O. L.

Wang

Paek

(2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29(3), 173–183. https://doi.org/10.1080/08957347.2016.1171766

22.

Hosmer

D. W.

Lemeshow

(2013). Applied logistic regression. Hoboken.

23.

Karabatsos

(2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277–298. https://doi.org/10.1207/S15324818AME1604_2

24.

Kasli

Zopluoglu

(2018, October), 10-12. Do people with item pre-knowledge really. respond faster to items they had prior access? An empirical investigation [Oral presentation]. The 2018 Conference on Test Security, Park City, UT, USA.

25.

Kasli

Zopluoglu

Toton

S. L.

(2023). A Deterministic gated lognormal response time model to identify examinees with item preknowledge. Journal of Educational Measurement, 60(1), 148–169. https://doi.org/10.1111/jedm.12340

26.

Klein Entink

R. H.

Fox

J. P.

van der Linden

W. J

. (2009). A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika, 74(1), 21–48. https://link.springer.com/article/10.1007/s11336-008-9075-y. https://doi.org/10.1007/s11336-008-9075-y

27.

Krzanowski

W. J.

Hand

D. J.

(2009). ROC curves for continuous data. Chapman. &Hall/CRC Press.

28.

Kullback

Leibler

R. A.

(1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. https://www.jstor.org/stable/2236703

29.

Lee

S. Y.

(2018). A mixture model approach to detect examinees with item preknowledge. https://www.proquest.com/openview/6e21b5081c4f66e866ce6ca5b5572b9e/1?pq-origsite=gscholar&cbl=18750. [Unpublished Doctoral Dissertation]. The University of Wisconsin]. Madison ProQuest Dissertations Publishing.

30.

Lee

S. Y.

Wollack

(2017). A mixture model to detect item preknowledge using item responses and response times [Oral presentation]. Conference on Test Security, Madison, WI, USA, (Vols. 6-7).

31.

Lee

Y.-H.

Jia

(2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-Scale Assessments in Education, 2(8), 2–24. http://www.largescaleassessmentsineducation.com/content/2/1/8

32.

Lee

S. Y.

Wollack

(2020). Concurrent use of response time and response accuracy for detecting examinees with item preknowledge. In Feinberg

Margolis

(Ed.), Integrating timing considerations to improve testing practices (pp. 165–175). NY: Routeledge.

33.

Levine

M. V.

Drasgow

(1988). Optimal appropriateness measurement. Psychometrika, 53(2), 161–176. https://doi.org/10.1007/BF02294130

34.

Wang

Zhang

Tao

(2020). A mixture model for responses and response times with a higher-order ability structure to detect rapid guessing behaviour. British Journal of Mathematical and Statistical Psychology, 73(2), 261–288. https://doi.org/10.1111/bmsp.12175

35.

Man

Harring

J. R.

(2021). Assessing preknowledge cheating via innovative measures: A multiple-group analysis of jointly modeling item responses, response times, and visual fixation counts. Educational and Psychological Measurement, 81(3), 441–465. https://doi.org/10.1177/0013164420968630

36.

Man

Harring

J. R.

Jiao

Zhan

(2019). Joint modeling of compensatory multidimensional item responses and response times. Applied Psychological Measurement, 43(8), 639–654. https://doi.org/10.1177/0146621618824853

37.

Man

Harring

J. R.

Sinharay

(2019). Use of data mining methods to detect test fraud. Journal of Educational Measurement, 56(2), 251–279. https://doi.org/10.1111/jedm.12208

38.

Man

Harring

J. R.

Ouyang

Thomas

S. L.

(2018) Response time based nonparametric Kullback-Leibler Divergence Measure for detecting aberrant test-taking behavior. International Journal of Testing, 18(2), 155–177. https://doi.org/10.1080/15305058.2018.1429446

39.

Marianti

Fox

J.-P.

Avetisyan

Veldkamp

B. P.

Tijmstra

(2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39(6), 426–451. https://doi.org/10.3102/1076998614559412

40.

Meijer

R. R.

Sotaridona

L. S.

(2006). Detection of advance item knowledge using response times in computer adaptive testing. https://ris.utwente.nl/ws/portalfiles/portal/5129730/LSAC_CT-03-03.pdf. Law School Admission Council. (LSAC Research Report 03-03).

41.

Meng

X.-B.

Tao

Chang

H.-H.

(2015). A conditional joint modeling approach for locally dependent item responses and response times. Journal of Educational Measurement, 52(1), 1–27. https://www.jstor.org/stable/43940552. https://doi.org/10.1111/jedm.12060

42.

Molenaar

Oberski

D. L.

Vermunt

J. K.

De Boeck

(2016). Hidden markov item response theory models for responses and response times. Multivariate Behavioral Research, 51(5), 606–626. https://doi.org/10.1080/00273171.2016.1192983

43.

Molenaar

Tuerlinckx

van der Maas

H. L.

(2015a). A bivariate generalized linear item response theory modeling framework to the analysis of responses and response times. Multivariate Behavioral Research, 50(1), 56–74. https://doi.org/10.1080/00273171.2014.962684

44.

Molenaar

Tuerlinckx

van der Maas

H. L.

(2015b). A generalized linear factor model approach to the hierarchical framework for responses and response times. British Journal of Mathematical and Statistical Psychology, 68(2), 197–219. https://doi.org/10.1111/bmsp.12042

45.

Plummer

Best

Cowles

Vines

Sarkar

Bates

Almond

Magnusson

(2020). ‘Coda: Output analysis and diagnostics for MCMC.’ R package version 0.19-4. https://cran.r-project.org/web/packages/coda/coda.pdf

46.

Qian

Staniewska

Reckase

Woo

(2016). Using response time to detect item preknowledge in computer-based licensure examinations. Educational Measurement: Issues and Practice, 35(1), 38–47. https://doi.org/10.1111/emip.12102

47.

Ranger

Kuhn

J. T.

Ortner

T. M.

(2020). Modeling responses and response times in tests with the hierarchical model and the three-parameter lognormal distribution. Educational and Psychological Measurement, 80(6), 1059–1089. https://doi.org/10.1177/0013164420908916

48.

Ranger

Kuhn

J. T.

Wolgast

(2021). Robust estimation of ability and mental speed employing the hierarchical model for responses and response times. Journal of Educational Measurement, 58(3), 308–334. https://doi.org/10.1111/jedm.1228420

49.

Raton-Lopez

Rodriquez-Alvarez

X. M.

Suarez-Cadarso

Sampedro-Gude

(2014). ‘OptimalCutpoints: Computing optimal cutpoints in diagnostic tests.’ R package version 1.1-5. https://cran.r-project.org/web/packages/OptimalCutpoints/OptimalCutpoints.pdf

50.

Robitzsch

(2020). ‘ sirt : Supplementary item response theory models.’ R Package Version 3.11-21. https://cran.r-project.org/web/packages/sirt/sirt.pdf

51.

Santos

M. A.

Moala

F. A.

Tachibana

V. M.

(2009). Approximate Bayesian methods for logistic regression model. Revista Brasileira de Biometria, 27(2), 288–300. http://jaguar.fcav.unesp.br/RME/fasciculos/v27/v27_n2/Maola.pdf

52.

Shu

Henson

Luecht

(2013). Using deterministic, gated item response theory model to detect test cheating due to item compromise. Psychometrika, 78(3), 481–497.https://link.springer.com/article/10.1007/S11336-012-9311-3. https://doi.org/10.1007/s11336-012-9311-3

53.

Singmann

(2020). Complete Environment for Bayesian Inference (LaplaceDemon) [Computer Software Manual]. https://cran.r-project.org/web/packages/LaplacesDemon/LaplacesDemon.pdf

54.

Sinharay

(2017a). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46–68. https://www.jstor.org/stable/26447648. https://doi.org/10.3102/1076998616673872

55.

Sinharay

(2018). A new person-fit statistic for the lognormal model for response times. Journal of Educational Measurement, 55(4), 457–476. https://doi.org/10.1111/jedm.12188

56.

Sinharay

(2020). Detection of item preknowledge using response times. Applied Psychological Measurement, 44(5), 376–392. https://doi.org/10.1177/0146621620909893

57.

Sinharay

Johnson

M. S.

(2020). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73(3), 397–419. https://doi.org/10.1111/bmsp.12187

58.

Sunbul

Yormaz

(2018). Investigating the performance of omega index according to item parameters and ability levels. Eurasian Journal of Educational Research, 74, 207–226. https://doi.org/10.14689/ejer.2018.74.11

59.

Thomas

(2020). Package ‘ R2OpenBUGS: Running OpenBUGS from R.’ R PackageVersion 3.2-3.2.1. https://cran.rproject.org/web/packages/R2OpenBUGS/R2OpenBUGS.pdf

60.

Toton

S. L.

Maynes

D. D.

(2019). Detecting examinees with pre-knowledge in experimental data using conditional scaling of response times. Frontiers in Education, 4(49), 1–18. https://www.frontiersin.org/article/10.3389/feduc.2019.00049.

61.

Ucar

Dogan

C. D.

(2021). Defining cut point for Kullback-Leibler divergence to detect answer copying. International Journal of Assessment Tools in Education, 8(1), 156–166. https://doi.org/10.21449/ijate.864078.

62.

Ulitzsch

(2019). Using response times for modeling missing responses in large-scale assessments. ProQuest Dissertations Publishing.https://www.proquest.com/openview/a8fd7f4f41b80de070e7d3cc904e6ea9/1?pq-origsite=gscholar&cbl=51922&diss=y [Doctoral dissertation, Freie Universitaet Berlin].

63.

van der Linden

W. J

. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181

64.

van der Linden

W. J

. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z

65.

van der Linden

W. J

. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247–272. https://doi.org/10.1111/j.1745-3984.2009.00080.x

66.

van der Linden

W. J

. (2011). Modeling response times with latent variables: Principles and applications. Psychological Test and Assessment Modeling, 53(3), 334–358. https://ris.utwente.nl/ws/portalfiles/portal/15037134/05_vanderLinden.pdf

67.

van der Linden

W. J.

Guo

. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73(3), 365–384. https://doi.org/10.1007/s11336-007-9046-8.

68.

van der Linden

W. J.

Klein Entink

R. H.

Fox

J.- P

. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34(5), 327–347. https://doi.org/10.1177/0146621609349800.

69.

van der Linden

W. J.

Scrams

D. J.

Schnipke

D. L

. (1999). Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psychological Measurement, 23(3), 195–210. https://doi.org/10.1177/01466219922031329.

70.

van der Linden

W. J.

van Krimpen-Stoop

E. M. L. A

. (2003). Using response times to detect aberrant responses in computerized adaptive testing. Psychometrika, 68(2), 251–265. https://doi.org/10.1007/BF02294800.

71.

van Rijn

P. W.

Ali

U. S

. (2017). A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing. British Journal of Mathematical and Statistical Psychology, 70(2), 317–345. https://doi.org/10.1111/bmsp.12101.

72.

von Davier

Rost

. (2006). 19 Mixture distribution item response models. In Rao

C. R.

Sinharay

(Eds.), Handbook of statistics (pp.643–661). Elsevier B. V. https://doi.org/10.1016/S0169-7161(06)26019-X.

73.

Wang

Chang

H. H.

Douglas

J. A.

(2013). The linear transformation model with frailties for the analysis of item response times. British Journal of Mathematical and Statistical Psychology, 66(1), 144–168. https://doi.org/10.1111/j.2044-8317.2012.02045.x.

74.

Wang

(2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456–477. https://doi.org/10.1111/bmsp.12054.

75.

Wang

Shang

Kuncel

(2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469–501. https://doi.org/10.3102/1076998618767123.

76.

Wise

S. L.

DeMars

C. E.

(2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38. https://doi.org/10.1111/j.1745-3984.2006.00002.x.

77.

Wise

S. L.

Kong

(2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. https://doi.org/10.1207/s15324818ame1802_2.

78.

Zhan

Jiao

Man

Wang

W. C.

(2021). Variable speed across dimensions of ability in the joint model for responses and response times. Frontiers in Psychology, 12, 469196. https://doi.org/10.3389/fpsyg.2021.469196.

79.

Zhan

Jiao

Wang

W.-C.

Man

(2018). A multidimensional hierarchical framework for modeling speed and ability in computer-based multidimensional tests. arXiv preprint arXiv:1807.04003. https://arxiv.org/abs/1807.04003.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB