Sage Journals: Discover world-class research

Abstract

Item response theory (IRT) models the relationship between the possible scores on a test item against a test taker’s attainment of the latent trait that the item is intended to measure. In this study, we compare two models for tests with polytomously scored items: the optimal scoring (OS) model, a nonparametric IRT model based on the principles of information theory, and the generalized partial credit (GPC) model, a widely used parametric alternative. We evaluate these models using both simulated and real test data. In the real data examples, the OS model demonstrates superior model fit compared to the GPC model across all analyzed datasets. In our simulation study, the OS model outperforms the GPC model in terms of bias, but at the cost of larger standard errors for the probabilities along the estimated item response functions. Furthermore, we illustrate how surprisal arc length, an IRT scale invariant measure of ability with metric properties, can be used to put scores from vastly different types of IRT models on a common scale. We also demonstrate how arc length can be a viable alternative to sum scores for scoring test takers.

Keywords

item response theory item characteristic curves nonparametric IRT simulation

1. Introduction

Most performance tests are scored by adding up predetermined item scores. We will refer to scores resulting from this method as sum scores. Let $i = 1, 2, ..., n$ denote the test items in a test, and $m = 0, 1, ..., M_{i}$ denote the possible item scores for item i. Furthermore, let $x_{i j} = 0, 1, ..., M_{i}$ be the observed score for test taker j on item i. Then, the sum score for test taker j can be defined as the sum of all the item scores $\sum_{i}^{n} x_{i j}$ .

Some examples of large-scale tests using sum scores include the Scholastic Aptitude Test (SAT) or tests produced by the American College Testing program (Dorans, 1999). A downside of sum scores is that all items contribute equally to the overall score and implicitly to the underlying trait the test aims to measure. This is typically not the case, and a good alternative to using sum scores is to score test takers with item response theory (IRT). IRT scores take properties of each individual item into account, and different items have different levels of precision for different trait levels. For instance, a hard item might better discriminate among individuals with high ability levels but would be less informative for individuals with low abilities. Consequently, IRT can more precisely estimate an individual’s trait level across the entire trait spectrum and also allows for uncertainty estimates of the scores to differ across the score scale. As a consequence, IRT scores are also less affected by items measuring something different than intended.

Additionally, IRT models are commonly used to solve a wide range of practical test related problems. Examples include test equating, establishing the psychometric properties of test forms and test items as well as optimizing the efficiency of test delivery in more complex assessment systems. Parametric IRT models (Lord, 1980) have been the most common choice in the past. A disadvantage with parametric IRT is that, even in well-designed tests, some items may not fit a parametric IRT model. When this is the case, using a nonparametric IRT model can be preferable.

For binary scored items, several nonparametric estimation methods have been suggested. Mokken (1997) studied monotonicity and nonparametric estimation for binary scored items. Ramsay (1991, 1997) proposed to estimate item response functions (IRFs) with kernel smoothing over quantiles of the Gaussian distribution. Rossi et al. (2002) and Ramsay and Silverman (2002) optimized the penalized marginal likelihood using the expectation–maximization algorithm, and their IRFs were close to the three parameter logistic IRT model when the roughness penalty increased. Ramsay and Silverman (2005) also proposed a nonparametric method for not strictly monotonic curve estimates. Woods and Thissen (2006) and Woods (2006) used a spline-based approximation of the ability distribution when estimating the item parameters. Lee (2007) compared several methods for nonparametric estimation of item characteristic curves for binary items.

Items with more than two scored outcomes are commonly referred to as polytomously scored items. Such items are used in a variety of settings, including the scoring of rated tasks, the scoring of testlets or groups of dependent dichotomously scored items, multiple-choice items for which the incorrect options are retained for scoring purposes, and rating scales used to measure psychological and behavioral traits. Nonparametric IRT for polytomously scored items was first proposed by Molenaar (1997), after which a wide variety of nonparametric/semiparametric approaches have been studied (Emons, 2008; Falk & Cai, 2016; Sijtsma & Molenaar, 2002; Stochl et al., 2012). Emons (2008) examined the effectiveness of generalizations of nonparametric person-fit statistics to polytomous item response data. Stochl et al. (2012) examined Mokken scale polytomous rating scale health data coded as 1–2–3–4 and recoded them as 0–0–1–1, thus essentially making the items binary before analyzing the data. Falk and Cai (2016) compared the monotonic polynomial method to other nonparametric and semiparametric alternatives. More recently, Ramsay et al. (2020) studied the nonparametric optimal scoring (OS) approach for rating scale data.

The OS IRT framework introduces new interesting perspectives to IRT. Through the fusion of psychometric and information theory, the arc length $θ$ transformation (presented in more detail in Section 4) introduces a way to represent test taker performance on a ratio scale, on which addition and subtraction have meaning in terms of bits of information.

Prior research on OS has mostly focues on binary scored multiple choice items (Ramsay & Wiberg, 2017). Ramsay et al. (2019) notably expanded the scope by taking the information from all incorrectly chosen response alternatives into account. A comparison of binary scored items using parametric and nonparametric IRT revealed that sum-scored tests would need to be longer than an optimally scored test in order to attain the same level of accuracy (Wiberg et al., 2019). OS also offers a flexible alternative for estimating IRFs of items that do not show a good fit with parametric IRT models (Ramsay et al., 2019). Recently, OS was proposed as an alternative for analyzing rating scales, which are commonly found in health instruments and questionnaires (Ramsay et al., 2020). Although OS for rating scales was explored in Ramsay et al. (2020), OS for polytomous response data has not yet been compared with parametric IRT.

The overall aim of this study is to compare OS with parametric IRT for polytomously scored items and sum scores, using both simulated and real data. Specifically, we evaluate and compare the item fit and performance of OS against the generalized partial credit (GPC) model, a commonly used parametric IRT model for polytomous response data where test takers can receive partial credit on an item. Additionally, IRT estimated test taker scores, transformed to a ratio scale using the previously mentioned arc length $θ$ transformation, are compared against the commonly used sum scores, expanding the work by Ramsay et al. (2020). Furthermore, we also demonstrate how the arc length transformation can be used to align the latent trait scales of different IRT models to a common scale, serving as a tool for comparing IRT models.

In Sections 2 and 3, the GPC and OS models are described. After that, the concept of arc length is introduced in Section 4 as an IRT ability scale invariant measure of ability to compare scores from different types of IRT models. The OS and GPC models are then used to analyze real test data in Section 5, followed by a simulation study to make comparison between the two approaches in Section 6. Finally, the paper ends with a discussion with concluding remarks, advantages, disadvantages, and practical implications in Section 7.

2. The GPC Model

The GPC model (Muraki, 1992) is an extension of the partial credit model (Masters, 1982). It can be used to model polytomously scored items, where the items can yield partial credit based on the respondent’s degree of attainment with the underlying trait. The GPC model, similar to most parametric IRT models, requires that the probability of each score on each test item varies smoothly over an index $θ$ . $θ$ represents the underlying construct the test is meant to measure and is commonly assumed to follow the standard normal distribution. Let $P_{i m} (θ)$ denotes the probability that a test taker with ability $θ$ responds with a score of m on item i. The GPC model defines $P_{i m} (θ)$ as follows:

P_{i m} (θ) = {\begin{matrix} \frac{1}{1 + \sum_{r = 1}^{M_{i}} (exp \sum_{c = 1}^{r} [a_{i} (θ - b_{i c})])}, & if m = 0 \\ \frac{exp (\sum_{c = 1}^{m} [a_{i} (θ - b_{i c})])}{1 + \sum_{r = 1}^{M_{i}} (exp \sum_{c = 1}^{r} [a_{i} (θ - b_{i c})])}, & otherwise \end{matrix}

where a_i is the item discrimination parameter and $b_{i c}$ is the boundary parameter for a score of c, denoting the ability $θ$ for which the probability of getting a score of c overtakes the probability of getting a score of $c - 1$ for item i. The important model assumptions are as follows:

Unidimensionality: Test taker ability $θ$ is unidimensional.

Local independence: Item responses are conditionally independent given $θ$ .

Monotonicity: The probability of obtaining a score of m or higher is monotonically increasing in $θ$ .

3. Optimal Scoring

The first proposal of OS for binary scored items was made by Ramsay and Wiberg (2017), and they showed that OS can be beneficial in comparison to sum scores. In a later proposal, Ramsay et al. (2019) incorporated information of the incorrect response alternatives as this information proved to give valuable information. Recently, Ramsay et al. (2020) demonstrated the use of OS for polytomous-ordered response scales with clinical data.

In parametric IRT, $θ$ is typically defined over the whole real line. For OS, any bounded interval can be used. In line with Ramsay et al. (2020), the interval 0–100 was used in this study. This is also the default in the TestGardener (Ramsay & Li, 2022) R package, in which the OS algorithm is implemented.

3.1. Surprisal and Probability

Surprisal, which originates from information theory (Shannon, 1948), is a central concept within the OS framework. A probability P can be transformed into surprisal using the transformation $S (P) = - {log}_{M} (P)$ . Figure 1 shows the surprisal plotted against probability when a base two logarithm is used. Surprisal is sometimes referred to as the self-information of an event. If an unlikely event occurs, it gives us more information (higher surprisal). For an event with $100 %$ probability, the surprisal is $- {log}_{M} (1) = 0$ . This makes intuitive sense, as we get no new information for knowing the event occurred as we already knew it would occur, no surprise. When the logarithm uses base M, $S (P)$ is the surprisal for an event with probability $1 / M$ occurring $S (P)$ times in a row.

Figure 1.

Probability plotted against the corresponding surprisal (2-bit).

The unit of measurement for surprisal depends on the base of the logarithm used in the computation: If base 2 is used, the unit is “bit”; if base 3 is used, the unit is “trit”; and if base 10 is used, the unit is “Hartley” or “dit.” In this article, when we mention “M-bit,” we are referring to the unit of measurement for surprisal based on the logarithm with base M. This concept of “bits” (2-bits) comes from the realm of digital computing, where “bit” is short for “binary digit,” the smallest increment of data on a machine. A bit can hold only one of two values: 0 or 1. Similarly, in information theory, a bit of information represents a decision between two equally likely alternatives.

For example, the probability of landing 3 heads in a row when flipping a coin is ${(1 / 2)}^{3} = 12.5 %$ , and thus, we get the 2-bit surprisal $S (0.125) = 3$ . The 3 2-bits of surprisal in this case indicates that there are three “binary decisions” or stages (each coin flip being a stage) to reach the event in question. This event, therefore, offers three bits of information.

3.2. OS Algorithm

OS models are estimated using an iterative estimation algorithm. The algorithm alternates between using splines to model the surprisal for each score against $θ$ and estimating $θ$ scores for each test taker. Estimating surprisals instead of probabilities makes for computational advantages, as removing the upper bound of 1 makes spline fitting easier. A full overview of the OS fitting algorithm is shown in Algorithm 1, and we will go through each step in more detail in this section.

Algorithm 1 OS Fitting Algorithm
1: Jitter the sum scores and use sum score percentiles as initial $θ$ estimates for each test taker. 2: Partition the $θ$ estimates into bins. 3: Compute the surprisal to obtain each score within each bin for all items. 4: For each score on every item, fit a surprisal curve to the bins obtained in the previous step. 5: Use the fitted curves to estimate new $θ$ scores for each test taker using maximum likelihood. 6: Repeat Steps 2–5 until sufficient convergence has been reached.

Algorithm 1 OS Fitting Algorithm

1: Jitter the sum scores and use sum score percentiles as initial

θ

estimates for each test taker.
2: Partition the

θ

estimates into bins.
3: Compute the surprisal to obtain each score within each bin for all items.
4: For each score on every item, fit a surprisal curve to the bins obtained in the previous step.
5: Use the fitted curves to estimate new

θ

scores for each test taker using maximum likelihood.
6: Repeat Steps 2–5 until sufficient convergence has been reached.

Step 1. In order to start the estimation algorithm, the sum score percentile ranks are used as initial estimates of $θ$ for each test taker. To break ties and reduce bias, the sum scores are jittered by adding a random value from a uniform distribution between −0.5 and 0.5 before computing the percentiles.

Steps 2 and 3. After the initial scores have been obtained, the test takers are ordered by their scores and partitioned into a set of bins, each bin containing a preset number of test takers. The proportion of test takers in each bin obtaining each score is calculated and translated into surprisal. If no test taker in a bin responds in an item category, a fixed large surprisal value is used. The surprisal value for each bin is placed horizontally at the center $θ$ score of the bin. The purpose of these steps is to aid the curve fitting in Step 4, as well as to regularize the fit.

Step 4. Using the binned surprisal values, surprisal curves are estimated using linear combinations of B-spline basis functions $ϕ_{v} (θ)$ , $v = 1, \dots, V$

S_{i m} (θ) = \sum_{v = 1}^{V} β_{i m v} ϕ_{v} (θ),

where $β_{i m v}$ are the coefficients corresponding to each of the basis functions. The basis functions are piecewise polynomials reaching up to a specified order.¹ The function resulting from the sum in Equation 2 is called a spline function. To increase the flexibility of a spline function, knots can be placed at various locations on the $θ$ scale. Between each knot, the number of piecewise polynomials having positive values is the same as the spline order. Thus, adding knots implicitly adds more basis functions to the spline. Figure 2 shows an example of B-spline basis functions $ϕ_{v} (θ)$ of order 4 with four knots. To summarize, one can control curve flexibility by adjusting the spline function order and/or the number of knots. A spline function of a given order is order $- 2$ times differentiable.

Figure 2.

B-spline basis functions $ϕ_{v} (θ)$ of order 4 with four knots, each marked as vertical dashed lines.

For more information about splines and how to estimate curves refer to Ramsay and Silverman (2005) and Ramsay et al. (2009). To assure that the surprisal corresponding probabilities within each item sum to one for all $θ$ values, a retraction operation is required. The reader is referred to Ramsay et al. (2020) for further details. The final spline curves are typically referred to as surprisal splines.

Step 5. After the surprisal splines have been obtained, new $θ$ scores are estimated for each test taker using maximum likelihood estimation (MLE). Let $U_{j i m} = 1$ if test taker j responded with a score of m on item i and let $U_{j i m} = 0$ otherwise. The negative log likelihood (LL)

H_{j} (θ) = \sum_{i = 1}^{n} \sum_{m = 0}^{M_{i}} U_{j i m} S_{i m} (θ),

is proportional to surprisal. The MLE of $θ$ is the value of $θ$ , which minimizes Equation 3 by solving the gradient equation

{H^{'}}_{j} (θ) = \sum_{i = 1}^{n} \sum_{m = 0}^{M_{i}} U_{j i m} \frac{d S_{i m} (θ)}{d θ} = 0.

Note that only the chosen response options with associated surprisal curve derivatives contribute to the summations. The assumptions of unidimensionality and local independence required for the GPC model persist for the OS model, as the spline curves are fit independently of each other conditional on $θ$ , and only one $θ$ value is estimated for each test taker. However, because of the spline fitted IRFs, monotonicity is not required and no logistic parametric form is assumed.

4. Arc Length as a Measure of Ability

A direct comparison of OS estimated $θ$ scores and parametric IRT estimated $θ$ scores cannot be done as the $θ$ s from each model are fitted onto different scales. In theory, the $θ$ scale can be arbitrarily rescaled for a fitted IRT model as long as the ordering of the $θ$ values is kept the same. Any monotone $θ$ scale transformation can be made, and the item response probabilities for any given $θ$ on the new scale would be the same as the corresponding $θ$ on the old scale, resulting in an equally valid model. However, there is no obvious scale transformation that assures a fair comparison between OS and parametric IRT $θ$ scores. On top of this, there is no way to transform a closed-set scale ( $[0, 100]$ produced by the OS algorithm) into an open one ( $(- \infty, \infty)$ produced by parametric IRT) as the endpoints of a closed set have no equivalent in an open set.

The concept of arc length gives us a solution to the ability comparison problem. The surprisal curves from each item category obtained from a fitted IRT model together define a smooth one-dimensional curve in multidimensional space, parameterized by $θ$ . Distance traveled along this curve, when moving from one $θ$ to another, can be computed using arc length. The distance $D (θ)$ from the smallest possible $θ$ , denoted $θ_{0}$ , to a test taker’s estimated $θ$ can thus be used as a measure of ability:

D (θ) = \int_{θ_{0}}^{θ} \sqrt{\sum_{i = 1}^{n} \sum_{m = 0}^{M_{i}} {[\frac{d S_{i m} (t)}{d t}]}^{2}} d t .

Since surprisal is measured in bits, $D (θ)$ inherits the bit unit and can be seen as a measure of the amount of information (in bits) covered by the test or instrument that a respondent with index value $θ$ has attained. Thus, when $θ$ equals the upper limit of the $θ$ scale, $D (θ)$ represents the total amount of information contained within the test. This may feel somewhat intuitive, even for someone who does not quite grasp the concepts of bits and information theory. Hence, in comparison to traditional IRT scores, arc length scores may present themselves as more accessible and straightforward, as for many test takers, negative $θ$ scores and a scale reaching toward infinity can be confusing.

One should also note Equation 5 is a one-to-one monotonic transformation function of $θ$ . Because of this, it inherits all properties of IRT, including the significant advantage of considering different properties of individual items such as their difficulty and discrimination during scoring. This is a feature not present in sum scores, which treat all items as equal. While it might be tempting to consider arc length scores as complex and hard to grasp, given their reliance on advanced concepts like logarithms and high-dimensional space distances, they offer a distinct advantage over traditional IRT $θ$ scores. Specifically, similar to sum scores, arc length scores come with well-defined minimum and maximum limits, something that resonates more naturally with people.

Furthermore, arc length operates on a ratio scale, possessing an absolute zero. As opposed to using a $θ$ scale, a fixed change in arc length means the same thing anywhere on the arc length scale; the same amount of change in the surprisal curves across all item categories. Because of this, the scale can be rescaled to a total length of 100. The rescaled scale can be seen as a measure of the percentage of test information attained at each level of $θ$ . This also gives us a measure, which can be used to compare IRT models using different $θ$ scales in a fair way. We will denote the rescaled % arc length with $D_{%} (θ)$ .

Arc length has been used in previous research on OS (e.g., Ramsay et al., 2023; Ramsay et al., 2020) and is also implemented in the TestGardener package. However, using it for comparison purposes with models that produce very small or even 0 probabilities like the GPC model creates problems, as surprisal goes to $\infty$ when probability approaches 0.

There is also a fundamental difference between OS and parametric IRT. For the parametric models such as the GPC model, the $θ$ scale is assumed normal and is supposed to reflect the underlying construct measured by the test. The fitted IRFs are allowed to range beyond the strongest/weakest test taker ability in the sample used to fit the model, allowing for a broader prediction scope. On the other hand, the OS model does not make the same assumptions about the ability distribution and how it aligns with the underlying construct. Instead, it fits the model to the specific population of test takers producing each dataset. The $θ$ values in the OS model are constrained within an interval that is only meant to cover the content measured by the test. In other words, the ability range of a fitted OS model is adjusted to the specific range of abilities found in the sample, and a fitted model does not generalize to other test taker populations.

Because of the above, we calculate arc length for the GPC model by setting $θ_{0}$ in Equation 5 to the smallest $θ$ estimate obtained by any test taker used to fit the model. By doing this, parametric IRT and OS arc lengths will be tied to the sample used to fit the data and thus comparable. As we used the mirt R package (Chalmers, 2012) for our analysis, we used the package default $θ$ estimation method, Bayesian expected a-posteriori (EAP). This method prevents $θ$ estimates from approaching negative or positive infinity for test takers with perfect or 0 sum scores. Consequently, as all IRFs go toward 0 or 1 at the extremes by the GPC model definition (Equation 1), the EAP method also reduces the impact of extremely large surprisal values for probabilities near 0.

5. Empirical Study

Parametric and OS IRT models were fit to four real data sets from two different tests: two data sets from the Swedish SAT, one from 2013 ( ${SAT}_{2013}$ ) and one from 2014 ( ${SAT}_{2014}$ ), as well as two data sets from the Swedish national test in mathematics, one from 2019 ( ${NAT}_{2019}$ ) and one from 2018 ( ${NAT}_{2018}$ ). Summary statistics for each test are shown in Table 1. In the upcoming subsections, each test is described in more detail.

Table 1.

Summary Statistics for Each Test Form

Statistic	${NAT}_{2018}$	${NAT}_{2019}$	${SAT}_{2013}$	${SAT}_{2014}$
Sample size	1,008	1,401	30,000	30,000
Number of items	28	28	60	60
Dichotomous items	9	9	50	50
Polytomous items	19	19	10	10
Maximum sum score	57	58	80	80
Mean sum score	25.72	25.54	43.65	42.49
Standard deviation	12.12	12.62	13.58	13.10
Skewness	0.30	0.15	0.17	0.16

5.1. Swedish National Mathematics Test

The national mathematics test is given to high-school students in Sweden taking the mathematics 3c course. The course is mandatory for students taking the natural science and technology programs but can also be taken by students from other programs by choice. The test is distributed by the end of the course and has a big impact on the course grade. It contains a mix of different item types, some being dichotomously scored while others are scored polytomously.

5.2. Verbal Swedish SAT

The Swedish SAT is administered twice a year and is used when applying for higher education in Sweden. It consists of a verbal and a quantitative part. Only the verbal part was considered in this study. The verbal part contains multiple-choice items of three different types: word interpretation, sentence completion, and reading comprehension. In the third type, texts are presented to the test taker and several multiple-choice items relate to each text. Responses to items relating to the same text cannot be assumed independent given a certain level of a unidimensional underlying trait, as different test takers may be more or less familiar with the text subject. Because of this, scores from items related to the same text were added together and treated as polytomous.

5.3. Fitting Procedure and Evaluation Metrics

GPC models were fitted to ${NAT}_{2018}$ , ${NAT}_{2019}$ , ${SAT}_{2013}$ , and ${SAT}_{2014}$ using the R package mirt (Chalmers, 2012). The TestGardener R package was used to fit OS models to the same data sets.

To evaluate model fit while still adjusting for overfitting, 10-fold cross validation (CV) was used to estimate the out-of-sample LL for each model. Specifically, each data set was subset into 10 equally sized folds $f = 1, 2, ..., 10$ . For each fold, the negative LL was computed for a model fit using all data outside the fold:

CVLL = - \frac{1}{10} \sum_{f = 1}^{10} \sum_{j = 1}^{n_{f}} \sum_{i = 1}^{I} \sum_{m = 0}^{M_{i}} U_{f j i m} ln ({\hat{P}}_{f i m} ({\hat{θ}}_{j})),

where n_f is the number of test takers in fold f. ${\hat{P}}_{f i m} ({\hat{θ}}_{j})$ denotes the model estimated probability that a test taker with estimated index ${\hat{θ}}_{j}$ achieves a score of m on item i from a model estimated without the test takers in fold f. $U_{f j i m} = 1$ if test taker j in fold f responded with a score of m on item i and 0 otherwise. We used the LL as it is widely adopted and mathematically sound, but one should note that it does not take into account an intuitive aspect of ordinal or partial credit data; the proximity of item response categories (scores). For instance, if a person rates an item with a “3” on a 5-point Likert-type scale, intuitively, “2” and “4” are more similar responses than “1” or “5.” Yet, in the typical LL function, only the probability of “3” is considered while calculating the likelihood of the response.

Additionally, a comparison method based on binning the data over the $θ$ scale was applied. For each model, the $θ$ scale was divided into K bins. These will sometimes be referred to as evaluation bins to differentiate them from the bins used to fit OS models (see Section 3.2). The bin sizes were chosen in a way such that the same number of individuals in the data set used to fit the model was contained within each bin. Kullback–Leibler divergence (KLD) and root mean squared error (RMSE) were computed to evaluate model performance for item i using the bins

{KLD}_{i} = \frac{1}{K} \sum_{k = 1}^{K} \sum_{m = 0}^{M_{i}} [O_{i} (m, k) {log}_{2} (O_{i} (m, k)) - O_{i} (m, k) {log}_{2} (E_{i} (m, k))],

{RMSE}_{i} = \sqrt{\frac{1}{K M_{i}} \sum_{k = 1}^{K} \sum_{m = 0}^{M_{i}} {(E_{i} (m, k) - O_{i} (m, k))}^{2}},

where $O_{i} (m, k)$ and $E_{i} (m, k)$ are the observed and model expected proportions of individuals in bin k getting score m on item i. The model expected number of individuals was computed by approximating

E_{i} (m, k) = \int_{k} {\hat{P}}_{i m} (θ) g (θ) d θ,

where ${\hat{P}}_{i m} (θ)$ is the fitted model probability to get a score of $m - 1$ given ability $θ$ . The integral goes over the $θ$ interval spanned by bin k. $g (θ)$ is the ability distribution of the test takers, assumed normal for GPC models and based on a spline fitted density for OS models. This model performance evaluation method gives an indication for how well the model fits the data, but at the same time, the binning process punishes overfitting. The performance for national mathematics models was evaluated using 10, 20, and 30 bins for KLD/RMSE computation, while 10, 35, and 70 bins were used to evaluate Swedish SAT models. A larger number of bins were considered for the SATs because of the larger samples. Note also that the quantities in Equations 6 and 7 are essentially surprisal values. KLD in Equation 7 is the loss in 2-bits of information/surprisal if the probabilities are assumed to be the model expected ones $E_{i} (m, k)$ when they are actually the observed ones $O_{i} (m, k)$ .

5.4. Comparing IRFs

The GPC model was compared against OS models fitted to the same data set using different settings for bins, knots, and spline orders. KLD, RMSE, and CV LL for a large set of fitted models using the different combinations of OS parameter values are provided in Tables A1 to A4 in Online Appendix A. These combinations were chosen to cover a large range of settings, and it is clear from the results that further increasing the flexibility results in overfitting.

Table 2 shows the performance measures for a selected subset of models, which performed relatively well. KLD and RMSE were averaged over all bins, items, and item scores for a fitted model. If a model performed well using one evaluation bin size, it generally performed better throughout the other evaluation bin sizes. The CV LL was smaller for less flexible models fit to the Swedish SAT data, and models using order 4 splines without knots fit to 5 bins performed the best. For the national mathematics data sets, CV LL sometimes favored more flexible OS models despite the much smaller samples. Using 20 bins with no knots was the best setting for both data sets in terms of this metric, with order 5 and order 4 splines for ${NAT}_{2018}$ and ${NAT}_{2019}$ , respectively. Models with knots also performed relatively well for these data sets, when compared to similar models fit to the SAT data.

Table 2.

Performance Measures for Fitted IRT Models

					KLD			RMSE			CV LL
	Model	Bins	Knots	Spline Order	10 Bins	20 Bins	30 Bins	10 Bins	20 Bins	30 Bins	CV LL
NAT 2018	GPC				27.54	35.11	49.22	4.01	5.49	6.77	18.71
	OS	6	0	4	21.17	28.79	42.28	3.20	4.80	6.05	16.19
	OS	7	0	4	20.86	29.10	43.05	3.19	4.91	6.07	16.15
	OS	7	1	5	25.73	35.35	48.41	3.35	5.31	6.57	16.17
	OS	8	0	4	22.43	28.28	43.04	3.11	4.60	5.87	16.08
	OS	10	0	4	22.43	29.64	41.78	2.92	4.71	5.74	15.95
	OS	12	0	4	23.52	29.36	44.13	3.03	4.59	5.88	15.86
	OS	20	0	5	27.41	31.20	43.93	2.98	4.56	5.85	15.38
	OS	20	3	4	33.95	36.4	49.59	3.47	5.16	6.36	15.55
	OS	20	3	5	42.27	35.29	47.66	3.19	4.8	6.02	15.61
NAT 2019	GPC				28.93	30.69	38.82	3.69	4.77	5.72	27.33
	OS	6	0	4	21.13	23.74	33.20	3.01	4.25	5.30	23.90
	OS	7	0	4	20.00	22.66	30.87	2.85	4.21	5.11	23.92
	OS	7	0	5	19.63	23.89	32.62	2.76	4.24	5.29	23.79
	OS	7	1	4	21.38	24.74	33.24	2.94	4.34	5.27	23.85
	OS	8	0	4	20.18	21.97	31.78	2.78	4.02	5.15	23.81
	OS	10	0	4	20.83	22.41	31.90	2.87	4.17	5.15	23.54
	OS	20	0	4	24.21	24.80	34.33	2.95	4.18	5.22	22.66
	OS	20	3	4	31.27	28.74	34.81	2.59	4.35	5.01	22.87
					KLD			RMSE			CV LL
					10 Bins	35 Bins	70 Bins	10 Bins	35 Bins	70 Bins	CV LL
SAT 2013	GPC				6.51	4.68	5.54	2.22	2.67	3.09	474.84
	OS	5	0	4	3.33	2.16	3.09	0.97	1.87	2.39	468.40
	OS	6	0	4	3.48	2.03	3.02	1.00	1.84	2.40	469.83
	OS	7	0	4	3.86	2.12	3.07	1.20	1.94	2.46	470.81
	OS	7	1	4	7.04	4.92	6.02	2.24	3.06	3.52	471.59
	OS	8	0	4	4.03	2.16	3.21	1.25	1.97	2.53	471.49
	OS	10	0	4	4.72	2.48	3.58	1.52	2.20	2.72	472.48
	OS	12	0	4	5.36	2.94	4.08	1.78	2.44	2.94	473.13
	OS	10	2	5	9.24	5.40	6.41	1.96	3.08	3.56	475.16
SAT 2014	GPC				6.21	4.40	5.21	2.13	2.61	3.02	476.79
	OS	5	0	4	4.01	2.82	3.92	1.54	2.25	2.73	474.00
	OS	6	0	4	3.63	2.45	3.52	1.29	2.09	2.60	475.14
	OS	7	0	4	3.63	2.07	3.13	1.17	1.90	2.43	475.85
	OS	7	1	4	6.89	5.12	6.15	2.29	3.08	3.50	477.12
	OS	8	0	4	3.87	2.17	3.26	1.24	1.98	2.51	476.68
	OS	10	0	4	4.33	2.37	3.41	1.40	2.11	2.61	477.49
	OS	12	0	4	5.02	2.92	4.06	1.67	2.36	2.87	478.53
	OS	10	2	5	8.74	5.20	6.07	2.06	3.05	3.48	481.84

Note. KLD times 1,000, probability RMSE in percentage points, and CV LL divided by 1,000 are displayed. Each OS entry also displays the number of bins and knots as well as the order of the splines used to fit the OS model. The KLD/RMSE columns show the number of bins used to evaluate model performance. The model with the smallest error for each measure is highlighted in bold. KLD = Kullback–Leibler divergence; RMSE = root mean squared error; CV = cross validation; LL = log likelihood; GPC = generalized partial credit; OS = optimal scoring; IRT = item response theory.

In terms of KLD and RMSE, simpler OS models generally outperformed more flexible ones, using higher spline orders and/or multiple knots, for all data sets. However, there were some exceptions to this, especially for the Swedish national mathematics tests. For example, the OS model fit using 20 bins, three knots, and order 4 splines outperformed the GPC model on ${NAT}_{2019}$ in all measures except KLD with 10 bins (see Table 2).

For all data sets, better fitting OS models could be identified for all evaluation metrics, but there were also OS parameter combinations, which would result in worse performance for OS.

Figures 3 and 4 show IRFs for some of the worst fitting polytomously scored items on ${NAT}_{2018}$ and ${SAT}_{2013}$ , respectively. To make the ability scales comparable, they have been transformed to percentile arc length for each model. For each of the 10 evaluation bins used for model evaluation, see Equations (7) and (8), the observed probability in the data and the fitted model expected probability are plotted as dots for each item score. As such, an overlap between the observed and expected dots for a bin indicates good model fit. Note that even for models fit to the same dataset, the observed probabilities within each bin differ as the ordering of individuals on each scale differ for different models.

Figure 3.

Various item response functions from ${NAT}_{2018}$ models, together with the expected and observed probabilities within each evaluation bin.

Figure 4.

Various item response functions from ${SAT}_{2013}$ models, together with the expected and observed probabilities within each evaluation bin.

Plots A and B in Figure 3 show the GPC and OS IRFs from the item with the worst RMSE performance using the GPC model. For this item, the parametric form of the GPC model struggles to model the probabilities for top performing test takers, plot A. The GPC model suggests that the expected probability for a test taker among the top 10% to get the max score on the item is lower than the expected probability to get the second highest score. However, for the observed probabilities in the data, the reverse is true, as shown by the nonfilled dots. When using the best-performing OS model, plot B, the expected probabilities for the top 10% align more closely to the observed ones when compared to the GPC model, plot A.

Plot D in 3 shows the IRF for the item with the worst fit for the OS model, while plot C shows the IRFs for the same item for the GPC model. Even though it has the largest KLD out of all polytomously scored items for the OS model, the KLD is still marginally smaller than the resulting KLD from the GPC model for the same item. Despite this, there are some bins for which the GPC model has a better fit.

IRFs from the worst fitting polytomously scored items on ${SAT}_{2013}$ are shown in Figure 4. For the worst fitting GPC item, plot A, the observed probability for a score of 1 is larger than the observed probability for getting a score of 2 for all bins. Despite the large dataset, this behavior cannot captured in a good way by the GPC model, as the observed binned probabilities for a score of 1 are above the expected ones for the top performing test takers. On the other hand, the OS model shows a great fit to the binned data, as shown in Figure 4B. For this data set, the item with the worst fit to the OS model still beats the GPC model fit for the same item by a relatively large margin in terms of KLD, plots C and D.

Plots similar to those in Figures 3 and 4 for ${NAT}_{2019}$ and ${SAT}_{2014}$ are provided in Online Appendix B. In all of the figures, the GPC and OS model curves corresponding to the same items are somewhat similar in shape. A big difference is that for the OS model, the probabilities do not need to fully reach 0 or 1 at the upper and lower ability scale boundaries. Whether this makes sense or not depends on the aim of the IRT model and the test. If the items are multiple choice, as is the case in the Swedish SAT, one always has a chance to get it right by guessing. The same may be true for an item requiring, for example, a numeric response, and only a certain amount of numbers are reasonable from the context.

5.5. Score Comparison

Figure 5 shows the kernel-smoothed score distributions for sum scores and percentage arc length $D_{%} (θ)$ from both OS and GPC models. $D_{%} (θ)$ was used in place of $θ$ scores, as it puts the GPC and OS scores on the same ratio scale (refer to Section 4 for further details). The sum scores were multiplied by 100 and divided by the maximum sum score of each test to have a max score of 100. The displayed OS model score distributions were computed from one of the best fitting models for each data set (see Table 2). Specifically, 10 bins, no knots, and order 4 splines for ${NAT}_{2018}$ ; eight bins, no knots, and order 4 splines for ${NAT}_{2018}$ ; 5 bins, no knots, and order 4 splines for ${SAT}_{2014}$ ; and finally, seven bins, no knots, and order 4 splines for ${SAT}_{2014}$ .

Figure 5.

Estimated score distributions resulting from different scoring methods. Each vertical line shows the mean for the corresponding distribution.

Despite $D_{%} (θ)$ being computed similarly for the OS or GPC models, the resulting score distributions are remarkably different. The GPC $D_{%} (θ)$ scores show less variation when compared to OS $D_{%} (θ)$ or sum scores, and relatively few test takers obtain small or large scores. These differences are likely caused by the GPC IRF shapes (Equation 1), where probabilities always go toward $0 / 1$ , together with the open-ended interval for $θ$ . As previously mentioned, score probabilities close to 0 lead to very large surprisal values, and consequently also large changes in surprisal and arc length as $θ$ increases. The items in Figures 3 and 4 are the examples of this. In all plots in the figures, the GPC IRF probabilities for each item score are closer to the probability boundaries toward the edges of the % arc length scale. These differences between the OS and GPC IRFs are subtle, but it makes for large differences in score distributions if one was to use arc length as a measure of test taker performance. This does not matter for the ordering of test taker scores, but it may affect the score interpretation. From the GPC $D_{%} (θ)$ distribution on the Swedish SAT from 2023, the interpretation that only about 30% of the test takers have attained more than 50% of the information (in bits) contained within the test may not seem plausible. Note that despite the differences in distribution shape, the OS and GPC mean % arc lengths are similar in size in all four plots in Figure 5.

So which score is better? In all four plots in Figure 5, a larger portion of test takers end up closer to or at the score scale boundaries when using OS compared to the other scoring methods. This makes perfect sense for the Swedish SATs, which only contain multiple-choice items. A person not knowing the answer to a single question would be guessing, still accommodating an average of 20% correct responses as all items in the Swedish SATs have five response options each. This is also under the assumption that all distractor options are equally plausible to the true one, which is very hard to achieve. In this sense, the sum score is by definition a biased score estimate for these types of tests if the score is meant to represent a test taker’s attained knowledge of the information contained within the test.

Moreover, even for a well constructed test such as the Swedish SAT, it is inevitable that some items will perform better or worse, both in general and for different test takers. Examples of worse performing items include items with confusing distractor options, unclear problem descriptions, or items that are less related to the underlying construct(s) the test is meant to measure. Since the Swedish SAT is used for applying to university, such items can be punishing, as all items have the same impact on the resulting sum scores. In this respect, the sum score is a worse choice than $D_{%} (θ)$ for scoring people on the Swedish SATs despite the differences being small in many practical settings.

For the national mathematics tests, the score distributions of each scoring method are more similar. The items on these tests are not multiple-choice questions, and it is not as clear just from context which distribution is more reasonable/desirable. The ordering of the test takers from using each scoring method is also very similar, with Spearman correlations ranging between 0.978 and 0.997 for all score pairs on all four tests. However, the inherent benefits from IRT of more accurate weighting of items with various relationships with the underlying latent trait together the theoretical concept of a ratio scale from an information theory perspective should speak in the favor of the % arc length scores.

6. Simulation Study

To further compare the fit of parametric IRT and OS IRT models, a simulations study was conducted. Data were generated using IRFs from models fit to the ${SAT}_{2014}$ data set. Curves fit to this data set were chosen as the true ones because of the large number of test takers, as opposed to the smaller samples in the national mathematics data sets. To avoid exclusively favoring one method based on the model generating test data, data were generated using either all GPC IRFs, all OS IRFs or a mix of the two in different scenarios. One of the best performing OS models from the empirical study, using order 4 splines with seven bins and no knots, was used to generate OS item data (see Table 2).

Figure 6.

Ability density functions used to generate test data in the simulation study. The dots in the OS distribution plots mark the proportions of test takers at the min/max values of $θ$ . (A) Normal GPC θ distribution. (B) SAT 2014 OS θ distribution. (C) Skewed GPC θ distribution. (D) Skewed OS θ distribution.

To be able to generate item scores for test takers using a mixture of OS and GPC IRFs, each test taker’s GPC $θ$ score needs to be mapped to an equivalent OS $θ$ . The percentiles in each model’s $θ$ distribution were used for this mapping. For the GPC model, the $θ$ s were assumed to follow the standard normal distribution (Figure 6A). For OS, the proportion of test takers getting a maximum or minimum score was first computed; then, a spline smoothed density was fitted to the scores between the maximum and minimum using the R package gss (Gu, 2014). The density function for the resulting smoothed distribution is shown in Figure 6B.

The methods were compared throughout various realistic scenarios. The sample size was either 1,000, 5,000, or 10,000 test takers and test lengths of 30 and 60 items were examined. Furthermore, the percentage of OS IRFs used for data generation was varied between 0%, $50$ %, and $100$ %. To explore performance when the test taker population is skewed, the $θ$ distribution for the GPC items was varied between the standard normal distribution and the skew normal distribution with location, scale, and shape parameters 1.5, 1.2, and −4, respectively. The density function for this skew normal distribution is shown in Figure 6C. The OS $θ$ density function corresponding to the skew normal GPC density in Figure 6C is plotted in Figure 6D. When presenting results from the simulation study, the populations in plots A and B will be referred to as data set populations, while the populations in plots C and D will be referred to as skewed populations.

IRF probability estimation was evaluated over $R = 1, 000$ simulation iterations. The full simulation procedure for each scenario is described in detail in Algorithm 2. The R code used for the simulations can be obtained at https://github.com/joakimwallmark/os-gpc-simulation.

Algorithm 2 Simulation Procedure for Each Simulated Scenario
1: Sample items fit to ${SAT}_{2014}$ to match the test length and the number of OS/GPC items in the scenario.2: Sample GPC $θ_{j}$ scores for $j = 1, 2, 3, ..., n_{s i m}$ test takers using the test taker distribution for the given scenario. These are treated as the true GPC $θ$ scores.3: Transform each GPC $θ_{j}$ from the previous step to its equivalent OS $θ_{j}$ using the $θ$ transformation previously described. These are treated as the true OS $θ$ scores. For each $r \in {1, 2, 3, ..., R}$ :4: Generate test data for the sampled items using the true $θ$ scores.5: Fit an OS and a GPC model to the generated dataset.6: Compute $θ_{j}$ estimates ${\hat{θ}}_{j r}$ for each test taker using both fitted models.7: Let ${\hat{P}}_{i m r} (θ)$ denote the estimated probability to get a score of m on item i from a model in simulation iteration r. Compute the following performance measures for both the OS and GPC models: ${KLD}_{i j} = \frac{1}{R} \sum_{r = 1}^{R} \sum_{m = 0}^{M_{i}} P_{i m} (θ_{j}) {log}_{2} (P_{i m} (θ_{j})) - P_{i m} (θ_{j}) {log}_{2} ({\hat{P}}_{i m r} ({\hat{θ}}_{j r})),$ 9 ${BIAS}_{i m j} = \frac{1}{R} \sum_{r = 1}^{R} [{\hat{P}}_{i m r} ({\hat{θ}}_{j r}) - P_{i m} (θ_{j})],$ 10 ${SE}_{i m j} = \sqrt{\frac{1}{R - 1} \sum_{r = 1}^{R} {[{\hat{P}}_{i m r} ({\hat{θ}}_{j r}) - \frac{1}{R} \sum_{q = 1}^{R} {\hat{P}}_{i m q} ({\hat{θ}}_{j q})]}^{2}},$ 11 ${RMSE}_{i m j} = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {[{\hat{P}}_{i m r} ({\hat{θ}}_{j r}) - P_{i m j} (θ_{j})]}^{2}} .$ 12

Algorithm 2 Simulation Procedure for Each Simulated Scenario

1: Sample items fit to

{SAT}_{2014}

to match the test length and the number of OS/GPC items in the scenario.2: Sample GPC

θ_{j}

scores for

j = 1, 2, 3, ..., n_{s i m}

test takers using the test taker distribution for the given scenario. These are treated as the true GPC

θ

scores.3: Transform each GPC

θ_{j}

from the previous step to its equivalent OS

θ_{j}

using the

θ

transformation previously described. These are treated as the true OS

θ

scores. For each

r \in {1, 2, 3, ..., R}

:4: Generate test data for the sampled items using the true

θ

scores.5: Fit an OS and a GPC model to the generated dataset.6: Compute

θ_{j}

estimates

{\hat{θ}}_{j r}

for each test taker using both fitted models.7: Let

{\hat{P}}_{i m r} (θ)

denote the estimated probability to get a score of m on item i from a model in simulation iteration r. Compute the following performance measures for both the OS and GPC models:

{KLD}_{i j} = \frac{1}{R} \sum_{r = 1}^{R} \sum_{m = 0}^{M_{i}} P_{i m} (θ_{j}) {log}_{2} (P_{i m} (θ_{j})) - P_{i m} (θ_{j}) {log}_{2} ({\hat{P}}_{i m r} ({\hat{θ}}_{j r})),

{BIAS}_{i m j} = \frac{1}{R} \sum_{r = 1}^{R} [{\hat{P}}_{i m r} ({\hat{θ}}_{j r}) - P_{i m} (θ_{j})],

{SE}_{i m j} = \sqrt{\frac{1}{R - 1} \sum_{r = 1}^{R} {[{\hat{P}}_{i m r} ({\hat{θ}}_{j r}) - \frac{1}{R} \sum_{q = 1}^{R} {\hat{P}}_{i m q} ({\hat{θ}}_{j q})]}^{2}},

{RMSE}_{i m j} = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {[{\hat{P}}_{i m r} ({\hat{θ}}_{j r}) - P_{i m j} (θ_{j})]}^{2}} .

6.1. Simulation Study Results

Table 3.

Simulation Performance Measures

			OS	Abs. Bias		SE		RMSE		KLD
	n	Items	Item Prop.	GPC	OS	GPC	OS	GPC	OS	GPC	OS
Dataset population	1,000	30	0.0	1.77	1.18	5.85	10.34	6.26	10.59	0.022	0.064
	1,000	30	0.5	2.33	1.15	6.05	9.74	6.76	9.95	0.026	0.057
	1,000	30	1.0	3.09	1.18	5.82	9.62	6.91	9.8	0.028	0.057
	1,000	60	0.0	0.95	0.64	4.70	6.72	4.86	6.82	0.013	0.026
	1,000	60	0.5	1.81	0.63	4.72	6.52	5.32	6.61	0.016	0.024
	1,000	60	1.0	2.55	0.60	4.75	6.48	5.67	6.54	0.017	0.023
	5,000	30	0.0	1.81	0.97	5.56	9.00	6.00	9.19	0.020	0.049
	5,000	30	0.5	2.34	1.00	5.78	8.45	6.52	8.64	0.024	0.046
	5,000	30	1.0	3.10	1.13	5.52	8.40	6.67	8.61	0.026	0.049
	5,000	60	0.0	0.95	0.41	4.35	5.31	4.52	5.38	0.011	0.016
	5,000	60	0.5	1.83	0.48	4.37	5.26	5.01	5.33	0.014	0.016
	5,000	60	1.0	2.56	0.45	4.41	5.29	5.39	5.34	0.016	0.016
Skewed population	1,000	30	0.0	2.09	1.29	5.13	10.48	5.70	10.72	0.019	0.068
	1000	30	0.5	2.29	1.39	5.42	10.15	6.10	10.40	0.022	0.066
	1,000	30	1.0	2.76	1.47	5.59	10.24	6.51	10.49	0.024	0.065
	1,000	60	0.0	1.21	0.96	4.39	7.60	4.63	7.76	0.012	0.033
	1,000	60	0.5	1.66	1.00	4.45	7.37	4.93	7.55	0.014	0.032
	1,000	60	1.0	2.12	1.02	4.50	7.26	5.20	7.45	0.015	0.031
	5,000	30	0.0	2.09	1.27	4.85	9.70	5.46	9.98	0.018	0.062
	5,000	30	0.5	2.31	1.38	5.17	9.30	5.89	9.57	0.021	0.061
	5,000	30	1.0	2.82	1.40	5.29	9.14	6.29	9.41	0.023	0.056
	5,000	60	0.0	1.20	0.83	4.04	6.49	4.30	6.66	0.010	0.025
	5,000	60	0.5	1.69	0.79	4.10	6.04	4.63	6.19	0.012	0.022
	5,000	60	1.0	2.18	0.77	4.16	5.70	4.94	5.85	0.014	0.020

Note. Average absolute bias, SE, and RMSE are displayed in percentage points. For each measure, the model with the best performance is emphasized in bold. Averages are calculated across all test takers, items, and categories. OS = optimal scoring; RMSE = root mean squared error; GPC = generalized partial credit; KLD = Kullback–Leibler divergence.

Table 3 shows the performance measures in Algorithm 2 for each method averaged over test takers, items, and item scores. As the total probability bias averaged over all item categories always sum to zero, the bias entries in the table were computed using the absolute values of the bias estimates retrieved using Equation 10. The results from using 10,000 test takers are omitted, as only slight reductions in bias and SE were observed. It is worth highlighting the importance of low bias, as it is crucial for accurate estimation of IRFs and test scores, and the OS model excels in this aspect. In all simulated scenarios, the OS model exhibited a smaller average absolute bias compared to the GPC model, including scenarios where the true IRFs adhered to the parametric shape of the GPC model. The GPC model shows larger bias when more OS items are used for data generation, while the bias from the OS model is more similar in size no matter if OS or GPC items are used to generate data. The bias is close to the same for all sample sizes for both models, but a slight decrease in bias is shown when using OS as the sample sizes increases, provided other parameters are kept the same. However, the OS model has larger average SEs compared to the GPC model, and the decrease in bias does not compensate for the increase in SE when evaluating performance using measures, such as RMSE and KLD, with the exception of the scenario where all items were generated using the OS model and a sample size of 5,000. Increasing the sample size has a marginal effect on reducing average SE when comparing scenarios with the same number of items and proportion of true OS items. In contrast, adding more items leads to substantial improvements in SE, RMSE, and KLD, with the effect on SE being more pronounced for the OS model compared to the GPC model.

Having a skewed population does not have a large effect on any performance measure. One should note that these are average measures computed only for each test takers sampled from the skewed/data set population matching the scenario. Figure 7 plots the resulting bias and RMSE from each model against the true percentile arc length $D_{%} (θ)$ using both the data set and the skewed test taker populations. The plots show scenarios with 5,000 test takers and 60 items when only OS items were used to generate the data.

Figure 7.

Item response functions probability absolute bias and root mean squared error averaged over all items and item scores plotted against % arc length.

It is clear from Figure 7 that the bias is larger for high and low scoring test takers. Having a left skewed population results in larger bias for low scorers. This is expected, as fewer test takers are found closer to the lower $D_{%} (θ)$ boundary when the population is skewed. For the data set population, the difference in bias between the methods is the largest for the high and low scoring test takers, but when the population is skewed, OS shows increased bias toward the lower end, matching the bias of the GPC model. The smaller bias for low/high performing test takers translates to smaller RMSE for the same test takers when using the OS model for the data set population. Equivalent plots to those in Figure 7 from other scenarios would often result in similar curve shapes but different bias/RMSE magnitudes. Simulation estimated absolute bias, SE, RMSE, and KLD for individual item probabilities are supplied in Table C5 in Online Appendix C.

7. Discussion

The aim of this study was to investigate whether the OS model is preferable to the GPC model in terms of model fit and to showcase the efficacy of the arc length $θ$ transformation, both as a method of test taker scoring and as a tool for model comparison.

A key feature of arc length is its invariance to the arbitrary choice of latent trait $θ$ scale for a fitted IRT model. As shown in this study, this property allows it to serve as an effective solution for aligning IRT models on vastly different $θ$ scales to a common ratio scale. On the surprisal arc length scale distances carry an inherent meaning in terms of bits of information, thus facilitating better comparison and interpretation.

In terms of model fit, the OS model showcased a superior fit for multiple polytomously scored items, some examples shown in Figures 3 and 4. These results are in line with previous studies, where OS was compared against parametric alternatives on tests using dichotomous test items (Ramsay & Wiberg, 2017; Wiberg et al., 2019). When comparing model fit to real test data using CV LL and the binning method as described in Section 5.3, OS also showed promising results. OS models with similar or better performance than the GPC model could always be found by tweaking the OS model parameters. However, the process of finding the appropriate number of bins, knots, and the appropriate spline order for the data at hand can be cumbersome and using a more flexible model can lead to overfitting. This was especially true for the Swedish SAT data sets despite the large sample sizes, while the results for models fit to the national mathematics data sets were more conflicting.

One should consider the fact that the test items in the Swedish SAT forms are constructed by experts in test construction and the items are thoroughly tested before being included in an administered test form to the general public. This process should generally lead to items being more closely related to the latent construct(s) the test is aimed to measure and produce items that are more likely to be responded to correctly by the more able students. Despite the inherent simplification of reducing the responses from all items to a single construct, this could be a potential explanation to why, for most items in the data, the estimated IRFs behave in a lenient way, where the probabilities for correct responses monotonically increase with test taker ability. These properties should generally favor simpler parametric models such as the GPC. For other tests where the relationship between item scores and test taker ability is more complex, the benefits of using a nonparametric approach such as OS would most likely be even larger.

Fitting OS models using splines without knots proved to be efficient in this study. In previous research on OS, more flexible spline functions were used. For example, Ramsay et al. (2020) used 18 bins in conjunction with order 5 spline functions and two knots to analyze a 473 sample from the symptom distress scale, a 13 item rating scale form measuring distress in cancer patients. Ramsay et al. (2019) used 53 bins with 24 basis functions to analyze Swedish SAT data. On the contrary, previous studies also applied a roughness penalty on the third derivatives of each spline to prevent overfitting. For this study, the use of penalized smoothing splines resulted in worse performance in terms of the empirical model evaluation measures, and the results were thus omitted. Further research should be conducted on how to obtain a suitable amount of smoothing in a practical way.

It is important to strike a balance between bias and variance when choosing a model, as a model with low bias but high variance can lead to overfitting, while a model with high bias but low variance can fail to model the actual relationship between the underlying ability and the item responses. Through our simulation study, we showed that fitting an OS model results in smaller IRF probability bias compared to fitting a GPC model. As with many nonparametric approaches, this comes with the cost of larger SEs. The GPC model outperformed OS in terms of KLD and item response probability recovery RMSE in most simulated scenarios. Looking at the simulations, it appears one would need large samples containing thousands of test takers in combination with a large number of test items to really reap the benefits of the bias reduction from the nonparametric curve estimation for tests such as the Swedish SAT. However, small bias is often considered more important than small SE. In situations where the RMSE and/or KLD from two competing models are somewhat similar, one may argue that it is more fair that someone gets a different score than they should by chance from larger SEs than by bias from choosing a specific parametric model.

We recognize the assumption of unidimensionality as a limitation of the current study. In reality, multiple abilities often come into play during an exam. While this study did not address multidimensional abilities, we are actively working to extend the OS IRT methodology to handle such scenarios, while maintaining the benefits of our proposed approach. Future publications will provide insights into these multidimensional extensions. In the meantime, readers should interpret our findings in the context of a unidimensional $θ$ .

Overall, OS proves to be a viable alternative to parametric IRT for analyzing test data containing polytomously scored items. Especially for tests containing a large number of test items in combination with large sample sizes. In future studies, it would be of interest to compare OS with other recently introduced nonparametric/semiparametric IRF estimation methods. Examples include Bayesian nonparametric methods (Arenson & Karabatsos, 2018; Duncan & MacEachern, 2008), the monotonic polynomial method (Falk & Cai, 2016; Liang & Browne, 2015) or the approach of using a mix of parametric and nonparametric curves for different items implemented in the mirt R package (Chalmers, 2012).

Supplemental Material

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231207879 - Analyzing Polytomous Test Data: A Comparison Between an Information-Based IRT Model and the Generalized Partial Credit Model

Supplemental Material, sj-docx-1-jeb-10.3102_10769986231207879 for Analyzing Polytomous Test Data: A Comparison Between an Information-Based IRT Model and the Generalized Partial Credit Model by Joakim Wallmark, James O. Ramsay, Juan Li and Marie Wiberg in Journal of Educational and Behavioral Statistics

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was funded by the Swedish Wallenberg MMW 2019.0129 grant.

ORCID iD

Joakim Wallmark

Note

References

Arenson

E. A.

Karabatsos

(2018). A Bayesian beta-mixture model for nonparametric IRT (BBM-IRT). Journal of Modern Applied Statistical Methods, 17(1), 1–17.

Chalmers

R. P.

(2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.

Dorans

N. J.

(1999). Correspondences between ACT and SAT I scores. ETS Research Report Series, 1999(1), i–18.

Duncan

K. A.

MacEachern

S. N.

(2008). Nonparametric Bayesian modelling for item response. Statistical Modelling, 8(1), 41–66.

Emons

W. H.

(2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psychological Measurement 32(3), 224–247.

Falk

C. F.

Cai

(2016). Maximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis. Psychometrika, 81(2), 434–460.

(2014). Smoothing spline ANOVA models: R package gss. Smoothing Spline ANOVA Models, 58(5), 1–25.

Lee

Y.-S.

(2007). A comparison of methods for nonparametric estimation of item characteristic curves for binary items. Applied Psychological Measurement, 31(2), 121–134.

Liang

Browne

M. W.

(2015). A quasi-parametric method for fitting flexible item response functions. Journal of Educational and Behavioral Statistics, 40(1), 5–34.

10.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Routledge.

11.

Masters

(1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

12.

Mokken

R. J.

(1997). Nonparametric models for dichotomous responses. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 351–367). Springer.

13.

Molenaar

I. W.

(1997). Nonparametric models for polytomous responses. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 369–380). Springer.

14.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i–30.

15.

Ramsay

J. O.

(1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56(4), 611–630.

16.

Ramsay

J. O.

(1997). A functional approach to modeling test data. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 381–394). Springer.

17.

Ramsay

J. O.

Hooker

Graves

(2009). Functional data analysis in R and Matlab. Springer.

18.

Ramsay

J. O.

(2022). TestGardener: Optimal analysis of test and rating scale data. R package version 3.0.0.

19.

Ramsay

J. O.

Wallmark

Wiberg

(2023). An information manifold perspective on psychometrics. [Manuscript submitted for publication].

20.

Ramsay

J. O.

Wiberg

(2020). Better rating scale scores with information-based psychometrics. Psych, 2(4), 347–369.

21.

Ramsay

J. O.

Silverman

(2002). Functional models for test items. In Ramsay

J. O.

Silverman

B. W.

(Eds.), Applied functional data analysis, Chapter 9 (pp. 131–144). Springer-Verlag.

22.

Ramsay

J. O.

Silverman

(2005). Functional data analysis (2nd ed.). Springer-Verlag.

23.

Ramsay

J. O.

Wiberg

(2017). A strategy for replacing sum scoring. Journal of Educational and Behavioral Statistics, 42(3), 282–307.

24.

Ramsay

J. O.

Wiberg

(2019). Full information optimal scoring. Journal of Educational and Behavioral Statistics, 45(3), 297–315.

25.

Rossi

Wang

Ramsay

J. O.

(2002). Nonparametric item response function estimates with the em algorithm. Journal of Educational and Behavioral Statistics, 27(3), 291–317.

26.

Shannon

C. E.

(1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

27.

Sijtsma

Molenaar

I. W.

(2002). Introduction to nonparametric item response theory (Vol. 5). Sage.

28.

Stochl

Jones

P. B.

Croudace

T. J.

(2012). Mokken scale analysis of mental health and well-being questionnaire item responses: A non-parametric IRT method in empirical research for applied health researchers. BMC Medical Research Methodology, 12(1), 1–16.

29.

Wiberg

Ramsay

J. O.

(2019). Optimal scores: An alternative to parametric item response theory and sum scores. Psychometrika, 84(1), 310–322.

30.

Woods

C. M.

(2006). Ramsay-curve item response theory (RC-IRT) to detect and correct for nonnormal latent variables. Psychological Methods, 11(3), 253–270.

31.

Woods

C. M.

Thissen

(2006). Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika, 71(2), 281–301.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB