Sage Journals: Discover world-class research

Abstract

This article explores innovations for parameter estimation in generalized linear and nonlinear models, which may be used in item response modeling to account for guessing/pretending or slipping/dissimulation and for the effect of covariates. We introduce a new implementation of the EM algorithm and propose a new algorithm based on the parametrized link function. The two novel iterative algorithms are compared to existing methods in a simulation study. Additionally, the study examines software implementation, including the specification of initial values for numerical algorithms and asymptotic properties with an estimation of standard errors. Overall, the newly proposed algorithm based on the parametrized link function outperforms other procedures, especially for small sample sizes. Moreover, the newly implemented EM algorithm provides additional information regarding respondents’ inclination to guess or pretend and slip or dissimulate when answering the item. The study also discusses applications of the methods in the context of the detection of differential item functioning and addresses the measurement error. Methods are offered in the difNLR package and in the interactive application of the ShinyItemAnalysis package; demonstration is provided using real data from psychological and educational assessments.

Keywords

parameter estimation EM algorithm generalized linear and nonlinear models differential item functioning

1. Introduction

In fields such as education, psychology, and health, constructs are commonly measured through multi-item instruments, where understanding how each individual item functions is crucial. Analyzing item functioning not only aids in refining measurement instruments but also provides valuable insights into the behavior and characteristics of different respondent groups. The three-parameter logistic (3PL) and four-parameter logistic (4PL) models are flexible tools that allow capturing of complex item response patterns and accommodating a more comprehensive range of item characteristics, including possible guessing and slipping rates in the context of educational measurement and pretending and dissimulation in the context of psychological and health measurement. However, estimation in these models, both in the item response theory (IRT) (Barton & Lord, 1981; Birnbaum, 1968) and non-IRT framework (Drabinová & Martinková, 2017; Hladká & Martinková, 2020), may become challenging due to several factors, including the complexity of these models caused by their nonlinearity, high-dimensionality of the parameter space, and the nature of the data being analyzed. These models typically require a large sample size (Kim & Oshima, 2013), which can result in computationally demanding fitting. Therefore, efficient algorithms, advanced estimation techniques, and software implementation are crucial for the effectiveness and accessibility of these models’ use in practice.

Recent research renewed interest in the 3–4PL IRT models since the availability of computing resources is on the rise. New approaches in estimation are being studied extensively (Battauz, 2020; Culpepper, 2016; Fu et al., 2021; Loken & Rulison, 2010; Meng et al., 2020), which helps solve some of the computational issues. However, their focus is mainly limited to large-scale assessments, while estimating item parameters with moderate sample sizes is still unreachable. To address the computational issues more effectively and accurately recover item characteristics such as guessing and slipping, the IRT models may benefit from traditional item analysis and generalized linear and nonlinear models (GLNMs), their simpler score-based counterparts (Martinková & Hladká, 2023). The GLNMs can offer improved and more precise starting values for related IRT models and allow for statistical inference regarding item parameters while still being less computationally complex compared to IRT models.

GLNMs incorporate a class of generalized logistic regression models that are natural extensions of the logistic regression model to describe item functioning. Analogous to 3–4PL IRT models, generalized logistic regression may account for the possibility that an item can be correctly answered or endorsed without the necessary knowledge or trait, for example, due to guessing or pretending. In this case, the logistic regression model is extended by including a parameter defining a lower asymptote of the probability curve, which may be larger than zero. Similarly, the model can consider the possibility that an item is incorrectly answered or opposed by a respondent with a high level of a particular trait due to issues such as inattention, lack of time, or dissimulation; this model includes an upper asymptote of the probability curve, which may be lower than one. These models can be seen as score-based counterparts to 3–4PL IRT models since they assume the same shape of the item response curve; however, in contrast to the class of latent variable models, this approach uses an observed estimate of the underlying latent trait.

Furthermore, logistic regression, its extensions, and their latent variable counterparts have become widely used for identifying between-group differences on item level when responding to multi-item measurements (Swaminathan & Rogers, 1990). The phenomenon, known as differential item functioning (DIF), indicates whether responses to an item vary for respondents with the same level of an underlying latent trait but from different groups (e.g. defined by gender, age, or socioeconomic status). In this vein, DIF detection is essential for a deeper understanding of group differences, assessing the effectiveness of various treatments, or uncovering potential unfairness in educational tests. It is identified as one of the crucial topics in measurement (AERA, APA, & NCME, 2014).

The estimation in the logistic regression model is a straightforward procedure, but extending the parametric space by including additional parameters in this model makes it more statistically and computationally challenging and demanding and may result in convergence issues. This is even more present in IRT modeling, where latent ability is estimated together with item parameters. In this vein, GLNMs can be seen as a helpful alternative in describing item functioning and identifying DIF, accounting for possible guessing or inattention while also being accessible in practice.

Therefore, this article examines innovations in the item parameter estimation for the GLNMs in the context of DIF detection. As the main contribution, the work proposes novel iterative algorithms, examines their theoretical properties, and compares the newly proposed methods to existing ones in a simulation study. The use of estimation procedures is then exemplified on real data examples with an application to DIF detection, with the secondary goal of providing possibilities for more accurate DIF detection. Given that GLNMs treat ability as observed, we also address potential biases this approach may bring due to measurement error.

The rest of the manuscript is organized as follows: To begin, Section 2 introduces the GLNMs and its relationship to IRT framework, examining the estimation techniques. This section provides a detailed description of two existing methods for parameter estimation, the nonlinear least squares (NLS) and the maximum likelihood (ML) method, and their application to fitting GLNMs. Furthermore, as an alternative to the direct implementation of the ML method, this study proposes a novel implementation of the expectation-maximization (EM) algorithm and a new approach based on a parametrized link function (PLF). Additionally, this section provides asymptotic properties of the estimates, an estimation of standard errors, and a software implementation, including a specification of starting values in iterative algorithms. Subsequently, Section 3 describes the design and results of the simulation study. To illustrate the differences and challenges between the existing and newly proposed methods in practice and the context of DIF detection, this work provides two real data analyses in Section 4. Section 5 contains the discussion and concluding remarks. Finally, Supplemental Appendix A provides asymptotic properties of the discussed estimation approaches, Supplemental Appendix B lists item parameter estimates of real data examples, and Supplemental Appendix C presents an additional study on measurement error. All Supplementary material is available at https://osf.io/eu5zm/

2. Methodology

2.1 Generalized Linear and Nonlinear Models for Item Functioning

GLNMs extend the logistic regression model by accounting for the possibility of guessing or inattention when answering an item. The simple 4PL model describes functioning of the item $i$ , meaning the probability of endorsing item $i$ by respondent $p$ , by introducing four parameters:

π_{pi} = P (Y_{pi} = 1 | θ_{p}) = c_{i} + (d_{i} - c_{i}) \frac{\exp (b_{i 0} + b_{i 1} θ_{p})}{1 + \exp (b_{i 0} + b_{i 1} θ_{p})},

(1)

with $θ_{p}$ being an observed trait of respondent $p$ .

2.1.1 Parameter Interpretation

All four parameters have an intuitive interpretation: The parameters $c_{i}$ and $d_{i}$ are the lower and upper asymptotes of the probability sigmoid function $π_{pi} (x)$ since

\lim_{x \to - \infty} π_{pi} (x) = c_{i} and \lim_{x \to \infty} π_{pi} (x) = d_{i},

where $c_{i} \in [0, 1], d_{i} \in [0, 1]$ and $c_{i} < d_{i}$ if $b_{i 1} > 0$ and $c_{i} > d_{i}$ otherwise. Evidently, with $c_{i} = 0$ and $d_{i} = 1$ , this model recovers a standard logistic regression for item $i$ .

In psychological and health-related assessments, the asymptotes $c_{i}$ may represent pretending or simulation, and $1 - d_{i}$ may represent the probability of reluctance to admit difficulties due to social norms or dissimulation. In educational testing, parameter $c_{i}$ can be interpreted as the probability that the respondents guessed the correct answer without possessing the necessary knowledge $θ_{p}$ , also known as a pseudo-guessing parameter. On the other hand, $1 - d_{i}$ can be viewed as the probability that respondents were inattentive while their knowledge $θ_{p}$ was sufficient (Hladká & Martinková, 2020), or a lapse-rate (Kingdom & Prins, 2016). Next, the parameter $b_{i 0}$ is an intercept parameter related to the difficulty of item $i$ (or item popularity in psychological and health-related assessments), and parameter $b_{i 1}$ is linked to a slope of the sigmoid curve $π_{pi} (x)$ , which is also called discrimination of the respective item.

2.1.2 Adding Covariates, Group-Specific 4PL Model

The simple model (1) can be further extended by incorporating additional respondents’ characteristics. As a typical example, a binary grouping variable $G_{p}$ might be considered. This variable describes a respondent’s membership to a social group ( $G_{p} = 0$ for the reference group and $G_{p} = 1$ for the focal group), which extends the simple 4PL model (1) to a group-specific form:

\begin{matrix} π_{pi} = P (Y_{pi} = 1 | θ_{p}, G_{p}) = c_{i} + c_{i DIF} G_{p} \\ + (d_{i} - d_{i DIF} G_{p} - c_{i} - c_{i DIF} G_{p}) \frac{\exp (b_{i 0} + b_{i 1} θ_{p} + b_{i 2} G_{p} + b_{i 3} θ_{p} \cdot G_{p})}{1 + \exp (b_{i 0} + b_{i 1} θ_{p} + b_{i 2} G_{p} + b_{i 3} θ_{p} \cdot G_{p})}, \end{matrix}

(2)

which is suitable for testing DIF; see also Hladká and Martinková (2020) and Martinková and Hladká (2023, Chapter 9).

However, the models (1) and (2) can be further generalized. Instead of using a single variable $θ_{p}$ to describe item functioning or added grouping variable $G_{p}$ to test for DIF, we introduce in this paper a vector of covariates $X_{p} = {(1, X_{p 1}, \dots, X_{pk})}^{⊤}$ , $p = 1, \dots, n$ , which includes the original observed trait and an intercept term. This process produces extra parameters $b_{i} = {(b_{i 0}, \dots, b_{ik})}^{⊤}$ . Beyond this, even asymptotes may depend on respondents’ characteristics $Z_{p} = {(1, Z_{p 1}, \dots, Z_{pj})}^{⊤}$ , $p = 1, \dots, n$ , which are not necessarily the same as $X_{p}$ . This general covariate-specific 4PL model is of form

π_{pi} = P (Y_{pi} = 1 | X_{p}, Z_{p}) = Z_{p}^{⊤} c_{i} + (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) \frac{\exp (X_{p}^{⊤} b_{i})}{1 + \exp (X_{p}^{⊤} b_{i})},

(3)

where $c_{i} = {(c_{i 0}, \dots, c_{ij})}^{⊤}$ and $d_{i} = {(d_{i 0}, \dots, d_{ij})}^{⊤}$ are asymptote parameters for item $i$ . Note that we typically assume $Z_{p}$ being categorical rather than continuous variables describing respondents’ characteristics to keep meaningful interpretation while requiring $0 \leq Z_{p}^{⊤} {< Z}_{p}^{⊤} \leq 1$ . With $X_{p} = (1, θ_{p})$ and $Z_{p} = 1$ , we get the simple 4PL model (1), while the choice of $X_{p} = (1, θ_{p}, G_{p})$ and $Z_{p} = (1, G_{p})$ yields the group-specific 4PL model (2).

2.1.3 Testing for DIF

The group-specific 4PL model (2) can be then used for testing between-group differences on the item level with a DIF analysis (Hladká & Martinková, 2020) with, for example, the likelihood ratio test. The likelihood ratio test measures the difference between the log-likelihood $l_{i 1}$ of the larger model (e.g. the group-specific 4PL model (2)) and the log-likelihood $l_{i 0}$ of its submodel (e.g. the simple 4PL model (1)) for the given item $i$ . The resulting $L R_{i}$ statistic has an asymptotic $χ^{2}$ -distribution under the smaller model with degrees of freedom equal to a difference in the number of parameters in the two models:

L R_{i} = - 2 (l_{i 0} - l_{i 1}) \overset{D}{\underset{n \to \infty}{\to}} χ^{2} {(df}_{i 1} - {df}_{i 0}) .

(4)

Similarly, any two nested submodels of the group-specific 4PL model (2) can be compared to test for the significance of group-related item parameters. Note that in GLNMs, DIF detection is usually performed item by item.

2.1.4 Matching Criterion

In these models, $θ_{p}$ is an observed variable describing the measured trait of the respondent, such as anxiety, fatigue, quality of life, or math ability, here called the matching criterion. In the context of the logistic regression method for DIF detection, the total test score (or its standardized version) is typically used as the matching criterion (Swaminathan & Rogers, 1990). Other options for the matching criterion include a pretest score (to identify differential item functioning in change; see Martinková et al., 2020), a score on another test measuring the same construct, or an estimate of the latent trait provided by an IRT model.

2.1.5 IRT Framework

Within the IRT framework, the matching criterion $θ_{p}$ in the models (1)–(3) is latent, necessitating joint estimation with item parameters. Both frameworks share the same shape of item characteristic curves with a comparable interpretation. A notable distinction lies in the estimation process: IRT models simultaneously estimate parameters for all items, which is typically not the case for models (1)–(3) as described below. However, GLNMs entail lower computational demands, as they require smaller sample sizes to yield precise estimates. The estimation algorithms for GLNMs, which are the focus of this article, may thus be further incorporated into the IRT framework as is further described in the Section 5.

2.2 Estimation of Item Parameters

Numerous algorithms are available to estimate item parameters in the covariate-specific 4PL model (3). First, this section describes two methods that may be directly implemented in the existing software: The NLS method and the ML method. Next, the study introduces two newly proposed iterative algorithms, which might improve the implementation of the computationally demanding ML method: The EM algorithm inspired by the work of Dinse (2011) and an iterative algorithm based on PLF. The described and proposed algorithms treat ability as known or estimated a priori and estimate parameters for each item separately, which is suitable in the GLNM framework. We provide a discussion on incorporating joint ability and item parameter estimation in Section 5.

2.2.1 Nonlinear Least Squares

The parameter estimates of the covariate-specific 4PL model (3) can be determined using the NLS method (Dennis et al., 1981; Drabinová & Martinková, 2017; Hladká & Martinková, 2020), which is based on minimization of the residual sum of squares (RSS) of item $i$ with respect to item parameters $(b_{i}, c_{i}, d_{i})$ :

RS S_{i} (b_{i}, c_{i}, d_{i}) = \sum_{p = 1}^{n} {[Y_{pi} - π_{pi}]}^{2} = \sum_{p = 1}^{n} {[Y_{pi} - Z_{p}^{⊤} c_{i} - (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) \frac{\exp (X_{p}^{⊤} b_{i})}{1 + \exp (X_{p}^{⊤} b_{i})}]}^{2},

(5)

where $n$ is the number of respondents. Since the criterion function $RS S_{i} (b_{i}, c_{i}, d_{i})$ is continuously differentiable with respect to item parameters $(b_{i}, c_{i}, d_{i})$ , the minimizer can be obtained when the gradient is zero. Thus, the minimization process involves a calculation of the first partial derivatives with respect to item parameters $(b_{i}, c_{i}, d_{i})$ and finding a solution of relevant nonlinear estimating equations (e.g., van der Vaart, 1998, Chapter 5). Since $Z_{p}^{⊤} c_{i}$ and $Z_{p}^{⊤} d_{i}$ asymptotes represent probabilities, it is necessary to ensure that these expressions are kept in the interval of $[0, 1]$ which is accomplished using numerical approaches.

The asymptotic properties of the NLS estimator, such as consistency and asymptotic distribution, can be derived under the classical set of regularity conditions (e.g., van der Vaart, 1998, Theorems 5.41 and 5.42; see also Supplemental Appendix A.1 for more details). This study proposes a sandwich estimator defined by Equation (A1) in Supplemental Appendix A.1, as a natural estimate can be used as a natural estimate of the asymptotic variance of the NLS estimate.

2.2.2 Maximum Likelihood

The second option for estimating item parameters in the covariate-specific 4PL model (3) is the ML method (Hladká & Martinková, 2020). Using a notation $ϕ_{pi} = \frac{\exp (X_{p}^{⊤} b_{i})}{1 + \exp (X_{p}^{⊤} b_{i})}$ , the corresponding likelihood function for item $i$ has the following form:

L_{i} (b_{i}, c_{i}, d_{i}) = Π_{p = 1}^{n} {[Z_{p}^{⊤} c_{i} + (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) ϕ_{pi}]}^{Y_{pi}} {[1 - Z_{p}^{⊤} c_{i} - (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) ϕ_{pi}]}^{1 - Y_{pi}},

and the log-likelihood function is then given by,

\begin{matrix} l_{i} (b_{i}, c_{i}, d_{i}) = \\ \sum_{p = 1}^{n} {Y_{pi} \log (Z_{p}^{⊤} c_{i} + (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) ϕ_{pi}) + (1 - Y_{pi}) \log (1 - Z_{p}^{⊤} c_{i} - (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) ϕ_{pi})} . \end{matrix}

The parameter estimates are obtained by maximization of the log-likelihood function. Thus, this approach proceeds similarly to the logistic regression model, except for a larger dimension of the parametric space. To find the maximizer of the log-likelihood function $l_{i} (b_{i}, c_{i}, d_{i})$ , the first partial derivatives are set to zero and these so-called “likelihood equations” must be solved. However, the solution of a system of nonlinear equations cannot be derived algebraically and needs to be numerically estimated using a suitable iterative process.

Using van der Vaart’s (1998) Theorems 5.41 and 5.42, consistency and asymptotic normality can be shown for the ML estimator; see Supplemental Appendix A.2 for more details. Additionally, the estimate of the asymptotic variance of the item parameters is an inverse of the observed information matrix defined by Equation (A2).

2.2.3 EM Algorithm

The ML method may be computationally demanding, and iterative algorithms might help in those situations. Inspired by the work of Dinse (2011), this study adopts a version of the EM algorithm (Dempster et al., 1977) for parameter estimation in the covariate-specific 4PL model (3).

To make use of the EM algorithm, the original 4PL model can be reformulated as a mixture model employing latent classes indicating different types of respondents. In our setting, we consider four mutually exclusive latent variables ( $W_{pi 1}$ , $W_{pi 2}$ , $W_{pi 3}$ , $W_{pi 4}$ ), where variable $W_{pij} = 1$ indicates that respondent $p$ belongs in the category $j = 1, \dots, 4$ for an item $i$ , whereas $W_{pij} = 0$ indicates that respondent does not belong in this category. In the context of educational, psychological, health-related, or other types of multi-item measurement, the four categories can be interpreted as follows: Categories 1 and 2 indicate whether a respondent who responded correctly to item $i$ or endorsed it (i.e. $Y_{pi} = 1$ ) was determined to do so ( $W_{pi 1} = 1$ , e.g. the respondent guessed correct answer while their knowledge or ability was insufficient, or the respondent simulated described situation) or not ( $W_{pi 2} = 1$ , e.g. had sufficient knowledge or ability to answer correctly and did not guess, or endorsed while experiencing described situation). On the other hand, Categories 3 and 4 indicate whether the respondent who did not respond correctly or did not endorse the item (i.e. $Y_{pi} = 0$ ) was prone to do so ( $W_{pi 3} = 1$ , e.g. did not have sufficient knowledge, ability, or trait) or not ( $W_{pi 4} = 1$ , e.g. incorrectly answered, or did not endorse, due to another reason such as inattention, lack of time, or dissimulation). Thus, the observed indicator $Y_{pi}$ and its complement $1 - Y_{pi}$ could be rewritten as $Y_{pi} = W_{pi 1} + W_{pi 2}$ and $1 - Y_{pi} = W_{pi 3} + W_{pi 4}$ (Figure 1). While the latent variables $W_{pi 2}$ and $W_{pi 3}$ represent desired response styles, $W_{pi 1}$ and $W_{pi 4}$ represent response styles which are undesired.

Figure 1.

Graphical representation of the relationships among latent variables for the EM algorithm.

Let $Z_{p}^{⊤} c_{i}$ be the regressor-based probability that the respondent was determined to respond to item $i$ correctly or endorse it (Category 1), and let $Z_{p}^{⊤} d_{i}$ be the regressor-based probability of the respondent not prone to respond correctly or endorse item $i$ (Categories 1–3). Then $Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}$ gives the regressor-based probability that the respondent was not determined but prone to (Categories 2 and 3). Further, we denote $ϕ_{pi}$ and $1 - ϕ_{pi}$ —the probabilities to answer a given item correctly (Category 2) and incorrectly (Category 3), respectively, depending on the regressors $X_{p}$ . Finally, the probability that the respondent did not respond correctly and was not prone to do so is given by $1 - (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) - Z_{p}^{⊤} c_{i} = 1 - Z_{p}^{⊤} d_{i}$ (Category 4). In summary, the expected values of the latent variables are then given by the following terms:

Z_{p}^{⊤} c_{i}, (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) ϕ_{pi}, (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) (1 - ϕ_{pi}), 1 - Z_{p}^{⊤} d_{i},

and the probability of a correct response or endorsement is given by

\begin{matrix} P (Y_{pi} = 1 | X_{p}) = P (W_{pi 1} + W_{pi 2} = 1 | X_{p}) = P (W_{pi 1} = 1 | X_{p}) + P (W_{pi 2} = 1 | X_{p}) \\ = Z_{p}^{⊤} c_{i} + (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) ϕ_{pi}, \end{matrix}

which under the logistic model $ϕ_{pi} = \frac{\exp (X_{p}^{⊤} b_{i})}{1 + \exp (X_{p}^{⊤} b_{i})}$ produces the covariate-specific 4PL model (3).

Using the setting of the latent variables, the corresponding log-likelihood function for item $i$ takes the following form:

\begin{matrix} l_{i}^{EM} = \sum_{p = 1}^{n} [W_{pi 2} \log (ϕ_{pi}) + W_{pi 3} \log (1 - ϕ_{pi})] \\ + \sum_{p = 1}^{n} [W_{pi 1} \log (Z_{p}^{⊤} c_{i}) + W_{pi 4} \log (1 - Z_{p}^{⊤} d_{i}) + (W_{pi 2} + W_{pi 3}) \log (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i})] \\ = l_{i 1}^{EM} + l_{i 2}^{EM} . \end{matrix}

The log-likelihood function $l_{i 1}^{EM}$ includes only parameters $b_{i}$ and regressors $X_{p}$ , whereas the log-likelihood function $l_{i 2}^{EM}$ incorporates only parameters related to the asymptotes of the sigmoid function and includes only regressors $Z_{p}$ . Notably, the log-likelihood function $l_{i 1}^{EM}$ has a form of the log-likelihood function for the logistic regression. However, in contrast to the logistic regression model, in this setting, it does not necessarily hold that $W_{pi 2} + W_{pi 3} = 1$ since the correct answer could be guessed or the respondent could be inattentive, producing $W_{pi 2} + W_{pi 3} = 0$ . The log-likelihood function $l_{i 2}^{EM}$ takes the form of the log-likelihood for multinomial data with one trial and with the regressor-based probabilities $Z_{p}^{⊤} c_{i}$ , $Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}$ , and $1 - Z_{p}^{⊤} d_{i}$ .

The EM algorithm estimates item parameters in two steps—expectation and maximization. These two steps are repeated until the convergence criterion is met, such as until the change in log-likelihood is lower than a predefined value. Since the EM algorithm is designed to obtain ML estimates, their asymptotic properties are the same as described in Supplemental Appendix A.2.

2.2.3.1 Expectation

At the E-step, conditionally on the item responses $Y_{pi}$ and the current parameter estimate $({\hat{b}}_{i}, {\hat{c}}_{i}, {\hat{d}}_{i})$ , the estimates of latent variables are calculated as their expected values:

\begin{matrix} {\hat{W}}_{pi 1} = \frac{Z_{p}^{⊤} {\hat{c}}_{i} Y_{pi}}{Z_{p}^{⊤} {\hat{c}}_{i} + (Z_{p}^{⊤} {\hat{d}}_{i} - Z_{p}^{⊤} {\hat{c}}_{i}) {\hat{ϕ}}_{pi}}, {\hat{W}}_{pi 2} = Y_{pi} - {\hat{W}}_{pi 1}, \\ {\hat{W}}_{pi 4} = \frac{(1 - Z_{p}^{⊤} {\hat{d}}_{i}) (1 - Y_{pi})}{1 - Z_{p}^{⊤} {\hat{c}}_{i} - (Z_{p}^{⊤} {\hat{d}}_{i} - Z_{p}^{⊤} {\hat{c}}_{i}) {\hat{ϕ}}_{pi}}, {\hat{W}}_{pi 3} = 1 - Y_{pi} - {\hat{W}}_{pi 4} . \end{matrix}

(6)

2.2.3.2 Maximization

At the M-step, conditionally on the current estimates of the latent variables ${\hat{W}}_{pi 2}$ and ${\hat{W}}_{pi 3}$ , the estimates of parameters $b_{i}$ maximize the log-likelihood function $l_{i 1}^{EM}$ . The estimates ${\hat{c}}_{i}$ and ${\hat{d}}_{i}$ are given by a maximization of the log-likelihood function $l_{i 2}^{EM}$ conditionally on current estimates of the latent variables ${\hat{W}}_{pi 1}$ , ${\hat{W}}_{pi 2}$ , ${\hat{W}}_{pi 3}$ , and ${\hat{W}}_{pi 4}$ .

The EM algorithm is designed to gain the ML estimates of the item parameters, so estimates have the same asymptotic properties as described above.

Additionally, it might be of practical interest that the EM algorithm provides estimates of latent variables $W_{pi 1}$ , $W_{pi 2}$ , $W_{pi 3}$ , and $W_{pi 4}$ . Their mean values over all items may be interpreted in an educational context as follows: The ${\bar{W}}_{p 1}$ as (undesired) “inclination to guess”; ${\bar{W}}_{p 2}$ as a (desired) probability of “knowing correct answers when correctly answering”; ${\bar{W}}_{p 3}$ as a (desired) probability of “not knowing correct answers when incorrectly answering”; and ${\bar{W}}_{p 4}$ as (undesired) “inclination to slipping/inattention.” In the context of psychological or health-related measurements, the mean values may be interpreted as follows: The ${\bar{W}}_{p 1}$ as the (undesired) “inclination to simulate”; ${\bar{W}}_{p 2}$ as the (desired) probability of “endorsing while experiencing described situations”; ${\bar{W}}_{p 3}$ as the (desired) probability of “not endorsing while not experiencing described situations”; and ${\bar{W}}_{p 4}$ as the (undesired) “inclination to dissimulate.”

2.2.4 Parametrized Link Function

In our setting, the covariate-specific 4PL model (3) can be viewed as a generalized linear model with a known PLF,

g (μ_{pi}; c_{i}, d_{i}) = \log (\frac{μ_{pi} - Z_{p}^{⊤} c_{i}}{Z_{p}^{⊤} d_{i} - μ_{pi}}),

(7)

where the parameters $c_{i}$ and $d_{i}$ are unknown and may depend on regressors $Z_{p}$ . Subsequently, the mean function is determined by $μ_{pi} = π_{pi}$ as given by Equation (3) with a linear predictor $X_{p}^{⊤} b_{i}$ . If the asymptote parameters were known, the estimation could proceed analogously to generalized linear models (see, e.g. Dobson & Barnett, 2018), specifically the standard logistic regression model. However, since the asymptote parameters are unknown, an additional step to estimate them is required.

Keeping this setting in mind, this study proposes a new two-stage algorithm to estimate item parameters using the PLF, see Equation (7), which involves repeating two steps until the convergence criterion is fulfilled. Similar to the EM algorithm, the PLF-based estimation method is designed to compute ML estimates; their asymptotic properties align with those detailed in Supplemental Appendix A.2.

2.2.4.1 Step One

First, conditionally on current estimates ${\hat{c}}_{i}$ and ${\hat{d}}_{i}$ of the PLF, the estimates of parameters $b_{i}$ maximise the following log-likelihood function:

\begin{matrix} l_{i 1}^{PL} (b_{i} | {\hat{c}}_{i}, {\hat{d}}_{i}) = \\ \sum_{p = 1}^{n} {Y_{pi} \log (Z_{p}^{⊤} {\hat{c}}_{i} + (Z_{p}^{⊤} {\hat{d}}_{i} - Z_{p}^{⊤} {\hat{c}}_{i}) ϕ_{pi}) + (1 - Y_{pi}) \log (1 - Z_{p}^{⊤} {\hat{c}}_{i} - (Z_{p}^{⊤} {\hat{d}}_{i} - Z_{p}^{⊤} {\hat{c}}_{i}) ϕ_{pi})} . \end{matrix}

The log-likelihood function $l_{i 1}^{PL} (b_{i} | {\hat{c}}_{i}, {\hat{d}}_{i})$ has a similar form to the log-likelihood function $l_{i} (b_{i}, c_{i}, d_{i})$ using the ML method. However, the parameters $c_{i}$ and $d_{i}$ are here replaced by their current estimates, ${\hat{c}}_{i}$ and ${\hat{d}}_{i}$ .

2.2.4.2 Step Two

Next, estimates ${\hat{c}}_{i}$ and ${\hat{d}}_{i}$ of the PLF, see Equation (7), are calculated conditionally on the current estimates ${\hat{b}}_{i}$ as the arguments of the maxima of the following log-likelihood function

\begin{matrix} l_{i 2}^{PL} (c_{i}, d_{i} | {\hat{b}}_{i}) = \sum_{p = 1}^{n} {Y_{pi} \log (Z_{p}^{⊤} c_{i} + (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) {\hat{ϕ}}_{pi}) \\ + (1 - Y_{pi}) \log (1 - Z_{p}^{⊤} c_{i} - (Z_{p}^{⊤} d_{i} - Z_{p}^{⊤} c_{i}) {\hat{ϕ}}_{pi})} . \end{matrix}

Again, the parameters $b_{i}$ are replaced by their estimates ${\hat{b}}_{i}$ , and $ϕ_{pi}$ is thus replaced by ${\hat{ϕ}}_{pi}$ .

In summary, the division into the two sets of parameters makes the algorithm based on PLF easy to implement in the R software and can take advantage of its existing functions. Because the algorithm is designed to produce the ML estimates, their asymptotic properties are the same as described above.

2.3 Implementation and Software

For all analyses, software R , version 4.3.1 (R Core Team, 2023) was used. The methods proposed here are implemented into the difNLR package version 1.5.0 (Hladká & Martinková, 2020), and some of them are available in the interactive application of the ShinyItemAnalysis package (Martinková & Drabinová, 2018; Martinková & Hladká, 2023), version 1.5.3; see Figure A1 in the Supplemental Appendix D. The NLS method was implemented using the base nls() function and the “port” algorithm (Gay, n.d.). The sandwich estimator defined by Equation (A1) in the Supplemental Appendix A.1 of the asymptotic covariance matrix was computed using the calculus package (Guidotti, 2022). The ML estimation was performed with the base optim() function and the “L-BFGS-B” algorithm (Byrd et al., 1995). The EM algorithm implements directly Equation (6) in the expectation step using the base glm() function and the multinom() function from the nnet package (Venables & Ripley, 2002) in the maximization step. Next, step one of the newly proposed algorithm based on PLF is implemented with the base glm() function with the modified logit link, which includes asymptote parameters. The asymptote parameters are estimated in step two using the base optim() function. The maximum number of iterations was 2,000 for all four methods, and the convergence criterion was set to $10^{- 6}$ when possible.

2.3.1 Initial Values

Starting values for item parameters were calculated as follows: The respondents were divided into three groups based upon tertiles of the matching criterion $θ_{p}$ . Next, the asymptote parameters were estimated: $c$ was computed as an empirical probability for those whose matching criterion was smaller than its average value in the first group defined by tertiles. The asymptote $d$ was calculated as an empirical probability of those whose matching criterion was greater than its average value in the last group defined by tertiles. The slope parameter $b_{1}$ was estimated as a difference between the mean empirical probabilities of the last and the first group multiplied by 4. This difference is sometimes called the upper-lower index. Finally, the intercept $b_{0}$ was calculated as follows: First, a center point between the asymptotes was computed, and then we looked for the level of the matching criterion that would have corresponded to this empirical probability. Additionally, smoothing and corrections for the variability of the matching criterion were applied.

3. Simulation Study

A simulation study was performed to compare various procedures to estimate parameters in the generalized logistic regression model, including the NLS, the ML method, the EM algorithm, and the newly proposed algorithm based on PLF. Two models were considered—the simple 4PL model (1) and the group-specific 4PL model (2).

3.1 Simulation Design

3.1.1 Data Generation

To generate data, ten sets with different combinations of item parameters were considered. The item parameters were chosen to correspond to common values: Parameters $b_{0}$ , $b_{2}$ , and $b_{3}$ were generated from the standard normal distribution, parameter $b_{1}$ was generated from a normal distribution with a mean value equal to 2.5, and a standard deviation of 0.5. Parameter $c$ was generated from uniform distribution $U (0.05, 0.30)$ for both groups. Parameter $d$ was generated from uniform distribution $U (0.7, 0.95)$ for both groups. In the case of the simple 4PL model (1), only parameters $b_{0}$ , $b_{1}$ , $c$ , and $d$ were considered. Next, the matching criterion $θ_{p}$ was generated from the standard normal distribution for all respondents. Since item parameters are usually estimated item by item in generalized logistic models, we focused on items separately, and only one item was generated for each scenario. We used a generated variable as the matching criterion instead of the total score. Binary responses were generated from the Bernoulli distribution with the calculated probabilities based upon the chosen 4PL model, true parameters, and the matching criterion variable. The sample size was set to $n =$ 500, 1,000, 2,500, and 5,000, that is, 250, 500, 1,250, and 2,500 per group in the case of the group-specific 4PL model (2). Each scenario was replicated 1,000 times.

3.1.2 Simulation Evaluation

To compare estimation methods, we first computed mean and median numbers of iteration runs and the convergence status of the methods, meaning the percentage of converged simulation runs; the percentage of runs that crashed (caused an error when fitting, e.g. due to singularities); and the percentage of those which reached the maximum number of iterations without convergence. Next, we selected only those simulation runs for which all four estimation methods converged successfully. We computed the mean parameter estimates and parametric confidence intervals, that is, average intervals found for estimated standard errors derived for the respective algorithm. When confidence intervals for asymptote parameters exceeded their boundaries of 0 or 1, confidence intervals were truncated at the boundary value. The proportion of confidence intervals covering the true parameter value was calculated. Subsequently, the mean bias (i.e. the mean difference between estimates and true values) and root mean squared error (RMSE; i.e. the square root of the average of squared errors) were calculated with respect to sample size. Finally, for a deeper insight into ML-based methods (i.e. traditionally implemented ML, the EM algorithm, and the algorithm based on PLF), we compared log-likelihoods for these three methods to those based on true values of parameters.

3.2 Simulation Results

3.2.1 Convergence Status

All four methods had low percentages of simulation runs that crashed for all sample sizes in the simple 4PL model (1). Still, the rate was mildly increased in the group-specific 4PL model (2) for the NLS method (6.72%) and for the algorithm based on PLF (9.22%) when $n = 500$ . With the increasing sample size, convergence issues disappeared. The EM algorithm struggled to converge in a predefined number of iterations, especially for small sample sizes in both models. Additionally, the method based on PLF reached the maximum limit of 2,000 iterations only in a small percentage of simulation runs when smaller sample sizes were considered (Table 1).

Table 1.

Convergence Status, Proportion of Suspicious Simulation Runs, and the Number of Iterations for the Four Estimation Methods.

Method	Simple 4PL model (1)						Group-specific 4PL model (2)
	Convergence status (%)				Number of iterations		Convergence status (%)				Number of iterations
	Conv.	Crash.	DNF	Susp.	Mean	Median	Conv.	Crash.	DNF	Susp.	Mean	Median
$n$ = 500
NLS	99.48	0.52	0.00	0.06	10.18	8.00	93.28	6.72	0.00	0.50	17.27	14.00
ML	99.84	0.16	0.00	0.06	23.84	23.00	98.87	1.13	0.00	0.45	123.75	104.00
EM	93.70	0.00	6.30	0.03	347.14	152.00	93.38	0.07	6.55	0.50	452.89	211.00
PLF	98.34	1.02	0.64	0.03	144.31	18.00	89.84	9.22	0.94	0.47	248.06	66.00
$n$ = 1,000
NLS	99.93	0.07	0.00	0.00	7.66	7.00	97.90	2.10	0.00	0.15	12.50	10.00
ML	99.89	0.11	0.00	0.00	22.24	22.00	99.99	0.01	0.00	0.06	106.85	99.00
EM	95.76	0.00	4.24	0.00	295.53	149.00	93.50	0.00	6.50	0.10	447.87	201.00
PLF	99.71	0.11	0.18	0.00	97.73	14.00	97.84	1.96	0.20	0.10	172.43	47.00
$n$ = 2,500
NLS	99.98	0.02	0.00	0.00	5.97	6.00	99.39	0.61	0.00	0.01	8.17	7.00
ML	99.94	0.06	0.00	0.00	20.98	20.00	100.00	0.00	0.00	0.00	94.19	91.00
EM	96.23	0.00	3.77	0.00	254.47	134.00	94.55	0.00	5.45	0.01	356.17	161.00
PLF	99.98	0.00	0.02	0.00	51.77	9.00	99.87	0.12	0.01	0.01	95.36	25.00
$n$ = 5,000
NLS	100.00	0.00	0.00	0.00	5.26	5.00	99.97	0.03	0.00	0.00	6.57	6.00
ML	99.95	0.05	0.00	0.00	20.50	20.00	100.00	0.00	0.00	0.00	90.21	89.00
EM	97.64	0.00	2.36	0.00	223.45	127.00	95.90	0.00	4.10	0.01	308.32	141.00
PLF	100.00	0.00	0.00	0.00	24.23	8.00	99.98	0.02	0.00	0.02	50.03	11.00

Note. Conv. = converged; Crash. = crashed; DNF = did not finish; Susp. = suspicious; NLS = nonlinear least squares; ML = maximum likelihood; EM = expectation-maximization; PLF = parametrized link function.

3.2.2 Number of Iterations

Furthermore, the methods differed in the number of iterations needed until the estimation process successfully ended. The EM algorithm yielded the largest mean and median numbers of iterations, which were somehow overestimated by simulation runs that did not finish without convergence (i.e. the maximum limit of 2,000 iterations was reached). The fewest iterations were needed for the NLS method. As expected, all the procedures required fewer simulation runs when the simple 4PL model (1) was considered than in the group-specific 4PL model (2). Beyond this, the number of iterations was decreasing with the increasing sample size in both models for all the methods (Table 1).

In both models, some estimation procedures produced nonmeaningful estimates of parameters $b_{0}$ – $b_{3}$ (absolute value over 100) despite successful convergence. Such simulations affected mean values significantly, so they were removed from a computation of the mean estimates and their confidence intervals for all four estimation methods. Incidence was similar for all methods (Table 1). Such estimates might be obtained due to insufficient sample size or starting values far from the global maximizer.

3.2.3 Parameter Estimates

In the simple 4PL model (1), the PLF-based algorithm gained the most precise estimates of parameters $b_{0}$ and $b_{1}$ (in the sense of bias and RMSE) when smaller sample sizes were considered ( $n = 500$ or $n = 1, 000$ ). Additionally, the NLS method yielded slightly more biased estimates in these scenarios. The precision of the estimation improved for both parameters when the sample size increased in all four methods, whereas differences between estimation procedures narrowed. The accuracy of the estimates of the asymptote parameters $c$ and $d$ was similar for all four methods. The NLS method yielded the least biased estimates, while the PLF-based algorithm produced the lowest RMSE. However, the differences between estimation approaches were minor. The proportion of confidence intervals covering true values of item parameters was high for all four methods (Table 2). Slightly higher coverage for the NLS method was caused by somewhat larger confidence intervals.

Table 2.

Bias, RMSE, and Confidence Interval (CI) Coverage by Four Estimation Methods Using the Simple Model (1).

Method	Bias				RMSE
Method	n = 500	1,000	2,500	5,000	500	1,000	2,500	5,000	CI coverage [%]
$b_{0}$
NLS	$0.044$	$0.012$	$0.003$	$0.003$	$1.028$	$0.399$	$0.208$	$0.143$	$96.27$
ML	$0.045$	$0.016$	$0.004$	$0.004$	$0.902$	$0.387$	$0.204$	$0.141$	$95.89$
EM	$0.035$	$0.015$	$0.003$	$0.003$	$0.839$	$0.388$	$0.204$	$0.141$	$95.75$
PLF	$0.016$	$0.010$	$0.006$	$0.006$	$0.514$	$0.322$	$0.196$	$0.144$	$94.95$
$b_{1}$
NLS	$- 0.645$	$- 0.217$	$- 0.070$	$- 0.037$	$2.876$	$0.990$	$0.446$	$0.300$	$95.44$
ML	$- 0.507$	$- 0.183$	$- 0.059$	$- 0.031$	$2.253$	$0.932$	$0.433$	$0.292$	$95.60$
EM	$- 0.470$	$- 0.174$	$- 0.055$	$- 0.026$	$2.194$	$0.922$	$0.430$	$0.291$	$95.47$
PLF	$- 0.159$	$- 0.033$	$0.024$	$0.026$	$1.011$	$0.628$	$0.377$	$0.278$	$95.19$
$c$
NLS	$0.003$	$0.003$	$0.001$	$0.000$	$0.061$	$0.044$	$0.027$	$0.019$	$94.62$
ML	$0.006$	$0.004$	$0.001$	$0.001$	$0.063$	$0.044$	$0.027$	$0.019$	$94.34$
EM	$0.006$	$0.004$	$0.002$	$0.001$	$0.062$	$0.043$	$0.027$	$0.019$	$94.27$
PLF	$0.012$	$0.009$	$0.005$	$0.004$	$0.060$	$0.042$	$0.027$	$0.019$	$95.16$
$d$
NLS	$- 0.001$	$- 0.001$	$0.000$	$0.000$	$0.065$	$0.049$	$0.032$	$0.023$	$94.28$
ML	$- 0.003$	$- 0.002$	$- 0.000$	$0.000$	$0.065$	$0.049$	$0.031$	$0.022$	$93.88$
EM	$- 0.003$	$- 0.001$	$- 0.000$	$0.000$	$0.065$	$0.049$	$0.031$	$0.022$	$93.77$
PLF	$- 0.009$	$- 0.006$	$- 0.003$	$- 0.002$	$0.062$	$0.047$	$0.030$	$0.022$	$94.12$

Note. NLS = nonlinear least squares; ML = maximum likelihood; EM = expectation-maximization; PLF = parametrized link function.

In the group-specific 4PL model (2), the PLF-based algorithm yielded the most precise estimates of parameters $b_{0}$ – $b_{3}$ in the sense of RMSE, especially for the smaller sample sizes. On the other hand, the NLS method produced the largest RMSE in such scenarios. Computed bias was similar for all four methods. Similar to the simple 4PL model (1), the differences in the precision of the parameter estimates were narrowed with the increasing sample size, and all four estimation approaches gave estimates close to the true values of the item parameters. The estimates of the asymptote parameters $c$ , $c_{DIF}$ , $d$ , and $d_{DIF}$ were similar for all four methods. The EM algorithm provided slightly less-biased mean estimates of the asymptote parameters, while the PLF-based algorithm produced slightly smaller RMSE. The proportion of confidence intervals covering true values of item parameters was high and similar for all four methods (Table 3). Different lengths of computed intervals caused differences in coverage of true parameters between the estimation methods.

Table 3.

Bias, RMSE, and Confidence Interval (CI) Coverage by Four Estimation Methods Using the Group-Specific Model (2).

Method	Bias				RMSE
Method	n = 500	1,000	2,500	5,000	500	1,000	2,500	5,000	CI coverage [%]
$b_{0}$
NLS	$- 0.025$	$0.008$	$0.007$	$0.003$	$2.230$	$1.259$	$0.351$	$0.204$	$96.63$
ML	$0.002$	$0.023$	$0.009$	$0.004$	$1.777$	$1.105$	$0.341$	$0.200$	$96.07$
EM	$0.023$	$0.026$	$0.008$	$0.005$	$1.770$	$1.086$	$0.337$	$0.200$	$95.40$
PLF	$0.009$	$0.006$	$0.010$	$0.011$	$0.846$	$0.485$	$0.285$	$0.232$	$94.71$
$b_{1}$
NLS	$- 1.400$	$- 0.640$	$- 0.156$	$- 0.072$	$6.027$	$3.041$	$0.896$	$0.445$	$94.64$
ML	$- 1.013$	$- 0.506$	$- 0.122$	$- 0.059$	$4.745$	$2.463$	$0.768$	$0.431$	$95.37$
EM	$- 0.811$	$- 0.475$	$- 0.116$	$- 0.053$	$4.289$	$2.371$	$0.768$	$0.427$	$94.12$
PLF	$- 0.214$	$- 0.120$	$0.010$	$0.036$	$1.510$	$1.000$	$0.541$	$0.390$	$93.67$
$b_{2}$
NLS	$0.082$	$0.016$	$- 0.001$	$0.002$	$3.005$	$1.528$	$0.477$	$0.289$	$96.91$
ML	$0.006$	$- 0.012$	$- 0.007$	$- 0.001$	$2.467$	$1.357$	$0.463$	$0.284$	$96.34$
EM	$- 0.022$	$- 0.011$	$- 0.006$	$0.000$	$2.270$	$1.277$	$0.461$	$0.283$	$95.59$
PLF	$0.024$	$0.005$	$- 0.012$	$- 0.013$	$1.319$	$0.763$	$0.401$	$0.309$	$94.53$
$b_{3}$
NLS	$0.047$	$0.031$	$- 0.000$	$- 0.003$	$8.105$	$4.068$	$1.175$	$0.649$	$97.73$
ML	$- 0.113$	$0.023$	$- 0.003$	$- 0.004$	$6.367$	$3.366$	$1.040$	$0.630$	$97.46$
EM	$0.098$	$0.061$	$0.003$	$0.001$	$5.889$	$3.046$	$1.040$	$0.625$	$95.93$
PLF	$0.084$	$0.101$	$0.065$	$0.050$	$2.329$	$1.532$	$0.800$	$0.584$	$95.06$
$c$
NLS	$0.015$	$0.005$	$0.003$	$0.001$	$0.087$	$0.062$	$0.039$	$0.027$	$94.25$
ML	$0.020$	$0.008$	$0.004$	$0.001$	$0.091$	$0.063$	$0.040$	$0.027$	$94.23$
EM	$0.024$	$0.008$	$0.004$	$0.001$	$0.091$	$0.063$	$0.039$	$0.027$	$93.33$
PLF	$0.029$	$0.015$	$0.009$	$0.006$	$0.089$	$0.061$	$0.039$	$0.027$	$94.51$
$c_{DIF}$
NLS	$- 0.007$	$- 0.002$	$- 0.001$	$- 0.001$	$0.113$	$0.083$	$0.053$	$0.037$	$95.19$
ML	$- 0.007$	$- 0.002$	$- 0.001$	$- 0.001$	$0.116$	$0.085$	$0.053$	$0.037$	$95.13$
EM	$- 0.004$	$- 0.002$	$- 0.001$	$- 0.001$	$0.117$	$0.084$	$0.052$	$0.037$	$93.63$
PLF	$- 0.005$	$- 0.000$	$0.001$	$0.001$	$0.110$	$0.080$	$0.052$	$0.037$	$94.58$
$d$
NLS	$- 0.010$	$- 0.003$	$- 0.003$	$- 0.001$	$0.084$	$0.065$	$0.044$	$0.031$	$93.72$
ML	$- 0.015$	$- 0.005$	$- 0.004$	$- 0.001$	$0.087$	$0.066$	$0.045$	$0.031$	$93.46$
EM	$- 0.018$	$- 0.005$	$- 0.003$	$- 0.001$	$0.088$	$0.065$	$0.043$	$0.031$	$91.56$
PLF	$- 0.022$	$- 0.012$	$- 0.008$	$- 0.005$	$0.084$	$0.063$	$0.043$	$0.031$	$92.31$
$d_{DIF}$
NLS	$0.003$	$0.002$	$0.001$	$0.000$	$0.118$	$0.091$	$0.059$	$0.043$	$95.29$
ML	$0.003$	$0.002$	$0.002$	$0.000$	$0.121$	$0.092$	$0.060$	$0.043$	$95.17$
EM	$- 0.000$	$0.001$	$0.001$	$0.000$	$0.121$	$0.091$	$0.058$	$0.042$	$92.77$
PLF	$0.000$	$- 0.000$	$- 0.001$	$- 0.001$	$0.114$	$0.087$	$0.058$	$0.042$	$93.57$

Note. NLS = nonlinear least squares; ML = maximum likelihood; EM = expectation-maximization; PLF = parametrized link function.

3.2.4 Log-Likelihood Comparison

In the simple 4PL model (1), the algorithm based on PLF yielded log-likelihood values nearest to those computed based on true parameters in 91.31% of cases, followed by the EM algorithm in 8.45% and the directly implemented ML method in 0.23% of cases. There were similar differences between the three ML-based methods in the group-specific 4PL model (2). The algorithm based on PLF outperformed other likelihood-based estimation procedures in 87.56% of cases, while the EM algorithm worked the best in 12.06% and the ML method in 0.38%.

4. Real Data Examples

4.1 Data Description

We demonstrate the estimation procedures with an application to DIF detection on two real-data examples, which are available in the ShinyItemAnalysis R package and interactive application (Martinková & Drabinová, 2018; Martinková & Hladká, 2023), namely PROMIS anxiety scale¹ and a test measuring learning competence (Martinková et al., 2020).

4.1.1 Anxiety Scale

The Anxiety dataset consisted of responses to 29 Likert-type questions (1 = never, 2 = rarely, 3 = sometimes, 4 = often, and 5 = always) from 766 respondents. Additionally, the dataset included information on the respondents’ age, education, and gender (0 = male and 1 = female). Overall, there were 369 male participants and 397 female participants.

For this work, item responses were dichotomized as follows: 0 was assigned to response Never (i.e. response $= 1$ on the original scale), while 1 was given to responses rarely and more often (i.e. response $\geq 2$ on the original scale).

4.1.2 Learning Competence

The LearningToLearn dataset consisted of binary-coded responses from 782 subjects to (mostly) multiple-choice test consisting of 41 items within seven subscales. Each respondent was tested twice—the first time in the sixth grade and the second time in the ninth grade; responses from the sixth grade only were considered for this analysis. Among other variables, the dataset included information on the school track of respondents (basic school track = 0, academic school track = 1). Overall, 391 students attended basic school, and 391 pursued selective academic school.

4.2 Real Data Analysis Design

This work considered the simple 4PL model (1) and the group-specific 4PL model (2) with different constraints on asymptote parameters, yielding the 3PL models. In the Anxiety dataset, the lower asymptotes were set to zeros, that is, $c_{i} = 0$ and $c_{i DIF} = 0$ , since pretending (i.e. lower asymptote greater than 0) was not expected. On the other hand, in the LearningToLearn dataset, the upper asymptotes were set to ones, that is, $d_{i} = 1$ and $d_{i DIF} = 0$ , since slipping (i.e. upper asymptote lower than 1) was not expected.

The matching criterion $θ_{p}$ in the Anxiety dataset was the overall level of anxiety, which was calculated as a standardized sum of nondichotomized item responses. Similarly, the standardized total test score gained in the sixth grade was used as the matching criterion $θ_{p}$ in the LearningToLearn dataset. For the group-specific models, the grouping variable $G_{p}$ was defined by the gender of respondents for the Anxiety dataset and the school track for the LearningToLearn dataset. DIF detection was performed concerning these variables.

The two newly proposed estimation methods were applied for the two datasets and models: the EM algorithm and the algorithm based on PLF. The same approach for computing starting values as in the simulation study was used to analyze both datasets. In the case of convergence issues, the initial values were re-calculated based on successfully converged estimates provided by other methods.

In this study, item parameter estimates were computed and reported with confidence intervals. Next, the likelihood-ratio test using the test statistic from Equation (4) was performed to compare the two nested models (simple and group-specific) to identify the DIF for all items and both novel estimation methods. Finally, for the EM algorithm, mean values of estimated latent variables over all items were computed. A significance level of .05 was used for all the tests.

4.3 Real Data Analysis Results

4.3.1 Anxiety Scale

4.3.1.1 DIF Detection

Using the likelihood-ratio test, the simple 4PL model (1) with constraints on lower asymptotes was rejected for items R6 (“I was concerned about my mental health”; $p$ -value = .001 considering either estimation algorithm), R7 (“I felt upset”; $p$ -value = .038), R10 (“I had sudden feelings of panic”; $p$ -value = .031), R21 (“I had twitching or trembling muscles”; $p$ -value = .035), and R29 (“I had difficulty calming down”; $p$ -value = .04) by using either of the two newly proposed estimation algorithms (i.e. these items functioned differently). In these items, the less restrictive group-specific model (2) was preferred, allowing for different intercepts, slopes, and upper asymptotes for the two groups (i.e. these items functioned differently).

We now take a closer look at DIF item R7, for which the confidence interval of estimated parameter $d_{DIF}$ (i.e., the difference in upper asymptote between groups of female and male respondents) did not cover 0 (Table A2 in the online version of the journal). According to the model, the male respondents with high anxiety levels seemed not to admit feeling upset with a probability of $1 - 0.92 = 0.08$ , while there was no dissimulation rate for female respondents (Figure 2).

Figure 2.

Estimated item characteristic curves of item R7 of the Anxiety dataset for the group-specific 4PL model (2) with constraints on lower asymptotes.

4.3.1.2 Latent Variables Estimates by EM Algorithm

Using the group-specific model (2), male respondent 272 with an overall level of anxiety equal to 0.62 had the highest “inclination to dissimulate,” equal to 0.20, meaning they would have dissimulated almost six items out of the 29-item Anxiety dataset. On the other hand, female respondent 264 with the same overall anxiety level had a probability equal to 0.06, which would correspond to the dissimulation of less than two items.

4.3.2 Learning Competence

4.3.2.1 DIF Detection

Using the likelihood-ratio test, the simple 4PL model (1) with constraints on upper asymptotes was rejected for items 1A ( $p$ -value = .006 considering either estimation algorithm), 1D ( $p$ -value = .009), 6F ( $p$ -value = .043), 6H ( $p$ -value = .032), and 7F ( $p$ -value = .049) by using either of the two newly proposed estimation algorithms. In these items, the less-restrictive group-specific model (2) was preferred, allowing for different intercepts, slopes, and lower asymptotes for the two groups.

Items 6F and 6H were identified as functioning differently due to differences in lower asymptotes, that is, confidence interval of estimated parameter $c_{DIF}$ (i.e. the difference in lower asymptotes between the two school tracks) did not cover 0 (Table A4 in the online version of the journal). In both items, students from the basic school track tended to guess more often than students from the academic school track. In item 6F, the probability of guessing in the basic school track was 0.16, while in the academic school track, it was 0 (Figure 3a). In item 6H, the probability of guessing in the basic school track was 0.23, while in the academic school track, it was 0.02 (Figure 3b).

Figure 3.

Estimated item characteristic curves of selected DIF items of the LearningToLearn dataset for the group-specific 4PL model (2) with constraints on upper asymptotes. (a) Item 6F and (b) Item 6H.

Both items were related to “solving tasks with invented mathematical operators which are conditionally defined depending on the value of the digits they connect” (Martinková et al., 2020). The original study suggested that students from academic schools might have been trying to solve these difficult items more often, while students from the basic school track might have been guessing more often.

4.3.2.2 Latent Variables Estimates by EM Algorithm

Considering the group-specific model (2), respondent 486, who attended basic school with an overall level of learning competence of $- 0.70$ , has the highest “inclination to guess,” equal to 0.38, meaning they would guess almost 16 items out of the 41-item test on learning competencies. On the other hand, respondent 386, who attended academic school and who had exactly the same level of learning competence, has a probability equal to 0.17, corresponding to the guessing of 7 items out of 41.

5. Discussion

This work explored novel approaches for estimating item functioning within the GLNMs framework. The study proposed two iterative procedures (a procedure using the EM algorithm and a new method based on PLF) as alternatives to the directly implemented ML method. The methods were compared via simulation with existing algorithms and implemented in R.

In the simulation study, the traditional NLS approach produced the most biased parameter estimates with wide confidence intervals. The directly implemented ML method performed satisfactorily; however, the newly proposed methods were superior in some aspects: The EM algorithm provided slightly less-biased parameter estimates than the directly implemented ML method, and it more often produced log-likelihood values closer to those computed based on true parameters. These were at the price of a higher number of iterations being needed for this approach to converge, while the maximum number of iterations was reached in several cases. As an added value, the EM algorithm provided additional information on respondents’ latent response styles. The newly proposed algorithm based on PLF yielded the least-biased parameter estimates of the expit function for most settings, especially when small sample sizes and additional covariates were considered. Moreover, in most scenarios, the PLF-based algorithm yielded log-likelihood values nearest to those computed based on true underlying parameters. Conversely, there was a higher rate of crashed simulations for the group-specific 4PL model (2) and small sample size. The precision of the asymptote parameters was similar for all four estimation techniques. As the sample size increased, differences between the estimation methods vanished, and all estimates were near the true values of the item parameters.

Using two real data examples, we illustrated the possible benefits of generalized logistic regression models in item response modeling, estimating asymptotes, and their application to DIF analysis. Further, we presented how the practitioners may benefit from the added value of the EM algorithm, which can be used to estimate the probability of guessing correctly answered items (in the context of psychological assessment, endorsing an item due to pretending) or answering incorrectly due to inattention (in the context of psychological assessment, not endorsing an item due to dissimulation) for individual respondents. We also demonstrated practical challenges in estimation procedures, including specifying initial values.

The EM algorithm proposed in this study builds on the work of Dinse (2011), while we extend their approach to the group-specific and the general covariate-specific models in multi-item measurement setting. Meng et al. (2020) proposed a similar EM algorithm for the 4PL simple model (without additional covariates) in the IRT framework. On the other hand, the PLF-based algorithm is novel and has not been proposed in this form for parameter estimation in the generalized logistic regression model. However, in recent decades, the idea of the PLF has been extensively discussed in the literature by many authors in various contexts, including Basu and Rathouz (2005), Flach (2014), and Scallan et al. (1984). For example, Pregibon (1980) proposed the ML estimation of the link parameters using a weighted least squares algorithm. Similarly, McCullagh and Nelder (1989) adapted this approach and presented an algorithm in which several models with the fixed link functions were fitted. Furthermore, Kaiser (1997) proposed a modified scoring algorithm to perform simultaneous ML estimation of all parameters. Scallan et al. (1984) proposed an iterative two-stage algorithm, building on the work of Richards (1961). This study examined generalized logistic regression, accounting for the possibility of guessing/pretending and inattention/dissimulation, whereas these features may depend upon the respondents’ characteristics.

The crucial part of each estimation process is specifying starting values for item parameters because these values may significantly impact the speed and precision of the estimation process. For instance, initial values far from the true item parameters may lead to situations where the estimation algorithm returns only a local extreme or does not converge. In this work, we used an approach based on an upper-lower index, resulting in low convergence rate issues with satisfactory estimation precision. However, other possible naive estimates of discrimination (and other parameters) could be considered, such as a correlation between an item score and the total test score without a given item.

This study has several limitations, and several possible further directions exist. First, the simulation study was limited to two models—the simple 4PL model (1) and the group-specific 4PL model (2), both of which included only one or two covariates. The simulation study suggested requiring a larger sample size with an increasing number of covariates. Second, all the described algorithms implement the estimation item-by-item, which is typical within the GLNM framework and suitable for the cases when the ability is known or estimated a priori. The benefits include the fact that the items do not necessarily need to be independent, given the ability. One possible path for future research would be to extend the proposed algorithms to estimate all item parameters simultaneously using a joint model (see, e.g. Martinková & Hladká, 2023, Section 6.8). Further, the proposed algorithms may also be implemented in the IRT framework to allow incorporating the latent ability $θ$ and its estimation, similar to (Meng et al., 2020). Third, this article described the NLS method as a simple approach, not accounting for the heteroscedasticity of binary data. For such data, Pearson’s residuals might be more appropriate to use. This weighted form (e.g. Ritz et al., 2015) takes the original squares of residuals and divides them by the variance $π_{pi} (1 - π_{pi})$ . Next, the RSS of item $i$ defined by Equation (5) would take the following form:

RS S_{i} (γ_{i}) = \sum_{p = 1}^{n} \frac{{(Y_{pi} - π_{pi})}^{2}}{π_{pi} (1 - π_{pi})} .

(8)

However, the number of observations on the tails of the matching criterion is typically tiny and provides only small variability at most. These heavy weights would require a nearly exact fit for cases with few observations. Nevertheless, the computation of the NLS estimates demonstrated in this work was straightforward and efficient, providing sufficient precision. Thus, this method could be helpful in some instances, such as producing an initial idea about parameter values and using these estimates as starting values for other approaches. Fourth, it is important to acknowledge that the estimation methods studied here can be sensitive to the choice of optimization algorithm and the control parameters. The directly implemented ML estimation was performed with the “L-BFGS-B” algorithm to account for constraints in asymptotes. Alternatively, asymptote parameters may depend on covariates through a transformation function, so the estimating algorithm does not need to incorporate constraints. The performance of these two approaches might differ. Moreover, the control parameters were set the same for all estimation methods, while the sensitivity of methods to their setting may vary and may be imposed in different quantities in different algorithms (e.g. deviation, likelihood, or the norm of gradient vector). For instance, the EM algorithm is known to require a large number of iterations till convergence, especially near the maximum. A potential improvement could involve a hybrid strategy, where the EM algorithm is run for a fixed number of iterations, followed by a single ML iteration at the end.

While the primary focus of this article lies in enhancing parameter estimation within GLNMs for multi-item measurement, it also touches upon the application of these algorithms in DIF detection. We illustrated the DIF detection by comparing the largest and the smallest models; however, a step-by-step procedure omitting the parameters with nonsignificant effects might be applied in practice to explore DIF in detail. Although DIF detection is not the central theme, practical examples illustrate the significance of assessing the fairness and validity of assessments across diverse groups. Nevertheless, this study does not aim to evaluate the properties of the underlying DIF detection procedure or to compare it with popular existing methods such as the anchor item-based approaches (Candell & Drasgow, 1988; Clauser et al., 1993; W.-C. Wang & Yeh, 2003; Kopf et al., 2015) or more recent regularization-based approaches (Magis et al., 2015; Tutz & Schauberger, 2015; Belzak & Bauer, 2020; C. Wang et al., 2023).

Establishing a common scale on which respondents from different groups can be scored and ranked is a crucial step in DIF analysis. In both the IRT and non-IRT frameworks (including, e.g., the Mantel–Haenszel test or SIBTEST procedure), the inclusion of DIF items in estimation or computation of ability estimate may have a severe impact on which items are detected as functioning differently. One possibility for dealing with such an issue is applying an item purification iterative algorithm (Lord, 1980). Additionally, as the number of hypotheses tested may get large, p-value adjustments could be considered (see Hladká et al., 2024, for discussion).

In contrast to the IRT framework, GLNMs offer flexibility in selecting the ability variable $θ_{p}$ . In this article, the true underlying ability variable was used in the simulation study for all four estimation methods. While this approach is not feasible with real data, using a unified choice allows for a clear comparison of differences between the estimation algorithms. To account for a measurement error in total scores, one potential approach is to use plausible values; however, this may introduce greater computational complexity. Other possibilities may include the simulation-extrapolation method (Lockwood & McCaffrey, 2014, 2017). In this work, we further investigated the impact of measurement error on item estimation precision through an additional simulation study (see Supplemental Appendix C in the online version of the journal). The differences among estimation methods were similar to those observed when the true ability was used. Besides the standardized total scores, the model may utilize latent trait estimates—possibly in an iterative algorithm, yielding an IRT model. The model may also utilize previous test scores as a matching criterion, allowing to study the differential item functioning in change (Martinková et al., 2020; for further applications also see Kolek et al., 2021; 2024), or other relevant criterion variables. Additionally, the covariate-specific model (3) accommodates multidimensional matching criteria, similar to its IRT counterpart. Both frameworks share the same objective when accounting for the same underlying latent trait—to estimate item functioning with a logistic-shaped item characteristic curve. In such instances, the estimating algorithms for GLNMs can provide initial estimates for the corresponding IRT model, as they are less computationally demanding, requiring lower sample size and resulting in fewer convergence issues. Moreover, they may be used for the iterative estimation of ability and item parameters.

Finally, GLNMs discussed in this article do not account for missing data. However, when estimation and potential DIF detection are performed for each item separately, this would minimize the omission of data.

This study’s real data examples explored item functioning in the multi-item measurement related to anxiety and learning competencies. However, the parameter estimation task in the presented models would also be relevant to many other educational, psychological, and health-related measurement areas, such as the assessment of well-being, fatigue, reading literacy, and others. Moreover, the generalized logistic regression model is not limited to multi-item measurements since the class determined by Equation (3) represents a broad family of the covariate-specific 4PL models. This model might be used and further extended in various study fields, including but not limited to quantitative pharmacology (Dinse, 2011), applied microbiology (Brands et al., 2020), modeling patterns of urban electricity usage (To et al., 2012), and plant growth modeling (Zub et al., 2012). Therefore, the estimation procedures proposed in this work are highly relevant for a wide range of researchers and practitioners, both within and outside the psychometric field.

To conclude, this study researched advances in fitting generalized logistic regression models using various estimation techniques, including two newly proposed ones. We demonstrated the superiority of the novel implementation of the EM algorithm and the newly proposed method based on PLF over the existing NLS and directly implemented ML methods. Improving estimation algorithms is critical since it could increase precision while maintaining a user-friendly implementation. It may also provide additional information regarding individual respondents and items; thus, it is worth investing resources in the advancements of estimation methods.

Supplemental Material

sj-docx-1-jeb-10.3102_10769986241312354 – Supplemental material for New Iterative Algorithms for Estimation of Item Functioning

Supplemental material, sj-docx-1-jeb-10.3102_10769986241312354 for New Iterative Algorithms for Estimation of Item Functioning by Adéla Hladká, Patrícia Martinková and Marek Brabec in Journal of Educational and Behavioral Statistics

Footnotes

Acknowledgements

We sincerely thank the anonymous reviewers for their valuable comments and suggestions on earlier versions of the manuscript. We especially appreciate their encouragement to explore measurement errors.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was funded by the Czech Science Foundation project “Theoretical Foundations of Computational Psychometrics” grant number 21-03658S, by the project “Research of Excellence on Digital Technologies and Well-being CZ.02.01.01/00/22_008/0004583” which is co-financed by the European Union, and by the institutional support RVO 67985807.

ORCID iDs

Adéla Hladká

Patrícia Martinková

Marek Brabec

Supplementary Material

Accompanying R scripts, simulation data, results, and figures are available in the electronic Supplementary Material at /.

Notes

Authors

ADÉLA HLADKÁ is a Researcher at the Department of Statistical Modelling, Institute of Computer Science of the Czech Academy of Sciences; the time of the work on the article included the period when she was an ICS Ph.D. fellow and a Ph.D. Student at the Department of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, Charles University, e-mail: hladka@cs.cas.cz. Her research focuses on statistical and psychometric methods.

PATRÍCIA MARTINKOVÁ is a Senior Researcher and a Chair at the Department of Statistical Modelling, Institute of Computer Science of the Czech Academy of Sciences, and an Associate Professor at the Institute for Research and Development in Education, Faculty of Education, Charles University, e-mail: martinkova@cs.cas.cz. Her research interests include statistical and computational aspects of psychometric methods.

MAREK BRABEC is a Senior Researcher at the Department of Statistical Modelling, Institute of Computer Science of the Czech Academy of Sciences, e-mail: mbrabec@cs.cas.cz. His research interests include statistical methods in general and semiparametric and Bayesian modelling in particular.

References

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. American Educational Research Association.

Barton

M. A.

Lord

F. M.

(1981). An upper asymptote for the three-parameter logistic item-response model. ETS Research Report Series, 1981(1), 1–8. https://doi.org/10.1002/j.2333-8504.1981.tb01255.x

Basu

Rathouz

P. J.

(2005). Estimating marginal and incremental effects on health outcomes using flexible link and variance function models. Biostatistics, 6(1), 93–109. https://doi.org/10.1093/biostatistics/kxh020

Battauz

(2020). Regularized estimation of the four-parameter logistic model. Psych, 2(4), 269–278. https://doi.org/10.3390/psych2040020

Belzak

Bauer

D. J.

(2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6), 673–690. https://doi.org/10.1037/met0000253

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.

Brands

Schulze Struchtrup

Stamminger

Bockmühl

D. P.

(2020). A method to evaluate factors influencing the microbial reduction in domestic dishwashers. Journal of Applied Microbiology, 128(5), 1324–1338. https://doi.org/10.1111/jam.14564

Byrd

R. H.

Nocedal

Zhu

(1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5), 1190–1208. https://doi.org/10.1137/0916069

Candell

G. L.

Drasgow

(1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3), 253–260. https://doi.org/10.1177/014662168801200304

10.

Clauser

Mazor

Hambleton

R. K.

(1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269–279. https://doi.org/10.1207/s15324818ame0604_2

11.

Culpepper

S. A.

(2016). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika, 81(4), 1142–1163. https://doi.org/10.1007/s11336-015-9477-6

12.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

13.

Dennis

J. E. J.

Gay

D. M.

Welsch

R. E.

(1981). An adaptive nonlinear least-squares algorithm. Transactions on Mathematical Software, 7(3), 348–368. https://doi.org/10.1145/355958.355965

14.

Dinse

G. E.

(2011). An EM algorithm for fitting a four-parameter logistic model to binary dose-response data. Journal of Agricultural, Biological, and Environmental Statistics, 16(2), 221–232. https://doi.org/10.1007/s13253-010-0045-3

15.

Dobson

A. J.

Barnett

A. G.

(2018). An introduction to generalized linear models. Chapman and Hall/CRC. https://doi.org/10.1201/9781315182780

16.

Drabinová

Martinková

(2017). Detection of differential item functioning with nonlinear regression: A non-IRT approach accounting for guessing. Journal of Educational Measurement, 54(4), 498–517. https://doi.org/10.1111/jedm.12158

17.

Flach

(2014). Generalized linear models with parametric link families in R (Unpublished master’s thesis). Technische Universität München, Department of Mathematics, München.

18.

Zhang

Y.-H.

Shi

Tao

(2021). A Gibbs sampler for the multidimensional four-parameter logistic item response model via a data augmentation scheme. British Journal of Mathematical and Statistical Psychology, 74(3), 427–464. https://doi.org/10.1111/bmsp.12234

19.

Gay

(n.d.). Port library documentation. Retrieved December 13, 2022, from http://www.netlib.org/port/

20.

Guidotti

(2022). calculus: High-dimensional numerical and symbolic calculus in R. Journal of Statistical Software, 104(5), 1–37. https://doi.org/10.18637/jss.v104.i05

21.

Hladká

Martinková

(2020). difNLR: Generalized logistic regression models for DIF and DDF detection. The R Journal, 12(1), 300–323. https://doi.org/10.32614/RJ-2020-014

22.

Hladká

Martinková

Magis

(2024). Combining item purification and multiple comparison adjustment methods in detection of differential item functioning. Multivariate Behavioral Research, 59(1), 46–61. https://doi.org/10.1080/00273171.2023.2205393

23.

Kaiser

M. S.

(1997). Maximum likelihood estimation of link function parameters. Computational Statistics & Data Analysis, 24(1), 79–87. https://doi.org/10.1016/S0167-9473(96)00055-2

24.

Kim

Oshima

(2013). Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement, 73(3), 458–470. https://doi.org/10.1177/0013164412467033

25.

Kingdom

F. A.

Prins

(2016). Psychophysics: A practical introduction (2nd ed.). Academic Press. https://doi.org/10.1016/C2012-0-01278-1

26.

Kolek

Martinková

Vařejková

Šisler

Brom

(2024). Is video games’ effect on attitudes universal? Examining the effects of perspective-taking game mechanics and attitude importance. Journal of Computer Assisted Learning, 40(2), 667–684.

27.

Kolek

Šisler

Martinková

Brom

(2021). Can video games change attitudes towards history? Results from a laboratory experiment measuring short- and long-term effects. Journal of Computer Assisted Learning, 37(5), 1348–1369. https://doi.org/10.1111/jcal.12575

28.

Kopf

Zeileis

Strobl

(2015). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22–56. https://doi.org/10.1177/0013164414529792

29.

Lockwood

J. R.

McCaffrey

D. F.

(2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39(1), 22–52. https://doi.org/10.3102/1076998613509405

30.

Lockwood

J. R.

McCaffrey

D. F.

(2017). Simulation-extrapolation with latent heteroskedastic error variance. Psychometrika, 82, 717–736. https://doi.org/10.1007/s11336-017-9556-y

31.

Loken

Rulison

K. L.

(2010). Estimation of a four-parameter item response theory model. British Journal of Mathematical and Statistical Psychology, 63(3), 509–525. https://doi.org/10.1348/000711009X474502

32.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Routledge. https://doi.org/10.4324/9780203056615

33.

Magis

Tuerlinckx

De Boeck

(2015). Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics, 40(2), 111–135. https://doi.org/10.3102/1076998614559747

34.

Martinková

Drabinová

(2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. The R Journal, 10(2), 503–515. https://doi.org/10.32614/RJ-2018-074

35.

Martinková

Hladká

(2023). Computational aspects of psychometric methods: With R. Chapman and Hall/CRC. https://doi.org/10.1201/9781003054313

36.

Martinková

Hladká

Potužníková

(2020). Is academic tracking related to gains in learning competence? Using propensity score matching and differential item change functioning analysis for better understanding of tracking implications. Learning and Instruction, 66, Article 101286. https://doi.org/10.1016/j.learninstruc.2019.101286

37.

McCullagh

Nelder

J. A.

(1989). Generalized linear models (2nd ed.). Chapman & Hall.

38.

Meng

Zhang

Tao

(2020). Marginalized maximum a posteriori estimation for the four-parameter logistic model under a mixture modelling framework. British Journal of Mathematical and Statistical Psychology, 73, 51–82. https://doi.org/10.1111/bmsp.12185

39.

Pregibon

(1980). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 29(1), 15–24. https://doi.org/10.2307/2346405

40.

R Core Team. (2023). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. https://www.R-project.org/

41.

Richards

F. S.

(1961). A method of maximum-likelihood estimation. Journal of the Royal Statistical Society: Series B (Methodological), 23(2), 469–475. https://doi.org/10.1111/j.2517-6161.1961.tb00430.x

42.

Ritz

Baty

Streibig

J. C.

Gerhard

(2015). Dose-response analysis using R. PLOS ONE, 10(12), Article e0146021. https://doi.org/10.1371/journal.pone.0146021

43.

Scallan

Gilchrist

Green

(1984). Fitting parametric link functions in generalised linear models. Computational Statistics & Data Analysis, 2(1), 37–49. https://doi.org/10.1016/0167-9473(84)90031-8

44.

Swaminathan

Rogers

H. J.

(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x

45.

Lai

Lam

Chung

(2012). The growth pattern and fuel life cycle analysis of the electricity consumption of Hong Kong. Environmental Pollution, 165, 1–10. https://doi.org/10.1016/j.envpol.2012.02.007

46.

Tutz

Schauberger

(2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80, 21–43.

47.

van der Vaart

A. W.

(1998). Asymptotic statistics. Cambridge University Press. https://doi.org/10.1017/CBO9780511802256

48.

Venables

W. N.

Ripley

B. D.

(2002). Modern applied statistics with S (4th ed.). Springer. https://doi.org/10.1007/978-0-387-21706-2

49.

Wang

Zhu

(2023). Using lasso and adaptive lasso to identify DIF in multidimensional 2PL models. Multivariate Behavioral Research, 58(2), 387–407. https://doi.org/10.1080/00273171.2021.1985950

50.

Wang

W.-C.

Yeh

Y.-L.

(2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498. https://doi.org/10.1177/0146621603259902

51.

Zub

Rambaud

Béthencourt

Brancourt-Hulmel

(2012). Late emergence and rapid growth maximize the plant development of Miscanthus clones. BioEnergy Research, 5(4), 841–854. https://doi.org/10.1007/s12155-012-9194-2

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB