Automatic Item Generation Measurement Models Respecting the Stochastic Sampling Space for Cross-Classified and Two-Level Sampling of Subjects and Incidentals

Abstract

In Automatic Item Generation (AIG), item incidentals refer to surface characteristics of an item that are assumed not to influence item parameters (e.g., item difficulty), whereas item radicals refer to attributes that are presumed to affect these parameters. Within the empirical validation process of the item generator, subjects and incidentals may either be sampled independently so that every subject sees every incidental (cross-classified sampling) for a radical, or incidentals may be sampled within each subject so that every subject only sees a specific set of incidentals (two-level sampling) for a radical. We present an approach for scrutinizing the effect of item incidentals relying on two classical test theory models that adhere to the stochastic sampling space of cross-classified and two-level sampling, respectively. We show how these may be used in combination to enable a more optimized investigation of incidental-induced variance within the item generator. We illustrate the approach with the figural short-term memory item-generator “figumem.” Results show that incidentals have little effect on item difficulty in the cross-classified model/sample and that the model parameters generalize to a larger set of incidentals in the two-level model/sample. Implications, limitations, and future research are discussed.

Keywords

automatic item generation cross-classified figural memory stochastic sampling space structural equation modeling two-level

Introduction

Automatic Item Generation (AIG) is a modern computer-based approach to test construction based on psychometric and cognitive models (Gierl & Lai, 2012). Usually, AIG consists of two steps: (1) item models are created that serve the role of templates or prototypes and (2) an algorithm manipulates specific elements of these item models to create the individual items (Gierl & Lai, 2012). Irvine (2013) introduced the terms radicals and incidentals for the specific elements to be varied in the second step. Radicals are those structural elements that influence item parameters like difficulties. In contrast, incidentals are surface elements that have no significant effect on the item parameters. For example, one could think of a test measuring numerical short-term memory (STM) with participants/subjects having to memorize a list of natural numbers with all numbers being of the same digit length (e.g., a list of two-digit numbers like 46, 73, and 91; or a list of three-digit numbers like 135, 957, and 641). Here, the digit length for each number would be the radical, since it is assumed to determine item difficulty (it is more difficult to remember a list of three-digit numbers than an equally long list of two-digit numbers). The exact numbers and their order within the list would be the incidentals, assuming that they do not determine item parameters (which might require some restrictions, e.g., excluding easily memorable repdigits like 11 or 222).

AIG can follow either a “strong” or “weak theory” approach. In the “strong theory” approach, the creation of item models is guided by scientific theories that can explain the cognitive features relevant for item responding and thus the item parameters (Gierl & Lai, 2012). On the other hand, in a “weak theory” approach, existing or newly created items require extensive field or pilot testing to determine the elements that influence the item parameters (Gierl & Lai, 2012). Since only radicals have a significant influence on item parameters (in contrast to incidentals), an item with a specific set of radical realizations can function as a so-called parent item (Glas et al., 2009). By letting incidentals vary randomly, item families consisting of parallel items with the same item parameters can be created (Irvine, 2013). The set of items that have the same radical realizations but different incidental realizations and therefore belong to one item family are commonly called isomorphs (Bejar, 2013) or clones (Arendasy et al., 2008).

In the past, AIG has primarily been used to develop items intended to measure cognitive abilities (Jendryczko et al., 2020). In addition to its use to measure knowledge of a specific subject (e.g., Circi et al., 2023; Falcão et al., 2022; Kurdi et al., 2020), AIG was predominantly used to create test material for figural (e.g., M. Arendasy & Sommer, 2005, 2010; Bejar, 1990; Bertling, 2012; Blum et al., 2016; Embretson, 2013; Gierl et al., 2008; Loe, 2019; Zeuch et al., 2011), verbal (e.g., M. Arendasy et al., 2006; M. E. Arendasy & Sommer, 2012; Bertling, 2012; Gierl et al., 2008; Holling et al., 2009; Loe, 2019; Loe et al., 2018) or numerical reasoning (e.g., Gierl et al., 2008; Loe, 2019) abilities. Driven by recent advances in machine learning (ML), particularly in deep learning (DL), natural language processing (NLP), and large language models (LLMs), research on AIG has increased substantially (e.g., Kıyak et al., 2024; Lee et al., 2024; von Davier, 2018). Recent developments enable the use of AIG for noncognitive constructs such as personality traits (Hommel et al., 2022) or even novel constructs (Götz et al., 2023) in a weak theory approach. Although AIG based on LLMs facilitates the generation of extensive initial item pools, human expertise remains essential for selecting promising items and conducting their validation.

In a strong theory approach, on the other hand, researchers have to identify the construct representation (Whitely, 1983), that is, the cognitive features that influence item responding (Harrison et al., 2017). Hypotheses about the relationship between structural features of items and their item parameters can be derived from the construct representation and tested with empirical data (Harrison et al., 2017). If the hypothesized relationship holds, this is evidence for the construct validity of the test and, thus, for newly created items with the same AIG process (Harrison et al., 2017). Since AIG is usually based on these quality control mechanisms, one strength of AIG is the possibility for decreased bias caused by misinterpretation in human item writing and the interpretation of test scores (Lai et al., 2009). AIG can help in creating a larger set of parallel test forms, which are, for example, useful when conducting longitudinal studies to avoid confounding effects of repeated item exposure (Jendryczko et al., 2019; Reeve & Lam, 2005). Even in non-longitudinal contexts, item exposure can be a problem if items are already known to the public (see Jendryczko et al., 2020; for a general overview of AIG-benefits, see Bertling, 2012).

As in traditional test construction, items created with an AIG procedure should be validated. According to Messick (1989), validity is “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores.” Hence, the validation process is specific to the measure and the underlying theory and there is no simple “recipe” for all validation processes. This is also reflected in the conceptualization of validity as multidimensional, differentiating between different aspects of validity (e.g., Cooper, 2023). Modern approaches to test validation like the Evidence-Centered Design (ECD) framework build upon this multidimensional conceptualization and outline processes to establish validity by explicitly connecting task performance to actual ability (e.g., Mislevy & Haertel, 2006). Importantly, a test has to be reliable in order to be valid. That is, the impact of measurement error should be as small as possible. Thus, the investigation of test reliability can be considered an early step in the validation-process that is concerned with the nomothetic span (statistical relationships between test items) and not yet with the construct representation. In the reliability context, incidentals inherit an important meaning as they should not only be irrelevant for the between-person variability of the construct but should also not increase measurement-error variance. Put another way, while incidentals technically reflect a potential source of variance as they are known properties of an AIG-based test, different incidental realizations should have low impact and not affect measurement residuals. We refer to this concept as “noise stability.”

There are two options to investigate noise stability of automated item generators. Borrowing terminology from multilevel modeling, we will refer to them as the cross-classified and the two-level approach. In the cross-classified approach, the same incidentals are sampled for each subject (person/participant). This means that a certain number of items will be generated by the AIG procedure and presented to each subject. Hence, incidentals are nested in subjects, but subjects are also nested in incidentals (subjects and incidentals form two separate strata at level – 2 and the subject-incidental interaction locates at level – 1). This differs from the two-level approach, in which different incidentals are sampled for each subject. In consequence, each subject is presented with a unique set of items, with incidentals (level – 1) being nested in subjects (level – 2). Using the two-level approach comes with the advantage of allowing for better generalizations because the sample of used incidentals is larger than in the cross-classified approach (under the practical assumption that when comparing the two-level approach and the cross-classified approach, the number of presented items per subject in both approaches is the same). Therefore, inductive inference from the sample of realized items to the population of all possible AIG items is more trustworthy in the two-level approach. On the other hand, when using the two-level approach, each specific realization of incidentals is used for exactly one measurement. For this reason, the incidental effect on the item score cannot be estimated since it is not possible to disentangle it from the subject-incidental interaction. This, however, is possible in the cross-classified approach, where each realization of incidentals is presented to all subjects.

If the AIG procedure is properly implemented in terms of the realized incidentals, then noise stability will be observed across the two approaches. The proposed statistical structure at the person level should remain unaffected by the approach chosen, the between-incidental variance in the cross-classified approach should be marginally low, and the subject-incidental interaction variance should be approximately the same across the two approaches.

The aim of the current contribution is to show how to properly respect the stochastic sampling procedure in the statistical model for both study designs (cross-classified and two-level sampling). This will be illustrated with the figural STM-AIG test “figumem” created by Jendryczko et al. (2020). Parameter interpretations and procedures for drawing valid conclusions about the noise stability of an item generator are outlined, and the noise stability of figumem AIG is empirically evaluated.

The remainder of this paper is structured as follows: Figumem AIG and its theoretical assumptions will be presented. This includes describing what the radicals and what the incidentals are. Then, a general AIG Structural Equation Modelling (SEM)-framework for the cross-classified sampling will be derived and it will be shown how a simplified form applies to two-level sampling. Both models will be applied to empirical data sets. The paper concludes with a substantive discussion of the results, implications, limitations, and future research.

Description and Brief Presentation of Figumem Items

Figumem is a test to assess figural-visual STM capacity (see Jendryczko et al., 2020 for its introduction and a user’s manual). It is based on AIG and is provided as an R-package.¹ AIG of figumem items builds on a “strong theory” about the relationship between visual load and feature bindings. It is known that the number of units of information that can be remembered for a short period of time (the STM capacity) is severely constraint (Trapp et al., 2021). Regarding visual material, the number of units is not only referring to the number of visual objects, but also their visual (information) load. Visual load is operationalized as the time to find an object in a set of similar objects (Alvarez & Cavanagh, 2004). Visual load depends mostly on the complexity of bindings of stimulus features as shown in a series of experiments (e.g., color, shape, orientation, pattern; Ueno et al., 2010; 2011). Through manipulation of feature bindings of the presented objects, AIG of figumem items varies the visual load and thus the expected relative number of objects that can be stored in the STM rendering the items more or less difficult.

In the figumem test, subjects have to memorize which emblems (figures of set A) are associated with which frames (figures of set B; see Figure 1). Each item consists of 20 emblems, each surrounded by a frame. In the learning phase, the framed emblems are displayed in a 5 × 4 matrix for 1 minute. In the recall phase, the same emblems are displayed in a different order, yet the emblems are presented together with four different frames underneath them (1 correct, 3 distractors). The number of correctly marked frames represents the subjects’ score on the item (ranging from 0 to 20). While the figumem package also supports lower numbers of emblems, different time lengths of the learning phase, and an open response format, this paper is concerned only with the standard setup used in the original study by Jendryczko et al. (2020).

Figure 1.

Examples of figumem items with different radical and incidental levels. Figure taken with permission from Jendryczko et al. (2020)

Based on theories of visual load and feature binding (Alvarez & Cavanagh, 2004; Ueno et al., 2010, 2011), a radical with three distinct levels is specified, resulting in the generation of three parent items. By varying the incidental realizations, three item families are created. For the first radical level, 20 distinct emblems and 20 distinct frames are used which differ only in their shape. Thus, the figures possess a low visual load as subjects only need to keep track of one stimulus feature (the shape). Therefore, the item difficulty should be low (examples can be seen in the first column of Figure 1). For the second radical level, the 20 distinct emblems from the first radical level are used again. Four frames from the first radical level (rectangle, trapeze, pentagon & hexagon) are modified such that four variants of each of them are created, each by erasing a different line of the corresponding frame. Those four frames and their variants are used for items with the second radical level. This increases the visual load compared to the first radical level since the frames are more similar and subjects have to keep track of two stimulus features: shape and completeness. Thus, item difficulty should be higher than for the first radical level (examples can be seen in the second column of Figure 1). For the third radical level, the frames from the second radical level are used again. Five emblems from the first radical level (right triangle, black circle with the missing quadrant in the top right, black circle with the prong in the top right, “L”-shape & black ring) are modified such that four variants of each of them are created that are being used instead of the original emblems. This is done for the black ring by editing out a piece of the ring at either the top, right side, bottom, or left side. The other four of the five emblems are rotated clockwise by either 0° (i.e., not changed at all), 90°, 180° or 270°. This increases the visual load compared to the first two radical levels since the frames and emblems are, respectively, more similar and subjects have to keep track of three stimulus features: shape, completeness and orientation. Thus, the item difficulty should be the highest of the three radical levels (examples can be seen in the third column of Figure 1).

While the three postulated radical levels determine three item families with theoretically distinct item difficulties, there are three different postulated incidentals that determine specific items. It is randomly varied (1) which emblem is surrounded by which frame, (2) in which order the figures in the 5 × 4 matrix are presented, and (3) which three frames are used as distractors (with the constraint that each frame appears equally often as a distractor).

The hypothesized psychometric properties were largely confirmed in an empirical study using the cross-classified approach (Jendryczko et al., 2020). Radical level mostly determined the item difficulties while incidentals had a small, yet not negligible effect. For items with low visual load, subjects were on average able to memorize roughly 10–11 figure associations with considerable variation. For example, the difference in predicted scores for two specific incidental realizations used in the study was 6.82% (an item score of 0.74). For items with medium visual load, the influence of incidentals was slightly higher (and also higher than for items with high visual load). For example, the difference was 12.36% (an item score of 1.11) when comparing two specific incidental realizations used in the study (subjects were on average able to memorize roughly 9-10 figure associations). Accordingly, Jendryczko et al. (2020) recommended taking these findings into account when administering the test and adopting a conservative stance when interpreting small differences in item scores. Despite these potential influences, figumem “holds the potential for an efficient, reliable, and repeatable assessment of figural memory in various non-clinical and clinical populations” (Jendryczko et al., 2020, p. 12).

Model Development

We derive the underlying psychometric model relying on the example of figumem (see Figure 1). Yet, the model applies to all cases where subjects and incidentals are cross-classified. In the specific case of figumem, subjects are sampled from the population of subjects with s denoting the specific sampled subject and, additionally, surface characteristics of an item are sampled from the population of incidentals with i denoting the specific surface characteristic. If every subject “sees” every incidental realization (and vice versa) the sampling process is cross-classified (for a formal definition of the stochastic sampling space see Koch et al., 2016). Further, there is a non-random radical variable for which three structurally different outcomes are given (r = 1, 2, or 3) which correspond to item-difficulty.

In the following, the cross-classified model will be formally defined step by step. It will be shown how the model for the two-level sampling process (each subject only “sees” her or his individual set of incidentals) can be derived by simplifying the cross-classified model. The models are shown in Figure 2.

Figure 2.

Example for the proposed classical test theory AIT-model for a construct measured with three different radical levels (r = 1, 2, or 3). $Y_{s i r}$ = observed response variable of subject s to an item comprised of incidental realization i and radical level r, $γ_{s i r}$ = subject-incidental interaction specific noise variable for radical level r, $η_{s}$ = latent subject-ability variable for the reference-radical level r = 1, $λ_{r}$ = factor-loading for the non-reference radical level r ≠ 1 on the latent ability, $α_{s r}$ = radical-effect variable for the non-reference radical level r ≠ 1 at the subject-level, $σ_{α_{2} α_{3}}$ = radical-effect covariance for the two non-reference radical levels r = 2 and r = 3 at the subject level, $β_{i r}$ = incidental-effect variable for radical level r. Intercepts are not shown. If incidentals are not sampled independently of subjects but nested within subjects (two-level sampling process instead of cross-classified sampling process), the incidental level drops from the model and the $β_{i r}$ are confounded within the noise variables at level – 1

According to classical test theory (CTT), any observed response $Y_{s i r}$ of subject s to incidental realization i for radical level r can be decomposed into a true score $τ_{s i r}$ and a measurement-error $ε_{s i r}$ :

Y_{s i r} = τ_{s i r} + ε_{s i r} .

(1)

The true score can be further decomposed into expected values according to the stochastic sampling space (Jendryczko & Nussbeck, 2024; Koch et al., 2016):

τ_{s i r} = μ_{r} + η_{s r} + β_{i r} + ψ_{s i r} .

(2)

Here, the unconditional expectation for the true score $E (τ_{s i r})$ is given by the intercept (mean) for radical level r: $μ_{r}$ . The conditionally expected deviation from $μ_{r}$ given the outcome s of the random subject variable $E (τ_{s i r} ‐ μ_{r} | s)$ for radical level r is denoted as $η_{s r}$ and can be interpretated as the latent subject-ability variable for radical level r. The conditionally expected deviation from $μ_{r}$ given the outcome i of the random incidental variable $E (τ_{s i r} ‐ μ_{r} | i)$ for radical level r is denoted as $β_{i r}$ and can be interpretated as a latent incidental effect. Lastly, the conditionally expected residual deviation from the previous expectations given the specific combination of s and i (the latent subject-incidental interaction effect) is denoted as $E (τ_{s i r} ‐ (μ_{r} + η_{s r} + β_{i r}) | s, i) = ψ_{s i r}$ .

The ability variables for different radical levels are assumed to be correlated since they are assumed to measure the same psychological construct within subjects. We can, thus, define the latent ability of one reference-radical level (e.g., r = 1) as a standard ( $η_{s 1} = η_{s}$ ) and regress the abilities for the non-reference radical levels (r ≠ 1) on this standard ability measure:

η_{s r} = λ_{r} η_{s} + α_{s r} .

(3)

Here, $α_{s r}$ denotes a non-reference radical level specific residual given the reference-radical level $E (η_{s r} | η_{s}) = λ_{r} η_{s}$ and can be interpreted as a latent radical effect that goes beyond item difficulty represented by the unconditional expectation µ_r. It reflects a specific interplay between a subject’s ability and the non-reference radical level. Note that no intercept is given in equation (3) since the latent ability variables reflect conditional deviations and, therefore, have unconditional expectations of zero.

Putting everything together, we arrive at

Y_{s i 1} = μ_{1} + η_{s} + β_{i 1} + ψ_{s i 1} + ε_{s i 1}

(4)

for the reference radical level r = 1 and

Y_{s i r} = μ_{r} + λ_{r} η_{s} + α_{s r} + β_{i r} + ψ_{s i r} + ε_{s i r}

(5)

for the non-reference radical levels r ≠ 1. Importantly, no incidental is repeated across different radical levels which means that the different

β_{i r}

and

ψ_{s i r}

across different radical levels are, respectively, uncorrelated. Thus, no measurement structure can be assumed for these latent variables. This also implies that, empirically, the latent subject-incidental interaction effect cannot be disentangled from measurement error, giving

\begin{array}{l} Y_{s i 1} = μ_{1} + η_{s} + β_{i 1} + γ_{s i 1} and \\ Y_{s i r} = μ_{r} + λ_{r} η_{s} + α_{s r} + β_{i r} + γ_{s i r} \end{array}

(6)

with

γ_{s i 1} = ψ_{s i 1} + ε_{s i 1}

and

γ_{s i r} = ψ_{s i r} + ε_{s i r}

. Note that, from a practical viewpoint, this confound may be regarded to be of little importance:

γ

may be considered a random noise variable since there should not be any systematics behind the subject-incidental interaction. Note further that different radical-effect variables (r ≠ r’) may covary. These covariances will be denoted as

σ_{α_{r} α_{r^{'}}}

in the following (the respective correlation will be denoted as

ρ_{α_{r} α_{r^{'}}}

In the case of a cross-classified sampling procedure (incidentals are repeated across subjects), the model as formulated in equation (6) is well-constrained (i.e., “identified” in frequentist terminology) and estimable as a structural equation model (SEM). In the case of a two-level sampling procedure, incidental effects $β_{i r}$ cannot be estimated as every incidental realization only appears once (is only seen by one subject). Any potential incidental effects are then additionally confounded within the noise variable $γ$ and the model reduces to:

\begin{array}{l} Y_{s i 1} = μ_{1} + η_{s} + γ_{s i 1} an d \\ Y_{s i r} = μ_{r} + λ_{r} η_{s} + α_{s r} + γ_{s i r .} \end{array}

(7)

If incidentals have no or only negligible influence, then: (1) Any variability at the person-level should remain stable regardless of the employed incidentals; (2) the variability of the incidental effect β _ir around the mean of zero should be zero or very low in the cross-classified approach; and (3) the variability of the random noise variable γ_sir should be stable across the two approaches. As variances cannot easily be interpreted, one should also investigate variance components in addition to absolute variances.

Variance Decomposition and Consistency Parameters

Since all latent variables manifesting in an observation are orthogonal (see Figure 2) the variance of each observed variable equals the sum of the latent variances weighted by the factor loadings:

σ_{Y_{s i r}}^{2} = λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2} + σ_{β_{r}}^{2} + σ_{γ_{r}}^{2}

(8)

Therefore, we can define three meaningful consistency-parameters (see also Koch et al., 2016) for the subject-level variance as relative variance components. The radical-level consistency coefficient (RCon_r) for a specific non-standard radical level (r ≠ 1) is defined as:

R C o n_{r} = λ_{r}^{2} σ_{η_{s}}^{2} / (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2}) .

(9)

It depicts the consistency with which the ability variability as assessed with the standard radical level (r = 1) manifests in the ability variability as assessed with a non-standard radical level (r ≠ 1) and, thus, the degree to which the construct remains stable across radical levels. If different AIG-items measure the exact same construct with only different levels of difficulty, radical-level consistency should be close to one as mere differences in difficulty across items should be fully captured by different unconditional expectations (different $μ_{r}$ ).

The Level-2 consistency coefficient for a specific radical level r (L2Con_r) depicts the subject-level variance relative to the total variance at level-2, that is, the variance at the subject level plus the variance at the incidental level:

L 2 C o n_{r} = (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2}) / (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2} + σ_{β_{r}}^{2}) .

(10)

This consistency should also be close to one if the AIG procedure works sufficiently with regard to noise stability, as it is assumed that different incidental realizations should not cause additional variability ( $σ_{β_{r}}^{2}$ should be low).

The Level-1 consistency coefficient for a specific radical level r (L1Con_r) shows the subject-level variance relative to the total model-implied variance of the observed variable:

L 1 C o n_{r} = (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2}) / σ_{Y_{s i r}}^{2} = (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2}) / (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2} + σ_{β_{r}}^{2} + σ_{γ_{r}}^{2}) .

(11)

This consistency should be close to one if incidental-specific variability ( $σ_{β_{r}}^{2}$ ) is low and random noise ( $σ_{γ_{r}}^{2}$ ) is low. Moreover, even in the case of substantial random noise, the consistency coefficient should be stable across sampling procedures if random-noise variability ( $σ_{γ_{r}}^{2}$ ) does not substantially depend on the concrete realizations of incidentals. Note that $λ_{1} = λ_{1}^{2} = 1$ and $σ_{α_{1}}^{2} = 0$ for the reference-radical level r = 1. Note further that in the case of a two-level sampling process, $σ_{β_{r}}^{2} = 0$ since no incidental effects $β_{i r}$ are modeled which means that L2Con_r is not meaningful and that L1Con_r simplifies to:

L 1 C o n_{r} = (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2}) / σ_{Y_{s i r}}^{2} = (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2}) / (λ_{r}^{2} σ_{η_{s}}^{2} + σ_{α_{r}}^{2} + σ_{γ_{r}}^{2})

(12)

In the following, two empirical examples will be presented. Data and code for both examples can be found in the supplementary materials at https://osf.io/ern58/.

Empirical Example: Data From the Cross-Classified Sampling Process

We start with a summary of the data-collection process described in the original paper by Jendryczko et al. (2020). Data collected from this study will be used to illustrate the cross-classified model (equation (6)).

Sample

The final sample consisted of N = 208 subjects (146 female, 61 male and 1 not stated). On average, subjects were 22.51 years old (SD = 6.18) with most of them being university students. Detailed information about the sample is given in Jendryczko et al. (2020).

Materials

Nine different figumem items (as displayed in Figure 1) were used (three of each radical level, all consisting of 20 emblem-frame pairings). After the learning phase (60 seconds), the emblems were displayed one below the other and subjects had to give their responses and scroll down to the next emblem. Each emblem was presented together with four frames (one correct option and three distractors; presented below each emblem).

Procedure

After giving informed consent and receiving the instruction not to use supporting devices (like paper and pencil), subjects were presented with a black rectangle of the size of a figumem item to check if the items fit on the subjects’ screens. Figumem items were explained with the presentation of an example item for each radical level and a practice item that was presented on the next page. The practice item contained only two framed emblems. It was the only item where subjects received feedback and could click a “next” button if they were able to memorize the two emblem-frame pairings before the end of the learning phase. If they were not able to memorize both pairings correctly a second practice item was presented.

Thereafter, the nine figumem items were presented to the subjects. The order of presentation was varied to deal with confounding sequence effects. To avoid negatively affecting subject motivation and performance by starting the survey with multiple difficult items, a Latin Square design (e.g., Jacobson & Matthews, 1996) was used to determine groups of presentation order instead of a full randomization of item sequence (see Figure 3). Items were distributed into three blocks, each of them containing one item of each radical level. Subjects were (pseudo) randomly assigned to one of three conditions (while targeting an equal distribution). The first block constituted a warm-up phase in which the theoretically easiest item (low visual load) is presented first, then the item of theoretically medium difficulty (medium visual load) and, last, the item of the theoretically highest difficulty (high visual load). This sequence of the first block was the same for all three conditions. In the other two blocks, the sequence of items was rotated according to the Latin Square design, that is, an item of each radical level appeared once at each position. All subjects went through the blocks in the same order. A forced-choice design was chosen to avoid missing data. After responding to all nine items, socio-demographic information was collected.

Figure 3.

Latin Square for item order in both studies. Subjects were (pseudo) randomized to one of the three conditions and went through the three blocks in the same order. Each block contains one item of each radical level (L = low visual load, M = medium visual load, H = high visual load). This Latin Square was used in the original study by Jendryczko et al. (2020) and in the new study using the two-level approach. In the original study, the number behind each letter (that indicates the radical level) signifies the incidental-ID of a specific item. In the new validation study presented in this paper in which each subject responds to a distinct set of items, the number behind each letter (that indicates the radical level) signifies a block-ID of a specific item. This block-ID only indicates in which block the distinct item is presented. Figure taken with permission from Jendryczko et al. (2020)

Empirical Example: Data From the Two-Level Sampling Process

In the following, a new empirical study using two-level sampling will be presented. This data will be used for an illustration of the two-level model (equation (7)). The Ethics Committee (institutional review board, IRB) of the University of Konstanz waived the requirement for approval of the study. IRB-number: IRB25KN009-03/w.

Sample

Subjects were recruited via the online platform SONA (Sona Systems, Ltd, 2022) mainly from the University of Konstanz. Out of the 239 subjects, 25 had to be excluded because they indicated that their screen did not always fit the complete figumem item. Five subjects reported problems with image loading. Six subjects did not solve the practice item consisting of only two emblem-frame pairings on their second try. Thus, the final sample consisted of N = 204 subjects, with one subject having produced a missing value due to a technical error on one item for the second radical level. 169 subjects were female, 31 were male and four subjects selected the gender option “diverse.” On average, subjects were 22.25 years old (SD = 5.04). Most subjects were university students studying Psychology. When asked about their highest degree, 175 subjects responded high school diploma (“A-level”), 14 subjects had a Bachelor’s degree, nine subjects had a Master’s degree or an equivalent degree (“Diplom”), 1 subject responded “Secondary School” and five subjects chose “other” and indicated they had either completed vocational training or obtained a degree equivalent to A-levels but from another country. Psychology students were able to obtain course credits for participation. There was no compensation for other subjects.

Materials

The study was created using the online survey platform FiguGen (Jahn, 2023) which supports online figumem AIG with the R-package by Jendryczko et al. (2020). Three figumem items were used to describe the task to subjects. In the testing phase, nine distinct items were created “on-the-fly” for each subject (three items of each radical level, all consisting of 20 emblem-frame pairings). After the learning phase, the emblems were displayed one emblem at each page and subjects had to give their responses and could go to the next (or to the previous) emblem by using the “next” (the “back”) button. Below each emblem four frames were presented (1 correct, 3 distractors). Subjects were prevented by FiguGen from using the browser’s “back” button to avoid letting them see the image to be memorized more than once. Another figumem item with only two framed emblems was used as a practice item.

Procedure

The procedure was the same as in the original study by Jendryczko et al. (2020) except for the fact that no item was presented to more than one subject. Again, a Latin Square Design was used to determine groups of presentation order to prevent confounding sequence effects as in Jendryczko et al. (2020; see Figure 3).

Empirical Example: Analytic Strategy

We applied the AIG model to the cross-classified and the two-level data for figumem as previously described. The easiest radical condition (high dissimilarity for both inner emblems and outer shapes, respectively) was chosen as the reference-radical level. This choice is based on the rationale that the easiest radical level requires only the most “basic” memorization process connecting a specific emblem to a specific outer shape, whereas the other two radical conditions require additional disassociation processes to distinguish the outer shapes from each other (radical level 2) or to distinguish both the outer shapes from each other and the inner emblems from each other (radical level 3).

We estimated both models using Bayesian Markov-Chain-Monte-Carlo (MCMC) Gibbs sampling (three chains for each model) with mostly weakly informative priors (see Appendix A) in Mplus (version 8.7; Muthén & Muthén, 1998–2017). For both models, the number of chain iterations was set to 1,000,000 with the first half treated as warm-up (the estimation-time for the cross-classified model was 17 minutes and the estimation-time for the two-level model was 16 minutes on a 3.4 GHz processor personal computer). This procedure was chosen for two reasons: (1) To the best of our knowledge, maximum-likelihood estimation for cross-classified SEMs containing freely estimated factor loadings is yet to be derived (see Jeon & Rijmen, 2014). (2) Bayesian posterior sampling allows for the computation of credibility intervals for variance components, that is, the consistency coefficients displayed above. Model fit was investigated via the Bayesian posterior predictive checking procedure (BPPC; e.g., Asparouhov & Muthén, 2021; Gelman et al., 1996) using the $χ^{2}$ -statistics. Better model fit is indicated by the closeness of the posterior predictive p-value to .5.

We additionally computed the posterior distributions of differences between the different unconditional expectations of radical levels ( $μ_{r} ‐ μ_{r^{'}}$ with r ≠ r’) within each model. This allows to investigate whether items with different radical levels differed significantly in their difficulty (indicated by a non-inclusion of zero in the credibility interval). Furthermore, we computed the posterior distributions of parameters that are identical across the two models, in order to investigate whether these parameters differed significantly across the samples. The data and analysis scripts can be found in the supplementary material.

Empirical Example: Results and Discussion

According to the Gelman-Rubin criterion (Gelman & Rubin, 1992), both models converged with the highest potential scale reduction factors being below 1.1 at the final iteration (cross-classified model: 1.015, two-level model: 1.001). Yet, the posterior distribution plots (see Figure 4) and the trace plots (see supporting information) for $μ_{1}$ indicated convergence problems for this parameter in both models. Thus, we will supplement any inferences concerning the intercepts of the models with t-tests.

Figure 4.

Posterior density distributions of model parameters. $σ_{γ_{r}}^{2}$ = subject-incidental interaction specific noise variance for radical level r, $σ_{η}^{2}$ = latent subject-ability variance for the reference-radical level r = 1, $λ_{r}$ = factor loading for the non-reference radical level r ≠ 1 on the latent ability $η$ , $σ_{α_{r}}^{2}$ = radical-effect variance for the non-reference radical level r ≠ 1 at the subject level, $ρ_{α_{2} α_{3}}$ = radical-effect correlation for the two non-reference radicals r = 2 and r = 3 at the subject level, $R C o n_{r}$ = radical-level consistency of the standard radical level within a non-standard radical level (r ≠ 1), $L 1 C o n_{r}$ = Level-1 consistency of the subject-level variance for radical level r, $μ_{r}$ = intercept (mean) for radical level r, $σ_{β_{r}}^{2}$ = incidental-effect variance for radical level r, $L 2 C o n_{r}$ = Level-2 consistency of the subject-level variance for radical level r. Distributions colored in dark-gray relate to the cross-classified model, distributions colored in light-gray relate to the two-level model. The plots in the last two rows relate to the parameters specific to the cross-classified model, which is why only dark-gray colored distributions are shown in these cases

The model applied to the cross-classified data yielded an excellent fit (M( $Δ χ^{2}$ ) = −0.793, 95%-confidence interval = [−16.069, 17.391], p = .459). The model applied to the two-level data yielded a less ideal but still acceptable fit (M( $Δ χ^{2}$ ) = −7.424, 95%-confidence interval = [−9.867, 24.520], p = .188).

Table 1 presents the parameter point-estimates (medians) and 95%-equal tailed credibility intervals (CI) for the parameters of the cross-classified and the two-level model respectively, as well as these statistics for the differences of the parameters between the models. Figure 4 shows the posterior distributions of the parameters graphically. As can be seen, the posterior-distributions overlap to a large degree, indicating no substantial differences in common model parameters between sampling procedures/groups. This is further corroborated by the fact that all CI for differences in parameters across models contain zero. In the following, the 95%-CI bounds are given in brackets.

Table 1.

Point estimates and 95%-equal tailed credibility intervals [in brackets] for the parameters of the models and their difference between the models

Parameter	Cross-classified	Two-level	Difference
$σ_{γ_{1}}^{2}$	8.036 [7.037, 9.221]	7.813 [6.838, 8.963]	0.223 [−1.298, 1.757]
$σ_{γ_{2}}^{2}$	6.508 [5.742, 7.405]	6.292 [5.552, 7.159]	0.216 [−0.941, 1.378]
$σ_{γ_{3}}^{2}$	5.133 [4.496, 5.893]	5.365 [4.696, 6.165]	−0.232 [−1.254, 0.779]
$σ_{η}^{2}$	6.811 [5.173, 8.913]	5.339 [3.978, 7.102]	1.466 [−0.929, 3.970]
$λ_{2}$	0.898 [0.771, 1.040]	0.916 [0.770, 1.077]	−0.017 [−0.222, 0.185]
$λ_{3}$	0.525 [0.403, 0.656]	0.598 [0.456, 0.753]	−0.073 [−0.270, 0.121]
$σ_{α_{2}}^{2}$	0.530 [0.057, 1.410]	0.474 [0.042, 1.302]	0.051 [−0.905, 1.050]
$σ_{α_{3}}^{2}$	1.714 [0.933, 2.649]	1.492 [0.677, 2.446]	0.224 [−1.011, 1.462]
$ρ_{α_{2} α_{3}}$	0.814 [0.268, 0.991]	0.827 [0.214, 0.992]	−0.009 [−0.581, 0.619]
$R C o n_{2}$	0.913 [0.774, 0.991]	0.905 [0.748, 0.991]	0.008 [−0.159, 0.184]
$R C o n_{3}$	0.524 [0.339, 0.715]	0.562 [0.359, 0.779]	−0.039 [−0.323, 0.242]
$L 1 C o n_{1}$	0.447 [0.329, 0.530]	0.406 [0.325, 0.489]	0.039 [−0.103, 0.157]
$L 1 C o n_{2}$	0.459 [0.301, 0.541]	0.443 [0.365, 0.520]	0.015 [−0.155, 0.130]
$L 1 C o n_{3}$	0.405 [0.301, 0.490]	0.390 [0.304, 0.474]	0.015 [−0.121, 0.136]
$μ_{1}$	11.106 [10.171, 12.447]	11.331 [10.936, 11.665]	−0.207 [−1.153, 1.028]
$μ_{2}$	9.358 [8.355, 10.508]	9.702 [9.336, 10.052]	−0.340 [−1.406, 0.853]
$μ_{3}$	7.351 [6.830, 8.126]	7.655 [7.343, 7.960]	−0.300 [−0.906, 0.516]
$μ_{1} ‐ μ_{2}$	1.741 [0.379, 3.282]	1.624 [1.307, 1.935]	0.121 [−1.275, 1.691]
$μ_{1} ‐ μ_{3}$	3.746 [2.635, 5.080]	3.672 [3.312, 4.011]	0.078 [−1.072, 1.425]
$μ_{2} ‐ μ_{3}$	2.001 [0.770, 3.205]	2.046 [1.736, 2.358]	−0.046 [−1.309, 1.196]
$σ_{β_{1}}^{2}$	0.160 [0.010, 4.813]
$σ_{β_{2}}^{2}$	0.400 [0.054, 7.146]
$σ_{β_{3}}^{2}$	0.066 [0.006, 2.045]
$L 2 C o n_{1}$	0.977 [0.583, 0.999]
$L 2 C o n_{2}$	0.938 [0.457, 0.991]
$L 2 C o n_{3}$	0.982 [0.635, 0.998]

Notes. $σ_{γ_{r}}^{2}$ = subject-incidental interaction specific noise variance for radical level r, $σ_{η}^{2}$ = latent subject-ability variance for the reference-radical level r = 1, $λ_{r}$ = factor loading for the non-reference radical level r ≠ 1 on the latent ability $η$ , $σ_{α_{r}}^{2}$ = radical-effect variance for the non-reference radical level r ≠ 1 at the subject level, $ρ_{α_{2} α_{3}}$ = radical-effect correlation for the two non-reference radical levels r = 2 and r = 3 at the subject level, $R C o n_{r}$ = radical-level consistency of the standard radical level within a non-standard radical level (r ≠ 1), $L 1 C o n_{r}$ = Level-1 consistency of the subject-level variance for radical level r, $μ_{r}$ = intercept (mean) for radical level r, $σ_{β_{r}}^{2}$ = incidental-effect variance for radical level r, $L 2 C o n_{r}$ = Level-2 consistency of the subject-level variance for radical level r.

As hypothesized, the easiest radical level was the first one, with an expectation of 11.106 [10.171, 12.447] correctly memorized associations in the cross-classified model and an expectation of 11.331 [10.936, 11.665] correctly memorized associations in the two-level model. The third radical level was the most difficult one with an expectation of 7.351 [6.830, 8.126] correctly remembered associations in the cross-classified model and 7.655 [7.343, 7.960] in the two-level model. The second radical level had intermediate difficulty with an expectation of 9.358 [8.355, 10.508] correctly remembered associations in the cross-classified model and 9.702 [9.336, 10.052] in the two-level model. Differences between these difficulties were significant in each model (cross-classified:

μ_{1} ‐ μ_{2}

= 1.741 [0.379, 3.282];

μ_{1} ‐ μ_{3}

= 3.746 [2.635, 5.080];

μ_{2} ‐ μ_{3}

= 2.001 [0.770, 3.205]; two-level:

μ_{1} ‐ μ_{2}

= 1.624 [1.307, 1.935];

μ_{1} ‐ μ_{3}

= 3.672 [3.312, 4.011];

μ_{2} ‐ μ_{3}

= 2.046 [1.736, 2.358]). Across the two models, intercepts and differences in intercepts were not significantly different within the Bayesian analysis. These findings were corroborated by the t-tests (see Table 2) with the only exception being the intercept for the third radical level. Here, the difference across the two models was significant according to the frequentist analysis (

μ_{3 ‐ c r o s s ‐ c l a s s i f i e d} ‐ μ_{3 ‐ t w o ‐ l e v e l}

= −0.333, t(1,233.40) = −1.994, p = .046). Yet, it must be stated that the t-tests do not take the dependencies due to subjects into account so that confidence intervals are likely understated and this might reflect a type-1 error.

Table 2.

Means of variables (and difference variables) in the cross-classified and the two-level samples (second and third column) and mean comparison across the samples (last column)

Variable	Cross-classified	Two-level	Difference
Y_si1	11.083, t (623) = 71.445, p < .001	11.310, t (611) = 76.881, p < .001	−0.227, t (1,231.70) = −1.062, p = .288
Y_si2	9.316, t (623) = 65.37, p < .001	9.687, t (610) = 71.545, p < .001	−0.371, t (1,231.00) = −1.891, p = .059
Y_si3	7.304, t (623) = 62.288, p < .001	7.637, t (611) = 64.323, p < .001	−0.333, t (1,233.40) = −1.994, p = .046
Y_si1 – Y_si2	1.768, t (623) = 11.505, p < .001	1.627, t (610) = 11.032, p < .001	0.141, t (1,231.90) = 0.661, p = .509
Y_si1 – Y_si3	3.779, t (623) = 23.200, p < .001	3.673, t (611) = 22.558, p < .001	0.106, t (1,233.90) = 0.459, p = .647
Y_si2 – Y_si3	2.011, t (623) = 13.795, p < .001	2.047, t (610) = 14.594, p < .001	−0.036, t (1,232.00) = −0.180, p = .858

Note. Y_sir = observed response variable (s = subject, i = incidental realization, r = radical level). t-tests in the last column reflect two-sample Welch-tests without the homoscedasticity assumption. The remaining t-tests are one-sample t-tests.

The largest difference in model parameters between sampling procedures/groups was observed for the variance in latent ability assessed with the first radical level (cross-classified: 6.811 [5.173, 8.913]; two-level: 5.339 [3.978, 7.102]; difference: 1.466 [−0.929, 3.970]). Factor loadings were higher for the second radical level (cross-classified: 0.898 [0.771, 1.040]; two-level: 0.916 [0.770, 1.077]; difference: −0.017 [−0.222, 0.185]) than for the third radical level (cross-classified: 0.525 [0.403, 0.656]; two-level: 0.916 [0.770, 1.077]; difference: −0.073 [−0.270, 0.121]), while the residual radical-effect variance was lower for the second radical level (cross-classified: 0.530 [0.057, 1.410]; two-level: 0.474 [0.042, 1.302]; difference: 0.051 [−0.905, 1.050]) than for the third radical level (cross-classified: 1.714 [0.933, 2.649]; two-level: 1.492 [0.677, 2.446]; difference: 0.224 [−1.011, 1.462]). This yielded high radical consistency for the second radical level (cross-classified: 0.913 [0.774, 0.991]; two-level: 0.905 [0.748, 0.991]; difference: 0.008 [−0.159, 0.184]) and much lower radical consistency for the third radical level (cross-classified: 0.524 [0.339, 0.715]; two-level: 0.562 [0.359, 0.779]; difference: −0.039 [−0.323, 0.242]). Accordingly, we can conclude that the additional visual load due to varying completeness of the frames introduced only little additional variance between subjects (little additional differential cognitive operations) in comparison to visual load due to shape alone. However, visual load due to shape, completeness, and orientation introduced much more additional variance between subjects (much more additional differential cognitive operations) in comparison to visual load alone. The radical effects correlated strongly but this correlation was estimated with weak precision (cross-classified: 0.814 [0.268, 0.991]; two-level: 0.827 [0.214, 0.992]; difference: −0.009 [−0.581, 0.619]). Such a strong radical-effect correlation would indicate that the second radical level can explain much of the remaining variance in the third radical level after the variance of the first radical level has been partialled out.

In the cross-classified model, comparatively little variability due to incidental effects was found ( $σ_{β_{1}}^{2}$ = 0.160 [0.010, 4.813]; $σ_{β_{2}}^{2}$ = 0.400 [0.054, 7.146]; $σ_{β_{3}}^{2}$ = 0.066 [0.006, 2.045]), yielding high level – 2 consistency coefficients ( $L 2 C o n_{1}$ = 0.977 [0.583, 0.999]; $L 2 C o n_{2}$ = 0.938 [0.457, 0.991]; $L 2 C o n_{3}$ = 0.982 [0.635, 0.998]).

In light of incidentals having such little effect in the cross-classified model, it was also observed that the random-noise variance was very similar and not significantly different across models. This suggests that the low incidental effects on item responses found for the limited number of sampled incidentals in the cross-classified sample generalize to a larger set of incidental realizations. The highest random-noise variance was observed for the first radical level (cross-classified: 8.036 [7.037, 9.221]; two-level: 7.813 [6.838, 8.963]; difference: 0.223 [−1.298, 1.757]), followed by the second radical level (cross-classified: 6.508 [5.742, 7.405]; two-level: 6.292 [5.552, 7.159]; difference: 0.216 [−0.941, 1.378]) and third radical level (cross-classified: 5.133 [4.496, 5.893]; two-level: 5.365 [4.696, 6.165]; difference: −0.232 [−1.254, 0.779]). The highest level – 1 consistency was observed for the second radical level (cross-classified: 0.459 [0.301, 0.541]; two-level: 0.443 [0.365, 0.520]; difference: 0.015 [−0.155, 0.130]), followed by the first radical level (cross-classified: 0.447 [0.329, 0.530]; two-level: 0.406 [0.325, 0.489]; difference: 0.039 [−0.103, 0.157]) and the third radical level (cross-classified: 0.405 [0.301, 0.490]; two-level: 0.390 [0.304, 0.474]; difference: 0.015 [−0.121, 0.136]).

General Discussion

In this contribution, we drew attention to the different sampling processes of subjects and item incidentals (surface characteristics of items that are hypothesized to elicit no effects on item parameters) that may be implied across different validation studies of automatic item generators for assessing psychological constructs. In cross-classified sampling, every subject “sees” every incidental realization and incidental effects on the item scores (that should be minimal) can be estimated. Yet, as the number of sampled incidental realizations will often be low due to practical constraints of the data collection, the generalization of sample findings to the complete population of incidental realizations will be limited. In contrast to this, every subject encounters a unique set of sampled incidental realizations in two-level sampling. Here, generalizations of incidental effects are much more comprehensive, but the effects cannot be separated from subject-incidental interactions. Our practical recommendation is to use both designs for an in-depth evaluation of noise stability by comparing the results across the two designs. For that purpose, we introduced a classical test theory AIG model for cross-classified data based on stochastic measurement theory. We showed how a simplification of the model leads to a new model with a two-level sampling process.

In the following, we will first discuss the substantive results of our empirical illustration with respect to the validity of the item generator and potential model extensions. Afterward, we discuss practical limitations that are likely to be encountered in applications, present recommendations for these cases and give an outlook on potential future applications and studies.

Substantive Conclusions for Figumem and Model Extensions

An application of the models to cross-classified and two-level sampling data for the figural short-term memory test-generator “figumem” (Jendryczko et al., 2020) indeed displayed little effects of the incidentals on the item-score (cross-classified data) and showed that the variance at the subject-incidental interaction level remained stable in a separate sample with a larger set of incidental realizations (two-level data). Hence, results point to considerable noise stability. Yet, level-2 consistency coefficients were not exactly 1 which means that the distinction between incidentals and radicals is not strict. That is, some combinations of frames and emblems (incidental realizations) are more easily recalled than others (see also Jendryczko et al., 2020). However, all level-2 consistency coefficients exceeded .9 pointing to a relatively small impact of incidentals.

To further investigate the importance of incidental realizations, we propose a slightly different study design: One could apply both sampling procedures for incidentals to the same sample of subjects (“two-way incidental sampling approach”). In the case of figumem, for example, a set of nine items (three per radical level) could be sampled for every subject (cross-classified) while for each subject an additional unique set of nine items (again three per radical level) could be sampled (two-level). While the variance decomposition and the investigation of noise stability would remain the same, this would allow for modeling separate variables at the person-level for the cross-classified and two-level items. If correlations among these variables are close to one, this suggests that the construct-relevant additional variability produced by incidentals does not substantially influence the rank-order of subjects and, thus, relative individual ability level.

Interpreting the empirical results with respect to the validity of figumem, we found that the different radical levels (item characteristics that are hypothesized to elicit construct-relevant psychological phenomena and, thus, should elicit differences in item parameters) did not only elicit differences in item difficulties, but also some substantial residual effects in the measurement of the construct. Thus, different radical levels do not only assess increased demands on short-term memory storage capacity, but, apparently, also differential cognitive operations needed to dissociate elements of visual load (different shapes, different types of shape completeness, different shape orientations). Additionally required differential cognitive operations increased with the radical level as the standard radical consistency within the second radical level (one additional operation) was very high (around .9) but decreased substantially (to around .5) within the third radical level (two additional operations). It follows that radicals are not mere determinants of item difficulty and that the item generator is not valid with regard to measuring a single clearly defined ability variable across all radical levels (although the third radical level seems to be the more problematic one and may hence drive this problem; see also Jendryczko et al., 2020). It is an open question what exactly these different/additional cognitive operations are. Unfortunately, the current data does not provide an answer and the presented models alone cannot answer this question. One approach to overcome this problem could be to include already well-established memory tests and tests that measure memory-adjacent constructs reflecting criterion and discriminant validity. Being imbedded in a flexible multilevel Bayesian SEM framework, the models principally allow to include such additional variables. Moreover, in contrast to simple single-level models, the proposed models separate the subject-level effects from the incidental and incidental-subject interaction effects, allowing to estimate criterion related correlations at the subject level more precisely.

The study is limited in that the samples constitute convenience samples, primarily composed of psychology students, with the cross-classified sample additionally including some students from a second university (see Jendryczko et al., 2020). In future studies, it will be important that subjects are sampled from the same population (or the two-way incidental sampling approach is applied on a single sample directly; see above). Otherwise, it remains unclear if (low) incidental effects found in a cross-classified sample are indeed generalizable to a larger set of incidental realizations in a two-level sample as incidentals might have different effects in different populations. Put another way, if the aim is to generalize from a small sample of incidentals to a larger population of incidental realizations within a specific population of subjects, then the populations of incidentals and subjects must remain constant across the two sampling procedures. In the application, we found that the ability variance for the first radical level ( $σ_{η}^{2}$ ) was considerably lower in the two-level sample. While the credibility interval for the difference across samples contained zero, the lower bound of the interval was much closer to zero than the higher bound which might indicate potential differences in populations.

Practical Limitations, Options, and Outlook

The proposed models are formulated not only within the framework of stochastic measurement but also within classical test theory. The latter implies a linear relationship between latent and observed variables and constrains their appropriateness when applied to different types of data. In intelligence research, for example, item responses are often ordered categorical or binary (0 = incorrect answer given, 1 = correct answer given) so that linearity will likely be violated. While SEMs for ordered categorical data exist (e.g., Muthén, 1983; Takane & De Leeuw, 1987), one needs to consider that, within these, residuals (such as measurement error or subject-incidental interaction) are determined by remaining model parameters (instead of being freely estimated) and that these models imply different stochastic sampling spaces (e.g., Eid, 1996) which renders a straightforward application of the proposed models questionable.

Many other cognitive ability tests (e.g., figumem) provide count data. Latent Poisson-regression models (Rasch, 1960/1980) are often considered more appropriate for such data, yet, the Poisson distribution comes with its own drawbacks such as the strict assumption of equidispersion (the expectancy equals the variance) and unity of factor loadings. While the Rasch-Poisson-Counts model has been extended to handle item-specific dispersion parameters (Forthmann et al., 2019), free estimation of factor loadings (Myszkowski & Storme, 2021), and both (Beisemann, 2022), the additional modeling of cross-classified or two-level data remains an obstacle. While count data are not continuous in nature, applications of continuous linear models are more defendable due to the wider range of possible item scores—especially when the normality assumption holds. We argue that the proposed models are particularly useful in the domain of cognitive abilities when tests contain figural, numerical, and verbal material for which the manipulation of radicals and incidentals does not imply a change in semantics (e.g., as in verbally formulated statements for personality assessment) and the outcome are counts like in many processing speed (e.g., Doebler & Holling, 2016) and fluency tasks (e.g., Ghanavati et al., 2019).

In addition, the application of continuous response models to ordered categorical data is also common practice and has been defended on a theoretical basis (Robitzsch, 2020). Ordered categories often appear in tests measuring facets of divergent thinking and creativity (e.g., Forthmann et al., 2017) or reading comprehension (e.g., Attali et al., 2022). Besides count data, reaction times that are often used in cognitive aptitude testing (e.g., Doebler & Scheffler, 2016; Kyllonen & Zu, 2016; Wang et al., 2023) are another valid candidate for the application of the models.

In the context of such cognitive ability tests, it should also be mentioned that (AIG) tests are also often constructed to provide proxy measures for general intelligence. The automatic item generation of such tests often relies on a much larger set of radicals (in comparison to figumem), such as many possible combinations of specific rules by which figural stimuli are aligned in visual matrices to form figural analogies (e.g., “MetrixDeveloper” by Freund et al., 2008). In these cases, and in order to determine the impact of radicals on the variance of observed scores, one could also model these as random effects. The proposed models of the current contribution may be modified for including radical effects as random and model fit procedures may be used to evaluate fixed and random radical-effect models against each other.

Promising applications of the models also lie within automatic item generation in the context of personality assessment with large language models (e.g., Götz et al., 2023; Hommel et al., 2022), which often apply continuous response scales (e.g., Bejar, 1977; Kloft & Heck, 2024; Kuhlmann et al., 2017; Simms et al., 2019) or multi-point Likert-scales. However, it might be more difficult in this context to define radicals that have an effect on item difficulty (represented by the intercepts $μ_{r}$ in our models). Usually, there are specific underlying theories about the “mental” structure of a particular personality dimension. For example, the classic BIG-5 personality trait of conscientiousness is theorized to consist of the facets self-discipline, competence, achievement striving, dutifulness, deliberation, and order (Costa & McCrae, 1992) or of the aspects of industriousness and orderliness (DeYoung et al., 2007). Using radical-specific residual effects at the subject-level (represented by $α_{s r}$ in our models), these different facets/aspects can be represented and generalized to multiple items for each facet/aspect.

Keeping all these things in mind, one also needs to reflect on the general costs and time demands of automatic item generators. These start with the initial efforts for creating the item templates and building the generation engines. These efforts vary depending on the psychological construct and the stimulus material. Many cognitive ability item generators rely on “vague” figural (abstract shapes), numerical (random numbers) or verbal (syllables or words that are not meant to form a semantically meaningful sentence) material. For these, the initial efforts are comparably low and come with comparably little costs as freeware programs (such as R) can be used for template and generation engine creation (e.g., figumem itself; for other examples see Blum & Holling, 2018; Doebler & Holling, 2016; Sun et al., 2019). Yet, other constructs such as personality traits, for example, rely on concrete verbal material with semantically meaningful statements or questions which renders the creation of item templates and item generators much more demanding and cost intensive. One might argue that the current and future developments in artificial intelligence and LLMs in particular may reduce time and costs for the generator-creation phase. Yet, we hypothesize that automatically creating “true” incidentals in semantic verbal material without confounding these with “true” radicals to be much more difficult. Hence, we believe that these item pools will suffer from noise instability which can be detected relying on the presented approaches. If this is the case, each and every item (incidental) will have to be inspected in order to form a pool of noise-stable items producing costs which may outweigh the benefits. Additionally, an automatic on-the-fly item generation will be impossible due to the different impacts of the incidentals on the measures.

Conclusion

Classical Test Theory-Automatic Item Generation measurement models adhering to the stochastic sampling space of cross-classified and two-level sampling of subjects and incidentals are useful tools for the validation process of automatic item generators as they can be used to investigate noise stability. They are best applied in combination, so that incidental effects can be estimated with cross-classified data and the generalization to a large population of incidentals can be examined with two-level data. While the type of data (categorical, continuous, and count) must be considered before the models are applied, the models are especially promising for the application to item generators for cognitive abilities and might hold potential for generators using large language models.

Footnotes

ORCID iDs

Philipp Jahn

David Jendryczko

Fridtjof W. Nussbeck

Ethical Considerations

The Ethics Committee (institutional review board, IRB) of the University of Konstanz waived the requirement for approval. IRB-number: IRB25KN009-03/w.

Consent to Participate

The following informed consent information was given in written form (in German): “Our study involves a figural memory test in which you are asked to memorize various shapes as well as possible. The survey takes approximately 45–60 minutes. Students at the University of Konstanz can receive 1 course-credit for their participation. In this case, the course-credit will be automatically credited after completion of the survey. Participation in the survey is voluntary and can be discontinued at any time without giving reasons. This survey is for scientific purposes only. Data collection is anonymous and the data you enter will not be traced back to you. The results will be published anonymously so that data cannot be attributed to any individual. If you have any questions or comments, please feel free to contact me at: philipp.jahn@uni-konstanz.de. By clicking ‘Next’, you agree to voluntarily participate in our study and confirm that we may use your data in anonymized form, as well as publish the test results of individual tasks (without any demographic information and without the possibility of identifying specific individuals).”

Consent for Publication

Consent to use data and figures from was given by David Jendryzcko who is the second author of this contribution.

Author Contributions

David Jendryczko (Conceptualisation, Formal Analysis, Methodology, Writing—Original Draft Preparation, Visualizations), Fridtjof Nussbeck (Methodology, Resources, Supervision, Writing—Review and Editing), Philipp Jahn (Conceptualisation, Data Curation, Investigation, Methodology, Software, Writing—Original Draft Preparation).

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data for this article is open-access and retrievable from .

Note

Appendix

References

Alvarez

Cavanagh

(2004). The capacity of visual short-term memory is set both by visual information load and by number of objects. Psychological Science, 15(2), 106–111. https://doi.org/10.1111/j.0963-7214.2004.01502006.x

Arendasy

M. E.

Hergovich

Sommer

(2008). Investigating the ‘g’-saturation of various stratum-two factors using automatic item generation. Intelligence, 36(6), 574–583. https://doi.org/10.1016/j.intell.2007.11.005

Arendasy

Sommer

(2005). The effect of different types of perceptual manipulations on the dimensionality of automatically generated figural matrices. Intelligence, 33(3), 307–324. https://doi.org/10.1016/j.intell.2005.02.00

Arendasy

Sommer

Gittler

Hergovich

(2006). Automatic generation of quantitative reasoning items. Journal of Individual Differences, 27(1), 2–14. https://doi.org/10.1027/1614-0001.27.1.2

Arendasy

M. E.

Sommer

(2010). Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence, 38(6), 574–581. https://doi.org/10.1016/j.intell.2010.06.004

Arendasy

M. E.

Sommer

(2012). Using automatic item generation to meet the increasing item demands of high-stakes educational and occupational assessment. Learning and Individual Differences, 22(1), 112–117. https://doi.org/10.1016/j.lindif.2011.11.005

Asparouhov

Muthén

B. O.

(2021). Bayesian analysis of latent variable models using mplus. Muthén & Muthén. https://www.statmodel.com/download/BayesAdvantages18.pdf

Attali

Runge

LaFlair

G. T.

Yancey

Goodwin

Park

von Davier

A. A.

(2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, Article 903077. https://doi.org/10.3389/frai.2022.903077

Beisemann

(2022). A flexible approach to modelling over‐, under‐ and equidispersed count data in IRT: The two‐parameter Conway–Maxwell–Poisson model. British Journal of Mathematical and Statistical Psychology, 75(3), 411–443. https://doi.org/10.1111/bmsp.12273

10.

Bejar

I. I.

(1977). An application of the continuous response level model to personality measurement. Applied Psychological Measurement, 1(4), 509–521. https://doi.org/10.1177/014662167700100407

11.

Bejar

I. I.

(1990). A generative analysis of a three-dimensional spatial task. Applied Psychological Measurement, 14(3), 237–245. https://doi.org/10.1177/0146621690014003

12.

Bejar

I. I.

(2013). Generative testing: From conception to implementation. In Item generation for test development (pp. 231–250). Routledge.

13.

Bertling

J. P.

(2012). Measuring reasoning ability: Applications of rule-based item generation. University of Muenster.

14.

Blum

Holling

(2018). Automatic generation of figural analogies with the IMak package. Frontiers in Psychology, 9, Article 1286. https://doi.org/10.3389/fpsyg.2018.01286

15.

Blum

Holling

Galibert

M. S.

Forthmann

(2016). Task difficulty prediction of figural analogies. Intelligence, 56, 72–81. https://doi.org/10.1016/j.intell.2016.03.001

16.

Circi

Hicks

Sikali

(2023). Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education, 8, Article 858273. https://doi.org/10.3389/feduc.2023.858273

17.

Cooper

(2023). An introduction to psychometrics and psychological assessment: Using, interpreting and developing tests (2nd ed.). Routledge.

18.

Costa

P. T.

Jr. McCrae

R. R.

(1992). NEO PI-R professional manual. Psychological Assessment Resources.

19.

DeYoung

C. G.

Quilty

L. C.

Peterson

J. B.

(2007). Between facets and domains: 10 aspects of the big five. Journal of Personality and Social Psychology, 93(5), 880–896. https://doi.org/10.1037/0022-3514.93.5.880

20.

Doebler

Holling

(2016). A processing speed test based on rule-based item generation: An analysis with the Rasch Poisson counts model. Learning and Individual Differences, 52, 121–128. https://doi.org/10.1016/j.lindif.2015.01.013

21.

Doebler

Scheffler

(2016). The relationship of choice reaction time variability and intelligence: A meta-analysis. Learning and Individual Differences, 52, 157–166. https://doi.org/10.1016/j.lindif.2015.02.009

22.

Eid

(1996). Longitudinal confirmatory factor analysis for polytomous item responses: Model definition and model selection on the basis of stochastic measurement theory. Methods of Psychological Research Online, 1(4), 65–85. https://doi.org/10.23668/psycharchives.12714

23.

Embretson

S. E.

(2013). Generating abstract reasoning items with cognitive theory. In Item generation for test development (pp. 251–282). Routledge.

24.

Falcão

Costa

Pêgo

J. M.

(2022). Feasibility assurance: A review of automatic item generation in medical assessment. Advances in Health Sciences Education, 27(2), 405–425. https://doi.org/10.1007/s10459-022-10092-z

25.

Forthmann

Gühne

Doebler

(2019). Revisiting dispersion in count data item response theory models: The Conway–Maxwell–Poisson counts model. British Journal of Mathematical and Statistical Psychology, 73(Suppl 1), 32–50. https://doi.org/10.1111/bmsp.12184

26.

Forthmann

Holling

Çelik

Storme

Lubart

(2017). Typing speed as a confounding variable and the measurement of quality in divergent thinking. Creativity Research Journal, 29(3), 257–269. https://doi.org/10.1080/10400419.2017.1360059

27.

Freund

P. A.

Hofer

Holling

(2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied Psychological Measurement, 32(3), 195–210. https://doi.org/10.1177/0146621607306972

28.

Gelman

Meng

Stern

(1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4), 733–760. http://www.jstor.org/stable/24306036.

29.

Gelman

Rubin

D. B.

(1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–511. https://doi.org/10.1214/ss/1177011136

30.

Ghanavati

Salehinejad

M. A.

Nejati

Nitsche

M. A.

(2019). Differential role of prefrontal, temporal and parietal cortices in verbal and figural fluency: Implications for the supramodal contribution of executive functions. Scientific Reports, 9(1), Article 3700. https://doi.org/10.1038/s41598-019-40273-7

31.

Gierl

M. J.

Lai

(2012). The role of item models in automatic item generation. International Journal of Testing, 12(3), 273–298. https://doi.org/10.1080/15305058.2011.635830

32.

Gierl

M. J.

Zhou

Alves

(2008). Developing a taxonomy of item model types to promote assessment engineering. The Journal of Technology, Learning and Assessment, 7(2), 1–14. https://ejournals.bc.edu/index.php/jtla/article/view/1629.

33.

Glas

C. A.

Linden

W. J.

Geerlings

(2009). Estimation of the parameters in an item-cloning model for adaptive testing. In Elements of adaptive testing (pp. 289–314). Springer.

34.

Götz

F. M.

Maertens

Loomba

van der Linden

(2023). Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods, 29(3), 494–518. https://doi.org/10.1037/met0000540

35.

Harrison

Collins

Müllensiefen

(2017). Applying modern psychometric techniques to melodic discrimination testing: Item response theory, computerised adaptive testing, and automatic item generation. Scientific Reports, 7(1), 1–18. https://doi.org/10.1038/s41598-017-03586-z

36.

Holling

Bertling

J. P.

Zeuch

(2009). Automatic item generation of probability word problems [assessment of competencies]. Studies in Educational Evaluation, 35(2), 71–76. https://doi.org/10.1016/j.stueduc.2009.10.004

37.

Hommel

B. E.

Wollang

F.-J. M.

Kotova

Zacher

Schmukle

S. C.

(2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749–772. https://doi.org/10.1007/s11336-021-09823-9

38.

Irvine

S. H.

(2013). The foundations of item generation for mass testing. In Item generation for test development (pp. 35–66). Routledge.

39.

Jacobson

M. T.

Matthews

(1996). Generating uniformly distributed random Latin squares. Journal of Combinatorial Designs, 4(6), 405–437. https://doi.org/10.1002/(SICI)1520-6610(1996)4:6<405::AID-JCD3>3.0.CO;2-J

40.

Jahn

(2023). Figugen [Software]. https://figugen.de/

41.

Jendryczko

Berkemeyer

Holling

(2020). Introducing a computerized figural memory test based on automatic item generation: An analysis with the rasch poisson counts model. Frontiers in Psychology, 11, Article 945. https://doi.org/10.3389/fpsyg.2020.00945

42.

Jendryczko

Nussbeck

F. W.

(2024). Latent reciprocal engagement and accuracy variables in social relations structural equation modeling. Multivariate Behavioral Research, 60(1), 115–137. https://doi.org/10.1080/00273171.2024.2386060

43.

Jendryczko

Scharfen

Holling

(2019). The impact of situational test anxiety on retest effects in cognitive ability testing: A structural equation modeling approach. Journal of Intelligence, 7(4), Article 22. https://doi.org/10.3390/jintelligence7040022

44.

Jeon

Rijmen

(2014). Recent developments in maximum likelihood estimation of MTMM models for categorical data. Frontiers in Psychology, 5, Article 269. https://doi.org/10.3389/fpsyg.2014.00269

45.

Kıyak

Y. S.

Coşkun

Ö.

Budakoğlu

I. İ.

Uluoğlu

(2024). Chatgpt for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. European Journal of Clinical Pharmacology, 80(5), 729–735. https://doi.org/10.1007/s00228-024-03649-x

46.

Kloft

Heck

D. W.

(2024). Discriminant validity of interval response formats: Investigating the dimensional structure of interval widths. Educational and Psychological Measurement, 85(3), 565–588. https://doi.org/10.1177/00131644241283400

47.

Koch

Schultze

Jeon

Nussbeck

F. W.

Praetorius

A.-K.

Eid

(2016). A cross-classified CFA-MTMM model for structurally different and nonindependent interchangeable methods. Multivariate Behavioral Research, 51(1), 67–85. https://doi.org/10.1080/00273171.2015.1101367

48.

Kuhlmann

Dantlgraber

Reips

U. D.

(2017). Investigating measurement equivalence of visual analogue scales and Likert-type scales in internet-based personality questionnaires. Behavior Research Methods, 49(6), 2173–2181. https://doi.org/10.3758/s13428-016-0850-x

49.

Kurdi

Leo

Parsia

Sattler

Al-Emari

(2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30(1), 121–204. https://doi.org/10.1007/s40593-019-00186-y

50.

Kyllonen

P. C.

(2016). Use of response time for measuring cognitive ability. Journal of Intelligence, 4(4), Article 14. https://doi.org/10.3390/jintelligence4040014

51.

Lai

Alves

Gierl

M. J.

(2009). Using automatic item generation to address item demands for CAT. In Weiss

D. J.

(Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Retrieved December 11, 2024 from https://www.psych.umn.edu/psylabs/CATCentral/. Minneapolis. June 2 and 3, 2009.

52.

Lee

Jung

Jeon

Sohn

Hwang

Moon

Kim

(2024). Few-shot is enough: Exploring chatgpt prompt engineering method for automatic question generation in English education. Education and Information Technologies, 29(9), 11483–11515. https://doi.org/10.1007/s10639-023-12249-8

53.

Loe

B. S.

(2019). The effectiveness of automatic item generation for the development of cognitive ability tests. (Doctoral dissertation). Apollo - University of Cambridge Repository. https://doi.org/10.17863/CAM.37218

54.

Loe

B. S.

Sun

Simonfy

Doebler

(2018). Evaluating an automated number series item generator using linear logistic test models. Journal of Intelligence, 6(2), Article 20. https://doi.org/10.3390/jintelligence6020020

55.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education/Macmillan.

56.

Mislevy

R. J.

Haertel

G. D.

(2006). Implications of evidence‐centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6–20. https://doi.org/10.1111/j.1745-3992.2006.00075.x

57.

Muthén

B. O.

(1983). Latent variable structural equation modeling with categorical variables. Journal of Econometrics, 49(1), 22–45. https://doi.org/10.1016/0304-4076(83)90093-3

58.

Muthén

L. K.

Muthén

B. O.

(1998–2017). Mplus User’s guide (8th ed.). Muthén & Muthén. https://www.statmodel.com/download/usersguide/MplusUserGuideVer_8.pdf

59.

Myszkowski

Storme

(2021). Accounting for variable task discrimination in divergent thinking fluency measurement: An example of the benefits of a 2‐ parameter poisson counts model and its bifactor extension over the Rasch Poisson counts model. The Journal of Creative Behavior, 55(3), 800–818. https://doi.org/10.1002/jocb.490

60.

Rasch

(1960/1980). Probabilistic models for some intelligence and attainment tests. The University of Chicago Press.

61.

Reeve

C. L.

Lam

(2005). The psychometric paradox of practice effects due to retesting: Measurement invariance and stable ability estimates in the face of observed score changes. Intelligence, 33(5), 535–549. https://doi.org/10.1016/j.intell.2005.05.003

62.

Robitzsch

(2020). Why ordinal variables can (almost) always be treated as continuous variables: Clarifying assumptions of robust continuous and ordinal factor analysis estimation methods. Frontiers in Education, 5, Article 589965. https://doi.org/10.3389/feduc.2020.589965

63.

Simms

L. J.

Zelazny

Williams

T. F.

Bernstein

(2019). Does the number of response options matter? Psychometric perspectives using personality questionnaire data. Psychological Assessment, 31(4), 557–566. https://doi.org/10.1037/pas0000648

64.

Sona Systems, Ltd . (2022). Sona systems. Retrieved November 14, 2022, from. https://www.sona-systems.com/

65.

Sun

Liu

Luo

(2019). Automatic generation of number series reasoning items of high difficulty. Frontiers in Psychology, 10, Article 884. https://doi.org/10.3389/fpsyg.2019.00884

66.

Takane

De Leeuw

(1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52(3), 393–408. https://doi.org/10.1007/bf02294363

67.

Trapp

Parr

Friston

Schröger

(2021). The predictive brain must have a limitation in short-term memory capacity. Current Directions in Psychological Science, 30(5), 384–390. https://doi.org/10.1177/09637214211029977

68.

Ueno

Allen

R. J.

Baddeley

A. D.

Hitch

G. J.

Saito

(2010). Disruption of visual feature binding in working memory. Memory & Cognition, 39(1), 12–23. https://doi.org/10.3758/s13421-010-0013-8

69.

Ueno

Mate

Allen

R. J.

Hitch

G. J.

Baddeley

A. D.

(2011). What goes through the gate? exploring interference with visual feature binding [Attention and Short-Term Memory]. Neuropsychologia, 49(6), 1597–1604. https://doi.org/10.1016/j.neuropsychologia.2010.11.030

70.

von Davier

(2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857. https://doi.org/10.1007/s11336-018-9608-y

71.

Wang

Zhang

(2023). Gender differences in mental rotational training based on computer adaptive tests. Behavioral Sciences, 13(9), Article 719. https://doi.org/10.3390/bs13090719

72.

Whitely

S. E.

(1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197. https://doi.org/10.1037/0033-2909.93.1.179

73.

Zeuch

Holling

Kuhn

J. T.

(2011). Analysis of the Latin square task with linear logistic test models. Learning and Individual Differences, 21(5), 629–632. https://doi.org/10.1016/j.lindif.2011.03.004