Generalizing Beyond the Test: Permutation-Based Profile Analysis for Explaining DIF Using Item Features

Abstract

Profile analysis is one of the main tools for studying whether differential item functioning can be related to specific features of test items. While relevant, profile analysis in its current form has two restrictions that limit its usefulness in practice: It assumes that all test items have equal discrimination parameters, and it does not test whether conclusions about the item-feature effects generalize outside of the considered set of items. This article addresses both of these limitations, by generalizing profile analysis to work under the two-parameter logistic model and by proposing a permutation test that allows for generalizable conclusions about item-feature effects. The developed methods are evaluated in a simulation study and illustrated using Programme for International Student Assessment 2015 Science data.

Keywords

differential item functioning generalizability item features international large-scale assessment profile analysis

1. Introduction

In many applications of cognitive testing, one of the main aims of administering these tests is to obtain comparable results for different groups. A crucial requirement for being able to make meaningful comparisons across groups on scale scores involves measurement equivalence or measurement invariance, meaning that the measured construct ought to be understood and measured equivalently across groups (Meredith, 1993). In practice, researchers generally adopt one of the two approaches when examining score comparability: multiple-group confirmatory factor analysis (MG-CFA; Jöreskog, 1971) or differential item functioning (DIF)-type approaches to testing measurement invariance (for an overview of such methods, see Millsap & Everson, 1993; Penfield & Lam, 2000; Potenza & Dorans, 1995). One area in which measurement invariance is especially important yet challenging to establish is in international large-scale assessments (ILSAs), which are generally low-stakes, comparative, population-level assessments and surveys administered across countries that differ in geography, culture, and economic development. The number of unique populations and the size of these studies make this a rich and important area for studying measurement invariance. Over the course of ILSAs such as the Programme for International Student Assessment (PISA) or the Trends in International Mathematics and Science Study (TIMSS), substantial evidence around the cross-cultural differences in achievement measures such as math, science, and reading exists (Grisay & Monseur, 2007; Kankaraš & Moors, 2014; Kreiner & Christensen, 2014). Detecting DIF across countries is important as not accounting for DIF can distort international comparisons: It has been shown that the country rankings are not necessarily robust against the violations of measurement invariance (Kreiner & Christensen, 2014; Rutkowski et al., 2016).

Specific operational methods for identifying measurement differences in the ILSA context are varied and depend on the study and the responsible contractor. For example, DIF analyses in early cycles of PISA were based on the comparisons of national and international relative item difficulties, where especially deviant items were identified as dodgy and either not used for scoring of individual countries or removed altogether for scoring (Organization for Economic Co-Operation and Development [OECD], 2014a). In contrast, the Teaching and Learning International Survey analyses are based on MG-CFA and emphasize the scale level (OECD, 2014b). In recent PISA cycles, measures that quantify deviations in observed data from estimated item characteristic curves are used (Oliveri & von Davier, 2011; OECD, 2020). Once identified, a natural line of investigation surrounds seeking explanations for measurement differences, which is especially relevant in the ILSA context, where dozens of countries are measured on hundreds of items, creating a substantial challenge for summarizing such multivariate results. This context also offers opportunities for pattern detection and for developing cultural or content-related profiles. Studying DIF patterns in ILSA is important not only for DIF detection and treatment but also for studying substantive differences between educational systems in different countries, which is both practically and theoretically important.

In an effort to understand potential sources of cross-cultural measurement differences, studies posited pedagogical traditions, language, or item type. For example, researchers argued that the conceptual nature of teaching in Russia as compared with procedural teaching methods in English-speaking countries explained item-by-country interactions (i.e., between-country DIF) in PISA (Kjærnsli & Lie, 2004). Similar research showed evidence of cultural item response patterns based on geography or language (Zabulionis, 2001). The previously cited studies rely on correlations of item residuals submitted to cluster analysis, with clusters given cultural or other meanings imbued by the researcher.

One important direction for studying the structure of DIF is to consider item features as a possible explanation. In the current context, we use the term item features to describe the characteristics of the test items, such as cognitive demand (high, medium, or low), item format (multiple-choice or open-ended), content dimension (algebra, numbers, or geometry in mathematics), and so on. For example, in the mathematics education literature, there is evidence that girls do better than boys under certain test conditions or on certain topics and vice versa (Hyde et al., 1990; Leder, 1992; Mendes-Barnett & Ercikan, 2006; Liu & Wilson, 2009). It is also important to note that such differences should be regarded not only as deviations from measurement invariance but also as differences between the groups that can be of substantive interest. Studying item features in the context of large-scale assessment is logical because these item features are typically built into the assessment design of the studies, rather than being ancillary or post hoc features of the test. Interested readers are encouraged to see, for example, PISA framework documents for complete details (OECD, 2017). Similarly, TIMSS or PIRLS frameworks are illuminating in this regard (Mullis & Martin, 2017, 2019). At the national level, the U.S. National Assessment of Educational Progress (NAEP) is similarly designed with item features that are built into the framework (National Assessment Governing Board, 2019).

Descriptive approaches have been used to study the structure of DIF in ILSA in relation to item features. For example, Bundsgaard (2019) analyzed item features in the International Computer and Information Literacy Study by first estimating item-level DIF parameters and then looking at items grouped by content domain to identify the patterns of DIF by country. Similarly, to describe differences in the relative strengths and weaknesses of different groups of countries, Olsen (2005) looked at the relationship between item features in PISA 2003 and deviations of the country-specific to international probabilities of correct responses to the test items. Such studies, however, do not take the uncertainty about the DIF-measures (item-level DIF parameters or deviations of the probabilities of correct responses) into account and therefore do not allow statistical inferences about the effects of item features. Rather, such findings provide a descriptive picture of the structure of country differences.

One method used to investigate how DIF is related to item features that does take uncertainty about DIF into account is profile analysis (Verhelst, 2012). In profile analysis, the aim is to study whether a particular group (e.g., country) performs notably better or worse than can be expected under measurement invariance on a set of items with certain features compared to a complementary set of items without these features (e.g., multiple-choice compared to open-ended items). This method conditions on the total number of correct responses and is developed under the Rasch modeling framework (Rasch, 1960), where it is possible to analytically derive the expected number of correct responses to items in a certain set, given the total number of correct responses. Using this expectation, one can classify each person as either performing above or below expectation on the set of items and aggregate these results on the group level. For each group, one can compute the expected and observed frequencies of persons with above- or below-expected performance on the set of items. A significant difference between observed and expected frequencies in a group suggests that items with a certain feature may exhibit DIF compared to items without the feature. For example, one could find evidence that for one country, the discrepancy between the performance on multiple-choice items and open-ended items is bigger than expected based on the international results, suggesting that this item feature might explain observed DIF for the studied country. In addition to providing a significance test for each considered group, one can compute the so-called excess percentage, which captures the difference between the observed and expected frequency of response patterns with the above-expected performance on the items in the set. This value is expressed as a percentage of the total sample size in the group. If the excess percentage is positive (negative) then it means that items with a particular feature are less (more) difficult in the group than expected based on the overall (e.g., international) model estimates. This excess percentage can be seen as providing a meaningful effect size measure to gauge the extent to which the item feature can explain observed DIF.

We note that a number of methods, used to describe the multivariate patterns of observations—focused on both continuous (e.g., Ding, 2001) and categorical data (e.g., Culpepper, 2009)—use the term profile analysis. To extract and describe profiles, established methods include multidimensional scaling (Davison, 1994; Ding, 2001), latent profile analysis (Gibson, 1959), latent class analysis (Lazarsfeld, 1950), and cluster analysis (Moses & Pritchard, 1996), among others. Verhelst (2012) distinguishes his approach to profile analysis by noting that profiles are defined at the individual, rather than the group, level; emphasis is on tests that measure a unidimensional trait; profiles describe relative weaknesses and strengths on predefined categories of items; and profiles are highly discrete. In this article, we will use the term “profile analysis” to refer to the method developed by Verhelst (2012) as has been done in multiple studies focused on explaining the structure of DIF in national and international assessments (Saatçioğlu Hamide, 2017; Vaheoja et al., 2019; Verhelst et al., 2016; Yıldırım, 2022).

Notable alternatives to profile analysis are differential bundle functioning (Douglas et al., 1996) and differential facet functioning (Engelhard, 1992). The first approach is aimed at studying whether, conditional on measured ability, the expected number of correct responses on items with a certain feature is the same in the reference and focal groups. This approach is based on the multidimensional DIF paradigm developed by Shealy and Stout (1993), which underlies the SIBTEST procedure (Bolt & Stout, 1996) that is the primary differential bundle functioning detection tool. The second approach is an extension of the linear logistic test model (Fischer, 1973) to DIF. Under this approach, instead of having an item-specific item difficulty in the reference group and item DIF parameters in each of the focal groups (i.e., difference between item difficulty in the focal and the reference group), the differences in item functioning are modeled by parameters that are specific to sets of items sharing a specific feature.

Profile analysis has important advantages over these alternatives. Compared to differential bundle functioning, profile analysis does not require a set of DIF-free items (i.e., items that are assumed to only measure the primary dimension). Moreover, profile analysis can be used for an arbitrary number of groups and does not require the specification of a reference and a focal group. Both are important advantages in a multiple-group setting, where no obvious reference group presents itself, as in the ILSA context. Compared to differential facet functioning methods, profile analysis does not require separate model parameter estimates for each country. This is a clear advantage in the ILSA context, where complex sampling schemes can render per-country sample sizes too small for precise estimates. Finally, profile analysis provides an easily interpretable measure of effect size (i.e., excess percentage), which can be used to summarize and communicate results to stakeholders. All this makes profile analysis a valuable tool for studying the relative strengths and weaknesses of different groups that have taken a test, for example, studying how countries differ in their performance on specific item types in ILSAs.

Despite several advantanges, profile analysis has an important restriction limiting its applicability in large-scale assessments: reliance on the Rasch model or its variants, which assume equal relationship between items and the underlying factor. Given that large-scale assessments such as TIMSS, PIRLS, PISA, and NAEP operationally use a more flexible measurement model (see, e.g., OECD, 2020), extending profile analysis to these contexts is valuable. Our work focuses on this extension, described in detail subsequently, along with an additional development, the context of which we describe next.

Investigating the relationship between item features and DIF allows us to determine whether observed differences between two sets of items on a given test can be attributed to a particular item feature or whether these differences are caused by random sampling variability.¹ In other words, we can investigate whether observed differences are generalizable beyond the immediate test setting—from those items observed on a particular test—to the population of items that share a particular feature. For example: Is one country structurally better in answering multiple-choice items, or is the observed pattern on a particular test administration limited to that setting? In the latter (nongeneralizable) case, differences are only due to idiosyncratic properties of the items in the two sets (e.g., with specific multiple choice items on this test being easier for one country due to reasons unrelated to the item feature).

It is important to distinguish between the results of a specific set of items on a test and the results of those items when applied to the population of items. DIF results, from profile analysis, differential facet functioning, or differential bundle functioning, can reveal if there is a systematic difference between two groups of items on the specific test, but these methods do not necessarily produce results that are generalizable the population level of items that carry a given feature. Importantly, developing generalizable methods can help to develop better measures that are less affected by culture, language, geography, or other demographics. On the other hand, if findings are unique to a test setting, making systematic adjustments to cross-cultural measures becomes an intractable challenge, as adjustments to future test cycles can create a “whack-a-mole” effect, where new measurement differences surface consequently and observed item feature effects do not reoccur in similar ways in other tests. As such, methods that can reveal whether feature-related DIF results can be expected to generalize outside of the test setting are needed. The importance of distinguishing the test-level and population-level results was earlier discussed by Ong et al. (2013).

Hierarchical logistic regression (Swanson et al., 2002), logistic mixed models (Van den Noortgate & De Boeck, 2005; Van den Noortgate et al., 2003) and Bayesian models for DIF (Gonçalves et al., 2013; Soares et al., 2009) can be used to address the issue of generalizability. In these methods, the group-specific deviations of the item parameters from their overall value (e.g., deviation of the country-specific difficulty of the item from its international value) are modeled as random across items and the variability across items is explained by item features included in the analysis as covariates. However, these methods cannot be directly applied in large-scale applications like PISA that include dozens of countries and a complex booklet design. The incomplete design poses challenges to the use of hierarchical logistic regression, where the test sumscore is used as a proxy for the latent variable: Since the sets of items that students respond to are different, their sumscores are not comparable. Furthermore, hierarchical logistic regression and logistic mixed models in their standard implementation are primarily used for application under a Rasch model, whereas many large-scale international assessments use more complex models. Finally, although there is no theoretical challenge in applying these methods when there are more than 50 groups between which DIF can occur, this has not been done to our knowledge and potentially would need some adjustments to the existing software to make it computationally feasible. Bayesian approaches to DIF have also only been applied in the context of comparing two groups, and there is no readily available software for this method for applied researchers to use in large-scale applications like PISA.

Since existing methods that address the issue of generalizability of item-feature-related DIF are not directly applicable in the context of large-scale assessment, it is of interest to develop methods that are easy to use and interpret for researchers in the field. Extending profile analysis to provide generalizable conclusions would fill the current knowledge gap, as it is computationally less intensive (i.e., it does not require estimating country-specific item parameters) and provides an easily interpretable outcome measure (i.e., excess percentage) expressed in terms of the actual performance of students from different groups rather than the estimates of model parameters. In the current article, we develop a straightforward permutation test for profile analysis to detect item-feature DIF that can be generalized outside of the immediate testing situation. Although in this article our application focuses on an ILSA context, our method can be applied readily in many large-scale assessments and other settings, where one wants to study the impact of item features on the performance of different groups of respondents (e.g., gender, ethnicity, or other policy-relevant groups).

The rest of this article is organized as follows. In Section 2, we describe the details of profile analysis in its current form (i.e., based on the Rasch model and with conclusions on the test level). In Section 3, we first generalize the method to the two-parameter logistic (2PL) model (Birnbaum, 1968), which is operationally common in many large-scale assessments, and then propose a permutation test that would allow for generalizable conclusions. In Section 4, we present an illustrative example examining different item features as potential sources of DIF in PISA 2015 Science assessment. In Section 5, we present a simulation study in which we evaluate Type I error and power of the permutation-based profile analysis. This article is concluded with a discussion.

2. Profile Analysis

Profile analysis is based on considering profiles: sets of subscores on test items of different types. Suppose some items in the test have a specific item feature (e.g., high depth of knowledge). We denote this set of items by $A$ and the complementary set of items, that is those that do not have the feature, by $\bar{A}$ . The sumscore for set $A$ (i.e., the number of correct responses to the items in the set) and the sumscore for set $\bar{A}$ constitute a profile. The central part of profile analysis is comparing the observed profiles of respondents in the data with their expected profiles, given their overall performance quantified by their total sumscores (i.e., total numbers of correct responses). Since the sumscore for set $\bar{A}$ is simply the difference between the total sumscore and the sumscore for set $A$ , deriving the expected profile boils down to deriving the conditional expectation of the sumscore for set $A$ given the total sumscore. Currently, profile analysis is developed only for models within the Rasch model family, since in the Rasch model, it is straightforward to derive the conditional distribution of the sumscore for a specific subset of items given the total sumscore. In the Rasch model, the probability of a correct response to an item depends on item difficulty $δ_{i}$ and ability $θ$ :

P (X_{i} = 1 | θ) = \frac{exp (θ - δ_{i})}{1 + exp (θ - δ_{i})} .

The model can alternatively be parameterized with transformed difficulties $ε_{i} = exp (- δ_{i})$ and abilities $ξ = exp (θ)$ :

P (X_{i} = 1 | ξ) = \frac{{ξε}_{i}}{1 + {ξε}_{i}} .

An important property of the Rasch model is that the total sumscore, denoted by S, is the sufficient statistic for ability, from which follows that conditional on this statistic, the distribution of the item scores is independent of ability (Fischer, 1995):

P (X = x | S = s, θ) = P (X = x | S = s) = \frac{\prod_{i = 1}^{n} ε_{i}^{x_{i}}}{γ_{s} (E)},

where n is the number of items, E is a set of transformed item difficulties ( $ε_{i}$ ) of the items in the test, and $γ_{s} (E)$ is the elementary symmetric polynomial of order s, which is formed by adding together all distinct products of s distinct values $ε_{i}$ within the set:

γ_{s} (E) = \sum_{1 \leq i_{1} \leq i_{2} \leq \dots \leq i_{s} \leq n} ε_{i_{1}} ε_{i_{2}} \dots ε_{i_{s}} .

For example, $γ_{1} (E) = \sum_{i} ε_{i}$ and $γ_{2} (E) = \sum_{i \neq j} ε_{i} ε_{j}$ . Additionally, $γ_{0} (E) \equiv 1$ and $γ_{s} (E) \equiv 0$ for $s < 0$ and $s > n$ .

Profile analysis uses a related property of the Rasch model, which has to do with the distribution of the sumscore for a subset of items conditional on the total sumscore. The probability that the sumscore for set $A$ , denoted by $S_{A}$ , would take a particular value t conditional on the total sumscore being equal to $s \in [0 : n]$ is independent of ability and is equal to

P (S_{A} = t | S = s, θ) = \frac{γ_{t} (E_{A}) γ_{s - t} (E_{\bar{A}})}{γ_{s} (E)},

where $E_{A}$ and $E_{\bar{A}}$ are the sets of transformed item difficulties of the items in sets $A$ and $\bar{A}$ , respectively. Note that the probability in Equation 5 is larger than zero only for the values between $min (s, n_{A})$ and $max (0, s - (n - n_{A}))$ , where $n_{A}$ is the number of items in set $A$ . From Equation 5, one can also derive the conditional expectation of the sumscore for set $A$ given the total sumscore:

ℰ (S_{A} | S = s) = \sum_{t = 0}^{n_{A}} t P (S_{A} = t | S = s) .

For a single respondent, one can determine whether their observed sumscore for set $A$ is higher than its conditional expectation given the total sumscore. If so, the respondent is classified as having an $A +$ profile. This information is aggregated on the group level by counting the number of persons with an $A +$ profile. Similarly, the number of persons with an $A -$ profile (i.e., with the observed sumscore for set $A$ lower than its conditional expectation given the total sumscore) is computed. Note that persons with total sumscores of 0 or n would not be classified in either of these two groups, because their observed profile is exactly equal to their expected profile ( ${0, 0}$ and ${n_{A}, n - n_{A}}$ , respectively). Therefore, respondents with these profiles, which are referred to as trivial profiles, are not informative for determining whether items within set $A$ are more or less difficult in the specific group.

In the profile analysis, the observed frequencies of $A +$ and $A -$ profiles in a specific group of respondents, denoted by $O_{c +}$ and $O_{c -}$ , where subscript c refers to the particular group, are compared to the corresponding expected frequencies, denoted by $E_{c +}$ and $E_{c -}$ . The expected frequencies can be computed using the conditional distribution of sumscores for set $A$ given the total sumscore (see Equation 5). Since this distribution is discrete and there is a relatively small number of possible values for $S_{A}$ , the probabilities of having an $A +$ and an $A -$ profile will often be far from .5. The expected frequency of $A +$ profiles is equal to

E_{c +} = \sum_{s = 1}^{n - 1} N_{c s} \sum_{t > ℰ (S_{A} | S = s)} P (S_{A} = t | S = s),

where $N_{c s}$ is the number of persons in group c with the total score equal to s. Note that the summation excludes trivial profiles, since for them, $A +$ profiles are not possible. The expected frequency of $A -$ profiles is equal to the number of persons with nontrivial profiles, denoted by N_c , minus $E_{c +}$ .

Since in the context of many large-scale assessments such as TIMSS and PISA, it is usually impossible to administer all items to all students, and multiple test forms are typically used. In this case, the observed and expected frequencies of $A +$ and $A -$ profiles can be computed separately for each test form and then summed up to obtain the total frequencies for the whole test.

Some deviations of the observed frequencies from the expected frequencies are expected due to sampling variability even when measurement invariance holds. To test whether the observed frequencies are significantly different from the expected frequencies, the $χ^{2}$ -statistic for contingency tables can be used:

χ_{c}^{2} = \frac{N_{c} (O_{c +} - E_{c +}) (O_{c -} - E_{c -})}{E_{c +} E_{c -}} .

This statistic has a $χ^{2}$ -distribution with one degree of freedom if the Rasch model holds. If this statistic is significant, then one can conclude that there are unexpectedly too many $A +$ (if $O_{c} > E_{c}$ ) or $A -$ (if $O_{c -} > E_{c -}$ ) profiles in group c. The former means that the items with the item feature are less difficult in group c than expected and items without the feature are more difficult. The latter means that items with the feature are more difficult than expected.

To compare the deviations between the observed and expected frequencies, Verhelst (2012) suggests to compute the so-called excess percentage:

E P = \frac{O_{c} - E_{c}}{N_{c}} \times 100 %,

which is the difference between the observed and expected frequencies as the percentage of sample size (disregarding the trivial profiles).

3. Extending Profile Analysis

3.1. Extension to the 2PL Model

Due to having only one item parameter and assuming that the items have equal discriminatory power, the Rasch model is usually too restrictive to fit real data. Many operational assessments have moved from using the Rasch modeling framework to using more flexible models, like the 2PL model used by PISA since 2015. This model adds a discrimination parameter ( $α_{i}$ ) to modeling the probability of a correct response to an item:

P (X_{i} = 1 | θ) = \frac{exp (α_{i} (θ - δ_{i}))}{1 + exp (α_{i} (θ - δ_{i}))} .

While the total sumscore is not a sufficient statistic for ability in this model, it is still a very important summary of the data that is easy to interpret and easy to communicate to stakeholders, as it still successfully captures overall ability and provides a stochastic ordering of persons. Therefore, a method for detecting item-feature-related DIF that is based on the distribution of the sumscore of items with a certain feature, conditional on the total sumscore is valuable within the context of the 2PL model.

In the 2PL model, the probability of observing a particular value for the sumscore for set $A$ conditional on the total sumscore is not independent of ability:

P (S_{A} = t | S = s, θ) = \frac{γ_{t} (E_{A} (θ)) γ_{s - t} (E_{\bar{A}} (θ))}{γ_{s} (E (θ))},

where $E (θ)$ is a set of values ${exp (α_{1} (θ - δ_{1})), exp (α_{2} (θ - δ_{2})), \dots, exp (α_{n} (θ - δ_{n}))}$ , $E_{A} (θ)$ and $E_{\bar{A}} (θ)$ are the sets of these values for the items in sets $A$ and $\bar{A}$ , respectively. However, ability can be integrated out using the group-specific distribution of ability given the total sumscore, denoted by $f_{c} (θ | S = s)$ :

P_{c} (S_{A} = t | S = s) = \int P (S_{A} = t | S = s, θ) f_{c} (θ | S = s) d θ.

The distribution $f_{c} (θ | S = s)$ is group-specific since the posterior distribution of ability depends on the population distribution in a specific group.

The integral in Equation 12 can be approximated using numerical integration and considering K quadrature points ${θ_{1}, θ_{2}, \dots, θ_{K}}$ for ability:

P_{c} (S_{A} = t | S = s) = \sum_{k = 1}^{K} (P (S_{A} = t | S = s, Θ = θ_{k}) P_{c} (Θ = θ_{k} | S = s)) .

The discrete distribution of ability conditional on the total sumscore can be easily computed using Bayes’s theorem:

P_{c} (Θ = θ_{k} | S = s) = \frac{P_{c} (Θ = θ_{k}) P (S = s | Θ = θ_{k})}{\sum_{j = 1}^{K} P_{c} (Θ = θ_{j}) P (S = s | Θ = θ_{j})},

where the distribution of the number of correct responses conditional on ability is

P (S = s | θ) = \frac{γ_{s} (E (θ))}{\sum_{t = 0}^{n} γ_{t} (E (θ))}

and the probability $P_{c} (Θ = θ_{k})$ is proportional to the ability distribution within the group, which is typically assumed to be normal with mean $μ_{c}$ and variance $σ_{c}^{2}$ .

The distribution of the sumscore for set $A$ conditional on the total sumscore computed using Equation 13 can be used to compute the group-specific expected value $ℰ_{c} (S_{A} | S = s)$ . The observed and expected frequencies of the $A +$ and $A -$ profiles within a group can be computed similarly to those under the Rasch model but using the group-specific conditional distribution of the sumscore for set $A$ and its group-specific expected value. The same test statistic (see Equation 8) and effect size measure (see Equation 9) can be used to test whether the deviation of the observed frequencies from the expected frequencies is significant and to quantify the size of this deviation, respectively.

3.2. Generalizability of the Results

In standard profile analysis, statistical inference about the presence of item-feature-related violations of measurement invariance is done by comparing the test statistic from Equation 11 with its distribution under measurement invariance (i.e., a $χ^{2}$ -distribution with $d f = 1$ ). If such a comparison shows that the probability of obtaining a value at least as extreme as in the observed data when measurement invariance holds is below the specified $α$ -level, then it is evidence of differential functioning of the two sets of items on the considered test (one set with and one set without the studied item feature, respectively) for the considered group of respondents.

However, finding such evidence does not necessarily mean that the item features themselves have explanatory power for the violations of measurement invariance. It could also be that violations of measurement invariance on the level of individual items are simply aggregated to the level of sets of items with and without the item feature, and if the test would have included a different set of items with the same features (e.g., in the next cycle of the large-scale assessment study), then there would be no such violation. Thus, the significance test only allows us to draw conclusions about the actual sets of items delivered during a particular test administration (i.e., inference at the test level) and not necessarily about the population of items that have that feature versus the population of items that do not (i.e., inference at the population level). As such, standard profile analysis leaves the question of generalizability unanswered.

To answer the question of generalizability, we propose to add a simple permutation test to profile analysis. Permutation tests are a highly flexible, nonparametric alternative to classical hypothesis tests that operate under a population model. Like classical, parametric null hypothesis tests, permutation tests allow for calculating the probability of getting a value as or more extreme under a null hypothesis. An advantage of permutation tests is that they do not rely on a specific probability distribution. Despite this advantage, they are not, formally, distribution free (LaFleur & Greevy, 2009), since samples are assumed to be exchangeable. The general process of conducting a permutation test involves a random rearrangement of the data followed by recalculating the test statistic of interest. This procedure is repeated either for all possible permutations of the data, referred to as systematic permutation tests, or for a random number of permutations (random or Monte Carlo permutation tests; Huo et al., 2014). The distribution of the resultant statistics is used to derive a p-value, from which inferences are drawn. In the subsequent sections, we provide further details about these procedures. Situations where using permutation tests can offer notable benefits are when sample sizes are small (see, e.g., Camilli & Smith, 1990), when the usual distributional assumptions are not met, or when an exact test is desirable. These methods are also useful when the hypothesis test in question is not about means or medians.

The logic of the proposed permutation test is as follows: If it is likely to obtain the same (or more extreme) value of the test statistic from a random grouping of the test items into sets mimicking the sets “with the feature” and “without the feature,” then there is insufficient evidence to conclude that the actual item feature explains the violations of measurement invariance. Only when the observed test statistic is extreme compared to what is obtained from random permutations of the items into subsets, can we conclude that the observed differences in the behavior of the items from different features are likely to not just be an aggregation of the idiosyncratic characteristics of individual items in the concrete test, but that they have to do with the characteristics of the item features themselves. In this case, we would expect to see the same pattern of violations of measurement invariance if different items with these item features would be included in the test.

More formally, the null hypothesis of the permutation-based profile analysis is that DIF (if present) is not related to the studied item feature (i.e., the null hypothesis is different from that of standard profile analysis). Random permutations of the items into sets produce the null distribution, since DIF is not related to any of these artificial item features created by randomly permuting the items. While in standard profile analysis, the test statistic in Equation 8 is compared to its distribution under measurement invariance, in our permutation-based profile analysis, it is compared it to this new null-distribution, that is the distribution of this statistic across different permutations of the items into random sets (of sizes matching the actual set of items with and without the feature). If the probability of obtaining a value at least as extreme as the observed value is below the specified $α$ -level, then it is evidence of generalizability and implies that the conclusions drawn from the item-feature-related DIF can be applied to the population level.

In each permutation $j \in [1 : J]$ , the test items are randomly split into sets $A_{j}$ and ${\bar{A}}_{j}$ with $n_{A}$ and $(n - n_{A})$ items, respectively. In each group, the probabilities for observing the different values of the sumscore for set $A_{j}$ , denoted by $S_{A_{j}}$ , conditional on the the total sumscore are computed:

P_{c} (S_{A_{j}} = t | S = s) = \sum_{k = 1}^{K} \frac{γ_{t} (E_{A_{j}} (θ_{k})) γ_{s - t} (E_{{\bar{A}}_{j}} (θ_{k}))}{γ_{s} (E (θ_{k}))} P_{c} (Θ = θ_{k} | S = s),

where $E_{A_{j}} (θ_{k})$ and $E_{{\bar{A}}_{j}} (θ_{k})$ are defined analogous to $E_{A} (θ)$ and $E_{\bar{A}} (θ)$ in Equation 11. Since the distribution of ability given the total sumscore and the elementary symmetric polynomials for all items do not depend on the permutation, they do not have to be recomputed for every permutation, which saves computation time.

When the test consists of multiple test forms, then the permutation of the items is done for all test items. It is first determined which items belong to $A_{j}$ and ${\bar{A}}_{j}$ . Next, the conditional distribution of the sumscore for the items in the test forms from set $A_{j}$ given the total sumscore in the test form, and the observed and expected frequencies of $A_{j} +$ and $A_{j} -$ profiles are computed for each test form. The observed and expected frequencies are then summed across the test forms to obtain the value of the corresponding test statistic $χ_{c j}^{2}$ . The observed value of the test statistic is then compared to the distribution of the statistic in the permuted data. The proportion of permuted data sets in which $χ_{c j}^{2}$ is larger than $χ_{c}^{2}$ can be used as the p-value for the group-specific permutation test:

p_{permutation, c} = \frac{1}{J} \sum_{j} ℐ (χ_{c j}^{2} \geq χ_{c}^{2}) .

In addition to considering the distribution of the $χ^{2}$ -statistic in the permuted data, one can consider the distribution of the excess percentage in the permuted data. That is, while in standard profile analysis, excess percentage is only used as an effect size measure and statistical inference is done using the $χ^{2}$ -test, when permutation-based profile analysis is used, then both statistics can be used for statistical inference.

4. Illustrative Example

4.1. Data

Publicly available data from 2015 PISA study were used for analysis. Seventy two countries participated in the 2015 cycle of PISA. To remove any potential mode differences, we only used data from the computer-based assessment, in which 55 countries participated. Because science was the major domain of emphasis, we limit our focus there. From the total of 184 items administered in the computer-based science assessment, there were nine items that did not have any valid responses in at least one of the countries. These items were removed to make country comparisons more straightforward. Furthermore, items with partial credit were recoded to binary items (correct = 1; incorrect or partially correct = 0) to simplify the analysis. Omitted responses were coded as incorrect responses.² Only those students who reached all items in their test form were included in the analysis. Further, we omitted responses from students that were administered the Une Heure or UH booklet, which is designed for students with special educational needs (OECD, 2017). Our analysis included 36 different test forms with a minimum of 24 and a maximum of 36 items in them (after the exclusion of the items that did not have any valid responses in one of the countries). The total sample size was 299,240 (with the largest and the smallest sample sizes per country of 17,141 and 1,975, respectively).

The PISA definition of science literacy is “the ability to engage with science-related issues, and with the ideas of science, as a reflective citizen” (OECD, 2017, p. 20). Based on detailed theories of science learning (see the framework for a full list of references), the science domain and associated items are organized according to several elements. Science competencies is the first element: to explain phenomena scientifically; to evaluate and design scientific enquiry; and to interpret data and evidence scientifically (p. 20). The second element is type of knowledge, which includes content, procedural, and epistemic ways of knowing (p. 19). The science domain further deals with different contexts, that is, situations relating to the self, to the community, or to the world (p. 23). Item content is divided across subdomains, including physical, living, and earth and space systems. Finally, items are characterized by their cognitive demands or depth of knowledge, including low, medium, and high (p. 2). Besides the theoretical science framework used to guide the development of PISA items, items can be further categorized by their item format, with options that include simple or complex multiple-choice items, or open-ended items. See the PISA Technical Report, Annex A for a complete categorization of science items (OECD, 2017). The number of items with each item feature is given in Table 1. For each of these item features, we want to determine whether items having this feature are more or less difficult in each of the 55 countries than expected under measurement invariance.

Table 1.

Number of Countries With Significant Results of Standard Profile Analysis (PA) and Permutation-Based PA

Item Feature	Level	$#$ Items	Standard PA	Permutation-Based PA
Competency	Explain	83	35	8
	Evaluate and design	38	35	6
	Interpret	54	32	3
Type of knowledge	Content	91	37	10
	Procedural	58	38	7
	Epistemic	26	38	4
Context	Self	19	28	21
	Community	53	32	3
	World	103	33	2
Subdomain	Physical	58	34	1
	Living	70	34	7
	Earth and space	47	34	1
Depth of knowledge	Low	51	35	5
	Medium	112	29	0
	High	12	27	6
Item format	Simple multiple-choice	50	36	9
	Complex multiple-choice	64	37	10
	Open-ended	61	45	23

4.2. Method

The multigroup 2PL model with item parameters equal across countries was fitted to the full data set using marginal maximum likelihood implemented in the R package TAM (Robitzsch et al., 2020). The estimates of the item parameters and the means and variances of the distribution of ability in each country were used to perform the 2PL-extension of standard profile analysis and permutation-based profile analysis as described in the previous section. Numerical integration was done using 101 equally spaced nodes between −5 and 5. The analyses were performed for every item feature (each of the three levels of the six item classifications). Permutation tests were performed for each item feature with 1,000 replications. For each item feature and country, we also computed the excess percentage in the observed and permuted data.

4.3. Results

Table 1 shows the results of studying DIF related to item features. For each item feature, one can see in how many countries a significant result was detected (at 5% $α$ -level) when comparing the $χ^{2}$ -statistic with the null-distribution under measurement invariance (i.e., standard profile analysis) and when comparing it to the distribution in the permuted data (i.e., permutation-based profile analysis). For practically all of the considered item features, in at least half of the countries, standard profile analysis produced a significant result, suggesting the presence of differential functioning for the two contrasted sets of test items (i.e., there is an effect at the level of the concrete test). However, the number of countries where DIF could be attributed to the item feature itself, rather than to an aggregation of idiosyncratic item-level results, was much smaller, as evidenced by the number of significant results of the permutation-based profile analysis, ranging from 0 (medium depth of knowledge vs. other items) to 23 countries (open-ended vs. other items). That is, there were many cases where a test-level significant result was found (showing differential functioning of the two sets of items on the test) that did not generalize to the population level, since a test statistic that was at least as extreme as the observed one was obtained in more than 5% of the permuted data sets. Thus, in those cases, there is not enough evidence available to attribute the observed differences to the item feature itself.

Using the results of one country (Australia) on one item feature (type of knowledge: content), Figure 1 presents a graphical illustration of the differences between the test-level inferences done in standard profile analysis and the population-level inferences done in permutation-based profile analysis. The values are presented on the log-scale to improve readability of the plot. The same observed $χ^{2}$ -statistic is computed in both methods and it is indicated by a solid vertical line. When performing standard profile analysis, it is compared to the distribution under measurement invariance, which is the $χ^{2}$ -distribution with one degree of freedom (indicated by a dashed line). Here, the probability of obtaining a value equal to or larger than the observed value is .003 (indicated by the dark gray area); therefore, on this test,. the items with and without this specific feature behave differently in the specific country that was expected based on the international estimates. This conclusion, however, applies only to the concrete set of items included in the test (i.e., at the test level) and does not necessarily generalize to a broader population of items (i.e., to the population level). To answer the question of generalizability, we look at the results of the permutation-based profile analysis. Here, the observed statistic is compared to its distribution under random permutations of the items into subsets (indicated by a dotted line). Here, the probability of obtaining a value equal to or larger than the observed one is .461 (indicated by the light gray shaded area), which means that the observed value is not an extreme outcome when dividing the items into two sets randomly rather than based on the item feature. Therefore, in this case, we cannot conclude that the feature “type of knowledge: content” has explanatory power for DIF that is observed for Australia, and the conclusion about differential functioning of the items with the specific feature cannot be generalized beyond the concrete items in the test (i.e., the effect of the item feature is not significant at the population level).

Figure 1.

Graphical representation of the differences between test-level and population-level inferences: The observed χ²-statistic (on the log-scale) is compared to its distribution under measurement invariance (χ²-distribution with one degree of freedom, dashed line) and its distribution under random permutations of the items (dotted line).

Figures 2 and 3 show the results of profile analysis for all countries for the item features “Item format: Open-ended” and “Depth of knowledge: Medium,” the item features for which the highest and the lowest number of significant results at the population level were found. In both figures, the triangles in the plots indicate those countries for which the $χ^{2}$ -test was significant only on the test level (i.e., based on standard profile analysis), and squares indicate countries for which both the test-level and the population-level results were significant (i.e., based on both standard and permutation-based profile analysis). Additionally, the plot shows the 2.5th and 97.5th percentiles of the distributions of excess percentage for each country based on the 1,000 permuted data sets. Note that while with standard profile analysis, excess percentage is only used as an effect size measure and statistical inference is done using the $χ^{2}$ -test, when permutation-based profile analysis is performed then the distributions under permutations can be obtained both for the $χ^{2}$ -statistic and for the excess percentage.

Figure 2.

Results of profile analysis for open-ended versus other items in all countries: The solid line indicates the observed excess percentage in the Programme for International Student Assessment data, while the dotted lines are the 2.5th and 97.5th percentiles of the distribution of the excess percentage in the permuted data; the countries are ordered based on the observed excess percentage.

Figure 3.

Results of profile analysis for medium depth of knowledge versus other items in all countries: The solid line indicates the observed excess percentage in the Programme for International Student Assessment data, while the dotted lines are the 2.5th and 97.5th percentiles of the distribution of the excess percentage in the permuted data; the countries are ordered based on the observed excess percentage.

Figure 2 shows the results for the items with open response format. As can be observed, in the vast majority of countries (45 of 55), DIF with respect to the open-ended versus nonopen-ended items was found with standard profile analysis, with excess percentages ranging from −12.09 to 10.26. Thus, while in Tunisia, the percentage of students performing better than expected on open-ended questions was 12.09 percentage points lower than expected; in China, this percentage was 10.26 percentage points higher than expected. Figure 2 shows that these test-level results do not generalize to the population level for all 45 countries. Rather, results are generalizable in 23 countries, where the estimated excess percentage falls outside of the confidence bounds (i.e., only for these countries, the results are significant according to the permutation-based profile analysis).³ For all countries where the excess percentage falls below the confidence band, there is generalizable evidence that they perform better on open-ended questions than expected based on the international results. For all countries where the excess percentage falls above the confidence band, we have evidence that students find open-ended questions easier than expected based on the international results. Thus, for these countries there is evidence—also for the application of future tests—that their overall test results will benefit or suffer from including more open-ended questions on the test. This conclusion does not extend to countries, where significant results occurred only at the test level.

Figure 3 shows the results for the items with medium depth of knowledge. Here, for all countries, the observed excess percentage lies within the 95% bounds of the distribution obtained based on the 1,000 permuted data sets. That is, in no country is the excess percentage for the item feature extreme compared to expectation when the test is randomly split into two sets with 112 and 63 items each. Thus, while at the test level, there are some countries that show better (or worse) performance than expected on the medium depth of knowledge items, there is no reason to expect that these findings would persist for a different set of items than the ones that were on the test. These results therefore do not suggest that certain countries would benefit from (or be adversely affected by) the inclusion of a larger proportion of medium depth of knowledge items on future tests.

5. Simulation Study

5.1. Method

To evaluate the performance of the permutation test in terms of Type I error and power, we conducted a simulation study. Here, we define Type I error as the probability of finding item-feature-related DIF, where none exists and power as the probability of finding item-feature-related DIF where it does exist. Generating parameters for the study are grounded in the results from our empirical analysis. First, we use empirically estimated international item parameters as described previously. This resulted in 175 items with difficulty and discrimination parameters assumed to be equivalent across our simulated countries to begin. These items were assigned to blocks and booklets that matched the PISA 2015 design. Then, to introduce DIF, we used empirical estimates in an analysis, where difficulty parameters were freely estimated across the 55 PISA countries. Specifically, we fitted country-specific models to each country’s data with discrimination parameters fixed to be equal to their international values, but difficulty parameters freely estimated. We computed DIF values for each country $\times$ item combination (i.e., the differences between the country-specific and international item parameters) and summarized these differences by computing their overall mean and standard deviation (SD). In those simulation conditions in which DIF was introduced (either related or not related to item features), we sampled DIF values from distributions with this mean and SD (for the exact specification of these distributions, see the description of the conditions in the following).

We used empirical sample sizes and achievement distributions as population parameters, with exceptions that we describe next. To exert some control over the study conditions and to examine whether sample size and population-test alignment are factors that influence the results, we replaced 12 empirical countries with generated countries according to the following structure: Two countries each with a combination of three sample sizes ( $N = 2,500$ , $N = 5,000$ , and $N = 10,000$ ) and two levels of population-test alignment (mean of ability equal to $μ = - 0.035$ , which matches the average international item difficulty, and mean of $μ = - 0.480$ , which is 1 SD below the international item difficulty⁴). Replacing empirical countries with generated ones resulted in overall weighted achievement mean and sample size that are close to those in empirical data (−0.155 in the simulated data vs. −0.162 in the empirical example and 298,846 in the simulated data vs. 299,240 in the empirical example, respectively). Finally, for the generated countries, we set the SD of ability to the empirical average across countries ( $σ = 0.888$ ), again, to maintain consistency between the empirical and simulation settings. We used two countries for each combination of sample size and population-test alignment instead of just one. This choice was because when item-feature-related DIF is introduced, it is important to counterbalance the (on average) negative DIF for a group of items in one country with (on average) positive DIF for these items in another country to make sure that across all countries, the DIF values for each item would average to zero. In the remainder of this section, the 12 generated countries are the focus of the analysis in that these countries are manipulated to contain item-feature-related DIF.

Type I error of profile analysis was studied under two different conditions. First, we considered a condition where there is no DIF for any item in any country. Second, we considered the possibility of DIF occurring, but not because of item features. In this condition, we generated DIF values for all items in the 12 generated countries by sampling them from $N {(0, 0.61}^{2})$ , where the mean and the SD (0 and 0.61) are based on the summary of the empirical country-specific DIF values. The item difficulty in these countries was equal to ${\hat{δ}}_{i} + Δ_{i c}$ , where ${\hat{δ}}_{i}$ is the international estimate of item difficulty of item i and $Δ_{i c}$ is the generated DIF value for item i in country c. For the other 43 countries, item difficulties were held equal to the international estimates. This approach ensured that the overall magnitude of simulated DIF was aligned with that of the empirically estimated DIF. For all 55 countries, discrimination parameters were held equal to their international estimates.

To evaluate whether performance of profile analysis is effected by the proportions of items with an item feature, we considered three different (arbitrary) item features that were assigned to items in three proportions: $1 / 2$ , $1 / 3$ , and $1 / 4$ of items. When evaluating Type I error (i.e., when no item-feature-related DIF is present), in each replication, only one data set was generated and profile analysis was performed for all three item features. However, when evaluating power, data were generated for each of the three item features separately as the DIF values are feature-dependent. Furthermore, when evaluating power, we crossed the item proportions with an additional design factor: size of the effect of the item feature on DIF values (i.e., in each replication, 3 (item proportions) × 3 (effect sizes) data sets were generated and analyzed). Effect size was defined in terms of the point-biserial correlation between the item feature and the DIF value: .1 (small), .3 (medium), and .5 (large). That is, the effect size does not quantify the magnitude of DIF that is present, but the strength of the relationship between the item feature and the DIF values when DIF if present. For each combination of item proportion and effect size, we specified the distributions from which the DIF values were sampled in such a way that the correlation between the DIF value and the item feature was equal to the desired value (.1, .3, or .5), the overall mean of the DIF values was 0, and their overall SD was equal to 0.61. That is, when evaluating power, the distributions of the DIF values in terms of the overall mean and SD for each combination of item proportion and effect size were matched to their distribution in the condition, in which DIF was not related to the item feature and to the empirical distribution of the DIF values in the PISA data. In each pair of generated countries with the same sample size and population-test alignment, we generated $Δ_{i c}$ from $N (μ_{1}, τ^{2})$ for the items with the item feature and $N (μ_{2}, τ^{2})$ for the items without the item feature in one country and from $N (- μ_{1}, τ^{2})$ and $N (- μ_{2}, τ^{2})$ in the other country from the same pair. The distributions of $Δ_{i c}$ within each pair of countries were mirrored to make sure that the overall expectation of the DIF values was equal to 0 for each item. The values for $μ_{1}, μ_{2}$ , and $τ$ for the different conditions are given in Table 2.

Table 2.

Parameters for the Distributions of the Differential Item Functioning Values in the Generated Countries in the Different Conditions Evaluating Power

Item Proportion	Small Effect			Medium Effect			Large Effect
	$μ_{1}$	$μ_{2}$	$τ$	$μ_{1}$	$μ_{2}$	$τ$	$μ_{1}$	$μ_{2}$	$τ$
1/2	−.06	.06	.61	−.19	.19	.58	−.31	.31	.53
1/3	−.09	.04	.61	−.26	.13	.58	−.43	.22	.53
1/4	−.11	.04	.61	−.32	.11	.58	−.53	.18	.53

In the no-DIF condition, 1,000 data sets were generated, in which first the student abilities in each country were sampled from their population distribution and then, using the international item parameters, their responses to the items from the assigned booklets were generated. A measurement invariance multigroup IRT model was fitted to the simulated data as in the empirical example and the estimates of the international item parameters were obtained. Profile analysis with and without the permutation test (with 1,000 replications) was performed for the three item features variables in the six of the generated countries (one for each combination of sample size and population-test alignment). Permutations were performed for each item feature with different item proportions separately, which resulted in 18 permutation-based distributions of the $χ^{2}$ -statistics, one for each combination of sample size, population-test alignment, and item proportions. For standard profile analysis, Type I error for each combination of sample size, population-test alignment, and item proportion was computed as the proportion of data sets, in which the corresponding observed $χ^{2}$ -statistic was higher than the 95th percentile of the $χ^{2}$ -distribution with one degree of freedom. For the permutation-based profile analysis, we computed the proportions of data sets in which each of the 18 (i.e., 3 sample size × 2 population-test alignment × 3 item proportion) observed $χ^{2}$ -statistics was larger than the 950th largest value in the corresponding permutation-based distribution.

In the DIF-unrelated-to-item-features condition, the data were generated analogously to the no-DIF condition with the exception that in the 12 generated countries, DIF values randomly sampled for each data set were added to the international item difficulties. Profile analysis was performed in each of the 1,000 simulated data sets and the Type I error was calculated in the same way as in the previous condition. We note that although we use the term “Type I error” for the standard profile analysis, the null hypothesis of this test is not necessarily true in this condition, as the test is for the concrete set of items with an item feature rather than the population of items. Therefore, here we do not expect this test to have the Type I error rate of .05.

Finally, we evaluated the power of the profile for detecting item-feature-related DIF. The data were generated in a similar way as in the DIF-unrelated-to-item-features condition, but the DIF values were generated to be correlated with one of the three item feature variables as described previously. Profile analysis was performed in each of the 9,000 simulated data sets (1,000 for each combination of effect size and proportion of items with the item feature). Power was calculated in similar to the Type I error in the previous two conditions.

5.2. Results

Table 3 shows the results in terms of Type I error for standard profile analysis. When the null hypothesis is true (i.e., there is no country DIF), then the proportion of significant results is close to the nominal value of .05 in all conditions. When measurement invariance is violated but DIF is not related to item features, then for a large proportion of replications, significant results are obtained (.737 on average across conditions, see Table 3). Note that the latter results cannot be interpreted as Type I error since the null hypothesis of standard profile analysis which is aimed at inferences at the test level is violated. The expected value of the difference between the average DIF value of the items with the feature and the average DIF value of the items without the feature is zero, but the probability of it being equal to exactly zero in the sample of 175 test items is negligible. Due to the data generating process, at the test level, there will always be at least some item-feature-related DIF (e.g., in the condition with the item proportion of 1/2, the expected absolute value of the difference between the average DIF values is 0.073), and it is a question of power whether this will can be detected. Because we are looking at averages, the power will be higher than when detecting DIF on the level of individual items. In the case of applications like PISA where sample sizes are large, it is possible to detect even small differences, which is reflected in the large proportion of significant results when DIF is present but not related to item features. As can be seen in Table 3, sample size has a clear effect on this proportion. It is important to note that the overall magnitude of DIF that we used in the simulation was based on what we found in the empirical PISA data. Smaller/larger DIF would lead to smaller/larger percentage of significant results. These results clearly indicate that while standard profile analysis correctly identifies that DIF is present for the sets of items, it cannot be used for making inferences that extend beyond the specific test considered (i.e., about the item feature effect rather than about differences between the two concrete sets of items), as it very often rejects the null hypothesis when violations of measurement invariance are not actually related to item features on the population level. Table 4 shows the results for Type I error for permutation-based profile analysis. Here, both when measurement invariance holds and when country DIF is present, but is not related to item features, Type I error is close to the nominal value of .05.

Table 3.

Type I Error Rate of Standard Profile Analysis and the Proportion of Significant Results for Standard Profile Analysis When Measurement Invariance Is Violated but Differential Item Functioning (DIF) Is Not Related to Item Features (i.e., the Null Hypothesis Tested With Permutation-Based Profile Analysis Is True)

Item Proportion	Country Mean at $\bar{δ}$			Country Mean Below $\bar{δ}$
	$N = 2,500$	$N = 5,000$	$N = 10,000$	$N = 2,500$	$N = 5,000$	$N = 10,000$
There is no country DIF
1/4	.036	.047	.040	.038	.044	.069
1/3	.046	.043	.054	.059	.053	.052
1/2	.054	.057	.055	.044	.047	.059
Country DIF is unrelated to item features
1/4	.668	.774	.796	.642	.748	.788
1/3	.649	.757	.808	.663	.737	.809
1/2	.689	.722	.818	.650	.737	.809

Table 4.

Type I Error Rate of Permutation-Based Profile Analysis

Item Proportion	Country Mean at $\bar{δ}$			Country Mean Below $\bar{δ}$
	$N = 2,500$	$N = 5,000$	$N = 10,000$	$N = 2,500$	$N = 5,000$	$N = 10,000$
There is no country Differential Item Functioning (DIF)
1/4	.039	.053	.038	.045	.049	.066
1/3	.052	.040	.053	.059	.048	.051
1/2	.060	.058	.052	.043	.050	.055
Country DIF is unrelated to item features
1/4	.049	.041	.054	.057	.044	.053
1/3	.054	.043	.048	.054	.058	.052
1/2	.046	.042	.058	.031	.053	.044

Tables 5 and 6 show power for standard profile analysis and permutation-based profile analysis, respectively. Power is generally higher for standard profile analysis (in all conditions higher than .8), which makes sense since it tests a stricter null hypothesis (and had a very high false positive rate when DIF unrelated to the item features was present, as was seen in Table 3). The power of permutation-based profile analysis is larger than .9 for medium and large correlations (on average .967 and 1.000, respectively) between the item feature and DIF parameters, but relatively low for the small effect (on average .223 across conditions). Low power for the small effect is to be expected since, with only 175 items, high power would not be achieved even if the DIF parameters were known rather than estimated with uncertainty and the effect of the item feature was tested simply by testing the significance of the point-biserial correlation between these parameters and the item feature. For example, when 1/2 of the items have the item feature and the effect of item feature is small, the power of the point-biserial correlation test is .234.⁵ Sample size (average power is .726 for N = 2,500, .728 for N = 5,000, and .728 for N = 10,000), proportion of items with the item feature (average power is .727 for 1/4, .736 for 1/3, and .727 for 1/2), and alignment between the country mean and the items’ difficulty (average power is .729 for country mean at $\bar{δ}$ and .730 for country mean below $\bar{δ}$ ) did not have a substantial effect on power.

Table 5.

Power of Standard Profile Analysis

Effect Size	Item Proportion	Country Mean at $\bar{δ}$			Country Mean Below $\bar{δ}$
Effect Size	Item Proportion	$N = 2,500$	$N = 5,000$	$N = 10,000$	$N = 2,500$	$N = 5,000$	$N = 10,000$
Small	1/4	0.833	0.904	0.930	0.838	0.894	0.923
	1/3	0.858	0.891	0.928	0.851	0.895	0.926
	1/2	0.833	0.881	0.929	0.849	0.889	0.930
Medium	1/4	0.999	0.999	1.000	1.000	1.000	0.999
	1/3	1.000	1.000	1.000	0.998	1.000	1.000
	1/2	1.000	1.000	1.000	0.998	1.000	0.999
Large	1/4	1.000	1.000	1.000	1.000	1.000	1.000
	1/3	1.000	1.000	1.000	1.000	1.000	1.000
	1/2	1.000	1.000	1.000	1.000	1.000	1.000

Table 6.

Power of Permutation-Based Profile Analysis

Effect Size	Item Proportion	Country Mean at $\bar{δ}$			Country Mean Below $\bar{δ}$
Effect Size	Item Proportion	$N = 2,500$	$N = 5,000$	$N = 10,000$	$N = 2,500$	$N = 5,000$	$N = 10,000$
Small	1/4	0.220	0.217	0.204	0.211	0.221	0.231
	1/3	0.241	0.233	0.254	0.225	0.256	0.245
	1/2	0.207	0.205	0.211	0.205	0.197	0.236
Medium	1/4	0.957	0.954	0.982	0.954	0.946	0.981
	1/3	0.957	0.969	0.968	0.959	0.971	0.977
	1/4	0.964	0.974	0.979	0.962	0.965	0.979
Large	1/4	1.000	1.000	1.000	1.000	1.000	1.000
	1/3	1.000	1.000	1.000	1.000	1.000	1.000
	1/2	1.000	1.000	1.000	1.000	1.000	1.000

6. Discussion

Investigating measurement invariance and violations of it in relation to item features is important as it can bring insights into how educational systems differ from each other internationally, across cultures. Similarly, measurement invariance is important at the national level for understanding how policy-relevant subgroups differ. In the current article, we proposed an extension of profile analysis that allows its application to the 2PL model, as well as allowing for inferences that generalize to the population of items, and hence to future test applications. By extending it to the 2PL model, profile analysis is more flexible. This is important because using the more restrictive model can potentially lead to confounding when the discrimination parameters actually vary across items, a situation that may occur frequently in practice. This extension should thus make profile analysis more relevant for operational settings, where the restrictions imposed by the Rasch model may not always be considered acceptable.

The second contribution of this article was the inclusion of a permutation test aimed at evaluating whether detected measurement invariance violations in profile analysis with the standard $χ^{2}$ -test can be attributed to the item feature itself and hence may generalize beyond the specific set of items that were considered. Using our permutation-based profile analysis, one can test whether the $χ^{2}$ -statistic quantifying item-feature-related DIF is significantly more extreme for the actual item feature that would be expected under a random splitting of the test items into subsets. If it is not the case, then there is insufficient evidence to conclude that the observed violation of measurement invariance can be attributed to the item feature, rather than simply being the result of an aggregation of item-specific DIF that is unrelated to the items sharing a particular feature.

The proposed permutation-based profile analysis allows users to examine whether results found at the test level can be attributed to particular item features and hence can be generalized to the population of items, rather than only being applicable to the idiosyncratic set of items that were included in the specific test administration. Currently, this level of generalizability is not possible in standard profile analysis. Importantly, such generalizability is valuable when evaluating the fairness of future tests that will be administered to participating countries, as it will allow one to determine whether including items with certain feature is likely to positively or negatively impact the performance of particular countries. The results of our empirical analysis and the simulation study show that basing this assessment on the results of a standard profile analysis $χ^{2}$ -test may lead to an overestimation of the number of countries affected. This is a logical outcome, since the $χ^{2}$ -test of the standard profile analysis evaluates a different (and more stringent) null hypothesis than the one considered by the permutation-based profile analysis, and that more stringent test is not aimed at the level of the population of items but only at the level of the sets of items on the test. It may be noted that the operational costs and challenges that would follow from using the test-level results rather than the results of the permutation test can be substantial, as one would incorrectly assume that more item features are problematic (and for more countries) than is warranted. The illustrative example shows that not all results that are flagged by the $χ^{2}$ -statistic actually will need to be taken into account for future test construction, since not all results generalize. Thus, revealing which item features actually can be expected to matter (and which countries actually are affected by it) should be considered highly relevant for optimal future test construction and evaluation. Caution must be used when making conclusions based on data from international studies with finite sets of items. Our article illustrates how test-level evidence of item-feature-related DIF between groups of items does not necessarily generalize to the population of items that share those features. Further, uncovering generalizable item-feature-related DIF depends on an appropriate method, such as the one we propose here. Researchers and policy-makers should take care when drawing conclusions based on analyses that do not consider the issue of generalizing beyond the test level. For example, when using standard profile analysis without the permutation test, it would be premature to conclude that in some countries, medium depth of knowledge items are more difficult or easy compared to low and high depth of knowledge items than is the case internationally. Rather, one can make such a conclusion only about the concrete sets of items in PISA 2015 Science, and there is no evidence that this result would be found in another cycle of PISA, where a different set of items would be used and, hence, where the PISA 2015 test-level results do not apply. When the results are not generalizable, they are merely a description of the differences between the items in a concrete test and can, at best, serve as grounds for formulating hypotheses for further research.

Understanding that an item feature explains DIF for certain countries can have notable implications for the fairness of country comparisons. Ideally, country comparisons should not depend on the proportion of items included in the test with a specific feature. When a certain item feature is found to have an effect—as was the case for all but one of the features considered in the illustrative example—this raises the question to what extent a comparison of the performance of countries is affected by the design choices of the test. Would the results have looked differently if more items of a certain type were included in the test? How robust are the rankings of countries to different choices for the test composition? These are questions that require one to consider the population of items, rather than just the set of items that were included in the test, and hence require a test for generalizability such as the permutation test proposed in this article. In this sense, the proposed method should facilitate the evaluation of the robustness of ILSA country comparisons to different choices for the test design.

It may be noted that for the test-level analysis of item-feature effects, the main challenge is to obtain sufficient observations per item; however, obtaining sufficiently high power for detecting item-feature effects that generalize outside the test setting also depends heavily on the number of considered items that do and do not have the feature. Thus, even with the sample sizes usually encountered in ILSAs, one may have trouble gaining sufficient power to study the effect of each item feature, especially if a certain feature is only shared by a small subset of the items on the test. In this context, it may be relevant to note that the absence of evidence of an effect of a certain item feature should not be taken as conclusive evidence for the actual absence of an effect, especially when there were not that many items on the test that shared the considered feature.

In this article, we have taken profile analysis as the starting point for developing a procedure for investigating item-feature-related DIF in a generalizable way. However, the kind of permutation test that we proposed can be adapted to other measures of item-feature-related DIF that do not yet address the issue of generalizability, like differential facet functioning and differential bundle functioning. In both cases, the distribution of the DIF-statistic used to quantify DIF for the group of items can be obtained using permutations of items into random subsets and inferences can be made based on how the observed statistic relates to this distribution.

One limitation of the proposed procedure is that item features are evaluated independently of each other and it is not possible to evaluate the effect of one item feature after controlling for other item features. As such, caution should be taken when interpreting the results, especially when item features are correlated with each other.

When testing the effects of multiple item features in multiple countries, researchers should be aware of the effects of multiple testing on the overall Type I error of the procedure. This problem, however, is not specific to the developed permutation-based profile analysis but applies to any method aimed at testing the effects of multiple item features on DIF in multiple groups (e.g., differential facet functioning and differential bundle functioning). Depending on the level at which the inferences are being made (country level, test level, item-feature level), multiple testing corrections can be applied at the corresponding level. As many of these methods (Bonferroni, 1936; Hochberg, 1988; Holm, 1979; Šidák, 1967) involve decreasing the $α$ -level, one might need to increase the number of permutations to achieve sufficient precision for the permutation-based p value. Potentially, a global test for testing the effects of different item features at the same time could be developed, but that goes beyond the scope of the current article.

Another limitation is that currently the permutation-based profile analysis can be used only for binary categorizations of the items (e.g., open-ended items vs. multiple-choice items). Permutation-based profile analysis can be extended to allow for multiple item categories similar to how it has been done for standard profile analysis (Verhelst, 2012), but this goes beyond the scope of the current article. If the item feature is not categorical, but continuous (e.g., some measure of item complexity), then to use our procedure, it has to be discretized, which would lead to some loss of information.

A further area for future research is to investigate whether adjusting the international scaling model to account for known, generalizable item-feature-related DIF would produce desirable outcomes. One possibility could be to include country-specific item parameters for items with features known to function differently in particular countries. A second option could be to use a linear logistic test model with $f e a t u r e \times c o u n t r y$ interactions. A third possibility would be to model item features as extra dimensions directly in the model (e.g., subdomain, depth of knowledge, or multiple choice vs. open response); however, a practical limitation involves the number of item responses per additional dimension, which would be quite small if all dimensions are crossed.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Maria Bolsinova

Jesper Tijmstra

Notes

References

Birnbaum

A. L

. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley, Reading.

Bolt

Stout

(1996). Differential item functioning: Its multidimensional model and resulting sibtest detection procedure. Behaviormetrika, 23(1), 67–95.

Bonferroni

(1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8, 3–62.

Bundsgaard

(2019). DIF as a pedagogical tool: Analysis of item characteristics in Icils to understand what students are struggling with. Large-Scale Assessments in Education, 7(1), 1–14.

Camilli

Smith

J. K.

(1990). Comparison of the Mantel-Haenszel test with a randomized and a jackknife test for detecting biased items. Journal of Educational Statistics, 15(1), 53–67. https://doi.org/10.2307/1164821

Culpepper

S. A

. (2009, September). A multilevel nonlinear profile analysis model for dichotomous data. Multivariate Behavioral Research, 44(5), 646–667. Retrieved November 4, 2021, from https://doi.org/10.1080/00273170903187640 (Publisher: Routledge).

Davison

M. L.

(1994). Multidimensional scaling models of personality responding. In Strack

Lorr

(Eds.), Differentiating normal and abnormal personality (pp. 196–215). Springer Publishing Company.

Ding

. (2001, November). Profile analysis: Multidimensional scaling approach. Practical Assessment, Research, and Evaluation, 7(1). https://scholarworks.umass.edu/pare/vol7/iss1/16. https://doi.org/10.7275/kp41-6458

Douglas

J. A.

Roussos

L. A.

Stout

(1996). Item-bundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning. Journal of Educational Measurement, 33(4), 465–484.

10.

Engelhard

(1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171–191.

11.

Fischer

G. H.

(1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37(6), 359–374.

12.

Fischer

G. H.

(1995). Derivations of the Rasch model. In Fischer

G.H.

Molenaar

I.W.

(Eds.), Rasch models (pp. 15–38). Springer. https://doi.org/10.1007/978-1-4612-4230-7_2

13.

Gibson

W. A

. (1959). Three multivariate models: Factor analysis, latent structure analysis, and latent profile analysis. Psychometrika, 24(3), 229–252. Retrieved November 4, 2021, from http://link.springer.com/10.1007/BF02289845 doi:10.1007/BF02289845

14.

Gonçalves

Gamerman

Soares

(2013). Simultaneous multifactor DIF analysis and detection in item response theory. Computational Statistics & Data Analysis, 59, 144–160.

15.

Grisay

Monseur

(2007). Measuring the equivalence of item difficulty in the various versions of an international test. Studies in Educational Evaluation, 33(1), 69–86. https://doi.org/10.1016/j.stueduc.2007.01.006

16.

Hochberg

(1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4), 800–802.

17.

Holm

(1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.

18.

Huo

Heyvaert

Van den Noortgate

Onghena

(2014). Permutation tests in the educational and behavioral sciences: A systematic review. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 10(2), 43–59. https://doi.org/10.1027/1614-2241/a000067

19.

Hyde

J. S.

Fennema

Lamon

S. J.

(1990). Gender differences in mathematics performance: A meta-analysis. Psychological Bulletin, 107(2), 139.

20.

Johnson

Lord

. (1958). An empirical study of the stability of a group mean in relation to the distribution of test items among students. Educational and Psychological Measurement, 18(2), 325–329.

21.

Jöreskog

(1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409–426. https://doi.org/10.1007/BF02291366

22.

Kankaraš

Moors

(2014). Analysis of cross-cultural comparability of PISA 2009 scores. Journal of Cross-Cultural Psychology, 45(3), 381–399. https://doi.org/10.1177/0022022113511297

23.

Kjærnsli

Lie

(2004). PISA and scientific literacy: Similarities and differences between the Nordic Countries. Scandinavian Journal of Educational Research, 48(3), 271–286. https://doi.org/10.1080/00313830410001695736

24.

Kreiner

Christensen

K. B.

(2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210–231. https://doi.org/10.1007/s11336-013-9347-z

25.

LaFleur

B. J.

Greevy

R. A.

(2009). Introduction to permutation and resampling-based hypothesis tests. Journal of Clinical Child & Adolescent Psychology, 38(2), 286–294. https://doi.org/10.1080/15374410902740411

26.

Lazarsfeld

P. F

. (1950). The logical and mathematical foundation of latent structure analysis. In Stouffer

S. A.

Guttman

Suchman

E. A.

Lazarsfeld

P. F.

Star

S. A.

Clausen

J. A.

(Eds.), Measurement and prediction (Vol. 4. Ch. 10, p. 756). Princeton University Press.

27.

Leder

G. C.

(1992). Mathematics and gender: Changing perspectives. In Grouws

D. A.

(Ed.), Handbook of research on mathematics teaching and learning (pp. 597–622). Maxwell Macmillan International.

28.

Liu

O. L.

Wilson

(2009). Gender differences in large-scale math assessments: PISA trend 2000 and 2003. Applied Measurement in Education, 22(2), 164–184.

29.

Mendes-Barnett

Ercikan

(2006). Examining sources of gender DIF in mathematics assessments using a confirmatory multidimensional model approach. Applied Measurement in Education, 19(4), 289–304.

30.

Meredith

(1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825

31.

Millsap

R. E.

Everson

H. T.

(1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297–334. https://doi.org/10.1177/014662169301700401

32.

Mislevy

R. J

. (1984). Estimating latent distributions. Psychometrika, 49(3), 359–381. Retrieved September 5, 2012, from http://www.springerlink.com/content/hv50347443r572h4/abstract/ doi:10.1007/BF02306026

33.

Mislevy

R. J.

Johnson

E. G.

Muraki

(1992, June). Chapter 3: Scaling procedures in NAEP. Journal of Educational and Behavioral Statistics, 17(2), 131–154. Retrieved September 12, 2012, from http://jeb.sagepub.com/content/17/2/131 doi:10.3102/10769986017002131

34.

Moses

J. A.

Pritchard

(1996). Modal profiles for the Wechsler Adult Intelligence Scale-Revised. Archives of Clinical Neuropsychology, 11(1), 61–68.

35.

Mullis

I. V. S.

Martin

M. O.

(2019). PIRLS 2021 assessment frameworks. TIMSS & PIRLS International Study Center, Boston College.

36.

Mullis

I. V. S.

Martin

M. O.

(Eds.). (2017). TIMSS 2019 assessment framework. TIMSS & PIRLS International Study Center.

37.

National Assessment Governing Board. (2019). Mathematics framework for the 2019 NAEP. National Center for Educational Statistics.

38.

Oliveri

M. E.

von Davier

(2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53(3), 315–333.

39.

Olsen

R. V.

(2005). An exploration of cluster structure in scientific literacy in PISA: Evidence for a Nordic dimension? Nordic Studies in Science Education, 1(1), 81–94.

40.

Ong

Y. M.

Williams

Lamprianou

(2013). Exploring differential bundle functioning in mathematics by gender: The effect of hierarchical modelling. International Journal of Research & Method in Education, 36(1), 82–100.

41.

Organization for Economic Co-Operation and Development. (2014a). PISA 2012 technical report. OECD Publishing.

42.

Organization for Economic Co-Operation and Development. (2014b). TALIS 2013 technical report (Tech. Rep.). OECD Publishing.

43.

Organization for Economic Co-Operation and Development. (2017). PISA2015 assessment and analytical framework. OECD Publishing.

44.

Organization for Economic Co-Operation and Development. (2017). PISA 2015 technical report. OECD Publishing.

45.

Organization for Economic Co-Operation and Development. (2020). PISA 2018 technical report (Tech. Rep.). OECD Publishing.

46.

Penfield

R. D.

Lam

T. C. M.

(2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19(3), 5–15. https://doi.org/10.1111/j.1745-3992.2000.tb00033.x

47.

Potenza

M. T.

Dorans

N. J.

(1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19(1), 23–37. https://doi.org/10.1177/014662169501900104

48.

Rasch

(1960). Probabilistic models for some intelligence and attainment tests. Paedagogiske Institut.

49.

Robitzsch

Kiefer

Robitzsch

M. A.

Adams

Rcpp

LSAmitR

R. E

. (2020). Package “tam.” Test Analysis Modules–Version: 3.4, 26 .

50.

Rutkowski

Gonzalez

von Davier

Zhou

. (2014). Assessment design for international large-scale assessment. In Rutkowski

Davier

M. v.

Rutkowski

(Eds.), Handbook of International large-scale assessment: Background, technical issues, and methods of data analysis (pp. 75–95). Chapman & Hall/CRC Press.

51.

Rutkowski

Zhou

(2016). Item calibration samples and the stability of achievement estimates and system rankings: Another look at the PISA model. International Journal of Testing, 16(1), 1–20.

52.

Saatçioğlu

Hamide

D. G

. (2017). Evaluation of reading skills subtest results of countries participating in the PISA 2009 application via profile analysis. Egitim ve Bilim, 42(190), 401–422.

53.

Shealy

Stout

(1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.

54.

Shoemaker

D. M

. (1973). Principles and procedures of multiple matrix sampling (Vol. XVIII). Ballinger.

55.

Šidák

(1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318), 626–633.

56.

Soares

T. M.

Gonçalves

F. B.

Gamerman

(2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3), 348–377. https://doi.org/10.3102/1076998609332752

57.

Swanson

D. B.

Clauser

B. E.

Case

S. M.

Nungester

R. J.

Featherman

(2002). Analysis of differential item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Behavioral Statistics, 27(1), 53–75.

58.

Vaheoja

Verhelst

N. D.

Eggen

T. J. H. M.

(2019). Profile analyses as feedback by evaluating the balance in exam scores. European Journal of Science and Mathematics Education, 7(2), 83–92.

59.

Van den Noortgate

De Boeck

(2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30(4), 443–464.

60.

Van den Noortgate

De Boeck

Meulders

(2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28(4), 369–386.

61.

Verhelst

N. D.

(2012). Profile analysis: A closer look at the PISA 2000 reading data. Scandinavian Journal of Educational Research, 56(3), 315–332.

62.

Verhelst

N. D.

Banerjee

McLain

(2016). Fairness and bias in language assessment. In Banerjee

Tsagari

(Eds.), Contemporary Second Language Assessment (pp. 243–260). Bloomsbury Academic.

63.

Yıldırım

H. H

. (2022). Relative strengths and weaknesses of countries in the TIMSS 2015mathematics assessment framework categories. International Journal of Mathematical Education in Science and Technology, 53, 1461–1477.

64.

Zabulionis

(2001). Similarity of TIMSS math and science achievement of nations. Education Policy Analysis Archives/Archivos Analíticos de Pol´ıticas Educativas, 9, 33.