Abstract
The current literature on test equating generally defines it as the process necessary to obtain score comparability between different test forms. The definition is in contrast with Lord’s foundational paper which viewed equating as the process required to obtain comparability of measurement scale between forms. The distinction between the notions of scale and score is not trivial. The difference is explained by connecting these notions with standard statistical concepts as probability experiment, sample space, and random variable. The probability experiment underlying equating test forms with random scores immediately gives us the equating transformation as a function mapping the scale of one form into the other and thus supports the point of view taken by Lord. However, both Lord’s view and the current literature appear to rely on the idea of an experiment with random examinees which implies a different notion of test scores. It is shown how an explicit choice between the two experiments is not just important for our theoretical understanding of key notions in test equating but also has important practical consequences.
Introduction
The process of equating two different forms of the same test is often referred to briefly as “test equating.” The custom obviously is an example of casual use of language, as test forms are physical entities left untouched during the process of equating. Casual language is unproblematic as long as all parties are aware of the actual meaning of the words used in their communications. But this is precisely where the current literature on test equating seems to disagree with a foundational paper published some 70 years ago.
One of the frequently cited introductory texts to test equating defines it as a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. (Kolen & Brennan, 2014, p. 2)
However, in an early introduction to the problem of test equating by Frederic M. Lord, one of the founding fathers of modern test theory, we meet a different view of what is actually adjusted during an equating. Taking the frequency distributions of scores on two different test forms as his point of departure, this author used the following analogy to explain the process of equating: We may imagine each of the original two frequency distributions of scores to be drawn on a perfectly elastic surface. It will then always be possible to stretch and compress the scale (as it is drawn on the surface) of either one of the distributions in such a way that this distribution will become identical with the distribution on the other test. This stretching and compressing of the score scale of measurement transforms it into a new converted-score scale of measurement. The converted-score scale on the first test and the unconverted score scale on the second test will now be “comparable.” (Lord, 1950, p. 2)
From a statistical point of view, the notions of scale, score, and score distribution are fundamentally different. The purpose of this note is to analyze these differences in more detail and show how they relate to the problem of test equating. The results of the analysis reveal that the view offered by Lord some 70 years ago appears to be more in line with standard statistics than the current literature on equating. As for the notion of a score distribution, both views appear to rely on a probability experiment with distributions different from what follows from test theory.
Basic Statistical Concepts
Test equating, as most of test theory, is indebted to statistics for its main concepts and methods. It is thus no surprise that notions as scale, score, and score distribution have their parallels in statistics. In fact, with different names, they underly all of statistics. In the following, we analyze these parallels; for a more technical introduction to the statistical concepts used in our analysis, Casella and Berger (2002, chap. 1) is recommended.
A typical statistical treatment begins with the explicit definition of a probability experiment in the outcomes of which we are interested. Examples of such experiments are throwing a die, watching a tennis match, recording the effect of a new pharmaceutical drug, and, indeed, observing an examinee taking a test. As an illustration, we first consider the well-known case of throwing a die. The set of all possible outcomes of the throw—that is, the set of integers
If the experiment can be described in this probabilistic way, statistical analysis becomes possible. Typically, we then begin by deriving a model for the probability distribution of the outcomes from the nature of the experiment. For example, if the assumption of a fair die holds, the observed outcomes are from a uniform distribution. If the die is not fair, the distribution must be a member of the multinomial family with unequal probabilities. And if it is replaced with a fair coin, the distribution is know to be binomial.
The importance of the distinctions between all these elementary concepts is reflected in their notation. The sample space for the die experiment is the set
The notation for the probability distribution associated with X needs a little more explanation. For technical reasons, statistics operates with the cumulative distribution function, which is denoted a
Probability Experiment of Test Taking
We now consider the experiment of an arbitrary examinee p taking a test where our interest is in the number of correct responses. The experiment served as the point of departure of Novick’s (1966) early axiomatic formalization of classical test theory shortly thereafter included in Lord and Novick (1968, section 2.2).
For a test of n items, the experiment has sample space
It is now clear that the idea of a score scale in test theory (in test equating: the number-correct scale) is exactly the same as that of a sample space in statistics. It refers to the entire set of possible outcomes of the test; nothing more, nothing less. As for the idea of a test score, for an arbitrary examinee p, this is just a random variable Xp
with a different probability distribution
Although straightforward, the match between all these concepts is easily blurred due to unfortunate connotations of the terms “observed score” and “observed-score distribution” in use throughout test theory. The former is easily taken to suggest the number of items correct actually observed when an examinee takes a test. However, as just highlighted, the observed score of an examinee p is a random variable Xp with possible realizations ranging over the entire scale, as opposed to the true score, which is its fixed expected value (e.g., Lord & Novick, 1968, section 2.3). What actually is observed for the examinee in a test administration is one realized observed score.
The term “observed score,” when taken at face value, is thus potentially misleading. In fact, when its actual meaning is ignored, it is a small step to take “observed-score distribution” to refer to the frequency distribution of the numbers of items correct actually obtained for the entire group of examinees in the test administration. However, the term should be used for the distributions of the random variables Xp for each of the examinees, not for an empirical distribution of the single realizations recorded for each of them as in Lord’s quote. The best way to keep track of all necessary distinctions is to remember that, paradoxically, the observed score of an examinee is never observed, only one of its possible values is.
Experiment of Test Equating
It is now time to apply the standard statistical terminology and notation to the problem of test equating. The first step is to realize that the problem actually involves two distinct probability experiments, one for an old test form X and the other for a new form Y, both assumed to measure the same ability. (We use Latin capitals to denote test forms and italics for the random variables associated with them.) When we equate back in time, only the experiment for the new form is conducted. The experiment for the old form is hypothetical; it would have been conducted if the new examinees had taken the old form as well (but then, obviously, no equating problem would have been left).
For notational simplicity, consider the case of forms with the same number of items, n. The experiment for the old form has sample space
The necessary transformation is found setting the two distribution functions equal to each other for each p. Thus, setting
and making x explicit, we obtain it as
Actually, the dependence of the transformation on p is unnecessarily restrictive. A standard assumption underlying all of test equating is a common ability measured by the number-correct scores for the two test forms. As the other characteristics defining the identity of the examinees can be ignored, Equation 2 can therefore be replaced with
where
The equating transformation just derived is thus a mathematical function from scale y of the new test form to scale x of the old form, exactly as claimed by Lord (1950). Distribution functions
Alternative Experiment of Test Equating
As already noted, though Lord and the later equating literature differ in their view of the equating transformation, they agree on a notion of test scores different from our earlier statistical definition. The reason resides in a different probability experiment assumed to represent the problem of test equating. As the choice of experiment is the fundament upon which all of statistical modeling rest, the difference is not inconsequential.
To the author’s knowledge, the first explicit definition of the probability experiment currently present throughout the equating literature is the one in Lord (1982, p. 165), where it was adopted to derive an asymptotic standard error of equipercentile equating. The experiment is as follows: A test form X has been administered to a group of examinees randomly sampled from some specified population. A different form Y has just been administered to a separate random sample from the same population. The problem is to equate the scale of Y back to the scale of X. Observe that the experiment for form X is no longer hypothetical, an actual sample of examinees is now supposed to have taken the form. The two experiments, with additional specifications to allow for different data collection designs, have been the automatic point of departure in the equating literature ever since. (Lord already considered the case of a single random sample taking two different forms. Because of its similar assumption of random sampling of examinees, it is not further considered here.)
The alternative choice of experiment immediately leads to different random variables for the scores of the examinees. Generally, data obtained through a simple random sample of size n from a specified population imply a model with random variables
As the two experiments differ in their choice of the source of randomness present in test equating, they can be labeled as experiments with random scores versus random examinees. The former clearly is the experiment of choice when the interest is in test scores of individual examinees with their less than perfect reliability, for instance, as they are used to make an admission or selection decision for each of them. Experiments with random subjects typically occur in such areas as opinion polling, survey analysis, and marketing research with their interest in quantities as population means, quantiles, standard deviations, and so on. (Just for the sake of completeness, experiments both with random examinees and test scores do exist, for instance, in studies of group-based assessment of educational progress. They involve actual sampling of students from well-defined educational populations with scores treated as random too. Their statistics involves the use of two-level models, with a separate level accounting for each source of randomness. However, group-based educational assessment is not the typical application addressed in the observed-score equating literature.)
Most likely, Lord’s (1982) choice of experiment just followed the statistical tradition of his day. All introductory texts used in Statistics 1.01 in the educational and behavioral sciences programs at the time, explicitly or more implicitly, motivated the use of statistics assuming the case of random sampling with its iid variables (in fact, most of them still do!). Nevertheless, his choice still is a bit of a puzzle. As demonstrated by the careful presentation of the probability experiment of test taking in the introduction to classical test theory in Lord and Novick (1968, section 2.2), he must have been aware of the differences between the two alternative experiments.
Practical Consequences
The choice of a probability experiment of test taking, with its subsequent implications for the treatment of test scores as random variables and the specification of their distribution, is not without practical consequences. Two of these consequences, the choice of scale transformation and the standard error of the equated scores, are briefly discussed.
The scale transformation in Equations 2 and 3 was derived for the equating experiment with non-iid random variables for the two groups of examinees. For the alternative experiment, setting its two distributions equal
and making x explicit, it follows as
The result is one common transformation for all examinees. The transformation is appropriate when the interest is in an “average equating” for examinees sampled from a “synthetic population.” It is less so when we are concerned with the equated scores for each of the examinees that actually took form Y.
It may look difficult to estimate the earlier transformations with the conditional score distributions in Equation 3, but it is not. The available options include item-response theory (IRT) observed-score equating, the use of the conditional observed-score distributions on X and Y given the score on anchor items in the two forms, conditioning on other proxies for the common ability measured by the two forms, single-group equating, or test forms assembled with preequated number-correct scales. For a review of these examples, performed under the name of local equating, see González and Wiberg (2017, chap. 6) and van der Linden (2011, 2018). One of the advantages capitalized on in each of the examples is the absence of the necessity to reconcile the differences in ability distribution between the two groups of examinees that took form X and Y, a problem inherent in the experiment with random examinees generally resolved by performing the equating in Equation 5 adopting arbitrary sampling weights for an assumed synthetic population. Also, for each of these examples, the assumption of a common ability measured by the two forms suffices. Specifically, it is not necessary to have forms of equal length, equal reliability, or to be concerned about the requirement of equity (van der Linden, 2019).
It is required statistical practice to report estimated quantities along with measures of their accuracy. Lord’s (1982) asymptotic standard error of equating serves the goal for an equating based on the experiment with random examinees. However, if the interest is in equated scores for the individual examinees that took form Y, the experiment of random scores applies and a different standard error is required for examinees with different levels of ability to account for the differences in random error between their equated scores. The choice between the two is completely analogous to the one between a constant error of measurement for all examinees and conditional errors given their level of ability discussed extensively in the history of testing. The required standard errors of equating for the experiment with random scores could be calculated similarly to Lord (1980), replacing his marginal distributions of the scores X and Y on the two forms by their conditional distributions given the common ability underlying the forms.
Conclusion
It thus appears to be a statistical fact that Lord’s answer to the question of what actually is equated in “test equating” is correct. Just as claimed in his quote, any proper equating transformation is directly from the scale of one form of a test to the scale for another. As for the role of observed-score distributions in the equating, the situation is different though. The distributions in the earlier quote from Lord (1950) were the empirical frequency distributions of realized scores collected for two test forms in the equating study. In his later derivation of the asymptotic standard error of equipercentile equating, they are random variables with identical distributions for each of the examinees that took the two forms. The same assumption is made throughout the current equating literature.
Both the notions of scale and observed-score distribution deserve to be reconnected with the basic statistical concepts of probability experiment, sample space, and random variable introduced in every modern introduction to statistics. Possible reasons for the current lack of connection might be the tradition in the educational and behavioral sciences of introductory texts motivating the use of statistics only by considering the case of random sampling with iid variables as well as the possible confusion created by the connotation of the terms “observed score” and “observed-score distribution” discussed earlier in this note.
