Abstract

How should researchers and practitioners calculate confidence intervals for individual scores in psychological testing? There are two different intervals recommended in the literature: (a) the standard error of measurement (SEM) interval, which is centered around the individual’s observed score and found in most textbooks on the subject, and (2) the standard error of estimation (SEE) interval, which is centered around a regression-based estimated true score for the individual, less common in practice, but recommended in the psychometric literature (e.g., Dudek, 1979; Lord & Novick, 1968; Nunnally & Bernstein, 1994).
In their article, Stanley and Spence (2024) argued that neither interval is inherently right or wrong but that one of the two intervals is the correct one depending on what you want to know (see “Which Interval Should I Use?”). More specifically, they claimed that if one is interested in an interval estimate for a single test-taker, the SEM interval is correct; in contrast, for measurements of groups of individuals (e.g., in selection decisions), the SEE interval is correct. Accordingly, the authors created new labels to reflect the different goals: “SEM Single-Test-Taker Interval” and “SEE Many-Test-Takers Interval.”
Unfortunately, this “Single-Test-Taker” versus “Many-Test-Takers” distinction obscures the precise nature of the central difference between the two types of intervals and, more consequentially, results in a recommendation at odds with the psychometric literature, in which the SEE interval is favored for determining the confidence intervals for individual scores (e.g., Dudek, 1979; Lord & Novick, 1968; Nunnally & Bernstein, 1994). The central difference between the two types of intervals is not the number of test-takers but the type of distributional information that is used for the interval estimation. Although both the SEM and the SEE intervals take into account the population standard deviation and the reliability estimate—which are both necessary to calculate the standard error—the SEE interval additionally makes use of the population mean.
A simple way to illustrate why including information on the distribution mean is important is to imagine a test whose reliability is zero. If a test-taker named Bob has an observed IQ of 120 on such a test, what is the best estimate of his intelligence? Because an IQ test with zero reliability provides no information at all, the best estimate of Bob’s intelligence is not his observed score but the average intelligence in the population (IQ = 100). If we used the SEM interval, the 68% confidence interval around the observed score of 120 ± 1 × SEM = [105, 135] does not even include our best estimate of IQ = 100
Of course, nobody should use a test with a reliability of zero. What happens in a more realistic scenario for a test with a reliability above 0 but below 1? Imagine again that Bob has an observed score of IQ = 120 (Fig. 1a). Because of the measurement error, Bob’s actual true score will deviate from the observed score. But this deviation is not symmetric: Because of the known distribution in the population (IQ: M = 100, SD = 15), a true score of 110 is more likely than a true score of 130. There are simply more people in the population who have a true score of 110 (and who happen to have an observed score that is 10 points higher because of the measurement error) than people who have a true score of 130 (and who happen to have a score that is 10 points lower because of the measurement error). For this reason, the best estimate of the true score is a value regressed toward the mean of the population (Fig. 1b), and therefore, placing a symmetric SEM interval around the observed score (Fig. 1c) is not the best interval we can provide for Bob given all that we know. This also applies from the perspective of Bob. Although his true score may be fixed, he does not know it and must infer it from the available information.

True score estimation based on a population distribution and (a) an observed score when (b) taking into account the population mean versus (c) ignoring it.
To account for the available information on the distribution mean of test scores, we calculate the SEE interval around the regression-based estimated true score, which is generally preferred in the psychometric literature (e.g., Dudek, 1979; Lord & Novick, 1968; Nunnally & Bernstein, 1994). Note how making use of existing information is very much compatible with a Bayesian approach; we can indeed derive the very same interval from a Bayesian framework (in which the posterior mean equals the estimated true score and the posterior standard deviation equals the SEE; see Levy & Mislevy, 2016, pp. 155–159).
One might object that the SEM interval may be the safer choice when one does not want to make the assumption that the individual test-taker is part of the population about which information is available. However, even though the SEM interval does not make use of the mean of the test scores, it does make use of the estimated population standard deviation and reliability that are required to calculate it—so we still implicitly assume that the test-taker is part of the population. Furthermore, one could question the assumption of a normal distribution of scores on which the calculation of both types of confidence intervals is based. However, this assumption is justified because the normal distribution has maximum entropy, that is, makes the fewest assumptions, among all continuous probability distributions with a given mean and standard deviation—thus, unless there is additional information about the distribution, it is the best choice.
There is a more important potential problem of the SEE interval because it has a puzzling property: When reliability decreases below .50, the interval becomes narrower—rather than wider—which one would expect given that a less precise test will increase uncertainty (
In short, the difference between the SEE interval and the SEM interval is not about whether one wants to make statements about individuals or statements about groups, as suggested by Stanley and Spence (2024). It is about whether one wants to take into account a priori known information about the mean of test results in the population. Taking into account such information is generally recommended in the psychometric literature (e.g., Dudek, 1979; Lord & Novick, 1968; Nunnally & Bernstein, 1994) and also seems prudent from an applied perspective. In practice, researchers usually have to work with limited information and want to make the most out of it.
