Abstract
Psychometrics, the science sustaining psychological test construction and use, has consistently ignored logical criticism of its foundations, even though science as a cognitive enterprise requires criticism. This accumulating corpus of criticisms is generally unacknowledged in histories of the discipline, in textbooks, and in course curricula. Thus, there is a mystery here as to why this critical resource is missing. Considering three critical offerings from before 1950, it is clear that by mid-century, psychometric methods, consistently marketed as instruments of scientific measurement, were bereft of evidence supporting that boast. A solution to this mystery is proposed.
Keywords
Introduction
Those practicing psychometrics (including psychologists, educationalists, sociologists, and others) have failed to support claims to measure psychological attributes (such as intellectual abilities, personality traits, social attitudes) by neglecting the following crucial step: they have failed to investigate whether such attributes actually possess quantitative structure 1 (e.g., Michell 1990, 2025a). Commonly used quantitative psychometric theories (factor analytic theories, item response theories, etc.) characterise psychological attributes as measurable quantities. However, since there is no logical necessity that attributes must be quantitative, the issue of whether any given one is quantitative is an empirical matter and, so, claims to measure 2 psychological attributes (in the sense of measure required by quantitative theories) are empty when devoid of empirical evidence supporting quantitative structure. While some might conclude that highlighting this fundamental failure represents an anti-psychometric stance, the opposite is the case: it is pro-psychometrics. It aims to strengthen psychometrics’ culture of criticism (criticism being a necessary engine driving science forward) and to ensure that psychometrics conforms to the traditions of quantitative science. Within these traditions, as present in, say, physical science, claims to measure are underwritten by evidence of quantitative structure in the relevant attributes. Those who claim to measure psychological attributes ignoring this issue of evidence leave psychometrics exposed to potentially debilitating criticism. Therefore, the aim of highlighting this failure is to advance psychometrics as a science. To this end, not only have criticisms been made but avenues via which relevant evidence may be obtained have been explored and relevant empirical investigations undertaken (e.g., Kyngdon 2006; Kyngdon and Richards 2007; Michell 1994).
Significantly, since the inception of modern psychometrics, the issue of whether psychological attributes are quantitative has been raised by an extensive array of critically minded psychologists, philosophers, and educationalists. My aim in this paper is: first, to consider three critical contributions formulated during the first half of the twentieth century highlighting psychometrics’ evidential deficit, which show that by mid-century this deficiency was clearly evident to any who cared to look; and, second, to propose an explanation for the failure to meet the empirical challenges required by this deficit if psychometrics is to realise its long-standing ambition to become a “quantitative rational science” (Heiser and Hubert 2016, 1175).
The Chorus of Published Psychometric Criticism
Psychologists are unaware that an undercurrent of criticism 3 has always attended psychometrics. To call it a chorus may seem misleading because critics rarely sing simultaneously, although often singing the same song. They are a diachronic, not a synchronic chorus. Those I have located include the following: (Adams 1931; Boring 1920; Brown 1934; Berka 1983; Blinkhorn 1997; Borsboom 2005; Briggs 2022; Essex and Smythe 1999; Fillmore-Patrick 2025; Gould 1981; Goldstein and Wood 1989; Garrison 2009; Grice et al. 2012; Heine and Heene 2025; Johnson 1936; Kyngdon 2011; Lumsden 1976; McCormack 1922; Thomson 1916; Maraun 1998; McGrane and Maul 2020; Nash 1990; Reese 1943; Smith 1938; Sutcliffe 1986; Schönemann 1994; Trendler 2009; Vautier et al. 2012; Wilson 1928).
These critiques have had little effect and are repressed in histories of the discipline (e.g., Jones and Thissen 2007). Even Derek Briggs (2022), whose history highlights controversies, discusses few of the above. For all the difference they made most might never have been penned. Logical criticisms are unsung in authoritative mainstream textbooks (such as Lord and Novick 1968 or McDonald 1999) and have minimal institutional recognition in course curricula. While largely a corpus incognito, they are in reality an unacknowledged treasure trove. Given their pertinence, their neglect by a cognitive enterprise presents a mystery. 4 In the next sections, I consider three very different critiques to illustrate their range and significance. Then a solution to the mystery is proposed.
The critiques considered here are as follows: first, Thomson’s critique of Spearman’s two factor theory of intellectual ability, which showed that a non-quantitative alternative hypothesis accounted for relevant data as successfully as Spearman’s quantitative theory; second, Boring’s critique of the psychometric practice of using the normal distribution to infer supposedly quantitative measures from test score data, which showed that this practice begs the question of whether psychological attributes are quantitative; and third, Smith’s criticism that measurement in the desired sense requires evidence of quantitative structure. These critiques show that by the middle of the twentieth century, psychometrics’ evidential deficit was clear, implying that claims to measure psychological attributes are unsubstantiated. The fact that those engaged in psychometrics have failed to investigate the issue of quantitative structure indicates a significant scientific failure and such a lapse in standard scientific processes requires explanation.
Sir Godfrey Thomson’s Non-Quantitative Alternative
Thomson challenged Charles Spearman’s two-factor theory of abilities, a still influential theory. What Thomson specifically challenged was a premise of this theory, one that became an axiom of the psychometric paradigm. Spearman’s theory emerged from factor analysis (Michell 2023a; Spearman 1904) and he thought that by factor analysing correlation coefficients between cognitive tests, measures of general ability or g could be obtained. This overestimated factor analysis’s logical reach. Based upon numerical data (generally, covariance indices between total test scores), factor analysis necessarily yields numbers, but concluding that these measure mental abilities is invalid.
Long before Spearman, Galton (1869) dreamt of measuring “natural ability,” ignoring the fact that mental abilities are never experienced as quantitative attributes (Michell 2022). He thought psychology’s destiny lay in emulating physics’ quantitative trajectory, enshrining this conviction in the term “psychometry” (Galton 1879, 149), coined for his number-generating methods. Spearman thought factor analysis realised Galton’s dream.
Thomson, on the other hand, saw its limitations and queried Spearman’s conclusions. This critical stance reflected Thomson’s background: he had received his PhD under Ferdinand Braun, “the great wireless telegraphy expert” (Thomson 1952, 280), at Strasburg in 1906, also attending lectures on Einstein’s theory of relativity, as good an introduction as any to the role of criticism in science. Returning to Britain to teach psychology, he encountered Spearman’s work. Unlike Spearman, who considered his theory a “Copernican Revolution” (Spearman 1927a, 325), Thomson proposed ingenious counterexamples demonstrating, “Professor Spearman has drawn over-hasty conclusions” (Thomson 1921, vi).
Without detailing these counterexamples (see Bartholomew et al. 2009; Briggs 2022), Thomson’s proposals culminated in his “sampling theory of ability” (Thomson 1919). “Let us suppose,” he speculated, the mind, in carrying out any activity such as a mental test, has two levels at which it can operate. The elements of activity at the lower level are entirely specific; but those at the higher level are such that they may come into play in more than one kind of activity, in more than one mental test. These elements are assumed to be additive like dice, and each to act on the “all or none” principle, not being in fact further divisible. (Thomson 1919, 341)
The salient point is his invocation of the “all or none” principle. According to him, our brains contain elements (“bonds” [Thomson 1935, 89]), which are switched “on” or “off,” each subserving, in the case of higher-level bonds, many different kinds of cognitive activities or, in the case of lower-level ones, only specific kinds of tasks. Attempting an ability test item recruits a random sample of elements of both levels, producing a response (either correct or incorrect depending upon the elements activated) and thus contributes to a person’s total test score. Thomson’s alternative accounted for relevant data as well as Spearman’s theory, but crucially, without postulating quantitative attributes (such as g), relying only upon discrete, all-or-nothing processes. If true, the processes underlying intellectual performance would not be quantitative attributes and Spearman’s presupposition of underlying quantitatively structured abilities would be false. 5
That a non-quantitative theory explains the pattern of correlation coefficients as well as a quantitative one is unsurprising. Covariance indices, upon which factor analyses are based, are calculated from total test scores, which in turn derive from the ordered series of correct or incorrect answers given to test items. On each test, a person’s ordered series of correct or incorrect responses (that person’s response pattern) is the original data, total scores being derivatives thereof. Because the classification of responses as “correct” or “incorrect” is qualitative, the original data underlying factor analysis is qualitative, not quantitative. For their explanation, qualitative data do not necessitate postulating quantitative causes. That is, specifying a person’s cognitive resources (e.g., their knowledge states, skills, and strategies) is sufficient to explain response patterns. Behind these qualitative causes lie deeper causes, but there is no reason to believe that these are quantitative either: explanation of qualitative cognitive resources no more requires postulating quantitative causes than does the explanation of qualitative response patterns. While logically possible, Spearman’s quantitative theory violates Occam’s razor, adding unnecessary complexity (i.e., the complexity of continuous quantitative attributes versus the simplicity of finite discrete qualitative attributes).
Spearman resisted seeing this, leading Thomson to lament, “the nature of my disagreement has almost always been misunderstood” (Thomson 1946, 1). Spearman’s blindness derived from his misinterpretation of “factor.” Prior to factor analysis, in scientific parlance, “factor” meant “any one of a plurality of causes or conditions which together determine a thing or event” (Baldwin 1901, 368). Spearman, in line with this but also presuming that the numerical outputs of factor analysis measure causal factors, used “factor” to describe both those outputs and underlying causal factors (i.e., intellectual abilities), believing they are the same.
Consequently, when Thomson admitted that factor analysing performances on cognitive tests entails more or less g, with the qualification that “g is interpreted as a mathematical entity only” (i.e., as a numerical output) “and judgment is suspended as to whether it is anything more than that” (Thomson 1939, 240), Spearman misconstrued it as an “endorsement of g” (the causal factor) (1946, 121), which it was not. Spearman’s identification of g (the product of factor analysis) with g (the measure of the causal factor, falsely believing that “like all measurements anywhere, [it] is primarily not any concrete thing but only a value or magnitude” (Spearman 1927b, 75) blinded him to Thomson’s distinction. Thomson knew these two concepts are logically distinct and that without supplementary evidence, equating them is unwarranted. Possessing a viable alternative theory, he accepted the reality of numerical g, the artefact of factor analysis, but was agnostic about g, the measure of Spearman’s hypothesised quantitative causal factor. Spearman, on the other hand, saw g as like a card viewed from two sides: one side, portraying a measure obtained via factor analysis; the other, a causally efficacious attribute, “noegenesis” (1931, 408). Thus, convinced he had measured g, he mistook Thomson’s acceptance of mathematical g as endorsement of his theory.
Because factor analysis was seen as a methodological godsend, Spearman’s blind-spot influenced psychometrics and Thomson’s perspicacious critique was given short shrift. For example, Guilford’s Psychometric Methods (1936 [1954]), a leading textbook of the next generation, curtly complained that there was “little likelihood of demonstrating experimentally the existence of the elements hypothesized” (476) without noting that exactly the same difficulty plagued factor-analytic theories and missing the point that Thomson’s alternative was proposed primarily to demonstrate the logical possibility of non-quantitative alternatives. From then on, Thomson’s sampling theory had negligible impact upon mainstream psychometrics.
Had psychometrics been a normal science, 6 the veracity of Thomson’s critique would have been accepted and the range of theories considered widened to include qualitative alternatives. Thomson showed that the presupposition that the psychological causes underlying intellectual performance (i.e., abilities) are quantitative attributes was an over-hasty conclusion and that tests of quantitative theories (like Spearman’s) against non-quantitative alternatives were required. But psychometrics was not to be deflected from its quantitative trajectory, a legacy Galton bequeathed, and Thomson, “Once ranked with the top names in intelligence … alone goes almost unmentioned” (Deary et al. 2010, 96). The heroic effort by Bartholomew et al. (2009) to give Thomson’s alternative “a new lease of life” (567) produced little recognition that psychometrics’ quantitative trajectory is not necessarily nature’s path.
Edwin Garrigues Boring and the Normal Law of Error
E. G. Boring, remembered now as historian, was an experimental psychologist, who oversaw intelligence testing of recruits in World War I. Familiar with the quantity objection to psychophysics (i.e., psychophysical measurement is impossible because sensory intensities are not quantitative [Titchener 1905]) and having a degree in electrical engineering, he queried psychometrics’ claims to measure intelligence.
While endorsing the quantitative imperative 7 (“We hardly recognize a subject as scientific if measurement is not one of its tools” [Boring 1929, 286]), he thought that without experimental evidence psychometrics could not progress beyond merely collecting frequencies (i.e., total test scores). In a paper containing his slogan, “intelligence is what the tests test” (Boring 1923, 35) (said to anticipate Bridgman’s [1927] operationism [e.g., Rogers 1992; Mills 1992] and still misunderstood [e.g., van der Maas et al. 2014]), he sought to redirect psychometrics, not towards operationism but towards bridging the gap between test scores and the theoretical concept of intelligence.
Operationists, 8 confusing what is measured with how it is measured (Michell 1990), identify the meaning of concepts with the operations used to measure them, which would imply that the operation of testing defines intelligence. That was not Boring’s view and his slogan is no more operationist than “blood pressure is what sphygmomanometers test.” He claimed that “intelligence as a measurable capacity must at the start be defined as the capacity to do well in an intelligence test” (Boring 1923, 35), construing intelligence as a yet-to be-discovered capacity causing intellectual performance. Even as late as 1933, Boring still viewed Spearman’s two-factor theory sympathetically, thinking of intelligence as a complex property, the character of which remained to be discovered. Indeed, this is what his 1923 paper actually treated: it concerned, not operationism, but what would now be called intelligence test validation without using that term (“validity” only just having entered psychometrics’ official lexicon (Courtis et al. 1921; Michell 2009). His intention was not to define intelligence operationally, but to indicate where intelligence might be found amongst the hidden causes of test performance.
The same spirit pervaded his 1920 paper on the normal law of error, which is my main focus. He noted, “Galton in the Hereditary Genius applied the normal law to mental differences and, using it a priori, worked from frequencies of natural ability to a scale of equal intervals of ability” (Boring 1920, 11–12). What was Galton’s reason for this unwarranted a priori manoeuvre? Referring to Quetelet’s observed distributions of certain physical features of “Frenchmen” and “Scotchmen,” which approximated the normal curve, Galton wrote, Now, if this be the case with stature, then it will be true as regards every other physical feature—as circumference of the head, size of brain, weight of grey matter, number of brain fibres, &c.; and thence, by a step on which no physiologist will hesitate, as regards mental capacity. (Galton 1869, 31–32)
The claim that no physiologist would hesitate to infer the form of the distribution of mental capacity from observed distributions of physical features was hyperbole, like his claim that the normal law of error “would have been personified by the Greeks and deified, if they had known of it” (Galton 1889, 66). Galton presumed that (1) mental capacity is an attribute possessing a mathematically continuous distributional form (thereby presupposing that it is quantitative); and (2) that the normal curve is a Platonic ideal holding sway above the flux and concluded that mental capacity must be normally distributed. Both premises lacked evidence, so his conclusion was unwarranted.
However, Galton was idolised within psychometrics and to this day, the normal curve is treasured as a philosophers’ stone supposedly able to magically transform qualitative observations into quantitative measures. However, as Boring noted, it yields only “a precision of result that is an artefact” (1920, 33), there being “no alchemy of probabilities that will change ignorance into knowledge” (1920, 3).
Boring reasoned this way: on the one hand, there are those (i.e., mainstream psychologists) who, first, transform the observed distribution of test scores to a normal distribution and, second, from that normalised distribution claim to determine the unit of measurement for the relevant attribute; and on the other hand, there are those (i.e., critics of the mainstream) who would, first, establish a unit of measurement (presumably by discovering experimentally the quantitative structure of the relevant attribute, if indeed it is quantitative) and, second, using that unit of measurement, determine the form of the distribution. He recognised that only the latter path gets “the necessary scientific order” (Boring 1920, 30) right because knowledge (of the distributional form) can only be wrought out of knowledge (of the unit of measurement). The former route puts the cart before the horse because “It is wrongly supposed that knowledge could somehow be wrought out of ignorance” (Boring 1920, 33).
Boring’s critique prompted a retort from Truman Lee Kelley (“Thorndike’s pupil and Stanford’s copy of Karl Pearson,
9
perhaps now America’s leading psychologist-statistician” (as Boring [1929, 528] described him). Kelley’s “early training was in mathematics” (Flanagan 1961, 343) and he also endorsed the quantitative imperative, writing, “in the field of psychology, if a designation of some trait or capacity of mental life, is to be given serious consideration, it must be such as to reveal itself as a measurable difference in conduct” (Kelley 1928, 3). While for Boring, psychology’s scientific reputation required exposing psychometrics’ flawed foundation, for Kelley, psychometrics’ scientific reputation required countering Boring’s critique. Attempting to strike Boring’s jugular, Kelley
10
dismissed the claim that psychometrics lacks fixed units of measurement, observing that there is an arbitrariness about choosing units, not just that involving the familiar linear transformation between, say, centimetres and inches, but something more radical, involving non-linear transformations: one could have a science of physical phenomena in which the units were such that the scale of time intervals was the square of the present intervals measured in seconds, and in which the length scale was logarithmic as compared with the present scale in centimeters, etc. …. (Kelley 1923, 418)
From this he concluded, “choice of the unit is purely a question of utility” (418), which would imply unrestricted choice. Whatever the truth of Kelley’s claim regarding non-linear transformations of physical measures (an issue dealt with in detail in Krantz et al. 1971; Michell 1993), his argument fails because it begs the question, presuming test scores are measurements of quantitative attributes. Test scores are frequencies (number of correct responses) and not necessarily measures in the desired sense and when transformed non-linearly are neither frequencies nor, as far as is known, measures. Importantly, as already noted, frequencies do not require quantitative attributes for their explanation, hence, it is unwarranted to treat them as measures of anything.
Nonetheless, Kelley’s question-begging response was accepted at face value in psychometrics. For example, Stevens, who, incidentally had attended lectures by Kelley as a student (Stevens 1974), and, who, from mid-century onwards, was psychology’s measurement theory guru (Michell 2002), declared, “the assumption of normality has the advocacy of a certain pragmatic usefulness in the measurement of many human traits” (Stevens 1951, 28); and a leading psychometric spokesman pronounced, A good argument can be made that there are no “real” or “correct” intervals for any measurement scale, but rather that the intervals are established as a matter of convention. … The issue is one of which calibration of intervals will prove most useful in the long run. (Nunnally 1970, 21)
The delusion that measurement is attained by assuming desired distributional forms was taken as the optimal route to psychometrics’ preferred quantitative destination.
The role of the “normal law of error” (and approximations thereto) as devices for conjuring metrics out of thin air continues unabated in psychometrics, not through normalising test score distributions (the dominant practice in Boring’s day) but by incorporating error distributions into the fabric of probabilistic item response models (see Michell 2004, 2014; Sutcliffe 1986) in such a way that what is most wanted (measurement) is wrought from what is least known (the hypothesised form of the error distribution) by mere postulation. As a past president of the Psychometric Society acknowledged regarding “measurement” so fabricated, “its metric—not only the origin and unit of measurement, but its entire calibration—is not given by data and generally must be imposed by the model” (McDonald 2013, 123). Boring was right: psychometric measurements are illusory and he concluded, “We are left then with the rank-orders of our psychological quantities … and it is with these rank-orders that we must deal” (1920, 33), a judgment too frugal for most psychologists, but actually too generous given test data’s actual structure.
Bunnie Othanel Smith and the Issue of Quantitative Structure
Smith (1938) appraised psychometrics from the perspective of N. R. Campbell’s (1920, 1928) measurement theory. Evaluations from this perspective already existed (Johnson 1936; McGregor 1935), but Smith’s was more thorough. Campbell, a physicist, had defended a version of the representational theory of measurement, according to which measurement is defined as the assignment of numerals to represent attributes other than number, in virtue of laws governing the structure of these attributes, and he distinguished so-called fundamental from derived measurement. For fundamental measurement to be possible, these laws must show that attributes are ordered and involve a physical operation analogous to numerical addition, the paradigm case being length, in which the operation of physical addition involves conjoining rigid straight rods end to end linearly. Representing attributes numerically enables relationships between them to be expressed as numerical laws, which in turn sometimes enables systems of numerical constants to be identified and thus allows derived measurement, the paradigm case being measurement of the density of substances identified as constant ratios of mass and volume. While Campbell believed that his theory of measurement accommodated physical measurement, it seemed unlikely that such a conception fits so-called “measurement” in psychometrics. Thus, it was important to raise doubts about psychologists’ claims.
Campbell’s approach to discovering whether measurement is possible in areas of scientific investigation contrasted with the psychometric approach, which derived from Galton’s position on how to subject “the qualities of life and mental processes to mathematical treatment” (Smith 1938, 33). According to Galton, psychometrics is “the art of imposing measurement and number upon operations of the mind” (Galton 1879, 149; emphasis added). The crucial contrast here is between discovering something in nature and imposing something upon nature. Smith, following Campbell, recognised that the possibility of measuring an attribute hinges upon discovering that it possesses quantitative structure. This contrasts with the psychometric approach according to which measurement is a matter of imposing number-generating operations upon the relevant attribute, no matter its structure.
Smith’s book appeared in 1938, on the eve of World War II and coincided with publication of the Ferguson committee reports (Ferguson et al. 1938, 1940) into psychophysical measurement, a committee dominated by Campbell. These raised doubts regarding attempts at psychophysical measurement. Stevens, whose research area was psychophysics, mulled over them (see his previously unpublished 1939 paper [Stevens 2006; Marks 2006]) and his emphatic repudiation of Campbell (Stevens 1946) came immediately after the War (see Michell 1999). Stevens’s attempted redefinition of measurement was quickly accepted by psychologists and had the effect of quashing logical criticisms of psychometrics for three decades. Smith’s defence of Campbell stood little chance of an audience, let alone a fair hearing alongside the comforts Stevens’s message seemed to promise and garnered only one review in a psychology journal (Cureton 1939), which failed to grasp the import of his critical message.
Smith (1938) saw that measurement begins with “a search for a special kind of structure” (57) in the attributes one aspires to measure and since the “only way in which a particular structure of any character of nature can be ascertained is by careful observation and experimentation” (61), measurement can only be achieved by making the required observations and performing the necessary experiments. These must test whether the relevant attribute is ordered and possesses additive structure.
There are limitations to Smith’s presentation. He failed to acknowledge that even in physics observational and experimental data rarely meet conditions for order and additive structure exactly and therefore one is looking more for signs of quantitative structure in messy data (Michell 2007). And while he recognised that evidence for additive structure may involve identifying very different kinds of empirical operations in different attributes, his treatment of indirect forms of evidence for additive structure, such as found in constancies in ratios of mass to volume (supporting the hypothesis that density is quantitative) and regularities in the way temperature varies with volume and pressure (supporting the hypothesis that temperature is quantitative) is sketchy.
He was unaware of the back door to evidence of additive structure provided by the theory of conjoint measurement (only introduced to psychology two and a half decades later by Luce and Tukey 1964), although the basic ideas were present (see the historical note in Krantz et al. [1971, 259]). While he referenced Nagel (1930), had he noted Nagel’s references, he might have benefited from Hölder’s (1901) specification of axioms for difference measurement, which anticipate conjoint measurement (Michell and Ernst 1997). Nonetheless, these limitations do not detract from his insistence upon observation and experiment to support (or refute) the hypothesis that psychological attributes possess quantitative structure, which is where attempts at measurement in psychometrics consistently fall short.
The Critical Situation at Mid-Century
By mid-twentieth century, Thomson had demonstrated that non-quantitative theories are a viable alternative to mainstream quantitative theories, which entails doubts regarding the presumed superiority of quantitative theories; Boring had demonstrated that using off-the-shelf probability distributions as measurement mechanisms begs the question of whether mainstream psychometric claims to measure psychological attributes are true; and Smith insisted that evidence for quantitative structure could only be obtained by observational and experimental research, which entails doubts about whether psychological attributes are quantitative.
Instead of working to resolve these doubts, psychologists continued to presume that their tests measure psychological attributes. This has been a perpetual stance: for example, at the birth of the Psychometric Society, its journal, Psychometrika, (founded in 1936) was dedicated to developing “a quantitative rational science” (Heiser and Hubert 2016, 1175) and the cover description for a recent set of Essays on Contemporary Psychometrics (van der Ark et al. 2023) still described psychometrics as “the science devoted to the advancement of quantitative measurement practices in psychology, education and the social sciences.” Such a stance takes for granted that abilities, etc., are quantitative attributes and alternative possibilities were not systematically investigated; and probability distributions were recruited as scaling devices without testing their veracity. In a normal science, evidential gaps stimulate remedial investigations. Not so psychometrics: measurement of psychological attributes using psychometric theories and methods was taken to be a fait accompli. As a widely used methodological resource put the matter: “Most behavioural and social science data are ordinal. However, through certain scaling methods and assumptions, it can be considered as interval scale data” (Kerlinger and Lee 1964 [2000], 639).
The result was a disconnect between the mathematical inventiveness of its quantitative theories and the character of the phenomena underlying psychological test performance, a disconnect captured in a remark of a past president of the psychometric society, “I keep thinking of psychometrics as being part of statistics, not so much ‘psycho’” (as quoted in Wijsen and Borsboom 2021, 335). A century earlier, Boring had warned of just this disconnect: “it is senseless to seek in the logical processes of mathematical elaboration a psychologically significant precision that was not present in the psychological setting of the problem” (1920, 33), but it had immediately fallen on deaf ears. The presupposition that psychological attributes are quantitative was elevated to axiomatic status constraining psychologists to impose a level of mathematical complexity not known to be present in the psychological setting of testing upon the phenomena of test performances.
As a consequence, psychometrics is neither a quantitative science (because relevant psychological attributes have not been shown to possess quantitative structure) nor a rational science (because valid criticisms are ignored) and to the extent that it is science, it is pathological (Michell 2000, 2008) because these attitudes subvert the scientific aim of finding the causes of test performance. And yet, despite that, psychometrics always was, is now, and apparently ever will be a thriving technology (as measured, say, by marketing success). However, because this success involves presenting tests as instruments of scientific measurement, that is, as something they are not, psychometrics is a myth-based technology.
Psychometrics: Myth-Based Technology
Psychometrics always had technological aspirations. Long before psychological tests were invented, Galton looked forward to the day when a system of competitive examination for girls, as well as for youths, had been so developed as to embrace every important quality of mind and body, and where a considerable sum was yearly allotted to the endowment of such marriages as promised to yield children who would grow into eminent servants of the State. (Galton 1865, 165)
Psychometrics came into being to solve this eugenic “problem” of selecting couples adjudged fit to procreate. In step with Galton, Spearman announced, “the eugenic problem from which we set out has reached a definite solution in the theory of ‘two factors’” (1914, 229), thinking factor analysis enabled measurement “of a minimum index to qualify … above all, for the right to have offspring” (Hart and Spearman 1912, 79). 11
Psychometrics did not invent the conviction that psychological attributes are quantitative, but it did invent the technology of testing. The presupposition that mental attributes are quantitative predates psychometrics by more than two millennia, the most influential expression being Plato, who considered the idea that our salvation depends upon finding the right measures of pleasure and pain. This quantitative presupposition was part of a continuous intellectual tradition enduring through to the British utilitarians, Galton’s contemporaries (Michell 2023b). Strange as it seems to modern minds, no measurement technology emerged from this tradition, whereas such a technology was pivotal for psychometrics.
There is no need to trace this technology’s history because its achievements are known: the work of Binet and Simon in identifying children needing special education, thereby creating the prototypes for future intelligence tests; the work of Goddard in adapting these tests to US conditions and similarly Burt in Britain; the work of Terman recasting intelligence tests as instruments for measuring IQs; of Yerkes in devising group tests for army recruits during World War I; and further innovations by Thurstone (Jones and Thissen [2007] summarise this history). The primary focus of psychometrics was always technological and substantive psychological theories played second fiddle to theories of test construction. Furthermore, the latter are not treated as raising empirical questions for investigation, but as providing answers enabling test construction. The presupposition that psychological attributes are measurable quantities is an axiom of this technology. As a result, psychometrics is guided by a myth-based technological paradigm.
Transferring Kuhn’s paradigm concept from the philosophy of science to technology studies, Dosi (1982, 152) defined a technological paradigm as a “‘model’ and a ‘pattern’ of solution of selected technological problems, based on selected principles derived from natural sciences and on selected material technologies.” Clearly, he had technologies utilising natural science in mind, but not all technologies are so based. For example, horoscopes are constructed using non-scientific, astrological principles, and the construction of tests as instruments of psychological measurement has always preferred wishful thinking to scientific investigation. Hence, Dosi’s definition is broadened here by replacing his phrase “based on selected principles derived from natural science” with “based on selected pragmatic principles.” This change serves to include not only the paradigm guiding horoscopes but also that guiding psychometrics and it indicates an important difference between scientific and technological paradigms.
Scientific paradigms track truth, in the sense that the aim of science is to discover the structure and ways of working of the systems under investigation; however, the focus of technological paradigms is different. Originally, construction of technological artefacts may have been guided by the belief that the procedures used deliver products fit for special uses. However, once technologies were mass-produced in market economies, the focus shifted. From then on, technological paradigms tracked markets. In market economies, marketing success trumps truth and in disciplines like psychology, where substantive theories and outcomes of interventions can be ambiguous, marketing success seems to provide a more tangible criterion than does scientific success. Hence, packaging tests as instruments of scientific measurement is a marketing strategy, not a truthful assessment.
Thus, psychometrics is the scene of conflict between two paradigms, scientific and technological. According to its technological paradigm, packaging tests as scientific measurement devices is a profitable marketing strategy, which logical critiques directly threaten. According to its scientific paradigm, packaging tests as scientific measurement devices misrepresents them. However, in psychometrics, “When strict logic conflicts with practical utility, it is utility that usually wins, as it probably should” (Ebel and Frisbie 1965 [1991], 31). With this use of “should” the mystery of the missing corpus is solved: psychometrics’ technological paradigm holds the reins and logical critiques are bad for business. 12
The reign of its technological paradigm over the discipline was tested in the 1950s when the suggestion was made to recast the use of testing technology as actuarial prediction (Meehl 1956) and to replace the rhetoric of scientific measurement by that of decision theory (Cronbach and Gleser 1957). Both suggestions fitted psychometric practice better than did the rhetoric of scientific measurement, nonetheless, that rhetoric prevailed, its marketing superiority obvious. The dominance of the myth-based technological paradigm was confirmed.
In the present cultural environment this dominance shows no sign of abating. For the past century, Western society, seized by “metric fixation” (Muller 2018, 17) has deified numerical indices because of a naive trust in numerical data (naïve because numbers offer no more security than anything else). Exploiting the fact that quantification is “a technology of distance” 13 (Porter 1995, ix), these indices enable the exclusion of those adjudged unworthy of receiving various social opportunities by cynically claiming to measure relevant human attributes (cynical because decision-makers falsely assume the authority of scientific measurement). With its dazzling statistical models producing its endless array of questionable metrics, psychometrics represents the gold standard of the modern metric fad.
Nonetheless, the fact that the hidden critical corpus has steadily accumulated means that some psychologists march to the drumbeat of the scientific paradigm and decline to endorse the presupposition that tests measure psychological attributes. Opposing the mainstream, they believe that “We should not wish nature to accommodate itself to what seems better ordered and disposed to us; we should rather accommodate our own intellect to what nature has made, in the certainty that this is the best and only way” (from a letter of Galileo to Prince Cesi, 30 June, 1612; Galluzzi 2014 [2017], 114).
Given its social prominence, psychometrics’ pathological deviation from its scientific paradigm, requires a sociological explanation. Ken Richardson has identified one, but not the only social cause: In effect, then, Galton’s aim, and that of his followers, became simply an attempt to reproduce an existing set of ranks (social class) in another, the test scores and pretend that the latter is a measure of something else. This is, and remains, the fundamental strategy of the intelligence-testing movement, and this gloss over the fundamentals of fully scientific measurement is what has dogged it throughout this century. Of course, the quantitative nature of the test helped to create the impression of scientific measurement. (Richardson 2000, 27)
However, what this particular cause implies is that those individuals affected, especially those deemed unworthy to receive social opportunities, have an interest in knowing that not only are tests not instruments of scientific measurement, but also that psychometrics ignores critiques showing this, while at the same time reaping social rewards through presenting its technology as something it is not known to be. In this sense and not inappropriately, the renegade philosopher, R. G. Collingwood, thought it a “fashionable scientific fraud” (1939, 95; see Michell 2020a).
Upshot
Given human fallibility, criticism is a necessary corrective assisting scientific progress. However, because critics question cherished convictions, those motivated to serve science need to be aware of the tactics used to deflect criticism. Psychometrics provides ready examples.
First, Spearman attempted to discredit Thomson’s critique by mocking what he called “Thomson’s unfortunate induration of thought” (1927a, 325). This was an ad hominem response, attacking the critic, not the critique. Whatever Thomson’s state of mind, the only important thing about his critique from a scientific point of view, is its logical relationship to Spearman’s theory. Second, Kelley’s attempted rebuttal of Boring’s critique is a case of petitio principii (that is, begging the question). Kelley’s starting point was that “our mental tests measure something” (1929, 86), which presumes the very thing Boring was questioning. Third, the desultory reaction to Smith’s critique amounted to turning a blind eye to criticism. This marked a transition from an earlier reliance upon specious arguments, which at least recognised the existence of critiques, to the post-Stevens stance of wilful blindness, which treated critiques as non-existent.
In part this was a consequence of methodological triumphalism. Modern psychology, belittling the armchair methodology of the earlier philosophical psychology, based itself upon the conviction that physics’ success was due to the methods of experiment and measurement, and hence these methods are necessary in psychology. This presumption was fervently embraced and the aping of these methods seemed triumph enough for psychologists. There was little recognition that as with all methods, scientific success is contingent upon specific empirical conditions in the phenomena investigated. In the case of measurement, the central condition necessary for scientific success is that the relevant attributes possess quantitative structure. In the absence of this, claiming measurement is not just misleading (in the sense of presenting a method as something it is not known to be) it is scientifically stultifying.
However, psychometrics is such a successful technology that its defects as a science escapes notice, except by those contributing to the critical corpus. By mid-twentieth century, psychometrics’ deficiencies were exposed. The fact that few attended to them displays psychometrics’ woeful critical culture and the wider culture’s metric fixation. Cocooned in a psychometric culture craving measurements and a wider culture craving convenient metrics, a discipline marketing a flourishing technology premised upon the belief that its “measurements” are key to its scientific image and this image, key to its marketing success has no incentive to aim higher.
Footnotes
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
