Abstract
IQ scores are used in the Netherlands to determine whether an individual qualifies for special education or has access to healthcare. The term for this use of IQ is referred to as “slagboomdiagnostiek” (doorway diagnostics). Criticism has been voiced from the professional field regarding this policy. In this article, we will discuss whether the current Dutch policy is justified by examining the validity and reliability of IQ tests. Specifically, we will address whether the IQ test can be used in clinical decision-making regarding the individual. We argue that while the validity and reliability have been demonstrated at a group level, they do not hold true for the individual, leading to the conclusion that the current Dutch policy is not justified.
In fact, he [Binet] specifically argued against giving a hereditarian interpretation, because he understood that if you did that, you had misinterpreted the number as a limit, rather than an aid.
Introduction
Back in 1904, Alfred Binet was asked to develop an instrument that could differentiate between children who could benefit from normal schools and children who would need some form of special education (Gould, 1996). In 1905, the first intelligence test was born. Binet intended that intelligence tests show cognitive strengths and weaknesses of the child; based on those abilities one can formulate which kind of schooling and help are needed. Binet strongly argued against the use of intelligence scores for predicting future abilities. Low intelligence was not to be seen as definite and therefore Binet developed several methods to enhance intelligence, with a focus on learning how to learn (Gould, 1996). The outcome of his test was later called IQ, devised by William Stern in 1912. Although a number was used to indicate a score, Binet stated that the number itself means nothing.
The number is only an average of many performances, not an entity onto itself. Intelligence, Binet reminds us, is not a single, scalable thing like height. “We feel it necessary to insist on this fact,” Binet (1911) cautions, “because later, for the sake of simplicity of statement, we will speak of a child of 8 years having the intelligence of a child of 7 or 9 years; these expressions, if accepted arbitrarily, may give place to illusions.” Binet was too good a theoretician to fall into the logical error that John Stuart Mill had identified—“to believe that whatever received a name must be an entity or being, having an independent existence of its own.” (Binet in Gould, 1996, p. 181)
Binet’s warning not to reify the score of his test, is exactly what happened. According to test developers and diagnosticians, IQ scores represent cognitive abilities and the scores on standardized intelligence tests play a prominent role in clinical decision making. The term used for this phenomenon is “slagboomdiagnostiek” (doorway diagnostics). This doorway is rather well described by Weinberg (1989). He refers to IQ scores as the “magical numbers.” These numbers are used for selection and placement in certain (educational) programs.
The scores determine who is adopted quickly or accepted in the top tier preschools, who is labelled retarded or gifted or is tracked to receive special education placement and programming, who is placed in the bluebird learning group or the cardinals, who goes to elite colleges or is offered other educational opportunities, and who serves in the military as an officer or gets into a management-training program. IQ tests have played a pivotal role in allocating society’s resources and opportunities. (Weinberg, 1989, p. 100)
Weinberg’s statement dates from almost 35 years ago. Not much has changed since then, in the Netherlands. Back in 2004, the Dutch Professor Peter Tellegen warned psychologists to base their decisions too strictly on IQ scores. The reason was an e-mail from parents whose son was not admitted to a particular school, because his test score was 1 IQ point short of the minimum score. One might think that this must be an exceptional case and that normally at least a confidence interval will be honored. However, it seems that IQ still plays a prominent, even dominant, role in clinical decision-making and policy in the Netherlands, as the following recent examples illustrate.
The importance of IQ in The Netherlands
To be allegeable for permanent care under Dutch Legislation or Wet Langdurige Zorg (Wlz; Long-term care act) due to intellectual disabilities, IQ scores are required and, in most cases, decisive. This act allocates healthcare to individuals who need care and proximity 24 hours a day. Individuals with psychiatric disorders and without intellectual disabilities, like autism, are not eligible for this funding. They, instead, receive help from another Dutch act; the Zorgverzekeringswet (Zvw; Healthcare insurances act). The Zvw is mandatory for all residents and is used when short-term help is needed. It turns out that those two Dutch acts can cause conflicting situations, due to the prominent role of IQ. To illustrate a complex situation: a young male adult with a low IQ score (recently determined), psychiatric disorders and a history of addiction was found not eligible for Wlz, because his low IQ had not been demonstrated before his 18th birthday. In that case it is impossible to determine that there is actually an intellectual disability 1 ; after all low IQ might also be the consequence of his addiction. Zvw, therefore, must fund the care he needs. Zvw, however, does not do this, because of his low IQ.
To gain more insight in the role of IQ scores, when it comes to the Wlz, we asked members of the Nederlandse Vereniging van Orthopedagogen (NVO; Dutch Association of Child Psychologists), to share their experiences (Radstaake et al., 2023). Respondents (n = 29) stated that IQ plays an important (58%) or very important (41%) role in Wlz allocations, whereas all the respondents (100%) disagree with this approach. Seventy-two percent indicated to have been forced to deliver IQ scores, in cases where it was impossible or undesirable due to individual circumstances of their client. A few quotes to reflect respondents’ experiences: IQ is very important to obtain an indication. Sometimes only one IQ point matters. Self-reliance and emotional development are also taken into account, but the IQ score is usually decisive. [Translation by the authors] As previously mentioned, it is not always easy to establish that the disability occurred before the age of 18. Furthermore, a relatively high IQ (mild intellectual disability or below average) combined with weak adaptive and emotional functioning may also indicate a genuine need for residential care, guidance, and treatment. However, someone may not end up in the right place because they do not receive that indication. [Translation by the authors]
IQ not only plays an important role in the domain of healthcare but is also prominent in the field of education. This is illustrated by a recent development in the field of education in The Netherlands. In 2019, the Sectorraad GO (Sector Council Specialized Education) formulated a Landelijk Doelgroepenmodel (LDGM; National Target Group Model; www.gespecialiseerdonderwijs.nl). This model is used to predict the outcome for learners in the field of special education, such as VSO (extended specialized education), dagbesteding (daytime activity), or beschermd werk (protected work) and is intended to be further developed for use in regular education. The prediction is based, among other things, on full score IQ (FSIQ). Developers of the LDGM state that it is important to shed light on the potential of the learner by means of IQ tests, and that this potential serves as a strong predictor for a learner’s maximum achievement. It is believed that this will enable teachers to tailor their curriculum to suit the learner’s potential. Note that Binet warned against this use of intelligence tests: “. . . he feared even more what has since been called the ‘self-fulfilling prophesy’. A rigid label may set a teacher’s attitude and eventually divert a child’s behavior into a predicted path” (Gould, 1996, p. 181).
We shall highlight the word potential a bit further by giving another example. In the Netherlands, a standardized intelligence test was developed by Hop and van Boxtel (2013) under auspices of CITO (Central Institute for Test Development). At this point, it is important to know that in the Netherlands, CITO is a prestigious institution, with far-reaching influence in the educational field. On their website they emphasize the importance of knowing the learner’s IQ, and therefore strongly encourage teachers to measure it: To discover the true potential of your student and provide them with optimal support, we now have the CognitiQ Intelligence test. . . . In my opinion, education should have an indication of the cognitive abilities of all students. Each student should be tested so that education is better equipped to guide every student. This calls for an inexpensive, widely deployable IQ test that schools can administer themselves. (Hop and van Boxtel, 2013, www.cito.nl) [Translation by the authors]
In the manual, the authors make a clear distinction between potential (what is in the child) and performance (what abilities does the child actually show). They state that leervorderingtoetsen (tests for learning progress) measure performance, whereas potential is measured by intelligence tests (IQ). Interestingly, the authors also speak of a continuum between performance and potential, suggesting that making a fundamental distinction between the two is impossible. In their opinion, scores on tests for learning progress are not a pure reflection of the learner’s potential, because other relevant variables play a role in school achievement. These variables are, amongst other things, motivation, knowledge, and skills acquired in school, which research has demonstrated to play an important role (e.g., Richardson, 2002; Sternberg, 2018). In other words, those tests measure performance, mostly based on skills learned in school. The assumption is that, apart from what the child has learned in school, there is a latent form of potential within the child and this is what intelligence tests are said to measure. The assumption that performance and potential are the extreme of a continuum does not, however, warrant a pre-set cut-off point, because this would be an arbitrary point. After all, where does school-based knowledge end and where does intelligence start? Smith and Thelen (1994) also state that it is impossible to separate the two concepts, potential and performance, at least in an empirical way. The idea that intelligence tests measure an ability independent from what is learned in school or in life might be a myth. As stated by Sternberg: “Instruction emphasizes the same memory and analytical abilities that the tests test. The tests and the schools reinforce each other in an unending loop” (2018, p. 3). In fact, a meta-analysis by Ritchie and Tucker-Drob (2018) suggests that there is a positive correlation between intelligence and education. Their meta-analysis included over 600,000 participants and showed that education raises IQ, in a causal way. Even Binet and Simon indicated that scholastic acquisitions is a source of variance in their test (Richardson, 2002).
Interesting, and somewhat confusing, is the fact that CITO not only developed a standardized intelligence test, but also a dynamic assessment approach (DA). In the manual of the DA, the author states that DA is recommended in special education because “it is questionable how relevant it is what the learner can do in an artificial testing situation on its own” (Kuhlemeier, 2018, p. 4), referring to standardized assessment (SA). “It is more important to know what the learning potential of the learner is, that is to say what a learner can do with the help of an adult in a natural situation and the amount of help needed” (p. 4), and “DA gives teachers an indication of the learning potential, or learning abilities, changeability or modifiability” (p. 5) [Translation and italics by the authors]. In other words, DA might be a better tool to bring the learner’s potential to light, compared with standardized intelligence tests. This leaves the work field with the question how relevant and meaningful IQ, measured by standardized intelligence tests, really is. Again, we asked members of the NVO, working within the educational field, to give their thought and to share their experiences (Radstaake et al., 2023). Some quotes: The role of IQ is (sometimes heavily) overestimated, especially when it comes to vulnerable students who are struggling in school, which can undermine equal opportunities (due to low expectations). School success is influenced by so much more than just IQ, such as: the child’s motivation, well-being, talents, interests, executive functions, the quality of education, the pedagogical and didactic actions of teachers, collaboration within the school team, professional attitude, seeking feedback from each other, parents, and students and learning from it, the extent to which parents can support education, and collaboration between school and parents and parental involvement. [Translation by the authors] It would be very beneficial if people could truly grasp the limited value of an IQ score. Even more importantly, during normative research, a sample is taken from the normal distribution within the population. As a result, very few children with an IQ below 70 are included in that sample, despite this being the group for whom the consequences of an IQ score (unjustly) have the most significant impact. If one wishes to attach importance to such a score, it is crucial to ensure that the score is as reliable as possible, as I genuinely have many reservations about it. [Translation by the authors]
In conclusion, within The Netherlands, we witness the significance attributed to IQ scores, alongside the critiques raised by practitioners in the fields of healthcare and education. We wonder whether the use of IQ scores as a doorway in current Dutch policy is justifiable. To answer that question, we will reflect on two important themes concerning the measurement of intelligence: validity and reliability. We have chosen this angle, because the terms validity and reliability are frequently used in publisher test manuals and publications by the COmmissie Testaangelegenheden Nederland (COTAN; Committee Test related issues, the Netherlands), to indicate the psychometric characteristics of standardized tests. More specific, the concepts validity and reliability are used to justify the use of standardized intelligence tests. In the following we will argue that standardized intelligence tests might seem valid and reliable on group level, but that these data cannot be transferred to the individual.
The problem of validity
The most important concept that requires attention in IQ research is validity. After all, researchers and diagnosticians alike need to be certain that intelligence exists as a property that can be measured like temperature can be measured. Validity thus simply is “. . .whether a test really measures what it purports to measure, . . .” (Kelley, 1927, p. 14). According to Borsboom et al. (2004), this definition of validity requires that “a) the measured attribute exists and b) variations in the attribute causally produce variation in the outcomes of the measurement procedure” (p. 1061). Moreover, the problem of validity is not a matter of proper psychometric techniques, “. . . it cannot be contracted out to methodology” (p. 1062), but must rather be addressed by substantive theory. We will unpack these notions in the next paragraphs.
The measured attribute exists
The first statement means that when intelligence does not exist as an entity, it cannot be measured. For this matter, it is important to reflect on the concept of intelligence first. Despite the fact that the first formal intelligence tests were developed almost 100 years ago, researchers still do not agree upon basic assumptions, like how to conceptualize intelligence. Another important dispute concerns the question whether or not there exists a general intelligence factor, the so-called factor g. Until today, there is no unequivocal answer to questions like what the nature of factor g must be, and to what extent this factor will be measurable. We will further delve into the matter of g.
A general intelligence factor based on correlation
Charles Spearman (1863–1945) coined the term g. He found that performance on conventional intelligence tests correlates with scores on other mental ability tests. He argued that one underlying ability must be the cause for that, the so-called general ability or g. Apart from g, he formulated s, which refers to specific abilities necessary for specific tasks (Weinberg, 1989). Group-data research indeed reveals, based on statistical methods, a hierarchical structure of intelligence, with g at the top. According to Spearman, the discovery of factor g makes psychology a real science (Gould, 1996). Spearman invented factor analysis and discovered factor g around the same period as intelligence testing and IQ was highly favored in society. At first, Spearman was condescending toward mental testers, using many uncorrelated subtests for measuring intelligence and merge the scores into a single IQ (Gould, 1996). Later, however, he perceived factor g as a justification for the use of IQ: Spearman argued passionately that the justification for Binet testing lay with his own theory of a single g underlying all cognitive activity. IQ tests worked because, unbeknownst to their makers, they measured g with fair accuracy. Each individual test had a g loading and its own specific information (or s), but g-loading varies from nearly zero to nearly 100 percent. Ironically, the most accurate measure of g will be the average score for a large collection of individual tests of the most diverse kind. Each measures g to some extent. The variety guarantees that s-factors of the individual tests will vary in all possible directions and cancel each other out. Only g will be left as the factor common to all tests. IQ works because it measures g. (Gould, 1996, pp. 293–294)
The idea of a general intelligence factor remains important in science and in today’s society. Carroll (1997) states that “when correlational data can be well fitted using factor-analytic models, it is scientifically appropriate to accept the existence and functioning of the postulated factor” (p. 28), and Sternberg (2018) argues that currently the most popular opinion in the field of intelligence research is the “traditional” view of the existence of factor g, after short periods of emphasizing broader theories of intelligence. Although the exact nature of g is unknown, many researchers believe, based on adoption and twin studies, that it must relate to a genetic or biological factor (Richardson and Norgate, 2006). Research repeatedly demonstrates that g correlates with job performance, especially when it comes to cognitively complex jobs (Gottfredson, 1997; Sternberg, 2012) and academic performance (Eysenck, 2017; Nettelbeck and Wilson, 2005). Gottfredson (1997) maintains that psychologists generally agree that g predicts job performance, but that predictive validity of g depends on the kind of job; the more complex the job, the higher the predictive validity, with correlations ranging from 0.2 to 0.8. The correlation between factor g and school achievement and occupational status is both about 0.5 (Richardson, 2002). Richardson, however, states that it is hard to measure the correlation between IQ (as a reflection of factor g) and job performance, because of how job performance is measured, mostly by subjective ratings of supervisors. He points to the fact that research results are ambiguous. The assumed relationship between IQ and job performance disappears as employees spend more time in the job due to increased confidence. Despite mixed results, these correlations are used to infer that g has predictive value and for that reason it is believed that measuring g, by means of IQ tests, is useful. Nettelbeck and Wilson (2005) recommend the use of IQ tests in educational settings, because it will provide the teacher in only “an hour or two” insight in the child’s intellect, which can then be used to predict academic performance. As Nettelbeck and Wilson state “. . . despite shortcomings and important caveats on its application, the IQ score has good construct validity as an estimate for general intelligence. Beyond predicting achievements in life events and longitudinal stability, individual differences in IQ certainly reflect genetic variation” (p. 612).
How to interpret factor g
Thus, factor g usually shows up when conducting the structure of intelligence in group research. The question remains whether g refers to a real entity with psychological meaning or whether there is only psychometrical meaning. Lohman (1997) describes it as follows: From Spearman to the present, those who report factor analyses of correlations among tests have routinely slipped from careful statements about factors representing convenient “patterns of covariation” or “functional unities” to entities that exist in some concrete fashion in the brains of those who responded to the tests. (p. 370)
There are authors who have shown that by using different research methods, different conclusions can be drawn when it comes to factor g. Factor g becomes visible when factor models are used, but these statistical techniques contain serious issues, as shown by Molenaar and Campbell (2009). They conducted a factor analysis on the test scores of the Big Five Personality test using a group of 22 people (the population) and indeed obtained the five-factor structure of personality. All participants also filled in the Big Five during 90 consecutive days and factor analysis was run on each individual participant separately. Results revealed greatly diverging factor models for individual participants. Full factor models of three different participants were presented. The factor model of Participant 13 revealed three factors, that of Participant 1 had four, whereas the factor model of Participant 8 led to two factors.
On an individual level, the structure of personality traits can be very different than the theoretical model based on group data suggests. When it comes to factor g, the exact same conclusions can be drawn. Recent research by Schmiedek et al. (2020) showed that the existence of g is questionable, when individual data rather than group data are analyzed. Participants (n = 101) where followed for 6 months, conducting several cognitive tasks during 100 sessions. The authors state that according to studies of between-person differences, the conclusion seems to be that the assumed (statistical) factor g exists, but their study about within-person differences shows an entirely different structure of intelligence. Particularly interesting, from our perspective, is the finding that the statistical factor g is less prominent when the individual is tested, which indicates that general intelligence might not exist. When this is true, a different explanation is needed to understand the rise of factor g when using factor analysis, which is provided by the mutualism model
2
of van der Maas et al. (2006). This is a dynamic model, that explains the positive manifold between cognitive tasks, but does not include factor g; there is not a latent factor causing correlations between cognitive tasks. This is in accordance with Schlinger (2012), who argues that g is just a statistical construct with no theoretical foundation; he described the logical error Spearman made, by taking the positive correlations as factor g: . . . it [g] is not a thing being measured. What is measured is the behavior of large numbers of people on various tests. The positive intercorrelations that result from factor analyses of their test scores are themselves far removed from the behavior of any individual in the test-taking situation or, for that matter, in any other context. (p. 17)
Variations in the attribute causally produce variation in the measurement outcomes and there should be a theoretical explanation for this causality
The conclusion is warranted that researchers do not agree upon whether a general intelligence factor exists. Besides that, until today science has not successfully located factor g in the human body, albeit research has shown that g correlates with an overall quality of the brain, in contrast to specific regions of the brain. Studies show that g is associated with the parieto-frontal integration network, which also correlates with working memory (Eling, 2014). The words “correlation” and “associated” are emphasized, because they reveal the problem that Borsboom et al. (2004) elucidated in their seminal paper: Validity is not about correlation, it is about causation. In other words, to state that factor g exists because it correlates with something in the brain is not enough. In case of a valid intelligence test, variations in intelligence should cause variations in measurement outcomes and there should be a theoretical explanation for this causality. When it comes to intelligence, however, a theory about the desired causality between the construct and the measurement is missing (Richardson, 2002). In other words, the reason why different levels of intelligence lead to different item responses remain unclear. When researchers do not know how to solve that problem, they are left with the often-quoted statement of Boring (1923) that intelligence is what the intelligence test measures. It really is not more than that; on an individual level, when a person does well on an intelligence test, the person does well on an intelligence test (van der Maas et al., 2014). More concretely, when a subject can recite the numbers 4, 9, 10, 11, 13 in reverse order, this person can say 13, 11, 10, 9, 4. Without a good theory, this is the only conclusion to be drawn.
CHC model; a solid theory of intelligence?
The above statements oblige us to comment briefly on the Cattell-Horn-Carroll theory of cognitive abilities (CHC theory; see McGrew, 2009), as CHC theory is perceived as the most widely investigated and substantiated intelligence theory today (Alfonso et al., 2005). CHC theory had a great impact on test development (Canivez and Youngstrom, 2019). Alfonso et al. (2005) state that before mid-1980, theoretical foundation of intelligence tests was almost absent, however revisions of those tests where almost always based on CHC model. Unfortunately, disagreement on the nature of intelligence as described above hampered the development of intelligence theories, which became clearly visible in the CHC model. CHC theory consists of two models, Cattell-Horn Extended Gf-Gc model and Carroll’s Three Stratum model, suggested by McGrew in the late 1990s (Alfonso et al., 2005). The theory describes a hierarchical structure of intelligence, based on group data. In Carroll’s Three Stratum model, cognitive abilities are subdivided into three levels. Stratum III is the general intelligence factor, Stratum II contains broad abilities (e.g., fluid reasoning and cognitive processing speed), and Stratum I a variety of narrow abilities (McGrew, 2009). The Extended GF-GC model from Cattell and Horn consists of eight broad cognitive abilities, which corresponds to Stratum II in Carroll’s model. The model is well-known for its distinction between fluid intelligence and crystallized intelligence. A major difference between the two models is the theoretical stance on g (Kan et al., 2011). Where Cattell and Horn deny the existence of a general mental ability and leave g out of their model, Carroll attributes an important role to g (Stratum III). In CHC theory (the merged model), g is included, but not how Carroll intended it in the first place and only with marginal notes. According to Carroll, factor g and broad abilities both have a direct effect on performances (Canivez and Youngstrom, 2019), but in the CHC model factor g is at the top of the hierarchy, affecting (i.e., causally) broad abilities.
The WISC-V is developed to measure five broad CHC abilities (Canivez and Youngstrom, 2019), with little or no role for factor g or a Full-Scale IQ score. This accords with McGrew, who deemphasized the effect of g by stating that it “has little practical relevance” (McGrew & Flanagan, p. 14, cited in Canivez and Youngstrom, 2019). According to McGrew, the focus of assessment should not be on a single IQ score but on intelligence profiles, that is, strengths and weaknesses of the individual. With this in mind, it is important to note that most of the variance in broad ability scores in WISC-V is caused by g, rather than being a unique contribution of one of the broad abilities. Remember that this was not the intention of WISC-V, as the test was designed to measure broad abilities apart from g, to provide insight in an individual’s cognitive strengths and weaknesses. What the test is measuring, however, is mostly general intelligence!
3
Canivez and Younstrom revealed a similar problem with measuring CHC broad ability factors as Molenaar and Campbell (2009) with regard to the Big Five personality test: Results from WJ III, WJ IV, and WISC-V all converged on a common result: despite claims by publishers and authors that the scores represent CHC broad abilities, independent replication of cognitive test structures has been problematic. The subtests do not always load on their intended factors and sometimes fail to load on any factor. (p. 237)
Even more important is their warning to interpret broad ability scores on an individual level, as being a cognitive strength or weakness of the individual. Broad ability scores, as measured by WISC V, contain a lot of g loading. When it comes to Verbal Comprehension (VCI) for instance, “the majority of true score variance contained in the VCI is not unique to verbal comprehension ability but to g! . . . this substantially complicates what one might ‘infer’ for individual clients, particularly because we do not know for particular individuals exactly how much of their performance on the VCI tasks, for example, is due to g and how much is due to verbal comprehension” (Canivez and Youngstrom, 2019, p. 240)
Although the statement that the WISC V is based on CHC theory may provide a seemingly scientific basis, the exact nature between the test and its underlying theory remains unclear in the scientific literature and in test manuals. As Canivez and Youngstrom (2019, p. 233) state: “It is one thing to assert that a test follows a model, but it something entirely different to examine if data conform to strong predictions from the model.” In the work field, WISC V is treated as a theoretical substantiated test, but there are only indications that the test is “following a model” and on an individual level it does not resemble that model very well. In line with Borsboom et al. (2004) we must conclude that intelligence tests do not satisfy the most fundamental rules of validity. Below we will further discuss more specific forms of validity.
Construct validity: The problem of standardized tests for multiple target groups
Norm-referenced tests are used for a variety of target groups, such as people with or without disabilities and assume a stable order of development (Visser et al., 2017). Research, however, shows that the development of children with developmental disabilities might be qualitatively different rather than quantitatively (i.e., delayed). Fuchs et al. (1987) showed that most test publishers failed to validate their tests for the handicapped population, albeit they are obligated to do so. A number of researchers have independently indicated that intelligence tests—as well as other standardized psychological tests, cannot be used validly and reliably with specific groups of patients or clients, such as young children (Neisworth and Bagnato, 1992), patients with frontal-lobe damage (Sbordone, 2014), autistic children (Koegel et al., 1997), and children with intellectual disabilities and developmental disabilities (Ponsioen, 2005; Visser et al., 2017). This leaves a small group of typically developing human beings, for whom intelligence tests might be appropriate. Note that the fundamental problem of validity, as presented by Borsboom et al. (2004) still stands. Moreover, normally developing people is a group most likely not referred to as patients or clients and might not be “in need” for intelligence tests, as a doorway to special education or healthcare.
That being said, it is interesting to note that even in normal and highly educated individuals, large inter and intra-individual differences of cognitive functioning have been found. Zakzanis and Jeffay (2011) ran a large number of neurocognitive tests in a group of 20 highly educated participants, with a mean IQ of 124 (as measured by two vocabulary and matrix reasoning subtests). Examples of these neurocognitive test are Rey Complex Figure Test, California Verbal Learning Test-II, Digit Span, and so forth. Results showed that when comparing IQ scores with these neurocognitive test scores, the majority of the test scores did not match IQ scores (in many cases tests scores were more than 1.5 SD below IQ). Research by Heyanka et al. (2013) revealed similar results. Thus, even in a group of normally developing individuals, IQ might not be a meaningful measure when it comes to assessing cognitive abilities.
Construct validity: The problem of cultural influences
As said before, validity refers to measuring what one intends to measure. Intelligence tests tend to measure intelligence. Since it is not clear what intelligence is, it is also not clear what intelligence tests measure. According to Richardson (2002): “all of the population variance in IQ scores can be described in terms of a nexus of sociocognitive-affective factors that differentially prepares individuals for the cognitive, affective and performance demands of the test” (p. 283). This “preparation” can be found in subtle parent-child or teacher-child interactions where belief systems and habits are transmitted, or in the toys they play with (see Dirks, 1982). According to Vygotsky (1988, as cited in Richardson, 2002), intelligence reflects the degree the subject has internalized the cultural tools. To understand the deep influence of culture on performances (including performance on intelligence tests) one has to perceive culture as more than an ethnic group. Social classes and even families can be seen as small cultural groups with their own habits and influences on cognitive abilities. Richardson (2002), for example, reviews a number of studies demonstrating that the acquisition of cultural tools differs per social class. Middle-class parent-child interactions are characterized by more factual questions”, like “Who was Columbus?” (see also Flavell et al., 2002). According to Richardson (2002) those parents also teach children how to use written materials and how to approach problems. The reason why intelligence test items appeal to those skills is that the tests are invented by researchers from the same middle-class background. So, it makes sense that children from these backgrounds achieve higher scores, because they came to the test better prepared.
Apart from cognitive preparedness, children can also be more or less prepared for test taking on an emotional level. Test-anxiety or self-confidence, for example, have a great effect on test outcome (Torres Van Grinsven, 2022), and may be transferred from parents to children (Richardson, 2002). Weiss and Saklofske (2020) further point at the significant impact of SES, parental income, parental expectations, and academic monitoring on the IQ scores of children. They show that these factors underly the “average IQ-score gap” between children and adults from different ethnic groups and stress the influence of environmental factors in the fulfillment of children’s cognitive potential. They also argue that, although efforts have been made to lessen the cultural load of test items, prior and/or present discrimination in education access, career opportunities, and employment still act upon the realization of cognitive potential in children from minority groups. Richardson (2002) also points to the subtleness of cultural bias in intelligence tests. He argues that non-verbal tests like Raven’s Progressive Matrices (RPM), always perceived as non-cultural loaded, are in fact more culturally loaded than the verbal tests. The implicit rules underlying this test refer to specific knowledge typical to middle-class cultures. “These nearly all require the reading of symbols from top left to bottom right, additions, subtractions and substitutions of numbers or other symbols across columns and down rows, and the deduction of new information from them” (Richardson, 2002, p. 291). Knowledge of and experience with such structures make the RPM a rather simple cognitive task to solve. Thus, RPM is not at all experience and culture-free and it certainly is not a pure measuring of general intelligence (Richardson, 2002).
Criterion validity: The problem of multiple intelligence tests
A questionable feature of intelligence tests is that they all provide the subject with an IQ score, but that the height of the score depends on which instrument is being used. There can be substantial differences between scores on different instruments, which should not be possible when intelligence tests are valid (i.e., they should measure the same construct). What makes it more problematic is that the score itself is interpreted the same way, with 100 as being the average and a standard deviation of 15 (Tellegen, 2004). In other words, IQ is a unitary concept, but the test outcome differs depending on which instrument is used.
Habets et al. (2015) investigated repeated measurements of IQ scores and differences between IQ scores due to using different instruments. They collected data from 176 participants with two or more IQ scores. Instruments being used where WAIS, WAIS-III, RAVEN and sGIT. Three categories were used to categorize IQ scores: Normal IQ (>85), Borderline IQ (71–84), and Intellectual Disability (⩽70). On a group level, the results showed high correlations between different IQ instruments and stability of IQ scores. But looking at individuals with intellectual disability or a history of special needs, there was no stability anymore. Comparison between WAIS-III and WAIS scores showed a different IQ category for 30 of 62 cases (48%). Comparison between the WAIS-III and RAVEN showed a different IQ category for 47 of 77 cases (61%). More specific, 11 participants changed from intellectual disability to normal IQ, comparing WAIS-III test scores with RAVEN test scores. Thus, this research demonstrates that 33–66% of the cases showed a difference of 10 points or more between instruments. Note that 10 points fall outside the 95% confidence interval of most IQ tests (Habets et al., 2015). If measurement outcomes are consistent (e.g., the WISC-III scores were consistently 10 points higher than WJ III) than there would not be a problem as one could correct for the systematic error. Unfortunately, this is not the case. Research by Kaufman (2009) shows differences between WISC-III, KABC-II and WJ-III,
4
in a sample of 29 preadolescents, 12–16 years old. Kaufman describes a few cases that show large differences between the three test scores: Brianna (not her real name) would be classified as average (WJ III), high average (WISC-III), or superior (KABC-II) depending on the test she was given (. . .). Leo earned IQs that ranged from 102 on the WJ III to 124 on the WISC-III. Asher had the opposite pattern, scoring higher on the WJ III (111) than on either the WISC-III (95) or KABC-II (90). If Brianna had been tested for entry into a gifted program (with IQ = 125 as the cut-off point), only her score on the KABC-II would have gained her entry. But Danica and Leo would have been more likely to be chosen for the gifted program if tested on the WISC-III than on the KABC-II or WJ III. (p. 153)
The above results reveal that not only test scores differ per instrument, but test scores differ per instrument per individual. There is no consistency, meaning that in some cases individuals benefit 5 from the chosen instrument and in other cases they do not.
van Toorn and Bon (2011) also compared different tests, using the WAIS-III, KAIT and GIT2 (short version) 6 among 50 adults in the forensic field. The assessment took place in a timespan of a few weeks. A difference of >10 points was found between GIT2 and WAIS-III (38% of the cases), between GIT2 and KAIT (46% of the cases), and between WAIS-III and KAIT (56% of the cases). Again, there was no stability between the differences, meaning that in some cases the KAIT test score was higher than the GIT2 and in some cases this pattern was reversed. More about the importance of correct diagnoses and the possible consequences of misdiagnoses in the forensic field, see Habets et al. (2015).
The problem of reliability
Reliability refers to the consistency of a measurement (Burton et al., 2000). When repeated measurements provide the same score over and over (given that the individual has not changed in the meantime), the measurement is said to be reliable. It is also important that external factors do not influence the measurement. The test should provide a score independent of the practitioner (Kievit et al., 2008). In reality, this will not be the case; measurements contain unsystematic errors (Burton et al., 2000). For instance, when the practitioner is inaccurate in writing down the client’s answer or the client performs on a lower level due to a headache. Thus, environmental influences during test taking can influence the reliability of the test outcome. According to Bruton et al., reliability is a rather complex concept. There are several methods to quantify it, all with different outcomes. It is up to the clinician how to interpret the degree of reliability and to determine which degree will be acceptable for clinical use. Below we will reflect on internal and external factors that might affect the true test score, if it exists, of the individual.
Reliability: The influence of external and internal factors
There is an intertwinement between the individual and the environment (Dickens and Flynn, 2001). When changing the environment of the individual or the circumstances within the individual, the IQ score changes with them, for instance due to adoption (te Nijenhuis et al., 2015), cognitive stimulation programs (DTT: Peters-Scheffer et al., 2013; Head Start: te Nijenhuis et al., 2014), psychological treatment (Malarbi et al., 2017), treatment for addiction (Stavro et al., 2013), and the use of a video game (Dirks, 1982). Also, motivation and the use of material rewards have found to impact upon IQ scores (Ackerman and Heggestad, 1997; Etherton and Axelrod, 2013; Phay, 1990). Research by Koegel et al. (1997) showed that the testing situation influences the performance of children with autism. The study consisted of a condition with conventional testing circumstances according to the manual versus a condition suited to the needs of the child. For instance, when it is known that the child can or will not sit at a table, the test was conducted on the floor. In this manner, for each individual child adjustments have been made. Test outcomes show significant higher IQ scores for the adjusted condition. Moreover, Koegel et al. (1997) caution against making decisions based on standardized test outcomes for autistic children. Those test scores give a distorted picture of the cognitive abilities of the individual, which complicates the interpretation of intervention outcomes, because the baseline is not set correctly.
In line with these results are earlier findings by Fuchs and Fuchs (1986) concerning examiner bias. The authors carried out a meta-analysis among 1489 subjects. Focus of attention was examiner familiarity, being a broad concept ranging from familiarity with the class of people the examiner belongs to, to real acquaintance. Results showed that subjects who are familiar with the examiner gain higher test scores. This was particularly obvious for individuals from low socioeconomic backgrounds and when the subject knew the examiner for a long time. Fuchs and Fuchs (1986) state that test settings are not decontextualized. “[. . .] the effects of examiner familiarity demonstrate the importance of contextual factors in testing” (p. 257). Examiner bias was further assessed by McDermott et al. (2014) in a clinical sample, they concluded that: “nearly all WISC-IV scores conveyed significant and nontrivial amounts of variation that had nothing to do with children’s actual individual differences and that the Full-Scale IQ and Verbal Comprehension Index scores evidenced quite substantial assessor bias” (p. 207). Their conclusion is clear: FSIQ and VCI should not be used for differential diagnosis and classification, let alone be used as cut-off points for decision-making.
In the paragraph about validity, we briefly discussed the work of Schmiedek et al. (2020). Their findings about the difference between group data (between-person differences) and individual data (within-person differences) does not only apply to validity but also to reliability. One can imagine that the level of motivation, a headache, family problems and the more, all have an influence on cognitive performance of the individual. The authors draw attention to day-to-day fluctuations in cognitive performance, as indicated by their research. Assessing intelligence at one moment in time, as all standardized intelligence tests do, will not catch that cognitive fluctuation. As the authors state: “If the aim is to describe, explain, and modify cognitive structures at the individual level, we need to measure and follow individuals over time” (Schmiedek et al., 2020, p. 22).
To conclude, it is acknowledged that the broader environment and characteristics of the testing situation itself as well as fluctuating characteristics of the individual affect the reliability of the test outcome (see also Bronfenbrenner and Morris, 2006), therefore strict prescriptions for the test taking situation are presented in publisher manuals. The only reason to keep testing situations under control is to enable the diagnostician to compare test scores with the norm group and to rank the subject. Based on the above, however, we can safely conclude that it is myth that testing situations can be kept under control, because some of them work beneath the surface and it is not known beforehand which factors have a positive or negative influence on the individual. Second, even if environmental conditions could be kept under control, circumstances within the child, or environmental conditions impacting upon the child before the test is taken, can further threaten the reliability of IQ-scores.
Reliability: The problem of ergodicity
As previously stated, the diagnostician intends to keep testing situations under control to compare an individual’s test score with a norm group. This paragraph discusses the reliability of using the norm group (group data) to draw a conclusion about the individual. We already mentioned the work of Schmiedek et al. (2020), but so far we did not explain why between-person differences differ from within-person differences. When you want to know the true score of something, you must measure it a number of times and calculate the average (Lord and Novick, 1968). This average will only be reliable when condition in which they were assessed and characteristics of the object or trait measured do not change over the course of the repeated measurements. It must be clear that this poses significant challenges when measuring human traits or conditions, especially intelligence. For it is not possible to take the same IQ-test several times, without remembering earlier assessments and it can never take place under the exact same circumstances (due to fatigue, motivation, etc.). It is therefore impossible to determine individual confidence intervals and true scores.
To overcome this issue, the reliability estimates of a population are used to determine the confidence interval surrounding the “true score” for the individual (Molenaar, 2013). This procedure is based on Classical Test Theory and almost all standardized tests in the field of psychology use this method. Loretan et al. (2019; see also Matarazzo, 1990) made an in-depth analyses of the reliability of this procedure. It is important to keep in mind that to draw reliable conclusions about the individual, based on these group data, group members must meet two criteria (described in a simplified form): 1) all members of the group are identical; 2) the members of the group do not change due to circumstances (Rose, 2016). This is called “ergodicity.” Loretan et al. (2019) illustrate that everyone has its own confidence interval, a so-called propensity distribution. This interval is created by hypothetically measuring the individual repeatedly. This propensity distribution does not necessarily suit the confidence interval of the test, which is based on the norm group (i.e., each member of the group measured only once). Individual propensity distributions may be considerable smaller or larger than the confidence interval of the group. Therefore, comparison of the individual’s test score with the group-based confidence interval, might result in an over- or underestimation of the individuals test score. Because humans are not ergodic systems, using confidence intervals of the norm group are, by definition, not reliable at the level of the individual.
Discussion
In this article we build our case that standardized intelligence tests are not reliable and valid and that they are not informative to draw conclusions about the individual. This assumption is not new. In the past, many researchers pointed at the controversy of the intelligence concept and its measurement (Richardson, 2002; Schlinger, 2012; Tellegen, 2004; Weinberg, 1989). Researchers warned for drawing conclusions about the individual, using the average as a norm, and pointed out the consequences of ergodicity for social sciences (Loretan et al., 2019; Molenaar, 2013; Rose, 2016). Researchers advised to follow individuals over time (Schmiedek et al., 2020), to consider their own expertise (i.e., based on observations, experience, and knowledge) as decisive rather than standardized test outcomes (Tellegen, 2004) and to combine multiple sources of information in clinical decision-making (Matarazzo, 1990; Neisworth and Bagnato, 2004). Nevertheless, scientists state that measuring intelligence is the biggest accomplishment of social sciences (Detterman, 2014), and standardized intelligence tests are part of a standard repertoire of tests in clinical decision-making and are still highly valued by experts (Rindermann et al., 2020; Bloemink, 2023). Also in Dutch policy, IQ is highly valued and functions as a doorway to (special) education and healthcare. In search for an explanation for these conflicting opinions, it is important to recognize that most of the knowns and unknowns about intelligence are based on group data. By using non-linear research methods and following individuals over time, a different picture arises concerning factor g as well as validity and reliability of intelligence tests. This largely explains the mixed research results. Different research methods, however, are only part of the answer. We believe that these research methods are rooted in different paradigms; that of biological determinism and complex systems theory respectively. The problems with validity and reliability of standardized intelligence tests will continue to be unsolvable if research is conducted from different paradigms. It is beyond the scope of this paper to elaborate on these two paradigms, but we will do so in a forthcoming paper.
Because our analysis takes a more balanced perspective on intelligence than is seen in Dutch policy, we would like to reflect in this final part on the gap between theory and practice. There are several reasons for this gap. The first one concerns the unfounded value placed on CHC theory. In 2009 McGrew noticed that CHC theory was accepted much more quickly in the work field (publishers basing their tests on CHC theory) than in the scientific literature. Although scientific research can be practice driven, it is worrisome that the work field outruns science, because of the risk that the work field continues to practice with a lack of scientific support. The second reason concerns the methods being used in validation studies. According to Gould (1996) this can be traced back to The Stanford-Binet intelligence test.
The invalid argument runs: we know that the Stanford-Binet measures intelligence; therefore, any written test that correlates strongly with Stanford-Binet also measures intelligence. Much of the elaborate statistical work performed by testers during the past fifty years provides no independent confirmation for the proposition that tests measure intelligence, but merely establishes correlation with a preconceived and unquestioned standard. (p. 207)
A potential third reason for the gap between theory and practice could be the forgetting, abbreviation, or alteration of previously rich and profound theories of intelligence in subsequent generations, evidenced by the fact that Binet’s warnings have been neglected (or forgotten?; see Lohman, 1997). A final reason might be that research results might get lost in translation. Weinberg (1989) argued 30 years ago that psychometricians are not the only ones responsible for how IQ tests are being used and how they are valued by society. He particularly calls researchers for guiding what happens with their research outcomes, how society interprets them and translates it into public policy, and to guide and watch that process. In 2020, Wai made the same call concerning the concept of intelligence. Detterman (2014) points to an inconvenient fact, when it comes to intelligence research and universities. Nowadays at most universities students learn how to assess mental abilities by means of intelligence tests. They do not learn, however, about intelligence and the status quo when it comes to intelligence research.
7
Besides, most universities in the US, and in the Netherlands alike, do not have a faculty position for intelligence research (Wai, 2020). Students seem to have many misconceptions about intelligence, just like the general public does (Detterman, 2014; Wai, 2020). As Wai (2020, p. 8) puts it: “students, after all, are simply members of the public who happen to be in college,” and as Detterman (2014) argues: For graduate students in psychology and education, it is incomprehensible to me that they are sent forth to practice either clinical psychology or education knowing as little about intelligence as they do. Very often they have had only a single course that teaches them how to administer tests with very little instruction on what intelligence is or what scores on a test actually mean. To me, this is equivalent to training surgeons as technicians with no knowledge of anatomy or physiology. (p. 2)
The next statement by Nettelbeck and Wilson (2005) is typical when it comes to the take home message authors tend to give: Thus, interpreting an individual’s IQ score continues to require both art and science because of validity limitations. As a consequence, it should not be assumed that a given IQ score or profile of scores provides more than a guide to a relatively narrow range of capabilities for that individual at that point in time. . . . We continue to advocate the use of IQ and similar aptitude/achievement tests because, given current knowledge limitations, they are the best tools available for predicting important future educational and other significant life outcomes. (p. 611 and p. 626)
This statement is a common way in intelligence literature, to emphasize the use of intelligence tests despite all the caveats. The idea that they are the best tools available for predicting important future educational and other significant life outcomes, may be true on a group level, but not on the individual level as this paper has tried to clarify. This conclusion might leave the reader wondering about the alternative. We believe that validity and reliability will always be a problem when attempting to assess a person’s mental faculties (e.g., cognitive, verbal, emotional, social, etc.). This has to do with the nature of human beings, wonderfully described as moving targets, by Ian Hacking: We think of these kinds of people as given, as definite classes defined by definite properties. As we get to know more about these properties, we will able to control, to help, to change, or to emulate them better. But it is not quite like that. They are moving targets because our investigations interact with the targets themselves, and change them. And since they are changed, they are not quite the same kind of people as before. The target has moved. (Hacking, 2007, p. 293)
This is all so true when assessing intelligence and determining the individual’s cognitive potential. Let us not forget that the assessor (or test publisher) defines the properties and the boundaries between normal and divergent intellectual development (Bosman, 2017), and that “. . . intelligence is a ‘relative’ or normative construct” (Ackerman, 2018, p. 2). We think that this awareness is not so prevalent today. A step forward would be to use any assessment of intelligence as a starting point rather than an endpoint. As an aid rather than a limit (see Gould’s citation at the start of this paper). With that in mind, we think the following assumptions might be worth considering, based on the cited articles:
- Assessment should provide insight in the test taker’s thought processes, how to overcome learning difficulties, and how to transfer learning strategies to other domains (see Jeltova et al., 2007).
- Assessment should include and describe the influence of the assessor on the testing situation and test outcome.
- The testing situation should be beneficial to the performance of the test taker rather than hampering it. Therefore, the testing situation should shed light on what is needed from the environment to reveal the test taker’s intellectual potential.
- Test scores should be considered relative, describing an individual’s path of development, rather than a comparison between the individual and the norm group.
- The ergodic fallacy is to overcome by letting go of IQ scores and using the instrument as an observational tool solely.
Dynamic assessment (DA) meets the majority of these assumptions and we refer interested readers to the work of Prof. Reuven Feuerstein, Prof. David Tzuriel, and, in The Netherlands, Prof. Wilma Resing. There are indications that DA has more predictive value in educational settings than IQ (Jeltova et al., 2007). Dynamic assessment, however, brings along other issues to discuss like how to conceptualize the concept and how to objectify, indicate and generalize test results (Beckmann, 2014; Jeltova et al., 2007). Validity also remains a complex issue in dynamic testing (see Beckmann, 2014 for an in-depth discussion). We therefore believe that it is important to first initiate a debate that focuses on addressing fundamental questions, such as how to study human beings and how to reintegrate the assessor into the equation, and not so much on alternative (static) assessments.
Conversations with the ministry of Health, Welfare and Sport revealed that policy makers follow common practices in the work field (I. R. Claassen, personal communication, June 5, 2023), meaning that Dutch policy might change when the work field wants to follow a different direction. However, the majority must lend their support for this direction; it requires a bottom-up process. We therefore call interested readers to respond on this matter and share their opinions in a scientific debate. In the meantime, we are conducting research on fundamental questions like how intelligence is perceived by diagnosticians and the various diverging views on humanity and their consequences for clinical decision-making.
We would like to recap and conclude with the following: Considering that 1) a general ability of intelligence might not exist as a psychological construct, and 2) a valid and reliable measure of intelligence is not possible, we must conclude that we see no justification for the use of standardized intelligence tests, let alone IQ-scores, in clinical decision-making when assessing the individual. As evidenced by the cited articles, utilizing these assessments put individuals at risk of a misinterpretation of their cognitive abilities.
Footnotes
Acknowledgements
We thank Dr. M Radstaake and Dr. S R J M Deckers at Radboud University for their valuable comments on an earlier draft of this paper. We would also like to thank the reviewers for their encouraging evaluation of the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
