Sage Journals: Discover world-class research

Abstract

In response to Van Hoogdalem and Bosman (2024) who advocate abandoning the use of intelligence tests at the individual level, we argue that their conclusions are too absolute and insufficiently substantiated. While we acknowledge some of the concerns they raise regarding measurement error and contextual influences, we argue that the complete dismissal of standardized intelligence tests overlooks their practical utility. Drawing on empirical evidence and established psychometric theory, we demonstrate that intelligence tests provide valuable information for clinical and educational decision-making. We further argue that replacing standardized testing with purely non-standardized methods introduces greater subjectivity and risk of error, ultimately undermining the quality of professional judgments. We argue that intelligence tests remain a valuable source of information in many cases of clinical and educational decision making, when interpreted carefully and integrated with other relevant sources of valid information using predefined decision rules.

Keywords

Intelligence IQ validity reliability individual versus group ergodicity

Introduction

In their recent article, titled “Intelligence Tests and the Individual: Unsolvable Problems with Validity and Reliability,” Van Hoogdalem and Bosman (2024) argue that intelligence tests are fundamentally unreliable and unusable at the individual level. They base this claim on a series of critical observations about the validity and reliability of intelligence tests at the group level. More specifically, the authors conclude (p. 16):

Considering that (1) a general ability of intelligence might not exist as a psychological construct, and (2) a valid and reliable measure of intelligence is not possible, we must conclude that we see no justification for the use of standardized intelligence tests, let alone IQ scores, in clinical decision-making when assessing the individual. As evidenced by the cited articles, utilizing these assessments puts individuals at risk of a misinterpretation of their cognitive abilities.

While we acknowledge some of the inherent limitations of intelligence testing, we contend that the authors’ conclusions are overly absolute and one-sided. By dismissing standardized intelligence tests altogether and advocating for non-standardized assessment, the authors discard a valuable diagnostic tool and run the risk of decreasing rather than increasing the quality of individual clinical decision-making. The reason is that clinical judgment, where the “judge” combines data using informal methods, is notoriously less accurate than mechanical prediction, where decisions are based on predefined decision rules grounded in empirical evidence (Grove et al., 2000; Grove and Meehl, 1996; Meehl, 1954).

In this article, we follow the call of Van Hoogdalem and Bosman (2024) for a scientific debate by critically examining their key arguments. We will start by discussing areas where we share concerns about intelligence tests. Subsequently, we address points of disagreement. Specifically, we will discuss the misconception that findings at the group level are irrelevant for individual assessment and explain why group-level findings are essential for informing clinical decisions about individuals. Clinical decision-making inevitably involves prediction, and therefore achieving a faultless decision for all individuals is an unattainable ideal beyond reach. Consequently, the goal should be to minimize the portion of erroneous decisions across individuals. We will argue that nonstandardized assessments—as suggested by the authors as an alternative to intelligence tests—can be expected to be much more susceptible to measurement error and thus enlarges the risk of erroneous decisions. Finally, we will maintain that, despite their limitations, intelligence tests remain a valuable source of information to include in many cases of clinical decision making, provided that they are interpreted carefully and integrated with other relevant sources of valid information using predefined decision rules. We offer practical suggestions for how to achieve this in practice.

Where we agree

Some of the concerns expressed by Van Hoogdalem and Bosman (2024) on intelligence tests resonate with us. These concerns have been acknowledged in the literature for quite some time, as nicely highlighted by Tellegen (2004). We share the concern about an absolute interpretation of test results, particularly when fully ignoring measurement error. Any psychological measurement is influenced by measurement errors. This means that an observed score may deviate from the “real score.” We use the term “real score” to describe a score that expresses the position of the individual on the construct of interest. To make clear that this concept is not bounded to the classical test theory, we intentionally avoid the term “true score” here. A complicating factor is that this “real score” refers to a latent variable—one that cannot be observed directly—and that the latent variable itself reflects a psychological construct that is notoriously difficult to define, and thus difficult to measure.

As Van Hoogdalem and Bosman (2024) rightfully indicate, various factors beyond intelligence can influence observed scores on intelligence tests. They discuss both external and internal factors, such as examiner familiarity, testing conditions, and the individual’s state at the time of testing, all of which can undesirably impact scores in ways unrelated to actual cognitive ability.

These influences contribute to measurement error, the difference between the observed score and an individual’s real score. Therefore, any psychological test, including intelligence tests, should be accompanied by information about the magnitude of the measurement error. This is commonly expressed as a confidence interval, which increases when the magnitude of the measurement error increases. For example, a 95%-confidence interval for a score covers the individual’s real score in 95% of the times: it indicates that 95% of hypothetical repeated tests (assuming no change in the individual’s true ability) would produce intervals containing the real score. While simplified as “95% confident,” this terminology reflects a long-run frequency, and may be phrased as “We are 95% sure that the true score is in the range indicated.” Note that this formulation does not strictly describe the certainty of a single result. Of course, both the measurement error and consequently the interval themselves are estimates, which thus are subject to uncertainty. Further, although it is an individual interval, one can only estimate it based on group data. The best approach is to use data from others that are as similar to the individual under study as possible. Therefore, a good way seems to use a method that allows for an estimated uncertainty that depends on the level of the real score. This can be done using item response theory based models, or estimating standard errors of measurement per observed score (Emons, 2023). As Van Hoogdalem and Bosman (2024) point out, the resulting estimated interval may therefore deviate from the real individual interval. Yet, we sincerely disagree that these intervals would be useless. They point at the inevitable uncertainty with respect to the “real score,” and provide an educated guess on the relative size of this uncertainty. It is the responsibility of the test developer to estimate these intervals as well as possible—using appropriate psychometric tools—and to clearly explain their implications to the psychologist (or other user) who will administer and interpret the test. Two of the authors of this paper previously emphasized the issue of measurement error, and how this should be communicated more clearly in the interpretation of IQ scores (Ruiter et al., 2017)

When different intelligence tests are administered to the same individual, differences in observed scores are to be expected. This is due not only to measurement error inherent in each test, but also to structural differences between the tests themselves. That is, tests vary in the specific tasks administered and in how they define and operationalize intelligence (Ruiter et al., 2017). As a result, what is labeled as “intelligence” can vary from one test to another. Although intelligence tests tend to correlate moderately to highly with each other (say 0.40–0.80, not corrected for measurement error; Ruiter et al., 2017), their scores are certainly not entirely interchangeable.

When interpreting the observed scores on an intelligence test, uncertainty due to measurement error and structural differences between tests needs to be taken into account. We therefore agree with Van Hoogdalem and Bosman (2024) that the sole use of strict IQ cut-offs for decisions related to care eligibility and school placement is inappropriate. This is especially true when the observed score is very low (e.g. IQ <70), because very low (and high) test scores are inherently low in reliability.

Tellegen (2004) also indicates that the standards of the intelligence test must be in order. This means that the norm groups should be well-defined and the chosen norm group must also be representative. Especially at the extremes of the scale, such as IQ values below 70 and above 130, the values are sensitive to violations. Tellegen also comments on the aging of the norms, which must be monitored, as well as test bias. Certainly, we agree with these issues. It is no coincidence that the Dutch Committee on Tests and Testing (COTAN; Evers et al., 2010) imposes strict standards on tests to ensure compliance. Whenever an intelligence test is released, it must be scientifically demonstrated that these requirements are met.

Where we disagree

The one-sided view on validity

A widely cited quote from Messick (1989: 13) captures the modern view on validity:

Validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores.

Messick thus emphasizes that validity concerns the practical usefulness of test score interpretations, rather than ontological questions about whether the underlying construct “really” exists. In line with this perspective, we prefer to focus on the practical usefulness of intelligence tests, rather than engaging in philosophical debates about whether intelligence truly exists—and if so, how it should be defined. Others, like Nettelbeck and Wilson (2005)—also quoted by Van Hoogdalem and Bosman (2024)—adopt a similar approach.

Van Hoogdalem and Bosman (2024), by contrast, appear to endorse the validity framework advanced by Borsboom et al. (2004), which posits that valid measurement requires a construct to exist as a causally coherent entity. Applying this framework, they argue that intelligence lacks ontological definability and is therefore inherently unmeasurable. However, their critique is logically incomplete: while rejecting intelligence as a measurable construct, they propose dynamic assessment as an alternative means to measure “learning potential”—a concept that is itself latent, poorly defined, and arguably just as ontologically problematic. Like intelligence, learning potential fails to meet Borsboom et al.’s (2004) stringent validity criteria, which demand a clear causal link between the construct and its observable indicators. This unresolved tension undermines their broader argument, as their proposed alternative replicates the very epistemological flaws they attribute to intelligence. Although the ontological discussion is very interesting, Messick’s more practical view on validity therefore seems more appropriate here.

The interpretation of differences in intelligence test scores

Van Hoogdalem and Bosman (2024) argue that intelligence test scores are unreliable, because they fluctuate over time due to individual characteristics and varying test conditions. While we acknowledge that some variability exists, research shows that intelligence test scores tend to stabilize with age (Eichelberger et al., 2023a, 2023b). In particular, intelligence test scores become relatively stable from childhood and adolescence onwards (Tucker-Drob and Briley, 2014), and are rather stable at successive periods from infancy (i.e. as of 1 year of age) to early adolescence (i.e. maximally 17 years old; Yu et al., 2018).

Further, it is important to recognize that variability in observed scores does not imply that a measurement is meaningless. Many common metrics—such as weight or blood pressure—also fluctuate depending on situational factors, but remain valuable for assessing general health. For example, a person’s weight may differ between morning and evening, or over longer time intervals. Likewise, blood pressure typically rises when the situation is stressful. Still, these variations do not invalidate weight or blood pressure as useful indicators of physical health. In the same way, an intelligence test score—though subject to situational influences—can still provide meaningful information, particularly when interpreted in context and alongside other relevant sources of information.

Hence, the examples cited by Van Hoogdalem and Bosman (2024) as evidence for poor criterion validity—due to variation in observed intelligence test scores—are not convincing. For example, in the study of Habets (2015), different tests were administered at different time points. However, the candidates’ developmental changes over time were not controlled for, making it plausible that the observed differences in scores or score categories reflect both measurement error and natural development. Similarly, if an individual’s weight is measured months apart, a shift from “obese” to “overweight” on the BMI scale does not invalidate the concept of weight measurement—it merely reflects change over time.

In the same reasoning, the fact that intelligence is not entirely stable over time does not render intelligence testing useless. It simply implies that interpretations must be made carefully and with appropriate context. Test results may vary based on the time of day, recent experiences, or testing conditions—just like weight or blood pressure. However, as with physical measures, there are likely performance boundaries. If the goal is to assess an individual’s optimal performance—as an indicator of their actual potential—it would be sensible to administer multiple (different) tests over time, at moments when the individual is well-prepared and in optimal condition. Selecting the best performance while taking into account the measurement error and retest effects could offer a more balanced and accurate representation than a single assessment, and directly addresses many of the concerns brought up by Van Hoogdalem and Bosman (2024).

While the preceding discussion appears to endorse test-retest protocols, it is critical to acknowledge that such practices carry the risk of introducing learning or memory effects. This is particularly relevant given that intelligence testing aims to assess an individual’s ability to engage with novel tasks or unfamiliar scenarios, thereby measuring cognitive adaptability and problem solving in new situations. To mitigate this risk while ensuring a comprehensive evaluation, utilizing diverse intelligence assessments becomes essential. Doing so not only reduces practice effects, but also helps to illuminate distinct cognitive strengths and strategic approaches across varied domains.

The misconception of ergodicity

Van Hoogdalem and Bosman (2024) criticize the reflective model of intelligence measurement, arguing that intelligence cannot be fully captured by a single, underlying factor, because it depends on the specific tasks and contexts in which it is measured. However, we argue that intelligence tests can still provide meaningful insights, especially when the scores are interpreted as reflecting specific cognitive functions rather than a singular, global factor—an interpretation that is consistent with a so-called formative model.

Intelligence is a multifaceted construct that extends cognitive abilities alone. It encompasses personal characteristics that enable individuals to function effectively in their environment, including, for example, perseverance, emotion regulation, delay of gratification, and time management. This broader conceptualization is reflected in Wechsler’s definition of intelligence as “the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment” (Wechsler, 1944: 3). Nevertheless, most intelligence tests primarily focus on cognitive aspects. Instruments such as the WAIS-IV (Wechsler, 2008) and the IDS-2 (Grob and Hagmann-von Arx, 2018a) assess a range of cognitive functions, such as visual processing, long-term memory, processing speed, short-term memory, and reasoning, and combine these into an overall IQ score. Although they differ in specific tasks and subtests, these tests share a common theoretical framework that justifies their interchangeable use in individual assessments.

A key feature of these intelligence tests is that their scores are norm-referenced: they express an individual’s performance relative to a reference population. For example, the overall IQ score is typically standardized to a normal distribution with a mean of 100 and a standard deviation of 15 within a general population of a certain country or language area of the same age as the individual tested. Since individuals are thus compared to others of the same age, intelligence test scores are controlled for age-related cognitive development, and thus free from effects of development across the age span.

Importantly, the cognitive functions and the overall IQ score are derived from studies involving groups of individuals. This between-person structure is consistently observed across studies (e.g. Grob and Hagmann-von Arx, 2018b; Wechsler, 2008). Thus, IQ scores effectively summarize an individual’s cognitive performance relative to others.

A critical question, however, is whether the same structure can be used to track changes within individuals over time. This requires the within-person structure to be equivalent to the between-person structure (Voelkle et al., 2014). Schmiedek et al. (2020), using intensive longitudinal data (100 observations per individual, N = 101 individuals) showed that within-person structures vary substantially and often deviate greatly from the between-person structure. This finding implies that conventional intelligence test scores are not suitable for examining cognitive development within individuals over time. Studying such development requires instruments that are sensitive to the individual’s unique patterns of change.

Because of this limitation, van Hoogdalem and Bosman (2024) argue for dismissing the overall IQ score entirely. In our view, this is a misinterpretation. Intelligence tests are designed to assess between-person differences and are therefore valuable for predicting behavior relative to others. However, as Schmiedek et al. (2020) showed, a person’s test scores on a given day is a poor predictor of their performance on a different task or day. The key issue is thus whether this relative-to-others perspective is informative in clinical decision making.

Van Hoogdalem and Bosman (2024) question this approach by invoking the concept of ergodicity. Basically, ergodicity refers to the condition where a summary measure calculated across individuals is identical to the same measure calculated within individuals. A lack of ergodicity implies that, for example, the average developmental trajectory derived from a group does not reflect the developmental path of any individual in that group.

However, the absence of ergodicity is not inherently problematic if we are examining a snapshot in time. At a single time point, the individual’s relative standing compared to the norm group remains meaningful. The fact that not all, or even any, individuals follow the mean group trajectory does not invalidate comparisons between individuals at a given moment.

Consider, for example, the child growth charts used in youth healthcare services (e.g. Dutch context: TNO, 2022; international context: WHO, 2025). These charts display growth curves (e.g. height, weight) for various percentiles and help physicians identify abnormal development. Despite individual variability and changing group composition, these charts remain essential in pediatric health care. Early identification of growth-related disorders or problems based on these charts has led to life-saving interventions.

According to the logic of Van Hoogdalem and Bosman (2024), such growth charts should not be used, since their two main criticisms—individual variability and non-identical development trajectories—also apply here. Yet, this ignores a fundamental point: users of these charts understand that deviations are expected, but that extreme deviations may warrant concern. This principle also applies to cognitive development over time. Good assessment tools—whether for height, weight, or cognitive ability—are not designed to indicate causes, but to signal deviations from normative patterns. The causes of those deviations must be investigated separately. Nonetheless, the measurements themselves are not meaningless. The same holds true for intelligence test scores when used for evaluating cognitive functioning relative to others at a specific time point.

In sum, if we were to follow Van Hoogdalem and Bosman’s (2024) reasoning consistently, physicians should abandon standardized growth charts and rely solely on subjective clinical judgment. They argue that scores should only be interpreted within the individual’s developmental context and reject norm-referenced frameworks. This would mean discarding well-validated tools that offer critical insights—something we believe is both unnecessary and unwise. We will delve deeper into this in the next section.

The usefulness of norms for measuring individuals

Dynamic testing, as advocated, can well be thought of and implemented in a standardized and quantitative manner (Beckmann, 2014). Standardized procedures, rather than ones based solely on clinical judgment, are also promoted (Jeltova et al., 2007). We agree that some of the suggestions mentioned in the discussion by Van Hoogdalem and Bosman (2024) can be considered important ways in which established testing procedures could be improved by considering the dynamic nature of individual differences in cognitive ability and performance. However, we disagree with their statements that advocates “ . . . letting go of IQ scores and using the instrument as an observational tool solely” and that test scores should not be used for “ . . . a comparison between the individual and the norm group” (p.16). Thus, in addition to observations of the individual developmental context, we value the significance of scientifically validated norm scores in psychological and behavioral research.

These standardized benchmarks, rigorously derived from empirical data, provide critical reference points for interpreting individual and group outcomes, ensuring comparability across studies while mitigating subjective biases. In developing such norms, test constructors follow rigorous methodological standards to ensure reliability, validity, and fairness. Nevertheless, Van Hoogdalem and Bosman (2024) argue that test scores should only be interpreted within the context of an individual’s developmental path. Yet, how is such interpretation supposed to take place in practice—particularly in the absence of a normative framework? Implicitly, the answer is: by the practitioner.

Leaving the interpretation of the test result at the practitioner raises critical questions. On what basis does a practitioner draw conclusions about an individual’s functioning when no reference scores are available? Van Hoogdalem and Bosman (2024) cite work (e.g. Schmiedek et al., 2020) to support a reliance on the practitioner’s expertise and contextual knowledge. Yet, the norms they reject are themselves based on carefully collected data—data derived from structured observations, cumulative experience, and theoretical knowledge. For the interpretation of scores, the scientifically sound standards, with a clearly defined norm population, with a representative sample are replaced by the experience of the test taker. Thus, this approach definitely risks substituting systematically gathered scientific knowledge with personal, subjective judgment.

It is important to note that a lack of well-defined norm population is just one of the several issues that Tellegen (2004) identifies as problematic. The experiences of the practitioners are generally not representative, especially when test takers generally see relatively many children who need that particular care. Such a bias is also clear in the authors’ claim that intelligence tests may only be appropriate for “a small group of typically developing human beings,” while suggesting that such tests are not useful for individuals with conditions such as frontal-lobe damage, autism, or intellectual disabilities and developmental disabilities. However, fortunately the vast majority of children in the population do not face such limitations. It could be that in their daily work, the percentage of typically developing children is smaller, which would explain their bias. It also would exemplify the need for of well-defined norm population rather than use the test takers experience as the ultimate norm.

Note also that while we agree that a minimal condition for a meaningful test administration is that the child should be able to understand and follow the instructions, we do not agree that this means that all children in the groups the authors mention should be excluded for taking an intelligence test. When we detail on the target groups of intelligence tests, we further discuss this issue.

The proposed approach to interpreting scores by Van Hoogdalem and Bosman (2024) stands in direct contradiction to their own concern (p. 15) that the professional field is advancing ahead of scientific insight—a risk, they argue, that leads to unsupported practice. Strikingly, on the very next page (p. 16), they suggest that policy should follow the professional field’s current practices. These statements appear inconsistent.

By rejecting the norm-referenced interpretation and discouraging comparisons with standardized groups, Van Hoogdalem and Bosman (2024) disregard decades of empirical research. In their seminal paper, Grove and Meehl (1996) systematically address the objections raised against actuarial (or mechanical) prediction—demonstrating its superior accuracy compared to unaided clinical judgment. One of the central fallacies Grove and Meehl (1996) describe, closely resembles Van Hoogdalem and Bosman’s (2024) reasoning: the claim that “Statistical predictionists aggregate, whereas we seek to make predictions for the individual, so the actuarial figures are irrelevant in dealing with the unique person” (Grove and Meehl, 1996). Grove and Meehl (1996) convincingly dismantle this objection. Consider a medical analogy: suppose a patient can choose between two medical procedures. One has a success rate of 90%, the other of 10% among patients with similar clinical characteristics. Of course, these statistics do not guarantee the outcome for any single individual—someone may still fail under the high-success treatment or recover under the low-success one. Still, faced with this choice, would one prefer to have this statistical information—or not? Van Hoogdalem and Bosman’s (2024) position suggests that such statistical information should be disregarded, because it may not apply precisely to any given individual. We argue the opposite: these statistics offer a rational basis for making probabilistic decisions under uncertainty. Rejecting them removes one of the most powerful tools available for informed decision-making.

This also means that we disagree with Van Hoogdalem and Bosman’s (2024) claim that “because humans are not ergodic systems, using confidence intervals of the norm group are, by definition, not reliable at the level of the individual” (p.16). If this were to be true, all human measurement would be futile, as humans are not ergodic systems, and everything we measure, especially in humans, contains measurement error. Following their logic, they for example should discard medical measurement and norms as well. However, we think these save lives.

In conclusion, while individual developmental context certainly matters, it should not replace scientifically validated norms, but complement them. Discarding normative frameworks undermines both the empirical basis of psychological assessment and the transparency of clinical reasoning. We believe that test scores grounded in normative data continue to offer essential, probabilistic information that supports responsible, evidence-based decision-making.

Target groups of intelligence tests

Apart from doubting the usefulness of test norms more broadly, Van Hoogdalem and Bosman (2024) argue that intelligence tests are unsuitable for children with developmental disabilities. They claim that standardized psychological tests “cannot be used validly and reliably with specific groups of patients or clients” (p. 11), and therefore conclude that such tests are only appropriate for the typically developing individuals. However, by entirely discarding the use of intelligence testing in these populations, they risk denying children with developmental disabilities access to an instrument that could support understanding and inform care. In addition, it is worth noting that test authors often investigate the reliability and validity of their instruments for use in specific groups. It appears that some tests can also often be administered to specific groups and lead to interpretable scores (Grob and Hagmann-von Arx, 2018b; Hendriks et al., 2018).

While we acknowledge that (certain) tests may be inappropriate for some populations, the suggestion that intelligence tests are only suitable for typically developing individuals is inaccurate. Each intelligence test is designed with a specific target group in mind and its manual should clearly define the intended population and purpose. For example, the Dutch version of the IDS-2 (Grob and Hagmann-von Arx, 2018b) explicitly outlines the constructs it intends to measure, the intended age range, and the relevant application contexts—illustrating the diverse and valuable roles intelligence tests can play in both clinical and educational practice.

That said, test interpretation must always take into account factors such as test conditions and the child’s state during testing, which can influence the reliability and validity of outcomes. Observations made during test administration are therefore critical supplemental data for decision-making.

In an ideal situation, all aspects of assessment—including environment, content, item presentation, and responses by the testee—should be equally accessible to all individuals within the test’s target population (Lovett and Lewandowski, 2015). While this ideal of Universal Design (The Center for Universal Design, 1997) is not fully attainable, it offers a guiding principle. By providing accommodations that preserve test content but increase accessibility (e.g. for children with motor impairments), assessments can be better aligned with children’s individual needs (Alant and Casey, 2005).

Test publishers often provide specific guidance for implementing such accommodations. Some even develop entirely separate versions, such as the Bayley-III Special Needs Addition (Ruiter et al., 2014), which was created to address the limitations of the standard Bayley-III for children with certain impairments. Research has shown that the SNA improves test validity for these children with specific impairments (Visser et al., 2013).

In individual cases, an evaluation of the suitability of the instrument always needs to be done for each child individually to ensure a sufficient level of reliability and validity. This requires information about the test fairness of an instrument, which can be studied using the concept of measurement invariance (MI). MI holds if two persons from different target groups, but with the same underlying ability that the test is supposed to measure, obtain equal results (Wicherts, 2016). Only if MI holds, the test result will reflect the latent (“true”) ability similarly for children from these target groups. Other terms used are measurement bias (Millsap, 2011) if MI does not hold and test fairness (e.g. Mickley and Renner, 2015) if MI holds.

Nevertheless, intelligence tests—even when adapted—may not be appropriate for all individuals. For example, children with very low cognitive functioning levels impairments may fall outside the scope of all standardized instruments due to issues like floor effects and reduced reliability. In these cases, alternative assessments tailored to the specific developmental level and needs of the child are likely to be more effective.

Non-standardized assessment as the new intelligence test

Van Hoogdalem and Bosman (2024) advocate for the use of alternative methods, such as dynamic assessment (DA) and qualitative observation, suggesting that these offer more accurate and fair evaluations than standardized intelligence tests. We do agree that alternative methods may offer this, but we do only in as far as the method is based on a form of standardization.

DA is “an umbrella term that refers to a wide range of approaches typified by the provision of instruction and feedback as part of the testing process” (Elliott et al., 2018: 8). It focuses on a person’s learning potential rather than on static performance (Tzuriel, 2021). A common format within dynamic assessment is dynamic testing, which typically involves a test-intervention-retest format to measure how much support an individual needs to improve. While these methods can offer valuable insights, many of the discussed limitations of intelligence tests apply to DA as well.

In many applications of DA, qualitative observation plays an important role. A qualitative observation is when you gather information by describing the qualities, characteristics, or properties of something without using numbers or measurements. Instead of focusing on “how much” or “how many,” you focus on what something is like (e.g. a child frequently approaches others). Qualitative observation relies on expert interpretation of behavior during tasks (Kirk and Miller, 2011), and has its own challenges and limitations.

A key concern with qualitative observations is their reliability and validity (Kirk and Miller, 2011). Qualitative observations easily lack a structured manner of observation. Consequently, they are inherently subjective and susceptible to large intra-rater and inter-rater variability (Boeije and Bleijenbergh, 2023). In addition, when observations are not done in a standardized way, based on proper norm data and validation research, assessors may unintentionally allow personal expectations and contextual factors to influence their judgments (Hasselhorn and Gold, 2013). This can introduce various biases, such as confirmation bias (where assessors unconsciously focus on behaviors that confirm their pre-existing beliefs about an individual’s cognitive abilities) and the halo effect (where a positive impression in one area, such as social skills, leads to an overestimation of cognitive abilities, or vice versa; Kaplan and Saccuzzo, 2018).

In general, qualitative observation and DA can thus heavily depend on the assessor’s interpretation of interactions during the assessment, thus yielding a possible source of examiner bias. Particularly, cultural bias may be an issue, as the interactional nature of DA may reflect cultural norms around communication, help-seeking, and authority (Cannata et al., 2024; Wilby et al., 2017). There is also a considerable burden on the assessor’s ability to both interact with and objectively evaluate the test-taker—something humans are demonstrably poor at (e.g. Grove and Meehl, 1996; Kahneman et al., 1982). In addition, Van Hoogdalem and Bosman (2024) assume that DA reveals insight into the test takers’ thought process. Yet, the ability to verbalize these thought processes varies widely depending on factors such as verbal ability and cultural background. Some individuals are more verbally reflective than others, and similar thought processes can be verbalized in very different ways, or vice versa. Naturally, in case the thought processes need to be derived from a non-verbal pointer, the interpretation of the examiner plays an important role as well, which again can lead to significant bias.

Van Hoogdalem and Bosman (2024: 8) find a possible examiner bias important in standardized testing. We agree, but we think it could be even more of an issue when it concerns DA. In DA, the assessor plays a role just like in standardized IQ tests, even in the case of a highly standardized DA. Assessors have to focus on the procedure of the administration and describe the actions of the child, just like with standardized intelligence tests. Van Hoogdalem and Bosman (2024) note that the influence of the assessor on the testing situation and test outcome should be included and described. However, to reflect on the child being tested, while reflecting on their own at the same time with the necessary level of objectivity is a difficult task. Hence, the impact of the assessor can be very large.

In addition to the inclusion and description of the assessor’s behavior, substantiated guidelines should be available on how this information should impact interpretation of the test result and decision following. Grove and Meehl (1996) have shown that mechanical (i.e. algorithmic) combination of information leads to more accurate decisions than clinical judgment, which is prone to inconsistency and bias. Hence, relying on unstructured assessment methods can reduce decision-making reliability and fairness.

Thus, for a fair use of DA and qualitative observation, it is essential that the assessor is thoroughly trained (Cohen and Swerdlik, 2018: 443). Also, to enhance transparency, clear protocols need to be provided and followed (Kapiszewski and Karcher, 2021). The protocol should specify what behaviors or responses are observed, using structured tasks or checklists where possible, and include both qualitative and quantitative data. Bias mitigation strategies must be incorporated, such as pre-defining hypotheses, real-time recording, and multiple observers (Hallgren, 2012; Haven and Van Grootel, 2019) to reduce confirmation or halo effects. Observations from different contexts and modalities should be systematically integrated, with professionals documenting how each piece of evidence informs their overall judgment, including any uncertainties or conflicting information. Finally, all observations, scoring, and reasoning should be thoroughly recorded, ensuring that the decision-making process is transparent, replicable, and open to evaluation.

Standardized tests, including intelligence tests, address some of these limitations by offering objective, replicable metrics. A promising bridge between qualitative observation and standardization is found in tools like the Guide to the Assessment of Test Session Behavior (GATSB; Glutting and Oakland, 1993), a 29-item, norm-based behavior rating checklist. In every test session of the Wechsler Intelligence Scale for Children (WISC-III), a child’s behavior was rated on a three-point scale (usually applies, sometimes applies, doesn’t apply). The scores were then summed for three factor-based scales—avoidance, inattentiveness, and uncooperative mood—as well as a total scale, which was converted into standard scores (with a mean of 50 and a standard deviation of 10). Thus, this structured, norm-based checklist quantifies behavioral data collected during test administration, offering reliable supplementary insights.

Note also that not all forms of DA are exclusively qualitative. For example, DA can generate quantitative outcomes (e.g. gain scores between pre- and post-tests, or ratings of the level of support required). Moreover, in some approaches the prompts and levels of mediation are protocolled in advance and administered at fixed points in the procedure, making the process more standardized and less reliant on open-ended interpretation.

Moreover, rather than serving as a replacement, DA is best viewed as a complement to standardized testing. Where traditional tests assess static performance, dynamic methods provide insights into cognitive processes, such as learning potential, instructional responsiveness, and problem-solving strategies. Possibly, adding protocols and thereby increasing the degree of standardization of DA could help alleviate some of the limitations discussed. In research into DA so far, its validity and potential to form a link between assessment and intervention has not yet been supported (Elliott et al., 2018). Further research is thus needed to determine how well DA outcomes predict real-world learning and under what conditions. While the paper of Van Hoogdalem and Bosman (2024) may not have rejected this view, it left us wondering what their position was on this point.

Finally, when it comes to dichotomous decisions on, for example, access to a specific type of education or care, the challenge of a fair decision remains. Van Hoogdalem and Bosman (2024) start their paper with justified criticism of gatekeeper diagnostics (“slagboomdiagnostiek”). Their paper begs the question whether their alternative will lead to fairer and more transparent decisions. For every stakeholder it should be crystal clear how the information obtained through DA and qualitative observation leads to a particular decision. Especially, those who do not agree with the decision being made may not only ask the traditional questions of standardized intelligence testing: “what if the assessment had been conducted at a different time” and “what if a different instrument had been chosen”? Compared to traditional testing, additional questions are even more relevant, such as “what if a different expert had conducted the DA” and “what if the expert had followed a different protocol or training.”

Conclusion

While we acknowledge that some concerns raised by Van Hoogdalem and Bosman (2024) are valid, several of their conclusions seem to be based on debatable assumptions and appear to reflect underlying biases. For instance, their assertion that standardized intelligence tests are only appropriate for a narrow group of typically developing individuals overlooks the wide variety of populations for which such tests have been specifically developed and validated. Although we agree that certain individuals require tailored assessments, we challenge the notion that this group is so large that there is only a small minority for whom the test fits. The bias in their reasoning stems from generalizing findings from their specific research population to the broader population—a methodological leap that is not empirically justified.

In this paper, we have sought to critically and systematically address their claims using evidence-based reasoning. We argue that, although intelligence tests have limitations, these do not justify their outright rejection in clinical or educational contexts—a position implicitly or explicitly advocated by Van Hoogdalem and Bosman (2024).

Importantly, intelligence test results are seldom, if ever, interpreted in isolation when professionals make decisions about individuals. Whether for school placement, diagnostic evaluation, or personnel selection, professionals typically draw upon a broad range of data sources, including test scores, contextual and developmental information, behavioral observations, and informant reports. A key question, then, is not whether intelligence tests should be used, but how multiple sources of information should be integrated to support fair and effective decision-making.

We strongly advocate for the use of mechanical methods of data integration—structured decision rules or algorithms—rather than relying on purely intuitive or holistic clinical judgment. Decades of research (e.g. Grove and Meehl, 1996; Meijer et al., 2020; Neumann et al., 2023) show that mechanical decision-making is more transparent, replicable (consistent), and less prone to cognitive biases than clinical judgment. It also allows for feedback and refinement over time, which are essential for professional accountability and continuous improvement. In contrast, solely relying on clinical judgment increases the risk of bias, inconsistency, and inequality—which may negatively impact individual lives (reference).

This position is further explained (including the scientific support for this approach) in the practical guide developed by Niessen et al. (2025), supported by the Dutch Committee on Tests and Testing, which offers concrete guidelines for applying mechanical decision making across diverse professional settings (e.g. clinical neuropsychology, work, and organizational psychology). By walking the reader through realistic case scenarios, they demonstrate what steps can be taken during the decision making process to achieve a structured and transparent, mechanical integration of multiple data sources—including intelligence test results. Such an approach allows professionals not only to improve their judgments, but also to explain and justify them in a clear and consistent manner.

Completely abolishing intelligence tests due to concerns about their use as a gateway to special education or healthcare strikes us as overly radical. While we agree that a person’s intelligence cannot—and should not—be reduced to a single test score, many of the objections the authors raise in general can be refuted or apply at least as strongly—if not more so—to alternatives such as DA. The application of intelligence tests for special groups should be handled with great care, but we think that outright abolition—as the authors advocate—amounts to throwing the baby out with the bathwater.

Concluding, intelligence tests neither should be dismissed wholesale, nor should be used uncritically or as standalone tools. Rather, they should be used thoughtfully, as part of a broader, structured and transparent diagnostic process. When interpreted with nuance and embedded within a mechanical decision-making framework, in our opinion, intelligence tests continue to hold diagnostic value and contribute meaningfully to responsible, equitable, and informed professional practice.

Footnotes

Acknowledgements

Not applicable.

Author note

All five authors are members of the Dutch Committee on Tests and Testing (COTAN; in Dutch: Commissie Testaangelegenheden Nederland) of the Dutch Institute of Psychologists (NIP; in Dutch: Nederlands Instituut van Psychologen).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: M.E. Timmerman was supported by Dutch Research Council (NWO) [grant number 406.27.GO.022].

Ethical considerations

Ethical approval was not required.

Consent to participate

Not applicable (no participants are used).

Consent for publication

Not applicable (no participants are used).

ORCID iDs

Bas Hemker

Marieke E Timmerman

Linda Visser

Lieke Voncken

Petra PM Hurks

Data availability statement

Not applicable (no data was collected for this paper).

Author biographies

Bas Hemker is a senior psychometric researcher at Stichting Cito in the Netherlands. He specialises in educational measurement, test quality, test use, item response theory, and psychometric evaluation of tests and exams. He has been involved in developing and reviewing national assessments and research on measurement properties — including work on test reliability, scoring methods, and quality assurance of computer-based assessments.

Marieke E Timmerman is a professor of Multivariate Data Analysis at the University of Groningen. Her research focuses on the development and evaluation of statistical models for psychological data — with special attention to test construction and (continuous) norming of psychological tests. She has contributed to methodology for assessing test quality and norming, and was involved in the development of various developmental and intelligence tests.

Linda Visser is an assistant professor at Radboud University, in the Behavioural Science Institute (Developmental Psychopathology department). Her research focuses on (assessment of) development, behaviour and mental health problems in children, with a special focus on test fairness and on resilience.

Lieke Voncken is an assistant professor in methodology and statistics at Tilburg University. She teaches courses on statistics and quantitative research methods. Her research focuses on psychometrics – especially the norming of psychological tests – and the improvement of research practices. She is particularly interested in bridging the gap between theory and practice.

Petra PM Hurks is a neuropsychologist and professor of Psychodiagnostics at the Faculty of Psychology and Neuroscience, Maastricht University. Her work focuses on: (1) evaluating psychological tests and their use in decision-making, (2) studying biological, cognitive, contextual, and demographic factors - plus perceptions of teachers and parents - that affect the well-being of individuals, and (3) developing and assessing cognitive, behavioral, and pharmacological interventions to improve attention, executive functioning, and overall well-being.

References

Alant

Casey

(2005) Assessment concessions for learners with impairments. South African Journal of Education 25: 185–189.

Beckmann

(2014) The umbrella that is too wide and yet too small: Why dynamic testing has still not delivered on the promise that was never made. Journal of Cognitive Education and Psychology, 13(3), 308–323.

Boeije

Bleijenbergh

(2023) Analyseren in kwalitatief onderzoek: Denken en doen, 4^eeditie [Analysis in Qualitative Research: Thinking and Doing, (4th edn)]. Amsterdam, the Netherlands: Boom uitgevers.

Borsboom

Mellenbergh

van Heerden

(2004) The concept of validity. Psychological Review 111(4): 1061–1071.

Cannata

O’Hora

Redfern

(2024) The effect of candidates and assessors culture on nonverbal expression and nonverbal judgments in the job interview. International Journal of Cross Cultural Management 24(2): 309–333.

Cohen

Swerdlik

(2018) Psychological Testing and Assessment: An Introduction to Tests and Measurement, 9th edn. New York, NY, US: McGraw-Hill Education.

Eichelberger

Latal

Kakebeeke

, et al. (2023a) The influence of preschool IQ on the individual-order stability of intelligence into adulthood. Acta Paediatrica 112(10): 2161–2163.

Eichelberger

Sticca

Kübler

, et al. (2023b) Stability of mental abilities and physical growth from 6 months to 65 years: Findings from the Zurich Longitudinal Studies. Intelligence 97: 101730.

Elliott

Resing

WCM

Beckmann

(2018) Dynamic assessment: A case of unfulfilled potential? Educational Review 70(1): 7–17.

10.

Emons

WHM

(2023) Methods for estimating conditional standard errors of measurement and some critical reflections. In: van der Ark

Emons

WHM

Meijer

(eds) Essays on Contemporary Psychometrics. Cham: Springer International Publishing, pp.195–216. Available at: https://doi.org/10.1007/978-3-031-10370-4_11 (accessed 7 May 2025).

11.

Evers

Sijtsma

Lucassen

, et al. (2010) The Dutch review process for evaluating the quality of psychological tests: History, procedure, and results. International Journal of Testing 10(4): 295–317.

12.

Glutting

Oakland

(1993) Guide to the Assessment of Test Session Behavior. San Antonio, TX: Psychological Corporation.

13.

Grob

Hagmann-von Arx

(eds) (2018a) Intelligence and Development Scales – 2 (IDS-2). Intelligenz- und Entwicklungsskalen für Kinder und Jugendliche (IDS-2) [Intelligence and Development Scales – 2 (IDS-2)]. Bern: Hogrefe. Available at: https://edoc.unibas.ch/67802/ (accessed 7 May 2025).

14.

Grob

Hagmann-von Arx

(2018b) Intelligence and Development Scales-2 (IDS-2). Intelligentie- en Ontwikkelingsschalen voor kinderen en jongeren. Verantwoording en psychometrie. Nederlandse bewerking door Selma Ruiter, Linda Visser en Marieke Timmerman. [Intelligence and Development Scales for Children and Adolescents. Justification and Psychometrics. Dutch Adaptation by Selma Ruiter, Linda Visser and Marieke Timmerman]. Amsterdam: Hogrefe.

15.

Grove

Meehl

(1996) Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy. Psychology, Public Policy, and Law 2(2): 293–323.

16.

Grove

Zald

Lebow

, et al. (2000) Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment 12(1): 19–30.

17.

Habets

Jeandarme

Uzieblo

, et al. (2015) Intelligence is in the eye of the beholder: Investigating repeated IQ measurements in forensic psychiatry. Journal of Applied Research in Intellectual Disabilities 28(3): 182–192.

18.

Hallgren

(2012) Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology 8(1): 23–34.

19.

Hasselhorn

Gold

(2013) Pädagogische Psychologie: Erfolgreiches Lernen und Lehren [Educational Psychology: Successful Learning and Teaching]. Stuttgart: Kohlhammer.

20.

Haven

Van Grootel

(2019) Preregistering qualitative research. Accountability in Research 26(3): 229–244.

21.

Hendriks

MPH

Ruiter

SAJ

Schittekatte

, et al. (2018) WISC-V-NL Technische Handleiding. Nederlandse Bewerking. Amsterdam: Pearson.

22.

Jeltova

Birney

Fredine

, et al. (2007) Dynamic assessment as a process-oriented assessment in educational settings. Advances in Speech Language Pathology 9(4): 273–285.

23.

Kahneman

Slovic

Tversky

(1982) Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press.

24.

Kapiszewski

Karcher

(2021) Transparency in practice in qualitative research. PS: Political Science & Politics 54(2): 285–291.

25.

Kaplan

Saccuzzo

(2018) Psychological Testing: Principles, Applications & Issues, 9th edn. Boston, MA: Cengage Learning.

26.

Kirk

Miller

(2011) Reliability and Validity in Qualitative Research. Newbury Park, CA: SAGE Publications, Inc.

27.

Lovett

Lewandowski

(2015) Universal design for assessment. In: Lovett

Lewandowski

(eds) Testing Accommodations for Students with Disabilities: Research-Based Practice. School psychology book series. Washington, DC: American Psychological Association, pp.207–223.

28.

Meehl

(1954) Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence. Minneapolis, MN: University of Minnesota Press.

29.

Meijer

Neumann

Hemker

, et al. (2020) A tutorial on mechanical decision-making for personnel and educational selection. Frontiers in Psychology 10: 3002.

30.

Messick

(1989) Validity. In: Linn

(ed.) Educational Measurement, 3rd edn. New York: Macmillan, pp.13–103.

31.

Mickley

Renner

(2015) Berücksichtigen deutschsprachige Intelligenztests die besonderen Anforderungen von Kindern mit Behinderungen? Praxis der Kinderpsychologie und Kinderpsychiatrie 64(2): 88–103.

32.

Millsap

(2011) Statistical Approaches to Measurement Invariance. New York, NY: Routledge. Available at: https://books.google.de/books?id=EXmsAgAAQBAJ (accessed 7 May 2025).

33.

Nettelbeck

Wilson

(2005) Intelligence and IQ: What teachers should know. Educational Psychology 25(6): 609–630.

34.

Neumann

Niessen

ASM

Hurks

PPM

, et al. (2023) Holistic and mechanical combination in psychological assessment: Why algorithms are underutilized and what is needed to increase their use. International Journal of Selection and Assessment 31(2): 267–285.

35.

Niessen

Hurks

PPM

Neumann

, et al. (2025) Schedule to 2024 NIP general standard for test use. Guide to creating and using decision rules. Available at: https://nip.nl/wp-content/uploads/pdfs/NIP_AST_2024_HANDLEIDING_ENG.pdf (accessed 7 May 2025).

36.

Ruiter

SAJ

Hurks

PPM

Timmerman

(2017) IQ-score is dringend aan modernisering toe [IQ score in urgent need of modernisation]. Kind & Adolescent Praktijk 16(1): 16–23.

37.

Ruiter

SAJ

Visser

van der Meulen

, et al. (2014) Bayley-III-NL, Handleiding Special Needs Addition. Amsterdam: Pearson Assessment and Information B.V.

38.

Schmiedek

Lövdén

von Oertzen

, et al. (2020) Within-person structures of daily cognitive performance differ from between-person structures of cognitive abilities. PeerJ 8: e9290.

39.

Tellegen

(2004) De waan van ‘het’ IQ [The illusion of ‘the’ IQ]. Available at: https://www.hogrefe.com/nl/nieuw/de-waan-van-het-iq (accessed 7 May 2025).

40.

The Center for Universal Design (1997) The principles of universal design. Available at: https://www.ncsu.edu/ncsu/design/cud/about_ud/udprinciplestext.htm (accessed 7 May 2025).

41.

TNO (2022) Groeidiagrammen [Child growth standards]. TNO. Available at: https://www.tno.nl/nl/gezond/jeugd-gezondheid/eerste-1000-dagen-kind/groeidiagrammen-groeicalculators/ (accessed 29 April 2025).

42.

Tucker-Drob

Briley

(2014) Continuity of genetic and environmental influences on cognition across the life span: A meta-analysis of longitudinal twin and adoption studies. Psychological Bulletin 140(4): 949–979.

43.

Tzuriel

(2021) Dynamic Assessment (DA) of learning potential. In: Mediated Learning and Cognitive Modifiability. Social Interaction in Learning and Development. Cham: Springer International Publishing, pp.69–88. Available at: https://link.springer.com/10.1007/978-3-030-75692-5_4 (accessed 7 May 2025).

44.

Van Hoogdalem

Bosman

(2024) Intelligence tests and the individual: Unsolvable problems with validity and reliability. Methodological Innovations 17(1): 6–18.

45.

Visser

Ruiter

SAJ

Van Der Meulen

, et al. (2013) Validity and suitability of the Bayley-III Low Motor/Vision version: A comparative study among young children with and without motor and/or visual impairments. Research in Developmental Disabilities 34(11): 3736–3745.

46.

Voelkle

Brose

Schmiedek

, et al. (2014) Toward a unified framework for the study of between-person and within-person structures: Building a bridge between two research paradigms. Multivariate Behavioral Research 49(3): 193–213.

47.

Wechsler

(1944) The Measurement of Adult Intelligence, 3rd edn. Baltimore: The Williams & Wilkins Company.

48.

Wechsler

(2008) Wechsler Adult Intelligence Scale–Fourth Edition (WAIS–IV). San Antonio, TX: NCS Pearson.

49.

WHO (2025) Child growth standards. Available at: https://www.who.int/tools/child-growth-standards/standards (accessed 29 April 2025).

50.

Wicherts

(2016) The importance of measurement invariance in neurocognitive ability testing. The Clinical Neuropsychologist 30(7): 1006–1016.

51.

Wilby

Govaerts

MJB

Austin

, et al. (2017) Exploring the influence of cultural orientations on assessment of communication behaviours during patient-practitioner interactions. BMC Medical Education 17(1): 61.

52.

McCoach

Gottfried

, et al. (2018) Stability of intelligence from infancy through adolescence: An autoregressive latent variable model. Intelligence 69: 8–15.

On the usefulness of testing intelligence: A response to Van Hoogdalem and Bosman (2024)

Abstract

Keywords

Introduction

Where we agree

Where we disagree

The one-sided view on validity

The interpretation of differences in intelligence test scores

The misconception of ergodicity

The usefulness of norms for measuring individuals

Target groups of intelligence tests

Non-standardized assessment as the new intelligence test

Conclusion

Footnotes

Acknowledgements

Author note

Declaration of conflicting interests

Funding

Ethical considerations

Consent to participate

Consent for publication

ORCID iDs

Data availability statement

Author biographies

References