Abstract
In response to Van Hoogdalem and Bosman (2024) who advocate abandoning the use of intelligence tests at the individual level, we argue that their conclusions are too absolute and insufficiently substantiated. While we acknowledge some of the concerns they raise regarding measurement error and contextual influences, we argue that the complete dismissal of standardized intelligence tests overlooks their practical utility. Drawing on empirical evidence and established psychometric theory, we demonstrate that intelligence tests provide valuable information for clinical and educational decision-making. We further argue that replacing standardized testing with purely non-standardized methods introduces greater subjectivity and risk of error, ultimately undermining the quality of professional judgments. We argue that intelligence tests remain a valuable source of information in many cases of clinical and educational decision making, when interpreted carefully and integrated with other relevant sources of valid information using predefined decision rules.
Introduction
In their recent article, titled “ Considering that (1) a general ability of intelligence might not exist as a psychological construct, and (2) a valid and reliable measure of intelligence is not possible, we must conclude that we see no justification for the use of standardized intelligence tests, let alone IQ scores, in clinical decision-making when assessing the individual. As evidenced by the cited articles, utilizing these assessments puts individuals at risk of a misinterpretation of their cognitive abilities.
While we acknowledge some of the inherent limitations of intelligence testing, we contend that the authors’ conclusions are overly absolute and one-sided. By dismissing standardized intelligence tests altogether and advocating for non-standardized assessment, the authors discard a valuable diagnostic tool and run the risk of decreasing rather than increasing the quality of individual clinical decision-making. The reason is that clinical judgment, where the “judge” combines data using informal methods, is notoriously less accurate than mechanical prediction, where decisions are based on predefined decision rules grounded in empirical evidence (Grove et al., 2000; Grove and Meehl, 1996; Meehl, 1954).
In this article, we follow the call of Van Hoogdalem and Bosman (2024) for a scientific debate by critically examining their key arguments. We will start by discussing areas where we share concerns about intelligence tests. Subsequently, we address points of disagreement. Specifically, we will discuss the misconception that findings at the group level are irrelevant for individual assessment and explain why group-level findings are essential for informing clinical decisions about individuals. Clinical decision-making inevitably involves prediction, and therefore achieving a faultless decision for all individuals is an unattainable ideal beyond reach. Consequently, the goal should be to minimize the portion of erroneous decisions across individuals. We will argue that nonstandardized assessments—as suggested by the authors as an alternative to intelligence tests—can be expected to be much more susceptible to measurement error and thus enlarges the risk of erroneous decisions. Finally, we will maintain that, despite their limitations, intelligence tests remain a valuable source of information to include in many cases of clinical decision making, provided that they are interpreted carefully and integrated with other relevant sources of valid information using predefined decision rules. We offer practical suggestions for how to achieve this in practice.
Where we agree
Some of the concerns expressed by Van Hoogdalem and Bosman (2024) on intelligence tests resonate with us. These concerns have been acknowledged in the literature for quite some time, as nicely highlighted by Tellegen (2004). We share the concern about an absolute interpretation of test results, particularly when fully ignoring measurement error. Any psychological measurement is influenced by measurement errors. This means that an observed score may deviate from the “real score.” We use the term “real score” to describe a score that expresses the position of the individual on the construct of interest. To make clear that this concept is not bounded to the classical test theory, we intentionally avoid the term “true score” here. A complicating factor is that this “real score” refers to a latent variable—one that cannot be observed directly—and that the latent variable itself reflects a psychological construct that is notoriously difficult to define, and thus difficult to measure.
As Van Hoogdalem and Bosman (2024) rightfully indicate, various factors beyond intelligence can influence observed scores on intelligence tests. They discuss both external and internal factors, such as examiner familiarity, testing conditions, and the individual’s state at the time of testing, all of which can undesirably impact scores in ways unrelated to actual cognitive ability.
These influences contribute to
When different intelligence tests are administered to the same individual, differences in observed scores are to be expected. This is due not only to measurement error inherent in each test, but also to structural differences between the tests themselves. That is, tests vary in the specific tasks administered and in how they define and operationalize intelligence (Ruiter et al., 2017). As a result, what is labeled as “intelligence” can vary from one test to another. Although intelligence tests tend to correlate moderately to highly with each other (say 0.40–0.80, not corrected for measurement error; Ruiter et al., 2017), their scores are certainly not entirely interchangeable.
When interpreting the observed scores on an intelligence test, uncertainty due to measurement error and structural differences between tests needs to be taken into account. We therefore agree with Van Hoogdalem and Bosman (2024) that the sole use of strict IQ cut-offs for decisions related to care eligibility and school placement is inappropriate. This is especially true when the observed score is very low (e.g. IQ <70), because very low (and high) test scores are inherently low in reliability.
Tellegen (2004) also indicates that the standards of the intelligence test must be in order. This means that the norm groups should be well-defined and the chosen norm group must also be representative. Especially at the extremes of the scale, such as IQ values below 70 and above 130, the values are sensitive to violations. Tellegen also comments on the aging of the norms, which must be monitored, as well as test bias. Certainly, we agree with these issues. It is no coincidence that the Dutch Committee on Tests and Testing (COTAN; Evers et al., 2010) imposes strict standards on tests to ensure compliance. Whenever an intelligence test is released, it must be scientifically demonstrated that these requirements are met.
Where we disagree
The one-sided view on validity
A widely cited quote from Messick (1989: 13) captures the modern view on validity: Validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores.
Messick thus emphasizes that validity concerns the practical usefulness of test score interpretations, rather than ontological questions about whether the underlying construct “really” exists. In line with this perspective, we prefer to focus on the practical usefulness of intelligence tests, rather than engaging in philosophical debates about whether intelligence truly exists—and if so, how it should be defined. Others, like Nettelbeck and Wilson (2005)—also quoted by Van Hoogdalem and Bosman (2024)—adopt a similar approach.
Van Hoogdalem and Bosman (2024), by contrast, appear to endorse the validity framework advanced by Borsboom et al. (2004), which posits that valid measurement requires a construct to exist as a causally coherent entity. Applying this framework, they argue that intelligence lacks ontological definability and is therefore inherently unmeasurable. However, their critique is logically incomplete: while rejecting intelligence as a measurable construct, they propose dynamic assessment as an alternative means to measure “learning potential”—a concept that is itself latent, poorly defined, and arguably just as ontologically problematic. Like intelligence, learning potential fails to meet Borsboom et al.’s (2004) stringent validity criteria, which demand a clear causal link between the construct and its observable indicators. This unresolved tension undermines their broader argument, as their proposed alternative replicates the very epistemological flaws they attribute to intelligence. Although the ontological discussion is very interesting, Messick’s more practical view on validity therefore seems more appropriate here.
The interpretation of differences in intelligence test scores
Van Hoogdalem and Bosman (2024) argue that intelligence test scores are unreliable, because they fluctuate over time due to individual characteristics and varying test conditions. While we acknowledge that some variability exists, research shows that intelligence test scores tend to stabilize with age (Eichelberger et al., 2023a, 2023b). In particular, intelligence test scores become relatively stable from childhood and adolescence onwards (Tucker-Drob and Briley, 2014), and are rather stable at successive periods from infancy (i.e. as of 1 year of age) to early adolescence (i.e. maximally 17 years old; Yu et al., 2018).
Further, it is important to recognize that variability in observed scores does not imply that a measurement is meaningless. Many common metrics—such as weight or blood pressure—also fluctuate depending on situational factors, but remain valuable for assessing general health. For example, a person’s weight may differ between morning and evening, or over longer time intervals. Likewise, blood pressure typically rises when the situation is stressful. Still, these variations do not invalidate weight or blood pressure as useful indicators of physical health. In the same way, an intelligence test score—though subject to situational influences—can still provide meaningful information, particularly when interpreted in context and alongside other relevant sources of information.
Hence, the examples cited by Van Hoogdalem and Bosman (2024) as evidence for poor criterion validity—due to variation in observed intelligence test scores—are not convincing. For example, in the study of Habets (2015), different tests were administered at different time points. However, the candidates’ developmental changes over time were not controlled for, making it plausible that the observed differences in scores or score categories reflect both measurement error and natural development. Similarly, if an individual’s weight is measured months apart, a shift from “obese” to “overweight” on the BMI scale does not invalidate the concept of weight measurement—it merely reflects change over time.
In the same reasoning, the fact that intelligence is not entirely stable over time does not render intelligence testing useless. It simply implies that interpretations must be made carefully and with appropriate context. Test results may vary based on the time of day, recent experiences, or testing conditions—just like weight or blood pressure. However, as with physical measures, there are likely performance boundaries. If the goal is to assess an individual’s optimal performance—as an indicator of their actual potential—it would be sensible to administer multiple (different) tests over time, at moments when the individual is well-prepared and in optimal condition. Selecting the best performance while taking into account the measurement error and retest effects could offer a more balanced and accurate representation than a single assessment, and directly addresses many of the concerns brought up by Van Hoogdalem and Bosman (2024).
While the preceding discussion appears to endorse test-retest protocols, it is critical to acknowledge that such practices carry the risk of introducing learning or memory effects. This is particularly relevant given that intelligence testing aims to assess an individual’s ability to engage with novel tasks or unfamiliar scenarios, thereby measuring cognitive adaptability and problem solving in new situations. To mitigate this risk while ensuring a comprehensive evaluation, utilizing diverse intelligence assessments becomes essential. Doing so not only reduces practice effects, but also helps to illuminate distinct cognitive strengths and strategic approaches across varied domains.
The misconception of ergodicity
Van Hoogdalem and Bosman (2024) criticize the reflective model of intelligence measurement, arguing that intelligence cannot be fully captured by a single, underlying factor, because it depends on the specific tasks and contexts in which it is measured. However, we argue that intelligence tests can still provide meaningful insights, especially when the scores are interpreted as reflecting specific cognitive functions rather than a singular, global factor—an interpretation that is consistent with a so-called formative model.
Intelligence is a multifaceted construct that extends cognitive abilities alone. It encompasses personal characteristics that enable individuals to function effectively in their environment, including, for example, perseverance, emotion regulation, delay of gratification, and time management. This broader conceptualization is reflected in Wechsler’s definition of intelligence as “the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment” (Wechsler, 1944: 3). Nevertheless, most intelligence tests primarily focus on cognitive aspects. Instruments such as the WAIS-IV (Wechsler, 2008) and the IDS-2 (Grob and Hagmann-von Arx, 2018a) assess a range of cognitive functions, such as visual processing, long-term memory, processing speed, short-term memory, and reasoning, and combine these into an overall IQ score. Although they differ in specific tasks and subtests, these tests share a common theoretical framework that justifies their interchangeable use in individual assessments.
A key feature of these intelligence tests is that their scores are norm-referenced: they express an individual’s performance relative to a reference population. For example, the overall IQ score is typically standardized to a normal distribution with a mean of 100 and a standard deviation of 15 within a general population of a certain country or language area of the same age as the individual tested. Since individuals are thus compared to others of the same age, intelligence test scores are controlled for age-related cognitive development, and thus free from effects of development across the age span.
Importantly, the cognitive functions and the overall IQ score are derived from studies involving groups of individuals. This between-person structure is consistently observed across studies (e.g. Grob and Hagmann-von Arx, 2018b; Wechsler, 2008). Thus, IQ scores effectively summarize an individual’s cognitive performance relative to others.
A critical question, however, is whether the same structure can be used to track changes within individuals over time. This requires the within-person structure to be equivalent to the between-person structure (Voelkle et al., 2014). Schmiedek et al. (2020), using intensive longitudinal data (100 observations per individual,
Because of this limitation, van Hoogdalem and Bosman (2024) argue for dismissing the overall IQ score entirely. In our view, this is a misinterpretation. Intelligence tests are designed to assess between-person differences and are therefore valuable for predicting behavior relative to others. However, as Schmiedek et al. (2020) showed, a person’s test scores on a given day is a poor predictor of their performance on a different task or day. The key issue is thus whether this relative-to-others perspective is informative in clinical decision making.
Van Hoogdalem and Bosman (2024) question this approach by invoking the concept of ergodicity. Basically, ergodicity refers to the condition where a summary measure calculated across individuals is identical to the same measure calculated within individuals. A lack of ergodicity implies that, for example, the average developmental trajectory derived from a group does not reflect the developmental path of any individual in that group.
However, the absence of ergodicity is not inherently problematic if we are examining a snapshot in time. At a single time point, the individual’s relative standing compared to the norm group remains meaningful. The fact that not all, or even any, individuals follow the mean group trajectory does not invalidate comparisons between individuals at a given moment.
Consider, for example, the child growth charts used in youth healthcare services (e.g. Dutch context: TNO, 2022; international context: WHO, 2025). These charts display growth curves (e.g. height, weight) for various percentiles and help physicians identify abnormal development. Despite individual variability and changing group composition, these charts remain essential in pediatric health care. Early identification of growth-related disorders or problems based on these charts has led to life-saving interventions.
According to the logic of Van Hoogdalem and Bosman (2024), such growth charts should not be used, since their two main criticisms—individual variability and non-identical development trajectories—also apply here. Yet, this ignores a fundamental point: users of these charts understand that deviations are expected, but that extreme deviations may warrant concern. This principle also applies to cognitive development over time. Good assessment tools—whether for height, weight, or cognitive ability—are not designed to indicate causes, but to signal deviations from normative patterns. The causes of those deviations must be investigated separately. Nonetheless, the measurements themselves are not meaningless. The same holds true for intelligence test scores when used for evaluating cognitive functioning relative to others at a specific time point.
In sum, if we were to follow Van Hoogdalem and Bosman’s (2024) reasoning consistently, physicians should abandon standardized growth charts and rely solely on subjective clinical judgment. They argue that scores should only be interpreted within the individual’s developmental context and reject norm-referenced frameworks. This would mean discarding well-validated tools that offer critical insights—something we believe is both unnecessary and unwise. We will delve deeper into this in the next section.
The usefulness of norms for measuring individuals
Dynamic testing, as advocated, can well be thought of and implemented in a standardized and quantitative manner (Beckmann, 2014). Standardized procedures, rather than ones based solely on clinical judgment, are also promoted (Jeltova et al., 2007). We agree that some of the suggestions mentioned in the discussion by Van Hoogdalem and Bosman (2024) can be considered important ways in which established testing procedures could be improved by considering the dynamic nature of individual differences in cognitive ability and performance. However, we disagree with their statements that advocates “ . . . letting go of IQ
These standardized benchmarks, rigorously derived from empirical data, provide critical reference points for interpreting individual and group outcomes, ensuring comparability across studies while mitigating subjective biases. In developing such norms, test constructors follow rigorous methodological standards to ensure reliability, validity, and fairness. Nevertheless, Van Hoogdalem and Bosman (2024) argue that test scores should only be interpreted within the context of an individual’s developmental path. Yet, how is such interpretation supposed to take place in practice—particularly in the absence of a normative framework? Implicitly, the answer is: by the practitioner.
Leaving the interpretation of the test result at the practitioner raises critical questions. On what basis does a practitioner draw conclusions about an individual’s functioning when no reference scores are available? Van Hoogdalem and Bosman (2024) cite work (e.g. Schmiedek et al., 2020) to support a reliance on the practitioner’s expertise and contextual knowledge. Yet, the norms they reject are themselves based on carefully collected data—data derived from structured observations, cumulative experience, and theoretical knowledge. For the interpretation of scores, the scientifically sound standards, with a clearly defined norm population, with a representative sample are replaced by the experience of the test taker. Thus, this approach definitely risks substituting systematically gathered scientific knowledge with personal, subjective judgment.
It is important to note that a lack of well-defined norm population is just one of the several issues that Tellegen (2004) identifies as problematic. The experiences of the practitioners are generally not representative, especially when test takers generally see relatively many children who need that particular care. Such a bias is also clear in the authors’ claim that intelligence tests may only be appropriate for “a small group of typically developing human beings,” while suggesting that such tests are not useful for individuals with conditions such as frontal-lobe damage, autism, or intellectual disabilities and developmental disabilities. However, fortunately the vast majority of children in the population do not face such limitations. It could be that in their daily work, the percentage of typically developing children is smaller, which would explain their bias. It also would exemplify the need for of well-defined norm population rather than use the test takers experience as the ultimate norm.
Note also that while we agree that a minimal condition for a meaningful test administration is that the child should be able to understand and follow the instructions, we do not agree that this means that all children in the groups the authors mention should be excluded for taking an intelligence test. When we detail on the target groups of intelligence tests, we further discuss this issue.
The proposed approach to interpreting scores by Van Hoogdalem and Bosman (2024) stands in direct contradiction to their own concern (p. 15) that the professional field is advancing ahead of scientific insight—a risk, they argue, that leads to unsupported practice. Strikingly, on the very next page (p. 16), they suggest that policy should follow the professional field’s current practices. These statements appear inconsistent.
By rejecting the norm-referenced interpretation and discouraging comparisons with standardized groups, Van Hoogdalem and Bosman (2024) disregard decades of empirical research. In their seminal paper, Grove and Meehl (1996) systematically address the objections raised against actuarial (or mechanical) prediction—demonstrating its superior accuracy compared to unaided clinical judgment. One of the central fallacies Grove and Meehl (1996) describe, closely resembles Van Hoogdalem and Bosman’s (2024) reasoning: the claim that “Statistical predictionists aggregate, whereas we seek to make predictions for the individual, so the actuarial figures are irrelevant in dealing with the unique person” (Grove and Meehl, 1996). Grove and Meehl (1996) convincingly dismantle this objection. Consider a medical analogy: suppose a patient can choose between two medical procedures. One has a success rate of 90%, the other of 10% among patients with similar clinical characteristics. Of course, these statistics do not guarantee the outcome for any single individual—someone may still fail under the high-success treatment or recover under the low-success one. Still, faced with this choice, would one prefer to have this statistical information—or not? Van Hoogdalem and Bosman’s (2024) position suggests that such statistical information should be disregarded, because it may not apply precisely to any given individual. We argue the opposite: these statistics offer a rational basis for making probabilistic decisions under uncertainty. Rejecting them removes one of the most powerful tools available for informed decision-making.
This also means that we disagree with Van Hoogdalem and Bosman’s (2024) claim that “because humans are not ergodic systems, using confidence intervals of the norm group are, by definition, not reliable at the level of the individual” (p.16). If this were to be true,
In conclusion, while individual developmental context certainly matters, it should not replace scientifically validated norms, but complement them. Discarding normative frameworks undermines both the empirical basis of psychological assessment and the transparency of clinical reasoning. We believe that test scores grounded in normative data continue to offer essential, probabilistic information that supports responsible, evidence-based decision-making.
Target groups of intelligence tests
Apart from doubting the usefulness of test norms more broadly, Van Hoogdalem and Bosman (2024) argue that intelligence tests are unsuitable for children with developmental disabilities. They claim that standardized psychological tests “cannot be used validly and reliably with specific groups of patients or clients” (p. 11), and therefore conclude that such tests are only appropriate for the typically developing individuals. However, by entirely discarding the use of intelligence testing in these populations, they risk denying children with developmental disabilities access to an instrument that could support understanding and inform care. In addition, it is worth noting that test authors often investigate the reliability and validity of their instruments for use in specific groups. It appears that some tests can also often be administered to specific groups and lead to interpretable scores (Grob and Hagmann-von Arx, 2018b; Hendriks et al., 2018).
While we acknowledge that (certain) tests may be inappropriate for some populations, the suggestion that intelligence tests are only suitable for typically developing individuals is inaccurate. Each intelligence test is designed with a specific target group in mind and its manual should clearly define the intended population and purpose. For example, the Dutch version of the IDS-2 (Grob and Hagmann-von Arx, 2018b) explicitly outlines the constructs it intends to measure, the intended age range, and the relevant application contexts—illustrating the diverse and valuable roles intelligence tests can play in both clinical and educational practice.
That said, test interpretation must always take into account factors such as test conditions and the child’s state during testing, which can influence the reliability and validity of outcomes. Observations made during test administration are therefore critical supplemental data for decision-making.
In an ideal situation, all aspects of assessment—including environment, content, item presentation, and responses by the testee—should be equally accessible to all individuals within the test’s target population (Lovett and Lewandowski, 2015). While this ideal of
Test publishers often provide specific guidance for implementing such accommodations. Some even develop entirely separate versions, such as the Bayley-III Special Needs Addition (Ruiter et al., 2014), which was created to address the limitations of the standard Bayley-III for children with certain impairments. Research has shown that the SNA improves test validity for these children with specific impairments (Visser et al., 2013).
In individual cases, an evaluation of the suitability of the instrument always needs to be done for each child individually to ensure a sufficient level of reliability and validity. This requires information about the test fairness of an instrument, which can be studied using the concept of measurement invariance (MI). MI holds if two persons from different target groups, but with the same underlying ability that the test is supposed to measure, obtain equal results (Wicherts, 2016). Only if MI holds, the test result will reflect the latent (“true”) ability similarly for children from these target groups. Other terms used are measurement bias (Millsap, 2011) if MI does not hold and test fairness (e.g. Mickley and Renner, 2015) if MI holds.
Nevertheless, intelligence tests—even when adapted—may not be appropriate for all individuals. For example, children with very low cognitive functioning levels impairments may fall outside the scope of all standardized instruments due to issues like floor effects and reduced reliability. In these cases, alternative assessments tailored to the specific developmental level and needs of the child are likely to be more effective.
Non-standardized assessment as the new intelligence test
Van Hoogdalem and Bosman (2024) advocate for the use of alternative methods, such as dynamic assessment (DA) and qualitative observation, suggesting that these offer more accurate and fair evaluations than standardized intelligence tests. We do agree that alternative methods may offer this, but we do only in as far as the method is based on a form of standardization.
DA is “an umbrella term that refers to a wide range of approaches typified by the provision of instruction and feedback as part of the testing process” (Elliott et al., 2018: 8). It focuses on a person’s learning potential rather than on static performance (Tzuriel, 2021). A common format within dynamic assessment is dynamic testing, which typically involves a test-intervention-retest format to measure how much support an individual needs to improve. While these methods can offer valuable insights, many of the discussed limitations of intelligence tests apply to DA as well.
In many applications of DA, qualitative observation plays an important role. A qualitative observation is when you gather information by describing the qualities, characteristics, or properties of something without using numbers or measurements. Instead of focusing on “how much” or “how many,” you focus on what something is like (e.g. a child frequently approaches others). Qualitative observation relies on expert interpretation of behavior during tasks (Kirk and Miller, 2011), and has its own challenges and limitations.
A key concern with qualitative observations is their reliability and validity (Kirk and Miller, 2011). Qualitative observations easily lack a structured manner of observation. Consequently, they are inherently subjective and susceptible to large intra-rater and inter-rater variability (Boeije and Bleijenbergh, 2023). In addition, when observations are not done in a standardized way, based on proper norm data and validation research, assessors may unintentionally allow personal expectations and contextual factors to influence their judgments (Hasselhorn and Gold, 2013). This can introduce various biases, such as confirmation bias (where assessors unconsciously focus on behaviors that confirm their pre-existing beliefs about an individual’s cognitive abilities) and the halo effect (where a positive impression in one area, such as social skills, leads to an overestimation of cognitive abilities, or vice versa; Kaplan and Saccuzzo, 2018).
In general, qualitative observation and DA can thus heavily depend on the assessor’s interpretation of interactions during the assessment, thus yielding a possible source of examiner bias. Particularly, cultural bias may be an issue, as the interactional nature of DA may reflect cultural norms around communication, help-seeking, and authority (Cannata et al., 2024; Wilby et al., 2017). There is also a considerable burden on the assessor’s ability to both interact with and objectively evaluate the test-taker—something humans are demonstrably poor at (e.g. Grove and Meehl, 1996; Kahneman et al., 1982). In addition, Van Hoogdalem and Bosman (2024) assume that DA reveals insight into the test takers’ thought process. Yet, the ability to verbalize these thought processes varies widely depending on factors such as verbal ability and cultural background. Some individuals are more verbally reflective than others, and similar thought processes can be verbalized in very different ways, or vice versa. Naturally, in case the thought processes need to be derived from a non-verbal pointer, the interpretation of the examiner plays an important role as well, which again can lead to significant bias.
Van Hoogdalem and Bosman (2024: 8) find a possible examiner bias important in standardized testing. We agree, but we think it could be even more of an issue when it concerns DA. In DA, the assessor plays a role just like in standardized IQ tests, even in the case of a highly standardized DA. Assessors have to focus on the procedure of the administration and describe the actions of the child, just like with standardized intelligence tests. Van Hoogdalem and Bosman (2024) note that the influence of the assessor on the testing situation and test outcome should be included and described. However, to reflect on the child being tested, while reflecting on their own at the same time with the necessary level of objectivity is a difficult task. Hence, the impact of the assessor can be very large.
In addition to the inclusion and description of the assessor’s behavior, substantiated guidelines should be available on how this information should impact interpretation of the test result and decision following. Grove and Meehl (1996) have shown that mechanical (i.e. algorithmic) combination of information leads to more accurate decisions than clinical judgment, which is prone to inconsistency and bias. Hence, relying on unstructured assessment methods can reduce decision-making reliability and fairness.
Thus, for a fair use of DA and qualitative observation, it is essential that the assessor is thoroughly trained (Cohen and Swerdlik, 2018: 443). Also, to enhance transparency, clear protocols need to be provided and followed (Kapiszewski and Karcher, 2021). The protocol should specify what behaviors or responses are observed, using structured tasks or checklists where possible, and include both qualitative and quantitative data. Bias mitigation strategies must be incorporated, such as pre-defining hypotheses, real-time recording, and multiple observers (Hallgren, 2012; Haven and Van Grootel, 2019) to reduce confirmation or halo effects. Observations from different contexts and modalities should be systematically integrated, with professionals documenting how each piece of evidence informs their overall judgment, including any uncertainties or conflicting information. Finally, all observations, scoring, and reasoning should be thoroughly recorded, ensuring that the decision-making process is transparent, replicable, and open to evaluation.
Standardized tests, including intelligence tests, address some of these limitations by offering objective, replicable metrics. A promising bridge between qualitative observation and standardization is found in tools like the Guide to the Assessment of Test Session Behavior (GATSB; Glutting and Oakland, 1993), a 29-item, norm-based behavior rating checklist. In every test session of the Wechsler Intelligence Scale for Children (WISC-III), a child’s behavior was rated on a three-point scale (usually applies, sometimes applies, doesn’t apply). The scores were then summed for three factor-based scales—avoidance, inattentiveness, and uncooperative mood—as well as a total scale, which was converted into standard scores (with a mean of 50 and a standard deviation of 10). Thus, this structured, norm-based checklist quantifies behavioral data collected during test administration, offering reliable supplementary insights.
Note also that not all forms of DA are exclusively qualitative. For example, DA can generate quantitative outcomes (e.g. gain scores between pre- and post-tests, or ratings of the level of support required). Moreover, in some approaches the prompts and levels of mediation are protocolled in advance and administered at fixed points in the procedure, making the process more standardized and less reliant on open-ended interpretation.
Moreover, rather than serving as a replacement, DA is best viewed as a complement to standardized testing. Where traditional tests assess static performance, dynamic methods provide insights into cognitive processes, such as learning potential, instructional responsiveness, and problem-solving strategies. Possibly, adding protocols and thereby increasing the degree of standardization of DA could help alleviate some of the limitations discussed. In research into DA so far, its validity and potential to form a link between assessment and intervention has not yet been supported (Elliott et al., 2018). Further research is thus needed to determine how well DA outcomes predict real-world learning and under what conditions. While the paper of Van Hoogdalem and Bosman (2024) may not have rejected this view, it left us wondering what their position was on this point.
Finally, when it comes to dichotomous decisions on, for example, access to a specific type of education or care, the challenge of a fair decision remains. Van Hoogdalem and Bosman (2024) start their paper with justified criticism of gatekeeper diagnostics (“
Conclusion
While we acknowledge that some concerns raised by Van Hoogdalem and Bosman (2024) are valid, several of their conclusions seem to be based on debatable assumptions and appear to reflect underlying biases. For instance, their assertion that standardized intelligence tests are only appropriate for a narrow group of typically developing individuals overlooks the wide variety of populations for which such tests have been specifically developed and validated. Although we agree that certain individuals require tailored assessments, we challenge the notion that this group is so large that there is only a small minority for whom the test fits. The bias in their reasoning stems from generalizing findings from their specific research population to the broader population—a methodological leap that is not empirically justified.
In this paper, we have sought to critically and systematically address their claims using evidence-based reasoning. We argue that, although intelligence tests have limitations, these do not justify their outright rejection in clinical or educational contexts—a position implicitly or explicitly advocated by Van Hoogdalem and Bosman (2024).
Importantly, intelligence test results are seldom, if ever, interpreted in isolation when professionals make decisions about individuals. Whether for school placement, diagnostic evaluation, or personnel selection, professionals typically draw upon a broad range of data sources, including test scores, contextual and developmental information, behavioral observations, and informant reports. A key question, then, is not
We strongly advocate for the use of mechanical methods of data integration—structured decision rules or algorithms—rather than relying on purely intuitive or holistic clinical judgment. Decades of research (e.g. Grove and Meehl, 1996; Meijer et al., 2020; Neumann et al., 2023) show that mechanical decision-making is more transparent, replicable (consistent), and less prone to cognitive biases than clinical judgment. It also allows for feedback and refinement over time, which are essential for professional accountability and continuous improvement. In contrast, solely relying on clinical judgment increases the risk of bias, inconsistency, and inequality—which may negatively impact individual lives (reference).
This position is further explained (including the scientific support for this approach) in the practical guide developed by Niessen et al. (2025), supported by the Dutch Committee on Tests and Testing, which offers concrete guidelines for applying mechanical decision making across diverse professional settings (e.g. clinical neuropsychology, work, and organizational psychology). By walking the reader through realistic case scenarios, they demonstrate what steps can be taken during the decision making process to achieve a structured and transparent, mechanical integration of multiple data sources—including intelligence test results. Such an approach allows professionals not only to improve their judgments, but also to explain and justify them in a clear and consistent manner.
Completely abolishing intelligence tests due to concerns about their use as a gateway to special education or healthcare strikes us as overly radical. While we agree that a person’s intelligence cannot—and should not—be reduced to a single test score, many of the objections the authors raise in general can be refuted or apply at least as strongly—if not more so—to alternatives such as DA. The application of intelligence tests for special groups should be handled with great care, but we think that outright abolition—as the authors advocate—amounts to throwing the baby out with the bathwater.
Concluding, intelligence tests neither should be dismissed wholesale, nor should be used uncritically or as standalone tools. Rather, they should be used thoughtfully, as part of a broader, structured and transparent diagnostic process. When interpreted with nuance and embedded within a mechanical decision-making framework, in our opinion, intelligence tests continue to hold diagnostic value and contribute meaningfully to responsible, equitable, and informed professional practice.
Footnotes
Acknowledgements
Not applicable.
Author note
All five authors are members of the Dutch Committee on Tests and Testing (COTAN; in Dutch: Commissie Testaangelegenheden Nederland) of the Dutch Institute of Psychologists (NIP; in Dutch: Nederlands Instituut van Psychologen).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: M.E. Timmerman was supported by Dutch Research Council (NWO) [grant number 406.27.GO.022].
Ethical considerations
Ethical approval was not required.
Consent to participate
Not applicable (no participants are used).
Consent for publication
Not applicable (no participants are used).
Data availability statement
Not applicable (no data was collected for this paper).
