Abstract
Neuropsychological assessments require adequate task engagement, yet paediatric performance validity testing (PVT) remains limited. This study evaluated the Groningen Effort Test (GET) as an attention-based PVT in children and adolescents using a simulation design. Participants were typically developing controls (N = 113), instructed simulators (N = 40) and outpatient referrals (N = 47), aged 6–17 years, who completed the Test of Memory Malingering (TOMM Trial-1), Reliable Digit Span (RDS) and the GET. GET performance improved with age, and outpatient referrals scored below controls. Age-stratified cutoffs (<13 years vs. >13 years) produced high specificity and strong sensitivity in children, comparable to TOMM Trial-1 and RDS, while all PVTs showed reduced sensitivity in adolescents. ROC analyses indicated excellent accuracy in children and good to moderate accuracy in adolescents. Regression analyses revealed significant GET effects of group and age. Findings support the GET as a promising paediatric PVT requiring further clinical validation.
Keywords
Introduction
Validity testing in neuropsychological assessment has grown substantially in both research attention and clinical application since the early 2000s, reflecting concerns about the credibility of self-reported symptoms and test performance (Bush & NAN Policy & Planning Committee, 2005; Heilbronner et al., 2009; Sweet et al., 2021). When invalid responses go undetected, the consequences may include misinterpretation, misdiagnosis, inappropriate treatment, and inefficient use of health care or educational resources (Kirk et al., 2020).
Thus, performance validity represents a central aspect of the diagnostic process. To draw firm conclusions, clinicians must first ensure that task performance reflects genuine cognitive abilities (Larrabee, 2012). Performance Validity Tests (PVT) have been developed for this purpose and can be administered across diverse populations, including those with neurological, mental, or developmental conditions (Bush & Heilbronner, 2025).
In adult neuropsychology, PVTs are widely used and well established (e.g., Basso et al., 2024; Dandachi-Fitzgerald et al., 2011; Fuermaier et al., 2025), but their use in paediatric populations remains limited (e.g., Blaskewitz et al., 2008; Kirk et al., 2011; Ploetz et al., 2016). Survey findings indicate that the routine use of validity testing has not yet been implemented in paediatric neuropsychological assessment (Brooks et al., 2016; Hirst et al., 2017; Macallister et al., 2019). This gap was formally acknowledged in the 2021 consensus statement by the American Academy of Clinical Neuropsychology (AACN), which emphasised the need to increase validity assessment in paediatric contexts (Sweet et al., 2021). Accordingly, there have been growing calls to develop and routinely integrate PVTs into paediatric clinical work (Ploetz et al., 2022).
The limited use of PVTs in children may partly stem from the assumption that they are unlikely to engage in intentional deception during testing. Research findings challenge this assumption, showing that even children as young as three can understand others’ beliefs and deliberately deceive when motivated (Talwar & Crossman, 2011; Talwar & Lee, 2008). Children have also been shown to simulate a variety of difficulties during medical evaluations, including cognitive, academic, motor, sensory, and psychological symptoms (Kirkwood, 2015b). Furthermore, children as young as five have demonstrated the ability to intentionally underperform during neurocognitive testing (Kirk et al., 2011; Kirkwood & Kirk, 2010; Ploetz et al., 2016).
Although these findings highlight the importance of assessing performance validity in paediatric populations, they also emphasise the need for careful application. In children, underperformance may be driven by developmentally specific motivators, such as avoiding schoolwork, escaping social stressors, or obtaining academic accommodations (Chafetz et al., 2020). In addition, cognitive, emotional, and behavioural systems continue to mature throughout childhood and adolescence. These developmental changes can thus affect attention, motivation, and self-regulation during testing, increasing variability unrelated to intentional deception (Best & Miller, 2010; Ploetz et al., 2022). Therefore, the validation of PVTs for paediatric populations remains an important research priority. Clinicians in turn must carefully select validated measures to promote accurate assessment and maximise clinical utility.
Meeting this need requires familiarity with the two classes of PVTs and the evidence supporting their appropriateness for use in paediatric populations. Stand-alone PVTs are freestanding tests that appear to measure ability but are low-demanding to primarily assess test engagement (Kirkwood, 2015a). As an example, the Test of Memory Malingering (TOMM) is the most widely used stand-alone PVT in younger populations (Kirkwood, 2015b; Tombaugh, 1996). It relies on pictograms, such that reading skills are not required. Research shows excellent malingering detection and demonstrates that children as young as six can meet adult-level performance standards (Blaskewitz et al., 2008; Constantinou & McCaffrey, 2003; Donders, 2005; Kirk et al., 2011; Macallister et al., 2009). Other stand-alone PVTs used in paediatric assessment include the Word Memory Test (WMT; Green et al., 2003) and the Medical Symptom Validity Test (MSVT; Green, 2004). Both demonstrate strong validity in children with at least a third-grade reading level. However, younger children or those with lower reading skills show a higher risk of false positives (Blaskewitz et al., 2008; Green & Flaro, 2021; Gunn et al., 2010).
Another approach involves embedded validity measures, which are derived from conventional neuropsychological tests. These indicators assess response validity within standard cognitive tests without requiring additional testing time (Boone, 2021; Larrabee, 2007). Their use in paediatric assessment is recommended by the AACN, who emphasise combining stand-alone and embedded measures to improve validity evaluation (Heilbronner et al., 2009). One such measure is the Reliable Digit Span (RDS; Greiffenstein et al., 1994), derived from the Digit Span subtest of the Wechsler intelligence test (e.g., Wechsler Intelligence Scale for Children–5th Edition, WISC-V; Wechsler, 2014). In paediatric populations, the classification accuracy of the RDS tends to vary based on the cutoff scores used. Lower cutoff scores, such as 4 or 5, tend to produce high specificity but low sensitivity (Kirk et al., 2020; for a review). A score of <6 resulted in an optimal cutoff score, yielding a sensitivity of 51% and a specificity of 92% (Kirk et al., 2011; Welsh et al., 2012). Other embedded measures used in paediatric performance testing include the Child and Adolescent Memory Profile (ChAMP (Brooks et al., 2023; Sherman & Brooks, 2015) and the California Verbal Learning Test for Children (CVLT-C; Delis et al., 1994).
Despite growing interest in paediatric PVTs, the field is still emerging. The above-mentioned stand-alone and embedded PVTs have demonstrated strong accuracy in specific age groups, cutoff scores, and clinical samples (Ernzen et al., 2025; Kirk et al., 2020). However, results vary across studies, and evidence is not yet consistent enough to establish clear, universally accepted standards for this age group. Thus, further research is needed to refine tools, identify optimal cutoff scores, and ensure that validity indices are developmentally appropriate and clinically useful. Yet, most available PVTs focus narrowly on memory. This leaves other domains, such as visual perception, executive functioning, and attention, underassessed (Boone, 2009; Fuermaier et al., 2020).
Attention is fundamental for children’s learning, academic achievement, and social development (Cuffe et al., 2020; Salari et al., 2023). Attentional deficits are highly prevalent across child and adolescent psychopathology, including anxiety, depression, post-traumatic stress disorder, obsessive-compulsive disorder, and attention-deficit/hyperactivity disorder (ADHD; Depamphilis et al., 2025). Moreover, developmental factors such as state regulation and delay aversion are still maturing in childhood and adolescence (Sonuga-Barke et al., 2010). When distraction, mind wandering, or slower responses occur, test performance can become poor and/or inconsistent. This performance pattern may resemble task disengagement, as different forms of invalid performance can result in similar patterns of poor and/or inconsistent scores. As a result, this resemblance can lead to invalid performance patterns being misinterpreted as low attentional abilities. Conversely, genuine attentional problems may be mistaken for invalid responding, which can produce false-positive classifications on PVTs (Roor et al., 2024). These concerns are compounded by the fact that no validated PVTs currently exist to reliably differentiate genuine attention deficits from noncredible test engagement in paediatric populations (Deright & Carone, 2015; Fuermaier et al., 2016; Harrison et al., 2007, 2015; Sullivan et al., 2007).
To address this limitation, the Groningen Effort Test (GET) was developed as a freestanding PVT focusing on attention-related performance (Fuermaier et al., 2021). The GET has since been validated in adult simulation studies of feigned ADHD symptoms and of cognitive deficits following acquired brain injury (Fuermaier et al., 2016; Fuermaier et al., 2020). More recently, the GET has also shown utility in the diagnostic assessment of adult ADHD, although current cutoff scores may increase the risk of false positives (Kneidinger et al., 2025; Raasch et al., 2025). It has likewise been applied in early retirement claimants presenting cognitive complaints (Teßmann et al., 2025). Nevertheless, its use has been discouraged in patients with suspected chronic solvent-induced encephalopathy, as one of the GET measures yielded more positive results (Van Vliet et al., 2024). While findings in adults are informative, it remains unclear whether the GET is suitable for younger populations. Evaluating the GET in children and adolescents is therefore necessary to expand assessment options beyond memory-focused measures. Such work could address a critical gap in the literature and support more comprehensive, developmentally appropriate evaluations.
The present study evaluated the GET as an attention-based PVT in children and adolescents using a simulation design. Participants comprised three criterion groups: typically developing individuals assigned to either a control (honest responders) or simulation group, and an outpatient referral group. Individuals in the control and outpatient referral groups were instructed to perform the test battery to the best of their abilities. Those in the simulation group were instructed to feign reduced ability. All participants completed three PVTs, including the TOMM, RDS and GET. The study examined whether age is associated with GET performance in children and adolescents in the control and outpatient referral groups. Given that the GET has not yet been validated for paediatric use, we anticipate that age would be associated with GET performance, resulting in the necessity to develop age-stratified cutoffs. Furthermore, we evaluate the sensitivity and specificity of the GET in distinguishing between genuine and invalid test engagement. In addition, we compare the GET performance across the three criterion groups to evaluate its effectiveness in identifying noncredible performance. Finally, we consider the influence of demographic factors (age, gender, cultural background, education, language) on the GET outcomes. Because PVTs are intended to function independently of demographic factors, such as gender and cultural background, we expect little to no influence on GET outcomes.
Methods
Participants
Participants were recruited to form three criterion groups used to evaluate the validity of the GET. The groups served different roles in the validation design. The control group (honest responders) was sampled more extensively to maximise the stability of the 90% specificity estimate across childhood and adolescence, whereas sensitivity was estimated from a smaller simulation sample modelling instructed underperformance. This sampling structure parallels that employed in prior paediatric performance validity studies (e.g., Blaskewitz et al., 2008; Gunn et al., 2010). The study also included an outpatient referral group to examine GET performance under routine diagnostic assessment, thereby strengthening the external validity of the design. Recruitment was guided primarily by feasibility considerations related to the availability of outpatient referrals and cooperating schools during the study period rather than a predefined a priori target sample size. The characteristics of the mixed outpatient referral group and the typically developing individuals (control group and simulation group) are described below.
Mixed Outpatient Referrals
Children and adolescents in this sample were referred to the psychotherapeutic child and adolescent outpatient clinic at the Department of Psychology, Marburg University, Germany for diagnostic evaluation due to behavioural, emotional, or cognitive concerns raised by teachers, caregivers, or close contacts. This was a mixed clinical sample, and the diagnostic process was tailored to the suspected diagnoses identified during the initial clinical interview. All participants underwent a comprehensive multiaxial diagnostic assessment based on ICD-10 criteria (Remschmidt et al., 2017). This assessment included the Diagnostic Interview for Mental Disorders in Childhood and Adolescence (Kinder-DIPS; Schneider et al., 2009, 2017) as a semi-structured clinical interview designed to assess psychological disorders in children and adolescents.
PVT administration was integrated into the routine diagnostic assessment. When technical difficulties or scheduling constraint occurred, PVT testing was not administered. Consequently, incomplete PVT datasets did not arise, and no participants were excluded due to missing PVT data.
Participants in the outpatient referral group (N = 62) ranged in age from 7.1 to 17.7 years (M = 11.0, SD = 2.98). In this sample, 48.4% (n = 30) identified as female. All outpatients reported a Western European cultural background and spoke German as their primary language. Half of the participants (51.6%, n = 32) were enrolled in primary school (see Table 1 for details).
Demographic Characteristics of Participants Per Group.
Note. N = number of cases.
Age in years and months at assessment (M ± SD). bNative Language is based on the self-reported primary language. All outpatient referrals reported German (N = 62). In the control group, German was reported by 84 participants; other languages were Dutch (n = 19), Swahili (n = 24), English (n = 6), Hindi (n = 2), and Frisian (n = 3). In the simulation group, German was reported by 32 participants; Dutch (n = 2) and Frisian (n = 6) were reported. For statistical analyses, non-German languages were collapsed into a single category due to small cell counts. cRegion of cultural origin is based on the self-reported cultural origin. Western Europe: Germany, the Netherlands; East Africa: Tanzania. dPost hoc Dunn tests (Bonferroni correction): outpatient referral vs. simulation (p < .001); control vs. simulation (p = .007); outpatient referral vs. control (p = .093). ePost hoc pairwise Fisher-Freeman-Halton exact tests (Bonferroni correction) – Native Language: Control vs. simulation (p = .111); control vs. outpatient referral (p < .001); simulation vs. outpatient referral (p = .001); Region of Cultural Origin: control vs. simulation (p < .001); control vs. outpatient referral (p < .001); simulation vs. outpatient referral (p = 1.0). fPost hoc pairwise χ2 tests (Bonferroni correction) – Education: Control vs. simulation (p = .001); simulation vs. outpatient referral (p = .004); control vs. outpatient referral (p = 1.0).
Statistically significant at p < .001.
With respect to clinical characteristics, externalising disorders were the most prevalent category of primary diagnoses, accounting for 38.7% of the sample (n = 24). The majority of the cases were ADHD (35.5%, n = 22), followed by oppositional defiant disorder (3.2%, n = 2). By contrast, internalising disorders were less frequent overall, representing 20.8% of the sample (N = 13). Primary internalising diagnoses included anxiety-related disorders (social anxiety/phobia, specific phobia, and separation anxiety disorder; 9.6%, n = 6) and affective disorders (moderate depressive episode, recurrent depressive disorder, and dysthymia; 8.0%, n = 5). Less prevalent internalising conditions were post-traumatic stress disorder (1.6%, n = 1) and obsessive-compulsive disorder (1.6%, n = 1). In addition, 21.0% (N = 13) of participants received a primary diagnosis of other childhood emotional disorders, and 1.6% (n = 1) were diagnosed with an elimination disorder. Notably, 17.7% (n = 11) of participants did not meet criteria for any mental health conditions. Their inclusion ensured consistent assessment procedures and preserved the representativeness of the outpatient referral group. Clinical diagnoses with corresponding Kinder-DIPS classifications are provided in Table 2.
Clinical Diagnoses in the Outpatient Referral Criterion Group (N = 62).
Note. N = number of cases; % representing percentage of cases in the entire sample.
A primary diagnosis is the major overarching diagnosis. bThe secondary diagnosis is the comorbid diagnosis, but with secondary effects. cKinder-DIPS = Diagnostisches Interview bei Psychischen Störungen, according to the ICD-10/Diagnostic Interview for Mental Disorder, according to the ICD-10; structural clinical interview. dOther Disorders encompasses developmental and learning disabilities.
Typically Developing Individuals
Typically developing children and adolescents were recruited from three secondary schools in Germany and two primary schools located in the Netherlands and Tanzania. The international scope was facilitated by graduate students from the University of Groningen, Netherlands international MSc programme in Psychology, who, under faculty supervision, identified and recruited participating schools in these locations. Furthermore, students were trained and supervised in all study procedures, coordinated with school staff, distributed study information, obtained parental consent and conducted testing.
We recruited and assessed a total of 218 participants. Prior to the assessment, participants were randomly assigned to either the control group (N = 146; instructed to perform at the best of their abilities) or the simulation group (N = 72; instructed to feign reduced ability by making occasional errors). Further details on the assessment procedure are given in the Materials and Procedure sections.
Of the 218 participants initially considered for inclusion, 5 participants from the control group were excluded based on caregiver-reported prior neurological (n = 1; spina bifida) or psychological disorders (n = 4; ADHD). In the simulation group, caregivers reported ADHD for four participants and autism spectrum disorder for one, and these participants were excluded.
Next, participants were excluded from the primary analyses when required PVT data were missing. Incomplete datasets were defined as cases in which at least one of the required PVT measures could not be evaluated. For the simulation group, complete data on all PVTs (GET, RDS, TOMM Trial 1) were required to evaluate the detection of instructed underperformance. Accordingly, 22 simulators (GET: n = 10; TOMM Trial 1: n = 12) were excluded because at least one of these measures was missing.
For the control group, credible responding was determined using the RDS criterion together with valid GET scores. Three control participants were excluded because GET indices were missing. TOMM Trial 1 scores were unavailable for 34 control participants due to technical or procedural constraints during data collection. These cases were retained because TOMM Trial 1 was not used to determine credible responding in this group. Thus, exclusions reflected predefined eligibility rules specified prior to the primary analyses rather than selective removal based on performance (see the Analysis Plan section).
Finally, five additional participants, in the simulation group were excluded because they did not meet the post-experimental manipulation check. Participants rated adherence to the feigning instructions on a five-point scales, and scores below 3 were considered insufficient (see the Materials section).
Characteristics of the remaining samples of 138 controls and 40 instructed simulators are presented in Table 1. The control group (honest responders) ranged in age from 6.5 to 17.2 years (M = 11.8, SD = 2.3) and included 50.0% female participants (n = 69). Most controls reported a Western European cultural background (73.9%, n = 102), with others reporting East African cultural origins ((26.1%, n = 36). The majority (60%, n = 84) were native German speakers. In this group, 51.4% (n = 71) were enrolled in primary school, and 48.6% (n = 67) were enrolled in secondary school. The simulation group (experimental simulators) ranged in age from 9.1 to 17.3 years (M = 13.1, SD = 2.0) and was predominantly female (71.5%, n = 29). All participants reported a Western European cultural background, with most being native German speakers (80%, n = 32). Eight participants (20.0%) were enrolled in primary school, and 32 (80.0%) were enrolled in secondary school.
Criterion group comparability was assessed by testing for differences in gender, age, native language, region of cultural origin and education. See Table 1 for detailed group differences. Significant group differences were observed for age (H = 18.6, p < .001), native language (x2 = 35.0, p < .001), region of cultural origin (x2 = 31.3, p < .001), and education (x2 = 13.3, p = .001). Gender did not differ significantly across criterion groups (x2 = 7.8, p = .10). Post hoc tests showed that the simulation group differed significantly in age from both the outpatient referral group (p < .001) and the control group (p = .007), whereas the outpatient referral and control groups did not differ significantly from each other (p = .093). Native language differed significantly between the outpatient referral group and both the control group (p < .001) and the simulation group (p < .001), whereas the control and simulation groups did not differ from each other (p = .111). Furthermore, the region of cultural origin differed between the control and simulation groups (p < .001) and between the control and outpatient referral groups (p < .001), whereas the simulation and outpatient referral groups did not differ from each other (p = 1.00). Education level differed significantly between the control and simulation groups (p = .001) and between the simulation and outpatient referral groups (p = .004), while no significant difference emerged between the control and outpatient referral groups (p = 1.00).
Materials
Demographic Questionnaire
A brief demographic questionnaire collected information about participants’ characteristics (including age, gender, native language, and education). Caregiver education was assessed by reporting the highest level of school completion. For the outpatient referral sample, additional questions addressed diagnostic status, mental health comorbidities, and medication use. For the typical developing sample, information on potential neurological or mental health conditions and medication use was obtained. For this sample, no independent clinical verification or structured diagnostic screening was conducted.
Diagnostic Interview for Mental Disorders in Childhood and Adolescence (Kinder-DIPS)
The Kinder-DIPS is a semi-structured interview widely used in Germany to assess psychological disorders and behavioural problems in children and adolescents (Schneider et al., 2009, 2017). It is available in two versions: one administered to children and adolescents aged 6–18 years as self-reports, and the other to caregivers as corresponding observer-reports. The interview allows for diagnostic classification according to ICD-10 (Remschmidt et al., 2017) as well as DSM-5 criteria (Falkai et al., 2015). In addition to categorical diagnoses, the Kinder-DIPS provides dimensional ratings of current symptoms and collects relevant clinical information such as prior treatments and parental mental health. Interviews are conducted following a standardised manual containing open and closed questions with instructions for skipping or terminating sections. Previous research has demonstrated good to excellent interrater reliability for both the child and parent versions of the Kinder-DIPS, with Kappa coefficients ranging from .87 to .98 (Neuschwander et al., 2013).
Test of Memory Malingering (TOMM)
The TOMM is a stand-alone PVT designed to detect noncredible memory performance (Tombaugh, 1996). It consists of three trials: two immediate/recognition trials (Trial 1 and Trial 2) and an optional delayed retention trial (Trial 3). During the immediate trials, participants view 50 black-and-white line drawings of common objects presented individually. Following each presentation, participants complete a forced-choice recognition task, selecting the target image from a pair that includes a novel distractor. The optional delayed retention trial follows the same procedure and is typically administered 15–30 minutes after Trial 2.
Although the full TOMM protocol includes multiple trials, only Trial 1 (TOMM1) was administered in this study. Empirical findings indicate that performance on Trial 1 provides a reliable and valid indicator of overall TOMM (Brooks et al., 2012; Perna & Loughan, 2013). This approach reduces total testing time without compromising the integrity of performance validity assessment. In terms of scoring, Tombaugh (1996) recommended for adults a TOMM1 cutoff of 45 to indicate valid performance. However, research in paediatric samples has shown that this cutoff reduces specificity and increases false positives (Brooks et al., 2012; Perna & Loughan, 2013). More recent findings indicate that a cutoff of >40 on TOMM1 correctly predicts overall TOMM performance in 98% of cases (Loughan et al., 2016). Based on this evidence, a cutoff score of 40 was applied in the present study to classify credible performance.
Reliable Digit Span (RDS)
The RDS is an embedded PVT, derived from the digit span subtest of the Wechsler intelligence scales for children (e.g., Wechsler Intelligence Scale for Children-5th Edition, WISC-V; Greiffenstein et al., 1994; Wechsler, 2014). It measures immediate verbal recall, attention, and short-term memory. The RDS consists of two subtests: Digit Span Forward and Digit Span Backward. In Digit Span Forward, the child repeats increasingly longer sequences of numbers in the same order as presented by the examiner. In Digit Span Backward, the child repeats increasingly longer sequences of numbers in reverse order from that presented by the examiner. For both subtests, each number is read aloud at a rate of one per second. Administration is discontinued when both items from a given pair are failed. A RDS score is calculated according to the guidelines recommended by Greiffenstein et al. (1994; sum of the longest string of digits repeated without error over two trials under both forward and backward conditions, that is, a participant passing two 5 digits and two 4 digits gets an RDS = 9). Kirkwood et al. (2011) recommended a cutoff score of ≤6 to indicate underperformance in children, which achieved a specificity of 96% and sensitivity of 51%. Thus, the following cutoff was applied in the present study.
Groningen Effort Test (GET)
The GET is an attention-based performance validity test designed to detect underperformance in adulthood (Fuermaier et al., 2021). It is based on the Embedded Figures Test (EFT) paradigm, in which a simple geometric figure is embedded within a more complex figure (Fuermaier et al., 2016). During the GET, participants are presented with a simple target figure and a complex figure on a computer screen and are asked to indicate whether the target figure is embedded within the complex figure by pressing a button. In adults, interpretation of GET index values (GETI) and error rates (GETE) is based on cutoff guidelines provided in the test manual (Fuermaier et al., 2021). These guidelines have not been validated for paediatric populations and were therefore not applied in the present study. Instead, GETI scores, which combine processing speed and accuracy within each block, and GETE scores, which reflect the total number of errors, were calculated. These indices were then examined for their potential utility in detecting noncredible performance and in developing developmentally appropriate cutoffs for children and adolescents. The GET has demonstrated promising classification accuracy in analogue studies with adults, differentiating simulated malingered ADHD from genuine ADHD with high sensitivity (68%) and specificity (91%) when using a suggested cutoff score for total errors (Fuermaier et al., 2021). In a further validation study with early retirement claimants presenting cognitive complaints, the GETI demonstrated sensitivity of 71.0% and specificity of 81.8%. The GETE demonstrated sensitivity of 77.4% and specificity of 89.1% (Teßmann et al., 2025).
Strengths and Difficulties Questionnaire-Teachers (SDQ-T)
The SDQ-T is a brief 25-item measure of psychological adjustment in children and adolescents aged 4–17 years (Goodman, 1997). Positive and negative behaviours and attributes across five domains (emotional symptoms, conduct problems, hyperactivity, peer problems, and prosocial behaviour) are rated on a 3-point Likert-type scale, with the following responses being “certainly true” (2), “somewhat true” (1), and “not true” (0). A sum score for each scale is calculated by summing the scores of the five respective items, generating a score from 0 to 10. All sum scores, except the prosocial behaviour scale, are then summed up to generate a minimum Total Difficulties Score (TDS) with a minimum score of 0 and a maximum of 40. Goodman (1997) classifies scores into three bandings with normal ranging from 0 to 11, borderline from 12 to 15, and abnormal from 16 to 40. This form of scoring was also used in the current study. Prior research has demonstrated good internal consistency for the SDQ-T, with Cronbach’s α reported at .80 for the TDS, .86 for hyperactivity, .81 for prosocial behaviour, and .73 for emotional symptoms (Van Den Heuvel et al., 2017).
Pre-and Post-Experimental Check
Participants of the simulation group completed a pre- and a post-experimental check. The pre-experimental check ensured that participants understood and were able to follow the instructions. Participants repeated three common words (e.g., pencil, flower, hat), counted backwards from 10 to 1, and then recalled the three words. If errors occurred, the instructions were repeated, and the task was re-administered. Participants who failed to complete the task correctly after two attempts were excluded. Those who completed the task correctly were then asked whether they could perform with reduced ability. This step confirmed their comprehension of the feigning instruction, namely, to simulate reduced ability.
The post-experimental check assessed compliance with the simulation condition. Participants rated two statements about their adherence to the feigning instructions (e.g., “I tried my very best to present I am not as smart”; “I managed convincingly to pretend I am not so smart”) on a five-point Likert-type scale ranging from 1 (strongly disagree) to 5 (strongly agree). Scores above 3 were considered indicative of adequate understanding of the simulation instructions.
Procedure
The protocol of this study was approved by the respective ethical review boards. All participants were informed about the study and participation was voluntary. Caregivers were assured that the decision to participate or decline would not affect clinical evaluation or care for outpatient referrals, nor academic performance or grading for typically developing (control group and simulation group) individuals. Written consent was obtained from caregivers, and children provided assent. Caregivers completed a demographic questionnaire prior to testing, and caregiver responses were used to screen the typically developing participants for known neurological or mental health diagnoses (see Materials). In addition, teachers of participating children were invited to complete the SDQ-T in the appropriate language version (German, Dutch, or English). Teacher ratings were selected because they provide systematic observations of children’s behaviour in classroom settings. In these contexts, attentional and self-regulatory demands are high, which makes difficulties more readily observable (Minder et al., 2018). Classroom staff can evaluate each child’s behaviour relative to same-age peers, which enhances the interpretability of the ratings and ensures consistent assessment across groups. Descriptive statistics for the SDQ-T scores are shown in Table 3.
Descriptive Statistics and Classification of SDQ Scales Scores (Teacher-Rated Questionnaire) Across All Criterion Group Variables.
Note. N = number of cases.
Missing values: Controls (honest responders): All scales N = 3, Prosocial Behaviour N = 6; Outpatient referrals: All Scales N = 5; Experimental Simulators: All Scales N = 5, Prosocial Behaviour N = 6. bScale score range for the normal banding (0–11) according to Goodman (1997). cOutside the normal total difficulties range: Scores falling in the borderline (12–5) or abnormal (16–40) categories on the individual scales, based on Goodman's (1997) bandings for total difficulties scores.
Each participant completed the same assessment, beginning with Trial 1 of the TOMM, followed by the RDS, both administered in pen-and-paper format, and concluding with the GET, which was administered digitally. As certain procedural elements differed between the mixed clinical outpatients and typically developing individuals (control group and simulation group), group-specific procedures are described separately below.
Mixed Outpatient Referrals
Participants in the outpatient referral group were recruited by a therapist in training during routine diagnostic assessment at the child and adolescent outpatient clinic. Standardised instructions were provided in German, and participants were asked to complete all tasks to the best of their ability. Assessments were then administered by a trained examiner in the outpatient clinic’s neuropsychological laboratory. Each participant received a 10€ cinema voucher as a token of appreciation.
Typically Developing Individuals (Control Group and Simulation Group)
Participants in the control and simulation groups were recruited through primary and secondary schools (Germany, the Netherlands, Tanzania). Testing was conducted during school hours in a separate room by supervised psychology graduate students. Instructions were provided in German, Dutch, or English, depending on the participant’s language background. After receiving a standardised instruction on the procedure, participants were randomly assigned to either the control condition (honest responders) or the simulation condition (experimental simulators). Control participants were instructed to complete all tasks with their best effort. Simulation participants were instructed to feign reduced ability by deliberately making some errors while maintaining the appearance of genuine effort so that the examiner would not detect their feigning (see Supplemental Material S1 for the full instructions). Participants in the simulation group completed a brief pre-and post-experimental check to ensure comprehension and compliance with the simulation instructions. The pre-check verified that participants understood how to simulate reduced ability, and the post-check required them to rate their adherence to the instructions (see the “Materials” section for details). Each participant received a small reward as a token of appreciation.
Analysis Plan
All analyses were performed using RStudio (Version 4.3.3; RStudio, 2024). To examine GET performance under credible test engagement, participants were required to meet the RDS criterion. Accordingly, a credible sub-sample was defined prior to the primary analyses. Participants in the control and outpatient referral groups were included if they met the RDS criterion (see the “Materials” section). All participants in these groups were instructed to perform to the best of their ability.
The RDS served as the screening indicator of credible engagement. As TOMM Trial 1 was also examined as a comparison PVT in the present study, using it as an inclusion criterion would have based sample selection on the same validity measure used in the subsequent analyses. To avoid this circularity, TOMM Trial 1 was not used to determine sample inclusion.
Applying the RDS criterion yielded a valid sub-sample of N = 160 participants (n = 113 control group; n = 47 outpatient referral group). The simulation group (N = 40) was included in full, as all participants in that group passed the pre- and post-experimental cheques. Using the valid sub-sample defined by the RDS criterion, GET performance was first examined descriptively across all criterion groups to provide an overview of performance patterns and potential age-related variability.
To examine whether age predicted performance on GETI and GETE, separate linear regression models were fitted for the outpatient referral and control groups. These analyses focused exclusively on these groups to model age-related variability under valid performance conditions. After fitting the models, standard regression diagnostics (e.g., evaluation of linearity, homoscedasticity, and normality of residuals) were performed post hoc to determine the appropriateness of the linear models. Predicted values and their 95% confidence intervals were generated to visualise performance trends across the age range.
Next, the classification accuracy of the GET (GETI and GETE) was evaluated. Age-stratified cutoff scores were derived from the collapsed control and outpatient referral groups using the 90th percentile. This approach fixed the false-positive rate at maximally 10% during cutoff construction, consistent with the recommendation that PVTs prioritise high specificity (Larrabee, 2012). Specificity in these groups therefore reflects the imposed threshold and was not treated as an independent validation. To support age-specific cutoff development, descriptive statistics for the age-stratified groups were examined for the GET test performance. These cutoffs were then applied to the simulation group to estimate sensitivity and to the control and outpatient referral groups to calculate specificity. In addition, the classification accuracy of the GET (GETI and GETE) was benchmarked against the established PVTs (TOMM Trial 1 and RDS) within the simulation group.
A further comparison of PVT performance was conducted in participants from the control and outpatient referral groups who completed all validity measures (GET, RDS, TOMM Trial 1), without applying the RDS-based inclusion criterion. This complete-data subset comprised N = 166 participants (control group: n = 104; outpatient referral group: n = 62). Age-specific GET cutoffs derived from the primary analyses were applied. Failure rates were calculated within the resulting age groups and compared across GET indices, RDS, and TOMM Trial 1. This approach enabled direct comparison of the proportion of participants falling below established cutoffs across measures under identical data conditions.
In subsequent analyses, the utility of GET measures (GETI and GETE) in differentiating between valid (control and outpatient referral groups) and invalid (simulation) performance was examined using receiver operating characteristic (ROC) analyses. For each GET measure, a ROC analysis was performed to determine the area under the curve (AUC) and identify the optimal threshold. Test accuracy, as indicated by the AUC, was evaluated using Swets (1988) criteria, with values of .70–.80 considered fair, .80–.90 good, and above .90 excellent.
Finally, multiple linear regression analyses were conducted to examine whether demographic variables (age, gender, cultural background, education, language) and group membership (outpatient referral group vs. control group) predicted GET performance (GETI and GETE). These analyses evaluated the extent to which demographic factors accounted for variability in GET measures. Potential demographic differences between the outpatient referral and control groups were further examined using a sensitivity analysis with propensity score matching. Full details of the propensity score matching analysis are provided (see Supplemental Material S6). The simulation group was excluded because their instructed invalid performance would confound the assessment of valid demographic effects.
Results
Age-Related Variability on GET Performance
Table 4 presents the descriptive statistics of the GET test performance overall across all criterion groups that precede the inferential analyses reported below. A linear regression revealed that age was a significant predictor of the GETI scores in the outpatient referral group, ß = −0.72, SE = 470.19, t(45) = −3.8, p < .001, explaining 25% of the variance, R2 = 0.25. In contrast, no significant age effect emerged in the control group, ß = −0.04, SE = 0.21, t(111) = −0.19, p = .85, R2 < .01. Model diagnostics (see Supplemental Materials S2–S5) indicated that assumptions were met for the outpatient group but not fully met for the control group. Residuals in the control group deviated from normality and showed heteroscedasticity. However, the large sample size (total N = 160; n = 113 control group; n = 47 outpatient referral group) and robustness of linear regression to moderate assumption violations supported retaining the model (Lumley et al., 2002).
Descriptive Statistics of GET Performance Across Criterion Groups Overall and by Age (M ± SD).
Note. M = mean, SD = standard deviation.
The overall sample of the control (honest responders) group N = 113; outpatient referral group N = 47 are based on the sub-sample (passing of the RDS); experimental simulators N = 40. bControl (honest responders): Children (<13 years; N = 68); outpatient referrals: Children (<13 years; N = 33); experimental simulators: Children (<13 years; N = 20). cControl (honest responders): Adolescents (>13 years; N = 45); outpatient referrals: Adolescents (>13 years; N = 14); experimental simulators: Adolescents (>13 years; N = 20).
Inspection of the predicted GETI values and their 95% confidence intervals (CI) across ages 7 to 17 further illustrated this pattern. In the outpatient referral group, predicted GET index scores decreased from approximately 5.73 (95% CI [3.67, 7.79]) at age 7 to −1.47 (95% CI [−3.77, 0.83]) at age 17, with confidence intervals generally narrowing between ages nine and 15. Conversely, in the control group, predicted GETI scores remained stable and near zero, ranging from approximately −0.23 (95% CI [−2.51, 2.06]) at age seven to −0.61 (95% CI [−2.77, 1.55]) at age 17, with wide overlapping confidence intervals, consistent with the absence of a significant age effect. Figure 1 illustrates age and GETI by group with 95% CI.

Age and GETI by group (control group, outpatient referral group) with 95% confidence intervals.
In addition, age significantly predicted GETE in the outpatient referral group, ß = −1.56, SE = 0.46, t(45) = −3.38, p = .002, explaining 20% of the variance, R2 = .20. In the control group, age also significantly predicted GETE ß = −1.61, SE = 0.36, t(111) = −4.45, p < .001, explaining 15% of the variance, R2 = .15. Model diagnostics (see Supplemental Materials S4 and S5) indicated a nonlinear age effect in the outpatient referral group, with a quadratic model providing better fit, while a linear model was sufficient for the control group. Despite these findings, linear models were retained for consistency and interpretability.
Inspection of predicted GETE values and their CI across ages 7–17 further illustrated these effects (Figure 2). In the outpatient referral group, predicted GETE decreased from approximately 23.57 (95% CI [18.50, 28.64]) at age seven to 7.99 (95% CI [2.32, 13.66]) at age 17, with moderately narrowing confidence intervals across mid-childhood (ages 7–11) and adolescence (ages 12–17). Similarly, in the control group, predicted GETE declined approximately from 19.65 (95% CI [15.65, 23.65]) at age 7 to 3.54 (95% CI [−0.24, 7.33]) at age 17, with wide and overlapping confidence intervals across the age range.

Age and GETE by group (control group, outpatient referral group) with 95% confidence intervals.
Derivation of Age-Specific Cut-Offs and Classification Accuracy of the GET Measures
Age was associated with GETI and GETE performance in the outpatient referral group but not consistently in the control group. In the control group, GETI scores showed no age-related change, with the age- GETI slope near zero, preventing reliable estimation of a change point, GETE declined with age in both groups. In the outpatient group, however, both GET indices showed age-related decreases, providing sufficient variability to estimate the age threshold using segmented regression. Breaking points were identified at 13.9 years for the GETI (95% CI [11.0–16.9]) and 13.7 years for the GETE (95% CI [11.5–16.4]), indicating that performance stabilises in early adolescence. Based on these findings, age-stratified cutoffs were derived for children (<13 years) and adolescents (>13 years). Descriptive differences between children (<13 years) and adolescents (>13 years) are evident in Table 4, supporting the presence of age-related variation in GET test performance.
Cutoff values for the GETI were 7.03 for children (<13 years) and 4.08 for adolescents (>13 years). For GETE, the corresponding cutoffs were 29 and 21.4, respectively. For GETI, sensitivity was 85% in children and 55% in adolescents, whereas for GETE, sensitivity was 75% and 45%, respectively. Specificity remained stable at 90% across all measures and age groups. Normative cutoffs and classification accuracy metrics are summarised in Table 5.
GET Index and Errors: Normative Cutoffs and Accuracy Metrics by Age.
Note. Sensitivity and specificity are reported as percentages.
Cutoff values (90th percentile) were derived from the collapsed group of participants (control group, outpatient referral group) who passed the RDS. This collapsed group (N = 160) was further stratified by age (<13 years, N = 101; >13, N = 59) before calculating percentile-based cutoffs. bSensitivity values were calculated based on the simulation group (N = 40), which was evenly split into two age groups: children (<13 years; N = 20) and adolescents (>13 years; N = 20). cSpecificity values were calculated based on the sub-sample of participants (control group, outpatient referral group) who passed the RDS. This sub-sample (N = 160) was further stratified by age (<13 years, N = 101; >13, N = 59).
For comparison, sensitivity of RDS and TOMM Trial 1 in the simulation group was 75% and 85%, respectively, in children (<13 years), and 50% and 55% in adolescents (>13 years). The precision of sensitivity estimates derived from the simulation group was evaluated using exact binominal 95% confidence intervals as a sensitivity analysis. In children (<13 years; n = 20), sensitivity ranged from 62% to 97% for the GETI and TOMM Trial 1 and from 51% to 91% for GETE and RDS. In adolescents (>13 years; n = 20), confidence intervals were wider (GETI: 32%–77%; GETE: 23%–69%; RDS: 27%–73%; TOMM Trial 1: 32%–77%).
Age-stratified failure rates for the GET indices, RDS, and TOMM Trial 1 in the complete-data subset of the control and outpatient referral groups (without application of the RDS inclusion criterion) are presented in Table 6.
PVT Failure Rates in a Complete-Data Sub-Sample of the Control and Outpatient Referral Groups.
Note. N = number of cases. Analyses were conducted in the complete-data subset of the control and outpatient referral groups; participants who completed the GET, RDS, and TOMM Trial 1. Age-specific GET classification was based on the cutoffs derived in Table 5.
PVTs: GETI, E, = Groningen Effort Test (Index, Errors); RDS = Reliable Digit Span; TOMM Trial 1 = Test of Memory Malingering, Trial 1; cutoff criteria for RDS (see Kirk et al., 2011) and TOMM Trial 1 (see Loughan et al., 2016) were applied. bControl (honest responders): Children (<13 years; N = 73); outpatient referrals: Children (<13 years; N = 46); combined group: Children (<13 years; N = 119). cControl (honest responders): Adolescents (13 years; N = 31); outpatient referrals: Adolescents (>13 years; N = 16); combined group: Adolescents (>13 years; N = 47).
Age-Stratified Discrimination of Simulated vs. Control and Outpatient Performances
Age-stratified ROC analyses were conducted to evaluate the utility of the GET in distinguishing invalid performance (simulation group) from valid performance (control group, outpatient referral group). Analyses were performed separately in children (<13 years) and adolescents (>13 years). The resulting ROC curves are presented in Figure 3.

Age-stratified received operating characteristic (ROC) curves of the GET (GETI and GETE) differentiating invalid performance (simulation group) from valid performance (control group, outpatient referral group). (A) Control (honest responders) vs. Simulation (<13). (B) Control (honest responders) vs. Simulation (<13). (C) Outpatient Referrals vs. Simulation (<13). (D) Outpatient Referrals vs. Simulation (<13).
For the control group versus simulation comparison (Figure 3, Panel A-B), in children younger than 13 years, the GETI achieved an AUC of 0.95 (SE = 0.03, 95% CI [.89–1.00], p < .001), while the GETE yielded an AUC of 0.92 (SE = 0.04, 95% CI [.84–1.00], p<.001). Among adolescents (>13 years), the GETI revealed an AUC of 0.83 (SE = 0.06, 95% CI [.72–.04], p < .001), and the GETE an AUC of 0.77 (SE = 0.06, 95% CI [.65–.89], p < .001).
For the outpatient referral versus simulation comparisons (Figure 3, Panels C-D), in children younger than 13 years, the GETI achieved an AUC of 0.87 (SE = 0.05, 95% CI [.76–.97], p < .001), while the GETE achieved an AUC of 0.85 (SE = 0.06, 95% CI [.74–.96], p < .001). Among adolescents (>13 years), the GETI yielded an AUC of 0.85 (SE = 0.07, 95% CI [.73–.98], p < .001), and the GETE an AUC of 0.67 (SE = 0.09, 95% CI [.49–.86], p = .064).
Predictors of GET Performance (GETI and GETE) in a Collapsed Group of Control and Outpatient Referral Children
Multiple linear regression models were conducted to examine the effects of predictors on GETI and GETE scores in a collapsed group of controls and outpatient referrals. Results are presented in Tables 7 and 8.
Linear Regression Predicting GET Index Score.
Note. The sample of both (valid) groups are based on the sub-sample (passing of the RDS) control (honest responders) group N = 113; outpatient referral group N = 47.
Reference categories were: Male (Gender); Primary (Education); Western Europe (Culture); Non-German (Language), and Control (Group). bAge in years; months.
Statistically significant at p < .01.
Linear Regression Predicting GET Errors.
Note. The sample of both (valid) groups are based on the sub-sample (passing of the RDS) control (honest responders) group N = 113; outpatient referral group N = 47.
Reference categories were: Male (Gender); Primary (Education); Western Europe (Culture); Non-German (Language) and Control (Group). bAge in years; months.
Statistically significant at *p < .01, **p < .05.
The overall model predicting GETI scores was significant, F (7, 152) = 2.76, p = .010, explaining 11% of the variance (adjusted R2 = .07). Among the predictors, group membership was significant. The outpatient referral group (n = 47) showed significantly higher GETI scores than the control group (n = 113, b = 2.34, SE = 0.89, p = .010).
For GETE, the regression model was also significant, F(7, 152) = 6.52, p < .001, explaining 23% of the variance (adjusted R2 = .20). Significant predictors included age, with older participants showing fewer errors (b = −1.23, SE = 0.42, p = .004). Group membership (control group, outpatient referral group) also significantly predicted GETE, with outpatient referrals making more errors than the control group (b = 4.15, SE = 1.68, p = .014).
The robustness of these findings was further examined using a propensity score-matched sub-sample of control and outpatient referral participants. This procedure yielded 78 matched participants (outpatient referrals: n = 39; control group: n = 39). Demographic characteristics of the matched sub-sample are reported in Table S1 (see Supplemental Materials). The regression model predicting GETI scores was not statistically significant. In contrast, the regression model predicting GETE scores remained statistically significant, although age was no longer a significant predictor. Group membership remained a significant predictor of GETE scores. Full regression results are reported in Section S6, Tables S2–S3 (see Supplemental Materials).
Discussion
The purpose of the present study was to evaluate the GET as an attention-based PVT in children and adolescents using a simulation design. In particular, we extended its previous application in adult clinical samples (e.g., Fuermaier et al., 2016, 2020; Kneidinger et al., 2025; Teßmann et al., 2025; Van Vliet et al., 2024). Consistent with expectations, our findings indicate that the GET, as developed for adults, does not per se generalise to children and adolescents, highlighting the need for age-specific cutoffs.
Evidence for this need first emerged when age-related variability in GET performance was observed. Under control testing conditions, outpatient referrals showed elevated GETI scores at age 7 that declined and stabilised by age 17. Conversely, GETI scores in the control group remained stable across the 7–17 age span. For the GETE, both groups demonstrated a steady decline in errors from childhood (approximately age 7) into adolescence (approximately ages 12–17), reflecting improvements in performance consistency.
Age-related variation in GET performance prompted the establishment of separate cutoffs for children (<13 years) and adolescents (>13 years). This age distinction was identified through breaking points estimated around 13 years in the outpatient group, indicating that performance stabilised in early adolescence. To meet established criteria towards validity tests, cutoffs were established to maintain 90% specificity across age groups (Larrabee, 2012). Despite comparable specificity, sensitivity to noncredible performance in the simulation group was higher in children (85% for the GETI; 75% for the GETE) than in adolescents (55% for the GETI; 45% for the GETE). For comparison, adult analogue research has shown sensitivities of 68% for the GETE and 89% for the GETI when detecting simulated ADHD (Fuermaier et al., 2021). Consistent with adult findings, the GETI demonstrated higher sensitivity than the GETE in the present study. Children performed similarly to adults on the GETI but lower on the GETE, while adolescents showed reduced sensitivity across both indices. A comparable trend was observed in the ROC analyses, showing that both GET indices classified noncredible performance more accurately in children than in adolescents. This pattern was consistent across comparisons with control and outpatient referral groups, with the GETI showing moderately stronger overall discrimination than the GETE. These findings may point towards a modest advantage of the GETI in identifying noncredible performance, particularly among children (<13 years).
In the context of established PVTs, the GET yielded comparable sensitivity values to the RDS (75%) and TOMM Trial 1 (85%) among children. Among adolescents, GETI demonstrated a comparable sensitivity to TOMM Trial 1 (55%) and slightly higher than RDS (50%), whereas GETE showed a lower sensitivity. Prior research using the TOMM’s adult cutoff (<45; Tombaugh, 1996) has shown passing rates of 98%–100% in healthy paediatric samples (Constantinou & McCaffrey, 2003; Rienstra et al., 2010). But studies have reported a higher failure risk among children with clinical diagnoses (e.g., Donders, 2005; Kirk et al., 2011; Macallister et al., 2009), with pass rates ranging from 67%–93% (Loughan & Perna, 2014). It is therefore unsurprising that Gunn et al. (2010), who used a simulation design without clinical participants, reported higher sensitivity (95%) and specificity (98%). To address administration time, the present study employed Trial 1, which has been shown to provide a valid indicator of overall TOMM classification (Brooks et al., 2012; Perna & Loughan, 2013). Specifically, Loughan et al. (2016) found that a cutoff score of 40 yielded a sensitivity of 80% in outpatients aged 6–18 years, comparable to the present findings in children.
Relative to the TOMM, the RDS has been examined less extensively in simulation-based studies. However, Blaskewitz et al. (2008) constitute one such example, in which 90% of simulators failed an adult cutoff of 7/8. The majority of matched controls (59%) also failed using this cutoff score. Alternatively, Kirkwood et al. (2011) used a clinical criterion group rather than a simulation design, reporting a RDS sensitivity of 51%. This value serves as an overarching reference point for interpreting the present findings, which reveal a comparable sensitivity among adolescents. In fact, this correspondence is likely due to the comparable age range, as Kirkwood et al.’s sample (M = 14.1 years) closely aligns with the present study’s adolescent group (>13 years). The consistent differentiation between children and adolescents across PVTs suggests that developmental factors contribute to variability in sensitivity. Developmental maturation associated with increasing age, including improvements in working memory, processing speed, and sustained attention, may partly account for this pattern. Greater cognitive maturity may also allow adolescents to regulate their responses more strategically during assessment, influencing how performance patterns are expressed in structured testing contexts.
At the same time, the finding of high classification accuracy in children (<13 years) when appropriate cutoffs are applied warrants cautious interpretation. Vickery et al. (2001) cautioned that analogue designs may inflate detection rates relative to real-world feigning. This concern is not unique to the GET and may apply to the reported findings for the RDS and TOMM Trial 1. Although this limitation is inherent to analogue designs, the present study sought to address this issue by improving methodological control. Participants in the simulation group completed the pre-experimental comprehension check and post-experimental adherence rating. The pre-check ensured that participants understood simulation instructions, and the post-check confirmed that they followed these instructions during testing. These procedures enhance confidence in the validity of the simulation condition and extend beyond the methods typically applied in previous child simulation studies, which often relied on examiner judgement or single post-test questions (e.g., Blaskewitz et al., 2008; Gunn et al., 2010; Nagle et al., 2006; Rambo et al., 2015).
Nonetheless, even with these methodological refinements, the elevated sensitivity observed in children likely reflects developmental characteristics rather than superior test performance. Younger participants may adopt more concrete and exaggerated response strategies when asked to simulate reduced ability (Lee, 2013). Such responding can accentuate differences between credible and simulated responding, thereby inflating apparent classification accuracy, although similar strategies may also occur in real-world paediatric cases of noncredible performance. Adolescents, in contrast, may employ more deliberate and nuanced feigning strategies to maintain credibility and avoid detection (Rogers, 2008). This response pattern produces greater overlap with genuine performance, yielding lower sensitivity but providing more ecologically valid representation of noncredible performance. The consistent reduction in sensitivity observed in adolescents across the GET, TOMM Trial 1, and RDS is therefore better understood in terms of simulation strategy than instrument performance. When feigning becomes more controlled and credibility focused, PVTs are expected to show less robust detection rates. This interpretation aligns with theoretical models of malingering, which emphasise the need to balance symptom exaggeration with the risk of detection (Merckelbach & Merten, 2012). From this perspective, stronger detection in children and weaker detection in adolescents reflect differences in how noncredible responding is enacted, rather than differences in the adequacy of the measures themselves. Whether this pattern extends beyond analogue paradigms remains to be examined in clinical contexts.
A further objective of this study was to examine predictors of GET performance in a collapsed control and outpatient referral group, beyond age-stratified cutoffs. Specifically, the demographic differences identified in the broader group comparison raised the question of whether such characteristics were related to variation in GET performance. Regression analyses indicated that, as expected, both age and group membership influenced GET indices. Outpatient referrals scored higher on the GETI and demonstrated increased error frequency on the GETE relative to controls. These elevations may reflect the cognitive load imposed by the GET, as psychological conditions represented in the outpatient referral sample (e.g., ADHD, anxiety-related disorders) are known to be associated with limited attentional resources and slowed processing speed. Such patterns can resemble task disengagement, making it difficult to distinguish genuine attentional problems from noncredible performance, which may thus increase the risk of false positives (Roor et al., 2024). As expected, older participants made fewer errors on the GETE, consistent with developmental improvements in attentional control and self-monitoring (Best & Miller, 2010).
Beyond the effects of group membership and age, demographic variables (e.g., gender, culture, education, language) were largely unrelated to GET performance. To examine whether residual demographic imbalances influenced these findings, the analyses were repeated in a propensity score-matched sub-sample. The overall pattern of results remained stable. The regression model predicting GETI scores was not statistically significant in the matched analyses. Although the coefficient for group membership reached significance, this effect should be interpreted with caution, given the non-significant overall model. In contrast, the regression model predicting GETE scores remained significant, with group membership continuing to predict higher error scores among outpatient referrals, whereas age was no longer a significant predictor. The persistence of group differences in GETE under demographically comparable conditions suggests that these differences are unlikely to be explained by demographic composition. These findings align with prior research showing that PVT outcomes are generally stable across demographic factors in paediatric samples (Baker et al., 2014; Donders & Romain, 2025). Similar evidence has also been reported in adults, where Raasch et al. (2025) found that GET performance was largely robust to demographic factors in an adult ADHD sample. The present results therefore support the interpretation of the GET as a relatively robust attention-based PVT, while acknowledging that developmental factors may contribute to performance variability in paediatric populations.
Limitations and Future Directions
The results of the present study need to be interpreted in the context of several limitations. First, the groups were uneven in size (control group n = 113; outpatient referral group n = 47; simulation group N = 40). Such imbalances reduce statistical power and constrain the stability of the regression models and classification analyses. This limitation was further compounded when age-specific cutoffs were derived for the GET based on the collapsed group passing the RDS (outpatient referral and control groups). The relatively small number of adolescents (>13 years; n = 59) increases the likelihood of unstable percentile estimates and limits the validity of these thresholds. Sensitivity values were based on the simulation group (N = 40). Although this group was evenly divided by age, the modest sample size reduced the extent to which these estimates can be generalised. This uncertainty was reflected in the exact binominal 95% confidence intervals, which indicated limited precision in the sensitivity estimates, particularly among adolescents. Moreover, exclusions due to incomplete PVT data reduced the effective simulation sample, which may further limit the generalisability of the sensitivity estimates. Future research should therefore prioritise larger and more balanced samples, particularly by improving the ratio of children to adolescents. By doing so, future studies can generate more stable sensitivity estimates and improve the reliability of age-specific GET cutoffs.
In addition, group assignments were not fully coherent across criterion groups. Outpatient referral and control participants were included if they scored above the cutoff of the RDS, indicating credible performance, whereas the simulation group was defined by adherence to simulation instructions. Moreover, instructed simulators were asked after completion of the assessment if they had done their best in feigning poor effort. Although this approach ensured adequate performance screening, it limited comparability across groups due to differing inclusion criteria.
Furthermore, the definition of the typically developing individuals (control group and simulation group) relied on caregiver-reported demographic questionnaire data to exclude known neurological and/or mental health conditions. No formal diagnostic interviews or independent clinical evaluations were conducted. Therefore, undetected conditions cannot be ruled out, and the reported information could not be independently verified. These factors may have introduced additional heterogeneity into the group of typically developing individuals.
Relatedly, the criterion groups differed on several demographic variables (age, native language, region of cultural origin, education), which may further complicate direct comparison across groups. Moreover, native languages other than German (e.g., Dutch, Swahili, English, Hindi, Frisian) were collapsed into a single “non-German” category for statistical analyses due to small cell counts within individual language groups. This aggregation reduced the specificity of language-related comparisons and may have obscured potential differences between distinct linguistic backgrounds. Although most of the demographic variables did not seem to affect GET performance meaningfully, future research should recruit more clearly defined criterion groups that are demographically comparable and psychologically homogeneous.
A further limitation involves the testing context. Conducting simulation testing in a school setting may have made it difficult for children to intentionally underperform. As Blaskewitz et al. (2008) noted, children in such environments are accustomed to striving for approval from teachers or parents and may therefore have been reluctant to give suboptimal effort, despite the instruction to do so (Nagle et al., 2006).
Finally, simulation designs, by nature, can be criticised for limited external validity (Rogers & Cruise, 1998; Vickery et al., 2001). The participants and context of simulation designs will always differ from those found in real-world forensic or clinical assessments. On this note, examiners were also aware of group membership. The absence of deception further reduces ecological validity, as noncredible performance in applied setting typically occurs without examiner awareness. Moreover, it remains uncertain whether younger children can meaningfully engage in simulation paradigms, raising questions about the validity of their simulated performance. Future research should enhance ecological validity by incorporating examiner blinding to minimise expectancy effects. In addition, in paediatric populations, experimental studies with criterion groups (e.g., distinct clinical populations in ADHD or traumatic brain injury) are warranted before the GET can be used with confidence in applied neuropsychology.
Conclusion
The present study employed a simulation design to determine the utility of the GET in differentiating credible from noncredible performance in paediatric samples. Age-specific cutoffs were established based on a collapsed group (control group and outpatient group) that passed the RDS. GET classification accuracy was higher in children (<13 years) than in adolescents (>13 years), a pattern also observed for the TOMM Trial 1 and RDS. The greater sensitivity in children likely reflects more overt simulation strategies, whereas adolescents simulated more subtly, yielding lower but presumably more realistic detection rates. In addition, the GETI showed stronger discrimination than the GETE, supporting its use as a more robust measure of performance validity. Importantly, group differences in GET performance remained evident in the propensity score-matched sub-sample, indicating that these effects were not primarily attributable to demographic imbalances. These findings support the interpretation of the GET as a robust attention-based PVT that is relatively stable across demographic characteristics.
To date, paediatric clinical guidelines for performance validity assessment remain less clearly defined than in adult neuropsychology. Consensus regarding optimal test combinations for children and adolescents has not yet been established. Current recommendations emphasise the use of multiple validity indicators, including both stand-alone and embedded measures. Combining indicators may improve the interpretability of validity profiles and help minimise false-positive classifications.
Within this framework, PVTs may also be considered with respect to the cognitive domains they primarily assess. Many commonly used PVTs, such as the TOMM Trial 1 and the RDS, rely primarily on memory performance. In contrast, the GET represents an attention-based indicator. It may therefore complement established PVTs by assessing a different aspect of performance validity. From this perspective, the GET should be considered a complement to existing indicators rather than a replacement. Combining attention- and memory-based indicators may broaden the scope of validity assessment and support a multi-method approach to neuropsychological evaluation.
At the same time, the clinical role of the GET should be interpreted cautiously. The present study demonstrates that the GET can differentiate credible from instructed noncredible responding when age-appropriate cutoffs are applied. However, its incremental value beyond established PVT combinations has not yet been demonstrated. Future studies should examine whether adding the GET improves classification accuracy beyond TOMM Trial 1 and the RDS alone.
Additional research is also required in clearly defined paediatric clinical samples and under conditions that more closely resemble routine neuropsychological assessment. Evidence from such studies will be necessary before firm recommendations for routine paediatric use can be established.
Supplemental Material
sj-docx-1-asm-10.1177_10731911261452115 – Supplemental material for Attention-Based Performance Validity Assessment in Paediatric Samples
Supplemental material, sj-docx-1-asm-10.1177_10731911261452115 for Attention-Based Performance Validity Assessment in Paediatric Samples by Emily Raasch, Hanna Christiansen, Johanna Kneidinger, Björn Albrecht and Anselm B. M. Fuermaier in Assessment
Footnotes
Acknowledgements
We thank all research assistants involved in this project for their support in data collection and processing.
Ethical Considerations
The study was approved by the respective ethical review boards. These were the Ethics Committee of the Faculty of Psychology, Marburg University, Germany, for the assessment of patients from the outpatient clinic (approval: 2022-53k) on September 15, 2022. The Ethical Committee Psychology (ECP) affiliated with the University of Groningen, the Netherlands, for typically developing participants (approval: 18027-O) on September 20, 2018 (approval; PSY-2122-S-0152) on January 31, 2022, and (approval: PSY-2223-S-0419) on June 19, 2023. Therefore, prior approval from an Institutional Review Board was obtained in line with point 23 of the Declaration of Helsinki.
Consent to Participate
Written informed consent was obtained from the legal guardians of all participants.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: A.B.M.F has a contract with Schuhfried GmbH for the development and evaluation of neuropsychological assessment instruments. A.B.M.F is the test author of the Groningen Effort Test (GET) that is administered and examined in the present study. The GET is an attention-based neuropsychological test on the Vienna Test System (VTS), owned and distributed by the test publisher Schuhfried GmbH.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
