Sage Journals: Discover world-class research

Abstract

Objective

The current diagnosis of intellectual disability (ID) in children relies on resource-intensive assessments by experts, limiting their use for widespread estimation. Eye-tracking offers a potential digital biomarker, but its application to the multifaceted cognitive profile of ID remains scarce. This study aimed to develop and validate a novel eye-tracking assessment combined with deep learning as an automated tool for estimating cognitive capacity of ID.

Methods

We developed three cognitive subtasks to elicit spatio-temporal gaze patterns related to three subindices including verbal comprehension (VCI), fluid reasoning (FRI), and working memory (WMI). With data collected from seven children with ID and nine typically developing (TD) children, we compared a logistic regression (LR) model using predefined gaze metrics and behavioral features with a convolutional neural network (CNN) trained directly on raw scanpath images to classify participants.

Results

The CNN model demonstrated superior performance, achieving a 0.93 F1-score in subject-level classification, while the feature-based LR model achieved a 0.76 F1-score. Notably, the CNN predictions derived from the working memory task significantly correlated with full-scale IQ as well as FRI and visuospatial (VSI) subscores, suggesting the model effectively captured higher-order reasoning and visuospatial processes.

Conclusions

This study demonstrates that deep learning analysis of spatio-temporal gaze patterns from a multidimensional cognitive task can serve as a robust digital biomarker, paving the way for accessible and objective tools for estimating cognitive capacity in children with neurodevelopmental disorders.

Keywords

intellectual disability eye-tracking cognitive profiling estimation digital biomarkers deep learning

Introduction

Intellectual Disability (ID) is a neurodevelopmental disorder characterized by significant limitations in intellectual functioning and adaptive behavior in the conceptual, social, and executive domains, with onset during the developmental period.¹ Generally, the global prevalence of ID is estimated to be approximately 1–2% of the populationn.^2,3 The prevalence of ID increases with age throughout the developmental period, with a prevalence of 1.39% in ages 3-7, 1.79% in ages 8-12, and 2.35% in ages 13-17.⁴ Children with ID typically present with deficits in attention, memory, executive function, and language,⁵ leading to delays in emotional development that affect social interactions and self-regulation.⁶ These developmental delays result in significant limitations in adaptive behaviors and cognitive functioning,⁷ as their impaired ability to recognize situational cues and problem-solving impedes their capacity for strategic planning and goal-oriented behaviors.⁸ A core characteristic of ID is deficits in executive function,^9,10 a higher-order cognitive system that encompasses working memory, inhibitory control, cognitive flexibility, and planning.¹¹ These deficits lead to significant limitations in daily life¹² and negatively affect academic achievement.^13,14

The pervasive consequences of ID necessitate early detection and therapeutic interventions¹⁵ that allow for the continuous monitoring of behavioral and affective issues and facilitate access to essential social support.¹⁶ In accordance with this necessity, there has been an increasing trend in utilizing diverse instructional technologies to address the educational and developmental challenges faced by individuals with special needs.¹⁷ However, the current “gold standard” for diagnosis, the Wechsler Intelligence Scale for Children (WISC), presents significant practical barriers to comprehensive cognitive assessment. Although highly reliable, the WISC is a resource-intensive assessment tool that must be administered and interpreted by a qualified professional, making it both time-consuming and costly.^15,16,18 These demanding requirements limit the feasibility of cognitive profiling, leaving a critical gap for more accessible tools, such as continuous monitoring applications.¹⁹

To fill the current diagnostic gap, digital technologies are emerging as promising alternatives for identifying objective behavioral indicators for continuous monitoring.^20–22 Previous studies have demonstrated that digital learning environments, such as those employing augmented reality or online project-based frameworks, can significantly influence academic achievement and cognitive monitoring.^23,24 These technological advancements have led to significant attention being given to eye tracking due to its capacity to identify distinct differences in visual responses and gaze patterns in children with developmental disabilities.^25,26 This potential has led to recent research on the use of eye tracking to classify developmental disorders.^27,28 A significant portion of this research has centered on autism spectrum disorder (ASD), as the technology is particularly well-suited to objectively measure its characteristic of reduced attention to social stimuli.²⁹ For instance, virtual reality (VR) with machine learning (ML) achieved high classification accuracies of 86% and 92.9% for participants with ASD versus typically developing (TD) participants.^30,31 The application of this methodology has also been extended to attention-deficit/hyperactivity disorder (ADHD), leveraging the findings that eye-movement characteristics can serve as biomarkers for attention-related cognitive processes^7,32and achieve an 83% classification accuracy between ADHD and TD groups.³³

The application of eye-tracking methodology to ID has been more challenging, leading to limited research compared to ASD and ADHD. Cognitive and behavioral heterogeneity within the ID population presents a complex classification problem. Despite these difficulties, a notable study has demonstrated that gaze-based data can quantify differences in cognitive processes.³⁴ Using the Raven Progressive Matrices (RPM), a widely used non-verbal test to evaluate individual reasoning and problem solving,^35,36 the study identified significant differences in gaze patterns between 15 adults with Down syndrome and 35 typically developing children who were matched for mental age.³⁴ Prior research on eye-tracking in individuals with ID has primarily focused on functions within a single cognitive domain, such as working memory or social attention, or on predicting specific abilities, such as problem-solving strategies.^25,29,34 However, the diagnosis of ID necessitates a multidimensional approach that integrates information from various cognitive domains and adaptive behaviors.^37,38 Indeed, the WISC, the gold standard for ID diagnosis, is structured to derive overall intelligence from several distinct indices, including verbal comprehension (VCI), fluid reasoning (FRI), working memory (WMI), visual-spatial (VSI), and processing speed (PSI), supporting this multidimensional view.^39,40 Building upon previous RPM-based research that evaluates cognitive processes, the present study aims to enable the rapid estimation of these multidimensional cognitive profiles using objective gaze-based biomarkers.

To achieve this estimation, the present study developed a novel set of eye-tracking tasks structurally modeled on the RPM to create a multidimensional cognitive profile for ID classification. These tasks were primarily designed to elicit cognitive processes corresponding to three core WISC indices: VCI, FRI, and WMI. Furthermore, because gaze data inherently consist of spatiotemporal aspects, such as scanpaths, gaze location, and fixation durations, it was hypothesized that the remaining VSI and PSI would be implicitly encoded within the eye-gaze patterns generated during task performance.⁴¹ To test this hypothesis and extract the encoded information, the collected gaze patterns were converted into image-based representations for a deep learning classification model. Specifically, a convolutional neural network (CNN) was utilized because this type of model excels at recognizing and hierarchically learning the spatial information contained within 2D images, such as gaze scanpaths or heat maps.^42,43 Prior research has demonstrated the effectiveness of this approach by successfully extracting cognitive characteristics from gaze data to distinguish between individuals with ASD and TD.^44,45 Therefore, the present study aims to validate a CNN-based approach for classifying ID using gaze-path images from custom-designed tasks. Moreover, we explored the potential of the model to quantify standardized intelligence scores (full-scale IQ and subscores) as an objective method for evaluating developmental characteristics. Finally, the performance of this image-based CNN approach is compared with that of models using traditional gaze metrics from previous RPM-based research.

This study was guided by the following objectives and hypotheses: The first objective was to design a digital cognitive assessment consisting of three distinct subtasks capable of measuring the cognitive functions of individuals with ID and detecting spatiotemporal gaze patterns. We hypothesized that the WISC-derived full-scale IQ (FSIQ) and its core sub-scores would significantly correlate with the gaze patterns recorded during these tasks. The second objective was to develop a deep-learning classification model from the collected gaze data and evaluate its potential to effectively discriminate between individuals with ID and TD. It was hypothesized that the CNN model would classify ID versus TD with superior performance compared to a traditional ML model using pre-calculated gaze metrics from prior RPM-based eye-tracking research on fluid reasoning. The third objective was to analyze the relationship between model predictions and actual intelligence scores. We hypothesized that the model’s overall output would significantly quantify the FSIQ and that the outputs derived from each specific subtask would correlate with their corresponding WISC subscores.

Methods

Participants

This research was designed as a cross-sectional pilot study. Participant recruitment and clinical assessments were conducted from November 2023 to August 2024 at the Choong-Hyeon social welfare center in Seoul, Republic of Korea. The participants were divided into two groups: children with ID and TD controls. For the ID group, nine Korean children aged 12–14 years were initially recruited from a Social Welfare Center (Seoul, Korea). One participant was excluded for personal reasons, and another for severe behavioral issues that precluded task performance, resulting in a final sample of seven children (four males and three females). Eleven children aged 10–13 years were recruited for the TD group. One participant was excluded because of an eye-tracking equipment malfunction, resulting in a final sample of nine children (six males, three females). The Fifth Edition of Korean WISC (K-WISC-V) was administered to all participants to confirm their group assignments. Subsequently, individuals were identified as having either ID based on a FSIQ score of 70 or lower, or TD with a FSIQ of 90 or higher. A summary of the participants’ demographics is provided in Table 1. As intended for group assignment, the ID group had a significantly lower FSIQ than the TD group (p < .001, Mann-Whitney U test). While the two groups did not differ significantly in sex distribution (p = 1.0, Fisher’s exact test), the ID group was significantly older than the TD group (p < .01, Mann-Whitney U test), a factor that was statistically controlled for in all subsequent analyses.

Table 1.

Participants demographics.

	TD	ID	p
N (Male/Female)	9 (6/3)	7 (4/3)	= 1.0
Age (Mean ± SD)	10.6 ± 1.01	12 ± 0.58	< 0.01
FSIQ (Mean ± SD)	104.89 ± 10.22	58 ± 11.73	< 0.001

Note. TD means Typical development group and ID means Intellectual disability group.

This study was conducted with the approval of the Institutional Review Board (IRB) of the Catholic University of Korea, Seongsim Campus (IRB No. 1040395-202305-02). Prior to participation, all children, their parents or legal guardians, and an impartial witness (social workers) were provided with a detailed explanation of the study procedures according to the IRB-approved informed consent form, and written consent was obtained.

Experimental paradigm

This study aimed to objectively characterize and differentiate individual cognitive and executive functions in children at risk for ID by leveraging behavioral patterns observed from their gaze. To this end, we developed a novel experimental paradigm that effectively elicited a range of gaze movements, thereby facilitating the assessment of children’s cognitive capabilities. The paradigm consisted of three subtasks designed to correspond to the core indices of the K-WISC-V and target cognitive domains known to be impaired in individuals with ID.

Experimental task design principles

Three subtasks were designed to assess VCI, FRI, and WMI. This design was motivated by the established cognitive characteristics of ID, including language impairment,⁴⁶ deficits in reasoning and solving novel and unfamiliar problems,⁴⁷ and limitations in working memory capacity.⁴⁸ The remaining K-WISC-V indices, VSI and PSI, were not designed as stand-alone tasks. Instead, we hypothesized that these functions would be implicitly encoded within the spatiotemporal dynamics of gaze data collected across all three tasks. This integrated approach allowed for a comprehensive cognitive assessment while minimizing the total task duration and cognitive load on the participants.

All three subtasks shared a common visual format inspired by the RPM, a widely used tool designed to assess fluid reasoning. In the RPM, participants are required to select an appropriate figure from a set of options to complete a missing cell in a matrix. Recent studies have demonstrated that combining this structure with eye tracking is a powerful method for analyzing cognitive processes, distinguishing problem-solving strategies, and predicting performance.^25,49 Building on this methodology, our paradigm was specifically designed to elicit and capture problem-solving gaze patterns. Each trial presented a problem matrix area (3 × 3 grid) and a response area (1 × 4 grid), as shown in Figure 1. This standardized structure was chosen to facilitate the observation of gaze patterns associated with problem-solving strategies.²⁵ Each subtask comprised 20 trials with progressively increasing difficulty levels.

Figure 1.

Eye-tracking-based tasks for evaluating intellectual ability.

Subtask descriptions

Although the RPM format is traditionally used for fluid reasoning (FRI), we adapted this matrix-based structure to create a comprehensive paradigm comprising three distinct subtasks. In this study, the FRI task was directly modeled after the RPM, which is a widely used tool for measuring fluid intelligence.⁵⁰ In each trial, participants were shown an 8-panel matrix with a missing panel in the bottom-right corner for 2 seconds. Subsequently, four answer choices appeared in the response area. Participants had 8 seconds to identify the underlying rule (e.g., symmetry, rotation, and progression) and select the correct missing panel. If no response was provided, the task proceeded automatically to the next trial. This task was selected because of its minimal reliance on linguistic knowledge, which makes it suitable for assessing children across a wide range of cognitive abilities, including those with ID.³⁶

In addition to the FRI task, novel tasks were developed in the same format to assess verbal comprehension (VCI) and working memory (WMI). The VCI task was designed to assess naming ability and vocabulary with stimuli selected based on the Korean Boston Naming Test (K-BNT). The K-BNT is a 60-item assessment that is culturally and linguistically adapted from the Boston Naming Test (BNT),⁵¹ a foundational assessment that measures word retrieval from a series of line drawings. The K-BNT has been validated for evaluating language development in Korean children.⁵² In each trial, a 3 × 3 problem matrix displayed nine images belonging to a single semantic category. After 2 seconds, four written words appeared in the response area. Participants had 8 seconds to select the word that correctly identified the objects depicted in the images. The trial advanced automatically after selection or if the 8-second time limit expired.

The WMI task was designed to assess visuospatial working memory, a key component of the working memory model responsible for encoding and maintaining integrated visual and spatial information.^53,54 To specifically target the ability to process multiple pieces of information as a unified whole,^55,56 our task utilized a presentation format that displayed color and location information simultaneously. Consistent with the preceding tasks, the task format consisted of a 3 × 3 problem matrix and a 1 × 4 response area. Each trial began with a 3-second presentation of randomly arranged colored squares in the problem matrix. Immediately following this, the grid was shown for 2 seconds, and several target squares were removed. Participants were then required to select the response option that correctly identified the color and location of the removed items within an 8-second time limit. This design required participants to integrally encode and store visuospatial information and then selectively retrieve and compare it with the response options to make a final choice. Trial difficulty was progressively increased by adjusting the number of presented colors and target items to be remembered. This calibration was informed by several findings showing that performance decreases with complexity and an increased number of stimuli,^57,58 and that the working memory capacity for color is limited to approximately three items.^59,60 It is also crucial to consider the finding that excessively difficult tasks can increase task-avoidant behaviors, particularly in children with developmental disabilities.⁶¹ Therefore, to accommodate these cognitive characteristics while ensuring sustained engagement, the maximum difficulty was set to three presented colors and three target items to be remembered.

Experimental procedure and apparatus

To counterbalance potential order effects, the administration of the K-WISC-V and the eye-tracking paradigm was randomized across participants. The K-WISC-V for the child participants, administered by trained graduate students specializing in clinical psychology, required roughly 60 minutes. These administrators operated under the direct supervision of a licensed clinical psychologist (J.W.Y.), who performed the final evaluation. The eye-tracking session totaled approximately 30 minutes, which included 10 minutes for equipment setup and the 9-point calibration procedure. The three subtasks were then administered in a fixed order (VCI, FRI, and WMI), with approximate durations of 4 minutes for VCI, 4 minutes for FRI, and 5 minutes for WMI. Prior to the main experiment, the participants completed three practice trials for each subtask. Eye-tracking data were recorded using a Gazepoint GP3 system (Gazepoint, Vancouver, Canada) at a sampling rate of 150 Hz. Stimuli were presented on a 24-inch LCD monitor (1920 × 1080 resolution) with a viewing distance of approximately 60 cm. A chin rest was used to minimize head movements and maintain a consistent viewing distance.

Conventional eye-tracking metrics and logistic regression

Following the methodology of Liu et al.,²⁵ three conventional gaze metrics were calculated from raw eye-tracking data. The proportional time on matrix (PTM) is the total fixation time on the 3 × 3 problem matrix divided by the total trial response time. Higher values indicate greater attention to the problem space. The rate of toggling (ROT) is the total number of gaze shifts between the problem matrix and response area, divided by the total trial response time. Higher values indicate more frequent comparisons between the matrix and response areas. The rate of latency to the first toggle (RLT) was the time elapsed before the first gaze shift from the problem matrix to the response area, divided by the total trial response time. Higher values may suggest that more time is spent on initial problem encoding and planning. In the original study, the ROT was calculated by dividing the total trial response time by the sum of toggles between the matrix and the response area, as this method was more suitable for expressing the strategy. However, in this study, based on the relevant research,^62,63 the ROT was calculated by dividing the sum of the toggles by the total trial response time. In other words, a higher ROT value indicates more frequent gaze shifts. These three metrics were calculated for each of the three subtasks (VCI, FRI, and WMI), yielding nine eye-tracking features. To test the predictive utility of these metrics, two logistic regression (LR) classifiers were trained to distinguish between ID and TD groups.⁶⁴ The first model utilized only nine eye-tracking features, whereas the second used a combined set of eye-tracking and behavioral features (task accuracy and response time for each subtask). All features were normalized via standardization before model training.

Gaze path imaging and convolutional neural networks

To overcome the limitations of the conventional LR approach, such as its reliance on sophisticated feature engineering and the resulting loss of spatiotemporal gaze information, we employed a CNN. CNNs are deep learning architectures that are particularly effective in processing 2D image data, such as scanpath images, and identifying salient spatial patterns.⁶⁵ For our model, the gaze data from each of the 20 trials within the three subtasks were converted into a unique scanpath image. These images were subsequently processed using the CNN, which consisted of three parallel convolutional blocks, one for each subtask. Within each block, the scanpath image was passed through three successive convolutional layers, employing 16, 32, and 128 filters of size 3×3, respectively, each followed by a max-pooling layer to produce a latent vector. Subsequently, the three vectors were concatenated, and the resulting combined vector was fed into a dense layer comprising 256 neurons, followed by a dropout layer with a rate of 0.3. This process generates a binary classification of ID and TD using sigmoid activation. The overall architecture of our model is depicted in Figure 2, and the detailed hyperparameters are described in Supplementary Table 1. In this study, all models were implemented by a system equipped with an AMD Ryzen 5 5600 6-core processor, 32GB of RAM, and an NVIDIA GeForce RTX 2060 GPU.

Figure 2.

The CNN architecture using scanpaths from three subtasks.

To prevent data leakage from the same participant into both training and validation sets, a Group K-Fold Cross-Validation (K=4) strategy was implemented, using participant ID as the grouping variable. The model was trained and validated on each fold, and the final performance was evaluated by averaging the results across all folds.

Statistical analysis and evaluation metrics

Prior to all statistical analyses, the normality of data distributions was first assessed using the Shapiro-Wilk test. For group comparisons, as all behavioral and eye-tracking metrics were not normally distributed, non-parametric Mann-Whitney U tests were primarily used. Furthermore, to control for the potential confounding effect of the significant age difference between the groups, a non-parametric analysis of covariance (ANCOVA) with age as a covariate was performed on all behavioral and eye-tracking metrics. Similarly, for correlation analyses, partial Spearman’s rank correlation coefficients were utilized to statistically adjust for the effect of age. To correct for multiple comparison problems, a false discovery rate (FDR) method was applied, and significance was determined based on a q-value of less than 0.05. The performance of the classification models was evaluated using accuracy, F1-score, sensitivity, precision, and specificity, averaged across the cross-validation folds.

Results

Behavioral performance

Correlation analyses between task performance accuracy and expert-evaluated FSIQ scores revealed positive correlations, particularly within WMI and FRI tasks (Figure 3). Specifically, the WMI task exhibited the strongest correlation with FSIQ (r = 0.705, q = .015), followed by the FRI task (r = 0.681, q = .015), whereas the correlation with the VCI task did not reach statistical significance (r = 0.549, q =.068). Furthermore, task accuracy correlated significantly with specific WISC-V subscores. The accuracy for the FRI task was correlated with its respective FRI subscore (r = 0.798, q = .003). The WMI task accuracy also exhibited a significant correlation with the FRI subscore (r = 0.870, q = 0.001) rather than the WMI subscore (r = 0.538, q = .127). Additionally, performance on both the FRI and WMI tasks showed positive correlations with the WISC-V VSI subscore (r = 0.799, q = .003 for FRI task; r = 0.795, q = .003 for the WMI task). Correlations with other subscores did not survive FDR correction. In contrast to the consistent positive correlations observed with accuracy, response time analyses did not reveal any significant correlations with the cognitive indices after FDR correction (Supplementary Table 2).

Figure 3.

Relation between task accuracy and WISC indices.

While accuracy in the WMI and FRI tasks correlated with cognitive abilities, a group comparison did not yield statistically significant differences after FDR correction. Although participants in the ID group demonstrated a trend of lower accuracy than those in the TD group across all three cognitive tasks, these differences were not statistically significant (Supplementary Figure 1a). The analysis of response times also revealed no statistically significant differences between the ID and TD groups in any of the three tasks (Supplementary Figure 1b).

Conventional eye tracking measures

Analyses of eye tracking measures (ROT, RLT, and PTM) established in the previous study²⁰ were conducted to explore their relationships with participants’ WISC-V scores. However, following rigorous FDR correction across the numerous hypotheses, none of the correlations reached the threshold for statistical significance (q < .05). Despite this, descriptive evaluation of the uncorrected coefficients showed task-dependent trends. Overall, the PTM metric generally exhibited positive correlations with the FSIQ and subscores (Figure 4). The most notable trend emerged within this metric during the FRI task, which exhibited a strong positive correlation with its corresponding FRI subscore (r = 0.702). Although this relationship did not survive the conservative FDR penalty, a distinct linear pattern remains visually apparent in the data. Conversely, the ROT metric demonstrated inconsistent directional patterns across tasks. Specifically, when correlated with FSIQ, VCI ROT showed a relatively flat or slightly positive trend, whereas FRI and WMI ROT exhibited negative correlations. Finally, the RLT metric exhibited the weakest correlation across all measures due to the high variance (Supplementary Table 3). Similarly, group comparisons did not yield statistically significant differences (Supplementary Figure 2).

Figure 4.

Relation between eye gaze based features and WISC indices.

Predictive modeling at the item-level using eye tracking data

To address the limitations of relying on isolated gaze metrics, we investigated whether a multidimensional modeling approach could better capture the underlying patterns for classifying the ID and TD groups. We compared two distinct approaches: a traditional feature-based LR classifier and a deep learning-based CNN. Our baseline LR model, trained using only nine eye tracking measures (ROT, RLT, and PTM from each of the three subtasks), demonstrated suboptimal performance at the item level (classifying each of the 20 experimental trials), achieving an average accuracy of 66.56% and an F1-score of 0.543 (Figure 5(a)). An analysis of the LR coefficients revealed that the feature with the largest absolute value was ROT from the WMI task, followed by PTM and ROT from the VCI task (Supplementary Figure 3a). Next, a second LR model was developed by adding behavioral features (task accuracy and response time), for a total of fifteen features. The accuracy was enhanced to 77.81%, and the F1-score increased to 0.719 (Figure 5(b)). While the top three predictive features remained consistent (WMI ROT, VCI PTM, and VCI ROT), the response time from the FRI task emerged as the fourth most influential feature (Supplementary Figure 3b).

Figure 5.

Performance of the Logistic Regression (LR) and CNN models.

To overcome the limitations of traditional feature-based methods, we employed CNNs to classify ID and TD groups directly from raw eye-tracking patterns across all trials. This approach involves transforming gaze sequences into image-like representations without manual feature engineering. The CNN model achieved significantly superior classification performance compared to the LR models, achieving an accuracy of 84.69% and an F1-score of 0.809.

Subject-level classification and correlation with WISC-V scores

To assess the practical applicability of these models for individual classification, we aggregated the item-level predictions to a subject-level decision. For each participant, the predicted probabilities from the 20 trials were averaged, and a classification threshold (0.5) was applied to this mean probability. The eye-tracking-only LR model achieved 68.75% accuracy and 0.545 F1-score. As seen in the confusion matrix (Figure 6(a)), this stemmed from limited ability to identify individuals with ID, correctly classifying only three out of seven (42.86% sensitivity) participants with ID, though it correctly identified eight out of nine TD individuals (88.89% specificity). The inclusion of behavioral features significantly improved sensitivity, correctly identifying five of seven participants with ID (Figure 6(b)). This resulted in a higher overall accuracy of 81.25% and an F1-score of 0.769. In contrast, the CNN model achieved the most robust subject-level classification performance, yielding an overall accuracy of 93.75% and an F1-score of 0.933. It achieved perfect 100% sensitivity (seven out of seven) while maintaining 88.89% specificity (eight out of nine) (Figure 6(c)). Crucially, this means no typical misclassifications in ID participants, with only one TD participant being misclassified as having an ID.

Figure 6.

Confusion Matrix based on Averaged Per-Item Probabilities.

A correlation analysis was then conducted to validate the model’s predictions against the participants’ WISC-V scores (Figure 7). For this analysis, the mean CNN-predicted probability of having an ID, averaged across all 60 experimental items, was used. The results revealed a significant negative correlation between the model’s predictions and the FSIQ (r = -0.568, q = .027). This negative relationship extended across all major cognitive domains. Particularly strong correlations were observed with the FRI (r = -0.740, q = .008) and the VSI (r = -0.669, q = .016). No significant correlations were observed with the VCI (r = -0.549, q=.057), WMI (r = -0.411, q = .160), or PSI (r = -0.247, q = .0.374). These findings suggest that the eye-gaze patterns extracted by the CNN serve as a robust indicator of an individual’s overall intellectual functioning, with a particular sensitivity to fluid reasoning and visuospatial processing. To further investigate the relative contribution of each task to overall predictive performance, separate correlation analyses were conducted for the VCI, FRI, and WMI subtasks. After FDR correction, none of the individual subtasks reached statistical significance at the q < .05 level. However, as illustrated Supplementary Figure 4, the WMI task consistently exhibited negative correlations that approached statistical significance not only with the FSIQ (r = -0.570, q = 0.080), but also with the VSI (r = -0.636, q = 0.054) and FRI (r = -0.644, q = 0.054).

Figure 7.

Spearman’s correlations between CNN predicted values and WISC indices with false discovery rate (FDR).

Explainable gaze patterns via class activation mapping

To explicitly demonstrate how the CNN model derives its multidimensional predictions and to ensure interpretability of gaze patterns, we conducted a gradient-weighted class activation map (Grad-CAM) analysis. Figure 8 illustrates the saliency maps of four representative cases, highlighting the specific spatio-temporal gaze features the model relied upon for classification. In TD cases that were correctly classified (Case 2), the model heavily focused on dense fixation areas primarily within the problem matrix and intersecting gaze trajectories between the problem matrix and the response options. The strong activation in these regions indicates that the CNN successfully learned the logical exploratory patterns characteristic of TD children. These visual explanations aligned with the behavioral metrics, characterized by a relatively larger PTM and smaller ROT in the WMI task, reflecting the cognitive process patterns of the TD group. Conversely, in correctly classified ID cases (Case 7), the saliency maps revealed activation over sparse and fragmented fixated areas, confirming that the model utilizes the lack of structured spatio-temporal exploration as a key discriminative feature for ID. In this case, a smaller PTM and a larger ROT in the WMI task were observed, reflecting an erratic and scattered visual search where the participant frequently shifted their gaze without sufficiently focusing on the problem matrix.

Figure 8.

Grad-CAM visualizations of CNN model predictions.

To further understand the sensitivity of the model, we analyzed a false-negative case (Case 3) where a high-IQ TD participant was misclassified as ID. Despite their high overall cognitive capacity, this participant produced an incorrect answer on the FRI task. The saliency map for this specific task visually captures a momentary lapse in sustained attention. It exhibits a distinct absence of prominent activation across both the problem matrix and response options. This diminished visual exploration indicates a failure to engage in the active visual encoding typically required for this task. Furthermore, in the WMI task, their gaze activation was diffuse, yielding a high ID prediction probability for this subtask. Consequently, the CNN detected the absence of systematic exploratory behavior and interpreted it as an ID characteristic, demonstrating that the model evaluates the actual cognitive effort exerted during the task. Finally, the false-positive case (Case 5) highlights the capability of the model for multidimensional estimation. Clinically diagnosed with borderline ID, this participant successfully solved all three subtasks. While their traditional discrete gaze metrics were uneven, they demonstrated active exploratory behaviors in the VCI task and systematic visual scanning in the WMI task. Notably, the participant exhibited a complete lack of toggling (ROT = 0) in the FRI task. This absence may reflect an intuitive resolution of the problem or the inherent limitations of rigid area of interest (AOI) definitions failing to capture boundary fixations. Unbound by predefined AOI constraints, the CNN successfully integrated these continuous multidimensional spatial patterns, predicting the participant as TD. This indicates that our model does not merely overfit to clinical labels; rather, it effectively estimates the underlying quality of the cognitive processes and the actual functional capacity demonstrated by the participant during the assessment.

Discussion

This study demonstrated that a novel eye-tracking-based cognitive assessment can successfully classify ID and quantify the underlying cognitive profiles of children. To achieve this, we developed three distinct subtasks structurally modeled on the RPM to elicit and capture the spatio-temporal patterns of eye gaze across core cognitive domains, including verbal comprehension, fluid reasoning, and working memory. A key contribution of this study is its multidimensional assessment design, mirroring the multi-index structure of the WISC-V. Previous eye-tracking studies involving ID have typically concentrated on isolated cognitive functions such as working memory, attentional control, or specific problem-solving strategies.^13,25,34 However, the WISC-V, the clinical standard for ID diagnosis, derives its assessment from several distinct cognitive indices.³⁹ The WISC-V is composed of five primary indices: FRI, VCI, WMI, VSI, and PSI. As conventional RPM paradigms primarily target FRI, we designed two additional subtasks for VCI and WMI. We hypothesized that the remaining two indices, VSI and PSI, could be captured through these three subtasks because computer-based eye-tracking tests are inherently capable of capturing spatial patterns and processing speed. Accordingly, this study proposed a unified assessment that captures multiple cognitive characteristics while potentially minimizing test time and increasing engagement in children with ID.

This multidimensional approach is particularly significant, because prior eye-tracking research of neurodevelopmental disorders has largely concentrated on autism spectrum disorder (ASD), whereas its application to ID remains.^27,66 Most ASD research has employed tasks that measure core symptoms, such as a lack of social attention or repetitive behaviors.⁶⁷ Specifically, distinct responses to faces versus nonsocial stimuli or emotional expressions can be readily captured by gaze patterns.⁶⁸ In contrast, such social-perceptual tasks are less suitable for assessing the core deficits of ID, which are characterized by broad impairments in intellectual and adaptive functioning. Therefore, by adapting the RPM, the gaze patterns captured during the tasks developed in our study provided a more direct and objective estimation of the specific cognitive processes relevant to ID.

The analysis of individual subtasks offers preliminary evidence for our multidimensional design hypothesis, demonstrating that the specific cognitive demands of our paradigms, particularly for FRI and WMI tasks, were adequately calibrated to capture FSIQ, as well as the FRI and VSI subscores. Specifically, the FRI task was directly adapted from the conventional RPM to assess fluid reasoning, where previous studies demonstrated significant results with FSIQ.^25,34 Meanwhile, the WMI task was newly designed in this study to measure the visuospatial working memory by simultaneously presenting color and location information.^54,56 Consequently, this paradigm imposes a high cognitive load by requiring the integration and maintenance of complex information.⁶⁹ Aligning with established literature for the FRI task and confirming our novel design hypothesis for the WMI task, behavioral accuracy from both paradigms yielded significant positive correlations with overall FSIQ (Figure 3). The effectiveness of the WMI paradigm lies in its sequential structure that requires participants to integrate color and spatial information, maintain that representation during a masked delay, and then actively retrieve it to select the correct answer from the response options (Figure 1(c)). This high cognitive load ensures that successful performance heavily relies on attentional control and information processing.^70–72 Furthermore, the effectiveness of our paradigm extended beyond a single cognitive domain by successfully capturing specific WISC-V subindices. Notably, behavioral accuracy from both the FRI and WMI tasks correlated significantly with the FRI, reflecting the active reasoning required across these problem-solving processes. Because we also intended to capture VSI and PSI implicitly without separate tasks, we examined relationships with the VSI and PSI. We found that behavioral accuracy from the both tasks correlated significantly with VSI, confirming the inherent visuospatial demands of the task. However, although intended, it did not yield a significant correlation with the PSI. It is also worth noting that, despite its name, behavioral accuracy from the WMI task did not show a significant correlation with the WISC-V WMI. This discrepancy may be attributed to the composite nature of the WISC-V WMI, which integrates both auditory-verbal (e.g., digit span) and visual-sequential (e.g. picture span) tasks to assess a broad working memory construct. In contrast, our WMI task was focused on high-load integration of visuospatial features, which may explain its significant correlation with the VSI and FRI rather than the WMI.

However, in contrast to the behavioral results, translating these underlying cognitive processes into quantitative eye-tracking biomarkers using conventional metrics proved challenging. For instance, in the FRI task, the PTM metric yielded a notable correlation with the FRI subscore, suggesting that the RPM-based format effectively elicited fluid reasoning strategies as intended. However, in contrast to previous studies that reported significant results between PTM and FSIQ,^25,34 our analysis failed to yield a significant correlation between any predefined gaze metrics and FSIQ after multiple comparison corrections. This discrepancy is likely attributable to participant characteristics. Unlike university students in previous RPM based study,²⁵ the children in our study may lack stable and consistent problem-solving strategies, resulting in higher individual variability in gaze patterns that obscured a detectable group-level effect. Indeed, even TD children struggled with high-difficulty items in the later stages of the task, leading them to adopt nonstrategic behaviors.

For the VCI task, the results were less consistent with those of the FRI and WMI tasks. Unlike the other paradigms, the VCI task failed to demonstrate significant correlations in both behavioral accuracy and gaze metrics with the FSIQ or the specific VCI subscore. This suggests that the VCI paradigm did not effectively capture vocabulary-specific processing as intended. This was likely because the words were drawn from the K-BNT, a formal neuropsychological test that includes items not commonly used in daily life. Therefore, the task outcome was likely more determined by a child’s prior vocabulary knowledge than by in-task visual search strategies. Collectively, these findings underscore a central challenge in conventional eye-tracking assessment: the need for task designs tailored to specific cognitive processes and populations makes the extraction of universal features difficult.

The challenge of relying on task-dependent predefined gaze metrics motivated our investigation into ML approaches for the automated estimation of cognitive capacity in children with ID. Our first approach utilizes an LR model trained with nine eye-tracking features (i.e. ROT, RLT, and PTM for each of the three subtasks). However, this model yielded modest performance, with an F1-score of 0.543 at the item level and 0.54 at the subject level (Figures 5(a) and 6(a)). Although adding behavioral features (accuracy and response time) improved the F1-score to 0.72 and 0.76, respectively (Figures 5(b) and 6(b)), behavioral metrics are insufficient for capturing the multifaceted nature of cognitive processes. While task accuracy was significantly correlated with not only the FSIQ, but also the FRI and VSI, its consistent positive patterns across all tasks suggests it may primarily reflect a general cognitive factor rather than providing the distinct signals needed for multivariate classification. Moreover, response time showed tasks-dependent instability and high variability. For instance, a longer response time could indicate cognitive impairment, but it could also reflect a strategic slowness.⁷³ Furthermore, pediatric populations naturally exhibit high variability in processing speed.⁷⁴ Therefore, this knowledge-based feature extraction approach, which relies on a few potentially noisy and task-dependent metrics, has inherent limitations, particularly in its difficulty in generalizing to novel tasks where the underlying structure and cognitive demands differ.

To overcome these limitations, we designed a CNN architecture that learns directly from the raw spatiotemporal patterns of the scanpath images from the three subtasks. This approach yielded higher performance, achieving an F1-score of 0.8086 at the item level and 0.93 F1-score at the subject level classification (Figures 5(c) and 6(c)). This improvement suggests that deep learning methods capable of automatically extracting complex visuospatial patterns are more effective in capturing the cognitive characteristics of children than models that rely on a few predefined metrics. Furthermore, the subject-level predictions of the CNN model demonstrated a significant correlation with the actual FSIQ scores, confirming that the estimated probabilities effectively align with established intellectual measures (Figure 7). This negative relationship extended across the FRI and the VSI subscores. To better understand the underlying drivers of this performance, we evaluated the relative contribution of each subtask to the predictive power of the model. While all three tasks were integrated into the CNN, the WMI task appeared to provide the most distinctive signals for identifying cognitive markers. As illustrated in Supplementary Figure 4, the average posterior probabilities from the WMI task consistently exhibited strong negative correlations that approached statistical significance with the FSIQ, VSI, and FRI. Although these probabilities did not show a significant correlation with the WMI subscore itself, the high cognitive load and sequential complexity of the WMI paradigm likely elicited the dense spatiotemporal gaze dynamics that the CNN utilized as an integrated proxy. In contrast, despite the relatively high correlations observed with conventional gaze metrics in the FRI task, the FRI task exhibited high individual variance in CNN prediction. This suggests that the WMI task was a more effective paradigm for the CNN model, providing the stable signals required for cognitive assessment.

The Grad-CAM analysis provides visual evidence of what these signals represent, particularly highlighting why the WMI task serves as such a robust cognitive marker. The high cognitive load and sequential complexity of the WMI paradigm elicit dense spatiotemporal gaze dynamics that reflect underlying processes such as attentional control, memory integration, and gaze organization. For a participant with a higher FSIQ (Figure 8(a)), the strong activation along dense and intersecting gaze trajectories reflects a systematic and efficient exploration strategy. This aligns with the finding that higher working memory capacity is associated with more structured gaze patterns.⁷⁰ Notably, the CNN translated these structured spatial trajectories into an overwhelmingly decisive TD prediction for the WMI subtask (probability of 0.003), far outperforming the predictions in the VCI and FRI tasks (0.377 and 0.162, respectively). Conversely, for a participant with a lower FSIQ score (Figure 8(b)), the model highly activated on unorganized and scattered patterns, which may stem from limitations in executive functions.^61,65 Consistent with this visual interpretation, the model yielded a highly confident ID prediction for the WMI subtask (probability of 0.997), while VCI and FRI tasks remained relatively ambiguous (0.375 and 0.394, respectively). These disparities further highlight the importance of the WMI task for final CNN predictions, as it provides the most discriminative signals for identifying cognitive markers. Furthermore, the misclassified cases highlight broader implications for automated cognitive assessments. The false-negative instance (Figure 8(c)) implicates a critical clinical challenge: vulnerability to temporary disengagement. This underscores the necessity for future digital assessments to incorporate real-time engagement monitoring or adaptive difficulty calibration, preventing momentary lapses from skewing overall estimations.⁶⁰ Finally, the false-positive case (Figure 8(d)) emphasizes the methodological advantage of the CNN over conventional metric-dependent LR models. While traditional variables like ROT are constrained by rigid spatial boundaries and can be disproportionately skewed by idiosyncratic viewing habits, our deep learning approach implicitly evaluates the holistic quality of visual exploration, proving more resilient in capturing latent cognitive capacities that predefined metrics might overlook.

The findings of this study should be interpreted in the context of several limitations, primarily stemming from the small sample size. Given these limitations, the generalizability of the current findings to broader populations remains restricted, and the results should be considered as preliminary and exploratory. The inherent challenges in recruiting participants with ID and the resource-intensive nature of the WISC assessment made developing the hypothesized regression model unfeasible. Deep-learning models typically require large datasets to generalize effectively, particularly for a target that is as complex and heterogeneous as WISC-V. With a limited sample size, there is a risk of overfitting, and the model may capture sample-specific noise rather than patterns that can be generalized. Indeed, our preliminary analysis demonstrated that the posterior probabilities of the CNN model explained a limited portion of the variance in FSIQ scores ( $R^{2}$ = 0.494; Supplementary Table 4). Moreover, for the goal of cognitive profiling, it was difficult to differentiate the subscores because they are inherently inter-correlated, all contributing to the FSIQ. With a limited dataset, a model may successfully predict FSIQ by capturing general cognitive ability rather than differentiating the unique characteristics of specific cognitive profiles.

An additional limitation was the significant age difference between the groups. However, the fact that the ID group was older than the TD group mitigates this concern, suggesting that the observed performance deficits are not attributable to age. To empirically address potential age-related effects, we first performed partial Spearman correlation analyses using age as a covariate. To further validate the independence of our findings from age-related variance, we conducted a supplementary validation analysis restricted to the largest age-homogeneous subgroup (N = 7, aged 12 years; 5 ID and 2 TD). As shown in Supplementary Figure 5, within this subgroup, CNN predictions demonstrated a significant correlation with FSIQ (r = -0.964, q < .001), VCI (r = -0.955, q = .003), FRI (r = -0.946, q = .003), VSI (r = -0.821, q = .029), and PSI (r = -0.857, q = .023). Although the correlation with WMI did not reach significance (r = -0.500, q = .253), these results confirm that the CNN predictions reflect underlying cognitive functioning rather than chronological age.

Furthermore, while the present study focused on the core cognitive characteristics of ID, the potential influence of comorbidities, such as ASD and ADHD, was not systematically delineated. Therefore, future research with a larger and more diverse sample is essential for validating and extending our findings and for establishing the broader generalizability of gaze-based cognitive assessment. This should include a wider range of ages, intellectual levels, and diverse clinical groups. Beyond simply increasing sample size, careful consideration should be given to the composition of the ID group itself. As ID encompasses a broad spectrum of etiologies, specific subgroups, such as children with Down syndrome or ASD, may exhibit distinct neurocognitive and oculomotor profiles. Therefore, targeted recruitment of etiologically homogeneous cohorts will reduce intra-group variability and enable the identification of gaze signatures specific to each subgroup. Furthermore, investigating the inter-group differences among these various etiologies could provide insights into condition-specific cognitive profiles. Conducting these focused investigations through large-scale clinical collaborations will ultimately enhance both the model performance and interpretability. Securing a larger and more varied dataset would provide the statistical power for significant methodological advancements. This would enable the current binary classification model to evolve into a more sophisticated system capable of differentiating between the subtypes of neurodevelopmental disorders. Critically, it would also allow the development of a regression model designed to predict both FSIQ and specific index scores from gaze data. Additionally, as the present findings suggest that the three subtasks contribute differentially to the estimation of composite cognitive scores, it is recommended that future studies should systematically investigate the relative weight assigned to each cognitive domain in predicting FSIQ. Furthermore, domain-specific response time, which showed task-dependent relationships with cognitive indices in the current study, should be integrated as a complementary feature alongside gaze-based measures. Such an approach would allow for a more fine-grained and interpretable cognitive profile, moving beyond a single composite score toward a multidimensional estimation framework that reflects the distinct contributions of each cognitive domain. As the model matures toward practical application, future studies should also evaluate its classification performance in terms of sensitivity and specificity. Although the present study frames gaze-based assessment as an estimation tool, a critical factor in determining the clinical utility of this approach lies in considering the implications of misclassification, including the potential consequences of overlooking children who require intervention.

The interpretability of the deep learning model poses an additional limitation. Although our CNN model achieved better performance than the conventional feature-engineering-based ML model, it was difficult to intuitively understand how the CNN arrived at a classification based on specific gaze pattern features because of its nature as a black-box model. Although post hoc methods for interpreting CNN models have recently been employed, they have their own constraints. For instance, even in the case of misclassification, the model often concentrated on specific regions in a pattern similar to that of the correct classification (Figure 8). This suggests that while heat maps are effective in visualizing where the model is focused, they do not provide a comprehensive explanation for why attention was directed toward a specific class, or what subtle features the model may have overlooked. These limitations likely stem from the inherent constraints of the methods. Grad-CAM, for instance, is known for its inability to process fine-grained elements at the image pixel level,⁷⁵ whereas SHapley Additive exPlanations (SHAP) operates under the assumption of feature independence, which leads to misinterpretations when variables are correlated.⁷⁶ Therefore, future research will require the selection of more inherently interpretable models and the application of more meticulous post-hoc analysis.

Conclusion

This study demonstrates the significant potential of applying deep learning to eye tracking data as a novel approach for the objective assessment of ID. By developing a novel multidimensional assessment model based on the RPM, we demonstrated that eye tracking can capture underlying cognitive characteristics, not just behavioral outputs. The high classification accuracy achieved by the CNN model trained on raw scanpath images validates the importance of analyzing holistic spatiotemporal gaze patterns. Ultimately, this study established a strong methodological foundation and highlighted the potential of this approach for the development of accessible and objective estimation tools for children with developmental disabilities.

Supplemental material

Supplemental material - A multidimensional eye-tracking assessment for estimating cognitive profiles in intellectual disability: A preliminary deep learning study

Supplemental material for A multidimensional eye-tracking assessment for estimating cognitive profiles in intellectual disability: A preliminary deep learning study by Kyeong-Bin Park, Jae-Won Yang, Seeun Kim, Dahyeon Sim and Dong-Hwa Jeong in Digital health.

Footnotes

Acknowledgements

The authors would like to express their sincere gratitude to all the children, participants, and legal guardians who participated in this study. We also thank the staff at Choonghyeon Social Welfare Center (Seoul, Korea) for their assistance with participant recruitment. We acknowledge MindHub Inc. for their collaboration and support in this research, which was performed as an industry-university joint project.

ORCID iD

Dong-Hwa Jeong

Ethical considerations

This study was conducted with the approval of the Institutional Review Board (IRB) of the Catholic University of Korea, Seongsim Campus (IRB No. 1040395-202305-02).

Consent to participate

Written informed consent was obtained from the parents or legal guardians of all participants included in the study.

Consent for publication

Not applicable as this manuscript does not contain any images or data that could identify individual participants.

Author contributions

Kyeong-Bin Park: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. Jae-Won Yang: Conceptualization, Investigation, Methodology, Resources, Validation, Writing – review & editing. Seeun Kim: Software, Investigation, Writing – review & editing. Dahyeon Sim: Investigation, Validation, Writing – review & editing. Dong-Hwa Jeong: Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2025-00532988, 70%). This work was also supported by the IITP (Institute of Information & Communications Technology Planning & Evaluation) - ICAN (ICT Challenge and Advanced Network of HRD) (IITP-2025-RS-2024-00438207, 30%) grant funded by the Korean government (Ministry of Science and ICT).

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: All authors are inventors on a Korean patent application (Application No. 10-20230184326) described in this manuscript. The rights to this pending patent have been licensed to Mindhub Inc.

Data Availability Statement

The data and code that support the findings of this study are via .

Guarantor

Dong-Hwa Jeong takes full responsibility for the work and the conduct of the study, had access to the data, and controlled the decision to publish.

Supplemental material

Supplemental material for this article is available online.

References

American Psychiatric Association . Diagnostic and statistical manual of mental disorders. 5th ed. American psychiatric association, 2022.

Mazza

Rossetti

Crespi

, et al. Prevalence of co‐occurring psychiatric disorders in adults and adolescents with intellectual disability: A systematic review and meta‐analysis. Journal of Applied Research in Intellectual Disabilities 2020; 33: 126–138. https://doi.org/10.1111/jar.12654

Olusanya

Smythe

Ogbo

, et al. Global prevalence of developmental disabilities in children and adolescents: A systematic umbrella review. Frontiers in public health 2023; 11: 1122009. https://doi.org/10.3389/fpubh.2023.1122009

Zablotsky

Black

, et al. Diagnosed developmental disabilities in children aged 3–17 years: United States, 2019–2021, 2023.

Hronis

Roberts

Kneebone

. A review of cognitive impairments in children with intellectual disabilities: Implications for cognitive behaviour therapy. British Journal of Clinical Psychology 2017; 56: 189–207. https://doi.org/10.1111/bjc.12133

Sappok

Budczies

Bölte

, et al. Emotional development in adults with autism and intellectual disabilities: a retrospective, clinical analysis. PloS one 2013; 8: e74036. https://doi.org/10.1371/journal.pone.0074036

Scherzer

Chhagan

Kauchali

, et al. Global perspective on early diagnosis and intervention for children with developmental delays and disabilities. Developmental Medicine & Child Neurology 2012; 54: 1079–1084. https://doi.org/10.1111/j.1469-8749.2012.04348.x

Bailey

Willner

Dymond

. A visual aid to decision-making for people with intellectual disabilities. Research in developmental disabilities 2011; 32: 37–46. https://doi.org/10.1016/j.ridd.2010.08.008

Danielsson

Henry

Messer

, et al. Strengths and weaknesses in executive functioning in children with intellectual disability. Research in developmental disabilities 2012; 33: 600–607. https://doi.org/10.1016/j.ridd.2011.11.004

10.

Memisevic

Sinanovic

. Executive function in children with intellectual disability–the effects of sex, level and aetiology of intellectual disability. Journal of intellectual disability research 2014; 58: 830–837. https://doi.org/10.1111/jir.12098

11.

Diamond

. Executive functions. Annual review of psychology 2013; 64: 135–168. https://doi.org/10.1146/annurev-psych-113011-143750

12.

Gligorović

Buha Ðurović

. Inhibitory control and adaptive behaviour in children with mild intellectual disability. Journal of intellectual disability research 2014; 58: 233–242. https://doi.org/10.1111/jir.12000

13.

Henry

Winfield

. Working memory and educational achievement in children with intellectual disabilities. Journal of Intellectual Disability Research 2010; 54: 354–365. https://doi.org/10.1111/j.1365-2788.2010.01264.x

14.

Torra Moreno

Canals Sans

Colomina Fosch

. Behavioral and Cognitive Interventions With Digital Devices in Subjects With Intellectual Disability: A Systematic Review. Frontiers in Psychiatry 2021; 12: 2021, Systematic Review. https://doi.org/10.3389/fpsyt.2021.647399

15.

Organization WH and Fund UNCs . Global report on children with developmental disabilities: from the margins to the mainstream. World Health Organization, 2023.

16.

McKenzie

Murray

, et al. Child and Adolescent Intellectual Disability Screening Questionnaire to identify children with intellectual disability. Developmental Medicine & Child Neurology 2019; 61: 444–450. https://doi.org/10.1111/dmcn.13998

17.

Kalemkuş

. Trends in instructional technologies used in education of people with special needs due to intellectual disability and autism. Journal of Research in Special Educational Needs 2025; 25: 237–261. https://doi.org/10.1111/1471-3802.12723

18.

Burns

. Wechsler intelligence scale for children-V: Test review. Applied Neuropsychology: Child 2016; 5: 156–160. https://doi.org/10.1080/21622965.2015.1015337

19.

Ryan

Glass

Brown

. Administration time estimates for Wechsler Intelligence Scale for Children‐IV subtests, composites, and short forms. Journal of Clinical Psychology 2007; 63: 309–318. https://doi.org/10.1002/jclp.20343

20.

Alam

Raja

Gulzar

. Investigation of Machine Learning Methods for Early Prediction of Neurodevelopmental Disorders in Children. Wireless Communications and Mobile Computing 2022; 2022: 5766386. https://doi.org/10.1155/2022/5766386

21.

Mazumdar

Arru

Battisti

. Early detection of children with autism spectrum disorder based on visual exploration of images. Signal Processing: Image Communication 2021; 94: 116184. https://doi.org/10.1016/j.image.2021.116184

22.

Kandeel

Morsy

Alkhodair

, et al. Digital health interventions for individuals with disabilities and their impacts on health, quality of life, and social participation. Digital Health 2024; 10: 20552076241294190. https://doi.org/10.1177/20552076241294190

23.

Kalemkuş

. Effect of the use of augmented reality applications on academic achievement of student in science education: meta analysis review. Interactive Learning Environments 2023; 31: 6017–6034. https://doi.org/10.1080/10494820.2022.2027458

24.

Kalemkuş

Bulut-Özek

. The effect of online project-based learning on metacognitive awareness of middle school students. Interactive Learning Environments 2024; 32: 1533–1551. https://doi.org/10.1080/10494820.2022.2121733

25.

Liu

Zhan

, et al. Using a multi-strategy eye-tracking psychometric model to measure intelligence and identify cognitive strategy in Raven's advanced progressive matrices. Intelligence 2023; 100: 101782. https://doi.org/10.1016/j.intell.2023.101782

26.

Mahanama

Jayawardana

Rengarajan

, et al. Eye Movement and Pupil Measures: A Review. Frontiers in Computer Science 2022; 3: 733531, 2021, Review. https://doi.org/10.3389/fcomp.2021.733531

27.

Jenner

Farran

Welham

, et al. The use of eye-tracking technology as a tool to evaluate social cognition in people with an intellectual disability: a systematic review and meta-analysis. Journal of Neurodevelopmental Disorders 2023; 15: 42. https://doi.org/10.1186/s11689-023-09506-9

28.

Wei

Cao

Shi

, et al. Machine learning based on eye-tracking data to identify Autism Spectrum Disorder: A systematic review and meta-analysis. Journal of Biomedical Informatics 2023; 137: 104254. https://doi.org/10.1016/j.jbi.2022.104254

29.

Lewis

Krupenye

. Eye-tracking as a window into primate social cognition. American Journal of Primatology 2022; 84: e23393. https://doi.org/10.1002/ajp.23393

30.

Alcañiz

Chicchi-Giglioli

Carrasco-Ribelles

, et al. Eye gaze as a biomarker in the recognition of autism spectrum disorder using virtual reality and machine learning: A proof of concept for diagnosis. Autism Research 2022; 15: 131–145. https://doi.org/10.1002/aur.2636

31.

Roth

Jording

Schmee

, et al. Towards computer aided diagnosis of autism spectrum disorder using virtual environments. In: 2020 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR). IEEE, 2020, pp. 115–122.

32.

Yoo

Kang

Lim

, et al. Development of an innovative approach using portable eye tracking to assist ADHD screening: a machine learning study. Frontiers in Psychiatry 2024; 15: 1337595. https://doi.org/10.3389/fpsyt.2024.1337595

33.

Lev

Braw

Elbaum

, et al. Eye Tracking During a Continuous Performance Test: Utility for Assessing ADHD Patients. Journal of Attention Disorders 2022; 26: 245–255. https://doi.org/10.1177/1087054720972786

34.

Vakil

Lifshitz-Zehavi

. Solving the Raven Progressive Matrices by adults with intellectual disability with/without Down syndrome: Different cognitive patterns as indicated by eye-movements. Research in developmental disabilities 2012; 33: 645–654. https://doi.org/10.1016/j.ridd.2011.11.009

35.

Dawson

Soulières

Ann Gernsbacher

, et al. The level and nature of autistic intelligence. Psychological science 2007; 18: 657–662. https://doi.org/10.1111/j.1467-9280.2007.01954.x

36.

Goharpey

Crewther

. Problem solving ability in children with intellectual disability as measured by the Raven's Colored Progressive Matrices. Research in Developmental Disabilities 2013; 34: 4366–4374. https://doi.org/10.1016/j.ridd.2013.09.013

37.

Patel

Cabral

, et al. A clinical primer on intellectual disability. Translational pediatrics 2020; 9: S23. https://doi.org/10.21037/tp.2020.02.02

38.

Pinals

Hovermale

Mauch

, et al. Persons with intellectual and developmental disabilities in the mental health system: part 1. Clinical considerations. Psychiatric Services 2022; 73: 313–320. https://doi.org/10.1176/appi.ps.201900504

39.

Canivez

Watkins

Dombrowski

. Factor structure of the Wechsler Intelligence Scale for Children-Fifth Edition: Exploratory factor analyses with the 16 primary and secondary subtests. Psychol Assess 2016; 28: 975–986, 20151116. https://doi.org/10.1037/pas0000238

40.

Watkins

Canivez

. Assessing the psychometric utility of IQ scores: A tutorial using the Wechsler intelligence scale for children–fifth edition. School Psychology Review 2022; 51: 619–633. https://doi.org/10.1080/2372966x.2020.1816804

41.

Hayes

Henderson

. Scan patterns during real-world scene viewing predict individual differences in cognitive capacity. Journal of vision 2017; 17: 23–23. https://doi.org/10.1167/17.5.23

42.

Alsaidi

Obeid

Al-Madi

, et al. A convolutional deep neural network approach to predict autism spectrum disorder based on eye-tracking scan paths. Information 2024; 15: 133. https://doi.org/10.3390/info15030133

43.

Vortmann

L-M

Knychalla

Annerer-Walcher

, et al. Imaging time series of eye tracking data to classify attentional states. Frontiers in Neuroscience 2021; 15: 664490. https://doi.org/10.3389/fnins.2021.664490

44.

Ahmed

Senan

Rassem

, et al. Eye tracking-based diagnosis and early detection of autism spectrum disorder using machine learning and deep learning techniques. Electronics 2022; 11: 530. https://doi.org/10.3390/electronics11040530

45.

Cilia

Carette

Elbattah

, et al. Computer-Aided Screening of Autism Spectrum Disorder: Eye-Tracking Study Using Data Visualization and Deep Learning. JMIR Hum Factors 2021; 8: e27706. https://doi.org/10.2196/27706

46.

Abbeduto

Kover

McDuffie

. 22 Studying the Language Development of Children with Intellectual Disabilities. Research Methods in Child 2012; 330: 330.

47.

Söderqvist

Nutley

Ottersen

, et al. Computerized training of non-verbal reasoning and working memory in children with intellectual disability. Frontiers in human neuroscience 2012; 6: 271. https://doi.org/10.3389/fnhum.2012.00271

48.

Numminen

Service

Ruoppila

. Working memory, intelligence and knowledge base in adult persons with intellectual disability. Research in developmental disabilities 2002; 23: 105–118. https://doi.org/10.1016/s0891-4222(02)00089-6

49.

Jia

. Measuring Raven’s Progressive Matrices Combining Eye-Tracking Technology and Machine Learning (ML) Models. Journal of Intelligence 2024; 12: 116. https://doi.org/10.3390/jintelligence12110116

50.

Huepe

Roca

Salas

, et al. Fluid intelligence and psychosocial outcome: from logical problem solving to social adaptation. PLoS One 2011; 6: e24858. https://doi.org/10.1371/journal.pone.0024858

51.

Kaplan

Goodglass

Weintraub

. Boston naming test. The Clinical Neuropsychologist 1983.

52.

Kim

. A Normative Study of the Boston Naming Test in 3- to 14-Year-Old Korean Children. The Clinical Neuropsychologist 2008; 22: 84–97. https://doi.org/10.1080/13854040601064526

53.

Baddeley

. Working Memory: Theories, Models, and Controversies. Annual Review of Psychology 2012; 63: 1–29. https://doi.org/10.1146/annurev-psych-120710-100422

54.

Mammarella

Borella

Pastore

, et al. The structure of visuospatial memory in adulthood. Learning and Individual Differences 2013; 25: 99–110. https://doi.org/10.1016/j.lindif.2013.01.014

55.

Mammarella

Pazzaglia

Cornoldi

. Evidence for different components in children's visuospatial working memory. British Journal of Developmental Psychology 2008; 26: 337–355. https://doi.org/10.1348/026151007X236061

56.

Retzler

Johnson

Groom

, et al. A comparison of simultaneous and sequential visuo-spatial memory in children born very preterm. Child Neuropsychology 2022; 28: 496–509. https://doi.org/10.1080/09297049.2021.1993808

57.

Bays

Husain

. Dynamic Shifts of Limited Working Memory Resources in Human Vision. Science 2008; 321: 851–854. https://doi.org/10.1126/science.1158023

58.

Vogel

Machizawa

. Neural activity predicts individual differences in visual working memory capacity. Nature 2004; 428: 748–751. https://doi.org/10.1038/nature02447

59.

Adam

KCS

Robison

Vogel

. Contralateral Delay Activity Tracks Fluctuations in Working Memory Performance. Journal of Cognitive Neuroscience 2018; 30: 1229–1240. https://doi.org/10.1162/jocn_a_01233

60.

Zhang

Liu

, et al. Visual Working Memory Capacity for Color Is Independent of Representation Resolution. PLOS ONE 2014; 9: e91681. https://doi.org/10.1371/journal.pone.0091681

61.

Fanning

Hocking

Dissanayake

, et al. Delineation of a spatial working memory profile using a non-verbal eye-tracking paradigm in young children with autism and Williams syndrome. Child Neuropsychology 2018; 24: 469–489. https://doi.org/10.1080/09297049.2017.1284776

62.

Laurence

Mecca

Serpa

, et al. Eye Movements and Cognitive Strategy in a Fluid Intelligence Test: Item Type Analysis. Frontiers in Psychology 2018; 9: 380–2018, Original Research. https://doi.org/10.3389/fpsyg.2018.00380

63.

Vigneau

Caissie

Bors

. Eye-movement analysis demonstrates strategic influences on intelligence. Intelligence 2006; 34: 261–272. https://doi.org/10.1016/j.intell.2005.11.003

64.

Pritalia

Wibirama

Adji

, et al. Classification of Learning Styles in Multimedia Learning Using Eye-Tracking and Machine Learning. 2020 FORTEI-International Conference on Electrical Engineering (FORTEI-ICEE) 23–24 Sept, Bandung, Indonesia, 2020, pp. 145–150.

65.

Chang

, et al. Eye movements of spatial working memory encoding in children with and without autism: chunking processing and reference preference. Autism Research 2021; 14: 897–910. https://doi.org/10.1002/aur.2398

66.

Frazier

Strauss

Klingemier

, et al. A meta-analysis of gaze differences to social and nonsocial information between individuals with and without autism. Journal of the American Academy of Child & Adolescent Psychiatry 2017; 56: 546–555. https://doi.org/10.1016/j.jaac.2017.05.005

67.

Guillon

Hadjikhani

Baduel

, et al. Visual social attention in autism spectrum disorder: Insights from eye tracking studies. Neuroscience & Biobehavioral Reviews 2014; 42: 279–297. https://doi.org/10.1016/j.neubiorev.2014.03.013

68.

Chita-Tegmark

. Attention allocation in ASD: A review and meta-analysis of eye-tracking studies. Review Journal of Autism and Developmental Disorders 2016; 3: 209–223. https://doi.org/10.1007/s40489-016-0077-x

69.

Feng

Pratt

Spence

. Attention and visuospatial working memory share the same processing resources. Frontiers in psychology 2012; 3: 103. https://doi.org/10.3389/fpsyg.2012.00103

70.

Jayawardena

Michalek

Jayarathna

. Eye tracking area of interest in the context of working memory capacity tasks. In: 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI). IEEE, 2019, pp. 208–215.

71.

Mohammadhasani

Caprì

Nucita

, et al. Atypical visual scan path affects remembering in ADHD. Journal of the International Neuropsychological Society 2020; 26: 557–566. https://doi.org/10.1017/S135561771900136X

72.

Van der Stigchel

Hollingworth

. Visuospatial Working Memory as a Fundamental Component of the Eye Movement System. Current Directions in Psychological Science 2018; 27: 136–143. https://doi.org/10.1177/0963721417741710

73.

Hedge

Vivian-Griffiths

Powell

, et al. Slow and steady? Strategic adjustments in response caution are moderately reliable and correlate across tasks. Consciousness and cognition 2019; 75: 102797. https://doi.org/10.1016/j.concog.2019.102797

74.

Tamnes

Fjell

Westlye

, et al. Becoming consistent: developmental reductions in intraindividual variability in reaction time are related to white matter integrity. Journal of Neuroscience 2012; 32: 972–982. https://doi.org/10.1523/JNEUROSCI.4779-11.2012

75.

Selvaraju

Cogswell

Das

, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017, pp. 618–626.

76.

Salih

Raisi-Estabragh

Galazzo

, et al. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. Advanced Intelligent Systems 2025; 7: 2400304. https://doi.org/10.1002/aisy.202400304