Abstract
Visual Reproduction is a condition to measure Visual Spatial Memory as one of the cognitive domains commonly used to measure visuo-spatial memory. Geometric figures serve as stimulus material, and probands have to reproduce the figures from memory through a hand drawing. The scoring of the drawing has subjective elements. This study aims to evaluate the scoring criteria for the Figural Reproduction Test (FRT), part of the Indonesian Neuropsychological Test Battery, and to develop and evaluate an automated scoring system based on computer vision technology (FRT-CVAS). Scoring evaluation conducted by Cohen Kappa analysis, accuracy, sensitivity, and specificity. The analyzes of the three criteria of the manual confirmed a subjective element in the scoring of the shape of triangles by a moderate (0.74) inter-rater agreement; this agreement could be improved to 0.84 by a slight modification of its criteria. FRT-CVAS, based on computer vision’s identification of the different elements of the hand drawing, was developed and trained using 290 drawings. The system was additionally tested by comparing its scoring with the scoring of two independent raters on 120 drawings from a second data set. FRT-CVAS recognized all elements, and its comparison between human raters showed a high accuracy and sensitivity (minimally 0.91), while the specificity was 0.80 for one of the three criteria. FRT-CVAS offers a highly standardized, consistent, precise, and objective output from the first card in the FRT. This approach is advantageous to data-hungry alternatives such as deep learning when applied to the automated scoring of hand drawings with relatively little data available for training.
Plain Language Summary
This study proposes a computer vision approach to develop an automated scoring system for the adapted visual-spatial test with a data size of less than 1000. This study facilitated the practitioner in developing a standardized and objective scoring system for a visual-spatial test (FRT-CVAS/Figural Reproduction Test—Computer Vision Automated Scoring System). The spatial test is one of the standard tests adapted and used in many countries. The stimulus could be different in each adaptation process, but the main part is always the same: the geometric stimulus. The number of hand-drawn responses is limited in the early adaptation, so this approach can be implemented in small-data training. The research team collected the 420 hand-drawn data from the Figural Reproduction Test as a part of the Neuropsychology Battery Test adapted in Indonesia. The study compared the manual scoring and automated scoring systems. The scoring of a hand-drawn geometrical figure by human raters has subjective elements. In the Figure Reproduction Test (FRT), these subjective factors are particularly notable in the judgment of the correct shape of triangles. The scoring accuracy can be improved somewhat by using single criteria instead of the compound criteria suggested in the scorer’s manual. FRT-CVAS, which is a computer vision approach, further removes such subjectivity: by extracting and evaluating all of the hand-drawn elements in detail, it achieves high-level accuracy, sensitivity, and reasonable specificity. FRT-CVAS achieves a more standardized, consistent, precise, and objective result.
Introduction
Visually presented stimuli are indispensable to any cognitive test battery. Moetesum et al. (2022) reviewed and highlighted the two most commonly used techniques in visual neuropsychological assessment: visual analysis and procedural analysis. The visual analysis assesses visual functions such as object recognition, visual memory, and visual attention. At the same time, procedural analysis evaluates an individual’s ability to perform tasks that require the integration of visual perception and motor coordination.
In the visual-spatial test, the participants were tasked with producing a copy of a figure stimulus by hand drawing. The hand drawing scoring method involves quantifying the precision of each component and its spatial alignment reflecting the degree to which the drawn image matches the original design presented (Zhang et al., 2021). According to Awad et al. (2004), the scoring system of hand drawing tends to contribute to the subjective nature of the rater. They suggest that employing a simple, objective, and explicit scoring system can mitigate this issue. They found that explicitly defining accuracy and placement separately for each element in the Rey-Osterrieth Complex Figure’s drawing (ROCF; Osterrieth, 1944; Rey, 1941) reduces the flexibility for the scorer in determining what constitutes accurate reproduction. Moreover, adjustment of scoring criteria is necessary when transitioning manual tests to computerized formats (Kim et al., 2020).
Technology trends in recent decades have significantly impacted the automated scoring of drawings and hand-writings by more precise and efficient assessment techniques (Pereira et al., 2015). Examples of computerized assessments on hand-drawing and hand-writing are the Clock Drawing Test, ROCF, and the Bender-Gestalt Test (Canham et al., 2000, 2005; Chen et al., 2020; Pereira et al., 2015; Vogt et al., 2019), handwriting segmentation (Vessio, 2019), handwriting classification (Taleb et al., 2019), and spiral drawing. The key to an automated scoring system was algorithms used in system development. Those algorithms were based on geometric template matching, fuzzy logic, statistics (Histogram, Local Binary Pattern), Convolution Neural Networks (CNN), Support Vector Machines (SVM), and the Decision Tree with accuracies ranging from 63% to 99%.
An automated of scoring system standardizes scoring, removes all rater variability (Canham et al., 2005; Diaz-Orueta et al., 2022), and it does not fatigue. It also reduces processing time and relieves psychologists and expert physicians of tedious tasks. Vogt et al. (2019) developed a method of scoring hand drawings that used a deep learning approach, and they then compared this system’s judgments with those of six expert raters. They obtained a high (0.88) correlation between automatic and manual scoring methods. They suggested adding segment detection to the automated system to achieve an accuracy equal to that of human raters.
The Indonesia Neuropsychology Consortium adapted a visual memory test (visual reproduction test) from the Wechsler Memory Scale (WMS) and named it the Figural Reproduction Test (FRT). It is part of the Indonesian Neuropsychological Test Battery (INTB), which includes nine other cognitive tests (Sulastri et al., 2018; Wahyuningrum et al., 2022). The FRT consists of three cards with different geometric designs. Figure 1 presents the sample of hand-drawing responses from the first of the three stimulus cards. As can be seen, the large variation in the drawings by different subjects leaves room for subjective scorer judgment. Langer et al. (2022) mentioned that the diagnostic effectiveness of this test is limited by a lack of clear scoring criteria, vagueness as to the extent of deviation from the standard that is allowable, and differences among scorers in their interpretation of complex criteria. Diaz-Orueta et al. (2022) reviewed the visual reproduction test (as a subset of the WMS/Wechsler Memory Scale) and noted that the WMS had been revised several times with regard to the scoring criteria (WMS-R), the stimulus material (WMS-III), and the scoring procedure (WMS-IV) thus demonstrating a perceived need for improved precision in the scoring of this test.

The left side is the original first card. The right side illustrates some scanned hand-drawn responses.
The present study aims to evaluate the scoring criteria provided in the scorer’s manual of the Figural Reproduction Test (Wahyuningrum et al., 2022). Next, we seek to develop an automated scoring system for the first figure of the FRT. This new system we named the FRT-CVAS (Figural Reproduction Test Computer Vision Automated Scoring). It relies heavily upon state-of-the-art computer vision techniques to identify various features of the drawings, particularly the triangle’s shape (the flags), the orientation of the four triangles, the position of the various elements, the crossing lines, and their angles. Finally, we evaluate the performance of FRT-CVAS in terms of its specificity, sensitivity, and accuracy, and we compare its performance with that of human raters, including the computer’s performance when two raters disagreed.
Method
The Figural Reproduction Test: Procedure
The FRT data was collected as part of a larger study in which other cognitive tests were also administered under the INTB data collection protocol. We used the drawings of Card 1 (Figure 1) of the FRT as performed by 410 healthy Indonesian participants called image responses. The age-range of the participants was 16 to 80 years old and their years of education varied from 6 to 22. The drawings were scanned manually to obtain the digital images. All scanned images were randomly separated into two parts: a training dataset (n = 290) and an evaluation dataset (n = 120). The first dataset was used to develop FRT-CVAS for an initial evaluation, the second dataset was used to compare the performance of FRT-CVAS with that of two independent raters.
The scores of the FRT were determined by investigating the object’s presence, orientation, size, and the shapes of the different elements in the hand drawing. A previous study reported the normative scores obtained with the Indonesian version of FRT, and the results agreed with what is internationally reported (Wahyuningrum et al., 2022). The FRT consists of different geometric stimuli presented on three cards. The test requires the participant to reproduce the figures one by one on a separate blank sheet of paper after the tester has shown each card for 10 s. After that, the tester manually scores the participant responses according to the test manual’s instructions.
Manual Scoring Rules
We used intra-rater and inter-rater reliability for the evaluation of the scoring criteria. Two senior clinical psychologists with 3 to 4 years of experience in the practice of neuropsychological assessment independently graded 120 hand drawings. Raters were first asked to grade the responses using the scoring criteria provided in the manual. Three weeks later, they again rated the drawings according to a slight change: the three compound criteria were now subdivided into six single criteria. It was hypothesized that the scoring accuracy would improve if the raters used six single criteria (Serra, 1986) instead of three compound criteria.
We hypothesize that there is a lack of manual scoring instruction when combining two detail criteria in one parameter. We propose a modified scoring to make the result more accurate. We provide the different within the original and modified scoring criteria in Table 1.
Two Ways of Scoring: The Left Column Contains the Compound Criteria, and the Right Column the Single Scoring Criteria.
The Figural Reproduction Test: Computer Vision Automated Scoring
System development: FRT-CVAS was developed using Python 3 and OpenCV, two programming languages widely used to develop artificial intelligence applications. Specifically, OpenCV is a cross-platform library primarily focusing on image processing. Python was employed to extract image elements and automate the grading process (Hyun et al., 2018). Both programs were used to identify and judge the correctness of the image elements—the two crossing lines, their angles, the presence and direction of the four pointing flags as well as their proper shape, and to calculate the total score. Figure 2 presents the steps taken by FRT-CVAS.

Flow diagram of the automated scoring process (FRT-CVAS).
Pre-Processing: Images of hand drawings vary in many aspects (such as line thickness, brightness, and size), and they may sometimes contain misplaced scores written by a research assistant. To standardize the varying line thickness present in the different hand drawings, algorithms developed from mathematical morphology were used: an erosion procedure when the lines were too thick and a dilation procedure when the lines were too thin (Guoquan et al., 2008). Furthermore, the Gaussian Filter Algorithm (Gong et al., 2018) was used to remove elements present on the drawing that were not part of the respondent’s original drawing, such as a score or comment written by the research assistant.
Line extraction: This extraction was used to identify the two crossing lines. The Hough Transform algorithm determined straight lines by selecting the candidate lines (Sim & Wright, 2005). A median formula was used to determine the desired line. Then, the automated system identified two straight lines that cross. Next, the intersection point was assessed. This intersection point was used to determine the angles of the intersecting lines. In the scoring criteria, the angle had to be equal to or greater than 60° and less than or equal to 120°.
Contour Extraction: Flag extraction was conducted by dividing the image into four parts and extracting each part using a contour detection algorithm. This algorithm detected the borders of every object of interest as a representative feature (Awad et al., 2004). The individual flags had to be located at the end of the crossing lines, at either the top left, top right, bottom left, or bottom right.
Shape Identification and Orientation: The extracted contour is a set of polygon areas throughout the image. We needed to extract the shapes of the four triangle flags by calculating the number of sides from the polygon using the approxPolyDP library (www.opencv.org). Finally, the orientation of the flags was determined by identifying the moment of the polygon relative to the flag’s line. Figure 3 illustrated the result of the unit processing.

Sample of the FRT-CVAS process. (a) remove noise and line extraction, (b) contour detection, and (c) shape position and orientation.
Statistical Analysis
Evaluation was conducted in three different parts. In the first evaluation training data was to reveal the automated scoring system. Second, the FRT score from two raters was evaluated. In the third evaluation, the two different scoring (manual and automatic scoring) were compared.
The computer performance was evaluated in several ways. The training dataset was first evaluated by determining, for each element, its sensitivity (True Positives/[True Positives + False Negatives]), specificity (True Negatives/[True Negatives + False Positives]), and accuracy ([True Positives + True Negatives]/[True Positives + False Positives + True Negatives + False Negatives]). Those calculation was also used for the evaluation dataset to assess FRT-CVAS on an independent dataset that had not been previously used. True Positive means that an element was identified correctly by a single rater (training dataset) and by two different raters’ (test dataset) and the computer. Conversely, True Negative means that both rater(s) and the computer agreed and identified an element as false. False Negatives occurred when the computer recognized an element as incorrect, but the raters considered it correct. In contrast, False Positives occurred when the computer identified an element correctly, but the rater scored it incorrectly.
The scoring of the different elements of the drawing yielded binary data: zero for a missing or incorrect item and one for a correct item. Therefore, Cohen’s Kappa was used to investigate the intra-rater reliability for the two scoring methods (compound vs single criteria). Next, the inter-rater reliability for both ways of scoring was assessed, as well as the agreement between FRT-CVAS and human raters. The benchmarks of Cohen’s Kappa used in this study were: <0.20 = no agreement; 0.21 to 0.39 = minimal; 0.40 to 0.59 = weak; 0.60 to 0.79 = moderate, 0.80 to 0.90 = strong, and >0.90 = almost perfect agreement.
The scores of the items agreed upon by the two raters agreed on were used to score the FRT-CVAS. For the occasions in which raters disagreed, a new rater was recruited to score these instances of disagreement, and his or her judgment was considered decisive.
Results
The results section is divided into three parts: the first part contains the analysis of how the neuropsychologist scored manually, using intra- and inter-rater reliability indices. In the second part, FRT-CVAS is described and its preliminary evaluation is provided using the training dataset (N = 290). In the third part, FRT-CVAS is evaluated against the judgment of two independent raters and on a different dataset (N = 120).
Manual Scoring Evaluation
Two neuropsychologists scored the 120 hand drawings twice, with an interval of three weeks. The intra-rater reliability, expressed by Cohen’s Kappa, determined the degree of agreement for each rater between the first and second rounds, and the inter-rater reliability coefficients measured the degree of agreement between the two raters separately for rounds 1 and 2.
Intra-rater reliability: Cohen’s Kappa for the total score for rater 1 was = .75, p < .01, with an inconsistency of 11.65%. For rater 2, it was =0.86, p < .01, with an inconsistency of 7.5%. The analyzes of each of the criteria showed excellent agreement (>0.90) for both criterion 1 (with the combined score of 1a and 1b) and criterion 2 (with the combined score of 2a and 2b, as shown in Table 1). Only moderate agreement with K = .66, p < .01 for criterion 3 was found for the combined score of 3a and 3b.
Inter-rater reliability: Table 2 presents the result of Cohen’s Kappa analysis showing the degree to which the two raters agreed in their scoring. In round one, when compound criteria were used, the raters were in excellent agreement (K above 0.90 for criteria 1 and 2). In contrast, the agreement for criterion 3 was weak (K = 0.46). Consequently, the agreement of the total scores were only moderate agreement of total scores (K = 0.74, moderate). In the second round, using single criteria, the K value showed almost perfect agreement for criterion 2a (K = 0.93) as well as for criterion 2b (K = 0.89). In contrast, very low agreement was found for criterion 3a (K = −0.01). This negative value indicates that agreement between two raters was even less than expected by chance (Sim & Wright, 2005). The low agreement score is indicating a far greater subjective element in the scoring of this criterion than of any other. On the other hand for criterion 3b, there was a moderate agreement (K = 0.65). The agreement on the total score was even stronger (K = 0.84).
Inter-rater Correlation Coefficients (ICC) Between Raters on Compound (Round 1) and Single (Round 2) Criterion Scoring, p-value, F, and Confidence Interval 95% Between Raters for Criterion Two, Criterion Three, and Total Score.
Note. The scoring of drawings using compound 1 and single criteria 1a and 1b has no variance since both raters fully agreed.
FRT-CVAS Scoring Evaluation
The automatic scoring system was developed, optimized, and given a preliminary evaluation using 290 hand drawings. It was once more independently tested using 120 hand drawings that had not been previously used.
Despite high variability in the drawings of the training dataset (see Figure 1), FRT-CVAS resolved this problem well and achieved high accuracy. Further, Hough Transform and median methods were used to determine the correct crossing lines. Next, specific elements of figures (the four flags or triangles) were identified by choosing the four biggest areas and ignoring the smaller ones, if present. The area had to have a sufficient surface to be regarded as a flag. After the flag areas were defined, the system calculated the number of sides from the defined areas.
The results of the evaluation of FRT-CVAS using the training dataset are presented in Table 3. The outcomes of the FRT-CVAS were first compared with those of a single rater, and the data of the training set were used. The scoring by CVAS-FRT was done separately for each of the different elements characterizing the figure. The crossing lines and the presence of the four flags were always correctly identified, with only true positives, no true negatives and no false detections, indicating a sensitivity and accuracy score of 1, with unidentified specificity.
Element-wise Analyzes of the Training Data Set. The Accuracy and Sensitivity of the Recognition of All Elements are High (> or Equal to .95), While the Specificity of Two Elements was Undefined, It was Relatively Low for Three of the Four Triangles Shape Elements and High for the Other Elements.
The next four elements concerned the orientation of the four facing flags. The participants made quite a number of errors regarding this part of the task, and this was indicated by a high number of true negatives for these elements. FRT-CVAS identified correctly almost all incorrect flag orientations. This resulted in sensitivity, specificity, and accuracy scores that were all above .95. The orientation of three of the four flags showed a small number of false negatives, and the top left flag showed somewhat more (13 out of 290 drawings). The angle detection by FRT-CVAS was also excellent, yielding three excellent grades for sensitivity, specificity, and accuracy. FRT-CVAS also revealed good sensitivity and accuracy: both were above 0.96 on the triangle identification.
The specificity on the identification of the triangle was low (between 0.44 and 0.56), except for the top right triangle (specificity = 0.86). Figure 4 provides some examples of the hand-drawing that were misidentified by the CVAS-FRT and which contributed to the low specificity. Figure 4a illustrates the difficulty the system had in identifying the form of a triangle on the bottom side when the triangles were quite close to one another. Figure 4b shows that the computer misidentified the triangle on the left bottom as the correct triangle, and Figure 4c and d reveal that the computer misidentified the triangle because of the wide gap between the lines. In other words, the lines of the triangle did not connect sufficiently. Figure 4e and f show that the system may identify a triangle as correct when it was judged to be incorrect by human raters. The system found a closed line which it then misidentified as a triangle.

(a–f) Hand-drawing examples that were misidentified by system.
The Accuracy of the Computerized Versus Manual Scoring
The accuracy of the computerized assessment was also evaluated against the manually assigned scoring of the evaluation data set. Interestingly, most of the participants’ errors regarded the proper recall of the orientation of the flags: 22% of the subjects made such errors. We included in this first comparison only element scores in which both raters agreed this concerned criterion 1 of all 120 figures of both scoring rounds.
Table 4 shows the result of the analysis of all image elements. Overall, automated scoring exhibited excellent sensitivity and accuracy. For all image elements, it was between 91% and 100%. Regarding compound criterion one, all responses were correctly scored by FRT-CVAS. Therefore, the sensitivity and accuracy were 1.00, and the value for specificity for compound criterion 1 remained undefined. Similarly, the scoring according to the single criteria 1a and 1b produced no errors and yielded perfect sensitivity and accuracy with undefined specificity. FRT-CVAS was also excellent at determining the true negative responses made by the participants: the score was 99% for all criteria. FRT-CVAS made a few false negatives: this regarded a small number for compound criteria 2 (5) and 3 (9) and single criteria 2a (2), 2b (3), and 3b (7).
Comparison of FRT-CVAS and Two Ways of Human Scoring. N is the Number of Human-rater Agreements on Scoring. The Results Revealed High Sensitivity, Accuracy, and Specificity.
Note. N is the number of responses where both raters agreed.
The subjectivity of human raters is also illustrated in Table 5. The two raters judged some hand-drawings differently. We found that the raters disagreed on 5 of the 120 drawings of the evaluation data set regarding the direction of the facing flags (criterion 2) and on 7 drawings regarding the correct shape of the flags (criterion 3). More specifically, this regarded three drawings based on criteria 2a and 2b, two on 2b, and one drawing on 3a and 3b, four on only 3b, and two for only 3b. To get an objective judgment of what should be considered as correct, we introduced a third independent judge to rate the figures where there was disagreement. Next, we compared the scoring from CVAS-FRT against the majority score. The automated scoring agreed with this rater in four of the five drawings for criterion 2. Furthermore, regarding criteria three, CVAS-FRT shows benefits in determining the correct angle size of the crossing lines. In all cases, its score matched that of the majority of raters. While on the triangle shape identification, the CVAS-FRT score matches the majority of the raters on four out of five drawing responses.
FRT-CVAS Agreement With Three Human Raters. The Automated Scoring is Consistent With the Majority of the Raters.
Discussion
This study evaluated the scoring of a visual-spatial reproduction/memory test with either human or computer scoring. This was done by comparing the scores of the FRT-CVAS with those of a single rater on the training data set and with the shared opinion of two independent raters on the test data set.
Based on two ways of applying scoring criteria, we found that the consistency of the two raters was excellent regarding the presence of elements (i.e., the crossed lines and the flags). Conversely, there was disagreements about whether the two flags at the top and bottom were facing each other, the angular size of the intersecting lines, and the judgment of the shape of the flags. Using an arc to measure the angle of the intersecting lines resulted more likely in inconsistent judgments of line angle. However, this seems rather time-consuming and inefficient (Awad et al., 2004). The identification of the shape of the flags also led to a more general disagreement between raters because the shape of the drawn flags varied not only between drawings but also within a single drawing. This difficulty in judging whether the shape is correct or not was found in a previous study regarding a complex figure test (Langer et al., 2022). The difficulty in its scoring might need to be addressed by having the scorers receive additional practice in scoring geometric figures. Alternatively, as demonstrated, complex drawings like these can be accurately and objectively scored by employing an automated system (Guoquan et al., 2008).
In round two of the study, the utilization of six single criteria, as opposed to three compound criteria, resulted in a discernible enhancement in inter-rater agreement. This observation parallels findings reported by Awad et al. (2004), who conducted a comparative analysis of two scoring methodologies for Taylor’s complex figure. The use of six single criteria contributes to reducing human subjectivity. This finding aligns with previous studies in which using single criteria increases the sensitivity and specificity of scoring (Jamus et al., 2023; Troyer & Wishart, 1997; Zhang et al., 2021). In addition, our intra-rater analysis (see Table 3) also shows that using single criteria helps raters score more accurately. Using single criteria in the scoring may also maintain the consistency of the raters when scoring a large number of hand drawings. For the future, we suggest that, as long as the FRT is still administered manually, the raters should consider using single criteria rather than compound criteria to improve the accuracy of their judgments.
Li et al. (2013) defined that identifying units is the pivotal challenge in automating the analysis and scoring process of the Rey-Osterrieth Complex Figure (ROCF). Apart from contending with ambiguous drawing features, the necessity to discern individual units and their constituent elements such as line segments, circles, and points adds complexity. Moreover, variations in the number and geometric relations of these components across units further complicate the scoring procedure. The methodologies by which computer vision decomposes complex images into their constituent parts were reported in previous studies (El-gayar et al., 2013; Fleuret et al., 2011). In this study, we extracted the individual elements of the drawing following the suggestion of Vogt et al. (2019), who predicted it would increase the scoring accuracy. Indeed, the accuracy of our system was high, and all individual elements were at least 91% and often close to 95% in accuracy.
The inception of automated scoring methodologies for visual-spatial memory tasks, notably exemplified by the Rey-Osterrieth Complex Figure (ROCF), dates back over two decades (Canham et al., 2000). Initial endeavors primarily relied on rudimentary feature extraction techniques to discern select components of hand-drawn responses. The computer vision approach has also achieved significant improvement in the automated scoring of the Complex Figure Test (Canham et al., 2005; Gao et al., 2018; Li et al., 2013; Webb et al., 2021) and other figural tests such as the Trail Making Test (Dahmen et al., 2017) and Ruff’s Figural Fluency Test (Elderson et al., 2016).
There are two different ways of collecting the hand drawing response, firstly using a digital device such as a tablet and drawing pad, and secondly using a scanner to digitalize the paper hand drawing. In prior investigations conducted by Webb et al. (2021) and (Li et al. (2013), which leveraged digital devices for data collection, participants in those studies were instructed to directly produce drawings using the provided digital interfaces. The adoption of digital devices in data collection by the desire to mitigate noise artifacts and the occurrence of overlapping strokes that challenge handwriting image processing. In practice, many practitioners still employ the paper and pencil method. In this study, we adopt a distinct methodological approach by utilizing scanned representations of hand-drawn responses, using paper and pencil. To address inherent noise artifacts within the scanned images, the Gaussian Filter technique is employed, serving to attenuate extraneous signal fluctuations. Moreover, to disentangle overlapping stroke lines and facilitate accurate angle calculations, a line extraction method is applied, thereby enhancing the fidelity of image analysis outcomes.
Recent advances in computer image recognition techniques have significantly improved medical disease classification and diagnosis. A computer-vision approach to handwriting analysis has enabled the early diagnosis of Parkinson’s disease (Pereira et al., 2015; Souza et al., 2018). One of the advantages of any automated scoring system is accuracy: in our case, determining the angle of the intersecting lines (Webb et al., 2021). Subsequent advantage, particularly in the field of artificial intelligence (AI), facilitated the emergence of deep learning algorithms tailored for automated scoring applications. Recent studies by Langer et al. (2022) and Park et al. (2023) underscore the utility of deep learning methodologies in assessing memory deficits through automated scoring protocols, extending beyond the ROCF.
The efficacy of deep learning algorithms is contingent upon access to training data, typically comprising thousands of images. This prompts an inquiry into the optimal quantity of data requisite for the engagement of the deep learning algorithms. In their investigation, Langer et al. (2022) posit that once the dataset surpasses thousands of images, deep learning techniques can be effectively utilized for system development. They note that with around 3,000 images, there is a decrease in mean MAEs (Mean Absolute Errors), and beyond approximately 10,000 images, additional data inclusion does not lead to significant improvements. Another study conducted by Li et al. (2013) emphasized the necessity for caution when employing deep learning methodologies, particularly when confronted with limited sample sizes, and underscored the importance of ensuring congruence between the training data and the target images.
To evaluate the subjectivity among raters, we looked at the drawings in which two raters scored differently than FRT-CVAS. The points of disagreement were all about the evaluation of the correct shape of the triangles. The raters had a different tolerance than the automated system for the number of triangle sides. For instance, a triangle was sometimes accepted as correct when a small fourth side or a curved line had been added to the drawing. The variable shape of the triangles was also responsible for the false negative results provided by the FRT-CVAS: 11.7% and 10%, respectively from the compound and single criteria, respectively. The difference in the scores given by the raters is due to a more generous interpretation of the factors required for the “triangle-shaped flag” concept. In daily practice, we recommend that if the computer interprets an item as incorrect it should be re-examined by a neuropsychologist, who will then make a final judgment. This re-examination can be done quickly if the digitized drawing and the scores can be displayed simultaneously. Our computer-rater comparisons showed that this automated system, with its standardized procedures and high agreement with human raters, can help neuropsychologists score more objectively. Kenda et al. (2022) also reported that the problem of inter-rater variability can be eliminated by using a standardized automated method.
The subjective nature of scoring visual-spatial tests is a perceptual organization that is commonly happens in human vision (Nevatia, 2000). Some studies note that human raters, even well trained clinicians, may not be consistent in giving scoring to hand-drawing (Canham et al., 2000; Langer et al., 2022). In this study two raters were initially involved for scoring the hand drawing. Later, one rater was added to provide more votes for determining the final decision. Adding more human raters would help to better estimate the threshold or deviation parameter in determining the shape and then could also improve the computer’s accuracy in identifying a triangle. However, the main problem vagueness in the definition of what should be considered as a triangle, is not solved by adding more raters. Therefore, to reduce the difference between raters, we advise that the manual of the test will be adapted to facilitate more detailed and unambiguous criteria.
We expect that the automatic scoring system can be made more precise concerning the scoring of the triangles by involving more human raters in the determination of what is a correct triangle and by comparing the outcome of FRT-CVAS with the shared opinion of these raters. These judging results can then be used as a new training dataset for further development of the automated scoring system, thereby improving its agreement with human judgment.
Limitations of this study are that the hand-drawing responses were produced by self-described “healthy” participants, while the test is designed for application to clinical populations. Therefore, as a next step, we plan to clinically validate the test and its automated scoring in various groups of patients. We did find some drawings that seemed diagnostic of mild or moderate motor problems, but these hand drawings did not seem to have a negative effect on the performance score of the automated system. Another limitation is that we developed an automated scoring for the first card of the FRT only. However, the principles that we applied from computer vision and the techniques of feature extraction can be extended to the other figures of the FRT and the automated analyzes of other hand drawings of geometrical figures.
Conclusion and Outlooks
Human judgments of hand drawn geometrical figure involve subjective elements. In the Figure Reproduction Test (FRT) these subjective factors are particularly noticeable in the judgment of the correct shape of triangles. The accuracy of the scoring can be improved somewhat by using single criteria rather than of the compound criteria suggested in the scorer’s manual. FRT-CVAS, which is a computer vision approach, further removes this subjectivity: by extracting and evaluating all of the hand-drawn elements in detail it achieves a high-level of accuracy, sensitivity, and good specificity. FRT-CVAS produces a more standardized, consistent, precise, and objective result. Given the leniency of the test’s scoring instructions with regarding acceptable triangle shapes, the FRT-CVAS cannot be expected to produce a results expert scorers will always agree with. In the meantime, the best course of action is for the individual clinician to check the shape of the triangles for themselves if in serious doubts.
It should be noted that the improvement in scoring which was achieved by using the six single criteria instead of three compound criteria, was based not only one of the three figures of the FRT. If the scoring of the other two even more complex figures of the test had been divided in this way, the difference between the two scoring methods would probably have been greater. This simple innovation would therefore have a beneficial effect on the accurate diagnosis of pathology.
Automated scoring systems offer professionals the advantage of accessing to objective measurements, thereby enhancing the reliability and standardization of assessments. Employing computer vision techniques, even datasets comprising fewer than a thousand training samples of hand drawing can be effectively processed. This methodology proves instrumental in the development of automated scoring mechanisms tailored to newly adapted visual-spatial assessments. Furthermore, the integration of the FRT-CVAS into online platforms facilitates widespread accessibility, enabling neuropsychologists in across different geographical regions to use tools such as the hand-drawing response of the Figural Reproduction Test with greater ease and efficiency.
Concerning future research plans our focus is on developing automated scoring capabilities specifically designed for the remaining cards in the Figural Reproduction Test (FRT). The approach we use in FRT-CVAS is a more of a comprehensive process that includes image segmentation, feature extraction, classification, and grouping of its constituent small image units. This methodology shows promise as a pre-processing tool conducive to subsequent integration with deep learning algorithms. Our future research trajectory covers a wide range, starting with the collecting of hand-drawing results from visual tests such as the Five Point Test, Trail Making Test and FRT not only in healthy groups but also in diverse patient groups. Next, data from various patient categories will be used as training data to improve the effectiveness of our system for early detection and identification of latent neurological pathology using a deep learning paradigm.
Footnotes
Acknowledgements
The authors are indebted to Dr. John V. Keller, who helped us with linguistic corrections.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by DIKTI (Directorate of Higher Education General of Indonesia, number: 076/E5/PG.02.00.PL/2023).
Ethical Approval
The research ethics committee of Soegijapranata Catholic University-Indonesia (number: 001B/B.7.5/FP.KEP/IV/2018) approved the data collection, storage, and use for scientific purposes. This approval followed the principles of the Declaration of Helsinki of the World Medical Association and local legislation.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
