Sage Journals: Discover world-class research

Abstract

Research on the automated assessment of mental disorders has primarily focused on adult participants and on behaviors on the individual level. We propose an approach to automatically assess the severity of children’s behavioral, emotional, and social problems from videos of face-to-face parent-child interaction. Children’s behavioral, emotional, and social problems were quantified using the Child Behavior Checklist (CBCL), focusing on the two broad categories “internalizing” and “externalizing” and the more specific categories “anxious”, “withdrawn”, and “aggressive”. Our experimental data comes from a cohort of 81 8- to 10-year-old children and their parents. We constructed features to represent the nonverbal face and head behaviors of the parents and children, combined them with the children’s symptom scores, and then fed these data to binary classifiers to make broad estimations of symptom severity. Prediction performance was good only for anxiety scores, although the prediction of withdrawal and internalizing scores did show some promise as well. We moreover identified the behaviors that were most informative in the context of predicting anxiety and withdrawal and investigated how they were influenced by symptom severity and topic of conversation. Our results exemplify how machine learning and computer vision can be used to gain further insights into child psychopathology.

Keywords

child psychopathology nonverbal behavior parent-child interaction machine learning computer vision

Introduction

Current methods to assess mental disorders depend primarily on clinical interviews and self-report scales in which psychological dysfunction is reported by either the affected individual or by someone else (e.g., their caregiver). Clinical interviews are time intensive, difficult to standardize across settings, and inherently subjective, while self-report scales are limited by various factors, such as the patient’s reading ability and differences between how clinicians and patients conceptualize their symptoms (Hinduja et al., 2024). They can be influenced by patients’ recall biases (e.g., downplaying or overestimating one’s symptoms), cognitive limitations (e.g., problems with memory), and social stigma (Low et al., 2020). Increased objectivity in the assessment of mental disorders may thus be achieved by examining behavior in a manner that is less susceptible to such biases. Nonverbal behaviors, for example, are often under less voluntary control than verbal behaviors and can sometimes be early and reliable indicators of disorders (Philippot et al., 2003). As such, the observation and measurement of nonverbal behaviors may be used to support existing methods of clinical screening and diagnosis.

The traditional approach to detecting nonverbal behaviors has been through manual annotation by trained human annotators, which is extremely time consuming and often requires compromises, such as annotating fewer frames or investigating fewer contexts. The automatic detection of nonverbal behaviors can be an efficient alternative, with many pre-built software toolkits readily available (e.g., Baltrušaitis et al., 2018; Bredin, 2023; Cao et al., 2017; Hinduja et al., 2023; Lugaresi et al., 2019; Onal Ertugrul et al., 2024; Plaquet & Bredin, 2023). These methods also have the potential for increased objectivity and representativeness, as it is much more feasible to annotate behavior from multiple angles and multiple contexts with automated methods than it is manually. The automated assessment of mental and neurodevelopmental disorders from multimodal cues has thus become an important topic in the field of affective computing.

Considerable focus has been placed on the automated detection of depression (see Girard & Cohn, 2015; Pampouchidou et al., 2019, for reviews), anxiety (see e.g., Tayarani-N & Shahid, 2025), and autism spectrum disorder (see de Belen et al., 2020, for a systematic review). This body of research, however, has two important shortcomings. First, there exists little research investigating nonverbal cues and their relation to psychopathology in children. Such research is crucial, because the patterns of nonverbal behavior that characterize specific disorders in children are likely to differ from those of adults (Kazdin et al., 1985). Second, the dyadic context of behavior has often been overlooked. Studies have focused on the nonverbal behaviors of individuals in isolation. This can be problematic, as most nonverbal behaviors occur and have meaning in the context of interaction. Even in clinical interviews, prospective patients observe and react to the behavior of the interviewer. To further our knowledge about the subtle nonverbal cues that mental disorders are marked by it is important to consider the dyadic nature of those cues (e.g., Bey et al., 2024; Bilalpur et al., 2023; Isaev et al., 2024).

To address the above-mentioned shortcomings, we propose an approach to automatically assess the severity of children’s behavioral, emotional, and social problems from videos of face-to-face parent-child interaction. These problems are commonly grouped under two broad categories: internalizing and externalizing (Achenbach et al., 2016). A widely validated tool for assessing them is the Child Behavior Checklist (CBCL; Achenbach & Rescorla, 2001). The “internalizing” syndrome scale of the CBCL is a combination the “anxious/depressed” (henceforth “anxious”), “withdrawn/depressed” (henceforth “withdrawn”), and “somatic complaints” syndrome scales, representing inward-focused behaviors related to anxiousness, withdrawal, and somatic problems. The “externalizing” syndrome scale is a combination of the “aggressive” and “rule-breaking” syndrome scales, representing behaviors related to aggression and breaking social norms.

Automated methods for assessing parent-child interactions exist and have been used to investigate constructs such as engagement level, interaction quality, and attachment style of children from infancy to adolescence (see Karaca et al., 2024, for a review). Important early work was done by Rehg et al. (2013), who introduced a multimodal dataset of adults and young children engaged in an interactive play protocol designed to assess socio-communicative milestones in the first two years of life. Studies utilizing automated methods with child participants have primarily focused on neurodevelopmental disorders such as attention-deficit/hyperactivity disorder and autism spectrum disorder (e.g., Bey et al., 2024; Isaev et al., 2024), whereas affective and conduct disorders have seen little attention. A notable exception is by Halfon et al. (2021), who developed a multimodal system to automatically assess children’s anger, anxiety, pleasure, and sadness levels from videotaped psychodynamic play therapy sessions. They used facial features extracted from videos together with transcripts of the sessions, and analyzed these for affect levels. Overall, pleasure was predicted best, and when both video and text data were used together. Anger, anxiety, and sadness, on the other hand, were predicted well from text data only, and facial analysis alone was not sufficient. The natural play setting also made it difficult to capture enough faces from a single camera viewpoint.

In the present study, we aimed to automatically assess the internalizing and externalizing problems of children, as measured by the CBCL, from videos of parent-child interaction. Building on previous research and gaps in the literature, we incorporated a wide range of nonverbal behaviors related to facial expressions, head orientation, and gaze direction, and examined behavior on both an individual and dyadic level. Although hand and body gestures are widely used and often have communicative power (see e.g., Hessels et al., 2025; Holler et al., 2009; Hostetter, 2011), and body posture data has previously been utilized in automatic depression detection (e.g., Joshi et al., 2013), the videos in our dataset only contained information from the shoulders up and did not allow for the inclusion of hand gestures or body posture (except for head movements). The specific research questions we investigated were: (1) Can we automatically predict children’s “internalizing”, “externalizing”, “anxious”, “withdrawn”, and “aggressive” syndrome scale scores from the face and head dynamics, eye gaze, and facial expressions of parents and children while they interact with each other? (2) If so, which features contribute most to prediction performance and (3) how are they represented when examined back in the context of nonverbal behavior during parent-child interaction? The study was exploratory, and no specific hypotheses were made.

Method

Dataset

The dataset used in the present study was collected as a separate add-on experiment in the Child & Adolescent cohort of the YOUth study conducted at Utrecht University (Onland-Moret et al., 2020). A complete description of participant recruitment and demographics, the technical details of the setup (e.g., data synchronization), and experimental procedure is presented in Holleman et al. (2021), with additional technical details found in Hessels et al. (2017).

Participants

The dataset consisted of 81 parent-child interaction videos. Participating children were between 8 and 10 years of age (M = 9.34 years), with 55 of them female (68%). Parents had middle to high education levels, representative of the overall relatively high socioeconomic status in the YOUth study compared to the general Dutch population (see Fakkel et al., 2020), and were aged between 33 and 56 years of age (M = 42.11 years), with 64 of them female (79%). The average family/household size of the sample was 4.27 residents (SD = 0.71 residents), with most parents having achieved middle to higher education level. Out of the 81 children, 70 had two parents or caregivers (92%), 7 had no siblings (9%), 42 had one sibling (55%), 23 had two siblings (30%), and 4 had three siblings (5%). All parents provided written informed consent for themselves and on behalf of their children. The study was approved by the Medical Research Ethics Committee of the University Medical Center Utrecht and is registered under protocol number 19–051/M.

Setup

The parent-child interaction videos were recorded using a dual eye-tracking setup consisting of two web cameras, two eye trackers, and two computer monitors (i.e., a dual eye-tracking setup with screens, using the terminology introduced in Valtakari et al., 2021). The parents and children sat on opposite sides of the setup in front of their respective computer monitors. The cameras were placed behind half-silvered mirrors and recorded at a resolution of 800 x 600 px at 30 Hz, allowing the parent and child to view each other through live video feeds of the cameras, which were presented at a resolution of 1024 x 768 px on their respective computer monitors. The camera feeds were saved with timestamps. Eye movements were recorded at 120 Hz using SMI RED eye trackers, and two AKG C417-PP Lavalier microphones were used to record the speech of both the parent and the child. Although physically present in the same room, participants viewed each other through computer monitors. One might argue that conversing in the setup was not completely representative of a typical face-to-face conversation. On the other hand, the nature of the setup allowed for fully frontal face data acquisition, which was beneficial for our automated analyses. Participants seemed to quickly get used to conversing in the setup. We do not have reason to believe that some participants perceived being in the setup different than others.

Procedure

Each dyad underwent two conversation scenarios in a fixed order. The “conflict” scenario had the parent and child discuss a recent disagreement and try to agree on possible solutions for solving it, while the “cooperation” scenario had them discuss an occasion that they would like to organize a party for and the activities that it would entail. Dyads were instructed to discuss each scenario for approximately five minutes.

Questionnaire

The CBCL is a widely used diagnostic screening questionnaire which contains 120 items representing typical problem behaviors of children. Parents rate each item on a 3-point scale (0: not true; 1: somewhat or sometimes true; 2: very true or often true for their child, see Achenbach & Ruffle, 2000). Items are categorized under different syndrome scales, allowing various dimensions of problematic behavior to be measured. The “internalizing” syndrome scale is a combination of the “anxious”, “withdrawn”, and “somatic complaints” syndrome scales. The “anxious” syndrome scale contains items that describe behaviors such as feeling fearful, worthless, and nervous. The “withdrawn” syndrome scale contains items related to behaviors such as feeling sad, not talking, and not enjoying things. The “somatic complaints” syndrome scale consists of specific somatic problems, such as feeling dizzy or tired. The “externalizing” syndrome scale is a combination of the “aggressive” and “rule-breaking” syndrome scales. The “aggressive” syndrome scale consists of items related to behaviors such as arguing, being mean, and destroying things, while the “rule-breaking” syndrome scale consists of items related to behaviors such as stealing and lying. We included both the “internalizing” and “externalizing” syndrome scales for an overall view, as well as the specific syndrome scales “anxious”, “withdrawn”, and “aggressive” to further differentiate between specific problems.¹ For additional examples of items on CBCL syndrome scales we refer the reader to Achenbach and Ruffle (2000) and the CBCL manual (Achenbach & Rescorla, 2001).

Final Dataset

CBCL data was available for 68 of the 81 parent-child dyads, for which 46 of the children were female (67%) and mean child age was 9.37 years. Of those 68 dyads, 63 dyads completed the conflict scenario, while all 68 dyads completed the cooperation scenario. Since each conversation scenario resulted in a one video recording per participant, the final dataset consisted of 131 video recordings. Video recordings for the conflict scenario had an average duration of 4.67 min (SD = 0.46 min, range = 3.43–5.93 min) and video recordings for the cooperation scenario had an average duration of 4.95 min (SD = 0.47 min, range = 3.07–6.06 min).

Video and Eye Tracker Data Analysis

To extract frame-level behavioral data, each frame of each video recording was first processed using two automated facial behavior analysis toolkits: OpenFace (2.0; Baltrušaitis et al., 2018) and PyAFAR (Hinduja et al., 2023; Onal Ertugrul et al., 2024). OpenFace was used to estimate head pose and gaze direction. When tested against ground-truth head pose and gaze data collected at distances comparable to normal screen viewing, OpenFace estimated head pose with a mean absolute error of 2.60° to 3.20° and eye gaze with a mean absolute error of 9.10° (Baltrušaitis et al., 2018). In practice, this means that its head pose estimations are rather reliable, whereas its eye gaze estimations leave a lot of room for improvement, but can potentially be used with sparse stimuli (for a validation, see Valtakari et al., 2023). PyAFAR was used to detect the occurrence of facial action units (henceforth AUs), representing specific facial muscle movements that are based on the facial action coding system (Ekman & Friesen, 1978). Importantly, AUs can be related to the expression of emotions. The combined activation of the cheek raiser (AU 6) and lip corner puller (AU 12), for example, represents a genuinely felt smile (i.e., a Duchenne smile, see Ekman et al., 1990). The adult version of PyAFAR detects the activation of 12 AUs (Hinduja et al., 2023). Given the nature of our dataset and previous PyAFAR evaluations, we expect average detection performance for AUs 6, 12, 14, and 15 to be around 0.89 or slightly lower due to cross-domain differences (Onal Ertugrul et al., 2019). Higher performance is anticipated for AUs 6 and 12, and lower for AUs 14 and 15 (Hinduja et al., 2023; see also Girard et al., 2017; Zhang et al., 2016, for the details of the GFT and BP4D+ databases PyAFAR was tested on, respectively). Lastly, the gaze data recorded by the eye trackers were downsampled and matched to the video frames using their respective timestamps.

Feature Construction

We used the frame-level OpenFace, PyAFAR, and eye tracker data to construct two sets of features to describe observable nonverbal behaviors of the parents and children on the level of the videos. “Individual” features were constructed separately for the parent and the child, representing behaviors observable in their respective video recordings and eye tracker data. “Dyadic” features, on the other hand, were constructed using the combined data of both the parent and the child, representing synchronicity in their behavior. The complete feature construction process is explained in detail in the following two subsections.

Individual Feature Construction

For AU selection, we were inspired by the results of Girard et al. (2014), who found depression severity to be related to the decreased activation of AU 12 (the lip corner puller) and AU 15 (the lip corner depressor), and increased activation of AU 14 (the dimpler). As such, we only included AU features related to the activation of those AUs, with the addition of the Duchenne smile, consisting of the combined activation of AU 6 (the cheek raiser) and AU 12. For head orientation and gaze direction, we focused on features related to face looking.

First, for each frame, an AU was labeled active if its occurrence value, as reported by PyAFAR, was at least 0.5 (a threshold used in previous research, see Bilalpur et al., 2023), and as inactive otherwise. The head orientation or gaze angle of the participant (as reported by OpenFace) was labeled forward (i.e., looking at the other) if the horizontal and vertical head/gaze angles were less than half of the angular width of the other’s face away from baseline and averted if not. The angular width of the other’s face was computed using the physical measurements of the setup as well as the ear-to-ear width of the face as it appeared on the video recording, using landmark locations also provided by OpenFace. The baseline horizontal and vertical head and gaze angles were defined as the median angles of the first 100 frames of each video, a point in time during which we had observed that participants generally oriented their head and eyes toward each other’s face. In addition, for 65 out of the 68 dyads, separate gaze direction estimates were also provided by the eye trackers (for the three participants without eye tracking data the eye trackers likely malfunctioned, or eye tracker calibration was not successful). OpenFace reported a gaze angle with respect to the camera, but the eye trackers reported where participants looked at on the screen in front of them. Determining how much a participant looked at the face of the other using the eye tracker gaze estimates was straightforward, whereas for OpenFace gaze estimates we had to determine a baseline for forward looking. Hence, while the data quality of both measures was heavily influenced by similar factors (e.g., excessive head movements), the measures were sufficiently different and prone to slightly different types of errors to warrant the inclusion of both. For the dyads with eye tracker data available, eye tracker gaze for a given frame was labeled forward if their gaze point was within an ellipse drawn around the other participant’s face and averted if not. Facial landmarks provided by MediaPipe (Lugaresi et al., 2019) were used to determine the size of the ellipse.

Second, we grouped together consecutive frames for which an AU was labeled either active or inactive or head orientation or gaze angle was labeled either forward or averted to form episodes spanning multiple frames. Episodes of only one frame were filtered out and surrounding episodes were merged. These episodes were computed for all included AUs (active/inactive), head orientation (forward/averted), and gaze direction (forward/averted). Episode durations were determined using the frame timestamps described earlier in the method section.

Finally, episode data were aggregated to form video-level features individually for each participant. For each AU, we computed the total duration (relative to the duration of the video), mean duration, and number of active episodes. For head orientation and gaze direction, we computed the total duration, mean duration, and number of forward episodes. Head orientation was further used to compute the mean and standard deviation of head angle on all possible axes (i.e., pitch, yaw, and roll) as well as the number of horizontal and vertical head shifts to represent head shakes and head nods. Head shifts were computed from head orientation data using the same baselines and were defined as episodes of looking in one direction (e.g., left, right, up, or down) that within 1 s were followed by an episode of looking in the opposite direction. This resulted in 29 features per individual participant, totaling to 58 individual features per dyad (see Table 1).

Table 1.

A List of all Individual Features. The Table Lists the Features That Were Computed Using Data Derived From the Individual Video Recordings of Each Participant, Meaning That all These Features Were Created Separately for Both the Parent and the Child

Behavior	Feature	Count
Facial expression	Total duration, mean duration, and total number of episodes for AU 12, 14, 15, and 6+12 activation (PyAFAR)	12 (3 x 4)
Head orientation	Total duration, mean duration, and total number of head-forward episodes (OpenFace)	3
	Mean and standard deviation of head yaw angle (OpenFace)	2
	Mean and standard deviation of head pitch angle (OpenFace)	2
	Mean and standard deviation of head roll angle (OpenFace)	2
	Number of horizontal head shifts (OpenFace)	1
	Number of vertical head shifts (OpenFace)	1
Gaze direction	Total duration, mean duration, and total number of gaze-forward episodes (OpenFace)	3
Gaze direction	Total duration, mean duration, and total number of gaze-forward episodes (Eye tracker)	3
		29 per participant

Dyadic Feature Construction

To describe behavioral synchronicity, we further combined the frame-level facial expression, head orientation, and gaze direction data of the parents and children to compute joint episodes (e.g., both the parent and child had a specific AU active or oriented their head or gaze forward), parent-only episodes, and child-only episodes (e.g., the parent had an AU active while the child did not have that AU active or vice versa). Episode duration was again aggregated to form video-level features describing the total duration, mean duration, and number of joint AU activity and joint head/gaze facing forward. In addition to this, we also computed the number of child- and parent-led smiles to quantify the initiator of joint smiles, and the number of child smiles coupled with head or gaze aversion. A child-led smile was defined as an episode of AU 12 activation on the child’s face followed by an episode of AU 12 activation on the parent’s face that started at minimum 0.2 s after the episode of AU 12 activation on the child’s face and before the episode had ended. A parent-led smile was defined similarly, with the roles of the parent and child reversed. Child smiles with head or gaze aversion were included to quantify “coy” smiles, expressions of positive shyness that are already present in early infancy (Colonnesi et al., 2013). A child smile with head or gaze aversion was defined as an episode of AU 12 activation on the child’s face during which their head or gaze was, at some point during the episode, averted from the parent’s face. Child smiles with gaze aversion were computed using only OpenFace gaze estimates, as eye tracker data was not available for all participants. This totaled to 67 features per participant dyad. A full overview is presented in Table 2.

Table 2.

A List of all Dyadic Features. The Table Lists the Features That Were Computed Using the Combined Data of the Parent and the Child

Behavior	Feature	Count
Facial expression	Total duration, mean duration, and number of joint episodes for AU 12, 14, 15, and 6+12 activation (PyAFAR)	12 (3 x 4)
	Total duration, mean duration, and number of parent-only episodes for AU 12, 14, 15, and 6+12 activation (PyAFAR)	12 (3 x 4)
	Total duration, mean duration, and number of child-only episodes for AU 12, 14, 15, and 6+12 activation (PyAFAR)	12 (3 x 4)
	Number of parent-led smiles (PyAFAR)	1
	Number of child-led smiles (PyAFAR)	1
Head orientation	Total duration, mean duration, and number of joint head-forward episodes (OpenFace)	3
	Total duration, mean duration, and number of parent-only head-forward episodes (OpenFace)	3
	Total duration, mean duration, and number of child-only head-forward episodes (OpenFace)	3
Gaze direction	Total duration, mean duration, and number of joint gaze-forward episodes (OpenFace)	3
	Total duration, mean duration, and number of parent-only gaze-forward episodes (OpenFace)	3
	Total duration, mean duration, and number of child-only gaze-forward episodes (OpenFace)	3
	Total duration, mean duration, and number of joint gaze-forward episodes (Eye tracker)	3
	Total duration, mean duration, and number of parent-only gaze-forward episodes (Eye tracker)	3
	Total duration, mean duration, and number of child-only gaze-forward episodes (Eye tracker)	3
Combined	Number of child smiles with head aversion (PyAFAR & OpenFace)	1
Combined	Number of child smiles with gaze aversion (PyAFAR & OpenFace)	1
		67 per dyad

Binary Classification, Model Evaluation, and Model Explanation

To perform binary classification, the CBCL syndrome scale scores needed to first be reduced to binary classes. Given that our sample consisted of nonreferred children participating in a large cohort study, the raw CBCL syndrome scale scores were far from the diagnostic ranges. Therefore, we based our binary classes on the raw mean scores of the nonreferred sample as presented in the CBCL manual (Achenbach & Rescorla, 2001). Taking gender into account has been shown to improve classification performance at least for depression detection (Alghowinem et al., 2018; Maddage et al., 2009; Pampouchidou et al., 2016; Stratou et al., 2015; Yang et al., 2016). Due to the relatively limited number of boys in the final sample, we did not train classifiers separately for boys versus girls. However, gender was considered in binary class assignment: If a child’s score on a syndrome scale was above the mean of their respective age group and gender, the child was labeled as belonging to the positive class (i.e., the class “1”) on that syndrome scale, and as negative class (“0”) otherwise. The raw CBCL scores and the class cutoff thresholds are presented in Figure 1.

Figure 1.

Raw CBCL scores.

Next, using features from each dyadic interaction we trained and tested XGBoost classifiers (Chen & Guestrin, 2016) for binary CBCL syndrome classes using stratified four-fold cross validation. XGBoost was chosen due to its good performance even with slightly imbalanced datasets (i.e., with at least 25% positive classes, see Velarde et al., 2023) and ease of use. The classifier was trained and tested separately for three feature combinations and three scenario combinations. The three feature combinations were (1) individual parent and individual child features, (2) dyadic features, and (3) individual+dyadic features. The three scenario combinations were (1) conflict, (2) cooperation, and (3) conflict+cooperation. As the combination with both scenarios together included data extracted from two separate video recordings for most participants, we ensured that that all the data for any unique dyad was only in either the training or the testing dataset for any given fold. Using speaker diarization achieved with pyannote.audio (Bredin, 2023; Plaquet & Bredin, 2023), we further trained and tested the classifier separately with features built on data extracted during moments when the children were speaking and during moments when the children were not speaking. The resulting classifier models were evaluated using area under the receiver operating characteristics curve (ROC AUC), a performance metric that is not affected by imbalanced datasets (Jeni et al., 2013), after which the top contributing features for the best performing models were assessed using SHapley Additive exPlanations (SHAP) analysis (Lundberg & Lee, 2017).² Finally, to gain further insights into which features contributed most to classifier performance, we circled back to the distributions of those features to examine differences and interactions between symptom severity and the two conversation scenarios. An illustration of the analysis pipeline is presented in Figure 2.

Figure 2.

The analysis pipeline.

Results

How did the Classifier Perform?

The performance of the classifier was assessed using the average area under the receiver operating characteristics curve (ROC AUC) score across all four train and test folds separately for each syndrome scale (Table 3). An ROC AUC score of 0.5 represents performance at chance level. For prediction performance, we consider scores below 0.6 as failure, scores of at least 0.6 but below 0.7 as poor, scores of at least 0.7 but below 0.8 as fair, scores of at least 0.8 but below 0.9 as good, and scores of 0.9 or greater as excellent (see e.g., Carter et al., 2016; Nahm, 2022, for similar interpretations). The classifier for the “anxious” syndrome scale under the conflict scenario and using both individual and dyadic features performed best, achieving an average ROC AUC score of 0.8 (SD = 0.14), indicating good prediction performance (Figure 3C). The second-best classifier was for the “internalizing” syndrome scale under both scenarios and using individual features only, with an average ROC AUC score of 0.69 (SD = 0.06), indicating almost fair performance (Figure 3A). The classifiers were also trained and tested using the individual features of each participant separately, resulting in similar performance to the combined individual classifier (see Table 1 in Supplementary S1). Overall, performance on the other syndrome scales was poor or equal to chance level. These results suggest that the nonverbal behaviors we quantified can at least be used to reliably predict whether a child’s anxiety symptoms (as estimated by the CBCL) fall below or above the general mean of their age group and gender when the conversation scenario is about a potentially conflicting topic. For completeness, we further tested the classifiers for all scenario combinations using an extended feature set that included all 12 AUs reported by PyAFAR as well as the combination of AU 6+12 (see Table 2 in Supplementary S1). Performance on the extended feature set yielded a similar pattern with worse overall and peak performance.

Table 3.

Classifier Performance. The Table Depicts Positive Class Percentage and Performance of the Classifiers for Each Syndrome Scale and all the Feature and Scenario Combinations. Values in the Table Represent the Mean (Standard Deviation in Parentheses) ROC AUC Across all Folds. Bold Text Indicates That the Model Achieved Non-failure Performance, Meaning an ROC AUC Score of at Least 0.60

Scenario	Scale	Positive %	Individual	Dyadic	Individual+Dyadic
Conflict	Aggressive	38%	0.35 (0.07)	0.39 (0.15)	0.41 (0.16)
	Anxious	34%	0.59 (0.08)	0.64 (0.15)	0.80 (0.14)
	Externalizing	32%	0.44 (0.15)	0.44 (0.22)	0.43 (0.26)
	Internalizing	49%	0.66 (0.08)	0.55 (0.10)	0.68 (0.11)
	Withdrawn	53%	0.48 (0.14)	0.50 (0.17)	0.47 (0.12)
Cooperation	Aggressive	37%	0.47 (0.14)	0.42 (0.12)	0.48 (0.16)
	Anxious	35%	0.39 (0.11)	0.38 (0.17)	0.31 (0.08)
	Externalizing	29%	0.47 (0.12)	0.39 (0.17)	0.40 (0.11)
	Internalizing	47%	0.62 (0.18)	0.56 (0.24)	0.60 (0.18)
	Withdrawn	51%	0.49 (0.14)	0.52 (0.05)	0.43 (0.06)
Both	Aggressive	38%	0.53 (0.07)	0.46 (0.12)	0.48 (0.04)
	Anxious	35%	0.67 (0.19)	0.49 (0.06)	0.62 (0.12)
	Externalizing	30%	0.47 (0.07)	0.45 (0.19)	0.46 (0.10)
	Internalizing	48%	0.69 (0.06)	0.55 (0.05)	0.60 (0.05)
	Withdrawn	52%	0.50 (0.14)	0.50 (0.14)	0.47 (0.13)

Figure 3.

Classifier performance.

Classifier Performance During and Outside Episodes of Child Speech

Next, we examined classifier performance separately with features extracted during and outside episodes of child speech. As Figure 4 shows, classifier performance for features extracted during episodes of child speech was overall quite similar to the general results presented in Table 3 and Figure 3, with two notable exceptions: First, the best-performing “anxious” syndrome scale classifier (conflict scenario, all features, see Figure 3C) now performed at chance level (M = 0.49, SD = 0.12; see Figure 4C). Second, there was a significant boost in performance on the “withdrawn” syndrome scale for the cooperation scenario using individual features only (M = 0.67, SD = 0.12; see Figure 4A), a classifier that previously performed at chance level (see Figure 3A).

Figure 4.

Classifier performance during episodes of child speech.

When examining classifier performance outside episodes of child speech (see Figure 5), overall performance was highly similar to the general results presented in Table 3 and Figure 3. Thus, it seems that the nonverbal behaviors that contributed to anxiety mainly presented themselves when the children were not speaking, while behaviors more important to withdrawal became somewhat more evident under different circumstances, such as during episodes of child speech during discussions that were about a potentially cooperative topic. Not much can be concluded about the predictions for the “aggressive”, “externalizing”, and “internalizing” syndrome scales, as performance on them was similar both during and outside episodes of child speech.

Figure 5.

Classifier performance outside episodes of child speech.

Which Features Contributed Most to Classifier Performance?

To shed light on which features contributed most to classifier performance, we computed the top 10 features of the best classifier for the “anxious” syndrome scale (conflict scenario, individual and dyadic features, see Figure 3C) and the best classifier for the “withdrawn” syndrome scale (cooperation, individual features, during child speech only, see Figure 4A) by averaging all features’ absolute SHAP values across all four folds. Children’s anxiety scores were best predicted from features related to nonreciprocal parent smiling, child head orientation, and parent head orientation/gaze direction (Figure 6A). This is not surprising, as anxiety has been linked to specific patterns of gaze behavior during dyadic interaction (Hessels et al., 2018; Kleberg et al., 2017; Wieser et al., 2009) and panic disorder severity has been linked to more frequent dyadic patterns of facial affective behavior between patients and therapists (Benecke & Krause, 2007). The individual child features that contributed best to predicting withdrawal were related to parent head orientation/gaze direction, child head orientation/gaze direction, child dimpling, child smiling, and parent lip corner depression (Figure 6B). Such behaviors were also expected, as differing patterns of head movements, smiling, and dimpling have all been implicated in relation to depression (see e.g., Bilalpur et al., 2023; Peham et al., 2015).

Figure 6.

Feature contributions assessed with SHAP analysis.

From Binary Classification Back to Nonverbal Behavior

As we had now identified one classifier which reliably predicted whether children fell above or below average (henceforth “high” and “low”, respectively) on the “anxious” syndrome scale, we circled back to the level of the features to examine whether we could observe differences in the children’s nonverbal behaviors depending on their anxiety levels and the topic of conversation. We limited our investigation to features with an average SHAP value that was greater than 0.2 (i.e., the top four features). This investigation yielded three interesting patterns (see Supplementary S2 for complete descriptive statistics).

First, as illustrated in Figure 7A, parents whose children scored low in anxiety demonstrated shorter episodes of parent-only Duchenne smiling in the conflict scenario than parents whose children scored high in anxiety. There are multiple possible explanations for this finding. For example, it could be that parents whose children were more anxious engaged in longer periods of Duchenne smiling in the conflict scenario in general, or it could be that the children themselves engaged in shorter periods of Duchenne smiling in the conflict scenario, resulting in shorter joint episodes. To investigate this further, we examined the average duration of the episodes in which parent-child dyads engaged in joint Duchenne smiling as well as the average duration of the episodes in which parents engaged in Duchenne smiling in general (see Supplementary S2). We did not find the average duration of joint Duchenne smiling between parents and children in the conflict scenario to depend on the child’s anxiety level, but we did find parents whose children were more anxious to engage in more overall Duchenne smiling during the conflict scenario, thus favoring the former of the proposed explanations. This suggests that parents with children who scored high in anxiety engaged in more parent-only Duchenne smiling for reasons unrelated to the amount of Duchenne smiling demonstrated by their child.

Figure 7.

Children’s behaviors as a function of anxiety level and conversation scenario.

Second, children who scored high in anxiety performed many more horizontal head shakes than children who scored below average, in both the conflict and the cooperation scenario, suggesting that the amount that children shook their heads was influenced by their anxiety level and not by the topic of conversation (see Figure 7B).

Third, as Figure 7C shows, children with a low anxiety score oriented their head forward for much longer average periods in the conflict scenario than children with a high anxiety score. Although a similar pattern was observed for the cooperation scenario, the greater overlap between the error bars indicates that the difference was not likely significant. It therefore seems that children with a high anxiety level averted their gaze more, but only when the topic being discussed was potentially conflicting.

Next, given the marked increase in performance on the “withdrawn” syndrome scale that was observed for the cooperation scenario during episodes of child speech, we conducted a similar investigation on its important features (see Supplementary S3 for complete descriptive statistics). An additional three interesting patterns were observed: (1) Parents whose children scored low in withdrawal gazed forward more than parents whose children scored high in withdrawal (Figure 8A). This was true for both the conflict scenario and the cooperation scenario. (2) Children low in withdrawal gazed forward more than children high in withdrawal (Figure 8B) in both the conflict and the cooperation scenario. (3) Children low in withdrawal demonstrated more AU 12 (the lip corner puller) activation than children high in withdrawal (Figure 8C), but only in the cooperation scenario, while the opposite was true for the conflict scenario. As the complete descriptive statistics suggest (see Supplementary S3), withdrawal symptoms were best characterized by a reduction in behavior, which was generally either roughly equal between scenarios or greater for the cooperation scenario.

Figure 8.

Children’s behaviors as a function of withdrawal level and conversation scenario.

Discussion

Our aim was to automatically assess the severity of children’s behavioral, emotional, and social problems from videos of parent-child interaction recorded under two scenarios: a conversation about a potentially conflicting topic and a conversation with a cooperative nature. To do this, we first quantified a wide range of nonverbal behaviors from the face and head dynamics of parents and children individually as well as on the level of the dyad (e.g., joint smiling). Next, we fed these nonverbal behaviors to a binary classifier to make broad estimations of symptom severity (i.e., whether a child’s symptoms were above or below average) on the more general Child Behavior Checklist (CBCL) syndrome scales “internalizing” and “externalizing” as well as the narrower syndrome scales “anxious”, “withdrawn”, and “aggressive”. Reliable predictions were achieved only on the “anxious” syndrome scale and specifically for the conflict scenario.

What might explain such results? The “anxious” syndrome scale is composed of items mainly related to the increase of negative affect, such as being fearful, as well as items that identify crying and nervous behavior. Perhaps the conflict scenario was more likely to elicit such behaviors than the cooperation scenario. In line with this idea, Thomas et al. (2017) assigned adolescents to either a conflict or a control task and found that adolescents assigned to the conflict task experienced greater levels of arousal than adolescents assigned to the control task, and that adolescents with higher baseline levels of conflict, as estimated through an interview, were also more likely to exhibit more hostile behavior, but only in the conflict task. Conversely, the “withdrawn” syndrome scale contains items mainly related to the reduction of positive affect, such as withdrawal from interactions, shyness, and feeling a lack of energy. One might thus expect withdrawn behavior to be more observable during moments that elicit an increase in positive affect. Our results were partly in favor of this, as we did observe almost fair performance for the “withdrawn” classifier under the cooperation scenario but not under the conflict scenario. Notably, however, this was true only when the classifier was trained using features extracted during episodes of child speech, suggesting that withdrawn behavior may not have been fully reflected during all moments of the interaction. The dichotomy between the presentation of anxious and withdrawn behavior (i.e., an increase in negative affect versus a reduction in positive affect) was also reflected in our results: smile-related behaviors were not included in the top 10 features of the best “anxious” syndrome scale classifier but were included in the top 10 features of the best “withdrawn” syndrome scale classifier. Ultimately, however, prediction performance for the “withdrawn” syndrome scale was, even at its best, still poor. A potential explanation for this could be that the cooperation scenario did not elicit enough positive affect for reductions of positive affect in certain children to become noticeable. Future research should find reliable ways to define and operationalize perceived levels of conflict and cooperation.

Prediction performance for the “externalizing” and “aggressive” syndrome scales was below poor for all tested combinations. It could be that the behaviors we quantified were not sufficiently representative of the “externalizing” and “aggressive” syndrome scales, which consist of behaviors that are not often characterized by specific facial expressions (e.g., stealing, swearing, breaking rules, being mean, arguing, fighting, screaming, and so on). Another possible explanation is that the context of the present study was not extreme enough for externalizing behaviors to occur. Such extreme behaviors may require more extreme contexts. Sampling bias may also have factored in, as children exhibiting more externalizing symptoms may have been less likely to participate (the experiment was done at the end of a long day of varied parent-child observations; Holleman et al., 2021). To add to this, there was limited representation of the clinical ranges for all syndrome scales. Small differences between syndrome scale scores may not be reflected in observable differences in the behaviors they are characterized by. Such effects may also vary across syndrome scales. For example, differences in symptom severity on the “anxious” syndrome scale may have been more observable in the behavior of the children than differences in symptom severity on the other syndrome scales.

What Can Automatic Assessment Methods Teach Us About Child Psychopathology?

A reliable automatic assessment model could potentially be used as a tool to aid in the screening and diagnosis of mental disorders. This can be advantageous, especially if what the model predicts allows clinicians or researchers to save time and money (e.g., by eliminating the need to conduct extensive clinical interviews or manually annotate large streams of data). The present study is the first we are aware of to attempt to automatically assess the internalizing and externalizing behavior of children. While our models may not yet be good enough to use as diagnostic or screening tools, our results provide initial evidence to suggest that making broad predictions of children’s anxiety levels from parent-child video recordings is possible. In addition to this, they have allowed us to identify some key behaviors.

First, parents whose children scored above average on anxiety demonstrated longer average episodes of parent-only Duchenne smiling (i.e., episodes during which the parent engaged in Duchenne smiling but the child did not) compared to parents whose children scored below average on anxiety, but only in the conflict scenario. In general, smiles have many social outcomes (see Gunnery & Hall, 2015, for a review). For example, people smiling in pictures are perceived more attractive and kinder than people who are not (Otta et al., 1996), and people are more likely to trust a photograph of a person when playing a trust game if the person is smiling (Scharlemann et al., 2001). Duchenne smiles, in particular, seem to generate a sense of trustworthiness in the observer, especially when issues regarding trust or cooperation are made salient (Johnston et al., 2010). Perhaps parents were sensitive to the anxiousness of their children and engaged in longer periods of Duchenne smiling in attempt to appear more trustworthy or cooperative during the conflict scenario. Relatedly, a recent study found increased maternal mental health challenges to be associated with an increased onset amplitude, onset duration, offset duration, and total duration of mothers’ smiles (Dust et al., 2024). Perhaps the anxiety levels of children were also reflected in the anxiousness or discomfort experienced by their parent, thus resulting in longer bouts of Duchenne smiling. To note, however, the difference in the average duration of parent-only Duchenne smiles between parents with children scoring high on anxious behavior and parents with children scoring low on anxious behavior was small (0.12 s) and should be interpreted with caution. However, an interesting line of future inquiry would be to examine whether this difference might increase when examining the nonverbal behaviors of more distinct groups with greater variability between children’s symptom scores. Moreover, while parent-only Duchenne smiling is a behavior demonstrated by the parent, it occurs in the context of parent-child interaction. Future research might investigate its underlying motivations further. For example, do parents tend to smile more in general when the topic of discussion involves potential conflict, or do they do so to adapt to the behavior of their child?

Second, children with more anxiety symptoms shook their head more than children with less anxiety symptoms, and this was true for both the conflict and the cooperation scenario. Shaking one’s head from side to side, a behavior learned in early childhood, is often used to signal disapproval in many human cultures and even among some non-human primates (Bross, 2020) but has many other uses as well (see Kendon, 2002). Perhaps children who scored high on the “anxious” syndrome scale were more likely to signal disapproval, regardless of whether they were planning a party or resolving a conflict together with their parent.

Third, children who scored high in anxiety oriented their head toward their parent’s face when resolving a conflict for shorter average durations than children who scored low in anxiety. In other words, it appears that children’s higher anxiety levels were associated with more head avoidance when resolving a conflict. Increased head avoidance on its own is not surprising, as traits related to social anxiety disorder have been found to be related to less looking at others’ eyes (Hessels et al., 2018; Weeks et al., 2019). Crucially, however, the fact that increased head avoidance mainly presented during the conflict scenario suggests that it is not globally related to anxious behavior but is dependent on the context of the interaction.

Children’s withdrawal symptoms, on the other hand, were best reflected by gaze behavior. Specifically, parents whose children scored low in withdrawal looked more at their child’s face during both the conflict and the cooperation scenario than parents whose children scored high in withdrawal, and children who scored high in withdrawal looked at their parent’s face fewer times than children who scored low in withdrawal. Overall, this was a similar pattern to that observed in the nonverbal behaviors related to children’s anxiety scores, suggesting that gaze avoidance plays a role in both anxiousness and withdrawal. The CBCL manual considers both syndrome scales to also represent depressed behavior, so finding similarities between the behaviors they are characterized by is not surprising. We also found children who scored high in withdrawal to smile less than children who scored low in withdrawal in the cooperation scenario, while the opposite was true for the conflict scenario. Interestingly, withdrawal was predominantly characterized by reduced presentation, and this reduction was either equal between scenarios or greater for the cooperation scenario. Behaviors representing anxiousness, on the other hand, were characterized by both reduced and increased presentation, and this was more pronounced for the conflict scenario than the cooperation scenario, further reinforcing the idea that context dictates the presentation of many symptomatic behaviors.

Lastly, we identified some more general aspects of behavior that contributed to classifier performance. The classifier for anxious behavior performed best when it was trained using both individual and dyadic features. The set of AUs included in the present study was motivated by the results of Girard et al. (2014), who found the symptom severity of depressed individuals to be associated with decreased activation of the lip corner puller and lip corner depressor as well as increased activation of the dimpler, and argued the results to suggest that individuals engage in specific patterns of facial behavior to maintain or increase interpersonal distance. Given our results, it may be that these facial behaviors serve as indicators of anxiety symptoms in children as well, and it moreover seems that they are not only important on the individual levels of the child and the parent but also on the dyadic level of the interaction.

It is moreover worth noting that the nonverbal behaviors that related to children’s anxiety scores were mainly observable outside of child speech (and therefore likely also involving moments of parent speech). Thus, the nonverbal behaviors of parents and children that relate to children’s anxiety levels may depend on conversational roles: parent behaviors may become more apparent during moments of speech while child behaviors may become more apparent during moments when they are listening and reacting to their parent’s speech. The conversational roles in the present study differed between the two conversation scenarios; as Holleman et al. (2021) summarize, parents spoke more in the conflict scenario than they did in the cooperation scenario while the opposite was true for children. Moreover, parents spoke more than children overall, and there was more silence during the conflict scenario than there was in the cooperation scenario. As such, the roles of the parent and child may have been more equal when, e.g., planning a party in the cooperation scenario, while the parent may have taken the lead more when resolving a conflict (Holleman et al., 2021). Differences in conversational roles between the scenarios may explain why behaviors related to anxiety symptoms became more apparent during the conflict scenario. This raises the question of whether the inclusion of verbal behaviors, particularly on the parent’s side, may have improved our predictions. This idea is partly supported by the results of Halfon et al. (2021), who found that affect information from the text modality, representing the speech content of play therapy sessions, proved to be much more useful for the prediction of children’s anger, anxiety, and sadness levels than facial affect information. To further explore the interaction between nonverbal and verbal behaviors in relation to child psychopathology, future research should aim to include representative behaviors from multiple modalities.

Challenges, Limitations, and Suggestions for Future Research

An important challenge in the automatic assessment of mental disorders from video recordings lies in the quantification of behaviors. Behaviors often need to be quantified on the level of the videos, meaning that the temporal dynamics and interactive nature of behaviors are easily lost. There exist a wide range of computations that one can perform on a single behavior, and through feature engineering one may reduce one’s dataset to features that contribute most to one’s model to improve its performance. However, once complex behaviors are represented by several different variables that all describe the same behavior but in different ways, the actual contributions of those behaviors can become difficult to identify. We attempted to keep our behavioral descriptions simple and easily explainable. This, however, may have contributed to us not finding links between nonverbal behaviors and symptom severity on most of the syndrome scales, highlighting the delicate balance between performance and explainability, which often differs between research fields. In social and affective computing, for example, one may wish to optimize on prediction performance, while in psychology, one may wish to optimize on explainability.

One major limitation of the study is related to the nature of the recording setup, which was designed to explore fine-grained details in gaze behavior. This was optimal for the analysis of face and head dynamics but rendered the analysis of full body posture and hand gestures impossible. The inclusion of additional features to describe the various body and limb movements people make when interacting with each other is likely to yield valuable information and will be incorporated in our future work.

Still other limitations lie in the constraints of the automatic analysis tools we used. PyAFAR, the tool we used to detect the occurrence of facial muscle movements, detects some movements better than others (Hinduja et al., 2023). It may be that the facial expressions that better characterize the other syndrome scales were more reliant on the facial muscle movements that are not detected well in the first place. It is also important to note that PyAFAR, as many other computer vision and machine learning models, was trained on adult participants and may generalize less well to children. The PyAFAR adult model has been shown to generalize less well (although still at acceptable levels) to infant data than Infant AFAR, a PyAFAR model fine-tuned on infant faces (Onal Ertugrul et al., 2023). However, since the children in our dataset were much older than infants, better generalization can be expected. Also, like Bilalpur et al. (2023), we set the threshold for AUs to be labeled as active if their occurrence value was at least 0.5. Given that AUs have different occurrence rates in general, each AU is likely to have a different optimal occurrence threshold. Future studies may benefit from empirically determined AU thresholds.

Finally, defining a baseline for head and gaze avoidance using data provided by OpenFace was not straightforward. Our baseline for facing or looking at the other person was defined using the head and gaze angles of periods during which we knew that both parents and children were likely to look at each other’s face (i.e., the median head/gaze angles of the first 100 frames of the video recordings). However, when observing the recordings, we noticed that in some cases parents and especially children moved their heads around quite a bit during the recordings. Although OpenFace reports head angle and gaze angle with respect to the camera, its estimates seem to be rather sensitive to changes in head position (and its gaze estimates are rather inaccurate, see Valtakari et al., 2023). This may have resulted in errors identifying episodes of head or gaze avoidance.

Conclusion

Applying machine learning and computer vision to automatically assess mental disorders from videos of parent-child interaction shows promise. It seems that at least anxiety symptoms are represented by the nonverbal behaviors of these interactions, particularly those pertaining to genuinely felt smiles on the side of the parent. The behavioral representations of such complex disorders are likely to vary greatly between individuals and between specific groups, such as adults and children. As our results demonstrate, they are also likely to vary depending on the context of the interaction, and many behaviors need to be examined in their interactive contexts. It is important to generate further research to identify such behaviors and the specific contexts they occur in.

Supplemental Material

Supplemental Material - Automatically Assessing Children’s Internalizing and Externalizing Behavior From Face and Head Dynamics During Parent-Child Interaction

Supplemental Material for Automatically Assessing Children’s Internalizing and Externalizing Behavior From Face and Head Dynamics During Parent-Child Interaction by Niilo V. Valtakari, Roy S. Hessels, Albert Ali Salah, Itir Onal Ertugrul in Journal of Experimental Psychopathology.

Footnotes

ORCID iDs

Niilo V. Valtakari

Albert Ali Salah

Ethical Considerations

The study was approved by the Medical Research Ethics Committee of the University Medical Center Utrecht and is registered under protocol number 19–051/M.

Consent to Participate

All parents provided written informed consent for themselves and on behalf of their children.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a Utrecht University Dynamics of Youth (DoY) invigoration grant.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The video, eye-tracking, and questionnaire data used in the study were collected as part of the YOUth study at Utrecht University. The raw data can be accessed via a data request: .

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Achenbach

T. M.

Ivanova

M. Y.

Rescorla

L. A.

Turner

L. V.

Althoff

R. R.

(2016). Internalizing/Externalizing Problems: Review and Recommendations for Clinical and Research Applications. Journal of the American Academy of Child & Adolescent Psychiatry, 55(8), 647–656. https://doi.org/10.1016/j.jaac.2016.05.012

Achenbach

T. M.

Rescorla

L. A.

(2001). Manual for the ASEBA School-Age Forms & Profiles. University of Vermont, Research Center for Children, Youth and Families.

Achenbach

T. M.

Ruffle

T. M.

(2000). The Child Behavior Checklist and Related Forms for Assessing Behavioral/Emotional Problems and Competencies. Pediatrics in Review, 21(8), 265–271. https://doi.org/10.1542/pir.21-8-265

Alghowinem

Goecke

Wagner

Epps

Hyett

Parker

Breakspear

(2018). Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors. IEEE Transactions on Affective Computing, 9(4), 478–490. https://doi.org/10.1109/TAFFC.2016.2634527

Baltrušaitis

Zadeh

Lim

Y. C.

Morency

L.-P.

(2018). OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (pp. 59–66). IEEE. https://doi.org/10.1109/FG.2018.00019

Benecke

Krause

(2007). Dyadic Facial Affective Indicators of Severity of Symptomatic Burden in Patients with Panic Disorder. Psychopathology, 40(5), 290–295. https://doi.org/10.1159/000104745

Bey

A. L.

Sabatos-DeVito

Carpenter

K. L.

Franz

Howard

Vermeer

Simmons

Troy

J. D.

Dawson

(2024). Automated Video Tracking of Autistic Children’s Movement During Caregiver-Child Interaction: An Exploratory Study. Journal of Autism and Developmental Disorders, 54(10), 3706–3718. https://doi.org/10.1007/s10803-023-06107-2

Bilalpur

Hinduja

Cariola

Sheeber

Allen

Morency

L.-P.

Cohn

J. F.

(2023). SHAP-based Prediction of Mother’s History of Depression to Understand the Influence on Child Behavior. In Proceedings of the 25th International Conference on Multimodal Interaction (pp. 537–544). Association for Computing Machinery. https://doi.org/10.1145/3577190.3614136

Bredin

(2023). pyannote.audio 2.1 speaker diarization pipeline: Principle, benchmark, and recipe. In 24th INTERSPEECH Conference (INTERSPEECH 2023) (pp. 1983–1987). ISCA. https://doi.org/10.21437/Interspeech.2023-105

10.

Bross

(2020). Why do we shake our heads? On the origin of the headshake. Gesture, 19(2/3), 269–298. https://doi.org/10.1075/gest.17001.bro

11.

Cao

Simon

Wei

S.-E.

Sheikh

(2017). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1302–1310). IEEE. https://doi.org/10.1109/CVPR.2017.143

12.

Carter

J. V.

Pan

Rai

S. N.

Galandiuk

(2016). ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves. Surgery, 159(6), 1638–1645. https://doi.org/10.1016/j.surg.2015.12.029

13.

Chen

Guestrin

(2016). XGBoost: A Scalable Tree Boosting System. In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785

14.

Colonnesi

Bögels

S. M.

de Vente

Majdandžić

(2013). What Coy Smiles Say About Positive Shyness in Early Infancy. Infancy, 18(2), 202–220. https://doi.org/10.1111/j.1532-7078.2012.00117.x

15.

de Belen

R. A. J.

Bednarz

Sowmya

Del Favero

(2020). Computer vision in autism spectrum disorder research: A systematic review of published studies from 2009 to 2019. Translational Psychiatry, 10(1), 333. https://doi.org/10.1038/s41398-020-01015-w

16.

Dust

Levitt

Matarić

(2024). Behind the Smile: Mental Health Implications of Mother-Infant Interactions Revealed Through Smile Analysis. In 2024 12th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 46–54. https://doi.org/10.48550/arXiv.2408.01434

17.

Ekman

Davidson

R. J.

Friesen

W. V.

(1990). The Duchenne Smile: Emotional Expression and Brain Physiology: II. Journal of Personality and Social Psychology, 58(2), 342–353. https://doi.org/10.1037/0022-3514.58.2.342

18.

Ekman

Friesen

W. V.

(1978). Facial Action Coding System: A Technique for the Measurement of Facial Movement: Consulting Psychologists Press.

19.

Fakkel

Peeters

Lugtig

Zondervan-Zwijnenburg

M. A. J.

Blok

White

van der Meulen

Kevenaar

S. T.

Willemsen

Bartels

Boomsma

D. I.

Schmengler

Branje

Vollebergh

W. A. M.

(2020). Testing sampling bias in estimates of adolescent social competence and behavioral control. Developmental Cognitive Neuroscience, 46, Article 100872. https://doi.org/10.1016/j.dcn.2020.100872

20.

Girard

J. M.

Chu

W.-S.

Jeni

L. A.

Cohn

J. F.

De la Torre

Sayette

M. A.

(2017). Sayette Group Formation Task (GFT) Spontaneous Facial Expression Database. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (pp. 581–588). IEEE. https://doi.org/10.1109/FG.2017.144

21.

Girard

J. M.

Cohn

J. F.

(2015). Automated audiovisual depression analysis. Current Opinion in Psychology, 4, 75–79. https://doi.org/10.1016/j.copsyc.2014.12.010

22.

Girard

J. M.

Cohn

J. F.

Mahoor

M. H.

Mavadati

S. M.

Hammal

Rosenwald

D. P.

(2014). Nonverbal social withdrawal in depression: Evidence from manual and automatic analyses. Image and Vision Computing, 32(10), 641–647. https://doi.org/10.1016/j.imavis.2013.12.007

23.

Gunnery

S. D.

Hall

J. A.

(2015). The Expression and Perception of the Duchenne Smile. In Kostic

Chadee

(Eds.), The Social Psychology of Nonverbal Communication (pp. 114–133). Springer.

24.

Halfon

Doyran

Türkmen

Oktay

E. A.

Salah

A. A.

(2021). Multimodal affect analysis of psychodynamic play therapy. Psychotherapy Research, 31(3), 313–328. https://doi.org/10.1080/10503307.2020.1839141

25.

Hessels

R. S.

Cornelissen

T. H. W.

Hooge

I. T. C.

Kemner

(2017). Gaze Behavior to Faces During Dyadic Interaction. Canadian Journal of Experimental Psychology / Revue Canadienne de Psychologie Expérimentale, 71(3), 226–242. https://doi.org/10.1037/cep0000113

26.

Hessels

R. S.

Holleman

G. A.

Cornelissen

T. H. W.

Hooge

I. T. C.

Kemner

(2018). Eye contact takes two – autistic and social anxiety traits predict gaze behavior in dyadic interaction. Journal of Experimental Psychopathology, 9(2), 1–17. https://doi.org/10.5127/jep.062917

27.

Hessels

R. S.

Iwabuchi

Niehorster

D. C.

Funawatari

Benjamins

J. S.

Kawakami

Nyström

Suda

Hooge

I. T. C.

Sumiya

Heijnen

J. I. P.

Teunisse

M. K.

Senju

(2025). Gaze behavior in face-to-face interaction: A cross-cultural investigation between Japan and The Netherlands. Cognition, 263, Article 106174. https://doi.org/10.1016/j.cognition.2025.106174

28.

Hinduja

Darzi

Onal Ertugrul

Provenza

Gadot

Storch

E. A.

Sheth

S. A.

Goodman

W. K.

Cohn

J. F.

(2024). Multimodal prediction of obsessive-compulsive disorder, comorbid depression, and energy of deep brain stimulation. IEEE Transactions on Affective Computing, 15(4), 2025–2041. https://doi.org/10.1109/TAFFC.2024.3395117

29.

Hinduja

Onal Ertugrul

Bilalpur

Messinger

D. S.

Cohn

J. F.

(2023). PyAFAR: Python-based Automated Facial Action Recognition library for use in Infants and Adults. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1–3). IEEE. https://doi.org/10.1109/ACIIW59127.2023.10388108

30.

Holleman

G. A.

Hooge

I. T.

Huijding

Deković

Kemner

Hessels

R. S.

(2021). Gaze and speech behavior in parent–child interactions: The role of conflict and cooperation. Current Psychology, 42(14), 1–22. https://doi.org/10.1007/s12144-021-02532-7

31.

Holler

Shovelton

Beattie

(2009). Do Iconic Hand Gestures Really Contribute to the Communication of Semantic Information in a Face-to-Face Context? Journal of Nonverbal Behavior, 33(2), 73–88. https://doi.org/10.1007/s10919-008-0063-9

32.

Hostetter

A. B.

(2011). When do gestures communicate? A meta-analysis. Psychological Bulletin, 137(2), 297–315. https://doi.org/10.1037/a0022128

33.

Isaev

D. Y.

Sabatos-DeVito

Di Martino

J. M.

Carpenter

Aiello

Compton

Davis

Franz

Sullivan

Dawson

Guillermo

(2024). Computer Vision Analysis of Caregiver–Child Interactions in Children with Neurodevelopmental Disorders: A Preliminary Report. Journal of Autism and Developmental Disorders, 54(6), 2286–2297. https://doi.org/10.1007/s10803-023-05973-0

34.

Jeni

L. A.

Cohn

J. F.

De La Torre

(2013). Facing Imbalanced Data—Recommendations for the Use of Performance Metrics. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (pp. 245–251). IEEE. https://doi.org/10.1109/ACII.2013.47

35.

Johnston

Miles

Macrae

C. N.

(2010). Why are you smiling at me? Social functions of enjoyment and non‐enjoyment smiles. British Journal of Social Psychology, 49(1), 107–127. https://doi.org/10.1348/014466609X412476

36.

Joshi

Dhall

Goecke

Cohn

J. F.

(2013). Relative Body Parts Movement for Automatic Depression Analysis. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (pp. 492–497). IEEE. https://doi.org/10.1109/ACII.2013.87

37.

Karaca

Salah

A. A.

Denissen

Poppe

de Zwarte

S. M.

(2024). Survey of Automated Methods for Nonverbal Behavior Analysis in Parent-Child Interactions. In 2024 18th International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–11). IEEE. https://doi.org/10.1109/FG59268.2024.10582009

38.

Kazdin

A. E.

Moser

Colbus

Bell

(1985). Depressive Symptoms Among Physically Abused and Psychiatrically Disturbed Children. Journal of Abnormal Psychology, 94(3), 298–307. https://doi.org/10.1037//0021-843x.94.3.298

39.

Kendon

(2002). Some uses of the head shake. Gesture, 2(2), 147–182. https://doi.org/10.1075/gest.2.2.03ken

40.

Kleberg

J. L.

Högström

Nord

Bölte

Serlachius

Falck-Ytter

(2017). Autistic Traits and Symptoms of Social Anxiety are Differentially Related to Attention to Others’ Eyes in Social Anxiety Disorder. Journal of Autism and Developmental Disorders, 47(12), 3814–3821. https://doi.org/10.1007/s10803-016-2978-z

41.

Low

D. M.

Bentley

K. H.

Ghosh

S. S.

(2020). Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investigative Otolaryngology, 5(1), 96–116. https://doi.org/10.1002/lio2.354

42.

Lugaresi

Tang

Nash

McClanahan

Uboweja

Hays

Zhang

Chang

C.-L.

Yong

M. G.

Lee

Chang

W.-T.

Hua

Georg

Grundmann

(2019). Mediapipe: A Framework for Building Perception Pipelines. arXiv Preprint arXiv:1906.08172.

43.

Lundberg

S. M.

Lee

S.-I.

(2017). A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc. 4768–4777.

44.

Maddage

N. C.

Senaratne

Low

L.-S. A.

Lech

Allen

(2009). Video-Based Detection of the Clinical Depression in Adolescents. In 31st Annual International Conference of the IEEE EMBS (pp. 3723–3726). IEEE. https://doi.org/10.1109/IEMBS.2009.5334815

45.

Nahm

F. S.

(2022). Receiver operating characteristic curve: Overview and practical use for clinicians. Korean Journal of Anesthesiology, 75(1), 25–36. https://doi.org/10.4097/kja.21209

46.

Onal Ertugrul

Hinduja

Bilalpur

Messinger

D. S.

Cohn

J. F.

(2024). Expanding PyAFAR: A Novel Privacy-Preserving Infant AU Detector. In 2024 18th International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–2). IEEE. https://doi.org/10.1109/FG59268.2024.10581868

47.

Onal Ertugrul

Ahn

Y. A.

Bilalpur

Messinger

D. S.

Speltz

M. L.

Cohn

J. F.

(2023). Infant AFAR: Automated facial action recognition in infants. Behavior Research Methods, 55(3), 1024–1035. https://doi.org/10.3758/s13428-022-01863-y

48.

Onal Ertugrul

Cohn

J. F.

Jeni

L. A.

Zhang

Yin

(2019). Cross-domain AU Detection: Domains, Learning Approaches, and Measures. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (pp. 1–8). IEEE. https://doi.org/10.1109/FG.2019.8756543

49.

Onland-Moret

N. C.

Buizer-Voskamp

J. E.

Albers

M. E.

Brouwer

R. M.

Buimer

E. E.

Hessels

R. S.

de Heus

Huijding

Junge

C. M.

Mandl

R. C.

Pascal

Vink

van der Wal

J. J.

Hulshoff Pol

Kemner

(2020). The YOUth study: Rationale, design, and study procedures. Developmental Cognitive Neuroscience, 46, Article 100868. https://doi.org/10.1016/j.dcn.2020.100868

50.

Otta

Abrosio

F. F. E.

Hoshino

R. L.

(1996). Reading a Smiling Face: Messages Conveyed by Various Forms of Smiling. Perceptual and Motor Skills, 82(3_suppl), 1111–1121. https://doi.org/10.2466/pms.1996.82.3c.1111

51.

Pampouchidou

Simantiraki

Fazlollahi

Pediaditis

Manousos

Roniotis

Giannakakis

Meriaudeau

Simos

Marias

Yang

Tsiknakis

(2016). Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge (pp. 27–34). Association for Computing Machinery. https://doi.org/10.1145/2988257.2988266

52.

Pampouchidou

Simos

P. G.

Marias

Meriaudeau

Yang

Pediaditis

Tsiknakis

(2019). Automatic Assessment of Depression Based on Visual Cues: A Systematic Review. IEEE Transactions on Affective Computing, 10(4), 445–470. https://doi.org/10.1109/TAFFC.2017.2724035

53.

Peham

Bock

Schiestl

Huber

Zimmermann

Kratzer

Dahlbender

Biebl

Benecke

(2015). Facial Affective Behavior in Mental Disorder. Journal of Nonverbal Behavior, 39(4), 371–396. https://doi.org/10.1007/s10919-015-0216-6

54.

Philippot

Feldman

R. S.

Coats

E. J.

(2003). Nonverbal behavior in clinical settings. Oxford University Press.

55.

Plaquet

Bredin

(2023). Powerset multi-class cross entropy loss for neural speaker diarization. arXiv Preprint arXiv:2310.13025.

56.

Rehg

J. M.

Abowd

G. D.

Rozga

Romero

Clements

M. A.

Sclaroff

Essa

Ousley

O. Y.

Kim

Rao

Kim

J. C.

Presti

L. L.

Zhang

Lantsman

Bidwell

(2013). Decoding Children’s Social Behavior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society , 3414–3421. https://doi.org/10.1109/CVPR.2013.438

57.

Scharlemann

J. P.

Eckel

C. C.

Kacelnik

Wilson

R. K.

(2001). The value of a smile: Game theory with a human face. Journal of Economic Psychology, 22(5), 617–640. https://doi.org/10.1016/S0167-4870(01)00059-9

58.

Stratou

Scherer

Gratch

Morency

L.-P.

(2015). Automatic nonverbal behavior indicators of depression and PTSD: the effect of gender. Journal on Multimodal User Interfaces, 9(1), 17–29. https://doi.org/10.1007/s12193-014-0161-4

59.

Sundararajan

Najmi

(2020). The Many Shapley Values for Model Explanation. In Proceedings of the 37th International Conference on Machine Learning (pp. 9269–9278). PMLR.

60.

Tayarani-N

M.-H.

Shahid

S. I.

(2025). Detecting Anxiety via Machine Learning Algorithms: A Literature Review. IEEE Transactions on Emerging Topics in Computational Intelligence, 9(4), 2634–2657. https://doi.org/10.1109/TETCI.2025.3543307

61.

Thomas

S. A.

Wilson

Jain

Deros

D. E.

Hurwitz

Jacobs

Myerberg

Ehrlich

K. B.

Dunn

E. J.

Aldao

Stadnik

De Los Reyes

(2017). Toward Developing Laboratory-Based Parent–Adolescent Conflict Discussion Tasks that Consistently Elicit Adolescent Conflict- Related Stress Responses: Support from Physiology and Observed Behavior. Journal of Child and Family Studies, 26(12), 3288–3302. https://doi.org/10.1007/s10826-017-0844-z

62.

Valtakari

N. V.

Hessels

R. S.

Niehorster

D. C.

Viktorsson

Nyström

Falck-Ytter

Kemner

Hooge

I. T. C.

(2023). A field test of computer-vision-based gaze estimation in psychology. Behavior Research Methods, 56(3), 1–16. https://doi.org/10.3758/s13428-023-02125-1

63.

Valtakari

N. V.

Hooge

I. T. C.

Viktorsson

Nyström

Falck-Ytter

Hessels

R. S.

(2021). Eye tracking in human interaction: Possibilities and limitations. Behavior Research Methods, 53(4), 1592–1608. https://doi.org/10.3758/s13428-020-01517-x

64.

Velarde

Sudhir

Deshmane

Deshmunkh

Sharma

Joshi

(2023). Evaluating XGBoost for Balanced and Imbalanced Data Application to Fraud Detection. In NVIDIA GTC, The Conference for the Era of AI and the Metaverse. S51129 https://doi.org/10.48550/arXiv.2303.15218

65.

Weeks

J. W.

Howell

A. N.

Srivastav

Goldin

P. R.

(2019). “Fear guides the eyes of the beholder”: Assessing gaze avoidance in social anxiety disorder via covert eye tracking of dynamic social stimuli. Journal of Anxiety Disorders, 65, 56–63. https://doi.org/10.1016/j.janxdis.2019.05.005

66.

Wieser

M. J.

Pauli

Alpers

G. W.

Mühlberger

(2009). Is eye to eye contact really threatening and avoided in social anxiety? An eye-tracking and psychophysiology study. Journal of Anxiety Disorders, 23(1), 93–103. https://doi.org/10.1016/j.janxdis.2008.04.004

67.

Yang

Jiang

Pei

Oveneke

M. C.

Sahli

(2016). Decision Tree Based Depression Classification from Audio Video and Language Information. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge (pp. 89–96). Association for Computing Machinery. https://doi.org/10.1145/2988257.2988269

68.

Zhang

Girard

J. M.

Zhang

Liu

Ciftci

Canavan

Reale

Horowitz

Yang

Cohn

J. F.

Yin

(2016). Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 3438–3446. https://doi.org/10.1109/CVPR.2016.374

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.15 MB