Abstract
Accurately measuring changes in core autism symptoms following early intervention is challenging. The Brief Observation of Social Communication Change (BOSCC) is a promising tool for assessing social interactions in preschoolers with autism spectrum disorder (ASD). However, data on its responsiveness remain limited, warranting validation to support its use in evaluating treatment-related changes. This study aimed to assess the reliability, validity, and sensitivity to change of the BOSCC based on international recommendations from the COSMIN expert group. The BOSCC was rated using 414 video observations from a large multicenter randomized controlled trial including 177 preschoolers with ASD between 19 and 36 months. Videos were coded using the original BOSCC protocol by trained, blinded raters. Analyses addressed reliability, structural and convergent validity, and responsiveness over a 2-year follow-up. Interrater, intrarater, and test–retest reliability were consistently high, with good internal consistency (Cronbach’s α = 0.88–0.98). Factor analysis supported a three-factor structure. Convergent validity was modest with Autism Diagnostic Observation Scale (ADOS) change scores (r = 0.05–0.20) but stronger with global ratings of improvement on the Clinical Global Impression–Improvement (CGI-I; r = 0.40–0.60, p < 0.001). ROC analyses confirmed acceptable to good responsiveness when anchored to the CGI-I, but poor discrimination relative to the ADOS. The BOSCC is a reliable and responsive measure of change in preschoolers with ASD. Its naturalistic format supports its use as an outcome measure in early intervention trials. Establishing thresholds for clinically meaningful change will be a critical next step for both research and clinical practice.
Lay Abstract
Early intervention may improve social interaction and communication in young children with autism spectrum disorder (ASD). However, it is often difficult to measure changes in core autism symptoms over time. Many commonly used assessment tools were developed for diagnosis and are not always sensitive to treatment-related change. The Brief Observation of Social Communication Change (BOSCC) was specifically designed to address this gap by using short, naturalistic observations of children’s social communication. This study examined how well the BOSCC works as a tool to capture change. We assessed its reliability (whether it gives consistent results), validity (whether it measures what it is intended to measure), and responsiveness (whether it can detect change over time). The study included 414 video observations from a large multicenter randomized controlled trial involving 177 preschool children with ASD, aged 19–36 months, followed up for over a 2-year period. Videos were rated by trained, independent observers using the original BOSCC coding system. Results showed that BOSCC is a highly reliable measure, with strong agreement between different raters and good consistency over time. The structure of the scale was supported, and BOSCC scores were meaningfully related to clinicians’ overall judgments of improvement. In contrast, changes in BOSCC scores were less closely related to changes measured by standard diagnostic tools. Overall, these findings support the BOSCC as a useful and sensitive outcome measure for evaluating change in early autism interventions. Future research should define thresholds for clinically meaningful change to strengthen its use in research, clinical practice, and service evaluation.
Introduction
Accurately measuring changes in the core symptoms of autism spectrum disorder (ASD) remains a key challenge in both clinical and research contexts. This challenge stems from the inherently complex nature of ASD core features, which include impairments in social communication and the presence of restricted and repetitive behaviors (RRBs). These features are highly heterogeneous across individuals, may evolve substantially over time, and can influence adaptative functioning in ways that are strongly dependent on environmental contexts (American Psychiatric Association, 2022; Sandbank et al., 2024; Zeidan et al., 2022). Early intervention with nonpharmacological approaches is recommended to improve child development. However, despite the increasing number of early intervention studies, there remains no clear consensus on which outcome measures best capture clinically meaningful changes in ASD (Bal et al., 2019). As highlighted by McConachie et al. (2015), numerous measurement tools have been developed; however, most provide limited evidence regarding their reliability in measuring responsiveness. Some instruments, such as the ADOS, originally developed for diagnostic purposes and not specifically designed to detect change, have nevertheless been widely used in intervention trials, and empirical evidence regarding their responsiveness should be acknowledged when assessing their ability to capture meaningful changes in autistic symptoms (Carruthers et al., 2023; Green et al., 2022; Pickles et al., 2016). Outcome measures, such as the Clinical Global Impression–Improvement (CGI-I) scale, have also been used (Bearss et al., 2015; Siafis et al., 2020; Toolan et al., 2022) and has shown some sensitivity to changes over time in core autistic behaviors (Choque Olsson & Bölte, 2014). As the validity and responsiveness of most existing instruments remain subject to debate, the use of complementary measures is essential, alongside careful evaluation of their psychometric properties. It is within this framework that Catherine Lord’s team developed the Brief Observation of Social Communication Change (BOSCC), a measure specifically developed to capture change in a naturalistic context for the child. BOSCC is a filmed semistructured interaction using a dedicated protocol and materials. Conceptually, the BOSCC was designed to provide a reproducible and objective assessment of treatment-related changes across three behavioral domains, social communication, RRBs, and nonspecific behaviors. Its semistructured format is grounded in the same rationale as the ADOS: Structured but naturalistic interactions create comparable contexts across participants while still eliciting spontaneous behaviors, thereby increasing the likelihood of detecting meaningful change over time. Building on the ADOS-2 algorithm with an expanded coding scheme, the BOSCC aims to capture subtle shifts in behavior with greater sensitivity. Although highlighted as promising in McConachie’s review, evidence regarding its sensitivity to change remains mixed. According to COSMIN expert group recommendations (Mokkink et al., 2006), the psychometric properties of a measurement tool refer to its validity, reliability, and responsiveness, reflecting its ability to accurately, consistently, and sensitively assess a given construct ad its evolution. The BOSCC has demonstrated robust interrater and test–retest reliability (Carruthers et al., 2021; Grzadzinski et al., 2016; Kim et al., 2019; Kitzerow et al., 2016; Nordahl-Hansen et al., 2016; Pijl et al., 2018), but evidence regarding its validity and responsiveness remains limited and mixed. Some studies report modest but significant effect sizes and suggest greater sensitivity than the ADOS in detecting changes in social communication (Grzadzinski et al., 2016; Kim et al., 2019; Pijl et al., 2018), whereas others find no clear advantage over established instruments such as the Social Responsiveness Scale (SRS), Social Communication Questionnaire (SCQ), or ADOS (Carruthers et al., 2021; Kitzerow et al., 2016). Divergent results have also been observed when comparing the BOSCC to other outcome measures, such as joint engagement (Nordahl-Hansen et al., 2016), with the BOSCC sometimes failing to differentiate intervention and control groups. These discrepancies may reflect variations in sample characteristics (e.g. age, verbal level), study design (observational vs randomized controlled trial [RCT]), BOSCC versions used, or differences in rater training and adherence to the protocol. Taken together, these findings highlight both the potential of the BOSCC to capture certain changes in social communication and the need for further research to clarify its validity and responsiveness. The current study aims to contribute to this ongoing evaluation. For details of previous studies on the BOSCC’s psychometric properties, including populations, versions, objectives, and results, see Table S1 in the Supplemental Material.
The present study aimed to address these gaps by evaluating the psychometric properties of the original BOSCC protocol in a large cohort of nonverbal preschoolers enrolled in an RCT. Specifically, we sought to assess its fidelity, convergent and structural validity, and responsiveness to change over a 2-year period, comparing it with both the ADOS and CGI-I. In addition, we explored the influence of age, verbal, and nonverbal developmental quotients (DQ) on the reliability of BOSCC ratings. To our knowledge, this is the first large-scale study to assess the psychometric properties of the BOSCC in a French-speaking context.
Objectives and Hypotheses
The objective of this study was to evaluate the psychometric properties of the original BOSCC protocol in a large cohort of nonverbal preschoolers enrolled in an RCT. We examined its reliability, construct and convergent validity, and responsiveness to clinical change and conducted complementary responsiveness analyses of its associations with broader developmental outcomes.
Based on prior literature, we formulated the following hypotheses:
Reliability: The BOSCC would demonstrate strong reliability in the full sample and within subgroups defined by age and DQ, including high interrater, intrarater, and test–retest agreement (intraclass random one-way correlation coefficient [ICC] ⩾ .75), as well as adequate internal consistency within domains (Cronbach’s α ⩾ 0.70).
Construct validity: The BOSCC would show adequate construct validity, with a three-factor structure reflecting its conceptual domains and acceptable item performance, including floor and ceiling effects below 15%.
Convergent validity: BOSCC scores would correlate moderately with ADOS scores (r ⩾ 0.30), supporting convergent validity with an established measure of autism symptom severity.
Responsiveness: The BOSCC would demonstrate acceptable responsiveness to clinical change when anchored to CGI-I ratings (area under the curve [AUC] ⩾ 0.70) and moderate correlations between BOSCC change scores and CGI-I (r ⩾ 0.30). Together, these indicators were expected to show greater sensitivity to change than the ADOS, which has not demonstrated comparable responsiveness when anchored to CGI-I.
Complementary responsiveness analyses would reveal meaningful associations between changes in BOSCC scores and changes in broader developmental domains, including cognitive functioning (Mullen Scales of Early Learning [MSEL]) and adaptive behavior (Vineland Adaptive Behavior Scales, 2nd Edition [VABS-2]).
Method
Dataset and Participants
To assess the psychometric properties of the BOSCC, we repurposed individual patient data from a multicenter RCT evaluating the Early Start Denver Model in preschoolers with ASD, conducted between 2015 and 2021 across five sites in France and Belgium. Data from one child were excluded from the reanalysis due to parental opposition to their reuse. The children were addressed by community health professionals. Inclusion criteria were as follows: (a) age between 15 months and 36 months and 30 days; (b) diagnosis of ASD confirmed by both the ADOS-second edition (ADOS-2) (Hus & Lord, 2014) and the Autism Diagnosis Interview-Revised (ADI-R) for toddlers (Kim et al., 2013); (c) a DQ ⩾ 30 on the MSEL (Akshoomoff, 2006); and (d) multidisciplinary diagnostic confirmation. Exclusion criteria included severe neurological or somatic disorders preventing intensive intervention or a diagnosis of Rett syndrome. The children included in this RCT were assessed at three time points: inclusion (T0), 1-year follow-up (T1), and 2-year follow-up (T2).
The BOSCC was applied retrospectively to video recordings collected at three time points (T0 inclusion, T1 1 year, and T2 2 years). BOSCC videos were available for 152 children, 139 at T0, 141 at T1, and 131 at T2, with a total of 411 rated videos from 177 included children in the RCT. Data from 50 children were missing due to poor video quality or recording issues and opposition to the reanalysis of data for one child. A total of 103 children were assessed at all three time points and were evaluated using the CGI-I (see Supplemental Material Figure S1: Flow diagram of the study). Population characteristics (N = 152) are described in detail in Table 1.
Description of the Population at Inclusion (N = 152).
All assessments were conducted in accordance with ethical standards, with data centralized and anonymized under secure conditions (CPP Sud Est III N° 2015 - 013 B). The local ethics committee approved this secondary analysis (CEREVI 2024/025).
BOSCC Protocol and Coding
Each video captured a 12-minute semistructured play session between the child and researcher, conducted using the standard BOSCC protocol. The BOSCC protocol (Grzadzinski et al., 2016) involves an interaction with an unfamiliar adult, structured into three sequences: (a) a 5-minutefree play with a first box containing symbolic and construction toys, (b) a 2-minute bubble play activity, and (c) another 5 minutes of free play with a second box of different toys, also combining symbolic and construction play. The evaluator follows the BOSCC manual, which provides guidance on how to initiate, prompt, or adjust interactions while allowing the child to lead part of the play. Although the BOSCC is designed to elicit more naturalistic interactions than the ADOS, both instruments rely on comparable semistructured social contexts. The ADOS remains the most widely used observational benchmark in ASD intervention research; its broad adoption makes it a valuable point of comparison, even though its responsiveness is debated. In this study, the ADOS is therefore used not as a gold standard, but as an established reference measure against which BOSCC responsiveness can be examined.
The researcher who coded the BOSCC was blinded to the treatment allocation, age, and assessment times in the study.
We used the 2016 version of the BOSCC (Grzadzinski et al., 2016). The scale consists of 15 items: Nine items assess social-communication skills (eye contact, facial expressions, gestures, directed vocalizations, integration of communication modes, social overtures, responses, engagement in activities, play with objects), three items assess RRBs (unusual sensory interests, mannerisms, repetitive and stereotyped interests), and three items assess behaviors not specific to ASD (activity level, disruptive behaviors, and anxiety). Each item is rated on a 6-point scale (0: abnormality is not present to 5: abnormality is present and can significantly disrupt functioning) based on the frequency and quality of behavior. The total score is the sum of the scores of the nine items assessing social-communication skills and three assessing RRB. The three items that assess behaviors not specific to ASD are not included in the total score.
The 12-minute video recordings were coded in two 6-minute segments. The raters reviewed each segment twice before scoring and used the item-specific decision tree to determine the ratings. The scores from both segments were averaged to obtain the total score. When the duration of the video was greater than 12 minutes, a 12-minute extract was systematically selected and rated: The 5 minutes before and after the 2 minutes of bubble activity were rated.
Two researchers conducted the BOSCC assessments blinded from the treatment allocation, age, and assessment time point: a senior child psychiatrist trained directly by R. Grzadzinski (Center for Autism and the Developing Brain, NY) (MMGC), and a second psychiatrist trained locally under supervision of MMGC (AJD).
High initial fidelity was established following the BOSCC protocol: Raters were required to reach agreement on three consecutive double-coded videos before independently scoring; thereafter, double scoring occurred every five videos. The fidelity criteria included agreement within 1 point on ⩾80% of items per segment and a total score difference of <4 points. In the event of a disagreement, discussions were conducted between the raters until a consensus was reached.
Comparator Measures
Clinical Global Impression–Improvement
This scale (Choque Olsson & Bölte, 2014) was used to assess functional changes in reciprocal social interaction based on BOSCC video recordings between T0 and T1, T1 and T2, and T0 and T2. Symptomatology changes were rated on a 7-point scale by comparing video recordings from T0–T1 and T1–T2, with higher scores indicating poorer outcomes. In this study, the CGI-I was used to reflect change specifically in social-communication behaviors, and clinicians based their rating on observed behaviors during the BOSCC administration.
CGI-I ratings were blinded to treatment allocation and were completed by one researcher (RH), a newly trained occupational therapist with knowledge of autism symptomatology but not trained in either the BOSCC or the ADOS. The rater received dedicated training. Prior to the study, a multidisciplinary team of clinicians (child and adolescent psychiatrists and psychologists with expertise in ASD) discussed and aligned on the clinical interpretation of change in social-communication behaviors to ensure consistency of judgment.
During scoring, the CGI-I was used to capture perceived change in social communication, based on behaviors observed during the BOSCC administrations (T0T1 and T1T2). Ratings therefore reflected improvement or worsening in social-communication functioning rather than global clinical change. This approach ensured conceptual alignment between the external anchor and the construct assessed by the BOSCC, in accordance with COSMIN recommendations, that external anchors may be derived from clinician judgment if their origin and limitations are clearly reported.
ADOS-2
ADOS-2 is a semistructured standardized diagnostic tool for evaluating autistic symptoms through structured observation (Gotham et al., 2007; Hus & Lord, 2014). Scores for social affects and RRBs, as well as a total comparison score (1–10), are provided. Higher scores reflect greater severity of autistic symptomatology. Modules were selected based on the participants’ age and verbal ability. Assessments were performed by ADOS-trained psychologists with cross-site interrater calibrations.
Vineland Adaptive Behavior Scales, 2nd Edition
Parent-report interview measuring adaptive functioning (communication, socialization, daily living, motor skills), providing standard domain and composite scores (Farmer et al., 2020; Yang et al., 2016). The higher the score, the higher the adaptive functioning in daily life. The tests were conducted by professionals trained in the VABS.
Mullen Scales of Early Learning
MSEL is a direct observation tool for assessing cognitive and developmental functioning in children from birth to 68 months of age (Akshoomoff, 2006). As most children are likely to score too low on the MSEL early learning composite score (risk of ceiling effect), we used the DQ score derived from four subscales (fine motor skills, visual reception, expressive language, and receptive language). Verbal DQ (expressive language and receptive language) and nonverbal DQ (fine motor skills and visual reception) scores were also calculated. The higher the score, the higher the DQ was. The tests were conducted by professionals trained in the MSEL.
Although not originally designed to measure change, these three last tools have been used as outcome measures in previous ASD intervention trials (Pickles et al., 2016). In our study, these tools will be used to assess the responsiveness of the BOSCC to external changes.
Statistical Analysis
All analyses were conducted using R software (R Core Team, 2019) following the COSMIN expert group recommendations (Mokkink et al., 2006).
BOSCC Reliability
The sample size for the reliability analyses was estimated using the ICC Shiny App (https://iriseekhout.shinyapps.io/ICCpower/), with an expected ICC of 0.80 based on prior BOSCC reliability studies (Grzadzinski et al., 2016; Kitzerow et al., 2016; Pijl et al., 2018). We assumed no systematic differences across the three repeated ratings and set the 95% CI width to 0.20 (±0.10) for precision. An expected variance of 10 reflected the observed distribution of the BOSCC scores. The required sample size was estimated to be 40.
Reliability was calculated using the ICC, with an assessment of interrater, intrarater, and test–retest reliability for the BOSCC. Intrarater reliability was evaluated by having the same coder re-score a subset of videos approximately 1 month later, which allowed us to monitor potential coder drift over time. Test–retest reliability was assessed using BOSCC videos collected 1 month apart during the study. For the BOSCC, ICC was calculated for each subsection (A and B), for the total score, and in each subgroup, in function of age, verbal and nonverbal DQ, and division of our group by the median.
Reliability was assessed using the following cutoffs:
ICC ⩾ 0.75 was considered indicative of good reliability.
ICC ⩾ 0.50 and < 0.75 were considered acceptable reliability.
ICC < 0.50 was considered poor reliability.
Internal consistency, that is, the degree of interdependence between items, was calculated with Cronbach’s alpha score to evaluate the correlation of various items with each other on the totality of the rated videos.
BOSCC Validity
Construct validity: The underlying construct measured by the BOSCC is defined as “deficits in socio-communicative interactions and the presence of repetitive, stereotyped behaviors,” consistent with the construct assessed by the ADOS. To evaluate whether the BOSCC accurately reflects this construct (i.e. to assess structural validity), we conducted a dimensional structure analysis using eigenvalue plots, exploratory factor analysis, and ascending hierarchical classification. In addition, we examined the psychometric quality of the individual BOSCC items by identifying potential floor or ceiling effects and assessing item redundancy through graphical representations. The cutoff for floor and ceiling effects was set at 15%, with items exceeding this threshold considered to have such effects.
Convergent validity was assessed using Spearman’s correlations between the BOSCC scores and ADOS scores (total and subdomain scores). While the ADOS is not a gold-standard instrument for measuring responsiveness in ASD, given that no such gold standard currently exists, they are commonly used to evaluate changes in core symptomatology. Correlation strength was interpreted according to Cohen’s benchmarks: weak (0.10–0.29), moderate (0.30–0.49), and strong (0.50–1.0) (Cohen, 1988).
BOSCC Responsiveness
Responsiveness refers to the ability of an instrument to detect meaningful changes in the construct it measures over time. Responsiveness includes two key components: internal and external responsiveness (Husted et al., 2000). Internal responsiveness reflects the capacity of a measure to detect changes within a given timeframe, typically evaluated by estimating effect sizes in RCTs involving interventions with established efficacy. However, in the context of this study, this approach was not deemed appropriate, as the early intervention under evaluation did not demonstrate a clear or large treatment effect (Geoffray et al., 2025; Touzet et al., 2017).
In contrast, external responsiveness refers to the extent to which changes in the instrument’s scores correspond with changes in other validated reference measures. It can be assessed using three primary methods: (a) receiver operating characteristic (ROC) curve analysis and calculation of the AUC; (b) correlation analyses of change scores; and (c) regression models.
In the current study, we evaluated the external responsiveness of the BOSCC by calculating ROC curves and AUC values and conducting Spearman correlation analyses with change scores from the ADOS and CGI-I.
We plotted ROC curves to visualize the sensitivity and specificity of the BOSCC scores relative to the CGI-I and ADOS scores across three time intervals (T0–T1, T1–T2, and T0–T2). AUC values were calculated to quantify the discriminative ability of the BOSCC over time.
The score change thresholds were defined as follows:
For ADOS:
Improvement: ADOS < –1; major improvement: ADOS < –2
Deterioration: ADOS > 1; major deterioration: ADOS > 2
For CGI-I:
Minimal improvement: CGI-I ⩽ 3; improvement: CGI-I < 3; major improvement: CGI-I < 2
No change or worsening: CGI-I > 3; worsening: CGI-I > 4
AUC interpretation (Çorbacıoğlu & Aksel, 2023):
0.5 = No discrimination; 0.5–0.7 = Poor; 0.7–0.8 = Acceptable
0.8–0.9 = Good; ⩾0.9 = Excellent
We also conducted complementary analyses using two broader developmental assessment tools. Specifically, we examined the correlations between changes in BOSCC scores and changes in cognitive and adaptive functioning, as measured by the MSEL and the VABS-2.
Results
BOSCC Reliability
Reliability was examined across three dimensions: interrater, intrarater, and test–retest. A threshold of ICC ⩾ 0.75 was considered indicative of good reliability.
Interrater reliability (N = 50, 4 girls, 46 boys) was high across all sections and the total score:
Section A: ICC = 0.90 [95% CI: 0.82–0.94]
Section B: ICC = 0.91 [0.85–0.95]
Total: ICC = 0.93 [0.88–0.96]
Reliability remained high within subgroups defined by age and verbal and nonverbal DQ (Table 2).
Summary Table of Reliability Coefficients (ICC).
Intrarater reliability (N = 40, 6 girls, 34 boys; reassessed after 4 weeks) was also high:
Section A: ICC = 0.84 [0.70–0.92]
Section B: ICC = 0.98 [0.97–0.99]
Total: ICC = 0.98 [0.94–0.99]
Test–retest reliability (N = 33, 4 girls, 29 boys; 1-month interval) was strong for the Total score and Section B (ICC = 0.91 [0.83–0.96] and 0.92 [0.84–0.96], respectively), with slightly lower reliability for Section A (ICC = 0.79 [0.62–0.89]). Lower reliability was observed for participants with nonverbal DQ below the median (ICC = 0.43 [−0.05 to 0.75]) (Table 2).
Internal consistency was high for the total score (Cronbach’s α = 0.98) and individual items (α = 0.88) across 407 videos (N = 152), exceeding the 0.70 threshold.
BOSCC Validity
Construct Validity
Factor analysis supported a three-dimensional structure:
Social communication (Items 1–9)
RRBs (Items 10–12)
Disruptive behaviors (Items 13–15)
The ascending hierarchical classification confirmed this structure (Table 3; Supplemental Figures S2A–B).
Factor Analysis of BOSCC Items.
Figure 1 presents the item quality analysis, with graphical representations of the different BOSCC items. Item-level quality revealed floor and ceiling effects (threshold 15%):
Clear floor effects for Items 2, 10, 11, 12
Ceiling effect for Item 3
Both floor and ceiling effects for Item 4
Items 13–15 assessing disruptive behaviors also showed expected floor effects, as these behaviors are generally absent; their inclusion prevents bias when present.

Item quality analysis—graphical representations of the various items of the BOSCC.
These findings confirm adequate coverage of target constructs, while highlighting items prone to floor/ceiling effects.
Convergent Validity
Convergent validity was examined using correlations between BOSCC scores and ADOS scores across 407 videos (152 participants). Correlation interpretation followed Cohen’s guidelines (small ⩾ 0.10, moderate ⩾ 0.30, large ⩾ 0.50).
Total score: r = 0.478 [0.345–0.596], p < 0.001 (moderate, above r ⩾ 0.30 threshold)
Social affect domain: r = 0.65 [0.54–0.74], p < 0.001 (high)
RRBs domain: r = 0.26 [0.09–0.42], p = 0.001 (below moderate threshold)
These results indicate that BOSCC scores converge well with ADOS for social communication, but less so for RRBs (Table 4).
Summary Table of Correlation of BOSCC Score Changes With Other Clinical Measures.
BOSCC Responsiveness
External responsiveness was assessed relative to two anchors: CGI-I and binarized ADOS scores. The ADOS was included because of its widespread use and demonstrated sensitivity to longer-term changes.
AUC interpretation: (0.5 = no discrimination, 0.5–0.7 = poor, 0.7–0.8 = acceptable, 0.8–0.9 = good, ⩾0.9 = excellent).
BOSCC vs CGI-I: AUCs ranged from 0.668 to 0.874, indicating acceptable to good discrimination across minimal, moderate, and major improvements, with strongest performance over T0–T2 (Table 5). Figure 3 presents ROC curves illustrating BOSCC sensitivity and specificity relative to CGI-I scores across three time intervals: T0–T1, T1–T2 and T0–T2.
BOSCC AUC Area Values for Each Interval and CGI-I Score Cutoff.
Yellow shading indicates acceptable AUC values, whereas green shading indicates good AUC values.
BOSCC vs binarized ADOS: AUCs were generally lower (0.475–0.582) except for T0–T2 deterioration (0.709, acceptable), demonstrating BOSCC’s superior sensitivity to short-term changes, while confirming ADOS captures longer-term deterioration (Table 6). Figure 2 presents ROC curves illustrating BOSCC sensitivity and specificity relative to ADOS score changes across three time intervals: T0–T1, T1–T2 and T0–T2.

Receiver operating characteristic (ROC) curves illustrating BOSCC sensitivity and specificity relative to ADOS scores across three time intervals (T0–T1, T1–T2, and T0–T2), with AUC values quantifying discriminative ability. ADOS change thresholds: improvement (<–1), major improvement (<–2), deterioration (>1), major deterioration (>2).

ROC curves showing BOSCC sensitivity and specificity relative to CGI-I scores across three time intervals (T0–T1, T1–T2, and T0–T2), with AUC values quantifying discriminative ability. CGI-I change thresholds: minimal improvement (⩽3), improvement (<3), major improvement (<2), no change/worsening (>3), worsening (>4).
BOSCC AUC Area Values for Each ADOS Score Interval and Cutoff.
Yellow shading indicates acceptable AUC values, whereas Blue shading indicates low AUC values and blue light None.
Correlation With Anchors
BOSCC vs CGI-I: r = 0.55 [0.395–0.669] (T0–T1), r = 0.44 [0.269–0.584] (T1–T2), r = 0.65 [0.525–0.751] (T0–T2); all p < 0.001.
BOSCC vs ADOS: r = 0.051 [−0.125 to 0.224] (T0–T1), r = 0.047 [−0.130 to 0.222] (T1–T2), r = 0.20 [0.017–0.363] (T0–T2).
These results confirm BOSCC’s strong alignment with CGI-I and limited short-term convergence with ADOS, consistent with expectations.
Complementary Responsiveness Analyses
Associations between BOSCC change scores and broader developmental outcomes were explored using VABS-2 and MSEL.
VABS: T0–T1: r = –0.274 [–0.429 to –0.102], p = 0.0021 (low, negative); T1–T2: r = –0.012 [−0.192 to 0.168], ns; T0–T2: r = –0.402 [–0.545 to –0.236], p < 0.00001 (high, negative).
MSEL: T0–T1: r = –0.399 [–0.537 to –0.240], p < 0.00001 (moderate, negative); T1–T2: r = –0.266 [–0.423 to –0.094], p = 0.0029 (low, negative); T0–T2: r = –0.384 [–0.527 to –0.219], p < 0.00002 (moderate, negative).
These findings suggest that BOSCC captures broader adaptive and cognitive changes over longer periods, with correlations exceeding the r > 0.30 threshold for meaningful associations (Table 4).
Discussion
This study is the first to comprehensively evaluate the fidelity, construct validity, and responsiveness to change of the BOSCC in a large sample of nonverbal preschoolers with ASD (N = 153) drawn from an RCT cohort. We implemented the original 12-minute BOSCC interaction with an unfamiliar adult (Grzadzinski et al., 2016) without modification, and all sessions were blinded. In line with COSMIN recommendations (Mokkink et al., 2006), we prioritized external responsiveness, and we compared BOSCC change against reference measures (CGI-I and ADOS) over a 2-year follow-up.
Overall, our findings confirm the high fidelity of the BOSCC, its satisfactory structural and convergent validity, and its responsiveness, that is, the ability to detect meaningful change over time, ranging from moderate, when compared with ADOS, to strong, when compared with CGI-I, using commonly accepted benchmarks for reliability (ICC ⩾ 0.75), validity (r > 0.30), and discrimination (AUC > 0.70).
Fidelity, Construct, and Convergent Validity
Interrater, intrarater, and test–retest reliability were consistently high across the total sample and in most subgroups defined by age and DQ, in line with previous reports (Grzadzinski et al., 2016; Kim et al., 2019; Pijl et al., 2018), exceeding the threshold for good reliability (ICC ⩾ 0.75). Internal consistency was also high, supporting the BOSCC as a coherent measure of social communication and restricted/repetitive behaviors in this population (Cronbach’s alpha > 0.70). Lower test–retest coefficients observed in children with lower nonverbal DQ suggest that behavioral variability and attentional challenges in this subgroup may influence score stability. As in ADOS-2, psychomotor instability and other behaviors could artificially increase scores (Hong et al., 2022). It may be advisable to re-evaluate the child later if they score highly on the three items measuring behaviors not specific to autism. In addition, the reduced variance in this subgroup may be partly due to the fact that only two researchers coded the BOSCC, which, while ensuring high fidelity, may have limited the variability and precision of ICC estimates. The small sample size of this subgroup further reduces the precision of the ICC, potentially affecting the reliability of the results.
Factor analysis confirmed overall robust structural validity and yielded a three-factor model—social interaction, restricted/repetitive behaviors, and a third factor related to anxiety/agitation—consistent with the BOSCC’s theoretical framework and with autism-associated nonspecific behaviors. Nevertheless, at the level of individual items, it is important to remain cautious regarding the distribution of the “facial expression,” “sensory interest,” “mannerisms,” and “vocalisation” items, which displayed floor and/or ceiling effects exceeding the conventional 15% threshold. These characteristics may limit their sensitivity to change, particularly over a short observation period.
Responsiveness
External responsiveness, assessed by two different methods, confirmed strong alignment with CGI-I ratings and thus alignment with clinician-rated global improvement, with correlations exceeding the commonly accepted threshold for convergent validity (r > 0.30). In fact, BOSCC poorly discriminated change based on ADOS severity scores, with low AUC values, particularly over short intervals (below the threshold for acceptable discrimination, AUC > 0.70). Only modest associations emerged over longer periods (T0–T2). In contrast, BOSCC showed moderate to strong correlations with CGI-I change scores and acceptable to good AUC performance (AUCs ⩾ 0.70). This divergence may reflect fundamental differences between the ADOS—designed for diagnostic classification and relatively insensitive to short-term change—and the BOSCC, which was explicitly developed to capture subtle behavioral variations.
Exploratory analyses using the MSEL and VABS revealed low to moderate correlations, reinforcing the idea that these instruments measure different constructs. This further supports the notion that no single tool can serve for assessing ASD outcomes, and that the BOSCC’s value may be maximized when combined with complementary measures, such as functional assessments (Mazurek et al., 2020).
Our results were generally consistent with previous studies (Grzadzinski et al., 2016; Kim et al., 2019; Pijl et al., 2018) but differed from those of Carruthers et al. (2021), who found no overall superiority of the BOSCC over the ADOS in detecting change. In our study, the BOSCC closely reflected the clinician’s impression of change in reciprocal social interactions and showed stronger performance than ADOS. This discrepancy may be explained by methodological differences: Carruthers et al. coded the BOSCC from ADOS videos—a semi-structured interview—whereas we used the original BOSCC protocol, scoring interactions between the child and an adult during minimally structured activities. Furthermore, they relied on internal responsiveness, a property that is contingent upon the magnitude of the effect of the tested intervention.]
Clinical and Research Implications
The BOSCC is a valid, responsive-to-change, and naturalistic tool that enhances ecological validity and requires minimal supervision. Results indicate it can be reliably administered without specialized accreditation, unlike other standard measures (Bolte & Diehl, 2013). Nonetheless, achieving and maintaining interrater reliability requires time and strict adherence to the BOSCC protocol, such as double coding every five videos, to ensure methodological rigor.
In contrast, tools like the CGI-I provide a rapid global clinical assessment but do not capture the detailed features of autistic behaviors. The BOSCC, by using a structured coding system with standardized items and scoring criteria, allows for a more systematic and reproducible evaluation of social-communication behaviors.
Incorporating BOSCC alongside functional outcome measures could enhance the ecological relevance of ASD clinical trials. Moreover, BOSCC has demonstrated good potential for use in ecological contexts, with studies showing consistent social-communication scores across home and school settings (Reszka et al., 2024). Similar findings have been reported for other observational tools, with social-communication behaviors remaining stable across varied routines such as play and snack (Frost et al., 2019).
An online platform dedicated to BOSCC coding has been developed, and a study has shown that the performance of online coding is comparable to that of manual coding. This platform makes BOSCC more accessible and less time-consuming for both researchers and clinicians (Toolan et al., 2025).
Study Limitations
The cohort was drawn from an RCT including an intensive intervention arm, offering a unique opportunity to assess BOSCC in the context of structured therapeutic programs. Future research in more diverse contexts, including cohorts outside specific intervention trials, will help strengthen generalizability. Variations in adherence to the 12-minute recording protocol and occasional missing data reflect the real-world challenges of large-scale implementation and will inform refinements to improve feasibility. Finally, the absence of a strong intervention effect in the clinical trial limited internal responsiveness analyses, highlighting the need for future studies including samples with a wider range of clinical change.
Because the CGI-I ratings were derived from the same BOSCC observation sessions, the external responsiveness analysis may be affected by shared method variance. While this approach is consistent with prior BOSCC validation studies and COSMIN guidance, results should therefore be interpreted with caution. Future studies should examine responsiveness using CGI-I ratings based on independent observational material or full clinical information.
Conclusion
These findings indicate that the BOSCC demonstrates strong reliability, validity, and responsiveness for detecting treatment-related changes in core autism domains within clinical trials. Overall, it emerges as a relevant indicator of clinically meaningful change—particularly over periods of 12–24 months and when anchored to global ratings (CGI-I) and functional measures. Future work should focus on refining the tool (exploring domain-specific scoring to optimize its use and establishing Minimal Clinically Important Difference (MCID) and Minimal Detectable Change (MDC) thresholds, to optimize its use in both research and clinical practice.
Future work should focus on refining the tool, including exploring domain-specific scoring to optimize its use, and establishing MCID and MDC thresholds using both anchor-based and distribution-based approaches, to enhance its applicability in research and clinical practice.
Supplemental Material
sj-docx-1-aut-10.1177_13623613261442618 – Supplemental material for Assess the Fidelity, Validity and Responsiveness of the Brief Observation of Social Communication Change (BOSCC) to Measure Changes in the Social Interactions of Preschool Children With Autism Spectrum Disorder
Supplemental material, sj-docx-1-aut-10.1177_13623613261442618 for Assess the Fidelity, Validity and Responsiveness of the Brief Observation of Social Communication Change (BOSCC) to Measure Changes in the Social Interactions of Preschool Children With Autism Spectrum Disorder by Agathe Jay, Lucie Jurek, Riham Hamadeh, Marie-Joelle Oreve, Carmen M. Schröder, Veronique Delvenne, Sandrine Sonie, Bruno Falissard, Olivia Febvey-Combes, Mario Speranza and Marie-Maude Geoffray in Autism
Footnotes
Acknowledgements
Acknowledgment to all the assessment and intervention teams (Denver Lyon CH le Vinatier, Denver Lyon Saint-Jean de Dieu, Strasbourg, Versailles, Bruxelles), Mrs Sartelet and Vial from CH le Vinatier, children and parents who took part in this study.
Ethical Considerations
All assessments were conducted in accordance with ethical standards, with data centralized and anonymized under secure conditions (CPP Sud Est III N° 2015–013 B). The local ethics committee approved this secondary analysis (CEREVI 2024/025).
Consent to Participate
Each family had provided written consent to participate in the RCT. A letter of nonopposition was sent to them for the reanalysis of data in the present study, and one family declined.
Author Contributions
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The initial RCT was supported by a grant from the Direction Générale de l’Offre de Soins (PREPS 14-0533) and a grant from the Fondation de France (2015–013B). This current reanalysis was supported by The Pole of Research and Valorisation (PRV), CH le Vinatier, Bron, France.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Trial-related anonymized participant data will be made available upon reasonable request to the principal investigator (MMG,
Community Involvement Statement
Of the 12 authors, 10 are clinicians—nine are child and adolescent psychiatrists, one is an occupational therapist. Eleven of these clinicians are also researchers in autism and in neurodevelopment. Among the two others, one is a public health researcher, and another one is a statistician. They are based in large hospitals or centers in France specializing in diagnosing and providing care for children with autism and associated neurodevelopmental diagnoses or center for clinical research.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
