Abstract
Universal behavior screening is used in schools worldwide to detect students with and at risk for behavioral challenges. A plethora of instruments is available for this purpose, though little metascience has been conducted to review and synthesize methods used to study these instruments in educational settings, nor is there a comprehensive list of instruments to support educators in selecting an appropriate tool. We conducted this review to provide a rigorous—and accessible—overview of the research base for universal behavior screening instruments to facilitate educators’ decision-making process when selecting a systematic screening tool for the students they serve and identify areas of further refinement for the research community. This scoping review includes an extensive list of behavior screening instruments, an examination of how these tools have been studied, and areas for future research. We identified 56 behavior screening instruments. The most common psychometric analyses included coefficient alpha for internal consistency, correlations between theoretically related variables, and confirmatory factor analysis. We discuss other methods currently employed as well as methods and complexities for consideration in future research.
Keywords
Many school-age youth experience social, emotional, and behavioral health concerns at some point during their school years. Forness et al. (2012) compiled point prevalence estimates of students meeting criteria for any emotional or behavioral disorder (EBD)—internalizing (e.g., anxiety) or externalizing (e.g., aggression)—during their school years and conservatively estimated 12% with included studies ranging from 3.7% to 21.1%. Prevalence rates, particularly for internalizing disorders (e.g., anxiety, depression), have been on the rise in recent years among school-aged youths (Lebrun-Harris et al., 2022). For example, in the 2018–2019 National Survey of Children’s Health, 13.2% of U.S. children ages 13–17 had a diagnosed mental or behavioral health condition (Health Resources and Services Administration, 2020). More recently, in the 2023 Youth Behavior Risk survey, 29% of high school students reported poor mental health in the last 30 days (Centers for Disease Control and Prevention, 2024). Although internalizing disorders are far less disruptive than externalizing behaviors in the classroom environment, they can still have detrimental impacts on students’ school experiences and may contribute to poor mental health (Weist et al., 2018). Collectively, estimates indicate many students need additional support at some point in their school career, though internalizing behaviors are often more challenging to recognize (Bradshaw et al., 2008).
Many schools adopted a systemic approach to examine overall levels of EBDs in school systems, identify students experiencing these challenges, and connect students with evidence-based strategies at the first sign of concern before the COVID-19 pandemic, and these systems are now being recommended as a way to support students’ social, emotional, and behavioral needs as part of pandemic recovery (Office of Special Education and Rehabilitative Services, U.S. Department of Education, 2022; Walker et al., 2014). Tiered systems such as Positive Behavioral Interventions and Supports (PBIS; Sugai & Horner, 2002); Comprehensive, Integrated, Three-tiered (Ci3T) model of prevention (Lane et al., 2009); Multi-Tiered System of Supports (MTSS; McIntosh & Goodman, 2016); and Interconnected Systems Framework (ISF; Barrett et al., 2017) share a key element: universal screening. When implementing these systems as designed, schools screen all students in one or more domain: academic, behavioral, and social and emotional well-being. Just like vision or hearing screenings, academic, behavioral, and social-emotional screening involves brief assessments used with all students intended to measure the specific constructs of interest (e.g., externalizing or internalizing behaviors; Oakes et al., 2017). Educators can use systematic screening data to shape instructional practices at Tier 1 for all students, which may include low-intensity strategies (e.g., precorrection, instructional choice) to maximize engagement and limit challenging behavior (Korpershoek et al., 2016; Ma et al., 2022). Additionally, screening within a tiered system can create a clear and equitable path for connecting students to Tier 2 and 3 interventions when Tier 1 efforts—even when implemented as planned—are not sufficient to meet students’ needs (Lane, Menzies et al., 2020). Tier 2 and 3 interventions may include validated interventions in the academic, behavioral, mental health, and/or social-emotional learning domains (e.g., Cipriano et al., 2023; K. A.Cohen et al., 2024, Murano et al., 2020; Sabey et al., 2017). As schools prepare to implement these systems, they must select a screening tool.
Several scholars have published recommendations to support education leaders in selecting a screening assessment. Glover and Albers (2007) recommended first evaluating the appropriateness for intended use, which includes making sure the (a) tool fits the need, (b) construct of interest aligns with their purpose, (c) tool has been validated for a similar population, and (d) instrument has theoretical and empirical support. Second, schools or districts evaluate the technical adequacy of instruments meeting the first set of standards by ensuring the (a) normative sample included students with similar characteristics to the student population they are serving, (b) measure demonstrated adequate reliability properties (e.g., internal consistency, test-retest stability, inter-rater reliability), and (c) studies have established adequate validity evidence (e.g., criterion, construct, content, prediction). Third, the tool must be feasible given cost, social validity, school systems or infrastructure, available accommodations, and applicability of results. Oakes et al. (2017) provided a similar framework for selecting a behavior screening tool: determine constructs of interest (e.g., internalizing and externalizing), narrow down potential instruments based on feasibility (e.g., cost, time), and then evaluate psychometric properties. American Education Research Association (AERA), American Psychological Association (APA), and National Council for Measurement in Education (NCME, 2014) provide further guidance regarding instrument development and criteria for selecting an instrument in The Standards for Educational and Psychological Testing. The National Center on Intensive Interventions (NCII, 2022) provided specific criteria for educational screening. These collective recommendations offer a roadmap for selecting a screening tool, though this roadmap may be challenging for schools and districts to implement as information on systematic screening tools has developed substantially in the last decade (Lane et al., 2012), leaving a widely dispersed set of information to review.
Aim and Scope
Initially, we planned to compile psychometric information on behavior screening instruments to assist practitioners in meeting established guidelines for selecting a screening instrument. As we prepared to conduct this review, we continually located more relevant instruments, though no comprehensive list of these instruments nor a consolidated source of information across instruments existed. We found several prior reviews with narrower scopes such as a single instrument (Kersten et al., 2015; Lambert, Sointu, et al., 2018; Marzocchi et al., 2004), a smaller selection of instruments (Caselman & Self, 2008; Jenkins et al., 2014), or a specific age range (Whalen et al., 2017). These reviews provide important depth to their respective areas, yet these were not designed to detail the full scope of available instruments to support practitioners in the early stages of instrument selection.
As such, we shifted our focus toward conducting a thorough scoping review to provide practitioners and researchers with a comprehensive yet accessible set of information regarding (a) behavior screening tools available to detect a range of behavior patterns of interest to aid in instrument selection and (b) summarize analyses used to study these tools to inform future research. We offer this rigorous and accessible scoping review of behavior screening tools to support schools in selecting instruments, identify current research practices, and highlight areas for refinement in future research (Boveda et al., 2023; Munn et al., 2018). We address two research questions. First, what systematic behavior screening tools are available for use in schools with students in grades pre-K–12? Second, how have researchers studied the psychometric properties of these instruments?
Method
We conducted a multi-phase search initially guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist (Page et al., 2021). After our focus shifted toward a scoping review of universal behavior screening research to maintain the methodological rigor and extend the potential reach to guide decisions regarding systematic screening tool selection (Boveda et al., 2023), we used the PRISMA extension for scoping review to update our process and reporting (Tricco et al., 2018). Our search included three phases with reliability assessed in each phase: (1) an electronic database search using Boolean operators, (2) hand searches through identified journals, and (3) a second electronic database search including additional instruments identified in the first two phases (see Figure 1). See LDBase.org for a summary of findings from each included study.

Overview of search process.
Inclusion Criteria
The first author trained all authors on how to apply the following inclusion criteria to ensure the accuracy of the article selection process. They used a PowerPoint and provided each coauthor a draft of the inclusion criteria below to reference throughout the coding process. All authors independently rated three articles to practice applying criteria, with 100% agreement.
Published in English
The study may take place in any country, though the article must be published in English so the research team can determine if the study meets inclusion criteria. Studies published in other languages (e.g., Bacanli & Erdoğan, 2003) were not included, a noted limitation.
Published in a Peer-Reviewed Journal
The study must be published in a peer-reviewed journal. We did not include dissertations and theses as we could not determine the rigor of review, given the variability of different university requirements. Yet, many methodologically sound theses and dissertations are later published in journals and could then be included. We did not have any non-examples based on search parameters (see Tables S1 and S2, available online).
Administered in a PreK–12 Educational Setting
The study must have been conducted in a school environment or educational context such as residential, self-contained, private, university lab, or public school. We excluded studies in clinical settings and community-based programs (e.g., Lambert et al., 2015; Walrath et al., 2004) as our research question focused on providing guidance for selecting screening tools for use in pre-K–12 educational settings.
Administered Universally
The study must include a behavior screening tool administered to, or offered pending consent and assent, all students in a set age or grade range. Many studies did not provide detailed information on their sampling procedures, so we could not create a more precise decision rule beyond universal administration. Studies including only a subset of students pre-identified by another characteristic (e.g., disability status, teacher referral) were excluded (e.g., Hysing et al., 2007; Treyvaud, 2014). Furthermore, studies that included a subset of students selected randomly or for nonspecified reasons were not included (e.g., Shojaei et al., 2008).
Behavior Screener
Because little meta-science has been conducted in this field, there is no broadly accepted operational definition for a behavior screener. We focused on identifying studies that included a brief, universal assessment of characteristic behavior patterns of major childhood behavior disorders (i.e., internalizing and/or externalizing; Achenbach, 1991). Screening instruments are typically brief, though they vary in length based on the number and depth of constructs assessed and the stage in instrument development, as they are often pruned over time (Oakes et al., 2017). We excluded studies using only extended rating scales intended for diagnostic or special education eligibility purposes, such as the Teacher Report Form (TRF; Achenbach & Rescorla, 2001) and Social Skills Improvement System Rating Scales (SSiS-RS; Gresham & Elliot, 2008), as they are not brief screening instruments intended for universal implementation (e.g., Flanagan et al., 1996; Gresham et al., 2010).
Complete Instrument
The study must use a complete screening tool or an entire subscale (e.g., Stage II measures of the Systematic Screening for Behavior Disorders; SSBD; Walker et al., 2014) as designed or intentionally study a revised version of a tool. If the instrument was edited or modified (e.g., modified response anchors, unintentionally omitted items) and the changes are not the primary focus or conclusions are related to the original instrument, we excluded the study (e.g., King & Reschly, 2014; Moulton & Young, 2021), as seemingly minor instrument changes can impact findings (Goodman et al., 2007).
Psychometric Analysis
The study must present at least one psychometric analysis of a screening tool to indicate the study evaluated the instrument itself in some way. The study could focus on reliability or reliability evidence of any kind (AERA et al., 2014). We did not include practice guides (e.g., Lane et al., 2011) and application studies (e.g., screening tool used as a pre- and post-test for an intervention study) as they do not provide information on how instruments are studied (e.g., Kamps et al., 2003).
Primary Database Search
Prior to developing Boolean search terms, we conducted informal searches through existing literature (e.g., Caselman & Self, 2008; Jenkins et al., 2014; Lane et al., 2012), practice briefs (e.g., Lane, 2019), and professional learning content about systematic behavior screening (e.g., ci3t.org/screening; pbis.org) to create a preliminary list of instruments. We determined a preliminary list was necessary to limit the initial search given the variety of language used to describe behavior screening instruments and relevant analyses. Using approximately 30 different sources, we compiled an initial list of screening tools and behavior rating scales often used as comparisons for screening tools. Next, we contacted senior scientists and researchers working with districts to implement systematic screening to ensure all screening tools they were aware of were included on this list. Although this was not a comprehensive search, we expected to find additional screening tools in articles referencing other established measures in the introduction or as a comparison tool. Any additional screening instruments found in the primary and hand searches were included in a secondary electronic search conducted approximately one year following the primary search.
We conducted the electronic database search in ERIC and APA PsycINFO databases—two databases with substantial representation of relevant educational research that have been used in other scoping reviews related to measurement in tiered systems (e.g., Buckman et al., 2021). The primary search included terms to specify the study occurred in a school setting, addressed psychometric properties, and included at least one behavior rating tool (Table S1, available online). We used check boxes to limit results to articles published in English by peer-reviewed journals. We conducted the primary search through the University of Kansas library on December 22, 2021. A second author ran the same search terms and returned the same articles with 100% reliability. The primary search yielded 2,163 articles.
Titles and Abstracts Coding
The first author reviewed all 2,163 abstracts to determine if they met inclusion criteria, erroring on the side of inclusion when there was no clear reason for exclusion. Two other authors served as secondary coders and assessed 25% of abstracts (n = 569) for reliability. Primary and secondary coders reached 94.71% interrater agreement (IRA; J.Cohen, 1960) on the binary decision for inclusion in the next step with Cohen’s kappa (κ = 0.84, 95% CI = [0.79, 0.90]) in the near-perfect agreement range per Landis and Koch (1977).
Full Read Coding
From the abstracts, we identified 362 articles for a preliminary full-text read. Coauthors completed an average of 10 full article reads per week for 8–10 weeks, enabling the first author to conduct weekly reliability checks to prevent drift between raters. Each author read approximately one-fourth of the articles in total to determine if the article was eligible for inclusion. The first author checked reliability on a minimum of 25% of articles coded by each coauthor. Overall agreement between the first author and reliability raters was 90.29% (κ = 0.81, 95% CI = [0.69, 0.92], indicating very high agreement). Three articles had substantial ambiguity regarding procedures and/or measures. Authors met to reach a consensus on these articles, which led to clarification on how to determine if an instrument qualified as a screener for EBD. We decided to include all brief instruments implemented universally and measure at least one behavior characteristic of EBD. From this preliminary read, we identified 122 articles for inclusion.
Journal Hand Search
We identified 15 journals with three or more published articles included in the primary search to search by hand online. The journals are listed here in alphabetical order: Assessment, Assessment for Effective Intervention, Behavioral Disorders, European Journal of Psychological Assessment, International Journal of School & Educational Psychology, Journal of Child and Family Studies, Journal of Emotional and Behavioral Disorders, Journal of Psychopathology and Behavioral Assessment, Journal of Psychoeducational Assessment, Journal of School Psychology, Psychological Assessment, Remedial and Special Education, School Mental Health, School Psychology, and School Psychology Review. For these journals, we reviewed all issues from the date of the first included article (1987) or the earliest issue available online (whichever was later) through September 30, 2022.
The first author reviewed all 23,101 titles, and a secondary coder assessed a minimum of 25% of issues (n = 622) from each hand-searched journal. Raters noted the number of articles in each issue to ensure they checked the same titles and abstracts and reached 99.64% agreement (range = 97.87 to 100%). The average agreement for the number of articles to include from each issue was 95.75% (range = 85.71 to 100%). We identified 251 articles for full read based on titles and abstracts, of which 147 (58.57%) were previously read in full in the primary search. The first author read all 104 unique articles, and another author completed reliability coding of 25% of articles (n = 26; IRA = 92.31%; κ = 0.84, 95% CI = [0.62, 1.05], indicating high agreement). We identified 25 new studies for inclusion, of which five were published after we conducted the primary search.
Secondary Database Search
As expected, we identified additional screening instruments from the primary and hand searches. We added search terms for these instruments to the original terms from the primary electronic database search and removed criterion measures (e.g., TRF, SSiS-RS) as they were included only to assist in identifying additional screening tools (online Table S2). Besides screening-tool-specific search terms, we used the same university library, databases, and search parameters as the primary search. We conducted the secondary search on January 2, 2023. Two authors conducted the search separately and returned the same 1,816 articles with 100% reliability, yielding 353 new articles, including articles published after the primary search.
Titles and Abstracts Coding
Following the same procedures as the primary search, the primary coder reviewed all titles and abstracts except those already read in full (n = 1,491). Two other authors served as reliability coders for 25% of abstracts (n = 373; IRA = 94.10%; κ = 0.74, 95% CI = [0.63, 0.84], indicating substantial agreement).
Full Read Coding
From abstracts, we identified 459 articles for a preliminary read of which 125 were already included from the primary and hand searches. The first author read all new articles, and two other authors scored at least 25% for reliability (n = 94), yielding 96.81% agreement (κ = 0.90, 95% CI = [0.79, 1.01], indicating very high agreement). From this preliminary read, we identified 33 new articles to include for information extraction.
Information Extraction
Finally, authors read all included articles (n = 180) in full and recorded key information, primarily from method and results sections, in a coding spreadsheet designed for systematic reviews available at ci3t.org. We recorded information on participant demographic information, school level, geographic location, locale (when available), screening informant(s), and all included instruments. Then, we recorded psychometric analyses reported and a brief overview of the findings. Another author checked coding for 25% of articles (n = 47) and found 98.16% average accuracy (range = 92.86 to 100%). See our coding sheet at LDBase.org for sample characteristics from all included articles.
Results
We begin by providing an overview of included studies. Many studies involved multiple school levels (i.e., elementary, middle, high school), informants, geographic regions, and analyses. The findings presented are not mutually exclusive, so percentages in each category do not sum to 100%. Studies included elementary-aged students most often (n = 110; 61.11%), followed by middle school (n = 67; 37.22%), high school (n = 42; 23.33%), and preschool or early childhood (n = 19; 10.56%). Teachers served as informants most often (n = 117; 65.00%), followed by students (n = 68; 37.78%) and parents (n = 35; 19.44%). Of the included studies, 119 (66.11%) incorporated samples from North American countries, of which 116 (64.44%) were in the United States, 35 (19.44%) from Europe, 18 (10.00%) from Asia, 8 (4.44%) from Australia, 4 (2.22%) from South America, 2 (1.11%) from Africa, and 3 (1.67%) with the location not explicitly reported. Our coding sheet with information on the samples from each article and summaries of findings is available at LDBase.org, though not all articles reported the same depth of demographic characteristics. Notably, describing demographic characteristics of parent or teacher raters was the exception rather than the norm. The earliest included study was published in 1987, and our search concluded with articles published in 2022. We present the cumulative total of studies examining psychometric properties of behavior-screening instruments administered universally in a celeration graph (Figure 2; Kennedy, 2005), which shows slow growth in the body of literature for the first 20 years and greater increase in the last 15 years.

Cumulative total of universal screening articles by year published.
Identified Instruments
To address question one in assisting practitioners in identifying potential screening instruments appropriate for the students they serve and their intended use, we created a list of all behavior-screening instruments located throughout the search. We identified 56 instruments (Table 1) and listed articles included for each instrument in online Table S3. Wording for instrument and subscale names varies greatly across instruments due to different foci (e.g., internalizing, externalizing) and approaches (i.e., deficit skills, strength-based). Instrument length ranged from 3 to 66 items, with nearly all instruments utilizing a Likert-type scale with 3 to 11 response options.
Identified Screening Instruments: Listed in Development Chronology
Note. The exact number of items may vary across publications.
Reliability Analyses
The most reported analysis was estimates of internal consistency, which were used in 149 articles (82.78%). Coefficient alpha was by far the most reported (n = 110; 61.11%), with more authors including omega in recent years (n = 21; 11.67%) and few other metrics (n = 18; 10.00%). Twelve articles (6.67%) included both alpha and omega estimates (Black et al., 2021; Español-Martín et al., 2021; Gillé et al., 2021; Gomez & Stavropoulos, 2020; Margherio et al., 2019; Pierce et al., 2016; Renshaw, 2019; Renshaw & Cook, 2019; Sharma et al., 2022; Volpe et al., 2021; von der Embse et al., 2017; Xu et al., 2021), five articles (2.78%) included alpha and another reliability estimate (Aitken et al., 2015; Garrido et al., 2020; Gustafsson et al., 2016; Kilgus, von der Embse, Allen et al 2018; Naser, Brown, et al., 2018), and one article (0.56%) included alpha, omega, and Spearman-Brown estimates (Naser, Hitti, et al., 2018).
Several studies (n = 43; 23.89%) examined test-retest reliability or temporal stability of behavioral ratings from two weeks to over a year. Most test-retest reliability analyses involved correlations (n = 39; 21.67%) with relatively few other analyses methods employed: intraclass correlations (ICC; n = 3; 1.67%; Aanondsen et al., 2020; Erdogan & Ozturk, 2011; Isolan et al., 2011), kappa coefficients (n = 2; 1.11%; Feil et al., 1995; Pagano et al., 2000), percent at risk (n = 1; 0.56%), and ANOVA (n = 1; 0.56%). Fewer studies evaluated inter-rater reliability (n = 31; 17.22%). Correlations were the most common method for assessing inter-rater reliability (n = 28; 15.56%), followed by intraclass correlations (n = 4; 2.22%; Caldarella et al., 2008; Downs et al., 2012; Kilgus et al., 2015; Mieloo et al., 2014) and kappa coefficients (n = 3; 1.67%; Feil et al., 1995; Kilgus et al., 2015; Margherio et al., 2019). ICCs can also be used to estimate the degree to which students within a class had similar scores by estimating the proportion of score variance due to the cluster or class (McCoach & Cintron, 2022). Very few studies reported these ICCs (n = 8; 4.44%; Eklund et al., 2017; Kilgus, von der Embse, Allen et al., 2018; Kilpatrick et al., 2018; Liu et al., 2020; Splett et al., 2017, 2018; von der Embse et al., 2016; Wiesner & Schanding, 2013).
Structural Analyses
Many studies evaluated the internal structure of instruments with factor analysis. Confirmatory factor analysis (CFA), often reported in relation to construct validity, was the third most common analysis overall (n = 77; 42.78%). Fewer studies reported exploratory factor analysis (EFA; n = 45; 25.00%), though more than half of those reporting EFA also reported CFA (n = 25; 13.89%). Multiple group confirmatory factor analysis (MGCFA) or measurement invariance was even less common (n = 34; 18.89%), though all studies evaluated the fit of the measurement model with CFA first. MGCFA has become more common with 31 of the 34 studies published in the last decade. Structure of the Strengths and Difficulties Questionnaire (SDQ; Goodman, 1997), including many translations, was frequently studied accounting for over a third of CFA (n = 26), EFA (n = 15), and MGCFA (n = 12) studies. Relatively few studies examined the structure of the instrument using item response theory (IRT) analyses. Only 12 (6.67%) articles reported some form of IRT analysis of which eight (4.44%) included differential item functioning analyses (Barbarin et al., 2020; Bøe et al., 2016; Bosik et al., 2022; Deighton et al., 2013; Harrell-Williams et al., 2015; Kim & Kamphaus, 2018; Lambert et al., 2014; Schatschneider et al., 2014).
Relational Analyses
Researchers established validity evidence by comparing screening scores to other theoretically related variables such as other screeners, behavior rating scales, or educational outcomes (e.g., attendance, office discipline referrals [ODRs], academic outcomes). Many studies reported correlation coefficients (n = 84; 46.67%), though correlations are often a prerequisite for more advanced analyses. Conditional probabilities (n = 47; 26.11%) paired with receiver-operator characteristic curves (n = 33; 18.33%) were the next most common relational analyses followed by regression (n = 41; 22.78%), and analysis of variance techniques (e.g., ANOVA, MANOVA; n = 14; 7.78%). Few studies reported other techniques such as kappa coefficients (n = 7; 3.89%), t-tests (n = 8; 4.44%), structural equation modeling (n = 6; 3.33%), and graphical examination (n = 2; 1.11%).
Cluster Analyses
Seven articles (3.89%) presented clustering analyses to examine patterns of behavioral risk over time (Bauer, 2022). These studies included k-mean cluster analysis (n = 1; 0.56%; Dever et al., 2017), latent profile analysis (n = 3; 1.67%; Dowdy et al., 2014; Kilgus et al., 2015; Warmbold-Brann et al., 2017), latent class analysis (n = 2; 1.11%; Kilgus et al., 2019; King et al., 2016), and latent transition analysis (n = 1; 0.56%; Iaccarino et al., 2019).
Social Validity
Twelve articles (6.67%) included some analysis of social validity—the acceptability of goals, procedures, and outcomes of universal behavior screening (Wolf, 1978)—to give voice to those completing the screening tools. Researchers examined social validity via response rates (e.g., Lane et al., 2010) or instruments such as the Assessment Rating Profile-Revised (ARP-R; Daniels et al., 2017; Eckert et al., 1999), Screening Tool Rating Scale (STR; Lane & Oakes, 2010; Lane et al., 2014, 2015; Oakes et al., 2016), and Usage Rating Profile–Assessment (URP-A; Chafouleas et al., 2012; Hartman et al., 2017).
Discussion
Educators around the world use universal behavior screening to inform instructional experiences for all learners; connect students with validated interventions when the core curriculum available to all students, or Tier 1, is insufficient; and direct teacher efforts to respectfully maximize instruction (Gresham et al., 2013; Lane et al., 2021). Although a wealth of behavior screening tools exists, there was not an extensive list to assist educators beginning the instrument selection process and there was little metascience on the field at large. Our goal was to provide a rigorous, accessible review of universal behavior screening instruments (Boveda et al., 2023). This scoping review resulted in an extensive list of universal behavior screeners and mapped the methods used to study these tools.
Identified Instruments
We found far more screening instruments than expected—56 in total—though we suspect many tools measure slightly different constructs. For example, social-emotional and behavior screeners, especially those with a strength-based approach, often contain similar language, though the constructs are not necessarily interchangeable (Lane, Oakes, Monahan et al., 2023). Both are valuable tools for informing instruction and intervention efforts, though they do not have the same intended use in practice thus requiring different validity evidence to align with their respective uses (Kane, 2013). We encourage instrument developers to clearly define intended constructs, researchers to engage in critical conversations on how to clarify the boundaries of these constructs, and educators to consider which constructs align best with their intended use when selecting an instrument.
In recent years, the number of universal screening studies has increased substantially. The bulk of the evidence base for universal behavior screening has been established at the elementary level suggesting there is ample room to study these tools in early childhood, middle school, and high school settings. Teachers were the most common informant in school-based screening studies, which seems fitting when the purpose is to inform educational experiences and detect students in need of additional support at school specifically. Students were the second most common informant followed by parents. Additional information gleaned from student or familial raters brings additional and important perspectives to inform students’ behavioral needs, especially beyond the school setting (De Los Reyes & Epkins, 2023). However, the school system must be prepared to respond swiftly and effectively when new information is gleaned from students and families, given ethical and legal obligations as mandated reporters (Lane et al., 2021). Most studies in this sample included samples from the United States, though our requirement of being published in English certainly impacted the geographical distribution of studies, a noted limitation.
For educators interested in implementing universal screening, there is a wealth of literature available. This is encouraging given systematic screening tools are one potential path to equitably connecting students with supports aligned with their level of needs. Beyond this review, a wide range of practice guides are available on pbis.org and ci3t.org/screening for educators and technical assistance providers to not only assist with selecting screening instruments (e.g., Lane, 2019; Oakes et al., 2021) but how to set up structures to prepare to implement screening (e.g., Lane, Oakes, Menzies, et al., 2020; Oakes et al., 2022, Schonour et al., 2022) and use screening tools to shape instructional experiences for students (e.g., Lane, Powers, et al., 2020; Ma et al., 2022).
Analysis Methods
As we examine the methods used to study these instruments, we consider the evolution of research methodology as well as the chronological entry-point of any given instrument over the 45-year range of included studies. We do not devalue a study using a technique considered dated by today’s standards when it was cutting-edge or an expected analysis at the time of publication. Similarly, we recognize each of these instruments was once, or may still be, in its infancy. We encourage further inquiry into tools with less published research or fewer modern analyses as they may be efficient and effective tools for educators. Even for established instruments, validation is an ongoing process as populations of interest, understanding of a construct, and inferences made from the scores may change over time (McCoach et al., 2013). We encourage researchers to reevaluate the utility of each instrument regularly, with continued research and refinement viewed as a strength of the instrument rather than a limitation.
In terms of reliability evidence, we were encouraged to see researchers moving away from relying solely on coefficient alpha as the underlying assumptions are rarely met fully in practice (Cronbach & Shavelson, 2004; Sijtsma & Pfadt, 2021). Instead, researchers can report other estimates of reliability with more plausible assumptions—particularly regarding tau equivalence—in addition to coefficient alpha (McNeish, 2018; Revelle & Condon, 2019). Test-retest and inter-rater reliability studies are relatively common, though we encourage researchers to consider their underlying theory when conducting these studies. When examining test-retest reliability, behaviors may not be stable over extended periods (e.g., over one year), especially if the student is receiving intervention in response to a prior screening score. Similarly, we encourage researchers to consider how student behavior may differ across settings when selecting multiple raters for inter-rater reliability as true differences in behavior may impact these reliability estimates (Achenbach et al., 1987).
Confirmatory factor analysis was the second most reported analysis which is a well-known process for confirming observed data fits with the hypothesized structure of the instrument. Conversely, item response theory approaches to item evaluation are relatively under-utilized with behavior screeners. IRT allows for individuals’ scores and screener items to be placed along the same continuum of latent trait ability to determine one’s location and an item’s discrimination (de Ayala, 2022). Thus, IRT may be particularly useful to identify highly informative items for brief screeners to ensure adequate information around specific ability levels, like those used as cut scores (Embretson, 1996). Computer adaptive testing, an advanced application of IRT, could even be used to create hybrid screening and diagnostic behavioral assessments to more swiftly and precisely identify students’ behavioral strengths and areas for support (Chang, 2004). Multiple group confirmatory factor analysis and differential item functioning analyses are becoming more common as methods to detect measurement bias (Berry, 2015). We anticipate these analyses will continue to become more commonplace as researchers work to support equitable educational practices and new criteria are established for screening instruments (e.g., NCII, 2022).
We grouped analyses relating screening instruments to other variables (e.g., predictive, discriminative, convergent, and concurrent validity) to align with more recent conceptualizations of validity evidence (AERA et al., 2014). Correlations were the most common analysis, which is expected when preceding other analyses but leaves much room for expansion when reported alone. We hope more researchers will use analysis methods that can appropriately model the known nested structure of the data with students nested within their classrooms (Huang, 2018). In the eight studies reporting ICC estimates for cluster effects, ICCs ranged from .02 to .55, empirically confirming multilevel modeling would be useful to appropriately account for nesting (Geldhof et al., 2014). Overlooking cluster or rater effects can lead to underestimated standard errors and increase Type I error rate (McCoach & Cintron, 2022). There are several methods for correcting standard errors, though researchers in social sciences utilize hierarchical linear modeling most often (McNeish et al., 2017). Although these analyses are more complex to conduct and interpret, we believe collaboration with methodological experts may result in a more rigorous and nuanced understanding of how behavior screening instruments function in schools (e.g., Lane et al., 2024; Reinke et al., 2022).
Mixture modeling methods such as latent class analysis, latent profile analysis, and latent transition analysis were also relatively uncommon as they appeared in only seven of the included articles. These techniques are known as person-centered approaches to summarizing data by grouping observations by similar response patterns (Schmiege et al., 2017). Mixture models can help researchers retroactively evaluate how the tool functioned for grouping students, which may be particularly useful when studying these tools in the context of tiered systems, though they should be used with caution for diagnostic purposes. Even when using the correct mixture model, individuals are not always classified into the correct group, especially with small or unequal groups (Cintron et al., 2023). In the context of tiered systems, screeners are often used to identify approximately 15% of students with some additional need and connect them with Tier 2 interventions and connect approximately 5% of students with the most intensive needs to Tier 3 interventions (Gresham et al., 2013; Lane et al., 2009), although it is likely these percentages have increased in the aftermath of the COVID-19 pandemic (Weist et al., 2023). Given the unequal and progressively smaller group sizes, mixture modeling may be inaccurate for determining classification and needs to be used with caution.
Social validity was among the least common though longest-standing analyses identified in this review. Stemming from behavioral intervention research, social validity is a measure of the acceptability of goals, procedures, and outcomes as determined by invested parties (Wolf, 1978). For universal screening, social validity among informants is particularly important to understand barriers to implementation and ensure future participation in universal screening. We found several instruments available for assessing social validity for researchers interested in studying this further (e.g., ARP-R, Eckert et al., 1999; STR, Lane & Oakes, 2010; URP-A, Chafouleas et al., 2012). We encourage additional inquiry in this area to uncover barriers to universal screening implementation, which may show the need for further research and professional learning to address educator concerns.
Over the 45 years of school-based behavior screening research, researchers have utilized a broad range of methods. We are encouraged to see the growth of literature aimed at supporting educators and students in schools. In future articles, we recommend researchers provide demographic information on both informants and students when available and include detailed information on their procedures for data collection, especially if they limited their sample by some criteria. In terms of analyses, we suggest authors include two or more estimates of reliability by instrument subscale each time they evaluate an instrument. We also encourage researchers to report correlations between subscales and other measures used in the study prior to more complex analyses, including methods that appropriately account for the nesting of students within the classroom. With ever-expanding analytic options, we encourage researchers to carefully consider which method best aligns with their research questions and available data and select the most appropriate and robust methods to evaluate behavior screening instruments.
Limitations
The primary limitations of this scoping review relate to our inclusion criteria. First, we specified the article must be published in a peer-reviewed journal. Studies with null or negative results may not have been included due to publication bias (Cook et al., 2017). We hope the open science movement, coupled with the understanding that null effects also provide important contributions, will minimize the effects of this limitation in the coming years. For logistical reasons, our team was limited to studies published in English, which makes this scoping review more relevant for predominately English-speaking countries than other locales. We also limited our search to studies in which the screener was administered universally. Some studies used robust sampling procedures; however, many studies included little description of their sampling procedures or utilized techniques that could introduce bias into the study. Often, we could not evaluate the strength of the sampling procedures, so we agreed upon the universally administered criterion to ensure relevance for educators seeking a universal screening tool. Open science practices are calling for increasingly clear methods for future replication, and this limitation could be mitigated with those practices. The universal criterion also eliminated studies that examined screening for a subset of a school’s student population (e.g., English language learners, children in foster care, refugees). Although outside the scope of this review, we recognize the need and merit in studying the functionality of these instruments with more specific populations and we strongly encourage practitioners to reference those studies when they have a particular population they are hoping to screen. Despite these limitations, we are hopeful this review will assist educators in locating screening tools and inform future research in universal behavior screening.
Summary
We conducted this review to create a rigorous and comprehensive overview of the field, including an extensive list of behavior screening instruments for implementers, and map how instruments have been studied to date. We located 56 instruments available for screening and 180 articles examining the psychometric properties of these tools meeting our criteria. The most common analyses examined internal consistency, correlations with related variables, and factor analysis. We encourage researchers investigating systematic screening to continue to evolve with research methodology to utilize the most appropriate and robust analytic techniques for their research questions. We hope in another 5 to 10 years the field will have continued the ongoing validation process for these instruments to ensure we are accurately and equitably screening for emotional and behavioral disorders in our school-age population. In the meantime, we are hopeful this systematic review will be useful to researchers as they conduct additional psychometric inquiry using current data analytic methodologies. In addition—and perhaps most importantly—we hope this review will be useful to educational leaders to inform their decision as to which screening tools to adopt to inform instruction within integrated systems to support educators in providing positive and productive learning environments for preK–12 students.
Supplemental Material
sj-docx-1-rer-10.3102_00346543251315168 – Supplemental material for Mapping the Research Base for Universal Behavior Screeners
Supplemental material, sj-docx-1-rer-10.3102_00346543251315168 for Mapping the Research Base for Universal Behavior Screeners by Katie Scarlett Lane Pelton, Kathleen Lynne Lane, Wendy Peia Oakes, Mark M. Buckman, David James Royer and Rebecca Lee Sherod in Review of Educational Research
Footnotes
Acknowledgements
I extend my sincere appreciation to the coauthors who contributed their valuable time and expertise to this student-led, unfunded scoping review as well as my advisor, Dr. Betsy McCoach, for ongoing support throughout this project.
Author Note
This manuscript was cited while in preparation under K. S.Lane et al., 2023, prior to the first author’s name change. Our coding sheet with information for each included article is available at
.
ORCID iDs
Authors
KATIE SCARLETT LANE PELTON, MA, is a doctoral student in educational psychology at the University of Connecticut in the Research Methods, Measurement, and Evaluation program; email: katie.lane@uconn.edu. She is interested in school-based measurement, particularly behavioral measurement, and intervention research to ensure educators are equipped with feasible and effective instruments and interventions.
KATHLEEN LYNNE LANE, PhD, BCBA-D, CF-L2, is a Roy A. Roberts Distinguished Professor in the Department of Special Education at the University of Kansas and Associate Vice Chancellor for Research; email: kathleen.lane@ku.edu. Dr. Lane’s research interests focus on designing, implementing, and evaluating Comprehensive, Integrated, Three-tiered (Ci3T) models of prevention to (a) prevent the development of learning, behavior, and social and emotional well-being challenges and (b) respond to existing challenges in these areas, with an emphasis on systematic screening.
WENDY PEIA OAKES, PhD, is the Nadine Mathis Basha Professor of Early Childhood Education and Executive Strategist of the Institute of Learning Design and Discovery in the Mary Lou Fulton College for Teaching and Learning Innovation at Arizona State University; email: woakes@asu.edu. Her research focuses on designing, implementing, and evaluating Comprehensive, Integrated, Three-tiered (Ci3T) models of prevention for the prevention and intervention of students with and at risk for emotional and behavioral disorders and professional learning for preservice and in-service educators in implementing practices with fidelity.
MARK M. BUCKMAN, PhD, is an assistant research professor in the Department of Special Education at the University of Kansas; email: buckman@ku.edu. His research focuses on the implementation of comprehensive, integrated, three-tiered (Ci3T) models of prevention, behavior screening, and professional development.
DAVID JAMES ROYER, PhD, BCBA, is an associate professor at the University of Louisville, Kentucky; email: david.royer@louisville.edu. His research focuses on systems change via comprehensive, integrated, three-tiered (Ci3T) models of prevention; low-intensity strategies to increase engagement and prevent challenging behavior; Seeing Stars and Visualizing and Verbalizing reading curricula; and My IEP®, a student-directed individualized education program model for teaching students to lead their full IEP meeting.
REBECCA SHEROD, MSE, is a doctoral student in the Mary Lou Fulton College for Teaching and Learning Innovation at Arizona State University; email: rsherod@asu.edu. Her research focuses on supporting students with and at risk for emotional and behavioral disorders (EBD) and teachers who serve them within Comprehensive, Integrated, Three-tiered (Ci3T) models of prevention and other tiered systems of support.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
