Sage Journals: Discover world-class research

Abstract

Adhering to PRISMA guidelines, this study systematically reviews the reliability and validity evidence of two prominent sentence completion tests (SCT): the Rotter Incomplete Sentences Blank (RISB) and the Washington University Sentence Completion Test (WUSCT). SCTs have often been conceptualized as projective tests and summarily dismissed as lacking efficacy in psychological assessment, though this characterization is not entirely accurate. Conceptually and methodologically, SCTs are more aptly characterized as performance-based measures. Interest in potential applications of SCTs with clinical populations continues to persist. This review incorporated 51 studies conducted in the United States that met rigorous reliability and validity criteria. Although there is evidence to support the reliability and validity of the RISB and WUSCT, results vary widely depending on sample characteristics (e.g., age, race/ethnicity), study context, and the purpose of score interpretations examined in the study. Therefore, practitioners who utilize results from SCTs should consider these characteristics when engaging in interpretation.

Keywords

systematic review projective techniques sentence completion psychological assessment

Public Significance Statement

This article systematically reviews studies on the reliability and validity of two widely used sentence completion tests (SCTs) in psychological assessment: Rotter Incomplete Sentences Blank (RISB) and the Washington University Sentence Completion Test (WUSCT). Despite significant skepticism surrounding this genre of assessment instrument, their clinical use persists. The variability in evidence supporting scores across assessment use and client characteristics underscores the critical need for nuanced and context-sensitive interpretation.

SCTs occupy a distinctive position within the broader domain of personality assessment, a position that calls for systematic evaluation of the empirical evidence relevant to their application. They are designed to elicit self-expressive responses that reflect individuals’ characteristic ways of perceiving themselves and others (Loevinger, 1957; Rotter & Rafferty, 1950). Responses generated through open-ended sentence stems rather than direct questions position SCTs as performance-based measures (PBMs). In Weiner’s (2013) characterization, PBMs are assessment methods that require individuals to demonstrate psychological processes through their performance, thereby providing information that surpasses what can be obtained through self-report. Although SCTs have historically been classified as projective techniques, this terminology oversimplifies their methodological and theoretical nature. The term “projective” derives from Frank’s (1939) hypothesis that individuals project unconscious internal needs and meanings onto ambiguous stimuli—but it does not define a distinct category of assessment.

Contemporary scholarship calls for conceptualizing SCTs as PBMs of personality that sample behavior, cognition, and affect under standardized conditions (McGrath & Carroll, 2012; Meyer & Kurtz, 2006). Building on Meyer and Kurtz’s (2006) call for more informative taxonomies of assessment methods, subsequent commentaries have emphasized differentiating techniques according to the psychological processes they engage—such as expressive, perceptual, narrative, or self-evaluative operations (Bornstein, 2011; Schultheiss, 2007). Bornstein (2011) has suggested that personality assessment methods should be classified not by presumed theoretical allegiance but by the mechanisms through which examinees generate responses. Within this perspective, SCTs represent a distinct methodological class that engages controlled verbal expression and reflective meaning-making, distinguishing them from the perceptual processing demands of inkblot methods or the narrative construction required by storytelling tasks.

In the absence of a consensus term, we adopt PBMs because it aptly captures the methodological essence of SCTs by eliciting observable performance under standardized conditions while avoiding the historical and theoretical assumptions associated with the term projective. Personality assessment tools vary in the degree to which they structure or constrain a respondent’s behavior, forming a continuum that ranges from highly structured self-report inventories to minimally structured expressive tasks. Although PBMs are less structured than self-report inventories or rating scales completed by collateral informants, there is no sharp subjective/objective divide; assessment instruments are best conceptualized as differing in the degree of stimulus ambiguity (Meehl, 1945).

Although survey data indicate that their overall use has declined (Benson et al., 2019; Piotrowski, 2015; Wright et al., 2016), PBMs continue to represent significant components of practice in certain specialties such as inpatient, forensic, and educational settings (Wright et al., 2016). For example, PBMs, such as SCTs, are used in approximately 11.3% (vs. 22.6% for the Wechsler Intelligence Scale for Adults) of child custody evaluations (Mathy, 2019). Given that custody evaluations are widely viewed as among the most complex, technically demanding, interpersonally challenging, and legally contentious types of psychological assessments (Benjamin et al., 2017), the continued allowance of PBMs by the courts is noteworthy. At the same time, PBMs of personality continue to attract scholarly attention, though often within the context of long-standing debates about the scientific credibility of projective methods. Influential critiques have questioned their empirical foundation relative to self-report inventories (e.g., Stemplewska-Żakowicz & Paluchowski, 2013; Weiner, 2013). However, these critiques frequently treat projective techniques as a monolithic category. While such scholarship is valuable for highlighting psychometric concerns, it tends to overgeneralize, obscuring important distinctions among subclasses such as storytelling, inkblot, drawing, and SCTs. The conceptual shift toward PBMs provides an opportunity to reconsider these instruments within a methodological framework that emphasizes observed performance, standardized administration, and empirically testable constructs (Bornstein, 2011; Meyer & Kurtz, 2006). Within this framework, SCTs represent a distinctive PBM subclass that operationalizes personality assessment through structured verbal production. Far from being obsolete, SCTs remain in active use across clinical, educational, and research contexts, particularly for assessing ego development, psychological adjustment, and personality integration (Lambie, 2007; McCloskey, 2014; Piotrowski, 2019, 2024).

Surveys of practitioners consistently identify SCTs as among the most frequently used performance-based techniques (Holaday et al., 2000), and they have been embedded for decades in personality assessment courses, developmental research programs, and applied evaluations. Among the wide range of SCTs developed over the past century, RISB (Rotter & Rafferty, 1950) and WUSCT (Loevinger, 1957, 1998) stand out as the two most extensively researched and widely applied instruments. Among SCTs, only the RISB and WUSCT possess sufficient empirical foundations, standardized scoring systems, and decades of validity research to support a systematic evidence review.

The RISB was designed to assess personal adjustment through respondents’ attitudes, conflicts, and affective concerns expressed in sentence completions (Churchill & Crandall, 1955; Rotter & Willerman, 1947). Its standardized scoring system classifies individuals as “normal,” “maladjusted,” or “questionably adjusted” based on the coherence, emotional tone, and defensiveness of responses (Lah & Rotter, 1981; Rotter et al., 1954). Later work has broadened interpretive applications to include interpersonal functioning, affect regulation, and psychopathology (Ames & Riggio, 1995; Fuller et al., 1982; McCloskey, 2014; Torstrick et al., 2015; Weis et al., 2008). Survey research with members of the Society for Personality Assessment found that the RISB was the most frequently used SCT, administered to nearly half of respondents’ adult clients and one third of adolescents (Holaday et al., 2000).

The WUSCT, by contrast, was developed as a measure of ego development grounded in Loevinger’s structural model of personality organization (Hoppe & Loevinger, 1977; Loevinger, 1957). The WUSCT is the most extensively studied and widely used standardized measure of ego development (Jorgenson, 2017; Manners & Durkin, 2001). Decades of research have extended the measure’s applications to domains such as identity status (Adams & Fitch, 1981; Adams & Shea, 1979), object relations and social competence (Avery & Ryan, 1988; Larson et al., 2007), and adaptive functioning in clinical and community samples (Jennings & Armsworth, 1992; Lambie, 2007; Wilber et al., 1982).

Collectively, the WUSCT and RISB possess stronger empirical foundations, clearer interpretive frameworks, and broader applications than other SCTs. Both instruments benefit from decades of psychometric research, replicable scoring procedures, and well-articulated theoretical underpinnings, contributing to their enduring relevance in contemporary personality and developmental assessment (American Educational Research Association [AERA] et al., 2014; Holaday et al., 2000; McGrath & Carroll, 2012). Despite their widespread use and substantial research history, no systematic review has synthesized the evidence on reliability and validity for the WUSCT and RISB, leaving the field without an integrated evaluation of the empirical foundations of these extensively studied and widely applied SCTs.

Purpose of the Study

The primary aim of this study was to systematically review and evaluate the evidence supporting the reliability and validity of scores derived from two of the most extensively researched SCTs: the WUSCT and the RISB. Despite limited recent publications, these measures remain among the most psychometrically developed and conceptually coherent PBMs, offering structured yet expressive indices of personality functioning. As the field increasingly emphasizes evidence-based standards, renewed evaluation of these measures is essential to determine their current scientific and applied value. To address this need, we conducted a comprehensive systematic review to synthesize empirical findings accumulated over several decades, clarifying the reliability, validity, and interpretive foundations of the WUSCT and RISB. The review identifies strengths, limitations, and gaps in the literature to inform contemporary assessment practice and guide future research on sentence completion methodologies within an evidence-based framework consistent with the Standards for Educational and Psychological Testing (AERA et al., 2014).

Method

This systematic review adhered to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Page et al., 2021). The initial step involved formulating a plan designed to uphold the scientific integrity of the process, as described below.

PICOT Statement

A PICOT statement (Haynes, 2012) was generated as part of the study plan and prior to the search. A PICOT statement offers a specialized framework that serves to aid researchers in formulating questions and facilitating the literature review, and it is easily modifiable to account for other variables. The search plan was developed in alignment with the study goal of systematically reviewing evidence of reliability and validity for scores derived from projective tests. The following PICOT (Population, Instrument, Corroboration, Outcome, Time Frame) statement guided the study:

P- Population: Included participants of any age, sex, race, or ethnicity living in the US

I- Instrument: Sentence completion tests

C- Corroboration: Evidence of reliability or validity

O- Outcome: Decisions regarding score interpretation or use

T- Time: Is there a difference in study results over time?

To identify studies for inclusion, several databases were searched: PsycINFO, Embase, Scopus, Web of Science, and Medline PubMed through EBSCO. Alternative search approaches to identify relevant articles not available in the above databases were used: (a) hand searching of the references of included articles identified in the searched databases and (b) conducting citation tracking of published articles. A formal meta-analysis could not be conducted due to the significant variability in the methods, purposes, and projective measures used in the studies that were included in this review.

Search Terms

The databases were searched using various combinations of the following search terms: “interrater agreement,” “interrater concordance,” “interrater reliability,” “intercoder reliability,” “test-retest,” “test-retest reliability,” “test retest reliability,” “result reliability,” “test reliability,” “reproducibility of results,” “predictive validity,” “concurrent validity,” “convergent validity,” “discriminant validity,” “clinical utility,” “classification accuracy,” “diagnostic accuracy,” “diagnostic utility,” “treatment utility,” “test validity,” “face validity,” or “result validity” and “projective,” “apperception,” “projective test,” “test validity,” “projective technique,” “sentence completion,” “human figure drawings,” “projective personality test,” “thematic apperception test,” “children’s apperception test,” “house-tree-person,” “draw-a-person,” “Rotter,” “Rorschach,” or “Miner.” Although the reliability and validity of projective techniques other than SCTs were not directly analyzed, it was important to consider a broader range of projective techniques in the search terms to identify studies that may have compared these measures to SCTs.

Inclusion/Exclusion Criteria

Inclusion and exclusion criteria were established for reliability and validity with each psychometric property considered separately. For a study to be included in the analysis of reliability it had to meet the following criteria: (a) involve administration of the RISB and/or the WUSCT and (b) report correlation coefficients to estimate reliability, rather than merely reporting percent agreement between raters (Cronbach & Shavelson, 2004; McCrae et al., 2011). Studies that met these criteria are presented in Figure 1.

Figure 1.

Flowchart of Studies Identified That Discussed Reliability.

For a study to be included in the analysis of validity, it needed to meet the following criteria: (a) involve administration of the RISB and/or the WUSCT, (b) report an interrater reliability in projective test scoring greater than .80, and (c) report one or more of the following: correlation coefficients used to estimate validity, effect size metrics based on differences between means or the information needed to calculate them, effect size metrics for associations among categorical variables or the information needed to calculate them, confidence intervals by means of two related groups, or diagnostic utility statistics (Frick & Loney, 2000; Smith & Smithson, 2015). Studies that met these criteria are presented in Figure 2.

Figure 2.

Flowchart of Studies That Discussed Validity.

Studies were excluded if they met any of the following criteria: (a) participants were sampled exclusively from a country outside of the United States, (b) the study was a dissertation, (c) the information reported focused exclusively on projective techniques that were not being compared with an SCT, (d) the study focused on cognitive problems related to aging, (e) the study was a case report, and (f) the study was nonempirical.

Exclusion criteria were designed to ensure that the synthesized evidence was methodologically rigorous and relevant to the intended context. Studies conducted outside the United States were excluded because validity is specific to particular scores, uses, and populations; cultural and linguistic differences can alter item functioning and score interpretation (AERA et al., 2014). Dissertations were omitted because they rarely affect review conclusions and often provide incomplete or inconsistently reported data, adding workload with minimal evidentiary benefit (Briscoe et al., 2023; Vickers & Smith, 2000). Studies of other projective techniques were excluded to maintain conceptual focus (Frank, 1939; Teglasi, 1998; Weiner & Kuehnle, 1998). Investigations centered on cognitive decline or aging-related processes were excluded because they predominately address neurocognitive rather than personality or psychological adjustment constructs, making their score interpretation and validity evidence distinct and not commensurate with the adjustment or ego-development focuses of the review (Caselli et al., 2016). Finally, nonempirical studies (e.g., theoretical reviews, commentary pieces) were excluded because validity and reliability arguments require empirical data to support score interpretation, and nonempirical sources cannot provide the statistical evidence necessary for a rigorous systematic review.

Coding of Variables

The titles, references, and abstracts from the EBSCO search results were uploaded to Covidence (2023), a web-based collaboration software platform that streamlines the production of systematic and other literature reviews. The authors of this study completed article screenings and reviews. Articles pertaining to reliability and validity were screened separately and were independently coded by two reviewers to mitigate any inclusion/exclusion bias. A third reviewer examined articles for which there was a disagreement about whether the article should be included. After the initial screening, full texts of the articles were uploaded into Covidence and further screened independently by two reviewers for inclusion in this review. Variables coded included the study design, reliability or validity evidence reported, participant ages, total sample size, clinical disorder discussed, effect size, mean, and standard deviation. Each article included in the review was coded by one or more authors.

Inclusion and Coding Agreement

Four authors independently reviewed the studies returned through the search procedures to determine which articles met this study’s inclusion criteria. Each article was reviewed and coded by two authors. Cohen’s kappa was used as a measure of interrater reliability to ensure that chance agreement did not overly impact the inclusion of studies in this article (McHugh, 2012). When a disagreement occurred between two authors on whether a study should be included, a third author reviewed the article to resolve the disagreement. While the overall interrater reliability among the reviewers was generally high, it is important to note that lower agreement rates involving one of the reviewers contributed to some lower interrater reliability metrics. This discrepancy may be attributed to differing interpretations of the evaluation criteria and highlights the importance of having a third reviewer resolve potential disagreements regarding inclusion of articles. In addition, this review was limited to studies conducted in English that occurred in the United States to help ensure relevance to populations, policies, and health systems within the United States. The authors were only able to review studies that were published in English, and translating non-English publications was beyond the scope of the review at this time.

Transparency and Openness

Materials and analysis information for this study are available by emailing the corresponding author. No elements of this study were preregistered. This paper was a systematic review that involved no additional analysis of new data. Artificial intelligence, specifically ChatGPT (OpenAI, 2025), was used to help generate the tables included in this article based on numerical data written in paragraph form by the authors. Prompts provided to ChatGPT included only data and statistical results (e.g., “please create an APA style table that summarizes the following descriptive statistics:”) that were already reported in the manuscript, and no new data or additional analyses were generated by ChatGPT. All use of artificial intelligence occurred after the manuscript was written, and all content generated in any way by artificial intelligence was checked for accuracy, clarity, and adherence to APA standards by the authors before being included.

Results

Five hundred and seventy-seven studies met the initial search criteria based on their titles and abstracts and were coded for subsequent analysis. Ultimately, 51 studies met the full-text inclusion criteria. The independent review of the articles resulted in adequate interrater reliability, both when coding the title/abstracts (Table 1) and in coding the full text of the articles (Table 2). Consensus coding was used to address and agree upon any differences in coding that occurred between raters.

Table 1.

Title/Abstract Interrater Reliability (IRR).

Reviewer A	Reviewer B	Articles coded	Proportionate agreement	Random agreement probability	Cohen’s kappa
1	4	1487	0.82	0.57	0.57
2	4	176	0.55	0.42	0.21
1	3	573	0.99	0.99	0.80
3	4	117	0.43	0.41	0.03
1	2	419	0.89	0.66	0.67
2	3	33	0.97^a	0.97^a	0.0^a

Too few co-coded articles to accurately calculate Cohen’s kappa.

Table 2.

Full Text Interrater Reliability (IRR).

Reviewer A	Reviewer B	Articles coded	Proportionate agreement	Random agreement probability	Cohen’s kappa
1	2	38	0.82	0.54	0.60
3	4	64	0.80	0.55	0.55
1	3	246	0.94	0.72	0.80
2	4	55	0.85	0.53	0.69
2	3	56	0.89	0.59	0.74
1	4	44	0.82	0.54	0.61

Articles that met inclusion criteria for this study reported measures of reliability such as interrater reliability, split-half reliability, Cronbach’s alpha, and test–retest reliability (Table 3). In addition, the articles that met inclusion criteria for this study reported measures of validity such as concurrent validity, predictive validity, construct validity, discriminant validity, and convergent validity (Table 4).

Table 3.

Types of Reliability Reported by Articles Included in This Study.

Type of reliability	Definition	Recommended standards for decision-making
Split-Half Reliability	Measures the extent to which all parts of a test contribute equally to what is being measured. The test is split into two halves, and scores from each half are correlated.	> .8, preferably above .9
Cronbach’s Alpha	Assesses the correlation among items, estimates consistency of responding in a single administration of a test.	> .8, preferably above .9
Interrater Reliability	Indicates the degree of agreement between two or more raters/observers on the same measure.	> .8, preferably above .9
Test–Retest Reliability	Evaluates the stability of a test over time by administering the same test to the same group at two different times.	> .8, preferably above .9

Source. AERA et al. (2014).

Table 4.

Types of Validity Reported by Articles Included in This Study.

Type of validity	Definition	Recommended standards for decision-making
Concurrent Validity	Assesses how well a new test or measure corresponds with an already validated criterion measure taken at the same time.	> .8
Convergent Validity	Assesses the degree to which two measures that are supposed to be measuring the same construct are correlated.	> .5, preferably above .7
Discriminant Validity	Assesses the degree to which a test does not correlate with other measures that are theoretically supposed to be unrelated.	< .3
Predictive Validity	Assesses the extent to which a test can accurately predict future performance, behavior, or outcomes.	> .3, preferably above .5

Source. AERA et al. (2014); Campbell and Fiske (1959); Fornell and Larcker (1981).

RISB

The RISB (Rotter & Rafferty, 1950) is an SCT that was first developed in 1950 as a way of evaluating psychological adjustment and personality. It consists of sentence stems that participants are asked to complete, and their responses are scored to provide a score that indicates their level of “adjustment” or “maladjustment” (Torstrick et al., 2015). The RISB remains one of the most used sentence-completion tools because of the simplicity of its administration, the flexibility of its use, and the level of nuanced information it can provide to a psychological evaluation (Holaday et al., 2000).

Studies of RISB Reliability

Four studies examined the reliability of the RISB when used with populations of school-age children. One such study found high reliability (r = .94) based on raters’ initial scoring and rescoring of RISB records (Kennedy et al., 1963). Two studies reported Cronbach’s alpha values that ranged from .78 to .93 (Fuller et al., 1982; Weis et al., 2008). The development of the scoring manual for the high school RISB form found that split-half reliability ranged from .74 to .86, and interrater reliability ranged from .96 to .97, with females having the higher reliability in both cases (Rotter et al., 1954).

Four studies examining the reliability of the RISB in a mix of adolescents and adults found that interrater reliability ranged from .88 to .95 (Ames & Riggio, 1995; Arnold & Walter, 1957; Jessor et al., 1963; Lah & Rotter, 1981), while reliability across initial and rescored records ranged from .90 to .95 across “many” years, to .97 after a four-week interval (Arnold & Walter, 1957; Jessor et al., 1963). Arnold and Walter (1957) found that split-half reliability was .76. Ames and Riggio (1995) found a Cronbach’s alpha of .91 for internal consistency.

Twelve studies examining the reliability of RISB scores using samples of adults found that interrater reliability of the total scores ranged from .81 to .98 (Baker & King, 1970; Edwards & Sapp, 2002; Gardner, 1967; Jessor & Hess, 1958; Lah, 1989; Logan & Waehler, 2001; McCloskey, 2014; Rotter et al., 1949; Rotter & Willerman, 1947; Tempone & Lamb, 1967; Torstrick et al., 2015). Split-half reliability ranged from .77 to .96 depending on the gender of the study participants, with females tending to have the higher split-half reliability (Gardner, 1967; Rotter et al., 1949; Rotter & Willerman, 1947; Torstrick et al., 2015). Test–retest reliability ranged from .38 to .54 (Churchill & Crandall, 1955). Gardner (1967) found an interrater reliability of .80 across an eight-month interval. Torstrick et al. (2015) found a Cronbach’s alpha of .81 for internal consistency.

Across 20 studies examining RISB reliability (Table 5), a clear consensus emerges for strong interrater reliability, which was reported in 16 studies and consistently exceeded .80 across age groups. Test–retest reliability was examined in only three studies and yielded highly variable coefficients, limiting firm conclusions about temporal stability. Overall, the available findings indicate that RISB scoring procedures yield reliable estimates across developmental levels, but longitudinal research is needed to clarify stability over time.

Table 5.

Reliability and Validity Metrics for RISB Studies.

Study	Measure	Sample description	Interrater reliability (r)	Other reliability measures	Validity
Kennedy et al. (1963)	RISB	Children (n = 130)	-	.94 (initial vs. rescored) [n = 43]	-
Fuller et al. (1982)	RISB	ChildrenStudy 1: 30 delinquent, 30 non-delinquent, Study 2 (cross-validation): 48 delinquent, 67 non-delinquent	.93	-	Correct identification of 75–80% (delinquent vs. non-delinquent)
Weis et al. (2008)	RISB	AdolescentsStudy 1: 238 referredStudy 2: 120 referred, 120 non-referredStudy 3: 101 referred, 100 non-referred	-	α = .78; r_sb = .76	-
Rotter et al. (1954)	RISB	ChildrenCriterion group: 56 referred, 185 non-referredCross-Validation group: 118 referred, 138 non-referred	.96 (boys) and .97 (girls)	Split-half reliability = .74 (boys), r = .86 (girls)	r = .20 to .37 (psychologist interview ratings), r = .20 to .32 (sociometric data); using a cutting score of 135, correct identification of 65–85% (‘maladjusted’ vs. ‘normal’)
Arnold & Walter (1957)	RISB	Adolescent and adult undergraduates (n = 120)	-	r = .97 (initial vs. 4-week rescore); Test–retest = .82 (1-2 weeks); Split-half = .76	-
Jessor et al. (1963)	RISB	Adolescents and adultsStudy 1: 283 undergraduatesStudy 2: 153 undergraduatesStudy 3: 931 undergraduatesStudy 4: 68 high school students	“above .95”	-	Study 3: 931 college studentsz = 1.96 (female Social Love and Affection need group vs. female balance group)Study 4: 68 HS students:z = 1.82 (male extreme need groups vs. male balance group)z = 1.64 (combined male/female extreme need groups vs. balance group)
Lah and Rotter (1981)	RISB	Adolescent and adult undergraduates (n = 300)	“.88 to .93”	r = .90 to .95 (initial vs. rescore “after many years”)	-
Baker & King (1970)	RISB	Adults (n = 204)	.96	-	r = .59 (R-S)
Churchill & Crandall (1955)	RISB	Adults (n = 344)	.94 (scorers A-B, B-C), .95 (scorers A-C), and .98 (mother sample)	Test–retest = .38 to .54 (6-months, 1-year, and 3-years retest)	r_pb = .37 to .42 (counseling utilization); r = .49 (psychologists’ rating of maladjustment); correct identification of 54–66% of those seeking counseling
Edwards & Sapp (2002)	RISB	Adults (n = 51)	.95	-	-
Lah (1989)	RISB	AdultsStudy 1: 116 AdultsStudy 2: 120 Adults, 120 Adults receiving mental health services	.94 (Study 1) and .90 (Study 2)	-	Study 1: r = −.34 to .40 (sociometric measure of adjustment created by author for study)Study 2: r = .67 to .72 (correlations between total RISB and group [receiving mental health services vs. not]); cutoff score of 145 identified 85% of adults receiving counseling services and 84% of control participants (ϕ = .69)
Logan & Waehler (2001)	RISB	Adults (n = 194)	.89, .90, and .96	-	r = −.34 (BIDR Self-Deceptive Enhancement), r = −.18 (BIDR Impression Management)Disproportionate number of African Americans labeled as ‘maladjusted’ (χ²(1, N = 194) = 7.85)
Rotter et al. (1949)	RISB	Adults (n = 206)	.91 (men) and .96 (women)	Split-half reliability = .83 (women), .84 (men)	r = .64 (women), r = .77 (men) [psychologists’ adjustment ratings]
Rotter & Willerman (1947)	RISB	AdultsStudy 1: n = 200 Study 2: n = 148	Study 1: .81 to .93; average = .89Study 2: .68	Split-half reliability = .85	Study 1: r = .61 (psychiatric diagnosis)Study 2: r = .39 to .41
Tempone & Lamb (1967)	RISB	Adults receiving outpatient services (n = 58)	.97	-	Correlation ratio η = .73 (R-S)
Young et al. (2003)	RISB	Adults (n = 193)	.97	-	r = −.56 (SOS-10)
Ames & Riggio (1995)	RISB	Adolescents and Adults368 high school students, 136 undergraduates	.925	Internal consistency coefficient alpha = .91	r = .45 (high school), r = .55 (college)—MPI Neuroticism
Gardner (1967)	RISB	Adults receiving outpatient services (n = 50)	.88	Interrater reliability (8-month interval) = .80; split-half reliability = .84	Using a cutoff score of 135, correct identification of 80–100% of substance users
McCloskey (2014)	RISB	Adults receiving outpatient services (n = 41)	.93	-	r = .32 to .74 (levels of adjustment); R² = .55 (variance of the adjustment composite); R² = .16 after covarying for demographics, intelligence, social desirability, self-ratings of adjustment
Torstrick et al. (2015)	RISB	Adults (n = 72 clinical, n = 69 non-clinical)	.85, .93, and .95	Internal consistency Cronbach’s alpha = .81; split-half reliability r_sb = .77	d = 0.58 (distinguishing between clinical patients/ nonclinical); d = 0.94 (patients with/without Axis I disorder); d = 0.81 (patients with / without personality disorder)r = −.50 (SLS); r = .20 to .55 (PDSQ); r = −.33 to .57 (SNAP), r = .50 (NEO PI-R Neuroticism), r = .26 to .40 (IIP-SC), r = −.27 to .39 (SGABS), r = .39 (SIDP-IV); r = .17 to .39 (PDSC), r = .57 (extracted measure of “general psychological distress”)

Studies of RISB Validity Using Samples of School-Age Children

Two studies were identified that examined the validity of the RISB in school-aged children. Rotter et al. (1954) developed a scoring manual for a high school RISB form to differentiate between clinic-referred “maladjusted” adolescents and their non-referred peers. The criterion group included 241 adolescents (56 referred, 185 non-referred), while the cross-validation study involved 256 adolescents (118 referred, 138 non-referred). For females, RISB scores were found to correlate positively with both psychologist interview ratings (n = 48; Pearson r = .37, biserial correlation = .58) and sociometric data (n = 70, r = .32). For males, the correlations with interview ratings (n = 45; Pearson r = .20, biserial correlation = .20) and sociometric data (n = 68, r = .20) were somewhat weaker. A cutting score of 135 correctly identified 78% of male participants who had been classified as “normal” based on psychologist interview ratings, 65% male participants who had been classified as “maladjusted,” 85% of females classified as “normal,” and 85% of females classified as “maladjusted.”

Fuller et al. (1982) studied a “delinquent” group consisting of 30 adolescent male students who had been placed in a residential treatment center and a “non-delinquent” group consisting of 30 randomly selected male high school students. They then developed a short RISB form with a revised maladjustment score which was cross validated with 48 male high school students and 67 male “delinquent” students. The form accurately identified 82% of the original 60 participants, 75% of the high school students, and 80% of the “delinquent” students in the cross-validation sample.

In summary, validity evidence for the RISB in school-age samples supports its utility for group discrimination. Across studies, RISB scores differentiated delinquent or clinic-referred adolescents from non-referred peers with moderate-to-high classification accuracy (approximately 65–85%), while concurrent validity correlations with interview and sociometric criteria were small to moderate, particularly for males. Gender differences were evident, with stronger associations and higher classification accuracy for females, suggesting that RISB validity in youth may be influenced by developmental and gender-related factors and may be strongest when used for categorical screening purposes.

Studies of RISB Validity Using Mixed Samples of Children and Adults

Ames and Riggio (1995) produced the only study examining the validity of the RISB in a mixed sample of children and adults. Their study examined the correlation between RISB scores and the Maudsley Personality Inventory (MPI) in high school (n = 368) and college (n = 136) students. The correlations between RISB scores and MPI Neuroticism scores were .45 for high school students and .55 for college students. The authors also found that high school students often scored significantly higher on the RISB and the MPI Neuroticism scale compared with norms from when the RISB was initially validated, while college students' scores on both measures remained relatively consistent over time. In addition, a greater percentage of female high school students were classified as “maladjusted” compared with their male counterparts.

In summary, the limited evidence from mixed-age samples suggests that RISB validity may strengthen with age, as reflected in higher associations with neuroticism among college students compared with high school students. Findings suggest that adolescent RISB scores may differ from original validation samples, with higher rates of maladjustment classification among female adolescents. These patterns indicate that developmental context may influence score interpretation and that caution is warranted when applying uniform norms and cutoff scores across age groups.

Studies of RISB Validity Using Samples of Adults

A series of eight studies examining the validity of the RISB in adults were reviewed. Rotter et al. (1949) examined the RISB as a measure of maladjustment in college students. Criterion groups included 42 women (20 classified as “normal,” 15 as “maladjusted,” and 7 as “questionably adjusted”) and 33 men (13 classified as “normal,” 15 as “maladjusted,” and 5 as “questionably adjusted”). Separate scoring manuals for males and females were developed, with common scoring issues addressed using an additional 16 women and 20 men. In total, 58 women and 53 men were included in the development of the manuals. Additional analysis of records from 82 females and 124 males demonstrated strong correlations between RISB scores and psychologists’ adjustment ratings (“adjusted” or “maladjusted”) for both males (r = .64) and females (r = .77).

Gardner (1967) administered the RISB to 50 outpatients at a narcotic rehabilitation center (20 male heroin users, 20 male pill users, 10 female heroin users). Using a cutoff score of 135, the RISB correctly identified 80% of male heroin users, 90% of male pill users, and 100% of female heroin users.

Churchill and Crandall (1955) studied the use of the RISB with college students and mothers. For 65 women and 24 men who entered counseling, and 123 women and 132 men who did not, biserial correlations between RISB scores and counseling utilization were slightly higher for females (r_pb = .42) than they were for males (r_pb = .37). For mothers (n = 44), there was also a correlation between RISB scores and psychologist ratings of maladjustment (r = .49). Using the cutoff score provided by the manual, the RISB correctly identified only 54% of male and 66% of female students who sought counseling.

Torstrick et al. (2015) studied 72 individuals either currently or previously receiving psychiatric treatment and 69 nonclinical undergraduate college students, finding that RISB scores effectively distinguished between the groups (d = 0.58). Scores also differentiated clinical patients with or without an Axis I disorder (d = 0.94) and those with a personality disorder (d = 0.81). The RISB scores inversely correlated (r = −.50) with the Satisfaction with Life Scale (SLS) created by Diener et al. (1985). A positive correlation was reported with the Psychiatric Diagnostic Screening Questionnaire (PDSQ) (r = .39) created by Zimmerman (2002) and with the Structured Interview for DSM-IV Personality (SIDP-IV) (r = .39) created by Pfohl et al. (1997). High RISB scores were linked to negative affect, anxiety, depression, irrational thinking, and interpersonal problems. Women regularly received scores that were significantly higher than men on the RISB.

McCloskey (2014) administered the RISB and six tests of adjustment to 41 recent admissions to psychotherapy at community mental health centers. The overall RISB score had a range of correlations (r = .32 to .74) with levels of adjustment. Hierarchical regression revealed that the overall RISB score explained 55% of variance of the adjustment composite, but this dropped to 16% after accounting for factors such as demographics, intelligence, social desirability, and self-ratings of adjustment.

Tempone and Lamb (1967) examined adult outpatients at a mental health clinic (n = 58) and found that scores on the RISB had a correlation ratio (η = .73) for the regression of conflict scores on a measure of repression-sensitization (R-S) developed by Byrne (1961). Similarly, Baker and King (1970) examined the relationship between the RISB and R-S scale in undergraduate students (n = 204). They found that the overall RISB score correlated with the R-S scale to a slightly lesser degree (r = .59).

Lah (1989) examined the RISB using 116 undergraduate students involved in sororities and fraternities and found that the RISB total scores correlated with a sociometric measure of adjustment created by the author for the purpose of the study. The highest correlations were for items associated with aspects of adjustment (r = −.34 to −.40). The second experiment recruited 240 undergraduate students, half of whom received on-campus mental health services and half of whom were controls. Pearson correlations between total RISB scores and group were higher for female participants (r = .72) than for male participants (r = .67). The author also found that a cutoff score of 145 achieved the best overall identification rates, which accurately identified 85% of undergraduate students receiving college counseling services and 84% of control participants (ϕ = .69).

Logan and Waehler (2001) examined potential influences of racial group and socially desirable response tendencies on the RISB in 94 African American and 100 White undergraduate students. The authors found that overall RISB scores had a negative correlation (r = −.34) with the Balanced Inventory of Desirable Responding (BIDR; Paulhus, 1994) measuring self-deception enhancement. Although mean overall RISB scores of African American and White students were comparable, the use of the recommended cutoff score of 135 resulted in a disproportionate number of African American students being labeled as “maladjusted” (χ²(1, N = 194) = 7.85, p < .0125).

In summary, adult studies provide the strongest and most consistent evidence of RISB validity, with scores showing moderate to strong associations with clinician ratings, psychopathology indicators, and related personality constructs, as well as moderate-to-large group differences between clinical and nonclinical samples. Classification accuracy was often high in clinical contexts, though performance varied by setting and criterion. At the same time, adult findings revealed systematic demographic effects, including higher maladjustment scores among women and disproportionate classification of African American participants when standard cutoffs were applied, indicating that while RISB validity is most robust in adulthood, interpretive caution remains necessary.

Overall Summary of RISB Validity Findings

In summary, the preponderance of the studies suggests that the RISB demonstrates acceptable internal consistency, as well as convergent and predictive validity. Internal consistency estimates for total scores and maladjustment indices, most often reported as Cronbach’s alpha, generally fell in the acceptable-to-strong range, with coefficients ranging from approximately .78 to .93 across studies. Convergent and predictive validity were supported through associations with clinician adjustment ratings, neuroticism, repression-sensitization, counseling utilization, indices of psychopathology, and life satisfaction, with validity coefficients typically ranging from approximately r = .32 to .77. However, some studies reported reliability and validity estimates—such as split-half reliability coefficients and selected convergent validity correlations—that fell below generally accepted standards (see Tables 3 and 4). Validity coefficients varied substantially across studies as a function of sample characteristics, criterion measures, and analytic approaches, reflecting notable methodological heterogeneity in the literature.

Across school-age, mixed-age, and adult samples, RISB scores demonstrated meaningful discriminant validity, consistently differentiating referred from non-referred participants, counseling seekers from non-seekers, and clinical from non-clinical groups. Across studies, classification accuracy for these distinctions typically ranged from approximately 54% to 85% when established or empirically derived cutoff scores were applied, with several investigations reporting moderate to large group differences (e.g., d ≈ 0.58 to 0.94). Discriminant validity was further supported by the measure’s ability to distinguish adjustment-related constructs from theoretically unrelated traits, with correlations with unrelated constructs generally falling in the small range (approximately |r| ≤ .30), although effect sizes varied across samples and criteria. Demographic patterns also emerged across multiple studies, with women more frequently receiving scores in the maladjusted range and African American participants disproportionately classified as maladjusted when standard cutoffs were used. Taken together, the available evidence indicates evidence of validity across developmental levels, though variability in research design, quality, and reporting practices underscores the need for more methodologically rigorous and demographically sensitive validity studies.

Washington University Sentence Completion Test (WUSCT)

The WUSCT (Loevinger, 1957, 1998) was initially designed to use sentence-completion stems to assess an individual’s ego development. The WUSCT was based in Loevinger’s stage theory of ego development, and its continued use is based in its status as one of the premiere tools for evaluating ego development and on the additional nuance that many providers believe adds to a psychological evaluation (Manners & Durkin, 2001).

Studies of WUSCT Reliability

Nine studies examined the reliability of the WUSCT using school-age participants. These studies found that interrater reliability ranged from .57 to .96 (Avery & Ryan, 1988; Borst et al., 1991; Frank & Quinlan, 1976; Holt, 1980; Hoppe & Loevinger, 1977; Maron & Rock, 1984; Noam et al., 1984). Test–retest reliability ranged from .14 to .92 (Frank & Quinlan, 1976; Kitchener et al., 1984; Redmore & Loevinger, 1979). A significant portion of the variability in test-retest reliability was reported by Redmore & Loevinger (1979) (Table 6).

Table 6.

Test–Retest Correlations Reported by Redmore and Loevinger (1979).

Study	n	Initial grade	Follow-up grade(s)	Test-retest r
1	31	6	12	.52
2	9	8	12	.92
3	60	8	10, 12	.58(grade8–10), .51(grade8–12), .46(grade10–12)
4	16	10	12	.63
5	30	8	10, 12	.14(grade8–10), .25(grade8–12), .35(grade10–12)
6	33	9	11, 12	.26 (grade9–12), .49 (grade11–12), .16 (grade9–11)
7	23	11	13	.38
8	27	9	12	.76

Note. n = sample size; r = test–retest product-moment correlation.

A study examining the reliability of the WUSCT with a mix of adolescent and adult participants found that interrater reliability was .93 for the Total Protocol Rating (TPR) and .88 for the item sum scores (Sutton & Swensen, 1983).

Nineteen studies examined the reliability of the WUSCT with adult participants. These studies found that interrater reliability for TPR and item sum scores ranged from .70 to .94 (Adams & Fitch, 1981; Adams & Shea, 1979; DeMoss & McCann, 1997; Jennings & Armsworth, 1992; Lambie, 2007; Lambie et al., 2010; Lambie & Ieva, 2012; Lanning et al., 2007; Larson et al., 2007; Loevinger et al., 1985; Luthar et al., 2001; Novy, 1993; Novy & Francis, 1992; Picano, 1987; Shea et al., 1978; Waugh, 1981; Welfare et al., 2013; White, 1985; Wilber et al., 1982).

Considered collectively, the 29 available studies provide moderately strong evidence of scoring reliability for the WUSCT across age groups, with interrater coefficients commonly falling in the acceptable-to-strong range (.70–.94) despite variability across studies and samples. Test–retest reliability was reported far less frequently and showed substantial fluctuation, with coefficients ranging from very low to very high values (approximately r = .14 to .92), making it difficult to draw firm conclusions about temporal stability. Across developmental levels, interrater agreement appears to be the most consistently supported index of reliability, whereas stability over time remains insufficiently established. Overall, the available findings indicate that WUSCT scoring procedures can yield reliable estimates when applied by trained raters, but more methodologically consistent longitudinal research is needed to clarify the measure’s temporal stability.

Studies of WUSCT Validity Using Samples of School-Age Children

Six studies examined the validity of the WUSCT in childhood. Avery and Ryan (1988) recruited 92 children in grades 4 to 6 with no known disorders at an urban magnet school. The WUSCT TPR was examined for relationships with the Blatt Parent Nurturance Scores of the Blatt Object Representation Scale (BORS) created by Blatt et al. (1981), the Teacher Rating Scale of Achievement and Social Adjustment (TRS) created by Ryan et al. (1985), the Child Rating Scale (CRS) created by Hightower et al. (1987), the Gesten Class Wheel (GCW) created by Gesten (1979), and the Metropolitan Achievement Tests (MAT) created by Durost et al. (1970). Scores on the WUSCT correlated with the Academic Competence subscale on the TRS (r = .26), the Reading Achievement Score on the MAT (r = .33), and inversely correlated with the Social Perceived Competence subscale on the CRS (r = −.22). No significant correlation was found with the overall scores for any of the assessments or other subscales of the assessments.

Frank and Quinlan (1976) recruited 66 Black and Puerto Rican “inner-city” adolescent girls: 25 “delinquents” from a city institution, 25 “nondelinquents” from a settlement house recreational program, and 16 from the settlement’s leadership training program. When WUSCT scores were adjusted for intelligence as measured by the Peabody Picture Vocabulary Test, “delinquent” girls, F(2, 62) = 7.79, p < .001, had lower ego development levels than “nondelinquent” girls (χ² (6) = 31.11, p < .001). Ego stage significantly correlated with the overall number of deviant behaviors (r = −.45), measured by self-report and corroborated by social workers, counselors, and administrators. The strongest correlation was with fighting behavior (r = −.52), with other significant correlations for running away, promiscuity, homosexuality, and alcohol abuse (r = .25 to .29). Correlations between drug abuse and pregnancy were found to be trivial.

Borst and colleagues (1991) recruited 219 adolescent psychiatric inpatients with an affective disorder, conduct disorder, or mixed conduct-affective disorder. Participants were classified as either “suicide attempters” or “non-suicidal.” Chi-square analyses showed a relation between suicide attempts and ego development on the WUSCT (χ²(1) = 8.65, p = .003). Stepwise logistic regression showed that level of ego development, gender, and specific diagnosis significantly contributed to prediction of suicide attempts. In addition, ego development remained a significant predictor of suicide attempts even when accounting for variance due to gender and diagnoses.

Hoppe and Loevinger (1977) studied 107 boys from grades 8, 9, and 11 at a private school to explore the relationship between ego development and conformity using a self-report measure of conformity developed by Hoppe (1973). Their analysis revealed a significant quadratic trend in self-report conformity scores based on the item sum, F(4, 105) = 5.1, p < .03, even after controlling for age, grade level, IQ, and GPA (F = 3.93, p < .05). They also found that ego development correlated with the number of demerits received over the past 90 days, with the highest proportion of “conformers” (students with perfect compliance) falling within the Conformist range of the WUSCT (χ²(1) = 4.36, p < .05). However, ego development did not correlate with peer ratings of conformity or an experimental measure of conformity (Willis & Hollander, 1964).

Maron and Rock (1984) recruited 17 adolescent girls at a coeducational Orthodox Jewish high school. Ego development, as measured by the WUSCT, was related to two of the three scales of religiosity derived from a structured interview: God’s Relationship to Man (χ² = 12.9, p = .01), and Behavior versus Belief (χ² = 12.9, p = .01). It was not significantly related to View of Religiosity.

Studies of WUSCT Validity Using Mixed Samples of Children and Adults

Sutton and Swensen (1983) recruited 70 participants made up of children and adolescents from a juvenile detention center, a junior high school, and a high school, as well as adults from a university, a group of college graduates and graduate students, and a group of retired university professors. All participants completed the WUSCT, the Thematic Apperception Test (TAT; Murray, 1943), and an unstructured interview. The authors used Loevinger and Wessler’s developmental framework (1970) to score participant responses for level of ego development and achieved high interrater agreement for item scoring (r = .95: .96). The score of the WUSCT correlated with the mean of the individual item scores for the TAT (r = .79) and unstructured interview (r = .89). Newman-Keuls Sequential Range Tests were used to test significant differences between the mean ego level scores of each group by instrument. There were no statistically significant differences for the three adolescent groups. For the three adult groups, the item sum score on the WUSCT was significantly lower than the mean interview and TAT scores, although no effect sizes are provided.

Studies of WUSCT Validity Using Samples of Adults

Five studies examining the validity of the WUSCT in adulthood were reviewed. Welfare and colleagues (2013) recruited 120 counselors-in-training and practicing counselors with no known disorders. Items on the WUSCT that seemed specifically relevant to the work of counselors were used to assess overall cognitive complexity. In a linear regression, the WUSCT was statistically significant as a predictor for two of the four variables: positive client traits identified by the counselor for clients they felt effective (r² = .02) and negative client traits identified by the counselor for clients they felt less effective (r² = .04). However, both analyses had low effect sizes as reported in Table 7.

Table 7.

Reliability and Validity Metrics for WUSCT Studies.

Study	Measure	Sample description	Interrater reliability (r)	Other reliability measures	Validity
Adams and Shea (1979)	WUSCT	Adults (n = 294)same sample as Shea et al. (1978)	Marcia Incomplete Sentence Blank: r = 0.92; Loevinger et al. Sentence Completion Test: r = 0.86	–	Significant differences in ego stage (H (3) = 54.64)
Adams and Fitch (1981)	WUSCT	Adults (n = 148)	Marcia Ego-Identity Incomplete Sentence Blank: r ≥ 0.90; Loevinger et al. Sentence Completion Test: r ≥ 0.86	–	Synchronous and cross-lagged r = .41 to .52 (EI-ISB); partial r = .19 to .22
Avery and Ryan (1988)	WUSCT	Children (n = 92)	Item rating reliability = .92 between ratersTotal protocol reliability = .83 between raters	–	Academic Competence subscale on the TRS (r = .26), the Reading Achievement Score on the MAT (r = .33), and the Social Perceived Competence subscale on the CRS (r = −.22)NS for any other subscale or overall scores
Borst et al. (1991)	WUSCT	Children (n = 219)	Greater than r = 0.85	–	Suicide attempts (χ²(1) = 8.65, p = .003)
DeMoss and McCann (1997)	WUSCT		86.7%	–	–
Frank and Quinlan (1976)	WUSCT	Children (n = 66)	r = 0.85	Intratest reliability: α = 0.85	“delinquent” girls, F(2, 62) = 7.79, had lower ego development levels than “nondelinquent” girls (χ² (6) = 31.11).overall number of deviant behaviors (r = –.45); fighting behavior (r = –.52); other significant correlations for running away, promiscuity, homosexuality, and alcohol abuse (r = .25 to .29).NS for drug abuse or pregnancy
Holt (1980)	WUSCT		ICC, Female: .72, .73, .75, .78, .79, .81, .84, .85, .86, .86, .87, .90ICC Male: .57, .66, .75, .75, .76, .78, .80, .81, .82, .84, .90	–	–
Hoppe and Loevinger (1977)	WUSCT	Children (n = 107)	Item sum score = .91	–	significant quadratic trend in self-report conformity scores, F(4, 105) = 5.1, even after controlling for age, grade level, IQ, and GPA (F = 3.93)NS with peer ratings of conformity or experimental measure of conformity
Jennings and Armsworth (1992)	WUSCT		Pearson product-moment correlation coefficient: .72	–	–
Kitchener et al. (1984)	WUSCT		–	Test–retest: r = 0.63	–
Lambie (2007)	WUSCT		Ranged between .85 and .90	–	–
Lambie et al. (2010)	WUSCT		.93	–	–
Lambie and Ieva (2012)	WUSCT		.93	–	–
Lanning et al. (2007)	WUSCT		Spearman-Brown corrected interjudge reliability coefficients: .91 (test) and .94 (retest)	–	–
Larson et al. (2007)	WUSCT	Adolescents and adults (n = 133)	ICC: ranged from .70 to .92	Internal reliability (between ISS and TPR): r = .89	differentiate between psychiatrically hospitalized and high school students, F(1, 131) = 29.71, Cohen’s d = .96.global self-worth (r = .22), psychological distress (r = –.26), criminal behavior (r = –.38), occupational prestige (r = .38), and education level (r = .52), close relationship competence (r = .24, Cohen’s d = .27).NS with social group competence
Loevinger et al. (1985)	WUSCT		Female: .90; Male: .84	Test–retest: r = .38 (Technical Institute; Female median); r = .47 (Technical Institute; Male median); r = .56 (Midwestern University; Female median); r = .43 (Midwestern University; Male median)	–
Luthar et al. (2001)	WUSCT	Adults (n = 91)	ICC = .88	Internal consistency: α = .78	anger expression (r = –.38), maternal substance abuse (r = –.28), current substance abuse (r = –.23), parenting satisfaction (r = .42), parenting support (r = .23), parental distress (r = –.43), and affective complexity (r = .34)NS after hierarchical regression with SES and ethnicity
Maron and Rock (1984)	WUSCT	Children (n = 17)	r = .92	–	two scales of religiosity derived from a structured interview: God’s Relationship to Man (χ² = 12.9, p = .01), and Behavior versus Belief (χ² = 12.9, p = .01)NS to View of Religiosity
Noam et al. (1984)	WUSCT		r = .70–.80	–	–
Novy (1993)	WUSCT		.94	–	–
Novy and Francis (1992)	WUSCT		Pearson product-moment correlation coefficient: .94	Split-half: .79	–
Picano (1987)	WUSCT		r = .87	–	–
Redmore and Loevinger (1979)	WUSCT		–	See Table 6	–
Shea et al. (1978)	WUSCT	Adults (n = 294)*	Ego-Identity Incomplete Sentence Blank: r = .92; Loevinger et al. Measure: r = .86	–	NS (EI-ISB, Internal Locus of Control Scales)
Sutton and Swensen (1983)	WUSCT	Adolescents and adults (n = 70)	Correlation between TPR scores: r = .93; Correlation between ISS: r = .88	–	TAT (r = .79) and unstructured interview (r = .89)NS differences between adolescents at a juvenile detention center, middle school students, and high school students
Waugh (1981)	WUSCT		Median interrater exact agreement: 72% (range: 53–90); Median agreement within ½ step: 92% (range: 80–98); Exact agreement for ogive TPR and comprised TPR: 52%; TPR agreement within ½ step: 89%	Split-half: .91 (male form); .79 (female form)	–
Welfare et al. (2013)	WUSCT	Adults (n = 120)	Pearson product-moment correlation: .86	Internal consistency: α = .73	r² = .02 (positive client traits identified by counselor for clients they felt effective), r² = .04 (negative client traits identified by counselor for clients they felt effective)
White (1985)	WUSCT	Adults (n = 163)	r = .88	–	r = –.21 to –.29 (two scales of the ACL); r = .17 to .33 (five scales of the ACL); r = .18 to .28 (CPI); r = .25 to .33 (enjoyment of children and satisfaction in being a mother)
Wilber et al. (1982)	WUSCT	Adults	.89	–	–

Note. WUSCT = Washington University Sentence Completion Test; TRS = Teacher Rating Scale of Achievement and Social Adjustment; MAT = Metropolitan Achievement Tests; CRS = Child Rating Scale; ACL = Adjective Check List; CPI = California Psychological Inventory.

Larson et al. (2007) evaluated 133 participants who had initially been recruited as adolescents for a longitudinal study of psychosocial development. This study took place 11 years after the participants had been recruited. Seventy-four participants had been hospitalized as adolescents for anxiety, depression, behavioral, or other disorders, and 59 participants were originally recruited from a demographically matched control group of high school students. Ego development was able to differentiate between groups, F(1, 131) = 29.71, Cohen’s d = .96, even after 11 years. When examining intercorrelations among indices of functioning, ego development correlated with the other five: global self-worth (r = .22), psychological distress (r = −.26), criminal behavior (r = −.38), occupational prestige (r = .38), and education level (r = .52). While ego development did not correlate with social group competence, it did correlate with close relationship competence (r = .24, Cohen’s d = .27).

White (1985) recruited 163 women from a nurse practitioner training program and administered various personality tests, including the Adjective Check List (ACL) created by Gough and Heilbrun (1965), the California Psychological Inventory (CPI) created by Gough (1957), and the Rotter Internal-External Locus of Control Scale (LCS) created by Rotter (1966). A questionnaire on attitudes toward children and childrearing experiences was also given, assessing satisfaction related to rearing children, being a mother, whether the children were planned, and the number of children. Higher ego levels were positively related to five of the eight scales on the ACL (r = .17 to .33) and negatively related to two of the eight scales (r = −.21 to −.29). They were also positively related to all twenty of the CPI scales (r = .18 to .28) and the external control scale of the LCS (r = .19). In addition, higher ego levels correlated with enjoyment of children and satisfaction in being a mother (r = .25 to .33). Thus, higher ego levels were associated with adjustment, nurturance, responsibility, tolerance, enjoyment of children, a sense of inner control, lack of aggression, and leadership.

Luthar and colleagues (2001) recruited 91 mothers from the community. Participants were given several measures to assess difficulties toward mothers’ feelings in the maternal role including the Brief Symptom Inventory (BSI) developed by Derogatis (1993), the State-Trait Anger Expression Scale (STAXI) developed by Spielberger (1996), the Addiction Severity Index (McLellan et al., 1990), the Parent-Child Relationship Inventory (PCRI) developed by Gerard (1994), the Parenting Stress Index Short Form (PSI) developed by Abidin et al. (2006), and vignettes on family relations designed for this study. Ego development was related to six of the seven variables: anger expression (r = −.38), maternal substance abuse (r = −.28), current substance abuse (r = −.23), parenting satisfaction (r = .42), parenting support (r = .23), parental distress (r = −.43), and affective complexity (r = .34). However, after considering ethnicity and SES via hierarchical multiple regression analyses, the main effects of ego development were no longer significant. Interactions between ego development and psychological difficulties indicated trends that the advantages of high levels of ego development decreased when mothers reported adjustment problems, and higher levels of ego development correlated with more variability in maternal functioning.

Shea and colleagues (1978) recruited 294 undergraduate students. Participants completed the Marcia Ego-Identity Incomplete Sentence Blank (EI-ISB) developed by Marcia (1966) and the Internal Locus of Control Scales (Levenson, 1974). Analyses of variance found that ego development, as measured by the WUSCT, had no significant relationship with scores on the other measures. The same sample and measures were used in a follow-up study conducted by Adams and Shea (1979). Using a Kruskal–Wallis one-way analysis of variance, they found that students in the four identity status groups had significant differences in their ego stage distributions (H (3) = 54.64, p < .0001). A series of one-way analyses of variance between ego level and locus of control revealed no significant associations. Adams and Fitch (1981) used the same sample and followed up with 148 of the original 294 undergraduate students a year later to have them complete the EI-ISB and WUSCT. All synchronous and cross-lagged correlations were significant (r = .41 to .52). To rule out a possible spurious relationship in the cross-lagged correlations, partial correlations with control for measurement of one source of spuriousness were computed: both were significant (partial r = .19 to .22).

Overall Summary of WUSCT Validity Findings

In summary, the preponderance of the studies suggests that the WUSCT demonstrates acceptable internal consistency as well as convergent and predictive validity across age groups, albeit with slightly stronger evidence of validity in use with adult populations. Several investigations reported adequate internal consistency for total protocol ratings and item sum scores, with coefficients typically ranging from .70 to .94. These results support the coherence of the ego development construct in applied validity contexts. Convergent and predictive validity were reflected in associations between WUSCT scores and a range of theoretically relevant indicators, including academic competence, reading achievement, deviant behaviors, psychosocial functioning, parenting satisfaction, and maternal distress, with reported validity coefficients most often falling in the small-to-moderate range (approximately r = .22 to .52), and several studies reporting moderate-to-large group differences for clinically relevant outcomes (e.g., d ≈ 0.96).

Evidence of concurrent validity was supported by associations between WUSCT scores and theoretically related measures administered within the same assessment context, including structured interviews, other sentence completion measures, and performance-based assessments of ego development and personality functioning. In these studies, concurrent validity coefficients were generally moderate to strong, with correlations ranging from approximately r = .41 to .89, depending on the criterion measure and scoring method. However, the strength of concurrent associations varied across samples and comparison instruments, and some studies reported weaker or nonsignificant relations with selected constructs, underscoring variability in concurrent validity as a function of criterion selection and study design.

Across school-age, mixed-age, and adult samples, WUSCT scores demonstrated meaningful associations with external indicators of developmental maturity, behavioral functioning, adjustment, and interpersonal competence. Ego development differentiated “delinquent” from “nondelinquent” adolescents, psychiatric inpatients with and without a history of suicide attempts, adults originally hospitalized as adolescents from matched controls, and individuals at different levels of occupational, relational, and educational attainment, with reported group differences ranging from moderate to large in magnitude (e.g., χ² values reaching statistical significance and effect sizes as large as d ≈ 0.96). Associations between WUSCT scores and external criteria typically fell in the small-to-moderate range (approximately r = .22 to .52), including correlations with deviant behaviors, psychosocial functioning, occupational prestige, and educational attainment.

Discriminant validity was supported in studies showing that ego development correlated with unique aspects of functioning, such as conformity patterns, religious beliefs, maternal role stress, and affective complexity, but not uniformly across all domains (e.g., peer-rated conformity or locus of control in some samples). Taken together, the available evidence indicates that the WUSCT demonstrates meaningful evidence of validity across developmental levels, although the variability in effect sizes, sample characteristics, and analytic strategies highlights the need for more methodologically rigorous and consistently designed validity research.

Discussion

The RISB and WUSCT are two SCTs that have been studied extensively for their reliability and validity across various populations and age ranges, although many of these studies are now several years to several decades old. Both assessment tools have shown some degree of potential in assessing psychological adjustment and personality development, but they also have limitations that practitioners should consider when interpreting results and forming diagnostic conclusions about a patient. Of course, reliance on a single instrument for purposes of conceptualization and classification of a clinical condition is a practice to be eschewed (e.g., Dombrowski, 2020) but the two SCTs under review have demonstrated potential for clinical use when considered in the context of a comprehensive evaluation.

For the RISB, studies generally reported high interrater reliability and moderate internal consistency. However, test–retest reliability is much more variable, indicating inconsistency in the stability of test scores over time. Demographic factors like gender and age appeared to influence the reliability of the RISB to some extent. Still, the instrument generally appears to demonstrate sufficient reliability for cautious use when acknowledging the impact that gender and age can have on the results.

In terms of validity, there is some evidence that scores on the RISB correlate with various other psychological measures demonstrating adequate concurrent validity. However, the RISB has shown potential shortcomings regarding its sensitivity to demographic variables of the participants included in the studies. This can contribute to overidentification of African American and female individuals as “maladjusted,” raising concerns about its appropriateness for these populations. Overall, while the RISB shows some potential value as a tool for assessing psychological adjustment, its results should be interpreted cautiously when used with diverse populations considering its lack of demonstrable validity for populations from diverse racial and gendered groups. This issue is not limited to sentence completion tasks (Bornstein, 2011; Putnick & Bornstein, 2016); the impact of cultural biases on test scores represents a field-wide concern, as even the most consequential instruments (e.g., intelligence tests) often lack investigations of measurement invariance (e.g., Dombrowski et al., 2021) and long-term stability (e.g., Watkins et al., 2022) with diverse populations.

The WUSCT has also been studied for its reliability and validity across populations. In studies with school-age children, interrater reliability was generally high. Although some studies involving both adolescents and adults as study participants reported high interrater reliability, there was much more variability in these values. Test–retest reliability estimates varied considerably across studies, indicating inconsistent stability over time. Based on our review, results varied markedly based on characteristics of the sample as well as the specific conditions of the study. This finding poses problems for consistent interpretation. If an interpretation suggests maladjustment at time 1 but then indicates normal functioning at time 2, then it can be difficult to establish interpretive consistency and therefore veracity. Overall, the preponderance of the WUSCT reliability evidence suggests adequate reliability, but the results also seem to vary significantly based on the sample population and the specific conditions of the study.

Various forms of validity have also been explored for the WUSCT. There is evidence that it has some predictive validity in terms of academic attainment, intelligence, and certain behaviors (e.g., aggression, suicidality). However, these associations appear to be influenced by demographic factors such as socioeconomic status, age, gender, and ethnicity. Despite this, the WUSCT has demonstrated evidence as a tool for measuring an individual’s personality and identity development over time.

Future research aimed at improving the clinical utility of the RISB and WUSCT should prioritize two interrelated directions. First, studies should examine the incremental validity of SCT scores within multimethod assessment batteries, clarifying whether these measures provide clinically meaningful information beyond routinely used procedures such as interviews, rating scales, and cognitive tests. Second, research should focus on developing context-sensitive interpretive frameworks, including revised cutoff scores, demographically informed norms, and explicit decision rules linked to specific clinical questions. Together, these efforts would shift SCT research from establishing psychometric adequacy toward determining how, when, and for whom these instruments enhance clinical judgment.

There are several limitations to this review that should be considered when interpreting the results. First, there was significant variability in the sample sizes, methods, and populations of the included studies which may impact the generalizability of the findings. Second, it is possible that studies relevant to this review were missed when searching the databases outlined in the methods section. Third, there is the potential that despite using a comprehensive and systematic approach to the review, publication bias may have impacted the studies that were included. Fourth, both tests showed variability in test–retest reliability, reflecting inconsistencies in the stability of measured constructs over time. Fifth, the evidence reviewed indicated limitations concerning demographic factors, including age, gender, race, and ethnicity. Sixth, studies have revealed a tendency for overidentification or misclassification of maladjustment. Finally, many of the studies that initially evaluated the reliability and validity of these assessment measures were conducted decades ago, and few studies include current populations to determine how the tools, presently still used in clinical practice, hold up over time While there is evidence of reliability and validity of the use of SCTs for select purposes, these findings strongly suggest a need for careful contextual interpretation of results and underscore the ethical requirement for practitioners to use SCTs, and any assessment instrument for that matter, in the context of a multi-informant, multimethod evaluation process when arriving at diagnostic decisions.

Conclusions

Adhering to PRISMA guidelines, this study systematically reviewed 51 studies that examined the reliability and validity of the RISB and WUSCT. The evidence reviewed supports the use of these measures as part of a comprehensive assessment when evaluating ego development, exploring cognitive styles, and conceptualizing personality structure. In addition, results from these measures can complement interviews and rating scales in treatment planning, goal setting, and monitoring therapeutic progress. These tests are particularly promising for specialized clinical populations, such as individuals with substance abuse issues, psychiatric conditions, adolescents with delinquent behaviors, and those at risk for suicidality.

The results of this study have important implications for applied psychological practice. Practitioners must recognize the nuanced nature of SCTs and their place within a broader multimethod assessment battery. These instruments are most beneficial for exploring specific aspects of personality, cognitive styles, or ego development, rather than serving as standalone diagnostic tools, a stricture not only specific to SCTs but also imposed on any assessment instrument. Because these measures provide a level of nuance and the ability to probe responses that is not always afforded by using more standardized broadband measurement tools, many providers still choose to use these assessments within their psychological evaluations. Given the need for additional evidence to support their widespread use across diverse purposes and populations—a requirement of almost all assessment instruments—training programs will need to promote a more nuanced understanding of SCTSs and consider whether teaching about them represents the most efficient use of instructional time and resources. However, if practitioners choose to use the two SCTs reviewed in this study, they should familiarize themselves with the available evidence concerning reliability and validity to recognize each instrument’s limitations, apply critical interpretive judgment, and remain attentive to demographic diversity and cultural sensitivity when interpreting results, an admonition that is applied to any instrument used in the assessment process (Dombrowski et al., 2022). Future directions for the use of SCTs should include the development of revised, valid scoring systems and supplemental interpretive guidelines tailored explicitly to demographic subgroups. In addition, it would be beneficial to include detailed procedures for training coders and for monitoring interscorer reliability. Future research into these measures and SCTs in general would benefit from a thorough evaluation of their reliability and validity in relation to current populations and with other updated assessment tools. While instruments like the RISB and WUSCT may provide valuable insights into psychological adjustment and personality structures, additional research is necessary to determine their incremental value and ensure their ethical use in psychological practice. However, the results of this study suggest that SCTs may provide a degree of incremental validity for specific assessment purposes and should not necessarily be summarily dismissed as being devoid of an evidentiary basis.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ryan Coleman

Kenzie Billeiter

Nicholas Benson

Stefan Dombrowski

Data Availability Statement

All tables and figures have been made publicly available at the Open Science Framework and can be accessed at . There is no data in this manuscript that also appears in previous publications. This study’s design and its analysis were not preregistered. This work is not based on a dissertation.

References

Abidin

Flens

J. R.

Austin

W. G.

(2006). The parenting stress index. Lawrence Erlbaum Associates Publishers.

Adams

G. R.

Fitch

S. A.

(1981). Ego stage and identity status development: A cross-lag analysis. Journal of Adolescence, 4(2), 163–171.

*Adams

G. R.

Shea

J. A.

(1979). The relationship between identity status, locus of control, and ego development. Journal of Youth and Adolescence, 8(1), 81–89.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

*Ames

P. C.

Riggio

R. E.

(1995). Use of the Rotter Incomplete Sentences Blank with adolescent populations: Implications for determining maladjustment. Journal of Personality Assessment, 64(1), 159–167.

*Arnold

F. C.

Walter

V. A.

(1957). The relationship between a self-and other-reference sentence completion test. Journal of Counseling Psychology, 4(1), 65.

*Avery

R. R.

Ryan

R. M.

(1988). Object relations and ego development: Comparison and correlates in middle childhood. Journal of Personality, 56(3), 547–569.

*Baker

R. P.

King

H. H.

(1970). The relationship between the repression-sensitization scale and the incomplete sentences blank. Journal of Projective Techniques and Personality Assessment, 34(6), 492–496.

Benjamin

G. A. H.

Beck

C. J. A.

Shaw

Geffner

(2017). Family evaluation in custody litigation (2nd ed.). American Psychological Association. https://doi.org/10.1037/0000024-000

10.

Benson

N. F.

Floyd

R. G.

Kranzler

J. H.

Eckert

T. L.

Fefer

S. A.

Morgan

G. B.

(2019). Test use and assessment practices of school psychologists in the United States: Findings from the 2017 National Survey. Journal of School Psychology, 72, 29–48.

11.

Blatt

S. J.

Chevron

E. S.

Quinlan

D. M.

Wem

(1981). The assessment of qualitative and structural dimensions of object representation [Unpublished manuscript]. Yale University.

12.

Bornstein

R. F.

(2011). Toward a process-focused model of test score validity: Improving psychological assessment in science and practice. Psychological Assessment, 23(2), 532–544. https://doi.org/10.1037/a0022402

13.

*Borst

S. R.

Noam

G. G.

Bartok

J. A.

(1991). Adolescent suicidality: A clinical-developmental approach. Journal of the American Academy of Child & Adolescent Psychiatry, 30(5), 796–803.

14.

Briscoe

Abbott

Lawal

Shaw

Thompson Coon

(2023). Feasibility and desirability of screening search results from Google Search exhaustively for systematic reviews: A cross-case analysis. Research Synthesis Methods, 14(3), 427–437. https://doi.org/10.1002/jrsm.1622

15.

Byrne

(1961). The repression-sensitization scale: Rational, reliability, and validity. Journal of Personality, 29(3), 334–349.

16.

Campbell

D. T.

Fiske

D. W.

(1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81. https://doi.org/https://doi.org/10.1037/h0046016

17.

Caselli

R. J.

Dueck

A. C.

Locke

D. E. C.

Sabbagh

M. N.

Ahern

G. L.

Rapcsak

S. Z.

Reiman

E. M.

(2016). Impact of personality on cognitive aging: A prospective cohort study. Journal of the International Neuropsychological Society, 22(7), 765–776. https://doi.org/10.1017/S1355617716000527

18.

*Churchill

Crandall

V. J.

(1955). The reliability and validity of the Rotter Incomplete Sentences Test. Journal of Consulting Psychology, 19(5), 345.

19.

Covidence. (2023). Covidence systematic review software. Veritas Health Innovation. https://www.covidence.org/

20.

Cronbach

L. J.

Shavelson

R. J.

(2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391–418. https://doi.org/10.1177/0013164404266386

21.

*DeMoss

M. A.

McCann

G. K.

(1997). Without a care in the world: The business ethics course and its exclusion of a care perspective. Journal of Business Ethics, 16, 435–443.

22.

Derogatis

L. R.

(1993). Brief Symptom Inventory: Administration, scoring and procedures manual. National Computer Systems.

23.

Diener

Emmons

Larsen

Griffin

(1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71–75.

24.

Dombrowski

S. C.

(2020). Psychoeducational assessment and report writing (2nd ed.). Springer Nature. https://doi.org/10.1007/978-3-030-44641-3

25.

Dombrowski

S. C.

McGill

R. J.

Farmer

R. L.

Kranzler

J. H.

Canivez

G. L.

(2022). Beyond the rhetoric of evidence-based assessment: A framework for critical thinking in clinical practice. School Psychology Review, 51(6), 771–784. https://doi.org/10.1080/2372966X.2021.1960126

26.

Dombrowski

S. C.

Watkins

M. W.

McGill

R. J.

Canivez

G. L.

Holingue

Pritchard

A. E.

Jacobson

L. A.

(2021). Measurement invariance of the Wechsler Intelligence Scale for Children, Fifth Edition 10-subtest primary battery: Can index scores be compared across age, sex, and diagnostic groups? Journal of Psychoeducational Assessment, 39(1), 89–99. https://doi.org/10.1177/0734282920954583

27.

Durost

Bixler

Wrightstone

Prescott

Balow

(1970). Metropolitan achievement tests, Primary I, Form F. Harcourt Brace Jovanovich.

28.

*Edwards

Sapp

(2002). Reoperationalising adaptive regression during hypnosis. Australian Journal of Clinical Hypnotherapy and Hypnosis, 23(2), 115.

29.

Fornell

Larcker

D. F.

(1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18(1), 39–50.

30.

*Frank

Quinlan

D. M.

(1976). Ego development and female delinquency: A cognitive-developmental approach. Journal of Abnormal Psychology, 85(5), 505–510.

31.

Frank

L. K.

(1939). Projective methods for the study of personality. The Journal of Psychology, 8(2), 389–413. https://doi.org/10.1080/00223980.1939.9917671

32.

Frick

P. J.

Loney

B. R.

(2000). Issues in clinical assessment with children and adolescents. In Reynolds

C. R.

Fletcher-Janzen

(Eds.), Handbook of psychological and educational assessment of children: Personality, behavior, and context (pp. 83–112). Guilford Press.

33.

*Fuller

G. B.

Parmelee

W. M.

Carroll

J. L.

(1982). Performance of delinquent and nondelinquent high school boys on the Rotter Incomplete Sentence Blank. Journal of Personality Assessment, 46(5), 506–510.

34.

*Gardner

J. M.

(1967). The adjustment of drug addicts as measured by the sentence completion test. Journal of Projective Techniques and Personality Assessment, 31(3), 28–29.

35.

Gerard

A. B.

(1994). Parent-Child Relationship Inventory (PCRI). Western Psychological Services.

36.

Gesten

(1979). The Class Wheel: A sociometric index [Unpublished manuscript]. University of Rochester.

37.

Gough

H. G.

(1957). CPI manual. Consulting Psychologists Press.

38.

Gough

H. G.

Heilbrun

A. B.

Jr. (1965). The adjective check list manual. Consulting Psychologists Press.

39.

Haynes

R. B.

(2012). Clinical epidemiology: How to do clinical practice research. Lippincott Williams & Wilkins.

40.

Hightower

A. D.

Cowen

E. L.

Spinell

A. P.

Lotyczewski

B. S.

Guare

J. C.

Rohrbeck

C. A.

Brown

L. P.

(1987). The Child Rating Scale: The development of a socioemotional self-rating scale for elementary school children. School Psychology Review, 16(2), 239–255.

41.

Holaday

Smith

D. A.

Sherry

(2000). Sentence completion tests: a review of the literature and results of a survey of members of the Society for Personality Assessment. Journal of Personality Assessment, 74(3), 371–383. https://doi.org/10.1207/S15327752JPA7403_3

42.

*Holt

R. R.

(1980). Loevinger’s measure of ego development: Reliability and national norms for male and female short forms. Journal of Personality and Social Psychology, 39(5), 909–920.

43.

Hoppe

C. F.

(1973). Ego development and conformity behaviors [Doctoral dissertation]. ProQuest Information & Learning.

44.

*Hoppe

C. F.

Loevinger

(1977). Ego development and conformity: A construct validity study of the Washington University Sentence Completion Test. Journal of Personality Assessment, 41(5), 497–504.

45.

*Jennings

A. G.

Armsworth

M. W.

(1992). Ego development in women with histories of sexual abuse. Child Abuse & Neglect, 16(4), 553–565.

46.

*Jessor

Liverant

Opochinsky

(1963). Imbalance in need structure and maladjustment. The Journal of Abnormal and Social Psychology, 66(3), 271–275.

47.

Jessor

Hess

H. F.

(1958). Level of aspiration behavior and general adjustment: An appraisal of some negative findings. Psychological Reports, 4(3), 335–339.

48.

Jorgenson

(2017). Washington University Sentence Completion Test. In Zeigler-Hill

Shackelford

(Eds.), Encyclopedia of personality and individual differences (pp. 1–4). Springer, Cham. https://doi.org/10.1007/978-3-319-28099-8_959-1

49.

*Kennedy

W. A.

Cottrell

Smith

(1963). Norms of gifted adolescents on the Rotter incomplete sentence blank. Journal of Clinical Psychology, 19(3), 314–315.

50.

*Kitchener

K. S.

King

P. M.

Davison

M. L.

Parker

C. A.

Wood

P. K.

(1984). A longitudinal study of moral and ego development in young adults. Journal of Youth and Adolescence, 13(3), 197–211.

51.

*Lah

M. I.

(1989). New validity, normative, and scoring data for the Rotter Incomplete Sentences Blank. Journal of Personality Assessment, 53(3), 607–620.

52.

*Lah

M. I.

Rotter

J. B.

(1981). Changing college student norms on the Rotter Incomplete Sentences Blank. Journal of Consulting and Clinical Psychology, 49(6), Article 985.

53.

*Lambie

G. W.

(2007). The contribution of ego development level to burnout in school counselors: Implications for professional school counseling. Journal of Counseling & Development, 85(1), 82–88.

54.

*Lambie

G. W.

Hagedorn

W. B.

Ieva

K. P.

(2010). Social-cognitive development, ethical and legal knowledge, and ethical decision making of counselor education students. Counselor Education and Supervision, 49(4), 228–246.

55.

*Lambie

G. W.

Ieva

K. P.

(2012). Impact of a counseling ethics course on graduate students’ learning and development. International Journal for the Scholarship of Teaching and Learning, 6(1), 12.

56.

*Lanning

Colucci

Edwards

J. A.

(2007). Changes in ego development in the wake of September 11. Journal of Research in Personality, 41(1), 197–202.

57.

*Larson

J. J.

Whitton

S. W.

Hauser

S. T.

Allen

J. P.

(2007). Being close and being social: Peer ratings of distinct aspects of young adult social competence. Journal of Personality Assessment, 89(2), 136–148.

58.

Levenson

(1974). Activism and powerful others: Distinctions within the concept of internal-external control. Journal of Personality Assessment, 38(4), 377–383.

59.

Loevinger

(1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(2), 635–694.

60.

Loevinger

(Ed.). (1998). Technical foundations for measuring ego development: The Washington University Sentence Completion Test. Lawrence Erlbaum Associates.

61.

*Loevinger

Cohn

L. D.

Bonneville

L. P.

Redmore

C. D.

Streich

D. D.

Sargent

(1985). Ego development in college. Journal of Personality and Social Psychology, 48(4), 947–962.

62.

*Logan

R. E.

Waehler

C. A.

(2001). The Rotter incomplete sentences blank: Examining potential race differences. Journal of Personality Assessment, 76(3), 448–460.

63.

*Luthar

S. S.

Doyle

Suchman

N. E.

Mayes

(2001). Developmental themes in women’s emotional experiences of motherhood. Development and Psychopathology, 13(1), 165–182.

64.

Manners

Durkin

(2001). A critical review of the validity of ego development theory and its measurement. Journal of Personality Assessment, 77(3), 541–567. https://doi.org/10.1207/S15327752JPA7703_12

65.

Marcia

J. E.

(1966). Development and validation of ego-identity status. Journal of Personality and Social Psychology, 3(5), 551.

66.

*Maron

Rock

M. H.

(1984). Jewish Belief, Observance and Ego Development. Jewish Education, 52(4), 35–40.

67.

Mathy

N. M.

(2019). 30 years of research from Keilin and Bloom to present: A survey of child custody evaluation practices [Doctoral dissertation]. Wisconsin School of Professional Psychology.

68.

*McCloskey

L. C.

(2014). Construct and incremental validity of the Rotter Incomplete Sentences Blank in adult psychiatric outpatients. Psychological Reports, 114(2), 363–375.

69.

McCrae

R. R.

Kurtz

J. E.

Yamagata

Terracciano

(2011). Internal consistency, retest reliability, and their implications for personality scale validity. Personality and Social Psychology Review, 15(1), 28–50. https://doi.org/10.1177/1088868310366253

70.

McGrath

R. E.

Carroll

E. J.

(2012). The current status of “projective” “tests..” In Cooper

Camic

P. M.

Long

D. L.

Panter

A. T.

Rindskopf

Sher

K. J.

(Eds.), APA handbook of research methods in psychology: Foundations, planning, measures, and psychometrics (Vol. 1, pp. 329–348). American Psychological Association.

71.

McHugh

M. L.

(2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282.

72.

McLellan

A. T.

Gargi

Bragg

Cacciola

Fureman

Incmikoski

(1990). Addiction Severity Index: Administration manual. University of Pennsylvania—Veterans Administration Center for Studies of Addiction.

73.

Meehl

P. E.

(1945). The dynamics of structured personality tests. Journal of Clinical Psychology, 1, 296–303.

74.

Meyer

G. J.

Kurtz

J. E.

(2006). Advancing personality assessment terminology: Time to retire “objective” and “projective” as personality test descriptors. Journal of Personality Assessment, 87(3), 223–225. https://doi.org/10.1207/s15327752jpa8703_01

75.

Murray

H. A.

(1943). Thematic apperception test. Harvard University Press.

76.

*Noam

G. G.

Hauser

S. T.

Santostefano

Garrison

Jacobson

A. M.

Powers

S. I.

Mead

(1984). Ego development and psychopathology: A study of hospitalized adolescents. Child Development, 55, 184–194.

77.

*Novy

D. M.

(1993). An investigation of the progressive sequence of ego development levels. Journal of Clinical Psychology, 49(3), 332–338.

78.

*Novy

D. M.

Francis

D. J.

(1992). Psychometric properties of the Washington University sentence completion test. Educational and Psychological Measurement, 52(4), 1029–1039.

79.

OpenAI. (2025). ChatGPT (GPT-5, Nov 2025 version) [Large language model]. https://chat.openai.com/

80.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Moher

(2021). The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. British Medical Journal, 372, Article n71.

81.

Paulhus

D. L.

(1994). Balanced Inventory of Desirable Responding reference manual for BIDR Version 6. University of British Columbia.

82.

Pfohl

Blum

Zimmerman

(1997). Structured Interview for DSM–IV Personality (SIDP–IV). American Psychiatric Press.

83.

*Picano

J. J.

(1987). Automatic ogive scoring rules for the short form of the sentence completion test of ego development. Journal of Clinical Psychology, 43(1), 119–122.

84.

Piotrowski

(2015). On the decline of projective techniques in professional psychology training. North American Journal of Psychology, 1(2), Article 7.

85.

Piotrowski

(2019). Projective techniques are not moribund: Comment on the Benson et al. (2019) assessment practices article. SIS Journal of Projective Psychology and Mental Health, 26, 73–76.

86.

Piotrowski

(2024). Use of projective techniques with children: A review of contemporary research studies. Journal of Projective Psychology & Mental Health, 31, 276–284.

87.

Putnick

D. L.

Bornstein

M. H.

(2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90. https://doi.org/10.1016/j.dr.2016.06.004

88.

*Redmore

C. D.

Loevinger

(1979). Ego development in adolescence: Longitudinal studies. Journal of Youth and Adolescence, 8(1), 1–20.

89.

Rotter

J. B.

(1966). Generalized expectancies of internal versus external control of reinforcements. Psychological Monographs, 80, Article 609.

90.

*Rotter

J. B.

Rafferty

J. E.

(1950). The Rotter Incomplete Sentences Blank manual. Psychological Corporation.

91.

*Rotter

J. B.

Rafferty

J. E.

Lotsof

A. B.

(1954). The validity of the Rotter Incomplete Sentences Blank: High school form. Journal of Consulting Psychology, 18(2), 105–111.

92.

*Rotter

J. B.

Rafferty

J. E.

Schactitz

(1949). Validation of the Rotter incomplete sentences blank for college screening. Journal of Consulting Psychology, 13(5), 348–356.

93.

Rotter

J. B.

Willerman

(1947). The Incomplete Sentences Test as a method of studying personality. Journal of Consulting Psychology, 11(1), 43–48.

94.

Ryan

R. M.

Avery

R. R.

Grolnick

W. S.

(1985). A Rorschach assessment of children’s mutuality of autonomy. Journal of Personality Assessment, 49(1), 6–12. https://doi.org/10.1207/s15327752jpa4901_2

95.

Schultheiss

D. E. P.

(2007). The emergence of a relational cultural paradigm for vocational psychology. International Journal for Educational and Vocational Guidance, 7(3), 191–201. https://doi.org/10.1007/s10775-007-9123-7

96.

*Shea

Crossman

S. M.

Adams

G. R.

(1978). Physical attractiveness and personality development. The Journal of Psychology, 99(1), 59–62.

97.

Smith

S. R.

Smithson

(Eds.). (2015). APA handbook of clinical psychology: Theory and research (Vol. 2). American Psychological Association. https://doi.org/10.1037/14646-000/doi.org/10.1037/14646-000

98.

Spielberger

C. D.

(1996). State–Trait Anger Expression Inventory: Professional manual. Psychological Assessment Resources.

99.

Stemplewska-Żakowicz

Paluchowski

W. J.

(2013). The reliability of projective techniques as tools of psychological assessment. Problems of Forensic Sciences, 93, 227–244.

100.

*Sutton

P. M.

Swensen

C. H.

(1983). The reliability and concurrent validity of alternative methods for assessing ego development. Journal of Personality Assessment, 47(5), 468–475.

101.

Teglasi

(1998). Assessment of schema and problem-solving strategies with projective techniques. In Hersen

Bellack

(Series Eds.), Reynolds

(Vol Ed.), Comprehensive clinical psychology: Assessment (Vol. 4, pp. 459–499). Elsevier Science Press.

102.

*Tempone

V. J.

Lamb

(1967). Repression-sensitization and its relation to measures of adjustment and conflict. Journal of Consulting Psychology, 31(2), 131.

103.

*Torstrick

McDermut

Gokberk

Bivona

Walton

K. E.

(2015). Associations between the Rotter Incomplete Sentences Blank and measures of personality and psychopathology. Journal of Personality Assessment, 97(5), 494–505.

104.

Vickers

A. J.

Smith

(2000). Incorporating data from dissertations in systematic reviews. International Journal of Technology Assessment in Health Care, 16(2), 711–713. https://doi.org/10.1017/S0266462300101278

105.

Watkins

M. W.

Canivez

G. L.

Dombrowski

S. C.

McGill

R. J.

Pritchard

A. E.

Holingue

C. B.

Jacobson

L. A.

(2022). Long-term stability of Wechsler Intelligence Scale for Children-fifth edition scores in a clinical sample. Applied Neuropsychology. Child, 11(3), 422–428. https://doi.org/10.1080/21622965.2021.1875827

106.

*Waugh

M. H.

(1981). Reliability of the Sentence Completion Test of ego development in a clinical population. Journal of Personality Assessment, 45(5), 485–487.

107.

Weiner

I. B.

(2013). Assessment of personality and psychopathology with performance-based measures. In Geisinger

K. F.

Bracken

B. A.

Carlson

J. F.

Hansen

J.-I. C.

Kuncel

N. R.

Reise

S. P.

Rodriguez

M. C.

(Eds.), APA handbook of testing and assessment in psychology: Testing and assessment in clinical and counseling psychology (Vol. 2, pp. 153–170). American Psychological Association. https://doi.org/10.1037/14048-010.

108.

Weiner

I. B.

Kuehnle

(1998). Projective assessment of children and adolescents. In Bellack

A. S.

Hersen

(Eds.), Comprehensive clinical psychology: Assessment (Vol. 4, pp. 431–458). Pergamon Press.

109.

*Weis

Toolis

E. E.

Cerankosky

B. C.

(2008). Construct validity of the Rotter Incomplete Sentences Blank with clinic-referred and nonreferred adolescents. Journal of Personality Assessment, 90(6), 564–573.

110.

*Welfare

L. E.

Farmer

L. B.

Lile

J. J.

(2013). Empirical evidence for the importance of conceptualizing client strengths. The Journal of Humanistic Counseling, 52(2), 146–163.

111.

*White

M. S.

(1985). Ego development in adult women. Journal of Personality, 53(4), 561–574.

112.

*Wilber

C. H.

Rounsaville

B. J.

Sugarman

Casey

J. B.

Kleber

H. D.

(1982). Ego development in opiate addicts: An application of Loevinger’s stage model. The Journal of Nervous and Mental Disease, 170(4), 202–208.

113.

Willis

R. H.

Hollander

E. P.

(1964). An experimental study of three response modes in social influence situations. The Journal of Abnormal and Social Psychology, 69(2), 150–156.

114.

Wright

A. G. C.

Hopwood

C. J.

Skodol

A. E.

Morey

L. C.

(2016). Longitudinal validation of general and specific structural features of personality pathology. Journal of Abnormal Psychology, 125(8), 1120–1134. https://doi.org/10.1037/abn0000165

115.

*Young

J. L.

Waehler

C. A.

Laux

J. M.

McDaniel

P. S.

Hilsenroth

M. J.

(2003). Four studies extending the utility of the Schwartz Outcome Scale (SOS-10). Journal of Personality Assessment, 80(2), 130–138.

116.

Zimmerman

(2002). The Psychiatric Diagnostic Screening Questionnaire: Manual. Western Psychological Service.

117.

*Studies marked with an asterisk were included in this systematic review.

Sentence Completion Tests With an Empirical Foundation: A Systematic Review

Abstract

Keywords

Public Significance Statement

Purpose of the Study

Method

PICOT Statement

Search Terms

Inclusion/Exclusion Criteria

Coding of Variables

Inclusion and Coding Agreement

Transparency and Openness

Results

RISB

Studies of RISB Reliability

Studies of RISB Validity Using Samples of School-Age Children

Studies of RISB Validity Using Mixed Samples of Children and Adults

Studies of RISB Validity Using Samples of Adults

Overall Summary of RISB Validity Findings

Washington University Sentence Completion Test (WUSCT)

Studies of WUSCT Reliability

Studies of WUSCT Validity Using Samples of School-Age Children

Studies of WUSCT Validity Using Mixed Samples of Children and Adults

Studies of WUSCT Validity Using Samples of Adults

Overall Summary of WUSCT Validity Findings

Discussion

Conclusions

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Data Availability Statement

References