Abstract
This article selectively reviews the key issues and measures for the assessment of depressive disorders and symptoms in youth and adults. The first portion of the article addresses the nature and conceptualization of depression and some key issues that must be considered in its assessment. Next, the diagnostic interview and clinician- and self-administered rating scales that are most widely used to diagnose, screen for, and assess the severity of depression in adults and youth are selectively reviewed. In addition, the assessment of three transdiagnostic clinical features (anhedonia, irritability, and suicidality) that are frequently associated with both depression and other forms of psychopathology is discussed. The article concludes with some broad recommendations for assessing depression in research and clinical practice and suggestions for future research.
Assessment of Depression
Depressive disorders are among the most common mental disorders (Kessler & Bromet, 2013) and are among the most burdensome disorders of any kind according to the World Health Organization’s metric of years lived with disability (James et al., 2018). Hence, depression is a top priority for clinical research, practice, and health care policy. Each of these domains requires reliable and valid assessment of depression.
This article focuses on the assessment of depressive disorders, particularly major depressive disorder (MDD) and persistent depressive disorder (PDD), from the perspective of the
In preparing this article, I first searched for publications in the past 10 years that provided broad reviews of the assessment of depression in adults and/or youth using Google Scholar. I systematically used combinations of terms such as “depression” and “mood disorders” with terms such as “assessment,” “measures,” “rating scales,” “questionnaires,” “inventories,” and “interviews.” I then scoured these sources for further references. Next, I used these sources to determine which issues and assessment instruments to include in the article. I then did another round of literature searches focusing on specific issues and instruments, with a particular emphasis on systematic reviews and meta-analyses. Where possible, I have relied on systematic reviews and meta-analyses in my appraisals and recommendations.
I begin by discussing the construct of depression and some key issues that should be considered in assessing depression. I will then review measures that are widely used to diagnose, screen for, and assess the severity of depression in adults and youth. I will also briefly discuss the assessment of some important transdiagnostic clinical features that are frequently associated with both depression and other forms of psychopathology. There are a vast number of measures to assess depression, and the literatures on many are quite extensive, so the review is selective, and I will not go into depth on any specific measure. Finally, I will make some recommendations for assessing depression in research and clinical practice and offer a few suggestions for future research.
The Construct of Depression
Descriptions of the phenomenology of depression prior to the advent of explicit diagnostic criteria include a number of symptoms that do not appear in the
The Feighner and
An important implication is that there is a distinction between diagnostic constructs and the criteria that are used to define them. The
The categorical nature of
If depression is a dimensional phenomenon, it raises the question of whether depression is comprised of a single or multiple dimensions. Most depression scales sum all items to create a total score. If a scale has a multidimensional, rather than unidimensional, structure, total scores will reflect different contributions from each dimension, clouding interpretation (Fried et al., 2022).
Factor analytic studies of depression rating scales have yielded inconsistent findings regarding the number and nature of dimensions that best represent their items (Brouwer et al., 2013; Fried et al., 2022; Watson & O’Hara, 2017). However, careful scale development can produce depression measures with a coherent general scale that can be broken down into a number of highly correlated subscales (e.g., Watson & O’Hara, 2017). This suggests that depression can be viewed at several levels of resolution.
Most depression measures focus on assessing symptoms. However, the course of depression is at least as important for prognosis (e.g., episode duration and recurrence) and treatment planning (i.e., type, intensity, and duration of treatment; Klein, in press). Course can also be conceptualized dimensionally (Klein, 2008). For example, Pettit et al. (2009) identified a factor consisting of age of onset, and number and duration of episodes that had good predictive validity in an 11-year follow-up.
Although it is not a focus of this article, it is important to acknowledge that several alternative approaches to classification have emerged since the publication of
In HiTOP, the depressive disorders fall within the internalizing spectrum, and at the next level, the distress subfactor, which also includes generalized anxiety disorder (which is reminiscent of the broader pre-DSM-III conceptualization of depression), as well as aspects of posttraumatic stress disorder and borderline personality disorder. The level beneath subfactors consists of syndromes, which have not yet been well-defined. However, preliminary work suggests that the syndromes break down into depression and the specific anxiety disorders, although depression may further split into separate cognitive and vegetative syndromes (Waszczuk et al., 2017). HiTOP is still a work in progress, and measures designed to assess its constructs are currently being developed (Simms et al., 2022).
The Research Domains Criteria (RDoC; Kozak & Cuthbert, 2016) focuses on a core set of psychobiological domains, each with multiple constructs, whose neural circuitry has been at least partially delineated and which are relevant to psychopathology. These domains can be observed across species and assessed in a dimensional manner across multiple units of analysis (e.g., molecules, neural circuits, physiology, behavior, and self-report). RDoC intentionally does not address clinical phenotypes—for example, depression does not appear in this system. Phenotypes are to be discovered inductively by identifying the clinical manifestations associated with dysfunction in one or more core domains or constructs. However, certain domains and constructs (e.g., Loss in the Negative Valence Systems domain, and Reward Responsiveness in the Positive Valence Systems domain) are especially relevant to depression-related psychopathology. The RDoC website offers suggestions for measures reflecting these constructs across levels of analysis: https://www.nimh.nih.gov/research/research-funded-by-nimh/rdoc.
Both categorical (e.g.,
Considerations in Assessing Depression
Development
The
Although research in this area is surprisingly limited, it appears that depressive disorders are generally characterized by similar symptoms throughout the lifespan (Klein et al., 2017). However, the loadings of depressive symptoms on a latent depression factor increase after age 12, suggesting greater coherence of the construct after the transition to adolescence (Morken et al., 2021). Genetic influences are also weaker in prepubertal than postpubertal depression, and the stability (or homotypic continuity) of depression increases after age 12 (Klein et al., 2017; Morken et al., 2021).
Older adults frequently have significant, often multiple, medical conditions. As a result, it can be challenging to distinguish depressive symptoms (e.g., fatigue, insomnia) from the effects of general medical disorders. For most older adults, depressive episodes are recurrences of an earlier onset condition. However, in a subgroup, particularly those with no prior history, depression may be secondary to cardiovascular disease. The putative subtype of “vascular depression” is characterized by a late-life onset, a negative family history of depression, apathy, poor executive functioning on neuropsychological tests, a history of hypertension, nonspecific findings of white matter hyperintensities in magnetic resonance imaging, a poor response to antidepressant medication, and an increased risk of developing dementia (Steffens, 2019). Thus, even though depression may be phenotypically similar across development, causal processes, mechanisms, and challenges in assessment differ across the lifespan.
Data Source
Most measures for assessing depression rely on clinicians’ evaluations of depressed individuals’ reports and behavior (e.g., semistructured diagnostic interviews and clinician rating scales) or individuals’ reports of their symptoms (self-rating scales). Self-ratings are highly economical, as they do not require clinicians to conduct the assessment. Clinicians’ evaluations have the advantage that raters can elaborate on questions, elicit examples, and probe inconsistencies to ensure that respondents understand what is being asked and to evaluate the clinical significance of their responses. In addition, clinicians can utilize observations of respondents’ behavior (which is particularly important for assessing psychomotor retardation or agitation) and obtain information that can be difficult to elicit with a highly structured format like a self-report questionnaire, such as past history and course of depression and suicide risk.
Agreement between clinician and self-rating scales is moderate, but higher when similar formats are employed. Patients tend to rate themselves as being more severely depressed than clinicians. However, both sources of information contribute unique variance in predicting outcomes, even when identical scales are used (Dozois et al., 2020; Uher et al., 2012).
Informants’ reports are often useful, especially for youth and individuals who are psychotic or otherwise limited in their ability to provide valid information. Reports of informants, such as parents and teachers, are especially important for younger children, whose cognitive processes and language abilities are less developed than older youth and adults. Younger children have particular difficulty reporting on temporal characteristics; therefore informants must be relied on for information on the onset and duration of symptoms and previous episodes (Dougherty et al., 2018). Nonetheless, it is still important to obtain children’s reports because informants may not be aware of the child’s feelings and thoughts. Indeed, children report higher levels of depressive symptoms than their parents and teachers (Jensen et al., 1999; Makol & Polo, 2018).
Informants are less essential for adolescents, who are more reliable reporters than children (Edelbrock et al., 1985), and because parents are less involved in the day-to-day lives of adolescents. However, informants can be useful in reporting on adolescents’ externalizing problems, which teens may minimize (Dougherty et al., 2018).
Although obtaining data from multiple sources is optimal, it is complicated by the fact that agreement between informants is only fair to moderate (Achenbach et al., 2005; De Los Reyes et al., 2015). Informants tend to agree more when they have the same relationship to the target, observe the target in the same context, and for observable behaviors (e.g., externalizing, as opposed to internalizing, symptoms). However, informant discrepancies can provide meaningful information, such as the situational specificity of the target’s behavior (De Los Reyes et al., 2015). Moreover, youth, parent, teacher, and clinician ratings each account for unique variance in predicting outcomes (Cohen et al., 2019; Ferdinand et al., 2003; Verhulst et al., 1997).
Due to the modest agreement between sources, clinicians and researchers often seek to integrate the conflicting information. Approaches to integrating data from multiple sources include rating the feature or diagnosis as present if any source reports it or using only the variance shared by sources by treating each of their reports as indicators of a latent construct and considering nonshared variance as measurement error. The approach that most closely mirrors clinical practice is the “best-estimate” procedure, in which the assessor uses their clinical judgment to evaluate the credibility of each source’s report and weighs them accordingly in reaching a decision (Klein et al., 1994).
However, as suggested above, discrepancies between sources can be meaningful, as they may reflect context-specific behavior (De Los Reyes et al., 2013). Thus, it can be informative to examine each informant’s data separately. De Los Reyes et al.’s (2013) Operations Triad Model provides a means of determining whether informant discrepancies reflect situation-specific behavior or measurement error.
Populations and Contexts
The assessment of depressive disorders is relevant for a wide variety of populations and contexts—inpatient and outpatient mental health facilities, primary and specialty medical care settings, schools, forensic contexts, and the community (e.g., population screening, epidemiological research). The severity of depression can differ across populations and contexts (e.g., milder in community and more severe in mental health settings). Measures (and items on the same measure) often vary in their sensitivity to the severity level of depression. Item–response theory methods can provide estimates of where on the distribution of a latent trait each measure or item provides the most information. For example, Olino et al. (2012) found that in a community sample of older adolescents, the Center for Epidemiological Studies Depression Scale (CES-D; Radloff, 1977) provided the most information at lower levels of depression, while the Beck Depression Inventory (BDI; Beck et al., 1988) provided more information at higher levels of depression. Thus, the CES-D may be more useful for epidemiological studies, while the BDI may be better-suited for clinical samples. Ideally, measures should include items that are sensitive to different levels of severity so that together they assess the full range of severity of the construct.
An important question is whether the underlying latent construct assessed by a measure is comparable across groups, such as individuals of different genders, sexual orientations, races, ethnicities, socioeconomic positions, cultures, and geographic regions. This issue can be examined by using confirmatory factor analysis to test for measurement invariance (MI). If an instrument is examined in two groups and found to lack invariance, then it cannot be assumed to measure the same construct in each group. MI of depression measures has primarily been examined for self-rated scales. This literature indicates that depression measures are often invariant across a variety of populations (e.g., Lindley & Bauerband, in press; Mellick et al., 2019; Olino, 2020; Patel et al., 2019; but see Fried et al., 2022 for some exceptions). Future work should test MI in diagnostic interviews and clinician rating scales.
However, MI cannot determine whether existing measures omit symptoms that are important in characterizing depression in some groups but not others. For example, it has been posited that in males, depression can take the form of externalizing behaviors such as anger, aggression, risk-taking, and substance misuse that are not included in most depression measures, and that this omission accounts for the higher prevalence of depression in females (Martin et al., 2013). Similarly, there is evidence that depression is experienced as musculoskeletal pain, headaches, anger, and/or loneliness in some cultures (Haroz et al., 2017), raising questions about whether the measures developed by North American and European investigators and used in most cross-cultural research capture culture-specific idioms of distress that are analogous to the Western construct of depression (Kirmayer et al., 2022). The generalizability of Western conceptualizations and measures of depression to other contexts and cultures should be a priority for future research. In addition, there is a need for investigators in non-Western cultures to study how depression-related symptoms are conceptualized and assessed in communities that live in vastly different contexts.
Measures
Aims of Assessment
The major aims of clinical assessment include screening, diagnosis and prognosis, case conceptualization and treatment planning, and treatment monitoring and evaluation (see Dougherty et al., 2018 for a more extensive discussion). In screening, large numbers of individuals are assessed for early identification and intervention or inclusion in research. To minimize costs and burden, brief self-rating scales are typically utilized. Individuals scoring above the case threshold should then receive more in-depth assessment.
Diagnosis and prognosis involve assessing diagnostic criteria for depressive disorders and exclusionary diagnoses (e.g., bipolar disorder), severity of symptoms (including the presence of symptoms with particular significance for clinical decision-making such as suicidality and psychotic symptoms), prior history and course of depression, comorbid mental and general medical disorders, and treatment history. Diagnostic interviews, supplemented by clinician or self-rating scales, are generally the primary means of collecting this information.
These data are also required for case conceptualization and treatment planning. In addition, a comprehensive assessment of personal, interpersonal, and systemic factors is necessary to provide clues to the development and maintenance of symptoms and functional impairment and to determine the focus of treatment. I will not review the many constructs and measures that can be included in such a comprehensive assessment, but for helpful discussions see Dougherty et al. (2018) and Persons et al. (2018).
Treatment monitoring and evaluation involves systematically assessing the degree of change in target symptoms and impairments to determine whether treatment should be continued, intensified, augmented, changed, or terminated. Clinician and self-rating scales are typically used for this purpose. I will focus on measures of depressive symptoms; for assessing functioning see Dougherty et al. (2018), Persons et al. (2018), and Sheehan (2022). There is considerable evidence that administering rating scales throughout the course of treatment (i.e., measurement-based care) enhances patient outcomes (D’Avanzato & Zimmerman, 2017; Tasca et al., 2019).
Diagnostic Interviews
Structured diagnostic interviews were developed because of the limited interrater reliability of standard (i.e., unstructured) clinical interviews (e.g., Regier et al., 2013). Structured diagnostic interviews can be semistructured, where a trained interviewer (optimally a clinician, but often a trained research assistant) follows a set format but can ask follow-up questions, elicit examples, resolve inconsistencies, and observe the respondent’s behavior, and is ultimately tasked with rating items based on their best judgment. In contrast, in fully structured interviews, the interviewer adheres to a script and accepts the interviewee’s responses at face value. Fully structured interviews (e.g., the Composite International Diagnostic Interview; Kessler & Üstün, 2004) are primarily used in epidemiological research where costs preclude using clinically trained interviewers to assess large samples. This review is limited to semistructured interviews.
If interviewers are properly trained and supervised, the reliability of semistructured appears to be substantially higher than unstructured interviews (e.g., Williams et al., 1992). In addition, semistructured interviews yield a greater number of diagnoses, suggesting that unstructured interviews overlook clinically significant psychopathology, and are more accurate compared with a gold-standard assessment procedure (Basco et al., 2000). Semistructured diagnostic interviews are widely used in research, but rarely employed in clinical practice, most likely because they take longer to administer than unstructured interviews and clinicians may not be reimbursed for the additional time. However, many semistructured interviews have a modular format, so clinicians can choose the modules that are most relevant for each patient.
Semistructured interviews typically assess the criteria necessary for diagnosing depression and the most common comorbid disorders. Thus, they are designed to yield categorical diagnoses, which, although clinically useful, have lower reliability and statistical power than dimensional measures (Markon et al., 2011). It is generally possible to sum symptom ratings to produce dimensional scores. However, due to the skippout structure of most diagnostic interviews, if the “gate” questions for a disorder (e.g., depressed mood and loss of interest/pleasure) are absent, the remaining symptoms are not rated, potentially underestimating scores of individuals without the core symptoms. However, it is reasonable to question whether accessory symptoms are really reflections of the disorder if there is no evidence of the core symptoms. That is, should sleep or concentration problems in the absence of depressed mood and anhedonia “count” toward a depression score, or are they better viewed as indicators of other disorders or problems? Nonetheless, investigators are beginning to develop interviews with minimal skippouts to provide both categorical and dimensional assessments (e.g., Shankman et al., 2018).
A second limitation of most semistructured interviews is that their assessment of the course of psychopathology is cursory, despite its implications for prognosis and decisions regarding intensity and duration of treatment. To assess course in a more detailed fashion, it is very helpful to work collaboratively with the respondent to complete a timeline mapping the trajectory of the disorder since onset (e.g., see the procedure described by McCullough et al., 2016).
The most widely used semistructured diagnostic interview for adults is the Structured Clinical Interview for
The research version of the SCID was designed for clinically trained interviewers. If additional sources of information (e.g., informants, case records) are available, the interviewer can take it into consideration in their ratings. The SCID begins with a screening module with gate questions for most disorders, the responses to which determine which modules for specific disorders will be administered subsequently. There are variants of the interview designed for patients unlikely to have psychotic disorders and for nonpatients. The
Administration time for the SCID-5-RV version ranges from 30 to 120 min depending on the breadth of the respondent’s psychopathology and the respondent’s and interviewer’s interview styles. Interrater reliability for depressive disorders is moderate to substantial (Williams et al., 1992; Zanarini et al., 2000). As a considerable portion of the research on depression in adults has used the SCID to establish diagnoses, the convergent, discriminant, and construct validity of the interview are inextricably intertwined with that of the diagnostic construct (see Klein, in press for a more extensive discussion).
The most widely used semistructured diagnostic interview for children and adolescents (6–18 years) is the Kiddie Schedule for the Affective Disorders and Schizophrenia (K-SADS; Townsend et al., 2020). There are many modified versions (Ambrosini, 2000), hence it cannot be assumed that all studies using the K-SADS have used the same interview. Like the SCID, there is now a streamlined version for clinical use (Townsend et al., 2020).
The K-SADS was designed to be used by clinically trained interviewers. It is administered separately to the youth and a parent. Differences between informants should be reconciled and final ratings are derived using a best-estimate approach. Similar to the SCID, the initial module consists of screening questions that determine the disorder-specific modules to be administered subsequently. Administration times for the parent and child interviews (research version) range from 30 min to 2.5 hr each, depending on the breadth of the child’s psychopathology and the styles of the respondents and interviewer. However, the time required for administration has been considerably reduced in a new computerized version (Townsend et al., 2020). Interrater reliability ranges from adequate to excellent for depressive disorders, and numerous sources of evidence support the K-SADS’ convergent and construct validity (Dougherty et al., 2018; Townsend et al., 2020).
The MINI International Neuropsychiatric Interview has versions for adults (MINI; Sheehan et al., 1997) and children (MINI-KID; Sheehan et al., 2010). Whereas the SCID and K-SADS assess both current and lifetime diagnoses, the MINI is limited to current diagnoses for most disorders. Compared with the SCD, the MINI takes about half the time to administer but provides less information. Both the MINI and MINI-KID show good interrater reliability and convergent validity (Sheehan, 2022).
Rating Scales
Clinician- and self-administered rating scales are useful for assessing severity and response to treatment. Typically, scores reflect the number and intensity of symptoms experienced during the previous 1 to 2 weeks. Clinician rating scales are interviews focusing on a circumscribed area (e.g., depression symptoms). They include a number of items that are rated by the interviewer, and often, but not always include a set of required or suggested questions and probes. Self-rating scales cover similar content; however, respondents directly report on their own symptoms (self-report) or informants (e.g., parents, teachers) report on another individual’s symptoms. Unlike diagnostic interviews, clinician and self-administered rating scales do not collect sufficient information to make diagnoses (e.g., duration and exclusion criteria are not assessed) or obtain information on development and course and comorbid conditions.
There are a number of issues in considering rating scales. First, rating scale scores are generally interpreted as indices of severity, but the construct of severity is contentious and poorly defined (Zimmerman et al., 2018). For example, should it refer only to symptoms, and if so, should they all be weighted equally or are some symptoms more indicative of severity than others (e.g., suicidality) and should therefore be weighted more heavily (Fried et al., 2022)? In addition, should other factors, such as psychosocial functioning, be included (Zimmerman et al., 2018)? The rating scales reviewed below focus solely on symptoms and sum all items weighting them equally. 1 However, this issue is an important area for future research.
Second, if a rating scale is used to monitor treatment response, it should be sensitive to change—a property that has been demonstrated for most widely used depression rating scales (Sheehan, 2022). In principle, the scale should also demonstrate MI over time (i.e., the structural properties of the measures should be comparable across repeated assessments). However, many widely used clinician and self-rating scales fail to show MI across repeated assessments, probably due restriction of range and regression to the mean (Fried, van Borkulo, et al., 2016). Unfortunately, this problem may be unavoidable in many contexts in which rating scales are needed (e.g., most patients seeking treatment will have elevated scores, and high scores are generally inclusion criteria for clinical trials). In addition, some studies in unselected community samples have reported attenuation effects in which scores diminish simply as a function of repeated assessments, making it difficult to distinguish the effects of treatment and time from methodological artifact (Dougherty et al., 2018; Dozois et al., 2020). These problems are priority areas for future conceptual and empirical work.
Finally, a number of different conventions and cutoffs have been applied to rating scales to define clinically important concepts such as response, remission, relapse, recovery, and recurrence (Sheehan, 2022). Unfortunately, empirical support for these definitions is often absent or even negative (de Zwart et al., 2019).
Clinician-Administered Rating Scales
The three most widely used clinician rating scales for depression in adults are the Hamilton Rating Scale for Depression (HAM-D; Hamilton, 1960), the Montgomery- Åsberg Depression Rating Scale (MADRS; Montgomery & Åsberg, 1979), and the Inventory of Depressive Symptomatology—Clinician version (IDS-C; Rush et al., 1996) or its briefer version, the Quick IDS-C (QIDS-C; Rush et al., 2003). The HAM-D is the most widely used clinician rating scale and takes 20 to 30 min to administer. It predates the development of explicit diagnostic criteria and includes a number of symptoms that are not
There are also a number of alternative versions of the HAM-D (for reviews see Carrozzino et al., 2020; Dozois et al., 2020; Sheehan, 2022). Some of these versions have expanded coverage of depressive symptoms, including a 24-item HAM-D that adds three cognitive symptoms and a 29-item version that includes reverse vegetative symptoms (e.g., increased appetite/weight and sleep). In addition, some investigators have developed short versions of 6 to 8 items.
The original HAM-D consisted of rating scales for each item but did not have suggested questions or probes, which limited interrater reliability. However, a number of semistructured interview versions of the HAM-D have been developed and show greater interrater reliability than unstructured versions (e.g., Williams, 1988; Williams et al., 2008). The HAM-D has a number of conceptual and psychometric limitations, including uneven coverage of depressive symptoms, a varying number of response options, lack of interval consistency across items, poor interrater reliability of some items, and items that fail to load on a latent severity dimension in item-response theory analyses (Bagby et al., 2004; Dozois et al., 2020). However, many of these problems have been rectified in more recent versions of the scale (Carrozzino et al., 2020).
The MADRS is a 10-item scale that was developed empirically to maximize sensitivity to the effects of antidepressant medication and is widely used in clinical trials (Sheehan, 2022). It takes approximately 10 min to administer. The content of the MADRS only partially overlaps with
The IDS-C was developed to provide more comprehensive coverage of depression symptoms than the HAM-D and MADRS. It has a semistructured interview format and 30 items. The IDS-C has been extensively evaluated, and there is substantial support for its interrater reliability and convergent validity (Rush et al., 1996). The QIDS-C includes 16 items covering the
The most widely used clinician rating scale for youth is the 17-item Children’s Depression Rating Scale—Revised (CDRS-R; Poznanski et al., 1979; Poznanski & Mokros, 1999). The CDRS was developed to assess the severity of depression in children aged 6–12 years. It is administered separately to the child and an adult informant, with the clinician subsequently integrating the data using clinical judgment. Its psychometric properties are generally favorable (Dougherty et al., 2018). The CDRS is often used with adolescents as well, although evidence for its reliability and validity in teens is lacking (Stallwood et al., 2021).
Self-Administered Rating Scales
There are numerous self-rated scales for depression in adults and youth, a number of which have been shown to possess strong psychometric properties in a variety of samples and most of which are highly intercorrelated (see Dozois et al., 2020; Fried et al., 2022). Given space constraints, I will focus on the four scales for adults and two scales for children that are most widely used. Other frequently used scales that are not discussed here include the CES-D (Radloff, 1977) and the Depression Anxiety Stress Scales (Lovibond & Lovibond, 1995), both of which provide only limited coverage of the range of depression symptoms. I also will not discuss self-rated depression scales that were designed for specific contexts and populations, such as the postpartum period (see Sultan et al., 2022 for a review) and older adults (see Balsamo et al., 2018 for a review). Finally, I will not discuss omnibus rating scales designed to assess a much broader range of symptoms and impairments (e.g., Achenbach et al., 2017; Kraus et al., 2005). Although these measures capture a great deal of information in an efficient manner, their value for assessing depression specifically is less clear.
Currently, some of the most widely used and best-studied self-rated scales for adults are the Beck Depression Inventory, second edition (BDI-II; Beck et al., 1996), the self-report versions of the IDS (IDS-SR: Rush et al., 1996) and QIDS (QIDS-SR; Rush et al., 2003), the Patient Health Questionnaire-9 (PHQ-9; Kroenke et al., 2001), and the expanded version of the Inventory of Depression and Anxiety Symptoms (IDAS-II; Watson et al., 2012).
The BDI-II (Beck et al., 1996) is a 21-item self-rating scale that covers most
The IDS-SR and QIDS-SR include the same items as their respective clinician-rated versions, which facilitates comparison and integration of data from both sources. Like the clinician-rated versions, the self-report versions have good psychometric properties (Rush et al., 1996, 2003).
The PHQ-9 (Kroenke et al., 2001) has nine items, each reflecting a
The IDAS-II (Watson et al., 2012) is unique among the measures reviewed here due to its multilevel, transdiagnostic design. It includes 99 items reflecting 18 factor-analytically derived dimensions of internalizing symptoms (depression, mania, anxiety disorders, and posttraumatic stress and obsessive-compulsive disorders). The 18 scales can be reduced to three higher-order dimensions: distress, fear/obsessions, and positive mood. In addition, a 20-item General Depression scale can be extracted that covers most
Two of the most frequently used self-rated scales for children and adolescents are the Children’s Depression Inventory, 2nd edition (CDI-2; Kovacs, 2011) and the Mood and Feelings Questionnaire (MFQ; Costello & Angold, 1988). Both the CDI-2 and MFQ have child self-report and parent-report versions, and the CDI-2 also has a teacher-report version.
The CDI-2 is designed for youth ages 7 to 17 and includes 28 items covering a broad range of depression and associated symptoms. There is also a 12-item short form. In addition, the CDI-2 has a 17-item version for parents and a 12-item version for teachers to complete about the child. The differing number of items for the child, parent, and teacher forms reflects each informant’s unique perspectives. However, along with differences in rating formats, this makes it more challenging to compare results across informants. The CDI-2 exhibits good internal consistency, test–retest stability, and convergent and construct validity, acceptable discriminant validity, and is sensitive to change (Dougherty et al., 2018; Kovacs, 2011).
The MFQ (Costello & Angold, 1988) assesses depression in 8- to 18-year-olds. It consists of 33 items covering the
Transdiagnostic Features
As exemplified by the transdiagnostic structure of the IDAS-II (Watson & O’Hara, 2017), most depression symptoms are also evident in other disorders. In this section, I briefly discuss the assessment of three selected transdiagnostic features: anhedonia, irritability, and suicidality. Although each of these features is assessed by one or more items in most of the diagnostic interviews and rating scales discussed above, for more reliable and detailed information, measures focusing specifically on these constructs are required. In addition to the measures described below, a number of the IDAS-II subscales may be used to assess transdiagnostic constructs (e.g., dysphoria, insomnia, suicidality, and ill-temper).
Anhedonia is a cardinal symptom of depression and has been shown to predict poor response to antidepressant pharmacotherapy (Rizvi et al., 2016; Sheehan, 2022). However, it is also characteristic of other disorders, particularly schizophrenia spectrum disorders. Anhedonia is thought to reflect dysfunction of neural reward circuitry, and can be decomposed into a number of aspects including deficits in anticipatory and consummatory pleasure (Khazanov et al., 2020; Rizvi et al., 2016). Most measures of anhedonia use a self-rating format, but there are also a number of behavioral and neural paradigms assessing various aspects of reward dysfunction (Khazanov et al., 2020; Rizvi et al., 2016). The most widely used self-rating measure is the Snaith-Hamilton Pleasure Scale (Snaith et al., 1995), which focuses on consummatory pleasure deficits. The self-rated Temporal Experience of Pleasure Scale (Gard et al., 2006) is also widely used, and assesses both anticipatory and consummatory pleasure.
The experience of pleasure may differ in different contexts (e.g., food/taste, physical or perceptual sensations, social interactions; but see Khazanov et al., 2020 for contrary evidence). Two measures that focus specifically on anhedonia in social contexts are the Revised Social Anhedonia Scale (RSAS; Mishlove & Chapman, 1985) and the Anticipatory and Consummatory Interpersonal Pleasure Scale (ACIPS; Gooding & Pflum, 2014) scales. The RSAS is most frequently used in research on schizophrenia spectrum disorders and focuses on anticipatory pleasure. The ACIPS is used in research on a variety of disorders, covers both anticipatory and consummatory pleasure, and has child, adolescent, and adult versions.
In the
A number of measures have been developed to assess irritability, or closely related constructs such as anger, in children (Althoff & Ametti, 2021) and adults (Saatchi et al., 2023; Toohey & DiGiuseppe, 2017). One of the most commonly used measures is the Affective Reactivity Index (Stringaris et al., 2012), which was initially developed as a self-rating scale for children and adolescents, with self- and parent-forms, but has been extended to adults (Mulraney et al., 2014), and is available as a clinician rating scale (Haller et al., 2020). Existing measures do not distinguish tonic from phasic irritability, so there is a need for further scale development in this area (Klein et al., 2021).
Depression is arguably the leading risk factor for suicide and suicide attempts, but suicide rates are also elevated in most other mental disorders. Although suicide is extremely difficult to predict due to its low prevalence (Franklin et al., 2017), assessment of the spectrum of suicidal ideation and behavior is critical for treatment planning and monitoring. Numerous measures of suicidal ideation and behavior for youth and adults have been developed (for reviews see Carter et al., 2019; Runeson et al., 2017; Thom et al., 2020).
Two of the most frequently used measures are the Columbia Suicide Severity Rating Scale (C-SSRS; Posner et al., 2011) and the Scale for Suicide Ideation (SSI; Beck et al., 1997). The C-SSRS has clinician and self-rating versions, as well as a computer-automated version using interactive voice technology. It includes subscales for severity of ideation, intensity of ideation, behavior, and lethality. The SSI is also available in clinician- and self-administered formats. More recently, there has been growing interest in using ecological momentary assessment (Sedano-Capdevila et al., 2021) and other approaches to ambulatory assessment, such as physiology and physical and geospatial activity (Kleiman et al., 2021) to assess suicidality and suicide risk, although this work is still in its infancy.
Recommendations
In this section, I make some broad recommendations of measures for adults and youth for the various assessment goals. These recommendations draw on traditional aspects of reliability (internal consistency, test–retest stability, and interrater reliability, as relevant) and validity (convergent, discriminant, construct, and as relevant, sensitivity to change). I also took adequacy of coverage of the domain (or content validity) into account. Where appropriate and data were available, I considered factor structure and MI. Finally, I looked for data on incremental validity and treatment utility and took into account the few such data that are available. However, it is important to note that the psychometric properties of measures can vary as a function of the population being assessed. Hence it is important to consult the literature for data on the measure in the target population of interest.
For diagnosis and prognosis, I recommend using a semistructured diagnostic interview (SCID-5 for adults; K-SADS for youth and a parent), supplemented by a timeline to elucidate the course of depression (McCullough et al., 2016). In clinical contexts, the MINI or the clinician versions of these interviews is recommended. It may also be desirable to assess transdiagnostic features using the IDAS-II or more focused measures of specific features.
The information from the diagnostic interviews and transdiagnostic measures discussed above are also critical for case conceptualization and treatment planning. In addition, it is necessary to assess the severity of depression and a range of personal, interpersonal, and systemic factors that may contribute to development and maintenance and provide treatment targets (Dougherty et al., 2018; Persons et al., 2018). Optimally, both clinician and self-rating scales should be used to assess severity, although in many contexts only the latter will be feasible. For adults, the IDS-C and IDS-SR have the advantage of being complementary instruments; however, a semistructured version of the HAM-D and the BDI-2 or IDAS-II are also good choices. In clinical practice, the QIDS-C and QIDS-SR, or if that is too burdensome, only the QIDS-SR or BDI-2 are recommended. For children, the CDRS is recommended. For adolescents, a well-validated clinician rating scale is not available. Provisionally, one might use the CDRS for younger adolescents and a semistructured version of the HAM-D or the IDC-C/QIDS-C for older adolescents. In addition, for children and adolescents, both the child and a parent should complete the CDI-2 or the MFQ.
For treatment monitoring and evaluation, the same clinician and self-rating scales should continue to be used throughout the course of treatment, although in many instances only self-rated scales will be feasible. Unfortunately, I am not aware of evidence-based guidelines for the frequency of these assessments; however, attenuation effects, feasibility, and patient burden must all be considered. Finally, in contexts requiring screening, the PHQ-9 or PHQ-2 and the SMFQ are reasonable choices for adults and youth, respectively.
Additional Areas Needing Research
A number of conceptual and methodological challenges and gaps in the literature have been noted throughout this article. I will not repeat them here, but I will briefly comment on several other issues that have not been raised. First, for over two decades, investigators have been pointing to the need for research on the incremental validity and treatment utility of assessment tools, and have outlined a number of study designs that can provide this information (e.g., Hunsley & Meyer, 2003; Nelson-Gray, 2003). Although I have noted several important examples of such work (e.g., the incremental validity of semistructured diagnostic interviews; the treatment utility of measurement-based care), there are still very few data on the incremental validity and treatment utility of most depression measures (Dougherty et al., 2018).
Second, advances in technology have provided new opportunities and challenges. Several of the rating scales reviewed above are available in voice-activated automated formats, and many semistructured interviews can now be computer-administered. Furthermore, diagnostic interviews and clinician- and self-rating scales can be administered via the internet. However, the field is still in the early stages of assessing the equivalence of these newer formats to traditional means of administration (D’Avanzato & Zimmerman, 2017; Sheehan, 2022).
Relatedly, the use of ambulatory assessment, both active (e.g., ecological momentary assessment [EMA]) and passive (e.g., wearable devices monitoring physiology, activity, and geolocation) has mushroomed (Stange et al., 2019). In addition to offering new approaches for between-persons research, it provides exciting opportunities for within-person designs, which may have greater clinical relevance (Wright & Woods, 2020). However, the field has been slow to develop standardized, well-validated EMA assessments for depression, perhaps because surveys must be brief and the optimal time frame for assessing depressive symptoms is not uniform (e.g., mood can be assessed frequently throughout the day, but sleep and weight/appetite require longer time frames). In addition, the incremental value and clinical utility of passive ambulatory assessments are unclear (Kleiman et al., 2021).
Finally, there have been significant advances in the analysis of facial and vocal characteristics (Girard & Cohn, 2015) and natural language, both in social media and interpersonal interactions (e.g., Eichstaedt et al., 2018). These methods may ultimately provide new tools for diagnosis, screening, and monitoring clinical status, but it is still necessary to demonstrate incremental validity and clinical utility and address ethical concerns such as unintended biases (Martinez-Martin, 2019).
Limitations and Conclusion
The literature on the assessment of depression is vast and includes numerous approaches and measures. I have attempted to summarize some of the key issues, such as the nature of the depression construct and general considerations in assessment, and review of some of the most widely used diagnostic interviews, clinician and self-rating scales, and measures of relevant transdiagnostic constructs in adults and youth. I have also offered some recommendations for measures for diagnosis and prognosis, conceptualization and treatment planning, and treatment monitoring and evaluation in research and clinical contexts. It is important to note that the review is selective, rather than systematic, and that my recommendations are subjective, albeit informed by past narrative reviews and systematic reviews and meta-analyses where available. Despite the considerable effort and expense devoted to assessing depressive disorders and symptoms over the past 50 to 60 years, there is still much that is uncertain and unknown. Moreover, as alternative approaches to conceptualizing psychopathology, such as HiTOP, RDoC, and network models, develop, and as new technologies and applications emerge, the ways in which depression is conceptualized and assessed are likely to change as well.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported by National Institute of Mental Health grant RO1 MH069942.
