Abstract
The optimal number of responses categories (NRC) is among the most discussed, yet least decided, topic in self-report measures. In addition, there is a dearth of scoping reviews that summarize its impact on validity evidence (e.g., evidence-based on internal structure). To that end, we conducted a scoping review of methodological literature to provide self-report measures developers and applied researchers with evidence-based recommendations when selecting the optimal NRC. Given the inconsistent results reported in previous research, a key recommendation, when conducting cognitive interviews, is to investigate the interpretation of response options by a sample of potential participants who have varying perspectives. This procedure is paramount to ascertain that response options are accurately interpreted and function as intended. The present scoping review is expected to become a valuable resource for applied researchers and practitioners to make informed decisions about the optimal NRC, taking into account validity evidence, and therefore contributes to the literature of educational and psychological measurement and research methods.
Plain Language Summary
The optimal number of response categories is a critical decision to be made in self-report measures, since it has implications for score reliability, validity evidence based on internal structure, validity evidence based on relations to other variables and quantitative fairness analyses. • Including vs. excluding the midpoint should be carefully considered in rating scales. • Cognitive interviews should be utilized to ensure respondents uniformly understand response options.
Since Likert (1932) has introduced the summative scoring technique, rating scales have been ubiquitously used for measuring unobserved continuous variables in many research fields including, but not limited to, counseling, education, health, marketing, and psychology (Wakita et al., 2012; Zumbo et al., 2007). For instance, researchers have commonly utilized rating scales for measuring individuals’ attitudes and perceptions of unidimensional and multidimensional constructs using various measurement instruments such as scales and surveys, among others (Barnette, 2010). Though there has been a vast amount of published research that has addressed the impact of the number of response categories (hereafter referred to as NRC) on psychometric properties (e.g., score reliability) of self-report measures, one of the most controversial issues has been selecting the optimal NRC, which supports validity of score interpretations and uses (E. P. Cox, 1980; Garner, 1960; Jones & Loe, 2013). It worth noting that response categories, response options, and scale points are used interchangeably in the present review.
Literature Review
The Optimal NRC and Associated Factors
The debate over the appropriate NRC has deeply been rooted in literature for approximately a century. For instance, Freyd (1923) argued that the appropriate number of scale points was a function of the scale purpose and the level of discrimination required by score users (i.e., distance among response options). Overall, researchers have taken different positions and provided relatively different recommendations for the optimal NRC. Many researchers have recommended four to seven response options (e.g., Barnette, 2010; Bindak, 2013; de Winter & Dodou, 2010; Sullivan & Artino, 2013). Others have recommended five to seven response options (e.g., Lietz, 2010). According to Dillman et al. (2014), the NRC should be long enough to cover the entire continuum of possible responses, but the difference between two adjacent scale points should not be so small or practically meaningless. In a review of nine studies published in the International Journal of Clinical and Health Psychology, Hartley (2014) found that measurement instruments with four or five scale points are commonly utilized.
Turning to respondents’ characteristics, researchers have outlined some factors that affect the appropriate NRC such as participants’ age and education level. In more detail, young learners or children should respond to measurement instruments with three to four options, whereas adult learners should respond with five or more options to enhance psychometric properties (e.g., Bandalos, 2018; Borgers et al., 2004; Ramsay, 1973). Second, based on a meta-analytic investigation, Narayan and Krosnick (1996) found that education level moderated some response options effects on attitude measurement. Specifically, respondents with low education level were more prone to response order and acquiescence effects (i.e., responding positively to a statement regardless of its content), among others. A third factor has been the polarity (e.g., unipolar, bipolar) of the self-report measure. For unipolar instruments designed to measure intensity (e.g., very, somewhat), the recommended optimal NRC has been generally four. However, for bipolar measurement instruments designed to measure direction (e.g., agree or disagree), the recommended optimal NRC has ranged from five to seven. This allows for two or three levels of differentiation to be on either side of the middle response category (Dillman et al., 2014). A fourth factor associated with the NRC has been the research field. In health studies, for instance, there may be a desire to quantify very specific pain levels by providing 11 response options (less painful to most painful), given each pain level requires a certain intervention or treatment (Safikhani et al., 2017). It is worth noting, however, when the NRC has increased (e.g., seven), this has been likely to be beyond respondents’ capacity to discriminate among response options, and consequently has increased measurement error (Komorita & Graham, 1965). Relatedly, the mental effort required to respond accurately has increased, which has made unmotivated participants responded randomly and consequently more measurement error (Struk et al., 2017). Additionally, it has sometimes been challenging to label seven or more response options. In that sense, if researchers have found it challenging to label seven or more response options, how to ask participants to accurately interpret them and select the response option that best reflects their position on the measured construct. Moreover, some researchers have not found a substantial impact for reducing the NRC on measurement properties (e.g., Jones & Loe, 2013). From a statistical perspective, Wolf et al. (2019) argued, that “there was no rationale to switch to a 6- point response option when a 4-point response option was already in place” (p. 2).
Prior Research Limitations
It seems obvious that there have been several recommendations and associated factors regarding the NRC generally used in self-report measures. Stated another way, there has been no universally agreed upon guidelines for the NRC to utilize when developing self-report measures (Krosnick & Presser, 2010; Menold et al., 2018). Specifically, to include or exclude the midpoint response category (i.e., even vs. odd) has also been controversial among applied researchers (more information is provided below). In addition, there was a notable paucity of scoping reviews of methodological literature that summarized the NRC impact on validity evidence including score reliability, validity evidence based on internal structure, validity evidence based on relations to other variables, and quantitative fairness analyses (differential item functioning [DIF] and measurement invariance). These limitations require more research efforts (e.g., conducting scoping reviews) to guide making such a critical decision (i.e., selecting the optimal NRC) when developing self-report measures to enhance score interpretations and uses. Practically, developers of self-report measures need, therefore, to make evidence-based and informed decisions for including versus excluding the midpoint as well as the optimal NRC to utilize, which has likely been associated with validity evidence. To that end, we conducted a scoping review of methodological literature, which had the potential to be a valuable resource for developers of self-report measures, applied researchers, and practitioners in many disciplines when designing a self-report measure and collecting validity evidence to support validity of score interpretations and uses.
In the sections that follow, we first outline a comprehensive validity framework that combines score reliability, sources of validity evidence, and quantitative fairness analyses. Second, we detailed the method section, focusing on the five steps of conducting the present scoping review. Third, we summarized results of previous research related to including versus excluding the midpoint as well as research related to the impact of the NRC on validity evidence, providing some recommendations at the end of each section. Fourth, we concluded with some practical implications for developers of self-report measures, applied researchers, and practitioners to make evidence-based and informed decisions with respect to selecting the optimal NRC, taking into account validity of score interpretations and uses.
A Comprehensive Validity Framework
Validity of score interpretations relies largely on the quality and rigor of evidence collected (American Educational Research Association [AERA] et al., 2014). With that said, researchers and practitioners should outline a comprehensive validity agenda that supports score interpretations and uses (Kane, 2013; Messick, 1989). The validity framework adopted in this review combines score reliability, validity evidence based on internal structure, validity evidence based on relations to other variables, and quantitative fairness analyses.
Score reliability has been viewed as a prerequisite for validity of score interpretations (Rios & Deng, 2022). Thus, we argue that score reliability is validity evidence, since valid score interpretations necessitate participants respond consistently across items (internal consistency), test forms (parallel forms), and time points (test re-test) to reduce random measurement error. Similarly, to make valid score interpretations, we need to collect sources of validity evidence that best support score interpretations and uses (AERA et al., 2014). In this review, we focus on validity evidence based on internal structure and relations to other variables, given they have received much attention with regard to NRC. Relatedly, researchers have conducted DIF analysis and measurement invariance to, respectively, test if individual items and scales/tests function equally across participant subgroups (Abulela & Rios, 2022; Khalaf & Abulela, 2021). These analyses have been recommended for ensuring fairness as a prerequisite for valid score interpretations and uses (Wells, 2021). The above discussion illustrates how score reliability, sources of validity evidence, and fairness analyses are all the overall validity evidence. With this in mind, applied researchers and practitioners should prepare the validity agenda with sources of validity evidence that enhance score interpretations and uses.
Method
In this study, we adopted a scoping review framework, since it is more appropriate for identifying knowledge gaps, reviewing a body of literature for a specific topic, and identifying the types of available evidence in a given field (Munn et al., 2018). In that sense, the latter is of a particular interest for reviewing the association between NRC and validity evidence as illustrated in the validity framework outlined earlier. According to previous research (e.g., Arksey & O’Malley, 2005; Sucharew & Macaluso, 2019), scoping reviews have five main steps: (a) identifying research questions, (b) identifying relevant studies, (c) selecting studies: inclusion and exclusion criteria, (d) charting the data, and (e) collating, summarizing, and reporting results. In the next sections, we illustrate each step in more detail with a special emphasis on the present study.
Identifying Research Questions
There are two research questions for the present review: (a) “what recommendations in the extant literature for including versus excluding the midpoint response category?” and (b) “Does the NRC impact validity evidence including score reliability, validity evidence based on internal structure and relations to other variables, and quantitative fairness analyses (measurement invariance and DIF)?”
Identifying Relevant Studies
To locate relevant studies, we conducted a web search in seven academic and commercial databases including APA PsycNet, EBSCOhost, ERIC, ProQuest, PsycINFO, Scopus, and Web of Science in addition to Google Scholar, backward citations (i.e., reviewing publications cited in specific articles), and forward citations (i.e., locating an article cited by many other articles). Given that early research in rating scales dates back to the 1920s, we set the search dates between 1920 and 2020. Keywords used in the search process included “Likert scale,”“number of alternatives,”“number of categories,”“rating scale categories,”“response categories,”“response options,”“scale points,”“response alternatives,”“scale alternatives,”“midpoint,”“mid-point,”“neutral response,”“neutral point,”“middle response,” and “middle point.”
Selecting Studies: Inclusion and Exclusion Criteria
To be included in the present review, a study should: (a) be published in an English or Arabic peer-reviewed journal, (b) have recommendations for including versus excluding the midpoint, and (c) be conducted to investigate the impact of the NRC on score reliability, factor analysis, validity evidence based on relations to other variables, measurement invariance, DIF, or some of them combined. Based on the first inclusion criterion, preprints, non-published articles, or articles published in other languages were not included in the present scoping review. Since choosing the optimal NRC has received considerable attention in different languages, we included Arabic studies, which made the present review more comprehensive. It is worth noting that locating Arabic studies was not a challenge, since both authors are bilingual.
Charting Studies
This step is crucial to sorting the included studies into recommendations for including versus excluding the midpoint or any source of validity evidence illustrated in the validity framework outlined in the study. However, if a particular study addresses the impact of the NRC on score reliability and factor analysis, for instance, it will be summarized twice in the results section: (a) score reliability and (b) validity evidence-based on internal structure subsections. Such a study will be counted once in the total number of included studies that meet inclusion criteria, which is 58 studies. Out of the 58 studies, 11(19%) studies presented recommendations for including versus excluding the midpoint, 22(37.9%) studies covered score reliability, 13(22.4%) studies addressed validity evidence-based on internal structure (i.e., factor analysis), three (5.2%) studies covered validity evidence-based on relations to other variables (i.e., correlation coefficients), two (3.4%) studies were focused on quantitative fairness analyses, and seven (12.1%) studies covered more than one source of validity evidence. It seems obvious that most published studies focused on score reliability compared to other sources of validity evidence.
Collating, Summarizing, and Reporting Results
In this final step, we summarized the 58 included studies and sorted them according to their thematic focus (see step 4). Specifically, there have been five categories: (a) the midpoint response category, (b) score reliability, (c) validity evidence based on internal structure validity, (d) evidence based on relations to other variables, and (e) quantitative fairness analyses.
Results
In the following subsections, we provide detailed results for each category outlined above. Such detailed description is intended to give the reader the necessary information to comprehensively and critically understand the impact of the NRC on validity evidence of score interpretations and uses of self-report measures.
The Midpoint Response Category
Besides the appropriate NRC to utilize in developing self-report measures, to include versus exclude the midpoint response category has specifically received considerable attention due to its association with measurement error, which can negatively affect validity evidence. In the extant literature, there has been less consensus on including versus excluding the midpoint. Matell and Jacoby (1972) argued that when the mean testing time and NRC increased, the selection of the “uncertain” response category decreased. Tsang (2012) highlighted two reasons underlying the controversy over including versus excluding the midpoint: (a) methodological and (b) epistemological. The former pertains to the effect of NRC on score reliability and other sources of validity evidence, whereas the latter is associated with respondents’ beliefs and interpretations of the psychological distance among response options. Relatedly, Subedi (2016) recommended the use of the midpoint from an epistemological rather than a methodological perspective.
Rationale for Including versus Excluding the Midpoint
A team of researchers have called for including the midpoint for different reasons. First, participants have the opportunity to select the response category that reflects their true response, which may reduce random measurement error and consequently increase precision (Converse, 1970, as cited in Hurley, 1998; Nadler et al., 2015). Second, the midpoint may sometimes be meaningful for the construct being assessed, since it has a meaningful position on the latent trait continuum (Nadler et al., 2015; Wakita et al., 2012). Third, excluding the neutral response option may “(a) affect the judgment of extreme options…., (b) result in respondents favoring the option superior on the more important attribute, and (c) result in more risk aversion” (Nowlis et al., 2002, p. 319). Fourth, based on a discussion of the midpoint literature, some have argued that the omission of the midpoint may increase the rate of missing data, since participants who have neutral opinions may skip many items (Bandalos, 2018). Fifth, participants who truly have a neutral opinion may have a forced choice between agreement and disagreement if the midpoint is not included (Chyung et al., 2017).
On the other hand, some researchers have provided different reasons for excluding the midpoint. For the first, the inclusion and labeling (i.e., undecided vs. neutral) of the midpoint have produced different score distributions (Holdaway, 1971). From a psychometric perspective, different response distributions mean different levels of score variability due to range restriction, which affect psychometric properties of measurement instruments. Second, it can result in inaccurate responses, particularly when participants respond with social desirability. Technically, social desirability bias can be mitigated by eliminating the midpoint (Garland, 1991). Third, overuse of the midpoint is a potential outcome, when instrument items are ambiguous, socially undesirable, or unfamiliar to participants (Chyung et al., 2017). Fourth, some participants have misinterpreted the neutral response option (Komorita, 1963), which yields more measurement error due to the random response style. Fifth, inclusion of the “no opinion” response option may be a target for respondents lowest in cognitive skills, and consequently “may not enhance data quality and instead may preclude measurement of some meaningful opinions” (Krosnick et al., 2001, p. 371).
Empirical Studies on the Midpoint
Early in the 1980s, Kalton et al. (1980) conducted an experimental study where random half of the sample was offered the midpoint, whereas the other was not. The authors concluded that offering the middle response option significantly increased the proportion of respondents who expressed a neutral response. Kulas and Stachowski (2009) conducted a study among 100 undergraduates using the international personality item pool. They found that choosing the middle response option took more time and was negatively correlated with item clarity and respondents’ self-awareness. Baka et al. (2012) collected data from 71 users of the voting advice applications online tool. Based on the selection of the midpoint, participants were classified into two main categories: (a) those who did not have an attitude and (b) those who were undecided. According to a thematic analysis of participants’ answers, there were four justifications underlying the selection of the midpoint: (a) indifference, (b) ambivalence feeling, (c) ambiguity of items, and (d) lack of knowledge. Overall, the four justifications indicate that participants may interpret the midpoint in ways inconsistent with the intended score interpretations. In particular, the third justification calls for the necessity of strictly following item writing guidelines to reduce measurement error. Chyung et al. (2017) summarized research findings on using the midpoint response category.
“Simply omitting a midpoint from the scales, however, is not the best practice. The more important question that practitioners and researchers should seek to answer is not whether or not to include a midpoint, but rather when to omit or present a midpoint in a Likert-type scale” (p. 18).
To explain Chyung and colleagues’ view, let us consider the following examples in which the midpoint is offered but response category labels are different.
In the first example, the middle response category “Sometimes” has a meaning on the rating scale to measure students’ behavioral engagement, since some respondents ”sometimes” attend classes on time. In that case, including the middle response category is necessary to provide some respondents with the response category, which better reflects their standing on the latent construct – behavioral engagement. In the second example, however, including the midpoint may not be a good methodological decision. Specifically, “Neither agree nor disagree” may simply mean “I do not have an opinion.” When selected by a large number of careless respondents, this is likely to yield an increased amount of measurement error. In addition, when scored, this middle category receives a score of three, which is not reasonable for a specific respondent to receive a score of three for an item in which s/ is not even interested in. In that sense, we argue that item writers should be very cautious when including or excluding the midpoint. As noted, if excluded, some respondents are not likely to find the response option that best reflects their standing on the latent construct (see example one). If included, it can be a target for participants who prefer to respond with social desirability, those who do not have an opinion, or those who avoid exerting mental effort to carefully read and effortfully respond to rating scale items (see example two).
Practical Implications for Including versus Excluding the Midpoint
To summarize, based on a thorough literature review and work experience with measurement specialists, we generally recommend developers of self-report measures to be cautious and seriously take the following considerations into account when deciding to include versus exclude the midpoint response category:
The neutral middle response option may mean, “I do not have an opinion” or “I do not care” and consequently its selection may not be informative in most cases such as example two presented above.
The neutral response option may/may not exist on the latent trait continuum, and in this case introducing “no opinion” or “do not know” options could be helpful.
Not all participants may interpret the psychological distance between the midpoint and other response options similarly, which leads to increased measurement error and consequently less precision. This also may be the case for other response options.
When randomly selected by a large number of careless respondents, measurement error increases. Then, we score participants including this measurement error, and accordingly inform decisions.
A respondent may have a high location on the latent trait continuum just because of randomly selecting the midpoint in most items. In fact, s/he may not be interested or even care about the topic of the study. Conversely, despite having an opinion, other respondents may have low locations on the latent trait continuum because of selecting low response options that truly express their opinions.
The over selection of the midpoint is called “moderacy bias” that should be reduced, particularly when it does not reflect participants’ true opinions.
Regarding score reliability, participants who randomly select the midpoint increase random measurement error, and consequently score reliability decreases.
Regarding validity evidence, increased random measurement error underestimates correlation coefficients, since randomness does not correlate with anything, which negatively impacts validity evidence based on internal structure and relations to other variables.
NRC and Validity Evidence
As discussed, score reliability, validity evidence based on internal structure, validity evidence based on relations to other variables, and quantitative fairness analyses (DIF and measurement invariance) are paramount for valid score interpretations and uses. Accordingly, reviewing the impact of NRC on these sources of validity evidence is necessary for developers of self-report measures and applied researchers.
Score Reliability
In the classical test theory, score reliability is the proportion of true score variance to observed score variance (Bandalos, 2018). Put another way, the magnitude of score reliability estimates increases as the amount of true score variance increases, or alternatively the amount of random measurement error variance decreases. In saying that, developers of self-report measures and applied researchers should seek the factors that maximize the amount of true score variance compared to that of the random measurement error. In that sense, the question arises is “what is the NRC effect on score reliability?” To address this research question, there have been two main research streams in literature. For the first, researchers have concluded that as the NRC has increased, score reliability estimates have increased. For the second, however, score reliability estimates have not necessarily increased as the NRC has increased.
Research Stream 1
Most researchers have concluded that the magnitude of score reliability estimates has increased with the NRC increase. For instance, Symonds (1924) and Pemberton (1933, as cited in Nadler et al., 2015) concluded that score reliability estimates were higher when scale points were seven compared to five. Administering the 24 bipolar adjectives and sociability scales among 260 undergraduates, Komorita and Graham (1965) concluded that the difference in score reliability estimates, as a function of the two and six response options, was moderately large for the sociability scale. However, it was negligible for the 24 bipolar adjectives scale. Lissitz and Green (1975) simulated data and found that alpha and test-retest score reliability estimates increased with more response options, which included 2, 3, 5, 7, 9, 14. Cicchetti et al. (1985) simulated data for a clinical scale with 2 to 100 response options and concluded that interrater reliability estimates increased up to seven points. Similarly, Alwin and Krosnick (1991) used three to nine response options and drew the same conclusion, indicating that there was no notable difference in score reliability estimates between the three and four response options. Bandalos and Enders (1996) simulated data and found that score reliability estimates increased as the NRC and interitem correlations increased. They added, however, that maximum gains were noted with five or seven response options.
Preston and Colman (2000) compared the effect of two to 11 response options on alpha and test-retest score reliability coefficients using the service satisfaction scale. Based on data collected from 149 employees, the authors found that score reliability estimates increased as the NRC reached 11. Weems and Onwuegbuzie (2001) administered the students’ experience questionnaire among 1,162 undergraduates, and concluded that alpha increased when the NRC was seven. Weng (2004) administered the teacher attitude test to 1,247 college students with three to nine response options. Results indicated that fewer response options tended to yield lower reliability estimates, particularly lower test-retest reliability or more technically the correlation between the first and second administration. Zumbo et al. (2007) simulated data for 25 items with two to seven response categories and found that alpha and theta coefficients increased as the NRC increased. Lozano et al. (2008) simulated data for 30 items with 2 to 9 scale points and 0.20 to 0.90 interitem correlations. The authors concluded that score reliability estimates increased as the NRC increased from four to seven, particularly with higher interitem correlations. The authors also noted that there was no notable increase in score reliability estimates beyond seven response options.
To continue the line of research, Wong et al. (2011) investigated the effect of four and five response categories on coefficient alpha using the big five personality inventory, job satisfaction, job commitment, job perception, and the demanding feedback behavior scales. The authors found that alpha was higher with five response categories. A. Cox et al. (2012) administered the Minnesota multiphasic personality inventory-2 to 199 undergraduates, who responded to a two- and a four-response option versions of the inventory. Authors concluded that score reliability estimates increased for the extended response options. In a replication study, Cox and other colleagues obtained the same results (A. Cox et al., 2017). Comparing the effect of five to nine response categories based on data obtained from 100 participants who responded to the compulsive buying scale, Choudhury and Bhattacharjee (2014) concluded that reliability coefficients increased as the NRC increased. Finn et al. (2015) found that reliability estimates increased for four compared two response options after administering the Minnesota Multiphasic Personality Inventory-2-Restructured Form among 406 undergraduate students. Using two scales for measuring attitudes toward the European Union and studying effort among 800 undergraduates, Menold and Tausch (2016) concluded that omega coefficients were higher when the NRC was seven compared to five. In their study among 1,358 undergraduates who were randomly assigned to 10 groups based on the NRC (2–11) to complete a personality measure, Simms et al. (2019) concluded that the psychometric precision for response scales with two to five response options were attenuated. Using data from clinical assessments among 829 undergraduate students, Shi et al. (2021) used the current symptoms scale to diagnose attention-deficit/hyperactivity disorder with three different forms based on the NRC (two, four, artificially dichotomized responses derived from observed 4-point alternatives). They found that few the NRC yielded less measurement precision.
To summarize the first research stream, more response options have been associated with higher score reliability estimates. However, it is worth noting that most of the studies reviewed above have utilized coefficient alpha, despite its strict assumptions associated with the essentially tau-equivalent model (e.g., equality of factor loadings), which have been difficult to meet in practice (e.g., Dunn et al., 2014). If violated, alpha may underestimate or overestimate the true reliability, particularly in case of correlated errors (Bandalos, 2018). We argue that using alpha without checking its assumptions makes it difficult to attribute the gains in score reliability estimates to the NRC increase or violating assumptions. In more detail, if a scale with five response options has more correlated errors than that with three response options, it becomes challenging to determine if the increase in reliability estimates is due to correlated errors or the increased NRC.
Research Stream 2
Other researchers have concluded that score reliability estimates have not necessarily increased with more response options (e.g., Bendig, 1953). Using the food preference rating scale, Bendig (1954) found that score reliability estimates were higher for the three and seven response options compared to the two, five, and nine response options. Matell and Jacoby (1971) compared the impact of 2 to 19 response options on test-retest reliability estimates and found that the increase in response options did not necessarily lead to an increase in score reliability estimates. Others have noted a joint effect between the trait being measured and the NRC on score reliability. For instance, Chomeya (2010) compared five and six response categories using the locus of control, the attitude toward alcoholic drinks, and achievement motivation scales. After administering each scale to 60 undergraduates with five and six response options, he indicated that alpha coefficient was higher for the locus of control scale with five scale points. Conversely, the other two scales had higher score reliability estimates when the NRC was six.
Using the Rosenberg self-esteem scale administered to 1,217 students, Leung (2011) found no major differences in coefficient alpha among the four, five, six, and 11 response options. Shaftel et al. (2012) used the attitudes toward cultures scale administered to 1,554 undergraduates and found that alpha estimates were equal in case of the six, seven, and nine response options. After comparing two to six response options, Lee and Paek (2014) did not find differences in alpha coefficients. However, when the NRC was two and the number of items was five, there was a substantially notable difference in alpha coefficients, which confirmed the interplay between the NRC and number of items. Colvin et al. (2020) found that score reliability estimates were equal for the Rosenberg self-esteem scale with four and five response options.
Practical Implications
To conclude, the NRC may not necessarily increase/decrease score reliability estimates, since other factors may play a potential role (e.g., item quality). Stated differently, the relation between the NRC and score reliability is not that simple. We argue that a scale with fewer response options (e.g., four) may have a higher score reliability estimate compared to a scale with more response options (e.g., five or more). To illustrate, the former may have well-written items and therefore respondents are likely to respond more accurately, yielding less measurement error, and consequently higher score reliability estimates. Conversely, the latter may have some flawed items or noise and therefore a more random response style, increasing the amount of measurement error, which in turn, results in lower score reliability estimates. In this case, this score variance has greater noise than signal (i.e., true variance; Simms et al., 2019). In saying that, applied researchers and practitioners should be cautious with drawing a general conclusion that more response options yield higher score reliability estimates, given the contradictory results discussed above as well as other factors (confounding variables) that may affect score reliability estimates.
Validity Evidence-Based on Internal Structure
When the intended score interpretations and uses stipulate collecting validity evidence based on internal structure, exploratory and confirmatory factor analyses (hereafter referred to as EFA and CFA) are respectively used to explore or test the hypothesized factor structure of a self-report measure. In saying that, the quality of this source of validity evidence depends essentially on the robustness of conducting EFA and CFA. Hall (2017) stated that one of the most controversial methodological issues was the NRC and its effects on factor analysis results. In the sections that follow, we discuss the effect of the NRC on EFA and CFA.
NRC and EFA
EFA is widely used for exploring the factor structure of a measurement instrument when there is no prior theory or a hypothesized factor structure. As a multivariate technique, EFA has a set of assumptions that should be met in order to obtain statistical conclusion validity and consequently draw valid inferences. These assumptions include sampling adequacy (assessed by Kaiser-Meyer-Olkin [KMO] that should be ≥.60), factorability of the correlation matrix (assessed by Bartlett’s test of sphericity that should be significant indicating the correlation matrix is not an identity matrix), and the determinant should be >.00001 indicating that multicollinearity is not a concern (Field, 2013; Tabachnick & Fidell, 2018).
In a recent study, Abdelsamea (2020) found that the NRC affected the extent to which these assumptions can be met. In more detail, using the academic boredom scale administered to 307 undergraduates, the author reported that KMO values were 0.75, 0.80, and 0.82 for the three, five, and seven response options, respectively. This result highlights that the sampling adequacy assumption may be a function of the NRC in addition to sample size. For instance, if a researcher has a large sample size but administered an instrument with fewer NRC (e.g., three), this is likely to yield low KMO (e.g., 0.50). In this case, one may mistakenly conclude that the data are not factorable due to violating the sampling adequacy assumption. On the other hand, another researcher may have a smaller sample size but more response options (e.g., six or seven), this may lead to meet the sampling adequacy assumption. Regarding other assumptions, as the NRC has increased, the value of χ2, associated with Bartlett’s test of sphericity, has increased and has been more likely to be significant. Relatedly, determinant values were 0.0002, 0006, and 0004, for the three, five, and seven response options, respectively. The above results highlight the potential effect of the NRC on the extent to which EFA assumptions are met.
Compared to the dearth of studies conducted on the NRC impact on EFA assumptions, more studies have been conducted on response options and EFA results. For instance, Comrey and Montag (1982) administered a translated version of Comrey personality scale to 159 male applicants for a motor vehicle. Results revealed higher intercorrelations and factor loadings for the seven compared to the two response options. Weems (1999) administered the students’ experience questionnaire to 1,162 university students with five numbers of response options (three–seven). The percentages of variance explained were 45.2%, 46.2%, 46.4%, 50.4%, and 51% for the three to seven response options, respectively. The author recommended using six response options to mitigate the effect of the midpoint compared to the four response options. Lozano et al. (2008) simulated data for a 30-item one-factor model with two to nine response options, 0.20 to 0.90 intercorrelations, and four sample sizes (50, 100, 200, 500). The authors found percentages of variance explained tended to increase as the NRC increased, regardless of the magnitude of intercorrelations or sample size. After seven response options, however, there was a less notable increase and consequently the authors recommended four to seven response options. Although this study employed simulation and four different sample sizes, it utilized maximum likelihood estimation with ordinal data, which may lead to biased parameter estimates (Beauducel & Herzberg, 2006).
Abdelsamea (2020) also found that the NRC might be a potential factor underlying the number of factors extracted, the magnitude of loadings, and the amount of variance explained. For the academic boredom scale, in particular, the number of factors extracted utilizing parallel analysis was three, four, and four for the three, five, and seven response options, respectively. In addition, when the NRC was three, more items did not load significantly based on the threshold adopted (e.g., 0.30) and consequently the likelihood of revsing or removing items increased. This raises a critical question about the association between fewer response options and removing items (i.e., construct underrepresentation as a threat to validity evidence based on test content; Messick, 1989). Consistent with prior research, various authors concluded that the amount of variance explained increased as the NRC increased, particularly with five or more response options (Green et al., 1997; Maydeu-Olivares et al., 2017; Mvududu & Sink, 2013).
Other researchers utilized principal component analysis even with ordinal data to explore the factor structure underlying a measurement instrument, despite being a data reduction technique. For instance, Muñiz et al. (2005) administered the Eysenck personality questionnaire with two to nine response options to 1,149 high school students and undergraduates. Results revealed that seven response options maximized the percentage of variance explained, and generally psychometric properties utilizing variance improved with four response options and more. Conversely, Chomeya (2010) concluded that the percentage of variance explained was less as the NRC increased except for the attitude scale. The author added that the number of components extracted was not equal based on the NRC utilized except for the locus of control scale. This result implies that the NRC can yield different factor structures, and consequently invalid inferences about validity evidence based on internal structure. Using the Rosenberg self-esteem scale, Leung (2011) found no significant differences in factor loadings with 4, 5, 6, and 11 response options. Ismail (2015) administered the organizational commitment scale with three, five, seven, and nine response options to 369 undergraduates. He found that all items load on one component with different percentages of explained variance: (a) 46.8%, (b) 56.1%, (c) 57.9%, and (d) 60.2% for the four studied scale points, respectively.
NRC and CFA
CFA is widely used to test the hypothesized factor structure of an instrument for a specific model or competing models (Brown, 2015; Kline, 2016). Similar to EFA, a great deal of research has been conducted to investigate the NRC effect on CFA results. Overall, research findings were not consistent and therefore recommendations were contradictory, particularly for the goodness-of-fit indices. However, there was a relative agreement that more response categories are associated with better CFA outcomes including measurement precision, standardized loadings, and less likelihood of obtaining spurious factors.
For the first research team, a group of researchers have concluded that more response options have yielded better goodness-of-fit indices. For instance, in a simulation study, Dolan (1994) manipulated two, three, five, and seven response options and three sample sizes (200, 300, 400) to examine their effects on the chi-square statistic. Results revealed that as the NRC increased, the value of chi-square tended to be smaller, increasing the potential for accepting the hypothesized factor structure. Maydeu-Olivares et al. (2017) noted as the NRC decreased, the probability of rejecting incorrect models decreased. Accordingly, with fewer response options, it is more likely to accept misspecified models and make invalid inferences about factor structures, particularly with the common factor model. Hall (2017) used empirical and simulated data to investigate the effect of the NRC of chi-square test statistic, the standardized root mean square residual (SRMR), and the root mean square error of approximation (RMSEA). Results revealed that as the NRC increased, the probability of correctly rejecting incorrect models increased. Relatedly, the author found as the NRC decreased, the probability of obtaining spurious factors increased. Xu and Leung (2018) administered the Rosenberg self-esteem scale to 1,807 students from secondary schools in Macau. Varying the NRC from 4 to 11, the authors concluded that the four-point scale was not recommended due to its higher skewness and lower standardized loadings.
Shi et al. (2021) fitted ordinal CFA models for two, four, and artificially dichotomized responses derived from observed 4-point alternatives scale points. Overall, the NRC affected parameter estimates, standard errors, goodness-of-fit indices, and individuals’ test scores. In particular, the fewer the NRC, the less measurement precision as well as less power for goodness-of-fit indices to detect model misfit. The authors recommended reconsidering the small NRC when developing self-report measures used for clinical purposes. Abdelsamea (2020) added that the standard error of estimates tended to be smaller as the NRC increased. Consistent with EFA results, the number of items that did not load significantly on their hypothesized factors increased as the NRC decreased.
The second team of researchers have found that fewer response options have yielded better indices for accepting the hypothesized factor structure. For example, Green et al. (1997) simulated data for 20 items with two, four, and six response options and found that chi-square statistic was inflated with two response options. However, the comparative fit index (CFI) was higher with two response options. Based on item factor analysis and item response theory models, Maydeu-Olivares et al. (2009) concluded that goodness-of-fit indices were better as the NRC decreased from five to two. Abdelsamea (2020) concluded that the RMSEA values tended to increase when the NRC increased, bearing implications for the likelihood of rejecting correct models.
The third group of researchers have reported scale-dependent findings. After fitting a one-factor CFA model for the attitude toward the European Union scale, Menold and Tausch (2016) found higher standardized loadings, higher CFI, and lower chi-square statistic when the NRC was seven compared to five. For the studying effort scale, however, the value for CFI was higher and chi-square statistic was lower for the five response options.
Practical Implications
To summarize, developers of self-report measures and applied researchers need to be highly attentive to the NRC when conducting EFA or CFA to gather validity evidence based on internal structure. As illustrated, when conducting EFA, the small NRC (e.g., 3) may lead to incorrect violation of assumptions, over-factoring (i.e., spurious extracted factors) with more candidate items for revision or removal due to their low standardized loadings (e.g., <0.30), leading potentially to construct underrepresentation. When Conducting CFA, the small NRC may increase the probability of accepting incorrect models. Fewer response options may also yield overestimated standard errors leading to more insignificant parameter estimates. Based on the above reviewed studies, we caution from the solo dependence on goodness-of-fit indices to accept or reject hypothesized factor structures. Put another way, imagine an applied researcher utilizes only goodness-of-fit indices to accept a hypothesized factor structure; s/he is likely to fail to reject an incorrect model due to the small NRC (less than four). In that sense, in addition to goodness-of-fit indices, applied researchers should utilize other pieces of evidence such as standardized loadings, standard errors, and factor correlations to accept or reject - a hypothesized factor structure (Brown, 2015).
Validity Evidence Based on Relations to Other Variables
Validity evidence based on relations to other variables can be concurrent (convergent or discriminant) and/or predictive consistent with the intended score interpretations and uses (AERA et al., 2014). For criterion validity evidence, Chang (1994) found no difference between the four and six response categories. A. Cox et al. (2012 and 2017) found no significant gains in convergent validity evidence for the extended response options. Similarly, Finn et al. (2015) found that differences between correlations with external criteria were very rarely statistically significant, when comparing the two and four response options. Hilbert et al. (2016) found the same pattern after comparing two and five response options based on responses obtained from 866 participants who completed a questionnaire on the personality facet “dutifulness.”
Correlation and Regression Analyses
Since correlation and regression analyses are the two main statistical techniques used to quantify validity evidence based on relations to other variables, it has been argued that the NRC has potentially indirect effects on correlation and regression coefficients through score reliability (i.e., as illustrated earlier, the NRC may affect score reliability estimates in different directions). In more detail, some applied researchers may not be attentive to the association between correlation and score reliability. Low score reliability estimates between two variables attenuate their observed correlation (Kline, 2016; Trafimow, 2016). Technically speaking, the correlation between two variables is equal to or less than the product of the square root of their score reliability estimates (Kline, 2016), as illustrated in equation (1):
where max
For instance, suppose that the population correlation between emotion regulation and self-efficacy is .90 and the estimated score reliability of emotion regulation and self-efficacy are .90 and .88, respectively. Applying equation (1), the maximum estimated correlation is .89 due to their high reliability estimates. On the other hand, if the respective estimated score reliability of emotion regulation and self-efficacy are 0.73 and 0.70, the maximum estimated correlation becomes .71, attenuating the value of population correlation between the two constructs, which is .90. With that said, validity evidence based on relations to other variables is negatively impacted by low score reliability estimates. As a result, one may conclude that the convergent validity evidence is not supported due to the low observed correlation between two constructs, which is likely attributed to low score reliability that may arise from either measurement error or fewer NRC.
Regarding the predictive validity evidence, suppose that emotion regulation is hypothesized to predict students’ self-efficacy in undergraduates. However, due to the low score reliability of one or both measures, one may conclude that emotion regulation does not provide predictive validity evidence based on the low observed correlation between both measures. With regard to discriminant validity evidence, one desires a low correlation between the two measures of interest. In this case, a low score reliability attenuates the correlation between the two constructs, and consequently one may conclude that there is discriminant validity evidence, which may be an invalid inference. As illustrated, one may or may not obtain validity evidence based on relations to other variables due to score reliability that is potentially impacted by the fewer NRC.
Practical Implications
When collecting validity evidence based on relations to other variables, developers of self-report measures and applied researchers should be cautious that low estimated correlations are likely to be a function of their attenuated score reliability estimates, which in turn may be a function of the NRC.
Quantitative Fairness Analyses
Fairness of test score interpretations is the third chapter in the Foundations section of the Standards of Educational and Psychological Testing (AERA et al., 2014). To make fair and consequently valid score interpretations particularly when comparing groups, test developers and psychometricians need to provide evidence that test items or the measure as a whole function equally across groups. Multiple group confirmatory factor analysis (MGCFA) is used to assess measurement invariance or test the lack of bias on the scale level. On the other hand, DIF alongside sensitivity analyses by subject matter experts are used to assess bias on the item level (Wells, 2021). In the context of assessing fairness using DIF or MGCFA, we wonder if the NRC makes a difference in concluding an item or a measure as biased.
Menold and Tausch (2016) utilized MGCFA to assess measurement invariance of the attitudes toward the European Union and studying effort scales based on varying the NRC (five or seven). They noted that measurement invariance was not ascertained for both scales particularly the metric and scalar equivalence. They concluded that applied researchers should not assume that the same latent variable is measured with the same items when different response options are utilized. This result implies that a measure is likely to be fair or biased across groups based on the NRC utilized. In that sense, applied researchers should be cautious when assessing measurement invariance. Conversely, Xu and Leung (2018) found no effect for varying the NRC from four to 11 on the longitudinal measurement invariance, since scalar invariance was met. Such contrasting results are likely to make applied researchers unable to determine the reasons for which measurement invariance is not ascertained (i.e., does the NRC play a role?).
Regarding the item-level invariance, Allahyari et al. (2016) simulated data to assess the effect of NRC on the power of ordinal logistic regression (OLR) to detect DIF. They found that the power of OLR to detect uniform DIF increased by 8% when the NRC was five and the ability (θ) was distributed identically in the reference and focal groups. However, the power of OLR was less than 3.6 % when theta distribution was highly skewed to the left and right in the reference and focal groups, respectively. They recommended utilizing a minimum of five response options when applying OLR to conduct DIF in rating scale data. With that said, we also argue that one can make unfair score interpretations and consequently invalid inferences just because of utilizing a small NRC (e.g., 3) that leads to failure in detecting biased items.
Practical Implications
When conducting measurement invariance and DIF, developers of self-report measures and applied researchers should pay much attention to the role of the NRC in drawing inferences about the fairness or lack thereof of score interpretations, which is an important component of validity evidence.
Discussion and Conclusion
The ultimate goal of assessing human attributes is to reduce the amount of measurement error and consequently provide precise estimation of individuals’ scores reflecting their true abilities, behaviors, and dispositions. Score reliability, sources of validity evidence, and fairness are essential components to validity of score interpretations for proposed uses. Given the vast amount of published research using empirical and simulated data, we argue that the NRC should be selected carefully, since it is an essential component of the validity agenda planned to support score interpretations and uses. The selected NRC is a methodological and an epistemological decision rather than an arbitrary or a quick choice, which should be consistent with the intended score interpretations and uses. We recommend using the midpoint response option only if it has a meaning on the latent trait and is not likely to be misinterpreted by instrument respondents.
We argue when the NRC is not appropriately selected during item writing (e.g., generally less than four or above seven), the amount of random measurement error increases resulting in much uncertainty in the estimated scores. In the first case, adult participants may not have the opportunity to express their true responses due to the small NRC (e.g., three) leading likely to a random response pattern. In the second case, when the NRC is large (e.g., more than seven) and is not consistent with participants’ characteristics (e.g., age or educational level) or the domain of interest (e.g., degree of pain in health sciences), they may encounter a challenge interpreting response options and therefore respond randomly. Both random response patterns increase measurement error, which in turn reduces score inconsistency or technically score reliability. Relatedly, to increase score variability, it is recommended to write assessment items that cover the whole range of the latent trait being measured rather than increasing the NRC, which may yield more measurement error in some cases.
Despite the claim made by most researchers that more response options have yielded higher score reliability estimates, this should not be generalized given other factors that may influence reliability estimates such as artificially increased score variability, interitem correlations, and item quality, among others. Low score reliability estimates, which can be attributed to many factors including random responding and fewer response options, can affect validity evidence based on relations to other variables. Additionally, the NRC is central to meeting EFA assumptions and affects EFA and CFA results, and consequently validity evidence based on internal structure. Furthermore, the NRC has potential impact on items to be flagged as DIF and therefore fairness of score interpretations for proposed uses.
Practical Implications
To conclude, the NRC should be selected carefully given the intended score interpretations and uses, target population, and the trait being assessed. To ensure that response options function as intended, it is highly recommended to include them in the cognitive interview protocol designed during the cognitive interview phase of developing a self-report measure. Specifically, the test developer should ask a sample of potential respondents, who have varying perspectives, about their understanding of each response option. This is especially important if the midpoint response category is included, since respondents sometimes misinterpret it. Additionally, if the intended score interpretations are across respondent subgroups, it is essential to ensure that response options are uniformly understood by all subgroups. Relatedly, if the desired NRC is seven or more, it is critical to ensure that it does not cause cognitive fatigue and participants are able to distinguish among adjacent response options or can process the meaning of the psychological distance among intervals (Cook et al., 2001). Accordingly, making use of cognitive interviews is a major evidence-based practice to select the optimal NRC that supports validity evidence and consequently enhances score interpretations and uses (for more information about cognitive interviews, see Peterson et al., 2017).
Limitations
After reviewing the NRC effect on score reliability, validity evidence, and quantitative fairness analyses, we emphasize that the reviewed studies were based on real and/or simulated data in which researchers have varied the NRC. In more detail, the quality of both studies differed in terms of methodological rigor. Specifically simulation studies have been potentially more robust, since the truth has been known when generating data. In that sense, they should have received more weight on drawing conclusions about the NRC and validity evidence. On the other hand, results based on real data have been likely less robust because of other confounding variables (e.g., participants’ mood), which may be challenging to control. With that said, we caution potential readers of the review from overgeneralization. Specifically, determining the NRC is not as simple as thought to be, given the detrimental consequences it can have on score reliability, validity evidence based on internal structure and relations to other variables, and bias on both the scale and item level.
Footnotes
Acknowledgements
The authors would like to deeply thank Dr. Michael Harwell, Emeritus Professor of Educational Psychology, University of Minnesota and Dr. Samantha Holquist, Child Trends, Minneapolis, Minnesota, for their thoughtful comments on an earlier draft of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
