Abstract
This investigation introduces a novel tool for identifying conscientious responders (CRs) and random responders (RRs) in psychological inventory data. The Conscientious Responders Scale (CRS) is a five-item validity measure that uses instructional items to identify responders. Because each item instructs responders exactly how to answer that particular item, each response can be scored as either correct or incorrect. Given the long odds of answering a CRS item correctly by chance alone on a 7-point scale (14.29%), we reasoned that RRs would answer most items incorrectly, whereas CRs would answer them correctly. This rationale was evaluated in two experiments in which CRs’ CRS scores were compared against RRs’ scores. As predicted, results showed large differences in CRS scores across responder groups. Moreover, the CRS correctly classified responders as either conscientious or random with greater than 93% accuracy. Implications for the reliability and effectiveness of the CRS are discussed.
When a self-report psychological inventory
1
is administered, the expectation is that respondents follow testing instructions and answer its items as honestly and accurately as possible. That is to say they respond
For clinical and applied psychologists, the presence of random data in their supposed valid data sets may lead them to make erroneous conclusions, diagnoses, and/or predictions about their clients (Ben-Porath & Waller, 1992). Bruehl, Lofland, Sherman, and Carlson (1998) showed this possibility in a clever study using a widely used pain inventory. They concluded that if the measure was administered to a group of RRs in a clinical setting and their random responding went unidentified, 35% of them would be classified as having elevated levels of interpersonal distress and another 35% as being highly adaptive copers. For researchers, random responding poses different problems. Primarily, it increases measurement error, making it more difficult to identify significant relations when they are present in data. In other words, it increases the likelihood of making Type II errors and otherwise jeopardizes the validity of one’s results (Osborne & Blanchard, 2011). In a recent study, Credé (2010) showed that even low rates of random responding (e.g., 5%) can have meaningful moderating effects on the size and direction of correlations, even increasing the likelihood of making Type I errors.
Old Approaches to Detecting Random Responding
Historically, there have been two types of validity scales that have been effective at identifying RRs in self-report inventory data: infrequency scales and inconsistency scales. An infrequency scale is composed of absurd item content—items that are endorsed so infrequently (e.g., <10% of conscientious responders [CRs] answer “true” to the item “I drink 10 glasses of milk a day”) that it is reasonable to interpret responses in the infrequent direction as highly unusual. If a respondent endorses too many infrequency items in the unusual direction, it strongly indicates the presence of random responding and, therefore, an invalid test profile. An exemplar of an infrequency scale is the Minnesota Multiphasic Personality Inventory–2’s (MMPI-2)
The logic behind an inconsistency scale is that if individuals are paying attention to item content, they should respond consistently to items that are semantically similar and inconsistently to items that are semantically dissimilar. For example, a person who answers “true” to an item like “I am a happy person” should also answer “true” to the item “I am happy.” An inconsistency scale is created by identifying pairs of items that are usually answered in the same way, usually correlating above .90 (e.g., Butcher et al., 1989). Given this, if a person responds inconsistently to several such item pairs, it strongly indicates the likelihood of random responding and that the responder’s data are invalid. Exemplars of inconsistency scales are the Variable Response Inconsistency (VRIN) and True Response Inconsistency (TRIN) scales of the MMPI-2 (Butcher et al., 1989).
Although effective, researchers rarely use or develop these scales due to their extensive costs. First, embedding an infrequency scale in a questionnaire makes it longer to administer, making it costly in terms of fatigue for the test taker and labor intensive for the test administrator. More importantly, because researchers must expect that a small proportion of CRs answer these items truthfully but in the infrequent direction (e.g., some individuals really do drink 10 glasses of milk a day), extensive normative testing is first necessary to establish base rates of infrequent responding in samples of CRs. In addition, normative testing is required to establish base rates for different categories of responders (e.g., disordered, non-disordered) and the various settings in which these scales are likely to be administered (e.g., vocational, forensic). For example, in a disordered population, one should expect that mean infrequency scale scores are higher than they are in a non-disordered population (see Archer et al., 2002; McNulty et al., 2003). This is partly due to the fact that in a disordered population, CRs are more likely to endorsee infrequency items in the infrequent direction. An item like “I talk to dead people” may be affirmatively endorsed because the responder actually believes he or she is communicating with the dead, not because they are answering indiscriminately. This has been a long-standing criticism of infrequency scales—that infrequency scale scores can sometimes confound random responding with psychopathology (Arbisi & Ben-Porath, 1995).
In contrast, inconsistency scales have the advantage not having to embed additional items in an inventory because they are made up of an inventory’s existing items. They do, however, still require an extensive amount of research to create and validate. First, a psychologist must identify a pool of highly correlated items within a measure from which to create its inconsistency scale. Subsequently, the psychologist must generate normative data to determine the optimal cutoff score that most effectively discriminates between CR and RR. Similar to the above concerns, some cutoff scores may be more appropriate to a particular responder population and setting than others. Consequently, the entire process is unappealingly laborious, time-consuming, and complex.
Unfortunately, apart from these costly types of infrequency- and inconsistency-validity scales, there are currently no practical means for psychologists to differentiate between CR and RR in self-report inventory data (see Meade & Craig, 2012, for an evaluation of item-based and statistically based indices). Researchers and applied psychologists alike therefore stand to benefit from a simple, reliable, and flexible tool that can effectively identify RRs in inventory data without all of the above mess associated with traditional validity scales. The development and evaluation of such a tool was the primary goal of this investigation.
Using Instructional Item Content to Identify Random Responding
The tool we developed for this investigation is called the Conscientious Responders Scale (CRS; see the appendix), which is a five-item variant of a traditional validity scale. The main advantage of the CRS over standard infrequency and inconsistency scales is that it does not require extensive normative testing to establish its cutoff scores. 2 This is because the CRS is made up of instructional item content. Instructional item content directs responders how to answer each particular item (e.g., CRS item 3, “To respond to this question, please choose option number five, ‘slightly agree’”); thus, unlike typical psychological inventory items, there is an objectively correct response for every item. Each correct response is given a score of 1 and incorrect response a score of 0. A CRS total score is generated by summing up all of a respondent’s correct responses. Thus, scores range from 0 (all incorrect responses) to 5 (reflecting all correct responses).
Depending on how many response options a responder has to choose from, the likelihood that a RR will answer an item correctly by chance alone can be estimated a priori using probability theory. In our investigation, we used a 7-point response-option scale for all measures. Consequently, the probability of a RR answering a CRS item correctly was 14.29% (i.e., 1/
Calculating A Priori CRS Cutoff Scores
Using the binomial distribution, we were able to determine that only 2.33% of RRs would be able achieve a CRS score of 3 or higher by chance alone (percentage answering 3 correct = 2.14% + 4 correct = 0.18% + 5 correct = 0.01%), whereas 15.19% would be able to answer 2 or more correctly by chance alone (percentage answering 2 correct = 12.86%). In fact, most RRs would only be able to generate a CRS score of 0 (46.25%) or 1 (38.56%), answering almost no items correctly. Using the ubiquitous critical probability value of
The Present Investigation: Purpose, Design, and Strategy
The purpose of this investigation was to evaluate the effectiveness of the CRS for discriminating between CR and RR in self-report inventory data. We evaluated our new measure in two identical experiments in which a single 89-item questionnaire was administered to university students either using a paper-and-pencil format (Experiment 1) or over the Internet (Experiment 2). The question of interest was whether CRS scores could reliably discriminate between RR and CR across these two widely used means for data collections.
According to recent studies, Internet collected data are equivalent to data collected through traditional methods, such as paper-and-pencil data questionnaires, in terms of psychometric properties such as factor structures, inter-scale correlations, means, and standard deviations (Johnson, 2005). Importantly, Internet data have not yet been heavily scrutinized in terms of data distortion tendencies such as random responding. In one such study, Pettit (2002) found that paper-and-pencil responders actually produced slightly higher rates of random responses than their Internet counterparts. In that study, however, the Internet participants were all self-selected and therefore probably highly motivated to participate from the outset. The extent to which the average undergraduate student completes questionnaires conscientiously has long been a source of doubt and controversy among psychologists (Sears, 1986). Because students are often compelled to participate in psychological research as a means to fulfill program requirements, the typical student is probably less motivated to participate conscientiously than we expect him or her to be. Given that the Internet provides participants with greater anonymity than traditional forms of data collection, unmotivated students may take advantage and produce more random responding than they normally would on a paper-and-pencil questionnaire. Experiment 2 was conducted to replicate the findings of Experiment 1 in a typical online administration of a psychological questionnaire. We also included a traditionally developed infrequency scale in our questionnaire to serve as a comparative measure against which we could evaluate the effectiveness of the CRS.
Data were collected in both experiments using the
In each experiment, the statistical strategy for evaluating CRS was threefold. First, group differences on CRS and Pettit Random Responding Scale (PRRS) scores would be assessed using independent-samples
Method
Participants
Experiment 1
A total of 68 participants were recruited from a second-year psychology class in exchange for being entered into a draw for $50 (CAD). In total, 33 students were randomly assigned to the CR condition and 35 were randomly assigned to the RR condition. Three participants were removed from the CR sample due to missing data. The final total sample (
Experiment 2
A total of 412 participants were recruited from an undergraduate research participant pool in exchange for course credit. Thirty-two participants (7.8%) were removed due to missing data. The final total sample (
Measures
The same questionnaire was administered in both experiments. In addition to the CRS and PRRS, all of the measures included were selected based on acceptable levels of internal consistency and breadth of content, such as perfectionism and ethics. The subject matter and validity of each scale was irrelevant to their selection. All of the questionnaire’s 89 items, including the CRS and PRRS items, were presented in a scrambled, random order. All items were answered on a 7-point Likert-type scale, ranging from 1 =
Self-Esteem Scale (SES)
The 10-item SES is a widely used self-report measure of trait self-esteem (Rosenberg, 1965). It has acceptable internal consistency across a variety of samples and has been extensively used in psychological research (Blascovich & Tomaka, 1993). Higher scores reflect greater levels of trait self-esteem. A sample item is “I wish I could have more respect for myself.”
Right-Wing Authoritarianism–Short Form (SRWA)
The 14-item SRWA (Manganelli Rattazzi, Bobbio, & Canova, 2007) was created by factoring Altemeyer’s (1996) 30-item RWA scale into two subscales measuring Authoritarian Aggression and Submission (SRWA-A) and Conservatism (SRWA-C). Each subscale has acceptable reliability and correlates highly with the original 30-item RWA scale (Bobbio, Canova, & Manganelli, 2010). Higher scores on either subscale reflect greater levels of RWA. A sample item from the authoritarian aggression and submission scale is “The situation in our country is getting so serious, the strongest methods would be justified if they eliminated the troublemakers and got us back to our true path.”
Multidimensional Perfectionism Scale (MPS)
The 35-item MPS (Frost, Marten, Lahart, & Rosenblate, 1990) is a scale that assesses six factors of trait perfectionism. Its subscales are Concern Over Mistakes (MPS-CM), Organization (MPS-O), Parental Criticism (MPS-PC), Personal Standards (MPS-PS), Doubts (MPS-D), and Parental Expectations (MPS-PE). The MPS has acceptable internal consistencies, with estimates for its subscales ranging between .73 and .93 (Frost et al., 1990), and has been used extensively in perfectionism research (Parker & Adkins, 1995). Higher scores reflect greater levels of perfectionism on all subscales. A sample item from the personal standards subscale is “I set higher goals than most people.”
Ethics Position Questionnaire (EPQ)
The 20-item EPQ (Forsyth, 1980) measures the philosophical framework from which individuals justify their decisions and behaviors. It contains two subscales, Idealism (EPQ-I) and Relativism (EPQ-R), which previous research has shown to be internally consistent measures (e.g., Davis, Andersen, & Curtis, 2001). Higher scores reflect greater levels of both ethical idealism and relativism. A sample item from the relativism subscale is “Whether a lie is judged to be moral or immoral depends upon the circumstances surrounding the action.”
PRRS
The PRRS is a 10-item infrequency scale containing all absurd item content (Pettit, 1999, 2002). In the original scale, items endorsed in the infrequent (statistically unusual) direction are scored as 1s, whereas items endorsed in the frequent direction are scored as 0s. We reversed this scoring system so that higher PRRS sum scores reflect greater conscientious responding, not random responding. The original measure’s cutoff score had also to be altered because of this scaling change, such that the original cutoff score of 2/3 was changed to 7/8. Put another way, only responders who achieved a high score of 8, 9, or 10 were labeled “conscientious responder.” Low scorers (i.e., 7 and less) were labeled “random responder.”
The original scale was validated on a large Internet sample using a dichotomous response scale and its psychometric properties are acceptable (Pettit, 1999, 2002). Given that a 7-point scale was used in this investigation, the likelihood of a RR answering an item correctly by chance alone was reduced from 50% to 14.29%, theoretically making the PRRS’ 7/8 cutoff score a more difficult standard for a RR to meet. A sample PRRS item is “Sometimes I feel warm or cool,” to which answering anything but “strongly agree” is an empirically infrequent response and is assigned a score of 0.
CRS
The CRS is a variant of a traditional infrequency scale that relies on instructional item content to identify CRs and RRs (see the appendix). The CRS is made up of five items that direct the responder exactly how to answer that particular item, such that each item has only one possible correct response. Thus, the number of the measure’s items, as well as the number of response options, can be used to generate effective cutoff scores using probability theory. For the purposes of this investigation, we adopted the
Procedure
Whether recruited in a second-year psychology class to complete an in-class paper-and-pencil questionnaire (Experiment 1) or from an undergraduate research participant pool to complete an online questionnaire of the same length (Experiment 2), participants were randomly assigned to complete one of two versions of the 89-item questionnaire. Participants assigned to the CR group received standard questionnaire instructions with some additional language that prepared them for the instructional nature of the CRS items: “Some of the items will ask you to answer them in a particular way . . .” Their questionnaire contained all of the measures listed above. In contrast, participants assigned to the RR group received a questionnaire with no items inside, only a 7-point Likert-type scale they had to endorse for each missing item. These participants were instructed to “respond on the scales below as randomly as possible, but do this in such a way that it will not be apparent that this is what you did.” Thus, in an effort to simulate reality, participants tried not to make their random responses so obvious that they would be easily identified by a visual inspection of the data. These instructions have similarly been used in other validity scale studies that used the analog design (e.g., Clark et al., 2003). In the debriefing, no CR participants reported having any difficulty understanding the instructions or completing the items in the questionnaire.
Results
Descriptive statistics for all of the measures across responder groups and experiments are presented in Table 1. Our first analysis involved conducting independent-samples
Descriptive Statistics Across Responder Groups and Experiments.
We next calculated zero-order correlations to test Hypothesis 2, that the CRS and PRRS would be strongly positively related because they both purportedly measure the same construct. This hypothesis was also fully supported by the data in Experiment 1,
In the final stage of the analysis, we examined CRS scores in the RR and CR groups to assess the effectiveness of our theoretically derived, a priori cutoff score (see Table 2 for CRS scores across responder groups and experiments). In the RR group in Experiment 1 (
CRS Scores Across Responder Groups and Experiments.
In Experiment 2, results were highly similar for the CRS. Altogether, our a priori 2/3 cutoff produced a classification accuracy rate of 90.05% in the CR group (making 19 errors of 191), 96.83% in the RR group (making 6 errors of 189), and 93.42% averaged across both groups, in full support of Hypothesis 3. In contrast, results were substantially worse for the PRRS. The PRRS correctly labeled 47.12% in the CR group as “conscientious responders” (101 errors of 191), 100% in the RR group as “random responders” (0 errors of 189), and achieved a 73.42% classification accuracy averaged across both groups. Thus, the CRS produced a similar result, again exceeding the ≥80% classification accuracy criterion, whereas the PRRS failed to meet that standard.
Given that the PRRS results were so jarringly different across the experiments, we reasoned that the problem was likely due to the imposition of the a priori 7/8 cutoff score on the data. Although it suited the Experiment 1 data just fine, in Experiment 2 it was too conservative, leading to too many CR participants being labeled as “random responders.” To explore this hypothesis further, we conducted binary logistic regression analyses for each of the CRS and PRRS measures in both sets of experimental data. Specifically, we sought to examine whether empirically derived cutoffs generated by the logistic regression models would be different and more effective than the a priori cutoff scores we imposed on the data.
In both sets of data, two binary logistic regressions were conducted in which the criterion variable (responder group: RR or CR) was regressed on the either the CRS or PRRS as the predictor variable. As expected, results of all four regressions were significant. For the CRS, results showed that 2/3 was the best empirically based cutoff to accurately differentiate between CR and RR—the same as the theoretically derived cutoff. In contrast, results from the PRRS logistic regressions were significant, but showed that the a priori 7/8 cutoff was not the optimal cutoff in either data set. In Experiment 1, 4/5 was shown to be a better empirical cutoff (correctly classifying 100% of CR participants and 100% of RR participants), whereas in Experiment 2, the best cutoff was 3/4 (correctly classifying 90% of CR participants and 97.88% of RR participants, producing an average accuracy rate of 93.93%). The smaller 3/4 cutoff was better in Experiment 2 because CR participants in that study produced a lower mean score than in Experiment 1.
In sum, these logistic regression data showed that an effective CRS cutoff score can be generated a priori using probability theory and applied reliably across data sets. In contrast, an effective a priori PRRS cutoff cannot be reliably applied across data sets. Rather, an empirical cutoff score needs to be generated for every data set it is used to optimize its discriminative power.
Discussion
The purpose of this investigation was to evaluate the effectiveness of a novel tool for identifying CR and RR in self-report inventory data. The CRS is a five-item variant of a traditional validity scale, which uses instructional item content and theoretically derived cutoff scores as its means to identify responders. Because CRs are assumed to follow testing instructions diligently, answering items as honestly and accurately as possible, we expected them to answer all of the CRS items correctly. In contrast, because RRs answer items indiscriminately, we expected them to account for all of the incorrect responses in the data, and only a very small proportion of items answered correctly that were due to chance. In our questionnaire, we used a 7-point scale; thus, the chance of a RR answering an item correctly was 14.29%. Given this rationale, we hypothesized that CRs would produce CRS total scores near the ceiling of the scale’s range (i.e., 5) and RRs near the floor (i.e., 0). The large gap in expected scale scores would make it easy to discriminate between individual cases of conscientious and random responding. The other unique advantage of instructional items over the traditional variety used in infrequency scales stems from the fact that they can be objectively scored as correct or incorrect. Because of this difference, effective cutoff scores can be generated using probability theory and therefore eliminate the need for extensive normative testing. This would save test developers the laborious task of having to validate every single validity measure they create, and also allow test administrators the flexibility of being able change the CRS format depending on their particular testing requirements (e.g., by increasing or decreasing the number of its items or the size of its response-option scale). These were the main ideas behind the CRS when we designed it. This investigation was conducted to assess whether these lofty speculations were realistic.
Overall, the findings of this investigation were positive for the discriminative power and validity of the CRS. As predicted, CRs produced significantly larger CRS scores than RRs across experiments and these group differences were large in magnitude. The PRRS, a traditionally developed infrequency scale, which was administered alongside the CRS for comparative purposes, correlated positively and strongly with the CRS. For both measures, CRs produced scores toward the ceiling of the measures’ scale range (5 for the CRS and 10 for the PRRS), whereas RRs produced mean scores near the scale floors (0 for both measures). Because the PRRS contains twice the number of items the CRS has, the average difference between the CR and RR groups’ sum scores was larger for the PRRS and this produced a larger effect size. In Experiment 2, this difference was largely eliminated because CRs produced PRRS scores nearer the middle of the measure’s scoring range.
The implication of this consistency is positive for the CRS. Given that across testing situations one can reliably expect CRs to produce a score near the ceiling and RRs near the floor, an a priori cutoff will be consistently effective at identifying responders. In these data, the theoretically derived 2/3 cutoff accurately discriminated between CR and RR about 93% of the time. Additional analyses with binary logistic regression models showed that the theoretically derived cutoff was identical to empirically derived cutoff scores from both sets of data. This agreement boosts the validity of our probability-based approach to generating cutoff scores.
Results for the PRRS were good, but less positive. Like the CRS, the PRRS produced large group differences between CR and RR, making distinguishing between them a fairly easy task. However, unlike the CRS, the size of the group difference in PRRS scores was inconsistent across studies. Moreover, the optimal empirical cutoff score changed from Experiment 1 to Experiment 2 and in neither study agreed with the a priori cutoff score of 7/8. The consequence of this was nicely demonstrated by the dramatic loss of discriminative power across studies. In Experiment 1, the a priori PRRS cutoff score produced an average classification accuracy of 93%, whereas in Experiment 2, its accuracy fell to just above 73%, failing to even meet the 80% classification standard. Consequently, one has to seriously doubt the utility of an a priori PRRS cutoff score, like the one we used in this investigation. The PRRS worked best using empirically based cutoffs. This means that after collecting PRRS data, an administrator should generate an equally large set of random data and run statistical analyses to identify the best empirical cutoff score. In sum, the effort required to use the PRRS effectively is far greater than it is to effectively use the CRS. With the CRS, a theoretically derived cutoff score can work reliably and effectively in a greater variety of testing scenarios. One has only to tally responders’ scores and then assign them their appropriate responder labels. No normative testing is required.
Limitations and Future Research
Although the findings of this investigation were straightforward, producing large CRS score differences across responder conditions and nearly identical results across experiments, the utility of the CRS should be further evaluated using a wider variety of study designs and settings in which inventories are commonly used. For example, the CRS effective in forensic and psychiatric settings, where rates of random responding are highest (Archer et al., 2002; McNulty et al., 2003), may be somewhat different than it was in this investigation with student, non-disordered samples. Given that the CRS only requires respondents to follow simple instructions, at this point we believe that the CRS is safe for use in non-disordered samples and for research purposes. Caution should be exercised when using it outside of these groups or for individual assessment. In addition, the classification accuracy of the CRS should be examined at finer gradients of random responding, for example, in identifying responders who engage in random responding in only 25% of a questionnaire’s items versus 100% of them. Research on the prevalence of random responding suggests that this form of intermittent random responding may account for the bulk of all random responding cases, as most responders admit to responding randomly to at least some of a questionnaire’s items, but few report doing it to all of them (e.g., Baer, Ballenger, Berry, & Wetter, 1997).
Also, the classification accuracy of the CRS should be evaluated against multiple standards of comparison, such as the highly regarded validity scales of the MMPI series (the
Finally, there is an off chance that embedding validity scales like the CRS or PRRS in a questionnaire may exacerbate random responding by lowering the questionnaire’s face validity. Face validity is defined as the extent to which item content seems appropriate for the purposes of a given testing situation (Holden & Jackson, 1979). For example, when assessing sadism, an item like “I enjoy hurting others” has higher face validity than the item “I would enjoy the occupation of a butcher.” Because the CRS items instruct one how to respond, which is very different from what people expect to find in an inventory, their odd nature may sap the motivation of some responders to participate conscientiously. Perhaps, even the notion of being told what to do may motivate some individuals to respond incompliantly in a fit of psychological reactance (Miron & Brehm, 2006). With infrequency scales, their item content may seem so absurd in some cases that responders may feel ridiculous and put off in having to respond to them, which similarly may sap their motivation to act conscientiously. As far as we are aware, this hypothesis that random responding scales exacerbate random responding has not been experimentally examined and perhaps should be pursued in future research. We speculate, however, that given the prevalence and utility of some low-face-validity inventories (e.g., the MMPI series), if there was an effect here to find, it would be negligible in size and fully compensated by the positive effects of validity scales (i.e., being able to discriminate between CRs and RRs).
Conclusion
Due to the many costs associated with random responding scales, basic and applied psychologists rarely use them when administering inventories. This investigation aimed to remedy this situation with the introduction and preliminary validation of the CRS, a five-item measure that uses instructional item content to achieve this goal. Results across two experiments were compelling in that effect sizes were large, results were consistent across samples, and the CRS was accurate in classifying responders about 93% of the time. Simply put, by embedding the CRS items randomly throughout a questionnaire, researchers can use endorsements of the CRS items as reliable indicators of whether data were generated conscientiously and should be retained or whether they were produced randomly and should be deleted.
Footnotes
Appendix
Acknowledgements
We thank Cathy Faye, Lisa Fiksenbaum, and David Flora for their reviews of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research and/or authorship of this article.
