Sage Journals: Discover world-class research

Abstract

The multidimensional forced-choice (MFC) format is an alternative to rating scales in which participants rank items according to how well the items describe them. Currently, little is known about how to detect careless responding in MFC data. The aim of this study was to adapt a number of indices used for rating scales to the MFC format and additionally develop several new indices that are unique to the MFC format. We applied these indices to a data set from an online survey (N = 1,169) that included a series of personality questionnaires in the MFC format. The correlations among the careless responding indices were somewhat lower than those published for rating scales. Results from a latent profile analysis suggested that the majority of the sample (about 76–84%) did not respond carelessly, although the ones who did were characterized by different levels of careless responding. In a simulation study, we simulated different careless responding patterns and varied the overall proportion of carelessness in the samples. With one exception, the indices worked as intended conceptually. Taken together, the results suggest that careless responding also plays an important role in the MFC format. Recommendations on how it can be addressed are discussed.

Keywords

multidimensional forced-choice format data quality response bias careless responding inattentive responding insufficient effort responding

When using self-reports to infer respondents’ trait levels, one concern researchers and practitioners alike have is that some participants might not have responded accurately and carefully to some (or even most) items, that is, data quality might be affected by careless responding. Alternative response formats to the ubiquitous rating scales, such as the multidimensional forced-choice (MFC) format, which requires participants to rank multiple items rather than rate individual items, have been successful in reducing other response biases such as faking. However, it is currently unclear how to detect careless responding in this format. The aim of this study was therefore to develop indices to detect careless responding in the MFC format. In doing so, we took existing indices for rating scales, adapted them to the MFC format, and developed new indices which are unique to the MFC format. Our first aim was to examine the performance of and correlations among these indices in the MFC format (Research Question 1). Furthermore, we were interested in whether different types of careless respondents in MFC questionnaires exist (i.e., different classes in a latent profile analysis; Research Question 2) as well as in the proportion of the sample that responded carelessly on the developed indices (Research Question 3) and over the course of the study (Research Question 4). Moreover, we conducted a simulation study for a conceptual proof of the developed indices.

In the following, we will give some background on the forced-choice format and briefly present it as an alternative to rating scales in personality questionnaires. Then, we will introduce careless responding as a response bias and present methods to detect careless responding in rating scale data. Next, indices that can be used to detect careless responding in MFC questionnaires are described. Finally, our research questions and hypotheses are derived.

The Multidimensional Forced-Choice Format

In the forced-choice format, two or more items are presented together in blocks. Participants are then instructed to select the items that describe them most and/or least (partial ranking) or to rank all items according to how well they describe them (full ranking). The items within a block can measure the same trait or different traits. The latter is known as the MFC format (see Figure 1 for an example). Variations in block size and different ranking instructions are described in Wetzel et al. (2020), for example. In our study, blocks were multidimensional and we instructed participants to rank the three items according to how well the items described them in their typical behavior (i.e., full ranking of three items representing different traits).

Figure 1.

An Example Triplet in the Multidimensional Forced-Choice Format.

Given the circumstance that items are presented together in blocks, the resulting rankings are relative. Therefore, applying classical scoring techniques results in ipsative trait estimates that only allow for intraindividual comparisons (e.g., Hicks, 1970). As interindividual comparisons are essential in many assessment settings, the use of the MFC format was limited in the beginning. This limitation was overcome in the past decade with the development of multidimensional item response theory models that enable obtaining normative trait estimates and therefore allow drawing interindividual comparisons (see Brown, 2016, for an overview of these models). One of these models is the Thurstonian item response theory model (Brown, & Maydeu-Olivares, 2011). This model was applied in the current study because it is suitable for modeling multidimensional item responses to blocks of three (or more) items. In addition, it is broadly applicable (e.g., to a variety of latent variable structures or other variants of the forced-choice format), and it can be estimated in available software, like Mplus (Muthen & Muthen, 1998–2021). This versatility is transparent by its predominant use in the last few years (e.g., Frick, 2023; Guenole et al., 2018; Lee et al., 2021; Walton et al., 2020; Watrin et al., 2019; Wetzel & Frick, 2020). When describing the indices in detail below, we will briefly delineate some of the model properties that are relevant for understanding the indices and refer the reader to Brown and Maydeu-Olivares (2011, 2018) for a deeper understanding of the model.

Careless Responding

In most self-report questionnaires assessing personality traits, participants are instructed to indicate an intensity. For example, they might be asked to what extent they agree or disagree with a given item (e.g., “I am often late”) or how often they engaged in a specific behavior in a given time period. Answers are conventionally measured on a rating scale with different response options (e.g., from “I strongly agree” to “I strongly disagree,” or from “never” to “always”). This format is also known as a single-stimulus format because participants respond to one item at a time.

In the rating scale format, a number of response biases can occur (Paulhus, 1991). These include, for example, response styles such as a preference for extreme or non-extreme categories, socially desirable responding (i.e., faking and self-deception), and careless responding. Careless responding is characterized by selecting response options without considering the (full) item content (e.g., Nichols et al., 1989). As the expression “careless” suggests, it is a listless, unmotivated, or uninterested participant behavior that manifests itself in responses that seem to be arbitrary or indiscriminate. Other terms for the same response behavior include inattentive (e.g., Maniaci & Rogge, 2014), random (e.g., Credé, 2010), or insufficient effort (e.g., Huang et al., 2012) responding. Notably, we define careless responding as a participant behavior that therefore is distinguishable from computer-generated (i.e., survey bot or automated) responses. The detection of such responses is described elsewhere (e.g., Dupuis et al., 2019; Teitcher et al., 2015).

Like all response biases, careless responses add noise to the data. Therefore, not removing careless respondents from the sample can potentially lead to erroneous conclusions. Various studies have investigated how different proportions of careless respondents in a sample can impact the results when using rating scale questionnaires. These studies suggest that not removing careless respondents can inflate or deflate the correlations between items (Credé, 2010; Huang et al., 2015), distort the factor structure (Goldammer et al., 2020; Schmitt & Stuits, 1985), reduce statistical power (Maniaci & Rogge, 2014), or even change the results of statistical tests (DeSimone & Harms, 2018). Nevertheless, researchers should be cautious about flagging participants if the construct of interest is associated with careless responding (e.g., Rios et al., 2017). The base rate of careless responding differs considerably between samples, applied methods to detect careless responding, and other determinants. Using factor mixture modeling to detect careless respondents in personality questionnaires, Arias et al. (2020) flagged between 4% and 10% in their online samples and Meade and Craig (2012) flagged between 10% and 12% in a sample of undergraduate students. In a group assessment of leadership rankings in military training, 33% of the recruits were identified as careless respondents relying on four careless responding indices (Goldammer et al., 2020). Another characteristic of careless responding is that it increases over the course of the study (e.g., Bowling et al., 2021; Clark et al., 2003; Galesic & Bosnjak, 2009).

Researchers have developed various methods to detect careless responding in rating scale data. Meade and Craig (2012) differentiated two types of methods. The first type comprises special items or scales to detect careless responding. These should be assessed together with the questionnaires of interest. They include, for example, items asking the participants to report the quality of their data such as, “Did you expend effort and attention sufficient to warrant using your responses for this research study?” or items with instructions on which response category should be chosen, such as “For this query, mark X [insert X] and move on.” (Abbey & Meloy, 2017, p. 66). The second type consists of indices that can be computed after data collection is complete. These include analyses of the response time (e.g., extremely fast responses), the inconsistency of the responses (e.g., responding differently to similar items), response patterns (e.g., choosing the same response option multiple times consecutively), and outliers (e.g., responses that deviate strongly from the average responses of the sample). Curran (2016) and DeSimone et al. (2015) summarized the existing methods and indices for detecting careless responding in questionnaires using rating scales.

Careless Responding in the MFC Format

Within a given MFC block, it is only possible to place one item on each rank (see Figure 1). Hence, the nature of the forced-choice format prevents the occurrence of the aforementioned response styles. One of the main reasons why the forced-choice format gained research interest was to prevent or mitigate faking as one form of socially desirable responding (e.g., Christiansen et al., 2005). Meta-analytic results from the rating scale and forced-choice format indicate that responses to personality inventories are less susceptible to faking in the forced-choice format (Birkeland et al., 2006; Cao & Drasgow, 2019; Viswesvaran & Ones, 1999). They are even less susceptible to faking when the items within blocks are matched in terms of their social desirability (Wetzel et al., 2021).

Unlike the other aforementioned response biases, careless responding cannot be prevented by the design and characteristics of the MFC format. Whether the occurrence of careless responding can be reduced by using the MFC format rather than rating scales is an open research question. To our knowledge, careless responding has not been investigated in MFC questionnaires. As the design of the forced-choice format differs substantially from rating scales, some of the methods used to identify careless responding in rating scale data cannot be applied and others need to be adapted. Hence, little is known about how to identify careless responding and how often it occurs in questionnaires using the forced-choice format.

Indices for Detecting Careless Responding in MFC Questionnaires

In this study, we focus on the development of indices¹ to detect careless responding in MFC questionnaires. Indices to detect careless responding in the MFC format can be grouped into two types, following Meade and Craig (2012). First, we implemented special items such as instructed response triplets and self-report items on data quality together with the questionnaires in the online study. Second, we adapted and developed post hoc methods such as analyses of response time, consistency of the responses (i.e., consistency score), response patterns (i.e., rank order indices and triplet variance), and outliers (i.e., Mahalanobis distance). These indices will now be described in detail (see Table 1 in the Supplemental Materials for an overview). All of the indices were preregistered together with our research questions, hypotheses, and a detailed analysis plan (https://osf.io/sfwnp/). The response time index and missing values index were not part of the preregistration and are therefore marked with an asterisk.

Response Time

For each participant, the completion time for each page of the survey was measured. In addition to this response time measure, we developed a response time index (RTI*) combining two approaches from the literature: First, some studies that included the response time to screen for careless responding decided to treat breaks or study interruptions as missing values (e.g., Meade & Craig, 2012). Second, other studies focused more on detecting extremely fast responses because some participants respond so quickly that it seems unlikely that they have really read and considered the item content (Wise & Kong, 2005). Hence, these researchers used a lower cutoff (e.g., two seconds per item) to screen for careless respondents (e.g., Bowling et al., 2016; DeSimone & Harms, 2018; Huang et al., 2012). To combine these two approaches, we extended the response time effort index developed by Wise and Kong (2005) to include an upper threshold for delayed responses ( $S_{upper}$ ), yielding the RTI for the MFC format. Furthermore, we investigated in a small sample (N = 10) how much time is needed to respond to a triplet in the MFC format with and without reading the items (see Appendix A in the Supplemental Materials). The results revealed that six seconds per triplet page was an adequate lower threshold ( $S_{lower}$ ). To calculate the RTI, first, the response time $R T_{pt}$ of each participant p on every triplet page t of the questionnaire is recoded into a categorical variable $C_{pt}$ :

C_{p t} = {\begin{matrix} N A & i f R T_{p t} \geq S_{u p p e r} \\ 1 & i f R T_{p t} < S_{l o w e r} \\ 0 & o t h e r w i s e \end{matrix}

(1)

Second, the RTI for each participant is determined as:

RT I_{p} = \frac{\sum_{t = 1}^{T} C_{pt}}{k},

(2)

where k is the number of values of $C_{pt}$ without those coded as missing (NA) and T is the number of triplets in the questionnaire. As can be seen in Equation 2, a participant’s $RT I_{p}$ is the average of $C_{pt}$ , with missing values removed.

Self-Report Items

At the end of the online survey, participants were asked to evaluate (a) the effort they put into the study: “I put forth . . . effort towards this study” (response options: “almost no,”“very little,”“some,”“quite a bit,” and “a lot of”), (b) the attention they paid to the study: “I gave this study . . . attention” (response options: “almost no,”“very little of my,”“some of my,”“most of my,” and “my full”), and (c) whether their data should be used in the analyses (use me): “In your honest opinion, should we use your data in our analyses in this study?” (response options: “yes,”“no”; Meade & Craig, 2012). All three items and instruction sets originated from Meade and Craig (2012, p. 441).

Instructed Response Triplets

In two questionnaires of the online study, an instructed response triplet appeared at a random position. The items making up these triplets contain instructions in which order to place the statements (see Figure 2 for an example).

Figure 2.

An Example of an Instructed Response Triplet.

Consistency Score

In the Thurstonian item response theory model, it is assumed that participants’ ranking preferences can be described by their trait levels and item parameters (Brown & Maydeu-Olivares, 2011). A consistency score can be calculated to examine whether participants responded consistently with this model (Brown & Bartram, 2011). To better explain the consistency score, we will briefly mention some features of the Thurstonian item response theory model. For a deeper understanding of the model, we refer the reader to Brown and Maydeu-Olivares (2011, 2018). Before analyzing the data from forced-choice questionnaires, the responses are coded into binary outcomes of pairwise comparisons. In the case of triplets, three pairwise comparisons are made: Item 1 with Item 2, Item 1 with Item 3, and Item 2 with Item 3. When the first item i in the pairwise comparison is preferred over the second item k, the binary outcome $y_{i, k}$ is coded as 1 (else 0). Subsequently, the binary coded outcomes can be analyzed using the Thurstonian item response theory model. Based on the parameters originating from this model, an item characteristic function for each pairwise comparison $y_{i, k}$ can be computed. The item characteristic function expresses the probability of preferring item i over item k conditional on the traits measured by the items, $η_{a}$ and $η_{b}$ (Brown & Maydeu-Olivares, 2011, p. 473):

P (y_{i, k} = 1 | η_{a}, η_{b}) = ϕ (\frac{- γ_{i, k} + λ_{i} η_{a} - λ_{k} η_{b}}{\sqrt{ψ_{i}^{2} + ψ_{k}^{2}}})

(3)

However, note that response probabilities for pairwise comparisons involving the same items are locally dependent given the latent traits. To account for this, in the Thurstonian item response theory model the item parameters involving the same items are constrained to the same absolute value, and the item parameters and trait covariances are estimated via limited information methods. The local dependencies are neglected in the estimation of the person parameters; however, this has a negligible impact on the estimates’ precision according to a simulation study conducted by Maydeu-Olivares and Brown (2010).

In Equation 3, the parameter $γ_{i, k}$ is the threshold of the binary outcome $y_{i, k}$ , $λ_{i}$ and $λ_{k}$ are the factor loadings, and $ψ_{i}^{2}$ and $ψ_{k}^{2}$ the residual variances of the two items involved in the pairwise comparison. $ϕ (x)$ represents the cumulative standard normal distribution function evaluated at x. When a participant’s responses are in line with his or her expected responses based on the trait levels, the resulting probabilities are high. In contrast, they are lower when the observed responses are not in line with the expected ones. The consistency score expresses the proportion of binary outcomes with a probability higher than the chance level (p > .5) of being preferred. The higher a participant’s consistency score, the more consistent his or her response behavior is with his or her latent trait levels. Thus, lower values could indicate careless responding.

Mahalanobis Distance

A commonly used method to detect outliers, for example on predictors in regression analysis, is Mahalanobis distance (Mahalanobis, 1936). Mahalanobis distance is the multivariate generalization of the Euclidian distance between two points. It can be used to detect values that deviate more from the centroid of all participants in the sample than the other participants’ values do (e.g., Pituch & Stevens, 2015). Thus, Mahalanobis distance expresses the distance between a person’s response and the sample mean, accounting for correlations between responses. In the case of an MFC questionnaire, the vector $D^{2}$ with an entry for each participant is calculated using the equation:

D^{2} = (Y - γ) \cdot CO V^{- 1} \cdot {(Y - γ)}^{T} .

(4)

$Y$ is a matrix with the binary-coded outcomes for all participants and all pairwise comparisons in the questionnaire. $γ$ is a vector of thresholds of the binary outcomes and $CO V^{- 1}$ the inverse of the covariance matrix of the binary outcome variables. As the computation of $D^{2}$ requires complete data (i.e., no missing values), we used multiple imputation on the binary outcome variables before calculation to avoid too many missing values in the results (for details on the computation, we refer the reader to the R-script “4_calc_indices.R,”https://github.com/rkupffer/ExploratoryAnalysesCRinMFC).

Rank Order² Indices

In the rating scale format, LongString indices are used to identify respondents who repeatedly choose the same response category (Johnson, 2005; Meade & Craig, 2012). We adapted the logic of LongString indices to the MFC format to generate longOrder indices. In an item block of size $K$ in the MFC format, there are $K!$ Possible rank orders. In the case of triplets (K = 3), there are six possible rank orders: 1–2–3, 1–3–2, 2–1–3, 2–3–1, 3–1–2, or 3–2–1. For each participant, it is counted how often in a row the same rank order is chosen (longOrder). Based on this, we calculated two indices: the maximum of the six longOrder values (longOrderMax) and the average of all six longOrder values (longOrderAvg). In addition, the proportion of triplets for which a person simply copied the presented order (sameOrder) was calculated.

Triplet Variance

This index expresses the variance of a person’s ranking patterns across all triplets in a questionnaire or the whole survey. The variance is high when participants use all rank orders equally often and low(er) when some rank orders are preferred over others. For a nominal variable with K! levels, the dispersion H can be calculated as (Eid et al., 2017, p.133):

H = \frac{- 1}{\ln (K!)} \cdot \sum_{j = 1}^{K!} h_{j} \cdot \ln (h_{j}) .

(5)

In the MFC format, K! equals the number of possible rank orders and $h_{j}$ is the proportion of rank order k.

Missing Values Index*

This index expresses the ratio of missing values (NA) on item ranks to the total number of item ranks in the questionnaire or survey. This index was not part of the preregistration and therefore the analyses regarding this index are exploratory.

The Present Study

Our aim was to examine the performance of and correlations among the indices we developed to detect careless responding in the MFC format. Furthermore, we were interested in the types of careless respondents in MFC questionnaires as well as in the proportion of the sample that responded carelessly. We preregistered four exploratory research questions and three hypotheses (https://osf.io/sfwnp/) that are described in the following. In addition, we conducted a simulation study in which we modeled careless responding in MFC questionnaires to provide a conceptional proof of the developed indices for detecting careless responding in MFC questionnaires. The simulation study was suggested by reviewers and was therefore not included in the preregistration. The design and rationale of the simulation are described in the Method section.

Research Questions and Hypotheses

For rating scales, Meade and Craig (2012) found that three factors underly the indices: consistency, self-report, and LongString. As the nature of MFC data differs substantially from rating scale data, we sought to investigate how the indices to detect careless responding in the MFC format relate to each other. Thus, our first research question was:

Research Question 1: How do the indices to detect careless responding in MFC questionnaires correlate with each other and what is their factor structure?

Some of the indices including the self-report items and the consistency or rank order indices have also been used to detect careless responding in rating scale data (e.g., Meade & Craig, 2012). For these indices, based on Meade and Craig’s findings, we posited the following three hypotheses:

Hypothesis 1: The correlations of the self-report items (attention, effort, and use me) with the other indices will be positive and of small to moderate size.

Hypothesis 2: The self-report items will show moderate positive correlations among each other.

Hypothesis 3: The correlations between the consistency score and the indices longOrderAvg and longOrderMax will be positive but small.

Besides investigating the intercorrelations among the indices to detect careless responding, we were also interested in whether different subgroups of respondents exist. Studies on detecting careless responding in rating scale data have often found three subgroups/classes of respondents (e.g., Kam & Meyer, 2015; Maniaci & Rogge, 2014; Meade & Craig, 2012). One class contains thoughtful participants, a careless class contains participants who choose one response option repeatedly in a row, and another careless class contains participants with inconsistent response patterns. The sizes of the different classes varied in these studies. To extend this knowledge with regard to MFC data, we asked:

Research Question 2: Do different subgroups of careless respondents exist?

As described above, the proportion of the sample that responds carelessly in studies based on rating scales varies across studies, indices, and cutoff values. Identifying careless respondents by performing latent class or latent profile analysis is only one approach commonly used in the literature. A more widely used approach is the multiple hurdle approach (e.g., Goldammer et al., 2020). In this index-based approach, a set of careless responding indices is computed and cutoffs (hurdles) are applied to each of them to identify careless respondents. To investigate what proportion of the sample responds carelessly according to the different indices in the MFC format with different cutoff values, we asked:

Research Question 3: What proportion of the sample responds carelessly?

Some researchers (e.g., Bowling et al., 2021; Clark et al., 2003; Galesic & Bosnjak, 2009) expect careless responding to occur more frequently near the end of a questionnaire due to tiredness or listlessness among participants. For example, Galesic and Bosnjak (2009) found that participants responded faster at the end of the study and that the variance in the responses decreased. Therefore, in our study, we compared the proportion of careless respondents identified by the six indices in the first and last questionnaires of the online survey, and thus asked:

Research Question 4: Does careless responding appear more frequently in questionnaires that are placed near the end of the study (compared with questionnaires at the beginning)?

Exploratory Analyses on the Criterion Validity and Reliability Estimates

Besides these research questions and hypotheses, we also preregistered exploratory analyses to investigate the impact of removing careless respondents from the sample in the MFC format. As described earlier, for rating scales, various studies have investigated how different proportions of careless respondents can impact the size and direction of observed correlations (see, for example, Credé, 2010). To examine the criterion validity in our sample, we computed several correlations with and without the respondents identified as careless in our analyses and compared these results with correlations reported in the literature. We focused on the relationships between the Big Five personality traits and narcissism with age and gender. Our tentative expectation was that the correlations without careless respondents would be more in line with those reported in the literature and stronger than those with careless respondents. A second exploratory analysis concerned the reliability of trait estimates on the scales, as some researchers (e.g., DeSimone et al., 2018) argued that careless responding leads to decreased reliability estimates. Hence, we compared the empirical reliability of trait estimates on the scales based on the whole sample to that in the sample without the careless respondents.

Method

In the following, we report how we determined our sample size, all data exclusions, and all measures in the study. In addition, we report which software we used for the analyses and a detailed analysis plan that we also preregistered. Moreover, we describe the design of the simulation study that we conducted to gain first insights into the kind of careless responding patterns that are detected by different kinds of indices.

Sample

Data were collected online on Prolific Academic (https://www.prolific.co) on September 21, 2017 and October 5, 2017. Participants signed up for the study, gave informed consent, and filled out the questionnaire online on SoSci Survey. On September 21, 2017, participants were remunerated with 1.75 British pounds, although participants who took longer than 20 min to complete the questionnaire received bonus payments of between 0.40 and 2 British pounds depending on their response time to comply with the payment principles of Prolific Academic with respect to minimum wage. These participants did not know beforehand that they would receive bonus payments if they took longer. On October 5, 2017, all participants were remunerated with 2.35 British pounds.

The computation of some of the careless responding indices is based on parameters from the Thurstonian item response theory model. Therefore, the sample size was determined based on simulation studies regarding the recovery of the model parameters (e.g., Maydeu-Olivares & Brown, 2010). We planned conservatively (N > 1,000) because only one of the six questionnaires was originally an MFC instrument. This sample size was also sufficiently large for the planned statistical analyses (e.g., correlations, latent profile analysis, and McNemar’s $χ^{2}$ tests).

After the removal of cases without a valid Prolific Academic ID (who were therefore not remunerated for their participation), cases who began the questionnaire multiple times, and cases who sent strange emails to the study organizer, the sample consisted of N = 1,169 persons. This sample size deviates from the preregistered size of N = 1,211 because the preregistered sample included some participants who were not remunerated for their participation. The mean age of the final sample was 36.27 years (SD = 11.27) and 64% were female. The participants originated from six different English-speaking countries (68% from the United Kingdom and 28% from the United States).

Measures

We administered six questionnaires in English. We used an MFC format with items presented in triplets and a full ranking instruction. The participants were instructed to rank the items as follows:

On each page you will be presented with a block of three statements. Rank these statements according to how well they describe you in your usual behavior. Place the statement that describes you best at the top, the statement that describes you neither best nor least in the middle, and the statement that describes you least at the bottom.

We included questionnaires measuring personality traits (HEXACO, Big Five, Dark Triad), vocational interests, and other scales related to personality. The six questionnaires in the online survey are displayed in Figure 3. Only the Big Five Triplets is originally an MFC instrument. For the HEXACO-60, we used the MFC version from Wetzel and Frick (2020). For the other instruments, we constructed an MFC version with triplets. The number in the boxes in Figure 3 corresponds to the number of triplets in the questionnaire. In two questionnaires (SD3 and IPIP), an instructed response triplet appeared at a random position. The order in which the triplets were presented was randomized within the six questionnaires, but the order of the questionnaires was the same for all participants (as displayed in Figure 3).

Figure 3.

Properties and Questionnaires of the Online Survey.

Following the questionnaires, participants were asked to evaluate the quality of their data. Three self-report items were used to assess the effort they put into the study, the attention they paid to the study, and whether their data should be used in the analyses (for more details on the item wording see Section Indices for Detecting Careless Responding in MFC Questionnaires). On the last survey page, the participants were asked for demographic information (age, gender, and nationality).

Analytic Strategy

The analyses were conducted using Mplus Version 8.9 (Muthen & Muthen, 1998–2021) and R Version 4.1.0 (R Core Team, 2021) with the packages psych (Revelle, 2021), MplusAutomation (Hallquist & Wiley, 2018), tidyLPA (Rosenberg et al., 2018), mice (van Buuren & Groothuis-Oudshoorn, 2011), MOTE (Buchanan et al., 2019), multiplex (Ostoic, 2020), MFCblockInfo (Frick, 2023), doParallel (Corporation & Weston, 2020), ggplot2 (Wickham, 2016), patchwork (Pedersen, 2022), and apaTables (Stanley, 2021). Moreover, we wrote functions to calculate the careless responding indices in the MFC format (CRinMFC, Kupffer et al., 2022b) and other functions to prepare and read out data related to the Thurstonian item response theory analyses (TirtAutomation, Frick & Kupffer, 2022). Both packages containing these functions are available on GitHub. The analysis plan described in this section was also part of our preregistration.

Pre-Analysis

Sometimes longer response times per page occur due to pausing the study, distraction, or spending more time than usual on thinking before answering. This can bias the average response time over all triplets. Therefore, we calculated the average response time per triplet for each participant. These values are shown in Figure 1 in the Supplemental Materials. To avoid extreme response time values from skewing the overall measure, we replaced values that were at least twice as high as the 75th percentile of the sample’s average response time per triplet (green dashed line in bottom boxplot) with random values between the 25th and the 75th percentile. In total, 5,766 (4%) values were replaced. We replaced these values rather than treating them as missing because missing response times already occur when participants drop out.

Research Question 1

To answer the first research question regarding the correlations among the indices and their factor structure, we first computed the correlations among the indices. For testing Hypotheses 1 to 3, we defined a low correlation as $. 10 \leq | r | < . 30$ and a moderate correlation as $. 30 \leq | r | < . 50$ , as suggested by Cohen (1988). We tested several correlations in our hypotheses and decided that the hypotheses would be considered fulfilled if two-thirds of the correlations met the criteria described above. We checked the prerequisites for calculating an exploratory factor analysis using two criteria: a Kaiser–Meyer–Olkin coefficient above .80 (Kaiser & Rice, 1974), and a Bartlett-Test (Bartlett, 1950) result that differs significantly (p < .05) from zero.

Research Question 2

To answer the second research question regarding whether different subgroups of careless respondents existed, we ran a latent profile analysis with the following indices: response time, self-reported effort, self-reported attention, consistency score, Mahalanobis distance, longOrderMax, longOrderAvg, sameOrder, and triplet variance. To determine the number of latent groups, we used the Bayesian Information Criterion (BIC; with smaller values indicating a better fit) and the model entropy (above .60, with higher values indicating a more accurate classification of respondents into latent groups), the interpretability of the resulting groups, and the parsimony of the solution.

Research Question 3

To answer the third research question regarding the proportion of careless respondents, we computed the proportion of respondents identified as careless by each index. Furthermore, we derived the proportion of persons assigned to profiles that indicated careless responding from the latent profile analysis.

Research Question 4

To answer the fourth research question regarding the occurrence of careless responding, we compared the proportion of careless respondents measured by our indicators in the first (BFT) and last (ORVIS) questionnaire of the online study. For both questionnaires, we computed the proportion of careless respondents identified by the following indices: consistency score, Mahalanobis distance, longOrderMax, longOrderAvg, sameOrder, and triplet variance. These proportions were compared using McNemar’s tests with a Bonferroni-corrected significance level of α = .01.

Simulation of Different Careless Responding Patterns and Proportions

The simulation study aimed to conceptually prove the developed indices and supplements the empirical part by giving first insights into the kind of careless responding patterns that can be detected by different kinds of careless responding indices. Thus, the research questions were: (a) How do the developed indices respond to increasing proportions of careless responding? (b) Which manifestations of careless responding patterns can be detected by which kind of careless responding index? and (c) What is the sensitivity and specificity of the indices at different cutoff values?

According to the literature on careless responding in rating scale data cited earlier, the base rate of careless responding is between 4% and 33%. Therefore, we varied the proportion of careless responding in our samples between 2% and 40% (in intervals of 5%). We modeled the two kinds of careless responding patterns that are discussed in the literature: randomness and longString (in our case longOrder). We divided the second kind into a moderate and a strong form of rank order repetition. The first careless responding pattern type is called random order, as it is manifested in the random selection of the rank orders for all of the blocks in the questionnaire. The second type strong repetition order is manifested in the selection and repetition of one rank order throughout the whole questionnaire. The third type moderate repetition order is manifested in the selection and repetition of one rank order for five consecutive triplets. In all types of careless responding patterns, the rank order “1–2–3” had a higher probability (p = .25) of being selected than the remaining rank orders (p = .15).

We also varied the sample composition. Thus, for each proportion of careless responding in the sample, we also varied the shares (0, .25, .50, .75, and 1) of random and one of the repetition order patterns. For example, for an overall percentage of careless responding in the sample of 2%, we simulated nine different sample compositions (see Table 1). In the first condition, only one type of careless responding (i.e., strong repetition order) was modeled. In Condition 9 on the contrary, the subsample of careless respondents (also 2% in total) consisted of 75% random selection of rank orders and 25% moderate repetition of rank orders. This results in a total of 72 conditions. Each of the simulation conditions was modeled with a sample size of N = 1,000 and was replicated 1,000 times.

Table 1.

Simulation Design of Sample Compositions.

Samplecomposition	Random selectionof rank orders (in %)	Strong repetitionof rank orders (in %)	Moderate repetitionof rank orders (in %)
1	.00	1.00	.00
2	.25	.75	.00
3	.50	.50	.00
4	.75	.25	.00
5	1.00	.00	.00
6	.00	.00	1.00
7	.25	.00	.75
8	.50	.00	.50
9	.75	.00	.25

Note. The overall proportion of careless responding in the simulated samples varies between [.02, .07, .12, .17, .22, .27, .32, and .37] percent.

The simulation was based on the Big Five Triplets questionnaire (for details see Table 2 in the Supplemental Materials), as it is the only validated MFC instrument in our study. We simulated the careful/thoughtful responses based on the Thurstonian IRT model with intercepts [–1; 1] and loadings [0.65; 0.95] sampled from a uniform distribution. For model identification we set the trait means to zero. The correlations among the Big Five traits were set to the correlations obtained in the meta-analysis by Anglim et al. (2020, p. 60).

In each replication, we computed the mean and standard deviation of the indices consistency score, Mahalanobis distance, triplet variance, longOrderMax, longOrderAvg, and sameOrder.³ A range of cutoff values were applied to each index (see Table 2) to compare sensitivity and specificity at these cutoff values. Specifically, we used the Youden-Index (YI; Youden, 1950) to determine the trade-off between sensitivity and specificity. As an overall performance measure of the careless responding indices, taking into account the balance between sensitivity and specificity over a range of cutoff values, we calculated the Area Under the Receiver–Operating Characteristic Curve (AUC).

Table 2.

Cutoff Values in the Simulation Study.

Index	Cutoff 1	Cutoff 2	Cutoff 3	Cutoff 4	Cutoff 5
Consistency score	.60	.65	.70	.75	.80
Mahalanobis distance	$D^{2} > χ_{df, α}^{2}$	–	–	–	–
LongOrderMax	2.0	2.5	3.0	3.5	4.0
LongOrderAvg	.75	1.00	1.25	1.50	1.75
SameOrder	.20	.22	.24	.26	.28
Triplet variance	.65	.70	.75	.80	.85

Note. We also applied a sample-dependent cutoff value (M– 2*SD) for the consistency score.

Results

We first calculated all of the indices adapted to and developed for the MFC format. As the consistency score is based on parameters from Thurstonian item response theory models, we ran these models for each questionnaire before calculating the indices. Unfortunately, the model for the Short Dark Triad Scale (Jones & Paulhus, 2014) in the MFC format failed to converge. Therefore, the consistency score was calculated for each of the remaining five questionnaires. These scores were then averaged. According to the root mean square error of approximation, the model fit of the remaining models was good (BFT = 0.040, BFI-2 = 0.035, HEXACO = 0.031, IPIP scales = 0.033, and ORVIS = 0.044). The three rank order indices, the triplet variance, and the missing values index were calculated for each of the six questionnaires as well as for the whole survey (including all 121 triplets). The response time and RTI were calculated for the whole survey and the RTI for the first and last questionnaires as well. Table 3 displays the means, standard deviations, confidence intervals, cutoff values, and percentage of the sample identified as careless respondents by the indices.

Table 3.

Descriptive Statistics on the Careless Responding Indices Including Cutoff Values and % of the Sample Identified as Careless Respondents by Each Index.

Index	M	SD	95% CI	Cutoff	% careless respondents
Response time in minutes	21.92	6.37	[8.25; 34.83]	M– 2*SD	2
Response time index*	.10	.18	[−.25; .45]	M+ 2*SD	6
Self-report effort	4.72	0.56	[3.62; 5.82]	3 “some”	3
Self-report attention	4.67	0.70	[3.30; 6.04]	3 “some”	4
Self-report use me	.01	.09	[−.17; .19]	1 “no”	1
Instructed response triplet	0.17	0.52	[−0.85; 1.19]	1	10
Consistency score	.76	.06	[.64; .88]	M– 2*SD	5
Mahalanobis distance	13.06	16.18	[−12.55; 41.27]	$D^{2} > χ_{df, α}^{2}$	2
LongOrderMax	2.97	5.09	[−7.01; 12.95]	7	3
LongOrderAvg	1.41	0.84	[−0.24; 3.06]	2	6
SameOrder	.21	.11	[−.01; .43]	.40	3
Triplet variance	.93	.17	[0.60; 1.26]	.70	7
Missingness*	.03	.14	[−.24; .30]	M+ 2*SD	4

Note. Indices marked with an asterisk were not preregistered. CI = confidence interval.

For some of the indices, including the self-report items and the instructed response triplets, natural cutoff values exist. For example, on the instructed response triplets, a given participant can be flagged as careless when he or she fails to select the prescribed rank order. For other indices, cutoff values are needed. As this study is the first to investigate the newly developed indices in the MFC format, no established cutoff values exist. Hence, we used visual inspection to determine cutoff values for the indices longOrderMax, longOrderAvg, sameOrder, and triplet variance and calculated deviations from the sample mean to determine the cutoff values for the response time, RTI, consistency score, and missing values indices. The calculation of the indices and the determination of the cutoff values are part of the R-script “4_calc_indices.R.”

After the cutoff values were applied, a sum score of the 11 preregistered indices was calculated for each participant (TotalSumScore). The TotalSumScore thus counts the number of indices on which a participant was flagged as careless and ranges from 0 to 11. Eight hundred and eighty-four participants (76% of the sample) were not flagged by any of the indices, 165 participants (14% of the sample) were flagged by one index, and 120 participants (10% of the sample) were flagged by two or more indices. Two participants scored the highest TotalSumScore of eight.

RQ1: Correlations Among the Indices

To test Hypotheses 1 to 3, we calculated the correlations among the indices and tested whether two-thirds of the correlations met the criteria of a small or moderate correlation as described above (the correlations are displayed in Table 4). The indices response time and missing values index were not part of the preregistration and therefore not included in the hypotheses tests, but are reported as exploratory results. Bivariate scatter plots of the indices are displayed in Figure 2 in the Supplemental Materials. The self-report item use me did not correlate with most of the preregistered indices. The self-report item effort had a positive but small correlation with the other indices, except for the sameOrder index and Mahalanobis distance for which there were no correlations. The self-report item attention had a positive but small correlation with the instructed response triplet, longOrderMax, and triplet variance indices. Only 11 (out of 24) correlations were low or moderate. Therefore, Hypothesis 1 was not confirmed. The correlation between the self-report items attention and effort was moderate and positive. The correlations of use me with effort and use me with attention were positive but small. Therefore, Hypothesis 2 was not confirmed. The indices longOrderAvg and longOrderMax did not correlate with the consistency score. Therefore, Hypothesis 3 was not confirmed. As preregistered and described in the Method section, we used two criteria to decide whether applying principal axis factoring is reasonable. The Bartlett test showed that the correlations among the indices differed significantly from zero, $χ^{2}$ (91) =10,635, p < .001, but the Kaiser–Meyer–Olkin coefficient was below .80 (KMO = .60). Therefore, the exploratory factor analysis was not carried out.

Table 4.

Correlations Among the Careless Responding Indices.

Index	1	2	3	4	5	6	7	8	9	10	11	12	13
1. Response time in minutes
2. Response time index*	−.63
3. Self-report effort	−.16	.24
4. Self-report attention	−.10	.15	.42
5. Self-report use me	−.01	.10	.26	.13
6. Instructed response triplet	−.18	.32	.20	.13	.05
7. Consistency score	−.16	.35	.18	.09	.10	.41
8. Mahalanobis distance	−.06	.08	.08	.07	−.02	.08	.02
9. LongOrderMax	−.06	.09	.16	.12	−.02	.34	.01	.10
10. LongOrderAvg	.02	.01	.12	.08	−.03	.21	−.03	.08	.93
11. SameOrder	.02	.08	.06	.07	−.02	.26	−.11	.13	.50	.45
12. Triplet variance	−.32	.27	.21	.14	.05	.66	.17	.08	.38	.13	.18
13. Missing values index*	−.34	.16	.17	.11	.10	.28	.08	.01	−.06	−.25	−.26	.78
14. TotalSumScore	−.26	.36	.36	.30	.20	.83	.38	.09	.50	.32	.38	.69	.30

Note. The indices self-report effort, self-report attention, consistency score, and triplet variance were recoded so that higher values indicate “more” careless responding. Indices marked with an asterisk were not preregistered. TotalSumScore = sum score counting for how many of the preregistered indices a given person was categorized as a careless responder.

RQ2: Subgroups of Careless Respondents

We conducted a latent profile analysis on the indices response time, self-report effort, self-report attention, consistency score, Mahalanobis distance, longOrderMax, longOrderAvg, sameOrder, and triplet variance. As described above, we used the BIC and the entropy of the models, the interpretability of the resulting groups, and the parsimony of the solution to determine the number of latent groups. Using these criteria, two models seemed to fit to the data: Model 1 with two classes and Model 3 with four classes (see Table 5). The model with five classes failed to converge. All latent profile analysis models are displayed in Table 6 with the resulting class means on the different indices. In Model 1, 84% of the sample was assigned to Class 1 and 16% to Class 2. A comparison of the predicted class means of the two classes in this model revealed that participants who were assigned to Class 2 on average responded a bit faster, scored lower on the self-report item attention, had higher Mahalanobis distance values, repeated a rank order twice as often in a row on longOrderMax, and had a lower triplet variance. The response time index, instructed response triplet, missing values index, and the self-report item use me were not part of the latent profile analysis. Therefore, for these indices, the class means were calculated post hoc. On average, a participant who was assigned to Class 2 missed one (out of two) instructed response triplets and had 18% missing values. Based on the class means, it appears that Class 2 contains careless participants.

Table 5.

Model Fit of the Latent Profile Analysis Models.

Model	No. of classes	BIC	Entropy	prob_min	prob_max
0	1	25,930	1.00	1.00	1.00
1	2	14,424	0.98	0.98	1.00
2	3	13,407	0.97	0.97	0.99
3	4	12,925	0.95	0.92	1.00

Note. Prob_min/prob_max—minimum/maximum of the average latent class probability for the class membership which is most likely. Model 4 did not converge. BIC = Bayesian information criterion.

Table 6.

Latent Profile Analysis Results.

Index	Model 1		Model 2			Model 3
Index	Class 1	Class 2	Class 1	Class 2	Class 3	Class 1	Class 2	Class 3	Class 4
Proportion	.84	.16	.79	.20	.01	.76	.19	.03	.02
Class means
Response time in minutes	22.45	16.92	22.60	17.55	18.12	22.64	19.23	11.88	16.80
Response time index*	.03	.23	.02	.22	.19	.02	.19	.26	.27
Self-report effort	4.78	4.22	4.81	4.26	4.06	4.82	4.35	4.85	4.20
Self-report attention	4.78	4.04	4.80	4.10	4.38	4.80	4.21	3.94	4.37
Self-report use me*	.01	.02	.01	.02	.00	.00	.02	.05	.00
Consistency score	.77	.71	.77	.70	.78	.78	.70	.70	.76
Mahalanobis distance	13.89	16.79	13.61	16.36	29.18	13.56	16.10	14.32	22.84
LongOrderMax	2.43	5.76	2.40	3.23	32.38	2.39	3.27	0.53	20.80
LongOrderAvg	1.39	1.51	1.39	1.20	5.52	1.38	1.38	0.16	3.67
SameOrder	.20	.25	.20	.20	.78	.20	.21	.03	.62
Triplet variance	.98	.67	.98	.76	.30	.98	.88	.27	.41
Instructed response triplet*	0.03	1.01	0.02	0.76	1.56	0.01	0.57	1.36	1.56
Missing values index*	.00	.18	.00	.15	.01	.00	.05	.69	.09

Note. Indices marked with an asterisk were not part of the latent profile analysis but were calculated post hoc.

In Model 3, 76% of the participants were assigned to Class 1. The second largest class (19%; Class 2) was on average 3 min faster than Class 1, scored lower on the self-report items (especially on attention, similarly to Class 2 in Model 1), had higher Mahalanobis distance values, and repeated on average one triplet more on longOrderMax. Respondents who were assigned to Class 3 (comprising 3% of the sample) responded ten minutes faster on average than Class 1 and skipped on average 69% of the items. Due to this high rate of missing values, the other indices will not be described for this class. The fourth class in this model (Class 4; 2%) contained respondents who responded on average almost five minutes faster than Class 1. Participants in this class had on average high values on the rank order indices and Mahalanobis distance and low values on triplet variance. Based on the class means, Model 3 revealed three different classes of careless respondents: one that was characterized by producing many missing values, another one that repeated one rank order very often in a row, and one that exhibited less extreme values on the indices but still deviated from the careful respondents in Class 1.

Graphical and Statistical Model Validation

In order to cross-check the results of Model 3, we conducted additional analyses. For a graphical model validation, we selected three of the indices used in the latent profile analysis to gain deeper insights into the behavioral differences between the four classes: triplet variance, response time, and longOrderMax. Figure 4 in the Supplemental Materials shows violin plots combined with scatter overlays to compare the density distribution and dispersion of data points across five (sub)samples: the first column shows data from all participants and the second through fifth columns show data from Classes 1 to 4. In general, the distinction between the four classes is most obvious when looking at the differences between the classes on several indices (not just one).

Triplet Variance

Class 1 contained participants with high values on the triplet variance. The dispersion in this class was very small compared with Class 2 which also included participants with comparably high values. The values on triplet variance in Class 3 and Class 4 also had a large dispersion, but the majority of the values were below .70, which is rather low compared with the total sample.

Response Time

Participants in Class 3 had on average the fastest response times. It was also evident that Class 2 included participants with faster response times in comparison to the careful Class 1.

LongOrderMax

The lower left plot shows the large dispersion in the total sample and Class 4 on the index longOrderMax. Shortening the y-axis (12 values are truncated in the lower right plot), again Class 1 contained the homogeneous values and Class 2 had a greater dispersion. Class 3 contained those individuals who had the fewest repetitions on the index longOrderMax compared with the total sample. As a reminder, a longOrderMax of 0 means that a participant selected a rank order and on the next page a different rank order, i.e., individuals in this class on average repeated a rank order less than once. Class 4 contained those participants who repeated a given rank order very often in a row.

Multinominal Logistic Regression Models

To evaluate the plausibility of the four classes resulting from Model 3, we ran two multinominal logistic regressions to examine the relationship between the indices response time index and instructed response triplet and class membership. We chose these indices because they were not used to assign the participants to the latent classes and thus can be used to cross-check the model solution. The dependent variable has four categories (i.e., the classes of Model 3). We chose Class 1 as the reference category because it is the largest class and expected to include careful/thoughtful participants.

Response Time Index

The model fit the data ( $χ^{2} (3$ ) = 132.66, p < .001). The response time index explained 11% of the variance in class membership according to Nagelkerke’s Pseudo $R^{2}$ . The response time index was significant in all of the comparisons. Participants with higher values on the response time index were 12.68 times more likely to be in Class 2 than Class 1 (b = 2.98, Wald $χ^{2} (1)$ = 8.30, p < .001, $e^{b}$ = 12.68), 19.69 times more likely to be in Class 3 than Class 1(b = 2.98, Wald $χ^{2} (1)$ = 6.42, p < .001, $e^{b}$ = 19.69), and 19.89 times more likely to be in Class 4 than Class 1 (b = 2.99, Wald $χ^{2} (1)$ = 6.17, p < .001, $e^{b}$ = 19.89).

Instructed Response Triplet

The model fit the data ( $χ^{2} (3$ ) = 414.73, p < .001). The instructed response triplet explained 36% of the variance in class membership according to Nagelkerke’s Pseudo $R^{2}$ . The instructed response triplet was significant in all of the comparisons. Participants with higher values on the instructed response triplet were 16.11 times more likely to be in Class 2 than Class 1 (b = 2.78, Wald $χ^{2} (1)$ = 9.45, p < .001, $e^{b}$ = 16.11), 43.81 times more likely to be in Class 3 than Class 1 (b = 3.78, Wald $χ^{2} (1)$ = 8.22, p < .001, $e^{b}$ = 43.81), and 58.56 times more likely to be in Class 4 than Class 1 (b = 4.07, Wald $χ^{2} (1)$ = 10.37, p < .001, $e^{b}$ = 58.56).

In summary, the model validation of the latent profile analysis was accomplished in two ways: (a) graphically by going deeper into indices that were part of the analysis and (b) statistically by multinominal logistic regression models in which class membership was predicted by indices that were not included in the latent profile analysis. Both kinds of analyses revealed that respondents in the four classes differed on a behavioral level that is captured by the careless responding indices.

RQ3: Proportion of Careless Responding

According to the results of the latent profile analysis, 16% to 24% of the sample were assigned to classes that showed careless responding behaviors. Another approach to group respondents as careless versus careful is the use of cutoff values. Table 3 displays the proportion of the sample identified as careless by each of the indices. The proportion of participants flagged by the indices varied between 1% for the self-report item use me and 10% for the instructed response triplets.

RQ4: Comparison of Careless Responding Between First and Last Questionnaire

In this part of the analyses, we compared the proportion of careless respondents identified by our indices in the first (BFT) versus the last (ORVIS) questionnaire in the online survey. We applied the cutoff values described in Table 3. The results of the McNemar’s tests are displayed in Table 7. For four out of the six preregistered indices (longOrderMax, longOrderAvg, sameOrder, and triplet variance), the proportion of careless respondents differed significantly between the two questionnaires. For three of these, the proportion of careless respondents was larger in the last compared with the first questionnaire, whereas for sameOrder, the effect went in the opposite direction. We additionally compared the proportion of careless respondents identified by the response time index and missing values index in an exploratory analysis, as these indices were not preregistered. For both indices, the proportion in the last questionnaire was higher than in the first.

Table 7.

Careless Responding in the First and Last Questionnaire in the Online Survey.

Index	Cutoff	Proportion of careless respondents		$χ^{2}$	p value
Index		BFT	ORVIS
Consistency score	M−2*SD	.04	.04	0.00	1.000
Mahalanobis distance	$D^{2} > χ_{df, α}^{2}$	.00	.00	0.17	.683
LongOrderMax	7	.01	.03	13.79	<.001
LongOrderAvg	2	.00	.01	7.56	.006
SameOrder	.40	.06	.04	8.00	.005
Triplet variance	.70	.06	.09	13.93	<.001
Response time index*	M+ 2*SD	.05	.08	18.85	<.001
Missing values index*	M+ 2*SD	.02	.04	13.56	<.001

Note. We preregistered to test on a Bonferroni-corrected alpha level of α = .01. Indices marked with an asterisk were not preregistered. BFT = Big Five Triplets; ORVIS = Oregon Vocational Interest Scales.

Exploratory Analyses: Impact of Removing Careless Respondents From the Sample

To investigate the impact of removing the participants who were flagged as careless by our indices from the sample, we computed several correlations with and without these respondents and compared the results with correlations reported in the literature (see Table 3 in Supplemental Materials). We chose two criteria on which to flag individuals: (a) the results of the latent profile analysis (Class 2 to 4 in Model 3) and (b) the combination of four of our indices (RTI, triplet variance, instructed response triplet, and longOrderMax). Criterion A flagged 24% of the sample and Criterion B 16%. Seventy-four percent of the sample was identified as careful by both criteria and 14% as careless by both criteria. For 12% of the sample, the two criteria did not overlap. The results of these two approaches did not differ much. Due to estimation problems with the Thurstonian item response theory model for the Short Dark Triad Scale, we cannot compare the correlations of narcissism with age and gender as we planned in our preregistration.

The correlations of the Big Five personality traits with age and gender reported in the literature and based on our sample are displayed in Table 3 in the Supplemental Materials. In general, most of the correlations did not differ meaningfully after removing the careless participants from the sample. A second exploratory analysis addressed the reliability of trait estimates on the scales. We compared the empirical reliability of the trait estimates on the scales based on the whole sample with the one without the careless respondents. To compare the reliability estimates, we transformed them using Fischer’s Z transformation. Removing the careless participants slightly improved the reliability estimates of all Big Five scales measured using the Big Five Inventory 2 (see Table 3 in Supplemental Materials).

Simulation Study

In 22,831 of the 72,000 replications (32%), factor scores could not be computed due to convergence problems of the Thurstonian item response theory model. These convergence problems occurred mainly in conditions where the proportion of careless responding exceeded 22% of the sample and strong repetition of rank orders was modeled. Thus, the first finding of the simulation study is that high proportions of careless responding can lead to model convergence problems. In these cases, the consistency score index could not be computed. Details on the simulation study can be found in the GitHub repository in the subfolder “7_simulation_study.”

Type of Careless Responding

A number of possible sample compositions were modeled (see Table 1). Figure 4 shows the trend for each index as the number of careless respondents increases. For each index, the values are also broken down by modeled type of careless responding. Thus, the purple line of strong repetition order consists only of values from conditions where 100% strong repetition of rank orders was modeled. The red line represents the average across all conditions for the given proportion of careless responding in the sample. As expected, the indices do capture different manifestations of careless responding, which will be addressed in detail.

Figure 4.

Different Proportions and Types of Careless Responding in the Simulated Samples.

Consistency Score

With increasing proportions of careless responding in the sample, the trend for all modeled types of careless responding is negative for the consistency score (see upper left corner in Figure 4). Higher values indicate a more consistent response pattern and thus less careless responding. Therefore, the results are in line with the conceptual idea of this index. Due to the above-mentioned convergence problems with the Thurstonian item response model, some data points are missing for strong repetition of rank orders.

Triplet Variance

With increasing proportions of careless responding, the trend for moderate and strong repetition of rank orders is negative while random selection of rank orders remains stable. Higher values refer to more variance in the response pattern (i.e., variation in the different ranking options). Therefore, the results are in line with the conceptual idea of this index.

LongOrderMax and LongOrderAvg

With increasing proportions of careless responding, the trend for moderate and strong repetition of rank orders is positive while random selection of rank orders remains stable (see middle row in Figure 4). Higher values indicate a longer sequence of repeated rank orders. Therefore, the results are in line with the conceptual idea of the two indices.

SameOrder

The third rank order index expresses the proportion of triplets for which the presented rank order was copied. With increasing proportions of careless responding the trend for all types of careless responding is slightly positive. The results are in line with the conceptual idea of the index.

Mahalanobis Distance

This index expresses the distance between a person’s response and the sample mean, accounting for correlations between responses. With increasing proportions of careless responding the trend for all types of careless responding is stable (despite the outlier). Therefore, the results are not in line with the conceptual idea (i.e., more carelessness, more deviation from the sample mean) of this index.

Sensitivity, Youden-Index, and AUC for a set of Cutoff Values

In Table 8, the results regarding the sensitivity, Youden-Index of different cutoff values, and the AUC for each index are displayed. This table is condensed, as only the means for three proportions of careless responding are reported. A complete results table can be found in the Supplemental Materials. For all analyzed indices except for the Mahalanobis distance, the AUC values are in an acceptable to excellent range (see Figure 5). The Youden-Index yields the best results for the following cutoff values: consistency score of .70 or .75, triplet variance of .80, LongOrderMax of 3 or 3.5, LongOrderAvg of 1.25 (but comparable to the other values), and sameOrder between .20 and .24. The Youden-Index for the cutoff value for the Mahalanobis distance is very low (see Figure 6).

Table 8.

Simulation Study Results.

Index	% CR	M	SD	AUC	Cutoff value 1		Cutoff value 2		Cutoff value 3		Cutoff value 4		Cutoff value 5
Index					Sensitivity	YI	Sensitivity	YI	Sensitivity	YI	Sensitivity	YI	Sensitivity	YI
Consistency score	.02	.84	0.07	1.00	.72	.72	.90	.90	.97	.96	1.00	.95	1.00	.82
	.22	.79	0.11	.92	.38	.38	.58	.57	.72	.70	.80	.71	.84	.55
	.37	.75	0.11	.91	.31	.31	.55	.55	.76	.73	.88	.75	.92	.55
Triplet variance	.02	.92	0.08	.82	.48	.48	.48	.48	.49	.48	.58	.56	.62	.54
	.22	.85	0.20	.82	.48	.48	.48	.48	.49	.49	.58	.56	.62	.54
	.37	.79	0.24	.81	.48	.48	.48	.48	.49	.49	.58	.56	.62	.54
LongOrderMax	.02	1.49	1.38	.79	.59	.54	.59	.54	.56	.56	.56	.56	–	–
	.22	2.77	3.64	.79	.59	.54	.59	.54	.56	.56	.56	.56	–	–
	.37	3.73	4.39	.79	.59	.54	.59	.54	.56	.56	.56	.56	–	–
LongOrderAvg	.02	0.47	0.33	.79	.61	.52	.56	.56	.56	.56	.55	.55	.55	.55
	.22	0.74	0.76	.79	.61	.52	.56	.55	.56	.56	.55	.55	.55	.55
	.37	0.94	0.90	.79	.61	.51	.56	.55	.56	.56	.55	.55	.55	.55
SameOrder	.02	0.17	0.09	.67	.53	.29	.53	.29	.53	.29	.32	.22	.32	.22
	.22	0.19	0.14	.65	.52	.28	.52	.28	.52	.28	.31	.21	.31	.21
	.37	0.20	0.17	.65	.52	.29	.52	.29	.52	.29	.31	.21	.31	.21
Mahalanobis distance	.02	32.65	147.32	.55	.26	.00	–	–	–	–	–	–	–	–
	.22	7.96	46.58	.52	.06	.00	–	–	–	–	–	–	–	–
	.37	−20.98	93.54	.52	.03	.00	–	–	–	–	–	–	–	–

Note. % CR = Proportion of careless responding in the sample; AUC = area under the receiver–operating-characteristic curve; YI = Youden-Index.

Figure 5.

Area Under the Receiver–Operating Characteristic Curve Values for Different Careless Responding Indices and Proportions of Careless Responding in the Sample.

Figure 6.

Youden-Index for Different Cutoff Values for the Careless Responding Indices.

The results of the simulation study can be summarized as follows: (a) With the exception of Mahalanobis distance, the results are consistent with the conceptual idea of the indices, resulting in trends in the expected direction with increasing proportions of careless responding in the sample. (b) The indices capture different kinds of careless responding: the consistency score captured moderate and random responding best, the triplet variance captured strong and random responding best, the longOrderMax and -Avg captured strong and moderate responding best, and the sameOrder captured all forms of careless responding. (c) The balance between sensitivity and specificity of the indices over a range of cutoff values (i.e., AUC) was in an acceptable to excellent range.

Discussion

The development of models that allow deriving normative trait estimates boosted the use of the MFC format in research and practice. Several MFC questionnaires especially for the assessment of personality traits have been developed including occupational personality inventories (e.g., Watrin et al., 2019), and scales assessing maladaptive personality traits (e.g., Guenole et al., 2018). In addition, research has, for example, investigated the validity of existing instruments (e.g., Walton et al., 2020; Watrin et al., 2019; Wetzel & Frick, 2020; Zhang et al., 2020) and developed methods that can be used for constructing new instruments (e.g., Frick, 2023; Pavlov et al., 2021) and for detecting differential item functioning (e.g., Lee et al., 2021). However, it was so far unclear how to detect careless responding in MFC data. Therefore, in this study, we developed a number of indices to detect careless responding in MFC questionnaires. These indices can easily be computed using the functions we published on GitHub (https://github.com/rkupffer/CRinMFC). In the present study, we provide a first examination of the indices’ performance and the extent of careless responding in MFC data based on a large online sample. In general, it seems that careless responding is an important issue that should be addressed not only in rating scale data but also in the MFC format. However, adapting careless responding indices from the rating scale to the MFC format is not straightforward and will not necessarily lead to the same behavior and performance. Our hypotheses on the indices’ relations, which were based on prior research with similar indices in the rating scale format, all had to be rejected. According to the results of the latent profile analysis, between 16% and 24% of the sample responded carelessly. The examined indices differed in their performance with the triplet variance, longOrderMax, instructed response triplets, and response time index performing best overall. There was twofold evidence for this in the data: (a) the correlations of these indices with the other indices were comparatively high, and (b) the average class means in the latent profile analysis differed on these indices. Removing participants who were flagged by these four indices slightly improved reliability estimates for most of the scales. The results of the simulation study confirmed that, with the exception of the Mahalanobis distance, the examined indices work as intended conceptually.

Is Careless Responding the Same in Rating Scales and MFC Questionnaires?

Various results from this study can be compared with the results of studies investigating careless responding in rating scale questionnaires. Although the correlations among the indices for the MFC format were smaller than expected, the results regarding latent classes of careless respondents and the increase in careless responding over the course of the online survey were in line with studies using questionnaires in the rating scale format. These similarities and differences will be discussed in the following paragraphs.

Correlations Among the Careless Responding Indices

Our results showed some differences to findings from rating scales regarding the nature of careless responding and the performance of the indices. One of the most evident differences was the correlational structure among the indices. Some of the indices that worked well in the rating scale format, including Mahalanobis distance, did not correlate with most of the other indices in the MFC format. Moreover, the indices did not share a substantial amount of variance, as indicated by a low KMO value.

We posed our hypotheses regarding the correlations among the indices based on results obtained with rating scale data by Meade and Craig (2012) and implemented the same self-report items in our online survey as they did. Two of the three hypotheses concerned the correlations with and among the three self-report items effort, attention, and use me. The two response formats differ from each other; therefore, it is not surprising that the correlations of our MFC indices with the self-report items differed from the ones reported by Meade and Craig for the rating scale indices. However, the correlations among the self-report items also differed between studies, despite the fact that these items were identical in both studies. The most striking differences were found for use me, for which we found no relationships with most of the other indices. It is possible that this result is related to our data collection using Prolific Academic. On this platform, every user has a unique ID that can be linked across several studies. Researchers who detect low-quality or suspicious data can report these IDs to Prolific (e.g., Lumsden, 2018). Participants might have feared being rejected from future studies if they stated that their data should not be used for analysis.

Types of Indices and Groups of Respondents

Other studies using the rating scale format tended to report small relationships among the careless responding indices and argued that this small overlap underscores the need to use different types of indices (e.g., DeSimone & Harms, 2018). With the term types of indices, we refer to groups of indices that capture similar response behavior: For instance, the rank order indices are of the same type, while the self-report items are of another type. Another result of our study which might support the recommendation to use different types of indices for data cleaning are the different latent classes. In latent class or profile analysis, multiple indicators are used to assign participants to classes. The results of the latent profile analysis in this study suggest that our sample consisted of four classes: a large careful class containing the majority of participants, and three smaller careless classes. The three careless classes differed with regard to the extent and manifestation of careless responding measured by our indicators. Studies conducting latent profile analysis to investigate careless responding in the rating scale format have often found three classes of respondents (e.g., Kam & Meyer, 2015; Maniaci & Rogge, 2014; Meade & Craig, 2012). The additional class we found is more likely to be attributed to differences in the handling of missing values (i.e., we did not exclude participants with missing values from the analysis) than to the difference in response format. The remaining three classes converge with those found for rating scales. In line with results by Kam and Meyer (2015), Maniaci and Rogge (2014), and Meade and Craig (2012), we found a small class containing participants who chose one response option (scale point or rank order) repeatedly in a row and another (somewhat larger) careless class with participants responding inconsistently. In the previous studies, the latter class was also distinguishable from the other classes by high values on Mahalanobis distance, which was not evident in our data. The proportion of the sample assigned to the different classes varied in previous research. In Kam and Meyer (2015), the class of careless respondents with high values on LongString comprised 14% of the sample. In our sample as well as in the other two rating scale studies (Maniaci & Rogge, 2014; Meade & Craig, 2012), this class was notably smaller (i.e., 1–2%). Thus, although the rating scale and MFC format differ substantially from each other, our results suggest some similarities in the nature of careless responding across the two formats.

Careless Responding Over the Course of the Study

For the rating scale format, increases in the occurrence of careless responding over the course of the study have been found (e.g., Bowling et al., 2021; Clark et al., 2003; Galesic & Bosnjak, 2009). Hence, we compared the percentage of respondents identified as careless in the first and last questionnaires of the online survey. Interestingly, the sameOrder index yielded a larger proportion of careless respondents in the first questionnaire. This might be related to the fact that people who are not yet familiar with the forced-choice format choose the presented rank order more often when they are unsure about how to rank the items. Later in the questionnaire, as they gain more experience, they may choose different rank orders in this case.

The proportion of participants flagged as careless by the other two rank order indices, longOrderMax, and longOrderAvg, and by the triplet variance was significantly higher in the last questionnaire. This increase in careless responding over the course of the study is in line with findings for the rating scale format regarding the LongString index (Bowling et al., 2021) and the variability in the response pattern (Galesic & Bosnjak, 2009). In an exploratory analysis, we found that the proportion of participants identified as careless by the response time index and missing values index was higher in the last questionnaire. The result for the response time index is in line with previous research (e.g., Bowling et al., 2021; Galesic & Bosnjak, 2009), but the result for missing values contrasts with the results by Galesic and Bosnjak (2009), who reported no increase in missing values.

These results suggest that at least some participants who were flagged as careless respondents by the indices did not respond carelessly to all questionnaires in the online study. Therefore, it might be beneficial to exclude careless respondents using model-based approaches, perhaps based on response times (e.g., Ulitzsch et al., 2022), or on the questionnaire level rather than cutoff values based on the whole survey.

Recommendations for Detecting Careless Responding in the MFC Format

The results of our study have several implications for practitioners using questionnaires in the MFC format. First, we recommend including instructed response triplets together with the questionnaire of interest. The efficacy of the instructed response triplets was highlighted by the results of the latent profile analysis: The instructed response triplet index was not used as a variable to assign the participants to different classes, but the post hoc calculation of the average class mean on this index showed that the careful classes missed practically none of the instructed response triplets. When including this kind of question in a survey, researchers must bear in mind that it may unsettle or confuse participants who are conscientiously filling out the questionnaire. Therefore, it is recommended to keep the number of instructed response items in balance with the total number of survey items (e.g., one instructed response item per 50–100 questionnaire items; Meade & Craig, 2012).

Second, throughout both parts of our study, the Mahalanobis distance did not provide promising results. Specifically, there was no substantial overlap with the other indices. Moreover, with increasing proportions of careless responding in the sample, there was no obvious effect on the Mahalanobis distance visible. Therefore, we recommend calculating the index only as an additional measure, if at all.

Third, as our indices varied in their sensitivity to different manifestations and degrees of careless responding, we recommend using different kinds of indicators to identify distinct careless responding groups. Therefore, after data collection is completed, we recommend using at least the indices triplet variance, longOrderMax, and response time index to screen for careless respondents. These three indices capture different kinds of careless responding behaviors. To calculate the response time index, the response time per page is needed. Therefore, when implementing an online survey, it is advisable to check beforehand how the response time is measured.

Limitations and Future Directions

As this study is the first to investigate detection methods for careless responding in the MFC format, the results need to be validated in further studies using different data sets. For practitioners, the careless responding indices can be a valuable tool. However, we must acknowledge their limitations (e.g., flagging false positives). As our understanding of the manifestation of careless responding in MFC questionnaires deepens, future research should focus on the development of model-based detection methods. A general limitation of some indices, including Mahalanobis distance or the response time, is that their cutoff values are sample-dependent and cannot easily be generalized across studies. The simulation study was useful for a conceptual proof of the indices and we identified cutoff values for some indices. These results are subject to validation in further analysis, as we made strong assumptions in our simulation (e.g., only interindividual variation of careless responding). Furthermore, the performance of the consistency score might depend on how well the model fits the data. Another limitation of the present study is that all questionnaires were forced-choice questionnaires with three items per block and a full ranking instruction. For other variants of the MFC format, the careless responding indices may have to be slightly modified. To what extent our results are applicable to other variants of the MFC format needs to be investigated in further studies.

Conclusion

This study serves as a starting point for investigating careless responding in MFC questionnaires. The adapted and newly developed indices to detect careless responding in the MFC format differed in how well they worked. Some practical recommendations include using an instructed response triplet and calculating several indices post hoc that capture different manifestations of careless responding behavior. Future studies should validate the results and work to establish appropriate cutoff values.

Supplemental Material

sj-pdf-1-epm-10.1177_00131644231222420 – Supplemental material for Detecting Careless Responding in Multidimensional Forced-Choice Questionnaires

Supplemental material, sj-pdf-1-epm-10.1177_00131644231222420 for Detecting Careless Responding in Multidimensional Forced-Choice Questionnaires by Rebekka Kupffer, Susanne Frick and Eunike Wetzel in Educational and Psychological Measurement

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by a grant from the Young Scholar Fund of the University of Konstanz awarded to Eunike Wetzel and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—GRK 2277 “Statistical Modeling in Psychology.”

ORCID iDs

Rebekka Kupffer

Susanne Frick

Eunike Wetzel

Data Availability Statement

The study was preregistered (https://osf.io/sfwnp/). The data and analysis code are available on GitHub (Kupffer et al., 2022a; ).

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Abbey

J. D.

Meloy

M. G.

(2017). Attention by design: Using attention checks to detect inattentive respondents and improve data quality. Journal of Operations Management, 53–56(1), 63–70. https://doi.org/10.1016/j.jom.2017.06.001

Anglim

Horwood

Smillie

L. D.

Marrero

R. J.

Wood

J. K.

(2020). Predicting psychological and subjective well-being from personality: A meta-analysis. Psychological Bulletin, 146(4), 279–323. https://doi.org/10.1037/bul0000226

Arias

V. B.

Garrido

L. E.

Jenaro

Martínez-Molina

Arias

(2020). A little garbage in, lots of garbage out: Assessing the impact of careless responding in personality survey data. Behavior Research Methods, 52(6), 2489–2505. https://doi.org/10.3758/s13428-020-01401-8

Ashton

M. C.

Lee

(2009). The HEXACO-60: A short measure of the major dimensions of personality. Journal of Personality Assessment, 91(4), 340–345. https://doi.org/10.1080/00223890902935878

Bartlett

M. S.

(1950). Tests of significance in factor analysis. British Journal of Statistical Psychology, 3(2), 77–85. https://doi.org/10.1111/j.2044-8317.1950.tb00285.x

Birkeland

S. A.

Manson

T. M.

Kisamore

J. L.

Brannick

M. T.

Smith

M. A.

(2006). A meta-analytic investigation of job applicant faking on personality measures. International Journal of Selection and Assessment, 14(4), 317–335. https://doi.org/10.1111/j.1468-2389.2006.00354.x

Bowling

N. A.

Gibson

A. M.

Houpt

J. W.

Brower

C. K.

(2021). Will the questions ever end? Person-level increases in careless responding during questionnaire completion. Organizational Research Methods, 24(4), 718–738. https://doi.org/10.1177/1094428120947794

Bowling

N. A.

Huang

J. L.

Bragg

C. B.

Khazon

Liu

Blackmore

C. E.

(2016). Who cares and who is careless? Insufficient effort responding as a reflection of respondent personality. Journal of Personality and Social Psychology, 111(2), 218–229. https://doi.org/10.1037/pspp0000085

Brown

(2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(1), 135–160. https://doi.org/10.1007/s11336-014-9434-9

10.

Brown

Bartram

(2011). OPQ32r technical manual. SHL Group Ltd.

11.

Brown

Maydeu-Olivares

(2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. https://doi.org/10.1177/0013164410375112

12.

Brown

Maydeu-Olivares

(2018). Modeling of forced-choice response formats. In Irwing

Booth

Hughes

(Eds.), The Wiley handbook of psychometric testing (pp. 523–569). John Wiley & Sons.

13.

Buchanan

E. M.

Gillenwaters

Scofield

J. E.

Valentine

K. D.

(2019). MOTE: Measure of the effect: Package to assist in effect size calculations and their confidence intervals. http://github.com/doomlab/MOTE

14.

Cao

Drasgow

(2019). Does forcing reduce faking? A meta-analytic review of forced-choice personality measures in high-stakes situations. Journal of Applied Psychology, 104(11), 1347–1368. https://doi.org/10.1037/apl0000414

15.

Christiansen

N. D.

Burns

G. N.

Montgomery

G. E.

(2005). Reconsidering forced-choice item formats for applicant personality assessment. Human Performance, 18(3), 267–307. https://doi.org/10.1207/s15327043hup1803_4

16.

Clark

M. E.

Gironda

R. J.

Young

R. W.

(2003). Detection of back random responding: Effectiveness of MMPI-2 and Personality Assessment Inventory validity indices. Psychological Assessment, 15(2), 223–234. https://doi.org/10.1037/1040-3590.15.2.223

17.

Cohen

(1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum Associates.

18.

Corporation

Weston

(2020). doParallel: Foreach parallel adaptor for the “parallel” package. https://CRAN.R-project.org/package=doParallel

19.

Credé

(2010). Random responding as a threat to the validity of effect size estimates in correlational research. Educational and Psychological Measurement, 70(4), 596–612. https://doi.org/10.1177/0013164410366686

20.

Curran

P. G.

(2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19. https://doi.org/10.1016/j.jesp.2015.07.006

21.

DeSimone

J. A.

DeSimone

A. J.

Harms

P. D.

Wood

(2018). The differential impacts of two forms of insufficient effort responding. Applied Psychology, 67(2), 309–338. https://doi.org/10.1111/apps.12117

22.

DeSimone

J. A.

Harms

P. D.

(2018). Dirty data: The effects of screening respondents who provide low-quality data in survey research. Journal of Business and Psychology, 33(5), 559–577. https://doi.org/10.1007/s10869-017-9514-9

23.

DeSimone

J. A.

Harms

P. D.

DeSimone

A. J.

(2015). Best practice recommendations for data screening. Journal of Organizational Behavior, 36(2), 171–181. https://doi.org/10.1002/job.1962

24.

Dupuis

Meier

Cuneo

(2019). Detecting computer-generated random responding in questionnaire-based data: A comparison of seven indices. Behavior Research Methods, 51(5), 2228–2237. https://doi.org/10.3758/s13428-018-1103-y

25.

Eid

Gollwitzer

Schmitt

(2017). Statistik und Forschungsmethoden: Lehrbuch [Statistics and research methods]. (Originalausgabe, 5., korrigierte edition). Beltz.

26.

Frick

(2021). MFCblockInfo: Compute block information for multidimensional forced-choice questionnaires [Computer software]. https://github.com/susanne-frick/MFCblockInfo

27.

Frick

(2023). Estimating and Using Block Information in the Thurstonian IRT Model. Psychometrika, 88(4), 1556–1589. https://doi.org/10.1007/s11336-023-09931-8

28.

Frick

Kupffer

(2022). R package: Automate the estimation of Thurstonian IRT models [Computer software]. https://github.com/susanne-frick/TirtAutomation

29.

Galesic

Bosnjak

(2009). Effects of questionnaire length on participation and indicators of response quality in a web survey. Public Opinion Quarterly, 73(2), 349–360. https://doi.org/10.1093/poq/nfp031

30.

Goldammer

Annen

Stöckli

P. L.

Jonas

(2020). Careless responding in questionnaire measures: Detection, impact, and remedies. The Leadership Quarterly, 31(4), Article 101384. https://doi.org/10.1016/j.leaqua.2020.101384

31.

Goldberg

L. R.

Johnson

J. A.

Eber

H. W.

Hogan

Ashton

M. C.

Cloninger

C. R.

Gough

H. G.

(2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40(1), 84–96. https://doi.org/10.1016/j.jrp.2005.08.007

32.

Guenole

Brown

A. A.

Cooper

A. J.

(2018). Forced-choice assessment of work-related maladaptive personality traits: Preliminary evidence from an application of Thurstonian item response modeling. Assessment, 25(4), 513–526. https://doi.org/10.1177/1073191116641181

33.

Hallquist

M. N.

Wiley

J. F.

(2018). MplusAutomation: An R package for facilitating large-scale latent variable analyses in Mplus. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 621–638. https://doi.org/10.1080/10705511.2017.1402334

34.

Hicks

L. E.

(1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74(3), 167–184. https://doi.org/10.1037/h0029780

35.

Huang

J. L.

Curran

P. G.

Keeney

Poposki

E. M.

DeShon

R. P.

(2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27(1), 99–114. https://doi.org/10.1007/s10869-011-9231-8

36.

Huang

J. L.

Liu

Bowling

N. A.

(2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100(3), 828–845. https://doi.org/10.1037/a0038510

37.

Johnson

J. A.

(2005). Ascertaining the validity of individual protocols from Web-based personality inventories. Journal of Research in Personality, 39(1), 103–129. https://doi.org/10.1016/j.jrp.2004.09.009

38.

Jones

D. N.

Paulhus

D. L.

(2014). Introducing the Short Dark Triad (SD3): A brief measure of dark personality traits. Assessment, 21(1), 28–41. https://doi.org/10.1177/1073191113514105

39.

Kaiser

H. F.

Rice

(1974). Little Jiffy, Mark Iv. Educational and Psychological Measurement, 34(1), 111–117. https://doi.org/10.1177/001316447403400115

40.

Kam

C. C. S.

Meyer

J. P.

(2015). How careless responding and acquiescence response bias can influence construct dimensionality: The case of job satisfaction. Organizational Research Methods, 18(3), 512–541. https://doi.org/10.1177/1094428115571894

41.

Kupffer

Frick

Wetzel

(2022a). Data set and scripts of an exploratory investigation of the indices to detect careless responding in the multidimensional forced-choice format [Computer software]. https://github.com/rkupffer/ExploratoryAnalysesCRinMFC

42.

Kupffer

Frick

Wetzel

(2022b). R package: Detecting careless responding in multidimensional forced-choice questionnaires [Computer software]. https://github.com/rkupffer/CRinMFC

43.

Lee

Joo

S.-H.

Stark

(2021). Detecting DIF in Multidimensional Forced Choice Measures Using the Thurstonian Item Response Theory Model. Organizational Research Methods, 24(4), 739–771. https://doi.org/10.1177/1094428120959822

44.

Lumsden

(2018, August 17). Improve your data quality from online survey research. Prolific. https://blog.prolific.ac/how-to-improve-your-data-quality/

45.

Mahalanobis

P. C.

(1936). On the generalized distance in statistics. http://bayes.acs.unt.edu:8083/BayesContent/class/Jon/MiscDocs/1936_Mahalanobis.pdf

46.

Maniaci

M. R.

Rogge

R. D.

(2014). Caring about carelessness: Participant inattention and its effects on research. Journal of Research in Personality, 48, 61–83. https://doi.org/10.1016/j.jrp.2013.09.008

47.

Maydeu-Olivares

Brown

(2010). Item response modeling of paired comparison and ranking data. Multivariate Behavioral Research, 45(6), 935–974. https://doi.org/10.1080/00273171.2010.531231

48.

Meade

A. W.

Craig

S. B.

(2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. https://doi.org/10.1037/a0028085

49.

Muthen

L. K.

Muthen

B. O.

(1998–2021). Mplus [Computer software]. Muthén & Muthén.

50.

Nichols

D. S.

Greene

R. L.

Schmolck

(1989). Criteria for assessing inconsistent patterns of item endorsement on the MMPI: Rationale, development, and empirical trials. Journal of Clinical Psychology, 45(2), 239–250. https://doi.org/10.1002/1097-4679(198903)45:2<239::aid-jclp2270450210>3.0.co;2-1

51.

Ostoic

J. A. R.

(2020). Algebraic analysis of multiple social networks with multiplex. Journal of Statistical Software, 92(11), 1–41. https://doi.org/10.18637/jss.v092.i11

52.

Paulhus

D. L.

(1991). Measurement and control of response bias. In Paulhus

D. L.

(Ed.), Measures of personality and social psychological attitudes (pp. 17–59). Elsevier. https://doi.org/10.1016/B978-0-12-590241-0.50006-X

53.

Pavlov

Shi

Maydeu-Olivares

Fairchild

(2021). Item desirability matching in forced-choice test construction. Personality and Individual Differences, 183, Article 111114. https://doi.org/10.1016/j.paid.2021.111114

54.

Pedersen

T. L.

(2022). patchwork: The composer of plots. https://cran.r-project.org/package=patchwork

55.

Peng

Man

Veldkamp

B. P.

Cai

. (2023). A mixture model for random responding behavior in forced-choice noncognitive assessment: Implication and application in organizational research. Organizational Research Methods, 10944281231181642. https://doi.org/10.1177/10944281231181642

56.

Pituch

K. A.

Stevens

J. P.

(2015). Applied multivariate statistics for the social sciences: Analyses with SAS and IBM’s SPSS. Routledge.

57.

Pozzebon

J. A.

Visser

B. A.

Ashton

M. C.

Lee

Goldberg

L. R.

(2010). Psychometric characteristics of a public-domain self-report measure of vocational interests: The Oregon Vocational Interest Scales. Journal of Personality Assessment, 92(2), 168–174. https://doi.org/10.1080/00223890903510431

58.

R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

59.

Revelle

(2021). psych: Procedures for psychological, psychometric, and personality research. Northwestern University. https://cran.r-project.org/package=psych

60.

Rios

J. A.

Guo

Mao

Liu

O. L.

(2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104. https://doi.org/10.1080/15305058.2016.1231193

61.

Rosenberg

Beymer

Anderson

van Lissa

Schmidt

(2018). TidyLPA: An R package to easily carry out latent profile analysis (LPA) using open-source or commercial software. Journal of Open Source Software, 3(30), Article 978. https://doi.org/10.21105/joss.00978

62.

Schmitt

Stuits

D. M.

(1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9(4), 367–373. https://doi.org/10.1177/014662168500900405

63.

Soto

C. J.

John

O. P.

(2017). The next Big Five Inventory (BFI-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power. Journal of Personality and Social Psychology, 113(1), 117–143. https://doi.org/10.1037/pspp0000096

64.

Stanley

(2021). apaTables: Create American Psychological Association (APA) style tables. https://cran.r-project.org/package=apaTables

65.

Teitcher

J. E. F.

Bockting

W. O.

Bauermeister

J. A.

Hoefer

C. J.

Miner

M. H.

Klitzman

R. L.

(2015). Detecting, preventing, and responding to “fraudsters” in internet research: Ethics and tradeoffs. The Journal of Law, Medicine & Ethics, 43(1), 116–133. https://doi.org/10.1111/jlme.12200

66.

Ulitzsch

Pohl

Khorramdel

Kroehne

von Davier

(2022). A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data. Psychometrika, 87(2), 593–619. https://doi.org/10.1007/s11336-021-09817-7

67.

van Buuren

Groothuis-Oudshoorn

. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

68.

Viswesvaran

Ones

D. S.

(1999). Meta-analyses of Fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59(2), 197–210. https://doi.org/10.1177/00131649921969802

69.

Walton

K. E.

Cherkasova

Roberts

R. D.

(2020). On the validity of forced choice scores derived from the Thurstonian item response theory model. Assessment, 27(4), 706–718. https://doi.org/10.1177/1073191119843585

70.

Watrin

Geiger

Spengler

Wilhelm

(2019). Forced-choice versus likert responses on an occupational big five questionnaire. Journal of Individual Differences, 40(3), 134–148. https://doi.org/10.1027/1614-0001/a000285

71.

Wetzel

Frick

(2020). Comparing the validity of trait estimates from the multidimensional forced-choice format and the rating scale format. Psychological Assessment, 32(3), 239–253. https://doi.org/10.1037/pas0000781

72.

Wetzel

Frick

Brown

(2021). Does multidimensional forced-choice prevent faking? Comparing the susceptibility of the multidimensional forced-choice format and the rating scale format to faking. Psychological Assessment, 33(2), 156–170. https://doi.org/10.1037/pas0000971

73.

Wetzel

Frick

Greiff

(2020). The multidimensional forced-choice format as an alternative for rating Scales. European Journal of Psychological Assessment, 36(4), 511–515. https://doi.org/10.1027/1015-5759/a000609

74.

Wickham

(2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org

75.

Wise

S. L.

Kong

(2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. https://doi.org/10.1207/s15324818ame1802_2

76.

Youden

W. J.

(1950). Index rating for diagnostic tests. Cancer, 3(1), 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3

77.

Zhang

Sun

Drasgow

Chernyshenko

O. S.

Nye

C. D.

Stark

White

L. A.

(2020). Though forced, still valid: Psychometric equivalence of forced-choice and single-statement measures. Organizational Research Methods, 23(3), 569–590. https://doi.org/10.1177/1094428119836486

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.21 MB

Detecting Careless Responding in Multidimensional Forced-Choice Questionnaires

Abstract

Keywords

The Multidimensional Forced-Choice Format

Careless Responding

Careless Responding in the MFC Format

Indices for Detecting Careless Responding in MFC Questionnaires

Response Time

Self-Report Items

Instructed Response Triplets

Consistency Score

Mahalanobis Distance

Rank Order 2 Indices

Triplet Variance

Missing Values Index*

The Present Study

Research Questions and Hypotheses

Exploratory Analyses on the Criterion Validity and Reliability Estimates

Method

Sample

Measures

Analytic Strategy

Pre-Analysis

Research Question 1

Research Question 2

Research Question 3

Research Question 4

Simulation of Different Careless Responding Patterns and Proportions

Results

RQ1: Correlations Among the Indices

RQ2: Subgroups of Careless Respondents

Graphical and Statistical Model Validation

Triplet Variance

Response Time

LongOrderMax

Multinominal Logistic Regression Models

Response Time Index

Instructed Response Triplet

RQ3: Proportion of Careless Responding

RQ4: Comparison of Careless Responding Between First and Last Questionnaire

Exploratory Analyses: Impact of Removing Careless Respondents From the Sample

Simulation Study

Type of Careless Responding

Consistency Score

Triplet Variance

LongOrderMax and LongOrderAvg

SameOrder

Mahalanobis Distance

Sensitivity, Youden-Index, and AUC for a set of Cutoff Values

Discussion

Is Careless Responding the Same in Rating Scales and MFC Questionnaires?

Correlations Among the Careless Responding Indices

Types of Indices and Groups of Respondents

Careless Responding Over the Course of the Study

Recommendations for Detecting Careless Responding in the MFC Format

Limitations and Future Directions

Conclusion

Supplemental Material

sj-pdf-1-epm-10.1177_00131644231222420 – Supplemental material for Detecting Careless Responding in Multidimensional Forced-Choice Questionnaires

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Data Availability Statement

Supplemental Material

Notes

References

Supplementary Material

Rank Order² Indices