Sage Journals: Discover world-class research

Abstract

Respondent behavior in questionnaires may vary in terms of attention, effort, and consistency depending on the survey administration context and motivational conditions. This pre-registered experimental study examined whether motivational context influences response inconsistency, response times, and the role of conscientiousness in survey responding. A sample of 66 university students in Cyprus completed five psychological scales under both low-stakes and high-stakes instructions in a counterbalanced within-subjects design. To identify inconsistent respondents, two index-based methods were used: the mean absolute difference (MAD) index and Mahalanobis distance. Results showed that inconsistent responding was somewhat more frequent under low-stakes conditions, although differences were generally small and significant only for selected scales when using a lenient MAD threshold. By contrast, internal consistency reliability was slightly higher, and response times were significantly longer under high-stakes instructions, indicating greater deliberation. Conscientiousness predicted lower inconsistency only in the low-stakes condition. Overall, high-stakes instructions did not substantially reduce inconsistent responding but fostered longer response times and modest gains in reliability, suggesting enhanced behavioral engagement. Implications for survey design and data quality in psychological and educational research are discussed.

Keywords

inconsistent responding survey experiment high-stakes low-stakes motivational context

Survey questionnaires typically consist of individual questions and multi-item scales, including instruments like psychological inventories that measure traits, beliefs, attitudes, behaviors, emotions, but also prior experiences, demographic and contextual characteristics. Data from these instruments help researchers examine research hypotheses, understand individual differences, and inform practice.

However, a growing body of evidence suggests that self-report surveys are vulnerable to various forms of careless or insufficient-effort responding; participants provide answers that do not reflect a thoughtful reaction to the item content (Arias et al., 2024; Huang et al., 2012; Meade & Craig, 2012). Responses unrelated to the item content can introduce measurement bias, undermining the validity of individual scores and potentially distorting key findings, especially when analyzing latent traits or drawing conclusions for policy and practice (Arias et al., 2024). As such, understanding and addressing careless responding has become critical for safeguarding data quality in survey-based research. A particularly common manifestation of careless responding is inconsistent responding, in which participants provide answers that do not logically align across similar or highly correlated items, creating response patterns that contradict theoretical expectations or data trends (Ward & Meade, 2023).

Mixed-worded scales which combine positively and negatively phrased items have long been used in survey design to minimize response biases and promote more attentive responding (Podsakoff et al., 2003). While mixed-worded scales are intended to enhance response validity, studies show that the prevalence of inconsistent responding varies widely, ranging from 1% to 41% across cross-cultural samples, educational systems, and age groups (Konstantinidou & Michaelides, 2025; Steinmann et al., 2024; Steinmann, Sánchez et al., 2022). Even a small proportion of inconsistent responses can have a substantial impact on psychometric properties of the measured constructs. Respondents may misinterpret or ignore item valence, produce inconsistent response patterns resulting in measurement problems, including reduced internal consistency, inflated factor structures, and overestimation of scale dimensionality (e.g., Arias et al., 2020; Podsakoff et al., 2024; Steinmann, Sánchez, et al., 2022).

Respondent Motivation and Data Quality in Low-Stakes Surveys

Survey questionnaires are widely used in national and international research across education, psychology, and other social science domains (Sturgis & Luff, 2021). They are typically administered under low-stakes conditions with minimal or no direct incentives or consequences for respondents. As a result, participants often exhibit reduced motivation (Alahmadi & DeMars, 2024) and are more prone to careless or insufficient-effort responding (Huang et al., 2012; Rios et al., 2017). A notable concern arises when a survey is low stakes for the individual respondent but high stakes for institutions that depend on the data for critical decisions about placements, rankings, policies, or accountability (e.g., Hopfenbeck et al., 2018).

A consistent finding across the literature on assessments and testing is that stakes significantly affect both motivation and performance. Wolf and Smith (1995) compared two groups of college students completing a standardized exam. Students told their scores would count toward course grades reported significantly higher motivation and outperformed those in the no consequence condition. The effect size for self-reported motivation was substantially larger than that for performance, indicating that stakes primarily affect effort investment. Similarly, Cole et al. (2008) found that students performed significantly better on the CollegeBASE exam when it was administered under high-stakes conditions than when it was presented as low-stakes. Hoyt (2001) reported that up to 22% of college students admitted to putting little or no effort into a low-stakes general education test. Sundre and Kitsantas (2004) further found that in low-stakes contexts, motivation was a significant predictor of test performance—a relationship that disappeared under high-stakes conditions, likely due to uniformly high levels of motivation across participants.

Effort during low-stakes assessments—often measured by time spent on task—strongly predicts the extent of performance improvement when stakes are increased. Students who were less engaged in a low-stakes condition tended to show larger gains once the assessment became high stakes (Attali, 2016). Wise and DeMars (2005), in a meta-analysis of 12 empirical studies, reported that students who were offered incentives or informed that their results would carry consequences consistently outperformed peers in no-stakes conditions, with performance differences reaching up to 0.59 standard deviations. Similarly, Goodman et al. (2015) showed that both point-based incentives (e.g., extra credit) and non-point-based rewards (e.g., gift cards, certificates) can be equally effective in promoting effort, whether applied at the individual or group level. Taken together, these findings highlight how raising assessment stakes can strengthen student motivation and, in turn, improve the validity and reliability of the data collected.

While the primary focus is often on test performance, similar response behavior may occur in surveys, undermining data quality, increasing nonresponse and careless responding, thus resulting in inaccurate inferences about the measured constructs (e.g., Huang et al., 2012). Research has further highlighted not only the prevalence of insufficient effort in surveys (Hong et al., 2020; Steinmann, Sánchez, et al., 2022; Steinmann, Strietholt, & Braeken, 2022), but also its specific types (e.g., straightlining), as the impact on survey data can vary differentially (DeSimone et al., 2018). For example, Zamarro et al. (2019) conducted a study analyzing large-scale international assessment data and found that students’ disengagement from the background survey, specifically when they skipped questions (item nonresponse), was strongly related to how well countries performed on the accompanying test. A meta-analysis by Loy et al. (2025) on personality tests, found that validity coefficients were different in employee assessment (low-stakes) compared with applicant testing (high-stakes) settings, possibly due to differential faking.

Methods for Detecting Careless and Inconsistent Responding

Detection methods can be classified as either a priori or post hoc (Meade & Craig, 2012). A priori methods, such as attention-check items, timers, or mixed-worded scales, are incorporated into the design of the study to promote more attentive and conscientious responding. Post hoc methods, by contrast, rely on statistical analysis after data collection. Among these, Mahalanobis distance is widely used to flag respondents whose overall response patterns deviate from the multivariate center of the sample (Meade & Craig, 2012). The Mean Absolute Difference (MAD) method detects inconsistency between positively and negatively worded items of a scale measuring a construct, with larger values suggesting greater inconsistency (Steinmann, Sánchez, et al., 2022). Both are index-based approaches, which calculate a numerical index for each respondent that reflects the degree of deviation or inconsistency in their response patterns, and by applying cut-off values to flag potential careless or inconsistent responders. In addition to these techniques, more complex statistical modeling, such as factor mixture models, has been applied to classify differential response patterns by modeling latent classes within the data, though this approach requires larger sample sizes (Arias et al., 2020; Steinmann, Strietholt, & Braeken, 2022).

In addition to statistical and modeling approaches, researchers have also employed behavioral and self-report methods to detect low effort responding. Self-report items assessing test-taking effort or motivation are often included in large-scale assessments and can help identify disengaged participants (Finn, 2015; Knekta & Eklöf, 2015). For example, Berry et al. (1992) found that 50%–60% of college students admitted to randomly answering one or more items on a standardized psychological test. However, self-report data must be interpreted cautiously, as participants may lack insight into their own behavior or may respond in socially desirable ways, especially in low-stakes contexts (Meade & Craig, 2012).

To complement these subjective measures, response time data has emerged as a reliable behavioral indicator of effort. Studies show that motivated respondents typically take more time to answer, while careless respondents tend to rush through items (De Boeck & Jeon, 2019; Wise & Kong, 2005). Hosch (2012) reported that students who scored higher on a low-stakes assessment spent significantly more time on the test. Analyzing response times can reveal low-effort behaviors such as rapid guessing or speeding, which are frequently observed in low-stakes contexts (Anghel et al., 2024; Michaelides et al., 2020). Unlike self-reports, these measures provide objective, unobtrusive evidence of cognitive engagement, enhancing the accuracy of inferences drawn from survey data (Lundgren & Eklöf, 2023; Soland et al., 2019).

Factors Influencing Inconsistent Responding

Inconsistent responding is influenced by a variety of factors including individual personality traits and response styles. In particular, individuals with lower conscientiousness are more likely to engage in careless or inconsistent responding (Bowling et al., 2016; Ward et al., 2017), especially in low-stakes contexts where external motivation is limited (Ward & Meade, 2023). Conscientious respondents tend to provide more thorough and consistent answers, regardless of the assessment conditions.

Due to social desirability bias, respondents answer items in a way that portrays them in a favorable or socially acceptable light, rather than responding truthfully (Krumpal, 2013). However, to our knowledge it has not been empirically studied in inconsistent responding behavior. A positive attitude toward research has not been studied either, even though participants who personally value research and respond to questionnaires thoroughly may indeed exhibit more attentive and consistent responding.

Other factors such as test length, test anxiety, age, and cognitive ability have also been found to affect response consistency. For instance, longer tests may increase fatigue and reduce attentiveness (Meade & Craig, 2012), while inconsistent responses tend to be more common among younger respondents and those with lower reading comprehension or cognitive abilities (Bulut & Bulut, 2022; Marsh, 1986). Gender differences have also been observed, with male students often reporting lower motivation and exhibiting faster, potentially less careful responding than female students (DeMars et al., 2013; Eklöf, 2007).

The Present Study

This study aims to experimentally investigate response patterns on mixed-worded scales administered under low-stakes and high-stakes motivational conditions. Specifically, we examined the prevalence of inconsistent responding, variations in response time, and the role of personality characteristics, particularly conscientiousness, in predicting response inconsistency.

Prior research on inconsistent responding has predominantly relied on observational secondary data, collected from young student samples under low-stakes conditions. The use of observational data prevents causal claims, while results based on young student populations may not generalize to adults or other demographic groups. Also, low-stakes settings do not necessarily reflect real-world contexts in which motivation can significantly vary (DeMars, 2007). The present study extends existing literature in three critical ways: (a) employing an experimental design to directly examine differences in response patterns, (b) explicitly comparing high- with low-stakes conditions, where motivation is expected to differ, and (c) using university rather than school student samples. By focusing on these aspects, this study offers new insights into how motivational contexts influence response behavior on surveys with mixed-worded scales, with potential implications for survey design and the interpretation of self-report data. The study is guided by three hypotheses:

Hypothesis 1 (H1): The prevalence of inconsistent responding on self-report measures will be higher in the low-stakes compared with the high-stakes survey condition.

Hypothesis 2 (H2): Response time on self-report scales will differ between low-stakes and high-stakes survey conditions, with participants in the high-stakes condition taking longer to respond.

Hypothesis 3 (H3): Conscientiousness will be significantly associated with the propensity for inconsistent responding. Individuals with higher conscientiousness scores will exhibit lower levels of inconsistent responding, particularly in the low-stakes condition.

Method

Sample and Study Design

Participants were 66 students from a public university in Cyprus. Fifty were women and 16 men, with a mean age of 20.53 (SD = 2.56) who received either course credit or a voucher for their research involvement. They completed five scales in Greek under both a high-stakes and a low-stakes condition under a within-participants design. To control for potential order effects, they were systematically assigned to one of the two counterbalanced orders (high-stakes first, followed by low-stakes; or low-stakes first, then high-stakes). The study design was approved by the Cyprus National Bioethics Committee (EEBK EΠ 2025.01.58) and was pre-registered at OSF (https://doi.org/10.17605/OSF.IO/AFQJX) on March 11, 2025.

Procedure and Materials

After scheduling an appointment, participants entered the laboratory, received a unique number by the researcher, read and signed the consent form. They were assigned to a condition (order: high-low or low-high) and individually completed a questionnaire on a computer consisting of five parts:

(a) A session collecting data on personal characteristics,

(b) A section that included five personality scales administered under the high-stakes (or low-stakes) condition,

(d) A section that included five personality scales administered under the low-stakes (or high-stakes) condition, and

(e) A final brief session collecting data on personal characteristics and demographics.

The experimental manipulation consisted of survey instructions describing a high- and a low-stakes condition. The instructions for both conditions were adapted from research paradigms of faking on personality questionnaires, for example, McFarland and Ryan (2000). The high-stakes instruction read as follows:

Please respond to the following scales. While answering, consider that your responses are extremely important! Assume that they will be evaluated by someone making a critical decision about you, such as selection for a job position or admission to a university program. Take your time to carefully consider each question and provide accurate answers that truly represent yourself.

The low-stakes instruction read as follows:

Please respond to the following scales without worrying about the accuracy of your choices. For example, think about how you would answer a questionnaire in which you have no interest in the questions or the subject matter. No one will judge your answers, and those will not affect any important decision concerning yourself.

Primary Measures

Five scales were administered in both conditions and participants were asked to respond to the same 63 items. To ensure completeness of responses, the survey required participants to answer every question before continuing to the next page.

The 5-item Satisfaction with Life scale is designed to measure global cognitive judgments of one’s life satisfaction (Diener et al., 1985). Participants indicate how much they agree or disagree with each of the items using a 7-point scale that ranges from 7 (strongly agree) to 1 (strongly disagree). The Greek version of the scale used in this study was retrieved from Diener’s online archive, http://labs.psychology.illinois.edu/~ediener/SWLS.html, and was validated in a Cypriot sample by Anthimou et al. (2021).

The 20-item Revised UCLA Loneliness Scale (Russell et al., 1980) was developed to provide a detailed description of the experience of loneliness. Each item is rated on a 4-point scale ranging from 1 (never) to 4 (often). The scale was adapted in Greek by Anderson & Malikiosi-Loizos (1992) and includes nine reverse-coded items.

The 10-item Rosenberg Self-Esteem Scale (Rosenberg, 1965) is designed to measure global self-esteem. Responses are recorded on a 4-point Likert-type scale ranging from 1 (strongly agree) to 4 (strongly disagree). The scale was translated into Greek using a forward and backward translation method and has been utilized in Cyprus for data collection by Panayiotou and Papageorgiou (2007). The scale is balanced, consisting of five positively worded items and five negatively worded items.

The Self-regulation scale (Schwarzer et al., 1999) assesses post-intentional self-regulation, focusing on the stage where individuals pursue their goals but encounter challenges in sustaining their action. In these situations, it is essential to maintain attention on the task and preserve a positive emotional state. As a result, the scale’s items capture aspects of both attention regulation and emotion regulation. The scale consists of 10 items and responses were recorded on a 4-point Likert-type scale from 1 (not at all true) to 4 (exactly true). The Greek adaptation of the scale was conducted by the first author and his research team after receiving permission and specific guidelines by the second author of the Self-regulation scale. The scale contains seven positive and three negative items.

Need for cognition refers to an individual’s tendency to engage in and enjoy effortful cognitive endeavors (Cacioppo et al., 1984). The short version of the Need for Cognition scale with 18-items was used: nine items indicating a high need for cognition and nine reverse-scored items indicating a low need for cognition. Responses are recorded on a 4-point Likert-type scale ranging from 1 (extremely uncharacteristic of me) to 4 (extremely characteristic of me). The option “uncertain” was not presented. The Greek translation of the scale was conducted by Diakidoy et al. (2015).

Covariates

IPIP Big Five Factor Marker

The 50-item International Personality Item Pool representation of Goldberg’s (1992) markers for the Big-Five factor structure consists of 10 items for each personality dimension: Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Intellect/Imagination. Participants respond to the items on a 5-point ordinal scale ranging from 1 (very inaccurate) to 5 (very accurate). The Greek translation (Vakola, n.d.) of the scale was downloaded from https://ipip.ori.org/Greek50-itemBigFiveFactorMarkers.htm. Cronbach’s alpha coefficients for the five subscales in the current study were 0.878, 0.809, 0.781, 0.839, and 0.734, respectively.

Social-Desirability Scale

The Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1960) is a 33-item scale which reflects the tendency of individuals to seek approval or avoid disapproval by responding in a culturally appropriate way (Crowne & Marlowe, 1960; Paulhus & Reid, 1991). The 13-item shorter version of the scale was used as developed by Reynolds (1982). Responses were binary, with participants selecting either “true” or “false.” The scale has been translated and adapted in Greek (Koutsogiorgi, 2020). Cronbach’s alpha was very low, .549.

Personal Norm in Research Participation

This self-report scale was developed by the authors according to the norm activation model (Schwartz, 1977) to measure individuals’ personal norms for research participation. By capturing their attitudes toward engaging in research activities without reliance on external rewards or incentives, the scale reflects internalized feelings of responsibility, obligation, and commitment to contributing responsibly to scientific data collection. Cronbach’s alpha was 0.790.

Manipulation Checks

Two methods were employed to assess the extent to which participants adhered to the instructions in each condition. First, instructed response items were used as a straightforward method of detecting careless responses. The item “To ensure data quality, please choose ‘not at all true’ for this item” was included twice near the middle of one of the focal scales: once in the high-stake and once in the low-stake condition. Self-report items were also administered twice, at the end of each condition. The two questions were: “In the previous five scales, I was very focused on my responses” and “In the previous scales, I did not pay much attention.” These questions measured participants’ self-reported levels of attention and focus during their responses in the preceding scales on a 4-point Likert-type scale.

Data Analysis

To identify inconsistent respondents, we employed two methods to compare their prevalence across the two conditions. First, the mean absolute difference (MAD) method was applied by reverse-coding the mean score of negatively worded items and subtracting it from the mean score of positively worded items. Respondents with an absolute difference of ≥1.75 were classified as inconsistent (Steinmann, Sánchez, et al., 2022); this arbitrary threshold essentially implies a change in endorsement from the agreement to the disagreement side of the 4-point scale (or vice versa). We also considered a more lenient threshold of ≥1.00 which implies a one-unit difference in the responses to positively and negatively worded items of the ordinal scale.

Second, we used Mahalanobis distance, a statistical approach for detecting multivariate outliers by assessing how far a respondent’s data deviates from the overall sample distribution. Response time differences across the two conditions were also compared with a paired samples test. If statistical assumptions were violated, a non-parametric test was used instead of a parametric one. For hypothesis three and the prediction of inconsistent response status a negative binomial model was employed because the outcome was a count variable.¹ Finally, for the manipulation checks we compared each method across the two experimental conditions with paired tests. The dataset and analysis code can be found at https://osf.io/r93jg/overview?view_only=a04a24a680344512a9b23eabafc7bc21 on OSF.

Results

Participants’ Background Characteristics

Participants were 66 university students, of whom 75.8% were women. The mean age was 20.53 years (SD = 2.56) and the majority (72.7%) were Psychology students. Regarding their year of study, 47.0% were in their second year, 21.2% in their third year, 19.7% in their first year, and 12.1% were either in their final undergraduate year or enrolled in a postgraduate program.

To ensure that the counterbalance-order groups were comparable, we conducted independent samples t-tests and chi-square tests on key demographic and background variables (e.g., age, gender, year of study). These analyses confirmed that the two groups did not differ significantly, supporting the assumption of randomly equivalent groups.

Manipulation Checks

The results from the instructed response items (e.g., “To ensure data quality, please choose ‘not at all true’ for this item”), indicated that under the high-stakes condition 65 of the participants (98.5%) complied with the instruction compared with 61 in low-stakes (92.4%) condition. This manipulation suggests that participants remained generally attentive in their responses with a small difference in the two conditions in the expected direction which was not significant (McNemar’s test p = .219).

The second manipulation check involved two self-report items. After the high-stakes condition, all participants reported being at least somewhat focused on their responses and disagreed (somewhat or completely) about being inattentive. In the low-stakes condition, the corresponding percentages were 69.6% and 66.7% suggesting that the majority self-reported being focused and attentive in completing the scales, with only about a third of the sample responding they were not. A one-sided paired samples t-test revealed that participants reported significantly greater focus in the high-stakes condition, t(65) = 6.246, p < .001, d = 0.769, and were significantly less likely to agree with the reversed item indicating inattention, t(65) = −6.347, p < .001, d = −0.781, compared with the low-stakes condition. Wilcoxon’s non-parametric tests concurred with the results of the t-tests.

The manipulation checks yielded mixed results. While the self-report items indicated that participants perceived the high-stakes condition as more engaging by reporting greater focus and less inattention, the results were less consistent in the low-stakes condition. About one-third of participants reported lower concentration and acknowledged inattention, aligning with the expected disengagement under low-stakes conditions. However, the majority still reported being focused. Moreover, the instructed response item was correctly answered by nine out of 10 participants in the low condition, contrary to expectations that disengaged participants in the low-stakes condition would fail to comply.

Hypothesis Testing

To examine the prevalence of inconsistent responding (H1) we conducted a series of pairwise comparisons. After calculating the mean absolute difference (MAD) score between responses on the positively and negatively worded items of each scale, the one-sided t-test comparisons for the high and low conditions were significant only for the Rosenberg Self-Esteem scale, t(65) = −2.736, p = .004, d = −0.337; for the remaining scales results were as follows: UCLA Loneliness scale, t(65) = 0.730, p = .234, d = 0.090, Self-Regulation scale t(65) = −0.855, p = .198, d = −0.105, Need for Cognition scale, t(65) = −0.992, p = .162 d = −0.122. When combining all MAD scores across the four scales, separately for each condition, the mean comparison was marginally significant, t(65) = −1.714, p = .046, d = −0.211.

Participants were classified into the inconsistent respondent category if their MAD ≥1.75, or consistent otherwise. Very few cases were classified as inconsistent: in the high-stakes condition, three participants on just one scale each; in the low-stakes condition five students were classified as inconsistent in at least one scale (Table 1). Because of the binary outcome variable (being inconsistent or not), a paired non-parametric test comparison was carried out for each scale, with all results being not significant. Comparing the number of occasions of inconsistent response classification under the high versus the low-stakes conditions, was also not significant (Z = −1.414, p = .157).

Table 1.

Number of Scales for Which a Participant Was Classified as Inconsistent by Threshold and Condition

No. of scales on which a participant was classified as inconsistent	Condition by threshold criterion
	MAD threshold ≥ 1.75		MAD threshold ≥ 1.00
	High-stakes	Low-stakes	High-stakes	Low-stakes
0	63	61	35	29
1	3	3	21	16
2	-	2	8	17
3	-	-	2	4
4	-	-	-	-
Mean	0.046	0.106	0.652	0.939
Variance	0.044	0.158	0.661	0.950

We also adopted a less strict criterion of MAD ≥1.00 as an indication of inconsistent behavior. As anticipated, this identified many more inconsistent respondents. Across all scales, 31 were inconsistent in the high-stakes and 37 in the low-stakes condition (Table 1). Among those participants, more tended to be classified as inconsistent in multiple scales in the low-stakes condition. The difference in the prevalence of inconsistent respondents between the two conditions was significant for the two shorter scales, the Rosenberg Self-Esteem (Z = −2.496, p = .013) and the Self-Regulation (Z = −2.714, p = .007) scales. Across all scales, the comparison of the number of occasions of inconsistent response classification under the high-stakes versus the low-stakes conditions, was significant (Z = −2.495, p = .013).

Mahalanobis distance was an alternative way to compare the high-stakes and low-stakes conditions for outlying response patterns on each scale separately. In separate linear regression models, we predicted a random uniform variable with the items of a scale as predictors. Mahalanobis distances were saved and those exceeding the critical value of the upper 5% of the chi-square distribution with the number of items as degrees of freedom were flagged as outliers. The results on Table 2 illustrate that the number of outliers based on Mahalanobis distances does not systematically vary across the two conditions. There are small differences (if any) with slightly more outliers occasionally in the high and other times in the low-stakes condition.

Table 2.

Results of the Mahalanobis Distance, and Reliability Analysis by Condition

Scale (no. of items)	Chi-square critical value	No. of Mahalanobis outliers		Cronbach’s alpha
Scale (no. of items)	Chi-square critical value	High-stakes	Low-stakes	High-stakes	Low-stakes
Life Satisfaction (5)	11.070	7	5	0.830	0.844
UCLA Loneliness (20)	31.410	6	7	0.934	0.906
Rosenberg Self-Esteem (10)	18.307	4	8	0.887	0.849
Self-Regulation (10)	18.307	7	7	0.823	0.813
Need for Cognition (18)	28.869	8	5	0.881	0.857

In Table 2 the Cronbach’s alpha reliability coefficients for each scale by condition, suggest that for the four mixed-worded scales, the alphas were slightly increased in the high-stakes compared with the low-stakes condition.

To test Hypothesis 2, the distributions of the response time variables of the five scales were inspected and were found to have positive skewness less than 2. In four cases, the Kolmogorov–Smirnov tests suggested violations of normality; therefore, Wilcoxon signed rank tests were conducted to compare response times between the low-stakes and high-stakes conditions for each scale (Table 3). Participants took significantly more time to complete a scale in the high-stakes compared with the low-stakes condition, supporting the hypothesis that increased stakes are associated with longer and more deliberate responding.

Table 3.

Response Time Differences Across the Two Conditions

Scale	Mean response time by scale (st. dev.)		Wilcoxon signed rank test for scale response times
Scale	High-stakes	Low-stakes	Z	p
Life Satisfaction	47.788 (21.788)	37.833 (16.345)	-2.709	.007
UCLA Loneliness	148.242 (44.092)	124.258 (41.235)	-3.702	<.001
Rosenberg Self-Esteem	62.152 (21.022)	54.667 (22.736)	-2.316	.021
Self-Regulation	96.561 (34.674)	83.879 (35.333)	-2.265	.024
Need for Cognition	178.121 (57.346)	151.970 (60.445)	-2.735	.006

To examine the third hypothesis—conscientiousness as a predictor of being classified as inconsistent—we used the dependent variable “number of times classified as inconsistent across the four mixed-worded scales” in each of the two conditions. As this is a count variable, a negative binomial regression model was employed. The conscientiousness score from the IPIP was specified as the main predictor. We considered the personal norms for research as a covariate, but social desirability was not included due to its very low reliability coefficient. The results of fit indices and predictor coefficients appear in Table 4. The overall model fit was not significant. The coefficients for conscientiousness were also nonsignificant, with the exception of the low-stakes condition when the strict criterion (1.75) of identifying inconsistent respondents was applied. The negative unstandardized coefficient implies that participants with higher conscientiousness score are less likely to be classified as inconsistent respondents across four mixed-worded scales completed under low-stakes instructions; this result accounts for personal norms about involvement in research, a covariate that was non-significant. With the lenient threshold of MAD ≥ 1.00 conscientiousness and personal norms were not significant predictors in either of the experimental conditions.

Table 4.

Fit Indices and Unstandardized Coefficients From the Negative Binomial Models with the Number of Inconsistent Response Classifications as Outcome

Fit indices/Predictors	Condition (threshold)
Fit indices/Predictors	High-stakes (1.75)	Low-stakes (1.75)	High-stakes (1.00)	Low-stakes (1.00)
Fit indices
Log-likelihood	-11.273	-20.368	-72.871	-88.613
AIC	28.546	46.736	151.742	183.227
Likelihood ratio chi-square (df)	2.134 (2)	5.394 (2)	0.474 (2)	0.094 (2)
p-value	.344	.067	.789	.954
Unstandardized coefficients for predictors
Conscientiousness (SE)	-0.111 (0.080)	-0.134 (0.061)*	0.020 (0.031)	0.000 (0.026)
Personal norms for research (SE)	-0.035 (0.176)	0.161 (0.151)	-0.004 (0.063)	-0.016 (0.053)

p < .05.

Discussion

A validity concern in survey research is the degree to which respondents engage thoughtfully and attentively with questionnaire items, particularly when there are no direct consequences for the responses they provide. Low motivation in such settings increases the risk of careless or insufficient-effort responding (Huang et al., 2012) which can compromise data quality and bias research findings (Rios et al., 2017). The present pre-registered study experimentally manipulated the motivational context by comparing responses provided under low- and high-stakes instructional conditions. This design allowed us to test whether increasing perceived stakes reduces inconsistent responding observed in mixed-worded psychological scales. Mixed-worded scales are often used in social science research (Podsakoff et al., 2003; Steinmann, Sánchez, et al., 2022) to mitigate acquiescence bias and improve validity, but they also provide an opportunity to detect non-content-based responding by comparing answers to positively and negatively worded items.

Previous research has suggested that no single approach can reliably identify all forms of careless and insufficient-effort responding with a recommendation to combine multiple indicators (Hong et al., 2020; Meade & Craig, 2012). In this study, we applied the mean absolute difference and Mahalanobis distance as complementary detection methods for inconsistent responding, providing a novel experimental test of whether response inconsistency differs by survey stakes.

The first hypothesis predicted that inconsistent responding would be more prevalent under low-stakes compared with high-stakes survey instructions. Using multiple detection methods, we observed slightly more instances of inconsistent responding in the low-stakes condition. However, the magnitude of this difference was small. When using the strict MAD threshold (≥ 1.75) and the Mahalanobis distance approach, the differences were not statistically significant. The more lenient MAD threshold (≥1.00) flagged a higher number of inconsistent respondents overall, and for two scales and in the overall comparison the difference between low- and high-stakes conditions reached statistical significance. These results indicate that manipulating survey stakes instructions had only a weak effect on response consistency.

We further examined internal consistency reliability to explore whether differences in response patterns were reflected at the scale level. Cronbach’s alpha coefficients were slightly higher in the high-stakes condition compared with the low-stakes condition across all mixed-worded scales. These small but consistent differences suggest that increased stakes may have promoted more deliberate responding. Although the differences were not large enough to indicate substantial shifts in measurement quality, they show that motivational context can have a subtle influence on the psychometric properties of self-report scales.

One possible explanation for the weak effect relates to the nature of our sample and the experimental setting. Prior studies on inconsistent responding have often been conducted with younger students and in observational, low-stakes contexts (e.g., Konstantinidou & Michaelides, 2025; Steinmann, Sánchez, et al., 2022). In contrast, our study implemented an explicit high-stakes condition and used adult students who participated voluntarily in a controlled laboratory setting, in most cases for course credit. Such participants are generally motivated and familiar with research participation, which is likely to promote engagement even in a low-stakes condition. Consistent with this interpretation, previous multinational studies also report relatively low levels of inconsistent responding in low-stakes conditions in some cultures (e.g., Steinmann, Sánchez, et al., 2022), suggesting a potential floor effect that may limit the detectability of differences induced by experimental manipulations of stakes.

The manipulation check analysis also aligns with this explanation. Self-report items confirmed significantly greater attentiveness and focus under high-stakes instructions; yet a substantial proportion of participants still reported being attentive in the low-stakes condition. The instructed response items demonstrated slightly lower compliance under low-stakes instructions, but the overall compliance rates were high in both conditions. Taken together, these findings suggest that most participants remained engaged and consistent, even when explicitly instructed to adapt in low-stakes condition.

Mahalanobis distance did not identify systematically more outliers in the low-stakes compared with the high-stakes condition. This result is consistent with recent methodological discussions on the limitations of index-based detection methods (Vogelsmeier et al., 2025). While indices such as Mahalanobis distance are attractive because they are easily computed, a central challenge lies in the freedom researchers have when defining cutoffs that separate attentive from careless respondents. Vogelsmeier et al. (2025) emphasized that the numerical values such as Mahalanobis distance do not lend themselves to straightforward interpretation or applicable cutoffs. Classifications end up being sample-dependent, because identical percentile thresholds behave differently across individuals; small deviations from average patterns may be flagged as extreme within low-variability respondents, while larger deviations may be overlooked in high-variability respondents. In contrast, the MAD index is specifically tailored to mixed-worded scales, and its cutoffs, although arbitrary, can be defined in relation to the response scale itself. In our study, we employed thresholds of 1.75 and 1.00, which correspond to meaningful shifts in endorsement on a 1 to 4 ordinal scale, making the method both more targeted and interpretable for detecting inconsistent responding.

Hypothesis two predicted that participants would spend more time responding under high-stakes instructions than under low-stakes instructions, reflecting greater engagement and deliberation. The findings supported this expectation: participants required significantly more time to complete the scales when responding under high-stakes instructions, indicating that the motivational manipulation successfully encouraged more deliberate and careful responding. This result aligns with research from cognitive testing showing that rapid responding, often conceptualized as “rapid guessing,” is a strong indicator of reduced effort and inattentive responding (Michaelides & Ivanova, 2022; Wise & Kong, 2005). The observed effect emerged in a context where the overall prevalence of inconsistent responding was relatively low, suggesting that even among generally motivated participants, increasing perceived stakes can further enhance engagement. This finding underscores the value of using behavioral indicators, such as response time, as a complement to inconsistent responding indices when assessing data quality.

The final hypothesis predicted that conscientiousness would be negatively associated with inconsistent responding, particularly under low-stakes conditions where external motivation is limited. The findings provided partial support for this hypothesis in the case of the strict method of detecting inconsistent respondents only. Participants with higher conscientiousness scores were less likely to be classified as inconsistent responders when completing mixed-worded scales under low-stakes instructions, even after accounting for personal norms regarding participation in research. Conscientious individuals were more likely to maintain attentive and deliberate responding even when the task carries little consequence. As expected, the association was not found in the high-stakes condition. This finding is consistent with previous research showing that conscientiousness is linked to greater effort investment (Bowling et al., 2016; Ward et al., 2017).

Taken together, these results suggest that motivational stakes influence data quality primarily by enhancing behavioral engagement, as reflected in longer response times and slightly higher reliability, rather than by substantially reducing inconsistent responding. While the present study provides new insights into how motivational context influences response consistency on mixed-worded scales in university students by introducing a high-stakes condition, its experimental manipulation may be limited in ecological validity. Participants were given instructions to consider that their responses carried high stakes rather than actually experiencing tangible consequences. Future research could address this limitation by implementing designs where identical assessments are administered under ecologically valid high- and low-stakes contexts. For example, the same test could be used for both employment selection purposes and for research purposes, allowing direct comparisons of response behavior under authentic motivational conditions. Such designs could offer stronger external validity and further clarify the extent to which stakes influence engagement, response consistency, and overall data quality in real-world applications. Extensions could also incorporate the mode of administration as a factor (Konstantinidou & Michaelides, 2025), because response behavior in computerized versus paper-and-pencil administration may differ.

Footnotes

Acknowledgements

Not applicable.

ORCID iDs

Michalis P. Michaelides

Evi Konstantinidou

Ethical Considerations

The study was approved by the Cyprus National Bioethics Committee (approval no. EEBK EΠ 2025.01.58) on February 12, 2025.

Consent to Participate

All participants provided written informed consent prior to participating in the experiment.

Consent for Publication

Not applicable.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors would like to acknowledge funding support to Michalis P. Michaelides by the University of Cyprus (Internal Funding Programs).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The data that support the findings of this study are openly available in the OSF repository at .

Notes

References

Alahmadi

DeMars

C. E.

(2024). Comparing examinee-based and response-based motivation filtering methods in remote low-stakes testing. Applied Measurement in Education, 37(1), 43–56. https://doi.org/10.1080/08957347.2024.2311927

Anderson

L. R.

Malikiosi-Loizos

(1992). Reliability data for a Greek translation of the Revised UCLA Loneliness Scale: Comparisons with data from the USA. Psychological Reports, 71(2), 665–666. https://doi.org/10.2466/pr0.1992.71.2.665

Anghel

Khorramdel

von Davier

(2024). The use of process data in large-scale assessments: A literature review. Large-scale Assessments in Education, 12(13), 1–33. https://doi.org/10.1186/s40536-024-00202-1

Anthimou

Koutsogiorgi

Michaelides

M. P.

(2021). Psychometric properties of the Satisfaction with Life Scale in a Cypriot student sample. Psychology: The Journal of the Hellenic Psychological Society, 26(3), 273–282. https://doi.org/10.12681/psy_hps.29152

Arias

V. B.

Garrido

L. E.

Jenaro

Martínez-Molina

Arias

(2020). A little garbage in, lots of garbage out: Assessing the impact of careless responding in personality survey data. Behavior Research Methods, 52, 2489–2505. https://doi.org/10.3758/s13428-020-01401-8

Arias

V. B.

Ponce

F. P.

Garrido

L. E.

Nieto-Cañaveras

M. D.

Martínez-Molina

Arias

(2024). Detecting non-content-based response styles in survey data: An application of mixture factor analysis. Behavior Research Methods, 56(4), 3242–3258. https://doi.org/10.3758/s13428-023-02308-w

Attali

(2016). Effort in low-stakes assessments: What does it take to perform as well as in a high-stakes setting? Educational and Psychological Measurement, 76(6), 1045–1058. https://doi.org/10.1177/0013164416634789

Berry

D. T. R.

Wetter

M. W.

Baer

R. A.

Larsen

Clark

Monroe

(1992). MMPI-2 random responding indices: Validation using a self-report methodology. Psychological Assessment, 4, 340–345. https://doi.org/10.1037/1040-3590.4.3.340

Bowling

N. A.

Huang

J. L.

Bragg

C. B.

Khazon

Liu

Blackmore

C. E.

(2016). Who cares and who is careless? Insufficient effort responding as a reflection of respondent personality. Journal of Personality and Social Psychology, 111(2), 218–229. https://doi.org/10.1037/pspp0000085

10.

Bulut

H. C.

Bulut

(2022). Item wording effects in self-report measures and reading achievement: Does removing careless respondents help? Studies in Educational Evaluation, 72, 101126. https://doi.org/10.1016/j.stueduc.2022.101126

11.

Cacioppo

J. T.

Petty

R. E.

Feng Kao

(1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48(3), 306–307. https://doi.org/10.1207/s15327752jpa4803_13

12.

Cole

J. S.

Bergin

D. A.

Whittaker

T. A.

(2008). Predicting student achievement for low stakes tests with effort and task value. Contemporary Educational Psychology, 33, 609–624. https://doi.org/10.1016/j.cedpsych.2007.10.002

13.

Crowne

D. P.

Marlowe

(1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24(4), 349–354. https://doi.org/10.1037/h0047358

14.

De Boeck

Jeon

. (2019). An overview of models for response times and processes in cognitive tests. Frontiers in Psychology, 10, Article 102. https://doi.org/10.3389/fpsyg.2019.00102

15.

DeMars

C. E.

(2007). Changes in rapid-guessing behavior over a series of assessments. Educational Assessment, 12(1), 23–45. https://doi.org/10.1080/10627190709336946

16.

DeMars

C. E.

Bashkov

B. M.

Socha

A. B.

(2013). The role of gender in test-taking motivation under low-stakes conditions. Research & Practice in Assessment, 8, 69–82.

17.

DeSimone

J. A.

DeSimone

A. J.

Harms

P. D.

Wood

(2018). The differential impacts of two forms of insufficient effort responding. Applied Psychology, 67(2), 309–338. https://doi.org/10.1111/apps.12117

18.

Diakidoy

I. A. N.

Christodoulou

S. A.

Floros

Iordanou

Kargopoulos

P. V.

(2015). Forming a belief: The contribution of comprehension to the evaluation and persuasive impact of argumentative text. British Journal of Educational Psychology, 85(3), 300–315. https://doi.org/10.1111/bjep.12074

19.

Diener

E. D.

Emmons

R. A.

Larsen

R. J.

Griffin

(1985). The satisfaction with life scale. Journal of Personality Assessment, 49(1), 71–75. https://doi.org/10.1207/s15327752jpa4901_13

20.

Eklöf

(2007). Test-taking motivation and mathematics performance in TIMSS 2003. International Journal of Testing, 7(3), 311–326. https://doi.org/10.1080/15305050701438074

21.

Finn

(2015). Measuring motivation in low-stakes assessments (Research Report No. RR-15-19). Educational Testing Service. https://doi.org/10.1002/ets2.12067

22.

Goldberg

L. R.

(1992). The development of markers for the Big-Five factor structure. Psychological Assessment, 4(1), 26.

23.

Goodman

Anson

Belcheir

(2015). The effect of incentives and other instructor-driven strategies to increase online student evaluation response rates. Assessment & Evaluation in Higher Education, 40(7), 958–970. https://doi.org/10.1080/02602938.2014.960364

24.

Hong

Steedle

J. T.

Cheng

(2020). Methods of Detecting Insufficient Effort Responding: Comparisons and Practical Recommendations. Educational and Psychological Measurement, 80(2), 312–345. https://doi.org/10.1177/0013164419865316

25.

Hopfenbeck

T. N.

Lenkeit

El Masri

Cantrell

Ryan

Baird

J.-O.

(2018). Lessons learned from PISA: A systematic review of peer-reviewed articles from the Programme for International Student Assessment. Scandinavian Journal of Educational Research, 62(3), 333–353. https://doi.org/10.1080/00313831.2016.1258726

26.

Hosch

B. J.

(2012). Time on test, student motivation, and performance on the Collegiate Learning Assessment: Implications for institutional accountability. Journal of Assessment and Institutional Effectiveness, 2(1), 55–76. https://doi.org/10.5325/jasseinsteffe.2.1.0055

27.

Hoyt

J. E.

(2001). Performance funding in higher education: The effects of student motivation on the use of outcomes tests to measure institutional effectiveness. Research in Higher Education, 42(1), 71–85. https://doi.org/10.1023/A:1018716627932

28.

Huang

J. L.

Curran

P. G.

Keeney

Poposki

E. M.

DeShon

R. P.

(2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99–114. https://doi.org/10.1007/s10869-011-9231-8

29.

Knekta

Eklöf

(2015). Modeling the test-taking motivation construct through investigation of psychometric properties of an expectancy-value-based questionnaire. Journal of Psychoeducational Assessment, 33(7), 662–673. https://doi.org/10.1177/0734282914551956

30.

Konstantinidou

Michaelides

M. P.

(2025). Assessment mode and inconsistent responding on a mixed-worded scale: Evidence from TIMSS 2019 across grades and countries. Frontiers in Education, 10, Article 1595648. https://doi.org/10.3389/feduc.2025.1595648

31.

Koutsogiorgi

(2020). Responding to positively and negatively worded items: Correlational and experimental evidence in conceptually distinct areas [Doctoral dissertation, University of Cyprus]. https://gnosis.library.ucy.ac.cy/entities/publication/1ac69693-565a-4939-9960-ff68d484a7de

32.

Krumpal

(2013). Determinants of social desirability bias in sensitive surveys: A literature review. Quality & Quantity, 47(4), 2025–2047. https://doi.org/10.1007/s11135-011-9640-9

33.

Loy

R. W.

Christiansen

N. D.

Tett

R. P.

Klein

Toich

(2025). Personality test validity differs between low-stakes and high-stakes employment settings. International Journal of Selection and Assessment, 33(3), Article e70018. https://doi.org/10.1111/ijsa.70018

34.

Lundgren

Eklöf

(2023). Questionnaire-taking motivation: Using response times to assess motivation to optimize on the PISA 2018 student questionnaire. International Journal of Testing, 23(4), 231–256. https://doi.org/10.1080/15305058.2023.2214647

35.

Marsh

H. W.

(1986). Negative item bias in ratings scales for preadolescent children: A cognitive-developmental phenomenon. Developmental Psychology, 22(1), 37–49. https://doi.org/10.1037/0012-1649.22.1.37

36.

McFarland

L. A.

Ryan

A. M.

(2000). Variance in faking across noncognitive measures. Journal of Applied Psychology, 85, 812–821. https://doi.org/10.1037/0021-9010.85.5.812

37.

Meade

A. W.

Craig

S. B.

(2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437. https://doi.org/10.1037/a0028085

38.

Michaelides

M. P.

Ivanova

(2022). Response time as an indicator of test-taking effort in PISA: Country and item-type differences. Psychological Test and Assessment Modeling, 64(3), 304–338.

39.

Michaelides

M. P.

Ivanova

Nicolaou

(2020). The relationship between response-time effort and accuracy in PISA science multiple choice items. International Journal of Testing, 20(3), 187–205. https://doi.org/10.1080/15305058.2019.1706529

40.

Panayiotou

Papageorgiou

(2007). Depressed mood: The role of negative thoughts, self-consciousness, and sex role stereotypes. International Journal of Psychology, 42(5), 289–296. https://doi.org/10.1080/00207590701318389

41.

Paulhus

D. L.

Reid

D. B.

(1991). Enhancement and denial in socially desirable responding. Journal of Personality and Social Psychology, 60(2), 307–317. https://doi.org/10.1037/0022-3514.60.2.307

42.

Podsakoff

MacKenzie

Lee

J.-Y.

Podsakoff

(2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. The Journal of Applied Psychology, 88, 879–903. https://doi.org/10.1037/0021-9010.88.5.879

43.

Podsakoff

P. M.

Podsakoff

N. P.

Williams

L. J.

Huang

Yang

(2024). Common method bias: It’s bad, it’s complex, it’s widespread, and it’s not easy to fix. Annual Review of Organizational Psychology and Organizational Behavior, 11, 17–61. https://doi.org/10.1146/annurev-orgpsych-110721-040030

44.

Reynolds

W. M.

(1982). Development of reliable and valid short forms of the Marlowe-Crowne Social Desirability Scale. Journal of Clinical Psychology, 38(1), 119–125. https://doi.org/10.1002/1097-4679(198201)38:1<119::AID-JCLP2270380118>3.0.CO;2-I

45.

Rios

J. A.

Guo

Mao

Liu

O. L.

(2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104. https://doi.org/10.1080/15305058.2016.1231193

46.

Rosenberg

(1965). Society and the Adolescent Self-Image. Princeton University Press. https://doi.org/10.1515/9781400876136

47.

Russell

Peplau

L. A.

Cutrona

C. E.

(1980). The revised UCLA Loneliness Scale: Concurrent and discriminant validity evidence. Journal of Personality and Social Psychology, 39(3), 472–480. https://doi.org/10.1037/0022-3514.39.3.472

48.

Schwartz

S. H.

(1977). Normative influences on altruism. In Berkowitz

(Ed.), Advances in experimental social psychology (pp. 221–279). Academic Press.

49.

Schwarzer

Diehl

Schmitz

G. S.

(1999). Self-regulation scale. Journal of Personality Assessment. https://doi.org/10.1037/t01606-000

50.

Soland

Jensen

Keys

T. D.

S. Z.

Wolk

(2019). Are test and academic disengagement related? Implications for measurement and practice. Educational Assessment, 24(2), 119–134. https://doi.org/10.1080/10627197.2019.1575723

51.

Steinmann

Chen

Braeken

(2024). Who responds inconsistently to mixed-worded scales? Differences by achievement, age group, and gender. Assessment in Education: Principles, Policy & Practice, 31(1), 5–31. https://doi.org/10.1080/0969594X.2024.2318554

52.

Steinmann

Sánchez

van Laar

Braeken

(2022). The impact of inconsistent responders to mixed-worded scales on inferences in international large-scale assessments. Assessment in Education: Principles, Policy & Practice, 29, 5–26. https://doi.org/10.1080/0969594X.2021.2005302

53.

Steinmann

Strietholt

Braeken

(2022). A constrained factor mixture analysis model for consistent and inconsistent respondents to mixed-worded scales. Psychological Methods, 27(4), 667–702. https://doi.org/10.1037/met0000392

54.

Sturgis

Luff

(2021). The demise of the survey? A research note on trends in the use of survey data in the social sciences, 1939 to 2015. International Journal of Social Research Methodology, 24(6), 691–696. https://doi.org/10.1080/13645579.2020.1844896

55.

Sundre

D. L.

Kitsantas

(2004). An exploration of the psychology of the examinee: Can examinee self-regulation and test-taking motivation predict consequential and non-consequential test performance? Contemporary Educational Psychology, 29(1), 6–26. https://doi.org/10.1016/S0361-476X(02)00063-2

56.

Vakola

(n.d.). Greek translation of the 50-item lexical big-five factor markers. IPIP. https://ipip.ori.org/Greek50-itemBigFiveFactorMarkers.htm

57.

Vogelsmeier

L. V. D. E.

Siepe

B. S.

Ulitzsch

Eisele

Geeraerts

Klocek

Fried

E. I.

(2025, September 22). Identifying careless responding in ecological momentary assessment: Inconsistent signals from different detection methods in the WARN-D data. https://doi.org/10.31234/osf.io/4dfbz_v1

58.

Ward

M. K.

Meade

A. W.

(2023). Dealing with careless responding in survey data: Prevention, identification, and recommended best practices. Annual Review of Psychology, 74(1), 577–596. https://doi.org/10.1146/annurev-psych-040422-045007

59.

Ward

M. K.

Meade

A. W.

Allred

C. M.

Pappalardo

Stoughton

J. W.

(2017). Careless response and attrition as sources of bias in online survey assessments of personality traits and performance. Computers in Human Behavior, 76, 417–430. https://doi.org/10.1016/j.chb.2017.06.032

60.

Wise

S. L.

DeMars

C. E.

(2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17. https://doi.org/10.1207/s15326977ea1001_1

61.

Wise

S. L.

Kong

(2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. https://doi.org/10.1207/s15324818ame1802_2

62.

Wolf

L. F.

Smith

J. K.

(1995). The consequence of consequence: Motivation, anxiety, and test performance. Applied Measurement in Education, 8(3), 227–242. https://doi.org/10.1207/s15324818ame0803_3

63.

Zamarro

Hitt

Mendez

(2019). When students don’t care: Reexamining international differences in achievement and student effort. Journal of Human Capital, 13(4), 519–552. https://doi.org/10.1086/705799

An Experimental Study on the Impact of Survey Stakes on Response Inconsistency in Mixed-Worded Scales

Abstract

Keywords

Respondent Motivation and Data Quality in Low-Stakes Surveys

Methods for Detecting Careless and Inconsistent Responding

Factors Influencing Inconsistent Responding

The Present Study

Method

Sample and Study Design

Procedure and Materials

Primary Measures

Covariates

IPIP Big Five Factor Marker

Social-Desirability Scale

Personal Norm in Research Participation

Manipulation Checks

Data Analysis

Results

Participants’ Background Characteristics

Manipulation Checks

Hypothesis Testing

Discussion

Footnotes

Acknowledgements

ORCID iDs

Ethical Considerations

Consent to Participate

Consent for Publication

Funding

Declaration of Conflicting Interests

Data Availability

Notes

References