Sage Journals: Discover world-class research

Abstract

Mixed-worded questionnaire scales, containing both positively and negatively phrased items, are intended to encourage thoughtful responses, but may result in inconsistent responses and wording effects and can reduce reliability and validity. This study examined the prevalence and predictors of inconsistent responding to a mixed-worded scale in the 2022 Programme for International Student Assessment (PISA) across seven countries. Using factor mixture analysis, findings revealed a low but variable rate of inconsistency (2–8%) across countries. Separate confirmatory factor analysis models on the effective and on filtered samples of consistent respondents showed improved fit and reliability indices for the filtered samples for all countries, suggesting that inconsistent respondents had unexpected response patterns. In logistic regression models, inconsistent responding was associated with lower PISA reading scores and often with reduced self-reported effort, suggesting that inconsistent responses may stem from lower reading ability and/or carelessness. The results contribute to ongoing discussions about the use of mixed wording in questionnaires and highlight the need to account for respondent behavior when analyzing survey data.

Keywords

mixed-worded scales factor mixture analysis model inconsistent responding PISA questionnaire

Large-scale assessment studies comprise achievement tests and contextual questionnaires, to assess students’ learning conditions and outcomes. The questionnaires consist of individual questions and multi-items scales to measure students’ demographics, learning environments, past experiences, beliefs, attitudes, feelings, etc. To ensure the quality of self-reported data, techniques such as including neutral items, ensuring respondent anonymity, and using mixed-worded scales have conventionally been employed to mitigate response biases, promote more attentive responding, and enhance construct validity (Benson & Hocevar, 1985; Nunnally, 1967).

“I make friends easily at school” and “I feel lonely at school” are two items phrased in opposite valence and administered within the PISA Sense of Belonging scale, using the same response scale. Respondents have to either agree with “I make friends easily at school” and disagree with “I feel lonely at school”, or vice versa, to make a consistent statement. Agreeing or disagreeing with both items is, by contrast, an inconsistent statement. Despite established recommendations about incorporating mixed-worded items in scale construction, there is scarce empirical evidence supporting the use of positively and negatively worded items within a scale to measure an underlying construct. Baumgartner and Steenkamp (2001) reported reduction in variance due to acquiescence in balanced scales with equal numbers of positively and negatively keyed items. A recent experimental study has shown that respondents dedicate more time and revisit more often survey items that are negatively worded (Koutsogiorgi & Michaelides, 2022). This may suggest higher attentiveness to those items but does not demonstrate reduction in response bias. However, mixed wording has been shown to inadvertently introduce measurement problems, including reduced internal consistency, inflated factor structures, and overestimation of scale dimensionality (e.g., Arias et al., 2020; Podsakoff et al., 2024; Steinmann, Sánchez, et al., 2022).

In surveys with mixed-worded scales, students may not answer consistently across all items (Steinmann, Strietholt, & Braeken, 2022) because they sometimes respond to positively and negatively worded items as if they were phrased in the same valence. Such inconsistent responses raise concerns regarding the reliability and validity of results (e.g., Chen et al., 2024; Steinmann, Sánchez, et al., 2022). Factor analyses of mixed-worded scales often result in artifactual or “method” factors beyond the intended substantive factors (e.g., Marsh et al., 2010; Michaelides et al., 2017) and thus jeopardize the scale’s construct validity. Consequently, previous studies highlighted the importance of identifying inconsistent responding and its impact on psychometric scale properties, factorial structures, and the nomological network of mixed-worded scales (e.g., Bulut & Bulut, 2022; Steedle et al., 2019; Steinmann, Sánchez, et al., 2022).

Different approaches for detecting inconsistent responses to mixed-worded scales have been suggested (e.g., Bolt et al., 2020; Hong et al., 2020). Two prevalent methods are (a) the mean absolute difference method (Hong et al., 2020; Steedle et al., 2019), which measures the alignment between responses to positively and negatively worded items with the use of threshold values, and (b) the constrained factor mixture analysis (FMA) method (Steinmann, Strietholt, & Braeken, 2022), stemming from factor analysis (e.g., Lubke & Muthén, 2005; Masyn et al., 2010). The FMA has the advantage of modeling the latent factor structure of the scale under investigation and simultaneously classifying participants into groups to reflect the heterogeneity in theoretically specified response patterns. Although computationally complex, FMA avoids the arbitrariness of thresholds based on observed mean score differences.

The prevalence of inconsistent responding, identified by different empirical methods, ranges from very low (e.g., Hong et al., 2020) to substantial (e.g., Swain et al., 2008) across studies. This variation is more salient in international large-scale assessments conducted on different topics and across many countries. Recent studies highlighted the large international variation in inconsistent responding, ranging from 2 to 36% across educational systems (Steinmann, Sánchez, et al., 2022) and from 1 to 21% across countries and grades (Steinmann et al., 2024).

What Causes Inconsistent Responding to Mixed-Worded Scales?

There are different explanations for why some respondents might respond inconsistently to mixed-worded scales. One hypothesis is that they respond inconsistently due to a lack of reading and/or cognitive skills to answer consistently. Some respondents might struggle to fully understand the item wording and response categories, and/or to adjust their responses in accordance with the opposite item wording, for example, disagreeing with negative statements to express a positive attitude (e.g., Alessandri et al., 2015; Gnambs & Schroeders, 2020; Marsh, 1986; Swain et al., 2008).

An alternative hypothesis, originally proposed by Schmitt and Stuits (1985), proposes carelessness and inattention to explain why some respondents respond inconsistently to mixed-worded scales. When individuals put insufficient effort into completing a survey, whether by providing random responses to items, selecting the same response across a series of items irrespective of their content (straight-lining), or reading some but not all items of a scale while responding similarly to all, one would also expect inconsistent patterns of responding to mixed-worded scales. Furthermore, it is possible that respondents both lack cognitive and/or reading skills and are careless.

Importantly, however, the two explanations for inconsistent responding behavior, the lack-of-skills and the carelessness hypotheses, have very different implications. If carelessness was the (only) explanation for inconsistent responding, including mixed-worded scales in questionnaires would allow analysts to detect and remove them to obtain “clean” datasets that only contain compliant respondents (Hong et al., 2020; Patton et al., 2019). If a lack of skills was the cause of inconsistent responding, at least partly, including mixed-worded scales in questionnaires would increase the difficulty to answer properly, at least for some respondents. This would conflict with the guiding design principle that questionnaires should be as easy and straightforward to fill out, as possible (Schulz & Carstens, 2020). To determine whether to use mixed scales, it is therefore important to try to determine the factors that cause inconsistent responses.

Moreover, inconsistent responding may also arise from response styles, such as acquiescence, the tendency to agree with items regardless of content (Bentler et al., 1971; Buchholz, 2022). Personality traits, particularly conscientiousness, have also been shown to be correlated with response consistency: individuals low in conscientiousness are more likely to respond carelessly, especially in low-stakes assessments (Ward & Meade, 2023). However, an argument can be made that lower self-reported conscientiousness might be a proxy for careless responding (Chen et al., 2024; Steinmann, Strietholt, & Braeken, 2022).

Individual Characteristics Associated With Inconsistent Responding

One way to disentangle the factors contributing to inconsistent responding is to compare characteristics of consistent and inconsistent respondents. A number of studies showed that inconsistent responding was associated with low cognitive ability or reading comprehension test scores (e.g., Alessandri et al., 2015; Bulut & Bulut, 2022; Gnambs & Schroeders, 2020; Marsh, 1986; Steedle et al., 2019; Steinmann, Strietholt, & Braeken, 2022; Swain et al., 2008). Marsh (1986) even referred to inconsistent responding as a cognitive-developmental phenomenon as it is more likely for young children with poorer reading ability to respond inconsistently to mixed-worded scale items than older children. An alternative interpretation for this association could be that careless respondents to mixed-worded questionnaire scales might also answer carelessly to tests and thus score lower. Therefore, the association between inconsistent responding and lower ability scores can be explained by both hypotheses. However, experimental studies with university students support that negative wording requires more resources for comprehending an item; it was associated with longer response times, more revisits, and more item-viewing time (Koutsogiorgi & Michaelides, 2022, 2023). These findings support the lack-of-skills hypothesis.

Several demographic variables have been examined in relation to inconsistent responding. Inconsistent respondents were more prevalent among children than adolescents (Konstantinidou & Michaelides, 2025; Marsh, 1986; Steinmann, Striethot, & Braeken, 2022; Steinmann et al., 2024) possibly due to lower cognitive or reading skills. Furthermore, males were overrepresented among inconsistent respondents in one study (Steinmann, Strietholt, & Braeken, 2022). This gender gap can be explained with both hypotheses. According to the lack-of-skills hypothesis, males might be more likely to respond inconsistently because they typically have lower reading abilities (Mullis et al., 2017), and according to the carelessness hypothesis, they might be more likely to respond inconsistently than females because they are often found to have lower conscientiousness levels (De Bolle et al., 2015) and exhibit less effort to complete surveys (Anaya & Zamarro, 2024).

Compared to students from families with higher socioeconomic status, disadvantaged students had higher rates of disengagement with a survey, indicated by non-response behavior and lack of internal consistency in responding to all scale items, in most of the countries participating in PISA 2018 (Buchholz et al., 2022). Conflicting results come from studies examining the effect of socioeconomic status on test-taking effort in the PISA cognitive test, where a positive relationship between the two variables was also found when student performance was not included (Azzolini et al., 2019; Rios & Soland, 2022), while the association becomes insignificant or negative (Azzolini et al., 2019; Ivanova, 2024) when incorporating achievement scores in a multilevel regression model. A weak positive association between the two variables was observed in Dutch low-stakes tests, when controlling for past achievement (Borghans et al., 2024).

Some variables can help to directly investigate the carelessness hypothesis. Self-reported survey effort can help identify disengaged students (Zamarro et al., 2019) who read the scales in a superficial way, do not notice the mixed item wording, and thus respond inconsistently (e.g., Schmitt & Stuits, 1985). Bulut (2021) reported that during the questionnaire session in PISA 2015, 40% of students were disengaged, as indicated by rapid responding across multiple scales. This points to the importance of screening for effort in large-scale surveys, so that low-effort respondents can be filtered out to increase the validity of the data. Response time indicators have been used a lot for understanding student’s behavior (e.g., via very rapid responses) in low-stakes assessments (e.g., Anghel et al., 2024; Michaelides et al., 2020; Wise, 2017), but less so in surveys (e.g., Lundgren & Eklöf, 2023; Soland et al., 2019). A relationship between response time and inconsistent responses in mixed-worded scales was observed by Soland et al. (2019): when students gave very rapid responses on a scale, they tended to exhibit inconsistent responding behavior more often.

Aim of the Study

The present study aimed to identify inconsistent respondents in the mixed-worded Sense of Belonging scale in PISA 2022 across seven countries and to compare their characteristics with those of the consistent respondents. Two hypotheses were tested:

Earlier studies with students found prevalence from as low as 1% up to 36%, with higher estimates in samples of young children. In the PISA samples with adolescents, we predicted that the prevalence of inconsistent respondents would be at the lower end of that range, and variable across countries.

Inconsistent respondents will be more often male, report lower self-reported effort put into the survey completion, complete the scale in less time, and have lower reading achievement scores. Furthermore, we explored socioeconomic status as a predictor of inconsistent responding.

Although there is some previous evidence regarding the prevalence of inconsistent responding and individual correlates of this prevalence, we extend the existing body of research by considering not only background characteristics and reading ability as important predictors of inconsistent responding, but also the students’ self-reported questionnaire response effort and the time taken to respond to the scale. PISA data allow for cross-cultural comparisons based on representative samples of student populations. Among several scales administered in 2022, the Sense of Belonging scale was the only one in the 2022 questionnaire that was sufficiently unidimensional, and mixed-worded, with multiple positively and negatively phrased items.

Methods

Sample

PISA is an international study conducted by the Organization for Economic Cooperation and Development (OECD) with the aim of assessing 15-year-old students’ mathematics, reading, and science knowledge and skills. Mathematics was the major focus of the 2022 cycle (OECD, 2023). Performance scores as well as rich questionnaire data are collected from a representative sample in each participating country. To investigate a diverse but small set of showcase samples, we selected seven countries with heterogeneous mean performance levels, geographic locations, test languages, and cultural and socioeconomic levels: Australia, Brazil, Denmark, Greece, Saudi Arabia, Singapore, and USA.¹ All seven countries conducted the PISA 2022 assessment on computers. In total, 54,924 students (Table 1; 49.8% females) across the seven educational systems were originally included in the analysis. We excluded 2.89% of the students with invalid or missing responses on all items of the scale. We also removed 4.08% of the students who responded on less than two positively or two negatively worded items per scale (see the “measures” section below for more details). After exclusions, the effective sample size was 51,100 (50.5% females, index of economic, social, and cultural status mean = −0.04, standard deviation = 1.06). Unweighted mean values for self-reported effort on the questionnaire, total time on the scale, and reading achievement by country also appear in Table 1.

Table 1.

Initial and Effective Student Sample Sizes and Demographics by Country

Countries	Initial sample	Sample with valid and non-missing responses	Effective sample after all exclusions
Countries	Initial sample	Sample with valid and non-missing responses	n	% female	Mean ESCS	Mean effort on questionnaire	Mean total time on scale (sec.)	Mean reading achievement
Australia	13,437	13,092	12,838	49.17	0.40	8.13	22.03	502
Brazil	10,798	10,548	9,428	52.15	−0.94	8.12	38.88	423
Denmark	6,200	5,799	5,655	49.85	0.36	7.87	24.77	482
Greece	6,403	6,357	6,173	50.30	−0.13	7.43	28.05	446
Saudi Arabia	6,928	6,849	6,416	53.09	−0.27	8.34	33.12	386
Singapore	6,606	6,567	6,546	49.18	0.29	8.18	19.07	544
USA	4,552	4,127	4,044	50.26	0.06	8.74	24.03	509
Total sample	54,924	53,339	51,100	50.51	−0.04	8.11	26.72	470

Note. Exclusions resulted from having invalid, valid skip, missing, or not applicable responses on all items of the scale. For the effective sample, participants who (a) gave no response to administered items or (b) responded to less than two positively or less than two negatively worded items on the scale were further excluded. ESCS = index of economic, social, and cultural status.

Measures

Mixed-Worded Scale

The attitudinal scale of Sense of Belonging was analyzed. It contains six items in total (Table 2), three of which were negatively worded (e.g., “I feel lonely at school”) and three were positively worded (e.g., “I feel like I belong to school”). The Likert response options ranged from 1 = strongly agree to 4 = strongly disagree. All the negatively worded items were antonyms. The Sense of Belonging items appeared relatively early in the questionnaire in the module measuring school culture and climate, soon after the demographic characteristics.

Table 2.

Item Wording and Item Names of the Sense of Belonging Scale

Sense of Belonging scale items	Wording direction	Item code name
I feel like an outsider (or left out of things) at school	−	ST034Q01TA
I make friends easily at school	+	ST034Q02TA
I feel like I belong at school	+	ST034Q03TA
I feel awkward and out of place in my school	−	ST034Q04TA
Other students seem to like me	+	ST034Q05TA
I feel lonely at school	−	ST034Q06TA

Note. For each item of the Sense of Belonging scale, the response categories were 1 = Strongly Agree, 2 = Agree, 3 = Disagree, 4 = Strongly Disagree. Positively worded items are indicated by “+” and negatively worded items by “−”.

The questionnaire was administered after 2 hours of cognitive tests. To minimize the burden on students, while also including as many scales and items as possible (OECD, 2024), a within-construct matrix sampling design was applied in the questionnaire in which each respondent received a subset of items from each scale. Specifically, the respondents were randomly administered only five out of six items of the Sense of Belonging scale. Since this study aimed to identify inconsistent respondents based on their responses to mixed-worded items, we only included students who randomly received and responded to at least two positively and two negatively worded items per scale by design.

Covariates of Inconsistent Responding

To investigate the association between inconsistent responding and student characteristics within countries, we included five sets of student-level variables:

(1) Gender (1 = female, 2 = male),

(2) ESCS, a standardized composite index of economic, social, and cultural status calculated from the family background measures highest parental education, highest parental occupation, and home possessions, with a mean of zero and a standard deviation of 1,

(3) the self-reported effort students put into giving accurate answers on the context questionnaire (response to “How much effort did you put into giving accurate answers?” on a scale of 1 to 10),

(4) the total time they used to answer the items on the scale (in seconds), and

(5) reading achievement scores (10 plausible values which are scaled to an approximately normal distribution with an international average of about 500 and a standard deviation of about 100 points).

Statistical Analysis

We applied the constrained factor mixture analysis (FMA) that was developed by Steinmann, Strietholt, and Braeken (2022) to classify respondents as consistent or inconsistent based on their responses to mixed-worded items of the scales. This model is designed to determine respondents’ most likely membership in one out of two latent classes, consistent and inconsistent respondents, who respond to items of one underlying factor. In contrast to the consistent class, where the factor loadings are constrained to be of opposite directionality for positively and negatively worded items, the item loadings are constrained to be of the same directionality in the inconsistent class, reflecting the fact that the inconsistent respondents are characterized by either agreeing or disagreeing with both positively and negatively worded items. We estimated the FMA models for the mixed-worded scale and for each country with Mplus 8.5 (Muthén & Muthén, 1998-2017). The estimation method was robust maximum likelihood (MLR) with adjustments for sampling (sampling weight W_FSTUWT) and students’ clustering in schools.² A posterior probability for belonging in each class was estimated for each student, and class membership was determined by the highest posterior probability.

To study which students were more likely to respond inconsistently to the mixed-worded questionnaire scale, binary logistic regression analysis was conducted for each country separately. Specifically, we ran student-level, multivariate logistic regression models to investigate the associations between inconsistent responding (0 = consistent response, 1 = inconsistent response) and gender, ESCS, self-reported effort, time spent on the scale, and reading scores. This analysis was conducted in the IDB Analyzer (IEA, 2022) accounting for the final student weights and replicate sampling weights, and applying Rubin’s (1987) rules to the repeated analyses across the 10 plausible values of reading performance. Variance estimates for parameters of interest are calculated with multiple imputation methodology.

Results

Prevalence of Inconsistent Respondents Across Countries

Table 3 presents the percentages of inconsistent responding classifications per country. The percentages of inconsistent respondents were low and varied between 2 and 8%. Thus, with respect to H1, we found small percentages of inconsistent respondents in the Sense of Belonging Scale with variability among the seven country samples.

Table 3.

Prevalence of Inconsistent Respondents on the Sense of Belonging Scale per Country (%)

Countries	Percentage
Australia	3.19
Brazil	6.70
Denmark	2.33
Greece	3.20
Saudi Arabia	8.01
Singapore	3.05
USA	3.72

Note. Results weighted by the student final weight (W_FSTUWT) which is appropriate for country-level analysis.

As corroborating evidence that the cases identified as inconsistent respondents had unexpected response patterns, confirmatory factor analyses were conducted (a) on the effective sample and (b) on the sample of consistent respondents, separately for each country. Unidimensional factor models were specified in Mplus 8.5, and the MLR estimation method was used. Relative to the effective samples, the approximate fit indices were improved in screened samples in all countries (Table 4). Omega and alpha reliability coefficients and mean inter-item correlations were higher for the consistent compared to the effective samples in all cases.³

Table 4.

Approximate Fit and Reliability Indices for the Effective and Consistent Respondent Samples by Country

Sample	CFI	RMSEA	SRMR	Omega	Alpha	Mean inter-item correlation
Australia E	0.902	0.089	0.059	0.843	0.838	0.464
Australia C	0.977	0.048	0.027	0.865	0.860	0.508
Brazil E	0.782	0.126	0.078	0.802	0.799	0.401
Brazil C	0.979	0.045	0.023	0.841	0.839	0.470
Denmark E	0.930	0.073	0.048	0.821	0.820	0.432
Denmark C	0.960	0.060	0.034	0.844	0.842	0.473
Greece E	0.915	0.079	0.049	0.818	0.814	0.421
Greece C	0.957	0.067	0.035	0.833	0.829	0.447
S. Arabia E	0.755	0.140	0.104	0.774	0.776	0.366
S. Arabia C	0.979	0.042	0.023	0.825	0.823	0.440
Singapore E	0.895	0.101	0.058	0.838	0.835	0.458
Singapore C	0.966	0.062	0.032	0.854	0.850	0.487
USA E	0.911	0.095	0.060	0.844	0.837	0.460
USA C	0.976	0.055	0.028	0.866	0.860	0.506

Note. E = effective sample, C = sample of consistent respondents.

Correlates of Inconsistent Responding

Table 5 shows the results of regressing the binary variable of inconsistent responding on all five student-level predictor variables simultaneously: gender, ESCS, the self-reported effort put into answering the context questionnaire, the total time dedicated to answering the items of the scale, and reading achievement scores, in each country sample and in the overall seven-country sample.

Table 5.

Results of Logistic Regressions of Inconsistent Responding on Student Predictors by Country

Predictor	Country samples														Total sample
	Australia		Brazil		Denmark		Greece		Saudi Arabia		Singapore		USA		Total sample
	b (se)	OR	b (se)	OR	b (se)	OR	b (se)	OR	b (se)	OR	b (se)	OR	b (se)	OR	b (se)	OR
Gender	0.010 (0.106)	1.010	−0.315 (0.134)	0.730	−0.225 (0.201)	0.799	0.098 (0.184)	1.103	−0.446 (0.132)	0.640	−0.451 (0.160)	0.637	0.178 (0.194)	1.195	−0.165 (0.061)	0.873
ESCS	−0.190 (0.075)	0.914	−0.061 (0.051)	0.941	−0.207 (0.169)	0.813	0.174 (0.091)	1.191	0.028 (0.056)	1.028	0.047 (0.083)	1.049	−0.301 (0.080)	0.740	−0.058 (0.035)	0.954
Self-reported effort on questionnaire	−0.104 (0.032)	0.901	−0.071 (0.024)	0.931	−0.019 (0.054)	0.981	−0.077 (0.049)	0.926	−0.088 (0.023)	0.916	−0.018 (0.050)	0.982	−0.059 (0.056)	0.943	−0.062 (0.016)	0.940
Total time on the scale	0.000 (0.004)	1.000	−0.001 (0.003)	0.999	−0.013 (0.015)	0.988	0.000 (0.010)	1.000	−0.019 (0.007)	0.981	0.008 (0.009)	1.008	−0.008 (0.009)	0.992	−0.005 (0.003)	0.995
Reading achievement	−0.003 (0.001)	0.997	−0.007 (0.001)	0.993	−0.007 (0.001)	0.993	−0.007 (0.001)	0.993	−0.008 (0.001)	0.992	−0.005 (0.001)	0.995	−0.004 (0.001)	0.996	−0.006 (0.000)	0.994

Note. Coefficients with p < .05 in bold. Logistic regressions were conducted in IDB Analyzer accounting for sampling weights and the 10 reading plausible values. Senate weights were used in the total sample analysis.

Females were more likely to be identified as inconsistent respondents in Brazil, Saudi Arabia, Singapore, and in the total sample, while in the other countries the association was not statistically significant. ESCS was a significant, negative predictor of inconsistent responding only in the USA after controlling for all other variables.

The self-reported effort put into answering the questionnaire was significantly and negatively associated with inconsistent responding on the Sense of Belonging scale in the overall sample and for three of the country samples. The time invested on the scale was a statistically significant negative predictor of inconsistent responding in the Saudi Arabia sample only.

Reading achievement was the student-level covariate that systematically had a significant, negative association with inconsistent responding across all countries. Better readers were less likely to give inconsistent responses to the scale. The effect sizes appear small but must be interpreted in light of the scaling of the reading scores with an international standard deviation of about 100. For example, in the total sample, a 100-score increase in the PISA reading achievement scale is associated with 45% lower odds to be classified as inconsistent respondent adjusting for other variables.

Discussion

This study investigated the prevalence and characteristics of inconsistent respondents to the mixed-worded scale of Sense of Belonging in PISA 2022 across seven heterogeneous countries. Specifically, it examined two hypotheses about the prevalence and the variation of prevalence of inconsistent responding across countries, and about student characteristics associated with this response behavior.

Conforming with hypotheses and previous research (Steinmann, Sánchez, et al., 2022; Steinmann et al., 2024), we found small percentages of inconsistent respondents with some variation among the seven countries. Saudi Arabia and Brazil had relatively higher percentages of inconsistent respondents—despite having the highest average time spent on completing the Sense of Belonging scale and moderate self-reported effort on the questionnaire—which can arguably be attributed to the fact that both countries have lower country-average achievement scores in PISA (Steinmann, Sánchez, et al., 2022). In previous research on cognitive tests, rapid guessing, a form of careless responding, was also observed in countries with lower average achievement scores (Ivanova, 2024; Michaelides & Ivanova, 2022). However, Singapore, a top performer in PISA 2022, did not have the lowest percentage of inconsistent respondents in our study. The reasons for inconsistent responding can differ between contexts, beyond different levels of student achievement. For example, negative wording might be more common or easier to understand in some languages than others.

For the cognitive and demographic predictors of inconsistent responding, the results mostly confirmed the lack-of-skills hypothesis. Low reading performance was the most robust predictor of inconsistent responding compared to the indicators for carelessness (low self-reported effort put into answering the questionnaire and a very fast response time to the scale items) and the demographic variables (socioeconomic index, gender) in the multivariate models. Low PISA reading score systematically emerged as a significant correlate of inconsistent responding in all country samples. The results on the systematic association between low cognitive test scores and inconsistent responding align with previous research (e.g., Bulut & Bulut, 2022; Chen et al., 2024; Gnambs & Schroeders, 2020; Marsh, 1986; Michaelides, 2019; Steedle et al., 2019; Steinmann, Sánchez, et al., 2022; Steinmann et al., 2024) and the notion that processing mixed-worded items is cognitively demanding (Baumgartner et al., 2018; Koutsogiorgi & Michaelides, 2022). However, it is conceivable that careless respondents give inconsistent responses to mixed-worded questionnaire items and score low on the tests. The carelessness hypothesis can therefore also explain this finding.

Low self-reported effort was significantly associated with inconsistent responding in three out of seven countries and in the total models. In contrast, a faster response time was significantly associated with inconsistent responding in only one of the country models. A possible explanation may be that while students’ self-reported effort is a more general evaluation and may be influenced by their behavior at certain parts of the survey, behavioral indications of engagement, such as response time, may become more relevant later in a questionnaire. This may be the case if effort diminishes toward the end of an assessment, and respondents become less attentive or careful, as has been demonstrated in the case of achievement tests (Debeer et al., 2014; Nagy et al., 2019; Pools & Monseur, 2021). These are potentially interesting avenues for future research, since this is the first study that looked into these effort and time variables and inconsistent responding.

The coefficients of the ESCS indicators for inconsistent responding were not significant in most samples possibly due to the inclusion of reading performance in the model, in agreement with previous non-significant results of ESCS with rapid responding in cognitive test items in multivariate models (Azzolini et al., 2019; Ivanova, 2024). Only in the USA sample, higher socioeconomic level was associated with lower inconsistent responding. Contrary to our hypothesis that males are more likely to be flagged as inconsistent (Steinmann et al., 2024), the effect of gender on response behavior was not significant most of the times, suggesting no systematic differences in inconsistent responding between males and females. In three countries and in the total sample, females were more likely to be classified as inconsistent than males. This unexpected finding may be a result of including other covariates in the model, where gender differences in favor of females have been reported, like effort and time, and especially reading performance, where females tend to perform higher than males (Mullis et al., 2017).

Limitations

The current study selected seven countries participating in the computerized PISA 2022 survey. Even though PISA provides extensive international data and administers comprehensive questionnaires with many attitude scales, it follows specific administration procedures and focuses on a population of 15-year-old students. We acknowledge the possibility that the study results may be influenced by the specific administration procedures or the age of the students. Furthermore, the seven country samples were selected based on their diversity in mean performance, geographic location, language, culture, and socioeconomic level. Results may have been different, had other countries been selected, considering that there are further differences in testing cultures with higher or lower emphasis on standardized testing programs. To enhance the generalizability of the findings, future research may replicate the results using different survey data, populations, a larger sample of nations, as well as alternative statistical approaches to classify inconsistent respondents (e.g., Arias et al., 2020, 2024; Kam & Cheung, 2024; Konstantinidou & Michaelides, 2025) which might lead to somewhat different classifications (García-Batista et al., 2021) and capture different response patterns in the “careless response” classes in addition to inconsistent responses.

We were constrained by availability of non-missing data and mixed-worded unidimensional scales in the PISA database. Due to the PISA matrix sampling design and the provision of randomly administering only five items to each student, the FMA was conducted on a small number of items. We selected examinees who responded to at least two positively and two negatively worded items because we considered this the minimum data points for a reliable classification. Future research should consider using scales with more positively and negatively worded items and with other response scale anchors to reduce sensitivity to specific item characteristics and to improve the robustness of the classification.

Answering questions on the actual causes of inconsistent responding to mixed-worded scales would require other types of data sources such as experimental data, not cross-sectional data and self-reported measures as in the present case. Thus, we would like to emphasize again that we cannot disentangle many of the assumed mechanisms with the data at hand. The fact that inconsistent respondents scored on average lower on the reading test can, for example, go back to poor readers struggling with the mixed wording (i.e., lack-of-skills hypothesis), or to unengaged, careless respondents giving both inconsistent responses and scoring low on the test. However, the fact that inconsistent responding did not coincide with rapid responding (as an indicator for carelessness) systematically seems to weaken the carelessness hypothesis.

Implications

Studies like PISA include questionnaires which are low-stakes and often long, and the validity of responses is a concern when analyzing data from such studies. As in previous studies, we found a rather low prevalence of inconsistent respondents to mixed-worded scales, but there is variability across countries. As a result, the validity of scale score comparisons across countries in multi-national surveys may be jeopardized if some country samples are more affected by inconsistent responding than others.

While the primary motivation in retaining scales with positively and negatively worded items is to promote thoughtful responding and to make careless respondents detectable, the psychometric consequences of using mixed-worded scales raise validity and fairness concerns, such as contaminated factor structures and suboptimal reliability indices. Even though responses flagged as inconsistent were not many, their removal resulted in better model fit for the anticipated unidimensional structures of the scale. Reliability coefficients were also improved after these exclusions.

The question whether to continue using mixed-worded scales depends on the assumed cause of inconsistent responding. If all inconsistent respondents were inattentive and careless, their responses could be considered invalid, and researchers would be able to detect and remove them from analyses, which would justify the inclusion of mixed-worded scales. Mixed wording would thus function as a built-in data quality check for analysis purposes when aggregate results are of interest and individual measurements are not required for all participants. If, however, (part of) the inconsistent responding is attributable to a lack of skills necessary to detect and handle the mixed wording appropriately, the conclusion would be to reconsider the use of mixed-worded scales; in such cases, researchers may opt for positively worded items only (cf. Greenberger et al., 2003) to make sure that all respondents manage to give valid responses. This recommendation is particularly relevant in large-scale, low-stakes surveys and in studies that include populations with lower levels of reading proficiency. While more research on the causes of inconsistent responding is necessary, by considering multiple predictors, the present study provides new and vital evidence that support both hypotheses, the carelessness as well as lack-of-skills hypotheses. If carelessness was the only explanation for inconsistent responding, we would have expected to find systematically faster response times to the mixed-worded scales among the inconsistent respondents, and systematically lower self-reported effort in responding to the questionnaire. Considering experimental findings which show more inconsistent responding in low-stakes contexts (Michaelides & Konstantinidou, 2026), we recommend that international large-scale assessments reconsider the use of mixed-worded scales in their questionnaires.

Footnotes

Acknowledgements

We thank Valentina Ierotheou for research support at the early stages of this project. Minor language editing was made using OpenAI’s ChatGPT. The authors take full responsibility for the content of the manuscript.

ORCID iDs

Michalis P. Michaelides

Militsa Ivanova

Ethical Considerations

This study is based on secondary analysis of publicly available, anonymized data.

Funding

The authors would like to acknowledge funding support to Michalis Michaelides by the University of Cyprus (Internal Funding Program).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

This study used PISA 2022 secondary data which are publicly available by the OECD. Instructions for data preparation and syntax files can be found on the Open Science Framework at .

Notes

References

Alessandri

Vecchione

Eisenberg

Łaguna

(2015). On the factor structure of the Rosenberg (1965) general self-esteem scale. Psychological Assessment, 27(2), 621–635. https://doi.org/10.1037/pas0000073

Anaya

L. M.

Zamarro

(2024). The role of student effort on performance in PISA: Revisiting the gender gap in achievement. Oxford Economic Papers, 76(2), 533–560. https://doi.org/10.1093/oep/gpad018

Anghel

Khorramdel

Von Davier

(2024). The use of process data in large-scale assessments: A literature review. Large-Scale Assessments in Education, 12(1), 13. https://doi.org/10.1186/s40536-024-00202-1

Arias

V. B.

Garrido

L. E.

Jenaro

Martínez-Molina

Arias

(2020). A little garbage in, lots of garbage out: Assessing the impact of careless responding in personality survey data. Behavior Research Methods, 52(6), 2489–2505. https://doi.org/10.3758/s13428-020-01401-8

Arias

V. B.

Ponce

F. P.

Garrido

L. E.

Nieto-Cañaveras

M. D.

Martínez-Molina

Arias

(2024). Detecting non-content-based response styles in survey data: An application of mixture factor analysis. Behavior Research Methods, 56(4), 3242–3258. https://doi.org/10.3758/s13428-023-02308-w

Azzolini

Bazoli

Lievore

Schizzerotto

Vergolini

(2019). Beyond achievement. A comparative look into 15-year-olds’ school engagement, effort and perseverance in the european union. European Commission. https://data.europa.eu/doi/10.2766/98129

Baumgartner

Steenkamp

J. B. E.

(2001). Response styles in marketing research: A cross-national investigation. Journal of Marketing Research, 38(2), 143–156. https://doi.org/10.1509/jmkr.38.2.143.18840

Baumgartner

Weijters

Pieters

(2018). Misresponse to survey questions: A conceptual framework and empirical test of the effects of reversals, negations, and polar opposite core concepts. Journal of Marketing Research, 55(6), 869–883. https://doi.org/10.1177/0022243718811848

Benson

Hocevar

(1985). The impact of item phrasing on the validity of attitude scales for elementary school children. Journal of Educational Measurement, 22(3), 231–240. https://doi.org/10.1111/j.1745-3984.1985.tb01061.x

10.

Bentler

P. M.

Jackson

D. N.

Messick

(1971). Identification of content and style: A two-dimensional interpretation of acquiescence. Psychological Bulletin, 76(3), 186–204. https://doi.org/10.1037/h0031474

11.

Bolt

Wang

Y. C.

Meyer

R. H.

Pier

(2020). An IRT mixture model for rating scale confusion associated with negatively worded items in measures of social-emotional learning. Applied Measurement in Education, 33(4), 331–348. https://doi.org/10.1080/08957347.2020.1789140

12.

Borghans

Diris

Tavares

(2024). Student characteristics and effort during test-taking. Learning and Instruction, 93, 101924. https://doi.org/10.1016/j.learninstruc.2024.101924

13.

Buchholz

(2022). Mixed-worded scales and acquiescence in educational large-scale assessments (OECD education working papers no. 269). OECD Publishing. https://doi.org/10.1787/8dd310c0-en

14.

Buchholz

Cignetti

Piacentini

(2022). Developing measures of engagement in PISA. OECD Education Working Paper No. 279. OECD Publishing. https://doi.org/10.1787/2d9a73ca-en

15.

Bulut

H. C.

(2021). The continuity of students’ disengaged responding in low-stakes assessments: Evidence from response times. International Journal of Assessment Tools in Education, 8(3), 527–541. https://doi.org/10.21449/ijate.789212

16.

Bulut

H. C.

Bulut

(2022). Item wording effects in self-report measures and reading achievement: Does removing careless respondents help? Studies in Educational Evaluation, 72, 101126. https://doi.org/10.1016/j.stueduc.2022.101126

17.

Chen

Steinmann

Braeken

(2024). Competing explanations for inconsistent responding to a mixed-worded self-esteem scale: Cognitive abilities or personality? Personality and Individual Differences, 222, 112573. https://doi.org/10.1016/j.paid.2024.112573

18.

De Bolle

De Fruyt

McCrae

R. R.

Löckenhoff

C. E.

Costa

Jr., P. T.

Aguilar-Vafaie

M. E.

Ahn

Alcalay

Allik

Avdeyeva

T. V.

Bratko

Brunner-Sciarra

Cain

T. R.

Chan

Chittcharat

Crawford

J. T.

Fehr

Ficková

Terracciano

(2015). The emergence of sex differences in personality traits in early adolescence: A cross-sectional, cross-cultural study. Journal of Personality and Social Psychology, 108(1), 171–185. https://doi.org/10.1037/a0038497

19.

Debeer

Buchholz

Hartig

Janssen

(2014). Student, school, and country differences in sustained test-taking effort in the 2009 PISA reading assessment. Journal of Educational and Behavioral Statistics, 39(6), 502–523. https://doi.org/10.3102/1076998614558485

20.

García-Batista

Z. E.

Guerra-Peña

Garrido

L. E.

Cantisano-Guzmán

L. M.

Moretti

Cano-Vindel

Arias

V. B.

Medrano

(2021). Using constrained factor mixture analysis to validate mixed-worded psychological scales: The case of the rosenberg self-esteem scale in the Dominican Republic. Frontiers in Psychology, 12, 1–24. https://doi.org/10.3389/fpsyg.2021.636693

21.

Gnambs

Schroeders

(2020). Cognitive abilities explain wording effects in the rosenberg self-esteem scale. Assessment, 27(2), 404–418. https://doi.org/10.1177/1073191117746503

22.

Greenberger

Chen

Dmitrieva

Farruggia

S. P.

(2003). Item-wording and the dimensionality of the rosenberg self-esteem scale: Do they matter? Personality and Individual Differences, 35(6), 1241–1254. https://doi.org/10.1016/S0191-8869(02)00331-8

23.

Hong

Steedle

J. T.

Cheng

(2020). Methods of detecting insufficient effort responding: Comparisons and practical recommendations. Educational and Psychological Measurement, 80(2), 312–345. https://doi.org/10.1177/0013164419865316

24.

IEA . (2022). Help manual for the IEA IDB analyzer (Version 5.0). Hamburg, Germany. Available from https://www.iea.nl/

25.

Ivanova

M. G.

(2024). Using process data to measure and model students’ test-taking effort in large-scale low-stakes assessments (doctoral dissertation). University of Cyprus. https://gnosis.library.ucy.ac.cy/entities/publication/1a41ef34-9f6a-4d7b-b2db-acb6964a7004.

26.

Kam

C. C. S.

Cheung

S. F.

(2024). A constrained factor mixture model for detecting careless responses that is simple to implement. Organizational Research Methods, 27(3), 443–476. https://doi.org/10.1177/10944281231195298

27.

Konstantinidou

Michaelides

(2025). Assessment mode and inconsistent responding in a mixed-worded scale: Evidence from TIMSS 2019 across grades and countries. Frontiers in Education, 10, 1595648. https://doi.org/10.3389/feduc.2025.1595648

28.

Koutsogiorgi

Michaelides

M. P.

(2022). Response tendencies due to item wording using eye tracking methodology. Behavior Research Methods, 54(5), 2252–2270. https://doi.org/10.3758/s13428-021-01719-x

29.

Koutsogiorgi

C. C.

Michaelides

M. P.

(2023). Response tendencies to the positively and negatively worded items of the rosenberg Self- esteem scale with eye-tracking methodology. European Journal of Psychological Assessment, 39(4), 307–315. https://doi.org/10.1027/1015-5759/a000772

30.

Lubke

G. H.

Muthén

(2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10(1), 21–39. https://doi.org/10.1037/1082-989X.10.1.21

31.

Lundgren

Eklöf

(2023). Questionnaire-taking motivation: Using response times to assess motivation to optimize on the PISA 2018 student questionnaire. International Journal of Testing, 23(4), 231–256. https://doi.org/10.1080/15305058.2023.2214647

32.

Marsh

H. W.

(1986). Negative item bias in ratings scales for preadolescent children: A cognitive-developmental phenomenon. Developmental Psychology, 22(1), 37–49. https://doi.org/10.1037/0012-1649.22.1.37

33.

Marsh

H. W.

Scalas

L. F.

Nagengast

(2010). Longitudinal tests of competing factor structures for the rosenberg self-esteem scale: Traits, ephemeral artifacts, and stable response styles. Psychological Assessment, 22(2), 366–381. https://doi.org/10.1037/a0019225

34.

Masyn

K. E.

Henderson

C. E.

Greenbaum

P. E.

(2010). Exploring the latent structures of psychological constructs in social development using the dimensional–categorical spectrum. Social Development, 19(3), 470–493. https://doi.org/10.1111/j.1467-9507.2009.00573.x

35.

Michaelides

M. P.

(2019). Negative keying effects in the factor structure of TIMSS 2011 motivation scales and associations with reading achievement. Applied Measurement in Education, 32(4), 365–378. https://doi.org/10.1080/08957347.2019.1660349

36.

Michaelides

M. P.

Ivanova

(2022). Response time as an indicator of test-taking effort in PISA: Country and item-type differences. Psychological Test and Assessment Modeling, 64(3), 304–338.

37.

Michaelides

M. P.

Ivanova

Nicolaou

(2020). The relationship between response-time effort and accuracy in PISA science multiple-choice items. International Journal of Testing, 20(3), 187–205. https://doi.org/10.1080/15305058.2019.1706529

38.

Michaelides

M. P.

Konstantinidou

(2026). An experimental study on the impact of survey stakes on response inconsistency in mixed-worded scales. Educational and Psychological Measurement, 86(2), 266–286. https://doi.org/10.1177/00131644251395323

39.

Michaelides

M. P.

Koutsogiorgi

Panayiotou

(2017). Method/Group factors: Inconsequential but meaningful. A comment on donnellan, ackerman and brecheen (2016). Journal of Personality Assessment, 99(3), 334–335. https://doi.org/10.1080/00223891.2016.1233560

40.

Mullis

I. V. S.

Martin

M. O.

Foy

Hooper

(2017). PIRLS 2016 international results in reading. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and the International Association for the Evaluation of Educational Achievement (IEA).

41.

Muthén

L. K.

Muthén

B. O.

(1998-2017). Mplus user’s guide (8th ed.). Muthén & Muthén.

42.

Nagy

Nagengast

Frey

Becker

Rose

(2019). A multilevel study of position effects in PISA achievement tests: Student- and school-level predictors in the German tracked school system. Assessment in Education: Principles, Policy & Practice, 26(4), 422–443. https://doi.org/10.1080/0969594X.2018.1449100

43.

Nunnally

J. C.

(1967). Psychometric theory. McGraw-Hill.

44.

OECD . (2023). PISA 2022 assessment and analytical framework. OECD. https://doi.org/10.1787/dfe0bf9c-en

45.

OECD . (2024). PISA 2022 technical report. OECD. https://doi.org/10.1787/01820d6d-en

46.

Patton

J. M.

Cheng

Hong

Diao

(2019). Detection and treatment of careless responses to improve item parameter estimation. Journal of Educational and Behavioral Statistics, 44(3), 309–341. https://doi.org/10.3102/1076998618825116

47.

Podsakoff

P. M.

Podsakoff

N. P.

Williams

L. J.

Huang

Yang

(2024). Common method bias: It's bad, it's complex, it's widespread, and it's not easy to fix. Annual Review of Organizational Psychology and Organizational Behavior, 11(1), 17–61. https://doi.org/10.1146/annurev-orgpsych-110721-040030

48.

Pools

Monseur

(2021). Student test-taking effort in low-stakes assessments: Evidence from the English version of the PISA 2015 science test. Large-scale Assessments in Education, 9, 10. https://doi.org/10.1186/s40536-021-00104-6

49.

Rios

J. A.

Soland

(2022). An investigation of item, examinee, and country correlates of rapid guessing in PISA. International Journal of Testing, 22(2), 154–184. https://doi.org/10.1080/15305058.2022.2036161

50.

Rubin

D. B.

(1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons, Inc. https://doi.org/10.1002/9780470316696

51.

Schmitt

Stuits

D. M.

(1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9(4), 367–373. https://doi.org/10.1177/014662168500900405

52.

Schulz

Carstens

(2020). Questionnaire development in international large-scale assessment studies. In Wagemaker

(Ed.), Reliability and validity of international large-scale assessment (10, pp. 61–83). Springer International Publishing. https://doi.org/10.1007/978-3-030-53081-5_5

53.

Soland

Wise

S. L.

Gao

(2019). Identifying disengaged survey responses: New evidence using response time metadata. Applied Measurement in Education, 32(2), 151–165. https://doi.org/10.1080/08957347.2019.1577244

54.

Steedle

J. T.

Hong

Cheng

(2019). The effects of inattentive responding on construct validity evidence when measuring social–emotional learning competencies. Educational Measurement: Issues and Practice, 38(2), 101–111. https://doi.org/10.1111/emip.12256

55.

Steinmann

Chen

Braeken

(2024). Who responds inconsistently to mixed-worded scales? Differences by achievement, age group, and gender. Assessment in Education: Principles, Policy & Practice, 31(1), 5–31. https://doi.org/10.1080/0969594X.2024.2318554

56.

Steinmann

Sánchez

Van Laar

Braeken

(2022). The impact of inconsistent responders to mixed-worded scales on inferences in international large-scale assessments. Assessment in Education: Principles, Policy & Practice, 29(1), 5–26. https://doi.org/10.1080/0969594X.2021.2005302

57.

Steinmann

Strietholt

Braeken

(2022). A constrained factor mixture analysis model for consistent and inconsistent respondents to mixed-worded scales. Psychological Methods, 27(4), 667–702. https://doi.org/10.1037/met0000392

58.

Swain

S. D.

Weathers

Niedrich

R. W.

(2008). Assessing three sources of misresponse to reversed likert items. Journal of Marketing Research, 45(1), 116–131. https://doi.org/10.1509/jmkr.45.1.116

59.

Ward

M. K.

Meade

A. W.

(2023). Dealing with careless responding in survey data: Prevention, identification, and recommended best practices. Annual Review of Psychology, 74(1), 577–596. https://doi.org/10.1146/annurev-psych-040422-045007

60.

Wise

S. L.

(2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. https://doi.org/10.1111/emip.12165

61.

Zamarro

Hitt

Mendez

(2019). When students don’t care: Reexamining international differences in achievement and student effort. Journal of Human Capital, 13(4), 519–552. https://doi.org/10.1086/705799

Inconsistent Responding on a Mixed-Worded Scale in the PISA 2022 Questionnaire: Prevalence and Predictors Across Seven Countries

Abstract

Keywords

What Causes Inconsistent Responding to Mixed-Worded Scales?

Individual Characteristics Associated With Inconsistent Responding

Aim of the Study

Methods

Sample

Measures

Mixed-Worded Scale

Covariates of Inconsistent Responding

Statistical Analysis

Results

Prevalence of Inconsistent Respondents Across Countries

Correlates of Inconsistent Responding

Discussion

Limitations

Implications

Footnotes

Acknowledgements

ORCID iDs

Ethical Considerations

Funding

Declaration of Conflicting Interests

Data Availability Statement

Notes

References