Sage Journals: Discover world-class research

Abstract

The Item Wording Effect (IWE) in psychological testing describes how individuals respond differently to positively and negatively worded items. Previous IWE research faced challenges due to measures varying beyond item valence. This study aimed to address this problem by developing an inventory, the Positive and Negative Descriptor Inventory (PANDI), with items varying solely on valence. Semantic framing was manipulated to examine which factor (valence vs. framing) was more causal of the IWE. Using an online survey on Mechanical Turk, 336 Canadian participants responded to PANDI items in different experimental conditions. Results indicated that item valence had a bigger impact on IWE than semantic framing. PANDI-Good items in the Affirming Condition exhibited lower reliability but higher means and response variance than other groups, emphasizing the significant difference in how individuals interpret positive and negating inventory items. This study recommends using negatively worded items sparingly, and not using negating items at all.

Keywords

Item wording effect psychometric systematic method variance semantic differential

Introduction

In the development of psychological inventories, equal numbers of positively worded items (e.g., I am good) and negatively worded items (e.g., I am bad) are used to minimize the effects of response biases such as acquiescence and naysaying (DeVellis, 2003), as well as to slow completion times (as negative words are processed slower than positive words (Wason, 1959), giving responders additional time to think about their answers (Podsakoff et al., 2003). After negatively worded items are reverse scored, their responses should align closely with positively worded items scores. It is presumed that responders answer positively and (reverse scored) negatively worded items similarly because the items measure the same hypothetical construct (Nunnally, 1978).

An unintended consequence of this best-practices principle is the discovery that responders answer positively and negatively worded items differently, that the presumption of item response similarity is false. In fact, studies show positively, and negatively worded items yield differences in mean scale scores (positive items produce higher means than negative items; Schriesheim & Hill, 1981; Weems et al., 2003), factor structure (positive and negative items often load on unique factors, which changes the impression of single dimension constructs into multidimensional ones; Bulut & Bulut, 2022; Marsh, 1996), reliability (positive items often yield greater Cronbach’s alphas than negative items; Barnette, 2000; Eys et al., 2007), and structural equation models (positively worded items produce better fitting models than negatively worded items; Gnambs & Schroeders, 2020; Tomas & Oliver, 1999). These results have been found over decades of study, in student and community samples, using a variety of inventories, and across diverse cultures: they are robust. We refer to these differences collectively as the Item Wording Effect (IWE).

Despite decades of research, the exact nature of the IWE remains unknown. Various studies have found diverse empirical results leading to a variety of distinct conclusions (Kam, 2018). The cause of the IWE is largely thought to be multifactorial, however, most attribute it to inventory development decisions or methodological style (Horan et al., 2003). Others have found that the IWE correlates significantly with social desirability (Rauch et al., 2007). Though these results were unable to be replicated in a later study (DiStefano & Motl, 2006; Kam, 2018). Inventory developers who incorporate positively and negatively worded items in their scales encounter these problems, whereas developers who opt for all positively or all negatively worded items eschew the IWE. Some researchers explain that IWE occurs because responders have greater difficulty answering negatively worded items. Negatively worded items have greater interpretive complexity over positive items (Wason, 1959). Whilst others point to the deleterious effect that emotion-laden negative words have on cognitive processing and categorization (Isen & Daubman, 1984).

Research has also identified trait characteristics or substantive issues related to the IWE, making it more than just an artifact of methodical style. Much of the work connecting personality and the IWE has focused on Rosenberg’s self-esteem scale. The difference in responses to its positively and negatively items has caused some to question whether the scale is a two-dimensional measure or just one (Tomas & Oliver, 1999). Similar doubts have been raised about measures of optimism (Marshall et al., 1992; Plomin et al., 1992), anxiety (Vagg et al., 1980; Vautier et al., 2004), psychological wellbeing (Hystad & Johnsen, 2020), and affect (Watson et al., 1988). The strongest case to be made for a substantive connection with the IWE is intelligence. Research shows the IWE is most discernible among responders with cognitive-developmental deficits and disabilities. Across studies, researchers find that response differences lessen as intelligence levels grow (Gnambs & Schroeders, 2020; Marsh, 1996). Additional research has found that people with low levels of reading achievement are more likely to agree with positively worded items, less likely to agree with negatively worded questions, and have greater difficulty disagreeing with negatively worded items due to the complexity of the task (Bulut & Bulut, 2022; Gnambs & Schroeders, 2020; Greenberger et al., 2003).

The Present Study

Although the IWE has been the subject of decades of research, the causes, and correlates of IWE are still unclear. We agree the trends in the literature are likely correct. IWE is both due to stylistic and substantive factors such that negative item valences and certain traits make the IWE more pronounced. Unfortunately, research findings heretofore have been obscured by study measures that lacked consistent and clear distinctions between positive and negative worded items. In most studies, inventory items differ in more than just semantic valence (whether the key descriptors in items are evaluatively positive or negative [e.g., I am kind vs. I am cruel; Haigler & Widiger, 2001]). Items also differ in terms of semantic framing (affirmation vs. negation: “I am kind,” “I am not kind”; Schriesheim et al., 1991), affix adjustments (the use of prefixes and suffixes] to turn positive items to negative, e.g., kind to unkind, meaningful to meaningless; Barnette, 2000), item complexity (e.g., I am not discouraged by untrustworthy misbehavior”), response option valences (whether the ends of a response scale begin with a positive option [strongly agree] or negative option [strongly disagree], or switch back and forth; Bors, Gruman, & Shukla, 2010), and the number of themes within an item (“I like to eat apples and bananas”; Kline, 1993).

To simplify our study design and interpretability of results, we opted to test an inventory that maximises the difference between positive and negative item valences, without introducing any of the problems associated with semantic framing, affix adjustments, item complexity, response option valences, and multiple themes. We created an inventory ourselves by utilizing single-word descriptors instead of the typical statements or questions that make-up inventory items. To test the widely held hypothesis that negative wording is more difficult to understand than positive wording, we asked participants to complete equal numbers of positively and negatively worded items. We also choose to manipulate semantic framing (I am… vs. I am not… item stems) to make comparisons with item valence. Thus, we tested a within-participants (positive items vs. negative items) and between-participants (I am vs. I am not) mixed-model design.

Hypothesis 1. Scale reliability and mean inter-item correlations would be positively affected by item valence and semantic framing. Positively worded items and affirmatively framed items (I am…) would generate greater scale reliability and mean inter-item correlations than negatively worded and negatively framed items.

Hypothesis 2. Inventory mean scores would be positively affected by item valence and semantic framing. Positively worded items and affirmatively framed item stems (I am…) would generate greater mean scores negatively worded or negatively framed items.

Hypothesis 3. Intra-scale response variance, as measured by inter-item standard deviations (ISDs; Marjanovic et al., 2015) would be positively affected by item valence and semantic framing. Positively worded items and affirming item stems (I am…) would generate smaller ISDs than negatively worded and negating items.

Method

Participants

Participants were recruited via an online questionnaire posted on Mechanical Turk, which took on average 11.77 min (SD = 19.93) to complete. These non-student community members were recruited from Canada and paid a nominal participation fee of US$1.00 USD. The original sample were 403 responders but was reduced by factors such as missing data (12 = 2.98%), indiscriminate or careless responding (38 = 9.43%), and a less than three-minute administration time (35 = 8.68%). The final sample were 336 responders. They were 246 men (73.21%) and 90 women (26.79%) and had a mean age of 33.32 years old (SD = 8.75). The final sample had a mean completion time of 13.30 min (SD = 21.38).

Measures

Demographics were assessed with two items querying age (in years) and biological sex (man, woman, other). The data were scrutinized for validity using three indicators.

1. Missing data. Participants who completed less than 90% of all items or failed to complete any one measure in full was eliminated from the final sample.

2. Conscientious Responders Scale (CRS; Marjanovic et al., 2014). The CRS is a 5-item validity scale that differentiates between conscientious responding (CR: answer generated systematically, which we presume to be the result of honest and accurate responding) from indiscriminate responding (IR: responding i.e. generated unsystematically and/or carelessly). CRS items were randomly embedded throughout the questionnaire not to appear in a row or too obviously to participants. The CRS achieves a higher degree of classification accuracy through the utilization of instructional items. Each item directs the responder exactly how to answer that item (e.g., “Please select response option one (strongly disagree) to answer this item”). Therefore, responses that are congruent with item instructions are presumed to be the result of CR and scored a 1s, whereas incongruent responses are presumed to be the result of IR and scored as 0s. CRS sum scores from 0 to 2 are labelled IR and scores between 3 and 5 labelled CR. Research shows the CRS is an effective and widely used measure for authenticating the validity of questionnaire data (Marjanovic et al., 2019).

3. Questionnaire completion times. Although it is difficult to pin down an exact principle for flagging quick administration times, it is unreasonable to assume questionnaires completed extremely quickly yield valid data (Huang et al., 2012; Wood et al., 2017). Prior to administration online, we had seven individuals complete the questionnaire to gain an idea how long it would take to administer. Their mean administration time was a little over 12 minutes. To be conservative in setting a intuitive cutoff, we decided a priori that participants who completed the questionnaire in less than 25% of our pilot groups’ administration time would be eliminated from the sample.

Study Measure

Positive and Negative Descriptor Inventory (PANDI; adapted from the theoretical model and items in Osgood et al., 1964). Developed for this investigation, the Good and Bad subscales each contain 15 trait descriptors, which describe aspects of being a good or bad person (e.g., honest, deceitful), a strong or weak person (powerful, fragile), and an active or passive person (e.g., lively, lethargic). Items are answered on a 5-point scale ranging from 1 = Not At All to 5 = A Great Deal. All negatively worded PANDI-Bad items are reverse scored to be in semantic alignment with the PANDI Good items. Consequently, greater scores on either subscale reflect higher trait levels of goodness, strength, and energy. A description of our development of this inventory is provided below.

Procedure

The questionnaire was part of a larger study including personality and educational tests. Appearing the end of the questionnaire, the PANDI measure instructions contained a manipulation that put responders in one of two experimental conditions.

Manipulation

Responders were randomly assigned to either the Affirming Condition (n = 165) or Negating Condition (n = 171). Both groups’ questionnaires were identical until the last inventory (the PANDI) began with these instructions: “Using the 5-point response scale beside each item, please indicate how much the following items describe you in a general way – the way you are most of the time. Please answer all items as honestly and as accurately as possible.” Each of the following 30 PANDI items began with either “I am” (the Affirming Condition) or “I am not” (the Negating Condition) item stems before each descriptor. Once finished, and after a reading a short debriefing statement, participants were thanked before leaving the questionnaire.

Development of the Study Inventory

We sourced our pool of descriptor items from Osgood, Suci, and Tannebaum’s work on semantic differential scales (SDSs; 1957; 1964). In SDSs, the stem of the item asserts some subject-verb combination (e.g., I am…) followed by a series of bipolar word pairs assessing attitudes about some social issue or topic (e.g., good. . . . . . . bad). The responder selects a point along a response continuum to answer the item. Osgood et al. identified three factors that categorize attitudes: (1) Evaluation – whether a topic is good or bad; (2) Potency – whether a topic is strong or weak; and (3) Activity – whether a topic is active or passive. Together, the Evaluation-Potency-Activity (EPA) model of attitude structure has been applied in the various fields of psychology (Friborg et al., 2006), political science (Abelson et al., 1982), information systems (Verhagen et al., 2015), and marketing (Themistocleous et al., 2019).

We opted not to use the original 1957 SDS because of its use of affixes to create bipolar pairs (e.g., pleasant-unpleasant, fair-unfair). We instead looked to the list of cross-cultural bipolar pairs published in 1967 for a source of items to adapt into a new measure. In the end we used 15 items from Osgood (1964) and choose 15 alternative words to replace affixed and complex words from their list. We also choose to separate each bipolar word pair into two single-word items to avoid the unpleasantness of disputes over whether word pairs are the opposite of a single continuum or represented unique dimensions. Our selection criteria for items were fourfold (Watson & Clark, 2019). First, for breadth, items had to have high content validity, spanning the breadth of the domains of the EPA model. Second, for clarity, all items needed to be free of affixes (i.e., prefixes & suffixes). Third, for interpretability, we favored simple words over complex words. Fourth, for balance, we selected equal numbers of positively and negatively worded item pairs representing all three EPA dimension (e.g., nice-awful, big-little, strong-weak, fast-slow). In the end, we developed a list of 30 descriptors for the inventory, half positively worded and half negatively worded, representing all the EPA constructs. We call it the PANDI Good and PANDI Bad, respectively (Table 1).

Table 1.

Positive and Negative Descriptor Inventory (PANDI) by Item Valence (+, -) and Component.

IV	Loadings			Loadings			Loadings
IV	Evaluation	C1	C2	Potency	C1	C2	Activity	C1	C2
+	Good	−.03	.67	Strong	.05	.62	Active	.15	.76
+	^aKind	−.03	.53	Big	−.58	.35	Fast	−.22	.55
+	^aGiving	−.26	.50	Powerful	−.14	.69	Energetic	−.08	.64
+	^aHonest	−.04	.58	^aTough	−.62	.09	^aIndustrious	.15	.60
+	^aEthical	−.19	.44	^aHealthy	−.06	.69	Lively	−.17	.63
-	Bad	.85	.08	Weak	.88	.10	Passive	.55	−.22
-	Cruel	.81	−.14	Small	.76	−.15	Slow	.87	.04
-	^aSelfish	.80	−.09	^aMeek	.67	−.06	^aTired	.80	.10
-	^aDeceitful	.72	−.14	Fragile	.82	−.07	^aLazy	.75	−.13
-	^aCorrupt	.79	−.17	^aSick	.84	−.12	^aLethargic	.78	.00

^aNote. Descriptors we contributed. All other items taken from Osgood (1964). V = Item Valence. C1 & C2 = Principal Component Analysis components 1 and 2 item loadings.

Results

Preliminary Statistics

All negatively worded items in the Affirming Condition (e.g., I am bad), and all positively worded PANDI items in the Negating Condition (e.g., I am not good) were reversed scored so that all PANDI data were positively aligned. After this, higher scores in all PANDI items, across Semantic Framing conditions, indicated greater levels of goodness, strength, and energy. Exploratory principal component analysis with varimax rotation was conducted¹ to force items into a two-factor structure. We observed that all negatively worded items loaded onto component 1, accounting for 47.22% of the variance in responding, and all positively worded items loaded onto component 2, accounting for another 13.55% of the variance².

Although the results of the PCA showed some items were psychometrically better than others, almost all the items passed conservative rules for item retention, such as component loadings above .40 and having double the loading size on an intended component than all other components (Kline, 1993). In the aggregate there was little difference between scales with 30 items, 24 items, and 18 items. For the sake of simplicity, because the items were previously vetted by Osgood (1964), and to avoid the potential for selectively choosing items to produce a desired outcome (i.e., p-hacking; Head et al., 2015), we retained all 30 items for the main analysis of this study.

Inferential Statistics

To test hypothesis 1, that item valence and semantic framing would influence inventory reliability, we calculated PANDI-Good and PANDI-Bad Cronbach’s alpha and inter-item correlation statistics in each experimental condition (Table 2). In the Affirming Condition (I am), contrary to our expectations, the PANDI-Good Cronbach’s alpha and mean inter-item correlation³ was significantly smaller than the PANDI-Bad statistics. A Fisher’s Z-test comparison of the two correlations was statistically significant. Also, the PANDI-Good correlation in the Negating Condition was significantly larger than it was in the Affirming Condition. In sum, the positively worded items in the Affirming Condition performed the worst of all four groups, contradicting hypothesis 1.

Table 2.

Cronbach’s Alpha and Mean Inter-item Correlations by Item Valence (Positive, Negative) and Semantic Framing (Affirming, Negating).

Conditions	α	Mr	α	Mr	Mr Comparison
Affirming	.85	.27	.95	.57	Z = −3.34***
Negating	.96	.61	.95	.54	Z = 0.95
Mr comparison	--	Z = −3.92***	--	Z = 0.39	--

Note. *** = p < .001. α = Cronbach’s alpha. Mr = mean inter-item correlation. Mr Comparison = Fisher’s correlation coefficient comparison Z-test.

To test hypothesis 2, that item valence and semantic framing would influence inventory means, we conducted a mixed-model ANOVA (2 within-participants [item valence] × 2 between-participants [semantic framing]) using PANDI Good and PANDI-Bad mean scores as the outcome variable (Table 3 & Figure 1). The ANOVA yielded a statistically significant item valence by semantic framing interaction. PANDI-Good items produced significantly larger mean scores than PANDI-Bad items in the Affirming Condition, but means were about equal in the Negating Condition. As expected, PANDI-Good means in the Affirming Condition were the highest of all four groups. Hypothesis 2 was supported.

Table 3.

ANOVAs by Item Valence (Positive, Negative) and Semantic Framing (Affirming, Negating).

			Positive		Negative
ANOVA	DV	Condition	Mean	SD	Mean	SD	Effect	F	p	Ƞ²
1	Means	Affirming	3.86	0.52	3.03	0.97	IV	25.95	<.001	.072
		Negating	3.16	1.11	3.06	1.00	SF	54.68	<.001	.141
							IV × SF	16.50	<.001	.047
2	ISDs	Affirming	0.76	0.31	0.82	0.33	IV	9.71	.002	.028
		Negating	0.79	0.39	0.85	0.41	SF	0.76	.386	.002
							IV × SF	0.00	.958	.000

Note. PANDI = Positive and Negative Descriptor Inventory-Good and -Bad subscales. ISDs = Inter-item Standard Deviations. IV = Item Valence Main Effect, SF = Semantic Framing Main Effect, IV × SF = Item Valence by Semantic Framing Interaction Effect, F = F-test, p = probability value, Ƞ² = partial eta squared.

Figure 1.

ANOVA 1 by item valence (positive, negative) and semantic framing (affirming, negating).

A second mixed-model ANOVA was conducted to test Hypothesis 3, that item valence and semantic framing affects intra-scale response variance. In this ANOVA, the outcome variable was the inter-item standard deviation (ISD), which quantifies intra-scale response variance across all the items of a scale. It stands to reason that if all the items of an inventory gauge the same construct, one should expect responses to all its items to be similarly located in the response range. We expected the PANDI Good subscale to have smaller ISDs than the PANDI Bad subscale, and the Affirming Condition to produce smaller ISDs than the Negating Condition.

In support of Hypothesis 3, results of the second ANOVA yielded a statistically significant main effect of item valence in the expected direction (Table 3 & Figure 2). PANDI-Good items produced less intra-scale response variance than PANDI-Bad items. The main effect for semantic framing was not statistically significant but was in the expected direction. In sum, the effect of item valence on intra-scale response variance was predictable, meaningful, and greater than the effect of semantic framing.

Figure 2.

ANOVA 2 inter-item standard deviations by item valence (positive, negative) and semantic framing (affirming, negating).

Discussion

The Item Wording Effect is a psychological testing phenomenon wherein responders answer positively worded items (e.g., I am good) differently than negatively worded items (e.g., I am bad). Research shows meaningful differences in the reliabilities, mean scores, and factor structure of positively and negatively worded items. Apart from that, the literature on the IWE has been murky due in part to researchers’ use of study measures that are differentiated by more than just item valence, but also semantic framing, affix use, changing response option valences, item complexity, and multiple themes.

The purpose of this study was to use a clarified inventory in which items could be differentiated on item valence alone. For this purpose, we developed a 30-item personality inventory, adapted in part from Osgood et al. (1964), that separated into 15 positively worded and 15 negatively worded subscales called the PANDI-Good and PANDI-Bad, respectively. Further, we manipulated semantic framing to make comparisons with item valence to see which variable caused the biggest IWE.

Hypothesis 1 findings were contrary to our expectations and completely inconsistent with the IWE literature. Positively worded items in the Affirming Condition produced the lowest reliability of all four groups. We attribute this odd result to a restriction of range. Table 2 shows PANDI-Good standard deviations in the Affirming Condition were about half the size of the other three groups. The PANDI-Good means in the Affirming Condition were more narrowly distributed as compared to the PANDI-Bad and Negating Condition items. This may reflect the jarring, slowing effect that negative items have on responders’ completion times (Podsakoff et al., 2003).

Hypothesis 2 showed interesting effects of item valence and semantic framing on scale means. Because all negatively worded and Negating Condition items were reversed scored, and all their content were conceptually aligned, mean scores should have been equivalent across groups. Yet, results showed responders seemed to process positively worded items in the Affirming Condition differently than in the other three groups. Positively worded and Affirming Condition item means were high above the response scale midpoint, whereas means in the other three groups were more conservatively near the midpoint. In sum, responders showed a positivity bias when completing positivity worded and affirmingly framed items. This is consistent with positive biases found in self-referent attributions (Watson et al., 2007), worldview/outlook (Mezulis et al., 2004; Peeters, 1971), and language (Augustine et al., 2011; Dodds et al., 2015). Hypothesis 3 analysis showed positively worded items produced less intra-scale response variance than negatively worded items, which we attribute to response hesitancy or a lack of certainty in responding to negatively worded items (Podsakoff et al., 2003). This explanation is consistent with research that shows responders answer positively worded items significantly quicker than negatively worded items (Watson et al., 2007).

In sum, positively worded and affirming items produced the best results for means and intra-scale response variance, but the worst results for inventory reliability. Altogether, the IWE was meaningfully influenced by item valence, and to a lesser extent semantic framing. This finding violates a fundamental assumption of objective personality testing, that responders can interpret items correctly (Kline, 1993; Watson et al., 1988). From these findings, and previous research, we argue this presumption is untenable and should be addressed.

Inventory developers may not be able to entirely avoid using negatively worded items. The subject matter they measure is often dark, disturbed, and pathological (e.g., the dark tetrad), which necessitates negatively worded content. Nevertheless, developers would benefit to carefully choose clarified items that vary on item valence alone. Avoid items that are negating, affixed, sesquipedalian (pun intended), and contain multiple themes. This study showed negative and negating items caused meaningful deleterious effects in reliability, mean scores, and intra-scale response variance as compared to positively worded and affirming items. Because these findings are mostly consistent with the IWE literature, which suggests a systematic methodical biasing effect of negating and negatively worded items, researchers would be sensible to limit their use when possible. The main loss in the absence of negative and negating items (e.g., control for response biases) can be regained with the clever use of data validity scales (e.g., acquiescence and naysaying can be detected with extremely small ISDs; Marjanovic et al., 2015).

Limitations and Future Directions

Our data were restricted to online self-report questionnaires that produced high rates of indiscriminate or careless responding. Although we were careful to screen the data to identify and expunge invalid responders, we believe some may have gotten through the screens with the use of sophisticated bots and response generating software that contaminated our data, nevertheless (Dupuis et al., 2019). We believe this because of the large number of responders that completed the questionnaire in less than 3 minutes but were still able to pass the validity scale and missing data cutoffs. This is important because indiscriminate responding contaminated data can mistakenly yield factors made up of positively and negatively worded items (e.g., Schmitt & Stuits, 1985). The more indiscriminate responding (IR) in a set of data, the more strongly the data would show signs of bidimensionality. We thusly implore researchers to take efforts to filter their data of impurities before commencing with analyses.

In future research, the IWE and its suspected causes and correlates can be re-examined using clarified attitudinal and personality inventories as used in this study. Existing measures like the Positive and Negative Affect Schedule (PANAS; Watson et al., 1988) are also promising measures to use in IWE research because items can be segregated based on item valence alone. For now, we advise researchers to be weary of inventories containing negating and negatively worded items: avoid their use if possible. Despite the advantages they gift to researchers (e.g., reduction of response bias), their disadvantages are costly to ignore.

Footnotes

Acknowledgments

We thank Morgan van Morgan for her assistance throughout this project.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the SSHRC Insight Grant # 435-2019-0529.

ORCID iD

Zdravko Marjanovic

Data availability statement

All data and pre-registered hypotheses and study details are available at the Open Science Framework ().

Notes

Author Biographies

Zdravko Marjanovic has taught and conducted research in Ontario, Newfoundland, and British Columbia, Canada, before settling in Alberta as an Associate Professor of Psychology at Concordia University of Edmonton. His research focuses on various topics within the area of social and personality psychology.

Anna Louisa Maidens was an undergraduate student at Concordia University of Edmonton during this study. She has since graduated with distinction, earning her BA in Psychology, and is now pursuing graduate studies in a counseling program.

References

Abelson

R. P.

Kinder

D. R.

Peters

M. D.

Fiske

S. T.

(1982). Affective and semantic components in political person perception. Journal of Personality and Social Psychology, 42(4), 619–630. https://doi.org/10.1037/0022-3514.42.4.619

Augustine

A. A.

Mehl

M. R.

Larsen

R. J.

(2011). A positivity bias in written and spoken English and its moderation by personality and gender. Social Psychological and Personality Science, 2(5), 508–515. https://doi.org/10.1177/1948550611399154

Barnette

J. J.

(2000). Effects of stem and Likert response option reversals on survey internal consistency: If you feel the need, there is a better alternative to using those negatively worded stems. Educational and Psychological Measurement, 60(3), 361–370. https://doi.org/10.1177/00131640021970592

Bores

D. A.

Gruman

J. A.

Shukla

(2010). Measuring tolerance of ambiguity: Item polarity, dimensionality, and criterion validity. European Review of Applied Psychology, 60(4), 239–245. https://doi.org/10.1016/j.erap.2010.07.001

Bulut

H. C.

Bulut

(2022). Item wording effects in self-report measures and reading achievement: Does removing careless respondents help? Studies In Educational Evaluation, 72, Article 101126. https://doi.org/10.1016/j.stueduc.2022.101126

Clark

L. A.

Watson

(2019). Constructing validity: New developments in creating objective measuring instruments. Psychological Assessment, 31(12), 1412–1427. https://doi.org/10.1037/pas0000626

DeVellis

R. F.

(2003). Scale development: Theory and applications (2nd ed.). Sage Publications.

DiStefano

Motl

R. W.

(2006). Further investigating method effects associated with negatively worded items on self-report surveys. Structural Equation Modeling, 13(3), 440–464. https://doi.org/10.1207/s15328007sem1303_6

Dodds

P. S.

Clark

E. M.

Desu

Frank

M. R.

Reagan

A. J.

Williams

J. R.

Mitchell

Harris

K. D.

Kloumann

I. M.

Bagrow

J. P.

Megerdoomian

McMahon

M. T.

Tivnan

B. F.

Danforth

C. M.

Danforth

C. M.

(2015). Human language reveals a universal positivity bias. Proceedings of the National Academy of Sciences, 112(8), 2389–2394. https://doi.org/10.1073/pnas.1411678112

10.

Dupuis

Meier

Cuneo

(2019). Detecting computer-generated random responding in questionnaire-based data: A comparison of seven indices. Behavior Research Methods, 51(5), 2228–2237. https://doi.org/10.3758/s13428-018-1103-y

11.

Eys

M. A.

Carron

A. V.

Bray

S. R.

Brawley

L. R.

(2007). Item wording and internal consistency of a measure of cohesion: The Group Environment Questionnaire. Journal of Sport & Exercise Psychology, 29(3), 395–402. https://doi.org/10.1123/jsep.29.3.395

12.

Friborg

Martinussen

Rosenvinge

J. H.

(2006). Likert-based vs. semantic differential-based scorings of positive psychological constructs: A psychometric comparison of two versions of a scale measuring resilience. Personality and Individual Differences, 40(5), 873–884. https://doi.org/10.1016/j.paid.2005.08.015

13.

Gnambs

Schroeders

(2020). Cognitive abilities explain wording effects in the Rosenberg Self-Esteem Scale. Assessment, 27(2), 404–418. https://doi.org/10.1177/1073191117746503

14.

Greenberger

Chen

Dmitrieva

Farruggia

S. P.

(2003). Item-wording and the dimensionality of the Rosenberg self-esteem scale: Do they matter? Personality and Individual Differences, 35(6), 1241–1254. https://doi.org/10.1016/s0191-8869(02)00331-8

15.

Haigler

E. D.

Widiger

T. A.

(2001). Experimental manipulation of NEO-PI-R items. Journal of Personality Assessment, 77(2), 339–358. https://doi.org/10.1207/S15327752JPA7702_14

16.

Head

M. L.

Holman

Lanfear

Kahn

A. T.

Jennions

M. D.

(2015). The extent and consequences of p-hacking in science. PLoS Biology, 13(3), Article e1002106. https://doi.org/10.1371/journal.pbio.1002106

17.

Horan

P. M.

DiStefano

Motl

R. W.

(2003). Wording effects in self-esteem scales: Methodological artifact or response style? Structural Equation Modeling, 10(3), 435–455. https://doi.org/10.1207/s15328007sem1003_6

18.

Huang

J. L.

Curran

P. G.

Keeney

Poposki

E. M.

DeShon

R. P.

(2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27(1), 99–114. https://doi.org/10.1007/s10869-011-9231-8

19.

Hystad

S. W.

Johnsen

B. H.

(2020). The dimensionality of the 12-item general health questionnaire (GHQ-12): Comparisons of factor structures and invariance across samples and time. Frontiers in Psychology, 11, Article 1300. https://doi.org/10.3389/fpsyg.2020.01300

20.

Isen

A. M.

Daubman

K. A.

(1984). The influence of affect on categorization. Journal of Personality and Social Psychology, 47(6), 1206–1217. https://doi.org/10.1037//0022-3514.47.6.1206

21.

Kam

C. C. S.

(2018). Why do we still have an impoverished understanding of the item wording effect? An empirical examination. Sociological Methods & Research, 47(3), 574–597. https://doi.org/10.1177/0049124115626177

22.

Kline

(1993). Personality: The psychometric view. Routledge.

23.

Marjanovic

Bajkov

MacDonald

(2019). The Conscientious Responders Scale helps researchers verify the integrity of personality questionnaire data. Psychological Reports, 122(4), 1529–1549. https://doi.org/10.1177/0033294118783917

24.

Marjanovic

Holden

Struthers

Cribbie

Greenglass

(2015). The Inter-Item Standard Deviation (ISD): An index that discriminates between conscientious and random responders. Personality and Individual Differences, 84, 79–83. https://doi.org/10.1016/j.paid.2014.08.021

25.

Marjanovic

Struthers

C. W.

Cribbie

R. A.

Greenglass

E. R.

(2014). The conscientious responders scale: A new tool for discriminating between conscientious and random responders. Sage Open, 4(3), 1–10. https://doi.org/10.1177/2158244014545964

26.

Marsh

H. W.

(1996). Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of Personality and Social Psychology, 70(4), 810–819. https://doi.org/10.1037//0022-3514.70.4.810

27.

Marshall

G. N.

Wortman

C. B.

Kusulas

J. W.

Hervig

L. K.

Vickers Jr

R. R.

(1992). Distinguishing optimism from pessimism: Relations to fundamental dimensions of mood and personality. Journal of Personality and Social Psychology, 62(6), 1067–1074. https://doi.org/10.1037//0022-3514.62.6.1067

28.

Mezulis

A. H.

Abramson

L. Y.

Hyde

J. S.

Hankin

B. L.

(2004). Is there a universal positivity bias in attributions? A meta-analytic review of individual, developmental, and cultural differences in the self-serving attributional bias. Psychological Bulletin, 130, 711–747.

29.

Nunnally

(1978). Psychometric theory (2nd ed.). McGraw-Hill.

30.

Osgood

C. E.

(1964). Semantic differential technique in the comparative study of cultures. American Anthropologist, 66(3), 171–200. https://doi.org/10.1525/aa.1964.66.3.02a00880

31.

Peeters

(1971). The positive‐negative asymmetry: On cognitive consistency and positivity bias. European Journal of Social Psychology, 1(4), 455–474. https://doi.org/10.1002/ejsp.2420010405

32.

Plomin

Scheier

M. F.

Bergeman

C. S.

Pedersen

N. L.

Nesselroade

J. R.

McClearn

G. E.

(1992). Optimism, pessimism and mental health: A twin/adoption analysis. Personality and Individual Differences, 13(8), 921–930. https://doi.org/10.1016/0191-8869(92)90009-e

33.

Podsakoff

P. M.

MacKenzie

S. B.

Lee

J.-Y.

Podsakoff

N. P.

(2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879–903. https://doi.org/10.1037/0021-9010.88.5.879

34.

Rauch

W. A.

Schweizer

Moosbrugger

(2007). Method effects due to social desirability as a parsimonious explanation of the deviation from unidimensionality in LOT-R scores. Personality and Individual Differences, 42(8), 1597–1607. https://doi.org/10.1016/j.paid.2006.10.035

35.

Schmitt

Stuits

D. M.

(1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9(4), 367–373. https://doi.org/10.1177/014662168500900405

36.

Schriesheim

C. A.

Eisenbach

R. J.

Hill

K. D.

(1991). The effect of negation and polar opposite item reversals on questionnaire reliability and validity: An experimental investigation. Educational and Psychological Measurement, 51(1), 67–78. https://doi.org/10.1177/0013164491511005

37.

Schriesheim

C. A.

Hill

K. D.

(1981). Controlling acquiescence response bias by item reversals: The effect on questionnaire validity. Educational and Psychological Measurement, 41(4), 1101–1114. https://doi.org/10.1177/001316448104100420

38.

Sturman

E. D.

Cribbie

R. A.

Flett

G. L.

(2009). The average distance between item values: A novel approach for estimating internal consistency. Journal of Psychoeducational Assessment, 27(5), 409–420. https://doi.org/10.1177/0734282908330937

39.

Themistocleous

Pagiaslis

Smith

Wagner

(2019). A comparison of scale attributes between interval-valued and semantic differential scales. International Journal of Market Research, 61(4), 394–407. https://doi.org/10.1177/1470785319831227

40.

Tomas

J. M.

Oliver

(1999). Rosenberg's self‐esteem scale: Two factors or method effects. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 84–98. https://doi.org/10.1080/10705519909540120

41.

Vagg

P. R.

Spielberger

C. D.

O'Hearn

T. P.

Jr. (1980). Is the state-trait anxiety inventory multidimensional? Personality and Individual Differences, 1(3), 207–214. https://doi.org/10.1016/0191-8869(80)90052-5

42.

Vautier

Callahan

Moncany

Sztulman

(2004). A bistable view of single constructs measured using balanced questionnaires: Application to trait anxiety. Structural Equation Modeling, 11(2), 261–271. https://doi.org/10.1207/s15328007sem1102_7

43.

Verhagen

Hooff

B. V. D.

Meents

(2015). Toward a better use of the semantic differential in IS research: An integrative framework of suggested action. Journal of the Association for Information Systems, 16(2), 108–143. https://doi.org/10.17705/1jais.00388

44.

Wason

P. C.

(1959). The processing of positive and negative information. Quarterly Journal of Experimental Psychology, 11(2), 92–107. https://doi.org/10.1080/17470215908416296

45.

Watson

Clark

L. A.

Tellegen

(1988). Development and validation of brief measures of positive and negative affect: The PANAS scales. Journal of Personality and Social Psychology, 54(6), 1063–1070. https://doi.org/10.1037//0022-3514.54.6.1063

46.

Watson

L. A.

Dritschel

Obonsawin

M. C.

Jentzsch

(2007). Seeing yourself in a positive light: Brain correlates of the self-positivity bias. Brain Research, 1152(6), 106–110. https://doi.org/10.1016/j.brainres.2007.03.049

47.

Weems

G. H.

Onwuegbuzie

A. J.

Schreiber

J. B.

Eggers

S. J.

(2003). Characteristics of respondents who respond differently to positively and negatively worded items on rating scales. Assessment & Evaluation in Higher Education, 28(6), 587–606. https://doi.org/10.1080/0260293032000130234

48.

Wood

Harms

P. D.

Lowman

G. H.

DeSimone

J. A.

(2017). Response speed and response consistency as mutually validating indicators of data quality in online samples. Social Psychological and Personality Science, 8(4), 454–464. https://doi.org/10.1177/1948550617703168

A Clarified Examination of the Item Wording Effect: Item Valence (Good vs. Bad) Versus Semantic Framing (I Am vs. I Am Not)

Abstract

Keywords

Introduction

The Present Study

Method

Participants

Measures

Study Measure

Procedure

Manipulation

Development of the Study Inventory

Results

Preliminary Statistics

Inferential Statistics

Discussion

Limitations and Future Directions

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iD

Data availability statement

Notes

Author Biographies

References