Sage Journals: Discover world-class research

Abstract

A great challenge to employing questionnaires is ensuring that respondents understand them. One possible strategy is to explain the meaning of items but there is a risk in biasing respondents’ answers. In this study, we assessed how explaining the meaning of items to respondents might affect their responses and the validity/reliability of the Positive System Usability Scale (PSUS). We employed straightforward, general explanations that we felt matched the intended meaning of the items. We found a statistically significant difference between the Explanations and No Explanations groups, but we judged this effect as not practically significant. We found the validity and reliability of the questionnaires from both groups to be acceptable. Finally, the reliability of the PSUS questionnaires given to the Explanations group was systematically higher than the No Explanations group, suggesting that correctly explaining items may enhance the reliability of questionnaires, perhaps by enhancing respondents’ understanding of the questionnaire items.

Keywords

psychometrics usability measurement SUS

Introduction

Subjective questionnaires are an instrumental tool in the application and research of usability. Questionnaires are great because, when applied well, they can be easy to administer, require little time, facilitate comparisons between products, and provide a method to summarize and communicate with non-expert stakeholders. One of the greatest hindrances to their use is that respondents may not understand the items, producing fruitless ratings, potentially leading to incorrect results (e.g., failing or passing a product when it should not be), or to puzzling response patterns (e.g., all 3s). Investigators are limited as to how they can solve this problem.

The System Usability Scale (SUS, Brooke, 1996) is a prime example of this challenge and how it might be addressed. Finstad (2006) found that one SUS item (“I found the system very cumbersome to use”) was difficult to understand for respondents whose primary language was not English. His recommended solution was to substitute the word “awkward” for “cumbersome” to increase readability. Bangor et al. (2008) supported this modification as they reported doing so drastically reduced the number of questions they received when administering the SUS. This represents a rare case in which changing the wording of an item is recommended. Generally, such changes are discouraged because modifications to previously validated scales can change their interpretation and render any psychometric qualities of a scale unknown.

To mitigate careless responding and acquiescence bias, the SUS was created using an alternate item format so that half of the items are positively worded, and the other half are negatively worded (Sauro & Lewis, 2011). However, it has been found in practice that the alternating item format may lead to misinterpretation, response mistakes, and miscoding by the scorer (Sauro & Lewis, 2011). To address these issues Sauro and Lewis (2011) created the Positive System Usability Scale (Positive SUS) which converts all the negatively worded SUS items into positive statements. They compared the response results between the Positive SUS and the original SUS and found that there was no detectable difference in response patterns or mean scores. Contrasting with Finstad (2006), this approach is much more resource intensive and thus many investigators may not be able to pursue questionnaire modification and validation.

Another option is to adopt another measurement instrument, but investigators may be impeded by limited time or knowledge of alternatives. Yet another strategy is to explain the meaning and intent of items to respondents. This may take many forms including an overview of the interpretation of items, answering clarifying questions from respondents, or giving written instructions on how to read and interpret questionnaire content. Although faster and easier to apply than other options, these strategies risk introducing conductors’ bias and influencing responses to the items.

In this study, we examined the most, arguably, benign form of this final strategy: Explaining the relevance and meaning of questionnaire items without comment on directionality (e.g., this product would be lower on this aspect). We explored how these explanations may affect response patterns, the psychometric qualities (reliability and validity) of the usability scale, and how helpful these explanations were to respondents.

Method

Design

Sixty participants retrospectively rated 10 products (selected from Kortum & Bangor, 2013) using the Positive SUS (Sauro & Lewis, 2011). We also used the Adjective Rating Scale to measure the convergent validity of ratings in both conditions (Kortum & Bangor, 2013). We randomly assigned participants into two groups. The No Explanations group rated products without any additional information. The Explanations group received the PSUS with 10 statements, each coupled to a Positive SUS item, explaining the item’s intended meaning.

Afterwards, both groups rated how understandable the Positive SUS items were (“The statements I answered about each product were understandable”) and how well the explanations matched their interpretations (“The supplemental information matched with my interpretation of this statement”; 10 times for each item-explanation coupling). We gauged the Explanation group’s perceived helpfulness of the explanations (“The supplemental information was helpful for me to understand the statements”; “I feel confident that I could have answered the statements without the supplemental information”). We asked the No Explanations group how helpful the explanations would have been for them (“The supplemental information would be helpful for me to understand the statements”; “I would have an easier time in answering the statements with the supplemental information”) after they had rated all the products.

We asked both groups if their responses would have differed based on the absence (Explanations; “My responses would have been the same if I had not been given the supplementary information”) or addition (No Explanations; “My responses would be the same if I had been given the supplemental information”) of the explanations. Responses were collected with a scale ranging from 1 “Strongly disagree” to 5 “Strongly agree.”

Participants

We recruited 60 undergraduate students from Rice University using the online participant portal, SONA. Participants ranged from 18 to 25 years old. There were 11 males and 46 females. Two participants identified themselves as non-binary, and one participant preferred not to disclose their gender identity. Students reviewed the IRB-approved informed consent form before participating in the study. Students received partial course credit upon completion of the study.

Materials and Procedure

Participants were randomly assigned to retrospectively rate 10 products using one of two versions of an online Qualtrics survey, one without any additional information on the meaning of items (the No Explanations group) and the other with additional information (the Explanations group). The 10 products were selected from products that Kortum and Bangor (2013) identified as ubiquitous, best-in-class, and spanning across software, hardware, and web-based systems. Some products on the list were excluded from the study because they are no longer commonly used (e.g., landlines). The 10 chosen products were Amazon’s website, ATMs, Microsoft Excel, Gmail, Google Search, iPhones, microwaves, Microsoft PowerPoint, the Nintendo Wii, and Gmail. Consistent with the recommendation from (Sauro, 2011) the products’ names were substituted for the word “system” for each corresponding PSUS scale to increase readability. Responses were captured on a 5-point Likert scale from 1 “Strongly disagree” to 5 “Strongly agree.”

In the No Explanations version of the task, the rating of each product was composed of three parts. First, participants reported their level of experience with the product (“I don’t use the product; Anchors: low, medium-low, medium, medium-high, or high). Second, the Positive SUS (Sauro & Lewis, 2011) was presented to measure participants’ perceived product usability. The Positive SUS is a 10-item scale with responses from ranging from 1 “Strongly disagree” to 5 “Strongly agree.” Then, the Adjective Rating Scale (Bangor et al., 2009) was used as a single question to measure the usability by choosing from seven adjectives (Anchors: Worst imaginable, Awful, Poor, OK, good, Excellent, Best imaginable). Participants who reported having no experience with the product were not presented with the Positive SUS or Adjective Rating scale for that product. The presentation order of the 10 products was randomized. The Explanations version was the same as the No Explanations version, except that additional information was provided to explain the meaning and relevance of each Positive SUS item. Each explanation (Table 1) was coupled to a Positive SUS item. The explanations were written to be below an 8th-grade level as evaluated with the FK readability test (Flesch, 1948).

Table 1.

Explanations in Order of Presentation.

1. Good systems can sustain user’s continuous engagement with them. They encourage people to use them more often.

2. Good systems are easy for everyone to understand. They do not have confusing elements that make it hard to figure out.

3. Good systems have intuitive interfaces. Users can quickly learn how to use them and complete tasks easily.

4. Good systems cater to users of diverse skill levels. They enable users to use features without external help.

5. Good systems blend product functions smoothly. All parts of the systems work well together.

6. Good systems are predictable. They always look and work in the same way over time.

7. Good systems avoid disappointing users and work in the way that they should.

8. Good systems avoid disappointing users and work in the way that they should.

9. Good systems make users feel like they can get the systems to do the task they want it to do.

10. Good systems minimize the learning curve for users. Users do not have to know extensive knowledge before they can interact with the systems at ease.

Results

Positive SUS Scores

The average Positive SUS scores for each product by condition are recorded in Table 2. After eliminating participants for missing data, data from 46 participants and 8 of the products (bolded in Table 2) were retained for a mixed ANOVA analysis. The interaction between Condition and Product was not statistically significant, F (7, 308) = .75, p = .63, GES = .01. There was a statistically significant main effect for Explanations, F (1, 44) = 5.40, p = .03, GES = .03. The main effect for Product was also statistically significant after a Hyunh-Feldt sphericity correction, F (5.0, 220.1) = 45.78, p < .001, GES = .43.

Table 2.

Product Positive System Usability Scores by Condition.

	No explanations		Explanations
Product	N	M (SD)	N	M (SD)
Amazon	29	82.9 (13.2)	29	77.2 (18.1)
ATM	19	62.6 (12.1)	21	66.7 (19.7)
Excel	25	54.2 (17.6)	27	47.0 (18.8)
Gmail	30	89.7 (10.7)	30	85.8 (12.4)
Google	30	93.6 (8.4)	29	90.3 (14.8)
iPhone	29	80.9 (12.4)	29	76.4 (17.8)
Microwave	30	84.8 (11.4)	30	83.3 (15.9)
PowerPoint	25	68.1 (15.4)	27	68.0 (18.8)
Wii	18	67.1 (12.2)	16	69.1 (19.3)
Word	30	70.3 (20.6)	29	72.2 (16.8)
Average*	23	80.4 (6.1)	23	75.2 (8.9)

Note. * Average calculated only with bolded products.

In practical terms the difference between groups’ average Positive SUS ratings was 5.2 points or about half a letter grade on the SUS letter grade scale (Bangor et al., 2009). On the raw scale this amounts to about 2 ticks on average. Put another way, participants in the Explanations scale scored two items one tick lower or one item two ticks lower on average. This difference was observed for 5 (Amazon, Excel, Gmail, Google, and iPhone) of the products. The average Microwave and PowerPoint PSUS score was only about 1 point or less different between the two groups. The average Word PSUS score was higher for the Explanations group than the No Explanations group, showing that the overall difference was not consistent across products.

The main effect for products was not further analyzed as it was neither surprising, these products were expected to differ from one another, nor was it of interest to our research question.

Reliability

As can be seen in Table 3, for the No Explanations group, the reliability of the Positive SUS for all eight products met the general recommended threshold of α = .70. Explaining the meaning of the items appeared to not negatively affect scale reliability as Cronbach’s alpha exceeded the recommended threshold for the Explanations group’s Positive SUS scores as well. Interestingly, the reliability of the Positive SUS for the Explanations group was consistently higher, as evidenced by the confidence intervals, than that of the No Explanations group for all products but Word. This result may indicate that participants in the Explanations group had a more similar response pattern to one another than participants in the No Explanations group.

Table 3.

Cronbach’s Alpha by Product and Condition.

	No Explanations		Explanations
Product	N	α [95% CI]	N	α [95% CI]
Amazon	29	.87 [.79, .93]	29	.94 [.90, .97]
Excel	25	.85 [.75, .92]	27	.90 [.84, .95]
Gmail	30	.83 [.71, .91]	30	.90 [.84, .95]
Google	30	.83 [.69, .90]	29	.94 [.89, .96]
iPhone	29	.78 [.64, .88]	29	.90 [.84, .95]
Microwave	30	.75 [.60, .87]	30	.90 [.84, .95]
PowerPoint	25	.88 [.80, .93]	27	.91 [.86, .95]
Word	30	.94 [.89, .96]	29	.88 [.81, .94]
Median		.84		.90

Note. Reliability analysis was conducted using only products retained for the Positive SUS ANOVA analysis. Coefficient alpha was calculated using the Psych package (Revelle, 2023) in R; confidence intervals were calculated using the Feldt method.

Validity

Looking at convergent validity, the relationship between Positive SUS scores and Adjective Rating scores (see Table 4) was statistically significant for all eight products and was of comparable strength for both groups (No Explanations average r = .74; Explanations average r = .77), suggesting explaining the items did not undermine the validity of the Positive SUS.

Table 4.

Adjective Rating Scale Median Sores and Validity Coefficient.

	No Explanations		Explanations
Product	Med.	r [95% CI]	Med.	r [95% CI]
Amazon	6	.74 [.51, .87]	6	.79 [.60, .90]
Excel	5	.83 [.64, .92]	4	.79 [.58, .90]
Gmail	6	.52 [.20, .74]	6	.59 [.30, .79]
Google	6	.52 [.20, .74]	6	.73 [.49, .86]
iPhone	6	.78 [.58, .89]	5	.84 [.68, .92]
Microwave	6	.82 [.65, .91]	6	.81 [.63, .90]
PowerPoint	5	.79 [.57, .90]	5	.83 [.65, .92]
Word	5	.92 [.84, .96]	5	.75 [.53, .88]
Average	5.71	.74	5.43	.77

Note. All correlations were statistically significant with p less than or equal to .003.

Understandability

Participants in both groups indicated that they could understand the Positive SUS items. The small difference between the No Explanations groups ratings (M = 4.40, SD = 0.56) and the Explanations readability ratings (M = 4.27, SD = 0.52) indicates that the explanations did little to improve participants reading of the items, probably because the items were understandable without any aid.

Match of Meaning Between Explanations and Items

Participants reported that the explanations given to them matched their interpretation of the Positive SUS items. The Explanation group gave a higher rating on average (M = 4.24) compared to the No Explanations group (M = 4.08) but not enough for us to test it inferentially. This result suggests that the explanations were generally consistent with their coupled items. The ratings are summarized at the item level in Table 5.

Table 5.

Participants’ Judgment of Compatibility between Explanations and PSUS Items.

Positive SUS item	No explanations	Explanations
Positive SUS item	M (SD)	M (SD)
1	4.1 (0.9)	4.0 (1.1)
2	4.2 (1.0)	4.2 (0.8)
3	4.2 (1.0)	4.5 (0.7)
4	3.9 (1.2)	4.2 (0.9)
5	4.0 (1.1)	4.3 (0.8)
6	3.8 (1.1)	4.1 (0.9)
7	4.0 (1.0)	4.3 (0.7)
8	4.3 (1.0)	4.3 (0.8)
9	4.23 (0.9)	4.2 (0.9)
10	4.1 (0.9)	4.3 (0.7)
Average	4.1 (0.6)	4.2 (0.6)

Helpfulness

Participants responses were middling in how helpful they found the explanations. There was a small difference in the prospective helpfulness ratings of the No Explanations group (M = 3.90, SD = 1.06, Median = 4) and the post-response helpfulness ratings from the Explanations group (M = 4.20, SD = 0.71, Median = 4). But when asked if they could have answered the questions without explanations, the Explanations group responded largely in the affirmative (M = 4.03, SD = 1.03, Median = 4) while the No Explanations group mostly agreed to the statement that they would have an easier time responding if they had been given the explanations (M = 3.80, SD = 1.13, Median = 4).

Would Responses Have Changed?

When asked if they felt their responses would have been different had they been given a different version of the Positive SUS, participants seemed uncertain. The No Explanations group trended slightly more towards the middle of the response scale (M = 3.27, SD = 1.28, Median = 3) indicating more uncertainty than those in the Explanations group which trended more towards indicating that their answers would have been the same had they not been given the additional information (M = 3.63, SD = 1.13, Median = 4).

Were Responses Just Random Noise?

An alternative explanation for the results found in this paper is that participants just responded at the extremes of the scales or in a random manner. To at least partially address this alternative hypothesis, we compared the ratings from the current study to those from Kortum and Bangor (2013). As can be seen in Table 6, for the most part, the average ratings from the original study and the two experimental groups are not too far apart from another.

Table 6.

Comparison of Current Study and Kortum and Bangor (2013).

Product	Original	No explanation	Explanations
Amazon	81.8	82.9	77.2
Excel	56.5	54.2	47.0
Gmail	83.5	89.7	85.8
Google	93.4	93.6	90.3
iPhone	78.5	80.9	76.4
Microwaves	86.9	84.8	83.3
PowerPoint	74.6	68.1	68.0
Word	76.2	70.3	72.2

This was confirmed using Lin’s Concordance Correlation Coefficient (CCC), a measure of agreement between for continuous variables. The closer to 1 that CCC gets, the higher the agreement between the two groups. Both the No Explanations (CCC = .93 [.76, .98]) and the Explanations groups (CCC = .91 [.73, .97]) were found to have high agreement with the scores from the original study.

Discussion

Based on the results, we concluded that when explanations are consistent with questionnaire items, there do not seem to be many detectable downsides. Although we observed a statistically significant difference, a 5.19-point difference on average only amounts to about 2 Likert ticks’ difference as the Positive SUS has a scalar of 2.5 points from raw to scaled score. Under most circumstances, this is likely not a practical difference.

Psychometrically speaking, the scores collected from both groups behaved similarly from a validity standpoint and all evidence suggests that the average scores from both groups are consistent with those reported by Kortum and Bangor (2013), which we interpreted as evidence against the possible explanation that both groups produce meaningless or random responses; bolstering the conclusion that both groups produced earnest ratings.

One interesting observation is that the internal reliability of the scales collected from the Explanations group was on average higher than the No Explanations group. This may indicate that the explanations enhanced respondents’ understanding of the items and helped to reduce error variance or lead to a more standardized response pattern. This suggests a positive upside to adding clarifying statements to questionnaires. This increase in reliability is likely irrelevant given that reliability was also far above the acceptable threshold for the No Explanations group. But this may be an upside worth considering with smaller respondent samples as such samples are likely to be subject to problems stemming from variability than larger samples. We feel that the increase in internal reliability due to explanations should be replicated and explored with other questionnaires to further establish the existence and magnitude of this effect.

Although there were no discernible drawbacks to the Explanations, there also may not be enough benefits to worry about generating and adding them in the first place. Although both groups generally reported they found the explanations helpful or potentially helpful, participants indicated more than not that their responses to the Positive SUS would have changed one way or another. Participants in the Explanations group reported they were confident they did not need the extra information to complete the Positive SUS, undermining the purpose of the explanations. That said, it may be that most participants do not need the assistance but the few that do need assistance benefit from the additional information.

Conclusions

The takeaway from this study is that test moderators probably should not worry that they may be unduly influencing usability questionnaire responses by providing additional information to help respondents understand the test content. The caveat is that the explanations need to be consistent with the meaning and intent of the items. How to determine the level of consistency or the appropriateness of that additional information is an entirely different question that requires further investigation.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ian W. T. Robertson

Xiaoxuan (Alicia) Cheng

Philip Kortum

References

Bangor

Kortum

P. T.

Miller

J. T.

(2008). An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction, 24(6), 574–594. https://doi.org/10.1080/10447310802205776

Bangor

Kortum

Miller

(2009). Determining what individual SUS scores mean: Adding an adjective rating scale. Journal of Usability Studies, 4(3), 114–123. https://uxpajournal.org/determining-what-individual-sus-scores-mean-adding-an-adjective-rating-scale/

Brooke

(1996). SUS: A quick and dirty usability scale. In Jordan

P. W.

Thomas

Weerdmeester

B. A.

McClelland

I.L.

(Eds.), Usability Evaluation in Industry (pp. 189–194). London, UK: Taylor & Francis. https://www.researchgate.net/publication/228593520_SUS_A_quick_and_dirty_usability_scale

Finstad

(2006). The system usability scale and non-native English speakers. Journal of Usability Studies, 1(4), 185–188. https://doi.org/10.4018/IJIDE.303610

Flesch

(1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. https://doi.org/10.1037/h0057532

Kortum

P. T.

Bangor

(2013). Usability ratings for everyday products measured with the system usability scale. International Journal of Human-Computer Interaction, 29(2), 67–76. https://doi.org/10.1080/10447318.2012.681221

Revelle

(2023). Psych: Procedures for psychological, psychometric, and personality research. Northwestern University, Evanston, Illinois. R package version 2.3.3. https://CRAN.R-project.org/package=psych

Sauro

(2011). A practical guide to the system usability scale: Background, benchmarks, and best practices. Denver, CO. Measuring Usability LLC.

Sauro

Lewis

J. R.

(2011, May). When designing usability questionnaires, does it hurt to be positive? In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 2215–2224). https://doi.org/10.1145/1978942.1979266

Explaining How to Interpret a Usability Survey May Enhance Internal Reliability Without Negatively Influencing Responses

Abstract

Keywords

Introduction

Method

Design

Participants

Materials and Procedure

Results

Positive SUS Scores

Reliability

Validity

Understandability

Match of Meaning Between Explanations and Items

Helpfulness

Would Responses Have Changed?

Were Responses Just Random Noise?

Discussion

Conclusions

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

References