Sage Journals: Discover world-class research

Abstract

The present study introduces the concept of validating data quality indicators through re-classification. We use the term re-classification to mean the evaluation of how well an indicator detects the quality of different versions of a survey question for which the quality is known a priori. We illustrate its application with two examples. In both, we make use of 12 questions from prior experiments that manipulated text features of questions to create ‘low’ and ‘high’ quality versions of each question. In the first example, we coded each question version in SQP 2.1 to obtain indicators of validity, reliability, and quality. We compared these indicators between the two versions of each question to assess whether the SQP outcomes were sensitive to text features. In the second example, we used a pretest survey to obtain three indicators of survey quality: response latencies, item nonresponse, and consistency over time. Again, we compared these indicators between question versions to assess whether the indicators were sensitive to text features. We give recommendations for applying re-classification and an outlook for future research opportunities.

Keywords

re-classification data quality validation pretesting survey questions survey experiments

Introduction

To assess the quality of survey questions, different methods and indicators have been proposed. In survey-based market and social science research, several pretesting methods exist, including cognitive pretesting (e.g., Willis, 2005), expert reviews (e.g., Olson, 2010), split-ballot experiments (e.g., Petersen, 2008; Schuman & Presser, 1981), or eye-tracking studies (e.g., Neuert & Lenzner, 2016). A more general approach is to run a pretest survey (with a comparatively small number of cases) to assess the quality of questions based on indicators such as item nonresponse (e.g., Olson et al., 2019; Shoemaker et al., 2002), non-differentiation (e.g., Krosnick & Alwin, 1988; Roßmann et al., 2018), response latencies (Bassili & Scott, 1996; Hillygus et al., 2014), or attention check questions (e.g., Gummer et al., 2021; Meade & Craig, 2012). The Survey Quality Predictor (SQP; Saris & Gallhofer, 2007) is a program which allows researchers to test questions without performing a dedicated data collection. SQP provides information on three quality indicators: reliability, validity, and question quality. The estimation of these indicators is based on a meta-analysis of multitrait-multimethod (MTMM) experiments testing question formulations and characteristics. To compute the three indicators, researchers can manually enter the question to be tested in the SQP web application and code its characteristics, following the coding instructions, to receive a quality prediction (Zavala-Rojas, 2016).

New pretesting approaches and indicators to assess question quality have been developed regularly, with web probing (e.g., Lenzner & Neuert, 2017; Meitinger, 2017) and more sophisticated non-differentiation indicators (e.g., Kim et al., 2018) receiving considerable attention in recent years. Often, indicators used to assess questions are deduced from theoretical frameworks such as the model of the cognitive response process (Tourangeau et al., 2000) and the satisficing theory (Krosnick, 1991; 1999); most common in survey research are indicators which are considered both indicative of low quality response behavior and the result of specific question characteristics. Examples include non-differentiation, which indicates response behavior that leads to little variation in answers; the share of respondents selecting ‘don’t know’ options as cognitive shortcuts; or response latencies to identify superficial processing of the question. Similarly, eye-tracking studies are used to evaluate the effects of question wording on question comprehensibility (e.g., Kamoen et al., 2017; Lenzner et al., 2011). We argue that survey researchers will continue to propose new indicators and approaches. These new indicators, methods, and approaches need to be evaluated to determine whether they detect what they are supposed to detect. In other words, researchers who deduce and propose these methods must make sure that they successfully identify those issues in questions that cause low data quality. In this regard, it is important to establish which indicators are most effective in investigating a specific problem. This knowledge will enable researchers to make educated decisions as to which indicators to use in their analyses (e.g., in which instances to rely on response latencies or non-differentiation indicators).

In the present study, we focus on how existing and new methods and indicators for question quality can be validated (i.e., we test whether they measure what they are purported to measure). For this purpose, we detail an approach that we term ‘re-classification’ and that we find underused in survey research. In this approach, we evaluate an indicator based on its success in correctly classifying the quality of at least two different versions of a survey question for which the quality is known a priori and which has been experimentally varied. In other words, re-classification assesses whether indicators or methods detect what they are supposed to detect.

Validation through re-classification has several advantages that make it a viable supplement to other methods. First, re-classification can draw on existing experiments that have been developed, evaluated, and used in prior studies. The reuse of existing work is not only prudent in terms of resource-efficient research, but also contributes to replicability in particular and open science in general. Second, drawing on question experiments that consist of versions explicitly been designed to represent ‘low’ or ‘high’ quality questions is based on a rather weak assumption: that the design works as intended (e.g., complex question wording is indeed complex). Third, reusing an experiment means that the researcher is fully in control of which aspects are validated in their study. For instance, when selecting questions, one might select those that vary with regard to the complexity of their syntactical structure. In the subsequent analyses, indicators of interest can then be evaluated in how well they detect overly complex syntactical structures in question wordings. Ultimately, the results of these analyses allow researchers to select evaluation methods in a more targeted manner (i.e., to select the indicator most sensitive for detecting an issue of interest).

With the present study, we aim to close an existing research gap and to illustrate the use of re-classification. We show how existing experiments can be reused for this purpose based on two illustrative examples. In the first example, we drew on the well-known SQP program that is used by large-scale surveys such as the European Social Survey (ESS) during questionnaire development and translation processes. We apply the proposed method to show how insights could be gained from re-classification to advance indicators or programs such as SQP. To this end, we drew on 12 survey questions from prior experiments that have one ‘low quality’ and one ‘high quality’ version. We coded each using SQP and analyzed its output. At the time of data analysis, SQP was available to the public as version 2.1 but has now been upgraded to version 3.0 (Weber & Revilla, 2021).

In the second example, we drew on a pretesting survey which implemented the same 12 survey questions as part of a web survey. We then computed two indicators: item nonresponse and response times. In this example, we showcase the use of re-classification to examine the applicability of both indicators when investigating data quality. Here, we analyzed how well the indicators classified the different versions of the survey questions in terms of data quality. Re-interviewing the respondents in a second survey with the same 12 survey questions allowed us to further investigate consistency over time as a third indicator. Again, we analyzed how well this indicator classified the different versions of the 12 survey questions.

Our study is structured as follows. In the next section, we introduce re-classification, then present our two illustrative examples. We close with practical recommendations on how to validate quality indicators through re-classification and an outlook for future research.

Re-Classification

Re-classification relies on using one or more survey questions that are available in at least two versions: one that is of low quality ( $S_{-}$ ) and one that is of high quality ( $S_{+}$ ). These questions can be designed specifically for validation through re-classification purposes or be adopted from prior survey experiments—for instance, questions that were part of experiments on how item-by-item instead of grid question design affected response behavior (e.g., Couper et al., 2013; Liu & Cernat, 2016; Mavletova et al., 2017; Roßmann et al., 2018). In other words, the quality of each version of the survey question is known a priori. A necessary precondition for the proposed method is that the questions are similar in characteristics except the ones that are varied to either achieve low or high quality. These questions are then used to assess how well a data quality indicator of interest $Q$ (e.g., item nonresponse, response time) or a question evaluation program such as SQP detects whether a question is of low or high quality.

Re-classification follows a two-step process detailed in Figure 1: (i), application of the quality indicator $Q$ to the different question versions $S_{-}$ and $S_{+}$ , and (ii), evaluation of the quality indicator conditional on the question versions ( $Q_{-}, Q_{+}$ ). When evaluating the quality indicators, researchers compare the outcome of quality indicators for two situations, first when applied to the question of low quality and, second, when applied to the high-quality question. If the quality indicator is sensitive to the characteristic(s) varied in the question, a difference should exist between the quality indicators ( $∆ Q \neq 0$ ).

Figure 1.

Schematic of validation through re-classification.

Moreover, for a successful re-classification, this difference should be in the correct direction, that is, it should indicate low quality for the low-quality version of the question and high quality for the high-quality version. If re-classification is successful, this means that the data quality indicator is correctly measuring the aspect of data quality manipulated in the survey experiment (e.g., complexity of question wording). Thus, a successful re-classification provides support that the quality indicator works as intended.

Illustrative Examples

Example 1: SQP

SQP is a program developed by Saris and Gallhofer (2007) that can be used to predict the quality of survey questions based on their characteristics without needing to collect new data. SQP uses cumulative data from past MTMM experiments for a meta-analysis to investigate the effect of questions’ characteristics on the reliability and validity of these questions. The idea behind SQP is that a meta-analysis of MTMM experiments can be used together with characteristics of questions to predict the quality of new questions. Researchers need to code the characteristics of the new question following a coding scheme in a web application and then receive SQP-specific measures of reliability and validity, as well as an overall question quality, which is the product of reliability and validity.

Data & Method

We re-used a set of 12 question pairs that had been experimentally manipulated to be either well-formulated or not by Lenzner et al. (2010) and Lenzner et al. (2011). The low-quality versions (text feature condition) contained problematic linguistic text features (e.g., a vague or ambiguous noun phrase), which have been shown to undermine question comprehensibility, while the questions in the control condition did not contain this text feature (Lenzner et al., 2011). We selected questions containing six problematic linguistic text features (low frequency words, vague or imprecise relative terms, vague or ambiguous noun phrases, complex syntax, complex logical structures, low syntactic redundancy) and two questions per text feature. Vague or imprecise relative terms, for instance, are words like ‘many’ or ‘often,’ which are frequently used in survey questions (Schaeffer & Dykema, 2020). To manipulate the questions in the text feature version, a vague quantification term was, for instance, added to the question. Questions with complex logical structures contain, for example, numerous logical operators like ‘or’ to connect subsentences (see Lenzner et al. (2010) for a detailed description of the different text features and question manipulations). In Table A.1 in the online supplemental material, we provide an overview of the 12 questions and the 24 low-and high-quality versions.

The two authors independently coded the 24 question versions in SQP 2.1 following the coding instructions provided on the SQP website (SQP, 2017). The coding scheme of this SQP version contains more than 60 formal and linguistic characteristics to be coded, such as the polarity of questions, the use of response labels, the position of the question in the questionnaire, or the number of words in the question text and response options. After coding the first questions, the authors discussed experiences and discrepancies in assigning codings to SQP characteristics, in order to avoid systematic differences. The average intercoder agreement was 81% but varied between 63% and 96% for the individual questions. To determine the final codes, the two coders discussed incidences of mismatching codes and made the final decision about the appropriate codings together.

Results

We compared the three SQP indicators reliability, validity, and question quality. Table 1 shows the indicators by question and the difference between the question versions. A negative sign means that the quality of the text feature condition was predicted to be lower than the control condition (successful re-classification). Overall, the differences were very small and only became apparent at the third decimal place. Comparing reliability for the 12 questions showed higher values for five control versions and three text feature versions, and no difference for the remaining four question pairs. Validity estimates did not differ for half of the questions, while the estimates were higher in five text feature versions and in one control version. With regard to question quality, the findings were also mixed. Three estimates were higher in the control version; five estimates were higher in the text feature version, and there were no differences for the remaining five question pairs.

Table 1.

SQP Estimated Indicators for Control (C) and Text Feature (TF) Versions of the Questions.

	Reliability			Validity			Quality
	C	TF	Δ	C	TF	Δ	C	TF	Δ
S1	.694	.690	−.004	.970	.970	.000	.673	.670	−.003
S2	.668	.668	.000	.954	.953	−.001	.638	.637	−.001
S3	.710	.711	.001	.963	.963	.000	.684	.685	.001
S4	.710	.710	.000	.965	.964	−.001	.685	.685	.000
S5	.639	.640	.001	.977	.977	.000	.624	.625	.001
S6	.712	.712	.000	.963	.963	.000	.686	.686	.000
S7	.700	.700	.000	.976	.976	.000	.683	.683	.000
S8	.698	.697	−.001	.945	.946	.001	.659	.659	.000
S9	.666	.668	.002	.934	.928	−.006	.621	.620	−.001
S10	.699	.702	.003	.969	.967	−.002	.678	.679	.001
S11	.712	.713	.001	.949	.948	−.001	.676	.676	.000
S12	.702	.701	−.001	.947	.947	.000	.665	.664	−.001

Note. C: Control Condition; TF: Text Feature Condition; S: Survey Question; Δ: Difference.

Overall, we did not find substantial differences between means of reliability, validity, and quality. In summary, by applying validation through re-classification, we found that the three quality indicators provided by SQP 2.1 were not sensitive to detecting low or high question quality resulting from problematic text features. Accordingly, the re-classification was not successful, and the indicators are thus not to be deemed suitable for detecting the impact of the text features used in the experiments on data quality.

Example 2: Pretesting Web Survey

As a second example, we conducted a web survey in which we implemented the same set of 12 question pairs. We compared the question pairs using item nonresponse and response times as data quality indicators. Respondents who completed the survey were invited to participate in a second survey about 3 weeks later, in which the same questions were administered to assess the consistency over time of the responses.

Data & Method

The pretest survey was conducted between November 11^th and November 21^st, 2021, with respondents from a German opt-in online panel operated by the survey company Respondi AG. The web survey used quotas for gender, age (18–39; 40–59; 60 and older), and education (with and without university entrance qualification). Overall, 2257 panelists accepted the survey invitation, of whom 64 broke off, and 2193 completed the survey. The mean and median duration of the questionnaire completion was 6 minutes and 55 seconds and 4 minutes 58 seconds, respectively. Respondents were randomly assigned to either the well-formulated question version or to the version containing one of the six text features. To evaluate possible differences in the sample composition between the two experimental conditions, χ²-tests were conducted. The results showed no statistically significant differences for gender (χ² = .366; df = 2; p = .833), age (χ² = 4.442; df = 4; p = .336), and education (χ² = 3.848; df = 6; p = .697).

The second survey was fielded between November 30^th and December 2^nd, 2021. Overall, 1732 participants of wave one accepted the second survey invitation, 26 participants broke off, and 1706 completed it (75.6%). Another 12 participants had to be removed from the analyses because they could not be assigned to the first wave, resulting in 1694 completes (833 in the control and 861 in the text feature condition). Of those, 49.4% were female; 28.4% were between 18 and 39 years, 34.0% were between 40 and 59 years, and 37.6% were 60 years and older. Respondents in the second survey did not differ between the two conditions with regard to gender (χ² = 1.255; df = 2; p = .534), age (χ² = 4.0 df = 4; p = .406), and educational attainment (χ² = 3.625; df = 6; p = .727).

All statistical analyses were performed with Stata (version 16.1). To compare response times, we report two sample t-tests and present p values for two-sided tests. When differences were in the hypothesized direction, we interpreted p < .100 as significant (i.e., converting a two-sided test to a one-sided test by dividing the p-value in half for directional differences consistent with our hypothesized direction).

Data Quality Indicators

For item nonresponse, we calculated the proportion of respondents who left a question blank before proceeding to the next question. For each question, we summed up the total number of item nonresponses under each experimental condition divided by the sample size for the condition and multiplied by 100. We applied χ²-tests to compare item nonresponse per question and a Kruskal–Wallis rank test to compare overall item nonresponse between conditions.

Response times were measured using server-side response times (in seconds). Respondents who had response times which were two standard deviations above or below the mean were excluded from the analyses (Mayerl, 2013). Because the distribution of response times was still skewed, we also applied logarithmic transformations (Fazio, 1990).

To calculate consistency over time across the two web surveys, we analyzed the gross error rate (Poe et al., 1988). The gross error rate was defined as the number of respondents for whom the response to the second survey differed from the response to the first survey, divided by the total number of respondents. We calculated the gross error rate for each question.

Results

Provided that the quality indicators are sensitive in terms of data quality, the text feature versions should theoretically yield higher item nonresponse due to respondents skipping the questions because of their higher complexity (or lower comprehensibility). Similarly, text feature versions should take longer to answer than questions without text features, because of the higher effort needed for cognitive processing, and item consistency should be lower. In these cases, re-classification will be successful.

Across all questions, response times were significantly shorter in the control than in the text feature condition (M_c = 1967.8, M_tf = 2301.6, t = −1.4595, p = .008). Cohen’s d indicated a small effect size (d = −.069). Overall, item nonresponse was low, amounting to 3.1% in the control and 1.9% in the text feature condition; nonresponse did not differ significantly. On average, respondents did not respond substantively to .16% of questions with a text feature; however, they failed to offer substantive responses to .26% of questions in the control condition containing no text feature (χ² = .004; df = 1; p = .952).

The average gross error rate across all 12 questions was significantly higher in the text feature condition (37.4%) than in the control condition (35.4%), indicating that the text feature questions reduced the consistency of responses over time (χ² = 5.84, df = 1, p = .016).1 On the level of the individual questions, the differences were significantly higher in the text feature condition in four out of 12 questions (S4: χ² = 8.60, df = 1, p = .003; S6: χ² = 5.03, df = 1, p = .025; S9: χ² = 5.00, df = 1, p = .025; S11: χ² = 7.37, df = 1, p = .007).

In summary, when applying re-classification we found that response latencies and consistency over time were effective in the detection of low or high question quality, when this difference in quality was the result of text features. We found that item nonresponse was not sensitive to these variations in our study. The results of our re-classification for item nonresponse suggest that further analyses are warranted to study what caused these issues (e.g., the omission of a “don’t know” category leading to respondents providing non-substantive responses). In general, the re-classification analysis determined that response latencies and consistency over time were appropriate indicators to detect issues in questions that derive from text features.

Conclusion

With the present study, we introduced and illustrated re-classification as a method for evaluating data quality indicators. We argue that this method has been underused in survey-based market and social science research to date, despite it representing a legitimate complement to traditional validation methods. Re-classification involves evaluating an indicator based on how well it detects the quality of (at least) two different versions of a survey question for which the quality is known a priori.

We have presented the application using two examples to illustrate that not every indicator is suitable to detect every aspect of data quality, and thus re-classification may be an easily implementable method to determine these limitations. Our first illustrative example showed that SQP version 2.1 has issues detecting changes in question wordings that impaired question comprehensibility. Overall, the differences in the indicators obtained via SQP were very small and did not allow for a differentiation in quality. In our second illustrative example, we assessed how three data quality indicators (item nonresponse, response latencies, and consistency over time) performed in detecting the same quality issue in the same 12 survey questions based on data from a pretest survey. Analyzing the pretest data showed that response latencies and consistency over time were indicative of the differences in the questions’ quality, while item nonresponse was not.

In summary, re-classification can be used to determine how well an indicator is suited to assessing a specific aspect of question quality. Consequently, the method can help us to better understand the still growing landscape of data quality indicators, and specifically to help us validate selected indicators. When quality indicators or pretesting methods are compared, diverging or even contradictory findings may arise. We may consider response latencies as an example, where difficulties could lead to slower response times because solving the task requires processing time, but also to faster responses (e.g., speeding), when respondents engage in satisficing response behavior by not taking enough care with their answers (e.g., Bassili & Scott, 1996; Zhang & Conrad, 2014). The proposed method has the potential to better select indicators for specific research questions on a fit-for-purpose basis. Moreover, re-classification can draw on existing survey experiments that systematically vary question characteristics, and thus promotes the re-use and replication of previous studies. Not only does this fit well with the increasing demand for open science practices in market and social science research, but it also encourages resource-efficient research practices that utilize existing work.

The present study is intended to introduce a method we perceive as underused in survey-based market and social science research and to illustrate its application. We invite further research to apply the re-classification approach to existing indicators and research questions. Contributing experimental research and knowledge on data quality indicators will help researchers select the appropriate indicator(s) or method(s) for their specific research goal. Besides enriching our understanding of existing indicators, we see special merit in the proposed method when it comes to developing new data quality indicators. Here, re-classification can be applied to gauge which aspects of question quality an indicator is genuinely able to measure, to establish where further development is needed, or to identify an indicator’s limitations.

Supplemental Material

Introducing the Validation of Data Quality Indicators Through Re-Classification: The example of SQP and pretest surveys

Supplemental Material for Introducing the Validation of Data Quality Indicators Through Re-Classification by Cornelia E Neuert, and Tobias Gummer in International Journal of Market Research.

Footnotes

Acknowledgments

We would like to thank Ranjit Singh for his thoughtful comments and suggestions that helped to improve the manuscript and Timo Lenzner for providing the translations of the questions in English.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funded by the Deutsche Forschungsgemeinschaft—Project number 491156185.

ORCID iDs

Cornelia E Neuert

Tobias Gummer

Supplemental Material

Supplemental material for this article is available online.

Note

References

Bassili

J. N.

Scott

B. S.

(1996). Response latency as a signal to question problems in survey research. Public Opinion Quarterly, 60(3), 390–399. https://doi.org/10.1086/297760

Couper

Tourangeau

Conrad

F. G.

Zhang

(2013). The design of grids in web surveys. Social Science Computer Review, 31(3), 322–345. https://doi.org/10.1177/0894439312469865

Fazio

R. H.

(1990). A practical guide to the use of response latencies in social psychological research. In Hendrick

Clark

M. S.

(Eds.), Research methods in personality and social psychology (pp. 74–97). Sage.

Gummer

Roßmann

Silber

(2021). Using instructed response items as attention checks in web surveys: Properties and implementation. Sociological Methods & Research, 50(1), 238–264. https://doi.org/10.1177/0049124118769083

Hillygus

D. S.

Jackson

Young

(2014). Professional respondents in nonprobability online panels. In Callegaro

Baker

Bethlehem

Göritz

A. S.

Krosnick

J. A.

Lavrakas

J. P.

(Eds.), Online panel research: A data quality perspective (pp. 219–237). Wiley. https://doi.org/10.1002/9781118763520.ch10

Kamoen

Holleman

Mak

Sanders

Van Den Bergh

(2017). Why are negative questions difficult to answer? On the processing of linguistic contrasts in surveys. Public Opinion Quarterly, 81(3), 613–635. https://doi.org/10.1093/poq/nfx010

Kim

T. H.

Lee

E. J.

Shin

S. D.

Y. S.

Kim

Y. J.

Ahn

K. O.

Song

K. J.

Hong

K. J.

Lee

K. W.

(2018). Straightlining: Overview of measurement, comparison of indicators, and effects in mail–web mixed-mode surveys. Prehospital Emergency Care, 22(2), 214–221. https://doi.org/10.1080/10903127.2017.1367443

Krosnick

J. A.

(1991). Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology, 5(3), 213–236. https://doi.org/10.1002/acp.2350050305

Krosnick

J. A.

(1999). Survey research. Annual Review of Psychology, 50, 537–567. https://doi.org/10.1146/annurev.psych.50.1.537

10.

Krosnick

J. A.

Alwin

D. F.

(1988). A test of the form-resistant correlation hypothesis. Ratings, rankings, and the measurement of values. Public Opinion Quarterly, 52(4), 526–538. https://doi.org/10.1086/269128

11.

Lenzner

Kaczmirek

Galesic

(2011). Seeing through the eyes of the respondent: An eye-tracking study on survey question comprehension. International Journal of Public Opinion Research, 23(3), 361–373. https://doi.org/10.1093/ijpor/edq053

12.

Lenzner

Kaczmirek

Lenzner

(2010). Cognitive burden of survey questions and response times: A psycholinguistic experiment. Applied Cognitive Psychology, 24(7), 1003–1020. https://doi.org/10.1002/acp.1602

13.

Lenzner

Neuert

(2017). Pretesting survey questions via web probing—does it produce similar results to face-to-face cognitive interviewing? Survey Practice, 10(4), 1–11. https://doi.org/10.29115/SP-2017-0020

14.

Liu

Cernat

(2016). Item-by-item versus matrix questions: A web survey experiment. Social Science Computer Review, 36(6), 690–706. https://doi.org/10.1177/0894439316674459

15.

Mavletova

Couper

Lebedev

(2017). Grid and item-by-item formats in PC and mobile web surveys. Social Science Computer Review, 36(6), 647–668. https://doi.org/10.1177/0894439317735307

16.

Mayerl

(2013). Response latency measurement in surveys: Detecting strong attitudes and response effects. Survey Methods: Insights from the Field. https://doi.org/10.13094/SMIF-2013-00005

17.

Meade

A. W.

Craig

S. B.

(2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. https://doi.apa.org/doi/10.1037/a0028085

18.

Meitinger

(2017). Necessary but insufficient: Why measurement invariance tests need online probing as a complementary tool. Public Opinion Quarterly, 81(2), 447–472. https://doi.org/10.1093/poq/nfx009

19.

Neuert

Lenzner

(2016). A comparison of two cognitive pretesting techniques supported by eye tracking. Social Science Computer Review, 34(5), 582–596. https://doi.org/10.1177/0894439315596157

20.

Olson

(2010). An examination of questionnaire evaluation by expert reviewers. Field Methods, 22(4), 295–318. https://doi.org/10.1177/1525822x10379795

21.

Olson

Smyth

J. D.

Ganshert

(2019). The effects of respondent and question characteristics on respondent answering behaviors in telephone interviews. Journal of Survey Statistics and Methodology, 7(2), 275–308. https://doi.org/10.1093/jssam/smy006

22.

Petersen

(2008). Split ballots as an experimental approach to public opinion research. In Donsbach

Traugott

M. W.

(Eds.), The Sage handbook of public opinion research (pp. 322–329). Sage. https://doi.org/10.4135/9781848607910.n30

23.

Poe

G. S.

Seeman

McLaughlin

Mehl

Dietz

(1988). ‘Don’t know’ boxes in factual questions in a mail questionnaire: Effects on level and quality of response. Public Opinion Quarterly, 52(2), 212–222. https://doi.org/10.1086/269095

24.

Roßmann

Gummer

Silber

(2018). Mitigating satisficing in cognitively demanding grid questions: Evidence from two web-based experiments. Journal of Survey Statistics and Methodology, 6(3), 376–400. https://doi.org/10.1093/jssam/smx020

25.

Saris

W. E.

Gallhofer

I. N.

(2007). Design, evaluation, and analysis of questionnaires for survey research. John Wiley and Sons. https://doi.org/10.1002/9780470165195

26.

Schaeffer

N. C.

Dykema

(2020). Advances in the science of asking questions. Annual Review of Sociology, 46(1), 37–60. https://doi.org/10.1146/annurev-soc-121919-054544

27.

Schuman

Presser

(1981). Questions and answers in attitude surveys: Experiments on question form, order, and context. Academic Press.

28.

Shoemaker

P. J.

Eichholz

Skewes

E. A.

(2002). Item nonresponse: Distinguishing between don’t know and refuse. International Journal of Public Opinion Research, 14(2), 193–201. https://doi.org/10.1093/ijpor/14.2.193

29.

SQP. (2017). SQP coding instructions. Universitat Pompeu Fabra.

30.

Tourangeau

Rips

L. J.

Rasinski

(2000). The psychology of survey response. Cambridge University Press.

31.

Weber

Revilla

(2021). Improvements in the survey quality predictor (SQP) software: From SQP2 to SQP3. Eutopia Science Fair. [Paper presentation].

32.

Willis

G. B.

(2005). Cognitive interviewing. a tool for improving questionnaire design. Sage.

33.

Zavala-Rojas

(2016). Measurement equivalence in multilingual comparative survey research. PhD Thesis. Barcelona, Spain: Universitat Pompeu Fabra. http://hdl.handle.net/10803/399146

34.

Zhang

Conrad

(2014). Speeding in web surveys: The tendency to answer very fast and its association with straightlining. Survey Research Methods, 8(2), 127–135. https://ojs.ub.uni-konstanz.de/srm/article/view/5453

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB