Abstract
In this study, I examine data-quality evaluation methods in online surveys and their frequency of use. Drawing from survey-methodology literature, I identified 11 distinct assessment categories and analyzed their prevalence across 3,298 articles published in 2022 from 200 psychology journals in the Web of Science Master Journal List. These English-language articles employed original data from self-administered online questionnaires. Strikingly, 55% of articles opted not to employ any data-quality evaluation, and 24% employed only one method despite the wide repertoire of methods available. The most common data-quality indicators were attention-control items (22%) and nonresponse rates (13%). Strict and unjustified nonresponse-based data-exclusion criteria were often observed. The results highlight a trend of inadequate quality control in online survey research, leaving results vulnerable to biases from automated response bots or respondents’ carelessness and fatigue. More thorough data-quality assurance is currently needed for online surveys.
Similar to other scientific fields, psychology relies on research-data quality to establish dependable conclusions. In psychological research, data quality typically hinges on participants’ willingness and capability to offer truthful and precise answers. People might decline research participation or, more concerning, partake but submit responses that are biased or entirely untrue—stemming from misunderstanding, negligence, or deliberate deceit. Naturally, researchers must exert effort to avert, identify, and address such responses. In this article, I focus on prevailing practices within psychological science for identifying low-quality data in self-administered online surveys.
Modern electronic self-administered surveys, hosted on online platforms (e.g., Qualtrics, Amazon Mechanical Turk, SoSci Survey, Survey Monkey, or Google Forms), have become the prevailing method for collecting survey data. In psychology research, surveys commonly incorporate self-report scales and questionnaires. Robins et al. (2009) even anticipated self-report measures to dominate personality-psychology assessments. However, because of the typical lack of an administrator, online survey responses often unfold without monitoring or corrective feedback. In such situations, various validity-related data concerns are more likely to arise than in controlled environments with an administrator present.
Both theoretical and empirical evidence indicates that careless responses can inhibit the power of statistical tests, bias survey outcomes, and even cause erroneous conclusions if left unidentified and unremoved from analyses (Huang et al., 2015; Oppenheimer et al., 2009). Consequently, evaluating data quality constitutes a sound research practice. Numerous articles have already compiled convenient summaries of diverse methods suitable for online-survey design, detailing their presumed effectiveness, merits, and drawbacks (see Brühlmann et al., 2020; Curran, 2016; Goldammer et al., 2020; Meade & Craig, 2012; Niessen et al., 2016). However, the popularity or actual utilization rates of these methods in empirical research are currently unknown.
This lack of information is concerning because the actual adoption of data-quality assessment methods significantly empowers researchers to manage specific biases arising from flawed or substandard data. For instance, response-time analysis best detects hurried respondents, and inattentive respondents consistently favoring a specific response option are effectively identified through long-string indices or answer-variability analysis. Respondents who neglect instructions can be discerned using instructional manipulation checks or control bogus items, among other techniques. Estimating the prevalence of methods employed for data-quality assessment would offer insights into a potential—or lack thereof—of online-survey research to identify and exclude compromised-quality data and be resilient to associated bias. However, note that “potential” is the crucial term here because the utilization of any data-quality evaluation method does not inherently ensure a positive impact in terms of validity or reliability. Effective implementation necessitates researchers to establish sensible exclusion criteria, suitable to the study’s context and design. Indeed, applying overly stringent cutoff rules for any method can disrupt both validity and reliability if excessive data are erroneously discarded. Consequently, the number of distinct methods used in a study to evaluate data quality solely reflects the potential for addressing various biases rather than guaranteeing actual efficacy in doing so.
Nevertheless, it is important to describe the prevailing practices in evaluating data quality in online self-administered psychology surveys. The description of the current state of art is a prerequisite to gauge the collective capability to counteract biases stemming from inadequate data quality and to point out potential deficiencies. The suitable method to achieve this objective is by documenting the utilization of data-quality assessment methods across empirical research articles featured in peer-reviewed journals.
Research Question
What types of methods are used for data-quality evaluation in online self-administered psychology surveys, and how often?
Disclosures
Preregistration, data, and materials
This study was preregistered. The preregistered document, data, analytic script, and all materials are available at https://osf.io/yxzm2/. Note that the preregistered study plan includes a survey sent to the contact authors of the articles. The survey queries them regarding potential data-quality evaluation methods that might have been employed but remained unreported in their articles. However, a substantial portion of the reviewed articles features a disclosure statement concerning the comprehensiveness of information regarding data manipulation, analysis, and exclusion criteria. Consequently, unreported utilization of data-quality evaluation methods was estimated to be minimal, and as a result, the survey was deemed unnecessary and was not conducted.
Reporting
The method for selecting the sample, all data exclusions, all manipulations, and all measures in the study are fully reported.
Ethical approval
The study protocol was approved by the institutional review board and was carried out in accordance with the Declaration of Helsinki.
Method
Sample
To secure a representative article sample, suitable source journals had to be defined. These journals needed to satisfy the following criteria: (a) inclusion in the Web of Science Master Journal List, (b) categorization in the field of psychology, and (c) publication in English. Among the 483 journals meeting these conditions, 200 (41%) were chosen at random. The chosen journals provided the source of articles for subsequent analysis. The journal sample size was set with the overall coverage of the indexed journals and the expected time needed for coding in mind. Arguably, 41% coverage should allow for broader generalization of the results, with 200 journals still being possible to code by the author as the sole rater within a 1-year time frame.
To qualify for analysis, journal articles had to meet certain criteria. Specifically, a portion of their data had to be both original and collected through unsupervised, electronic self-administered surveys—without a physical or virtual presence of an administrator. Online surveys administered in a supervised environment (e.g., in a computer lab, in a cubicle) were deemed ineligible for this review because certain methods of data-quality evaluation, such as IP address checks or captcha items, are not reasonably applicable when the survey is administered under supervision. Furthermore, the articles had to be published in English in a 2022 journal issue. Following a large-scale review of articles published in the selected 200 journals, 3,298 eligible articles were identified and included in the study sample.
Measures
Codebook
A codebook was written to classify various data-quality evaluation methods into distinct categories based on the data aspect they target. The foundation for these categories was derived from works by Curran (2016) and Meade and Craig (2012). Each category was to be dichotomously coded to indicate whether the respective method is used for data-quality assessment in the article (1) or not (left blank). During the course of this study, the codebook was revised two times, and categories were added, merged, or dropped. For the sake of brevity, only the final codebook version is described here. However, all codebook versions and the revision protocols providing detailed justifications for changes are available in the online resources.
Codebook reliability was tested before and after the data coding. Each time, three raters independently read and coded a random sample of articles. The interrater reliability was assessed by the percentage agreement and Fleiss’s kappa. However, note that Fleiss’s kappa becomes unreliable and susceptible to bias when the occurrence of the method is extremely rare. Such cases are typically identified by high percentage agreement and low Fleiss’s kappa. Table 1 shows the estimates of interrater reliability for the preliminary version of the codebook (pretest) and the final version (posttest). Focusing on the reliability of the final codebook categories, I show in Table 1 that the raters reached at least fair agreement on most categories. Almost all categories reached good (> 90%) agreement; only missing rates achieved bad agreement (63% agreement, κ = .16), and control items achieved acceptable agreement (87% agreement, κ = .72). The poor performance of the missing-rates coding category is explained in the Results section.
Interrater Reliability
Note: “Observed” is the total occurrence across the ratings of three independent raters. NA = not applicable.
Final codebook categories
Response time
Response time encompasses measurements of time spent or latency in providing single or multiple answers or, more broadly, engaging in the study. Responses with excessively brief or prolonged latencies generally signify potential issues in responding. Responding overly swiftly leaves insufficient time for question comprehension, whereas extended response times might indicate respondent struggles, such as confusion or distraction. Extremely rapid or slow responses tend to carry a higher likelihood of bias or invalidity (e.g., Meade & Craig, 2012; Yan & Olson, 2013).
Respondents’ self-reported study engagement, study knowledge, or answer validity (self-report)
This category includes self-reported aspects of study engagement, motivation, attention, and the sincerity and precision of respondents’ own answers. This approach possesses a distinctive capability to reveal threats to data validity, notably in low-stakes scenarios in which respondents are less likely to deny subpar performance. However, it leans on entirely subjective criteria unique to each respondent (Curran, 2016). The evaluation of respondents’ self-reported lack of awareness about the study’s purpose or design also belongs to this category.
Control of multiple survey submissions by a single respondent (multiple submissions)
The category consists of methods to detect and potentially forestall undesired instances of a single respondent submitting the survey multiple times. The study should explicitly detail the control measures for multiple submissions. Reips (2002) contended that although multiple submissions generally pose minimal risk to overall study validity, employing techniques to manage them has the potential to heighten data quality in online research.
Comparison with data from a second source (cross-check)
Comparing self-reported data with data derived from an alternative source serves as a valuable validation mechanism because disparities can signal potentially problematic self-reported data, for example, when the self-reported age of a respondent is different to the respondent’s age in an official registry. Although this method is not frequently referenced in survey literature and might be scarcely employed in survey research, it still has utility as a scientific approach for data validation.
Consistency of answers (consistency)
The methods within the consistency category can manifest in various ways. For instance, this might involve comparing a respondent’s responses on reverse-coded scale items with the respondent’s responses on the remaining scale items or contrasting answers on two items with identical or closely related meanings. When a respondent’s answers diverge from the anticipated pattern of answers based on theoretical assumptions or the characteristics of the responses obtained on the rest of the research sample, the validity of the respondent’s response might be challenged. Other approaches include person-total correlations or polytomous Guttman errors. For a condensed overview of this method type, see Curran (2016) or Meade and Craig (2012).
Control/infrequency/bogus items (control items)
Within this category, certain items may pertain to events, issues, attitudes, and so on that are exceedingly rare, occasionally nonsensical, or even implausible—thus making agreement highly improbable. Conversely, some items might relate to matters that are exceedingly likely, common, or hard to disagree with (Curran, 2016, pp. 13–15). These items effectively function as attention checks, enabling researchers to readily pinpoint respondents who select unlikely or irrational responses, presumably because of inattentiveness. This category also encompasses captcha items for detecting automatized bot responses, instructional manipulation checks (see Oppenheimer et al., 2009), and a broad range of rudimentary assessments of respondents’ attention, knowledge, reasoning, or comprehension.
Outliers, residuals, extraordinarily low/high scores (outliers)
Outlier analysis is based on the premise that biased or inattentive responses render a respondent a univariate or multivariate outlier. Outlier analysis can employ relative criteria, such as standardized scores; Mahalanobis distances, or comparison with other participants’ task performance; or absolute criteria, such as raw cutoff scores. For further insights, refer to Curran (2016, pp. 8–9).
Missing answer rates (missing rates)
An excessive number of missing answers implies respondents’ reduced motivation or their reluctance to answer. Furthermore, it can indicate other challenges respondents might face while answering, potentially jeopardizing response validity. This convenient metric of data quality is frequently overlooked in articles discussing methods for survey data-quality evaluation. Nevertheless, missing-answer rates yield valuable insights, warranting inclusion among the other methods detailed herein.
Variability of answers, long-string indices (variability)
Responses from respondents displaying unusually low or high variability warrant closer examination. This category includes diverse indicators, with the simplest being the within-subjects variance of responses. Another one is long-string indices, representing the lengthiest unbroken sequence of identical responses within a respondent’s data. Respondents repeatedly offering identical or highly similar answers may arouse suspicion of careless responding. In general, any method analyzing response variability or patterns qualifies for this category. For in-depth insights, consult Curran (2016, pp. 6–8), Goldammer et al. (2020, p. 2), or Gottfried et al. (2022).
Open answer
This method is based on evaluating the quality of open-text responses. Even without the need for the item to be specifically designed to assess data quality, researchers might still exclude respondents providing nonsensical, implausible, excessively concise, vague, plagiarized, or irrelevant text answers.
Other
This category encompasses custom or otherwise specific methods that did not fit into the previous categories. The prerequisite for classification is that a method has to be actively used to evaluate the quality of respondents’ answers.
Procedure
In selected journals, full texts of all empirical articles in issues dated to 2022 were accessed. Preprint versions of articles were accessed (where available) in journals published by American Psychological Association (APA) because of the lack of a subscription. Consequently, a portion of articles published in APA journals could not be reviewed because of inaccessibility (estimated as 20%–33% of APA articles).
As the first step, articles not containing any instances of words “online,” “survey,” or “questionnaire” were considered unlikely to be online-survey studies and were not investigated further. Second, I read the method section of each article and evaluated whether the article fit the criteria for analysis (see Sample). For each eligible article, I read the method and result sections and recorded the utilization of data-quality evaluation methods per codebook rules. To enhance coding speed, relevant word parts were automatically highlighted in the text—“attent” (for respondent’s attention), “check” (for quality/validity checks), “delet” (for data deletion), “excl” (for data exclusion), and “remov” (for data removal).
Results
Utilization of data-quality evaluation methods
Regarding data-quality assessment, a considerable portion of studies refrained from using any method at all (see Figure 1). Precisely, 55% of studies abstained from employing any method, 24% employed one method category, 13% adopted two method categories, 5% integrated three method categories, and a mere 2% harnessed four or more method categories. Looking at the results from a slightly different point of view, only 20% of studies incorporated two or more distinct methods.

Utilization of data-quality evaluation methods.
Table 2 shows utilization rates of method categories for data-quality evaluation. Predominantly, control items designed to gauge respondent attention emerged as the most widespread method, employed in 22% of all reviewed articles. Relatively common as well was the exclusion of respondents because of nonresponse issues, applied in 13% of cases. Occasionally employed methods include response-time analysis (9%), outlier-score examination (8%), and identification of multiple survey submissions by a single respondent (7%). Utilization rates of other methods were very low (< 5%).
Methods Used for Evaluating Data Quality
Note: N = 3,298.
Relevant observations
Multiple survey submissions
Identification of duplicate or multiple survey submissions by an individual respondent was documented based on codebook guidelines. However, there are reasons to believe that the utilization rate of this particular category is susceptible to underestimation. This stems from the recording approach, restricted to instances in which the method was explicitly reported in the article. Yet the capacity to deal with multiple submissions might inherently result from a study’s design. For instance, longitudinal studies necessitate respondent identification for within-subjects data linkage. Specific study designs or data-collection platforms might inherently prevent multiple submissions, although this fact could remain unnoticed because of a lack of explicit reporting. Hence, it is reasonable to anticipate that this review potentially downplays the true utilization prevalence of methods in the multiple-submission category.
Missing rates
Throughout the review, a noteworthy observation surfaced regarding the handling of nonresponse-based data exclusion. A significant proportion of studies employed listwise data deletion with an exceptionally stringent threshold—requiring 100% survey completion. This meant exclusion of respondents’ data from analysis if they left even a single survey item unanswered. Furthermore, the article descriptions often lacked clarity on whether data exclusion was aimed at enhancing data quality or bolstering analysis robustness against potential missing-data issues. After reviewing the codebook testing, I determined this vagueness in the purpose of data exclusion based on missing answers is most likely to be the cause of the poor agreement of missing-rates ratings across multiple raters.
The general tendency for researchers to opt for listwise data exclusion upon any instance of nonresponse is of particular concern. Unfortunately, stringent data-exclusion strategies based on nonresponse were rarely accompanied by clear rationale or justification. Only a fraction of reviewed articles took a more elaborate approach in this regard.
Discussion
This expansive review of online psychology survey studies describes the infrequent practice of evaluating research data quality. More than half of the examined articles sidestepped data-quality evaluation altogether. Roughly a quarter employed just one type of data-quality evaluation method, and about a fifth used two or more types of methods. As a result of an alarming lack of data-quality control, a substantial majority of published findings remain vulnerable to biases arising from low-quality data. Such biases limit the accuracy of research findings and might hinder their replicability.
Within the subset of articles that did address data quality, attention-check control items, such as bogus items or instructional manipulation checks (see Curran, 2016; Oppenheimer et al., 2009), emerged as the preferred method. In addition, excluding data based on nonresponse was relatively common. However, a prevalent approach was all-encompassing exclusion of respondent data upon any instance of missing data, often devoid of substantiation for such rigid criteria. Although listwise data deletion may appear appealing because of its simplicity, it leads to biased estimates and inflated standard errors if data are not missing completely at random (Schafer & Graham, 2002, pp. 155–157). This aspect of data-exclusion practice in particular calls for enhancement within psychology research. Indiscriminate data exclusion not only undermines efficiency and introduces bias in results but also raises ethical concerns about discarding potentially valid data. Researchers should be required to explicitly clarify whether missing-data-based exclusions are driven by data-quality concerns or are an aspect of the chosen analytical approach.
Fewer than 10% of studies used response times as a data-quality indicator despite online platforms often allowing recording of survey- and item-response durations. Overlooking this relatively simple metric deprives researchers of valuable insights. Likewise, the identification of multiple survey submissions by individual respondents remained infrequent, although there is a potential for underestimating the actual prevalence (see Relevant Observations above). Moreover, as a reviewer noted, multiple submission in online surveys can be done relatively easily if respondents employ any kind of IP-address manipulation. On the other hand, submission from respondents sharing the same IP network (be it at school, home, or work) might be incorrectly labeled as multiple submission. Altogether, this makes identification of multiple submission based on IP addresses an easily available, albeit quite unreliable, method on its own. Finally, utilization of data-quality evaluation methods based on answer variability or consistency was rare (< 5%) despite the availability of numerous methods for this purpose—such as long-string indices, psychometric synonyms/antonyms, or interitem standard deviation (see Curran, 2016; Meade & Craig, 2012).
Overall, review findings show that the availability of data-quality evaluation methods does not directly translate into the frequency of their actual utilization. The majority of online survey studies reviewed in this article did not evaluate data quality at all, and among those that did, attention-check items and listwise deletion of cases with missing data were the favored methods. Conversely, more analytically difficult approaches, such as assessing response times, outliers, answer variability, or consistency, were relatively scarce. This trend could stem from researchers hesitating to employ methods necessitating self-defined data-exclusion thresholds, unlike attention-check items and listwise data deletion, which generally offer straightforward cutoff criteria.
Given the random sampling and the coverage of more than 40% of psychology journals indexed in the Web of Science Master Journal List, the sample of articles is likely representative of online survey studies typically published in peer-reviewed influential journals. Consequently, the findings can be reasonably generalized to the prevailing data-quality evaluation practices in psychology research.
Conclusion
The observed usage of methods for evaluation of data quality in online self-administered surveys is low; approximately half of studies did not inspect the quality of research data. The most frequently used indicators for data exclusion are designed control items and nonresponse rates, although researchers often provide poor justification for listwise data exclusion because of missing answers. The current state of practice in this regard could be improved by motivating researchers to inspect the quality of collected data. There are plenty of possible approaches to reach such a goal, such as journal editors and reviewers taking the issue of data quality more into consideration when reviewing articles, granting badges for studies that do assess the quality of their data, and promoting tutorials on how to use different methods for evaluating data quality. This study provides the first necessary step in this regard—it identifies issues in the current data-quality evaluation practices in psychology research and hopefully raises awareness about the subject.
Footnotes
Acknowledgements
I thank Hynek Cígler, Vít Suchý, and Martin Tancoš for their help with the pretest and the posttest of the codebook and for their helpful suggestions and remarks. To improve readability of this article, the text was revised based on feedback from ChatGPT (OpenAI, 2023).
Transparency
Action Editor: Pamela Davis-Kean
Editor: David A. Sbarra
Author Contributions
