Abstract
A recent article in Perspectives on Psychological Science (Webb & Tangney, 2022) reported a study in which just 2.6% of participants recruited on Amazon’s Mechanical Turk (MTurk) were deemed “valid.” The authors highlighted some well-established limitations of MTurk, but their central claims—that MTurk is “too good to be true” and that it captured “only 14 human beings . . . [out of] N = 529”—are radically misleading, yet have been repeated widely. This commentary aims to (a) correct the record (i.e., by showing that Webb and Tangney’s approach to data collection led to unusually low data quality) and (b) offer a shift in perspective for running high-quality studies online. Negative attitudes toward MTurk sometimes reflect a fundamental misunderstanding of what the platform offers and how it should be used in research. Beyond pointing to research that details strategies for effective design and recruitment on MTurk, we stress that MTurk is not suitable for every study. Effective use requires specific expertise and design considerations. Like all tools used in research—from advanced hardware to specialist software—the tool itself places constraints on what one should use it for. Ultimately, high-quality data is the responsibility of the researcher, not the crowdsourcing platform.
Amazon’s Mechanical Turk (MTurk) is a crowdsourcing platform used widely across the social and behavioral sciences (Anderson et al., 2019; Buhrmester et al., 2018; Crump et al., 2013; Mason & Suri, 2012; Paolacci & Chandler, 2014; Paolacci et al., 2010; Zallot et al., 2021). Webb and Tangney (2022; henceforth W&T) described a study conducted on MTurk involving a mental-health questionnaire. 1 Their view is that MTurk is “too good to be true” because of “bots and bad data” and that only 2.6% of their total sample (14 of N = 529) were “human beings.” W&T admit that their report is “not an empirical assessment of validity of all MTurk data” (p. 4), but it has been received as implying just that. Discussion of the study (for instance, see tweets at https://sage.altmetric.com/details/138168456) has interpreted the specific “2.6%” statistic as applying to MTurk in general, interpreting W&T’s conclusion as “a cautionary tale on the value of MTurk samples” or a “warning call” against MTurk.
W&T identify a core issue as “ambiguity” and “opacity” surrounding MTurk data quality. Despite the unique challenges of online research, 2 there are effective strategies researchers can employ to ensure data quality. Although these strategies are documented in a substantial literature, this does not mean they are straightforward to understand or implement, or that they are unchanging; the best practice for any method—like science itself—evolves. Like other tools in psychological science, the effective use of MTurk or other online crowdsourcing tools is complex. MTurk is only “too good to be true” if researchers have misconceptions about the considerable labor involved in collecting high-quality data online or the time and effort needed to develop the requisite expertise. The net costs of using MTurk ethically and rigorously are not necessarily lower than using in-person studies (and may sometimes be more), but in any case, the costs are different: The costs of up-front investment in experimental design and data-quality monitoring counterbalance the benefits of quicker data collection and access to a more diverse participant pool (Moss et al., 2020; Rodd, 2023).
In other words, there is a substantial burden associated with collecting high-quality behavioral data online; however, we argue that this burden lies on researchers, not recruitment platforms. We should not expect these platforms—which are involved in data collection for a wide range of sectors—to have interests that align completely with those of academic researchers, and we should want to maintain rigorous control over our own data quality as part of a comprehensive research pipeline.
We aim to explain why the estimated 2.6% participant viability in W&T’s study is unusually low and how some claims in W&T’s article are indicative of widespread misunderstanding of what MTurk is or how to use it. W&T mention some strategies drawn from organizational psychology to ensure data quality (Keith et al., 2017) but not the vast literature on using MTurk effectively, including specific guidance for the focus of their study: mental health and clinical psychology (Agley et al., 2022; Chandler et al., 2020; Ophir et al., 2020; Zorowitz et al., 2023). Our discussion references the robust and growing literature on MTurk, but this is not a tutorial on how to run a study online—that would require considerable depth (data quality has various complex roots in design and recruitment strategies) and breadth (design and recruitment must adapt to fit a specific study’s needs). This has received book-length treatment elsewhere (e.g., Litman & Robinson, 2020), and there are many shorter guides to crowdsourcing (Bauer et al., 2020; Hauser et al., 2019; Rodd, 2023; Zallot et al., 2021). Instead, we offer mental models of MTurk (or crowdsourcing platforms in general) to guide researchers’ expectations.
Too Good to Be True?
W&T assessed their 529 respondents on six criteria, with the number of respondents lost to each criterion in parentheses: 3 (a) eligibility (−193); (b) performance on a consent quiz (−136); (c) noncompletion (−60; 44 abandoned the task, 16 clicked through directly to payment); (d) failing attention checks (−16); (e) response time (−47); and (f) unusual answers to open-ended questions such as “Who are you? Write ten sentences below describing yourself as you are today” (−77).
W&T frame their 2.6% claim in terms of validity, but validity typically refers to properties of a measure (e.g., whether it captures the phenomenon of interest), often evaluated alongside reliability (e.g., the accuracy of a measure across contexts). As validity is generally not a property of participants, we refer instead to those who were “ineligible” (and did not participate) and those who were “unviable” (failing data-quality checks). Second, we highlight elements of the study design that either depart from established best practice in online research or stem from issues that likely go beyond online data collection.
Ineligible respondents are not participants
The majority of the reported participant sample (N = 529) did not complete the task (329 from criteria “a” and “b”; 44 from criterion “c”). But these “participants” did not actually participate. Participants who do not complete a study (in the lab or online) should be reported as having voluntarily withdrawn or as having been excluded. 4 Exclusions based on demographic characteristics or understanding of informed consent are reasonable, but such exclusions are fundamentally not part of the sample.
Beyond demographic segmenting provided by MTurk—part of criterion “a”—there are various ways to perform eligibility exclusions, though these involve familiarity with web design (Hauser et al., 2019), multiphase recruitment (Hydock, 2018; Springer et al., 2016), or third-party services that prescreen MTurk workers (or “Turkers”; e.g., CloudResearch offers fine-grained age filters and checks whether reported demographics are accurate). All of these cost time and money. Such costs must be considered when choosing to crowdsource data; doing so would have averted most of the issues W&T encountered.
It may seem alarming that the consent quiz in criterion “b” had a 40% failure rate (136/336 remaining after criterion “a”). However, participants do not read consent forms carefully in in-person studies, either (Douglas et al., 2021). W&T’s rate is in range for in-person studies—from ~25% for simple components of informed consent to ~45% for more complex components (Tam et al., 2015). It may be a general problem for participant-based research that participants do not carefully read consent information, but it is not a problem specific to online data collection or MTurk.
Good data is not a given
The attention checks in criterion “d” (known as instructional manipulation checks, or IMCs) are commonly used to assess data quality and participant viability both online and in the lab. W&T imply that unviable participants were not human (i.e., bots), but humans also produce unviable data during in-person studies (Hauser & Schwarz, 2016; Necka et al., 2016). For W&T, 16 participants failed these checks (~10% of the 156 remaining). In-person studies have IMC failure rates in the range of 14 to 18% for motivated or monitored participants and up to 28% for unmotivated participants (Oppenheimer et al., 2009). The 10% rate for W&T’s criterion “d” does not reflect an MTurk-specific phenomenon, 5 and worries about pervasive bots on MTurk have been debunked (Ahler et al., 2021; Kennedy et al., 2020; Moss et al., 2021).
Recent assays of quality on unfiltered MTurk find that ~60% of Turkers provide acceptable quality data and ~40% unacceptable (Hauser et al., 2022), though more may display “careless behavior” in a broader, less pernicious sense, highlighting the importance of nuanced approaches to data quality (Brühlmann et al., 2020). The aforementioned filtering services, such as CloudResearch, decrease the rate of unviable responses substantially (Douglas et al., 2023; Hauser et al., 2022). In our experience, combining such services with two-stage recruitment (in which researchers rerecruit participants who had previously passed attention checks) further reduces unviability by an order of magnitude. Again, such strategies cost time or money, and researchers choosing to recruit online must accept such costs if they want good data.
However, IMCs reflect only localized attention and not global task attention (Gummer et al., 2021) and are only moderately effective in bot detection (Pei et al., 2020; Storozuk et al., 2020); they also have measurement problems (Hauser et al., 2019) and frustrate participants (Silber et al., 2022). We thus advise against treating IMCs as straightforward indices of attention, in view of the fact that even attentive participants might still fail an occasional item. Our own convention is to count participants as unviable only if they fail more than one attention-check item. However, we recommend prioritizing more sophisticated techniques than IMCs. These track patterns of responses within and between participants (Buchanan & Scofield, 2018; Curran, 2016; Dupuis et al., 2019; Wood et al., 2017; see SM10 in Sulik et al., 2023, for a worked example, available at https://osf.io/xw23p).
Raw completion times are not diagnostic
Overall task-completion times (criterion “e”) are not informative gauges of data quality. More informative indicators would be rates (e.g., hourly-equivalent participant-payment rates or seconds taken per response) and nuanced tracking of response behavior.
W&T’s reported rate of viability is likely affected by the compensation offered and the length of the task. Assuming their reference to “minimum wage” means the U.S. federal minimum wage ($7.25/hour), and given the reported upper estimate of completion time (50 min), W&T presumably offered ~$6 as compensation. Minimum wage in most U.S. states and territories is higher than the federal minimum, and tools used by Turkers to track compensation rates (e.g., https://turkerview.com/) consider the federal rate to be low pay. Our own policy is to offer a minimum $12/hr equivalent; we arrived at this rate after considering mean response times from pilots with MTurk participants, calculating time from when participants consent and begin the survey, not from when they open the survey window. We urge against making assumptions about respondents’ likely behavior and against estimating completion times using non-MTurk samples (or samples not from the recruitment platform the study will use).
The compensation offered by W&T was not only relatively low, but the task itself was very long. Online surveys should ideally be 5 to 15 min (Aguinis et al., 2021; Moss et al., 2023; Revilla & Höhne, 2020), with ~20 min being a reasonable maximum (Cape & Phillips, 2015; Chandler, 2023; Revilla & Ochoa, 2017). It would come as no surprise if many of W&T’s participants abandoned the task or rushed to get a better hourly wage or avoid boredom (indeed, the lower bound of their reported completion times aligns with the maximum recommended time for a survey noted above). W&T mention that “owing to the compensation structure, ‘workers’ have little incentive to invest the extra time and thought required by open-ended qualitative items” (p. 4). However, this overlooks the fact that researchers, not MTurk, choose the compensation structure and task duration.
Motivation and incentives for online studies differ from in-person studies (Hauser et al., 2019). There is a complex and evolving relationship between compensation and data quality online: U.S. Turkers identify money as a primary motivator and are less likely to accept low-paying tasks (Litman et al., 2015). Additionally, low pay leads to participant frustration (Fowler et al., 2022), and as the results from Litman et al. (2015) involved a short 6-min task, such issues may be compounded in long questionnaires. Participants speed up toward the end of longer 6 questionnaires; they are also less likely to move sliders, they give shorter responses to open-text questions, and they grow increasingly careless (Bowling et al., 2022; Cape & Phillips, 2015).
Researchers (and ethical-review bodies) considering MTurk should be aware of the well-documented potential for worker exploitation (Pittman & Sheehan, 2016), adjusting compensation to scale with what they would expect to pay in the lab. Further, researchers should carefully and honestly communicate their expectations about time or attention to participants (Galesic & Bosnjak, 2009; Hauser et al., 2019; Zallot et al., 2021). If MTurk is to be considered “cheap” (a characterization we resist in this simple form), it is not because the pay rate should be lower, but because researchers can get data from MTurk for short studies; it may be difficult to get participants into the lab for similarly short periods.
Overall, completion times are only a useful index of data quality in the context of other information: A two-hour completion time could reflect a participant who is working distractedly while watching TV, but it could just as easily represent someone who opens the survey window and does not immediately begin the survey, but responds attentively once the survey is launched. In our experience, the latter scenario is common. To distinguish such cases, researchers can record times between clicks, track mouse movements, or monitor when a participant has moved away from the survey’s browser window. All of these can be achieved with flexible platforms such as jsPsych (De Leeuw, 2015), but again, this requires some skill development.
Open-ended questions must be motivated
Finally, W&T used two open-ended questions 7 for which they sought 10 sentences in response and at least 20 sentences total (criterion “f”). Participants were eliminated if their responses were deemed unusual, nonsense, contradictory, or repetitive across participants. Assuming that this approach is a variant of the 20-statements test (Kuhn & McPartland, 1954), there is no published comparison of performance on this task online versus in the lab, nor are there established procedures for coding how humans (as opposed to bots) respond to this. It is not clear, for example, that replicated statements like “I will . . . get married” or “I will . . . buy a car” across respondents indicate that respondents are not likely to be human or are providing low-quality data (given how these are likely future events for respondents in their early twenties).
A common complaint among Turkers is frustration or confusion with open-ended or repetitive questions (Fowler et al., 2022). This is less a matter of comprehending the question and more a matter of not understanding why the question is being posed or what researchers expect in terms of a satisfactory response. Participants may deliberately give unusual responses when the perceived pointlessness of such questions inspires frustration (Fowler et al., 2022).
It is thus difficult to determine the root cause of the poor-quality data obtained in W&T’s study. Odd responses may have been generated by alleged bots, or they may be merely the last gasp of thoroughly bored participants. Because W&T did not filter by location, 8 there is also the possibility of a language barrier.
Understanding MTurk
Well-recompensed studies that rigorously check multiple interrelated indices of data quality show a radically different picture of data quality on MTurk than W&T suggest (Hauser et al., 2022). Their reported 2.6% rate is off by more than an order of magnitude relative to even the most conservative estimates (Brühlmann et al., 2020; Chmielewski & Kucker, 2020; Douglas et al., 2023). Although W&T’s experience does not reflect the rates of data quality that researchers should expect from MTurk, it highlights a larger problem: many researchers seem to be operating with the wrong mental model of what MTurk is and how it should be used. Disappointing results may reflect mismatches between many researchers’ expectations and what MTurk actually provides.
Expect nothing from MTurk in isolation
Researchers should proceed as if MTurk offered no promise of data quality. MTurk is used across heterogeneous branches of academic research, and even this is a small share of their business: Private-sector companies use it for disparate tasks (Schmidt, 2015) with diverse criteria. Reporting data-quality metrics that span all these areas is not feasible, so researchers should consult recent estimates for specific fields (e.g., psychology, Douglas et al., 2023; advertising, Berry et al., 2022; political science, Kennedy et al., 2020) and calibrate their expectations accordingly.
The opportunity costs involved in ensuring data quality on MTurk have led to the emergence of crowdsourcing platforms, such as Prolific, that were originally designed for academic research. Although some researchers may prefer such purpose-built platforms, Prolific is not immune to the data-quality issues inherent to crowdsourced approaches (see Charalambides, 2021), and using CloudResearch to filter MTurk participation results in similar—and sometimes even higher—quality than Prolific (Douglas et al., 2023; Peer et al., 2022).
Researchers should think of recruitment platforms like MTurk as the classifieds section of a newspaper: It is just a place to list or browse offerings. Recruiting directly from MTurk is like offering a job to the first N people who respond to a classified ad, whereas paying for CloudResearch-filtered participants is like engaging a recruitment agency. Alternatively, researchers could take the time and effort to vet their own participants using multistage recruitment with multiple mutually corroborative indices of quality (Storozuk et al., 2020).
Expect little from basic filtering
MTurk’s main signal of participant quality is the HIT 9 Approval Rate (HAR), tracking the percentage of a participant’s previous work that was approved rather than rejected by requesters. Various papers on using MTurk for academic studies recommend filtering out workers with an HAR below some high threshold (commonly > 95%, as in W&T). However, high HAR is not a guarantee of good data quality, even though low HAR may indicate a likelihood of low-quality data (Ahler et al., 2021; Hauser et al., 2022; Peer et al., 2022). Conversely, MTurk’s most prolific participants typically have a high HAR, so researchers can access more naive populations if they set a lower threshold (Robinson et al., 2019). How (or whether) a researcher wants to use a worker’s HAR as a filter during recruitment will depend on specific study requirements.
However, researchers should focus on the broader process of recruiting online, rather than just checking boxes or setting filters such as this. Whether in person or online, it remains a researcher’s responsibility to develop a rigorous recruitment strategy that is tailored to a particular study’s needs (Chandler, 2023; Rodd, 2023). Online researchers must check who they are recruiting, build data-quality indices into their studies from the ground up, and demonstrate a rigorous approach to data quality for their own sake and for their audience’s sake (whether editors, reviewers or readers).
Do not expect MTurk to be both cheap and fast
Ever since the first appearance of the term “web experiment” in psychology (Reips, 2000), there have been widespread assumptions that online data collection is quick and cheap, or simply a matter of translating an in-lab study for the web browser (Rodd, 2023). Such assumptions persist despite decades of evidence to the contrary. This perception of low cost may be a holdover from the early MTurk “boom”: Low per-participant costs were considered a main draw, but concerns about exploitation mean this is no longer the default (Pittman & Sheehan, 2016). As with in-person studies, it remains the ethical responsibility of researchers to pay participants at a rate that is neither exploitative or coercive.
MTurk is an efficient way to collect data once a study has been effectively designed, but that does not mean that researchers should expect overall costs to be lower; in fact, efficient recruitment means spending extra resources on other aspects of study design. Whether vetting their own participants or paying third-party filter services, researchers should harness the efficiency of MTurk to iterate over multiple versions of their study, benchmarking comprehension and attention, and using this to fine-tune their instructions and response formats.
Although W&T reported running a pilot of their study, it is not clear whether they did so on MTurk. If they had done so with 30 participants, and if their exclusion criteria worked as described, a pilot would have yielded, at most, one viable participant by their standards (and possibly none). In that case, recruitment of the full sample should not have proceeded until the entire approach was reconsidered. We stress that reducing unviable responses to 0 is unlikely, but a pilot study should aim for a low rate (say, 7%–10%) that suits a researcher’s balance between available time and money. Targeted piloting counteracts the speed of recruitment on MTurk relative to in-person recruitment. Although the relative costs of different parts of the research pipeline may vary, rigorous behavioral research has a fairly fixed cost when considered holistically.
Expect to spend time developing technical skills and keeping them up to date
Many methods in psychological science require specific expertise, and studies employing them must be adapted to their affordances. There is no reason to think crowdsourcing data should be a lone exception. An eye-tracking study run without calibration of the device and without data preprocessing would yield misleading conclusions, and yet failure to do so would not undermine the value of eye trackers for psychology research more generally. Tools for studying complex problems evolve rapidly, but the pace of these changes does not absolve poor practice. High-tech landscapes have an even faster rate of change (cf. the sudden explosion of ChatGPT in recent years), so researchers opting for online research should expect to keep abreast of current developments.
Relevant technical skills include lessons from web development and data science (Hauser et al., 2019). Web applications generally undergo extensive testing and revision before launch, and data wrangling, cleaning, and preprocessing are major parts of any data-science pipeline (Wickham & Grolemund, 2016). Specific skills for online research include coding for the web (e.g., in JavaScript), which allows one to enhance the relatively constrained affordances of survey platforms such as Qualtrics or to benefit from the flexibility of custom libraries such as jsPsych (De Leeuw, 2015). Other platforms lie between these extremes (e.g., Gorilla Psych, https://gorilla.sc; Labvanced, https://labvanced.com). Even without coding, basic attention to the control flow of a task would prevent participants clicking through to the end of the study (cf. W&T’s criterion “d”).
If a researcher cannot invest time in skill development, there is the option to pay expert consultants (e.g., offered by both Gorilla and CloudResearch). Whether the investment comes in the form of time or other resources, the net cost of conducting effective research online is substantial.
Expect to prioritize participant perspectives
Some researchers may think of online studies as being like in-person laboratory studies occurring at a distant undisclosed location (Rodd, 2023). But online research benefits from considering online participants’ perspectives throughout the design and recruitment process. Our emphasis on expertise notwithstanding, everyone has to start somewhere. While researchers are working to build expertise, they can benefit immensely from the perspectives of Turkers and their communities (Fowler et al., 2022; Schmidt, 2015; Silber et al., 2022), including browsing forums where Turkers discuss their experiences (e.g., https://www.reddit.com/r/TurkerNation/, https://turkopticon.net, https://turkerview.com). Even more simply, while iteratively piloting a new study, one should regularly ask for participant feedback (Aguinis et al., 2021) and act on it, offering bonus payments where such feedback is useful. Indeed, we consider that every crowdsourced study should end by asking for open-ended user feedback, and it is relatively easy to build a pool of trusted testers.
The issue extends beyond basic survey-design principles. For instance, boredom is a major issue on MTurk (Fowler et al., 2022), though often not considered in the lab. Researchers should accept that they are competing with everything else on the Internet when it comes to people’s attention, and they should design engaging studies accordingly. MTurk is best suited to short studies with broad participant requirements and without dozens of Likert-style questions (aptly called “bubble hell” by workers; Fowler et al., 2022).
Our emphasis on technical skills and perspective-taking converge in UX (user experience in the context of web development). This includes designing interfaces for ease of use, taking into account user perceptions and motivations, and it forms a natural connection with gamification to enhance participant motivation in online studies (Rodd, 2023; Tinati et al., 2017). Ultimately, task design should make it effortless for respondents to give the kinds of responses researchers need, but effortful for them to try to bypass researcher needs. Even then, not every survey or experiment is suitable for online recruitment. just as not every research question could be effectively addressed using eye tracking.
Conclusions
Webb and Tangney (2022) reported a study with alarmingly low rates of data quality using MTurk. We used their report to highlight how researcher expectations of MTurk often don’t match the affordances of the platform, though these challenges are well documented in the literature. MTurk is a large marketplace where tasks can be posted for workers to complete, giving researchers access to a diverse participant pool (Buhrmester et al., 2011; Smith et al., 2015) and relatively quick data collection once a task has been effectively designed. Alone, it is not effective as a filter for task eligibility or participant viability. The responsibility of ensuring a desirable sample and high-quality data still lies with the researcher, requiring either careful design and specialized skills or the hiring of services or experts. Information and strategies for how to accomplish this using MTurk are available in literature going back over a decade, much of which includes periodic estimates of data quality and how this changes over time, alongside task- and discipline-specific recommendations. This literature should form the basis of a researcher’s assessment of whether the costs and benefits inherent to the platform are a fit for their research goals. MTurk may not always be an appropriate tool, given the skills of the research team or the nature of the research question. It is only “too good to be true” if researchers assume, despite a growing and robust literature to the contrary, that it is a magic bullet that will drastically reduce the overall costs of human-subjects research.
