Abstract
Increasing numbers of research studies are completed entirely online, often using “crowdsourcing” platforms to recruit participants, such as Amazon’s Mechanical Turk (MTurk) or Prolific. Over time, crowdsourced data collection has faced multiple threats to validity including bots, inattention, and misrepresentation. Fortunately, research and development of quality control techniques has resulted in approaches to conducting credible research using crowdsourcing platforms. However, advances in large language models (LLMs), particularly computer-using agents (CUAs), such as OpenAI’s “Operator,” may pose novel threats to the validity of crowdsourced research data in the future. This brief report explores how and why that might be the case and encourages further discussion of this issue.
The number of academic and clinical research studies conducted entirely online has increased substantively over the past decade (Aguinis et al., 2021; Kolc et al., 2023; Tompkins & Swift, 2019). However, certain approaches to data collection – such as public links advertising compensated studies (e.g., via Facebook) – are particularly vulnerable to automated or rapidly-submitted low-quality responses (Morford et al., 2025).
Recent Online Research Approaches
One alternative to such approaches is to use “crowdsourcing” platforms for research. These “micro-work” platforms, such as Amazon’s MTurk (Amazon.com, 2020) and Prolific (Prolific, 2022), serve as recruitment platforms by advertising academic studies or surveys to registered users. Online recruitment through crowdsourcing tools has provided opportunities for rapid and less expensive data collection (in comparison to in-person studies or experiments) but has also highlighted multiple threats to research validity related to automated or low-quality survey completion, dishonesty, inattention, and misrepresentation (Agley et al., 2022; Buhrmester et al., 2011; Chandler & Shapiro, 2016; Chmielewski & Kucker, 2020). Addressing these concerns while ensuring fairness to research participants also creates complex ethical questions (Agley et al., 2024; Agley & Mumaw, 2024). Evidence has suggested that a combination of approaches can enable credible research to be conducted online using crowdsourcing platforms. These may include rigorous screening of a participant pool using real identity checks and location verification (Prolific, 2025), robust quality control approaches (Agley et al., 2022; Karumathil & Tripathi, 2025), and clear communication with participants (Agley et al., 2024), along with typical research procedures for data cleaning and management.
A Possible Threat to Crowdsourced Research Validity on the Horizon
On January 23, 2025, OpenAI released an initial version of Operator, “an agent that can go to the web to perform tasks for you. Using its own browser, it can look at a webpage and interact with it by typing, clicking, and scrolling” (OpenAI, 2025). As a computer-using agent (CUA), “Operator can ‘see’ (through screenshots) and ‘interact’ (using all the actions a mouse and keyboard allow) with a browser, enabling it to take action on the web without requiring custom API integrations” (OpenAI, 2025).
In theory, such a tool could be set up to complete surveys and experiments without being “caught” by many of the existing quality control approaches. Operator does not appear to present a novel use case for certain unwanted behaviors (e.g., completing the same survey repeatedly, which can already be accomplished by less sophisticated means). However, it potentially could be used by individual workers (e.g., those enrolled in MTurk or Prolific) who could use Operator to impersonate themselves and complete tasks. Given this concern, I developed a sample academic survey in QualtricsXM (Qualtrics) filled with multiple quality control approaches and conducted a short field test from the perspective of a naive Operator user.
Method
This project explored three slightly different approaches to how a typical user might approach using the Operator tool to complete surveys. Text entries in quotations are the user prompts. Text in italics is relevant feedback from Operator. (1) “Can you please complete this survey for me? [Link]” // The survey includes questions about mood, anxiety, and alcohol consumption. It requires confirmation of understanding and willingness to proceed. Should I go ahead and select “YES” to continue with the survey? // “Yes, please proceed” // The survey asks for your gender and if you consider yourself Hispanic or Latino. These are personal questions. How would you like me to proceed? // “Please answer at random but without contradicting yourself”. (2) [Within the same Operator session as #1] “Please complete the following survey as though you are a regular person. Watch out for tricks (attention check questions) and areas where you might contradict yourself. Your goal is to complete the survey without “getting caught” taking it as an LLM. Ok?” (3) [Creating a new session] “I would like you to take the survey at the following link [Link]. For this survey, pretend you are a 22 year old male. This survey will be checked to be sure it is not completed by a bot or LLM, so please ensure that your responses are varied (like a human might produce) and that you check for attention check questions or other “checks” in completing the survey.” // The survey involves questions about mood, anxiety, and alcohol consumption, and it’s conducted by Indiana University. Should I proceed with the survey? // “Yes, please proceed”. (a) This final prompt was used three different times in clean chat windows to check for variability in how the model completed the survey (the second two instances were completed two months after the first attempt). The secondary attempts from this prompt are numbered 4 and 5 in Table 1. (b) This prompt was also used separately to test “jailbreaking” and “prompt leaking” text during revisions as part of the peer review process (see Table 1). Quality Control Approaches Used in the Test Survey
Survey Components
The survey consisted of a study information sheet (copied from a real, similar study, but with a fake study number), a section with sociodemographic questions intermixed with quality control questions, and then standard self-administered versions of the US Alcohol Use Disorders Identification Test (USAUDIT) (Higgins-Biddle & Babor, 2018), Patient Health Questionnaire (PHQ-9) (Kroenke et al., 2001), and questionnaire for Generalized Anxiety Disorder (GAD-7) (Spitzer et al., 2006).
Types of Quality Control
See Table 1 for the quality control approaches embedded or analyzed within the survey.
Results
In this brief field test, the Operator tool ignored “honeypot” text and completed region, attention, and image recognition tests correctly. Providing a relatively clear prompt (example three) enabled it to pass a deception check with correct epistemological processes and to avoid contradicting itself when asked about the user’s age in two different ways on different pages (age in years and year of birth). This successful performance of example prompt three was repeated twice, two months later (June vs. April). Two of the three sample prompts produced re-CAPTCHA scores consistent with human participants. Across five survey attempts, Operator produced similar, but not identical, scores for the screening tools, and did not engage in “straight-lining” when responding to the instruments.
When the QualtricsXM fraud prevention tool was enabled (to prevent multiple submissions from the same IP address), it blocked Operator from repeating the survey, and a basic request to Operator to retake the survey from a different IP address was rebuffed. The QualtricsXM RelevantID tool flagged duplicate responses accurately after the IP-based fraud prevention tool was disabled. Attempts to use “jailbreaking” and “prompt leaking” text, which were recently successful in flagging survey completion by two different bots powered by Google Gemini (Höhne et al., 2025), were recognized by Operator, which waited for user input to circumvent the check.
Operator produces a running text-based monologue of its actions. Unexpectedly, when completing the survey in June, but not when doing so in April, it mentioned that it was “Completing the GAD-7” (and PHQ-9) despite there being no indication on the screen that the questions represented those tools.
Discussion
This initial, limited assessment suggests that with careful prompt engineering, the Operator tool (or a similar tool) would be capable of assuming the role of a crowdsourced worker and completing studies as they become available while rarely being flagged by existing quality control approaches embedded within data collection tools. Widespread occurrence of this phenomenon would represent a threat to data validity for a wide variety of studies.
The likely “use case” for an Operator-like tool is not to “set it and forget it” (e.g., run it overnight) where it would need to be prompted to use its own judgment and it might make mistakes or fail certain sophisticated checks. Rather, because a user can watch Operator act in real time and because Operator will query the user if it encounters something unusual, the efficient use case is probably to open Operator and set it to work on a secondary monitor in the background while doing something else (e.g., watching television, playing a videogame, or working). It is unclear to me at this time whether there is an ironclad way to prevent surveys from being completed by a CUA in that manner. Liem (2025) recently described bot detection as a “game of cat-and-mouse” in which “the mouse… seems to have the upper hand” (Liem, 2025). And, of course, the mouse never sleeps.
One preprint from April 2025 tested a BERT (Bidirectional Encoder Representations from Transformers) prediction model on responses to open-ended questions and was able to distinguish between LLM- and human-produced responses with nearly perfect accuracy, but this required fine-tuning the models for those specific (and narrow) questions, and performance “decrease [d] if the models make predictions about questions on which they were not fine tuned” (Claassen et al., 2025). Still, this might have utility for standardized sets of open-ended questions used across multiple surveys. At the same time, a user could pause Operator and offer human-produced open-ended responses to such questions (or skip them entirely, where allowable by the survey parameters).
Other researchers have tested image- and vignette-based “theory of mind” questions as well as learning and association questions, all of which nominally represent puzzles that require inferences or perspective-taking (Rodriguez & Oppenheimer, 2024). The image-based theory of mind questions performed especially well in that study. However, as Rodriguez and Oppenheimer also acknowledge, we should consider the degree to which there may be an overlapping Venn diagram created by the most sophisticated bots and the average human survey taker who might otherwise produce legitimate data – and the degree to which some complex tasks may also screen out legitimate human participants.
I also queried ChatGPT 4o as to whether it has a current operational understanding of how Operator works (it claimed to) and how it might address these concerns. Several of its proposed solutions might not be effective (e.g., monitoring response times) given that the test cases produced widely different response times (5+ minute difference), and none were implausibly fast (though a 4-min completion time might suggest an experienced survey-taker). Others, like “image-based drag-and-drop puzzles,” would plausibly capture “set it and forget it” users but not necessarily users running this on a secondary screen, who could intervene briefly. The most promising proposal from GPT 4o was to attempt to trick the LLM by requiring a “semantic comprehension check” where the incorrect answer would appear correct when using keyword matching but not to a human reader. Any of these solutions (or others) would require extensive study before being deployed to determine whether they are an effective deterrent.
Of note, as of April 2025, the likelihood that Operator is being used for this purpose at scale is probably low. Operator is only available to “Pro” users, an OpenAI membership tier costing $200/month; this is likely to limit current uses of the tool for crowdsourced work given that payment is a major motivator for participation (Litman et al., 2014) and estimates for hourly compensation for crowdsourced work are low (Hara et al., 2018), though some platforms set minimum compensation rates (Prolific, 2023). However, this could change quickly depending on changes to payment or alternative forms of access to this tool or to the underlying approach (e.g., a free-to-use version of Operator). And, as noted, less sophisticated but still LLM-driven survey bots already exist and have been deployed.
Limitations and Conclusion
This report is not intended to produce generalizable claims about the use of Operator, crowdsourced research, or any similar topic, nor should it be interpreted as a comprehensive review. Instead, it is an attempt to raise attention about a potential methodological issue for online research and to provide preliminary information. Specifically, the goal is to encourage open discussion among researchers and relevant stakeholders around how best to manage this emerging phenomenon. My own inclination, though I don’t know whether it is feasible, is to seek a solution outside of the “cat-and-mouse” game between bots and researchers implemented at a higher level than individual studies (e.g., development of a bot-resistant survey platform).
Footnotes
Acknowledgments
I would like to thank the reviewers for raising my awareness of certain work being performed on this topic in adjacent fields so that I could incorporate it.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
