Abstract
Most scientific research is conducted by small teams of investigators who together formulate hypotheses, collect data, conduct analyses, and report novel findings. These teams operate independently as vertically integrated silos. Here we argue that scientific research that is horizontally distributed can provide substantial complementary value, aiming to maximize available resources, promote inclusiveness and transparency, and increase rigor and reliability. This alternative approach enables researchers to tackle ambitious projects that would not be possible under the standard model. Crowdsourced scientific initiatives vary in the degree of communication between project members from largely independent work curated by a coordination team to crowd collaboration on shared activities. The potential benefits and challenges of large-scale collaboration span the entire research process: ideation, study design, data collection, data analysis, reporting, and peer review. Complementing traditional small science with crowdsourced approaches can accelerate the progress of science and improve the quality of scientific research.
There is no perfect study. Scientists, in their effort to understand nature, are constrained by limited time, resources, and expertise. This constraint may produce a dilemma between choosing a lower quality, expedient approach or conducting a better powered, more intensive investigation allowing for stronger inferences. Ideals of the scientific process can be outweighed by the pragmatic reality of scientists’ available resources and pursuit of career advancement. Scientists are rewarded for being the originators of new ideas and evidence through the authorship of articles. These cultural incentives foster a focus on novelty and authorship that can come at the expense of rigor and foster questionable practices (Bakker, van Dijk, & Wicherts, 2012; Greenland & Fontanarosa, 2012; Nosek, Spies, & Motyl, 2012; Open Science Collaboration, 2015). One alternative is for researchers to take more time for individual studies, expend more resources on each project, and publish fewer findings. Scientists could also work more collectively, combining resources across more contributors. But such choices have implications for productivity, individual credit, and career advancement.
Here we consider the standard model of scientific investigation and describe a complementary model—crowdsourcing science. Crowdsourced approaches seek to maximize the use of available resources, diversify contributions, enable big science, and increase transparency and reliability. The adaptation of cultural norms and incentives to promote crowdsourcing as a complement to the standard model promises to make science more rigorous and inclusive and accelerate discovery.
Two Models of Doing Science
Standard model: vertical integration
Some academic research resembles a vertically integrated business. An individual or small research team conceives a research question, designs studies to investigate the question, implements the studies, analyzes the data, and writes a report of what was found. The closed team conducts the entire process from conceiving the idea to reporting the outcomes. The team members responsible for these steps are active collaborators and coauthors on a manuscript reporting the research. The sought-after reward is acceptance and publication in the most widely read and prominent journal possible.
This model has several notable characteristics. It is localized, with funding distributed to particular labs and institutions, and resource intensive, with the project work divided among a few individuals. Access to productive research pipelines is constrained, and experience and status lead to opportunities to engage in research collaborations (Merton, 1968). It produces a large quantity of small science with teams of limited size conducting projects that are correspondingly limited in scope—a small team can collect only so much data, carry out only so many analyses, and consider only so many alternatives to their methodology. Finally, contribution is recognized and rewarded through authorship on the final publication.
The standard model is akin to the philosopher model of scholarly contribution. An independent thinker conceives and generates a stand-alone piece of scholarship. After peer review by a small number of select colleagues, that scholarship is entered into the marketplace of ideas for others to examine, discuss, critique, and extend. Independence in developing and enacting the idea allows the scholar to dig deeply into a question or idea without interference, and credit allocation is straightforward. Scholars are evaluated on the basis of the reception of their work in the idea marketplace. Outstanding ideas and evidence may become permanently linked to the scholar’s identity, securing a lasting reputation and impact.
So what is wrong with the standard approach to science? For many research questions and contributions, nothing. Independently generated contributions are an efficient means of getting initial evidence for many ideas into the marketplace. Indeed, the decentralized nature of science is presumed to feed the productive generation and culling of ideas by the independent actions of scholars with different priors, assumptions, expertise, and interests. Small teams often work together repeatedly and develop cospecializations that enable deep dives into a methodology or phenomenon. A community of scientists then shares its work, exchanges feedback, and serially builds on each other’s findings.
At the same time, for some research questions and contributions, the standard model may limit progress. Individual researchers and small teams must consider certain trade-offs when directing their research efforts. They could vary design elements and stimuli instead of holding them constant, collect larger samples for fewer studies instead of smaller samples for more studies, and they could replicate their findings across multiple conditions or contexts rather than demonstrate a phenomenon and then move on. Researchers inevitably weigh these trade-offs against the potential rewards. And because the present culture prizes innovation and discovery (Bakker et al., 2012), some behaviors that would foster research credibility and cumulative progress are performed ineffectively or infrequently. Underperformed behaviors include collecting large, cross-cultural samples to evaluate generalizability and estimate effect sizes precisely (Henrich, Heine, & Norenzayan, 2010), replicating findings systematically in independent laboratories (Klein et al., 2014; Makel, Plucker, & Hegarty, 2012; Mueller-Langer, Fecher, Harhoff, & Wagner, 2019; Simons, 2014), obtaining several different perspectives on how to analyze the same data (Silberzahn et al., 2018), and using a wide variety of study designs and stimuli (Judd, Westfall, & Kenny, 2012; Wells & Windschitl, 1999).
Alternative model: horizontal distribution
The alternate model—crowdsourcing—eschews vertical integration and embraces the horizontal distribution of ownership, resources, and expertise (Howe, 2006). In a distributed collaboration, numerous researchers each carry out specific components of a larger project, usually under the direction of a core coordination team (such that crowd projects are rarely perfectly horizontally distributed). Modern science is already stretching the standard model in more collaborative directions (see Supplement 1 in the Supplemental Material available online). Solo authorship is now the exception in most fields. This is partly due to the diversification of expertise required to conduct research with modern tools (Börner et al., 2010). Across disciplines, team size almost doubled from 1.9 in the 1960s to 3.5 in 2005 (Valderas et al., 2007; Wuchty, Jones, & Uzzi, 2007), and working in teams is associated with greater individual career success (Kniffin & Hanks, 2018). Team-authored articles are more cited than solo-authored articles, and this gap in scholarly impact has increased over time (Valderas et al., 2007; Wuchty et al., 2007).
Rather than two qualitatively distinct categories of research, the vertically integrated and horizontally distributed approaches are better conceived as a continuum, with variation in the depth of contribution by any given individual and the number of individuals contributing to the project. New opportunities and challenges emerge when moving further across the continuum from singular, independent scholars to a distributed, interdependent community. Crowdsourcing carefully selected research questions, in parallel to the necessarily far greater number of small team projects, holds several potential benefits for science, among which are enabling the conduct of large-scale research projects, democratizing who contributes to science, and assessing the robustness of findings.
Enabling big science
An inclusive, diversified contribution model enables ambitious projects that would be unattainable by individuals or small teams working in isolation. Combining resources enables crowdsourced teams to enact research designs that vastly exceed what could be accomplished locally. Instead of holding sampling, stimulus, or procedural variables constant and hoping they do not matter, crowdsourced teams can allow them to vary and test whether they do. Instead of carrying out a low-powered, imprecise test, crowdsourced teams can conduct high-powered, precise studies and draw confident conclusions. Crowdsourcing complex activities seeks to mobilize the crowd’s competencies, knowledge, and skills and may leverage underused resources such as a better way to analyze the data, access to hard-to-recruit populations, knowledge of unpublished research or articles published in other languages, and translation of research materials into local languages and dialects. Crowdsourcing flips research planning from “what is the best we can do with the resources we have to investigate our question?” to “what is the best way to investigate our question, so that we can decide what resources to recruit?”
Democratizing science
Although personal factors (Clemente, 1973; Hirsch, 2007; Williamson & Cable, 2003) and merit play a role in success in science, scientific careers also exhibit a Matthew effect (Merton, 1968). Early advantages in doctoral institution rank, professional connections, and grant funding accumulate benefits over time (Bol, De Vaan, & van de Rijt, 2018; Clauset, Arbesman, & Larremore, 2015). Grant funding is overallocated to elite universities, and evidence suggests that returns on investment would be greater if the funds were distributed more evenly (Wahls, 2018). Early-career researchers from less well-known institutions, underrepresented demographic groups, and countries that lack economic resources may never have a fair chance to compete (Petersen, Jung, Yang, & Stanley, 2011; Wahls, 2018). Academic fields are generally rich in talent, such that globally distributed projects can recruit individuals with advanced training and much to offer yet too few resources to enact the vertical model competitively on their own. Few people enjoy the resource benefits of research-intensive institutions, including laboratory space, professional staff to support grant writing and management, graduate students, light teaching loads, and a community of colleagues for developing ideas and sharing infrastructure. Crowdsourcing aims to provide a new avenue through which those outside of major research institutions can contribute to high-profile projects, increasing inclusiveness, merit, and returns on investment (Chargaff, 1978; Feyerabend, 1982).
Assessing the robustness of findings
A crowdsourced approach is uniquely advantaged in determining the reliability and generalizability of findings. The ecosystem of standard science leads to the publication of massive numbers of small-sample studies (Pan, Petersen, Pammolli, & Fortunato, 2016), each with observations typically drawn from a single population (e.g., undergraduates from the researchers’ home institution in the case of behavioral experiments; Sears, 1986). Combined with the filter of an academic review process that primarily permits statistically significant results to appear in the published record (Fanelli, 2010), the end result is a research literature filled with inaccurately estimated effect sizes as a result of publication bias (Ioannidis, 2005, 2008). The standard approach to science is also susceptible to issues such as study designs generated from a single theoretical perspective (Monin, Pizarro, & Beer, 2007), unconsidered cultural differences (Henrich et al., 2010), and researcher degrees of freedom in data analysis (Gelman & Loken, 2014; Simmons, Nelson, & Simonsohn, 2011). Large-scale collaboration helped transform epidemiology into a more reliable field (Ioannidis, Tarone, & McLaughlin, 2011; Panagiotou, Willer, Hirschhorn, & Ioannidis, 2013), and this process is currently under way in psychology and other scientific disciplines. Multilab collaborations facilitate directly replicating findings (same materials and methods, new observations; Ebersole et al., 2016; Klein et al., 2014) and conceptually replicating them (new approach to testing the same idea; Landy et al., 2018). Crowdsourcing research is a part of a changing landscape of science that seeks to improve research reliability and advance the credibility of academic research (LeBel, McCarthy, Earp, Elson, & Vanpaemel, 2018; Nosek et al., 2012).
At the same time, there are opportunity costs and diminishing returns involved in organizing many laboratories to carry out a single scientific investigation. Organizing a collective for a globally distributed project can create bureaucracy and transaction costs. For the same effort, a larger number of ideas with initial supporting evidence could have been introduced into the literature by smaller teams working separately. Crowdsourcing allows for systematically examining cross-population variability, but it is important to begin by making sure the effect emerges reliably in at least one location. It will often be beneficial to rely on research from small teams for these reasons, especially when it comes to new areas of inquiry. Crowd projects with dozens or even hundreds of authors also create credit ambiguity and lack extrinsic incentives for participation, topics we address in depth later when we discuss structural reforms to encourage greater crowdsourcing. We believe the two models should coexist, with individual investigators and small teams generating initial evidence for new ideas and crowdsourced initiatives implemented to select particularly critical questions for intense examination. A diverse array of scientific projects, everywhere along the continuum from lone researchers to huge collectives, may produce the greatest return of useful knowledge from the resources invested. The remainder of this article discusses circumstances in which crowdsourcing offers particular opportunities and challenges as a complement to the standard model.
Forms of Scientific Crowdsourcing
Rather than supplanting the standard approach, organizing many individuals and laboratories into shared projects seeks to offset some of the weaknesses of vertically integrated science. Crowd initiatives vary on multiple dimensions that can create advantages and disadvantages depending on the research application (Lakhani, Jeppesen, Lohse, & Panetta, 2007; Muffatto, 2006; Salganik, 2017; Srinarayan, Sugumaran, & Rajagopalan, 2002; Surowiecki, 2005). For example, crowdsourced projects vary in terms of the degree of communication between project members, from largely independent work curated by a coordination team to crowd collaboration on shared activities. Crowd-science initiatives also vary in their inclusivity, from open calls for collaborators to carefully chosen groups of topic experts.
Figure 1 crosses the horizontal dimension of communication (anchored at the left end by curated contributions and at the right by crowd collaboration) with the vertical dimension of selectivity to create a 2 × 2 matrix. Examples of relevant crowdsourced projects are placed in this matrix as illustrations. These projects are described in greater detail in the next section and in Tables 1 and 2 (see also Supplements 1 and 2 in the Supplemental Material). Citizen-science initiatives that include anyone willing to collect data involve a high degree of independence between actors and thus fall into the bottom-left quadrant (Gura, 2013). Posing a research question to specialists (e.g., moral-judgment researchers) and asking them to independently design studies to test the same idea falls into the top-left quadrant (Landy et al., 2018). Iterative contests in which topic experts work together to improve experimental interventions (Lai et al., 2014) and the collective development of open-source software (Muffatto, 2006) are in the top-right quadrant, and more inclusive forms of crowd writing (Christensen & van Bever, 2014) are in the bottom-right quadrant. Open peer review, in which anyone can publicly comment on a scientific manuscript or article, falls into the bottom-right quadrant, and crowd review by experts carefully chosen by a journal editor falls into the top-right quadrant. Traditional small-team research, with unrestricted communication and select membership, falls outside the extreme top-right corner of the matrix at the far end of both axes.

Forms and examples of crowdsourcing. Curated contributions refers to projects in which project coordinators collect the individual work of a crowd of contributors whose communication with one another is limited to nonexistent. Crowd collaborations refers to projects in which a large group of contributors engage in regular communication regarding their shared work. CREP = Collaborative Replication and Education Project; RP:P = Reproducibility Project: Psychology.
Crowdsourcing Different Stages of the Research Process
Examples of Crowdsourced Scientific Initiatives
Multistage projects may operate in different locations in this space during the research life cycle. For example, to explore consensus building about disparate findings from the same data set, Silberzahn et al. (2018) segued from isolated individual work to round-robin feedback and then open-group debate. Indeed, much crowdsourced science moves gradually from left to right on the communication dimension over the life course of the project, culminating in collective e-mail exchanges and editing of the manuscript draft. Likewise, crowd projects tend to rely more on selective expertise over time (i.e., move up the vertical axis), as project coordinators and specialized subteams of statistical experts check the collective work for errors and play leading roles in producing the final report.
On the vertical dimension, greater inclusivity facilitates scaling up for massive initiatives. In contrast, selectivity in project membership prioritizes specific areas of expertise for contribution. It is not yet clear under what conditions involving large crowds of contributors (i.e., moving downward on the vertical axis) compromises overall project quality relative to applying mild or strong selectivity standards for contribution (Budescu & Chen, 2015; Mannes, Soll, & Larrick, 2014). Research done by lone scientists and small teams is already known to be prone to error (Bakker & Wicherts, 2011; Berle & Starcevic, 2007; Garcia-Berthou & Alcaraz, 2004; Salter et al., 2014; Westra et al., 2011), and the quality-quantity trade-off that can accompany scaling up is potentially offset by the numerous eyes available to catch mistakes (e.g., Silberzahn et al., 2018). The available evidence suggests that data collected by citizen scientists are comparable in error rates and general quality to those assembled by professionals (Kosmala, Wiggins, Swanson, & Simmons, 2016; Thelen & Thiet, 2008). Online coders and political scientists reach near-perfect agreement on policy positions in political manifestos (Benoit, Conway, Lauderdale, Laver, & Mikhaylov, 2016), Wikipedia entries are as accurate as the Encyclopedia Britannica (Giles, 2005), highly published and less prolific researchers are similarly likely to successfully replicate a given behavioral effect (Bench, Rivera, Schlegel, Hicks, & Lench, 2017; see also Klein, Vianello, Hasselman, & Nosek, 2018), and crowds of investigators do not exhibit measurably different “flair” at designing studies that obtain significant findings (Landy et al., 2018).
These null findings are surprising—there must be some point at which a crowd project becomes overly inclusive and insufficiently expert members compromise overall quality. One possibility is that coordinators of the crowd projects thus far have chosen the degree of inclusiveness and communication best suited to their research question (i.e., the correct location in Fig. 1), leading to judicious scaling without losses in quality. Logically, only individuals with specialized training (e.g., with physiological equipment) would be recruited to collect data for certain projects (e.g., pooling data from fMRI across laboratories; top-left quadrant of Fig. 1). Even with an open call, potential contributors may volunteer for projects in which they feel they can add value (e.g., an avid bird watcher volunteers to help track migrations), leading to self-screening based on relevant skill sets. Testing the conditions under which crowdsourcing increases and decreases project quality will inform future investments in crowdsourced research.
In contrast, there is little direct evidence regarding the consequences of information exchange between project members in crowdsourced scientific initiatives. Nevertheless, potential costs and benefits of crowd communication are suggested by the literature on group influence and decision making. One of the virtues of crowds of independent agents, especially demographically and intellectually diverse ones, is their tendency to balance out individual biases and errors in the aggregate (Galton, 1907; Larrick, Mannes, & Soll, 2012; Surowiecki, 2005). Crowdsourcing scientific investigations with little to no communication between project members (i.e., the far-left regions of Fig. 1) may help to avoid the potentially biasing effect of individuals’ overcommitment to intellectual claims (Berman & Reich, 2010; Luborsky et al., 1999; Manzoli et al., 2014; Mynatta, Dohertya, & Tweneya, 1977) and path dependencies in which knowledge of others’ approaches has an inordinate influence (Derex & Boyd, 2016). The effectiveness of crowds is more difficult to evaluate in situations that lack normatively correct answers or objective measures of accuracy. Yet even then, the diversity in approaches and results on the part of independent scientists, for example in analytic choices and study designs, is at least made transparent to the reader (Landy et al., 2018; Silberzahn et al., 2018).
That the “wisdom of the crowd” effect is spoiled when peer influence between members of the crowd is possible (Lorenz, Rauhut, Schweitzer, & Helbing, 2011) suggests that the more one moves toward crowd collaborations (i.e., right on the horizontal axis), the more conformity and deference to authority become risks. The one crowdsourced project that has tracked individual beliefs under conditions of gradually increasing communication found little evidence of convergence over time, beyond what would be expected from sensitivity to new evidence (see Fig. 4 in Silberzahn et al., 2018). The circumstances under which conformity effects occur in crowd science remains an open empirical question, and future projects should consider manipulating factors such as task interdependence and anonymity of communications.
Allowing information exchange and creating interdependencies between project members also comes with potential important benefits. One of the hypothesized benefits of crowd collaboration is the ability of members of the community to learn from each other (Wenger, 1998). For example, teams in the Lai et al. (2014) intervention contest observed the effectiveness of others’ interventions between rounds and used those insights to improve their own interventions. Likewise, the round-robin feedback between different analytic teams in the crowdsourcing data-analysis initiative (Silberzahn et al., 2018) helped several analysts to identify clear errors and adopt improved specifications. These are only anecdotal examples, and further research is needed to examine when peer learning occurs systematically in iterative, multistage crowd collaborations and how it might best be facilitated. As reviewed next, evidence of the viability of crowdsourcing across all stages of the research process has accumulated rapidly in recent years.
Crowdsourcing Science in Action
Science can benefit from crowdsourcing activities that span the entire research process (see Table 1). These include coming up with research ideas, assembling the research team, designing the study, collecting and analyzing the data, replicating the results, writing the article, obtaining reviewer feedback, and deciding next steps for the program of research. Table 2 and Supplement 2 in the Supplemental Material summarize some recent crowdsourced scientific initiatives, organized by the respective stages on which they focused their crowd efforts.
Ideation
Crowds of scientists can be organized to collaborate virtually on complex problem-solving challenges, each proposing ideas for solving components of the problem and commenting on each other’s suggestions (open communication; the far-right regions of Fig. 1). This approach has been used to great effect in the Polymath projects, resulting in several important mathematical proofs (Ball, 2014; Polymath, 2012, 2014; Tao, Croot, & Helfgott, 2012). Like how they are used in product-design contests (Poetz & Schreier, 2012), crowds of researchers can also be used to generate original research hypotheses and select which ideas are most likely to be of broad interest and impact (Jia et al., 2018; Schweinsberg, Feldman, et al., 2018). This approach may be particularly useful when it comes to data sets that for legal or ethical reasons cannot be publicly posted or further distributed—for instance, the personnel records of a private firm, who might agree to share them with one research team or institution but not for general distribution. Even in such cases, the core coordination team who serves as custodians of the data can post an overview of the variables and sample online and publicly solicit ideas for testing (Jia et al., 2018). The crowdsourced generation and selection of research ideas is one way to open up data sets and collaboration opportunities that would otherwise remain closed to most scientists.
Assembling resources
Genome-wide association studies distribute the task of investigating the entire genome across many collaborators and institutions with specialized roles, leading to important discoveries related to genes and pathways of common diseases (Visscher, Brown, McCarthy, & Yang, 2012). Consider the innumerable lost opportunities for similarly combining resources across laboratories in other scientific fields. For instance, a researcher at one institution may have a great idea but lacks access to the right equipment or sample of subjects to test it. Elsewhere, another team finds they have an excess of research resources (e.g., they compensate participants for a 30-min session for completing a 15-min study). Some researchers have resources that could productively be used by other researchers who need those resources to meet their research goals. One way to attempt to minimize the collective waste and maximize researchers’ collective ability to meet their research goals is to match “haves” with “needs” using online platforms such as Science Exchange (https://www.scienceexchange.com) and StudySwap (http://osf.io/view/StudySwap). Such exchanges, which could be expanded into full-scale online academic labor markets similar to oDesk or Elance (Horton, 2010), seek to push academic communities into the top-right quadrant of Figure 1 by opening novel lines of communication and creating opportunities to connect resources and expertise.
Study design
Another limitation to standard science is narrow sampling of the constructs of interest (Baribault et al., 2018; Judd et al., 2012; Monin & Oppenheimer, 2014; Wells & Windschitl, 1999). A small team is at risk of generating a limited set of stimuli, operationalizations of variables, and study designs. Another team might have carried out a very different test of the same idea because of different prior training and theoretical assumptions. Even seemingly small differences in methods might produce substantial differences in research results. An alternative crowd approach is to assign the same research question to different experts, who then independently design studies aimed at answering it (low communication combined with high expertise; top-left corner of Fig. 1). Landy et al. (2018) did precisely this, finding that variability in effect sizes due to researcher design choices was consistently high. Indeed, study designs from different researchers produced significant effects in opposite directions for four of five research questions related to negotiation, moral judgment, and implicit cognition. Crowdsourcing conceptual replications more effectively reveals the true consistency in support for a scientific claim.
Data collection
Online platforms for crowdsourced labor such as Amazon’s Mechanical Turk have become widely used as a source of inexpensive research participants and coders (Stewart, Chandler, & Paolacci, 2017; see Supplement 3 in the Supplemental Material). Rather than merely serving as research subjects, members of the general public can also be recruited to collect data and observations. This strategy moves the project into the bottommost left corner of Figure 1 of inclusive projects with low communication, with anyone willing to help being included as a project member. The tradition of citizen science dates back to Denison Olmsted’s use of observations from a crowd of both amateur and professional astronomers to track the great meteor storm of 1833 (Littmann & Suomela, 2014; Olmsted, 1934). Citizen science today is a movement to democratize science (Chargaff, 1978; Feyerabend, 1982), engage the public, create learning opportunities, and gather data and solve problems at minimal cost with the aid of a host of volunteers (Cavalier & Kennedy, 2016; Gura, 2013). Amateur scientists participate actively in scientific investigations in biology, astronomy, ecology, conservation, and other fields, working under the direction of professionals at research institutions. A related approach is to gamify scientific problems and recruit citizen scientists to aid in cracking them, as in the video game Quantum Moves, in which players move digital renditions of atoms (Sørensen et al., 2016), the online game EyeWire, in which players help reconstruct eye cells (Kim et al., 2014), and the protein-folding game FoldIt (Cooper et al., 2010). Note that for some types of citizen-science projects, contributors may have substantial skills and knowledge—or even formal training, such as an advanced degree—and in such cases are far from novices. One of the strengths of crowdsourcing is the ability to tap into the expertise of individuals outside of mainstream academia who are able and willing to contribute to science.
Data analysis
Researchers working with a complex data set are confronted with a multitude of choices regarding potential statistical approaches, covariates, operationalizations of conceptual variables, and the like. In a quantitative review, Carp (2012a, 2012b) found that 241 published articles on fMRI used 223 distinct analytic strategies. Researchers may consciously or unconsciously choose statistical specifications that yield desired results, in particular statistically significant results, in support of a favored theory (Bakker et al., 2012; Ioannidis, 2005; Ioannidis & Trikalinos, 2007; Simmons et al., 2011; Simonsohn, Nelson, & Simmons, 2014). One way to maximize transparency is to turn the analysis of data over to a crowd of experts. The same data set is distributed to numerous scientists who are asked to test the same theoretical hypothesis, at first without knowing the specifications used by their colleagues (high expertise combined with low communication; top-left quadrant of Fig. 1). This offers an opportunity to assess how even seemingly minor differences in choices may affect research outcomes and reduces the pressure to observe any particular outcome—at least for the purposes of publishability. Silberzahn et al. (2018) found that 29 different teams of analysts used 29 distinct specifications and returned effect-size estimates for the same research question (“Do dark skin toned soccer players receive more red cards?”) that ranged from slightly negative to large positive effects. Crowdsourcing the analysis of the data reveals the extent to which research conclusions are contingent on the defensible yet subjective decisions made by different analysts.
The growth of large-scale data has created opportunities to leverage this diversity to identify the most robust means of analyzing such complex and massive data sets. Crowdsourced challenges have been used by researchers for benchmarking new computational methods, as with, for instance, the Dialogue for Reverse Engineering Assessments and Methods (DREAM) Challenge focused on predicting the survival of breast-cancer patients (Saez-Rodriguez et al., 2016; Stolovitzky, Monroe, & Califano, 2007). Organizers provide a test data set and a particular question to be addressed to many independent analysts (a top-left-quadrant approach) and then apply the analytic strategies to a hold-out data set to evaluate their robustness.
Another innovative method is to hold constructs, models, and covariates constant and leverage a network of researchers to carry out this same analysis on different existing data sets (a coordinated analysis; Hofer & Piccinin, 2009). This approach was pioneered by the Integrative Analysis of Longitudinal Studies on Aging network (Lindwall et al., 2012). Testing a research question of common interest (e.g., “Does education protect against cognitive decline?”; Piccinin et al., 2013) on existing data sets that include the same constructs (e.g., measures of cognitive function such as memory, reasoning, and fluency) and yet measure them in disparate ways in different populations (e.g., Sweden, Austria, the Netherlands, and the United Kingdom) far more systematically assesses the generalizability of the results than relying on a single data source. Because members of this network of experts communicate extensively to agree on their shared analytic approach and measures to use from each longitudinal data set, a coordinated analysis falls into the top-right quadrant of Figure 1.
Note that all of these approaches are qualitatively different from fields in which many researchers independently leverage a central data source (e.g., the General Social Survey, or GSS). In fields such as political science, resources such as the GSS are used to investigate separate research questions, such that aggregation and metascientific comparisons are less informative. Crowdsourcing is especially useful, we suggest, for fields that rely on local resources that can remain siloed. That said, the data corpus generated by crowdsourced projects often serves as a public resource after the publication of the article (e.g., Open Science Collaboration, 2015; Tierney et al., 2016).
Replicating findings before publication
Individual laboratories are typically constrained in the amount and type of data they can collect. Replicating unpublished findings in independent laboratories before they are submitted for publication (Schooler, 2014; Tierney, Schweinsberg, & Uhlmann, 2018) addresses power and generalizability directly. Authors can specify a priori in which replication samples and laboratories they expect their findings to emerge; for example, they might select only topic experts as their replicators and thus moving up the vertical axis of Figure 1. This approach, which thus far returns a modest reproducibility rate even under the seemingly best of conditions (Schweinsberg et al., 2016), has recently been integrated into graduate and undergraduate methods classes (Schweinsberg, Vignanola, et al., 2018), thus traveling downward along the vertical axis toward greater inclusiveness. Such crowdsourced pedagogical initiatives are one means of turning replication into a commonplace aspect of how science is conducted and students are educated (Everett & Earp, 2015; Frank & Saxe, 2012; Grahe et al., 2012).
Writing research reports
The conceptualization, drafting, and revision of research articles represents another opportunity to leverage distributed knowledge. The article “The Capitalist’s Dilemma,” conceptualized and written by two professors and 150 of their MBA students, is one example (Christensen & van Bever, 2014). As with other forms of collaborative writing online, such as Wikipedia, channeling the contributions of many collaborators into a quality finished article requires a few group leaders who complete a disproportionate amount of the work and organize and edit the written material of others (Kittur & Kraut, 2008; Kittur, Lee, & Kraut, 2009). Our personal experience with articles with many authors is that a large number of contributors commenting publicly on the draft greatly facilitates working out a solid framework and set of arguments, identifying relevant articles and literatures to cite (especially unpublished work), ferreting out quantitative and grammatical errors, and tempering claims appropriately. More radically, efforts such as CrowdForge suggests that nonexperts (e.g., elite Mechanical Turk workers) are surprisingly capable at drafting quality summaries of scientific findings for lay readers (Kittur, Smus, Khamkar, & Kraut, 2011). Such quality raw material could be carefully vetted and included in reviews of scientific research for practitioners and lay audiences. This suggests cautious optimism in moving down the vertical axis of Figure 1 to allow for written work from unconventional contributors, with the degree of inclusiveness varying by the technical expertise and topic knowledge required for a given article.
Peer review
In the current system of academic peer review, an unpublished manuscript is submitted to a journal and evaluated by the editor and usually two to five external referees, each of whom provides detailed feedback, often over multiple rounds of revisions and serially across multiple journals. Even when successful, it can be a slow and arduous process taking months or years. For example, Nosek and Bar-Anan (2012) reported a case study of a researcher’s corpus of publications and found that the average time from manuscript submission to ultimate publication was 677 days. There is little doubt that detailed feedback from colleagues can be immensely helpful, yet it remains unknown whether research reports are consistently improved by the review process (“Revolutionizing Peer Review?” 2005). Empirical studies indicate that the interrater reliability of independent assessors is low, with median reliability coefficients of .30 for journal articles and .33 for grant reviews (Bornmann, Mutz, & Daniel, 2010; Cicchetti, 1991; Marsh, Jayasinghe, & Bond, 2008) and that there is bias in favor of authors with strong networks (Wenneras & Wold, 1997). There are also the diminishing returns on time investments to consider—completing iterative rounds of review and revisions consumes time that might have been better allocated to pursuing a novel scientific discovery. The reviewers, typically anonymous, receive minimal professional benefit from their work, and the broader community may never hear worthy criticisms left unaddressed in the published version of the article. Ultimately, publication in a prestigious outlet is a poor signal of an article’s scholarly impact, with journal impact factors driven by outlier articles and only a weak predictor of the citations accrued by the typical article in the journal (Baum, 2011; Holden, Rosenberg, Barker, & Onghena, 2006; Seglen, 1997).
An alternative is to open scientific communication and crowdsource the peer-review process (Nosek & Bar-Anan, 2012). This moves rightward on the horizontal axis by opening communication and downward on the vertical axis to the extent the review process is inclusive of many commentators. Both might be accomplished simultaneously using a centralized platform for review and discussion of research reports, with a content feed similar to social-media sites (e.g., Facebook, Twitter) and users able to comment on and evaluate content as with the websites run by Reddit, Yelp, Amazon, and others (Buttliere, 2014). Posted files could include not only manuscripts but also data sets, code, and materials and reanalyses, replications, and critiques by other scientists. Peer review would be open, credited, and citable, and prominent articles that attract attention would be evaluated by a potentially more reliable crowd of scientists rather than a small group of select colleagues. Further, reviewers would have access to the underlying data, facilitating the early identification of errors (Sakaluk, Williams, & Biernat, 2014). Measures of contribution would be diverse, with scholarly reputation enhanced not just via citations to authored manuscripts but also intellectual impact via proposals of novel ideas, the posting of data and code that others find useful, insightful feedback on others’ work, and the curation of content related to specialized topic areas (e.g., replicability of the effects of mood on helping behaviors; LeBel et al., 2018). Original authors would have the opportunity to update their article in light of new evidence or arguments, with older versions archived, as in the Living Reviews group of journals in physics.
In contrast to such a radical bottom-right-quadrant approach (open communication, highly inclusive), top-right-quadrant versions of peer review would invite a crowd of topic experts carefully selected by a journal editor. However, in this more conservative scenario journal reviews would still be public, citable, and greater in number than is currently the norm. Open and citable reviews allow readers who weight traditional credentials highly to do so, whereas individuals lower in formal expertise but whose comments are high in quality have the opportunity to be recognized. The barriers to wider experimentation are not so much technological—there are already platforms that facilitate open scientific communication (Wolfman-Arent, 2014)—but rather social, with current professional reward structures still encouraging publication via the traditional process and outlets. Only by experimenting with diverse approaches, some staying close in important respects to traditional academic review and others departing radically, can we identify the most effective ways to communicate scientific ideas and knowledge.
Replicating published findings
Among the best known uses of crowdsourcing are large-scale initiatives to directly replicate published research in psychology, biomedicine, economics, and other fields (e.g., Alogna et al., 2014; Errington et al., 2014; McCarthy et al., 2018; O’Donnell et al., 2018). In these crowdsourced projects, up to 100 laboratories attempt to repeat the methodology of previous studies, collecting much larger samples to provide improved statistical power to detect the hypothesized effect. Aggregating across six major replication initiatives in the social sciences, examining 190 effects in total, crowdsourced teams successfully replicated 90 (47%; Camerer et al., 2016, 2018; Ebersole et al., 2016; Klein et al., 2014, 2018; Open Science Collaboration, 2015).
A crowdsourced approach to replicability reveals that high levels of heterogeneity in effect-size estimates across laboratories are observed primarily for large effects, not small ones (Klein et al., 2018). In other words, effects that fail to be replicated tend to consistently fail to be replicated across cultures and demographic populations, which casts doubt on the argument that as-yet-unidentified moderators explain the disappointing results. The lack of consistent laboratory differences in effect-size estimates (i.e., some research teams are not “better” than others at obtaining support for the original hypothesis; Bench et al., 2017; Klein et al., 2014, 2018) suggests that cautious scaling (e.g., moving downward on the vertical axis of Figure 1 toward greater inclusiveness) ought to be considered. The Collaborative Replications and Education Project (CREP; Grahe et al., 2013; Wagge et al., 2019) seeks to achieve this by organizing undergraduate experimental methods classes into research teams, an approach that promises to radically scale up data collection for replications by integrating this activity into student education (Everett & Earp, 2015; Frank & Saxe, 2012). The Psychological Science Accelerator, an international network of more than 300 psychological-science laboratories, have committed to contributing to large-scale collaborations on an ongoing basis, including regularly involving their students via the Accelerated CREP initiative (Moshontz et al., 2018).
Deciding what findings to pursue further
Faced with a voluminous and constantly growing research literature—more than 30 million academic articles have been published since 1965 (Pan et al., 2016)—and evidence that many published findings are less robust than initially thought (Begley & Ellis, 2012; Errington et al., 2014; Open Science Collaboration, 2015; Prinz, Schlange, & Asadullah, 2011), researchers must determine how best to distribute limited replication resources. Viable options include focusing on highly cited articles, findings covered in student textbooks, results that receive widespread media coverage, or on research with practical relevance (e.g., for government policies or interventions to reduce demographic gaps in educational attainment). The replication value of a study might be calculated on the basis of the impact of the finding relative to the strength of the available evidence (e.g., statistical power of the original demonstrations; Nosek et al., 2012).
Another complementary rather than competing approach is to leverage the collective wisdom of the scientific community. The aggregated estimates of crowds perform surprisingly well at predicting future outcomes—such as election results, news and sporting events, and stock-market fluctuations—because in many cases, the aggregation cancels out individual errors (Galton, 1907; Mellers et al., 2014; Surowiecki, 2005). Likewise, the averaged independent predictions of scientists regarding research outcomes—based solely on examinations of short summaries of the findings, research abstracts, or study materials—are remarkably well aligned with realized significance levels and effect sizes (Camerer et al., 2016; DellaVigna & Pope, 2018a, 2018b; Dreber et al., 2015; Forsell et al., 2018; Landy et al., 2018). Senior academics (e.g., full professors) and junior academics (e.g., graduate students and research assistants) exhibit similar forecasting accuracy (DellaVigna & Pope, 2018a, 2018b; Landy et al., 2018), suggesting the feasibility of an inclusive bottom-left-quadrant approach. It may be reasonable to avoid allocating replication resources to findings considered either clearly spurious or well-established by a heterogeneous crowd of scientists and focus on findings about which beliefs are conflicting or uncertain.
A decision market might be used to select among the many available options for independent replication, the idea being to allocate resources as efficiently as possible. Crowdsourced replications will be most useful when a clear, widely agreed-on question of broad interest is present. Large-scale efforts seem less appropriate for findings the community considers highly unlikely to be true (e.g., extrasensory perception) or not particularly theoretically interesting if true. Such crowd-based selection might be ongoing, with attention dynamically shifting away from effects that have experienced repeated replication failures and for which the community’s expectations drop below a predetermined threshold (Dreber et al., 2015). This would help prevent cases in which numerous laboratories conduct replications of an effect, collecting many thousands of participants, when fewer tests would have already led to strong inferences. Decision markets might also be used to select the most and least likely populations an effect should emerge in as an initial test of universality (Norenzayan & Heine, 2005).
Crowd science can also be used to make gradual improvements to existing research paradigms and interventions. Lai and colleagues (Lai et al., 2014, 2016) held a series of crowdsourced contests to identify the best interventions for reducing implicit racial biases. Beginning in the top-left quadrant of Figure 1 (low communication, high expertise), research teams submitted 17 interventions to reduce implicit biases (e.g., exposure to positive exemplars, perspective taking, empathy). Of those interventions, 8 successfully reduced implicit intergroup bias in the short term. Moving horizontally into the top right of the quadrant by adding the element of information exchange, teams were able to observe and learn from each other’s approaches between rounds of data collection. Several teams used this opportunity to improve their own intervention, leading to progressively greater effectiveness in reducing intergroup bias across rounds. We believe this contest model holds widespread applicability for identifying and improving upon practical interventions to address societal challenges. We envision a future scientific landscape in which forecasting surveys and decision markets are run in tandem with research contests and other large-scale empirical data collections on an ongoing basis.
Reforms to Facilitate Large-Scale Collaboration
We believe most researchers have an intrinsic interest in contributing to the accumulation of knowledge and are not solely driven by prestige. At the same time, professional reward systems can be updated in ways to encourage voluntary participation in large-scale collaboration and better align intrinsic and extrinsic motives. The current culture and reward system impose pressures for researchers to act independently as opposed to collectively and pursue initial evidence for novel findings rather than engage in systematic verification, more than is ideal for scientific progress. Further, although merit matters in science, there are also Matthew effects (Bol et al., 2018; Clauset et al., 2015; Merton, 1968; Petersen et al., 2011; Wahls, 2018). The resulting hierarchical and network-based arrangements interfere with inclusivity for researchers who have much to offer but come from disadvantaged backgrounds and/or lack resources. Thus, we advocate for changes to include greater rewards for collective engagement.
Distribution of grant funding
Empirical evidence suggests that distributing grant funding more evenly would increase the total return on investment in terms of scientific knowledge (Wahls, 2018). The receipt and renewal of such funds could be further linked to evidence of ongoing contributions to open science. These might include publicly posting data and materials (Simonsohn, 2013), disclosing data exclusions and stopping rules (Simmons et al., 2011), running highly powered studies (Stanley, Carter, & Doucouliagos, 2018), preregistering studies and analysis plans (Nosek, Ebersole, DeHaven, & Mellor, 2018; Nosek & Lakens, 2014; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012), conducting replications, helping to develop new methods, sharing resources on platforms such as StudySwap, and participating in crowdsourced initiatives, among other options. A more equitable distribution of financial support for research could reward merit and encourage excellence, not only by providing additional opportunities for those with useful skills and knowledge to contribute (Wahls, 2018) but also by directly incentivizing emerging best practices. To avoid the diffusion of responsibility on projects with many collaborators, not only authorship but also grant funding might be made contingent on specific deliverables (e.g., minimum number of participants collected, provision of annotated analysis code others can reproduce).
Author contribution statements
Although some especially elaborate crowd projects involve specialized subteams who are able to publish a separate report of their work (e.g., Dreber et al., 2015; Forsell et al., 2018), these are atypical cases. Articles with many authors that report large-scale projects require reforms in how intellectual credit is allocated. Input can be documented through careful and detailed author contribution statements, which academic journals increasingly require. A good starting point for the crafting of clear contribution statements is the CRediT taxonomy (Brand, Allen, Altman, Hlava, & Scott, 2015), in which contributions throughout the full research life cycle are represented in categories such as conceptualization, data curation, writing, and visualization. Providing information about which coauthors contributed to which CRediT categories allows collaborators to transparently communicate how authorship was determined and which author deserves credit for which components of a research project. This sort of detailed accounting is a necessary precursor for the acceptance of increasingly long author lists that are already commonplace in fields such as high-energy physics.
Selection and promotion criteria
In addition to traditional metrics of scholarly merit, search and promotion committees should take into account an applicant’s contributions to conducting rigorous research and making science better. In some fields, a demonstrated commitment to open science and scientific reform is already starting to be factored into selection and promotion decisions (Nosek, 2017; Schönbrodt, 2018). One way in which applicants might choose to fulfill these criteria is by participating in crowdsourced initiatives to replicate findings, reanalyze data, generate and select ideas, and so forth. Comprehensive shifts in incentives will require that hiring and tenure and promotion committees rely more on specific indicators of contribution (Brand et al., 2015), such as the author contribution statements described above, rather than heuristics of counting articles and whether the person was first, last, or somewhere in the middle of an authorship list. In this way, individuals who led an important subcomponent of a massive project (e.g., the subteam that conducted the forecasting survey, qualitative analyses, or Bayesian meta-analysis) can be more fairly recognized.
Another more radical option is making entire project workflows open and linked to each contributor (something possible through the Open Science Framework; http://osf.io/) and for hiring and promotion committees to examine these workflows before making their decisions. In a future in which open peer review becomes commonplace, online links to feedback provided on the articles of colleagues might be formally listed on one’s curriculum vitae (CV) as further evidence of intellectual contribution and service to the field. If the multifold aspects of an academic’s workflow are made transparent, decision makers can move beyond heuristics and use more complete information to better allocate rewards on the basis of merit.
Integrating crowd science into pedagogy
Another way to encourage crowd science is to build such initiatives into activities that scientists in many fields already do routinely, such as collecting data in methods classes for student projects and analyzing complex data sets as part of graduate education (Everett & Earp, 2015; Frank & Saxe, 2012; Grahe et al., 2012; Mavor et al., 2016). The CREP (Grahe et al., 2013; Wagge et al., 2019) and Pipeline Projects (Schweinsberg et al., 2016; Schweinsberg, Viganolla, et al., 2018) offer opportunities to leverage such activities for articles with many authors that report crowdsourced replications. In these cases, for both students and course instructors, being the middle author on a report of an interesting initiative is better than no author credit at all. Crowdsourcing avoids letting the students’ hard work collecting data go to waste through repeating established paradigms (e.g., the Stroop effect) in unpublishable class projects the results of which are low in information gain. As a further incentive, the second Pipeline Project offers course instructors a free curriculum they can use in their lectures, reducing course preparation time (https://osf.io/hj9zr). Whether graduate programs provide opportunities for experiential education and authored work on crowd-science projects could potentially be factored into their rankings and accreditations.
Changes in publication criteria
Top-down changes in publication requirements at journals (e.g., disclosure rules and open-science badges) are already changing how science is done and what gets published (Everett & Earp, 2015; Nosek et al., 2015). Such systematic shifts in policies help to avoid collective-action problems such that only a subset of scientists engage in best practices that increase research quality but may also reduce productivity, which risks placing them at a professional disadvantage (Kidwell et al., 2016). One option, aimed at encouraging prepublication independent replication (Schweinsberg et al., 2016), is to include independent verification of findings in another laboratory as a publication criterion at the most prestigious empirical journals (Mogil & Macleod, 2017). It is often useful to get initial evidence for a finding out there to be examined and debated by the scientific community, and individual careers should continue to advance primarily in this way. However, it is also reasonable for those publication outlets that provide the most professional benefit to authors and are perhaps perceived as most authoritative (e.g., Science, Nature, Proceedings of the National Academy of Sciences) to set the bar higher. Prominent journal outlets are also increasingly recognizing the value of metascientific work that relies on a crowd approach, a trend that promises to encourage future crowdsourced projects. A more general shift in emphasis toward rigorous verification, relative to novelty, as a publication criterion would incentivize high-powered crowd projects well positioned to assess the replicability and generalizability of findings.
Developing infrastructure
Another avenue is to create infrastructure and tools to make crowdsourcing easier and more efficient. Online platforms such as the Harvard Dataverse and Open Science Framework are available to host data, research and teaching materials, and preregistrations and document workflows. Journal mechanisms such as Registered Reports that review methodology and accept articles in principle before data collection have now been adopted at scores of outlets (https://cos.io/rr), and journals are increasingly experimenting with innovative formats such as open review, crowd review, and updatable articles. Recently introduced tools such as StudySwap and standing laboratory networks such as the Psychological Science Accelerator likewise hold promise to change the landscape of everyday science.
These approaches to encourage large-scale collaboration are important complements to reforms in how small-team science is conducted and funded. Larger samples (Stanley et al., 2018), disclosure rules (Simmons et al., 2011), preregistration (Nosek et al., 2018; Wagenmakers et al., 2012), and Registered Report formats at journals (Chambers, 2013; Nosek & Lakens, 2014) promise to increase the true positive rate for small studies, with scaling up for crowd projects then allowing for strong inferences about the generalizability versus context sensitivity of particularly important findings. At the same time, crowdsourced metascientific investigations can help to assess the effectiveness of new practices intended to improve science but that may also have unwanted side effects. For instance, preregistration might reduce false-positive results but could also negatively affect the rate of novel discoveries by dampening creativity (Brainerd & Reyna, 2018). A crowdsourced project in progress (Ebersole et al., 2018) will randomly assign researchers to preregister their analyses of a complex data set to empirically assess the costs and benefits of this proposed reform. Finally, the encouragement of large-scale collaborations to help democratize participation in research is a complement to supporting research at teaching institutions through grants, addressing gender gaps in representation, and other efforts to reduce systematic inequalities in science.
Conclusion
Crowdsourcing holds the potential to greatly expand the scale and impact of scientific research. It seeks to promote inclusion in science, maximize material and human resources, and make it possible to tackle problems that are orders of magnitude greater than what could be solved by individual minds working independently. Although most commonly used in the data-collection phase of research and for conducting replications, opportunities to take advantage of a distributed, interdependent collective span the entire scientific endeavor—from generating ideas to designing studies, analyzing the data, replicating results, writing research reports, providing peer feedback, and making decisions about what findings are worth pursuing further. Crowdsourcing is the next step in science’s progression from individual scholars to increasingly larger teams and now massive globally distributed collaborations. The crowdsourcing movement is not the end of the traditional scholar or of the vertically integrated model. Rather, it seeks to complement this standard approach to provide more options for accelerating scientific discovery.
Supplemental Material
Uhlmann_Supplemental_Materials – Supplemental material for Scientific Utopia III: Crowdsourcing Science
Supplemental material, Uhlmann_Supplemental_Materials for Scientific Utopia III: Crowdsourcing Science by Eric Luis Uhlmann, Charles R. Ebersole, Christopher R. Chartier, Timothy M. Errington, Mallory Kidwell, Calvin K. Lai, Randy J. McCarthy, Amy Riegelman, Raphael Silberzahn and Brian A. Nosek in Perspectives on Psychological Science
Footnotes
Action Editor
Timothy McNamara served as action editor for this article.
Author Contributions
The study was outlined and the literature review conducted through a crowdsourced process to which all authors contributed. The third through ninth authors are listed alphabetically in the byline. E. L. Uhlmann, C. R. Ebersole, C. R. Chartier, T. M. Errington, C. K. Lai, R. J. McCarthy, A. Riegelman, R. Silberzahn, and B. A. Nosek drafted the body of the manuscript. C. R. Ebersole created the figure and tables. All of the authors provided critical edits and revisions and approved the final manuscript for publication.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
Funding
This research was supported by the Center for Open Science, John Templeton Foundation, Templeton World Charity Foundation, Templeton Religion Trust, Arnold Ventures, James McDonnell Foundation, and a research and development grant from INSEAD.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
