Sage Journals: Discover world-class research

Abstract

Political scientists increasingly use crowdworkers to produce data, predominantly in the context of coding researcher-curated text or to retrieve simple data from the internet. In this article, we provide a theoretical and empirical basis for understanding when crowdworkers can provide data of sufficient quality to substitute for other types of coders. First, we introduce a typology of data-producing actors – experts, trained coders and crowds – and hypothesize factors that affect the substitutability of crowdworkers. We then examine how crowdworkers perform across coding tasks that vary along multiple dimensions of difficulty: information verifiability, availability and complexity. The results provide scope conditions bounding the substitutability of crowdworkers in political science applications. Although crowds can substitute for trained coders in the context of relatively simple information retrieval tasks, there is little evidence that crowdworkers can substitute for experts, whose tasks require both information retrieval and data synthesis.

Keywords

Crowdsourcing expert surveys survey experiments democratic institutions cross-national research

Political scientists frequently rely on experts to assign categorical or numeric values to concepts that are difficult to observe (Carey et al., 2019; Lindberg et al., 2014; Hooghe et al., 2010); they often ask trained coders to code more directly observable phenomena (Cruz et al., 2021; Hyde and Marinov, 2019; Marshall and Jaggers, 2017). At the same time, crowdsourcing – large-scale recruitment of laypersons to accomplish coding tasks – has emerged as a competing tool for social science data collection. Theoretically, crowdsourcing can offer improvements over expert-coded data in terms of reliability, validity, cost efficiency and replicability.¹

However, existing crowdsourcing applications are largely limited to asking crowds to either: (a) code or compare expert-aggregated and curated text excerpts with varying levels of complexity (Benoit et al., 2016; Carlson and Montgomery, 2017; Horn, 2019; Lehmann and Zobel, 2018; Narimanzadeh et al., 2023; Skytte, 2022; Voong et al., 2020); or (b) perform web-based information retrieval (Porter et al., 2020; Sumner et al., 2020). The potential application of crowdsourcing to a more general class of coding tasks in political science remains largely unexplored. In particular, many political science tasks demand information retrieval (gathering relevant information about a concept as applied to a specific case), curation (organizing and prioritizing these data) and synthesis (summarizing and making sense of these data).

Although these information retrieval and synthesis (IRS) tasks range in complexity, they are often substantially more demanding than the tasks crowdworkers typically perform. As a result, crowdworkers may not be substitutable for the other actors who conduct these common coding tasks. Exploring the boundaries of crowdworker substitutability – and the contexts and conditions under which they are more or less substitutable – is thus an important endeavour.

In this article, we provide a conceptual framework, backed by empirical evidence, for understanding crowdworker substitutability. First, we consider a typology of three actors who code data in social scientific applications: experts, trained coders (henceforth ‘coders’) and crowdworkers. Using this typology, we hypothesize and test how task attributes and crowdworker incentives enhance or diminish the substitutability of crowds for coders and experts. We find evidence that crowdworkers can sometimes substitute for coders, but find only limited evidence that crowdworkers can substitute for experts. Although crowdworkers can often perform information retrieval, and can sometimes classify observations according to social scientific concepts, our results show that crowdworkers lack the background and training to interpret or synthesize complex information when any informed subjective judgement is required.

Theorizing experts, coders and crowds

Multiple types of actors – experts, coders and crowdworkers – code data for social science research. Though scholars often refer to these actors interchangeably, the actors exhibit important differences in skills and incentives.²

According to Morris (1977), an expert is ‘anyone with special knowledge about an uncertain quantity or event’. Researchers typically recruit experts to perform IRS tasks, relying on them to retrieve – or already know – hard-to-find information. Researchers expect experts to be able to classify observations by synthesizing data that the experts themselves have collected. Typically, experts have spent years developing specialized case and domain knowledge, and therefore have substantial practice in retrieving and synthesizing task-relevant information. In exchange for lending their expertise to a coding enterprise, experts generally receive both monetary and non-monetary (e.g. reputation) benefits.

Unlike experts, trained coders often lack prior task-relevant knowledge. Coders are often graduate students and their tasks may be part of paid responsibilities or serve as a kind of research apprenticeship; coders are pre-screened for general suitability and given task-specific training. Like experts, coders also have a mixture of monetary and non-monetary incentives. Although coders typically receive financial compensation, in the process of their coding work they may also obtain research experience to facilitate future study or employment. Indeed, because coders often commit considerable time to a project, they may develop a level of knowledge in the topic that approximates expertise.

Coders can perform multiple types of tasks. Some simply retrieve observable information, such as the names of mayors of US cities. Others classify expert-aggregated materials, applying researcher-generated rubrics to curated data points. For example, coders may classify sentences from party manifestos as liberal or conservative. Previous research on crowdsourcing in political science has largely focused on supplanting these two sorts of coders (Benoit et al., 2016; D’Orazio et al., 2016; Honaker et al., 2013; Horn, 2019). A third category of coders performs both tasks: retrieving information which they also classify.

Crowdworkers are independent contractors hired through enterprises such as Amazon Mechanical Turk. Their compensation is almost purely financial. Unlike experts, crowdworkers do not have substantial domain-specific expertise. Unlike coders, researchers cannot recruit crowdworkers specifically for their aptitude at research-related tasks, nor can they give them substantial task-specific training. Instead, crowdworkers conduct many different tasks, without spending substantial time on any of them. As such, crowdworkers are unlikely to master any specific task.

The main purported advantage of crowdworkers over experts and coders is that they are numerous and relatively inexpensive to hire. In addition to facilitating large-scale data-gathering projects, this advantage could also lend itself to the creation of reproducible crowd-coded datasets (Benoit et al., 2016).

This advantage hinges on two key assumptions. First, crowdworkers can produce data of similar quality to those which experts and coders produce. Second, crowdworkers are sufficiently cheap and numerous to actually substitute for these other coders.

While the remainder of this paper investigates the first assumption in detail, we briefly note here that there is considerable reason to doubt the second assumption. First, analyses in Appendix A demonstrate that recruiting a sufficiently large number of coders to reproduce a large-scale expert-coding enterprise, such as the Varieties of Democracy (V-Dem) dataset as of 2017, would be much more expensive than using experts and would likely require recruiting more crowdworkers than existed in 2017. Second, less than a quarter of the crowdworkers we recruited for this project (54 out of 229) correctly responded to all screener questions and more than 50% of our ‘gold standard’ questions (more cognitively demanding screener questions).³ This latter result aligns with the findings of others who have noted challenges both in screening and compensating high-performing crowdworkers (Boas et al., 2020; Carlson and Montgomery, 2017; Chandler et al., 2019; Horton et al., 2011; Loepp and Kelly, 2020; Pyo and Maxfield, 2021; Strange et al., 2019), and in comprehensively accounting for crowdworker incentive structure in designing tasks (Brown and Pope, 2021; Chandler et al., 2019). Cumulatively, such findings indicate that it is necessary to over-recruit crowdworkers and have them complete a high volume of tasks in order to obtain accurate estimates. In applications such as the cross-national data-gathering enterprise we discuss here, the costs associated with doing so may outweigh the benefits of recruiting crowdworkers over experts or trained coders.

When can crowds replace experts and coders?

Previous research demonstrates that crowds can substitute for experts and coders when coding expert-curated data. There is also more tentative evidence that crowdworkers can perform the more complicated IRS tasks on which we focus here. In the context of political science, Sumner et al. (2020) deploy crowds to code recent experiences with financial distress for US cities. In a set of analyses that spans three other social-scientific contexts, Porter et al. (2020) find that crowdworkers perform well when asked to find and code easily accessible online data for data augmentation purposes. Finally, La Barbera et al. (2024) positively assess the ability of crowdworkers to conduct fact-checking of political statements. Since performing all of these tasks requires retrieval, curation and synthesis; these lines of research provide a proof-of-concept for crowdsourcing IRS tasks.

Here we explore the bounds of substituting crowds for experts and coders, investigating the extent to which crowds can holistically retrieve, curate and synthesize information. Most broadly, we hypothesize that crowds can substitute for experts and coders in IRS tasks (H1, Table 1): we expect crowdworkers to produce data in IRS tasks that mirror those which traditional methods produce. However, we expect substitutability to be a function of task attributes, crowdworker incentives and crowdworker background.

Table 1.

Determinants of substitutability.

Hypothesis	Category	Name	Description	Direction	Operationalization	Reject null?
H1	General		Crowds substitute for experts and coders in IRS tasks	N/A	Crowds converge to expert/coder data	No
H2	Task	Issue Verifiability	Whether the task has a verifiable, correct answer	+	Crowds better substitute when (a) coding verifiable (coder-coded) data; among these data, they better substitute for (b) more easily verifiable data: minimum voting age and referenda permitted instead of bicameral legislature and suffrage level	Yes
H3	Task	Information availability	The amount of information available to assist the participant in completing the task	+	Crowds better substitute for (a) Argentina than Senegal, and (b) more recent years	Yes (mixed)
H4	Task	Issue complexity	The amount of nuance and complexity of issues considered in the task	−	Crowds better substitute for (a) minimum voting age and bicameral legislature than referenda permitted and suffrage level, and (b) judicial independence and journalist harassment than gender equality and forced labour	Mixed
H5	Task	Issue polarization	Polarizing issues are more difficult to code	−	Crowds substitute worse when coding political killings	No
H6	Task	Question complexity	The amount of nuance and complexity of the task question	−	Crowds better substitute for judicial independence and gender equality than forced labour and journalist harassment	Yes
H7	Incentives	Pay	The magnitude of the per-task payment received by the participant	+	Crowds better substitute when assigned to high pay	Mixed

In the interest of space, in the text we focus on the first two dimensions and relegate crowdworker attributes to a brief discussion.⁴ With regard to the dimension of task attributes, we hypothesize that five attributes should have a substantial effect on crowd substitutability for other coders. Two of these attributes relate to information retrieval, whereas the remaining three relate to information curation and synthesis.

With regard to information retrieval, we hypothesize that information availability should increase substitutability (H3). Experts have access to information unavailable to others and can draw on their case-specific experience to find sources. Coders have time, and can leverage supervisor input, to seek out requisite sources of information. In contrast, crowdworkers do not have time, training or oversight.

We also hypothesize that information verifiability should increase the substitutability of crowds (H2). Experts have the necessary conceptual and case knowledge that enables them to accurately provide codings for phenomena that require judgement, as well as the ability to weigh evidence and look for further evidence when unambiguous inference is difficult. Crowds do not have this knowledge. On the other hand, crowds should be capable of coding that only requires retrieving information for which there is a clear correct or incorrect answer – that is, data that coders historically have coded. Within more verifiable data, there remains variation in verifiability in the form of ease of accessing data; more easily accessible data should positively correlate with substitutability.

Along these lines, different aspects of curating and synthesizing material may present particular hurdles to effective coding by crowds. First, certain political phenomena are more complicated than others. Although experts possess the background to weigh different issues when making judgements about these phenomena, and coders have the time and resources to investigate them in detail, crowds have none of these attributes. As a result, increasing issue complexity should reduce the substitutability of crowds (H4). Second, the extensive training which experts have undergone prior to beginning the coding task – and their concomitant experience and specialized knowledge – should provide them with a frame of reference that enables them to code potentially polarizing issues dispassionately; crowds lack such reference and will thus have more difficulties coding these issues accurately. Issue polarization should therefore reduce the substitutability of crowds (H5). Third, although experts have experience in making sense of dense texts and applying the difficult concepts within them, crowds do not. As such, the question complexity of survey items should reduce the substitutability of crowds for experts (H6).

Finally, we hypothesize that incentives should influence the substitutability of crowds. Whereas experts often decide to provide codings for non-monetary reasons, and coders provide codings for a mix of monetary and non-monetary reasons, crowds are primarily pay-motivated. As a result, high pay should increase the motivation of crowdworkers to provide quality coding and thus increase their substitutability for experts and coders (H7).

Research design

We rely on questions from the V-Dem project to test our hypotheses (Coppedge et al., 2017). V-Dem is a large-scale data-gathering enterprise that provides country-year measures of political institutions. To do so, V-Dem recruits both experts and coders to code political phenomena. To code directly observable (‘verifiable’) questions, V-Dem uses coders whom the project trains and provides with scholarly oversight to ensure the accuracy of their data. To code concepts that are difficult or impossible to directly measure – and thus require substantial, informed judgement to code – V-Dem uses experts. These experts are generally scholars who hold a PhD and have a research focus on specific countries and concepts (Marquardt et al., 2019). V-Dem provides these experts with a standardized survey with Likert-scale response categories, which experts use to code concepts across cases.

The V-Dem project strictly limits coders to providing data for directly observable indicators, assuming that they do not have the relevant conceptual or case expertise to code indicators that require informed judgement. In the terminology of the project (Coppedge et al., 2017), coders provide ‘A’ data and experts provide ‘C’ data. This variation and clear delineation of data sources between experts and coders makes V-Dem an ideal context for examining the practical potential of crowd-sourcing for IRS coding tasks with different levels of verifiability and complexity.

To assess crowd substitutability, we compare both coder-coded (‘A’) and expert-coded (‘C’) V-Dem data to crowdworker responses from a March 2017 survey.⁵ Following Benoit et al. (2016), we ran the study on CrowdFlower.⁶ Crowdworkers self-selected into the research pool, although we randomized task attributes and pay. Per pre-analysis power calculations, our sampling strategy aimed to obtain 20 observations per indicator-country-year-treatment group, with a 20% buffer for attrition. Though this strategy would yield 216 individuals, our sample included 229 crowdworkers because of practical constraints to precisely stopping recruitment. Crowdworkers used an interface similar to that used by V-Dem.⁷ We used ‘gold standard questions’ and screeners to weed out low-quality crowdworkers and bots. The analyses we present in the main text include only the 54 crowdworkers who correctly answered at least 2 of 4 ‘gold standard’ questions and all screeners (roughly 12 crowdworkers per V-Dem indicator).⁸

We reiterate that V-Dem data collection is a difficult task for crowdworkers. In addition to the difficulty of retrieving information on some concepts and cases, V-Dem questions use technical language to convey complex social-scientific concepts in a way that is difficult to explain to a neophyte without extensive training.⁹ As such, we intend for this project to test the boundaries of what tasks crowds can and cannot perform, thereby establishing scope conditions lacking in other work on this topic.

Operationalization and variable description

To assess the effect of task attributes on crowdworker substitutability, we randomly assigned crowdworkers to two of nine V-Dem indicators that varied along important metrics (Table 2).¹⁰ We assigned crowdworkers to only two of these indicators to avoid overextending them. For similar reasons, instead of asking crowdworkers to code the full time period, we asked them to code each of these indicators sequentially for six five-year periods for one country, Argentina. Following completion of coding the second indicator, we asked crowdworkers to code that indicator for an additional set of six time periods for another country, Senegal.¹¹

Table 2.

Varieties of Democracy (V-Dem) indicators used.

• Coder-coded indicators 1. Minimum voting age: Minimum age for voting in national elections (v2elage in V-Dem codebook). 2. Bicameral legislature: Number of chambers in legislature (v2lgbicam). 3. Referenda permitted: Form of referenda permitted by law (v2ddlegrf ). 4. Suffrage level: Percentage of population with de facto suffrage (v2elsuffrage).• Expert-coded indicators 1. Judicial independence: The degree to which a high court does not make decisions based on government positions (v2juhcind). 2. Gender equality: The degree to which political power was equally distributed by gender (v2pepwrgen). 3. Journalist harassment: The degree to which journalists did not face harassment (v2meharjrn). 4. Forced labour: The degree to which males were free from forced labour (v2clslavem). 5. Political killings: The degree to which a society is free from political killings (v2clkill).

Our logic in choosing Argentina and Senegal as the two countries for crowdworkers to code was twofold. First, both of these countries are internationally prominent. Accordingly, any cross-national dataset with pretensions of complete global coverage must include data for these countries. Second, Argentina and Senegal are not the most prominent countries in terms of relevant information being easily accessible using English-language materials. As a result, they represent good test cases for assessing whether crowdworkers can provide substitutable data for cross-national datasets. Had we chosen the most easily accessible cases (e.g. the United States), our results would only have generalized to the easiest cases in a cross-national dataset.¹²

The choice of these two countries also allows us to test the information availability hypothesis (H3a). Specifically, these countries have an intermediate (Argentina) and low (Senegal) volume of information available on Wikipedia at the time of experiment fielding.¹³ Since our metric indicates that Argentina has more information availability than Senegal, we expect crowdworkers to substitute better when evaluating Argentina than Senegal.

That said, since all crowdworkers coded Argentina first, then had the option to continue coding for Senegal, differences in substitutability between countries could also be due to fatigue.¹⁴ We therefore use an additional metric to assess the effect of information availability: time period. Since information should be more accessible for more recent years (H3b), we assigned coders to five-year time periods spanning V-Dem’s coverage, ending in the years 2015, 2005, 1996, 1970, 1950 and 1920. We expect crowdworkers to substitute less effectively when coding older cases.

In addition to information availability, we also expect information verifiability to influence crowd substitutability (H2). Most generally, we expect crowds to better substitute when there is a verifiable, correct answer. This attribute aligns with the delineation between the types of data that V-Dem experts (‘C’) versus coders (‘A’) code, with the latter largely retrieving verifiable information without synthesis. As such, we expect crowds to better substitute for coder-coded data than expert-coded data (H2a).

However, within coder-coded data there is also variation in how much effort information retrieval and verification requires; some data are more findable than others. For example, although the legal provisions for suffrage are located in the legal documents of a country, the suffrage level in practice is not always officially published. As a result, we expect crowds to better substitute for more easily verifiable coder-coded indicators, such as minimum voting age and referenda permitted, than indicators such as bicameral legislature and suffrage level (H2b).

We expect issue complexity (H4) to affect the substitutability of crowds for both coder- and expert-coded data. To assess issue complexity for coder-coded indicators, we focus on the dimensionality of the issue: whether coding requires understanding only one concept or requires being able to integrate several different concepts. For example, minimum voting age is a unidimensional concept, whereas the requirements for referenda are complex and multidimensional (e.g. there are different types of referenda, different levels and different approval stages). Table 3 presents the four coder-coded indicators we analyse in a two-by-two table based on our classifications of their findability (issue verifiability) and dimensionality (issue complexity). As the table illustrates, we expect minimum voting age to be the indicator for which crowdworkers are most substitutable, since this indicator is both relatively findable and uni-dimensional; suffrage level should be the least substitutable indicator because it is both less verifiable and multidimensional.¹⁵

Table 3.

Selected Varieties of Democracy (V-Dem) coder-coded indicators.

		Issue complexity(dimensionality)
		Uni-	Multi-
Issue verifiability (findability)	High	Minimum voting age	Referenda permitted
Issue verifiability (findability)	Low	Bicameral legislature	Suffrage level

We base our metric for the issue complexity of expert-coded V-Dem indicators on the assumption that experts become less confident when rating more complex issues. This assumption allows us to use the average self-reported expert confidence across indicators in the V-Dem dataset to assess the issue complexity of specific V-Dem indicators. Specifically, V-Dem experts report their confidence in their codings for each indicator-country-year combination; we take the average of these scores across experts, years and countries for each indicator as our metric for its overall complexity. Based on this metric, our analyses include two low-complexity questions (i.e. questions for which experts are relatively confident, journalist harassment and judicial independence) and two high-complexity questions (i.e. questions for which experts are relatively not confident, gender equality and forced labour).

We also hypothesize that crowdworkers are less substitutable for experts when they are asked to code more difficult tasks. To measure question complexity (H6), we score expert-coded indicators by their total English-language word count (question and response description), assuming that denser, more complicated questions represented a more difficult task. Of our four expert-coded indicators, judicial independence and gender equality have low task complexity based on this metric, whereas journalist harassment and forced labour have high complexity.

We expect crowds to best substitute for the indicators that have low issue and question complexity; and least for those which have high complexity on both dimensions. Table 4 presents a two-by-two table of our four expert-coded indicators along these metrics, illustrating that judicial independence (low complexity on both metrics) should be the indicator for which crowdworkers are most substitutable, and forced labour (high complexity on both metrics) should be the indicator for which they are the least substitutable.

Table 4.

Selected Varieties of Democracy (V-Dem) expert-coded indicators.

		Question complexity
		Low	High
Issue complexity	Low	Judicial independence	Journalist harassment
Issue complexity	High	Gender equality	Forced labour

Finally, we also hypothesize that crowdworkers are less substitutable for experts when they code indicators with high issue polarization (H5). To assess the effect of polarization, we analyse crowdworker performance on a presumably highly polarizing indicator, political killings, which measures the extent of government-sponsored killings in a country-year.

Incentives

To test our high pay hypothesis (H7) we randomly assigned each crowdworker to a typical ($0.12) or high ($0.24) per-task payment condition, roughly mirroring Benoit et al. (2016). We expect higher-paid crowdworkers to provide more substitutable codings.

Results

In contrast to H1, we find that crowdworkers cannot substitute for either coders or experts in general. However, in line with H2a, there is evidence that crowdworkers can substitute for coders in particularly simple and verifiable contexts. We discuss results for analyses of crowdworker substitutability for coders and experts in turn.

Crowdworker substitutability for coders

In order for crowdworkers to substitute for trained coders, a majority (ideally large) of crowdworkers would need to provide correct values for the indicators they code. If that is the case, a data-gathering enterprise could recruit multiple coders and relatively safely assume that the modal response is the correct value. We therefore assess the substitutability of crowdworkers for coders over different indicators by estimating the probability that a crowdworker would provide the correct answer for a given indicator-country-year.

Figure 1 illustrates this relationship, showing results from a dichotomous probit regression analysis of the probability that a crowdworker exposed to different task characteristics would provide a correct response.¹⁶ Dots represent point estimates of this probability and horizontal lines represent 95% confidence intervals.

Figure 1.

Substantive effect of task characteristics on the probability of correct answer to coder-coded questions.

The results demonstrate clear variation in the substitutability of crowdworkers for coders. In line with hypothesis H1, as well as H2 and H4, crowdworkers have a high probability of providing the correct response for the indicator with both high verifiability and low issue complexity: minimum voting age, with a point estimate of 0.88 at the reference level. In line with H2 and H4, crowdworkers are much less likely to provide correct responses for indicators with high issue complexity (referenda permitted, 0.53), low verifiability (bicameral legislature, 0.63) or both (suffrage level, 0.55). Evidence regarding the effect of information availability (H3) is mixed. In contrast to H3a, crowdworkers perform slightly better with the lower-information country of Senegal than with Argentina (the reference level), although this result may be due to the fact that only more engaged crowdworkers chose to code Senegal after coding Argentina. In line with H3b, however, there is a strong positive relationship between the recency of coding years and substitutability (the point estimates that a crowdworker will provide a correct response for the periods ending in 1950 and 1920 are 0.14 and 0.09 lower than the reference level, respectively), which is clear evidence in favour of the information availability hypothesis. Finally, in line with H7, there is some evidence that crowdworkers who are paid more perform better.

Point estimates and 95% confidence intervals from a probit regression analysis. Vertical line represents the point estimate at the reference level (2015, Argentina, minimum voting age, standard payment).

Cumulatively, these results suggest that crowdworkers can substitute for coders when they are presented with very simple tasks with easily accessible data: for recent years and simpler indicators, the estimated probability that a (screened) crowdworker provided a correct response is well over 0.50. However, the probability that a crowdworker provides a correct response is much lower for more complicated indicators and for coding periods further back in time, substantially lowering their substitutability in these contexts.

Crowdworker substitutability for experts

The fundamental promise of crowd-sourced data is that average responses will converge toward true values as the sample size increases. Our primary metric for the substitutability of crowdworkers for experts is therefore the convergence of crowdworkers to a benchmark with an increasing sample size. In the case of expert-coded data, a reasonable benchmark is the average score across experts. We therefore assess the bootstrapped standard errors – the squared difference between the average bootstrapped crowdworker score and the average expert score – using a varying count of bootstrapped crowdworkers across our expert-coded indicators. Specifically, for increasing values of n, we randomly draw n crowdworkers with replacement for different country-year-indicator combinations, calculate the mean scores within these groups and compute the difference in means between these n-coder groups and the full set of available expert scores. We repeat this process 100 times for each value of n = 1,2,…,10.¹⁷

For purposes of comparison, we also examine the convergence of the V-Dem experts who coded these cases to the expert mean using the same algorithm, but with bootstrapped draws of these experts. By design, expert scores converge toward zero since their average score was the benchmark. We therefore intend for these analyses to serve purely as a heuristic to enable comparison.

Figure 2 presents the bootstrapped standard errors between bootstrapped groups and overall expert averages. Points represent the errors for a country-year for a given number of crowdworkers (grey) or experts (black). Standard errors should tend toward zero as the number of crowdworkers increases, indicating that the crowdworker average approaches the expert average as the number of crowdworkers increases. Although the the figure provides evidence that a greater number of crowdworkers decreases error for all indicators, this convergence is sharply bounded. Even with the maximum number of bootstrapped crowdworkers for even the best-performing indicators (judicial independence, gender equality and political killings), a substantial number of country-year errors are above one, indicating that crowdworker averages tend to diverge by more than one point on a 5-point scale from the expert average.

Figure 2.

Bootstrapped standard errors across expert-coded indicators by number of experts (black) and crowdworkers (grey). (a) Judicial independence. (b) Journalist harassment. (c) Gender equality. (d) Forced labour. (e) Political killings.

In contrast to H1, these analyses provide little evidence of crowdworker convergence to expert means across observations and indicators.

We also examine the degree to which substitutability varies with task characteristics and crowdworker incentives, using a similar strategy as for the substitutability of crowdworkers for trained coders. Specifically, we regress the absolute difference between the expert mean and crowdworker scores for a given country-year-indicator on task attributes and incentives.¹⁸ Figure 3 plots the results of this analysis, illustrating predicted crowd substitutability for experts as a function of task attributes and crowdworker incentives. Points represent average difference, surrounded by 95% confidence intervals; lower values represent greater substitutability.

Figure 3.

Substantive effect of task characteristics on distance from expert mean.

The results of the regression analysis reinforce the findings from Figure 2. Figure 3 shows that, at the reference level, crowdworkers tend to deviate by 1.16 Likert-scale points from the mean expert coding. This value is slightly less than a quarter of the scale range, indicating low substitutability, again contradicting H1. As with the results regarding substitutability for trained coders, the greatest source of variation is the indicator crowdworkers were asked to code. Crowdworkers are the least substitutable for the indicators forced labour and journalistic harassment, with an average estimated absolute difference from the expert mean of 1.50 and 1.45, respectively. By way of contrast, crowdworkers are the most substitutable for gender equality, with an average estimated absolute difference from the expert mean of 0.91. The fact that both of the indicators for which crowdworkers were less substitutable are indicators with high question complexity is strong evidence in favour of H6. On the other hand, the fact that gender equality has high issue complexity, but crowdworkers were nevertheless substitutable for this indicator, is perhaps evidence against H4.

The final expert-coded indicator, political killings, shows similar levels of crowdworker substitutability to judicial independence, the reference level. This result indicates that issue polarization (H5) does not meaningfully affect substitutability.

Examining information availability, we find no variation in crowdworker substitutability when they code the case with more information available (Argentina, the reference level, compared to Senegal). Similarly, there is little evidence that substitutability increases with the recency of the year of coding. These two results together indicate that we cannot reject the null for H3 in the case of experts.

Finally, increased coder pay slightly increased substitutability, per H7, though the increase is not substantively meaningful.

These results cumulatively indicate that crowdworkers are not substitutable for experts: even in the best-performing indicator, crowdworker scores are, on average, roughly 18% of a 5-point Likert scale away from the expert average. Moreover, there is evidence that asking crowdworkers to code complex issues further reduces substitutability, and only weak evidence that paying them more can increase substitutability.

Demographic correlates of substitutability

Appendix G analyses the demographic correlates of crowdworker substitutability, focusing on the full sample of crowdworkers (i.e. both those crowdworkers who successfully completed screening tasks and those who did not). In short, there is little evidence that any demographic factor – political interest, education, familiarity with a case, performing well on screeners, coding diligence – affects the substitutability of crowdworkers for experts. However, demographics do have a strong effect on crowdworker substitutability for trained coders. More diligent coders – those who correctly complete screeners and take more time to code – are much more likely to provide correct responses than other crowdworkers, as are crowdworkers who report having spent time in the country they coded. Crowdworkers who report using the V-Dem dataset in their coding task are also more likely to provide the correct response, whereas political science majors are less likely.

In conjunction with the previous analyses of task characteristics, these results indicate that scholars can take concrete steps to increase the substitutability of crowdworkers for coders. Although the most important step is ensuring that the coding task is straightforward, scholars can also screen crowdworkers carefully before and during the coding period, and attend to their coding behaviour. However, there is little scholars can do to make crowdworkers substitutable for experts or coders of complicated tasks.

Conclusion

The literature on crowdsourced data in political science largely focuses on classification tasks, such as categorizing party manifestos, news articles or country descriptions. The success of such endeavours relies on expertise: experts obtain, curate and sometimes synthesize information before delegating classification to crowdworkers. However, social science data collection enterprises often more directly require their coders to perform information retrieval and curation, in addition to classification. Here we investigate whether crowds can economically and efficiently provide such labour. We do so by analysing the degree to which crowdworkers perform such tasks similarly to experts and coders.

We find little evidence that crowds can substitute for experts on IRS tasks. Our typology provides a possible explanation: most crowdworkers do not have the background, support or incentives to gather and analyse the types of data that experts do. IRS tasks are difficult to break into small chunks, and therefore it may not be surprising that crowdworkers perform poorly in our study. Nevertheless, since essentially every published study on crowdsourcing in political science reports on a success, it is valuable to highlight the method’s limitations.

However, we do find evidence for the substitutability of crowdworkers for coders in some tasks, though the boundaries of this substitutability are somewhat blurry. Although our results demonstrate that crowdworkers can substitute for coders in some simple information retrieval tasks, the difficulty of the task has a strong negative relationship with their substitutability.

Supplemental Material

sj-pdf-1-ips-10.1177_01925121241293459 – Supplemental material for Experts, coders and crowds: An analysis of substitutability

Supplemental material, sj-pdf-1-ips-10.1177_01925121241293459 for Experts, coders and crowds: An analysis of substitutability by Kyle L Marquardt, Daniel Pemstein, Constanza Sanhueza Petrarca, Brigitte Seim, Steven Lloyd Wilson, Michael Bernhard, Michael Coppedge and Staffan I Lindberg in International Political Science Review

Footnotes

Acknowledgements

We thank Ryan Bakker, Ken Benoit, Adam Glynn, Noah Nathan, Amy Semet and participants at the American Political Science Association Annual Meeting, the European Political Science Association General Conference, the Midwest Political Science Association Annual Conference and the V-Dem Annual Conference (all 2017) for comments on earlier drafts of this paper. We also thank three anonymous reviewers and the editor, Daniel Stockemer, for their valuable contributions during the review process. Josefine Pernes and Natalia Stepanova provided invaluable administrative support, and Paige Ottmar provided outstanding research assistance. Susan Williams copy-edited the final draft of the manuscript.

Author contributions

First authors are listed alphabetically and are followed by second authors, also listed alphabetically.

Data availability statement

All materials necessary to replicate the analyses in the text and appendices are available in the associated online supplementary materials.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: Riksbankens Jubileumsfond, Grant M13-0559:1; the Knut and Alice Wallenberg Foundation Grant 2013.0166 and the National Science Foundation Grant No. SES-1423944; as well as internal grants from the vice-chancellor’s office, the dean of the College of Social Sciences, and the Department of Political Science at the University of Gothenburg. Michael Bernhard’s work on the article was supported by the University of Florida Foundation.

Ethical considerations

Human subjects research was approved by Regionala Etikprövningsnämnden i Göteborg, DNR 1079-16.

ORCID iD

Kyle L Marquardt

Supplemental material

Supplemental material for this article is available online.

Notes

Author biographies

Kyle L Marquardt is a professor at the Department of Comparative Politics at the University of Bergen. His research interests include identity politics, latent variable models, post-Soviet politics and survey research.

Dan Pemstein is professor of political science and public policy at North Dakota State University, where he directs the Center for the Study of Digital Society. His research focuses on democratic institutions, internet politics, political careers and methodology.

Constanza Sanhueza Petrarca is an assistant professor at the School of Politics and International Relations at the Australian National University. Her research interests include democracy, elections, comparative politics and survey research.

Brigitte Seim is an associate professor at the Humphrey School of Public Affairs at the University of Minnesota. Her research interests include the politics of development, governance and accountability and experimental methods.

Steven Lloyd Wilson is an assistant professor of politics at Brandeis University. His research focuses on comparative democratization, cyber-security and the effect of the internet and social media on authoritarian regimes.

Michael Bernhard holds the Raymond and Miriam Ehrlich Chair in Political Science at the University of Florida. His work centers on questions of democratic survival, civil society and democracy, dictatorial legacies and the politics of memory.

Michael Coppedge is professor of political science and a faculty fellow of the Kellogg Institute at the University of Notre Dame and a co-principal investigator of the V-Dem project. He specializes in the spatial analysis of democratization, as well as democracy conceptualization and measurement.

Staffan I Lindberg is professor of political science, and director of the V-Dem Institute, at the University of Gothenburg. His research interests include democratization and autocratization, concepts and measurement, and climate scenarios.

References

Benoit

Kenneth

Conway

Drew

Lauderdale

Benjamin E

Laver

Michael

Mikhaylov

Slava

(2016) Crowd-Sourced Text Analysis: Reproducible and agile production of political data. American Political Science Review 110(2): 278–295.

Boas

Taylor C

Christenson

Dino P

Glick

David M

(2020) Recruiting Large Online Samples in the United States and India: Facebook, Mechanical Turk, and qualtrics. Political Science Research and Methods 8(2): 232–250.

Brown

Adam R

Pope

Jeremy C

(2021) Mechanical Turk and the ‘Don’t Know’ Option. PS: Political Science & Politics 54(3): 416–420.

Carey

John M

Helmke

Gretchen

Nyhan

Brendan

Sanders

Mitchell

Stokes

Susan

(2019) Searching for Bright Lines in the Trump Presidency. Perspectives on Politics 17(3): 699–718.

Carlson

David

Montgomery

Jacob M

(2017) A Pairwise Comparison Framework for Fast, Flexible, and Reliable Human Coding of Political Texts. American Political Science Review 111(4): 835–843.

Chandler

Jesse

Rosenzweig

Cheskie

Aaron J

Moss

Robinson

Jonathan

Litman

Leib

(2019) Online Panels in Social Science Research: Expanding sampling methods beyond Mechanical Turk. Behavior Research Methods 51: 2022–2038.

Coppedge

Michael

Gerring

John

Lindberg

Staffan II

Skaaning

Svend-Erik

Teorell

Jan

Altman

David

Bernhard

Michael

Fish

M Steven

Glynn

Adam

Hicken

Allen

Knutsen

Carl Henrik

Lührmann

Anna

Marquardt

Kyle L

Mechkova

Valeriya

McMann

Kelly

Olin

Moa

Paxton

Pamela

Pemstein

Daniel

Pernes

Josefine

Petrarca

Constanza Sanhueza

von Römer

Johannes

Saxer

Laura

Seim

Brigitte

Sigman

Rachel

Staton

Jeffrey

Stepanova

Natalia

Wilson

Steven

(2017) Varieties of Democracy Codebook v7. Varieties of Democracy Project: Project Documentation Paper Series.

Cruz

Cesi

Keefer

Philip

Scartascini

Carlos

(2021) DPI2020 Database of Political Institutions 2020: Changes and Variable Definitions. Washington, DC: Inter-American Development Bank.

D’Orazio

Vito

Kenwick

Michael

Lane

Matthew

Palmer

Glenn

Reitter

David

(2016) Crowdsourcing the Measurement of Interstate Conflict. PLoS One 11(6): e0156527.

10.

Heseltine

Michael

von Hohenberg

Bernhard Clemm

(2024) Large Language Models as a Substitute for Human Experts in Annotating Political Text. Research & Politics 11(1): 21–10.

11.

Honaker

James

Berkman

Michael

Ojeda

Chris

Plutzer

Eric

(2013) Sorting Algorithms for Qualitative Data to Recover Latent Dimensions with Crowdsourced Judgments: Measuring state policies for welfare eligibility under TANF. http://projects.iq.harvard.edu/files/applied_stats/files/james_honaker-_sorting_algorithms.pdf

12.

Liesbet

Hooghe

Bakker

Ryan

Brigevich

Anna

de Vries

Catherine

Edwards

Erica

Marks

Gary

Rovny

Jan

Steenbergen

Marco

Vachudova

Milada

(2010) Reliability and Validity of Measuring Party Positions: The Chapel Hill expert surveys of 2002 and 2006. European Journal of Political Research 49(5): 687–703.

13.

Horn

Alexander

(2019) Can the Online Crowd Match Real Expert Judgments? How task complexity and coder location affect the validity of crowd-coded data. European Journal of Political Research 58(1): 236–247.

14.

Horton

John J

Rand

David G

Zeckhauser

Richard J

(2011) The Online Laboratory: Conducting experiments in a real labor market. Experimental Economics 14: 399–425.

15.

Hyde

Susan D

Marinov

Nikolay

(2019) Codebook for National Elections Across Democracy and Autocracy (NELDA) Dataset 5. 1–40. https://www.dropbox.com/scl/fi/xivptq8yhnl5f1pj8npvm/NELDA_Codebook_V5.pdf

16.

Knutsen

Carl Henrik

Marquardt

Kyle L

Seim

Brigitte

Coppedge

Michael

Edgell

Amanda B

Medzihorsky

Juraj

Pemstein

Daniel

Teorell

Jan

Gerring

John

Lindberg

Staffan I

(2024) Conceptual and Measurement Issues in Assessing Democratic Backsliding. PS: Political Science & Politics 57: 162–177.

17.

La Barbera

David

Maddalena

Eddy

Soprano

Michael

Roitero

Kevin

Demartini

Gianluca

Mizzaro

Stefano

(2024) Crowdsourced Fact-Checking: Does it actually work? Information Processing & Management 61(5): 103792.

18.

Lehmann

Pola

Zobel

Malisa

(2018) Positions and Saliency of Immigration in Party Manifestos: A novel dataset using crowd coding. European Journal of Political Research 57(4): 1056–1083.

19.

Lindberg

Staffan I

Coppedge

Michael

Gerring

John

Teorell

Jan

(2014) V-Dem: A new way to measure democracy. Journal of Democracy 25(3): 159–169.

20.

Linegar

Mitchell

Kocielnik

Rafal

Alvarez

R Michael

(2023) Large Language Models and Political Science. Frontiers in Political Science 5: 1–12.

21.

Loepp

Eric

Kelly

Jarrod T

(2020) Distinction Without a Difference? An assessment of MTurk worker types. Research & Politics 7(1): 2053168019901185.

22.

Marquardt

Kyle L

Pemstein

Daniel

Seim

Brigitte

Wang

Yi-ting

(2019) What Makes Experts Reliable? Expert reliability and the estimation of latent traits. Research & Politics 6(4): 1–8.

23.

Marshall

Monty G, Ted Robert Gurr

Jaggers

Keith

(2017) Polity IV Project: Dataset users’ manual. https://www.systemicpeace.org/inscr/p4manualv2016.pdf

24.

Morris

Peter A

(1977) Combining Expert Judgments: A Bayesian approach. Management Science 23(7): 679–693.

25.

Narimanzadeh

Hasti

Badie-Modiri

Arash

Smirnova

Iuliia G

Chen

Ted Hsuan Yun

(2023) Crowdsourcing Subjective Annotations Using Pairwise Comparisons Reduces Bias and Error Compared to the Majority-Vote Method. Proceedings of the ACM on Human-Computer Interaction 7(CSCW2): 1–29.

26.

Porter

Nathaniel D

Verdery

Ashton M

Gaddis

S Michael

(2020) Enhancing Big Data in the Social Sciences with Crowdsourcing: Data augmentation practices, techniques, and opportunities. PLoS One 15(6): 1–21.

27.

Pyo

Jimin

Maxfield

Michael G

(2021) Cognitive Effects of Inattentive Responding in an MTurk Sample. Social Science Quarterly 102(4): 2020–2039.

28.

Skytte

Rasmus

(2022) Degrees of Disrespect: How only extreme and rare incivility alienates the base. The Journal of Politics 84(3): 1746–1759.

29.

Strange

Austin M

Enos

Ryan D

Hill

Mark

Lakeman

Amy

(2019) Online Volunteer Laboratories for Human Subjects Research. PloS One 14(8): e0221676.

30.

Sumner

Jane Lawrence

Farris

Emily M

Holman

Mirya R

(2020) Crowdsourcing Reliable Local Data. Political Analysis 28: 244–262.

31.

Voong

Michelle

Gunda

Keerthana

Gokhale

Swapna S

(2020) Predicting the Political Polarity of Tweets Using Supervised Machine Learning. In: 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020. New York, NY: IEEE, 1707–1712.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.81 MB

Experts,coders and crowds: An analysis of substitutability

Abstract

Keywords

Theorizing experts, coders and crowds

When can crowds replace experts and coders?

Research design

Operationalization and variable description

Incentives

Results

Crowdworker substitutability for coders

Crowdworker substitutability for experts

Demographic correlates of substitutability

Conclusion

Supplemental Material

sj-pdf-1-ips-10.1177_01925121241293459 – Supplemental material for Experts, coders and crowds: An analysis of substitutability

Footnotes

Acknowledgements

Author contributions

Data availability statement

Declaration of conflicting interests

Funding

Ethical considerations

ORCID iD

Supplemental material

Notes

Author biographies

References

Supplementary Material