Sage Journals: Discover world-class research

Abstract

Expert-coded datasets provide scholars with otherwise unavailable data on important concepts. However, expert coders vary in their reliability and scale perception, potentially resulting in substantial measurement error. These concerns are acute in expert coding of key concepts for peace research. Here I examine (1) the implications of these concerns for applied statistical analyses, and (2) the degree to which different modeling strategies ameliorate them. Specifically, I simulate expert-coded country-year data with different forms of error and then regress civil conflict onset on these data, using five different modeling strategies. Three of these strategies involve regressing conflict onset on point estimate aggregations of the simulated data: the mean and median over expert codings, and the posterior median from a latent variable model. The remaining two strategies incorporate measurement error from the latent variable model into the regression process by using multiple imputation and a structural equation model. Analyses indicate that expert-coded data are relatively robust: across simulations, almost all modeling strategies yield regression results roughly in line with the assumed true relationship between the expert-coded concept and outcome. However, the introduction of measurement error to expert-coded data generally results in attenuation of the estimated relationship between the concept and conflict onset. The level of attenuation varies across modeling strategies: a structural equation model is the most consistently robust estimation technique, while the median over expert codings and multiple imputation are the least robust.

Keywords

Bayesian methods civil conflict conflict onset ethnic politics expert-coded data latent variable models

Expert-coded datasets such as the Chapel Hill Expert Survey, Electoral Integrity Project, Human Rights Measurement Initiative, and Varieties of Democracy (V-Dem) allow scholars to conduct cross-national longitudinal research on vital concepts (Bakker et al., 2012; Norris, Frank & Martínez i Coma, 2014; Clay et al., 2020; Coppedge et al., 2018). However, expert-coded data come with potential disadvantages. Experts are susceptible to different sources of error (Clinton & Lewis, 2008; Bakker et al., 2014; Marquardt & Pemstein, 2018b); such error may bias results in statistical analyses (Lindstädt, Proksch & Slapin, 2018). These concerns are particularly acute in the context of quantitative peace research. Since outcomes such as conflict onset are rare events, quantitative analyses will be sensitive to measurement error on the right-hand side. Moreover, expert perceptions of key correlates of conflict may be endogenous to this outcome.

Given these concerns, awareness of the degree to which different forms of expert error substantively matter is of great importance, as is understanding of the extent to which different modeling strategies can correct for these errors. I provide insight into these two issues by conducting a series of ecologically valid simulation analyses in which I vary expert error in the measurement of a latent concept. I then regress conflict onset on the simulated data using five modeling strategies: three strategies utilize common point estimates, while the other two incorporate measurement uncertainty.

Results from these analyses indicate that most methods roughly recover the correct relationship between the expert-coded concept and conflict onset, even when expert error is extremely high. At the same time, simulated expert error almost always results in attenuation bias, in line with standard expectations from the error-in-variables literature (Chesher, 1991; King, Keohane & Verba, 1994; Stefanski, 2000; Hausman, 2001): simulated error reduces the magnitude of the relationship between the expert-coded variable and conflict onset. However, the degree to which this attenuation occurs varies across modeling strategies: the most robust strategy is a structural equation model which iteratively estimates concept values and their relationship to conflict onset, while the median and multiple imputation are the least robust.

Measuring latent concepts with expert-coded data

As articles in this special issue illustrate, scholars can use data from a variety of sources to estimate important latent concepts (Barnum & Lo, 2020; Fariss, Kenwick & Reuning, 2020; Krüger & Nordås, 2020; Terechshenko, 2020). Individuals who have extensive knowledge about both these concepts and particular cases – henceforth ‘experts’ – can also efficiently provide information for such purposes (Marquardt et al., 2017). However, since latent concepts are not directly observable, equally expert experts are likely to have different perceptions of ‘true’ latent values. Experts will therefore disagree in their codings. While such disagreement is an integral element of expert coding, disagreement may also result from variation in expert (1) scale perception (a concept known as differential item functioning, or DIF) and (2) reliability (Clinton & Lewis, 2008; Bakker et al., 2014; Lindstädt, Proksch & Slapin, 2018; Marquardt & Pemstein, 2018b).

These latter forms of disagreement present problems for quantitative analyses that use expert-coded data, particularly in the context of peace research. To illustrate these problems, I focus on an expert-coded latent variable from the V-Dem dataset, identity-based discrimination (‘Social group equality in respect for civil liberties,’ Coppedge et al., 2018).¹ Though identity-based discrimination underlies many theories of ethnic conflict and separatism (Gellner, 1983; Gurr, 1993; Horowitz, 2000), it is difficult to observe: relevant forms of identity and discrimination vary across countries, and governments generally have little incentive to publicize the degree to which they engage in discrimination. As a result, this concept is an excellent example of an important-but-challenging latent concept for expert coding.

Differential item functioning in expert-coded data

Experts may have different perceptions of question scales. For example, the V-Dem identity-based discrimination question asks experts to use a five-point Likert scale to report the degree to which a government deprives citizens of civil liberties on the basis of membership in social groups (a term which largely aligns with ethnicity).² The words ‘much’, ‘substantially’, ‘moderately’, and ‘slightly’ modify the degree to which social groups enjoy fewer civil liberties in the question scale. In this example, one expert’s ‘substantially’ could easily be another expert’s ‘much’. As a result, two experts who perceive the same latent level of discrimination may report different values.

If DIF is randomly distributed across a sufficiently large number of experts, this source of expert disagreement is problematic only in that it increases uncertainty about an estimate. However, it is likely that DIF is not randomly distributed, but instead clustered by cases. Such clustering raises concerns about cross-national comparability.

Most generally, experts with similar backgrounds may exhibit similar biases (Maestas, Buttice & Stone, 2014). Since experts with expertise on a specific country are those most likely to code it, these biases may be clustered by country. More specific concerns are also possible. If a country changes from having relatively high levels of a latent variable to a lower level, experts who focus on this country may code the change as more extreme than would experts with more comparative experience (Pemstein, Tzelgov & Wang, 2015). Actors in conflict settings may use grievances – ranging from discrimination to repression to corruption – to justify their struggle, even if the true cause of conflict lies elsewhere (Fearon & Laitin, 2003). Such framing may lead country experts to perceive high levels of the concept linked to the salient grievance, regardless of the concept’s actual level.

Variation in expert reliability

Potential variation in expert reliability is also a concern. For example, V-Dem uses a network of over 3,500 expert coders. Although the project employs a rigorous procedure to select experts (Coppedge et al., 2019), there is almost certainly variation in the degree to which these coders are knowledgeable about their cases and concepts.

As with DIF, more idiosyncratic forms of variation in reliability are also possible. Reliable information about political processes may not be readily available in cases with conflict, causing experts to rely on less accurate and potentially biased sources. Civil conflict is a context that may particularly lead to expert polarization: in periods leading up to conflict a regime-sympathizing expert may code latent concepts such as identity-based discrimination as decreasing, while a regime opponent may code the same concept as increasing.

Aggregating expert-coded data

The method by which a researcher aggregates expert-coded data has important implications for the degree to which the resulting estimates are robust to DIF and variation in expert reliability. Here I focus on three aggregation techniques: (1) the normalized average; (2) the median; and (3) latent variable models, in this case a modified Bayesian ordinal item response theory (IRT) model. The primary virtue of the first two methods is that they are straightforward, with Lindstädt, Proksch & Slapin (2018) arguing that the median is a robust alternative to the more commonly used average.

The third aggregation technique has two main virtues. First, it provides estimates of uncertainty that can be readily incorporated into regression analyses, as I will discuss in the following section. Second, it can account for both DIF and variation in expert reliability. In particular, IRT models outperform both the normalized average and the median in recovering latent values when expert error is high (Marquardt & Pemstein, 2018a,b).³

To explain how IRT models can account for variation in expert scale perception and reliability, I provide a brief overview of the standard V-Dem IRT model.⁴ Equation 1 presents the partial likelihood for this model.

\begin{array}{l} Pr (y_{c t r} = k) & = Φ (τ_{r, k} - θ_{c t} β_{r}) - Φ (τ_{r, k - 1} - θ_{c t} β_{r}) \end{array}

y represents the ordinal coding (values $1, ..., k$ ) which expert r provides for country-year $c t$ , and $θ$ the latent value for this country-year. The model accounts for DIF through $τ$ , $k - 1$ threshold values that are specific to expert r. Since thresholds provide the value relative to which $θ$ must be greater in order for an expert to provide a given scale item over the next lowest, this strategy allows scale perception to vary by expert. The model also clusters expert thresholds by the main country an expert codes. By assuming that experts who focus on the same country have similar understandings of the scale, the approach leverages ‘bridging’ – overlap in codings by experts who generally code different cases – to correct for systematic biases shared by these experts and thereby facilitate cross-national comparability (Pemstein, Tzelgov & Wang, 2015).⁵

The model accounts for variation in expert reliability through $β$ , idiosyncratic discrimination parameters for each expert. Essentially, this parameter weights an expert’s contribution to the measurement process based on the degree they diverge from other experts who code the same cases: experts who diverge more from other experts contribute less.

Modeling expert-coded data in regression analyses

Since expert disagreement is an inherent aspect of expert-coded data, incorporating the resulting measurement uncertainty into regression analyses is of clear theoretical value. I illustrate this process using a very reduced Bayesian probit model,⁶ in which I regress a country-year level indicator of civil conflict onset (Gleditsch et al., 2002; Pettersson & Wallensteen, 2015; Girardin et al., 2015) on different aggregations of a latent expert-coded concept with a one-year lag. The model also includes a cubic spline and country and year effects. Equation 2 presents the baseline model:

\begin{array}{l} Pr (y_{i} = 1) = Φ (α_{j} + ψ_{j} θ_{i - 1} + ζ_{j,1} t_{i} + ζ_{j,2} t_{i}^{2} + ζ_{j,3} t_{i}^{3} + C o u n t r y_{j, i} + Y e a r_{j, i}) \end{array}

Here $Φ$ represents the CDF of a normal distribution, i the observation and j the iteration over the Markov chain Monte Carlo (MCMC) algorithm. $θ$ represents a point estimate from a latent variable; t, $C o u n t r y$ , and $Y e a r$ are self-explanatory.⁷

The Bayesian estimation strategy allows me to incorporate measurement uncertainty about the latent concept in two ways. First, it facilitates multiple imputation using draws from the IRT model in Equation 1.⁸ Specifically, I rerun the model eight times with each of 500 draws from the posterior distribution of $θ$ . Equation 3 illustrates this measurement strategy, with the key distinction being subscript $k = 1, ..., 500$ , denoting 500 posterior draws from $θ$ .

\begin{array}{l} Pr (y_{i} = 1) = Φ (α_{j} + ψ_{j} θ_{k, i - 1} + ζ_{j,1} t_{i} + ζ_{j,2} t_{i}^{2} + ζ_{j,3} t_{i}^{3} + C o u n t r y_{j, i} + Y e a r_{j, i}) \end{array}

Second, I embed the IRT model within the regression equation using a structural equation model. This model iteratively estimates $θ$ (the latent concept) and $ψ$ , the coefficient estimating the relationship between $θ$ and conflict onset. As with the multiple imputation model, this model accounts for measurement error in $θ$ . However, the iterative estimation of $ψ$ and $θ$ essentially allows $ψ$ to act as a perfectly bridged equivalent of $β$ in Equation 1. The model thus allows $ψ$ to adjudicate between experts in cases of disagreement, lowering the influence of those whose opinion is inconsistent with the relationship between the latent variable and conflict. Equation 4 presents this model, with the subscript j on $θ$ illustrating that $θ$ is estimated iteratively with all other model parameters.

\begin{array}{l} Pr (y_{i} = 1) = Φ (α_{j} + ψ_{j} θ_{j, i - 1} + ζ_{j,1} t_{i} + ζ_{j,2} t_{i}^{2} + ζ_{j,3} t_{i}^{3} + C o u n t r y_{j, i} + Y e a r_{j, i}) \end{array}

Simulation analyses of expert error

While there are clear theoretical distinctions between different aggregation techniques and modeling strategies, the extent to which they matter in an applied regression context is unclear. To put it bluntly: while we know that experts can be unreliable and have different scale perceptions, we do not know the extent to which expert-coded data can yield misleading regression results. Equally importantly, we also do not know if different modeling strategies can correct for the different forms of measurement error we expect in expert coding.

I therefore conduct a series of simulation analyses to provide insight into the sensitivity of regression analyses to different forms of expert error, using the modeling strategies described in the previous sections.⁹ Note that conflict onset – the outcome in these analyses – is a rare event, occurring in only 2.8% of observations. As a result, these analyses should be highly sensitive to perturbations in the expert-coded data and thus constitute a strong test of robustness.

I use data from the V-Dem identity-based discrimination variable to create ecologically valid simulated data. Specifically, experts in the simulated datasets code the same set of observations as their real-world counterparts. I set the true values for each observation as equal to the posterior median estimate of identity-based discrimination, which means that the true relationship between the simulated concept and conflict onset is known. I then simulate different forms of error across expert coders, and use these forms of error in conjunction with the known true values to create simulated expert coding datasets.¹⁰ I regress conflict onset on different aggregations of the simulated data, using the models in Equations 2 –4. I replicate this procedure thrice for each form of simulated error to check robustness.

Although I derive the data from a particular latent variable (identity-based discrimination), this simulation strategy means that the results should be generalizable to other expert-coded data, with several caveats. First, these analyses constitute a particularly hard test for expert-coded data. Not only is the outcome of interest a rare event, but the assumed relationship between identity-based discrimination and conflict onset is relatively weak, if consistently positive: a change from a low to high level of identity-based discrimination correlates with a .04 increase in the posterior probability of conflict onset.¹¹ As a result, the simulated data are likely more sensitive to expert error than they would be for a latent concept with a stronger relationship with conflict. Second, V-Dem data are relatively sparse: while a total of 1,418 experts code some subset of cases for this variable, the median observation has six coders. If the data were less sparse (i.e. more than six experts coded each observation), the analyses would be more robust to random expert error; the converse is also true (Marquardt & Pemstein, 2018b). Third, there is relatively substantial bridging in the data: the median number of additional countries which the experts who coded a given country also coded is 25, while the equivalent statistic at the observation level is 10. As a result, the data present the IRT model in Equation 1 with the potential to correct for systematic differences in scale perception and reliability across countries. As with general data density, greater bridging density would likely make the data more robust to systematic error.

Random error

The first set of simulations assume that expert error is randomly distributed across experts, matching modeling assumptions in Marquardt & Pemstein (2018b). These simulations include two levels of error for both expert reliability and DIF. The first level corresponds to a moderate level of error, while the second corresponds to a high level in which DIF spans the threshold range and a substantial proportion of experts have negative reliability.¹²

Figure 1 presents posterior effect estimates from regressions that include moderate and high levels of both types of expert error, divided by modeling strategy.¹³

Figure 1.

Posterior estimate of latent concept’s effect on probability of conflict onset, simulated data with random error in expert coding

Each strategy has three estimates, one for each simulation replication. The cells also have three vertical lines, representing the estimated true relationship between the concept and conflict (i.e. the median and 95% credible region for the posterior median in Figure F.2 in the Online appendix). A method that fully recovers the true relationship between the concept and conflict onset would have a median estimate on the middle line and credible regions that coincide with the left and right lines.

Perhaps the most important result in Figure 1 is that, even in a scenario of extremely high simulated DIF and variation in expert reliability (Subfigure 1B), all modeling strategies result in consistently positive estimates of the relationship between the latent concept and conflict onset, in line with the assumed true relationship. However, the figure also indicates that estimates from all strategies attenuate the relationship between the concept and conflict onset, even in the presence of only moderate variation in expert reliability and DIF (Subfigure 1A).

The degree to which this attenuation occurs varies across methods. The structural equation model yields effect estimates that are the closest to the true relationship, and the credible regions for all estimates that use this aggregation technique overlap with the point estimate for the true relationship. In contrast, the median and the multiple imputation techniques yield the most attenuated estimates. This result is most apparent in the context of high variation in expert error (Subfigure 1B), in which the credible regions of the estimated effect for both of these aggregation techniques do not overlap with the true effect in any simulation. The average and the posterior median perform similarly to each other: worse than the structural equation model and better than the median and multiple imputation.

Systematic error

Expert error may not be randomly distributed across experts, and such systematic error may be even more problematic than random error. I therefore create two simulated datasets in which experts who focus on countries with ethnic conflict systematically differ from other experts.¹⁴ In the first dataset, these experts have lower reliability on average than other experts. This simulation approach is in line with the concern that experts who code certain cases of conflict may have less access to information about these cases, or are ideologically polarized and thus provide divergent codings. This approach should further attenuate the estimated relationship between the latent concept and conflict onset.

Figure 2.

Posterior estimate of latent concept’s effect on probability of conflict onset, simulated data with systematic error in expert coding

In the second dataset, experts who code ethnic conflict tend to perceive higher levels of the latent concept. This approach models the possibility that experts who focus on cases with a particular form of conflict may systematically – and erroneously – perceive high levels of a related latent concept. This simulation strategy should artificially increase the relationship between the concept and conflict onset.

I present results from simulations with systematic variation in expert reliablity and DIF in Figure 2.¹⁵ Subfigure 2A presents results from analyses of simulated data with systematic variation in expert reliability. The structural equation model performs very well in this context: in two simulations, the point estimates and credible regions coincide substantially with the true relationship. The mean over expert scores and posterior median also perform well in these analyses, while the relationship between the median over expert scores and conflict onset tends to have relatively more attenuation. As with previous analyses, multiple imputation yields the most attenuated relationships between the concept and conflict onset.

Subfigure 2B presents results from analyses of simulated data with systematic DIF. Contrary to expectations, there is little evidence that this form of simulated DIF artificially strengthens the relationship between the concept and conflict. Instead, the results again tend to show a generally attenuated relationship between the concept and outcome, though the structural equation model, posterior median, median and mean all perform relatively well in recovering the true relationship. Multiple imputation again provides the most attenuated estimates of the relationship between the concept and conflict.

Conclusion

The simulation analyses in this article have illustrated that regression analyses of civil conflict onset which use expert-coded data on the right-hand side are relatively robust to expert error. As a rare event, conflict onset is an outcome for which regression analyses are likely highly sensitive to expert error; the not-overwhelmingly strong relationship between the expert-coded concept and conflict onset increases this sensitivity. These analyses thus present a hard test for the robustness of expert-coded data, which the data largely pass.

Despite their broad robustness, the analyses demonstrate that expert error almost always attenuates the estimated relationship between the expert-coded concept and conflict. The level of attenuation varies based on modeling strategy. Multiple imputation consistently provides the least robust estimates. The median over expert codings is less robust than either the average over expert codings or the posterior median from an IRT model; Online appendix J also presents some evidence that the posterior median is more robust to variation in expert reliability than both the average and the median.

The most robust estimates of the relationship between the concept and conflict onset come from a structural equation model, indicating that scholars should consider using such a model in future quantitative research. To that end, the replication materials provide code and instructions for using such models.

These conclusions come with several scope conditions. First, though simulated error leads to attenuation bias in this relatively simple regression context, both the form and level of bias may change in more complicated multivariate analyses (Cochran, 1968; Cragg, 1994). Moreover, concerns about the unpredictable effects of measurement error are particularly acute in non-linear contexts (Stefanski & Carroll, 1985). Second, if the latent explanatory variable had a stronger relationship with the outcome – and the outcome were not a rare event – expert error would likely attenuate their estimated relationship to a lesser extent and could in fact amplify it if error is systematic. Third, if there were more experts-per-observation – as is the case with expert-coded datasets such as the Chapel Hill Expert Survey and the Human Rights Measurement Initiative – analyses would likely be even more robust to random expert error. Fourth, greater bridging density or more efficient use of anchoring vignettes could make the data more robust to systematic error. Future research would do well to probe these results using different outcomes, forms of expert error, and patterns of expert coding.

Footnotes

Replication data

An abbreviated replication dataset that focuses on applied analyses is available at https://doi.org/10.7910/DVN/BINH8N. The complete replication files for the empirical analysis in this article can be found at http://www.prio.org/jpr/datasets. All analyses were conducted using R; all Bayesian analyses use Stan (Stan Development Team, 2018).

Acknowledgments

I thank Ruth Carlitz, Carl Henrik Knutsen, Anna Lührmann, Juraj Medzihorsky, and Daniel Pemstein for their helpful insights. I also thank Chris Fariss, James Lo, and four anonymous reviewers for their valuable comments on earlier drafts.

Funding

I prepared the article within the framework of the HSE University Basic Research Program, with funding from the Russian Academic Excellence Project ‘5-100’. The work was also supported by the National Science Foundation (SES-1423944), Riksbankens Jubileumsfond (M13-0559:1), the Swedish Research Council (2013.0166), the Knut and Alice Wallenberg Foundation, and the University of Gothenburg (E 2013/43). I performed simulations using resources provided by the High Performance Computing section and the Swedish National Infrastructure for Computing at the National Supercomputer Centre in Sweden (SNIC 2017/1-406 and 2018/3-543).

ORCID iD

Kyle L Marquardt

Notes

References

Bakker

Ryan

de Vries

Catherine

Edwards

Erica

Hooghe

Liesbet

Jolly

Seth

Marks

Gary

Polk

Jonathan

Rovny

Jan

Steenbergen

Marco

Vachudova

Milada Anna

(2012) Measuring party positions in Europe: The Chapel Hill Expert Survey trend file, 1999–2010. Party Politics 21(1): 143–152.

Bakker

Ryan

Jolly

Seth

Polk

Jonathan

Poole

Keith

(2014) The European common space: Extending the use of anchoring vignettes. Journal of Politics 76(4): 1089–1101.

Barnum

Miriam

James

(2020) Is the NPT unraveling? Evidence from text analysis of review conference statements. Journal of Peace Research 57(6): 740–751.

Blackwell

Matthew

Honaker

James

King

Gary

(2017) A unified approach to measurement error and missing data: Overview and applications. Sociological Methods & Research 46(3): 303–341.

Chesher

Andrew

(1991) The effect of measurement error. Biometrika 78(3): 451–462.

Clay

K Chad

Bakker

Ryan

Brook

Anne-Marie

Hill

Daniel W

Jr Murdie

Amanda

(2020) Using practitioner surveys to measure human rights: The Human Rights Measurement Initiative’s civil and political rights metrics. Journal of Peace Research 57(6): 715–727.

Clinton

Joshua D

Lewis

David E

(2008) Expert opinion, agency characteristics, and agency preferences. Political Analysis 16(1): 3–20.

Cochran

William G

(1968) Errors of measurement in statistics. Technometrics 10(4): 637–666.

Coppedge

Michael

Gerring

John

Knutsen

Carl Henrik

Lindberg

Staffan I

Skaaning

Svend-Erik

Teorell

Jan

Altman

David

Bernhard

Michael

Steven Fish

Cornell

Agnes

Dahlum

Sirianne

Gjerlow

Haakon

Glynn

Adam

Hicken

Allen

Krusell

Joshua

Lührmann

Anna

Marquardt

Kyle L

McMann

Kelly M

Mechkova

Valeriya

Medzihorsky

Juraj

Olin

Moa

Paxton

Pamela

Pemstein

Daniel

Pernes

Josefine

Seim

Brigitte

Sigman

Rachel

Staton

Jeffrey

Stepanova

Natalia

Sundström

Aksel

Tzelgov

Eitan

Wang

Yi-ting

Wig

Tore

Wilson

Steven L

Ziblatt

Daniel F

(2018) V-Dem Dataset v8. Technical report. Varieties of Democracy Project (http://dx.doi.org/10.2139/ssrn.3172819).

10.

Coppedge

Michael

Gerring

John

Knutsen

Carl Henrik

Lindberg

Staffan I

Teorell

Jan

Marquardt

Kyle L

Medzihorsky

Juraj

Pemstein

Daniel

Pernes

Josefine

Römer

Johannes von

Stepanova

Natalia

Tzelgov

Eitan

Wang

Yi-ting

Wilson

Steven L

(2019) Varieties of Democracy Methodology v9. Working paper. Varieties of Democracy Project (http://dx.doi.org/10.2139/ssrn.3441063).

11.

Cragg

John G

(1994) Making good inferences from bad data. Canadian Journal of Economics 27(4): 776–800.

12.

Fariss

Christopher J

Kenwick

Michael R

Reuning

Kevin

(2020) Estimating one-sided-killings from a robust measurement model of human rights. Journal of Peace Research 57(6): 801–814.

13.

Fearon

James D

Laitin

David D

(2003) Ethnicity, insurgency, and civil war. American Political Science Review 97(1): 75–90.

14.

Gellner

Ernest

(1983) Nations and Nationalism. Ithaca, NY: Cornell University Press.

15.

Girardin

Luc

Hunziker

Philipp

Cederman

Lars-Erik

Bormann

Nils-Christian

Vogt

Manuel

(2015) Grow

+^{u p}

– Geographical Research On War, Unified Platform. Technical report. ETH Zurich (http://growup.ethz.ch/).

16.

Gleditsch

Nils Petter

Wallensteen

Peter

Eriksson

Mikael

Sollenberg

Margareta

Strand

Håvard

(2002) Armed conflict 1946–2001: A new dataset. Journal of Peace Research 39(5): 615–637.

17.

Gurr

Ted R

(1993) Minorities At Risk: A Global View of Ethnopolitical Conflicts. Washington, DC: United States Institute of Peace Press.

18.

Hare

Christopher

Armstrong

David A

Bakker

Ryan

Carroll

Royce

Poole

Keith T

(2015) Using Bayesian Aldrich-Mckelvey Scaling to study citizens’ ideological preferences and perceptions. American Journal of Political Science 59(3): 759–774.

19.

Hausman

Jerry

(2001) Mismeasured variables in econometric analysis: Problems from the right and problems from the left. Journal of Economic Perspectives 15(4): 57–67.

20.

Horowitz

Donald L

(2000) Ethnic Groups in Conflict, 2nd edition. Berkeley, CA: University of California Press.

21.

King

Gary

Wand

Jonathan

(2007) Comparing incomparable survey responses: Evaluating and selecting anchoring vignettes. Political Analysis 15(1): 46–66.

22.

King

Gary

Keohane

Robert O

Verba

Sidney

(1994) Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton, NJ: Princeton University Press.

23.

Krüger

Jule

Nordås

Ragnhild

(2020) A latent variable approach to measuring wartime sexual violence. Journal of Peace Research 57(6): 728–739.

24.

Lindstädt

René

Proksch

Sven-Oliver

Slapin

Jonathan B

(2018) When experts disagree: Response aggregation and its consequences in expert surveys. Political Science Research & Methods 8(3): 580–588.

25.

Maestas

Cherie D

Buttice

Matthew K

Stone

Walter J

(2014) Extracting wisdom from experts and small crowds: Strategies for improving informant-based measures of political concepts. Political Analysis 22(3): 354–373.

26.

Marquardt

Kyle L

Pemstein

Daniel

(2018a) Estimating latent traits from expert surveys: An analysis of sensitivity to data generating process. Working paper 83. Varieties of Democracy Project (http://dx.doi.org/10.2139/ssrn.3302459).

27.

Marquardt

Kyle L

Pemstein

Daniel

(2018b) IRT models for expert-coded panel data. Political Analysis 26(4): 431–456.

28.

Marquardt

Kyle L

Pemstein

Daniel

Petrarca

Constanza Sanhueza

Seim

Brigitte

Wilson

Steven L

Bernhard

Michael

Coppedge

Michael

Lindberg

Staffan I

(2017) Experts, coders, and crowds: An analysis of substitutability. Working paper 53. Varieties of Democracy Project (http://dx.doi.org/10.2139/ssrn.3046462).

29.

Norris

Pippa

Frank

Richard W

Coma

Ferran Martínez i

(2014) Measuring electoral integrity around the world: A new dataset. PS: Political Science & Politics 47(4): 789–798.

30.

Pemstein

Daniel

Marquardt

Kyle L

Tzelgov

Eitan

Wang

Yi-ting

Krusell

Joshua

Miri

Farhad

(2018) The V-Dem Measurement Model: Latent variable analysis for cross-national and cross-temporal expert-coded data. Working paper 3rd edition. Varieties of Democracy Project (http://dx.doi.org/10.2139/ssrn.3167764).

31.

Pemstein

Daniel

Tzelgov

Eitan

Wang

Yi-ting

(2015) Evaluating and improving Item Response Theory models for cross-national expert surveys. Working paper. Varieties of Democracy Project (http://dx.doi.org/10.2139/ssrn.2613421).

32.

Pettersson

Therése

Wallensteen

Peter

(2015) Armed conflicts, 1946–2014. Journal of Peace Research 52(4): 536–550.

33.

Schennach

Susanne M

(2016) Recent advances in the measurement error literature. Annual Review of Economics 8: 341–377.

34.

Shor

Boris

Bafumi

Joseph

Keele

Luke

Park

David

(2007) A Bayesian multilevel modeling approach to time-series cross-sectional data. Political Analysis 15(2): 165–181.

35.

Stan Development Team (2018) RStan: The R interface to Stan R package version 2.18.2 (http://mc-stan.org/).

36.

Stefanski

Leonard A

(2000) Measurement error models. Journal of the American Statistical Association 95(452): 1353–1358.

37.

Stefanski

Leonard A

Carroll

Raymond J

(1985) Covariate measurement error in logistic regression. Annals of Statistics 13(4): 1335–1351.

38.

Terechshenko

Zhanna

(2020) Hot under the collar: A latent measure of interstate hostility. Journal of Peace Research 57(6): 764–776.

How and how much does expert error matter? Implications for quantitative peace research

Abstract

Keywords

Measuring latent concepts with expert-coded data

Differential item functioning in expert-coded data

Variation in expert reliability

Aggregating expert-coded data

Modeling expert-coded data in regression analyses

Simulation analyses of expert error

Random error

Systematic error

Conclusion

Footnotes

Replication data

Acknowledgments

Funding

ORCID iD

Notes

References