Abstract
Experts code latent quantities for many influential political science datasets. Although scholars are aware of the importance of accounting for variation in expert reliability when aggregating such data, they have not systematically explored either the factors affecting expert reliability or the degree to which these factors influence estimates of latent concepts. Here we provide a template for examining potential correlates of expert reliability, using coder-level data for six randomly selected variables from a cross-national panel dataset. We aggregate these data with an ordinal item response theory model that parameterizes expert reliability, and regress the resulting reliability estimates on both expert demographic characteristics and measures of their coding behavior. We find little evidence of a consistent substantial relationship between most expert characteristics and reliability, and these null results extend to potentially problematic sources of bias in estimates, such as gender. The exceptions to these results are intuitive, and provide baseline guidance for expert recruitment and retention in future expert coding projects: attentive and confident experts who have contextual knowledge tend to be more reliable. Taken as a whole, these findings reinforce arguments that item response theory models are a relatively safe method for aggregating expert-coded data.
Introduction
Many political science datasets use experts to code concepts that are difficult to directly assess (Bakker et al., 2012; Buttice and Stone, 2012; Kitschelt and Kselman, 2012; Castles and Mair, 1984; Clinton and Lewis, 2008). Although modeling rater-level bias and reliability when aggregating codings is of clear importance (Johnson and Albert, 1999; Maestas et al., 2014; Wagner et al., 2010), there has been little exploration of either the factors that influence reliability in political science contexts or their implications for model design. Such exploration is essential for both assessing the validity of data-aggregation methods and determining criteria for expert retention and recruitment.
Here we analyze potential correlates of expert reliability in the context of a cross-national survey of political traits: the Varieties of Democracy (V–Dem) Dataset (Coppedge et al., 2018a), which employs a diverse body of over 3000 experts to code over 121 ordinal variables covering a variety of regime traits from 1900–2017. 1 This diversity of experts and contexts provides an ideal laboratory for analyzing coder reliability.
We measure reliability using expert-specific discrimination (reliability) parameters from six randomly selected V–Dem variables. In the item response theory (IRT) context, reliability parameters represent the degree to which an expert randomly diverges from other experts who code the same cases; experts who code in patterns similar to those of their peers receive higher reliability scores and thus contribute more to the estimation of the latent concept. This operationalization aligns with classic definitions of reliability (Carmines and Zeller, 1979), as well as work examining convergence among crowd-platform coders (Benoit et al., 2016).
In this analysis, we regress these reliability parameters on both expert coding behavior and demographic characteristics. Doing so provides insight into the degree to which a prominent method for aggregating expert-coded data—an IRT model that accounts for both variation in expert reliability and scale perception (Clinton and Lewis, 2008; Pemstein et al., 2019)—provides substantively unbiased estimates of latent concepts.
In general, we find a weak and inconsistent relationship between reliability and expert characteristics. Most of these null findings regard variables that could constitute problematic sources of bias in the the estimation procedure, such as gender. The exceptions are intuitive. Reliable experts tend to be those who: (a) are more confident in their codings; (b) vary their codings; and (c) evince contextual knowledge of an important concept. Cumulatively, these findings indicate that IRT models incorporating expert reliability and scale perception parameters are a safe method for aggregation.
Reliability in the V–Dem model
We use a modified version of the V–Dem measurement model (Pemstein et al., 2019) to estimate expert reliability for each of the six variables.
2
This model derives from the basic assumption that each expert
where
Two sets of parameters in this model are of particular importance. First,
Second,
In IRT terminology,
Benefits of analyzing reliability correlates
Analyses of potential reliability correlates provide a diagnostic of a measurement model. In the model we use, experts with lower reliability scores contribute less to the estimation of country-year latent traits, the parameters of interest in most applications. Systematic biases that are inconsistent with model assumptions—notably case-varying systematic differences across experts—will appear to the model like random error, resulting in lower reliability scores among experts who exhibit such biases. Although certain coder characteristics—such as conceptual knowledge—should correlate with reliability, other traits should not. Analyses of respondent-level reliability can therefore provide insight into potential threats to validity by highlighting classes of experts for which selection procedures and modeling assumptions do not effectively adjust for systematic bias.
A key example is gender. A majority of V–Dem experts are men. If women systematically perceive a latent trait differently than men, and this systematic bias is not adequately modeled through threshold estimates, women could receive lower reliability scores even though their viewpoint is equally valid. Such a result would indicate problematic bias in the measurement process.
Analyses of reliability correlates also provide tentative evidence regarding the characteristics of more reliable experts, which may facilitate decisions on expert recruitment and retention. Although research stresses that expertise is important for data validity (Maestas et al., 2014), potential correlates of intra-expert variation in this context remain largely unexplored.
Variables and descriptive statistics
Reliability
We analyze reliability (
We randomly selected all six variables.
5
We selected one variable (Female freedom of discussion [
We use Markov chain Monte Carlo (MCMC) methods to estimate the IRT model for each of the variables included in the analysis. 7 MCMC methods generate samples from the posterior distributions of model parameters; we use the full posterior of reliability estimates across iterations of the MCMC algorithm to account for measurement error.
Correlates of reliability
We discuss potential sets of reliability correlates in turn. All variables related to coding characteristics regard the variable being analyzed; self-reported confidence and coding variation variables use reduced data. 8 Online Appendix C presents descriptive statistics.
Demographics
Previous research illustrates that a rater’s background can influence their perception of latent traits (Cumming, 1990; Michael et al., 1980; Royal-Dawson and Baird, 2009), and raters with greater expertise are more reliable when rating complex or broad tasks (Schoonen et al., 1997). We therefore include measures of education and university employment, which indicate relevant expertise and thus potentially greater reliability.
We trichotomize education: experts with a (a) PhD (reference level), (b) Professional degree such as a Master of Business Administration or Doctor of Jurisprudence, or (c) MA or lower. We analyze employment with four indicators: employees of a Public university (the reference level), Private university, the Government, and Other (non-governmental, non-academic employment). We separate public and private employment because experts in the private sector may be more reliable, because they are potentially less susceptible to government pressure or other incentives to provide biased estimates.
Because gender may influence reliability for reasons previously discussed, we include the dichotomous indicator Female. We also include the natural logarithm of a respondent’s
Knowledge
We a priori expect all experts to have a high level of knowledge about the cases and concepts they code. Equally knowledgeable experts should provide similar coding patterns, although their codings may vary due to DIF or case-level stochastic error. However, if some experts know less about a concept or case, their codings may vary in a fashion that is not attributable to DIF or case-level stochastic error. For example, a less knowledgeable expert may miss changes in latent concept values. As a result, less knowledgeable experts should receive lower reliability scores.
Measuring knowledge is difficult in the absence of concrete data (e.g., responses to factual questions about a case). We therefore use three proxies to measure different types of knowledge. Because these proxies are not comprehensive, results should be interpreted with caution.
We proxy lower case knowledge with an indicator for experts who are Not resident in the country they are coding, assuming that residing in a country can provide an expert experience with a case. We also measure both conceptual awareness and general knowledge. The indicator Low awareness represents experts who reported in a post-survey questionnaire that they do not consider electoral democracy—a principle that underpins most definitions of democracy—important to the broader concept of democracy. The indicator Low knowledge represents experts who either consider (a) very democratic Sweden to be non-democratic or (b) very non-democratic North Korea to be democratic. 9
Democracy in residence country
Experts living in democratic countries may have better access to information and may thus be more knowledgeable than experts residing in autocracies. They may also be less concerned by potential government sanction, allowing them to more accurately code sensitive concepts and cases. For both of these reasons, such experts may be more reliable. Democracy represents the average level of V–Dem’s electoral democracy index from 2008 to 2017 for an expert’s residence country.
Confidence
Experts self-report their case-level Confidence on a 0–1 scale, which we aggregate to an expert’s average over a given variable. This measure provides a rough estimate of an expert’s knowledge about the variable they are coding; experts who are generally not confident are potentially signaling low knowledge. 10 For the reasons detailed in the previous section, lower knowledge could result in lower reliability.
Attentiveness
Less attentive experts may be less reliable, because they will be less sensitive to changes in latent traits for the variables they code than more attentive experts. We measure attentiveness with two sets of indicators. First, because most countries vary in political traits, the degree to which an expert varies their scores may proxy their attentiveness. Second, because expertise likely varies over time and across countries, attentive experts should vary in self-reported confidence. We measure both variation in coding and confidence with two indicators each. Coding variation and Confidence variation indicate if an expert changed their scores on either metric at least once. Because the extent to which an expert varied their coding or confidence may also be important for reliability, we also include Coding sd and Confidence sd to measure an expert’s standard deviation on these metrics, with those who did not vary coded as zero.
Volume
High coding volume may lead experts to overextend themselves, causing them to either be less attentive or code cases and concepts with which they are less familiar. Such overextended experts may therefore be less reliable. We measure coding volume along three dimensions. First, the natural logarithm of the country-years an expert coded, Country-years. Second, the natural logarithm of variables an expert coded, Variables. Third, although most experts coded only one country, many coded several. We include both Countries
Results
We conduct analyses of each variable’s reliability scores individually, regressing each posterior draw of reliability parameters on the complete set of potential correlates. 11 Given that some countries and years may be more difficult to code than others, we include fixed effects for the coded country and year in all analyses. 12
Figure 1 presents coefficient estimates by variable, with points representing the bootstrapped median coefficient estimate and horizontal lines the 90% highest bootstrapped density about this estimate. The vertical line aligns with an effect magnitude of zero; we center the intercept at zero for illustration purposes.

Bootstrapped posterior coefficient estimates of correlates of reliability.
Demographics
The difference between female and male coders is generally low in magnitude and inconsistent across variables, indicating the model does not erroneously penalize female experts. Age and employment also show little correlation with reliability. Respondents with a professional degree tend to have higher reliability than experts with a PhD (the reference level) in four of the six variables with a relatively high magnitude, although these estimates are based on a relatively small number of experts; results regarding experts with a Master’s degree or lower level of education are ambiguous. Experts who code historical data tend to be less reliable than other experts in four of the five variables (there are no historical data for Reasoned justification), although this result may be a relic of differences in the cases these experts code.
Democracy in residence country
Democracy shows an ambiguous relationship with reliability, evincing little relationship in four of the six variables and contradicting signs in the remaining two.
Knowledge
Experts who show a lack of general knowledge are less reliable than other experts in four of the six variables, and slightly more reliable in the remaining two; the magnitude of this relationship is generally small. The remaining knowledge measures (Not resident and Low awareness) show little consistent relationship with reliability.
Confidence
In five of the six variables, self-reported confidence shows a positive correlation with reliability; in the remaining variable there is little evidence of a relationship.
Attentiveness
Variation in coding shows the most consistent results in these analyses: in all variables, experts who varied more in their coding tend to have higher reliability than their peers who varied less. However, results regarding the difference between those experts who did not vary their codings and their peers are inconsistent, which may be due to the relative lack of variation in latent concept levels in some cases across variables. Variation in self-reported confidence shows little correlation with expert reliability.
Volume
Neither the number of country-years an expert coded nor the number of variables they coded shows a relationship with reliability in any variable. Results regarding the number of unique countries an expert coded are inconsistent; volume and reliability are uncorrelated for two variables, negatively correlated for three variables, and positively correlated for one variable.
Predicted reliability
The coefficient plots also show a high level of uncertainty in the intercept, which indicates they may be misleading in terms of the substantive importance of reliability correlates. Figure 2 presents the predicted reliability of experts with different characteristics across variables. Points represent the bootstrapped predicted median reliability for experts with given certain demographic or coding characteristics, holding all other correlates constant at their mean or mode. 13 The range represents each variable’s posterior median range of reliability scores.

Posterior bootstrapped predicted reliability of experts with different characteristics.
As Figure 2 makes clear, once we incorporate overall posterior uncertainty into the assessment, the substantive relationship between the correlates of reliability and this outcome is generally minimal. The main exceptions to this rule are Confidence, Low knowledge, and Coding variation, which retain their relatively strong correlation with reliability. In four of the six indicators, experts with high average confidence are more reliable than those with lower confidence, and experts with low knowledge tend to be less reliable. Across all variables, experts who vary their coding are more reliable than those who do not or do so minimally.
Conclusion
The analyses in this paper assess the correlates of expert reliability in the context of cross-national panel data. Most potential correlates show little substantive relationship with reliability; these null results provide evidence that the IRT model is well specified in this context and, more generally, that IRT models are a safe method for aggregating expert-coded data.
The most notable exception to this rule regards coding variation, which is positively correlated with reliability. This result provides a simple heuristic for evaluating respondents to expert surveys: of those experts who vary their codings, those that vary the most will tend to be most reliable. Other results are more tentative, albeit intuitive: lower conceptual knowledge and lower confidence predict lower reliability. This suggests that expert-coding enterprises should endeavor to recruit experts who have knowledge of the concepts they are coding and are confident in their knowledge.
This paper also suggests directions for further research. Although the analyses here focus on expert-level correlates of reliability, they provide tentative evidence that task difficulty also matters: the country and year being coded explains a great deal of variation in reliability (Online Appendix Table E.1), and the distribution of reliability scores varies substantially across questions (Online Appendix Figure C.1). Although it is important to not overinterpret these results, future scholarship would do well to probe them.
Supplemental Material
reliable_app – Supplemental material for What makes experts reliable? Expert reliability and the estimation of latent traits
Supplemental material, reliable_app for What makes experts reliable? Expert reliability and the estimation of latent traits by Kyle L. Marquardt, Daniel Pemstein, Brigitte Seim and Yi-ting Wang in Research & Politics
Footnotes
Acknowledgements
Earlier drafts presented at the 2016 MPSA Annual Conference, 2016 EIP/V–Dem APSA Workshop, 2018 SPSA Annual Conference and 2018 Annual V–Dem Conference. The authors thank David Armstrong, Ryan Bakker, Ruth Carlitz, Chris Fariss, John Gerring, Adam Glynn, Kristen Kao, Laura Maxwell, Juraj Medzihorsky, Jon Polk, Sarah Repucci, Jeff Staton, Laron Williams and Matthew Wilson for their comments on earlier drafts of this paper, as well as the editor and two anonymous reviewers for their valuable insights. The authors also thank Staffan Lindberg and other members of the V–Dem team for their suggestions and assistance. Regionala etikprövningsnämnden i Göteborg 1080-16 provided ethics approval, including informed consent guidelines.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors acknowledge research support from the National Science Foundation (SES-1423944, PI: Daniel Pemstein), Riksbankens Jubileumsfond (M13-0559:1, PI: Staffan I. Lindberg), the Swedish Research Council (2013.0166, PI: Staffan I. Lindberg and Jan Teorell); the Knut and Alice Wallenberg Foundation (PI: Staffan I. Lindberg) and the University of Gothenburg (E 2013/43), as well as internal grants from the Vice-Chancellor’s office, the Dean of the College of Social Sciences, and the Department of Political Science at University of Gothenburg. Marquardt acknowledges the support of the HSE University Basic Research Program and funding by the Russian Academic Excellence Project ‘5-100.’ The authors performed simulations and other computational tasks using resources provided by the Swedish National Infrastructure for Computing at the National Supercomputer Centre in Sweden (SNIC 2017/1-406 and 2018/3-133, PI: Staffan I. Lindberg).
Supplemental materials
Notes
Carnegie Corporation of New York Grant
This publication was made possible (in part) by a grant from the- Carnegie Corporation of New York. The statements made and views expressed are solely the responsibility of the author.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
