Sage Journals: Discover world-class research

Abstract

Here I propose procedural replication as a method for diagnosing errors and omissions and identifying research artifacts in published studies. The goal of procedural replication is not to make substantive contributions so much as improve research practice, or how scientists go about doing science. This is accomplished by generating checklists of lessons learned that scholars can use to assess the reliability of new or existing studies, guide editorial reviews, and make scientific knowledge production more reliable. I demonstrate the method by implementing a procedural replication of Michael Ross’s controversial finding that democracy has no effect on child mortality. I find this null finding is an artifact of the way five-year averages were computed and the static nature of the preferred model. I demonstrate, using causal diagrams, how concerns about listwise deletion and selection bias affecting previous studies may have been overstated. I also provide a checklist with lessons learned.

Keywords

procedural replication experimental artifact procedural errors checklists scientific standards causal inference observational studies democracy infant mortality

Introduction

The impact of regime type on welfare outcomes is a topic of great importance. Against the prevailing consensus, a recent study finds “little evidence that the rise of democracy contributed to the fall in infant and child mortality rates” in the period between 1970 and 2000 (Ross, 2006: 872).¹ According to the study, the negative association between democracy and infant and child mortality reported in previous studies disappears once corrections are made for additive, time-invariant, unobserved heterogeneity; common time trends; and selection bias from listwise deletion. Here, I report results from a procedural replication of this finding. I show this null result is an artifact of a highly restrictive dynamic specification. I also find that concerns about selection bias in previous studies may have been overstated. Finally, I document sources of errors and omissions in a checklist that scholars can use to assess the reliability of new or existing studies, guide editorial reviews, and make scientific knowledge production more reliable.

Ross’s (2006) null finding has wide ranging theoretical, policy, and practical implications. It questions core theories of representation, electoral accountability, and redistribution (Boix, 2003; Bueno de Mesquita et al., 2003b; Lake and Baum, 2001; Meltzer and Richard, 1983); and it questions democracy’s constructive and direct roles in the conceptualization and satisfaction of needs (Sen, 2000). It also provides evidence against the desirability of political conditionality in foreign aid (Alesina and Dollar, 2000; Crawford, 1997). Moreover, by toning down the perceived accomplishments of democracy, it can also influence the very process of democratization via problem definition (Rochefort and Cobb, 1993), preference formation (Druckman and Lupia, 2000), and political persuasion (Cobb and Kuklinski, 1997). As a testament to this importance, Ross’s (2006) study has been cited over 300 times.²

Replication studies remain highly controversial. One reason for this is that social scientists disagree on what constitutes a replication study, how they are done, and for what purpose (Anderson et al., 2008; Berthon et al., 2002; Bueno de Mesquita et al., 2003a; Burman et al., 2010; Gelman, 2013; Herrnson, 1995; Ishiyama, 2014; King, 1995; McCullough et al., 2008; Schmidt, 2009). For example, many scholars are adamant that replication studies should make a substantive − as opposed to procedural – contribution. In this view replications are completely uninteresting unless accompanied by a new discovery. Yet from the point of view of research practice, or the study of how scientists do science, a primary goal of replication studies is to identify research artifacts, diagnose sources of errors and omissions, and document lessons learned through checklists, guidelines, and assessment scales.³

It is a mistake to think all replication studies should make a substantive contribution. First, such replication studies often struggle with the question of where the replication ends, and a new study begins. After all, most scientific studies are nothing but replication studies with extensions and new findings (King, 2006: 1). Second, replication studies that aim to make a substantive contribution often relegate the details of the replication – what went wrong and lessons learned – to a footnote, thus forsaking an opportunity to inform and improve research practice. Third, these studies often focus on contentious solutions rather than on diagnosing errors and omissions, yet diagnosis is central to prevention and remediation. For example, procedural checklists can help bring problematic areas to the attention of researchers and reviewers, whilst avoiding contentious debates about specific solutions that are best resolved in the context of specific applications.

Here I introduce procedural replication, a formal method to diagnose errors and omissions, identify research artifacts, and contribute new (or improved) checklists that scholars can use to inform future research, streamline editorial reviews, and improve research practice. The replication proceeds in two steps. First, I conduct a pure replication to infer the exact procedures and technologies, or scientific standard, used in producing the original study outputs. Second, I undertake a critical evaluation of the inferred standard, including a checklist to prevent future errors and omissions. The focus of the evaluation step is on “what to look out for,” not what an ideal study would do; and on problematic areas rather than contentious solutions. Standard replication studies aim to make substantive contributions by changing data, model specifications, estimators or inference procedures. By contrast, procedural replications aim to improve research practice and the reliability of scientific knowledge production by reporting sources of errors and omissions.

This case-study in research practice proceeds as follows. The section titled “Data” describes the data in the replication file. “Pure replication” presents results from a pure replication, including inferences about the scientific standard used in the original study. “Diagnosing errors and omissions in the inferred scientific standard” demonstrates how a minor change to the way quinquennial averages are computed is enough to reverse the null finding, and how concerns about selection bias from listwise deletion may be overblown. The article ends with “Conclusion.”

Data

In response to a request for replication materials for Ross’s (2006) study “Is democracy good for the poor?” the author kindly provided the following: (1) a raw annual frequency data set; (2) a quinquennial frequency data set used in the analysis (the dependent variable is only available every five to 10 years so all annual data are collapsed to quinquennial frequency); and (3) five quinquennial frequency multiply imputed data sets. The estimation results using the quinquennial frequency data and the multiply imputed data were sourced from the published manuscript. The original computer code, random seed for multiple imputation, and other elements of the procedures used to generate the estimates could not be obtained. Even so, judging from information in the published paper and replication file I infer that the principal software platforms used were Stata, R (R Core Team, 2013), and Amelia I (King et al., 2001). The exact version of each could not be determined.

Pure replication

The objective of a pure replication is to infer the exact procedures and technologies used in the original article and to document potential errors and omissions. These are then diagnosed further at the evaluation stage (see “Diagnosing errors…”). The pure replication was done in three steps. First, I reproduced the quinquennial frequency estimation data from the raw annual data using the exact same procedures reported in the original article. Next, I tried to reproduce the listwise deleted estimation results. Finally, I did the same with multiply imputed data and results. I report the findings from each step below.

Replicating the estimation dataset

First, the available replication data are incomplete. Collapsing the raw annual data into quinquennia by computing five-year averages I was unable to replicate the key measure of democracy – the polity variable – in the quinquennial estimation data set.⁴ In addition, the adult HIV prevalence and democratic years variables were both missing from the annual dataset, so I could not replicate their quinquennial averages in the estimation dataset. All other variables in the original quinquennial frequency dataset could be replicated from the annual data with the exception of real GDP (gross domestic product) per capita, where I found minor differences for the UK, Greece, Ireland and Thailand between the quinquennial data in the replication file, and the corresponding five-year averages I computed from the annual data.

Second, the original study computes quinquennial data using a forward average. As shown in “Diagnosing errors…” this can severely dampen the estimated effect of democracy on child and infant mortality. For example, mortality data for the 1970 quinquennium refers to mortality in the year 1970. These data are regressed on the Polity data for the 1965 quinquennium. However, the latter is computed as a forward average for the years 1965–1969, which is centered in 1967, and not the centered average for the years 1963–1967, which centered in 1965. This is closer to a three year than a five year lag.

Third, the original study’s criteria for defining the population of interest – sovereign countries having a population larger than 200,000 − are ambiguous. In effect the study uses an artificially balanced panel of 169 countries for the quinquennia 1970, 1975, …, 2000 (see Table 1). This panel excludes extinct countries (like the USSR) and includes extrapolations for non-sovereign country years (like Ukraine prior to 1989). As a result, the original study sample contains 14% more annual “observations” than Przeworski et al.’s (2000) unbalanced panel of sovereign country years, which, by recording history as it happened, avoids extrapolations and includes deceased entities. As shown in “Diagnosing errors…,” the extra observations in artificially balanced panels can inadvertently deflate standard errors.⁵

Table 1.

Summary statistics for Ross’s quinquennial data set, 1970–2000.

Variable	Obs.	Missing %	Mean	Std. Dev.	Min.	Max.
Log under-five mortality	821	30.6	4.0	1.2	1.4	6.0
Small state dummy	1183	0.0	0.1	0.3	0.0	1.0
Lagged variables
Log under-five mortality	808	31.7	4.2	1.1	1.4	6.0
Log infant mortality (UNICEF)	770	34.9	3.9	1.0	1.4	5.4
Log infant mortality (WB)	656	44.5	3.8	1.0	1.4	5.4
Log GDP per capita	783	33.8	8.2	1.1	5.7	10.7
Log adult HIV prevalence	999	15.6	0.2	0.5	0.0	3.3
Log population density	906	23.4	3.7	1.5	−0.2	8.7
Real GDP growth	851	28.1	3.2	4.9	−42.5	35.6
Polity	1129	4.6	−0.6	7.3	−10.0	10.0
Log democratic years	1008	14.8	1.4	1.7	0.0	4.6

Data for 169 countries observed over the quinquennia 1970, 1975, …, 2000. In general data are forward quinquennial averages of annual data (e.g. the quinquennial datum for 1970 is the arithmetic average of the log annual data for the years 1970–1974 inclusive). These averages are computed ignoring missing values. Over 30% of mortality data from various sources are observed as missing in the quinquennial data. The equivalent figure for the annual data is 80%. Lagged variables are lagged one quinquennium.

WB: World Bank; GDP: gross domestic product

Fourth, the original study did not use the available data efficiently or consistently. With two exceptions, all annual data prior to 1970 were discarded before computing the quinquennial averages and their lagged values.⁶ Consequently all lagged values for the 1970 quinquennia, which refer to 1965–1969, are missing even if annual data are available for these years. Some of these observations were then imputed manually before multiple imputation (see Table 2).

Table 2.

Summary statistics for Ross’s quinquennial data set, 1970.

Variable	Obs.	Missing %	Mean	Std. Dev.	Min.	Max.
Log under-five mortality	154	8.9	4.5	0.9	2.6	6.1
Small state dummy	169	0.0	0.1	0.3	0.0	1.0
Lagged variables
Log under-five mortality	154	8.9	4.5	0.9	2.6	6.0
Log infant mortality (UNICEF)	0	100.0	–	–	–	–
Log infant mortality (WB)	0	100.0	–	–	–	–
Log GDP per capita	0	100.0	–	–	–	–
Log adult HIV prevalence	0	100.0	–	–	–	–
Log population density	0	100.0	–	–	–	–
Real GDP growth	0	100.0	–	–	–	–
Polity	131	22.5	–1.0	7.3	–10.0	10.0
Log democratic years	0	100.0	–	–	–	–

Data for the first quinquennia (1970–1974) are missing for most regressors. The annual data set was truncated in 1970 before computing the lagged quinquennial averages, with two exceptions. First, the quinquennial lag of polity was calculated before truncation. Second, the dependent variable in 1970 and its lagged value are exactly identical because the former was used to manually impute the latter. For the other variables, truncating the annual data before calculating the lags discards one-eighth of the cells, including UNICEF data on child and infant mortality for 1965.

WB: World Bank; GDP: gross domestic product

Using the quinquennial data in the replication file, I was able to replicate exactly the listwise deleted results reported in Table 3 of the original study (Ross, 2006: 869). However, I found both the R² and dependent variable are misreported. The reported R² measures the overall as opposed to the within-country variation.⁷ Furthermore, the dependent variable used in Tables 3 and 4 of the original study is not UNICEF’s child mortality rate (Ross, 2006: 866), but rather the World Bank’s under-five mortality rate, which has the most missing cells.

Table 3.

Procedural replication of Ross’s Table 4 (Ross, 2006: 869).

Log under-five mortality	LDV	LDV	LDV	FE	LDV	LDV	FE
Log under-five mortality	0.981	0.97	0.955		0.974	0.959
	(0.01)	(0.009)	(0.013)		(0.009)	(0.013)
Log GDP per capita	−0.062	−0.062	−0.069	−0.404	−0.061	−0.067	−0.399
	(0.013)	(0.012)	(0.012)	(0.056)	(0.013)	(0.013)	(0.057)
Log adult HIV prevalence	0.069	0.072	0.088	0.297	0.068	0.084	0.303
	(0.011)	(0.011)	(0.013)	(0.036)	(0.011)	(0.014)	(0.036)
Log population density	−0.013	−0.013	−0.013	−0.122	−0.014	−0.014	−0.125
	(0.002)	(0.002)	(0.002)	(0.104)	(0.002)	(0.002)	(0.105)
GDP growth	−0.005	−0.005	−0.005	0.001	−0.005	−0.005	0.002
	(0.002)	(0.002)	(0.002)	(0.004)	(0.002)	(0.002)	(0.004)
Polity		−0.002**	−0.003***	−0.008**
		(0.001)	(0.001)	(0.003)
Period 3			−0.043	−0.128		−0.042	−0.125
			(0.021)	(0.026)		(0.022)	(0.026)
Period 4			−0.085	−0.29		−0.084	−0.285
			(0.02)	(0.027)		(0.021)	(0.028)
Period 5			−0.059	−0.431		−0.058	−0.427
			(0.03)	(0.047)		(0.03)	(0.048)
Period 6			−0.068	−0.65		−0.067	−0.648
			(0.025)	(0.046)		(0.026)	(0.047)
Period 7			−0.069	−0.81		−0.068	−0.812
			(0.026)	(0.056)		(0.028)	(0.059)
Period 8			−0.081	−0.975		−0.083	−0.989
			(0.026)	(0.067)		(0.027)	(0.07)
Log democratic years					−0.008***	−0.010***	−0.036
					(0.002)	(0.002)	(0.028)
Constant	0.475	0.521	0.693	8.221	0.509	0.685	8.253
	(0.149)	(0.13)	(0.154)	(0.573)	(0.132)	(0.156)	(0.574)
Observations (NT)	957	957	957	957	957	957	957

***

Significantly different from zero at 99% confidence. Only noted for key independent variables.

Significantly different from zero at 95% confidence. Only noted for key independent variables.

Significantly different from zero at 90% confidence. Only noted for key independent variables.

These estimates were computed using a stricter definition of sovereign country years (Przeworski et al., 2000), centered quinquennial averages, and the software package Amelia II for multiple imputation (see main text for further details). Both polity and democratic years are now highly statistically significant in all specifications except the last. By contrast, in the original study they are insignificant across all specifications. All regressors, except period dummies, are lagged one quinquennia. The LDV specification uses panel corrected standard errors, assuming a panel-specific AR(1) autocorrelation structure. The static fixed effects (FE) specification uses robust standard errors (although, in theory, these are not needed (Greene, 2008: 200)). Standard errors are in parentheses.

LDV: Lagged dependent variable; NT: Number of countries (N), Number of time periods (T), where NT= N x T; AR: Auto-regressive.

Source: Reproduced from Ross M (2006) Is democracy good for the poor? American Journal of Political Science 50(4): 860–874.

Table 4.

Procedural replication of Ross’s Table 5 showing the coefficients on polity across alternate multiply imputed measures of infant and child mortality.

	LDV	LDV and period dummies	FE and period dummies
ACLP population of sovereign country years, centered quinquennial averages
CMR WB	−0.003**	−0.003**	−0.008**
	(0.001)	(0.001)	(0.003)
CMR UNICEF	−0.002**	−0.003***	−0.08**
	(0.001)	(0.001)	(−0.003)
CMR WHO	−0.001	−0.001	−0.005*
	(−0.001)	(0.001)	(0.003)
IMR WB	−0.002*	−0.003**	−0.006*
	(0.001)	(0.001)	(0.003)
IMR UNICEF	−0.002	−0.002***	−0.006**
	(0.001)	(0.001)	(0.003)
NT	957
Ross population of sovereign country years, forward quinquennial averages
CMR WB	−0.003***	−0.003***	−0.004
	(0.001)	(0.001)	(0.003)
CMR UNICEF	−0.003**	−0.003***	−0.003
	(0.001)	(0.001)	(0.003)
CMR WHO	−0.003***	−0.004***	−0.002
	(0.001)	(0.001)	(0.003)
IMR WB	−0.003***	−0.003***	−0.002
	(0.001)	(0.001)	(0.003)
IMR UNICEF	−0.002***	−0.003***	−0.003
	(0.001)	(0.001)	(0.003)
NT	1183

***

Significantly different from zero at 99% confidence.

Significantly different from zero at 95% confidence.

Significantly different from zero at 90% confidence.

The top panel reports estimates using Przeworski et al.’s (2000) more restrictive definition of sovereign country years, centered quinquennial averages (so the quinquennial data for 1970 is the average of years 1968–1972), and corrections for other minor errors and omissions (see main text). The bottom panel reports the same estimates using the original study’s population of sovereign country years, forward quinquennial averages (so the quinquennial data for 1970 is the average of years 1970–1975), and corrections for other minor errors and omissions (see main text).

FE: fixed effects; CMR: child mortality rate; WB: World Bank; WHO: World Health Organization; IMR: infant mortality rate

CLP: Przeworsky et al’s (2000) dataset; LDV: Lagged dependent variable; NT: Number of countries (N), Number of time periods (T), where NT= N x T.

Source: Reproduced from Ross M (2006) Is democracy good for the poor? American Journal of Political Science 50(4): 860–874.

Replicating the multiply imputed estimates

I could not replicate exactly the multiply imputed data and estimates because the replication file does not include a random seed for the imputations. Even so, I found missing data were imputed inconsistently. For example, quinquennial averages were computed ignoring missing observations (a form of imputation); lagged values of the dependent variable in the first period were manually imputed using the actual observation in that period; and the key independent variable, polity, was never included in the imputation model.⁸ Moreover, the imputation software available at the time ignored time dependency (King et al., 2001).

Diagnosing errors and omissions in the inferred scientific standard

The pure replication in the previous section helped me partially infer the combination of technological inputs and procedures (or scientific standard) used in the original study. Here, I diagnose potential errors and omissions. The objective is to learn from mistakes, assess their impact, and propose preventive measures, not to make a substantive contribution about regime type and human welfare by changing data sources, model specification, estimators, or inferential procedures.

Relaxing the dynamic specification in the observed dataset

The original study’s preferred two-way fixed effect model specification includes only one lag of polity, and no interactions with time. Combined with the forward quinquennial averages, this imposes severe dynamic restrictions. As formulated, the original study asks not whether democracy has an effect − any effect – on infant and child mortality, but whether it has a constant additive effect in the first three years after a transition. Few social scientists would expect democracy to have a substantive impact in such a short period, yet the original study provides no theoretical justifications for this choice. This is all the more puzzling considering that most econometric textbooks highlight the flexibility of panel data in characterizing dynamic effects (e.g. Baltagi, 2001: 6).

For example, in the preferred two-way fixed effects model specification the outcome for 1975 is given by y_i,₁₉₇₅ = α _i + λpolity_i,1970 + x _i,1970 β + ϵ_i,₁₉₇₅, where polity is lagged one quinquennia, and the penultimate term is a vector of lagged covariates and time dummies. First, the quinquennial lag is computed as ${Polity}_{i, 1970} = \frac{1}{5} \sum_{t \in {1970, 1971,, 1974}} {Polity}_{i, t}$ This is centered in 1972 rather than 1970, which is closer to a three-year than a five-year lag. Second, because the preferred model specification includes only one lag but no interactions with time, it rules out – by assumption − any dynamic adjustment, long-run effects, and changes to the rate of mortality decline. That is, the model assumes democracy only has a constant additive effect on mortality in the first three years or so after a transition, and not thereafter. This is an extremely restrictive assumption.

The dynamic restrictions are so severe that even a minor relaxation, like computing centered quinquennial averages (which allow for a slightly longer lagged effect), is enough to reverse the null findings in the original study. To show this separate from the multiple imputation I replicated the listwise deleted estimation results in Table 3 of the original study, replacing the forward quinquennial data in the original estimation with centered quinquennia. I also used Przeworski et al.’s (2000) unbalanced panel of sovereign country years.⁹ Using the exact same two-way fixed effect specification in the published table – the study’s preferred specification – I obtained a point estimate of −0.005 (s.e. 0.002). This is statistically and practically significant, and about twice as large as the published estimate (–0.0021 (s.e. 0.002)). For example, a movement of 10 points in the polity variable implies a 5% average decline in child mortality after five years (95% CI: –9.4 to −1.3%).

To check that this was not a result of using a different population, I replaced Przeworski et al.’s (2000) unbalanced panel of sovereign country years with the original study’s artificially balanced panel. The result is an almost identical estimate (–0.0054; s.e. 0.002). Relaxing the dynamic specification further, by adding an extra lag of polity or an interaction with a linear time trend, also yielded practically and statistically significant results. These results underscore the importance of omitted dynamics in generating the original study’s null result.

Relaxing the dynamic specification in the multiply imputed dataset

The original study claims previous significant findings may have been biased by missing data. Specifically, listwise deletion drops rich autocracies with enviable records in reducing mortality from the sample, thus biasing the estimated effect in favor of democracies. Assuming this bias is removed by multiply imputing the missing data and controlling for two-way fixed effects, the original article showed democracy has a negligible effect on mortality. In what follows I show the claim about selection bias has weak theoretical support. The null result in the original study is an artifact of the extremely restrictive dynamic specification, not the result of correcting selection bias in previous studies.

First, the selection bias argument does not withstand theoretical scrutiny. The fact that rich autocracies with enviable declines in infant and child mortality are listwise deleted from the estimation sample tells us nothing about the effect of democracy on mortality. For all we know, their declines could have been faster (or slower) had they been democracies. Therefore the bias, if it exists, could go either way. Besides, if selection is on the basis of income and regime type, as the original study claims, then controlling for these variables, like most previous studies do, should help alleviate the bias. For example, suppose autocratic regimes above a certain income threshold all drop out from the sample. If the effect of democracy increases with income, then the sample estimate will underestimate the population effect. But the estimate will remain unbiased for the countries in the sample, which in this case includes most of the world. Figure 1 illustrates this logic using a causal diagram.¹⁰

Figure 1.

Simplified causal diagram illustrating the causes of the outcome of interest, the missingness, and the selection through listwise deletion. The graphical model assumes mortality is caused by last period’s regime type (polity_t₋₁), income (GPD_t₋₁), and other unobserved causes (U_t). For simplicity I assume GDP_t₋₁ is the only variable with missing data. The missingness indicator $R_{G D P_{t - 1}}$ equals one whenever GDP is missing, and is zero otherwise. Selection into the listwise deleted sample is on the basis of missingness. Thus, Select equals one if an observation is included in the listwise deleted sample (i.e. $R_{G D P_{t - 1}}$ = 0), and is zero otherwise (i.e. $R_{G D P_{t - 1}}$ = 1). Conditioning the analysis on Select = 1, a descendant of collider $R_{G D P_{t - 1}}$ , opens a backdoor path from polity_t₋₁ to IMR_t via the missingness indicator and income variables. However, conditioning on income blocks this and other backdoor paths. If so the population experimental distribution that would have been observed had polity_t₋₁ been randomized can be estimated – within income strata – using the passively observed distribution (formally P (IMR_t|do (polity_t₋₁), GDP_t₋₁, Select) ≡ P(IMR_t|polity_t₋₁, GDP_t₋₁), where the operator do (.) captures the notion of experimental manipulation). From here, we can estimate the overall population effect if the distribution of income in the selected sample overlaps with the population distribution and if the population weights for the strata are known. Otherwise we can only estimate the effect within the selected sample.

Second, the selection bias argument does not withstand empirical scrutiny. For example, using centered quinquennia, and a multiple imputation software better suited for time dependence, I found democracy has a practical and statistically significant effect on infant and child mortality, as reported in Table 3.¹¹ These results are very different from the corresponding estimates in Table 4 of the original study (Ross, 2006: 869). There none of the coefficients on polity or democratic years is significant. By contrast, the corresponding estimates reported in Table 3 are all practically and statistically significant, with the exception of democratic years in the last column.

To further check the robustness of this results I replicated Table 5 in the original study (Ross, 2006: 870). That table reports the coefficients on polity when different multiply imputed measures of infant and child mortality are used. I repeated the replication twice. Once using Przeworski et al.’s (2000) unbalanced panel of sovereign country years and centered quinquennia (top panel, Table 4), and again using the artificially balanced panel of sovereign country years and forward quinquennia from the original study (bottom panel, Table 4).

Table 5.

DEMOCHECK: A procedural checklist for studying the impact of democracy on human welfare.

Item	Check
Procedures
Carefully consider the dynamic nature of the effect you are trying to estimate, and whether a static fixed effect specification makes theoretical sense, is consistent with the data, and is robust.	□
Describe the population of interest, including the cross-sectional and time-wise criteria used for selection of country-year units into the study, and be explicit about how you plan to deal with extinct countries, new countries, mergers, and splits. Consider using Przeworski et al.’s (2000) unbalanced panel of sovereign country years.	□
Consider using a casual diagram to communicate easily and transparently the assumed causal structure generating the outcome, the missing data, and the sample selection.	□
Exercise care in aggregating panel data to lower frequencies, and consider how that may affect the dynamic interpretation of the estimates. Centered averages are often easier to interpret.	□
Form lags before truncating the data to a shorter period.	□
Examine the time-wise and cross-sectional patterns of missingness.	□
Use a random seed, like a verifiable public lottery number, for multiple imputations and include it in the replication file.	□
Use the within estimator, rather than the least square dummy variable estimator, in fixed effect models to get a more meaningful R².	□
Software technologies
Report the software used and its release version in the main article or replication file.	□
Consider using a multiple imputation software, like Amelia II, for data sets with time dependence.	□

First, using the original study’s preferred two-way fixed effect model I found polity is significant at the 10% level or less whenever I used the centered quinquennia data (top panel, last column of Table 4). However, using the forward quinquennia I found all estimated coefficients are insignificant, and about half the size (bottom panel, last column).

Second, using the lagged dependent variable specifications I found most point estimates are similar and significant, even when forward quinquennia are used (first two columns of Table 4). This is because the lagged dependent variable specification, though still very simple, allows for short- and long-run effects. However, I found standard errors are higher when using Przeworski et al.’s (2000) stricter definition of sovereign country years (top panel) compared to the original study’s criteria (bottom panel). The reason for this difference is that the stricter criteria yield 957 observations, as opposed to the original study’s 1183 observations. The latter treats observations for countries like Ukraine prior to 1989 as missing rather than undefined. Multiply imputing these data may exaggerate the amount of information in the dataset, thus underestimating uncertainty.

I summarize the lessons from the pure replication in the section “Pure replication” and the diagnosis and criticism in “Diagnosing errors…” in a checklist (see Table 5). Such a checklist can be used prospectively to help design more effective studies on the impact of regime type on human welfare. It can also be used retrospectively to assess the quality of existing studies and as a quality control in the peer review process.

Conclusion

Ross’s (2006) controversial and widely cited finding that democracy has no effect on child mortality is of momentous significance. If true it has wide ranging theoretical, policy, and practical implications. I replicated this study using a procedural replication and found reasons to challenge it.

I found Ross’s (2006) original null result is an artifact of an extremely restrictive dynamic specification. The preferred static fixed effect model specification, combined with forward quinquennial averages, assumes democracy only has an additive effect on child or infant mortality within the first three years or so after a transition − and not thereafter. Few social scientists expect democracy to have a substantive impact in such a short period. The restriction is so severe that even a small change in the dynamic specification, like using centered quinquennial averages that allow for a five-year lag, is enough to detect practically and statistically significant effects. The lesson here is that scholars should think carefully about dynamics when estimating the effect of democracy on mortality.

I also found the original study’s concern over listwise deletion and selection bias may have been overstated. Ross (2006) claims previous significant findings are biased by missing data. Specifically, listwise deletion drops rich autocracies with excellent records in reducing mortality from the sample, thereby biasing the estimated effect in favor of democracies. However, the theoretical arguments and empirical evidence presented here demonstrate that selection bias may not have been such a problem after all. What is driving the null result is the extremely restrictive dynamic specification, not the presumed correction for selection bias in previous studies. I also found a more sound definition of the population of interest yields better measures of uncertainty. Finally, I showed how causal diagrams can be used to analyze selection bias and listwise deletion succinctly.

Here I have demonstrated the use of procedural replication. The objective of procedural replication is to diagnose sources of errors and omissions, identify research artifacts, and propose preventive measures including checklists to inform future research, streamline editorial reviews, and improve cumulative research about the impact of regime type on human welfare. In pursuing this objective I deliberately avoided questioning the basic research design, choice of measures, model specifications, estimators, inferential techniques and other assumptions.¹² Advocating alternative choices and assumptions, and testing them, is the remit of standard scientific studies, not procedural replications. Indeed, it is a mistake to think all replication studies must make a substantive contribution. The point of any consequential scientific endeavor is not just to present novel findings, but to actually answer existing questions reliably. Procedural replication focuses squarely on improving our answers to existing questions and so the method ought to be as relevant as the questions are consequential.

Footnotes

Acknowledgements

I would like to thank Michael Ross for his kindness and generosity in sharing his data, answering my questions, and providing excellent constructive comments on earlier drafts of this manuscript; Patrick Royston for kind help with the Stata package mim; Neal Beck; anonymous reviewers; and the editors of Research & Politics. All errors are my own.

Declaration of conflicting interest

The author declares that there is no conflict of interest.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. The views expressed are my own and not necessarily those of Cambridge Social Science Decision Lab Inc.

Supplementary material

The replication files are available at: http://thedata.harvard.edu/dvn/dv/researchandpolitics/faces/study/StudyPage.xhtml?globalId=doi:10.7910/DVN/27258&studyListingIndex=0_4016764dc87f081be2cdd9dd54d2

Notes

References

Alesina

Dollar

(2000) Who gives foreign aid to whom and why? Journal of Economic Growth 5: 33–63.

Anderson

Greene

McCullough

, et al. (2008) The role of data/code archives in the future of economic research. Journal of Economic Methodology 15: 99–119.

Baltagi

(2001) Econometric Analysis of Panel Data. New York: John Wiley & Sons.

Bareinboim

Tian

Pearl

. Recovering from Selection Bias in Causal and Statistical Inference. AAAI Conference on Artificial Intelligence, North America, jun. 2014. Available at: <http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8628> (accessed 28 November 2014).

Berthon

Pitt

Ewing

, et al. (2002) Potential research space in mis: A framework for envisioning and evaluating research replication, extension, and generation. Information Systems Research 13(4): 416–427.

Boix

(2003) Democracy and Redistribution. Cambridge studies in comparative politics. Cambridge, UK; New York: Cambridge University Press.

Boutron

John

Torgerson

(2010) Reporting methodological items in randomized experiments in political science. The Annals of the American Academy of Political and Social Science 628(1): 112–131.

Bueno

Mesquita

Gleditsch

James

, et al. (2003a) Symposium on replication in international studies research. International Studies Perspectives 4(1): 72–107.

Bueno

Mesquita

Smith

Siverson

, et al. (2003b) The Logic of Political Survival. Cambridge, MA: MIT Press.

10.

Burman

Reed

Alm

(2010) A call for replication studies. Public Finance Review 38(6): 787–793.

11.

Carlin

Galati

Royston

(2008) A new framework for managing and analyzing multiply imputed data in Stata. Stata Journal 8(1): 49–67.

12.

Cobb

Kuklinski

(1997) Changing minds: Political arguments and political persuasion. American Journal of Political Science 41(1): 88–121.

13.

Crawford

(1997) Foreign aid and political conditionality: Issues of effectiveness and consistency. Democratization 4(3): 69–108.

14.

Druckman

Lupia

(2000) Preference formation. Annual Review of Political Science 3(1): 1–24.

15.

Gardner

Bond

(1990) An exploratory study of statistical assessment of papers published in the British Medical Journal. JAMA 263(10): 1355–1357.

16.

Gelman

(2013) It’s too hard to publish criticisms and obtain data for replication. Chance 26(3): 49–52.

17.

Gerber

Doherty

Dowling

(2009) Developing a checklist for reporting the design and results of social science experiments. Prepared for the 1st Experiments in Governance and Politics (EGAP) meeting, Yale University. Available at: http://orion.luc.edu/~ddoherty/September4th2012.

18.

Greene

(2003) Econometric Analysis (5th Edition). Pearson Education, Prentice Hall. New Jersey, USA: Upper Saddle River.

19.

Greene

(2008) Econometric Analysis (6th Edition). Pearson International Edition. New Jersey, USA: Upper Saddle River.

20.

Herrnson

(1995) Replication, verification, secondary analysis, and data collection in political science. PS: Political Science & Politics 28(03): 452–455.

21.

Higgins

JPT

Altman

Gøtzsche

, et al. (2011) The cochrane collaboration’s tool for assessing risk of bias in randomised trials. BMJ 343. Available at: http://www.bmj.com/content/343/bmj.d5928.long (accessed June 2013).

22.

Honaker

King

Blackwell

(2011) Amelia II: A program for missing data. Journal of Statistical Software 45(7): 1–47. Available at: http://www.jstatsoft.org/v45/i07/ (accessed September 2012).

23.

Ishiyama

(2014) Replication, research transparency, and journal publications: Individualism, community models, and the future of replication studies. PS: Political Science & Politics 47(01): 78–83.

24.

King

(1995) A revised proposal, proposal. PS: Political Science and Politics 28(3): 494–499.

25.

King

(2006) Publication, publication. PS: Political Science and Politics 39(01): 119–125.

26.

King

Honaker

Joseph

, et al. (2001) Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review 95: 49–69.

27.

Lake

Baum

(2001) The invisible hand of democracy: Political control and the provision of public services. Comparative Political Studies 34(6): 587–621.

28.

McCullough

McGeary

Harrison

(2008) Do economics journal archives promote replicable research? Canadian Journal of Economics/Revue canadienne d’conomique, 41(4): 1406–1420.

29.

Martel García

(2012) Small, slow, and diminishing: The effect of democracy on the under-five mortality rate. Working Paper 2188599, Social Science Research Network. Available at: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2188599 (accessed May 2013).

30.

Martel García

(2013a) Definition and diagnosis of problematic attrition in randomized controlled experiments. Working Paper 2302735, Social Science Research Network. Available at: http://ssrn.com/abstract=2302735 (accessed August 2013).

31.

Martel García

(2013b) A unified approach to generalized causal inference. Working Paper 2304970, Social Science Research Network. Available at: http://ssrn.com/abstract=2304970 (accessed September 2013).

32.

Meltzer

Richard

(1983) Tests of a rational theory of the size of government. Public Choice 41(3): 403–418.

33.

Mohan

Pearl

Tian

(2013) Missing data as a causal inference problem. Technical Report R-410, Computer Science Department, University of California.

34.

Moher

Jadad

Nichol

, et al. (1995) Assessing the quality of randomized controlled trials: An annotated bibliography of scales and checklists. Controlled Clinical Trials 16(1): 62–73. Available at: http://www.sciencedirect.com/science/article/pii/019724569400031W.

35.

Nelson

(2007) Elections, democracy, and social services. Studies in Comparative International Development 41: 79–97.

36.

Olivo

Macedo

Gadotti

, et al. (2008) Scales to assess the quality of randomized controlled trials: A systematic review. Physical Therapy 88(2): 156–175. Available at: http://ptjournal.apta.org/content/88/2/156.abstract.

37.

Pearl

(2009) Causality: Models, Reasoning, and Inference. New York, USA: Cambridge University Press.

38.

Przeworski

Alvarez

Cheibub

, et al. (2000) Democracy and Development: Political Institutions and Well-Being in the World, 1950–1990. New York, USA: Cambridge University Press.

39.

R Core Team (2013) R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: www.R-project.org/.

40.

Rochefort

Cobb

(1993) Problem definition, agenda access, and policy choice. Policy Studies Journal 21(1): 56–71.

41.

Ross

(2006) Is democracy good for the poor? American Journal of Political Science 50(15): 860–874.

42.

Royston

Carlin

White

(2009) Multiple imputation of missing values: New features for mim. Stata Journal 9(2): 252–264.

43.

Sanderson

Tatt

Higgins

JPT

(2007) Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: A systematic review and annotated bibliography. International Journal of Epidemiology 36(3): 666–676.

44.

Schmidt

(2009) Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology 13(2): 90–100.

45.

Sen

(2000) Development as Freedom (1st Edition). New York, NY: Anchor Books.

46.

Simera

Moher

Hoey

, et al. (2010) A catalogue of reporting guidelines for health research. European Journal of Clinical Investigation 40(1): 35–53. Available at: http://dx.doi.org/10.1111/j.1365–2362.2009.02234.x.

47.

Sovey

Green

(2011) Instrumental variables estimation in political science: A readers’ guide. American Journal of Political Science 55(1): 188–200.

48.

Turner

Shamseer

Altman

, et al. (2012) Does use of the consort statement impact the completeness of reporting of randomized controlled trials published in medical journals? A Cochrane review. Systematic Reviews 1(1): 60. Available at: http://www.systematicreviewsjournal.com/content/1/1/60.

49.

Wigley

Akkoyunlu-Wigley

(2011) The impact of regime type on health: Does redistribution explain everything? World Politics 63: 647–677.

50.

World Bank (2008) World Development Indicators. Washington, DC: World Bank Publications.

Democracy is good for the poor: A procedural replication of Ross (2006)

Abstract

Keywords

Introduction

Data

Pure replication

Replicating the estimation dataset

Replicating the multiply imputed estimates

Diagnosing errors and omissions in the inferred scientific standard

Relaxing the dynamic specification in the observed dataset

Relaxing the dynamic specification in the multiply imputed dataset

Conclusion

Footnotes

Acknowledgements

Declaration of conflicting interest

Funding

Supplementary material

Notes

References