Sage Journals: Discover world-class research

Abstract

Macroeconomic variables like unemployment, inflation, trade, or GDP are not set in stone: they are preliminary estimates that are constantly revised by statistical agencies. These data revisions, or data vintages, often provide conflicting information about the size of a country’s economy or its level of development, reducing our confidence in established findings. Would researchers come to different conclusions if they used different vintages? To answer this question, I survey all articles published in a top political science journal between 2005 and 2020. I replicate three prominent articles and find that the use of different vintages can lead to different statistical results, calling into question the robustness of otherwise rigorous empirical research. These findings have two practical implications. First, researchers should always be transparent about their data sources and vintages. Second, researchers should be more modest about the precision and accuracy of their point estimates, since these estimates can mask large measurement errors.

Keywords

data quality statistical capacity development indicators replication

Introduction

How much did the economy of Equatorial Guinea grow in 2000? Between 2005 and 2020, the World Bank’s World Development Indicators (WDI) reported four different values for Equatorial Guinea’s 2000 GDP growth, from 1.47 to 18.2%. As Figure 1 shows and Johnson et al. (2013) confirm, this is no exception: researchers seeking to explain economic growth might come to different conclusions depending on their data source and version. As with growth, recent estimates of unemployment, inflation, or trade are preliminary and under constant revision, in what Croushore and Stark (2003) call “data vintaging.”

Figure 1.

GDP growth in 2000 for selected countries: Difference in values reported by different WDI releases.

This study examines how data vintaging affects the replicability of published research. Would researchers come to different conclusions if they used different data? To answer this question, I survey all articles published in a prominent journal between 2005 and 2020, collecting detailed information about data usage, and replicate three articles using various data sources and vintages of the same source. I find that using different data can lead to different statistical results, calling into question the robustness of rigorous research. Researchers should not only disclose their data sources and vintages but also be more modest about the precision and accuracy of their point estimates, which can mask large measurement errors.

The logic behind data vintaging

Datasets like the WDI are revised to incorporate improved data, eliminate errors, and account for new price benchmarks (Ciccone and Jarociński, 2010). The WDI relies on data reported by national statistical agencies; as agencies improve their statistical capacity, they collect and disseminate better data. In 2014, for example, Nigeria’s GDP increased by 89% after the government updated the base year for calculations, incorporating information about the telecommunication and film-making industries (Kerner et al., 2017). International organizations advise governments to update their GDP base year at least every 10 years, but half of the 189 countries surveyed by Berry et al. (2018) use older base years, reporting figures that are likely biased downward.

New administrations are often eager to rectify the errors committed by previous administrations. Data manipulation increased under the Kirchner and Rousseff presidencies in Argentina and Brazil, respectively; in both cases, the successor came from a rival party and publicized these data issues (Aragão and Linsi, 2022). Shortly after coming to power, Prime Minister Papandreou requested help from Eurostat and the IMF to revise Greece’s public finance statistics, which had long been misrepresented to follow EU rules (Alt et al., 2014). Some countries deliberately report biased figures to seem poorer and meet World Bank eligibility criteria for foreign aid, though these figures are typically corrected ex post (Kerner et al., 2017).

Lastly, the WDI standardizes country-specific data using a purchasing power parity (PPP) adjustment set by the International Comparison Program (ICP) based on international price surveys. Until 1996, these surveys covered developed countries and made extrapolations for the rest of the world; updates in 2005, 2011, and 2017 began to cover developing countries (Deaton and Aten, 2017). Using satellite-recorded nighttime lights as a “true” measure of economic activity, Pinkovskiy and Sala-i Martin (2020) find that vintages based on newer ICP benchmarks have smaller measurement errors.

Data vintaging can be good news. The standardization methodology is improving: current estimates of the past are closer to the “truth” than past estimates of the past. It is better to revise inaccurate figures than to keep them consistently inaccurate. However, data vintaging calls into question the robustness of existing findings, since many vintages provide conflicting information. With few exceptions (Boehmer et al., 2011; Hollyer et al., 2014; Inklaar and Prasada Rao, 2017; Fariss et al., 2022), researchers did not examine these issues until recently Figure 2.

Figure 2.

Available observations by country, 1980–1999.

A survey of published studies

To gauge the prevalence of vintaged data in political science, I assembled all 459 research articles and research notes published in International Organization between 2005 and 2020, including special issues. Of these studies, 173 conducted a cross-country statistical analysis using contemporary macroeconomic variables, like GDP growth, unemployment, inflation, or trade.¹

Table 1 classifies each study according to its data sources mentioned in the main text or appendix. The numbers amount to more than 100% because if a single study mentioned multiple sources, I recorded all. This is a conservative estimate, as 24 studies (13.87%) do not mention any source of macroeconomic data. This does not mean that they did not use such data, only that they did not volunteer such information.

Table 1.

Relevant studies published in International Organization according to data source, 2005–2020.

	Number of studies	Percentage
Use Relevant Data	173	100.00
Mention Data Source	149	86.13
World Development Indicators	106	61.27
Penn World Table	52	30.06
Maddison	9	5.20
Other	25	14.45
Mention Data Vintage	119	68.79
World Development Indicators until 2005	28	16.18
World Development Indicators 2006–2010	23	13.29
World Development Indicators 2011–2015	20	11.56
World Development Indicators 2016–2020	8	4.62
Penn World Table 5.6, 1994	24	13.87
Penn World Table 6.1–6.3, 2002–2009	14	8.09
Penn World Table 7.0–7.1, 2011–2012	7	4.05
Penn World Table 8.0, 2013	2	1.16
Maddison 2003	3	1.73
Maddison 2007	1	0.58
Maddison 2010	4	2.31

Among the studies that mention their data source, 106 use the WDI. Some combine multiple sources. To construct their GDP variable, Goldstein et al. (2007, 50) “turned first to the 2005 edition of the World Bank’s World Development Indicators …then extended the series backwards to 1946, using U.S. dollar figures from the Penn World Tables, the United Nations, the Oxford Latin American Economic History Database, and the IMF International Financial Statistics. In a few cases, we used the GDP indices from Maddison …to complete the data set.” Mansfield and Reinhardt (2008, 636, footnote 61) use GDP data “from a hierarchy of sources, starting with the World Bank’s World Development Indicators, the OECD’s Monthly Statistics of International Trade, UNCTAD’s Handbook of Statistics On-Line, the IMF’s World Economic Outlook, the Penn World Table version 6.1, and the IMF’s International Financial Statistics.” Yet, WDI, Penn World Table (PWT), and Maddison figures are quite different from each other—even if the underlying data are the same—due to differences in currency conversions or PPP adjustments. Ram and Ural (2014) identify 33 countries for which GDP estimates from the WDI and the PWT differ by at least 25%. It is unclear if authors combining multiple sources are aware of these differences or address potential discrepancies.

Instead of providing a direct source, 21 articles refer readers to Gleditsch (2002), who collected GDP, trade, and population data from PWT version 5.6, imputing missing observations and providing additional estimates from the CIA’s World Factbook. Others refer readers to Fearon and Laitin (2003), who used WDI data to extend PWT estimates for GDP growth. However, the data compiled by Gleditsch and Fearon and Laitin rely on older PWT and WDI vintages, which have larger measurement errors. Many authors do not know the true nature of the data underlying their empirical analyses, since they did not collect these data themselves.

Though 86.13% of all articles mention their data source, only 68.79% mention their data vintage, indicating a release date or version number (“World Development Indicators 2006” or “Penn World Table 5.6”). Given the differences between data vintages, this information should always be provided.

Empirical consequences of data vintaging

To show that even rigorous research is vulnerable to data vintaging, I use different sources and vintages to replicate three of the surveyed studies, selected following three criteria. First, they had to be transparent about their sources and vintages, allowing me to locate equivalent variables elsewhere. Second, these had to be older studies, ensuring that there were more recent comparable data. Third, supplementary materials (both data and replication code) had to be publicly available, allowing me to estimate the models exactly as published. Since the International Organization website does not provide supplementary materials for issues before 2011, my selection was restricted to authors who provided this information on their own websites.

De Soysa and Neumayer (2005)

Using data for 135 countries between 1980 and 2000,² de Soysa and Neumayer (2005, 732) show that economic globalization has a positive and statistically significant effect on sustainable development, defined as a state’s “ability to maintain (increase) the aggregate value of manufactured, human, and natural capital.” In eliminating price distortions and promoting an efficient allocation of resources across borders, globalization minimizes waste.

The outcome, genuine savings (% of GNI), combines six WDI variables.³ The key independent variable, trade (% of GDP), comes from the WDI, as do six control variables: current GNI per capita in PPP; agriculture (% of GDP); GDP per capita growth (%); population; population density; and urban population (% of the total population). The random effects generalized least squares model also controls for regime type, fuel exporter status, political constraints, stability of the political system, occurrence of a currency crisis or a civil war, and number of peace years since 1946.

All WDI variables come from the 2002 release, though data for Angola and Sudan come from the 2003 release “because their values seem to be reported with errors” (de Soysa and Neumayer, 2005, 740, footnote 56). Upon closer inspection of the 2002 data, the genuine savings rate for Angola and Sudan is exceptionally low, but the World Bank (2023) did not flag these countries as problematic, and more recent releases report similarly extreme values. Even if the 2002 vintage suffers from measurement error and the 2003 vintage does not, there is no evidence that this error is limited to Angola and Sudan.

Table 2 compares the original results to results using 2002, 2012, and 2022 WDI data. Since older benchmarks tend to be biased downward, I expect to recover larger effect sizes when using newer vintages. But these differences should be in levels, not trends. The effect of the independent variable on the outcome should be substantively and statistically consistent; both variables come from the same source and should suffer from similar measurement errors.

Table 2.

The effect of trade dependence on genuine savings (random effects GLS), 1980–1999.

	(1)	(2)	(3)	(4)
	Original Model	WDI 2002	WDI 2012	WDI 2022
Trade/GDP (ln)	2.416***	2.109**	2.901***	−0.898
	(0.774)	(0.999)	(1.053)	(2.747)
GNI pc (ln)	21.933***	21.070***	12.256**	37.320
	(6.248)	(7.938)	(6.323)	(33.424)
(GNI pc)² (ln)	−0.977***	−0.986**	−0.532	−2.617
	(0.370)	(0.473)	(0.371)	(1.895)
Economic Growth	−0.009	0.029	0.115***	0.100
	(0.024)	(0.031)	(0.031)	(0.079)
Agriculture/GDP	−0.052	−0.131**	−0.166**	0.056
	(0.047)	(0.060)	(0.063)	(0.174)
Currency Crisis	0.017	−0.255	1.117**	0.828
	(0.397)	(0.520)	(0.560)	(1.100)
Fuel Exporter	−18.230***	−22.429***	−21.671***	−1.718
	(2.397)	(2.711)	(2.590)	(20.317)
Democracy	1.034	1.009	0.577	1.196
	(0.714)	(0.939)	(0.934)	(2.160)
Political Constraints	−1.203	1.239	−2.597	−4.619
	(1.503)	(1.978)	(1.954)	(3.711)
Government Stability	−0.344	−0.337	−0.175	0.168
	(0.345)	(0.455)	(0.467)	(0.909)
Population Density (ln)	1.078**	1.406***	1.232**	1.741
	(0.435)	(0.499)	(0.490)	(3.627)
Population Size (ln)	0.116	−0.098	0.534	−1.021
	(0.405)	(0.471)	(0.479)	(3.214)
Population Urban	−8.490***	−7.855***	−6.664***	16.062
	(1.524)	(1.816)	(1.783)	(11.964)
Civil War	−1.473**	1.320	−0.038	1.149
	(0.734)	(0.956)	(0.953)	(1.794)
Peace Years	0.010	0.018	0.008	0.032
	(0.025)	(0.032)	(0.031)	(0.091)
Constant	−89.726***	−81.206**	−52.733**	−173.620
	(26.202)	(33.409)	(26.636)	(145.987)
Number of Observations	2,069	2,058	1,822	840
Countries	135	135	122	109

This is a replication of Model 1, Table 1 in de Soysa and Neumayer (2005). Standard errors appear in parentheses. All regressions assume an AR1 correlation structure and include year dummies. All independent variables are lagged one year. *p < 0.1, **p < 0.05, ***p < 0.01.

Models 1 to 3 confirm these expectations and support de Soysa and Neumayer’s finding: countries that trade more tend to have a higher genuine savings rate. The coefficient for Trade/GDP is smallest when using 2002 data and largest when using 2012 data. Since the exact point estimate is not robust to using different vintages, the interpretation of results should focus on the direction and statistical significance of the effects.

The original results are not robust to using 2022 data: in Model 4, the coefficient for Trade/GDP is negative and not statistically significant. This is because PPP indicators were revised after 2014 to adopt newer ICP benchmarks. Consequently, “PPP data are now provided only from 1990, as the longer the time period between the estimate and the benchmark, the greater the risk of inaccuracy” (World Bank, 2023). If the authors used 2022 data, they would not find support for their argument—not because this argument is wrong, but because the sample covered by each vintage affects one’s empirical conclusions. The number of observations shrinks from 2069 in Model 1 to 840 in Model 4. These observations are not evenly lost across all countries, as shows. Recent vintages might be closer to the “truth,” but if “truthful” values are not available for all countries, the sample—and the empirical results—will be biased.

The discrepancies identified in Table 2 are not just a function of sample selection bias, but also a consequence of data revisions: different vintages might disagree about the same country-year pairs. Table 3 reduces the analysis to the 735 country-year pairs common to all vintages. When all vintages have the exact same coverage, the effect of trade on genuine savings, while never significant, is positive in Models 1 to 3 and negative in Model 4.

Table 3.

The effect of trade dependence on genuine savings (random effects GLS), including only observations available from all sources, 1980–1999.

	(1)	(2)	(3)	(4)
	Original Model	WDI 2002	WDI 2012	WDI 2022
Trade/GDP (ln)	2.053	1.912	1.779	−0.181
	(1.304)	(1.303)	(1.397)	(3.097)
GNI pc (ln)	25.009**	24.102**	24.598***	51.776
	(10.070)	(10.057)	(9.232)	(37.989)
GNI pc² (ln)	−1.190**	−1.160*	−1.164**	−3.395
	(0.594)	(0.593)	(0.534)	(2.178)
Economic Growth	0.016	0.020	0.027	0.131
	(0.042)	(0.042)	(0.049)	(0.086)
Agriculture/GDP	−0.080	−0.130*	−0.001	0.138
	(0.073)	(0.074)	(0.091)	(0.192)
Currency Crisis	0.421	0.458	1.253*	1.726
	(0.594)	(0.594)	(0.740)	(1.226)
Fuel Exporter	−21.870***	−22.094***	−20.993***	−4.186
	(3.161)	(3.158)	(3.395)	(25.065)
Democracy	0.746	0.703	−0.855	0.241
	(1.030)	(1.028)	(1.146)	(2.534)
Political Constraints	−0.942	−0.949	−3.491*	−4.927
	(1.847)	(1.844)	(2.059)	(4.208)
Government Stability	−0.627	−0.621	−0.677	0.195
	(0.483)	(0.483)	(0.619)	(0.998)
Population Density (ln)	1.814***	1.834***	1.379**	1.751
	(0.556)	(0.555)	(0.598)	(4.190)
Population Size (ln)	−0.059	−0.116	0.731	−1.254
	(0.559)	(0.559)	(0.608)	(3.767)
Population Urban	−8.244***	−8.635***	−7.373***	14.016
	(2.234)	(2.232)	(2.612)	(13.373)
Civil War	−0.474	−0.491	−0.085	1.214
	(0.899)	(0.898)	(1.068)	(1.927)
Peace Years	0.028	0.028	0.013	0.035
	(0.035)	(0.035)	(0.037)	(0.107)
Constant	−96.740**	−87.473**	−107.309***	−231.806
	(42.707)	(42.747)	(38.624)	(164.192)
Number of Observations	735	735	735	735
Countries	95	95	95	95

Vreeland (2008)

Many studies only derive their control variables from the WDI, not their main variables. This is still empirically consequential: since different vintages cover different countries and years, vintaged control variables affect the sample size. Moreover, coefficients tend to be biased downward when there are measurement errors on the right-hand side, attenuating the “true” effect that would appear had variables been measured correctly (Hausman, 2001). Researchers might be underestimating the substantive importance of their findings, as the second replication shows.

Using data on 109 dictatorships between 1985 and 1996, Vreeland (2008) finds that multiparty dictatorships are more likely to torture opponents and more likely to enter the United Nations Convention Against Torture (CAT) than one-party or no-party dictatorships. When power is shared, there is more room to disagree with the ruling party. Since at least some dissent is tolerated, defection is more common, as is the punishment of defectors. But interest groups can force the regime to make concessions—and entering the CAT is one concession. The focus on dictatorships is valuable: autocrats overstate their growth rates and do not revise these figures (Martínez, 2022), so there could be fewer discrepancies between WDI releases than in the previous replication.

I replicate the first part of Vreeland’s argument: multiparty dictatorships are more likely to engage in torture. The outcome is a five-point ordinal scale of torture, ranging from one (no allegations of torture) to five (torture is prevalent or widespread), and the estimated model is an ordinal logit. The main explanatory variable, Parties, takes the value of one if more than one party exists legally, and zero otherwise. Four control variables come from the WDI: GDP per capita in 1995 PPP dollars, GDP growth (%), population, and trade (% of GDP). The study also controls for communist regimes and the occurrence of a civil war.

Vreeland uses 2004 WDI data, combined with PWT 6.1. I compare the original data to three WDI releases: 1998 (the earliest release for which all required years are available), 2004 (without PWT additions), and 2018. These vintages report GDP per capita with 1987, 1995, and 2011 as the base year, respectively.

Table 4 presents the results of this replication. Rather than interpret the coefficients for control variables, I investigate how their inclusion affects the main results. Controlling for GDP per capita, GDP growth, population, and trade using 1998 or 2004 data, Parties continues to have a significant positive effect on the outcome. The coefficient for Parties is smaller in Model 2 than in Model 3, confirming that older benchmarks are biased downward: their measurement errors attenuate the “true” effect of multiple parties on torture.

Table 4.

The effect of multiple parties on torture in dictatorships (ordinal logit), 1985–1996.

	(1)	(2)	(3)	(4)
	Original Model	WDI 1998	WDI 2004	WDI 2018
Parties	0.578***	0.447***	0.507***	−0.066
	(0.149)	(0.152)	(0.148)	(0.246)
GDP/Capita	0.016	0.044*	−0.006	0.003
	(0.025)	(0.026)	(0.025)	(0.011)
Growth	0.007**	−0.003	0.002	−0.001
	(0.003)	(0.015)	(0.013)	(0.012)
Population	0.002***	0.003***	0.003***	0.006***
	(0.001)	(0.001)	(0.001)	(0.001)
Trade/GDP	−0.010***	−0.009***	−0.011***	−0.010***
	(0.002)	(0.002)	(0.002)	(0.002)
Civil War	0.795***	1.072***	0.856***	0.576**
	(0.170)	(0.184)	(0.176)	(0.241)
Communist	−1.098***	−1.655***	−1.561***	−2.968***
	(0.355)	(0.396)	(0.346)	(0.618)
Cut 1	−3.073***	−2.891***	−3.069***	−3.720***
	(0.241)	(0.244)	(0.255)	(0.380)
Cut 2	−1.048***	−0.910***	−1.163***	−1.533***
	(0.165)	(0.177)	(0.202)	(0.274)
Cut 3	1.141***	1.312***	0.960***	0.315
	(0.172)	(0.186)	(0.198)	(0.268)
Cut 4	2.700***	2.942***	2.505***	1.716***
	(0.219)	(0.239)	(0.236)	(0.301)
Number of Observations	694	668	710	403
Log Likelihood	−893.6	−852.1	−927.6	−535.3

This is a replication of Model 1, Table 1 in Vreeland (2008). Robust standard errors appear in parentheses. *p < 0.1, **p < 0.05, ***p < 0.01.

Model 4 indicates that the original results are not robust to using 2018 data. The number of observations shrinks from 694 (Model 1) to 403 (Model 4) because the World Bank ceased to provide data before 1990 after revising PPP indicators in 2014. This should not be taken as evidence against the original findings: since the CAT was opened for signature in 1984, the reduced sample drops crucial years and only covers a fraction of the 109 dictatorships included in the original study. Figure 3 reiterates that researchers using different vintages would draw inferences from different samples, as the missing observations are not the same. Even if only the control variables are vintaged, their inclusion might shape our substantive conclusions about the relationship between other unvintaged variables.

Figure 3.

Available Observations by Country, 1985–1996.

Table 5 reduces the analysis to the 283 country-year pairs common to all vintages. When all vintages have the same coverage, all models agree that multiparty dictatorships are more likely to torture opponents but disagree about the magnitude of the effect—a discrepancy driven by data revisions, not just by changes in the sample size. This underscores the need to look beyond point estimates, which might suffer from measurement error.

Table 5.

The effect of multiple parties on torture in dictatorships (ordinal logit), including only observations available from all sources, 1985–1996.

	(1)	(2)	(3)	(4)
	Original Model	WDI 1998	WDI 2004	WDI 2018
Parties	0.250	0.284	0.268	0.332
	(0.314)	(0.316)	(0.316)	(0.315)
GDP/Capita	0.078	0.081	0.094	0.070***
	(0.051)	(0.068)	(0.061)	(0.025)
Growth	0.006	−0.003	−0.010	−0.012
	(0.018)	(0.026)	(0.025)	(0.026)
Population	0.006***	0.006***	0.006***	0.006***
	(0.001)	(0.001)	(0.001)	(0.001)
Trade/GDP	−0.015***	−0.018***	−0.015***	−0.018***
	(0.004)	(0.004)	(0.004)	(0.004)
Civil War	0.327	0.321	0.298	0.319
	(0.284)	(0.286)	(0.280)	(0.288)
Communist	−3.279***	−3.301***	−3.221***	−3.026***
	(0.813)	(0.831)	(0.828)	(0.830)
Cut 1	−4.072***	−4.354***	−4.086***	−4.135***
	(0.545)	(0.562)	(0.545)	(0.536)
Cut 2	−1.446***	−1.714***	−1.458***	−1.506***
	(0.375)	(0.397)	(0.375)	(0.377)
Cut 3	0.629*	0.390	0.618*	0.597
	(0.362)	(0.379)	(0.363)	(0.372)
Cut 4	2.214***	1.998***	2.206***	2.208***
	(0.399)	(0.417)	(0.401)	(0.413)
Number of Observations	283	283	283	283
Log Likelihood	−354.405	−351.280	−353.962	−351.121

This is a replication of Model 1, Table 1 in Vreeland (2008). Robust standard errors appear in parentheses. *p < 0.1, **p < 0.05, ***p < 0.01.

Goldstein et al. (2007)

The final replication looks at variation across different data sources, not just different releases of the same source. According to Rose (2004), formal membership in the General Agreement on Tariffs and Trade (GATT) and the World Trade Organization (WTO) did little to increase trade. However, Goldstein et al. (2007) argue that the GATT/WTO created rights and obligations even for countries that had not attained formal membership, like colonies and newly independent states. Using dyadic data from 1946 to 2004, the authors introduce a measure of participation that goes beyond formal members to include nonmember participants; using this measure, formal members and nonmember participants trade more than nonparticipants.

I do not replicate Goldstein et al.’s main results, but rather a gravity model used to establish Rose’s original finding, without the novel measure of GATT/WTO participation. The outcome is the value of imports (in 1967 USD) from country i to country j. The key explanatory variables indicate the existence of a unilateral or bilateral GATT/WTO membership. Besides the standard gravity variables (the distance between i and j as well as the product of their GDP), the model controls for participation in preferential trade agreements (PTA) or in the Generalized System of Preferences (GSP), currency unions, land area, colonial ties, shared language or border, and whether the two countries are islands or landlocked.

Combining data from multiple sources into one single variable appears to be standard practice to maximize coverage. This study is no different. The main source for Log Product Real GDP (in 1967 USD) is the 2005 WDI, complemented by version 6.1 of the PWT and the 2003 Maddison Project, which report GDP in 2000 USD, 1996 USD, and 1990 international dollars, respectively. It is not clear how these data were rescaled to 1967 USD, as the Maddison Project does not provide nominal GDP data that would enable such calculations. Therefore, I use WDI, PWT, and Maddison data in their raw form, without changing the base year.

Instead of interpreting control variable coefficients, I ask whether their inclusion affects the main results. According to Model 1 in Table 6, formal GATT/WTO membership significantly reduces trade. Model 2, estimated using WDI data, corroborates this finding (which is unsurprising; the original data rely primarily on the WDI). Models 3 and 4 find the opposite: trade increases when both members of the dyad are formal GATT/WTO members.⁴ Using PWT or Maddison data, the authors would find support for their argument even without the updated participation measure.⁵

Table 6.

The effect of GATT/WTO membership on trade (ordinary least squares), 1946–2004.

	(1)	(2)	(3)	(4)
	Original Model	WDI 2005	PWT 6.1	Maddison 2003
Both Formal GATT/WTO Members	−0.070***	−0.182***	0.120***	0.107***
	(0.026)	(0.031)	(0.031)	(0.027)
Only One Formal GATT/WTO Member	−0.211***	−0.289***	−0.223***	−0.203***
	(0.025)	(0.032)	(0.032)	(0.027)
Reciprocal PTA	0.334***	0.197***	0.289***	0.287***
	(0.027)	(0.030)	(0.035)	(0.030)
Nonreciprocal PTA	0.139***	0.178***	0.181***	0.160***
	(0.035)	(0.036)	(0.038)	(0.037)
GSP	−0.097***	−0.086***	0.214***	0.177***
	(0.022)	(0.023)	(0.027)	(0.025)
Currency Union	1.010***	1.134***	1.069***	1.129***
	(0.075)	(0.086)	(0.093)	(0.079)
Colonial Orbit	1.755***	1.771***	1.639***	1.548***
	(0.104)	(0.258)	(0.153)	(0.100)
Log Product Real GDP	0.771***	0.752***	0.907***	0.844***
	(0.005)	(0.005)	(0.007)	(0.006)
Log of Distance	−0.708***	−0.830***	−0.795***	−0.754***
	(0.015)	(0.017)	(0.019)	(0.016)
Common Language	0.357***	0.269***	0.469***	0.421***
	(0.034)	(0.038)	(0.042)	(0.038)
Land Border	0.577***	0.562***	0.471***	0.438***
	(0.059)	(0.068)	(0.080)	(0.066)
Number of Landlocked	−0.142***	−0.124***	−0.197***	−0.218***
	(0.020)	(0.022)	(0.025)	(0.022)
Number of Islands	0.237***	0.288***	0.231***	0.224***
	(0.032)	(0.035)	(0.037)	(0.034)
Log Product Land Area	−0.095***	−0.043***	−0.161***	−0.134***
	(0.005)	(0.005)	(0.006)	(0.005)
Constant	−11.754***	−13.221***	−18.447***	−17.085***
	(0.252)	(0.278)	(0.358)	(0.321)
Number of Observations	381,656	269,313	243,109	360,730
R ²	0.613	0.643	0.631	0.583

This is a replication of Model 1, Table 1 in Goldstein et al. (2007). Robust standard errors, clustered by directed dyad, appear in parentheses. All regressions include year dummies. *p < 0.1, **p < 0.05, ***p < 0.01.

Figure 4 confirms that different sources cover different periods, but this alone cannot explain the discrepancies between results. Table 7 shows that these discrepancies persist when the analysis includes only the 202,819 country-year pairs common to all three sources. This is because measures like real GDP, GDP growth, and trade to GDP rely on nominal GDP information that differs across sources due to currency conversions or PPP adjustments. Compared to the PWT, the WDI consistently overestimates the size of developed economies and underestimates the economy of small nations. Just as researchers using different vintages might draw inferences from different samples, the choice of one source over another can affect researchers’ conclusions.

Figure 4.

Available Observations by Country, 1946–2004.

Table 7.

The effect of GATT/WTO membership on trade, including only observations available from all sources (ordinary least squares), 1946–2004.

	(1)	(2)	(3)	(4)
	Original Model	WDI 2005	PWT 6.1	Maddison 2003
Both Formal GATT/WTO Members	−0.133***	−0.110***	0.111***	0.088**
	(0.033)	(0.034)	(0.035)	(0.036)
Only One Formal GATT/WTO Member	−0.259***	−0.292***	−0.267***	−0.376***
	(0.034)	(0.036)	(0.037)	(0.038)
Reciprocal PTA	0.229***	0.147***	0.239***	0.227***
	(0.032)	(0.033)	(0.037)	(0.036)
Nonreciprocal PTA	0.099***	0.176***	0.138***	0.138***
	(0.037)	(0.038)	(0.039)	(0.039)
GSP	−0.083***	−0.060**	0.239***	0.224***
	(0.024)	(0.025)	(0.028)	(0.028)
Currency Union	1.055***	1.222***	1.048***	1.167***
	(0.097)	(0.102)	(0.104)	(0.102)
Colonial Orbit	1.321***	1.505***	1.310***	1.374***
	(0.277)	(0.250)	(0.257)	(0.238)
Log Product Real GDP	0.829***	0.763***	0.919***	0.900***
	(0.006)	(0.006)	(0.008)	(0.008)
Log of Distance	−0.792***	−0.837***	−0.830***	−0.834***
	(0.018)	(0.019)	(0.020)	(0.020)
Common Language	0.434***	0.349***	0.486***	0.443***
	(0.040)	(0.042)	(0.045)	(0.045)
Land Border	0.604***	0.568***	0.543***	0.522***
	(0.075)	(0.076)	(0.083)	(0.085)
Number of Landlocked	−0.153***	−0.124***	−0.235***	−0.153***
	(0.024)	(0.025)	(0.026)	(0.027)
Number of Islands	0.230***	0.239***	0.232***	0.211***
	(0.035)	(0.037)	(0.039)	(0.039)
Log Product Land Area	−0.108***	−0.050***	−0.170***	−0.156***
	(0.006)	(0.006)	(0.006)	(0.006)
Constant	−13.350***	−13.539***	−18.862***	−18.751***
	(0.290)	(0.305)	(0.373)	(0.375)
Number of Observations	202,819	202,819	202,819	202,819
R ²	0.682	0.668	0.643	0.640

Conclusions

Vintaged data are ubiquitous in political science. However, 31.21% of the studies published in International Organization from 2005 to 2020 do not mention their vintage, while 13.87% do not mention their source. Besides sharing their replication code and data, researchers should disclose their sources and vintages, as de Soysa and Neumayer (2005), Vreeland (2008), and Goldstein et al. (2007) do.

Even rigorous findings might disappear if re-estimated using different data. To identify potential sources of imprecision, researchers can consult the WDI’s Data Updates and Errata, which describe “additions, deletions, and changes in codes, descriptions, definitions, sources and topics” (World Bank, 2023). Plotting the distribution of variables can help identify extreme values. A plot of GDP growth over time would reveal that Equatorial Guinea’s economy grew 18.2% in 2000. Using case knowledge, researchers can assess whether this is a “true” outlier—the result of increased oil production—or a product of human error—in which case Equatorial Guinea would appear in the Data Updates and Errata (it does not).

Researchers should strike a balance between maximizing coverage and using the most recent data. Unless working with pre-1950 data, one should favor recent PWT or WDI releases, which are revised more frequently than Maddison data. But there are trade-offs: more recent series might have worse coverage or cover an entirely different sample, as the replication of Vreeland (2008) illustrates. Depending on the sample of interest, older releases are preferable, even if further away from the “truth.”

Finally, researchers should avoid mixing different vintages or sources. If a vintage or source suffers from inherent measurement error, this error must be consistent across all observations. Off-the-shelf datasets are very convenient, but their use does not absolve researchers from thinking about the quality of their data and, ultimately, the robustness of their findings.

Supplemental Material

Supplemental Material - New data, new results? How data sources and vintages affect the replicability of research

Supplemental Material for New data, new results? How data sources and vintages affect the replicability of research by Iasmin Goes in Research & Politics

Footnotes

Acknowledgements

I would like to thank Gabriella Gricius for research assistance and Kerice Doten-Snitker, Matthew Hitt, Daniel Weitzel as well as participants of the 2022 ACUNS-UN Workshop, the 2022 DVPW-ÖGPW-SVPW International Political Economy Conference, and the University of Vienna’s Government Department speaker series for helpful comments. In particular, thanks to Indra de Soysa, Judith Goldstein, Eric Neumayer, Douglas Rivers, Michael Tomz, and James Vreeland for supporting my replication of their work.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Correction (June 2025):

The article has been updated with correct dataverse link in the supplementary material section. For more details, please see the correction notice: .

ORCID iD

Iasmin Goes

Supplemental Material

Supplemental material for this article is available online.

The files can be found at

Notes

References

Alt

Lassen

Wehner

(2014) It isn’t just about Greece: domestic politics, transparency and fiscal gimmickry in europe. British Journal of Political Science 44(4): 707–716.

Aragão

Linsi

(2022) Many shades of wrong: what governments do when they manipulate statistics. Review of International Political Economy 29(1): 88–113.

Berry

Iommi

Stanger

, et al. (2018) The status of GDP compilation practices in 189 economies and the relevance for policy analysis, IMF Working Paper 37 .

Boehmer

Jungblut

Stoll

(2011) Tradeoffs in trade data: do our assumptions affect our results? Conflict Management and Peace Science 28(2): 145–167.

Ciccone

Jarociński

(2010) Determinants of economic growth: will data tell? American Economic Journal: Macroeconomics 2(4): 222–246.

Croushore

Stark

(2003) A real-time data set for macroeconomists: does the data vintage matter? Review of Economics and Statistics 85(3): 605–617.

de Soysa

Neumayer

(2005) False prophet, or genuine savior? Assessing the effects of economic openness on sustainable development, 1980–99. International Organization 59(3): 731–772.

Deaton

Aten

(2017) Trying to understand the PPPs in ICP 2011: why are the results so different? American Economic Journal: Macroeconomics 9(1): 243–264.

Fariss

Anders

Markowitz

, et al. (2022) New estimates of over 500 years of historic GDP and population data. Journal of Conflict Resolution 66(3): 553–591.

10.

Fearon

Laitin

(2003) Ethnicity, insurgency, and civil war. American Political Science Review 97(1): 75–90.

11.

Gleditsch

(2002) Expanded trade and GDP data. Journal of Conflict Resolution 46(5): 712–724.

12.

Goldstein

Rivers

Tomz

(2007) Institutions in international relations: understanding the effects of the GATT and the WTO on world trade. International Organization 61(1): 37–67.

13.

Hausman

(2001) Mismeasured variables in econometric analysis: problems from the right and problems from the left. Journal of Economic Perspectives 15(4): 57–67.

14.

Hollyer

Rosendorff

Vreeland

(2014) Measuring transparency. Political Analysis 22(4): 413–434.

15.

Inklaar

Rao

(2017) Cross-country income levels over time: did the developing world suddenly become much richer? American Economic Journal: Macroeconomics 9(1): 265–290.

16.

Johnson

Larson

Papageorgiou

, et al. (2013) Is newer better? Penn World Table revisions and their impact on growth estimates. Journal of Monetary Economics 60(2): 255–274.

17.

Kerner

Jerven

Beatty

(2017) Does it pay to be poor? Testing for systematically underreported GNI estimates. The Review of International Organizations 12(1): 1–38.

18.

Linsi

Mügge

(2019) Globalization and the growing defects of international economic statistics. Review of International Political Economy 26(3): 361–383.

19.

Mansfield

Reinhardt

(2008) International institutions and the volatility of international trade. International Organization 62(4): 621–652.

20.

Martínez

(2022) How much should we trust the dictator’s GDP growth estimates? Journal of Political Economy 130(10): 2731–2769.

21.

Pinkovskiy

Sala-i Martin

(2020) Shining a light on purchasing power parities. American Economic Journal: Macroeconomics 12(4): 71–108.

22.

Ram

Ural

(2014) Comparison of GDP per capita data in Penn World Table and World Development Indicators. Social Indicators Research 116(2): 639–646.

23.

Rose

(2004) Do we really know that the WTO increases trade? American Economic Review 94(2): 98–114.

24.

Vreeland

(2008) Political institutions and human rights: why dictatorships enter into the united Nations convention against torture. International Organization 62(1): 65–101.

25.

World Bank (2023) Data Updates and Errata. https://datahelpdesk.worldbank.org/knowledgebase/articles/906522-data-updates-and-errata

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.60 MB

New data,new results? How data sources and vintages affect the replicability of research

Abstract

Keywords

Introduction

The logic behind data vintaging

A survey of published studies

Empirical consequences of data vintaging

De Soysa and Neumayer (2005)

Vreeland (2008)

Goldstein et al. (2007)

Conclusions

Supplemental Material

Supplemental Material - New data, new results? How data sources and vintages affect the replicability of research

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

Correction (June 2025):

ORCID iD

Supplemental Material

Notes

References

Supplementary Material