Abstract
Every year at the United Nations (UN), member states deliver statements during the General Debate (GD) discussing major issues in world politics. These speeches provide invaluable information on governments’ perspectives and preferences on a wide range of issues, but have largely been overlooked in the study of international politics. This paper introduces a new dataset consisting of over 7300 country statements from 1970–2014. We demonstrate how the UN GD corpus (UNGDC) can be used as a resource from which country positions on different policy dimensions can be derived using text analytic methods. The article provides applications of these estimates, demonstrating the contribution the UNGDC can make to the study of international politics.
Introduction
Every September, the heads of state and other high-level country representatives gather in New York at the start of a new session of the United Nations General Assembly (UNGA) and address the Assembly in the General Debate (GD). The GD provides the governments of the almost 200 UN member states with an opportunity to present their views on international conflict and cooperation, terrorism, development, climate change and other key issues in international politics. As such, the statements made during the GD are an invaluable and largely untapped source of information on governments’ policy preferences across a wide range of issues over time.
Government preferences are central to the study of international relations and comparative politics. As preferences cannot be directly observed, they must be inferred from states’ observed behaviour. One approach has been to use military alliances as an indicator of preference similarity (e.g. Bueno de Mesquita, 1983). This approach, however, provides little information about preferences when states do not have alliances. Scholars have instead overwhelmingly relied on UNGA voting records to estimate foreign policy preferences (see Bailey et al., 2015; Voeten, 2013). However, UNGA voting-based methods – like all measures of preference – rely on certain assumptions and, as such, have both strengths and limitations (see Voeten, 2013). For example, one shortcoming is that estimates of state preference are derived from the limited number of issues that are voted on in the UNGA in a given year. 1 Therefore, it is essential that researchers can draw on additional data and measures to avoid producing findings about government preferences that are based on one type of observed state behaviour.
We argue that the application of text analytic methods to GD statements can provide much-needed additional measures and tools that can broaden our understanding of government preferences and their effects. The use of text analytic methods is rapidly gaining ground in comparative politics and legislative studies (see Herzog and Benoit, 2015; Laver et al., 2003; Proksch and Slapin, 2010). To date, however, there has been little effort to use speeches to estimate policy preferences in international relations. The formal and institutionalised setting of the GD, its inclusion of all UN member states, which are provided with equal opportunity to address the Assembly, and the fact that it takes place every year, makes it an ideal resource from which to derive, using text analysis, estimates of state preferences that can be applied to systematic analyses of international politics.
This paper introduces a new dataset, the UN GD corpus (UNGDC), consisting of 7314 GD statements from 1970–2014, that we have preprocessed, categorised and prepared for empirical applications. In the next section, we discuss the characteristics, content and purpose of the UN GD. Second, we explain the process of collecting and preprocessing the statements, and provide an overview of the UNGDC. We then use the text as data approach to show how the UNGDC can be used as a resource from which estimates of government preferences can be derived, providing applications of these estimates. We conclude by outlining potential uses of the UNGDC in future research.
The UN GD and world politics
The GD marks the start of the UNGA regular session each year. By tradition, the opening speech is made by Brazil, with the US also scheduled to speak on the first day. Typically, the heads of state and governments speak during the first days of the GD, followed by vice-presidents, deputy prime ministers and foreign ministers, and concluding with the heads of delegation to the UN (Bailey, 1960; Luard and Heater, 1994; Nicholas, 1959; Smith, 2006).
The GD provides governments with an opportunity to declare, and to have on public record, their official position on various major international events of the past year (Smith, 2006). In addition, country representatives use the GD venue to present their governments’ perspectives on broader underlying issues in international politics. Their speeches frequently deal with issues of mutual concern such as terrorism, nuclear non-proliferation, development and aid, and climate change, often appealing to the international community to do more to tackle these issues. For example, in 1995, the US discussed UN reform, non-proliferation, terrorism, money laundering and the narcotics trade in its GD statement. In turn, the UK and France both drew attention to the challenges of UN peacekeeping, while India discussed terrorism, disarmament, human rights and concerns about the inability of global institutions, such as the WTO, to address the needs of the Global South.
There are several important characteristics of GD speeches that have implications for their use in deriving estimates of state preferences from them. In contrast to UNGA roll-call votes, which are directly linked to the adoption of UN resolutions, GD speeches are not institutionally connected to decision-making within the UN. As a consequence, states face lower external constraints and pressures when delivering GD statements than when voting in the UNGA. Indeed, studies that use UNGA voting highlight the various constraints countries face when voting as a result of, among other things, aid relationships and strategic voting blocs (see Alesina and Dollar, 2000; Kim and Russett, 1996; Voeten, 2000). The lack of external constraints means that when delivering their GD statements, governments have more leverage with the positions they take and the issues they emphasise. Hence, GD statements provide more information on key national priorities than the limited number of votes in the UNGA.
This view is supported by interviews conducted by the authors with members of the diplomatic community. The Deputy Representative of the Finnish Mission to the UN, for example, explained, ‘speeches at the General Debate are interesting because they flesh out national policies – what states think … the General Debate is the one place where states can speak their mind; it reflects the issues that states consider important’. Similarly, a spokesperson for the German Mission to the UN stated that the absence of external pressures when delivering GD statements means ‘these speeches are the most sovereign thing that a country does as a member of the UN’. 2 It is clear that non-democratic regimes also attach great importance to GD statements. For example, members of Russia’s inner political circle not only viewed the 2015 GD statement as a key summary of the country’s foreign policy concerns, they were also apparently aware of its content weeks in advance. 3
A significant consequence of the relative lack of external constraints in the GD is that member states can more freely express their government’s perspectives on issues deemed important – including more contentious issues. As Smith (2006: 155) argues, a key function of the GD is that ‘it provides members with the opportunity to blow off steam on contentious issues without causing undue damage’. This is particularly relevant for smaller nations who can use the GD to raise more disagreeable political issues (see Nicholas, 1959). For example, in 2014, Antigua and Barbuda’s statement emphasised the failure of the US government to adhere to a ruling from the WTO’s Dispute Settlement Body that stated that the US should pay compensation to Antigua and Barbuda. In making this complaint, the Antiguan representative highlighted the importance of the GD for smaller nations, stating ‘my small nation has no military might, no economic clout. All that we have is membership of the international system as our shield and our voice in this body as our sword.’
The fewer external constraints on representatives when delivering GD statements does not, however, imply that these speeches are not strategic. Scholars have long recognised that ‘member states present themselves exclusively in the guise in which they wish to be known’ during these annual debates (Nicholas, 1959: 98). In fact, a key purpose of the GD is that it provides governments with the opportunity to ‘influence international perceptions of their state, aiming to position their states favorably, as well as to influence the perception of other states’ (Hecht, 2016: 10). Therefore, governments use GD speeches strategically to signal their preferences among the community of states. This use of strategic signalling in the GD can be seen when we compare references to Iran in the US statements of 2012 and 2013. In the 2012 address, President Obama
4
was highly critical of Iran:
In Iran we see where the path of a violent and unaccountable ideology leads […] Time and again, it has failed to take the opportunity to demonstrate its nuclear program is peaceful […] Make no mistake: a nuclear-armed Iran is not a challenge that can be contained. It would threaten the elimination of Israel, the security of Gulf nations and the stability of the global economy […] and that is why the United States will do what we must to prevent Iran from obtaining a nuclear weapon.
In contrast, speaking a year later
4
, the US president was more reconciliatory, offering to give diplomacy one last chance in relation to Iran’s nuclear programme:
if we can resolve the issue of Iran’s nuclear program, that can serve as a major step down a long road towards a different relationship, one based on mutual interests and mutual respect […] America prefers to resolve its concerns over Iran’s nuclear program peacefully […] We are not seeking regime change, and we respect the right of the Iranian people to access peaceful nuclear energy.
A few hours later during the same session, President Rouhani in his address 4 also emphasised diplomacy and the hope of reaching a compromise. The world has subsequently learned that the US and Iran held secret talks in the background, which eventually led to the breakthrough and signing of the intermediate deal (Borger and Kamali, 2013). As such, the change in rhetoric between 2012 and 2013 demonstrates the strategic nature of GD speeches. A further example of both the importance placed by governments on the GD address and its strategic purpose is provided by the Chilcot Inquiry into the UK’s role in the Iraq War. The report contains a memo sent by Prime Minister Tony Blair to President George W. Bush, complimenting the US president on the speech delivered at the 2002 GD that set out the case for war, ‘It was a brilliant speech … it puts us on exactly the right strategy to get the job done.’ 5 Hence, the US speech was seen as part of the US and UK strategy to build support for intervention in Iraq.
The lack of external constraints on member states in delivering GD statements means that they can use their address to indicate the issues considered most important by devoting more attention to these topics. As governments can choose what issues to discuss or ignore, and how strongly to emphasise certain issues, the GD provides detailed information about a government’s position on a policy issue and, also, the importance – or salience – of an issue for a government. As Smith (2006: 155) notes, the GD acts ‘as a barometer of international opinion on important issues, even those not on the agenda for that particular session’. The focus on position and salience means that GD speeches can be used to uncover the most important topics that emerge in international politics over time.
UNGDC: The UN General Debate corpus
The speeches made in the GD are subsequently deposited at the UN Dag Hammarskjöld Library. However, statements made before 1992 are stored as image copies of typewritten documents. These are of very poor quality and require additional preprocessing using optical character recognition software. We collected speeches through the dedicated webpages of the individual UNGA GDs and the UN Bibliographic Information System (UNBIS).
Speeches are typically delivered in the native language. Based on the rules of the Assembly, all statements are then translated by UN staff into the six official languages of the UN. If a speech was delivered in a language other than English, we use the official English version provided by the UN. Therefore, all of the speeches in the UNGDC are in English.
The annual sessions are assigned numbers, starting with the first session in 1946 up to the most recent seventieth session in 2015. We collected all GD speeches from 1970 (Session 25) to 2014 (Session 69). In total, there are 7314 country statements delivered between 1970–2014. The number of countries participating in the GD increased from 70 in 1970 to 193 in 2014 in line with the increase in UN membership. Non-member states may also participate in the GD (e.g. the Holy See and Palestine). Several states that previously participated in the GD have ceased to exist. Where possible, we linked such states to their legal successor states (e.g. USSR and the Russian Federation). If this was not possible we kept speeches in the data under the country’s last known name (e.g. German Democratic Republic). Overall, the corpus contains the GD contributions from 198 countries. On average, speeches contain 123 sentences and 945 unique words. 6
Table 1 provides an overview of the UNGDC. It shows average frequency of types (unique form of a word), tokens (individual words) and sentences for each individual speech in the text corpus. In terms of who delivered the statement, for sessions with identifiable speakers and their posts, 1909 (44.3%) were delivered by heads of state or government (e.g. presidents, prime ministers, kings), 2126 (49.3%) by vice-presidents, deputy prime ministers and foreign ministers and 276 (6.4%) by country representative at the UN. 7
UN GD corpus.
Note: Descriptive statistics for the UNGDC containing 7314 statements delivered by heads of state or their representative from 1970–2014. From 2011, the president of the European Commission made a separate statement on behalf of the EU. UN: United Nations; GD: General Debate.
Empirical application: Preferences on single-issue dimensions
The UNGDC can be used by scholars who require easy access to the statements and may want to read a particular text, or compare selected statements. Primarily, however, we envision the UNGDC to be used in quantitative applications looking at the nature, formation and effects of state preferences in world politics. Treating text as data has a long tradition in political science (for a review, see Laver, 2014). Since the earlier introduction of text scaling methods, such as Wordscores (Laver et al., 2003) and Wordfish (Slapin and Proksch, 2008), to estimate policy positions on dimensions of interest, the availability and complexity of methods has increased dramatically (Grimmer and Stewart, 2013; Herzog and Benoit, 2015). The majority of such methods are either derived directly from, or can be traced to, the natural language processing literature in computer science and computational linguistics (e.g. Benoit and Nulty, 2013; Lowe, 2008). Wordscores is by far the most popular text scaling method in political science based on a Google Scholar citation count. It is related to the Naive Bayes classifier deployed for text categorisation problems (Benoit and Nulty, 2013).
Working with text as data generally involves using the bag-of-words approach, whereby each document can be represented by a multiset (bag) of its words that disregards grammar and word order. Word frequencies in the document are then used to classify the document into one of two categories. In Wordscores, the learning is supervised by providing training documents that are a priori known to belong to either category, so that the chosen dimension is substantively defined by the choice of training documents.
As an illustration of this approach, we derive from our resource estimates of preferences on the very specific issue of US–Russia rivalry in world politics. Figure 1 maps Wordscores estimates for the 2014 UN GD. We use statements by the US and Russia as reference texts. We therefore a priori define the policy dimension as Russia vs US. We do not use the resulting scores as an explanatory variable in an empirical application here due to limited space. However, such an application would clearly be of value for research on international relations. Here, we simply demonstrate how it is possible to derive estimates of differences between UN member states from our resource using the text as data approach.

Wordscores map 2014.
Empirical application: Preferences on multiple dimensions
While estimating state preferences on single-issue dimensions has many benefits, countries routinely express preferences on multiple dimensions of foreign policy. We therefore turn to correspondence analysis (CA) – a dimensionality reduction technique (e.g. Bonica, 2013). In CA, the first dimension is fitted to explain maximal variation in the data, while subsequent dimensions explain maximal residual variation (which means dimensions are orthogonal to each other). Unlike Wordscores, the definition of the dimensions produced by CA must be discerned inductively, a posteriori (Laver, 2014). This also implies that the dimensions produced by CA may correspond to single, multiple or meta issues. Figure 2 presents the positions over time of USA and Russia (opponents) and USA and the UK (allies) on the first and second dimensions (CA1 and CA2) uncovered using CA.

CA1 and CA2 of allies and opponents.
Lowe (2016) suggests that position estimated by such models is a low dimensional summary of the relative emphasis of one topic over another, compared to what would be expected by chance. This is consistent with a key assumption of the saliency theory of party competition (Budge et al., 2001), which posits that the policy differences between parties are determined by their contrasting emphases on different issues. In the context of GD statements, the CA model fitted to the count data of unique words captures countries’ relative emphasis of different issues – and therefore the differences in their policy preferences.
A benefit of using CA is that it allows us to easily estimate positions on multiple dimensions. We demonstrate the ease of using multidimensional text scaling by including the new CA measures in an existing analysis of the International Criminal Court (ICC) and US nonsurrender agreements (Kelley, 2007). The format of this article prevents us from covering issues in detail; therefore, the following is intended merely as an illustrative example. In brief, the US sought to pressure other states to sign bilateral agreements not to surrender US citizens to the ICC. This attempt to seek exceptional treatment was widely criticised for inconsistency with international norms, and many countries (but not all) turned it down. Kelley (2007: 573) argues that, for these states, normative preferences trumped strategic concerns. Overall, the views on the nonsurrender agreements were complex and unlikely to be reduced to an easily identifiable single-issue dimension.
To determine the optimal number of dimensional estimates to include in the estimate we rely on the leave-one-out cross-validation (LOOCV) method (James et al., 2013: 178). Given the sample size, we considered alternative specifications with up to 10 CA dimensions, as presented in Figure 3. 8 For each alternative model we calculate the cross-validation error. As the LOOCV indicates that the optimal number of CA dimensions is three, we include three dimensions to the original specification that predicts whether countries signed nonsurrender agreements (Kelley, 2007).

Choosing optimal number of CA dimensions and the estimated model results.
The results presented in the second subplot in Figure 3 indicate that the CA3 coefficient is statistically significant. What does this mean substantively? A detailed discussion is limited by the scope of the paper, but we can gain some insight from Figure 4, which shows the most important words defining the variation on that dimension. The results suggest that states who expressed stronger concerns about security and terrorism were more likely to sign the nonsurrender agreement. We interpret this as meaning that indicating security concerns alongside normative goals influenced decisions on whether to sign the nonsurrender agreement with the US. It is, however, important to note that further analysis would be required to fully support this claim.

Word Cloud of top 100 words on CA3.
Conclusion
This paper introduces a new dataset, the UNGDC, for understanding and measuring state preferences in world politics. We have demonstrated how scholars can extract relevant information from the UNGDC using text analytic methods. Specifically, we have shown how the UNGDC can be used to uncover single and multiple dimensions of government preferences, and have provided examples of how such estimates can be applied.
Estimates derived from the UNGDC complement existing measures of government preferences based on UNGA voting. In fact, a possible application of the UNGDC would be to investigate the relationship between preferences expressed by governments in their GD statements and their voting behaviour in the UNGA across difference issue areas. This would shed light on whether governments express their foreign policy preferences in different ways depending on the particular audience they face and the associated costs.
A benefit of using texts to extract information about preferences is that they provide detailed information about countries’ views on a particular policy area, and so can be compared to other text data. Hence, a future application of the UNGDC would be to compare the statements with international treaties and laws. Such comparisons can show whether some countries have greater influence on specific international agreements than others, and how countries perceive such agreements. For example, researchers may consider the extent to which states adopt language based on international law in their GD statements. Finally, in addition to examining the effects of government preferences, the UNGDC can also help us better understand how state preferences are formed, and which groups in a country influence preferences across different issues.
Footnotes
Acknowledgements
We would like to thank Altaf Ali, Sofia Collignon Delmar, Elvin Gjevori, Karl Murphy, Mohsen Moheimany and Bethsabee Souris for their excellent research assistance. We are also grateful to Kristin Bakke, Alex Braithwaite, David Hudson, Tim Hicks, Jeff Kucik, Lucas Leemann, Neil Mitchell and Erik Voeten for their helpful comments and advice.
Correction (June 2025):
Authors’ note
Authors’ names are listed in alphabetical order. Authors have contributed equally to all work.
Declaration of conflicting interest
The authors declare that there is no conflict of interest.
Funding
We acknowledge the receipt of the Dublin City University Enhancing Performance Award.
Notes
Carnegie Corporation of New York grant
This publication was made possible (in part) by a grant from Carnegie Corporation of New York. The statements made and views expressed are solely the responsibility of the author.
