Sage Journals: Discover world-class research

Abstract

Topic models have been widely used by researchers across disciplines to automatically analyze large textual data. However, they often fail to automate content analysis, because the algorithms cannot accurately classify individual sentences into pre-defined topics. Aiming to make topic classification more theoretically grounded and content analysis in general more topic-specific, we have developed Seeded Sequential Latent Dirichlet allocation (LDA), extending the existing LDA algorithm, and implementing it in a widely accessible open-source package. Taking a large corpus of speeches delivered by delegates at the United Nations General Assembly as an example, we explain how our algorithm differs from the original algorithm; why it can classify sentences more accurately; how it accepts pre-defined topics in deductive or semi-deductive analysis; how such ex-ante topic mapping differs from ex-post topic mapping; how it enables topic-specific framing analysis in applied research. We also offer practical guidance on how to determine the optimal number of topics and select seed words for the algorithm.

Keywords

international relations LDA topic model political science semi-supervised model topic model United Nations

Introduction

Researchers from various fields of social science have extensively used topic models to analyze vast amounts of textual data because their algorithms can automatically group documents with a similar content and provide a summary of the corpus with topic words. Even though existing algorithms recognize topics only as clusters of words, they can produce meaningful results because co-occurrences of words tend to reflect their semantic relationships. Among several topic models, Latent Dirichlet allocation (LDA) (Blei et al., 2003) has been arguably the most popular algorithm in social science research (Grimmer & Stewart, 2013). For example, scholars have employed it to analyze thinktank reports on climate change (Boussalis & Coan, 2016), central bank committees transcripts (Baerg & Lowe, 2018), international news about political violence (Mueller & Rauh, 2018), and patents on robotics (Savin et al., 2022).

LDA is typically applied to a corpus of documents in order to either identify known topics based on a theoretical framework (deductive approach) or to discover unknown topics in order to improve such frameworks (inductive approach). However, researchers often find it difficult to interpret and accept topics identified by the algorithm, because these topics only loosely match the theoretical concepts (Eshima et al., 2020). A common solution to this problem is to manually map topics to the target concepts after fitting a model (ex-post mapping), but another, and better, solution is to “teach” the algorithm the target concepts through seed words before fitting a model (ex-ante mapping). Such a semi-supervised approach is enabled by Seeded LDA (Jagarlamudi et al., 2012; Lu et al., 2011), but only a few researchers have to date applied the model in applied research (e.g., Curini & Vignoli, 2021).

Furthermore, researchers who attempt to perform topic-specific content analysis by correlating topics and other traits (e.g., sentiment, lexical complexity, named entities) of individual sentences often fail, because LDA struggles to identify topics in shorter documents, in which words co-occur less frequently (Yan et al., 2013).¹ The solution to this problem is to rely on topic models that associate words in a group of documents (Amoualian et al., 2016; Du et al., 2012; Gruber et al., 2007; Jiang et al., 2019; Yan et al., 2013). Among proposed algorithms, Sequential LDA (Amoualian et al., 2016; Du et al., 2012) appears to be the most useful for classification of sentences, but it has not been widely used either.

Aiming to facilitate the use of topic models in broader research applications across disciplines, we have developed Seeded Sequential LDA and implemented it in an open-source package by extending the popular LDA algorithm.² With the new algorithm, we contribute to making topic classification more theoretically grounded in the deductive (or semi-deductive) approach and to making content analysis in general more topic-specific with its capability to classify individual sentences more accurately.

In this article, we explain how Seeded Sequential LDA differs from the original algorithm and why it can be used to perform topic-specific content analysis of sentences. In the following sections, we first explain the key aspects of the LDA algorithms. Specifically, since Heinrich (2008) has given a detailed explanation of the original LDA algorithm, we focus on how the proposed changes affect the way the Gibbs sampler assigns topics to words. Second, we compare the new LDA algorithms (unseeded, seeded non-sequential and seeded sequential) by applying them to sentences from the United Nations General Assembly speeches. Third, we present topic-specific sentiment analysis of the sentences as an example of framing analysis that the Seeded Sequential LDA enables.³ Finally, we conclude this article by offering the best practice in topic classification of sentences in social science research.

LDA Algorithms

LDA: Unsupervised Topic Model

We present algorithms using graphical models to help readers understand the key differences between the seeded, sequential and original algorithms. In Figure 1, gray circles are variables whose values are known, whereas white circles are latent variables whose values are unknown; rectangles around them indicate that there are many such variables. The most important variables for LDA are topics of words, $Z$ , and words used to express these topics, $W$ ; the lengths of the two variables are the same so that all the words in the corpus can be assigned to one of the topics; $θ$ is the topic distribution for documents while $ϕ$ is the topic distribution for words. When $d$ is one of the documents, $d \in D$ ; $v$ is one of the unique words, $v \in V$ , in the corpus; $k$ , is one of the topics, $k \in K$ , $θ$ and $ϕ$ are $| D | \times K$ and $K \times | V |$ matrices, respectively.

Figure 1.

Graphical model of the original LDA. $θ a n d ϕ$ are Dirichlet topic distributions for documents and words determined by their hyperparameter $α a n d β$ , respectively. $Z$ is multinomial topic distribution for which the most appropriate topic words, $W$ , are selected.

Although $W$ is the only known variable for LDA, it can infer $θ a n d ϕ$ through Gibbs sampling. First, the algorithm initializes $Z$ by randomly assigning topics to words, and computes $θ a n d ϕ$

θ_{d k} = P (Z = k | d)

= \frac{M_{d k} + α}{M_{d .} + α | K |}

(1)

ϕ_{k v} = P (W = v | k)

= \frac{N_{k v} + β}{N_{. v} + β | V |}

(2)where

M_{d k}

is the frequency of topic

k

to be found in document

d

;

N_{k v}

is the frequency of topic

k

to be assigned to word

v

;

α

and

β

are the hyperparameters for smoothing.⁴ Second, the algorithm updates

Z

by sampling topics from a multinominal distribution computed based on

θ a n d ϕ

P (Z = k | W = v, d) \propto θ_{d k} ϕ_{k v}

(3)

Every time a new topic is sampled, $θ a n d ϕ$ are updated to increase the likelihood of the next sample. By repeating this process 1000 to 2000 times, the algorithm completes inference of the unknown variables, revealing the most important topics for documents and the more relevant words for topics as the highest probability scores in $θ a n d ϕ$ , respectively.

Seeded LDA: Semi-Supervised Topic Model

In unsupervised topic models, topics and concepts match only by accident because the algorithms recognize topics as clusters of words in the corpus. Such mismatches between topics and concepts make it difficult for researchers to employ LDA when analysis is theoretically motivated (Watanabe & Zhou, 2020). However, Seeded LDA can solve this problem because it allows researchers to define topics a priori through seed words, before fitting the model.

Computer scientists have developed algorithms to guide LDA to discover user-defined a priori topics with a small number of seed words. Lu et al. (2011) proposed an LDA algorithm that clusters words around seed words by inducing bias in Gibbs sampling. Jagarlamudi et al. (2012) created an LDA algorithm that allows both seeded and unseeded topics to coexist. Eshima et al. (2020) further extended Jagarlamudi’s algorithm to incorporate document-level meta data into topic identification.

Lu’s algorithm only requires adding pseudo-count, $ω_{k v}$ , that induces bias in $ϕ$ . With the biased topic distribution, $\bar{ϕ}$ , a Gibbs sampler becomes more likely to assign words, which co-occur with the seed words, to the seeded topic (Figure 2). This can be accomplished by making small changes in Equation 3

{\bar{ϕ}}_{k v} = \frac{N_{k v} + β + ω_{k v}}{N_{. v} + β | V |}

(4)

P (Z = k | W = v, d) \propto θ_{d k} {\bar{ϕ}}_{k v}

(5)

Figure 2.

Graphical model of Seeded LDA. The topic distribution for words, $\bar{ϕ}$ , is biased by the pseudo-count, $ω$ , for seed words.

The authors suggested that the size of pseudo-count is about 1% of the total number of words in the corpus, $\sum f$ , for any seed word, but we found it is too large for infrequent seed words. Therefore, we adjusted the size by the frequency of seed words, $f_{v}$ , in the corpus and a hyperparameter, $0 < μ$ ⁵

\begin{array}{c} ω_{k v} = {\begin{array}{c} μ^{*} \frac{f_{v}}{\sum f}, v \in S_{k} \\ 0, v \notin S_{k} \end{array} \\ μ^{*} = 100 μ \end{array}

(6)

We also allow $K$ to be larger than the number of seeded topics to have both seeded and unseeded topics in the same model in a similar manner to Jagarlamudi’s algorithm. Seeded LDA is useful for researchers when it has both seeded and unseeded topics, because (1) they can perform semi-deductive topic classification without defining all the topics in the corpus; (2) they can improve the classification accuracy for seeded topics by adding unseeded topics as residual categories; (3) they can discover new topics and seed words in the topic terms for unseeded topics.

Sequential LDA: Unsupervised Topic Model for Sentences

The original LDA ignores the order of documents like many other topic models because the content of a document is assumed to be unrelated to that of other other documents (Blei et al., 2003). The inability of LDA to consider the positions of documents in the corpus makes it unsuitable for classification of sentences because topics in the current sentence usually depend on those in the earlier sentences. However, Sequential LDA explicitly models such relationship between the current and the preceding sentencesto identify topics more accurately.

Computer scientists have proposed several algorithms that can be employed to classify sentences. Gruber et al. (2007) applied the hidden Markov model to detect changes in topics between sentences; Du et al. (2012) adopted the Poisson–Dirichlet process to model the relationship between the current and previous topics; Amoualian et al. (2016) proposed several variants of LDA to classify topics of sequentially structured documents; Jiang et al. (2019) developed a variant of LDA that considers topics of both previous and following sentences.

We propose a Sequential LDA algorithm for sentence classification with minimal changes to preserve the well-understood behavior of the original algorithm. Its algorithm partially resembles those proposed by Du et al. (2012) and Amoualian et al., (2016), but it is much simpler to implement. Keeping all the variables in the original LDA intact, we add a hyperparameter, $γ$ , to Equation 3. It determines how much the Gibbs sampler should consider the topic of the previous sentence, $θ_{d - 1}$ , in choosing topics for words in the current sentence

P (Z = k | W = v, d) \propto θ_{d k} θ_{(d - 1) k}^{γ} ϕ_{k v} .

(7)where

γ = 0

d = 1

, otherwise

0 < γ \leq 1.0

. When

γ = 0

, it considers only the topics of the current sentence (identical to the original LDA); when

γ > 0

, it considers the topics of both the previous and the current sentences. Figure 3 shows that the topic distribution for documents of the current sentence,

θ_{d}

, is influenced by

θ_{d - 1}

when

γ > 0 a n d d > 1

, but the topic distribution for words,

ϕ

, affects the sentence in the same way as the original LDA.

Figure 3.

Graphical model of Sequential LDA. The topics of current sentence $θ_{d}$ are affected by the topics of the previous sentence $θ_{d - 1}$ . The types of distributions of variables are not shown in this figure for simplicity.

Sequential LDA can classify sentences more accurately than the original LDA because (1) it alleviates the data sparsity problem due to the short lengths of the documents; (2) it makes the transition of topics between sentences smoother; (3) it can identify topics of sentences that lack strong features based on the context.

Seeded Sequential LDA: Semi-supervised Topic Model for Sentences

It is easy to combine Sequential LDA and Seeded LDA because they affect different parts of the LDA algorithm. By merging Equations 5 and 7, we yield the sampling distribution for Seeded Sequential LDA

P (Z = k | W = v, d) \propto θ_{d k} θ_{(d - 1) k}^{γ} {\bar{ϕ}}_{k v}

(8)Since the equation is an extension of Equation 3, Seeded Sequential LDA behaves in the same way as Sequential LDA when

ω = 0

, as Seeded LDA when

γ = 0

or as the original LDA when

ω = 0 a n d γ = 0

Hyperparameter Optimization

We have added two hyperparameters, $γ$ and $μ$ to create Seeded Sequential LDA, but users usually do not need to optimize their values because the algorithm works sufficiently well when $γ = 0.5$ and $μ = 0.01$ However, we still need to find the optimal number of topics, $K$ for a new corpus. The number of seeded topics determines the value in Seeded LDA, but it can also have an unrestricted number of unseeded topics.

Regularized Topic Divergence

There are several methods for optimizing $K$ in the computer science literature (Arun et al., 2010; Cao et al., 2009; Deveaud et al., 2014; Griffiths & Steyvers, 2004), but they are not always useful in applied research because they tend to give numbers that are too large for users to interpret manually (Mimno et al., 2011). Further, these methods assume that there is an optimal $K$ for all, ignoring the fact that researchers may prefer different levels of granularity in topic analysis depending on research design. Therefore, we propose a regularized topic divergence measure that allows researchers to apply LDA with their desired topic granularity by extending Deveaud’s (2014). With our measure, users can find an optimal number of topics for their analysis by specifying the minimum size of topics as the proportion of words in the corpus.

According to Deveaud et al. (2014), the number of topics, $K$ is optimal when the average Jensen–Shannon divergence (JSD) between the pairs of topics maximizes.⁶ The average divergence scores can be computed by taking the sum of divergence scores weighed by the transposed cross-products, $e * e^{t}$ , of the inverse of $K$ Equation 9

\begin{array}{c} D_{i j} = J S D (ϕ^{K}) (e * e^{t}) \\ e = [\frac{1}{K}, \dots, \frac{1}{K}] w h e r e K = | e | \\ \hat{K} = \underset{K}{argmax} \sum_{i} \sum_{j} D_{i j} \end{array}

(9)For our regularized measure, the divergence scores are calculated in the same manner, but they are weighted by the sizes of the topics,

P (Z = k)

(Equation 10). The product of topic sizes,

e * e^{t}

minus the user-provided minimum sizes,

δ^{2}

, yield biased weights⁷

\begin{array}{c} D_{i j} = J S D (ϕ^{K}) (e * e^{t} - δ^{2}) \\ e = P (Z = k) \\ \hat{K} = \underset{K}{argmax} \sum_{i} \sum_{j} D_{i j} + δ^{2} \end{array}

(10)

The biased weights regularize $\hat{K}$ because they become negative and decrease the sum of divergence scores when $e * e^{t} < δ^{2}$ . This is useful in applied research because the sum of divergence scores stays the same or increases when an additional topic splits existing topics into distinctive sub-topics with sizes larger than $δ$ , but decreases when it makes existing topics smaller without making them more distinctive.

Evaluation of the Algorithms

We evaluate the Seeded Sequential LDA algorithm and the regularized topic divergence measure on a corpus of speeches at the United Nations General Assembly (Baturo et al., 2017).⁸ At the introductory General Debate meeting, delegates from the member states express their opinions on various topics in speeches (between 70 and180 sentences each in length. The speeches often start with diplomatic greetings to the audience and contain literary quotes or verbiage.

In an earlier study, Watanabe and Zhou (2020) sampled speeches delivered by delegates from 27 countries between 1991 and 2017 and had sentences manually labeled (n = 2507) into one of six general topics: “greeting” (5.7%), “UN” (18.3%), “security” (43.3%), “human rights” (6.8%), “democracy” (3.6%) or “development” (28.0%). They also compiled a list of seed words to identify topics of sentences using two semi-supervised machine learning algorithms, Seeded LDA and Newsmap.⁹ They reported that Newsmap outperformed Seeded LDA, but that the overall F1 scores were only 0.57 and 0.40, respectively; both algorithms struggled to classify sentences about “human rights and “democracy” (0.14 ≤ F1 ≤0.36) because of the abstract nature of the concepts.

In this section, first, we determine the optimal number of topics for the corpus using the regularized divergence measure; second, we apply Seeded LDA and Seeded Sequential LDA to compare their classification accuracy; third, we test the sensitivity of classification accuracy to the amount of weight given to seed words; finally, we assess the impact of the residual topics on classification accuracy.

In our evaluation of the algorithms, we pre-processed the textual data using the Quanteda package in R (Benoit et al., 2018): we tokenized the texts based on the rules defined by the Unicode; we removed punctuation marks, numbers and infrequent (fewer than 10 times in the corpus) or grammatical words; we compounded collocations that are statistically significantly (p < 0.001) associated with each other.

Optimal Number of Topics

Researchers in international relations have identified different numbers of topics in speeches in the General Assembly, either manually or automatically. The chosen number reflected their varying research interests and ranged from around five (Bailey & Voeten, 2018; Finke, 2021; Kim et al., 2020; Watanabe & Zhou, 2020) to as large as 20 (Brun-Mercer, 2018) and 23 (Gray & Baturo, 2021).

We fitted LDA models with $3 \leq K \leq 60$ on a sample of 1000 speeches and computed topic divergence scores with high ( $δ = 0.01)$ , medium ( $δ = 0.05$ ) and low ( $δ = 0.1$ ) topic granularity (Figure 4).¹⁰ The average divergence (AD) scores maximize when $K = 42 o r 37$ , but the regularized divergence (RD) scores for the low granularity peak when $K = 5 o r 6$ and for the high granularity when $K = 24 o r 22$ . The peaks of RD scores largely agree with the number of topics in the aforementioned studies. Nevertheless, the RD scores become the highest when $K = 8$ for the medium granularity, suggesting that there are one or two more topics that the algorithm can identify. Therefore, we choose $K = 8$ and present the most important topics words from an unseeded LDA model in Table 1.

Figure 4.

Changes in topic divergence scores for unseeded and seeded LDA models. The black line is the average divergence (AD), while others are the regularized divergence (RD) with different levels of granularity ( $δ$ ). Solid symbols indicate the peaks of the divergence scores.

Table 1.

Top 20 Topic Words Identified By Non-Sequential Unseeded LDA With $K = 8$ .

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8
human	countries	assembly	development	nations	world	people	peace
rights	development	general	international	united	us	world	region
international	economic	session	conference	international	new	war	international
peace	developing	like	agenda	states	must	terrorism	community
security	resources	president	year	organization	can	years	african
law	social	also	nuclear	role	challenges	many	republic
respect	trade	mr	new	security council	international	one	process
must	global	secretary-general	weapons	reform	future	conflicts	east
democracy	poverty	work	goals	member	global	violence	africa
political	financial	wish	sustainable	must	problems	country	efforts
development	developed	country	summit	support	time	children	union
principles	assistance	government	action	efforts	community	conflict	conflict
social	growth	people	implementation	members	one	end	country
economic	per	republic	held	security	today	lives	european
right	small	opportunity	states	cooperation	now	still	government
peoples	economy	election	millennium	charter	make	threat	security
people	also	express	convention	system	need	even	south
justice	world	first	meeting	organizations	however	poverty	middle
world	cent	year	treaty	peace	order	fight	solution
national	climate	take	agreement	effective	century	suffering	countries

Seed Word Selection

In deductive topic analysis, researchers should select seed words from words that frequently occur in the contexts of the target concepts to inform the algorithm effectively (Watanabe & Zhou, 2020). They can gather seed words not only from external sources (e.g., thesauri, glossaries, and indices of books) but also from topic words of an unseeded LDA fitted on the same corpus with relatively small

K

(Table 1). To simulate the latter method, we extracted the top 100 most important words from unseeded LDA topics and matched them against the existing seed word patterns for the six topics (Table 2).¹¹

Table 2.

Seed words Selected From the Top 100 Topic Words of A Non-Sequential Unseeded LDA Model.

Topic	Seed words
Greeting	thank, thanks, congratulate, congratulations, sir, express, great, greater, mr, wish, wishes, hope, contribution, contribute, anniversary, welcome, welcomes
UN	united nations, security council, general assembly, organization, organizations, reform, reforms, secretary-general, resolution, resolutions, charter, session, conference
Security	security, security council, victims, peace, peaceful, peacekeeping, peace-keeping, terrorism, terrorist, weapons, nuclear, conflicts, conflict, war, wars, disarmament, threat, threats, crisis, solutions, solution, settlement, forces, force, destruction, military, violence, armed, arms, fight
Human rights	human rights, dignity, protection, protect, citizens, education, humanitarian, children, women, refugees, community, people, respect, responsibility, food, health
Democracy	democracy, democratic, representative, election, elections, leadership, president, institutions, government, law, republic, free, freedom, legal
Development	development, developing, developed, developments, market, markets, investment, economic, economy, economies, climate change, assistance, sustainable, poverty, trade, growth, social, environment, environmental, prosperity, progress, financial, cooperation

Classification Accuracy

We evaluate the ability of Seeded Sequential LDA to identify topics of sentences accurately by fitting it on a sample of speeches with different hyperparameters. We set the number of residual topics, $r$ , to be one of ${0, 1, 2, 5, 10, 15}$ , the size of seed word weight, $μ$ , to be one of ${0.01, 0.02, 0.03, 0.04, 0.05}$ , and the influence of preceding sentences $σ = 0 o r 0.5$ . For each of the 60 unique combinations of the values, we fit the model on 10 random samples of 1000 speeches, resulting in 600 sets of F1 scores. After fitting the model, we classify sentences into topics with the highest probability only among the six seeded topics, virtually setting the probability for residual topics to zero.

We also compare the classification accuracy between ex-ante topic mapping with Seeded LDA and ex-post topic mapping with unseeded LDA. For ex-post mapping, we fit unseeded LDA with $σ = 0 o r 0.5$ and $10 \leq K \leq 100$ ; then we assign the topics to one of the six topics based on the total count of their topic words that match the seed words.¹²

Sequential Classification

We found that the overall F1 score of non-sequential models with two residual topics is only 0.47 (Figure 5). The poor performance of the non-sequential model is mainly due to very low F1 scores in ' “democracy” (0.12) and “human rights” (0.29). The algorithm struggles with these “difficult topics” because they usually relate to other topics but occur only infrequently.¹³ However, the sequential algorithm improves the overall score from 0.47 to 0.62 with the higher classification accuracy in all the topics by between 0.09 and 0.18. The resulting median F1 scores are 0.60 in “greeting,” 0.62 in “UN,” 0.65 in “security,” 0.38 in “human rights,” 0.31 in “democracy,” and 0.70 in “development.” The scores vary around the median mainly due to the different values of hyperparameters as discussed below.

Figure 5.

Classification accuracy (F1) of non-sequential ( $γ = 0$ ) and sequential ( $γ = 0.5$ ) models with two residual topics ( $K = 8; μ = 0.3$ ). Boxes are the quantile range, and the whiskers are the entire distribution. Precision and recall are micro-averaged in computing the overall F1 scores.

The sequential model can classify sentences more accurately than the non-sequential model because it recognizes transitions of topics in speeches. Figure 6 shows that the probability of topics estimated by the sequential model ( $σ = 0.5$ ) is far more stable than the non-sequential model ( $σ = 0$ ): the speech starts with 20 sentences about “greeting” followed by 10 sentences about “security” in the former, whereas the topics with the highest probability change in almost every sentence in the latter. The better fit of the sequential model to the data leads to higher confidence in classification as well: the probability of topics often reaches 80% in the sequential model, while it is around 60% in the non-sequential model.

Figure 6.

Probability of topics of sentences in a speech non-sequential ( $γ = 0$ ) and sequential ( $γ = 0.5$ ) models. LDA algorithms classify them into topics with the highest probability.

Residual Topics and Seed Word Weights

We also discovered that the non-sequential and sequential algorithms work differently under the same hyperparameters (Figure 7): while the non-sequential model works similarly with different sizes of the seed weight, the sequential model performs slightly better when the weights are 0.02 or larger. Further, whereas the scores of the non-sequential model slowly increase as more residual topics are added, the scores of the sequential model decrease visibly as two or more residual topics are added.

Figure 7.

Classification accuracy (F1) of the non-sequential ( $γ = 0$ ) and sequential ( $γ = 0.5$ ) models with varying weights and residuals. Boxes are the quantile range, and the whiskers are the entire distribution.

Table 3 shows that the top 10 topic words identified by the sequential model match the seed words, but there is a generic term “world” in “human rights.” However, it disappears from the topic when residual topics are added to the model because the added topics serve as ‘junk' topics that absorb noises (Table 4). Further, it is also possible to define topics only partially and leave other topics unseeded. Table 5 shows that, if seed words were provided only for the first two topics, residual topics capture words related to development (“other 1”) and security (“other 2” and “other 4”).

Table 3.

Top 10 Topic Words Identified by Seeded Sequential LDA.

Greeting	UN	Security	Human rights	Democracy	Development
mr	united nations	peace	people	government	development
great	security council	security	community	republic	economic
hope	organization	security council	human rights	democratic	cooperation
wish	general assembly	international	respect	president	countries
session	session	conflict	world	democracy	developing
united nations	reform	weapons	Responsibility	law	social
express	secretary-general	terrorism	women	institutions	sustainable
welcome	conference	nuclear	children	free	poverty
world	charter	war	humanitarian	international	international
general assembly	international	conflicts	international	freedom	trade

Table 4.

Top 10 Topic Words Identified by Seeded Sequential LDA With Two Residual Topics (“Other1” and “Other2”).

Greeting	UN	Security	Human rights	Democracy	Development	Other1	Other2
mr	united nations	peace	people	government	development	world	people
great	security council	security	community	republic	economic	us	world
hope	organization	security council	human rights	democratic	cooperation	must	us
wish	general assembly	conflict	respect	president	countries	new	one
session	session	international	women	democracy	developing	international	country
express	reform	weapons	responsibility	law	social	can	can
welcome	secretary-general	terrorism	international	institutions	sustainable	global	many
united nations	charter	nuclear	humanitarian	free	poverty	one	must
general assembly	conference	war	children	peace	trade	nations	years
also	resolution	conflicts	protection	freedom	international	political	today

Table 5.

Top 10 Topic Words Identified by Seeded Sequential LDA With Two Seeded Topics (“Greeting” and “UN”) and Four Residual Topics.

Greeting	UN	Other 1	Other2	Other3	Other4
mr	united nations	Development	peace	people	international
great	organization	countries	international	world	weapons
hope	security council	international	people	international	nuclear
united nations	general assembly	economic	community	must	states
international	session	world	united nations	us	world
world	reform	developing	government	peace	treaty
wish	conference	global	process	can	countries
session	international	must	region	one	peace
peace	secretary-general	united nations	country	country	terrorism
express	charter	new	efforts	countries	united nations

Ex-post Topic Mapping

We found that the overall F1 scores are highest when $K \leq 25$ in both the non-sequential ( $γ = 0$ ) and sequential ( $γ = 0.5$ ) algorithms (Figure 8). When $K = 15$ , their median scores are 0.46 and 0.58, respectively, which are close to the seeded algorithm’s (Figure 5). However, when $K > 25$ , the scores of both non-sequential and sequential algorithms fall rapidly.

Figure 8.

Classification accuracy (F1) of the unseeded non-sequential ( $γ = 0$ ) and unseeded sequential ( $γ = 0.5$ ) models based on ex-post topic mapping.

While we achieved the ex-post topic mapping based on the total count of the topic words regardless of their ranks, researchers usually perform topic mapping based only on highest-rank topic terms. To simulate this, we counted the number of topics that have at least one seed word among the top 10 or 50 topic terms from the non-sequential and the sequential algorithms (Figure 9). The number of such topics are consistently higher in the latter than in the former by about 2% if words are the top 50 topic terms; the difference between them expands to 4% if words are limited to the top 10 topic terms.

Figure 9.

The number of topics in the unseeded non-sequential ( $γ = 0$ ) and unseeded sequential ( $γ = 0.5$ ) models that have at least one topic term that matches the seed words. Top 10 and 50 topic terms are matched against the seed words.

Example: Topic-Specific Sentiment Analysis

As an illustration of how the new algorithm can be successfully applied in empirical research, we present the results of topic-specific sentiment analysis. We classified all the sentences in the corpus into the six topics using a Seeded Sequential LDA model ( $K = 8; γ = 0.5; μ = 0.3$ ) and performed sentiment analysis on them, employing the Lexicoder Sentiment Dictionary, which is widely used in social sciences (Young & Soroka, 2012).¹⁴ Following the Security Council’s geopolitical classification, we aggregate the results by speakers from five regional groups: Africa, Asia (Asian and Pacific), East Europe (former Soviet Union and Central and Eastern Europe), Latin America (Latin American and Caribbean), and West (West European, North America and their allies) (Peterson, 2005).

Figure 10 plots the average sentiment of sentences classified as about “security” and “development” by Seeded Sequential LDA, where lower scores indicate a more negative sentiment about the topics. The upper subplot for “security” reveals that the Kosovo crisis in 1998 as well as the 9/11 attacks triggered strong negative reactions of speakers from all the regions. Similarity, positive sentiment in 2011 at the onset of the Arab Spring turned very negative in the following years because of the Syrian civil war, which led to the rise of Islamic extremists. We can also observe some regional patterns, such as a more negative sentiment among Eastern European speakers during the 2008 Russo-Georgian and the 2014 Russo-Ukrainian wars. Likewise, while representatives of Latin America in general tend to express a more negative sentiment than others, there is a clear spike in a more positive sentiment in 2015, when the Colombia peace agreement was reached.

In the lower subplot for “development,” speakers from all the regions expressed a more negative sentiment about development during the 1998 Asian financial crisis and the 2008–2009 global financial crisis. Speakers from East Europe were particularly negative during their transition to the market economy in the early 1990s, while those from Western and East European were negative during the European sovereign debt crisis in 2011. Speakers from Africa and Latin America are consistently most negative about development, but the former is less negative than the latter because arguably they must appeal to foreign donors through more positive language.¹⁵

In short, these patterns reflect the primary function of the United Nations: maintaining peace and security. With the topic-specific sentiment analysis of the speeches, the scholars of international relations can account for the role of emotions in the discursive maintenance of power in global politics.

Figure 10.

Sentiment about “security” and “development” by geopolitical groups of countries. The vertical axis shows the average sentiment scores for the topics. Higher scores indicate a more positive sentiment.

Discussion

Our evaluation of the algorithms and the illustrative example has clearly demonstrated that users can match algorithmic topics to theoretical concepts with only a small number of seed words. The classification accuracy of Seeded LDA is poor because of the short lengths of the documents, but Seeded Sequential LDA has improved the accuracy significantly. Its accuracy was lower for complex concepts such as “democracy” and “human rights,” but it performed reasonably well in other topics. The sequential algorithm achieved a higher accuracy with more stable transition of topics between sentences in the same speech.

Seeded LDA enables researchers to perform not only deductive topic analysis through user-provided seed words but also semi-deductive topics analysis with the assistance of unseeded residual topics. Unseeded LDA allows users to conduct inductive topic analysis by discovering topics and their seed words from the corpus. Nevertheless, the number of topics in unseeded LDA must be small enough (thus, the sizes of topics are large enough) to identify relevant frequent seed words from the corpus; scholars must also choose seed words based on external resources to avoid over-fitting to the current corpus.

The optimal number of topics for LDA has been determined based solely on their divergence in earlier studies, but the regularized divergence measure can incorporate users’ preferred granularity of the analysis defined as the expected minimum sizes of topics. The granularity can be any value between the $0 < δ < 1.0$ range, but we recommend users to set $δ = 0.01$ for high granularity, $δ = 0.05$ for medium granularity, and $δ = 0. 1$ for low granularity because values smaller or larger than the range would lead to the number of topics being too small or too large. Further, the topic divergence can be computed for any of the LDA algorithms, but sequential models are unsuitable for the optimization purpose.

The larger seed word weights have improved the classification accuracy of Seeded Sequential LDA, but they had little or no effect on Seeded LDA. Larger weights improve the classification accuracy only in the sequential algorithm because large weights are necessary to increase the probability of both current and following sentences (small weights are sufficient to affect current sentences). This also demonstrates the robustness of the seeded algorithms, suggesting that users can give large weights ( $μ > 0.01$ ) to seed words without the risk of deteriorating the performance of the algorithm.

The comparison of the LDA models has revealed significant differences in the behavior of non-sequential and sequential algorithms with respect to the number of topics. The differences originate primarily from the amended definition of what constitutes a topic: these are not only words that frequently occur in the same sentences, but also words that occur in neighboring sentences in the sequential model. This makes the algorithm susceptible to unseeded topics that interrupt a sequence of seeded topics, especially when small unseeded topics accidentally receive high probabilities.

The ex-post mapping (unseeded LDA) of topics has appeared comparable to the ex-ante mapping (Seeded LDA) with the non-sequential and sequential algorithms. This is comforting because ex-post mapping is a common practice in applied research. Still, the lower F1 scores with large $K$ warn us that regrouping of very small topics would lead to less accurate results. If $K = 40$ , around which the average divergence score maximizes (Figure 4), the F1 score falls to 0.47 even in the sequential algorithm. In fact, when $K$ is as large as this, on average, 12% of the topics lack terms that are strongly associated with any of the pre-defined topics (Figure 9). Without clear topic terms, researchers would overlook or misclassify them in ex-post mapping.

The sharp fall in the F1 and the number of topics with topic terms can be explained by the sparsity of words in small topics that makes Gibbs sampling less accurate. The average frequency of topic terms decreases rapidly as the $K$ becomes larger. In the non-sequential algorithm, when $K = 10$ , the 10th and 100th topic terms appear 1081 and 210 times in the entire corpus, respectively, but they become only 429 and 45 times when $K = 40$ (Appendix 3). In the sequential algorithm, the fall is less steep, but it still suffers from the sparsity of words. These results suggest that it is advisable to keep the number of topics as small as possible.

To summarize, we propose the following four-step procedure for topic classification of sentences based on the strength and weaknesses of the LDA algorithms:

(1) Optimization: Estimate the optimal number of topics, $\hat{K}$ , based on the regularized divergence scores for a non-sequential unseeded model with $δ = 0.01, 0.05 o r 0.1,$ depending on the preferred granularity of the analysis.

(2) Seed Word Selection: Select seed words for the target concepts from the most important topic words identified by the unseeded non-sequential model. The number of seeded topics is $\hat{K}$ minus one or two to create residual topics.

(3) Estimation: Fit a Seeded Sequential LDA model with the seed words with the chosen number of residual topics to absorb junk words. The seed word weight should be $μ = 0.02$ or more to influence many sentences.

(4) Prediction: Compute the probability of sentences using the fitted Seeded Sequential LDA model and classify them into topics with the highest values, ignoring those for residual topics.

Finally, classification of sentences into topics allows users to interpret the results in the same way as the traditional content analysis. They can also combine topic classification with additional analyses of the sentences (e.g., measuring positive-negative sentiment, lexical complexity or detecting named entities). Combined results will show not only which topics, but also how such topics, are discussed, enabling the users to study how speakers framed issues in a particular light.

Conclusions

With Seeded Sequential LDA, we have proposed a new approach to topic-specific content analysis of sentences. The sequential classification algorithm is relatively simple, but it significantly improves the accuracy of sentence classification across all topics. With the assistance of this algorithm, researchers can perform highly accurate topic-specific content analysis whether in a deductive, semi-deductive, or inductive approach: in a deductive analysis, scholars must provide seed words a priori; in a semi-deductive analysis they should do so only for some of the topics; and there is no need for seed words in an inductive analysis.

The ability of our algorithm to classify individual sentences allows researchers to correlate topics with other traits as demonstrated in our example. If topics and sentiment are combined in each sentence, researchers can show not only which topics, but also how such topics, are discussed, enabling the users to study how speakers framed issues in a particular light.¹⁶ Although we chose dictionary-based sentiment analysis for simplicity, more complex traits of sentences could be captured and analyzed along with topics if other types of semi-supervised algorithms are employed (e.g., Trubowitz & Watanabe, 2021).

In our evaluation, ex-post topic mapping with unseeded LDA appeared as good as ex-ante mapping with Seeded LDA as long as $K$ is small enough to avoid the sparsity problem, but researchers often set the much larger number partially because the commonly used methods for optimization do not consider topic sizes. Our regularized divergence measure does exactly this to avoid the problem, but ex-ante topic mapping is still a better way to minimize the number.

There are a lot of potential applications of Seeded Sequential LDA beyond topic-specific analysis of sentences. Seeded LDA could also be used to identify topics whose words change over time by applying it to subsets of a historical corpus with the same seed words.¹⁷ Such dynamic topic analysis is hard to implement using unseeded algorithms unless the relationship between topics from different time windows is explicitly modeled (Blei & Lafferty, 2006), but it is easy using the seeded algorithm because topics are pre-defined.¹⁸

Sequential LDA can be applied not only to the sentences of a long speech by the same person but also to the short comments in a conversation by a group of people, because speakers usually discuss the same topic, responding to earlier speakers’ comments. Even if they refer to the subject of the discussion only with pronouns in informal fora (e.g., social media), the sequential algorithm can identify the topic of the current comment based on the previous comment. This, in turn, allows researchers to detect points at which the topics of the conversation changed and split it into sub-conversations for further analysis.

We hope this article, along with the R package that implements the Seeded Sequential LDA algorithm, will make deductive topic classification more accessible in applied research. The algorithm already allows researchers to perform types of analysis that are otherwise hard to achieve, but there are unanswered questions for future research. First, we believe that the algorithm can classify the multifaceted difficult topics (“democracy” and “human rights”) more accurately with better seed words, but we do not have a way to easily distinguish between “good” and “bad” seed words. To make seed word selection easier, we must devise a method to detect seed words that are causing false matches in the corpus. Second, we found the sparsity of topic words is an important factor affecting the performance of the algorithms, but we have assessed its impact only in the classification of very short documents (i.e., sentences). To identify the best practice in topic modeling, especially the reliability of ex-post topic mapping, we must conduct similar evaluations of the LDA algorithms with longer documents. Finally, we discovered the performance of Sequential LDA deteriorates when many residual topics are added. We believe it is also related to the sparsity of topic words, but we do not know how to address the issue yet.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Kohei Watanabe

Alexander Baturo

Notes

Appendix

Author Biographies

Kohei Watanabe is a Senior Data Scientist at Lazard Asset Management. He is also a visiting scholar at US Centre of the London School of Economics and Political Science and at Waseda Institute for Advanced Study. He develops R packages for quantitative text analysis and studies international flows of political and economic information.

Alexander Baturo is an Associate Professor of Government at the Dublin City University, Ireland. He studies comparative democratization, the politics of authoritarian regimes, and international organizations. He has constructed and published text corpora on international relations, such as the United Nations General Assembly speeches.

References

Ahall

Thomas

(2015). Emotions, politics and war. Routledge. https://www.routledge.com/Emotions-Politics-and-War/Ahall-Gregory/p/book/9780815377139

Amoualian

Clausel

Gaussier

Amini

M.-R.

(2016). Streaming-LDA: A copula-based approach to modeling topic dependencies in document streams. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 695–704, NY, USA, Association for Computing Machinery. https://doi.org/10.1145/2939672.2939781

Arun

Suresh

Veni Madhavan

C. E.

Narasimha Murthy

M. N.

(2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Zaki

M. J.

J. X.

Ravindran

Pudi

(Eds.), Advances in knowledge discovery and data mining (pp. 391–402). Springer. https://doi.org/10.1007/978-3-642-13657-3_43

Baerg

Lowe

(2018). A textual Taylor rule: Estimating central bank preferences combining topic and scaling methods. Political Science Research and Methods, 8(1), 106–122. https://doi.org/10.1017/psrm.2018.31

Bailey

M. A.

Voeten

(2018). A two-dimensional analysis of seventy years of United Nations voting. Public Choice, 176(1-2), 33–55. https://doi.org/10.1007/s11127-018-0550-4

Baturo

Dasandi

Mikhaylov

S. J.

(2017). Understanding state preferences with text as data: Introducing the UN General Debate corpus. Research & Politics, 4(2), 205316801771282. https://doi.org/10.1177/2053168017712821

Benoit

Watanabe

Wang

Nulty

Obeng

Müller

Matsuo

(2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774

Blei

D. M.

Lafferty

J. D.

(2006). Dynamic topic models. Proceedings of the 23rd international conference on Machine learning - ICML '06, 113–120, NY, USA, Association for Computing Machinery. https://doi.org/10.1145/1143844.1143859

Blei

D. M.

A. Y.

Jordan

M. I.

(2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

10.

Boussalis

Coan

T. G.

(2016). Text-mining the signals of climate change doubt. Global Environmental Change, 36, 89–100. https://doi.org/10.1016/j.gloenvcha.2015.12.001.

11.

Brun-Mercer

(2018). Nations united through discourse: A corpus analysis of UN general assembly addresses. [Ph.D., Northern Arizona University]. https://www.proquest.com/docview/2109753182/abstract/C133FDDFA9944330PQ/1

12.

Burscher

Vliegenthart

Vreese

C. H. de.

(2016). Frames beyond words: Applying cluster and sentiment analysis to news coverage of the nuclear power issue. Social Science Computer Review, 34(5), 530–545. https://doi.org/10.1177/0894439315596385

13.

Cao

Xia

Zhang

Tang

(2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7–9), 1775–1781. https://doi.org/10.1016/j.neucom.2008.06.011

14.

Curini

Vignoli

(2021). Committed moderates and uncommitted extremists: Ideological leaning and parties’ narratives on military interventions in Italy. Foreign Policy Analysis, 17(3), 1–20. https://doi.org/10.1093/fpa/orab016

15.

Deveaud

Sanjuan

Bellot

(2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique, 17(1), 61–84. https://doi.org/10.3166/DN.17.1.61-84

16.

Buntine

Jin

Chen

(2012). Sequential latent Dirichlet allocation. Knowledge and Information Systems, 31(3), 475–503. https://doi.org/10.1007/s10115-011-0425-1

17.

Eshima

Imai

Sasaki

(2020). Keyword assisted topic models. ArXiv:2004.05964 [Cs, Stat] http://arxiv.org/abs/2004.05964

18.

Finke

(2021). Regime type and co-sponsorship in the UN general assembly. International Interactions, 47(3), 559–578. https://doi.org/10.1080/03050629.2020.1865947

19.

Gray

Baturo

(2021). Delegating diplomacy: Rhetoric across agents in the united Nations general assembly. International Review of Administrative Sciences, 87(4), 718–736. https://doi.org/10.1177/0020852321997560

20.

Griffiths

T. L.

Steyvers

(2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(suppl 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101

21.

Grimmer

Stewart

B. M.

(2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028

22.

Gruber

Weiss

Rosen-Zvi

(2007). Hidden topic Markov models. In Meila

Shen

(Eds.), Proceedings of the eleventh international conference on artificial intelligence and statistics (2, pp. 163–170). PMLR. https://proceedings.mlr.press/v2/gruber07a.html

23.

Heinrich

(2008). Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf

24.

Hutchison

Bleiker

(2014). Theorizing emotions in world politics. International Theory, 6(3), 491–514. https://doi.org/10.1017/S1752971914000232

25.

Jagarlamudi

Daumé

III Udupa

(2012). Incorporating lexical priors into topic models. Proceedings of the 13th conference of the European chapter of the association for computational linguistics, 204–213, Avignon, France, Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2380816.2380844

26.

Jiang

Zhou

Zhang

Wang

Zhang

(2019). Sentence level topic models for associated topics extraction. World Wide Web, 22(6), 2545–2560. https://doi.org/10.1007/s11280-018-0639-1

27.

Kim

LaFleur

others (2020). What does the United Nations “say” about global agenda? An exploration of trends using natural language processing for machine learning. https://www.un.org/en/desa/what-does-united-nations-say-about-global-agenda-exploration-trends-using-natural-language

28.

Ott

Cardie

Tsou

B. K.

(2011). Multi-aspect sentiment analysis with topic models. 2011 IEEE 11th International Conference on Data Mining Workshops, 81–88.

29.

Mai

(2021). Economic Narratives and market outcomes: A semi-supervised topic modeling approach. (SSRN scholarly paper ID 3883747). Social Science Research Network. https://doi.org/10.2139/ssrn.3883747

30.

Mimno

Wallach

H. M.

Talley

Leenders

McCallum

(2011). Optimizing semantic coherence in topic models. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 262–272, Abu Dhabi, UAE, Association for Computational Linguistics.

31.

Mueller

Rauh

(2018). Reading between the lines: Prediction of political violence using newspaper text. American Political Science Review, 112(2), pp. 358–375. http://dx.doi.org.ez.wul.waseda.ac.jp/10.1017/S0003055417000570

32.

Nielsen

(2019). On the jensen–shannon symmetrization of distances relying on abstract means. Entropy, 21(5), 485. Article 5. https://doi.org/10.3390/e21050485

33.

Peterson

M. J.

(2005). The UN general assembly. Routledge. https://doi.org/10.4324/9780203087831

34.

Phan

X.-H.

Nguyen

C.-T.

(2007). GibbsLDA++: A C/C++ implementation of latent dirichlet allocation. http://gibbslda.sourceforge.net/

35.

Savin

Ott

Konop

(2022). Tracing the evolution of service robotics: Insights from a topic modeling approach. Technological Forecasting and Social Change, 174, 121280. https://doi.org/10.1016/j.techfore.2021.121280

36.

Shiller

R. J.

(2019). Narrative economics: How stories go viral and drive major economic events. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691182292/narrative-economics

37.

Trubowitz

Watanabe

(2021). The geopolitical threat index: A text-based computational approach to identifying foreign threats. International Studies Quarterly, 65(3), 852–865. sqab029 https://doi.org/10.1093/isq/sqab029

38.

Watanabe

(2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487

39.

Watanabe

Zhou

(2020). Theory-driven analysis of large corpora: Semisupervised topic classification of the UN speeches (40, pp. 1–21). Social science computer review. https://doi.org/10.1177/0894439320907027

40.

Yan

Guo

Lan

Cheng

(2013). A biterm topic model for short texts. Proceedings of the 22nd international conference on world wide web. 1445–1456, Geneva, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva. https://doi.org/10.1145/2488388.2488514

41.

Young

Soroka

(2012). Affective news: The automated coding of sentiment in political texts. Political Communication, 29(2), 205–231. https://doi.org/10.1080/10584609.2012.671234

Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences

Abstract

Keywords

Introduction

LDA Algorithms

LDA: Unsupervised Topic Model

Seeded LDA: Semi-Supervised Topic Model

Sequential LDA: Unsupervised Topic Model for Sentences

Seeded Sequential LDA: Semi-supervised Topic Model for Sentences

Hyperparameter Optimization

Regularized Topic Divergence

Evaluation of the Algorithms

Optimal Number of Topics

Seed Word Selection

Classification Accuracy

Sequential Classification

Residual Topics and Seed Word Weights

Ex-post Topic Mapping

Example: Topic-Specific Sentiment Analysis

Discussion

Conclusions

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Notes

Appendix

Author Biographies

References