Topic modelling is a means to an end: On topic modelling in corpus linguistics and discourse analysis

Abstract

Topic modelling (TM) is becoming an increasingly popular method in the corpus linguistics toolbox, especially when researchers are grappling with a large corpus and want to derive insights for a discourse analysis of the data. Following on from Bednarek’s discussion in this issue, I would like to draw attention to three specific aspects of TM that should be considered when applying it.

The first issue concerns the so-called ‘black box’ nature of the method. Researchers may apply TM without fully grasping the underlying principles, especially when it comes to parameters. Fundamentally, the criticism is that TM is technically difficult. My view is that it is not more technically challenging than, for example, keyword analysis, which can be computed using different statistics (Gabrielatos, 2018) and which corpus linguists apply, presumably, with full awareness of the possible options. It is not unreasonable to ask a researcher to study the principles behind TM or, as Bednarek suggests, to work collaboratively with somebody who does.

Each step in TM is relevant to the results, including what kind of normalization is applied to the data, whether lemmatization or stemming is chosen, and whether and which stop words are removed. As an example, in Rao and Taboada (2021), we removed a standard set of stopwords. We performed relative pruning, to remove both common words (because they occur across all documents) and rare words (because they are unlikely to be representative of common topics across the data). Additionally, given that we were working with news stories, we also removed words related to news (say, report, story, press, news), social media and URLs (post, tag, inbox, https, href). Since those words were so frequent across all articles, they were not meaningful and removing them simplified the calculation of topics. Such decisions are grounded in a deep understanding of the different steps needed to perform TM and a preliminary qualitative analysis of the corpus, that is, they require technical expertise both in TM and in discourse analysis.

The second important aspect of TM is the question of what a ‘topic’ is and how it can be interpreted. It is crucial to understand that TM captures a probabilistic distribution of topics across the corpus, but also topics in each document. Thus, each document may feature different topics in different proportions. In this sense, a document is not about a topic, but instead contains words that may be representative of several topics. As we illustrate in an application of TM (Rao and Taboada, 2021), an article about an airline purchasing new airplanes may contain words related to finance, geopolitics, travel, passenger trends or markets. The topic of the document is not finance or airlines; the document contains words related to those larger topics in the entire collection. Brookes and McEnery (2019) conduct an experiment whereby they inspect documents with high probability for several topics and read them to ascertain whether they are indeed representative of such topics. While such qualitative analysis is useful, we need to remember that an article is never about a topic in the TM sense; an article contains words that have been found to be representative of one or more topics across the corpus.

Related to the question of what constitutes a topic is the labelling of topics. TM produces a researcher-determined number of topics and a list of words that are representative of each topic. Using those words, typically the top 20 or so, researchers may apply a label to the topic. This is a subjective procedure, which requires understanding of the genre of the corpus and the types of documents it contains and is often an iterative process (Rao and Taboada, 2021; Vogel and Jurafsky, 2012).¹ Whether the topics are interpretable depends, in part, on how TM was carried out. Nevertheless, bad labels do not imply that the method is bad; it may simply be that the person labelling them did not know the corpus well enough. We have been labelling topics monthly for the last 6 years² and have found that a monthly collection of news articles yields consistent topics and can help us observe emerging trends and changes over time. At the end of January 2022, we labelled an emerging topic ‘Russia-Ukraine tensions’. By the end of February 2022, we labelled a similar topic ‘Russian invasion of Ukraine’. Similarly, at the end of January 2020 we labelled a topic ‘Coronavirus outbreak & spread’. By the end of March 2020, the single topic had multiplied into eight distinct topics, such as ‘Covid-19: Cases, deaths & spread’ or ‘Covid-19: Schools & university impact’. In summary, each of the steps in TM is important but not technically terribly difficult and researchers should familiarize themselves with the excellent literature available.

My third point is that TM, in corpus linguistics and discourse analysis, is likely not an analysis in itself, but one of the many possible ways we can study the data. Bednarek lists examples of studies that have combined TM with other corpus linguistic techniques. In our study (Rao and Taboada, 2021), the objective of TM was to establish whether some topics had a higher proportion of men or women quoted (the results, perhaps unsurprising: women are more frequently quoted in lifestyle and health; men in politics, business and sports). In all these cases, TM was part of a larger study with more precise research questions than ‘What are the topics in this corpus?’. Jaworska and Nanda (2018: 395) characterize TM as a ‘good exploratory tool’ that can ‘signpost important terms for further discourse-analytical investigations’. This is precisely what I mean when I say that TM is a means to an end; one does not simply ‘do’ TM, one does it to answer specific questions about the data. This is also why the limits of the method are tolerable, if it is not the only method deployed.

Footnotes

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Maite Taboada

Notes

Author biography

Maite Taboada is Distinguished Professor of Linguistics at Simon Fraser University. Her research intersects discourse analysis and computational linguistics, with a focus on sentiment analysis, social media language, and misinformation.

References

Brookes

McEnery

(2019) The utility of topic modelling for discourse studies: A critical evaluation. Discourse Studies 21(1): 3–21.

Gabrielatos

(2018) Keyness analysis: Nature, metrics and techniques. In: Taylor

Marchi

(eds) Corpus Approaches to Discourse: A Critical Review. New York, NY: Routledge, pp.225–258.

Grootendorst

(2022) BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv Preprint arXiv:2203.05794.

Jaworska

Nanda

(2018) Doing well by talking good: A topic modelling-assisted discourse study of corporate social responsibility. Applied Linguistics 39(3): 373–399.

Rao

Taboada

(2021) Gender bias in the news: A scalable topic modelling and visualization framework. Frontiers in Artificial Intelligence 4: 664737. DOI: 10.3389/frai.2021.664737.

Vogel

Jurafsky

(2012) He said, she said: Gender in the ACL anthology. In: Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries, Jeju Island, Korea, pp.33–41.