Sage Journals: Discover world-class research

Abstract

Both corpus linguistics and discourse analysis are increasingly using techniques that originate outside of linguistics to supplement traditional methods in these fields. Such techniques have attracted criticism for overlooking the expertise required in corpus or discourse analysis and replacing linguistic insight with mere number-crunching.

It is, therefore, high time to contextualise topic models within corpus linguistics and discourse analysis, and to consider what they can and cannot offer. Bednarek (this issue) contributes to this goal by first providing an overview of previous studies on corpus-based discourse analysis that have utilised topic models, demonstrating their varied and increasing use in the field. She then raises important issues and challenges related to the use of topic models in corpus-based discourse analysis and, more broadly, in the relationship between discourse analysis, corpus linguistics, and natural language processing (NLP). Below are some thoughts provoked by her article.

One issue discussed in the article is the (in)adequacy of forming interpretations of a corpus that are based solely on lists of topic-words (Brookes and McEnery, 2019; Gillings and Hardie, 2023). However, this concern relates to a specific use of topic models rather than to topic models in general. We fully agree that a list of words is insufficient for deriving an appropriate interpretation of a topic. Instead, the list should be viewed as a starting point for exploring the underlying co-occurrence patterns or their implications. The same argument could be made in relation to other analytical methods in corpus linguistics, such as word lists, keyword analysis, and multidimensional analysis. In each case, these lists point toward aspects of the text rather than being self-explanatory. It is, therefore, crucial to examine how the topic-words function within central texts to interpret a topic appropriately. In this sense, topic models help us delve deeper into texts rather than providing a definitive list of topics.

In relation to the above, it is important not to assume that the ‘topic’ in topic models corresponds to the conventional understanding of a topic (i.e. themes). In fact, Blei et al. (2003), who proposed probabilistic topic models based on latent Dirichlet allocation (LDA), stated that ‘[w]e refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but we make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words’ (p. 996, fn 1). It is in this sense that we claimed that the term topic model is a ‘misnomer’ (Murakami et al., 2017: 244). It should be viewed as an analytical method for identifying co-occurring sets of words. What these text-level word co-occurrences signify is an open question. While some co-occurrence patterns may indeed indicate thematic topics, as users of topic models often hope, others may be due to rhetorical (Murakami et al., 2017) or other factors. What is clear, however, is that these groups of co-occurring words require scrutiny in their interpretations.

A major strength of topic models is that they do not require a point of reference. In corpus linguistics, comparison is fundamental: We cannot determine whether a word (or category or feature) is significantly frequent in corpus A without knowing its frequency in corpus B. This reliance on comparison makes abandoning it feel uncomfortable. However, in situations where an appropriate point of reference is genuinely unknown, topic models – with their linguistically naïve ‘bag-of-words’ approach – can offer researchers additional resources for ‘opening up’ discourses whose salient features are not apparent.

Finally, topic models are an active area of research in NLP, and we believe this is an area where corpus linguists and NLP researchers can benefit from collaboration. For example, regarding stopwords and lemmatisation mentioned by Bednarek (this issue), Schofield et al. (2017) and Schofield and Mimno (2016) found that neither having an extensive stopwords list nor using lemmatisation significantly improves topic models. However, their studies relied on automatically computed measures to evaluate the models. While some of these measures have been validated against human evaluation to some extent (Mimno et al., 2011), they may not fully capture the nuances that corpus linguists wish to capture. This presents an opportunity for both fields to work together in refining quality measures for topic models. Bednarek rightly points out that using topic models involves making a number of decisions. While we do not see this as inherently problematic, we believe we have a research agenda that explores how these decisions impact analytical outcomes and aims to identify optimal configurations for different purposes. Collaborating with NLP researchers is crucial to achieving these objectives.

Footnotes

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Akira Murakami

Author biographies

Akira Murakami is assistant professor in the Department of Linguistics and Communication at the University of Birmingham, as well as a Visiting Scientist with the Natural Language Understanding Team, Center for Advanced Intelligence Project, RIKEN. His primary research areas include corpus linguistics, second language acquisition, and quantitative data analysis for applied linguistics research.

Susan Hunston is professor of English language at the University of Birmingham, UK. She specialises in corpus linguistics and discourse analysis and has published books and articles on topics such as: evaluative language, academic discourse, the use of corpora to describe the grammar and lexis of English, and the interface between corpus and discourse studies.

Paul Thompson is reader in Applied Corpus Linguistics at the University of Birmingham, UK. He specialises in corpus linguistics and research into specialised discourses. He is a former Co-Editor of the Journal of English for Academic Purposes, and is currently Co-Editor of the Applied Corpus Linguistics journal.

References

Blei

Jordan

(2003) Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.

Brookes

McEnery

(2019) The utility of topic modelling for discourse studies. Discourse Studies 21(1): 3–21.

Gillings

Hardie

(2023) The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice. Digital Scholarship in the Humanities 38(2): 530–543.

Mimno

Wallach

Talley

, et al (2011) Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, Scotland, pp.262–272.

Murakami

Thompson

Hunston

, et al (2017) ‘What is this corpus about?’ Using topic modelling to explore a specialised corpus. Corpora 12(2): 243–277.

Schofield

Magnusson

Mimno

(2017) Pulling out the stops: Rethinking stopword removal for topic models. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers, Valencia, Spain, pp.432–436.

Schofield

Mimno

(2016) Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics 4: 287–300.

Contextualising topic models in corpus linguistics: Opportunities and challenges

Abstract

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

References