Abstract

Both corpus linguistics and discourse analysis are increasingly using techniques that originate outside of linguistics to supplement traditional methods in these fields. Such techniques have attracted criticism for overlooking the expertise required in corpus or discourse analysis and replacing linguistic insight with mere number-crunching.
It is, therefore, high time to contextualise topic models within corpus linguistics and discourse analysis, and to consider what they can and cannot offer. Bednarek (this issue) contributes to this goal by first providing an overview of previous studies on corpus-based discourse analysis that have utilised topic models, demonstrating their varied and increasing use in the field. She then raises important issues and challenges related to the use of topic models in corpus-based discourse analysis and, more broadly, in the relationship between discourse analysis, corpus linguistics, and natural language processing (NLP). Below are some thoughts provoked by her article.
One issue discussed in the article is the (in)adequacy of forming interpretations of a corpus that are based solely on lists of topic-words (Brookes and McEnery, 2019; Gillings and Hardie, 2023). However, this concern relates to a specific use of topic models rather than to topic models in general. We fully agree that a list of words is insufficient for deriving an appropriate interpretation of a topic. Instead, the list should be viewed as a starting point for exploring the underlying co-occurrence patterns or their implications. The same argument could be made in relation to other analytical methods in corpus linguistics, such as word lists, keyword analysis, and multidimensional analysis. In each case, these lists point toward aspects of the text rather than being self-explanatory. It is, therefore, crucial to examine how the topic-words function within central texts to interpret a topic appropriately. In this sense, topic models help us delve deeper into texts rather than providing a definitive list of topics.
In relation to the above, it is important not to assume that the ‘topic’ in topic models corresponds to the conventional understanding of a topic (i.e. themes). In fact, Blei et al. (2003), who proposed probabilistic topic models based on latent Dirichlet allocation (LDA), stated that ‘[w]e refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but we make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words’ (p. 996, fn 1). It is in this sense that we claimed that the term topic model is a ‘misnomer’ (Murakami et al., 2017: 244). It should be viewed as an analytical method for identifying co-occurring sets of words. What these text-level word co-occurrences signify is an open question. While some co-occurrence patterns may indeed indicate thematic topics, as users of topic models often hope, others may be due to rhetorical (Murakami et al., 2017) or other factors. What is clear, however, is that these groups of co-occurring words require scrutiny in their interpretations.
A major strength of topic models is that they do not require a point of reference. In corpus linguistics, comparison is fundamental: We cannot determine whether a word (or category or feature) is significantly frequent in corpus A without knowing its frequency in corpus B. This reliance on comparison makes abandoning it feel uncomfortable. However, in situations where an appropriate point of reference is genuinely unknown, topic models – with their linguistically naïve ‘bag-of-words’ approach – can offer researchers additional resources for ‘opening up’ discourses whose salient features are not apparent.
Finally, topic models are an active area of research in NLP, and we believe this is an area where corpus linguists and NLP researchers can benefit from collaboration. For example, regarding stopwords and lemmatisation mentioned by Bednarek (this issue), Schofield et al. (2017) and Schofield and Mimno (2016) found that neither having an extensive stopwords list nor using lemmatisation significantly improves topic models. However, their studies relied on automatically computed measures to evaluate the models. While some of these measures have been validated against human evaluation to some extent (Mimno et al., 2011), they may not fully capture the nuances that corpus linguists wish to capture. This presents an opportunity for both fields to work together in refining quality measures for topic models. Bednarek rightly points out that using topic models involves making a number of decisions. While we do not see this as inherently problematic, we believe we have a research agenda that explores how these decisions impact analytical outcomes and aims to identify optimal configurations for different purposes. Collaborating with NLP researchers is crucial to achieving these objectives.
Footnotes
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
