Sage Journals: Discover world-class research

Abstract

The use of automated coding procedures to scale up content analysis has risen over the last years. Using a mixed-method approach, we examine researchers’ justifications to scale up content analysis and assess the methodological adjustments—or lack thereof—when employing supervised machine learning for 38 large-scale content analyses. Almost all of the included studies displayed deficiencies in study design, primarily related to the uncritical use of frequentist statistics on datasets containing the entire statistical population, or employing supervised machine learning without methodological adjustments to account for misclassifications. Our findings question the need for large datasets and automated coding in the first place.

Keywords

content analysis supervised machine learning frequentist statistics measurement error big data

Researchers are increasingly drawn to methods that promise to extract insights from vast amounts of information, often believing that larger datasets inherently offer more valid, objective, or revealing results. This belief, described by boyd and Crawford (2012) as the “mythology” of Big Data, positions data scale as a marker of truth, objectivity, and knowledge. With the ever-growing torrent of digital data, we communication researchers have adopted automated coding, such as supervised machine learning (SML), as one of the tools for scaling up content analysis. While automated coding allows for efficient coding of large corpora, its application in communication research raises critical questions about researchers’ justifications to favor large census datasets over traditional representative samples, and the intended (or unintended) methodological adjustments. Through a review of 38 empirical studies that used SML, this article critically examines their study designs, specifically focusing on the use of automated coding in content analysis of text data. Our results suggest that the allure of Big Data is associated with researchers adopting large data and automated coding without sufficient methodological consideration. By confronting the mythology of Big Data and its influence on research practices, this study evaluates the appropriateness of very large datasets and automated coding in communication research.

To address these questions of justifications and methodological adjustments, we employed a mixed-method approach, combining qualitative and quantitative analyses across a dataset of 38 empirical studies published between 2013 and 2020. We applied reflexive thematic analysis (RTA) to explore how researchers justify the automation of manual coding tasks in content analysis and to understand their conscious and unconscious methodological adjustments to these decisions. This qualitative component allowed us to capture both explicit justifications and subtler, latent factors influencing the use of SML. By analyzing articles that directly compared manual and automated coding, we gained a comprehensive understanding of the factors driving the adoption of SML and the potential impact of its use in content analysis. We consolidate the themes we discovered through RTA and present points of self-reflection (PSRs) to the researcher community that we complement with statistical evidence on key methodological aspects.

The Big Data Meta

We define Big Data as a cultural, technological, and scholarly phenomenon that rests on the interplay of: (1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. (2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. (3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy. (boyd & Crawford, 2012, p. 663)

As an anecdote, a new textbook on content analysis for communication researchers was recently published (Oehmer-Pedrazzi et al., 2023). It has been published after what several scholars call the Computational Turn (Hase et al., 2022; Hepp et al., 2021; Lukito & Pruden, 2023);¹ therefore, it might reflect how communication researchers think about content analysis today. The book contains chapters on manual and automated content analyses. Like previous textbooks (e.g., Krippendorff, 2018; Neuendorf, 2017; Riffe et al., 2014), the chapter on manual content analysis contains a section on sample size planning, which includes a description of the population and the sampling procedure. The subsequent chapter on automated content analysis, however, cites the four-step procedure by Wilkerson and Casas (2017) as “[c]ommon steps of analysis and research designs.” According to this four-step procedure, there is no sample size planning phase² and automated content analysis starts immediately with data collection, which entails obtaining a large text corpus either through news databases or application programming interfaces (APIs). The subsequent steps are data preprocessing, data analysis, and data validation.

This anecdote highlights a drastic difference in the practices between manual and automated content analysis beyond the coding procedure (the often-thought definitional difference between the two) as expressed in this new textbook: The size of text corpora plays a strong role in this form of content analysis. But the focus on the data size in this textbook was not a singular event.³ Several methodological introductions published in communication journals also carry this focus. For example, Günther and Quandt (2015, p. 75) posit automated text analytic methods as a reaction to “new challenges to describe and analyze the wealth of information.” Similarly, Trilling and Jonkman (2018, p. 160) claim that the decision to scale up content analysis through automation arises “from the need to keep pace with the data revolution.” Or Jarvis et al. (2021, p. 42) suggest that automation in content analysis would “offer new ways to measure what people think, feel, and talk about—on the level of a whole society” and that “[t]he abundance of text data now available makes the limited scalability of traditional approaches more apparent.” It appears to us that the initial skepticism toward automated coding in content analysis (e.g., Grimmer & Stewart, 2013; Riffe et al., 2014) has faded away. Empirical research supports this view: Jünger et al.’s (2022) literature review of communication articles analyzing social media data found that automated data collection often goes hand in hand with automated data analysis techniques in practice.

What Justifies Large-Scale Content Analysis?

While it is true that through computational methods it has become easier to obtain and process more data for content analysis, this does not imply that one should always do so. From a statistical standpoint, one does not need to analyze very large samples or even conduct a census to study the population (see Appendix A). Even though the uncertainty around one’s estimates of interest diminishes with increasing data size all else being equal, inferring from a sufficiently sized representative sample is often adequate. In line with that, the traditional literature on content analysis (e.g., Krippendorff, 2018; Neuendorf, 2017) emphasizes the role of unitizing and sampling in the data making process. Also, in the early days of internet research, Weare and Lin (2000) discussed the opportunities offered by the abundance of data from the World Wide Web but still concentrated on unitizing and sampling (cf. Mahrt & Scharkow, 2013, on the difficulty of doing that).

This shift from studying a relatively small sample to studying large data (e.g., a large sample or census) can be partly attributed to the improvement of techniques such as SML. But it is not clear what other factors drive researchers to study ever-increasing datasets. Therefore, for our study, we formulate our first research question:⁴

RQ1a: What were the justifications that researchers use for analyzing a large dataset rather than the “traditional” small representative sample?

The decision to study either large datasets or smaller samples has implications for the choice between manual and automated coding methods. The automation of coding serves as a means to scale up content analysis. A related question is:

RQ1b: What were the justifications that researchers use for conducting automated coding rather than manual coding?

With this first set of questions, we aim to investigate whether obtaining very large samples and using automated coding are even necessary in the first place. Using the critical definition of Big Data by boyd and Crawford (2012, quoted above), the quest for large datasets could be attributed to the mythological (not methodological) belief that large datasets offer a higher form of intelligence, although there is no empirical evidence that supports this mythological belief.

Are There Methodological Adjustments?

The first wave of methodological reflections on content analysis of text data, best summarized by Baden et al. (2021), has criticized the lack of validity of many measurements with automated coding, the lack of generalizability, and the lack of language diversity. Haim et al. (2023) critically evaluated current content analyses (including automated ones) and (re)established the quality criteria of validity, reliability, reproducibility, robustness, replicability, and ethics. While these evaluations are valid, they position the automated coding procedure mainly as an isolated component. Using Breiman (2001)’s notion of statistical modeling cultures, the isolation is rooted in the algorithmic modeling culture in computer science, which focuses mainly on the prediction made by an algorithm for a specific purpose. However, content analysis as a research approach in communication science is in the data modeling culture, which focuses on finding and explaining relationships among variables. From the perspective of the data modeling culture, automated coding methods such as SML represent one component within the broader system of the data making process and should not be considered in isolation.

The classic methodological literature (e.g., Krippendorff, 2018; Neuendorf, 2017; Riffe et al., 2014) points out that the data making process (including unitizing, sampling, and recording, i.e., coding) impacts the subsequent inference process. This means that the consequences of the data making step must be evaluated within the entire research process (a useful summary of the entire process is available as a schematic by Krippendorff, 2018, p. 90). An example of how should the research process be adjusted because of the larger data size is given by van Atteveldt and Peng (2018): They state that data size and statistical significance are no longer “a sign of validity nor of invalidity of the conclusions” when the data are too large (p. 86); instead one should focus on substantive effect size and validity. This points out the statistical repercussions of using a large dataset. Given these considerations, our second research question is:

RQ2: What methodological adjustments followed the use of large datasets and automated coding in content analysis?

Some researchers express their concerns about automated data making and theory building (Hilbert et al., 2019; Waldherr et al., 2021). From our understanding, the most comprehensive diagnoses thus far are Mahrt and Scharkow (2013) and De Grove et al. (2020). They disclose the rarely discussed trade-off between measurement quality and sample quantity. A concrete evidence of this trade-off is TeBlunthuis et al.’s (2024) study on the measurement errors introduced by misclassifications in SML and their impact on the subsequent regression analysis.

With the above questions, we aim to understand the justifications and reasoning for employing SML as a means to scale up content analysis and also investigate whether the researchers reflect upon their methodological choices and study design. We ask these questions about SML because in our view, they are indicative of a larger trend in recent years in communication science that follows the Big Data Meta (Fuchs, 2017). Methodologically speaking, when to use manual or automated coding in content analysis is not clearly demarcated, except in terms of social factors such as knowledge silos (Hilbert et al., 2019; cf. Baden et al., 2021) or unequal access to computational resources and data (boyd & Crawford, 2012; Bruns, 2019). Traditionally, researchers need to plan the sample size of their content analyses carefully, because manual coding is often thought to be resource intensive and researchers want to make the best use of (human) resources. However, for automated coding, researchers might have taken the exploratory approach from data science⁵ and got used to unplanned data collection activities to “compile huge troughs of data with few constraints” (Puschmann, 2019, p. 1582) and then engage in frequent iteration of their methods until they reach a point that they find acceptable (Wilkerson & Casas, 2017). Sample size planning seemingly has become less important and the exploratory approach contradicts the hypothetico-deductive tradition of quantitative content analysis. The research community should not merely accept the usage of computational methods for their own sake, especially given that SML can be error prone and, in the worst cases, may lead to wrong substantial conclusions (Fong & Tyler, 2021; H. Zhang, 2021).

Methodology

Inspired by the in-depth analyses by Baden et al. (2021) and Jünger et al. (2022), we took a mixed-method approach to study our questions (RQ1a, RQ1b and RQ2) qualitatively and quantitatively. We applied RTA to qualitatively study how researchers justify the automation of manual coding tasks in content analysis, as well as which methodological adjustments they took, given their respective usage of automation and large data. We mixed this qualitative approach with a quantitative approach by counting the occurrence of certain methodological characteristics in the included papers.

Research Object: Content Analysis With Automated Coding

In this article, we specifically focus on content analysis of textual data that employs automated coding using SML. We chose SML for two reasons. First, it is at the intersection of manual and automated content analysis. When it is applied in content analysis, it can be understood as an automated extension of the manual coding process (Scharkow, 2011). It requires researchers to provide a manually coded “gold standard” (or training set), which is then fed into an SML algorithm. Second, we think SML is the best research object to answer RQ1a and RQ1b because both manual and automated coding were done and researchers might justify their decision and discuss advantages and disadvantages of both approaches.

The research object of this study is not “automated content analysis.” Although the term is widely used, we found many conflicting definitions. The meaning of it is not clear enough to formulate suitable inclusion and exclusion criteria for our study. But an overarching feature is that one uses different computational methods to replace or extend manual coding: dictionaries, topic modeling, or SML, and so on. But as outlined in the introduction, “automated content analysis” does not seem to be determined simply by the method of coding.

There are other self-operating, or automatic, coding methods, such as off-the-shelf dictionary methods and topic modeling techniques. However, dictionary approaches are well understood (e.g., Boukes et al., 2019) and unsupervised methods, specifically topic models, are widely used but exhibit a separate set of complexities that exceed the scope of this study. There are other works that have scrutinized these methods in detail (e.g., Bernhard et al., 2023; Chen et al., 2023).

Data

Our study is based on the list of 102 publications provided by TeBlunthuis et al. (2024), which in turn is a compilation of reviews by Baden et al. (2021), Hase et al. (2022), Jünger et al. (2022), and Song et al. (2020). TeBlunthuis et al. (2024) gathered 102 articles published between 2013 and 2020 that employ some form of SML for content analysis of text data. Of those, they excluded articles that were either methods papers or articles which used SML for data cleaning but not for data analysis. These remaining 48 empirical articles serve as the basis of the current analysis. One drawback of this dataset is that the most recent articles were published in 2020. However, the data represent the 2010s which saw a strong increase in the usage of computational methods that was later coined as the Computational Turn. This first wave lasted until approximately 2020, right before the first major self-reflections and prescriptive works were published. We are particularly interested in how and why researchers scaled content analysis during this period. Our findings may not reflect how researchers think about these issues as of the date of this publication, or thereafter.

As we are interested in the decisions for scaling and automation over manual methods in empirical content analysis articles during this period, we established the following inclusion criteria:

The analysis was guided by hypotheses or research questions to study social phenomena.

A subset of the studied units was manually coded.

The first criterion excludes articles that studied methodological questions. The same exclusion criterion was used by Baden et al. (2021), Jünger et al. (2022), and Song et al. (2020). The second criterion is important because it ensures that manual coding produced the labeled data (rather than using structural clues such as publication time or URL structure) for the SML algorithm. Also, it excludes articles that either used machine learning models trained elsewhere (e.g., Microsoft Azure Sentiment Analysis API) or training data from elsewhere. It ensures that the manual coding part got automated.

Based on these two inclusion criteria, we excluded 10 articles (8 did not conduct any manual coding; 2 did not study social phenomena but methodological questions). Hence, 38 articles were eligible for further analysis. We will discuss the appropriateness of this data size in the next section. In the subsequent section, we use a hexadecimal notation to refer to these 38 articles AD to D2. Although their true identities are available in Appendix B, the purpose of this partial de-identification is to prevent any direct quote to be associated with author names in the APA citation style. Our potential critique is not directed toward any particular article and by extension toward particular researchers. Instead, our aim is to study how practitioners of content analysis (including us) thought about data size and automated coding. In the subsequent sections, we will use “we” to refer to the author(s) of the current article (except in quotes) and “researchers” to practitioners of automated coding. We are researchers too.

Reflexive Thematic Analysis

For our qualitative analysis, we chose to use RTA because it makes use of the researcher’s own expertise, perspective, and even subjectivity in the qualitative interpretation of data (Braun & Clarke, 2006, 2019a). We are actively aware that this analysis was anchored on our understanding of several methodological concepts: content analysis, frequentist analysis, and SML (see Terminology). For transparency, our team was composed of communication researchers and statisticians. We do have a methodological ideal of how a content analysis should be conducted, which is mostly based on Krippendorff (2018, or its older editions) as well as the recent canon about automated coding such as Grimmer and Stewart (2013), Haim et al. (2023), Scharkow (2011), and van Atteveldt and Peng (2018). We also had a preconception that some aspects of the current content analysis contradict our methodological ideal. We dealt with this by making our coding orientation explicit. According to Braun and Clarke (2022), investigators should choose and state their own orientation along the three axes: experiential/constructionist, inductive/deductive, and semantic/latent. We decided to choose a constructionist, deductive, and latent orientation. With this orientation, we coded both manifested and latent meaning in the data in a critical-realist manner. For example, we coded the absence of some elements (e.g., random sampling).

We followed the six-step approach advocated by Braun and Clarke (2006) and Braun and Clarke (2019a). One coder familiarized themselves with the 38 articles (Step 1) and then did the initial coding (Step 2). Using our research questions (RQ1a, RQ1b, RQ2), all authors then discussed the codes and formed themes (Step 3). The themes were reviewed by all authors (Step 4) and subsequently defined and named by all authors (Step 5) through debriefings. All authors wrote this report (Step 6).

The original developers of RTA reject the idea of saturation and therefore also the idea of getting enough samples to attain the so-called data saturation (Braun & Clarke, 2019b). Instead, they advocate the concept of information power, which states that a small sample with relevant information is more important than a large sample with less relevant information. Our 38 articles should provide enough relevant information, as evidenced by the supplied richness in the description of the themes below.

Results

Given our research questions, we constructed two themes (Ts) and several subthemes (STs) that address the study designs of the publications together with their justifications and methodological choices. Table 1 shows a summary of the themes and subthemes, where T1 is mainly concerned with the size of datasets used in the reviewed publications alongside their statistical implications and T2 focuses on the practices surrounding SML.

Table 1.

Constructed Themes (T) and Subthemes (ST) With Their Corresponding Research Questions.

Theme	Summary	Research question
T1	From scale comes census, but rarely adjustment of statistical practice	RQ2: Methodological adjustments
ST1.1	Some awareness of the census dataset	RQ2: Methodological adjustments
ST1.2	Perceived inferiority of evidence based on probabilistic samples	RQ1a: Justifications for scaling
ST1.3	Scale not for small effect, but granularity	RQ1a: Justifications for scaling
T2	Supervised machine learning as shortcut	RQ1b: Justifications for automated coding
ST2.1	Manual coding as a chore	RQ1b: Justifications for automated coding
ST2.2	Incompatible expectations and misconceptions	RQ1b: Justifications for automated coding
ST2.3	Implicit criteria and handling strategies for failure (and success)	RQ2: Methodological adjustments

T1: From Scale Comes Census, But Rarely Adjustment of Statistical Practice

We identified five studies that used a form of probabilistic sampling in (parts of) their study designs (C9, C4, B0, D2, AE). Among those, C9, C4, B0, and D2 used cluster sampling. AE used random sampling only for one subset analysis. Importantly, AE contained a survey experiment, C9 and D2 were linkage analyses (linking survey and content data), and C4 and B0 were pure content analyses.

The rest of the studies, however, took a census dataset. Figure 1 shows the distribution of data sizes for the studies where data size was mentioned unambiguously. A tendency can be observed that census datasets are often larger in size compared to sample datasets. But even sample datasets are comparatively large. Even with this difference from traditional content analyses and most study designs, adjustment in statistical practice was rare. We observe several fixtures, which are the established practices of data analysis in social science. These fixtures are:

Frequentist inference, such as reporting $p$ values, was used by 21 studies among the studies with a census dataset (cf. ST1.1 below, which shows a certain level of awareness).

Use of two-sided nil hypothesis significance testing. That is, a parameter of interest is hypothesized to be different from zero, $ℋ_{1} : θ \neq 0$ . Alternative ways of testing hypotheses, for instance superiority, equivalence, or noninferiority (e.g., Walker & Nowacki, 2011; Weber & Popova, 2012), were not found.

Rare reporting of effect sizes, especially the standardized ones. Only nine reported any effect size.

Extremely rare sharing of data and code.

We would like to further expand on the last fixture. We only had access to reproducible materials for two articles (C6 and B3) and one article with incomplete data which does not allow the SML training (D2). At the same time, the description of the SML procedure in many articles was unclear. Among the 38 articles, there were 26 for which we have no doubt about how the SML procedure was done. Among the ones we have doubt, the actual size of the data, training set, or test set was obscured with imprecise language, for instance “about 100 million sentences” (D2) and “training set comprises 400–500 randomly-selected texts” (B8). Some articles reported performance metrics, but we do not know with certainty whether these metrics were calculated with a separate test set. In BE, a confusing method of validation was mentioned: “leave-one-out cross-validation (with an 80/20% train/test split),” which left us wondering whether it was leave-one-out cross-validation, a validation on a separate test set, or both.⁶ These ambiguities could have been resolved with the provision of reproducible material such as the data and code. Similarly, when an article did not report any effect size, it would have been possible to recalculate it if the data (or at least summary statistics) and code were available.

Figure 1.

Sizes of Datasets Among the Reviewed Studies. Note That Only 26 Out of 38 Studies Were Considered Because for the Remaining 12 Studies the Data Size Could Not Be Inferred Unambiguously.

Regarding the justifications for scaling the dataset, the inferiority of evidence based on probabilistic samples is a strong perception among researchers (ST1.2). However, there are other justifications for scaling (ST1.3).

ST1.1: Some Awareness of the Census Dataset

There should be some awareness about using a census dataset among researchers. Implicitly, 13 studies presented only descriptive information and data visualization without explicitly mentioning that they analyzed a census dataset. Explicit mention of adjustment in statistical practice because of it was rare, but there were two mentions.⁷ In B6, only descriptive analysis of the data was reported and stated that “[d]ue to full data analysis, no inferential statistics are needed.” In BB, the issue was acknowledged but the statistical practice remained the same as Fixture 1:

We note that we have “population data.” That is, we have all of the posts from all [subjects] over the timeframe of our study. Since our data are not a sample, p-values of estimated model parameters are suspect; thus, we also report confidence intervals. (BB, actual research subjects replaced)

Although confidence intervals are more informative than reporting p values alone, this “adjustment” of reporting cannot address the issue of having a census dataset.

ST1.2: Perceived Inferiority of Evidence Based on Probabilistic Samples

The use of probabilistic samples, even labeled as “representative” in many articles, was perceived by some researchers as inferior because it was not complete data (i.e., census). The inferiority, together with the limited choices of news outlets, was thought to limit the generalizability of previous research:

Most research on [a particular communication phenomenon] uses only representative samples from a small number of news outlets, limiting the generalizability of its findings. (B9, actual communication phenomenon replaced)

An advantage of SML was posited as a way to skip the “inferior” process:

Conversely to the usual practice of employing human coders, we used computer-based coding. The most obvious advantages of our attempt to offer an alternative to traditional manual content-analysis is that there is no need to draw a representative sample, as the computer can handle the whole corpus. (BC, emphasis added)

SML was preferred over solely human coding content analysis because it enable us to code [label] in the [texts] without relying on small sample. (B2, emphasis added, actual label and content replaced)

Similarly, SML was described as the enabler of full data analysis:

. . .[A]utomated coding via machine learning enabled a full data analysis of complex content-related variables. (B6)

ST1.3: Scale Not for Small Effect, But Granularity

Most studies did not provide a reason for using a large, census dataset. However, the classic statistical argument (Cohen, 1992) of detecting a small effect with enough power was not a reason in these articles. Among the ones with an explicit expression of a reason, none of them mentioned effect size (See the above Fixture 3). Similarly, we did not find any sample size determination, neither a priori nor a posteriori.

Scaling appears to be a passive process of dealing with the increasing amount of data that can be obtained easily. C7 made this point clear (we will discuss this further below in T2):

SML is a relatively new technique that allows analysis of publicly available big data, which could hardly be analysed by applying traditional methods of content analysis. (C7, emphasis added)

For those that explicitly mentioned a reason for scaling, we observed one overarching methodological reason: granularity. With a systematically collected large dataset from various sources or across time, researchers can slice the data into different granularity for more sophisticated analyses. The granularity allows comparative analysis across geographical regions (D1, CB, B7) and outlets (CE, B9, CB). The most powerful, in our opinion, was to provide the granularity for dicing the data into time series to allow the study of temporal dynamics (B3, CA, BA, B5, C2, CE, D0, B8), as expressed in this quote:

Machine-based content analytic methods, complementing human coders, can help capture accumulated [a particular communication phenomenon] dynamics as they change over time, and allow us to examine important subsequent questions, such as how many individual [a particular communication phenomenon] mentions in the media, over what period of time. (BA, communication phenomenon replaced)

T2: SML as Shortcut

Most of the included articles did not explicitly mention any reason for employing SML despite the potential challenges of its implementation (Zamith & Lewis, 2015). In the rare instances, where explicit reasons were given, using SML was justified by scaling up the data size (C6, CA, D0, CB) and harnessing technological advancements (C7, C3). Another common way to introduce the need for SML was to pit it against manual coding (ST2.1). SML was thought to be a means to conquer two perceived adversaries: the immediate availability of large data and the tediousness of manual coding (ST2.1). However, when SML became a shortcut to skip a huge chunk of manual coding and enable the analysis of Big Data, this external force from computer science also turned “magical” (magnificent, ubiquitous, and mysterious): Researchers might justify the use of SML with incompatible expectations or misconceptions (ST2.2). SML becomes mysterious as researchers also show inconsistencies in how to tame this external force by encountering practical challenges and often lack sufficient methodological consideration (ST2.3).

ST2.1: Manual Coding as a Chore

Manual coding was presented as a task that researchers want to get rid of. In one extreme case (C0), even when 91.27% of the content were manually coded, the remaining 8.73% were coded with SML automatically. In the same article, it says:

The research also makes a significant methodological contribution, highlighting the value of automated coding efforts that can enhance or potentially even replace traditional manual approaches toward content analysis. (C0, emphasis added)

Several articles made similar methodological claims (D1, B1). However, manual coding in previous studies was still described as theoretically better but more costly than SML (C4, C8, B7, BC). However, these costs were not made concrete, for instance:

When manually coding [texts], deeper insights (though still not entirely accurate) may be extracted from the [texts] by the coders. However, a high cost in both time and effort is required. (C4, actual text type replaced)

Exception to this was C8, which puts a vague estimation of person hours and says:

But as new technologies emerge and online content grows, these strategies [manual coding] are becoming unwieldy and cost prohibitive. Hand coding what could be considered a representative sample could require hundreds of hours of work. (C8)

A related issue was the methodological concern about scaling manual coding, such as the danger of code drift in previous studies:

. . .[M]anually classifying thousands of documents increases the likelihood that coding becomes inconsistent and colored by a researcher’s personal expectations. (C6)

However, manual coding is still an essential part of SML: Researchers still need to code manually to produce training and testing data. Researchers took the opportunity to express their dissatisfaction with this inevitable “long and tedious process.” (C4)

ST2.2: Incompatible Expectations and Misconceptions

We identified several expectations of SML that are probably true and can certainly be achieved using SML. However, they are incompatible with the practices of academic research or the current accepted practices of computational methods. An example of this incompatible expectation is real-time analysis based on “real time social media information”:

The real time social media information will be beneficial to understand the public’s perception, since the opinion information extracted from unstructured, free-text messages often reflect people’s attitudes in backstage view and can be monitored in real-time. (B4)

While it is true that one can apply SML for the real-time analysis of social media information, this is not how researchers usually do their analysis. As a matter of fact, B4 was still an analysis of a historical snapshot of data.⁸

Another example of an incompatible expectation is the idea that a model that was trained with SML can be reused for future studies, making it more efficient than using manual coders:

Trained and validated in this study, the classifiers can now be applied to code even larger datasets. This makes it possible to conduct future research without the costs associated with manual coding and to reliably compare findings across various studies thanks to the measurement equivalence ensured by applying the same classifiers. (D1, emphasis added)

Again, while it is true that a trained classifier can be applied elsewhere without any effort, this is not compatible with the current understanding of validity (Hosseini et al., 2017; Rauchfleisch & Kaiser, 2020; Schöpke-Gonzalez et al., 2025; van Atteveldt & Peng, 2018; Vargo, 2024). An “off-the-shelf” classifier, similar to “off-the-shelf” dictionaries, requires individual validation with respect to the data at hand. Moreover, none of the reviewed articles made their trained model publicly available for reuse. And other reviews of software and tool usage in computational communication research have shown that classifiers are rarely reused in practice, which suggests that researchers are aware of validity issues (Stecker et al., 2024). This is unfortunate, because if researchers made their models and data available,⁹ SML could have the potential to enable large scale content analysis for researchers in various situations. Fostering such a sharing culture would potentially lead to more standardization of coding practices and counteract theoretical fragmentation found in journalism studies (De Grove et al., 2020).

There were also some expectations that were probably due to misconceptions about SML:

. . .[A]utomated coding may help us find patterns that counter conventional wisdom, none of which would be discovered if conventional wisdom dictated the coding process. (C6)

This quote does not fit the description of SML because the generation of the training and test sets is still “dictated” by conventional wisdom. Training sets still dictate how classifiers are trained, and manually coded test sets are assumed to be the ideal (or “gold standard”) for evaluating their performance. This quote in our understanding fits better with unsupervised learning methods such as Latent Dirichlet Allocation.

Another quote shows misconception about what a test set in machine learning is:

. . .[T]he algorithm uses information from texts whose [label] are assumed known (“training set”) to learn about a second set of texts whose [label] are unknown (“test set”). (B8, emphasis added, actual label replaced)

ST2.3: Implicit Criteria and Handling Strategies for Failure (and Success)

Despite the fact that nine articles did not mention anything on the performance of the classifier, performance tuning was mentioned in several articles (B0, CE, AF). For instance:

It turned out that the initial sample of 1000 [texts] was not enough to train a model with satisfactory performance. It was especially challenging to reach satisfactory accuracy for classifying [label]-mentioning [texts] . . . Therefore, more [texts] were manually labeled. After several rounds of iterations, a final sample of 2100 labeled [texts] were used to build a model to classify [label]. (B0, emphasis added, original text type and label replaced)

However, what exactly satisfactory performance means was not disclosed. Among the articles without any reported tuning, we observe a large variation in performance. For example, B2 and CC reported F1 scores of 0.53 and 0.47, respectively. C4 explicitly declared that the SML was a failure because the F1 values were low (e.g., 0.22, 0.36, and 0.52). Consequently, the authors decided to use the manually coded training data for their analysis. Given that the manually coded data from C4 was a representative sample, they could directly use it as such. In contrast, study CB suggested that their training set was not representative. Nonetheless, the misclassified variables were not corrected for in statistical analyses of all articles.¹⁰

Discussion and Reflections

Rather than aiming for the identification of research gaps or the prescription of a research agenda, our goal is to reflect upon and discuss the why of scaling up content analysis and their methodological implications. We acknowledge that our study exhibits limitations such as the age of the reviewed publications as well as the focus on SML for content analysis of text data. The themes and subthemes reflect a particular moment in the field’s development, where “computational methods are becoming increasingly established in the discipline” and “fundamental concerns with epistemological parameters can be found” (Jünger et al., 2022, p. 1488). With this in mind, we also acknowledge the evolution in our field since the first wave of the Computational Turn.

The second limitation is that we only focused on studies that employed SML for content analysis of text data. We did not consider studies using SML for analyzing image or video data which have arguably become highly relevant for communication research. Video content has become ubiquitous, for example, political communication is increasingly mediated on video-centric platforms. Therefore, the analysis of video data in particular probably strongly benefits from computational methods, where the complexity of the material can be reduced with utilities such as key frame extraction or automated transcriptions.

While existing research agendas provide valuable guidance on the how of future research (e.g., Baden et al., 2021; Birkenmaier et al., 2023; Hilbert et al., 2019; Lukito & Pruden, 2023; Pouwels et al., 2023; Shah et al., 2015; Shen et al., 2024; Waldherr et al., 2024) our discussion and reflections aim to contribute by examining the fundamental why behind these methodological shifts.

With our analysis, we would like to elicit the zeitgeist of the first wave of the Computational Turn and reflect and learn from it for the second wave. This is particularly relevant, because in the first wave researchers had easy access to a large variety of data—such as social media data—that is now hard to acquire (Bruns, 2019). Moreover, modern computational methods require more computing power than ever before, which in turn requires fresh water and energy.¹¹ In this second wave, researchers have begun to leverage resource-hungry methods for analyzing content data for communication research (e.g., Fan et al., 2024; Joo & Steinert-Threlkeld, 2022; Kroon et al., 2023; Peng et al., 2023). We are no historians, so we quote Timothy Snyder, reflecting upon the clarity of why:

When we forget the why and have only the how, our imagination seeps into the gully of the status quo. We rationalize, using our residual intelligence to explain the world cannot be otherwise than it is, and that we cannot do otherwise than we have done. (Snyder, 2024, emphasis in original)

Researchers should put the equal amount of mental energy into critically evaluating the why of scaling and automation. During the initial phase, researchers’ imagination on the why of automated coding was based, if not exclusively, on the mainstream discourse to scale up content analysis to (passively) cope with the amount of data (Günther & Quandt, 2015; Trilling & Jonkman, 2018).

Our RTA suggests several justifications, the whys, for researchers to adopt SML, as one example of an automated coding approach (ST 1.2, ST 1.3, T2, ST 2.1, ST 2.2). ST1.2 (perceived inferiority of evidence based on probabilistic samples) is particularly noteworthy, because the other side of the same coin is exactly the mythological belief of Big Data defined by boyd and Crawford (2012). Even when one supposes that the Big Data myth were correct and a larger data size alone were really an indicator of a higher form of intelligence, the answers to RQ2 on methodological adjustments suggest otherwise: The application of SML was not principled enough to be classified as a higher form of intelligence.

We observed that the use of SML was not principled in terms of the criteria of failure (ST2.3). Meanwhile, measurement errors arising from SML were completely ignored (ST2.3). The unprincipled application of the Fixtures (ST1, the status quo) to the vastly different, upscaled, census dataset was also prevalent, although we also observed some awareness of this issue (ST1.1). The irony of these Fixtures is that most of them were established for data scarcity and apply only to relatively small probabilistic samples. This historical methodological state during the initial phase was confusing, conflicting, and even ironic. In our opinion, researchers should reflect on the why again, or else this methodological state from that period would perpetuate into the future.

As an essential step of RTA, we provide our reflexive account on our findings and our experience of getting them published. Readers can consider them to be our own PSRs on scaling and automation. From our reading of the methodological literature of our field, discussions of those points are not common, except similar points expressed by Mahrt and Scharkow (2013), De Grove et al. (2020), and TeBlunthuis et al. (2024). We punctuate all of them with a superposition of the exclamation and question punctuation marks to simultaneously opine and doubt our own opinion on behalf of the readers. Please note again the distinction between our use of researchers and the pronoun we.

PSR1: One Content Analysis‽

Krippendorff (2018, p. 384) grouped content analyses into three different classes based on their entry point: text-driven content analysis (“the availability of texts”), problem-driven content analysis (“a desire to know something currently inaccessible and the belief that a systematic reading of potentially available texts and other data could provide answers”), and method-driven content analysis (“the analysts’ desire to apply known analytical procedures to areas previously explored by other means”). This classification is important because it deals with the why of content analysis.

Because of our inclusion criteria, all 38 included articles looked like problem-driven content analyses.¹² If the entry point was epistemic questions, then one should plan how best to answer these questions. The deciding factors, ideally speaking, should not be which type of data is easily obtainable (text-driven) or by which way one can massively process the data (method-driven).

If it is the ideal, it seems that in the beginning, researchers created one content analysis. Only one. Manual coding or automated coding should only be a methodological choice that suits a study design. This point looks banal, but it puts back coding to its integral role in the entire methodological approach of content analysis and invites systems thinking about the entire research process. It also prevents what Krippendorff (2018, p. 398) said about the risk of “Law of the Instrument”: “. . .[W]hen researchers get hooked on one analytical technique, when they become experts in its use, they may well end up applying that technique to everything in sight—and not without pleasure.”¹³

A subpoint is—unsurprising for this field—how one communicates about this branch of research methodology. Perhaps a healthy first step for us is to talk about “automated coding” and “manual coding” in a content analysis instead of using the terms “automated content analysis” and “manual content analysis.” The term “automated content analysis” is ambiguous and could imply that the entire research process, which includes formulation of research questions, data making, making inferences, and narration, is automated. As of this writing, we have never encountered such a technological marvel. Even when it would exist in an artificial intelligence (AI) laboratory somewhere, one should question the epistemic value of such an “automated content analysis.” Whom would it serve?

PSR2: Sampling Is More Efficient Than Automation‽

Scaling can be done by increasing the number of observations drawn from a population. Researchers look for efficiency and cost effectiveness (ST2.1), but manual coding is an inefficient chore. However, it is also an inevitable step even for automated coding. From ST2.1, researchers usually consider automated coding as a cost saving measure (cf. De Grove et al., 2020). This train of thought deserves rethinking because, in comparison to scaling and then automating coding, sampling is a more valid and efficient way to reduce costs. Using a representative sample reduces the amount of manual coding work, often resulting in a sample that is much smaller than the one needed for training an SML classifier, as attested by the mean training set size of the 38 papers: 3049.5.

The claim of sampling being valid and efficient, however, goes against the perception of inferiority of representative samples among researchers (ST1.2). This perception is unjustifiable because the difference between a census dataset and a representative sample is the sampling fraction, the proportion of members selected from a population. A census dataset is a special case of a representative sample, where the sampling fraction is 100%. Purely from a statistical viewpoint, the higher the sampling fraction, the higher the sample size for a given population; and thus, the smaller the sampling error.¹⁴ If the sample is representative, sampling error is a random error within a predictable range (contrast this with another source of error in PSR4). This controllable, predictable error is usually not influential enough to affect the validity of a research finding. But of course, if this error is too large, it would drive us to question the validity of the finding. Krippendorff (2018, p. 366) subsumed sampling error inside the validity of a content analysis and called it sampling validity (a).¹⁵ When the sample is representative, one can control the acceptable range of sampling error by a proper sample size planning. However, to plan a proper sample size, the often-ignored effect size (See T1 and Fixture 3) must play a stronger role.

Establishing a sampling plan is not trivial (see Cohen, 1988, 1992 for thorough treatments on power analysis). Many uncertain quantities play a role and must be considered. Nonetheless, it is worth doing¹⁶ to avoid underpowered but also overpowered studies. Under the frequentist statistical paradigm, the sampling plan is typically fixed before the start of the study (but see Pocock, 1977). Adaptation of the sampling plan based on interim inspection of results risks inflating the Type I and II error rates (Armitage et al., 1969; Yu et al., 2014) and obtaining meaningless $p$ values (Wagenmakers, 2007). An a priori estimate of the sample size that is required to detect a certain effect can be computed based on the type of statistical model, the desired $α$ (Type I error rate), the desired power (1—Type II error rate, or $1 - β$ ), and the expected, desired, or minimally meaningful effect size. Furthermore, in content analysis the intercoder agreement often has to be taken into account (Geiß, 2021). As a safeguard against too low estimates, in practice researchers often multiply the resulting sample size by a constant that is higher than one, for example, 1.2.

To get an intuition about effect sizes and the required sample sizes to detect a certain effect, we provide a simplified example. Figure 2 displays the correlation effect size $r$ for 34 analyses stemming from nine articles (see black dots corresponding to the right $y$ -axis) out of the 38 articles we analyzed. The correlation was computed from the original effect sizes reported in the articles.¹⁷ Using the original reported effect sizes (i.e., not the converted effect sizes), we computed the minimum sample size that would have been required to detect the corresponding original reported effect in both directions with $α = . 05$ and $1 - β = p o w e r = . 8$ (see blue line and dots corresponding to the left $y$ -axis). It is visible that for any practically meaningful effect size, say $| r | > . 1$ , the required sample size is manageable ( $n < 1000$ ). Moreover, the gray dashed vertical line, which indicates the mean size of training datasets among the 26 studies where data size could be inferred unambiguously, suggests that all effects except for the smallest seven effects could have been detected with that data size.

Figure 2.

The Black Dots Display Effect Sizes in Terms of Correlation (See Right $y$ -Axis). These Correlations Were Converted From the Effect Sizes That Were Originally Reported in the Articles. Data Points Stem From Nine Studies That Reported Effect Sizes. Some of Them Reported Multiple Effect Sizes, Yielding More Than Nine Data Points. The Blue Squares Show the Minimum Sample Size Required to Detect an Effect With $α = . 05$ and $1 - β = p o w e r = . 8$ (see left $y$ -axis). For This, We Used the Original Reported Effect Sizes and Not the Converted Effect sizes That Are Shown by the Black Dots. The Minimum Sample Size for the Analysis on the Left Is Not Within the Range of the Right $y$ -Axis ( $n = 124, 621$ ). The Gray Dashed Horizontal Line Shows the Mean Data Size of the Training Datasets.

As an alternative to frequentist statistics, one could use the Bayesian statistical framework (for introductions, see Dienes, 2011; Gelman et al., 2013; Kass & Raftery, 1995; Kruschke, 2015; Wagenmakers et al., 2018). Among several advantages of Bayesian statistics (e.g., Dienes, 2011; Wagenmakers, 2007), one that is immediately relevant for the current discussion is that Bayesian statistics allows sequential testing, so that one can monitor the data as they come in and based on the results thereof decide whether sampling should be continued or stopped (Rouder, 2014; Schönbrodt & Wagenmakers, 2018). A predefined decision criterion¹⁸ guides the choice for continuing or stopping data collection, which could be a desired Bayes factor (evidence for one hypothesis over the other) or a width of a specified percentage, $x %$ , posterior credible interval (an interval that contains $x %$ of the most probable parameter values a posteriori). In the context of content analysis, this could mean that coding is conducted on smaller batches of data and stopped once the desired decision criterion is reached. If manual coding is a chore (ST2.1), this approach can control the number of cases to be sampled and coded to the least amount. Once again, the total amount should be much less than the amount for training and validating a machine learning classifier, unless the effect size is extremely small (Figure 2).

PSR3: Let Census be Census‽

If there is really a need to conduct a census (for example, for granularity, ST1.3), one should appropriately adjust the statistical practices (ST1.2). The practice of applying frequentist statistics (e.g., $p$ -values and confidence intervals) to census datasets is still a common fixture in social science (ST1, Fixture 1), despite being criticized since the 1990s (e.g., Western & Jackman, 1994; Zamith & Lewis, 2015). Even though $p$ values and similar metrics can certainly be generated by violating the underlying statistical assumptions, they lose their meanings in statistical inference because “nearly everything is statistically significant in big data analytics” (Shah et al., 2015, p. 11).

Census should be descriptive. Regression can still be used for the estimation of effect sizes, also a descriptive action. Practical significance by evaluating the magnitude of an effect is meaningful. Statistical significance of that effect is not. For this, one can apply systems thinking (PSR1) to consider also PSR2. One can also get an effect estimate for the evaluation of practical significance more efficiently with a survey (PSR2). Therefore, we cannot think of a reason why conducting a census is methodologically necessary for the estimation of effect sizes.

PSR4: Hold Automated Coding to a Higher Standard Than Manual Coding‽

Our review raises the following questions on how one should approach automated coding vis-à-vis manual coding: Should one hold automated coding to a higher standard than manual coding? Is it fair to ask for more justifications for SML? Do we overlook limitations of manual approaches?

We hold SML to a higher standard than manual coding, because the former has more uncertainties than the latter. Although SML can be considered as a convenient extension of manual coding, it inherits all issues of manual coding and also comes with unique issues that extend beyond manual coding. The methodological literature has long been documenting the validity and reliability issues of manual coding (e.g., Krippendorff, 2018; Neuendorf, 2017). We can assume that manually coded data always have validity and reliability issues. For instance, a Krippendorff’s $α$ of 0.9 does not mean that there is no interrater discrepancy, but this level of interrater discrepancy is generally considered acceptable.

Without fixing the imperfectly coded data due to validity and reliability issues of manual coding, manually coded data are used as the gold standard in automated coding. The imperfect—even erroneous—manual coding decisions were modeled computationally by a SML algorithm. The accuracy of the SML was evaluated by assuming manual coding as the “ground truth” (there is no way for F1 to exceed 1). Because of that, automated coding inherits all the errors from manual coding. Garbage in, garbage out. Up to this point, one should at least hold manual coding and automated coding on the same standard.

In ST2.2, we did not list the arguably biggest misconception among researchers about SML as a coding method for content analysis: SML-coded variables can be directly used in subsequent analyses without any consequences, whether or not the performance of the SML classifier relative to the human performance was demonstrated. All of the included 38 articles suffer from this misconception. This misconception may have arisen from the conflicts between the two cultures of statistical modeling (Breiman, 2001). In computer science, SML is assumed to be used for prediction only. How the predictions are used afterward was not a concern of computer science during the early development of machine learning. Hence, the demonstration of acceptable predictive accuracy suffices, even though we do not know what the acceptable level is (ST2.3). In the context of content analysis, however, one uses SML to create new surrogate variables and subsequently use these variables to conduct further analyses. This difference in cultures asks for different yardsticks for validity. Communication researchers should use a validity yardstick that is useful for “our culture,” that is, whether or not the classification further distorts subsequent analyses. Predictive accuracy metrics derived from a confusion matrix such as F1 are incomplete indicators of validity for communication researchers’ use case.

A unique SML validity threat is that misclassifications are seldom random, but a single value such as F1 assumes misclassifications to be random. On the contrary, misclassifications are often associated with other variables, that is, misclassifications are differential. An example of differential misclassifications is the variation across languages in SML performance relative to humans (Baden et al., 2021). If a multilingual populism classifier, for instance, classifies German content closer to human performance than Hebrew content, the misclassifications are differential because the populism variable generated with the classifier will be more “wrong” (or less human-like) for the Hebrew content than the German counterpart. In other words, the language difference (and by extension, countries where these languages are spoken, such as Austria versus Israel) causes the differential misclassifications.

Unlike the predictable sampling error arising from small sample size, differential misclassifications introduce bias, that is systematic and unpredictable before the data collection; the direction and magnitude of the bias introduced by SML depend on the causal structure of the misclassifications with all measured variables and unmeasured confounders (see also Fong & Tyler, 2021; H. Zhang, 2021). The evaluation of the complete causal structure of misclassifications is a feat that is impossible. Therefore, researchers should not be able to show that the misclassifications do not bias the results. The best they can do is to partially adjust for the biases introduced using additional manually coded data (see an overview by TeBlunthuis et al., 2024, but adjustment is not needed in the first place if the data size is not large and there is no need to automate the coding). Scaling without adjusting for differential misclassifications amplifies the biases introduced by SML and consequently generates highly precise but also highly inaccurate estimates. A Monte Carlo Simulation based on the R program from TeBlunthuis et al. (2024) is presented in Figure 3, which simulates the scaling effect of differential misclassifications with automated coding with an impressive F1 of 0.9 and idiosyncratic coding errors by humans on regression analysis (further details about the simulation can be found here: https://osf.io/a4wb6). While the observed values based on manual coding approach the expected values as the data size increases, the observed values based on automated coding become more precise but farther away from the expected values. The latter sort of scaling, in our opinion, is harmful to science. As these biases are unpredictable, we have no means to determine how the unadjusted misclassifications influenced the reported results, neither in terms of direction nor magnitude, of the included 38 articles (See Fixture 4 on missing shared data and code; if they were available, we would be able to check).

Figure 3.

A Monte Carlo Simulation to Demonstrate the Difference in Scaling on Statistical Analysis With Misclassifications in Automated Coding (Top Row) and Scaling With Idiosyncratic Coding Errors, or Unreliability, in Manual Coding (Bottom Row); With Increasing Sample Sizes (Columns, Left to Right). In the Regression Model, the Dependent Variable Was Assumed to be Coded Either Automatically or Manually. The Performance of the Classifier Was Set to F1 of 0.90 But With Misclassifications Correlated With One Independent Variable. Manual Coders Were Set to Make Random Guesses 1/5 of the Time. There Were Two Independent Variables and Their Regression Coefficients Were Set to Zero (the Intersection Point of Two Dashed Lines) in the Simulation as the Expected Values and Plotted in the Two Axes. Each Scenario Was Based on 50 Simulation Runs. The Black Dots Display the Observed Values of the Two Regression coefficients.

In summary, automated coding inherits all problems of manual coding and introduces the unique problem of differential misclassifications. If manual coding is “wrong,” automated coding is a “wrong” made of another “wrong,” as in “two wrongs don’t make a right.” Using the principle introduced earlier, we should hold automated coding to a higher standard than manual coding.

Acknowledgments

The following R (R Core Team, 2024) packages were used for the data analysis: effectsize (Ben-Shachar et al., 2020), pwrss (Bulus, 2023), viridis (Garnier et al., 2024), tidyverse (Wickham et al., 2019), data.table (Barrett et al., 2024).

We would like to thank (Baden et al., 2021; Hase et al., 2022; Jünger et al., 2022; Song et al., 2020) for making their literature databases publicly available; TeBlunthuis et al. (2024) for consolidating these databases and providing us the list of articles. This study would not be possible without their commitments to Open Science. This study is also not possible without Professor Klaus Krippendorff (1932 - 2022)’s lifework on content analysis.

Concluding Remarks

Braun and Clarke (2019a) position RTA in the “big Q” of the “big Q/small q” dichotomy of qualitative research. Unlike the “small q” qualitative research, “big Q” qualitative research is not informed by the positivist values and practices that animate many quantitative approaches. Our work is peculiar because our mixed-method approach has an interpretivist “big Q” component, but the research object is the often positivist approach of content analysis. Our roles in interpreting the data are crucial.

We acknowledge our own position and made reflections on how that would shape the interpretation of data. We are not interchangeable coders, as assumed in the calculation of Krippendorff’s $α$ (Krippendorff, 2018, p. 294). Our expertise, perspective, and subjectivity have shaped the whole experience. We applied our knowledge in statistics to determine certain Fixtures as unprincipled (PSR3). But we also approached our review with an open mind and learned from some papers that there are indeed compelling reasons for scaling (ST 1.3). Embracing the inherent subjectivity of our qualitative analysis, the specific characteristics of this work reflect the unique lens through which we engaged with the material.

If there is one thing we argue in this article, we argue that researchers should ask themselves why they use a certain methodological approach, be it using scaled-up large data, small historical data of only 38 papers, manual coding, automated coding, or an interpretivist qualitative approach of RTA. What justifies such a methodological approach as suitable for answering one’s research questions? This question is not related to the research topic (e.g., misinformation) or the availability of massive data (e.g., from social media APIs). The same way that physicists would not conduct a census of the $10^{82}$ atoms in the observable universe, just because they study atomic physics or there are so many atoms out there. Using automated coding because there is a need to either analyze more data or all data in this sense is an unjustifiable reason. It hides the fundamental question of What justifies the use of large or even census data as suitable for answering one’s research questions? Given the additional validity threats of using automated coding to scale up content analysis (PSR4), this fundamental question must be thought about and be answered.

It is easy to interpret our motive as us—perhaps some methodological conservatives—being against this branch of computational methods and as us wanting to bring back the orthodoxy of the so-called “manual content analysis.” Even though using a small sample and manual coding could be a common outcome when researchers have asked themselves why, it is not our intention to “restrict” researchers to taking that route. We acknowledge the fact that methods for conducting automated coding are improving rapidly, recently due to the rise of large language models. Unless researchers discard human coding as the gold standard, it would be extremely difficult to have automated coding methods that are free of differential misclassifications (PSR4, also Figure 3). When one thinks about how to address the issues we raised, it is beneficial to consider PSR1: One should improve content analysis as a whole package, not just (automated) coding. Improving sampling strategies and statistical methods (PSR2) is in our opinion equally important—perhaps also more efficient—avenues to address the issues. We also would like to advocate for rethinking study designs for content analysis and consider the integration of qualitative methods in the workflow. It may be self-evident to most readers, but we also want to mention explicitly that using automated coding still has merits for many research questions. Higher sample sizes are needed for study designs of that require high granularity such as time-series analysis (ST1.3), comparative designs (ST1.3, Weber & Popova, 2012),¹⁹ or causal inference (Egami et al., 2022). As is evident from this exemplary non-exhaustive list, there is a multitude of research scenarios for which a large sample is beneficiary. Even though all of these measures have their place, we argue that researchers should examine more thoroughly and systematically whether they are necessary to answer their research question.

Footnotes

Appendix A

Appendix B

Acknowledgements

Not applicable.

Data Availability Statement

Data and code used in this article are available on OSF:

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Ethical Approval and Informed Consent Statements

Not applicable.

ORCID iDs

Maximilian Linde

Chung-hong Chan

Paul Balluff

Notes

Author Biographies

Maximilian Linde, PhD obtained from University of Groningen. Maximilian Linde is currently Postdoc at GESIS. He is a statistician and social scientist, with a focus on Bayesian statistical methods and multiverse analysis.

Chung-hong Chan, PhD (Univ. Hong Kong). Chung-hong Chan is an R programmer.

Paul Balluff, Master of Arts (University of Vienna). Paul Balluff is a computational communication scientist at GESIS and specialized in deep learning methods for content analysis of textual data.

References

Armitage

McPherson

C. K.

Rowe

B. C.

(1969). Repeated significance tests on accumulating data. Journal of the Royal Statistical Society: Series A, 132(2), 235–244. https://doi.org/10.2307/2343787

Asher

Leston-Bandeira

Spaiser

(2019). Do parliamentary debates of e-petitions enhance public engagement with parliament? An analysis of Twitter conversations. Policy & Internet, 11(2), 149–171. https://doi.org/10.1002/poi3.194

Baden

Pipal

Schoonvelde

Van der Velden

M. A. C. G

. (2021). Three gaps in computational text analysis methods for social sciences: A research agenda. Communication Methods and Measures, 16(1), 1–18. https://doi.org/10.1080/19312458.2021.2015574

Barrett

Dowle

Srinivasan

Gorecki

Chirico

Hocking

Schwendinger

(2024). Data.table: Extension of “data.frame.” https://CRAN.R-project.org/package=data.table

Bastos

Mercea

(2018). Parametrizing Brexit: Mapping Twitter political space to parliamentary constituencies. Information, Communication & Society, 21(7), 921–939. https://doi.org/10.1080/1369118x.2018.1433224

Baum

M. A.

Zhukov

Y. M.

(2019). Media ownership and news coverage of international conflict. Political Communication, 36(1), 36–63. https://doi.org/10.1080/10584609.2018.1483606

Becker

A. B.

Goldberg

A. B.

(2017). Entertainment, intelligent, or hybrid programming? An automated content analysis of 12 years of political satire interviews. Atlantic Journal of Communication, 25(2), 127–137. https://doi.org/10.1080/15456870.2017.1293670

Ben-Shachar

M. S.

Lüdecke

Makowski

(2020). effectsize: Estimation of effect size indices and standardized parameters. Journal of Open Source Software, 5(56), 2815. https://doi.org/10.21105/joss.02815

Bernhard

Teuffenbach

Boomgaarden

H. G.

(2023). Topic model validation methods and their impact on model selection and evaluation. Computational Communication Research, 5(1), 1. https://doi.org/10.5117/CCR2023.1.13.BERN

10.

Birkenmaier

Lechner

C. M.

Wagner

(2023). The search for solid ground in text as data: A systematic review of validation practices and practical recommendations for validation. Communication Methods and Measures, 18(3), 249–277. https://doi.org/10.1080/19312458.2023.2285765

11.

Boukes

Van de Velde

Araujo

Vliegenthart

(2019). What’s the tone? Easy doesn’t do it: Analyzing performance and agreement between off-the-shelf sentiment analysis tools. Communication Methods and Measures, 14(2), 83–104. https://doi.org/10.1080/19312458.2019.1671966

12.

boyd

Crawford

(2012). CRITICAL QUESTIONS FOR BIG DATA: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662–679. https://doi.org/10.1080/1369118x.2012.678878

13.

Braun

Clarke

(2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101. https://doi.org/10.1191/1478088706qp063oa

14.

Braun

Clarke

(2019a). Reflecting on reflexive thematic analysis. Qualitative Research in Sport, Exercise and Health, 11(4), 589–597. https://doi.org/10.1080/2159676x.2019.1628806

15.

Braun

Clarke

(2019b). To saturate or not to saturate? Questioning data saturation as a useful concept for thematic analysis and sample-size rationales. Qualitative Research in Sport, Exercise and Health, 13(2), 201–216. https://doi.org/10.1080/2159676x.2019.1704846

16.

Braun

Clarke

(2022). Conceptual and design thinking for thematic analysis. Qualitative Psychology, 9(1), 3–26. https://doi.org/10.1037/qup0000196

17.

Breen

McMenamin

Courtney

McNulty

. (2020). Daily judgement: Political news and financial markets. New Political Economy, 26(4), 616–630. https://doi.org/10.1080/13563467.2020.1806221

18.

Breiman

(2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726

19.

Bruns

(2019). After the “APIcalypse”: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566. https://doi.org/10.1080/1369118x.2019.1637447

20.

Bulus

(2023). pwrss: Statistical power and sample size calculation tools. https://CRAN.R-project.org/package=pwrss

21.

Chen

Peng

Kim

S.-H.

Choi

C. W.

(2023). What we can do and cannot do with topic modeling: A systematic review. Communication Methods and Measures, 17(2), 111–130. https://doi.org/10.1080/19312458.2023.2167965

22.

Chiang

Arons

Pomeranz

J. L.

Siddiqi

Hamad

(2020). Geographic and longitudinal trends in media framing of obesity in the United States. Obesity, 28(7), 1351–1357. https://doi.org/10.1002/oby.22845

23.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum.

24.

Cohen

(1992). Statistical power analysis. Current Directions in Psychological Science, 1(3), 98–101. https://doi.org/10.1111/1467-8721.ep10768783

25.

Crawford

(2021). The atlas of AI: Power, politics, and the planetary costs of artificial intelligence. Yale University Press.

26.

Dai

Hao

(2017). Mining social media data on marijuana use for post traumatic stress disorder. Computers in Human Behavior, 70, 282–290. https://doi.org/10.1016/j.chb.2016.12.064

27.

De Grove

Boghe

De Marez

. (2020). (What) can journalism studies learn from supervised machine learning? Journalism Studies, 21(7), 912–927. https://doi.org/10.1080/1461670x.2020.1743737

28.

Dienes

(2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6(3), 274–290. https://doi.org/10.1177/1745691611406920

29.

Egami

Fong

C. J.

Grimmer

Roberts

M. E.

Stewart

B. M.

(2022). How to make causal inferences using texts. Science Advances, 8(42), eabg2652. https://doi.org/10.1126/sciadv.abg2652

30.

Fan

Liu

Deng

(2024). Coding latent concepts: A human and LLM-coordinated content analysis procedure. Communication Research Reports, 41(5), 324–334. https://doi.org/10.1080/08824096.2024.2410263

31.

Fong

Tyler

(2021). Machine learning predictions as regression covariates. Political Analysis, 29(4), 467–484. https://doi.org/10.1017/pan.2020.38

32.

Fuchs

(2017). From digital positivism and administrative big data analytics towards critical digital and social media research! European Journal of Communication, 32(1), 37–49. https://doi.org/10.1177/0267323116682804

33.

Garnier

Ross

Rudis

Camargo

A. P.

Sciaini

Scherer

(2024). viridis(Lite)—Colorblind-friendly color maps for R. https://doi.org/10.5281/zenodo.4679423

34.

Garrett

R. K.

Liu

Young

S. D.

(2018). The relationship between social media use and sleep quality among undergraduate students. Information, Communication & Society, 21(2), 163–173. https://doi.org/10.1080/1369118x.2016.1266374

35.

Geiß

(2021). Statistical power in content analysis designs: How effect size, sample size and coding accuracy jointly affect hypothesis testing—A Monte Carlo simulation approach. Computational Communication Research, 3(1), 61–89. https://doi.org/10.5117/CCR2021.1.003.GEIS

36.

Gelman

Carlin

J. B.

Stern

H. S.

Dunson

D. B.

Vehtari

Rubin

D. B.

(2013). Bayesian data analysis (3rd ed.). Chapman & Hall/CRC.

37.

Gohdes

A. R.

(2020). Repression technology: Internet accessibility and state violence. American Journal of Political Science, 64(3), 488–503. https://doi.org/10.1111/ajps.12509

38.

Gotfredsen

(2023, December 6). Q&A: What happened to academic research on twitter. Columbia Journalism Review. https://web.archive.org/web/20240909013815/https://www.cjr.org/tow_center/qa-what-happened-to-academic-research-on-twitter.php

39.

Grimmer

Stewart

B. M.

(2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028

40.

Groves

Figuerola

C. G.

Quintanilla

M.-Á

. (2016). Ten years of science news: A longitudinal analysis of scientific culture in the Spanish digital press. Public Understanding of Science, 25(6), 691–705. https://doi.org/10.1177/0963662515576864

41.

Gruber

J. B.

van Atteveldt

Welbers

(2023). Sharing is caring (about research): Three avenues for sharing (copyrighted) text collections and the need for non-consumptive research (D7.6). OPTED. https://www.opted.eu/fileadmin/user_upload/k_opted/OPTED_Deliverable_D7.6.pdf

42.

Günther

Quandt

(2015). Word counts and topic models. Digital Journalism, 4(1), 75–88. https://doi.org/10.1080/21670811.2015.1093270

43.

Guo

(2019). Media agenda diversity and intermedia agenda setting in a controlled media environment: A computational analysis of China’s online news. Journalism Studies, 20(16), 2460–2477. https://doi.org/10.1080/1461670x.2019.1601029

44.

Guo

Rohde

J. A.

H. D.

(2020). Who is responsible for twitter’s echo chamber problem? Evidence from 2016 U.S. Election networks. Information, Communication & Society, 23(2), 234–251. https://doi.org/10.1080/1369118x.2018.1499793

45.

Haim

(2023). Computational communication science: Eine Einführung [Computational communication science: An Introduction]. In Brosius

H. B.

Donges

Löblich

Matthes

(Eds.), Studienbücher zur Kommunikations- und Medienwissenschaft [Communication Research Series] (pp. 1–358). Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-40171-9

46.

Haim

Hase

Schindler

Bachl

Domahidi

(2023). Editorial to the special issue: (Re)establishing quality criteria for content analysis: A critical perspective on the field’s core method. Studies in Communication and Media, 12(4), 277–288. https://doi.org/10.5771/2192-4007-2023-4-277

47.

Hamad

Pomeranz

J. L.

Siddiqi

Basu

(2015). Large-scale automated analysis of news media: A novel computational method for obesity policy research. Obesity, 23(2), 296–300. https://doi.org/10.1002/oby.20955

48.

Hanna

(2013). Computer-aided content analysis of digitally enabled movements. Mobilization: An International Quarterly, 18(4), 367–388. https://doi.org/10.17813/maiq.18.4.m1g180620x7n1542

49.

Hase

Mahl

Schäfer

M. S.

(2022). Der „Computational Turn“: ein „interdisziplinärer Turn“? Ein systematischer Überblick zur Nutzung der automatisierten Inhaltsanalyse in der Journalismusforschung [The “Computational Turn”: An “interdisciplinary turn”? A systematic overview on the usage of automated content analysis in journalism research]. Medien & Kommunikationswissenschaft, 70(1–2), 60–78. https://doi.org/10.5771/1615-634x-2022-1-2-60

50.

Hepp

Loosen

Hasebrink

(2021). Jenseits des Computational Turn: Methodenentwicklung und Forschungssoftware in der Kommunikations—und Medienwissenschaft—zur Einführung in das Themenheft [Beyond the Computational Turn: Development of methods and research software in communication science as introduction to this issue]. Medien & Kommunikationswissenschaft, 69(1), 3–24. https://doi.org/10.5771/1615-634x-2021-1-3-1

51.

Hilbert

Barnett

Blumenstock

Contractor

Diesner

Frey

Gonzalez-Bailon

Lamberso

Pan

Peng

T.-Q.

Shen

Smaldino

P. E.

van Atteveldt

Waldherr

Zhang

Zhu

J. J. H.

(2019). Computational communication science: A methodological catalyzer for a maturing discipline. International Journal of Communication, 13, 3912–3934. https://ijoc.org/index.php/ijoc/article/view/10675/2764

52.

Hodges

H. E.

Stocking

S. H.

(2016). A pipeline of tweets: Environmental movements’ use of Twitter in response to the keystone XL pipeline. Environmental Politics, 25(2), 223–247. https://doi.org/10.1080/09644016.2015.1105177

53.

Hosseini

Kannan

Zhang

Poovendran

(2017). Deceiving Google’s Perspective API built for detecting toxic comments. https://arxiv.org/abs/1702.08138

54.

James

Witten

Hastie

Tibshirani

(2013). An introduction to statistical learning with applications in R. Springer.

55.

Jarvis

B. F.

Keuschnigg

Hedström

(2021). Analytical sociology amidst a computational social science revolution. In Engel

Quan-Haase

Liu

S. X.

Lyberg

(Eds.), Handbook of computational social science (Vol. 1, pp. 33–52). Routledge.

56.

Joo

Steinert-Threlkeld

Z. C.

(2022). Image as data: Automated content analysis for visual presentations of political actors and events. Computational Communication Research, 4(1), 11–67. https://doi.org/10.5117/ccr2022.1.001.joo

57.

Jünger

Gärtner

(2023). Computational Methods für die Sozial- und Geisteswissenschaften [Computational methods for the social sciences and humanities]. Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-37747-2

58.

Jünger

Geise

Hänelt

(2022). Unboxing computational social media research from a datahermeneutical perspective: How do scholars address the tension between automation and interpretation? International Journal of Communication, 16, 24.

59.

Jungherr

Jürgens

Schoen

(2011). Why the Pirate Party won the German election of 2009 or the trouble with predictions: A response to Tumasjan, A., Sprenger, T. O., Sander, P. G., & Welpe, I. M. “Predicting elections with Twitter: What 140 characters reveal about political sentiment.” Social Science Computer Review, 30(2), 229–234. https://doi.org/10.1177/0894439311404119

60.

Kananovich

(2018). Framing the taxation-democratization link: An automated content analysis of cross-national newspaper data. The International Journal of Press/Politics, 23(2), 247–267. https://doi.org/10.1177/1940161218771893

61.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.2307/2291091

62.

Katagiri

Min

(2019). The credibility of public and private signals: A document-based approach. American Political Science Review, 113(1), 156–172. https://doi.org/10.1017/s0003055418000643

63.

King

Pan

Roberts

M. E.

(2017). How the Chinese government fabricates social media posts for strategic distraction, not engaged argument. American Political Science Review, 111(3), 484–501. https://doi.org/10.1017/s0003055417000144

64.

Kostygina

Tran

Binns

Szczypka

Emery

Vallone

Hair

(2020). Boosting health campaign reach and engagement through use of social media influencers and memes. Social Media + Society, 6(2), 1–12. https://doi.org/10.1177/2056305120912475

65.

Krippendorff

(2018). Content analysis: An introduction to its methodology (4th ed.). Sage.

66.

Kroon

Welbers

Trilling

van Atteveldt

(2023). Advancing automated content analysis for a new era of media effects research: The key role of transfer learning. Communication Methods and Measures, 18(2), 142–162. https://doi.org/10.1080/19312458.2023.2261372

67.

Kruschke

J. K.

(2015). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd ed.). Academic Press.

68.

Akin

Yi-Fan

Brossard

Xenos

Scheufele

D. A.

(2016). Tweeting disaster: An analysis of online discourse about nuclear power in the wake of the Fukushima Daiichi nuclear accident. Journal of Science Communication, 15(05), A02. https://doi.org/10.22323/2.15050202

69.

Liu

Siegel

Gibson

L. A.

Kim

Y. M.

Binns

Emery

Hornik

R. C.

(2019). Toward an aggregate, implicit, and dynamic model of norm formation: Capturing large-scale media representations of dynamic descriptive norms through automated and crowdsourced content analysis. Journal of Communication, 69(6), 563–588. https://doi.org/10.1093/joc/jqz033

70.

Lörcher

Neverla

(2015). The dynamics of issue attention in online communication on climate change. Media and Communication, 3(1), 17–33. https://doi.org/10.17645/mac.v3i1.253

71.

Lörcher

Taddicken

(2017). Discussing climate change online: Topics and perceptions in online climate change communication in different online public arenas. Journal of Science Communication, 16(02), A03. https://doi.org/10.22323/2.16020203

72.

Lukito

Pruden

M. L.

(2023). Critical computation: Mixed-methods approaches to big language data analysis. Review of Communication, 23(1), 62–78. https://doi.org/10.1080/15358593.2022.2125821

73.

Mahrt

Scharkow

(2013). The value of big data in digital media research. Journal of Broadcasting & Electronic Media, 57(1), 20–33. https://doi.org/10.1080/08838151.2012.761700

74.

McGregor

S. C.

Lawrence

R. G.

Cardona

(2017). Personalization, gender, and social media: Gubernatorial candidates’ social media strategies. Information, Communication & Society, 20(2), 264–283. https://doi.org/10.1080/1369118x.2016.1167228

75.

Mitchell

(1997). Machine learning (1st ed.). McGraw Hill Higher Education.

76.

Neuendorf

K. A.

(2017). The content analysis guidebook. Sage.

77.

Oehmer-Pedrazzi

Kessler

S. H.

Humprecht

Sommer

Castro

(2023). Standardisierte inhaltsanalyse in der kommunikationswissenschaft: Ein handbuch [Standardized content analysis in communication research: A handbook]. Springer Nature.

78.

Ogan

Varol

(2017). What is gained and what is left to be done when content analysis is added to network analysis in the study of a social movement: Twitter use during Gezi Park. Information, Communication & Society, 20(8), 1220–1238. https://doi.org/10.1080/1369118x.2016.1229006

79.

Opperhuizen

A. E.

Schouten

(2020). Dynamics and tipping point of issue attention in newspapers: Quantitative and qualitative content analysis at sentence level in a longitudinal study using supervised machine learning and big data. Quality & Quantity, 55(1), 19–37. https://doi.org/10.1007/s11135-020-00992-w

80.

Opperhuizen

A. E.

Schouten

Klijn

E.-H.

(2019). Framing a conflict! How media report on earthquake risks caused by gas drilling. Journalism Studies, 20(5), 714–734. https://doi.org/10.1080/1461670x.2017.1418672

81.

Ozalp

Williams

M. L.

Burnap

Liu

Mostafa

(2020). Antisemitism on Twitter: Collective efficacy and the role of community organisations in challenging online hate speech. Social Media + Society, 6(2), 1–20. https://doi.org/10.1177/2056305120916850

82.

Park

Greene

Colaresi

(2020). Human rights are (increasingly) plural: Learning the changing taxonomy of human rights from large-scale text reveals information effects. American Political Science Review, 114(3), 888–910. https://doi.org/10.1017/s0003055420000258

83.

Peng

Lock

Ali Salah

(2023). Automated visual analysis for the study of social media effects: Opportunities, approaches, and challenges. Communication Methods and Measures, 18(2), 163–185. https://doi.org/10.1080/19312458.2023.2277956

84.

Pocock

S. J.

(1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2), 191–199. https://doi.org/10.1093/biomet/64.2.191

85.

Pouwels

J. L.

Araujo

van Atteveldt

Bachl

Valkenburg

P. M.

(2023). Integrating communication science and computational methods to study content-based social media effects. Communication Methods and Measures, 18(2), 115–123. https://doi.org/10.1080/19312458.2023.2285766

86.

Puschmann

(2019). An end to the wild west of social media research: A response to Axel Bruns. Information, Communication & Society, 22(11), 1582–1589. https://doi.org/10.1080/1369118x.2019.1646300

87.

Rauchfleisch

Kaiser

(2020). The false positive problem of automatic bot detection in social science research. PLOS ONE, 15(10), 1–20. https://doi.org/10.1371/journal.pone.0241045

88.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

89.

Riffe

Lacy

Fico

(2014). Analyzing media messages: Using quantitative content analysis in research (3rd ed.). Routledge. https://doi.org/10.4324/9780203551691

90.

Rossini

Hemsley

Tanupabrungsun

Zhang

Stromer-Galley

(2018). Social media, opinion polls, and the use of persuasive messages during the 2016 US election primaries. Social Media + Society, 4(3), 1–11. https://doi.org/10.1177/2056305118784774

91.

Rouder

J. N.

(2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301–308. https://doi.org/10.3758/s13423-014-0595-4

92.

Ruiz-Soler

Curini

Ceron

(2019). Commenting on political topics through Twitter: Is European politics European? Social Media + Society, 5(4), 1–13. https://doi.org/10.1177/2056305119890882

93.

Scharkow

(2011). Thematic content analysis using supervised machine learning: An empirical evaluation using german online news. Quality & Quantity, 47(2), 761–773. https://doi.org/10.1007/s11135-011-9545-7

94.

Scharkow

(2012). Automatische inhaltsanalyse und maschinelles lernen [Automated content analysis and machine learning]. Doctoral thesis, Universität der Künste Berlin.

95.

Schönbrodt

F. D.

Wagenmakers

E.-J.

(2018). Bayes factor design analysis: Planning for compelling evidence. Psychonomic Bulletin and Review, 25(1), 128–142. https://doi.org/10.3758/s13423-017-1230-y

96.

Schöpke-Gonzalez

Kumar

Hemphill

(2025). Using off-the-shelf harmful content detection models: Best practices for model reuse. Proceedings of the ACM on Human-Computer Interaction, 9, 1–27. https://doi.org/10.1145/3711099

97.

Sebastiani

(2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. https://doi.org/10.1145/505282.505283

98.

Senninger

Blom-Hansen

(2021). Meet the critics: Analyzing the EU commission’s regulatory scrutiny board through quantitative text analysis. Regulation & Governance, 15(4), 1436–1453. https://doi.org/10.1111/rego.12312

99.

Shah

D. V.

Cappella

J. N.

Neuman

W. R.

(2015). Big data, digital media, and computational social science: Possibilities and perils. The ANNALS of the American Academy of Political and Social Science, 659(1), 6–13. https://doi.org/10.1177/0002716215572084

100.

Shen

Sun

Jürgens

Zhou

Bachl

(2024). Taking communication science and research methodology seriously. Communication Methods and Measures, 18(1), 1–6. https://doi.org/10.1080/19312458.2024.2308369

101.

Snyder

(2024). On freedom. Random House.

102.

Song

Tolochko

Eberl

J.-M.

Eisele

Greussing

Heidenreich

Lind

Galyga

Boomgaarden

H. G.

(2020). In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication, 37(4), 550–572. https://doi.org/10.1080/10584609.2020.1723752

103.

Stecker

Balluff

Lind

Dinhopl

Waldherr

Boomgaarden

H. G.

(2024). Tools of the trade—When are software tools mentioned in computational text analysis research? Computational Communication Research, 6(1), 1–21. https://doi.org/10.5117/ccr2024.1.6.stec

104.

L. Y.-F.

Cacciatore

M. A.

Liang

Brossard

Scheufele

D. A.

Xenos

M. A.

(2017). Analyzing public sentiments online: Combining human- and computer-based content analysis. Information, Communication & Society, 20(3), 406–427. https://doi.org/10.1080/1369118x.2016.1182197

105.

L. Y.-F.

Xenos

M. A.

Rose

K. M.

Wirz

Scheufele

D. A.

Brossard

(2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20(10), 3678–3699. https://doi.org/10.1177/1461444818757205

106.

TeBlunthuis

Hase

Chan

C.-H.

(2024). Misclassification in automated content analysis causes bias in regression: Can we fix it? Yes we can! Communication Methods and Measures, 18(3), 278–299. https://doi.org/10.1080/19312458.2023.2293713

107.

Theocharis

Barberá

Fazekas

Popa

S. A.

Parnet

(2016). A bad workman blames his tweets: The consequences of citizens’ uncivil Twitter use when interacting with party candidates. Journal of Communication, 66(6), 1007–1031. https://doi.org/10.1111/jcom.12259

108.

Trilling

Jonkman

J. G. F.

(2018). Scaling up content analysis. Communication Methods and Measures, 12(2–3), 158–174. https://doi.org/10.1080/19312458.2018.1447655

109.

van Atteveldt

Peng

T.-Q

. (2018). When communication meets computation: Opportunities, challenges, and pitfalls in computational communication science. Communication Methods and Measures, 12(2–3), 81–92. https://doi.org/10.1080/19312458.2018.1458084

110.

Vargo

C. J.

(2024). The computational content analyst: Using machine learning to classify media messages. Routledge. https://doi.org/10.4324/9781003514237

111.

Vermeer

Trilling

Kruikemeier

de Vreese

(2020). Online news user journeys: The role of social media, news websites, and topics. Digital Journalism, 8(9), 1114–1141. https://doi.org/10.1080/21670811.2020.1767509

112.

Wagenmakers

E.-J.

(2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

113.

Wagenmakers

E.-J.

Marsman

Jamil

Verhagen

Love

Selker

Gronau

Q. F.

Šmíra

Epskamp

Matzke

Rouder

J. N.

Morey

R. D.

(2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review, 25(1), 35–57. https://doi.org/10.3758/s13423-017-1343-3

114.

Waldherr

Geise

Mahrt

Katzenbach

Nuernbergk

(2021). Toward a stronger theoretical grounding of computational communication science. Computational Communication Research, 3(2), 1–28. https://doi.org/10.5117/CCR2021.02.002.WALD

115.

Waldherr

Weber

Haim

Van der Velden

M. A. C. G.

Chen

(2024). Between innovation and standardization: Best practices and inclusive guidelines in computational communication science. Journalism & Mass Communication Quarterly, 102(1), 13–36. https://doi.org/10.1177/10776990241301676

116.

Walker

Nowacki

A. S.

(2011). Understanding equivalence and noninferiority testing. Journal of General Internal Medicine, 26(2), 192–196. https://doi.org/10.1007/s11606-010-1513-8

117.

Weare

Lin

W.-Y.

(2000). Content analysis of the world wide web: Opportunities and challenges. Social Science Computer Review, 18(3), 272–292. https://doi.org/10.1177/089443930001800304

118.

Weber

Popova

(2012). Testing equivalence in communication research: Theory and application. Communication Methods and Measures, 6(3), 190–213. https://doi.org/10.1080/19312458.2012.703834

119.

Western

Jackman

(1994). Bayesian inference for comparative research. American Political Science Review, 88(2), 412–423. https://doi.org/10.2307/2944713

120.

Wickham

Averick

Bryan

Chang

McGowan

L. D.

François

Grolemund

Hayes

Henry

Hester

Kuhn

Pedersen

T. L.

Miller

Bache

S. M.

Müller

Ooms

Robinson

Seidel

D. P.

Spinu

. . .Yutani

(2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

121.

Wickham

Çetinkaya-Rundel

Grolemund

(2023). R for data science. O’Reilly Media.

122.

Wilkerson

Casas

(2017). Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542

123.

E. C.

Sprenger

A. M.

Thomas

R. P.

Dougherty

M. R.

(2014). When decision heuristics and science collide. Psychonomic Bulletin and Review, 21, 268–282. https://doi.org/10.3758/s13423-013-0495-z

124.

Zamith

Lewis

S. C.

(2015). Content analysis and the algorithmic coder: What computational social science means for traditional modes of media analysis. The ANNALS of the American Academy of Political and Social Science, 659(1), 307–318. https://doi.org/10.1177/0002716215570576

125.

Zhang

(2021). How using machine learning classification as a variable in regression leads to attenuation bias and what to do about it. SocArXiv. https://doi.org/10.31235/osf.io/453jk

126.

Zhang

Shah

Foley

Abhishek

Lukito

Suk

Kim

S. J.

Sun

Pevehouse

Garlough

(2019). Whose lives matter? Mass shootings and social media discourses of sympathy and policy, 2012–2014. Journal of Computer-Mediated Communication, 24(4), 182–202. https://doi.org/10.1093/jcmc/zmz009

127.

Zoizner

Shenhav

S. R.

Fogel-Dror

Sheafer

(2020). Strategy news is good news: How journalistic coverage of politics reduces affective polarization. Political Communication, 38(5), 604–623. https://doi.org/10.1080/10584609.2020.1829762

Rethinking Scaling Up Content Analysis: A Reappraisal of Justifications and Practices for Large-Scale Content Analysis

Abstract

Keywords

The Big Data Meta

What Justifies Large-Scale Content Analysis?

Are There Methodological Adjustments?

Methodology

Research Object: Content Analysis With Automated Coding

Data

Reflexive Thematic Analysis

Results

T1: From Scale Comes Census, But Rarely Adjustment of Statistical Practice

ST1.1: Some Awareness of the Census Dataset

ST1.2: Perceived Inferiority of Evidence Based on Probabilistic Samples

ST1.3: Scale Not for Small Effect, But Granularity

T2: SML as Shortcut

ST2.1: Manual Coding as a Chore

ST2.2: Incompatible Expectations and Misconceptions

ST2.3: Implicit Criteria and Handling Strategies for Failure (and Success)

Discussion and Reflections

PSR1: One Content Analysis‽

PSR2: Sampling Is More Efficient Than Automation‽

PSR3: Let Census be Census‽

PSR4: Hold Automated Coding to a Higher Standard Than Manual Coding‽

Acknowledgments

Concluding Remarks

Footnotes

Appendix A

Appendix B

Acknowledgements

Data Availability Statement

Declaration of Conflicting Interests

Funding

Ethical Approval and Informed Consent Statements

ORCID iDs

Notes

Author Biographies

References