Sage Journals: Discover world-class research

Abstract

This paper provides a practical example and guide on how to augment or replace human coders with Large Language Models (LLMs) during content analysis. We demonstrate this by replicating and extending an influential study on environmental communication. Our setup, running locally on consumer-grade hardware, makes it feasible for university researchers operating within typical computational and legal constraints. We validate the LLM’s performance by replicating the original study’s codings, scaling the analysis to cover a tenfold increase in articles, and extending the LLM’s application to a comparable German-language corpus, comparing these results to human expert coders. We offer guidelines for instructing LLMs, validating output, and handling multilingual coding, presenting a replicable framework for future research. This paper is intended to systematically guide other researchers when integrating LLMs into their workflows, ensuring reliable and scalable coding practices. We demonstrate several advantages of LLMs as coders, including cost-effective multilingual coding, overcoming the limitations of small-sample content analysis, and improving both the replicability and transparency of the coding process.

Keywords

large language model content analysis computational methods classification reliability

Content analysis is a fundamental method in communication and media studies, enabling systematic examination of text for patterns, themes, and trends (Coe & Scacco, 2017). Traditionally, this requires human annotation of large corpora, which is both time-consuming and resource-intensive. With the advent of computational text analysis, scholars could algorithmically scale up research on journalistic and political texts (Laurer et al., 2024; Van Atteveldt et al., 2022; Wilkerson & Casas, 2017). Supervised machine learning, especially BERT-style classifiers, has become essential for natural language processing/NLP (Osnabrügge et al., 2023). However, these methods demand large, manually annotated datasets and multiple training-validation cycles (Laurer et al., 2024; Wilkerson & Casas, 2017). Reliability can also drop when models encounter out-of-domain data, and minor input alterations (e.g., typos) may disrupt classification (Gröndahl et al., 2018). Large Language Models (LLMs) build on transformer architectures and can learn from context without requiring task-specific training (Corral et al., 2024), potentially complementing or replacing conventional approaches.

In this paper, we outline a practical guide to integrating LLMs into content analysis, demonstrated by replicating and extending Feldman et al. (2015)’s study on newspaper coverage of climate change. By adopting LLMs, we address high costs, limited scalability, and replication challenges inherent in human coding. Our key objectives are to: (1) provide guidelines for LLM setup, validation, and handling of multilingual or proprietary data; (2) replicate Feldman et al.’s original coding; (3) demonstrate scalability by analyzing ten times more articles; and (4) assess the model’s multilingual capabilities through a comparable German-language corpus. To streamline and clarify our objectives, we pose the following research questions:

RQ1: Can an LLM-based pipeline replicate human codings of climate change content (as per Feldman et al.) at an acceptable level of reliability and with current consumer-grade hardware?

RQ2: Does the automatic pipeline introduce biases or distortions?

RQ3: Are there notable differences in LLM performance when applying the same coding instructions across English and German texts, and if so, how do these affect reliability?

Literature Review

Current State of the Methodological Integration of LLMs

The field of NLP was revolutionized by LLMs—massive pre-trained models with the capacity to generate and seemingly “understand” text. Trained on extensive sources (e.g., web scrapes, open-access books, and Wikipedia) (Brown et al., 2020), LLMs have demonstrated remarkable capabilities in zero-shot and few-shot learning scenarios, where they can perform tasks without explicit training on specific datasets and often perform comparably to human coders (Gretz et al., 2023; Pilny et al., 2024). This in-context learning ability allows them to adapt to new tasks with minimal additional training (Brown et al., 2020). Studies show their effectiveness across tasks like sentiment analysis, stance detection, and accusation detection (Corral et al., 2024; Fields et al., 2024).

Once trained on diverse data, LLMs reduces the need for researchers to provide labeled datasets (Kuzman et al., 2023). Moreover, LLMs can handle a variety of languages, helping to overcome the focus on texts from Anglo-Saxon countries (Zhong et al., 2023) or the analyses of only few languages due to lack of coders proficient in multiple languages. In content analysis, LLMs can outpace manual methods by removing the need for coders to actually read texts and for extensive coder training (Coe & Scacco, 2017).

Huang et al. (2023) found that ChatGPT showed “great potential as a data annotation tool” (p. 2), outperforming traditional machine learning models in certain tasks. Similarly, Gilardi et al. (2023) demonstrated that ChatGPT can outperform human annotators in terms of speed and accuracy. LLMs have also been found to outperform human coders in reliability for annotation tasks (Törnberg, 2023) and can identify complex concepts such as hate speech or hypocrisy claims (Corral et al., 2024; Reiss, 2023).

However, concerns remain about their reliability, consistency, and validity (Reiss, 2023). They are inherently non-deterministic, producing variable outputs for the same input. Although such disagreement parallels that seen with multiple human coders, no standardized procedures yet exist to enhance intercoder reliability in automated LLM workflows.

Strategies to Improve Reliability

Designing effective prompts is critical to improving the performance and reliability of LLMs. This sensitivity and potential randomness require carefully formulated and engineered prompts (Argyle et al., 2023; Wang et al., 2024): Structured, iterative prompting techniques can enhance performance of LLM tasks, while simultaneously reducing bias coming from the model’s parameters (Furniturewala et al., 2024). Below we list techniques that we implemented and hope to be generally useful when including LLMs in content analysis:

Meta-Prompting is a technique where instructions provided by a human get “translated” by an LLM into LLM-friendly prompts. This approach leverages the model’s ability to critically evaluate and refine output and makes processing of the code more efficient than direct human instructions. According to Suzgun et al. (2024), meta-prompting offers several benefits: rather than issuing one direct response, it guides the model to break complex tasks into subtasks, enhancing accuracy through structured decomposition.

Using Agentic Workflows can help tailor the LLM’s prompts and responses to specific contexts and has been shown to significantly improve the model’s performance across various tasks (Singh et al., 2024). By defining unique agents that the LLM can emulate during the workflow, this technique goes beyond subtask decomposition in that agents serve different roles often not linked directly to a text, for example, in our case agents for meta-prompting or resolution of conflicting codings.

Thus, through this approach, the LLM assists Decomposing the task into separate, simpler tasks often increases performance. LLMs have a limited capacity for “attention” and overloading them with multifaceted tasks strains this resource, leading to decreased performance and increased variability (Brown et al., 2020). Decomposing coding schemes into individual, manageable subtasks optimizes the use of the model’s attention mechanism. This approach also facilitates easier troubleshooting by human.

Chain-of-Thought Reasoning is integral to meta-prompting, decomposing complex tasks into smaller steps and allocating more resources to difficult problems. This approach elicits emergent reasoning abilities in sufficiently large models, surpassing standard prompting (Wei et al., 2022). By making reasoning explicit, it improves transparency, facilitates debugging, and enhances interpretability—not only for humans but also for other agents. In our workflow, we instruct the LLM to walk through each classification step and provide direct quotes, further boosting clarity and quality of coding decisions.

Pooling outputs from Multiple Iterations of the LLM on the same data can improve reliability (Reiss, 2023). In our case, we found it most useful to resolve conflicting codings on the same article by deciding which coding was supported by the most convincing chain of thought. However, since each additional iteration increases computational costs, we recommend the use of this technique only if initial evaluation of the LLM’s classification show unacceptable inconsistencies.

Despite all the above, human involvement remains a crucial part of the workflow in the form of a Human-in-the-Loop approach (Mosqueira-Rey et al., 2023). First, humans provide the LLM during meta-prompting with selected examples from which it can learn. This process, often facilitated through few-shot learning, allows the model to understand complex coding schemes and nuanced categories specific to the research context (Brown et al., 2020). Second, humans act as moderators between agents in an agentic workflow. By overseeing the reasoning and interactions between the agents, human intermediaries can identify and rectify issues such as misinterpretations, ambiguities, or biases—problems that LLMs can (re)produce according to Bender et al. (2021). This moderation ensures that the chain-of-thought reasoning employed by the LLM aligns with the research objectives and maintains methodological rigor.

Pipeline for Developing LLM Instructions

Here, we introduce our agentic workflow for LLM-driven content analysis. It comprises three main entities—Instructor, Classifier, and Judge—plus the Human Intermediary who oversees prompt refinement. We present this general framework before demonstrating its concrete application in a separate section.

Agents and Their Roles

Building on the findings of previous research on the iterative combination of classifications from multiple LLM agents with adjudication by human experts (e.g., Heseltine et al., 2024), we present a pipeline for a continuously self-refining classification process. This classification pipeline is general enough to be useful to other, leveraging the collaborative interaction between three key entities: the Instructor Agent, the Human Intermediary, and the Classifier Agent. The process, visualized in Figure 1, ensures that text classification tasks are refined iteratively, increasing coding accuracy and consistency. Additionally, we found it beneficial to incorporate a Judge Agent to reduce noise and enhance accuracy in the Classifier’s outputs, especially when coding complex concepts like, in our case, “efficacy.” Below, we provide a brief description of each agent’s role.

Figure 1.

Our agentic workflow when optimizing and validating the prompt for the classifier agent.

The Human Intermediary acts as a critical bridge between the LLM agents and plays a central role in interpreting, refining, and validating the prompts generated by the Instructor Agent. The intermediary ensures that the prompts are both appropriate for the classification task and compatible with the Classifier. Beyond prompt refinement, the Human Intermediary also evaluates the output generated by the Classifier, verifying its accuracy and identifying areas for improvement. This feedback is then forwarded to the Instructor, who adjusts the prompts, ensuring a continuous refinement loop.

The Instructor (LLM) generates a comprehensive prompt designed to guide the Classifier. The Instructor outlines the task requirements, specifying the expected output formats, and operationalizing the variables necessary for classification in the format optimized for LLM comprehension. The key to success here is designing prompts that are unambiguous, as short as possible, well-structured, and optimized for the language model’s capabilities. It is therefore important that the Instructor uses the same version of the LLM as the Classifier (in our case Llama 3.1). In collaboration with the Human Intermediary, these prompts are refined to meet the specific needs of the task.

The Classifier (LLM) is responsible for executing the actual classification tasks, based on the prompts developed by the Instructor and refined by the Human Intermediary. To ensure accuracy and prevent information from previous documents affecting the current classification (so called hallucinations), it processes one document at a time and resets its memory after each task. Importantly, the Classifier incorporates chain-of-thought reasoning, providing the Human Intermediary with valuable insights into the classification process. This reasoning aids in refining instructions and enhances the system’s transparency.

The Judge (LLM) is an optional component in our workflow, which we found beneficial when the re-run reliability of the Classifier was low (Krippendorff’s alpha <0.7). The Judge becomes active when multiple iterations of the Classifier on the same document—two in our case—yield conflicting codes. The Judge reviews the classifications along with the Classifier’s chain-of-thought reasoning for each iteration. By evaluating which reasoning is more coherent and aligns better with the coding guidelines, the Judge selects the most convincing set of codes as the final output. This process resolves inconsistencies originating from the Classifier’s non-deterministic nature, enhancing overall reliability, particularly for complex concepts. The Judge provides its own chain of thought, allowing the human intermediary to assess whether it aligns with the intended conceptualization.

The Iterative Workflow

As illustrated in Figure 1, the workflow between the entities is iterative, allowing continuous improvements. The process starts with the Human Intermediary, explaining, in plain language, the theoretical concepts, the types of codes expected and examples of few previous codings. The advantage of this procedure compared to formal approaches in machine learning is that no formal language is required from the human. In our case, we simply attached the article of Feldman et al. and their codebook to our initial request. The Instructor is then informed that it has the role of a “prompt engineer” (which improves LLM’s performance, see e.g. Wang et al., 2024) and is asked to transform all the provided information into a comprehensive prompt for another LLM (the Classifier), explaining exactly the coding (sub)task and the format of the response (see Table 1 for an example of prompts).

Table 1.

Input and Output of Agents.

	Instructor	Classifier	Judge
Input	You are a prompt engineer. Trying to automatize part of the coding done by Feldman et al. Instructions: Output the prompt you generate in markdown. Output the prompt in a codeblock. I want you to write a prompt for me that instructs an LLM to act as a classifier. The classifier gets a newspaper article as input and codes whether impacts of climate change are mentioned the prompt that you write should work as it is and will be followed up directly by the text that should be classified. Note that all coding should be done on the basis of explicit content. Thus, if an impact of climate change is not explicitly stated, but only implied (for example, if an increase in heat waves is discussed but not directly linked to climate change), it should not be coded as present in the article. Be cautious of “hallucinations” - do not introduce information that is not present in the text. Each identified impact should separately be evaluated regarding the following:- Frame: In terms of what is the impact primarily framed? (Environment; health; economic; security; Other) The output of the classifier is supposed to be a Pandas dataframe in the following format: df = pd.DataFrame ({impact: [listOfImpacts], frame: [listOfFrames]}) listOfImpacts is a list of short descriptions of each impact. Per impact, listOfFrames contains the frame most fitting to that impact. Therefore, make sure that lists have the same length. If no impact is found, leave all lists empty. Always provide your chain of thought after the python code but keep it short	See Output Instructor + the inputed newspaper article (no pre-processing needed). For copyright reasons, the article is not shown here. However, the article “The United States and other nations agreed . . .” was published on December 19, 2009 in The Washington Post and has the Factiva ID WP00000020091219e5cj0001o	Two classifiers read a newspaper article and identified impacts of climate change in that article and their framing. Both classifiers disagree and your task is to decide which is more believable. Consider their chain of thought. Repeat the output of the classifier that you consider most believable. CLASSIFIER 1: ```python import pandas as pd df = pd.DataFrame({ “impact”: [“Effects of climate change”], “frame”: [“Environment”]}) ``` My chain of thought is that the text mentions “the effects of climate change” which implies an impact on the environment. CLASSIFIER 2: ```python import pandas as pd df = pd.DataFrame({ “impact”: [“Need for emergency aid,” “effects of climate change on developing nations”], “frame”: [“Economic,” “Other”]}) ``` My chain of thought: * Two impacts are mentioned: The need for emergency aid and the effects of climate change on developing nations. * The frame for “need for emergency aid” is economic, as it involves providing financial assistance. For “effects of climate change on developing nations,” I chose other, as it doesn’t fit into environment, health, security, or economic categories neatly
Output	You are a classifier that identifies impacts of climate change in newspaper articles. ## Task Read the following article and identify any explicit mentions of impacts of climate change. If an impact is mentioned, briefly describe it and classify its primary frame (environment; health; economic; security; Other). ### Article [INSERT ARTICLE TEXT HERE] ## Output Format Provide your output as a Pandas DataFrame in the following format: ```python df = pd.DataFrame ({ “impact”: [listOfImpacts], “frame”: [listOfFrames]}) ``` where “listOfImpacts” is a list of short descriptions of each impact, and “listOfFrames” contains the frame most fitting to that impact. Ensure both lists have the same length. If no impacts are found, leave both lists empty. ## Notes * Only consider explicit mentions of climate change impacts. * Avoid introducing information not present in the text (“hallucinations”). * Each identified impact should be separately evaluated for its primary frame	```python import pandas as pd df = pd.DataFrame ({ “impact”: [“Effects of climate change”], “frame”: [“Environment”]}) ``` My chain of thought is that the text mentions “the effects of climate change” which implies an impact on the environment. The timing is future as it talks about providing aid in the next three years and channeling funds by 2020. The location is international since it mentions developing nations and multiple countries agreeing to take action	After reviewing the output and chain of thought from both classifiers, I find classifier 2’s analysis more believable. Classifier 2 identified two specific impacts mentioned in the article: “Need for emergency aid” and “effects of climate change on developing nations.” This demonstrates a more thorough understanding of the text. Additionally, their framing of these impacts as “economic” and “other,” respectively, is reasonable given the context. In contrast, classifier 1’s analysis seems more superficial, only identifying a single impact (“effects of climate change”) and assigning it to an overly broad frame (“Environment”). Therefore, I will repeat the output from classifier 2: ```python import pandas as pd df = pd.DataFrame ({ “impact”: [“Need for emergency aid,” and “effects of climate change on developing nations”], “frame”: [“Economic” and “Other”]}) ``` My chain of thought: * Two impacts are mentioned: The need for emergency aid and the effects of climate change on developing nations. * The frame for “need for emergency aid” is economic, as it involves providing financial assistance. For “effects of climate change on developing nations,” I chose other, as it doesn’t fit into environment, health, security, or economic categories neatly

This table shows an example in which the classifier codes impacts and their frames. During our actual analysis, classifiers received one prompt for impacts and all the associated constructs presented in Table 2. A separate prompt was used for actions. Text in italics provides further information. Instructor received Feldman et al.’s manuscript and codebook as context.

This prompt is then reviewed and refined by the Human Intermediary. Once the prompt is approved by the human, the Classifier codes a sample of documents based on these instructions. If the classifications are satisfactory, the system moves forward to classify the entire dataset. In our case the criteria were a high intercoder reliability (Krippendorff’s alpha >0.7): (a) within multiple iterations of the Classifier on the same document, and (b) between the Instructor and human codes. If issues arise, the Human Intermediary intervenes, identifying problems and explaining them to the Instructor. The Instructor then is asked to adjust the prompt to address the problem.

This iterative loop, where prompts are continuously adjusted and outputs are re-evaluated, enhances the classification process by improving the accuracy and robustness of the model. This approach leverages both the strengths of LLMs in performing large-scale classification tasks and the precision that human oversight provides. However, we found that after two iterations in the refinement loop, no big gains in terms of reliability are to be expected. If reliabilities are not acceptable by that point, we suggest reconsidering the theoretical concept or dropping some from this workflow.

Our Case Study

Comparing Feldman et al.’s and Our Approach

In this section, we demonstrate how we applied the workflow described in Section 3 to replicate and extend Feldman et al. (2015). They examined how climate change was portrayed in four major U.S. newspapers between 2006 and 2011. Initially, they collected all articles containing “climate change” or “global warming” in the headlines or lead paragraphs, resulting in 3274 articles. To manage the workload associated with manual coding, they selected every fifth article, yielding a sample of 642 articles that were coded by three human experts.

While commendable given the human labor required, this sampling introduced limitations that restricted their analysis. The uneven distribution of articles across newspapers—for instance, USA Today contributed only 67 articles—made fine-grained comparisons statistically challenging and increased the risk of Type I errors due to multiple testing. This lack of statistical power potentially hindered detecting nuanced differences in climate change reporting over time or across outlets. This is not a fundamental critique of Feldman et al. but highlights the scalability advantages LLMs offer for content analyses.

As common in human-coded data, Feldman et al. used binary present/absent codes for most aspects of newspaper articles, specifically regarding impacts of climate change and possible mitigation actions. They then added additional present/absent codes for more nuanced features of actions and impacts (see Table 2). This binary coding at the article level was necessary to keep the workload acceptable for human coders, given the substantial volume of text to be analyzed. However, such an approach comes with a significant loss of textual information in the codes. For example, it may be valuable to know not just whether climate impacts are mentioned, but how often they are discussed within an article, indicating the emphasis placed on certain aspects of climate change. In our study, we leveraged LLMs to code each climate-related impact and action separately, allowing for a more nuanced and detailed view of the articles. This approach enabled us to capture the frequency and context of mentions, providing richer data for analysis than the simplistic binary coding allowed. In Table 2, we list the concepts investigated by Feldman et al., their operationalizations, and how our approach allowed us to extend the codes to be more informative.

Table 2.

Comparison of Concepts and Levels of Measurement.

Concept	Description	Feldman et al. (Coding Presence in Article)	LLM (Coding per Impact/Action)	New Analyses Possible Through LLM
Impact
Type	Are impacts of climate change mentioned?	At least one impact mentioned (yes/no)	Summarizes and lists each impact separately within article	Identifying the amount and position of impacts in article + ability to link codes with each impact
Frame	How are impacts framed in the article?	Presence of environmental, health, economic, security, or other framing (yes/no for each)	Classify each impact as primarily environmental, health, economic, security, or other	Identifying relative proportions of frames in articles, where they appear in an article, etc.
Timing	When did impacts occur or have an effect?	Mentioning of impact as past, present, or future (yes/no for each)	Classify each impact as primarily concerning past, present, or future	Identifying co-occurrence of frames, timing and location information
Location	What is the geographical context of the impact?	Mentioning of impact as national, regional, or global (yes/no for each)	Classify each impact as primarily relevant on a national, regional, or global level	Same as above
Action
Type	Who can/should take actions?	Action by individuals, civil society, research or government (yes/no for each)	Summarizes and lists each action separately within articles	Identifying the amount of actions mentioned + ability to link further codes with each action
Efficacy	Which efficacy types are conveyed regarding actions?	Self-, external-, and response efficacy (high, neutral, and low for each)	Code per action whether the efficacy type is presented as high, neutral, or low	Calculating a continuous efficacy score when aggregating over articles

Note. Comparison of concepts and levels of measurement between Feldman et al. (2015) and LLM codes. The last column presents examples of new types of analyses possible with LLM codes.

Our Setup and Sample

We next detail how the agentic pipeline in Section 3 was used to replicate Feldman et al.’s codes, validate the LLM’s reliability, and then scale up our analysis. We aimed to reproduce the original coding decisions from Feldman et al. with our pipeline, thereby assessing the model’s ability to replicate human coding reliably. We then expanded the analysis by applying the LLM to the entire corpus of 3274 articles, demonstrating how LLMs can help overcome the limitations of small-sample studies. Furthermore, we extended the study to a comparable German-language corpus (4863 articles) to evaluate the LLM’s multilingual capabilities. For comparability between the samples, all articles were collected using identical inclusion criteria as those used by Feldman et al. and using the same database (Factiva) to collect all articles. In the United States, we used the same four outlets from Feldman et al. (2015): The New York Times (liberal), The Washington Post (center-liberal), USA Today (center), and The Wall Street Journal (conservative). For Germany, we chose four of the country’s largest dailies—taz (liberal), Süddeutsche Zeitung (center-liberal), Frankfurter Allgemeine Zeitung (center-conservative), and Die Welt (conservative)—based on prior classifications (Donsbach et al., 1996; Eilders, 2004). Both sets of outlets are thus broadly comparable in terms of national reach and ideological diversity within their respective media landscapes.

To achieve our objectives, we selected the Llama 3.1–70b model with 6-bit quantization, executed on a high-end consumer-grade local computing setup. Llama 3.1 was chosen after pre-testing several state-of-the-art models available in 2024, with Llama 3.1 demonstrating robust multilingual capabilities and the ability to handle nuanced, large-scale textual data. We chose the largest model (in terms of numbers of parameters) that could fit on the RAM available to us since larger models perform generally better for classifying complex concepts (Corral et al., 2024). Running the LLM locally allowed us to process proprietary data securely while adhering to data privacy regulations and legal restrictions, constraints typical for research in media studies and whenever dealing with confidential data. The data is not passed to corporate service providers and remains on the scientist’s own research infrastructure. This ensures that no copyright infringement or privacy issues arise from the unauthorized transfer of (training) data.

As our starting point, we provided the LLM with the original codebook used by Feldman et al. (2015). Their codebook included definitions of each coding category, criteria for determining the presence or absence of specific information, and examples from the original dataset. We then instructed the LLM to go beyond the codebook and code each identified impact and action separately. In Section 3, we describe in detail how these instructions were developed.

Validation of LLM Classifications

We compared the LLM output with the original human-coded data to assess accuracy and consistency. Iterative refinements were made to the prompts and instructions based on identified discrepancies, aiming for a Krippendorff’s alpha of 0.7 or higher, usually considered indicative of reliable coding. The most successful refinements for improving reliability scores were: (1) explicitly reminding the model not to “hallucinate” (i.e., invent information not present in the source text), (2) breaking down complex coding instructions into separate prompts, and (3) requiring the model to quote directly from the text to justify each assigned code.

Our iterative refinement of the prompt is conceptually similar to parameter tuning in conventional supervised learning because the refined instructions may become overfitted to particularities of the training data. To mitigate this, we partitioned the dataset into an 80% training set (513 articles with which we tested and refined prompts) and a 20% validation set (129 articles used only after finalizing the prompt). This approach allowed us to estimate the reliability of the refined prompt without bias from the iterative refinement phase. For greater transparency, we Table 3 shows intercoder reliability on the entire dataset (training + validation combined) and only the validation data separately. We found that the reliability scores on the combined data were slightly higher than on the validation subset. This confirms that while we did utilize a standard machine learning-like validation approach to avoid overfitting to one subset, the final prompt remains robust across the entire sample.

Table 3.

Intercoder Reliabilities Between Human and LLM Coders (Krippendorff’s Alpha) per Concept and Language.

	Description	Reliability US: Feldman et al. versus LLM		Reliability German: Human versus LLM
	Description	1 Classifier	2*Classifier + Judge	1 Classifier	2*Classifier + Judge
Impact
Type	Are impacts of climate change mentioned?	0.74 (0.72)	0.78 (0.77)	0.72	0.72
Frame	How are impacts framed in the article?	0.61 (0.61)	0.74 (0.72)	0.68	0.8
Timing	When did impacts occurrence or have an effect?	0.65 (0.66)	0.81 (0.79)	0.7	0.81
Location	What is the geographical context of the impact?	0.6 (0.61)	0.72 (0.7)	0.6	0.73
Action
Type	Who can/should take actions?	0.74 (0.74)	0.77 (0.74)	0.7	0.71
Efficacy	Which efficacy types are conveyed regarding actions?	0.62 (0.6)	0.63 (0.62)	0.59	0.64

Feldman et al. coded on an article level, so LLM codes are aggregated to the same level for alpha calculation. Numbers in brackets are alphas for the validation sample.

We leveraged the LLM’s multilingual capabilities by applying the same coding instructions without translation, allowing the model to process German-language articles directly. Recent LLMs, particularly the Llama version which we used in this study, demonstrate strong performance in classification tasks across both German and English. This effectiveness is proven to remain consistent even when the language of the query differs from that of the document (Kleinle et al., 2024). Results from the LLM were compared with coding performed by two German expert coders on 200 randomly selected articles (see Table 3). Intercoder reliability between the human coders and LLMs was then assessed. When comparing consistencies between LLM and human codings, we see similar performance across the two languages, achieving acceptable levels of reliability. The classifications for “efficacy” form an exception with reliability <0.7. We discuss possible reasons for this in Section 5.

Analyzing the results in Table 3, we observe that incorporating the Judge agent generally enhanced the alignment between the LLM’s classifications and those of human coders. Specifically, using two classifiers in combination with a judge improved the consistency and reliability of the predictions, as evidenced by higher Krippendorff’s alpha values. Additionally, as detailed in Appendix Table A1, the LLM tended to be more sensitive than human coders, identifying the presence of coded variables more frequently within the texts. This increased sensitivity was especially pronounced when the Judge agent was included, suggesting that while the ensemble approach enhances detection rates, it may also lead to a higher frequency of positive identifications compared to human coding. However, we also show in the Appendix that this did not qualitatively change the results found by the more conservative human coders.

Discussion and Recommendations

Our experience integrating LLMs into content analysis demonstrate the practical advantages, including scalability, multilingual reach, and cost-effectiveness in their use. Yet these benefits demand a structured approach, and human oversight remains essential—particularly for prompt refinement and performance evaluation.

Computational overhead varies greatly depending on how LLMs are instructed. Breaking coding into subtasks can boost reliability but multiplies processing time because the model must reset and reprocess each document to minimize hallucinations. Researchers should consider which subtasks are truly necessary and bundle them carefully to manage resources.

Similarly, re-running the classifier to improve reliability significantly adds overhead. With two M3 Max processors, our process took two weeks of continuous operation, so we limited ourselves to two runs per article. Future advances may ease these trade-offs, but balancing computational costs against analytical rigor remains inevitable under current constraints.

A key insight is the transparency provided by the LLM’s chain-of-thought reasoning. Discrepancies between the LLM’s classifications and those of Feldman et al.’s human coders were challenging to reconcile due to the lack of access to human coders’ rationales. In contrast, the LLM offered explicit reasoning for each classification, facilitating troubleshooting and enhancing replicability.

Our procedure uses a Judge agent to review the Classifier’s chain-of-thought reasoning, providing a systematic, transparent way to reconcile discrepancies and enhance accountability. While many content-analysis pipelines rely on a single pass or simple voting, the Judge explicitly weighs chain-of-thought outputs from multiple runs before rendering a final decision—an advantage when nuanced interpretation (e.g., “efficacy,” Table 3) is required. In traditional human-coded analyses, a third expert or internal discussions typically settle coder disagreements without leaving transparent rationales. By contrast, our automated Judge preserves explicit reasoning, enabling future researchers to examine or challenge each decision.

We acknowledge alternative approaches to reconciling multiple LLM outputs, including ensemble-learning techniques from machine learning. Although the Judge agent shares similarities with the concept of an “ensemble,” it differs by leveraging chain-of-thought rationales, rather than merely counting votes. A full benchmark against different ensemble methods would be informative but lies beyond our present scope.

Efficacy coding proved especially challenging, with reliability dropping both across repeated LLM runs and between LLM output and human coders. This may reflect the nuanced, less concrete nature of “efficacy” compared to attributes like geographic focus. Despite three prompt refinement cycles aimed at clarifying this construct, we did not reach our target reliability. This shortfall could stem from the concept’s theoretical vagueness, incomplete instructions, or current LLM limitations—exposing a common issue that even human coders debate. Future work must clarify whether advanced models or refined conceptual frameworks can resolve these persistent gaps.

Looking ahead, we recommend exploring the most recent suggestions to further improve LLM reliability. For example, Kumar et al. (2025), show that a pipeline combining chain-of-thought reasoning with retrieval, self-consistency, and self-verification halves hallucination rates and improves accuracy by 6%–10% points on fine-grained classification benchmarks. Drawing on these findings, we would, first, break the task into a two-stage sequence, detecting the mere presence of any efficacy claim before further classification of the claim; this hierarchical approach lightens the model’s cognitive load and reduces false positives. Second, we would exploit the larger context windows of current models by prefacing each prompt with several previously validated efficacy sentences, giving the model concrete anchor points for its reasoning. Third, we would add a self-verification pass: the model re-examines its own prediction in light of the retrieved evidence and either confirms the code or flags the item for human review. Kumar et al. report that such a retrieval-augmented, self-verifying workflow pushes Krippendorff’s α from roughly 0.65 to above 0.80 on comparable tasks, suggesting that comparable gains are attainable for the efficacy construct analyzed here.

We also found that the LLM’s multilingual capabilities greatly facilitate analyses across different languages. The same coding instructions could be applied irrespective of the language of the text, eliminating the need for translation or separate coding protocols and coders. This is a substantial improvement over traditional methods, where language barriers among human coders and the limitations of language-specific machine learning models often hinder comprehensive multilingual analyses.

LLMs will likely extend beyond text to multimodal data, such as images, audio, and video, greatly expanding the scope of content analysis. Our workflow can adapt to these developments, enabling researchers to examine diverse data while preserving methodological rigor and transparency.

We encourage future work to harness these LLM capabilities for greater analytical depth. Sharing chain-of-thought outputs in scholarly datasets can foster more open, replicable research. Locally deployed LLMs also allow researchers to code proprietary data without breaching privacy or legal constraints, ensuring the benefits of advanced text analysis remain ethically and legally sound.

Despite these opportunities, replacing human coders with LLMs alters both research design and text interpretation. LLMs may match or exceed human reliability but do not truly “understand” content, instead applying learned patterns. Qualitative inquiry, by contrast, relies on critical, in-depth engagement with text. Incorporating LLMs therefore requires rethinking how we infer meaning and draw conclusions, underscoring the need for triangulation with human-led approaches to ensure robust interpretations.

As in all applications, the use of LLMs for content analysis raises ethical concerns related to transparency, accountability, and the risk of perpetuating biases without proper human oversight. Additionally, there are environmental concerns, particularly regarding energy consumption and the associated environmental effects when training the models.

In conclusion, while LLMs present powerful tools for content analysis, their effective deployment requires careful consideration of computational resources, methodological design, and the balance between depth and reliability of analysis. Many of these challenges are not new, however. Rather, novel LLM-based approaches to the application of text analysis highlight in a more pressing way the need for scientific rigor.

Supplemental Material

Supplemental Material - A Practical Guide and Case Study on How to Instruct LLMs for Automated Coding During Content Analysis

Supplemental Material for A Practical Guide and Case Study on How to Instruct LLMs for Automated Coding During Content Analysis by Mike Farjam, Hendrik Meyer, and Meike Lohkamp in Social Science Computer Review

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Mike Farjam

Supplemental Material

Supplemental material for this article is available online.

Author Biographies

Mike Farjam is a post-doctoral computational social scientist at Hamburg University.

Hendrik Meyer is a PhD-student at Hamburg University with a focus on climate communication in traditional and social media.

Meike Lohkamp is a PhD student at Hamburg University with a focus on climate communication.

References

Argyle

L. P.

Busby

E. C.

Fulda

Gubler

J. R.

Rytting

Wingate

(2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 1–15. https://doi.org/10.1017/pan.2023.2

Bender

E. M.

Gebru

McMillan-Major

Shmitchell

(2021). On the dangers of stochastic parrots: Can language models be too big? (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922

Brown

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Amodei

(2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://arxiv.org/abs/2005.14165

Coe

Scacco

J. M.

(2017). Content analysis, quantitative. In Matthes

Davis

C. S.

Potter

R. F.

(Eds.), The international encyclopedia of communication research methods (pp. 1–11). Hoboken, NJ: Wiley Blackwell. https://doi.org/10.1002/9781118901731.iecrm0045

Corral

P. G.

Green

Meyer

Stoll

Yan

Reuver

(2024). A few hypocrites: Few-shot learning and subtype definitions for detecting hypocrisy accusations in online climate change debates. In Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers. Association for Computational Linguistics. https://aclanthology.org/2024.cpss-1.4

Donsbach

Wolling

Blomberg

(1996). Repräsentation politischer Positionen im Mediensystem aus der Sicht deutscher und amerikanischer Journalisten. In Hömberg

Pürer

(Eds.), Medien-Transformation: Zehn jahre dualer rundfunk in deutschland (pp. 343–356). UVK.

Eilders

(2004). Von Links bis Rechts — Deutung und Meinung in Pressekommentaren. In Eilders

Neidhardt

Pfetsch

(Eds.), Die stimme der medien (pp. 129–166). VS Verlag für Sozialwissenschaften.

Feldman

Hart

P. S.

Milosevic

(2015). Polarizing news? Representations of threat and efficacy in leading US newspapers’ coverage of climate change. Public Understanding of Science, 26(4), 481–497. https://doi.org/10.1177/0963662515595348

Fields

Chovanec

Madiraju

(2024). A survey of text classification with transformers: How wide? How large? How long? How accurate? How expensive? How safe? IEEE Access, 12, 6518–6531. https://doi.org/10.1109/ACCESS.2024.3349952

10.

Furniturewala

Jandial

Java

Banerjee

Shahid

Bhatia

Jaidka

(2024). Thinking fair and slow: On the efficacy of structured prompts for debiasing language models. arXiv Preprint. https://arxiv.org/abs/2405.10431

11.

Gilardi

Alizadeh

Kubli

(2023). ChatGPT outperforms crowd-workers for text-annotation tasks. arXiv Preprint. https://arxiv.org/abs/2303.15056

12.

Gretz

Halfon

Shnayderman

Toledo-Ronen

Spector

Dankin

Katsis

Arviv

Katz

Slonim

Ein-Dor

(2023). Zero-shot topical text classification with LLMs - an experimental study. In Houda Bouamor, Juan Pino, & Kalika Bali (Eds.), Findings of the association for computational linguistics: EMNLP 2023 (pp. 9647–9676). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.647

13.

Gröndahl

Pajola

Juuti

Conti

Asokan

(2018, January). All you need is “love” evading hate speech detection. In Proceedings of the 11th ACM workshop on artificial intelligence and security (pp. 2–12).

14.

Heseltine

Clemm von Hohenberg

(2024). Large language models as a substitute for human experts in annotating political text. Research & Politics, 11(1). https://doi.org/10.1177/20531680241236239

15.

Huang

Kwak

(2023). Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. arXiv Preprint. https://arxiv.org/abs/2302.07736

16.

Kleinle

Prange

Friedrich

(2024). OMoS-QA: A dataset for cross-lingual extractive question answering in a German migration context. arXiv preprint arXiv:2407.15736.

17.

Kumar

Kim

Nathani

J. S.

Roy

(2025). Improving the reliability of LLMs: Combining CoT, RAG, self-consistency, and self-verification. arXiv preprint arXiv:2505.09031.

18.

Kuzman

Mozetič

Ljubešić

(2023). ChatGPT: Beginning of an end of manual linguistic data annotation? Use case of automatic genre identification. arXiv Preprint. https://arxiv.org/abs/2303.03953

19.

Laurer

Van Atteveldt

Casas

Welbers

(2024). Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and BERT-NLI. Political Analysis, 32(1), 84–100. https://doi.org/10.1017/pan.2023.20

20.

Mosqueira-Rey

Hernández-Pereira

Alonso-Ríos

Bobes-Bascarán

Fernández-Leal

Á.

(2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56(4), 3005–3054. https://doi.org/10.1007/s10462-022-10246-w

21.

Osnabrügge

Ash

Morelli

(2023). Cross-domain topic classification for political texts. Political Analysis, 31(1), 59–80. https://doi.org/10.1017/pan.2021.37

22.

Pilny

McAninch

Slone

Moore

(2024). From manual to machine: Assessing the efficacy of large language models in content analysis. Communication Research Reports, 41(2), 61–70. https://doi.org/10.1080/08824096.2024.2327547

23.

Reiss

M. V.

(2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv Preprint. https://arxiv.org/abs/2304.11085

24.

Singh

Ehtesham

Kumar

Khoei

T. T.

(2024). Enhancing AI systems with agentic workflows patterns in large language models. In 2024 IEEE World AI IoT Congress (AIIoT) (pp. 527–532). IEEE. https://doi.org/10.1109/AIIoT61789.2024.10578990

25.

Suzgun

Kalai

A. T.

(2024). Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv Preprint. https://doi.org/10.48550/arXiv.2401.12954

26.

Törnberg

(2023). How to use LLMs for text analysis. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.13106

27.

Van Atteveldt

Trilling

Calderon

C. A.

(2022). Computational analysis of communication. John Wiley & Sons.

28.

Wang

Chen

Deng

Wen

You

Liu

(2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Medicine, 7(1), Article 41. https://doi.org/10.1038/s41746-024-01029-4

29.

Wei

Wang

Schuurmans

Bosma

Chi

E. H.

Zhou

(2022). Chain of thought prompting elicits reasoning in large language models. CoRR. abs/2201.11903. https://arxiv.org/abs/2201.11903

30.

Wilkerson

Casas

(2017). Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542

31.

Zhong

Ding

Liu

Tao

(2023). Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT. arXiv Preprint. https://arxiv.org/abs/2302.10198v2

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.54 MB