Abstract
Recent months have witnessed an increase in suggested applications for large language models (LLMs) in the social sciences. This proof-of-concept paper explores the use of LLMs to improve text quality and to extract predefined information from unstructured text. The study showcases promising results with an example focussed on historical newspapers and highlights the effectiveness of LLMs in correcting errors in the parsed text and in accurately extracting specified information. By leveraging the capabilities of LLMs in these straightforward, instruction-based tasks, this research note demonstrates their potential to improve on the efficiency and accuracy of text analysis workflows. The ongoing development of LLMs and the emergence of robust open-source options underscores their increasing accessibility for both, the quantitative and qualitative, social sciences and other disciplines working with text data.
Keywords
Introduction
Recent months have witnessed a surge in suggested applications for large language models (LLMs) in the social sciences: LLMs are being used to generate synthetic survey answers (Argyle et al., 2023), as chatbots in conversations in intervention studies (Costello et al., 2024), or to detect constructs and annotate and label text in a variety of contexts (Macanovic and Przepiorka, 2024; Rathje et al., 2023; Ziems et al., 2024). As one of the first, Brown et al. (2020) have argued that LLMs are few (or zero) shot-learners and can greatly improve performance in many of the typical natural language processing (NLP) tasks.
These uses of LLMs come with their limitations and critique (see also Brown et al., 2020). The deep neural networks used in LLMs are widely recognised as black boxes with opaque decision-making processes (Dobson, 2023) which retain the biases inherent in the data that they had been trained on (Bender et al., 2021; Navigli et al., 2023). Also, asking LLMs the right question is not that straightforward. ‘Prompt engineering’, the crafting of efficient questions, is an iterative process of trial and error, guided by best practices and further complicated by the fact that LLMs are sensitive to changes in wording and not necessarily deterministic so that the same prompts can lead to different responses (Chen et al., 2023; Ouyang et al., 2023; Strobelt et al., 2023). Given these difficulties regarding reproducibility, the usefulness of LLMs for (social) scientific use cases is an important topic of debate (Ball, 2023; Ye et al., 2023).
This research note will highlight one of LLMs’ strengths amidst their acknowledged shortcomings: Their ability to comprehend even poorly formatted or low-quality text and to extract predefined information from it. Text-as-data has amassed within the last decade. Text-as-data first undergoes a pipeline of different preprocessing steps, which include steps like tokenisation, the removal of stopwords or stemming (e.g. Grimmer et al., 2022: 50–53). Effective text analysis hinges on the quality and cleanliness of the input data, but the reality often looks messy: Social media data, for example, is user-generated and can thus exhibit all kinds of errors and misspellings. Digitised, optical character recognised (OCR-ed) text data is also often plagued by inaccuracies and artifacts introduced during the digitisation process. Unlike other tools, LLMs circumvent the need for extensive data cleaning and preprocessing and can instead be advised to take over this task. In addition, LLMs are well equipped to extract pieces of information from text. Through a proof-of-concept demonstration, this research note will discuss how LLMs can effectively be integrated into the research process when working with unstructured text data.
Improving the text analysis workflow with large language models
From open-ended survey questions to comments scraped from social media pages and digitised books, text data has amassed within the last decade. These data sources hold information in unstructured text instead of tidy numeric datasets, making them less straightforward to work with. This increase in data has led to the development of more sophisticated methods to systematically and automatically understand and process texts as data (Grimmer and Stewart, 2013).
Already before the increased popularity of LLMs in the recent years, language models have been suggested to improve untidy text (e. g. Neto et al., 2020; Xu et al., 2022). Given the new widespread and transdisciplinary interest in LLMs, this research note advocates for their inclusion in the toolbox of social scientists who work with text data. LLMs are advanced artificial intelligence systems designed to interpret and generate human language. They are based on neural networks and are trained on vast amounts of textual data which has allowed them to learn patterns, semantics and grammar.
One important emergent capacity of these models is that they can analyse textual statements in line with a question posed by the user, including typical use cases in the social sciences like identifying themes and emotions, or coding the text in terms of key features like hate speech or other labels, with new tasks still being discovered (Lupo et al., 2023; Ornstein et al., 2023; Törnberg, 2023a, 2023b). However, as users pose more complex questions, granting the model greater autonomy and agency in its responses (see also Latour, 1996), the risk of introducing biases and limiting replicability also increases. With more complex questions, LLMs base their answers on more contextual knowledge; however, the contextual knowledge at play is unknown to the researcher. LLMs do not offer a positionality statement (Bourke, 2014) – as researchers and users, we do not have a clear understanding of underlying biases, assumptions and values guiding the model’s responses, so that it becomes challenging to discern the basis upon which decisions are made.
On the other hand, LLMs excel in tasks where they are provided with narrow instructions and have limited autonomy; these tasks are often more straightforward but can still be challenging to automate in research contexts. These tasks include, for example, correcting messy OCR text or misspelled social media posts, or extracting specific pieces of information. In these scenarios, the model’s capacity to understand and follow precise instructions allows it to perform with accuracy and efficiency, while still being more flexible than more traditional approaches (such as named entity recognition) as they manage to understand unclear instructions and input text. As Törnberg (2023b) phrased it, LLMs can be thought of as virtual student assistants instructed with textual analysis which are versatile and capable, but prone to misunderstandings - narrow instructions and clear use cases minimise these misunderstandings. For instance, when tasked with correcting OCR or spelling mistakes, LLMs can leverage their language understanding capabilities to identify and rectify errors, improving the overall quality and readability of the text. Similarly, when instructed to extract specific information, LLMs can navigate through the text, locate relevant details and extract the requested information.
An example: Preprocessing historical newspapers with Command R+
To showcase the capabilities of an LLM, I take historical newspapers excerpts as an example. In the following, I aim to retrieve the bride’s and groom’s names from American newspapers from 1861. The digitised newspapers from Chronicling America are available as OCR-ed text; the data quality of these texts varies. Information extraction with regular expressions and named entities becomes particularly challenging as there are diverse phrasings of wedding reports (e.g. ‘Miss X gets married to Mr X’, ‘the unity of Mr X and Miss X’, ‘Miss X, daughter of Y, celebrated her wedding to Mr X, son of Z’, etc.), they often include misidentified letters (e.g. ‘Miss’ can easily be parsed as ‘Mlss’) and line breaks in the original newspaper text has introduced hyphenations which are difficult to resolve.
Example text look as follows:
These texts provide challenges for rigid approaches. However, the data can be preprocessed by prompting the LLM to improve the text quality and to extract specific information in a desired format. Via an application programming interface (API), the LLM can be accessed with a few lines of code by passing a prompt (see also Törnberg, 2023b). Different LLMs exist, the most popular one currently being GPT-4 (developed by OpenAI). I make use of open-source alternatives via HuggingFace and use the model Command R+, which is one of Cohere’s LLMs and which was the first open-weight model to beat GPT4 in the Chatbot Arena, a crowdsourced open platform for LLM evaluations (as of the writing of this manuscript in April 2024).
Advising the LLM to correct the text (using the following prompt: ‘You are an OCR expert. You are perfect at fixing errors which happen when digitising text. Please correct the following text and return the corrected version: [excerpt]’) has led to the following results:
Further prompting the LLM to return the name of the bride and groom as a comma-separated table (using the prompt: ‘Please extract the name of the bride and groom and return it as a comma-separated table with two columns ‘name_bride’ and ‘name_groom’. These are the people getting married and their first and last names are given. If you do not find any names, put ‘NA’ in the table. Return nothing else.’) led to the following output: Martha S. Wright,R. Leander Tomlinson.’ (National Endowment for the Humanities, 1861d) Cynthia A. Jackson,Andrew Jackson Boswell Susan Murray,George N. Ladd’ (National Endowment for the Humanities, 1861b) Sallie Vincent,Alfred Turner Loula Cochran,Jas. McCarley’ (National Endowment for the Humanities, 1861a)
The LLM-processed text has cleaned errors introduced through OCR and has produced well-structured results. It is important to note that the results are not perfect; especially with difficult source material, improving the data is difficult (garbage in, garbage out). For example, the name ‘8aiiaii Vincent’ in the third excerpt gets corrected to Sallie Vincent which is wrong; taking a look at the scanned image shows that the actual name is Sarah Vincent. Some errors thus still pertain, but the improvement in quality from the OCR-ed text is undeniable. Usage of the LLM has also allowed the extraction of key information which can now be analysed with more traditional methods of quantitative data analysis.
Conclusions
The (optimal) use of LLMs in research methodologies has emerged as a prominent subject in the recent scientific discourse. When passing LLMs clear instructions and well-defined tasks, the results become reproducible and simple to validate. In the presented example, I used Command R+ to extract information from historic newspapers, leading to overall good results. Challenging source material led to less-than-ideal output results, but there is an impressive overall improvement in quality. Nevertheless, it is of utmost importance that results given by LLMs are validated. This proof-of-concept paper showed promising results for (English-language) newspapers, but the application of LLMs must be adapted and tested within specific research contexts and input texts. For example, considering untidy OCR, the task of correcting wrongly put letters (e.g. replacing l’s with i’s) is more feasible than correcting gibberish or wrong digits, and shorter inputs are generally easier to process than long texts (Chang et al., 2024).
Preprocessing and information extraction with LLMs will not always work and this also cannot be the benchmark, as the process of correcting errors is an equilibrium of correcting mistakes while not introducing many new ones (Kim et al., 2021). While errors are thus part of the process, (open-source) LLMs have the capabilities to improve data quality and thus make more text accessible for research. This also means that text which is available in more niche contexts can be made more accessible, allowing for even more widespread use and analysis of text data within the social sciences. Improved open-source models like Command R+ used in this study are democratising access to sophisticated language processing capabilities. As such, ongoing innovations and efforts to enhance the efficiency and affordability of LLMs hold promise for empowering research endeavours across diverse domains, including projects with limited resources (both financial and technical).
Footnotes
Acknowledgements
The publication of this article was funded by the Mannheim Centre for European Social Research (MZES).
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Ethical approval
Ethical approval was not required as only public data is being used.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Data availaotherbility statement
Data used in this study is publicly available and cited.
