Sage Journals: Discover world-class research

Abstract

Open-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected through web surveys—may contain measurement error in the form of misspellings which are not easily corrected in a reliable and systematic manner. This paper provides evidence that large language models (LLMs), specifically OpenAI’s GPT-4o, offer a flexible, dependable, and low-cost solution to correcting misspellings in open-ended responses. We demonstrate the efficacy of this approach with open-ended responses about the Democratic and Republican parties from the 1996–2020 American National Election Studies, where GPT is shown to correct 85%–90% of misspellings identified by human coders in a sample of responses. Following spelling correction on ∼50,000 responses, we document several consequential changes to the data. First, we show that spelling correction reduces the number of unique and single-use tokens while increasing the number of words matched to a sentiment dictionary. Then, to highlight the potential benefits and limitations of spelling correction we show improved out-of-sample prediction accuracy from a text-based machine learning classifier. Finally, we show a significantly larger degree of emotionality is captured in the spelling-corrected texts, though the size of this measure’s relationship with a known correlate in political interest remains relatively unchanged. Our findings point to LLMs as an effective tool for reducing measurement error by correcting misspellings in open-ended survey responses.

Keywords

Large language models text analysis open-ended survey responses data quality

Introduction

Open-ended survey questions have been central to many influential works in political science (e.g., Converse, 1964; Lazarsfeld, 1944; Zaller, 1992), allowing researchers to explore the freely expressed attitudes of the public unconstrained by the limited response options presented in closed-ended questions (Geer, 1988). With the proliferation of computing power and resources in recent decades has come an explosion of new approaches to automated text analysis that enable researchers to analyze and extract knowledge from these open-ended responses (Grimmer et al., 2022; Kraft, 2023; Roberts et al., 2014). However, a particular challenge to analyzing open-ended responses is their susceptibility to measurement error in the form of misspellings, which may arise as responses are hastily written or typed by survey respondents and interviewers. Existing approaches to correcting these errors are unfortunately limited, coarse, and require considerable time and supervision, such that spelling correction is almost always excluded from the pre-processing stage (see e.g., Denny and Spirling, 2018). Importantly, failure to correct misspellings can have several negative consequences such as artificially inflating the vocabulary size, altering word counts, and ultimately hampering our ability to discover patterns in language-use across texts.

We show that recently developed artificial intelligence tools, specifically OpenAI’s GPT-4o large language model (LLM), offer a flexible, dependable, and low-cost solution to correcting misspellings in open-ended survey responses. LLMs are massive deep learning models trained on troves of data (typically large sections of the internet) that are capable of generating natural language responses to human-generated text inputs. Recently, social scientists have begun leveraging LLMs—particularly Generative Pre-trained Transformers or GPTs—to assist in a variety of research tasks such as text summarizing or labeling (Gilardi et al., 2023; Goyal et al., 2023; Heseltine and Celmm Von Hohenberg, 2024; Mellon et al., 2024). We propose that LLMs can also be used to correct misspellings in open-ended survey responses, providing more accurate representations of the thoughts and opinions of the public and facilitating our ability to discover those patterns.

We point to four specific advantages that LLMs have over existing methods of spelling correction such as built-in spellcheckers in programs like Microsoft Word/Excel or the “hun-spell” R package (Ooms, 2024). First, LLMs are more aware of and better able to account for the context in which a word appears, allowing them to accurately identify and correct more complex errors such as distinguishing between homophones (e.g., “capitol” and “capital”). Second, LLMs excel at handling slang, acronyms, and proper nouns, which are quite common in political speak but handled poorly by traditional spellcheckers (e.g., ACA, AOC, MAGA). Third, LLMs are highly customizable, allowing users to provide detailed instructions for spelling and grammar corrections to match one’s use-case. Fourth and finally, whereas existing approaches may recognize misspellings and suggest replacements, researchers are still typically required to confirm each correction. We show that LLMs can help to automate the entire spelling correction process and does so at performance levels very similar to human coders. Overall, the advanced capabilities of LLMs make them a highly effective tool for addressing a pernicious data quality issue of misspellings in open-ended responses.¹

Correcting misspellings in ANES open-ended responses

We demonstrate the efficacy of GPT-4o for correcting misspellings with thousands of open-ended survey responses from the American National Election Studies about people’s attitudes toward the two major political parties. The ANES is a nationwide, probability-based survey of adult Americans’ various political attitudes and behaviors conducted every 4 years in the context of the U.S. presidential elections.² For decades, the ANES has consistently asked participants if there is anything they “like” or “dislike” about the Democratic and Republican Parties, and those answering affirmatively then provide an open-ended response. The raw text of those responses have recently been made available to the public in a redacted form.^3,⁴ Beginning in 1996, a number of responses appear to contain spelling errors which, if not corrected, will hinder our ability to discover relationships across texts.⁵ Consider two distinct respondents to the ANES: the first speaks about the “state of our economey” while the other is worried about how the “ecomony” will affect their pocketbook next month. Both respondents are speaking on the same subject (i.e., the “economy”), but the connection between the participants and between their words and the topic at hand will not be recognized by most modern statistical software.⁶ Correcting these misspellings would therefore lead to more accurate representations of the public’s attitudes, which we intend to demonstrate through the use of OpenAI’s GPT-4o large language model.

Evaluating GPT for spelling correction

We begin with a manual validation of the performance of GPT in spelling correction across prompts, across repeated calls, and in comparison to texts corrected with the “hunspell” R package. The first step here is to create prompts to instruct the LLM. We generate a set of four candidate prompts (shown in SI 1) that vary in their structure and specificity of instructions to demonstrate the flexibility of the approach. The first prompt simply asks that misspelled words be corrected and the text be returned. The second prompt adds to the first a set of guidelines for the task and output, including a direction that only misspellings should be corrected and not grammar, punctuation, or casing. The third prompt adds additional guidelines specific to the ANES texts, including instructions for handling abbreviations such as “dem(s),” “rep(s),” and “gov(t),” as well as tokens with prefixes such as “pro-” or “anti.” Finally, the fourth prompt further clarifies the task and context, stating explicitly that the texts are open-ended survey responses about the political parties. Researchers may wish to provide their own context and instructions to fit the peculiarities of their data.

With these four prompts we performed spelling correction on 140 randomly selected open-ended responses (4 response types × 5 responses each × 7 survey years = 140 responses).^7,⁸ We also automated spelling correction with the “hunspell” R package by writing a function to automatically accept the first recommendation for any words identified as misspelled.⁹ Our goal is to compare the original and the spelling-corrected texts to measure the efficacy of each approach against a desired result or “ground truth.” The “ground truth” against which we benchmark the corrected texts was established by having two members our research team independently review each text and listing the set of misspelled words along with their acceptable spellings, which is then consolidated into a final list after deliberation.¹⁰ This ground truth is almost always the dictionary spelling of a term (e.g., ground truth of “economey” is “economy”), though this is complicated by words that do not appear in a dictionary (e.g., names, slang) and for words with multiple proper spellings (e.g., “health care” and “healthcare”). Researcher discretion is required here to designate a desired spelling(s) for such terms and one may consider such instructions in their prompts as we have done. Examples of several texts, the words identified in manual review as misspelled, and the set of acceptable corrections are shown in Table 1.

Table 1.

Illustrating the problem of misspellings in open-ended responses.

Text	Misspellings	Replacement
BACKING A GOOD CANDIDTE	{candidte}	{candidate}
More in tough with the real world with no idealogical glasses on	{idealogical}	{ideological}
They want to negotiate more than repubicans	{repubicans}	{republicans}
Almost everything. Envirnmental spending, entitlements, taxes, transparency, foriegn affairs	{envirnmental; foriegn}	{environmental; foreign}

Note: “Text” column shows the original open-ended responses exactly as they appear in the data minus any markers for interviewer questions or notes (e.g., (AE), /TM/) removed in pre-processing. Red underlines indicating misspellings identified in a manual review by both members of the research team. Replacements were determined first by independent review of two members of the research team, which were then consolidated into a final list. Our replacement words are insensitive to casing as the texts are often converted to lowercase before being used in any analyses.

Through our manual validation exercise we identified a total of 114 misspelled words across the 140 documents we sampled, or about 0.8 misspellings per document. The efficacy of the “hunspell” package and of our four prompts in correcting these misspellings are shown in Figure 1, which reports the share of all identified misspellings in the sample that were appropriately corrected. Spelling correction with the “hunspell” package resulted in the poorest performance, accurately correcting only 40 of the 114 terms (35%). Further, given that we automated the use of “hunspell” by automatically accepting the first recommendation for misspelled words, errors were often introduced when that recommendation did not match the ground truth. Of the 74 words the “hunspell” package failed to replace with our desired term, 27 (36%) were replaced with a word outside our defined set of acceptable replacements, suggesting this approach can inadvertently introduce additional measurement error without careful oversight. A similar issue was not observed when using GPT for spelling correction, which demonstrated far superior performance that increased with the specificity of our prompts. The most basic prompt resulted in 87 (76%) of the 114 misspellings corrected. The second prompt that adds more context about the task demonstrates slightly poorer performance (82 of 114; 72%), though still better than “hunspell.” The third and fourth prompts adding context and guidelines specific to the ANES open-ended texts improves performance considerably, correcting 98 (86%) and 94 (83%) of the 114 words, respectively. We settle on prompt 4 for correction of the remaining texts as this prompt’s additional context that the texts were from a political survey about the parties seemed to help with some of the trickier corrections (e.g., “spec ints” successfully recognized as “special interests”).

Figure 1.

Proportion of misspellings properly corrected by approach (Hunspell, GPT prompts 1–4).

Given that GPT can provide different responses to the exact same input across repeated calls to the model, it is important to consider the replicability of this approach. Therefore, we used prompt #4 to perform spelling correction once again on the texts and compare its performance across the two runs (far right of Figure 1). We find remarkable consistency in the spelling correction rate across the two, and even see an improvement on the second run such that 91% of all misspellings we identified were appropriately corrected (an 8 percentage point improvement). This exercise does suggest some marginal variability in the efficacy of this approach across prompts and repeated calls, but performance overall is quite strong compared to the standard approach of ignoring the issue altogether and compared to existing tools such as the “hunspell” package. Our final step is to perform correction on the entire set of ∼49,000 texts with prompt #4 at a cost of about $50.¹¹

How spelling correction changes the data

Now we turn to describing the extent and nature of the changes made to our corpus by performing spelling correction with GPT-4o. We do so in Figure 2 which shows the number of unique tokens (Panel A) and the share of single-use tokens (Panel B) in the original (left bar, gray) and GPT-corrected (right bar, black) texts, as well as the share of unique tokens in either corpus matched to the sentiment dictionary (Panel C).¹²

Figure 2.

Changes to tokens before and after GPT spelling correction.

We see first in Panel A that the number of unique tokens decreases markedly after spelling correction by almost 12,000 tokens, or an astonishing ∼45%. We also see in Panel B a large reduction (0.21) in the share of either corpus consisting of single-use tokens. The reason for such changes is that misspellings—which often appear only once across a set of documents—will artificially inflate the vocabulary size of a corpus. For example, the misspelled word “politjics” will likely represent a unique token that is unlikely to have been used more than once, when in reality it should be counted as the frequently used term “politics”; correcting the word to its appropriate form consequently reduces both the number of unique and single-use tokens, and will lead to more accurate word counts across the corpus. Panel C then points to the potential substantive implications of spelling correction. Here we see that the share of unique tokens affirmatively matched as “positive” or “negative” increases by 0.08 while the share not matched (i.e., “neutral”) decreases by the corresponding amount. The implication is that misspelled terms hamper our ability to identify and count tokens of interest, consequently introducing error into our measurement of substantive concepts in the text. And while we have used a sentiment dictionary to illustrate the nature of this problem, the same logic would apply for the many other dictionaries that remain popular in the social sciences such as LIWC (Boyd et al., 2022), in that misspellings hamper our ability to identify key tokens of interest and to discover patterns of token use across texts.

The results of these descriptive exercises suggest that spelling correction can lead to more efficient and accurate modeling of text data, so now we turn to demonstrating those benefits in practice. We do this with two empirical exercises, the first where we construct a machine learning classifiers and show the performance gains from using spelling corrected text, and the second where we show improvements in the measurement of emotionality in text.

Consequences of spelling correction for modeling text

Our next exercise is based on the authors’ related work where we manually coded a random sample of respondents’ “likes” and “dislikes” about the parties for the presence of ideological language (Converse, 1964), and then used those codings to construct a machine learning classifier (Support Vector Machine) to categorize the remaining unlabeled texts.¹³ Classification of the open-ended responses proceeds in a multi-stage process, where first a model is constructed to identify ideological language in respondents’ “thoughts about the Republican Party” (i.e., their combined “likes” and “dislikes” about the party) before the process is repeated with respondents’ “thoughts about the Democratic Party.” We then create a final classification based on the results of the two models that takes a value of 1 if ideological language was detected in one’s responses about either party, and 0 if no ideological language was detected in either. Here we construct the two underlying classifiers with both the original and GPT-corrected texts and assess the models’ predictive ability on a sample manually labeled by the research team but held-out from model construction. Figure 3 presents performance metrics from these models including balanced accuracy (top row) and F1 (bottom row) which are often used to assess performance of models with imbalanced data as we have here.^14,¹⁵

Figure 3.

Classifier performance metrics before and after spelling correction.

Beginning in the top row of Figure 3, we see that balanced accuracy (i.e., the arithmetic mean of sensitivity and specificity) increases in both models following spelling correction: the model of people’s “thoughts about the Democratic Party” (light and dark blue bars) shows a one percentage point improvement in balanced accuracy while the model of peoples’ “thoughts about the Republican Party” (light and dark red bars) show a more considerable four percentage point increase. Similar patterns can be seen with respect to the F1 measure in the bottom row (i.e., the harmonic mean of precision and recall). The F1 score of the “thoughts about the Democratic Party” shows again a one percentage point improvement, while the “thoughts about the Republican Party” show a slightly larger improvement at two percentage points. More broadly, these results showing that the performance of our classifiers can be improved by several percentage points by simply correcting misspellings in the text. Of course, the size of the performance improvement is not monumental, but the relatively low cost of spelling correction in terms of time and resources to potentially buy back several degrees of predictive power appears worthwhile in our view.

Consequences of spelling correction for measurement in text

Lastly, we build on the earlier findings that a larger share of the unique tokens in the spelling corrected texts were successfully matched to a sentiment dictionary. We do so by first statistically comparing respondents’ emotionality across all four of their open-ended responses about the parties. We measure emotionality by counting the number of words across all four responses that are again matched to a sentiment dictionary; put simply, emotional responses are those which use more emotional words, whether positive or negative. We then regress this measure of emotionality on an indicator for the GPT-corrected texts, including respondent random effects given the paired observations for each respondent. Panel A in Figure 4 shows the coefficient estimate for the GPT-corrected texts (relative to the original texts) and reveals that emotionality is significantly greater (β^ˆ = 0.039, p = .000) in the GPT-corrected texts. Failure to correct spelling errors would therefore lead researchers to significantly underestimate the amount of emotionality in the open-ended responses.

Figure 4.

How political interest relates to emotionality in original and GPT-corrected texts.

We take this exercise one final step further to show how, if at all, the relationship between emotionality and political interest changes following spelling correction. Miller (2011) previously used closed-ended items from the ANES to show that political sophisticates are more likely to express emotions such as anger, fear, hope, and pride toward the presidential nominees. Here we examine how emotionality is expressed across one such dimension of sophistication—political interest—when emotionality is measured in the original and GPT-corrected texts. Panel B of Figure 4 plots the mean emotionality in either set of texts across respondents’ levels of political interest. Here we see that emotionality is greater at all levels of political interest when measured in the GPT-corrected texts, again because spelling correction is facilitating the matching of words to the sentiment dictionary. We then regress either measure of emotionality on the political interest scale plus co-variates for race, sex, age, and education, and find only a marginally improved model fit when emotionality is measured in the GPT-corrected texts as indicated by a higher adjusted R-squared value (Adj. $R_{O r i g i n a l}^{2}$ = 0.166, Adj. $R_{G P T}^{2}$ = 0.173). The size of the coefficient on political interest also appears larger in the model using the GPT-corrected texts (β^ˆ_Original = 1.16, β^ˆ_GPT = 1.21; see Table S5), though these estimates do not appear statistically distinguishable (p = .68). In summary, these exercises have revealed that spelling correction has clear implications for the measurement of concepts in text such as emotionality, and may marginally influence any subsequent relationships estimated with these measurements, though in cases where the misspellings are not too egregious it appears unlikely that spelling correction will entirely alter the substantive conclusions one draws from their analyses.

Conclusion

Spelling correction is rarely (if ever) considered as part of pre-processing text data. However, our results suggest that improvements to our models and measures of text may be left on the table if such misspellings are not addressed. We have shown that tools of artificial intelligence, specifically the GPT-4o large language model, offer one fast, flexible, and dependable solutions to correct such errors, and can help us to reclaim knowledge from the text that is often lost on the margins to this unique measurement.

Moving forward, we hope that researchers conducting text analyses consider performing spelling correction as part of their pre-processing workflow. While we have outlined the potential returns to spelling correction with specific respect to open-ended survey responses, we believe there is potential to apply this approach to many other sources of text data. For instance, content from social media may contain a sizable number of spelling errors as there are few incentives to proofread one’s typed thoughts before pressing “post.” Texts such as these that may contain large numbers of spelling errors stand to gain the most from spelling correction. To be sure, some texts such as news articles, press releases of elected officials, and government documents may require few (if any) spelling corrections given that they likely are edited prior to release. Therefore, researchers should think carefully about the text-generating process and its potential to introduce measurement error when deciding whether spelling correction is appropriate or necessary for their data.

Supplemental Material

Supplemental Material - Spelling correction with large language models to reduce measurement error in open-ended survey responses

Supplemental Material for Spelling correction with large language models to reduce measurement error in open-ended survey responses by Maxwell B. Allamong, Jongwoo Jeong and Paul M. Kellstedt in Research & Politics.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Carnegie Corporation of New York Grant

This publication was made possible (in part) by a grant from the Carnegie Corporation of New York. The statements made and views expressed are solely the responsibility of the author.

ORCID iD

Paul M. Kellstedt

Supplemental Material

Supplemental material for this article is available online.

The replication files are available at: .

Notes

References

Boyd

Ashokkumar

Seraj

, et al. (2022) The Development and Psychometric Properties of LIWC-22. Austin, TX: University of Texas at Austin. Technical report.

Converse

(1964) The nature of belief systems in mass publics. In: Apter

(ed). Ideology and Discontent. New York: Free Press, 18.

Denny

Spirling

(2018) Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it. Political Analysis 26(2): 168–189, Publisher: Cambridge University Press.

Geer

(1988) What do open-ended questions measure? Public Opinion Quarterly 52(3): 365.

Gilardi

Alizadeh

Kubli

(2023) ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120(30): e2305016120.

Goyal

Jessy Li

Durrett

(2023) News Summarization and Evaluation in the Era of GPT-3. arXiv preprint arXiv:2209.12356.

Grimmer

Roberts

Stewart

(2022) Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, UK: Princeton University Press. Google-Books-ID: dL40EAAAQBAJ.

Heseltine

Hohenberg

BCvon

(2024) Large language models as a substitute for human experts in annotating political text. Research & Politics 11(1): 20531680241236239.

Liu

(2004) Mining and summarizing customer reviews. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA, 22 –25 August 2004, 168–177.

10.

Kraft

(2023) Women also know stuff: challenging the gender gap in political sophistication. American Political Science Review 118(July): 903–921.

11.

Lazarsfeld

(1944) The controversy over detailed interviews-an offer for negotiation. Public Opinion Quarterly 8(1): 38.

12.

Meitinger

Behr

Braun

(2021) Using apples and oranges to judge quality? Selection of appropriate cross-national indicators of response quality in open-ended questions. Social Science Computer Review 39(3): 434–455.

13.

Mellon

Bailey

Scott

, et al. (2024) “Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale”. Research & Politics 11(1): 20531680241231468. Publisher: Sage Publications Ltd.

14.

Miller

(2011) The emotional citizen: emotion as a function of political sophistication. Political Psychology 32(4): 575–600.

15.

Ooms

(2024) Hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker. (Version 3.0.4) [R package]. Available at https://cran.r-project.org/package=hunspell.

16.

Roberts

Stewart

Tingley

, et al. (2014) Structural topic models for open-ended survey responses. American Journal of Political Science 58(4): 1064–1082.

17.

Simchon

Brady

Van Bavel

(2022) Troll and divide: the language of online polarization. PNAS nexus 1(1): pgac019.

18.

Zaller

(1992) The Nature and Origins of Mass Opinion. Cambridge, UK: Cambridge University Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.52 MB