Sage Journals: Discover world-class research

Abstract

Market and survey researchers aim to write survey questions so that the target population can understand them. A common recommendation for general population studies is to write survey questions at an eighth-grade reading level. To evaluate whether questions meet this threshold, survey researchers turn to readability measures, such as the Flesch-Kincaid Reading Grade Level. Researchers may be able to streamline the calculation of question reading levels using artificial intelligence (AI) tools, such as ChatGPT, in situ to draft survey questions. One risk of using artificial intelligence tools is that they may incorrectly calculate readability measures. To our knowledge, whether AI tools calculate readability correctly has not been evaluated. In this paper, we examine readability calculations for 60 survey questions performed by commonly available AI tools, including both ChatGPT and Claude large language models (LLMs), at three time points (Summer 2024, November 2024, April 2025). We compare these to a “gold standard” online readability assessment tool (Readable.com), and calculations from packages in two open-source software programs, R and Python. We examine the Flesch-Kincaid Grade Level and four other readability measures and the inputs to their calculations (e.g., number of words, sentences). Although there is almost perfect alignment between these metrics as calculated by Readable and R, each LLM varies in its calculations across models and over time. We also examine the calculations of the inputs of the readability calculations for each tool, including implications for the reported overall readability score. Our results suggest that open-source tools are reliable and accurate. We also find that LLMs are evolving in their ability to accurately calculate readability of survey questions, with large variation over models and over time. Some, but not all, LLMs are being trained to use the same resources as open-source tools to calculate readability of passages of text.

Keywords

survey methods questionnaire design readability AI question evaluation methods LLMs

Get full access to this article

View all access options for this article.

References

Adkins

N. R.

Ozanne

J. L.

(2005). The low literate consumer. Journal of Consumer Research, 32(1), 93–105. https://doi.org/10.1086/429603

Akbarabadi

Hosseini

(2020). Predicting the helpfulness of online customer reviews: The role of title features. International Journal of Market Research, 62(3), 272–287. https://doi.org/10.1177/1470785318819979

Anthropic (2024). Claude 3.5 Sonnet (Version 20241022) [Large language model].

Bansal

Aggarwal

(2016). Textstat.

Benoit

(2022). R program: nsyllable. 1.0.1 ed.: CRAN, counts syllables in character vectors for English words. Imputes syllables as the number of vowel sequences for words not found.

Benoit

Watanabe

Wang

Nulty

Obeng

Müller

Matsuo

(2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774

Bisbee

Clinton

J. D.

Dorff

Kenkel

Larson

J. M.

(2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(4), 401–416. https://doi.org/10.1017/pan.2024.5

Blair

Czaja

R. F.

Blair

E. A.

(2013). Designing surveys: A guide to decisions and procedures (3rd ed.). Sage Publications, Inc.

Buskirk

T. D.

Eck

Timbrook

(2024a). Leveling the reading level: An experimental prompting approach for controlling the reading levels of survey questions drafted from generative AI tools. Midwest Association for Public Opinion Research Annual meeting.

10.

Buskirk

T. D.

Eck

Timbrook

(2024b). The task is to improve the ask: An experimental approach to developing optimal prompts for generating survey questions from generative AI tools. 79th Annual American Association of Public Opinion Research Conference.

11.

Cheng

(2023). Analyzing ChatGPT’s mathematical deficiencies: Insights and contributions. The Association for Computational Linguistics and Chinese Language Processing, pp. 188–193.

12.

Demszky

Yang

Yeager

D. S.

Bryan

C. J.

Clapper

Chandhok

Eichstaedt

J. C.

Hecht

Jamieson

Johnson

Jones

Krettek-Cobb

Lai

JonesMitchell

Ong

D. C.

Dweck

C. S.

Gross

J. J.

Pennebaker

J. W.

(2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. https://doi.org/10.1038/s44159-023-00241-5

13.

Dillman

D. A.

Smyth

J. D.

Christian

L. M.

(2014). Internet, phone, mail, and mixed mode surveys: The tailored design method. John Wiley & Sons.

14.

Eloundou

Manning

Mishkin

Rock

(2024). GPTs are GPTs: Labor market impact potential of LLMs. Science, 384(6702), 1306–1308. https://doi.org/10.1126/science.adj0998

15.

Graesser

A. C.

Cai

Louwerse

M. M.

Daniel

(2006). Question understanding aid (QUAID): A web facility that helps survey methodologists improve the comprehensibility of questions. Public Opinion Quarterly, 70, 3–22.

16.

Harrison‐Walker

L. J.

(1995). The import of illiteracy to marketing communication. Journal of Consumer Marketing, 12(1), 50–62. https://doi.org/10.1108/07363769510080997

17.

Hulland

Baumgartner

Smith

K. M.

(2018). Marketing survey research best practices: Evidence and recommendations from a review of JAMS articles. Journal of the Academy of Marketing Science, 46(1), 92–108. https://doi.org/10.1007/s11747-017-0532-y

18.

Jansen

B. J.

Jung

S.-g.

Salminen

(2023). Employing large language models in survey research. Natural Language Processing Journal, 4, Article 100020. https://doi.org/10.1016/j.nlp.2023.100020

19.

Kostyk

Leonhardt

J. M.

Niculescu

(2021). Processing fluency scale development for consumer research. International Journal of Market Research, 63(3), 353–367. https://doi.org/10.1177/1470785319877137

20.

Lerner

J. Y.

Sepulvado

Bilgen

Christian

Huang

(2024). Enhancing survey research data quality with LLMs: Using AI to optimize open-ended questions for NLP. 79th annual conference of the American Association of Public Opinion Research, Atlanta, GA, 2024.

21.

Maitland

Presser

(2016). How accurately Do different evaluation methods predict the reliability of survey questions? Journal of Survey Statistics and Methodology, 4(3), 362–381. https://doi.org/10.1093/jssam/smw014

22.

Marchand

Adsett

C. R.

Damper

R. I.

(2009). Automatic syllabification in English: A comparison of different algorithms. Language and Speech, 52(1), 1–27. https://doi.org/10.1177/0023830908099881

23.

McClain

(2024). Americans’ use of ChatGPT is ticking up, but few trust its election information: Pew Research Center. Available at: https://www.pewresearch.org/short-reads/2024/03/26/americans-use-of-chatgpt-is-ticking-up-but-few-trust-its-election-information/

24.

Olivos

Liu

(2024). ChatGPTest: Opportunities and cautionary tales of utilizing AI for questionnaire pretesting. Field Methods, 0(0), Article 1525822X241280574. https://doi.org/10.1177/1525822x241280574

25.

OpenAI . (2024). ChatGPT [large language model]. Gpt-4 technical report. Arxiv 2303.08774.

26.

Paasche-Orlow

M. K.

Taylor

H. A.

Brancati

F. L.

(2003). Readability standards for informed-consent forms as compared with actual readability. New England Journal of Medicine, 348(8), 721–726. https://doi.org/10.1056/NEJMsa021212

27.

Padgett

Maiorino

Gutierrez

(2024). Evaluating the quality of questionnaires created with SurveyMonkey’s build with AI. 79th annual American Association of public opinion research conference, Atlanta, GA, 2024.

28.

Qualtrics . (2025). Survey methodology & compliance best practices. Available at: https://www.qualtrics.com/support/survey-platform/survey-module/survey-checker/survey-methodology-compliance-best-practices/?utm_medium=ExpertReview&utm_source=product&utm_campaign=QuestionReadability&utm_content=#QuestionReadability (accessed 20 May 2025).

29.

Smith

E. A.

Senter

R. J.

(1967). Automated readability index. Wright-Patterson air Force base. Aerospace Medical Research Laboratories.

30.

Soós

(2024). Who wrote the scientific news? Improving the discernibility of LLMs to human-written scientific news. Old Dominion University.

31.

Stenger

Olson

Smyth

J. D.

(2023). Comparing readability measures and computer‐assisted question evaluation tools for self‐administered survey questions. Field Methods, 35(4), 287–302. https://doi.org/10.1177/1525822x221124469

32.

Suvarna

Khandelwal

Peng

(2024). PhonologyBench: Evaluating phonological skills of large language models. Proceedings of the 1st Workshop on towards Knowledgeable Language Models (KnowLLM 2024). Association for Computational Linguistics.

33.

Vomberg

Klarmann

(2022). Crafting survey research: A systematic process for conducting survey research. In Homburg

Klarmann

Vomberg

(Eds.), Handbook of market research (pp. 67–119). Springer International Publishing.

34.

Wallendorf

(2001). Literally literacy. Journal of Consumer Research, 27(4), 505–511. https://doi.org/10.1086/319625

35.

WCAG2 . (2025). Web content accessibility guidelines (WCAG) 2 Guideline 3.1. Available at: https://www.w3.org/WAI/WCAG22/Techniques/general/G153.html

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.52 MB

0.02 MB

“ChatBot” is a Two Syllable Word...Or Is It?: Using Generative AI for Survey Question Readability Assessments

Abstract

Keywords

Get full access to this article

References

Supplementary Material