Sage Journals: Discover world-class research

Abstract

Generative artificial intelligence (AI) is increasingly presented as a potential substitute for humans, including as research subjects. However, there is no scientific consensus on how closely these in silico clones can emulate survey respondents. While some defend the use of these “synthetic users,” others point toward social biases in the responses provided by large language models (LLMs). In this article, we demonstrate that these critics are right to be wary of using generative AI to emulate respondents, but probably not for the right reasons. Our results show (i) that to date, models cannot replace research subjects for opinion or attitudinal research; (ii) that they display a strong bias and a low variance on each topic; and (iii) that this bias randomly varies from one topic to the next. We label this pattern “machine bias,” a concept we define, and whose consequences for LLM-based research we further explore.

Keywords

LLMs bias generative artificial intelligence computational social sciences machine learning survey research

Get full access to this article

View all access options for this article.

References

Aher

Gati

Arriaga

Rosa I.

Kalai

Adam Tauman

. 2023. “Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies.” arXiv. doi: https://doi.org/10.48550/arXiv.2208.10264.

Alvero

A. J.

Lee

Jinsook

Regla-Vargas

Alejandra

Kizilcec

René F.

Joachims

Thorsten

Antonio

Anthony Lising

. 2024. “Large Language Models, Social Demography, and Hegemony. Comparing Authorship in Human and Synthetic Text.” Journal of Big Data 11(138):1–28. doi: https://doi.org/10.1186/s40537-024-00986-7

Argyle

Lisa P.

Busby

Ethan C.

Fulda

Nancy

Gubler

Joshua R.

Rytting

Christopher

Wingate

David

. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis 31(3):337–51. doi: https://doi.org/10.1017/pan.2023.2.

Ashokkumar

Ashwini

Hewitt

Luke

Ghezae

Isaias

Willer

Robb

. 2024. “Predicting Results of Social Science Experiments Using Large Language Models.” Working Paper.

Atari

Mohammad

Xue

Mona J.

Park

Peter S.

Blasi

Damián Ezequiel

Henrich

Joseph

. 2023. “Which Humans?” PsyArXiv preprint. doi: https://doi.org/10.31234/osf.io/5b26t.

Bail

Christopher A.

2024. “Can Generative AI Improve Social Science?” PNAS 121(21):e2314021121. doi: https://doi.org/10.1073/pnas.2314021121

Barrie

Christopher

Palaiologou

Elli

Törnberg

Petter

. 2024. “Prompt Stability Scoring for Text Annotation with Large Language Models.” arXiv Preprint. doi: https://doi.org/10.48550/arXiv.2407.02039.

Bender

Emily M.

Gebru

Timnit

McMillan-Major

Angelina

Shmitchell

Shmargaret

. 2021 “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Pp. 610–23 in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ‘21. New York, NY, USA: Association for Computing Machinery.

Berglund

Lukas

Tong

Meg

Kaufmann

Max

Balesni

Mikita

Stickland

Asa Cooper

Korbak

Tomasz

Evans

Owain

. 2024. “The Reversal Curse: LLMs Trained on ‘A Is B’ Fail to Learn ‘B Is A’.” arXiv Preprint. doi: https://doi.org/10.48550/arXiv.2309.12288.

10.

Bisbee

James

Clinton

Joshua D.

Dorff

Cassy

Kenkel

Brenton

Larson

Jennifer M.

. 2024. “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models.” Political Analysis 32(4):401–416. doi: https://doi.org/10.1017/pan.2024.5

11.

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” Pp. 4356-4364 in Advances in Neural Information Processing Systems 29 (NIPS 2016), edited by Daniel D. Lee, Masahi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett. Red Hook, NY: Curran.

12.

Brynjolfsson

Erik

Danielle

Raymond

Lindsey R.

. 2025. "Generative AI at Work." Quarterly Journal of Economics: 1–54. doi: https://doi.org/10.1093/qje/qjae044

13.

Buolamwini

Joy

Gebru

Timnit

. 2018. “Gender Shades. Intersectional Accuracy Disparities in Commercial Gender Classification” Edited by S. A. Friedler and C. Wilson.” Proceedings of Machine Learning Research 81:77–91.

14.

Cao

Yong

Zhou

Lee

Seolhwa

Cabello

Laura

Chen

Min

Hershcovich

Daniel

. 2023 “Assessing Cross-Cultural Alignment Between ChatGPT and Human Societies: An Empirical Study.” Pp. 53–67 in Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (c3nlp), edited by Dev

Prabhakaran

Adelani

Hovy

Benotti

. Dubrovnik, Croatia: Association for Computational Linguistics.

15.

Chen

Irene Y.

Szolovits

Peter

Ghassemi

Marzyeh

. 2019. “Can AI Help Reduce Disparities in General Medical and Mental Health Care?” AMA Journal of Ethics 21(2):E167–79. doi: https://doi.org/10.1001/amajethics.2019.167.

16.

Daikeler

Jessica

Bošnjak

Michael

Manfreda

Katja Lozar

. 2019. “Web Versus Other Survey Modes. An Updated and Extended Meta-Analysis Comparing Response Rates.” Journal of Survey Statistics and Methodology 8(3):513–39. doi: https://doi.org/10.1093/jssam/smz008.

17.

Dillion

Danica

Tandon

Niket

Yuling

Gray

Kurt

. 2023. “Can AI Language Models Replace Human Participants?” Trends in Cognitive Sciences 27(7):597–600. doi: https://doi.org/10.1016/j.tics.2023.04.008.

18.

Dominguez-Olmedo

Ricardo

Hardt

Moritz

Mendler-Dünner

Celestine

. 2024. “Questioning the Survey Responses of Large Language Models.” arXiv preprint. doi: https://doi.org/10.48550/arXiv.2306.07951.

19.

Dutwin

David

Buskirk

Trent D.

. 2021. “Telephone Sample Surveys: dearly Beloved or Nearly Departed? Trends in Survey Errors in the Era of Declining Response Rates.” Journal of Survey Statistics and Methodology 9(3):353–80. doi: https://doi.org/10.1093/jssam/smz044.

20.

Gallegos

Isabel O.

Rossi

Ryan A.

Barrow

Joe

Tanjim

Md Mehrab

Kim

Sungchul

Dernoncourt

Franck

Tong

Zhang

Ruiyi

Ahmed

Nesreen K.

. 2024. “Bias and Fairness in Large Language Models: A Survey.” Computational Linguistics 50(3):1097–179. doi: https://doi.org/10.1162/coli\_a\_00524.

21.

Glass

Jennifer

Simon

Robin W.

Andersson

Matthew A.

. 2016. “Parenthood and Happiness. Effects of Work–Family Reconciliation Policies in 22 OECD Countries.” American Journal of Sociology 122(3):886–929. doi: https://doi.org/10.1086/688892

22.

Hartmann

Jochen

Schwenzow

Jasper

Witte

Maximilian

. 2023. “The Political Ideology of Conversational Ai: Converging Evidence on ChatGPT’s Pro-Environmental, Left-Libertarian Orientation.” arXiv Preprint. doi: https://doi.org/10.48550/arXiv.2301.01768.

23.

Heyde

Leah von der

Haensch

Anna-Carolina

Wenz

Alexander

. 2023. “Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion.” SocArXiv preprint. doi: https://doi.org/10.31235/osf.io/8je9g.

24.

Horton

John.

2023. “Large Language Models as Simulated Economic Agents. What Can We Learn from Homo Silicus?” NBER Working Paper 31122. doi: https://doi.org/10.3386/w31122.

25.

Hovy

Dirk

Prabhumoye

Shrimai

. 2021. “Five Sources of Bias in Natural Language Processing.” Language and Linguistics Compass 15(8):e12432. doi: https://doi.org/10.1111/lnc3.12432

26.

Inglehart, Ronald, Christian Haerpfer, Alejandro Moreno, Christian Welzel, Kseniya Kizilova, Jaime Diez-Medrano, Marta Lagos, Pippa Norris, Eduard Ponarin, and Bi Puranen, 2022. World Values Survey (WVS). All Rounds—Country-Pooled Datafile. Vienna: worldvaluessurvey.org, 4.0 ed. doi: https://doi.org/10.14281/18241.17

27.

Johnson

Rebecca L.

Pistilli

Giada

Menédez-González

Natalia

Duran

Leslye Denisse Dias

Panai

Enrico

Kalpokiene

Julija

Bertulfo

Donald Jay

. 2022. “The Ghost in the Machine Has an American Accent. Value Conflict in GPT-3.” arXiv preprint. doi: https://doi.org/10.48550/arXiv.2203.07785.

28.

Kim

Junsol

Lee

Byungkyu

. 2023. “AI-Augmented Surveys. Leveraging Large Language Models and Surveys for Opinion Prediction.” arXiv preprint. doi: https://doi.org/10.48550/arXiv.2305.09620.

29.

Kozlowski

Austin C.

Kwon

Hyunku

Evans

James A.

. 2024. “In Silico Sociology. Forecasting Covid-19 Polarization with Large Language Models.” arXiv Preprint. doi: https://doi.org/10.48550/arXiv.2407.11190.

30.

Lin

Zhicheng.

2024. “How to Write Effective Prompts for Large Language Models.” Nature Human Behavior 8:611–15. doi: https://doi.org/10.1038/s41562-024-01847-2

31.

Mehrabi

Ninareh

Morstatter

Fred

Saxena

Nripsuta

Lerman

Kristina

Galstyan

Aram

. 2021. “A Survey on Bias and Fairness in Machine Learning.” ACM Computing Surveys (CSUR) 54(6):1–35. doi: https://doi.org/10.1145/3457607

32.

Motoki

Fabio

Neto

Valdemar Pinho

Rodrigues

Victor

. 2024. “More Human Than Human. Measuring ChatGPT Political Bias.” Public Choice 198(1):3–23. doi: https://doi.org/10.1007/s11127-023-01097-2

33.

Nangia, Nikita, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020 “CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models.” Pp. 1953-67 in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by B. Webber, T. Cohn, Y. He, and Y. Liu. Online: Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/2020.emnlp-main.154

34.

Narayanan

Arvind

Kapoor

Sayash

. 2024. AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference. Princeton, NJ: Princeton University Press.

35.

Navigli

Roberto

Conia

Simone

Ross

Björn

. 2023. “Biases in Large Language Models: Origins, Inventory, and Discussion.” J. Data and Information Quality 15(2):1-21. doi: https://doi.org/10.1145/3597307

36.

Ollion

Etienne

Shen

Rubing

Macanovic

Ana

Chatelain

Arnault

. 2024. “The Dangers of Using Proprietary LLMs for Research.” Nature Machine Intelligence 6:4–5. doi: https://doi.org/10.1038/s42256-023-00783-6

37.

O’Neil

Cathy.

2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York: Crown.

38.

Pearson

Helen.

2015. “Massive UK Baby Study Cancelled.” Nature 526(7575):620–21. doi: https://doi.org/10.1038/526620a

39.

Plaza-del-Arco

Flor Miriam

Nozza

Debora

Hovy

Dirk

. 2024 “Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation.” Pp. 19–30 in Proceedings of the 3rd Workshop on Perspectivist Approaches to nlp (Nlperspectives) @ Lrec-Coling 2024, edited by Abercrombie

Basile

Bernadi

Dudy

Frenda

Havens

Tonelli

. Torino, Italia: ELRA; ICCL.

40.

Rogers

Anna

Luccioni

Sasha

. 2024. "Position: Key Claims in LLM Research Have a Long Tail of Footnotes." arXiv. doi: https://doi.org/10.48550/arXiv.2308.07120

41.

Röttger

Paul

Hofmann

Valentin

Pyatkin

Valentina

Hinck

Musashi

Kirk

Hannah

Schuetze

Hinrich

Hovy

Dirk

. 2024 “Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models.” Pp. 15295–311 in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Ku

L.-W.

Martins

Srikumar

. Bangkok, Thailand: Association for Computational Linguistics.

42.

Rozado

David.

2024. “The Political Preferences of LLMs.” PLOS ONE 19(7):1–15. doi: https://doi.org/10.1371/journal.pone.0306621

43.

Rutinowski

Jérôme

Franke

Sven

Endendyk

Jan

Dormuth

Ina

Roidl

Moritz

Pauly

Markus

. 2024. “The Self-Perception and Political Biases of ChatGPT” Edited by S. G. Fashoto.” Human Behavior and Emerging Technologies 2024(7115633):1–9. doi: https://doi.org/10.1155/2024/7115633.

44.

Santurkar

Shibani

Durmus

Esin

Ladhak

Faisal

Lee

Cinoo

Liang

Percy

Hashimoto

Tatsunori

. 2023. ““Whose Opinions Do Language Models Reflect?” Edited by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett.” Proceedings of Machine Learning Research 202:29971–30004.

45.

Schaeffer

Rylan

Miranda

Brando

Koyejo

Sanmi

. 2023 “Are Emergent Abilities of Large Language Models a Mirage?” Pp. 55565–81 in Advances in Neural Information Processing Systems. 36, Oh

Naumann

Globerson

Saenko

Hardt

Levine

. Curran Associates, Inc.

46.

Schilke

Oliver

Reimann

Martin

Cook

Karen S.

. 2021. “Trust in Social Relations.” Annual Review of Sociology 47:239–59. doi: https://doi.org/10.1146/annurev-soc-082120-082850.

47.

Schnabel

Tobias

Swaminathan

Adith

Singh

Ashudeep

Chandak

Navin

Joachims

Thorsten

. 2016. "Recommendations as Treatments: Debiasing Learning and Evaluation." arXiv. doi:https://doi.org/10.48550/arXiv.1602.05352

48.

Shankar

Shreya

Halpern

Yoni

Breck

Eric

Atwood

James

Wilson

Jimbo

Sculley

. 2017. "No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World." arXiv.

49.

Spirling

Arthur.

2023. “Why Open-Source Generative AI Models Are an Ethical Way Forward for Science.” Nature 616(7957):413. doi: https://doi.org/10.1038/d41586-023-01295-4

50.

Strimling

Pontus

Krueger

Joel

Karlsson

Simon

. 2024. “GPT-4’s One-Dimensional Mapping of Morality: How the Accuracy of Country-Estimates Depends on Moral Domain.” arXiv preprint. doi: https://doi.org/10.48550/arXiv.2407.16886.

51.

Voas

David

Chaves

Mark

. 2016. “Is the United States a Counterexample to the Secularization Thesis?” American Journal of Sociology 121(5):1517–56. doi: https://doi.org/10.1086/684202

52.

Wang

Xinpeng

Bolei

Chengzhi

Weber-Genzel

Leon

Röttger

Paul

Kreuter

Frauke

Hovy

Dirk

Plank

Barbara

. 2024. ““My Answer Is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.” Pp. 7407–16 in Findings of the Association for Computational Linguistics ACL 2024, edited by Ku

L.-W.

Martins

Srikumar

. Bangkok, Thailand; virtual meeting: Association for Computational Linguistics.

53.

Wang

Angelina

Morgenstern

Jamie

Dickerson

John P.

. 2024. “Large Language Models Should Not Replace Human Participants Because They Can Misportray and Flatten Identity Groups.” arXiv preprint. doi: https://doi.org/10.48550/arXiv.2402.01908.

54.

Williams

Douglas

Brick

J. Michael

. 2018. “Trends in US Face-to-Face Household Survey Nonresponse and Level of Effort.” Journal of Survey Statistics and Methodology 6(2):186–211. doi: https://doi.org/10.1093/jssam/smx019

55.

Ziems

Caleb

Held

William

Shaikh

Omar

Chen

Jiaao

Zhang

Zhehao

Yang

Diyi

. 2024. “Can Large Language Models Transform Computational Social Science?” Computational Linguistics 50(1):1-55. doi: https://doi.org/10.1162/coli_a_00502

56.

Zoph

Barret

Raffel

Colin

Schuurmans

Dale

Yogatama

Dani

Zhou

Denny

Metzler

Don

Chi

Ed H.

Wei

Jason

Dean

Jeff

Fedus

Liam B.

Bosma

Maarten Paul

Vinyals

Oriol

Liang

Percy

Borgeaud

Sebastian

Hashimoto

Tatsunori B.

Tay

. 2022. “Emergent Abilities of Large Language Models.” arXiv. doi: https://doi.org/10.48550/arXiv.2206.07682

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

8.72 MB

Machine Bias. How Do Generative Language Models Answer Opinion Polls? 1

Abstract

Keywords

Get full access to this article

References

Supplementary Material