Sage Journals: Discover world-class research

Abstract

Large language models (LLMs) provide cost-effective but possibly inaccurate predictions of human behavior. Despite growing evidence that predicted and observed behavior are often not interchangeable, there is limited guidance on using LLMs to obtain valid estimates of causal effects and other parameters. We argue that LLM predictions should be treated as potentially informative observations, while human subjects serve as a gold standard in a mixed subjects design. This paradigm preserves validity and offers more precise estimates at a lower cost than experiments relying exclusively on human subjects. We demonstrate—and extend—prediction-powered inference (PPI), a method that combines predictions and observations. We define the PPI correlation as a measure of interchangeability and derive the effective sample size for PPI. We also introduce a power analysis to optimally choose between informative but costly human subjects and less informative but cheap predictions of human behavior. Mixed subjects designs could enhance scientific productivity and reduce inequality in access to costly evidence.

Keywords

mixed subjects design prediction-powered inference (PPI)PPI correlation effective sample size PPI poweranalysis machine learning large language models computational social science

Get full access to this article

View all access options for this article.

References

Abdurahman

Atari

Karimi-Malekabadi

Xue

M. J.

Trager

Park

P. S.

Golazizian

Omrani

Dehghani

. 2024. “Perils and Opportunities in Using Large Language Models in Psychological Research.” PNAS Nexus 3(7):245.

Almaatouq

Griffiths

T. L.

Suchow

J. W.

Whiting

M. E.

Evans

Watts

D. J.

. 2024. “Beyond Playing 20 Questions with Nature: Integrative Experiment Design in the Social and Behavioral Sciences.” Behavioral and Brain Sciences 47:e33.

Alvero

A.J.

Lee

Regla-Vargas

Kizilcec

R. F.

Joachims

Antonio

A. L.

. 2024. “Large Language Models, Social Demography, and Hegemony: Comparing Authorship in Human and Synthetic Text.” Journal of Big Data 11(1):138.

Angelopoulos

A.N.

Bates

Fannjiang

Jordan

M. I.

Zrnic

. 2023. “Prediction-powered Inference.” Science (New York, N.Y.) 382(6671): 669–74.

Angelopoulos

A. N.

Duchi

J. C.

Zrnic

. 2024. “Ppi++: Efficient Prediction-powered Inference”.

Argyle

L. P.

Busby

E. C.

Fulda

Gubler

J. R.

Rytting

Wingate

. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis 31(3): 337–51.

Ashokkumar

Hewitt

Ghezae

Willer

. 2024. “Predicting Results of Social Science Experiments Using Large Language Models”.

Atari

Xue

M. J.

Park

P. S.

Blasi

D. E.

Henrich

. 2023. “Which Humans?”.

Awad

Dsouza

Kim

Schulz

Henrich

Shariff

Bonnefon

J.-F.

Rahwan

. 2018. “The Moral Machine Experiment.” Nature 563(7729): 59–64.

10.

Bail

C. A

. 2024. “Can Generative AI Improve Social Science?” Proceedings of the National Academy of Sciences 121(21):e2314021121.

11.

Berinsky

A. J.

Huber

G. A.

Lenz

G. S.

. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20(3): 351–68.

12.

Binz

Schulz

. 2023. “Using Cognitive Psychology to Understand Gpt-3.” Proceedings of the National Academy of Sciences 120(6):e2218523120.

13.

Bisbee

Clinton

J. D.

Dorff

Kenkel

Larson

J. M.

. 2024. “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models.” Political Analysis 32(4): 401–16.

14.

Blackwell

Honaker

King

. 2017. “A Unified Approach to Measurement Error and Missing Data: Overview and Applications.” Sociological Methods & Research 46(3): 303–41.

15.

Bradley

V.C.

Kuriwaki

Isakov

Sejdinovic

Meng

X.-L.

Flaxman

. 2021. “Unrepresentative Big Surveys Significantly Overestimated Us Vaccine Uptake.” Nature 600(7890): 695–700.

16.

Breiman

. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16(3): 199–231.

17.

Chandler

Rosenzweig

Moss

A. J.

Robinson

Litman

. 2019. “Online Panels in Social Science Research: Expanding Sampling Methods Beyond Mechanical Turk.” Behavior Research Methods 51(5): 2022–38.

18.

Christensen

Freese

Miguel

. 2019. Transparent and Reproducible Social Science Research: How to Do Open Science. Oakland, California: University of California Press.

19.

Chu

J.Y.

Voelkel

J. G.

Stagnaro

M. N.

Kang

Druckman

J. N.

Rand

D. G.

Willer

. 2024. “Academics are More Specific, and Practitioners More Sensitive, in Forecasting Interventions to Strengthen Democratic Attitudes.” Proceedings of the National Academy of Sciences 121(3):e2307008121.

20.

Cohen

. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, New Jersey: L. Erlbaum Associates.

21.

Davidson

. 2024. “Start Generating: Harnessing Generative Artificial Intelligence for Sociological Research.” Socius 10:23780231241259651.

22.

DellaVigna

Pope

. 2018. “What Motivates Effort? Evidence and Expert Forecasts.” The Review of Economic Studies 85(2): 1029–69.

23.

Dillion

Tandon

Gray

. 2023. “Can AI Language Models Replace Human Participants?” Trends in Cognitive Sciences 27(7): 597–600.

24.

Efron

Tibshirani

R. J.

. 1994. An Introduction to the Bootstrap. New York: CRC Press.

25.

Egami

Hinck

Stewart

B. M.

Wei

. 2024. “Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models.” arXiv:2306.04746 [cs, stat].

26.

Fisch

Maynez

Hofer

R. A.

Dhingra

Globerson

Cohen

W. W.

. 2024. “Stratified Prediction-powered Inference for Hybrid Language Model Evaluation.”

27.

Freese

Peterson

. 2018. “The Emergence of Statistical Objectivity: Changing Ideas of Epistemic Vice and Virtue in Science.” Sociological Theory 36(3):289-313.

28.

Gelman

. 2018. “You Need 16 Times the Sample Size to Estimate an Interaction than to Estimate a Main Effect | Statistical Modeling, Causal Inference, and Social Science.”

29.

Gligorić

Zrnic

Lee

Candès

E. J.

Jurafsky

. 2024. “Can Unconfident LLM Annotations be Used for Confident Conclusions?”

30.

Grossmann

Feinberg

Parker

D. C.

Christakis

N. A.

Tetlock

P. E.

Cunningham

W. A.

. 2023. “AI and the Transformation of Social Science Research.” Science (New York, N.Y.) 380(6650): 1108–9.

31.

Hainmueller

Hopkins

D. J.

Yamamoto

. 2014. “Causal Inference in Conjoint Analysis: Understanding Multidimensional Choices Via Stated Preference Experiments.” Political Analysis 22(1): 1–30.

32.

Harding

D’Alessandro

Laskowski

Long

. 2023. “Ai Language Models Cannot Replace Human Research Participants.” AI & Society 39:215-227.

33.

Hoffman

Salerno

Afiaz

Leek

J. T.

McCormick

T. H.

. 2024. “Do We Really Even Need Data?”

34.

Horton

J. J

. 2023. “Large Language Models as Simulated Economic Agents: What Can We Learn From Homo Silicus?” Technical report, National Bureau of Economic Research.

35.

Hussain

Binz

Mata

Wulff

D. U.

. 2024. “A Tutorial on Open-source Large Language Models for Behavioral Science.” Behavior Research Methods 56(8): 8214–37.

36.

Korinek

. 2023. “Generative AI for Economic Research: Use Cases and Implications for Economists.” Journal of Economic Literature 61(4): 1281–317.

37.

Lazer

Pentland

Adamic

Aral

Barabasi

A.-L.

Brewer

Christakis

. et al. 2009. “Computational Social Science.” Science (New York, N.Y.) 323(5915): 721–3.

38.

Lazer

D. M. J.

Pentland

Watts

D. J.

Aral

Athey

Contractor

Freelon

. et al. 2020. “Computational Social Science: Obstacles and Opportunities.” Science (New York, N.Y.) 369(6507): 1060–2.

39.

Levay

K. E.

Freese

Druckman

J. N.

. 2016. “The Demographic and Political Composition of Mechanical Turk Samples.” Sage Open 6(1): 1–17.

40.

Castelo

Katona

Sarvary

. 2024. “Frontiers: Determining the Validity of Large Language Models for Automated Perceptual Analysis.” Marketing Science 43(2): 254–66.

41.

McFarland

D. A.

McFarland

H. R.

. 2015. “Big Data and the Danger of Being Precisely Inaccurate.” Big Data & Society 2(2):2053951715602495.

42.

Mei

Xie

Yuan

Jackson

M. O.

. 2024. “A Turing Test of Whether Ai Chatbots are Behaviorally Similar to Humans.” Proceedings of the National Academy of Sciences 121(9):e2313925121.

43.

Meng

X.-L

. 2018. “Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12(2): 685–726.

44.

Messeri

Crockett

M. J.

. 2024. “Artificial Intelligence and Illusions of Understanding in Scientific Research.” Nature 627(8002): 49–58.

45.

Milkman

K. L.

Gromet

Kay

J. S.

Lee

T. W.

Pandiloski

Park

. et al. 2021. “Megastudies Improve the Impact of Applied Behavioural Science.” Nature 600(7889): 478–83.

46.

Park

P.S.

Schoenegger

Zhu

. 2024. “Diminished Diversity-of-thought in a Standard Large Language Model.” Behavior Research Methods 56(6): 5754–70.

47.

Rauf

Voelkel

J. G.

Druckman

Freese

. 2024. “An Audit of Social Science Survey Experiments.”

48.

Salerno

Miao

Afiaz

Hoffman

Neufeld

McCormick

T. H.

Leek

J. T.

. 2024. IPD: An R Package for Conducting Inference on Predicted Data.

49.

Salganik

. 2019. Bit by Bit: Social Research in the Digital Age. Princeton: Princeton University Press.

50.

Sarstedt

Adler

S. J.

Rau

Schmitt

. 2024. “Using Large Language Models to Generate Silicon Samples in Consumer and Marketing Research: Challenges, Opportunities, and Guidelines.” Psychology & Marketing 41(6): 1254–70.

51.

Spirling

. 2023. “Why Open-source Generative AI Models are An Ethical Way Forward for Science.” Nature 616(7957): 413.

52.

Stadtfeld

Snijders

T. A. B.

Steglich

van Duijn

. 2020. “Statistical Power in Longitudinal Network Studies.” Sociological Methods & Research 49(4): 1103–32.

53.

Stantcheva

. 2023. “How to Run Surveys: A Guide to Creating Your Own Identifying Variation and Revealing the Invisible.” Annual Review of Economics 15(2023): 205–34.

54.

Takemoto

. 2024. “The Moral Machine Experiment on Large Language Models.” Royal Society Open Science 11(2):231393.

55.

Tappin

B. M.

Wittenberg

Hewitt

L. B.

Berinsky

A. J.

Rand

D. G.

. 2023. “Quantifying the Potential Persuasive Returns to Political Microtargeting.” Proceedings of the National Academy of Sciences 120(25):e2216261120.

56.

Thye

S.R

. 2000. “Reliability in Experimental Sociology.” Social Forces 78(4): 1277–309.

57.

Veer

A. v. t.

Giner-Sorolla

. 2014. “Public Template for Pre-registration.” https://osf.io/k5wns/ .

58.

Voelkel

J. G.

Stagnaro

M. N.

Chu

J. Y.

Pink

S. L.

Mernyk

J. S.

Redekopp

Ghezae

. et al. 2024. “Megastudy Testing 25 Treatments to Reduce Antidemocratic Attitudes and Partisan Animosity.” Science (New York, N.Y.) 386(6719):eadh4764.

59.

Zack

E. S.

Kennedy

Long

J. S.

. 2019. “Can Nonprobability Samples Be Used for Social Science Research? A Cautionary Tale.” Survey Research Methods 13(2):215-227.

60.

Ziems

Held

Shaikh

Chen

Zhang

Yang

. 2024. “Can Large Language Models Transform Computational Social Science?” Computational Linguistics 50(1): 237–91.

61.

Zrnic

Candès

E. J.

. 2024. “Cross-prediction-powered Inference.” Proceedings of the National Academy of Sciences 121(15):e2322083121.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.74 MB

The Mixed Subjects Design: Treating Large Language Models as Potentially Informative Observations

Abstract

Keywords

Get full access to this article

References

Supplementary Material