Sage Journals: Discover world-class research

Abstract

Systematic content analysis of messaging has been a staple method in the study of communication. While computer-assisted content analysis has been used in the field for three decades, advances in machine learning and crowd-based annotation combined with the ease of collecting volumes of text-based communication via social media have made the opportunities for classification of messages easier and faster. The greatest advancement yet might be in the form of general intelligence large language models (LLMs), which are ostensibly able to accurately and reliably classify messages by leveraging context to disambiguate meaning. It is unclear, however, how effective LLMs are in deploying the method of content analysis. In this study, we compare the classification of political candidate social media messages between trained annotators, crowd annotators, and large language models from Open AI accessed through the free Web (ChatGPT) and the paid API (GPT API) on five different categories of political communication commonly used in the literature. We find that crowd annotation generally had higher F1 scores than ChatGPT and an earlier version of the GPT API, although the newest version, GPT-4 API, demonstrated good performance as compared with the crowd and with ground truth data derived from trained student annotators. This study suggests the application of any LLM to an annotation task requires validation, and that freely available and older LLM models may not be effective for studying human communication.

Keywords

large language models artificial intelligence crowdsourcing content analysis social media machine learning

Get full access to this article

View all access options for this article.

References

Bail

C. A.

(2024). Can Generative AI improve social science? Proceedings of the National Academy of Sciences of the United States of America, 121(21), Article e2314021121. https://doi.org/10.1073/pnas.2314021121

Benoit

Conway

Lauderdale

B. E.

Laver

Mikhaylov

(2016). Crowd-sourced text analysis: Reproducible and agile production of political data. American Political Science Review, 110(2), 278–295. https://doi.org/10.1017/S0003055416000058

Berelson

(1952). Content analysis in communication research. Hafner Publishing Company.

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Amodei

(2020). Language Models are Few-Shot Learners. In Larochelle

Ranzato

Hadsell

Balcan

M. F.

Lin

(Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Budak

Garrett

R. K.

Sude

(2021). Better crowdcoding: Strategies for promoting accuracy in crowdsourced content analysis. Communication Methods and Measures, 15(2), 141–155. https://doi.org/10.1080/19312458.2021.1895977

Chen

Zaharia

Zou

(2023). How is ChatGPT’s behavior changing over time? arXiv: 2307.09009v3.

Cheung

K. S.

(2024). Real estate insights unleashing the potential of ChatGPT in property valuation reports: The “Red Book” compliance chain-of-thought (CoT) prompt engineering. Journal of Property Investment & Finance, 43(2), 200–206. https://doi.org/10.1108/jpif-06-2023-0053

Chowdhery

Narang

Devlin

Bosma

Mishra

Roberts

Barham

Chung

H. W.

Sutton

Gehrmann

Shuch

Shi

Tsvyashchenko

Maynez

Rao

Barnes

Tay

Shazeer

Prabhakaran

Fiedel

(2023). PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113. https://jmlr.org/papers/v24/22-1144.html.

Conway

B. A.

Kenski

Wang

(2013). Twitter use by presidential primary candidates during the 2012 campaign. American Behavioral Scientist, 57(11), 1586–1610. https://doi.org/10.1177/0002764213489014

10.

Crawford

(2021). Atlas of AI. Yale University Press.

11.

Dhurandhar

Nair

Singh

Daly

Natesan Ramamurthy

(2024). Ranking Large Language Models without Ground Truth. In Ku

L.-W.

Martins

Srikumar

(Eds.), Findings of the Association for Computational Linguistics: ACL 2024. (pp. 2431–2452). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.143

12.

Ding

You

Machulla

T.-K.

Jacobs

Sen

Höllerer

(2022). Impact of annotator demographics on sentiment dataset labeling. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1–22. https://doi.org/10.1145/3555632

13.

Evans

H. K.

Cordova

Sipole

(2014). Twitter style: An analysis of how house candidates used Twitter in their 2012 campaigns. PS: Political Science & Politics, 47(2), 454–462. https://doi.org/10.1017/S1049096514000389

14.

Geer

(2006). In defense of negativity. University of Chicago Press.

15.

Gilardi

Alizadeh

Kubli

(2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences of the United States of America, 120(3), Article e2305016120. https://doi.org/10.1073/pnas.2305016120

16.

Goel

Gueta

Gilon

Liu

Erell

Nguyen

L. H.

Hao

Jaber

Reddy

Kartha

Steiner

Laish

Feder

(2023). LLMs accelerate annotation for medical information extraction. In Proceedings of the 3rd machine learning for health symposium (pp. 82–100). MLR Research Press. https://proceedings.mlr.press/v225/goel23a/goel23a.pdf.

17.

Guo

Mays

Lai

Jalal

Ishwar

Betke

(2020). Accurate, fast, but not always cheap: Evaluating “crowdcoding” as an alternative approach to analyze social media data. Journalism & Mass Communication Quarterly, 97(3), 811–834. https://doi.org/10.1177/1077699019891437

18.

Ham

Lee

J.-G.

Jang

Kim

K.-E.

(2020). Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 583–592). Association for Computational Linguistics.

19.

Hart

R. P.

(2000). Diction 5.0: The text analysis program. Sage.

20.

Haselmayer

Jenny

(2017). Sentiment analysis of political communication: Combining a dictionary approach with crowdcoding. Quality & Quantity, 51(6), 2623–2646. https://doi.org/10.1007/s11135-016-0412-4

21.

Lin

Gong

Jin

A.-L.

Zhang

Lin

Jiao

You

S. M.

Chen

(2023). AnnoLLM: Making large language models to be better crowdsourced annotators. arXiv: 2303.16854v1.

22.

Holsti

O. R.

(1969). Content analysis for the social sciences and humanities. Addison-Wesley Publishing Co.

23.

Hovy

Berg-Kirkpatrick

Vaswani

Hovy

(2013). Leaning whom to trust with MACE. Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1120–1130). Association for Computational Linguistics.

24.

Howe

(2006). Crowdsourcing: A definition. Crowdsourcing. https://crowdsourcing.typepad.com/cs/2006/06/crowdsourcing_a.html

25.

Huang

Kwak

(2023). Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. In Companion Proceedings of the ACM Web Conference, 2023 . New York: Association for Computing Machinery.

26.

Huang

Zong

Feng

Wang

Chen

Peng

Fent

Qin

Liu

(2025). A survey of hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1–55. https://doi.org/10.1145/3703155

27.

Huang

Fleisig

Klein

(2023). Incorporating worker perspectives into MTurk annotation practices for NLP. arXiv: 2311.02802v2.

28.

Jamieson

K. H.

Waldman

Sherr

(2000). Eliminate the negative? Defining and refining categories of analysis for political advertisements. In Thurber

Nelson

Dulio

(Eds.), Crowded airwaves (pp. 44–64). Brookings Institution Press.

29.

Johnston

Kaid

L. L.

(2002). Image ads and issue ads in U.S. presidential advertising: Using videostyle to explore stylistic differences in televised political ads from 1952–2000. Journal of Communication, 52(2), 281–300. https://doi.org/10.1093/joc/52.2.281

30.

Jurafsky

Martin

J. H.

(2017). Speech and language processing. https://web.stanford.edu/∼jurafsky/slp3002F

31.

Jurgens

(2013). Embracing ambiguity: A comparison of annotation methodologies for crowdsourcing word sense labels. Proceedings of the 2013 conference of the North American chapters of the association for computational linguistics: Human language technologies (pp. 1120–1130). Association for Computational Linguistics.

32.

Kojima

S. S.

Reid

Matsuo

Iwasawa

(2022). Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY: Curran Associates Inc. arXiv:2205.11916.

33.

Korzynski

Mazurek

Krzypkowska

Kurasinski

(2023). Artificial intelligence prompt engineering as a new digital competence: Analysis of generative AI technologies such as ChatGPT. Entrepreneurial Business and Economics Review, 11(3), 25–37. https://doi.org/10.15678/EBER.2023.110302.

34.

Krippendorff

(2018). Content analysis: An introduction to its methodology (4th ed.). Sage.

35.

Kuhn

(2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. [Version 6.0.94]. https://doi.org/10.18637/jss.v028.i05

36.

Kuzman

Mozetic

Ljubesic

(2023). Chat GPT: Beginning of an end of manual linguistic data annotation? Use case of automatic genre identification. ArXiv: 2303.03953v2.

37.

Lazer

Pentland

Adamic

Aral

Barbasi

Brewer

Christakis

Contractor

Fowler

Gutmann

Jebara

King

Macy

Roy

Van Alstyne

(2009). Life in the network: The coming age of computational social science. Science, 323(5915), 721–723.

38.

Lind

Gruber

Boomgaarden

H. G.

(2017). Content analysis by the crowd: Assessing the usability of crowdsourcing for coding latent constructs. Communication Methods and Measures, 11(3), 191–209. https://doi.org/10.1080/19312458.2017.1317338

39.

Ling

Yogatama

Dyer

Blunsom

(2017). Program induction by rationale generation: Learning to solve and explain algebraic word problems. Proceedings of the 55th annual meeting of the association of computational linguistics (pp. 158–167), Vancouver, Canada, July, 2017. https://doi.org/10.18653/v1/p17-1015

40.

Mohri

Rostamizadeh

Talwalker

(2012). Foundations of machine learning. MIT Press.

41.

Neuendorf

K. A.

(2017). The content analysis guidebook. Sage.

42.

OpenAI (2023). GPT-4 technical report. ArXiv:2303.08774v6.

43.

Palan

Schitter

(2018). Prolific.ac—a subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27. https://doi.org/10.1016/j.jbef.2017.12.004

44.

Pei

Jurgens

(2023). When do annotator demographics matter? Measuring the influence of annotator demographics with the POPQUORN dataset. arXiv: 2306.06826v2.

45.

Petrenz

Webber

(2011). Stable classification of text genres. Computational Linguistics, 37(2), 385–393. https://doi.org/10.1162/COLI_a_00052

46.

Radford

Narasimhan

Salimans

Sutskever

(n.d.). Improving language understanding by generative pre-training. https://paperswithcode.com/paper/improving-language-understanding-by

47.

Radford

Child

Luan

Amodei

Sutskever

(n.d.). Language models are unsupervised multitask leaders. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

48.

Reiss

M. V.

(2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. ArXiv: 2304.11085v1.

49.

Reynolds

McDonell

(2021). Prompt programming for large language models: Beyond the few-shot paradigm. CHI ‘21 Extended Abstracts, 8–13 May, 2021. Yokohama, Japan.

50.

Rouse

S. V.

(2015). A reliability analysis of mechanical Turk data. Computers in Human Behavior, 43, 304–307. https://doi.org/10.1016/j.chb.2014.11.004

51.

Sanh

Webson

Raffel

Bach

Sutawika

Alyafeai

Chaffin

Stiegler

Raja

Dey

Bari

M. S.

Thakker

Sharma

S. S.

Szczechla

Kim

Chhablani

Naya

Datta

Rush

A. M.

(2022). Multitask prompted training enables zero-shot task generalization. In The 10th international conference on learning representations. Online. https://openreview.net/forum?id=9Vrb9D0WI4

52.

Snow

O’Connor

Jurafsky

A. Y.

(2008). Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 254–263). Association for Computational Linguistics.

53.

Stromer-Galley

Jennifer

Rossini

Patricia

(2024). Categorizing political campaign messages on social media using supervised machine learning. Journal of Information Technology & Politics, 21(4), 410–423. https://doi.org/10.1080/19331681.2023.2231436

54.

Stromer-Galley

Jennifer

Rossini

Patricia

Hemsley

Jeff

Bolden

Sarah

McKernan

Brian

(2021). Political messaging over time: A comparison of U.S. presidential candidate Facebook posts and Tweets in 2016 and 2020. Social Media & Society, 7(4). https://doi.org/10.1177/20563051211063465

55.

Stromer-Galley

Zhang

Hemsley

Tanupabrungsun

(2018). Tweeting the attack: Predicting gubernatorial candidate attack messaging and its spread. International Journal of Communication, 12, 3511–3532.

56.

Taylor

Kardas

Cucurull

Scialom

Hartshorn

Saravia

Poulton

Kerkez

Stojnic

(2022). Galactica: A large language model for science. arXiv:2211.09085v1.

57.

Ting

Y.-T.

Shieh

T.-C.

Wang

Y.-F.

Kuo

Y.-C.

Chen

Y.-J.

Chan

P.-K.

Kao

C.-H.

(2024). Performance of ChatGPT incorporated chain-of-thought method in bilingual nuclear medicine physician board examinations. Digital Health, 10, 20552076231224074. https://doi.org/10.11/20552076231224074

58.

Törnberg

(2023). Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv:2304, 06588.

59.

Touvron

Lavril

Izacard

Martinet

Lachaux

M.-A.

Lacroix

Roziere

Goyal

Hambro

Azhar

Rodriguez

Joulin

Grave

Lample

(2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971.

60.

van Atteveldt

Van der Velden

M. A.

Boukes

(2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121–140. https://doi.org/10.1080/19312458.2020.1869198

61.

Vaughan

J. W.

(2017). Making better use of the crowd: How crowdsourcing can advance machine learning research. The Journal of Machine Learning Research, 18(1), 7026–7071. https://jmlr.org/papers/v18/17-234.htm

62.

Wan

Safavi

Jauhar

S. K.

Kim

Counts

Neville

Suri

Shah

White

R. W.

Yang

Andersen

Buscher

Joshi

Rangan

(2024, August). Tnt-llm: Text mining at scale with large language models. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining (pp. 5836–5847). Association for Computing Machinery. https://doi.org/10.1145/3637528.3671647

63.

Wang

Liu

Zhu

Zeng

(2021). Want to reduce labeling cost? GPT-3 can help. Paper presented at the conference on empirical methods in natural language Processing, Punta Cana, Dominican Republic, 7–11 November, 2021.

64.

Wei

Wang

Schuurmans

Bosma

Ichter

Xia

Chi

Zhou

(2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv: 2201.11903.

65.

Sen

Assenmacher

Samory

Fröhling

Dan

Nozza

Wagner

(2024). The unseen targets of hate: A systematic review of hateful communication datasets. Social Science Computer Review. online first. https://doi.org/10.1177/0894439324158771

66.

Zhang

Sun

Galley

Chen

Y.-C.

Brockett

Gao

Liu

Dolan

(2020). DialoGPT: Large-scale generative pre-training for conversational response generation. arXiv:1911, 00536v2.

67.

Ziems

Held

Shaikh

Chen

Zhang

Yang

(2024). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.39 MB

The Efficacy of Large Language Models and Crowd Annotation for Accurate Content Analysis of Political Social Media Messages

Abstract

Keywords

Get full access to this article

References

Supplementary Material