Sage Journals: Discover world-class research

Abstract

In this study, we generate prompts derived from every topic within the Journal of Economic Literature to assess the abilities of both GPT-3.5 and GPT-4 versions of the ChatGPT large language model (LLM) to write about economic concepts. ChatGPT demonstrates considerable competency in offering general summaries but also cites non-existent references. More than 30% of the citations provided by the GPT-3.5 version do not exist and this rate is only slightly reduced for the GPT-4 version. Additionally, our findings suggest that the reliability of the model decreases as the prompts become more specific. We provide quantitative evidence for errors in ChatGPT output to demonstrate the importance of LLM verification.

JEL Codes: B4; O33; I2

Keywords

artificial intelligence large language models ChatGPT writing research methods

Introduction

Economists predict that Artificial Intelligence (AI) tools like OpenAI’s ChatGPT will make human workers more productive (Eloundou et al., 2023; Goldfarb et al., 2023; Peng et al., 2023). While acknowledging that AI tools like ChatGPT can augment human labor, we present systematic evidence of errors common in the output of large language models (LLMs) when prompted to write about and cite social science literature. Researchers should be aware of ChatGPT’s tendencies to hallucinate (i.e., generate false information or content that appears real). The problem of hallucination is significant for applications in business, policy, or wherever humans rely on text generated by AI.

We query ChatGPT (both the GPT-3.5 and more recent GPT-4 versions) about economics through prompts derived from each of the Journal of Economic Literature (JEL) categories. While acknowledging that some of the output is useful and accurate, we note the common occurrence of hallucinations. The issue of hallucination has been known to the computer science community since before the public ChatGPT product. Ji et al. (2023) warn that, “Hallucination … is concerning because it hinders performance and raises safety concerns for real-world applications. For instance, in medical applications, a hallucinatory summary generated from a patient information form could pose a risk to the patient.” Alkaissi and McFarlane (2023) document this problem in a medical context. Even though the error rate will go down as the LLM technology improves, it is unlikely that the problem could be eliminated entirely, since LLMs are fundamentally creative. AI generated content is already on the internet, including websites that users might go to for medical advice (Thompson, 2023).

To provide an objective determination of the number of hallucinations per response, we count how many of the citations provided by ChatGPT exist versus those that do not exist. There can be other types of mistakes in LLM output, however, we consider this quantitative measure to be useful data on hallucinations in social science writing. For the GPT-3.5 version of ChatGPT, the overall rate of false citations in our data is over 30%. The GPT-4 version of ChatGPT fares better, but still produces false citations at a rate over 20%. We also find that the rate of false citations increases significantly when the prompt goes from a general topic to a narrow question.

Our work supports the notion that certain creative tasks can be outsourced to LLMs, while some of the more routine work in research production still cannot be entirely entrusted to ChatGPT (Buchanan & Hickman, 2023; Cowen & Tabarrok, 2023). Our contribution does not only concern the usefulness of ChatGPT for the particular task of writing a literature review. We use citations as a proxy for factual errors because they can be counted and verified. The broader implication is that LLM output can contain falsehoods that appear legitimate because ChatGPT can mimic scientific writing.

Our goal is to present systematic evidence for this problem in a social science research context, so that researchers can use LLMs appropriately. We find, like Korinek (2023), that the LLM can produce some valid references for widely-cited books and papers, because the LLM has access to many training examples in human writing. Jungherr (2023) presents examples showing that LLM performance declines at the conceptual level when requests become more specific. Roberts et al. (2023) provide cases with geography and GPT-4.¹ Our paper provides quantitative evidence for this phenomenon.

Method

We asked ChatGPT to provide content on a wide range of topics within economics. For every JEL category, we constructed three prompts with increasing specificity. Each prompt was provided to both the GPT-3.5 and GPT-4 versions of ChatGPT.² OpenAI indicates that the GPT-4 version is “more reliable, creative, and able to handle much more nuanced instructions” than the GPT-3.5 version (OpenAI, 2023). The enhanced functionality of GPT-4 should lead to better results, but we demonstrate that the newer model is still susceptible to frequent hallucinations.

The three levels of prompts are as follows:

Level 1: The first prompt, using JEL Category A here as an example, was “Please provide a summary of work in JEL Category A, in less than 10 sentences, and include citations from published papers.”

Level 2: The second prompt was about a topic within the JEL category that was well-known. An example for JEL category Q is, “In less than 10 sentences, summarize the work related to the Technological Change in developing countries in economics, and include citations from published papers.”

Level 3: We used the word “explain” instead of “summarize” in the level 3 prompt, asking about a more specific topic related to the JEL category. For L we asked, “In less than 10 sentences, explain the change in the car industry with the rising supply of electric vehicles and include citations from published papers as a list. Include author and year in parentheses and journal for the citations.”

Using 19 JEL categories and three levels of prompts across the two versions of ChatGPT, we obtained 114 responses. The GPT-3.5 version responses were collected in May 2023. The responses from the GPT-4 version were collected in October 2023. Topic variety is used to demonstrate that hallucination is pervasive and not restricted to one subject. The variation in specificity allows for testing the hypothesis that hallucinations are more likely when there is less public training literature for the LLM.³

We analyzed the responses in two ways. First, we used a measure that we argue is objective. We checked whether the citation exists. Second, we judged whether the real citations are appropriate for the answer to our question, which we acknowledge is subjective.

Real: The citation is classified as real if we could find it, as written by ChatGPT, online. Internet search tools such as Google Scholar are comprehensive. We expect that a paper or book that exists would be indexed by Google search engines. The citation is classified as not real if no such published document exists.

For example, in response to “explain how marketing in social media affects profit for businesses…” ChatGPT (the GPT-3.5 version) produced the following citation: Trusov, M., et al. (2009). Evaluating the Impact of Social Networks on Business Performance. Marketing Science, 28(2), 356-378. ChatGPT identified a scholar, Michael Trusov, who writes about social media. However, the cited paper does not exist, and ChatGPT failed to locate any of his papers that might have been a good match.

Correct: The citation is classified as correct if it is a proper source for answering the question. Consider the Level 3 prompt for JEL category R: “explain rental price changes during hurricane seasons…” ChatGPT (GPT-3.5) returned the following citation: Deryugina, T., et al. (2020). The long-run dynamics of electricity demand: Evidence from municipal aggregation. Journal of Urban Economics, 119, 103276. This citation is classified as not real because the paper with that title was not published in The Journal of Urban Economics. It would also be classified as not correct, because the subject of the paper is residential electricity demand in Illinois not hurricanes.

Although we recognize that academics might use different standards to evaluate the quality of references provided by ChatGPT, we believe that these two measures capture an important trend. The errors are significant and frequent. Readers are invited to view every response from GPT-3.5 in the Online Appendix and create metrics for errors. Figure 1 shows the response to a level 1 prompt for JEL category B.

Figure 1.

Example ChatGPT prompt and response.

The primary metric used to evaluate a GPT response is

p r o p o r t i o n o f r e a l c i t a t i o n s = \frac{r e a l c i t a t i o n s}{t o t a l n u m b e r o f c i t a t i o n s}

Results

Figure 2 shows the proportion of real citations across each level of specificity. The GPT-4 version of ChatGPT outperforms the GPT-3.5 version, but both versions show a decline in the proportion of real citations as specificity increases. We report the proportion of real citations for each JEL category, by specificity level, in Table 1 (GPT-4) and Table 2 (GPT-3.5).⁴ Figure 3 shows a similar result for the proportions of correct citations.

Figure 2.

Accuracy of citations provided by ChatGPT (GPT-3.5 and GPT-4) declines as the level of specificity in the prompt increases.

Table 1.

Proportions of Real Citations by JEL Category and Level (ChatGPT, GPT-4) (* Indicates That ChatGPT did not Provide References, see Footnote 4).

JEL Category	A	B	C	D	E	F	G	H	I	J
Level 1	1	1	1	1	0.43	1	1	1	*	0.8
Level 2	0.75	0.75	1	1	1	0.83	1	1	1	0.8
Level 3	1	0.67	0	1	1	0	0.5	1	0.67	0.67

JEL Category	K	L	M	N	O	P	Q	R	Z
Level 1	1	1	*	0.86	1	0.5	0.71	1	*
Level 2	0.33	0.8	1	1	1	1	1	0.67	0.67
Level 3	0.25	0.17	0.75	1	0	1	0.25	0.25	0.67

Table 2.

Proportions of Real Citations by JEL Category and Level (ChatGPT, GPT-3.5).

JEL Category	A	B	C	D	E	F	G	H	I	J
Level 1	0.67	1	0.5	0.8	0.8	0.6	0.75	1	1	0.8
Level 2	0.5	1	0.5	1	1	1	0.75	0.75	0.5	0.8
Level 3	0.6	0.67	0.6	0.5	1	0.4	0.8	0	0.6	0.4

JEL Category	K	L	M	N	O	P	Q	R	Z
Level 1	0.8	0.8	0.6	1	1	1	0.8	0.4	0.67
Level 2	0.67	0.67	1	0.83	0.75	0.5	0.5	0.33	0.5
Level 3	0.8	0.2	0.8	0.6	0.2	0.8	0.5	0	0

Figure 3.

Correctness of citations provided by ChatGPT (GPT-3.5 and GPT-4) declines as the level of specificity in the prompt increases.

Result 1:

ChatGPT returns a considerable proportion of false citations, more than 30% for GPT-3.5 and 20% for GPT-4.

We test for a difference in the proportion of real citations between level 1 and level 2 using a nonparametric Mann–Whitney test. For both GPT-3.5 and GPT-4 there is no significant decline (p = .29 and p = .54, respectively) in accuracy going from a description of the JEL category to a broad topic in economics. We can surmise that the prompt for Level 2 was still general and therefore ChatGPT had access to sufficient training material. Performance declined when asked to explain more specific topics with a narrower set of available documents. There is a significant decrease in accuracy from Level 1 to Level 3 for both GPT-3.5 and GPT-4 (p < .002 for both). The pattern also holds for the metric of correct citations.

Result 2:

We find a significant decline in accuracy from a general topic to a more specific question posed to ChatGPT (GPT-3.5 and GPT-4).

Conclusion

Academics, students, and researchers will use LLM tools (Geerling et al., 2023). We argue that fact-checking is necessary to prevent false statements from propagating in the literature. OpenAI (2022) acknowledged that, “ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.” We systematically document this issue in the context of citations from the economics literature.

LLMs simulate humans convincingly, but we argue that ChatGPT does not yet outperform experts in a subfield of knowledge. Researchers can study the output of LLMs as a new kind of data that provides insights into human behavior and biases, since it was trained on a vast amount of human writing (Bybee, 2023; Horton, 2023).

Hallucinations make LLMs different from previous printing and writing technologies. The issue of false academic citations might be solved if AI creators build a tool to prevent that specific type of error. Nonetheless, this article will serve as evidence that LLM output should be tested for factual accuracy and that errors are more likely to occur for obscure topics. The academic community of teachers and researchers should recognize that hallucinations are possible. The implications are equally important for management and public policy.

Supplemental Material

Supplemental Material - ChatGPT Hallucinates Non-existent Citations: Evidence from Economics

Supplemental Material for ChatGPT Hallucinates Non-existent Citations: Evidence from Economics by Joy Buchanan, Stephen Hill and Olga Shapoval in The American Economist

Footnotes

Acknowledgments

The authors declare that there is no conflict of interest regarding the publication of this article. Funding was provided by Samford University.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Joy Buchanan

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Joy Buchanan, an Associate Professor at Samford University’s Brock School of Business. Her research focuses on applying the experimental method to better understand behavior in labor markets and institutions. She has designed an experiment to test willingness to do computer programming.

Stephen Hill, holding a Ph.D. in Operations Management from the University of Alabama, serves as an Associate Professor of Data Analytics at Samford University. His application of analytics techniques is demonstrated through research spanning diverse sectors, including sports, supply chains, education, and health care.

Olga Shapoval serves as an instructor of economics and data analytics at Samford University. She earned her Ph.D. in Economics at the University of Nevada in Reno, specializing in policy evaluation within the health and education sectors.

References

Alkaissi

McFarlane

S. I.

(2023). Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus, 15(2), e35179. https://doi.org/10.7759/cureus.35179

Buchanan

Hickman

(2023). Do people trust humans more than ChatGPT? https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4632842

Bybee

(2023). Surveying Generative AI’s Economic Expectations. https://arxiv.org/abs/2305.02823

Cowen

Tabarrok

A. T.

(2023). How to Learn and Teach Economics with Large Language Models, Including GPT (GMU Working Paper in Economics No. 23-18). https://doi.org/10.2139/ssrn.4391863

Eloundou

Manning

Mishkin

Rock

(2023). GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. https://doi.org/10.48550/arXiv.2303.10130

Geerling

Mateer

G. D.

Wooten

Damodaran

(2023). ChatGPT has aced the test of understanding in college economics: Now what? The American Economist, 68(2), 233–245. https://doi.org/10.1177/05694345231169654

Goldfarb

Taska

Teodoridis

(2023). Could machine learning be a general purpose technology? A comparison of emerging technologies using data from online job postings. Research Policy, 52(1), 104653. https://doi.org/10.1016/j.respol.2022.104653

Horton

J. J.

(2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? National Bureau of Economic Research. https://www.nber.org/papers/w31122

Lee

Frieske

Ishii

Bang

Y. J.

Madotto

Fung

(2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730

10.

Jungherr

(2023). Using ChatGPT and Other Large Language Model (LLM) Applications for. https://fis.uni-bamberg.de/handle/uniba/58950

11.

Korinek

(2023). Language Models and Cognitive Automation for Economic Research (NBER Working Paper No. 30957). National Bureau of Economic Research. https://doi.org/10.3386/w30957

12.

OpenAI . (2022). Introducing ChatGPT. OpenAI Blog. https://openai.com/blog/chatgpt

13.

OpenAI . (2023). GPT-4. OpenAI Blog. https://openai.com/research/gpt-4

14.

Peng

Kalliamvakou

Cihon

Demirer

(2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. https://doi.org/10.48550/arXiv.2302.06590

15.

Roberts

Lüddecke

Das

Han

Albanie

(2023). GPT4GEO: How a Language Model Sees the World’s Geography. http://arxiv.org/abs/2306.00020

16.

Thompson

S. A.

(2023). A.I.-Generated Content Discovered on News Sites, Content Farms and Product Reviews. The New York Times. https://www.nytimes.com/2023/05/19/technology/ai-generated-content-discovered-on-news-sites-content-farms-and-product-reviews.html

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.61 MB