Sage Journals: Discover world-class research

Abstract

The rapid advancement of artificial intelligence (AI) has spotlighted ChatGPT as a key technology in the realm of information retrieval (IR). Unlike its predecessors, it offers notable advantages that have captured the interest of both industry and academia. While some consider ChatGPT to be a revolutionary innovation, others believe its success stems from smart product and market strategy integration. The advent of ChatGPT and GPT-4 has ushered in a new era of Generative AI, producing content that diverges from training examples, and surpassing the capabilities of OpenAI’s previous GPT-3 model. In contrast to the established supervised learning approach in IR tasks, ChatGPT challenges traditional paradigms, introducing fresh challenges and opportunities in text quality assurance, model bias, and efficiency. This paper aims to explore the influence of ChatGPT on IR tasks, providing insights into its potential future trajectory.

Keywords

Information retrieval ChatGPT large language models

1. Introduction

On November 30, 2022, OpenAI unveiled ChatGPT,1

¹
https://chat.openai.com
an AI chatbot application powered by the advanced GPT-3.5 and later GPT-4 generative language models. This application quickly attracted over a hundred million users worldwide, setting a new record for rapid product dissemination [54]. ChatGPT, as an embodiment of these models, demonstrated significant advancements over its predecessors, quickly becoming a central topic in both industrial and academic circles. While some view ChatGPT as a disruptive technological innovation, predicting revolutionary changes in various sectors, others believe its success stems more from effective product and market strategies than from purely technological breakthroughs.

Indeed, ChatGPT heralded a new phase in Generative AI, distinct from previous models like GPT-3. This new generation of AI models, including ChatGPT, is capable of generating unique content, not just refining or predicting information based on training examples [19]. GPT-3.5 established a strong foundation with its robust capabilities [9,53,68]. GPT-4 further expanded these capabilities, offering enhanced understanding, accuracy, and contextual relevance. The evolution from GPT-3.5 to GPT-4 has shown great promise in numerous information retrieval tasks (e.g. [32,93]), particularly in text classification [34], document ranking [35], question-answering systems [11], and multimodal retrieval [96]. The introduction of ChatGPT, leveraging these advancements, has spurred progress in this field, highlighting the impressive abilities of large language models (LLMs) in understanding and generating semantic information.

Amid these rapid technological developments, ChatGPT has been applied in various practical settings. Notably, it powers Microsoft’s AI-driven search engine, New Bing, based on GPT-4,2 ²
https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web
and integrates with other multimodal pre-trained models, enhancing the scope of IR tasks. Traditionally, supervised learning has been the main approach in IR, involving the design of statistical or probabilistic models trained on specific task-related data, parameter optimization through loss function minimization, and model inference on new data. The advent of deep neural networks shifted the focus from traditional machine learning models to deep learning models. However, the reliance on the supervised learning framework persisted. This method, training models on labeled datasets to predict or categorize unseen data, has driven significant progress in various IR applications. Nonetheless, the emergence of ChatGPT and the GPT-X models it is based on (where X represents different versions) has posed new challenges to existing IR paradigms, introducing research and application issues such as ensuring text quality, addressing model bias and ethical concerns, and improving model efficiency and practicality.

This paper delves into the opportunities and challenges brought forth by ChatGPT in IR tasks. We also offer a forward-looking view on the future development of ChatGPT and its underlying GPT-X models, aiming to provide valuable insights for research and applications in related fields.

Table 1
Comparison of pre-trained large language models in recent years

Pre-trained language models Release data Size of pre-training corpus Parameters size

BERT-Large [20] 2018-10 16 GB 340M

GPT-2 [61] 2019-02 40 GB 1.5B

RoBERTa [50] 2019-07 161 GB 340M

XLNet-Large [90] 2019-07 142 GB 340M

T5-11B [62] 2019-10 750 GB 11B

OPT [97] 2020-05 180B tokens 175B

GPT-3 [9] 2020-06 45TB 175B

mT5-XXL [88] 2020-10 750 GB 13B

ERNIE 3.0 [73] 2021-07 375B tokens 10B

Yuan 1.0 [84] 2021-10 180B tokens 245B

PaLM [15] 2022-04 780B tokens 540B

BLOOM [83] 2022-11 366B tokens 176B

GPT-4 [54] 2023-04 About 13T tokens About 1.76T

PaLM2 [15] 2023-05 100B tokens 16B

LlaMA2 [75] 2023-07 2T tokens 70B

Qwen-14B [4] 2023-09 2.4T tokens 14B

Skywork [81] 2023-10 3.2T tokens 13B

2. Pretrained large language models

Pre-trained language models	Release data	Size of pre-training corpus	Parameters size
BERT-Large [20]	2018-10	16 GB	340M
GPT-2 [61]	2019-02	40 GB	1.5B
RoBERTa [50]	2019-07	161 GB	340M
XLNet-Large [90]	2019-07	142 GB	340M
T5-11B [62]	2019-10	750 GB	11B
OPT [97]	2020-05	180B tokens	175B
GPT-3 [9]	2020-06	45TB	175B
mT5-XXL [88]	2020-10	750 GB	13B
ERNIE 3.0 [73]	2021-07	375B tokens	10B
Yuan 1.0 [84]	2021-10	180B tokens	245B
PaLM [15]	2022-04	780B tokens	540B
BLOOM [83]	2022-11	366B tokens	176B
GPT-4 [54]	2023-04	About 13T tokens	About 1.76T
PaLM2 [15]	2023-05	100B tokens	16B
LlaMA2 [75]	2023-07	2T tokens	70B
Qwen-14B [4]	2023-09	2.4T tokens	14B
Skywork [81]	2023-10	3.2T tokens	13B

The field of information retrieval has experienced a remarkable transformation with the emergence of pretrained large language models (PLLMs). This evolution, progressing from initial simplistic models to the current advanced dense retrieval models, has significantly broadened the scope and capabilities of IR and related fields. A comparison of recent pre-trained language models, including their training datasets and parameter size, can be seen in Table 1.

Early language models The era of language models began with statistical approaches, notably n-gram models. These models predicted subsequent words based on the probability distribution of word sequences in sentences. The field advanced with the introduction of neural network-based models, such as the Neural Probabilistic Language Model [7], marking a new phase in language modeling. Following this, architectures like Convolutional Neural Networks (CNNs) [27], Recurrent Neural Networks (RNNs), and Long Short-Term Memory Networks (LSTMs) [69] emerged. These architectures addressed issues like data sparsity and capturing long-term dependencies but faced challenges in processing long sequences and parallelization.

The transformer paradigm A significant breakthrough occurred with the introduction of the Transformer architecture by Google in 2017 [76]. This architecture, featuring self-attention mechanisms, enabled efficient parallel processing of sequences and effective management of long-term dependencies, overcoming many limitations of previous models.

The evolution of OpenAI’s Generative Pre-trained Transformer (GPT) series is a testament to the success of the Transformer architecture. GPT-1 laid the foundation, and subsequent versions, GPT-2 and GPT-3, dramatically expanded the scale and capabilities of these models. Notably, GPT-3, with its 175 billion parameters, demonstrated an impressive leap in generating human-like text and facilitating meaningful interactions.

ChatGPT: The new frontiers Building on GPT-3, OpenAI developed ChatGPT based on the GPT-3.5 architecture. This model was specifically designed to overcome certain limitations of GPT-3, particularly in producing coherent and contextually relevant responses over extended dialogues. The training of ChatGPT involved a novel approach, Reinforcement Learning from Human Feedback (RLHF) [56], involving multiple iterations of model refinement using a reward model created from human-ranked responses.

ChatGPT represents a significant advancement in creating models capable of more meaningful and context-aware user interactions. Its deployment has demonstrated potential for a wide range of real-world applications, as highlighted by recent studies and deployments [37,41,43,48,58].

Training methodologies of ChatGPT ChatGPT’s architecture, based on the transformer model, includes specific modifications to enhance conversational abilities. The RLHF training method is notable, involving human trainers who guide the model by ranking responses, thereby refining the model’s capability to generate contextually appropriate responses. This training also incorporates safety and bias reduction measures, ensuring adherence to ethical guidelines.

Interaction mechanisms with prompts ChatGPT’s interaction with prompts involves understanding user input’s intent and context. It generates responses that are relevant, coherent, and contextually suitable by combining learned patterns from its training data with real-time input processing. This process also includes managing ambiguous or incomplete information and maintaining context over a conversation.

GPT4: Advancing ChatGPT’s capabilities GPT-4, released after ChatGPT and GPT-3.5, further pushes the boundaries of PLLMs. With an extended context window and hypothesized multimodal capabilities, GPT-4 is posited to surpass GPT-3.5 in many respects, potentially matching or exceeding human performance in various tasks. Its extendibility is evident in integrations and new services like Microsoft’s Copilot,3

³
https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work
enhancing productivity tools.

ChatGPT in IR ChatGPT significantly contributes to IR by understanding and responding to queries using its extensive internal knowledge base. Unlike traditional search engines, ChatGPT simplifies the user experience by generating useful answers without requiring users to have specific knowledge, making it an invaluable tool for various tasks. A typical scenario is ChatGPT’s robustness in understanding queries that contain grammatical or spelling errors. Even when a user submits a query with such inaccuracies, ChatGPT effectively interprets the intended meaning and provides responses incorporating correct grammar and spelling. This feature enhances user experience, ensuring that communication barriers due to language proficiency or typing errors do not hinder the retrieval of accurate and relevant information.

Other noteworthy models The PLLM landscape features several key players besides the GPT series, as shown in Table 2. ChatGPT is renowned for its creative text generation and remarkable scalability with plugins. Nevertheless, it has the problem of producing incoherent or incorrect text. Meta’s LlaMA-24 ⁴
https://ai.meta.com/llama
has a range of parameter sizes and offers versions that are fine-tuned for specific tasks. Despite the various parameters available, the model lacks a user-friendly bot interface, limiting access to the normal user. Google’s Bard5 ⁵
https://bard.google.com
stands out for its ability to respond consistently to varied queries but with limited creativity. Lastly, Anthropic’s Claude,6 ⁶
https://www.anthropic.com/index/introducing-claude
while not fully disclosing its architecture, has drawn attention for its extensive token capacity, facilitating the processing and generation of lengthy and complex texts. In addition, Claude is committed to reducing the generation of false or misleading information. However, it operates under a strict content review strategy, which may restrict access to legitimate information, particularly in fields like scientific research. These models reveal unique strengths and challenges, contributing to the dynamic PLLM field.

Table 2
Comparison of ChatGPT, Llama-2, Bard, and Claude

Models Company Architecture Notable strengths Notable weaknesses

ChatGPT OpenAI Generative pre-trained transformer (GPT) Creative text generation Generate incoherent or incorrect text

Scalability (e.g. integration with DALL-E)

Llama-2 Meta Auto-regressive language optimized transformer Range of parameter sizes (7B, 13B, and 70B) No convenient bot-like interface

Fine-Tuned Versions

Bard Google Pathways language models (PaLM2) [15] Faster Significant creative limitations

Coherent responses

Claude Anthropic Not fully disclosed Large token capacity Strict censorship

Reduced hallucinations

3. Potential opportunities in information retrieval with ChatGPT

Models	Company	Architecture	Notable strengths	Notable weaknesses
ChatGPT	OpenAI	Generative pre-trained transformer (GPT)	Creative text generation	Generate incoherent or incorrect text
Scalability (e.g. integration with DALL-E)
Llama-2	Meta	Auto-regressive language optimized transformer	Range of parameter sizes (7B, 13B, and 70B)	No convenient bot-like interface
Fine-Tuned Versions
Bard	Google	Pathways language models (PaLM2) [15]	Faster	Significant creative limitations
Coherent responses
Claude	Anthropic	Not fully disclosed	Large token capacity	Strict censorship
Reduced hallucinations

In the era of large models, generative models represented by ChatGPT are introducing new perspectives and methodologies for the core task of information retrieval. IR systems aim to extract relevant information from enormous amounts of textual data. Traditional IR systems often rely on keyword matching. However, with the advent of neural networks and deep learning, IR is progressively evolving towards semantic-based retrieval [57].

The deep neural networks of GPT-X enable a profound understanding of text semantics, enhancing the precision in semantic-level retrieval beyond traditional keyword-level text matching. Their generative framework allows for the formulation of precise query expressions and the generation of descriptive retrieval results, enhancing the flexibility and expressiveness of IR. With zero or few-shot learning capabilities where models require little to no training data, these models reduce the necessity for extensive annotated data, making complex retrieval tasks more manageable. The end-to-end training methodology minimizes error propagation and directly optimizes performance from input to output, improving retrieval accuracy and efficiency. Furthermore, the potential for multimodal information retrieval extends the scope beyond text to encompass images and videos, offering richer and more accurate retrieval results. Lastly, integrating knowledge graphs leverages structured knowledge in the retrieval process, simultaneously aiding in the construction and updating of knowledge graphs, thus providing a richer knowledge base for IR.

3.1. Information extraction

Information Extraction (IE) is a fundamental task in information retrieval, encompassing sub-tasks such as named entity recognition (NER) and event extraction (EE). IE has evolved significantly over the years. Initially, the focus was on structured and semi-structured data extraction, employing various techniques, tools, and systems to extract useful information automatically [16]. Early IE systems were primarily rule-based, relied on a large amount of human involvement, and were tailored for specific domains like chemical or medical search [5,33,51,52,67,94].

Transitioning into the contemporary period, the field has seen a shift towards employing deep learning technologies, which excel at extracting structured information from unstructured text without being confined to a particular domain [2,63]. The core idea of deep learning is to extract features from the original data, moving from low-level to high-level and from the concrete to the abstract through a series of non-linear transformations in a data-driven manner. These methods have significantly improved the advanced levels of various fields, including speech recognition, visual object recognition, and object detection, showcasing the efficacy of deep learning in handling complex IE tasks [89].

Moreover, researchers hope these large-scale language models can process text efficiently and extract valuable information without the necessity for retraining, potentially replacing manual annotation. However, multiple extensive IE experiments on ChatGPT show a significant performance gap between ChatGPT and state-of-the-art (SOTA) results on datasets with zero/few-shot IE sub-tasks [29,45,82,95]. Although the results are unsatisfactory, they spark new research perspectives in IE, such as the possibility that IE tasks can be decomposed into multiple simpler subtasks [82], a rethinking of the evaluation strategy might reflect a more accurate performance of ChatGPT [29], and ChatGPT’s performance can be significantly improved by prompt engineering [95].

3.2. Text classification

In exploring text classification tasks in the era of large language models, it’s pertinent first to introduce the traditional and prevalent methodologies in text classification. Traditional text classification approaches generally rely on statistical learning paradigms such as Naive Bayes and K-Nearest Neighbors [3,59,103]. These methods entail substantial effort in feature engineering to construct meaningful representations of text. Subsequently, with the advent of deep neural networks, models like RNNs, CNNs, and Graph Neural Networks (GNNs) [91] have emerged as mainstream paradigms, significantly automating the construction of rich semantic representations of text.

Entering the era of LLMs, models like ChatGPT have markedly impacted text classification tasks. These models achieve high-quality text semantic modeling from massive text corpora through supervised pre-training techniques, substantially enhancing the performance in text classification tasks. Particularly in addressing open-domain tasks, domain adaptation, few-shot (where models learn from a small set of labeled examples), and zero-shot (where models generalize to unseen classes) problems, these large models exhibit impressive performance and exceptional generalization capabilities [14,60,71].

ChatGPT can be utilized to undertake a knowledge graph extraction task to obtain refined and structured knowledge from raw data. The collected knowledge is then transformed into a graph, which is subsequently utilized to train an interpretable linear classifier to render predictions, exhibiting impressive performance [70].

In scenarios with few or zero examples, LLMs leverage pre-trained knowledge to achieve satisfactory classification outcomes, mitigating the dependency on large labeled datasets inherent in traditional methods. This capability is invaluable in domains encumbered by limited training data due to costly and labor-intensive annotation processes [98]. In addition, high-quality categorization lays a solid foundation for accurate and efficient annotation, thus potentially speeding up the annotation process, reducing costs, and improving the overall quality of the annotated data, which greatly benefits the text annotation task [26].

From an information retrieval perspective, text classification serves as a crucial mechanism for ranking and categorizing textual data, aiding in the efficient retrieval and management of information. Combining the knowledge graph and few-shot learning capabilities based on LLMs, text classification tasks can extract and utilize relevant information from extensive data, achieving more accurate and efficient categorization.

3.3. Document ranking

Document ranking is a crucial process in information retrieval systems, determining the order in which retrieved documents are presented based on their estimated relevance to a query. Historically, the methodologies employed for document ranking have predominantly centered on term-based matching, leveraging standard techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) and BM25 [65]. These traditional approaches assess the significance of terms within documents and their corresponding relevance to the query at hand [100,101]. However, they often fall short in capturing the semantic relationships between terms and may overlook contextual relevance, which is increasingly important in refining the precision of document retrieval.

Transitioning into the modern era, machine learning has found a foothold in document ranking through methods like Learning to Rank [49], which predicts a relevance score for each document-query pair, ranking documents accordingly. Thereafter, deep learning models started gaining traction. CNNs, RNNs, and attention-based mechanisms such as BERT [20] have been employed to enhance the representation of text data and improve the understanding of natural language queries [42]. Recently, the focus has also shifted towards dense retrieval and re-ranking models [38]. Dense retrieval models propose a more accurate approach to document ranking tasks by embedding both documents and queries in a continuous vector space. Re-rankers take an initial set of retrieved candidates and re-sort them based on relevance scores, ensuring a more reliable list of results in response to a query.

The advent of large language models has opened new possibilities for document ranking in IR. Investigations have revealed that ChatGPT can deliver competitive or even superior ranking performance compared to supervised methods on popular IR benchmarks when properly instructed [72,79]. The emergence of GPT-4 has further pushed the boundaries, showcasing AI-driven document ranking, significantly impacting the search engine domain [72]. In addition, a human-involved experiment comparing the search performance and user experience of ChatGPT and Google Search points to practical insights. Although ChatGPT cannot always outperform Google Search, it considerably enhances work efficiency and increases user satisfaction [87].

Furthermore, domain-specific document ranking emerges as a promising area for the application of GPT-4. Presently, ranking methods heavily rely on training data and fine-tuning. However, the scarcity of high-quality annotated datasets in specialized domains such as medicine and law poses a significant challenge, impeding the efficacy of deploying pre-trained models for ranking documents [36]. LLMs like GPT-4, endowed with expansive knowledge and pronounced generalization capability due to their vast training data spectrum, present a viable solution. These models hold the potential to serve as data augmentation tools in such contexts, synthesizing pseudo-label data that could improve the performance of retrieval models in data-scarce situations [22,78]. By generating synthetic yet relevant data, GPT-4 could significantly enhance the model’s ability to accurately rank documents in domain-specific scenarios, thereby bridging the data gap and facilitating improved performance in document retrieval tasks.

3.4. Conversational search

Conversational Search (CS) has significantly evolved over the years, transitioning from rule-based models to the more advanced machine learning and deep learning models prevalent today [39,105]. Traditionally, it is divided into two main subtasks: task-oriented and open-dialog/interactive tasks. Task-oriented conversational IR (Information Retrieval) systems employed a pipeline approach, integrating several modules like intent recognition, dialogue management, and response generation to handle user interactions [99]. Conversely, open-domain conversational IR systems aim to engage users in more social and less goal-directed conversations. Initially, these systems relied on retrieval-based approaches, but the advent of generative models allowed for more fluid and natural responses [1,85,106]. They function like an IR system, extracting related information from a pre-designed database.

The introduction of transformer-based models, such as OpenAI’s ChatGPT, Anthropic AI’s Claude, and Google’s LaMDA [74], marked a paradigm shift in the domain of CS. These models’ capability to generate human-like text based on a given context has expanded the horizons of what’s possible in task-oriented and open-domain CS systems.

Several opportunities arise as the field advances with contributions from models like ChatGPT and GPT-4. Thanks to these models’ impressive intent understanding, semantic parsing, and API integration capabilities, the union of task-oriented and open-domain dialogues under a single technical framework is now attainable. This union could lead to the development of CS systems that are not only functional but also emotionally intelligent, catering to the practical needs of users. Moreover, the pursuit of creating more personalized CS systems remains a significant area of research and development. Advancements in these areas are expected to push CS systems closer to delivering a truly human-like and enriching conversational experience.

3.5. Multimodal retrieval

In the realm of multimodal retrieval, the transition from traditional methods to cutting-edge techniques showcases a remarkable development. Initially, traditional multimodal retrieval predominantly fell under the Nearest Neighbor (NN) problem [10]. However, these methods struggled to bridge the semantic gap between low-level features (such as color, texture, and shape) and users’ high-level informational needs. As technology advanced, the focus shifted towards crafting unified representations for data across different modalities, such as text, images, audio, and video, aiming to foster seamless and enriched interactions between these modalities [102].

The field then embraced cross-modal retrieval, emphasizing the importance of modeling relationships between different modalities. This approach allowed users to retrieve desired information by submitting data in one modality to fetch related data in another, marking a significant stride towards enhancing accuracy and scalability in retrieval [30,77]. Additionally, the emergence of retrieval-augmented multimodal models began integrating external knowledge more scalably and modularly. For a given input text, such models use a retriever to fetch relevant documents from external sources and a generator (often a language model) to produce predictions based on the acquired information. Typically, these external sources include text corpora and structured knowledge bases. However, retrieval-augmented methods were initially researched for text, and extending them to the multimodal domain remains challenging. The main difficulty lies in the design of the retriever and generator that can handle multimodal documents containing both images and text.

Addressing this challenge, the Retrieval-Augmented Text-to-Image Generator (Re-Imagen) [13] represents a significant advancement. Utilizing a diffusion-based method, this model generates high-fidelity images that are remarkably accurate, even when depicting entities not previously encountered. The process hinges on the effective use of information retrieved from external sources, enabling the creation of visually precise representations. Similarly, the Multimodal Retrieval-Augmented Transformer (MuRAG) [12] focuses on answering natural language questions using image retrieval methods. Although these works concentrate on generating single modalities (text or image), RA-CM3 [92] proposed a comprehensive and unified model capable of retrieving and generating both images and text. Notably, the generator model develops capabilities such as controlled image generation in a contextual learning framework through retrieval-enhanced training.

The debut of GPT-4 notably impacted the field of multimodal retrieval, ushering in an era closer to human-like AI. GPT-4 is a large multimodal model capable of processing both text and image inputs while delivering text outputs, pushing closer to human-level performance on various benchmarks, albeit with certain limitations in real-world scenarios [47]. Conversely, ChatGPT has been empowered by GPT-4V(ision) [55], boosting its multimodal capabilities. For instance, the integration of DALL-E 37

⁷
https://openai.com/dall-e-3
with ChatGPT facilitates smoother interaction, where ChatGPT aids in crafting precise prompts for DALL-E 3, turning user ideas into vibrant AI-generated art.

The arrival of large language models marked a significant milestone in bridging the semantic gap between multiple types of information, paving the way for more intuitive and rich interactions across diverse data modalities. In recommendation systems, LLMs have shown immense promise [80]. They foster a more comprehensive understanding of user preferences and behaviors by integrating information from various sources and modalities. For example, a recommendation system powered by a multimodal LLM can analyze textual reviews, image-based preferences, and purchase histories to generate more accurate and personalized product recommendations. Moreover, by understanding the semantic relationships between different items and user interactions, these models can provide a more enriched and personalized user experience, thereby enhancing user satisfaction and engagement.

Similarly, the medical field has seen substantial advancements by incorporating LLMs [31,46]. In clinical settings, multimodal LLMs can assist in synthesizing information from diverse sources such as electronic health records, medical imaging, and genomic data to provide more comprehensive and personalized insights. This holds vast potential to support diagnostic processes, treatment planning, and personalized medicine. For instance, integrating textual clinical notes with medical imaging data can empower clinicians with a more holistic understanding of a patient’s condition, enabling better-informed decision-making.
4. Unresolved challenges in information retrieval with ChatGPT

ChatGPT proves the duality of technological advances in AI. On the one side, it can greatly enhance the productivity of users from all walks of life thanks to its excellent language comprehension and generation capabilities. Whether in education, business, or personal assistance, ChatGPT is a powerful tool that facilitates task completion, inspires creativity, and helps make correct decisions.

On the flip side, it reveals the ethical dilemmas associated with misinformation, disinformation, and the potential misuse possibilities of fabricating deceptive or harmful content. Its remarkable ability to produce realistic text blurs the boundaries of information authenticity, making it challenging for individuals to discern between real and fake content. These potential risks highlight the limitless possibilities of ChatGPT while also emphasizing the need to navigate the ethical regulation that accompanies such groundbreaking innovations.

4.1. Hallucination

The challenge of hallucination in large language models, underscored by Google AI researchers in 2018 [44], presents a formidable hurdle to their deployment. Hallucination, a phenomenon where models generate convincing yet factually incorrect or misleading content, harbors serious risks. This is particularly concerning in critical applications such as decision-making, where the propagation of false information can lead to adverse outcomes [6,66]. OpenAI, the developer of ChatGPT, has recognized the concerns regarding the model’s propensity for factual inaccuracies and is actively pursuing measures to mitigate this issue.8

⁸
https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai

Information retrieval strategies are poised to be instrumental in addressing the hallucination challenge. A viable approach could be establishing a continuous feedback loop wherein the model’s outputs are rigorously evaluated, and refinements are made based on identified inaccuracies. This iterative process aims to bolster the model’s accuracy and reliability over time. Specifically, integrating IR models to work in tandem with LLMs could present a robust solution [21]. By augmenting LLMs with updated and accurate information extracted from external sources, IR models can potentially curtail the generation of factually inaccurate responses, thus mitigating the occurrence of hallucinations.
4.2. Ethical issues and safety

The ethical and safety concerns surrounding ChatGPT are multi-faceted, arising from their profound language understanding and generation capabilities. As advanced iterations of language models developed by OpenAI, these models harbor significant expectations alongside concerns due to their potential transformative impact on society [104].

The expansive training data and the complex nature of these models introduce risks associated with bias and fairness. The training material, sourced from human-generated content, may inadvertently perpetuate existing societal biases. Instances where models exhibited gender or racial biases are emblematic of this problem. These biases can manifest across various applications, potentially leading to unfair or discriminatory outcomes [18].

Moreover, the emergence of generative AI poses challenges related to misinformation and abuse. Their ability to generate text can be leveraged to fabricate misleading information, contribute to online misinformation campaigns, or even generate harmful or abusive content. The lack of source attribution in responses generated by ChatGPT exacerbates this issue, as users may struggle to discern the veracity of the generated content [64]. The potential misuse of generative AI for criminal activities such as fraud or harassment is another significant concern. LLMs can be employed to create realistic fake content for nefarious purposes, thereby reducing costs and increasing the efficiency of executing fraudulent activities.

In an IR system, while relevance is often prioritized, this can lead to insufficient diversity in the results [33,94]. Frequently, the most prominent subtopic groups dominate the search results, marginalizing minority topics. This imbalance can cause users to exert extra effort to find items related to less common topics, leading to a partial and skewed information result. Moreover, the tendency of users to click primarily on top search results can facilitate a cycle of unfairness. Ranking or recommendation algorithms incorporating user feedback tend to maintain these items’ top positions, creating a positive feedback loop for unfairness issues. This is particularly problematic in systems like ChatGPT, where an initial response with an ethical issue can be challenging to correct internally. If users trust such a response, repeated interactions can exacerbate the issue, contributing to the development of bias. Quantifying bias in terms of gender and age can be beneficial to address these challenges [25]. A fairness-aware ranking algorithm that accounts for these factors can have a positive impact. Additionally, considering fairness as an optimization problem opens up new approaches [23]. Implementing a fairness-constrained reinforcement learning algorithm can help balance relevance with the need for diversity and fairness in IR systems [24].

Furthermore, the scalability of these models amplifies privacy concerns. As models become larger and necessitate more computational resources, the need to offload processing to cloud servers escalates. This centralization can heighten the risk of data breaches and misuse of personal information, especially if adequate measures are not in place to secure user data.

Overall, the array of ethical and safety concerns emanating from the deployment of ChatGPT underscores the imperative of diligent oversight, robust regulatory frameworks, and continuous dialogue among stakeholders to ensure the responsible development and use of these transformative technologies.

4.3. Interpretability

As language models become more complex with increased parameters and depth, their decision-making processes become less interpretable. This complexity also challenges understanding the vector and parameter representations within deep neural networks [17].

Characterizing large language models as “black-box” models summarizes this fundamental challenge in deploying and trusting these systems [28]. While the user can observe the inputs and outputs, the intricacies of the processes in between remain hidden, preventing a clear understanding of how the model derives a particular output from a given input. This opacity extends to an inability to discern what aspects of the input data the model considers important, obscuring interpretability.

The main reason for this challenge is that while LLMs are good at recognizing patterns and correlations in data, they lack a grasp of causality [86]. This inadequacy is particularly evident in decision-making. Moreover, LLMs are prone to inherit biases present in the training data, which highlights another dimension of the interpretability challenge. Any bias may permeate the model’s behavior, leading to anomalous or unfair results. Diagnosing and mitigating these biases becomes difficult without a clear window into the model’s inner workings.

The challenge of interpretability is further exacerbated by the unpredictability of LLMs in the face of new or adversarial inputs. These models may exhibit erratic behavior in the face of unexpected input scenarios, which is difficult to address without an interpretability framework. Improving the interpretability of LLMs is, therefore, not just an academic exploration but a pragmatic need to ensure responsible and credible deployment of these models, especially as they enter increasingly sensitive and critical domains. Uncovering the “black box” nature of LLM and building robust interpretability frameworks is necessary for developing machine learning and AI.

Retrieval-Enhanced Machine Learning [96] presents a promising approach to addressing the issue of interpretability. In pre-trained language models, the training knowledge is embedded within the learned model parameters, making it difficult to understand model predictions. In contrast, when the reasoning process relies on retrieved information, predictions can be directly linked to specific data, typically stored in an accessible text format. This feature improves the interpretability of the model’s outputs. Additionally, Aspect Learning [40] can further enhance interpretability. By incorporating aspects, the model not only grasps general language semantics, like other pre-trained models, but also acquires domain-specific knowledge, enabling it to identify aspects relevant to a particular domain. These “explicit aspects” significantly improve interpretability, as the retrieved documents are expected to share similar aspects (or categories) with the input query.

OpenAI has initiated efforts to automate the interpretability of large language models by using GPT-4 itself to generate and score explanations of neuron behavior in other language models [8]. This initiative aims to uncover how different parts of the neural network operate, although the technique still struggles with larger models, indicating room for improvement. The initiative by OpenAI represents a significant stride towards demystifying the operations of LLMs, hoping to foster more responsible and effective use of these powerful tools in various domains.

5. Conclusions and future directions

ChatGPT signifies a remarkable stride in Generative AI, enriching multiple information retrieval tasks. They excel in understanding and generating textual content, with applications extending to various practical and academic domains such as healthcare, education, and programming, thus reshaping traditional paradigms. However, this advancement isn’t without challenges.

Ethical dilemmas such as misinformation, disinformation, and potential misuse for harmful content generation pose serious concerns. The issue of hallucination, generating incorrect or misleading content, highlights the need for robust mechanisms to ensure accuracy and reliability. Furthermore, the challenge of interpretability remains a substantial hurdle. The “black box” nature of these models hinders transparency in their decision-making processes, which is essential for responsible AI deployment, especially in critical domains.

In addressing these challenges, recent works in IR have made strides in these areas. We note that fairness retrieval methods have shown the potential to mitigate biases in PLLMs, promoting more equitable and unbiased content generation. Additionally, the application of retrieval-enhanced learning methods has been identified as beneficial in tackling interpretability issues. By integrating context-rich information into the learning process, these methods can provide insights into the decision-making mechanisms of these complex models.

The advent of ChatGPT embodies the broader narrative of AI development, filled with promises of technological innovation and the imperative of addressing ethical, safety, and privacy challenges. Continued research and proactive steps to mitigate these challenges while exploring new ways to harness the power of these models responsibly will help navigate the complexities of AI. Collaborative efforts among researchers, practitioners, and policymakers are pivotal in realizing a future where AI significantly enhances human capabilities while preserving ethical and social values.

Footnotes

Acknowledgements

We express our sincere gratitude to the reviewers for their insightful comments and to the editor for their valuable assistance, both of which have significantly contributed to the enhancement of this paper. This research is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada and the York Research Chairs (YRC) program.

References

Adiwardana,

M.-T.

Luong,

D.R.

So,

Hall,

Fiedel,

Thoppilan,

Yang,

Kulshreshtha,

Nemade,

Lu et al., 2020, Towards a human-like open-domain chatbot, arXiv preprint arXiv:2001.09977.

Adnan and

Akbar, An analytical study of information extraction from unstructured and multidimensional big data, Journal of Big Data 6(1) (2019), 1–38. doi:10.1186/s40537-018-0162-3.

An,

Huang,

Huang and

Cercone, Feature selection with rough sets for web page classification, Trans. Rough Sets 2 (2004), 1–13. doi:10.1007/978-3-540-27778-1_1.

Bai,

Chu,

Cui,

Dang,

Deng,

Fan,

Ge,

Han,

Huang,

Hui,

Ji,

Li,

Lin,

Liu,

Lu,

Ma,

Men,

Ren,

Tan,

Tu,

Wang,

Wu,

Xu,

Yang,

Yao,

Yu,

Yuan,

Zhang,

Zhou,

Zhou and

Zhu, Qwen technical report, 2023.

W.-T.

Balke, Introduction to information extraction: Basic notions and current trends, Datenbank-Spektrum 12 (2012), 81–88. doi:10.1007/s13222-012-0090-x.

Bang,

Cahyawijaya,

Lee,

Dai,

Su,

Wilie,

Lovenia,

Ji,

Yu,

Chung et al., A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity, 2023, arXiv preprint arXiv:2302.04023.

Bengio,

Ducharme and

Vincent, A neural probabilistic language model, Advances in neural information processing systems 13 (2000).

Bills,

Cammarata,

Mossing,

Tillman,

Gao,

Goh,

Sutskever,

Leike,

Wu and

Saunders, Language models can explain neurons in language models, 2023, https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html (Date accessed: 14.05. 2023).

Brown,

Mann,

Ryder,

Subbiah,

J.D.

Kaplan,

Dhariwal,

Neelakantan,

Shyam,

Sastry,

Askell et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020), 1877–1901.

10.

Cao,

Feng,

Lin,

Cao and

He, A review of hashing methods for multimodal retrieval, IEEE Access 8 (2020), 15377–15391. doi:10.1109/ACCESS.2020.2968154.

11.

Chen,

Hu,

J.X.

Huang,

He and

An, Enhancing recurrent neural networks with positional attention for question answering, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7–11, 2017,

Kando,

Sakai,

Joho,

Li,

A.P.

de Vries and

R.W.

White, eds, ACM, 2017, pp. 993–996. doi:10.1145/3077136.3080699.

12.

Chen,

Hu,

Chen,

Verga and

W.W.

Cohen, MuRAG: Multimodal retrieval-augmented generator for open question answering over images and text, in: Conference on Empirical Methods in Natural Language Processing, 2022. doi:10.48550/arXiv.2210.02928.

13.

Chen,

Hu,

Saharia and

W.W.

Cohen, Re-imagen: Retrieval-augmented text-to-image generator, in: International Conference on Learning Representations, 2022. doi:10.48550/arXiv.2209.14491.

14.

Chen,

Ye,

Zu,

Xu,

Zheng,

Peng,

Zhou,

Gui,

Zhang and

Huang, How robust is GPT-3.5 to predecessors? A comprehensive study on language understanding tasks, 2023, arXiv preprint arXiv:2303.00293.

15.

Chowdhery,

Narang,

Devlin,

Bosma,

Mishra,

Roberts,

Barham,

H.W.

Chung,

Sutton,

Gehrmann,

Schuh,

Shi,

Tsvyashchenko,

Maynez,

Rao,

Barnes,

Tay,

Shazeer,

Prabhakaran,

Reif,

Du,

Hutchinson,

Pope,

Bradbury,

Austin,

Isard,

Gur-Ari,

Yin,

Duke,

Levskaya,

Ghemawat,

Dev,

Michalewski,

Garcia,

Misra,

Robinson,

Fedus,

Zhou,

Ippolito,

Luan,

Lim,

Zoph,

Spiridonov,

Sepassi,

Dohan,

Agrawal,

Omernick,

A.M.

Dai,

T.S.

Pillai,

Pellat,

Lewkowycz,

Moreira,

Child,

Polozov,

Lee,

Zhou,

Wang,

Saeta,

Diaz,

Firat,

Catasta,

Wei,

Meier-Hellstern,

Eck,

Dean,

Petrov and

Fiedel, PaLM: Scaling language modeling with pathways, 2022.

16.

Cowie and

Lehnert, Information extraction, Communications of the ACM 39(1) (1996), 80–91. doi:10.1145/234173.234209.

17.

Danilevsky,

Qian,

Aharonov,

Katsis,

Kawas and

Sen, A survey of the state of explainable AI for natural language processing, in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2020, Suzhou, China, December 4–7, 2020,

Wong,

Knight and

Wu, eds, Association for Computational Linguistics, 2020, pp. 447–459, https://aclanthology.org/2020.aacl-main.46/ .

18.

Dash and

Sharma, Are ChatGPT and deepfake algorithms endangering the cybersecurity industry? A review, International Journal of Engineering and Applied Sciences 10(1) (2023).

19.

Deng and

Lin, The benefits and challenges of ChatGPT: An overview, Frontiers in Computing and Intelligent Systems 2(2) (2022), 81–83. doi:10.54097/fcis.v2i2.4465.

20.

Devlin,

Chang,

Lee and

Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Vol. 1 (Long and Short Papers), Minneapolis, MN, USA, June 2–7, 2019,

Burstein,

Doran and

Solorio, eds, Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/n19-1423.

21.

Gao,

Dai,

Pasupat,

Chen,

A.T.

Chaganty,

Fan,

V.Y.

Zhao,

Lao,

Lee,

D.-C.

Juan and

Guu, RARR: Researching and revising what language models say, using language models, 2022, arXiv preprint arXiv:2210.08726.

22.

Gao,

Ma,

Lin and

Callan, Precise zero-shot dense retrieval without relevance labels, Annual Meeting of the Association for Computational Linguistics (2022). doi:10.48550/arXiv.2212.10496.

23.

Gao and

Shah, How fair can we go: Detecting the boundaries of fairness optimization in information retrieval, in: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 229–236. ISBN 9781450368810. doi:10.1145/3341981.3344215.

24.

Ge,

Liu,

Gao,

Xian,

Li,

Zhao,

Pei,

Sun,

Ge,

Ou and

Zhang, Towards long-term fairness in recommendation, in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 445–453. ISBN 9781450382977. doi:10.1145/3437963.3441824.

25.

S.C.

Geyik,

Ambler and

Kenthapadi, Fairness-aware ranking in search & recommendation systems with application to LinkedIn talent search, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 2221–2231. ISBN 9781450362016. doi:10.1145/3292500.3330691.

26.

Gilardi,

Alizadeh and

Kubli, ChatGPT outperforms crowd-workers for text-annotation tasks, Proceedings of the National Academy of Sciences of the United States of America (2023). doi:10.1073/pnas.2305016120.

27.

Gu,

Wang,

Kuen,

Ma,

Shahroudy,

Shuai,

Liu,

Wang,

Cai et al., Recent advances in convolutional neural networks, Pattern recognition 77 (2018), 354–377.

28.

Guidotti,

Monreale,

Ruggieri,

Turini,

Giannotti and

Pedreschi, A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51(5) (2018), 1–42.

29.

Han,

Peng,

Yang,

Wang,

Liu and

Wan, Is information extraction solved by ChatGPT? An analysis of performance, evaluation criteria, robustness and errors, 2023, arXiv preprint arXiv:2305.14450.

30.

Hofstätter,

Chen,

Raman and

Zamani, FiD-light: Efficient and effective retrieval-augmented text generation, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, Association for Computing Machinery, New York, NY, USA, 2023, pp. 1437–1447. ISBN 9781450394086. doi:10.1145/3539618.3591687.

31.

Huang,

Zheng,

Wang,

Yin,

Wang,

Ding,

Yin,

Xu,

Yang,

Zheng et al., ChatGPT for shaping the future of dentistry: The potential of multi-modal large language model, International Journal of Oral Science 15(1) (2023), 29. doi:10.1038/s41368-023-00239-y.

32.

J.X.

Huang,

Miao and

He, High performance query expansion using adaptive co-training, Inf. Process. Manag. 49(2) (2013), 441–453. doi:10.1016/J.IPM.2012.08.002.

33.

Huang and

Hu, A Bayesian learning approach to promoting diversity in ranking for biomedical information retrieval, in: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009,

Allan,

J.A.

Aslam,

Sanderson,

Zhai and

Zobel, eds, July 19–23, 2009, ACM, Boston, MA, USA, 2009, pp. 307–314. doi:10.1145/1571941.1571995.

34.

Huang,

Y.R.

Huang,

Wen,

An,

Liu and

Poon, Applying data mining to pseudo-relevance feedback for high performance text retrieval, in: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, 18–22 December 2006, IEEE Computer Society, 2006, pp. 295–306. doi:10.1109/ICDM.2006.22.

35.

Huang,

Zhong and

Si, York University at TREC 2005: Genomics track, in: Proceedings of the Fourteenth Text REtrieval Conference, TREC, Gaithersburg, Maryland, USA, November 15–18, 2005,

E.M.

Voorhees and

L.P.

Buckland, eds, NIST Special Publication, Vol. 500-266, National Institute of Standards and Technology (NIST), 2005, http://trec.nist.gov/pubs/trec14/papers/yorku-huang2.geo.pdf .

36.

Huang and

Huang, Diversified prior knowledge enhanced general language model for biomedical information retrieval, in: ECAI 2023 – 26th European Conference on Artificial Intelligence, Frontiers in Artificial Intelligence and Applications, Vol. 372, IOS Press, 2023, pp. 1109–1115. doi:10.3233/FAIA230385.

37.

Jahan,

M.T.R.

Laskar,

Peng and

J.X.

Huang, Evaluation of ChatGPT on biomedical tasks: A zero-shot comparison with fine-tuned generative transformers, in: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, BioNLP@ACL 2023, Toronto, Canada, 13 July 2023,

Demner-Fushman,

Ananiadou and

Cohen, eds, Association for Computational Linguistics, 2023, pp. 326–336. doi:10.18653/v1/2023.bionlp-1.30.

38.

Karpukhin,

Oğuz,

Min,

Lewis,

L.Y.

Wu,

Edunov,

Chen and

W.-T.

Yih, Dense passage retrieval for open-domain question answering, in: Conference on Empirical Methods in Natural Language Processing, 2020. doi:10.18653/v1/2020.emnlp-main.550.

39.

Keyvan and

J.X.

Huang, How to approach ambiguous queries in conversational search: A survey of techniques, approaches, tools, and challenges, ACM Comput. Surv. 55(6) (2023), 129:1–129:40. doi:10.1145/3534965.

40.

Kong,

Khadanga,

Li,

S.K.

Gupta,

Zhang,

Xu and

Bendersky, Multi-aspect dense retrieval, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, Association for Computing Machinery, New York, NY, USA, 2022, pp. 3178–3186. ISBN 9781450393850. doi:10.1145/3534678.3539137.

41.

M.T.R.

Laskar,

M.S.

Bari,

Rahman,

M.A.H.

Bhuiyan,

Joty and

J.X.

Huang, A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets, in: Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9–14, 2023,

Rogers,

J.L.

Boyd-Graber and

Okazaki, eds, Association for Computational Linguistics, 2023, pp. 431–469. doi:10.18653/v1/2023.findings-acl.29.

42.

M.T.R.

Laskar,

J.X.

Huang and

Hoque, Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task, in: Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11–16, 2020,

Calzolari,

Béchet,

Blache,

Choukri,

Cieri,

Declerck,

Goggi,

Isahara,

Maegaard,

Mariani,

Mazo,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association, 2020, pp. 5505–5514, https://aclanthology.org/2020.lrec-1.676/ .

43.

M.T.R.

Laskar,

Rahman,

Jahan,

Hoque and

Huang, Can large language models fix data annotation errors? An empirical study using debatepedia for query-focused text summarization, in: Findings of the Association for Computational Linguistics: EMNLP 2023,

Bouamor,

Pino and

Bali, eds, Association for Computational Linguistics, Singapore, 2023, pp. 10245–10255, https://aclanthology.org/2023.findings-emnlp.686 . doi:10.18653/v1/2023.findings-emnlp.686.

44.

Lee,

Firat,

Agarwal,

Fannjiang and

Sussillo, Hallucinations in neural machine translation, 2018.

45.

Li,

Fang,

Yang,

Wang,

Ye,

Zhao and

Zhang, Evaluating ChatGPT’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness, 2023, arXiv preprint arXiv:2304.11633.

46.

Li,

Dada,

Kleesiek and

Egger, ChatGPT in healthcare: A taxonomy and systematic review, medRxiv (2023), 2023-03.

47.

Li,

Pan and

Pan, Prompt ChatGPT in MNER: Improved multimodal named entity recognition method based on auxiliary refining knowledge from ChatGPT, 2023, arXiv preprint arXiv:2305.12212.

48.

Li,

Zhang,

Dan,

Jiang and

Zhang, ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge, CUREUS (2023). doi:10.7759/cureus.40895.

49.

T.-Y.

Liu

et al., Learning to rank for information retrieval, Foundations and Trends^® in Information Retrieval 3(3) (2009), 225–331. doi:10.1561/1500000016.

50.

Liu,

Ott,

Goyal,

Du,

Joshi,

Chen,

Levy,

Lewis,

Zettlemoyer and

Stoyanov, Roberta: A robustly optimized BERT pretraining approach, 2019, arXiv preprint arXiv:1907.11692.

51.

Lupu,

J.X.

Huang,

Zhu and

Tait, TREC-CHEM: large scale chemical information retrieval evaluation at TREC, SIGIR Forum 43(2) (2009), 63–70. doi:10.1145/1670564.1670576.

52.

Lupu,

Piroi,

Huang,

Zhu and

Tait, Overview of the TREC 2009 chemical IR track, in: Proceedings of the Eighteenth Text REtrieval Conference, TREC 2009, Gaithersburg, Maryland, USA, November 17–20, 2009,

E.M.

Voorhees and

L.P.

Buckland, eds, NIST Special Publication, Vol. 500, National Institute of Standards and Technology (NIST), 2009, http://trec.nist.gov/pubs/trec18/papers/CHEM09.OVERVIEW.pdf .

53.

Neelakantan,

Xu,

Puri,

Radford,

J.M.

Han,

Tworek,

Yuan,

Tezak,

J.W.

Kim,

Hallacy,

Heidecke,

Shyam,

Power,

T.E.

Nekoul,

Sastry,

Krueger,

Schnurr,

F.P.

Such,

Hsu,

Thompson,

Khan,

Sherbakov,

Jang,

Welinder and

Weng, Text and code embeddings by contrastive pre-training, 2022, arXiv preprint arXiv:2201.10005.

54.

OpenAI, 2023, GPT-4, Technical report, preprint.

55.

OpenAI, GPT-4V(ision) system card, preprint, 2023.

56.

Ouyang,

Wu,

Jiang,

Almeida,

Wainwright,

Mishkin,

Zhang,

Agarwal,

Slama,

Ray et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022), 27730–27744.

57.

Pan,

Wang,

J.X.

Huang,

A.J.

Huang,

Chen and

Chen, A probabilistic framework for integrating sentence-level semantics via BERT into pseudo-relevance feedback, Inf. Process. Manag. 59(1) (2022), 102734. doi:10.1016/J.IPM.2021.102734.

58.

J.S.

Park,

J.C.

O’Brien,

C.J.

Cai,

M.R.

Morris,

Liang and

M.S.

Bernstein, Generative agents: Interactive simulacra of human behavior, 2023, arXiv preprint arXiv:Arxiv-2304.03442.

59.

Peng,

Huang,

Schuurmans and

Wang, Text classification in Asian languages without word segmentation, in: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages 2003, Sappro, Japan, July 7, 2003,

Adachi, ed., ACL, 2003, pp. 41–48, https://dl.acm.org/citation.cfm?id=1118941 .

60.

Piskorski and

Yangarber, Information extraction: Past, present and future, in: Multi-Source, Multilingual Information Extraction and Summarization, 2013, pp. 23–49.

61.

Radford,

Wu,

Child,

Luan,

Amodei,

Sutskever et al., Language models are unsupervised multitask learners, OpenAI blog 1(8) (2019), 9.

62.

Raffel,

Shazeer,

Roberts,

Lee,

Narang,

Matena,

Zhou,

Li and

P.J.

Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.

63.

A.-U.

Rahman,

Musleh,

Nabil,

Alubaidan,

Gollapalli,

Krishnasamy,

Almoqbil,

M.A.A.

Khan,

Farooqui,

M.I.B.

Ahmed et al., Assessment of information extraction techniques, models and systems, Mathematical Modelling of Engineering Problems 9(3) (2022).

64.

P.P.

Ray, ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope, internet of things and cyber-physical systems, 2023.

65.

S.E.

Robertson,

Walker,

Jones,

M.M.

Hancock-Beaulieu,

Gatford et al., Okapi at TREC-3, Nist Special Publication Sp 109 (1995), 109.

66.

Sallam, ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns, Healthcare, 11 (2023), 887, MDPI.

67.

Sarawagi

et al., Information extraction, Foundations and Trends^® in Databases 1(3) (2008), 261–377. doi:10.1561/1900000003.

68.

Schulman,

Wolski,

Dhariwal,

Radford and

Klimov, Proximal policy optimization algorithms, 2017, arXiv preprint arXiv:1707.06347.

69.

Sherstinsky, Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network, Physica A: Statistical Mechanics And Its Applications (2018). doi:10.1016/j.physd.2019.132306.

70.

Shi,

Ma,

Zhong,

Mai,

Li,

Liu and

Huang, Chatgraph: Interpretable text classification by converting chatgpt knowledge to graphs, 2023, arXiv preprint arXiv:2305.03513.

71.

Soni and

Wade, Comparing abstractive summaries generated by ChatGPT to real summaries through blinded reviewers and text classification algorithms, 2023, arXiv preprint arXiv:2303.17650.

72.

Sun,

Yan,

Ma,

Ren,

Yin and

Ren, Is ChatGPT good at search? Investigating large language models as re-ranking agent, 2023, arXiv preprint arXiv:2304.09542.

73.

Sun,

Wang,

Feng,

Ding,

Pang,

Shang,

Liu,

Chen,

Zhao,

Lu,

Liu,

Wu,

Gong,

Liang,

Shang,

Sun,

Liu,

Ouyang,

Yu,

Tian,

Wu and

Wang, ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation, 2021.

74.

Thoppilan,

D.D.

Freitas,

Hall,

Shazeer,

Kulshreshtha,

H.-T.

Cheng,

Jin,

Bos,

Baker,

Du,

Li,

Lee,

H.S.

Zheng,

Ghafouri,

Menegali,

Huang,

Krikun,

Lepikhin,

Qin,

Chen,

Xu,

Chen,

Roberts,

Bosma,

Zhao,

Zhou,

C.-C.

Chang,

Krivokon,

Rusch,

Pickett,

Srinivasan,

Man,

Meier-Hellstern,

M.R.

Morris,

Doshi,

R.D.

Santos,

Duke,

Soraker,

Zevenbergen,

Prabhakaran,

Diaz,

Hutchinson,

Olson,

Molina,

Hoffman-John,

Lee,

Aroyo,

Rajakumar,

Butryna,

Lamm,

Kuzmina,

Fenton,

Cohen,

Bernstein,

Kurzweil,

Aguera-Arcas,

Cui,

Croak,

Chi and

Le, LaMDA: Language models for dialog applications, 2022, arXiv preprint arXiv:2201.08239.

75.

Touvron,

Martin,

K.R.

Stone,

Albert,

Almahairi,

Babaei,

Bashlykov,

Batra,

Bhargava,

Bhosale,

D.M.

Bikel,

Blecher,

C.C.

Ferrer,

Chen,

Cucurull,

Esiobu,

Fernandes,

Fu,

Fuller,

Gao,

Goswami,

Goyal,

A.S.

Hartshorn,

Hosseini,

Hou,

Inan,

Kardas,

Kerkez,

Khabsa,

I.M.

Kloumann,

A.V.

Korenev,

P.S.

Koura,

M.-A.

Lachaux,

Lavril,

Lee,

Liskovich,

Lu,

Mao,

Martinet,

Mihaylov,

Mishra,

Molybog,

Nie,

Poulton,

Reizenstein,

Rungta,

Saladi,

Schelten,

Silva,

E.M.

Smith,

Subramanian,

Tan,

Tang,

Taylor,

Williams,

J.X.

Kuan,

Xu,

Yan,

Zarov,

Zhang,

Fan,

Kambadur,

Narang,

Rodriguez,

Stojnic,

Edunov and

Scialom, Llama 2: Open foundation and fine-tuned chat models, 2023, https://api.semanticscholar.org/CorpusID:259950998, arXiv:2307.09288.

76.

Vaswani,

Shazeer,

Parmar,

Uszkoreit,

Jones,

A.N.

Gomez,

Ł.

Kaiser and

Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).

77.

Wang,

Yin,

Wang,

Wu and

Wang, A comprehensive survey on cross-modal retrieval, 2016, arXiv preprint arXiv:1607.06215.

78.

Wang,

Yang and

Wei, Query2doc: Query expansion with large language models, 2023, arXiv preprint arXiv:2303.07678.

79.

Wang,

Scells,

Koopman and

Zuccon, Can ChatGPT write a good boolean query for systematic review literature search? 2023, arXiv preprint arXiv:2302.03495.

80.

Wang,

Lin,

Feng,

He and

T.-S.

Chua, Generative recommendation: Towards next-generation recommender paradigm, 2023, arXiv preprint arXiv:2304.03516.

81.

Wei,

Zhao,

Zhang,

Zhu,

Wang,

Yang,

Li,

Cheng,

Lü,

Hu,

Li,

Yang,

Luo,

Wu,

Liu,

Cheng,

Zhang,

Lin,

Wang,

Ma,

Dong,

Sun,

Chen,

Peng,

Liang,

Yan,

Fang and

Zhou, Skywork: A More Open Bilingual Foundation Model, 2023.

82.

Wei,

Cui,

Cheng,

Wang,

Zhang,

Huang,

Xie,

Xu,

Chen,

Zhang et al., Zero-shot information extraction via chatting with ChatGPT, 2023, arXiv preprint arXiv:2302.10205.

83.

Workshop,

T.L.

Scao,

Fan,

Akiki,

Pavlick,

Ilic,

Hesslow,

Castagné,

A.S.

Luccioni,

Yvon et al., Bloom: A 176b-parameter open-access multilingual language model, 2022, arXiv:2211.05100.

84.

Wu,

Zhao,

Yu,

Zhang,

Shen,

Liu,

Li,

Zhu,

Luo,

Xu and

Zhang, Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning, 2021.

85.

Xia,

Huang,

Xu,

Zhao,

Yin and

J.X.

Huang, Hypergraph contrastive collaborative filtering, in: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11–15, 2022,

Amigó,

Castells,

Gonzalo,

Carterette,

J.S.

Culpepper and

Kazai, eds, ACM, 2022, pp. 70–79. doi:10.1145/3477495.3532058.

86.

Xu,

Uszkoreit,

Du,

Fan,

Zhao and

Zhu, Explainable AI: A brief survey on history, research areas, approaches and challenges, in: Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8, Springer, 2019, pp. 563–574.

87.

Xu,

Feng and

Chen, Chatgpt vs. Google: A comparative study of search performance and user experience, 2023, arXiv preprint arXiv:2307.01135.

88.

Xue,

Constant,

Roberts,

Kale,

Al-Rfou,

Siddhant,

Barua and

Raffel, mT5: A massively multilingual pre-trained text-to-text transformer, 2021.

89.

Yang,

Wu,

Yang,

Lian,

Guo and

Wang, A survey of information extraction based on deep learning, Applied Sciences 12(19) (2022), 9691. doi:10.3390/app12199691.

90.

Yang,

Dai,

Yang,

J.G.

Carbonell,

Salakhutdinov and

Q.V.

Le, XLNet: Generalized autoregressive pretraining for language understanding, in: Conference on Neural Information Processing Systems (NeurIPS), Vol. 32, 2019, pp. 5754–5764.

91.

Yao,

Mao and

Luo, Graph convolutional networks for text classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 7370–7377.

92.

Yasunaga,

Aghajanyan,

Shi,

James,

Leskovec,

Liang,

Lewis,

Zettlemoyer and

Yih, Retrieval-augmented multimodal language modeling, in: International Conference on Machine Learning, ICML, Honolulu, Hawaii, USA, 23–29 July 2023,

Krause,

Brunskill,

Cho,

Engelhardt,

Sabato and

Scarlett, eds, Proceedings of Machine Learning Research, Vol. 202, PMLR, 2023, pp. 39755–39769, https://proceedings.mlr.press/v202/yasunaga23a.html .

93.

Ye,

J.X.

Huang and

Lin, Finding a good query-related topic for boosting pseudo-relevance feedback, J. Assoc. Inf. Sci. Technol. 62(4) (2011), 748–760. doi:10.1002/ASI.21501.

94.

Yin,

J.X.

Huang,

Li and

Zhou, A survival modeling approach to biomedical search result diversification using Wikipedia, IEEE Trans. Knowl. Data Eng. 25(6) (2013), 1201–1212. doi:10.1109/TKDE.2012.24.

95.

Yuan,

Xie and

Ananiadou, Zero-shot temporal relation extraction with ChatGPT, 2023, arXiv preprint arXiv:2304.05454.

96.

Zamani,

Diaz,

Dehghani,

Metzler and

Bendersky, Retrieval-enhanced machine learning, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, pp. 2875–2886. ISBN 9781450387323. doi:10.1145/3477495.3531722.

97.

Zhang,

Roller,

Goyal,

Artetxe,

Chen,

Dewan,

Diab,

Li,

X.V.

Lin et al., Opt: Open ore-trained transformer language models, 2022, arXiv:2205.01068.

98.

Zhao,

Jin,

Ser and

Yang, ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification, Neurocomputing (2023). doi:10.48550/arXiv.2305.15024.

99.

Zhao,

J.X.

Huang,

Deng,

Chang and

Xia, Are topics interesting or not? An LDA-based topic-graph probabilistic model for web search personalization, ACM Trans. Inf. Syst. 40(3) (2022), 51:1–51:24. doi:10.1145/3476106.

100.

Zhao,

J.X.

Huang and

He, CRTER: using cross terms to enhance probabilistic information retrieval, in: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25–29, 2011,

Ma,

Nie,

Baeza-Yates,

Chua and

W.B.

Croft, eds, ACM, 2011, pp. 155–164. doi:10.1145/2009916.2009941.

101.

Zhao,

J.X.

Huang and

Ye, Modeling term associations for probabilistic information retrieval, ACM Trans. Inf. Syst. 32(2) (2014), 7:1–7:47. doi:10.1145/2590988.

102.

Zhao,

Chen,

Wang,

Jiao,

X.L.

Do,

Qin,

Ding,

Guo,

Li,

Li and

Joty, Retrieving multimodal information for augmented generation: A survey, 2023, arXiv preprint arXiv:2303.10868.

103.

Zhou,

Chen,

J.X.

Huang,

Q.V.

Hu and

He, Position-aware hierarchical transfer model for aspect-level sentiment classification, Inf. Sci. 513 (2020), 1–16. doi:10.1016/J.INS.2019.11.048.

104.

T.Y.

Zhuo,

Huang,

Chen and

Xing, Red teaming ChatGPT via jailbreaking: Bias, robustness, reliability and toxicity, 2023, arXiv preprint arXiv:2301.12867.

105.

Zou,

J.X.

Huang,

Ren and

Kanoulas, Learning to ask: Conversational product search via representation learning, ACM Trans. Inf. Syst. 41(2) (2023), 45:1–45:27. doi:10.1145/3555371.

106.

Zou,

Xia,

Gu,

Zhao,

Liu,

J.X.

Huang and

Yin, Neural interactive collaborative filtering, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25–30, 2020,

J.X.

Huang,

Chang,

Cheng,

Kamps,

Murdock,

Wen and

Liu, eds, ACM, 2020, pp. 749–758. doi:10.1145/3397271.3401181.

Exploring ChatGPT for next-generation information retrieval: Opportunities and challenges

Abstract

Keywords

1. Introduction

3.1. Information extraction

3.2. Text classification

3.3. Document ranking

3.4. Conversational search

3.5. Multimodal retrieval

4.1. Hallucination

4.3. Interpretability

5. Conclusions and future directions

Footnotes

Acknowledgements

References