Retrieval-Augmented Generation-Based Relation Extraction

Abstract

Information extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. Identifying the relation between a pair of entities plays a crucial role within this framework. Despite the availability of various techniques for RE, their efficacy heavily depends on access to labeled data and substantial computational resources. To address these challenges, large language models (LLMs) have emerged as promising solutions; however, they are prone to generating hallucinated responses due to the limitations of their training data. To overcome these shortcomings, this work proposes a retrieval-augmented generation-based relation extraction (RAG4RE) approach to enhance RE performance. We evaluate the effectiveness of RAG4RE using various LLMs. By leveraging established benchmarks such as TACRED, TACREV, Re-TACRED and SemEval RE datasets, we aim to comprehensively assess the efficacy of our methodology. Specifically, we employ prominent LLMs, including Flan T5, Llama2, and Mistral, in our investigation. The results of our work demonstrate that RAG4RE outperforms traditional RE methods based solely on LLMs, with significant improvements observed in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.

Keywords

relation extraction large language models retrieval-augmentation generation RAG RAG4RE

1. Introduction

Information extraction (IE) is a process of converting unstructured text data into structured data by applying entity and relation extraction approaches. Identifying the relation between a pair of entities in a sentence, relation extraction (RE), is one of the most significant tasks in the IE pipeline (Grishman, 2015). RE plays a pivotal role in constructing domain-specific knowledge graphs (KGs) from text data and ensuring the completeness of KGs. An example of a relation type between entity pairs, such as per:cities_of_residence, is illustrated in Figure 1, where the head entity (Eugenio Vagni) is linked to the tail entity (Sulu). Various RE approaches have been developed, including supervised RE, distant supervision, unsupervised RE methods, rule-based and semi-supervised approaches (e.g., weakly supervised RE) (Agichtein & Gravano, 2000; Aydar et al., 2021; Efeoglu, 2022; Pawar et al., 2017). However, well-performing RE approaches, for example, supervised learning, require a large amount of labeled data and substantial computation time. Another effective method for identifying relation types between entities is fine-tuning language models (Chen et al., 2024; Cohen et al., 2022; Han et al., 2022; Li et al., 2023; Wang et al., 2022; Zhou & Chen, 2022). It is important to note, however, that both supervised learning approaches and fine-tuning language models demand significant GPU memory and computational time during their training phase.

Figure 1.

Example of a Relation Between Head and Tail Entities in a Sentence.

General-purpose large language models (LLMs) exhibit remarkable inference capabilities when applied with zero-shot prompting techniques, allowing them to effectively handle key tasks in IE, such as entity recognition (ER) and RE. However, they are prone to generating hallucinated outputs when lacking prior knowledge, owing to the next-token prediction mechanism inherent in these autoregressive models. Additionally, LLM prompt-tuning approaches require both prompt template engineering and domain experts for domain-specific IE approaches (Chen et al., 2024). However, template engineering is time-consuming due to its manual nature. Retrieval-augmented generation (RAG) has been proposed to reduce hallucinations in LLMs when LLM-based conversational systems produce random responses to queries (Lewis et al., 2020). The RAG system functions akin to an open-book exam, integrating relevant information from the Embedding Database directly into the query (sentence) (Lewis et al., 2020). Although there have been attempts to apply the LLM approach in conjunction with zero-shot prompting techniques, such as multiple-choice questioning (Zhang et al., 2023) and rationale prompting (Xiong et al., 2023), these works underperform on RE benchmarks due to a high number of false predictions. Specifically, they do not incorporate relevant context or information into the query sentence within the prompt template. As a result, the RAG, using zero-shot settings, could improve and reduce false predictions of LLMs in identifying relation types between entity pairs in sentences.

Well-performing RE approaches, for example, supervised learning, require a large amount of labeled training data and significant computational time, as they learn RE patterns in a supervised manner. Another effective method-fine-tuning language models-requires considerable GPU memory and computational time, particularly when both the base model size and training data are large, as the base model weights and training data must be loaded into the GPU to facilitate efficient training (Han et al., 2022). In the era of LLMs, well-designed prompts (or the prompt engineering approaches) might help us identify the relations between entities in a sentence. It is clear that well-designed prompts yield highly accurate performance in other downstream tasks, for example, ontology-driven knowledge graph generation (Mihindukulasooriya et al., 2023), text-to-image generation (Ahmad et al., 2023) by including information about fictional characters in the prompt template as an example of zero-shot settings, and ontology matching (Hertling & Paulheim, 2023) by including information about the concept to be matched in the prompts. Building on previous works that utilize zero-shot prompting, enriching the context of the prompt could provide task-relevant information to the LLMs, improving their responses. This can be done while still preserving the zero-shot settings of the prompt within the context of RAG.

In this work, our goal is to explore the potential performance enhancement in relation extraction between entity pairs in a sentence through the use of a retrieval-augmented generation-based relation extraction (RAG4RE) approach,¹ which leverages zero-shot settings. Specifically, we propose a pipeline for RAG-based relation extraction that utilizes open-source language models. To evaluate our RAG4RE, we leverage RE benchmark datasets, such as TACRED (Zhang et al., 2017), TACREV (Alt et al., 2020), Re-TACRED (Stoica et al., 2021), and SemEval (Hendrickx et al., 2010) RE datasets. We utilize both encoder–decoder models, for example, Flan T5 (Thoppilan et al., 2022)—an instruction fine-tuned variant of the T5 model (Chung et al., 2024)—and decoder-only models like Llama2 (Touvron et al., 2023) and Mistral (Jiang et al., 2023), all of which are integrated into the approach outlined in Figure 2. Furthermore, we compare the performance of our RAG4RE approach with that of simple query (or vanilla) prompting to highlight how incorporating relevant contextual information into the prompt improves results and reduces false predictions. In this work, our findings are: –

The RAG-based RE (RAG4RE) approach has the potential to outperform both simple query (without relevant sentence), known as vanilla LLM prompting, and existing best-performing RE approaches from previous studies.

–

While Decoder-only LLMs (Pan et al., 2024) still encounters hallucination issues on these datasets, our RAG4RE effectively mitigates this problem, especially when compared to the results obtained from the simple query.

Figure 2.

RAG-Based Relation Extraction (RAG4RE) Pipeline, Featuring a Sample Query Sentence and the Corresponding Similar Sentence Retrieved.

In the following section, we first summarize recent works in RE and RAG in Section 2, and then provide a detailed description of the proposed RAG4RE in Section 3. We evaluate our RAG4RE on RE benchmark datasets, integrating different types of LLMs in Section 4. Next, we conduct ablation studies, which yield promising results for SemEval and provide inspiration for applying RAG4RE to domain-specific datasets in Section 5. Subsequently, Section 6 discusses RAG4RE’s results in comparison to those of previous approaches. Finally, we summarize the outcomes of our RAG4RE approach in Section 7.

2. Related Works

In this section, we summarize recent works into two categories: (i) relation extraction and (ii) retrieval-augmented generation.

2.1. Relation Extraction

RE is one of the main tasks of Information Extraction and plays a significant role among natural language processing tasks. RE aims to identify or classify the relations between entity pairs (head and tail entities).

RE can be carried out with various types of approaches: (i) supervised techniques including features-based and kernel-based methods, (ii) a special class of techniques which jointly extract entities and relations (semi-supervised), (iii) unsupervised, (iv) Open IE, and (v) distant supervision-based techniques (Pawar et al., 2017). Supervised techniques require a large annotated dataset, and its annotation process is time-consuming and costly (Pawar et al., 2017). Distant supervision is among one of the popular methods dealing with the problem of obtaining annotated data. The distant supervision, based on existing knowledge bases, brings its own drawbacks, and it faces the issue of wrongly labeled sentences troubling the training due to the excessive amount of noise (Aydar et al., 2021). Another popular approach is weakly supervised RE (Agichtein & Gravano, 2000). However, the weakly supervised approach is more error-prone because of semantic drift in a set of patterns per iteration of its incremental learning approach like a snowball algorithm (Agichtein & Gravano, 2000). In rule-based RE approaches, finding relations is mostly restricted by predefined rules (Pawar et al., 2017).

In terms of the best-performing RE approach, obtained by fine-tuning the language models, Cohen et al. (2022) proposed a span-prediction-based approach for relation classification instead of single embedding to represent the relations between entities. This approach has improved the state-of-the-art scores on the well-known datasets. DeepStruct (Wang et al., 2022) proposed an innovative approach aimed at enhancing the structural understanding capabilities of language models. This work introduced a pre-trained model comprising 10 billion parameters, facilitating the seamless transfer of language models to structure prediction tasks. Specifically, regarding the RE task, the output format entails a structured representation of (head entity, relation, tail entity), while the input format comprises the input text along with a pair of head and tail entities. Zhou and Chen (2022) concentrated on addressing two critical issues that affect the performance of existing sentence-level RE models: (i) Entity Representation and (ii) noisy or ill-defined labels. Their approach extends the pretraining objective of masked language modeling to entities and incorporates a sophisticated entity-aware self-attention mechanism, enabling more accurate and robust RE. Li et al. (2023) proposed a label graph to review candidate labels in the top-K prediction set and learn the connections between them. When predicting the correct label, they first compute that the top-K prediction set of a given sample contains useful information.

Zhang et al. (2023) generated multiple-choice question prompts from test sentences where choices consist of verbalization of entities and possible relation types. These choices are selected from the training sentence based on entities in a test sentence. However, it could not outperform the previously introduced rule and ML -based approaches. In the context of their works, Zhang et al. (2023) proved that enriching prompt context improves the prediction results on benchmark datasets such as TACRED and Re-TACRED. Melz (2023) focuses on auxiliary rationale memory for the RAG approach in the Relation Extraction task, and the proposed system learns from its successes without incurring high training costs. Chen et al. (2024) proposes a Generative Context-Aware Prompt-tuning method, which also tackles the problem of prompt template engineering. This work proposed a prompt generator that is used to find context-aware prompt tokens by extracting and generating words regarding entity pairs and evaluated on four benchmark datasets: TACRED, TACREV, Re-TACRED, and SemEval. Furthermore, Han et al. (2022) employed prompt-tuning approaches as a mask-filling task, utilizing various encoders such as BART, RoBERTa, and the encoder component of T5-large on datasets like TACRED, ReTACRED, TACREV, and Wiki80. Their approach achieved F1 scores of 75.3% on TACRED and 84.0% on TACREV. However, the primary limitation of this work lies in the time efficiency of these autoregressive models.

In this work, we introduce a retrieval-augmented generation-based relation extraction (RAG4RE) approach that operates in zero-shot settings to identify relations between entity pairs within a sentence. Previous approaches have been limited by their dependence on either labeled data during training (Chen et al., 2024; Cohen et al., 2022; Han et al., 2022; Li et al., 2023; Wang et al., 2022; Zhou & Chen, 2022) or by their use of prompt templates that lack sufficient contextual information (Zhang et al., 2023). In contrast, our RAG4RE method incorporates contextual information, reducing reliance on the potentially outdated internal knowledge of (vanilla) language models. A key advantage of our approach is that it eliminates the need for both a training process and a labeled dataset. We provide a detailed explanation of RAG in the next section.

2.2. Retrieval-Augmented Generation

RAG for large language models can be classified into two categories: (i) naive RAG and (ii) advanced RAG. Naive RAG follows basic steps: retrieval, augmentation, and generation. In contrast, the advanced version incorporates post-processing steps, such as selecting essential information, condensing the context to be processed, and emphasizing critical parts of the retrieval context before delivering the retrieved information to the user (Gao et al., 2023). The concept of RAG has been suggested as a way to minimize the undesired alterations in LLMs when conversational systems built on LLMs generate arbitrary responses to a query (Lewis et al., 2020). RAG is an example of open-book exams which are applied to the usage of LLMs. The retriever mechanism in RAG finds an example of the user query (prompt), and then the user query is regenerated along with the example by the data-augmentation module in RAG. Ovadia et al. (2024) evaluates the knowledge injection capacities of both fine-tuning and the RAG approach and found that LLMs dealt with performance problems through unsupervised fine-tuning while RAG outperformed the fine-tuning approach in unsupervised learning.

3. Methodology

In this work, we have developed an RAG4RE approach to identify the relation between a pair of entities in a sentence. Our proposed RAG4RE, illustrated in Figure 2, consists of three modules: (i) retrieval, (ii) data augmentation, and (iii) generation. Our proposed RAG4RE approach is a variant of an advanced RAG (Gao et al., 2023), as its retrieval module includes “Result Refinement” which applies post-processing after responses from the generation module. An example demonstrating the different responses returned to RAG4RE and a simple query is given in Table 11. The rest of the section explains the details of how each module of our proposed approach in Figure 2 works under specific subsections.

3.1. Retrieval

A user submits a sentence (query) along with a pair of entities (head and tail entities) that might have a relation to the Retrieval module as demonstrated in Figure 2. Then, the Retriever sends this query to the Data Augmentation module, which extends the original query with a semantically similar sentence from training dataset, as an example given in Figures 2 and 3. “Result Refinement” in this module applies post-processing techniques, if necessary, to the results returned by the Generation module. The “Result Refinement” consists of a couple of response processing steps, such as refining prefixes (e.g., changing “per:member_of” to “org:member_of” as illustrated in Table 1 and Figures 8(a) and 8(b)), and converting “no relation” answers into “no_relation” as defined in the predefined relation types.² Unfortunately, due to the nature of LLMs, which are based on next-token prediction, they might still generate undefined relation types, as analyzed in Section 4.3.

Figure 3.

An Example of a Re-Generated Prompt from a Sample in the TACRED Dataset.

Table 1.

Prefix Refinement Samples From Flan T5 XL, Along With TACRED and Its Variants, Based on Predictions From the Evaluation Phase.

Raw Predicted Relations by LLMs	Refined Relation
org:religion	per:religion
per:member_of	org:member_of
org:employee_of	per:employee_of
org:schools_attended	per:schools_attended
per:members	org:members
org:parents	per:parents

3.2. Data Augmentation

The data augmentation module includes an Embedding Database (DB) containing embeddings of the training data, which are computed using the Sentence BERT (SBERT) model (Reimers & Gurevych, 2019). In our approach, we use the “all-MiniLM-L6-v2” version of SBERT.³ Within this module, the embedding of the query sentence is also computed by SBERT. The system then calculates similarity scores between embeddings of each training sentence in the Embedding DB and the query sentence embeddings using the cosine similarity metric, as described in equation 1. This formula measures the cosine similarity between two embedding vectors $A$ and $B$ , computed as the dot product of the vectors divided by the product of their magnitudes.

\begin{aligned} Cosine Similarity = \cos (θ) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} \end{aligned}

(1)

where

A

and

B

are the embedding vectors representing two different sentences in equation 1.

After computing the similarity scores between the query sentence embeddings and those in the embedding DB, the system selects the sentence with the highest similarity (top one) and incorporates it into the prompt template, as shown in Figure 4. For example, the cosine similarity score between the query sentence, “ Survivors include his wife, Sandra; four sons, Jeff, James, Douglas, and Harris; a daughter, Leslie; his mother, Sally; and two brothers, Guy and Paul.” and the top-ranked similar sentence, “ Survivors include his wife, Mary Russell Flowers, two sons, a daughter, 10 grandchildren, and four great-grandchildren.” is approximately 0.77904, as illustrated in Figure 2. Both the query sentence and the most similar sentence are then input into the prompt generator, which constructs the prompt based on this template. Essentially, the prompt generator reformulates the user query by including the relevant sentence from the Embedding DB. An example of the generated prompt is displayed in Figure 3. Additionally, no similarity threshold is set for retrieving the most similar sentence. Furthermore, the similarity computation based on embeddings does not guarantee that the most similar sentence will contain both the head and tail entities of the query sentence, as well as the same relation types between entities, since embeddings are computed for all tokens in a sentence. The generated prompt is then passed to the Generation module for further processing. Our prompt template (see Figures 4 and 3) incorporates possible relation types to leverage the conditional generation capabilities of LLMs.

Figure 4.

Illustration of the Re-Generated Prompt Template. The Blue-Colored Query Sentence, Head and Tail Entities, and Relation Types are Provided by the User.

Figure 5.

Gives Number of Undefined Relation Predictions Across TACRED and Its Variants on Different LLMs Along with RAG4RE.These Relation Types are Not Defined in the Relation Types in Datasets (see Table 2).

3.3. Generation

The LLM generates a response for the prompts using zero-shot settings in the generation module. We integrate LLMs with different architectures, including encoder–decoder and decoder-only models (Pan et al., 2024), in our experiments so that we can evaluate the performance of our proposed RAG4RE approach with different LLMs and compare them within the RAG4RE framework. Subsequently, the response is sent to the “Result Refinement” in the Retrieval module. Result refinement might be necessary if the relation type includes a prefix, as the response might omit or incorrectly predict the prefix. For example, the response might return member_of or per:member_of instead of org:member_of as given in Table 1⁴ (See Figures 8(a) and 8(b) for statistics about prefix refinement when Flan T5 has been integrated into our RAG4RE pipeline.). The Generation module concludes when the results are sent to the Retrieval module. Afterwards, the “Retriever” sends the results to the user. An example of the responses returned by the retriever is shown in Figure 2.

4. Evaluation

In this section, we examine the performance of our RAG4RE work. We first introduce the experimental settings in Section 4.1. Then, we present the results of our experiments in Section 4.2. Finally, false predictions are analyzed in Section 4.3.

4.1. Experimental Setup

In our work, we assess the effectiveness of our RAG4RE approach using well-established RE benchmarks, including the TACRED (Zhang et al., 2017), TACREV (Alt et al., 2020), Re-TACRED (Stoica et al., 2021), and SemEval (Hendrickx et al., 2010) RE datasets. These benchmarks consist of head and tail entities in the given sentences, along with the ground truth relation types between these entities. These widely recognized datasets serve as invaluable resources for evaluating the efficiency and performance of our approach. Further insights into the datasets can be found below and in Section 4.1.1. Additionally, we compare the performance of our RAG4RE approach, which incorporates a relevant (similar) sentence alongside a query sentence in its prompt template (see the prompt template in Figure 4), to that of a simple query prompt-referred to as the vanilla prompt-which excludes any relevant sentence related to the query sentence, as explored in previous works (Zhang et al., 2023). This comprehensive evaluation helps assess our approach’s performance across different LLMs.

4.1.1. Datasets

We utilize four benchmark datasets, as detailed below and in Table 2. Figure 6 and 7 provide statistics for the SemEval RE dataset, while Table 3 offers details about the ‘no_relation’ type in TACRED and its variants.

–
TACRED, namely the TAC RE Dataset, is a supervised RE dataset obtained via crowdsourcing and targeted towards TAC KBP relations. There is no directionality in predefined relations, which can also be extracted from a given sentence tokens. We directly used this licensed dataset from Linguistic Data Consortium (LDC)⁵.
–
TACREV is a revisited version of TACRED that reduces noise in sentences defined with “no_relation.” In our work, this dataset is generated from original TACRED by running source codes given at (Alt et al., 2020).⁶
–
Re-TACRED is a re-annotated version of the TACRED dataset that can be used to perform reliable evaluations of RE models. To generate this Re-TACRED from original TACRED, we leverage source codes⁷ given at (Stoica et al., 2021).
–
SemEval: focuses on multi-way classification of semantic relationships between entity pairs. The predefined relations (target relation labels) in this benchmark dataset have directions and cannot be extracted from tokens of (test, or train) sentences in the dataset. Figures 6 and 7 demonstrate details about this dataset. The dataset is obtained from HuggingFace.⁸

Table 2.
Overview of Benchmark Datasets. ‘-’ Indicates the Absence of a Validation Split.

Split TACRED TACREV Re-TACRED SemEval

Training 68124 68124 58465 8000

Test 15509 15509 13418 2717

Validation 22631 22631 19584 –

# of Relations 42 42 40 19

Table 3.
Statistics of Relation Type ‘no_relation’ Across TACRED and Its Variants.

Dataset Total Test # of No_Relation Percent of No_Relation

TACRED 15,509 12,184 78.56%

TACREV 15,509 12,386 79.86%

ReTACRED 13,418 7,770 57.91%

4.1.2. Pre-Trained Language Models

Split	TACRED	TACREV	Re-TACRED	SemEval
Training	68124	68124	58465	8000
Test	15509	15509	13418	2717
Validation	22631	22631	19584	–
# of Relations	42	42	40	19

Dataset	Total Test	# of No_Relation	Percent of No_Relation
TACRED	15,509	12,184	78.56%
TACREV	15,509	12,386	79.86%
ReTACRED	13,418	7,770	57.91%

We evaluate our RAG4RE approach on the aforementioned benchmark datasets by integrating various LLMs, including Flan T5 (XL⁹ and XXL¹⁰ ), Llama-2-7b-chat-hf,¹¹ and Mistral-7B-Instruct-v0.2.¹² We use the instruction fine-tuned versions of Llama2 and Mistral, as Flan T5 (Thoppilan et al., 2022) is itself an instruction fine-tuned version of the T5 model (Chung et al., 2024).

4.1.3. Evaluation Metrics

We compare our RAG4RE approach with simple query, vanilla prompting, in terms of micro F1, Recall, and Precision scores, as given in equations 2, 3, 4. In these equations, True Positive, False Positive and False Negative are denoted as TP, FP and FN, respectively, where n in equations 2, 3, 4 points out total number of classes or categories, and i is an index representing a class or category in a multi-class classification problem, due to the imbalance in these benchmark datasets (see Table 3). To compute these metrics, we leverage the metrics library of sklearn.¹³

\begin{aligned} Micro Precision & = \frac{\sum_{i = 1}^{n} {TP}_{i}}{\sum_{i = 1}^{n} ({TP}_{i} + {FP}_{i})} \end{aligned}

(2)

\begin{aligned} Micro Recall & = \frac{\sum_{i = 1}^{n} {TP}_{i}}{\sum_{i = 1}^{n} ({TP}_{i} + {FN}_{i})} \end{aligned}

(3)

\begin{aligned} Micro F1 & = \frac{2 \cdot Micro Precision \cdot Micro Recall}{Micro Precision + Micro Recall} \end{aligned}

(4)

4.1.4. Hardware Details

In regard to hardware specifications, these language models have undergone evaluation utilizing a setup comprising 4 GPUs, with each GPU boasting a memory capacity of 12 GB in the NVIDIA system. Device details are NVIDIA GeForce GTX 1080 Ti (4GPUs X 12GB). Furthermore, the memory configuration reaches 300 GB.

4.2. Results

Our experiments are conducted using the four benchmark datasets mentioned in Section 4.1.1. Firstly, we assess the performance of our proposed RAG4RE and then compare it to that of a simple query (sentence), vanilla LLM prompting, in terms of micro F1 score. As mentioned earlier, our evaluation criteria take into account the micro F1 score, Recall, and Precision metrics due to the imbalanced labelling of the datasets (see Table 3). Furthermore, we explore how our approach enhances the performance of LLM responses. This is accomplished by incorporating the example sentence which is the most similar to the query sentence determined using the cosine similarity metric at equation 1 with SBERT embeddings into the prompt template alongside the query sentence in our proposed RAG4RE approach (see Data Augmentation in Figure 2). We compare the results of a simple query without any relevant sentence to our RAG4RE results at Table 4.

Table 4.
Experimental Results on Four Benchmark Datasets Using Different LLMs.

We utilize various LLMs, including Flan T5 XL and XXL, Mistral-7B-Instruct-v0.2, and Llama-2-7b-chat-hf, to conduct our experiments and evaluate the performance of our proposed RAG4RE framework. The results demonstrate that RAG4RE achieves remarkable performance compared to a simple query approach, as shown in Table 4 and Figure 9. Notably, RAG4RE consistently outperforms the simple query approach across the TACRED, TACREV, and Re-TACRED datasets, even when the underlying language model is varied. The highest F1 scores achieved, as detailed in Table 4, are 86.6%, 88.3%, and 73.3% for TACRED, TACREV, and Re-TACRED, respectively. These remarkable results are primarily accomplished by integrating the Flan T5 XL model into the Generation module. Nonetheless, RAG4RE does not achieve comparable performance on the SemEval dataset. This might be primarily due to either the predefined relations (target relation labels) in this dataset, which cannot be directly extracted from the sentence tokens, or the lack of knowledge about this dataset in the vanilla LLMs used in RAG4RE. Furthermore, the SemEval dataset includes manually annotated sentences for specific, defined semantic relation types (Hendrickx et al., 2010).

The remarkable improvement observed in RAG4RE’s results can be primarily attributed to the incorporation of relevant (or similar) example sentences, extracted from the training data of benchmark datasets, into the user query sentence. As highlighted in Lewis et al. (2020), RAG operates akin to an open-book exam, where adding a relevant (or similar) sentence to the query sentence facilitates the LLM’s understanding of the query sentence in the Generator module of our approach, as shown in Figure 2. The example sentence and query sentence might have similar or same entities, as illustrated in Figure 2, which helps the LLM make accurate inferences and reduce hallucinations. This interpretation is further supported by the results of the simple query and RAG4RE approaches, as outlined in Table 4.

Consequently, our RAG4RE has improved F1 scores on benchmark datasets, for example, TACRED, TACREV, and Re-TACRED, when compared its results to those of a simple query as demonstrated in Table 4. This performance improvement between simple query and RAG4RE can be explained by the spread of activation theory (Abramski et al., 2025). In terms of embeddings, similar sentences might contain the same or semantically similar (or closely related) entities as those in the query sentence. In this context, similar sentences serve as facilitators for comprehending the entities in the query sentence. This role can be explained by the spread of activation theory, which describes how a computational model activates one word (token) and spreads its influence to related words or concepts in the model (Abramski et al., 2025); for instance, “sky” reminds one of “blue” or “cloud” as explained by the spread of activation theory. In the next section, we closely analyze the prediction errors on TACRED, its variants, and SemEval for further insights about how RAG4RE works.

4.3. Error Analysis

In this section, we analyze how RAG4RE improves the results of simple query (sentence) across three benchmarks, whereas it could not demonstrate this improvement on SemEval dataset. We mainly discuss false predictions and undefined relation types predicted by LLMs.

Decoder-only LLMs, for example, Mistral and Llama2, are prone to producing hallucinatory results when a simple prompt is sent to those models (Pan et al., 2024). In the responses of the experiments conducted with Mistral-7B-Instruct-v0.2 and Llama-2-7b-chat-hf, we observed such relation types that have not been defined in the relation types of the prompt template (See the example prompt in Figure 4). Therefore, we analyze these undefined relation types generated by LLMs in Figure 5 across TACRED and its variants. According to Figure 5, both decoder-only models used in our experiments generate more undefined relation types than encoder–decoder models, Flan T5 XL and XXL.

We also analyze the False Negative (FN) and False Positive (FP) relation predictions of three language models in Table 5. In terms of FN predictions, our RAG4RE has decreased its FN predictions on the Flan T5 XL model in Table 5. Likewise, our RAG4RE has reduced the number of FN predictions on Mistral-7B-Instruct-v0.2 and Llama-2-7b-chat-hf. Regarding dataset insights, the reason Flan T5 XL performs better on the TACREV and TACRED datasets but underperforms on ReTACRED is mainly due to the number of ‘no_relation’ labels in these datasets (see Table 6). ReTACRED is a reannotated version of TACRED created using a source code, while TACRED is noisier and less reliable. The relation type ‘no_relation’ in the ReTACRED dataset makes up 57.91% of the overall relation types in its test dataset, and Flan T5 XL generates 80.69% ‘no_relation’ predictions among all test relations. This means it generates ‘no_relation’ predictions as frequently as the proportion of ‘no_relation’ in the TACRED test dataset. This might be related to the data used in the training phase of Flan T5 models, as TACRED is a crawling dataset and the base model, T5, of Flan T5 was trained on a crawling dataset.¹⁴ Additionally, the decrease in FNs is greater than the increase in FPs in most cases in Table 5.

Table 5.
Comparison of False Positives (FP) and False Negatives (FN) Between the Simple Query and RAG4RE Approaches Across Different LLMs.

Table 6.

Statistics of no_relations Prediction Across TACRED and Its Variant Along With Flan T5 XL for RAG4RE.

Dataset	Total Test	# No_Relation Predicted	% No_Relation Predicted	% No_Relation in test Dataset	F1 of No_Relation
TACRED	15,509	12,811	82.60%	78.56%	87%
TACREV	15,509	13,548	87.36%	79.86%	88%
ReTACRED	13,418	10,827	80.69%	57.91%	73%

With regards to the SemEval dataset, the increase in FPs is higher than the decrease in FNs when RAG4RE is used with Flan T5 XL, meaning the results are not improved. Similarly, the number of FNs has increased with decoder-only models when RAG4RE is used instead of a simple query, so the results are not improved. Additionally, the SemEval dataset is partially manually annotated and not a web crawling dataset; therefore, the vanilla LLMs might not have prior knowledge of this dataset.

Overall, when comparing the number of false predictions, both Llama-2-7b-chat-hf and Mistral-7B-Instruct-v0.2 produce higher numbers compared to Flan T5 XL. It is clear that RAG4RE mitigates hallucination problem of the simple query by reducing the number of false prediction on three language model types according to Table 5 on TACRED and its variants.

5. Ablation Study

In this section, we conduct ablation studies: (i) prompt engineering approaches, (ii) comparing Llama variants and (iii) post-training approaches on SemEval. We evaluated these experimental approaches on original TACRED and SemEval datasets.¹⁵

5.1. Prompt Engineering Approaches

This section evaluates two approaches: (i) one-shot prompting and (ii) another prompt template for RAG4RE. We first present the results of one-shot prompting in the following section, and then discuss the impact of varying prompt templates on the results.

5.1.1. One-Shot Prompting

In this section, we conducted an experiment using one-shot prompting as a prompt engineering approach with the original TACRED dataset and SemEval, alongside the previously used Flan T5 models (XL and XXL). This experiment departs from zero-shot settings and includes an example. We identified similar sentences and incorporated them into our prompt templates. For this section, we include head and tail entities, as well as their relation type, in the sample prompt template, as shown in Figures 4 and 10. The results in Table 7 show that the one-shot prompting approach cannot outperform RAG4RE when using the Flan T5 (XL and XXL) models on the TACRED and SemEval datasets, as indicated in Table 4. This one-shot approach also fails to achieve the performance of a simple query on TACRED (see Table 4). However, one-shot prompting with Flan T5 XL and XXL on SemEval improves micro F1 by 16.44% and 38.60%, respectively.

Table 7.
One-Shot Experiment on TACRED and SemEval.

TACRED SemEval

LLM P(%) R(%) F1(%) P(%) R(%) F1(%)

Flan T5 XL 81.36 70.90 75.77 21.75 13.21 16.44

Flan T5 XXL 74.36 49.99 59.79 38.830 38.37 38.60

	TACRED	SemEval
Flan T5 XL	81.36	70.90	75.77	21.75	13.21	16.44
Flan T5 XXL	74.36	49.99	59.79	38.830	38.37	38.60

5.1.2. Another Prompt Template for RAG4RE

We also evaluate our RAG4RE system using a different prompt template that excludes the problem definition, as shown in Figure 4, on the TACRED and SemEval datasets. However, we do not evaluate this prompt template, shown in a sample in Figure 11, on TACREV or Re-TACRED, since they are variants of the TACRED dataset. Excluding the problem definition from the prompt template resulted in a decrease in the F1 score from 86.6% (see Table 4) to 82.89% for RAG4RE with Flan T5 XL, and a similar drop of 4.71% was observed for simple query results on TACRED in Table 8. In contrast, on the SemEval dataset, this prompt template resulted in an improvement, increasing the micro F1 score by 18.91% for simple queries and by 17.42% for RAG4RE with Flan T5 XL. These results indicate that the effectiveness of the prompt template is highly dependent on the dataset. While the choice of prompt template can significantly enhance or diminish RAG4RE performance, RAG4RE consistently outperformed simple queries on TACRED across both prompt templates, along with Flan T5. Similar to the previously used prompt template, no performance improvement was observed on SemEval with this prompt template when using the RAG4RE system.

5.2. Comparing Llama Variants

We evaluate a different instruction-tuned version of the Llama model: Llama-3.1-8B-Instruct.¹⁶ This model has more parameters than the one used to produce the results shown in Table 4. Our goal is to examine whether the proposed approach behaves differently when using a model with a larger parameter count. As shown in Table 9, our RAG4RE approach outperforms the simple prompting baseline on the TACRED, TACREV, and Re-TACRED datasets when using Llama-3.1-8B-Instruct within our framework (illustrated in Figure 2), similar to the results obtained with Llama-2-7b-chat-hf. For the SemEval dataset, the simple query performs slightly better, though the results between the simple query and RAG4RE remain close, as seen in Table 9. Beyond internal comparisons, Llama-3.1-8B-Instruct with the simple query achieves higher performance than Llama-2-7b-chat-hf on four benchmark datasets (see Tables 4 and 9). However, when using the RAG4RE approach, Llama-3.1-8B-Instruct does not surpass Llama-2-7b-chat-hf on TACRED and TACREV.

Table 8.
Another Prompt Template Which Does Not Have Any Problem Definition.

TACRED SemEval

LLM Method P(%) R(%) F1(%) P(%) R(%) F1(%)

Flan T5 XL simple query 93.66 69.81 79.99 32.86 35.82 34.28

RAG4RE 88.7 77.8 82.89 38.91 26.44 31.49

Llama-2-7b-chat-hf simple query 32.13 19.52 24.29 26.51 26.59 26.55

RAG4RE 34.19 13.08 18.92 17.20 16.83 17.01

Mistral-7B-Instruct-v0.2 simple query 20.66 11.19 14.52 100.0 0.04 0.08

RAG4RE 23.45 11.37 15.31 57.89 0.44 0.88

	TACRED	SemEval
Flan T5 XL	simple query	93.66	69.81	79.99	32.86	35.82	34.28
	RAG4RE	88.7	77.8	82.89	38.91	26.44	31.49
Llama-2-7b-chat-hf	simple query	32.13	19.52	24.29	26.51	26.59	26.55
	RAG4RE	34.19	13.08	18.92	17.20	16.83	17.01
Mistral-7B-Instruct-v0.2	simple query	20.66	11.19	14.52	100.0	0.04	0.08
	RAG4RE	23.45	11.37	15.31	57.89	0.44	0.88

Table 9.

The Experimental Results on Four Benchmark Datasets Along With Llama3.1-8B.

		TACRED			TACREV			Re-TACRED			SemEval
LLM	Method	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)	P(%)	R(%)	F1(%)
Llama-3.1-8B-Instruct	simplequery	37.12	24.06	29.19	30.89	18.79	23.36	27.10	17.61	21.35	27.95	27.94	27.94
	RAG4RE	49.90	46.75	48.27	62.33	54.52	58.16	36.24	30.83	33.32	27.72	27.71	27.72

5.3. Post-Training Approaches on SemEval Dataset

We propose that fine-tuning LLMs on a small subset of the SemEval dataset could enhance model performance and better adapt the LLMs to the domain of the dataset. Therefore, we fine-tune the Flan T5 Base (250M) model on two distinct versions of SemEval dataset: one based on the simple queries introduced in previous sections (see Table 11 for an example), and the original SemEval sentence dataset, where the input is a sentence and the output is a relation type. The subset is selected from the remaining data after identifying the sentences most similar to the test sentences in the training dataset. Fine-tuning is then performed on 5,283 samples (train and validation) from the SemEval training dataset using 5-fold cross-validation.¹⁷ The hyperparameters are set to 5 epochs, a batch size of 16, and a learning rate of 0.001 along low-rank adaptation (LoRA) (Hu et al., 2021). ${LoRA}_{α}$ is set to 32, the rank parameter ( $r$ ) is 4, the task type is Seq2SeqLM, and ${LoRA}_{dropout}$ is 0.01 for the Flan T5 Base model.

We first fine-tune the model on the original dataset¹⁸ (see Table 12 for a sample from SemEval dataset) and evaluate its performance using simple queries and RAG4RE, as shown in Table 10. Afterwards, we also fine-tune the Flan T5 Base on the prompt datasets¹⁹ and conduct the same evaluation approach along with simple query prompt and RAG4RE. According to the results in Table 10, our RAG4RE, utilizing these fine-tuned Flan T5 Base models, outperforms the simple query approach. As a result, these findings might provide inspiration for evaluating domain-specific datasets with RAG4RE.

Table 10.
Mean Metrics of the Fine-Tuned Flan T5 Base Models Across 5 Runs With 5-Fold Cross-Validation.

Dataset Method P (%) R (%) F1 (%)

Prompt Dataset from SemEval fine-tune +Simple Query 78.01 70.33 73.89

fine-tune + RAG4RE 75.65 75.65 75.65

Original SemEval sentences fine-tune +Simple Query 36.15 36.15 36.15

fine-tune +RAG4RE 38.37 38.37 38.37

Dataset	Method	P (%)	R (%)	F1 (%)
Prompt Dataset from SemEval	fine-tune +Simple Query	78.01	70.33	73.89
	fine-tune + RAG4RE	75.65	75.65	75.65
Original SemEval sentences	fine-tune +Simple Query	36.15	36.15	36.15
	fine-tune +RAG4RE	38.37	38.37	38.37

6. Discussion

In our experiments, we compared two methods: (i) RAG4RE and (ii) a simple query, vanilla LLM prompting, which lacks inclusion of the relevant example sentence described in Figure 4. Our findings indicate a notable enhancement in F1 scores when employing the RAG4RE approach over the simple query method on the TACRED dataset and its variations, as outlined in Table 4. This improvement stems from the integration of a relevant sentence into the prompt template, as illustrated in Figure 4. This inclusion of the relevant example sentence facilitates the predictions made by the LLM in the Generation module of our proposed architecture, as depicted in Figure 2.

The incorporation of this relevant sentence serves to mitigate hallucinations in the LLM’s responses, subsequently reducing the occurrence of false predictions, as demonstrated in Table 5. Additionally, while RAG4RE enhances the generation capabilities of LLM models, they might still produce hallucinated relation types. Figure 5 gives the count of undefined relations in predictions across datasets and LLMs. Mistral and Llama models generate a significant number of undefined relations-relations that do not exist in the predefined set of relation types in the given prompt. In contrast, the Flan T5 XL model produces the lowest undefined relations. Our assessment of the RAG4RE approach’s effectiveness is based on the integration of Flan T5 XL into the LLM within Figure 2, given that our approach, combined with Flan T5 XL, yields the highest F1 scores across benchmarks except for the SemEval (See Table 4). Although the prompt tuning approach using a mask filling template is applied with T5 Large (fine-tuning its encoder) and evaluated on TACRED (Han et al., 2022) (achieving an F1 score of 75.3%), its performance could not outperform that of Flan T5 XL with RAG4RE. This discrepancy in performance might be related to the size of the language models and the fact that only the encoder part of the T5 model is fine-tuned in Han et al. (2022).

The large margin between Re-TACRED and TACREV, or Re-TACRED and TACRED, is due to the percentage of sentences in their test datasets where the relation type between the given entities is ‘no_relation’ in RAG4RE approach. While TACRED and TACREV have high percentages of 78.56% and 79.86%, respectively, with Flan T5 XL predicting a high number of ‘no_relation’ sentences, Re-TACRED has only 57.91% of its test sentences labeled as ‘no_relation’, with Flan T5 XL predicting more than existing percent of no_relation (80.69%) (see Table 6). Furthermore, Re-TACRED evaluates 40 relation types, whereas TACREV and TACRED evaluate 42 relation types. Another important reason why Flan T5 XL does not perform better on ReTACRED might be related to that Flan T5 whose base model, T5, was trained on the C4 (Colossal Clean Crawled Corpus) dataset²⁰ which was constructed from free public data resource as TACRED dataset (Zhang et al., 2017) was constructed. Additionally, ReTACRED is reannotated with a codebase from TACRED and is not available on the web. Nevertheless, RAG4RE improves performance of the simple query approach on TACRED and its variant even if Flan T5 models might know dataset insight for TACRED. Likewise, fine-tuning Flan T5 with a small amount of the SemEval dataset could improve the performance of both RAG4RE and simple queries, as it would adapt the model to this specific dataset as shown in the ablation study in Section 5.3.

We present an analysis of the performance of our RAG4RE approach, comparing its F1 score with both LLM-based methods and state-of-the-art RE techniques reported in current literature. In terms of LLM-based RE approaches, our RAG4RE consistently outperforms other methods utilizing LLMs, as illustrated in Table 4, across all benchmark datasets except for SemEval. The reason for the superior performance of our RAG4RE, as presented in Table 4, is largely attributed to the absence of relevant sentence addition in the prompt templates of both LLMQA4RE (Zhang et al., 2023) whose prompt template is based on multiple-choice question and RationaleCL (Xiong et al., 2023) (with F1 of 80.8% on TACRED) based on conversational prompting based on rational strategy. Notably, neither competing method incorporated relevant sentences into their prompt templates. These results are further supported by Min et al. (2022), who compared the performance of multiple-choice tasks with classification tasks, demonstrating that language models perform better in classification tasks than in multiple-choice templates. Additionally, LMMQA4RE does not include the prefix of the relation in its TACRED and its variants’ prompt templates, so there is no need to address incorrect prefix predictions. In contrast, we incorporate the prefix in our RAG4RE architecture. The changes made to the prefix during the integration of Flan T5 (XL and XXL) can be observed in Figure 8. In addition to the performance comparison with LLM-based approaches, we also compare our results with the best-performing methods in the literature, as shown in Table 4. Our RAG4RE outperformed all state-of-the-art approaches, including recently proposed models, fine-tuned language models, and advanced techniques, on both TACREV and TACRED, achieving F1 scores of 86.8% and 88.3%, respectively. However, it did not achieve the same performance on ReTACRED, primarily due to the high number of ‘no_relation’ predictions, as detailed in Table 6. Similarly, due to the unique features of the SemEval dataset-such as directed relations and relations that cannot be predicted from sentence tokens, or was not used in the training of the vanilla LLMs used at Table 4-our RAG4RE did not yield promising results on this dataset. For example, the LLM-based approach, GAP (Chen et al., 2024), fine-tunes the RoBERTa large model (355M parameters) using a prompting strategy and achieves an F1 score of 90.3%. Likewise, our Flan T5 Base (250M) model fine-tuned on a subset of the SemEval train dataset improves the results of RAG4RE in Section 5.3.

With regard to ethical considerations, our proposed approach is evaluated on local hardware using open-source models. Therefore, no data is shared outside the local hardware. Furthermore, this study does not involve any human subject data, and thus does not have ethical concerns related to human data use. Nonetheless, the approach is designed to preserve privacy and can be applied to the evaluation of sensitive domain data, such as in the healthcare sector.

Overall, our RAG4RE demonstrates strong performance on the TACRED and TACREV datasets. However, its performance on SemEval does not achieve similar improvements, likely due to challenges posed by the predefined relation types (target relation labels) in this benchmark dataset or the limitations of vanilla LLMs’ prior knowledge (see results in Section 5.3). For example, directly extracting the “Cause-Effect (e2,e1)” relation type from the provided sentence tokens between entity 1 (e1) and entity 2 (e2) remains challenging for zero-shot LLM prompting, as this relation type often requires logical inference for accurate identification.

7. Conclusion and Future Work

In this work, we introduce a novel approach to RE called RAG4RE, which leverages zero-shot prompting settings. Our aim is to identify the relation types between head and tail entities in a sentence, utilizing an RAG-based LLM prompting approach.

We also claim that RAG4RE has outperformed the performance of the simple query (vanilla LLM prompting). To prove our claim, we conducted experiments using four different RE benchmark datasets: TACRED, TACREV, Re-TACRED, and SemEval, in conjunction with three distinct LLMs: Mistral-7B-Instruct-v0.2, Flan T5 (XL and XXL), and Llama-2-7b-chat-hf. Our RAG4RE yielded remarkable results compared to those of the simple query. Our RAG4RE exhibited notable results on the benchmarks compared to previous works. Unfortunately, our proposed methods, including vanilla LLMs, did not perform well on the SemEval dataset. This can be attributed either to the absence of logical inference in LLMs or to the lack of prior knowledge in vanilla LLMs regarding this dataset, as predefined or target relation types cannot be directly derived from the sentence tokens in the SemEval dataset. The ablation study conducted with SemEval in Section 5.3 yields promising results when the post-trained model is integrated into RAG4RE. These findings may also encourage the application of RAG4RE to domain-specific datasets. In addition to the results presented in Section 5.3, Llama-3.1-8B-Instruct outperforms Mistral-7B-Instruct-v0.2, Flan T5 (XL and XXL), and Llama-2-7b-chat-hf on the SemEval dataset. These findings suggest that larger models may serve as more effective domain experts compared to their smaller counterparts.

In future work, we aim to extend our approach to real-world dynamic learning scenarios, inspired by the ablation study on SemEval, and evaluate it on real-world datasets. Additionally, we intend to integrate fine-tuned LLMs on training datasets into our RAG4RE system to address the performance issues encountered when datasets require logical inference to identify relation types between entities in a sentence, and target relation types cannot be extracted from the sentence tokens as in SemEval.

Footnotes

Acknowledgements

Sefika Efeoglu is funded by the Turkish Ministry of National Education, Republic of Türkiye, under the Postgraduate Study Abroad Program.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Sefika Efeoglu

Adrian Paschke

Notes

Appendix A. Dataset Overview

Table 12.

Some Data from Benchmark Datasets.

Dataset	Sentence	Entities	Relation
TACRED	He has served as a policy aide to the late U.S. Senator Alan Cranston, as National Issues Director for the 2004 presidential campaign of Congressman Dennis Kucinich, as a co-founder of Progressive Democrats of America and as a member of the international policy department at the RAND Corporation think tank before all that.	Head: Progressive Democrats of America,	$n o_r e l a t i o n$
		Tail: international policy department
SemEval	The ¡e1¿surgeon¡/e1¿ cuts a small ¡e2¿hole¡/e2¿ in the skull and lifts the edge of the brain to expose the nerve.	e1: surgeon, e2: hole	Product-Producer (e1,e2)

Appendix B. SemEval Dataset Statistics

Appendix C. Results

Appendix D. Ablation Study

References

Abramski

Improta

Rossetti

Stella

(2025). The “llm world of words” english free association norms generated by large language models. Scientific Data, 12, 803. https://doi.org/10.1038/s41597-025-05156-9

Agichtein

Gravano

(2000). Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM conference on digital libraries (p. 85–94). DL ’00. New York, NY, USA: Association for Computing Machinery. ISBN 158113231X.

Ahmad

Critelli

Efeoglu

Mancini

Ringwald

Zhang

Merono Penuela

(2023). Draw me like my triples: Leveraging generative ai for Wikidata image completion. https://wikidataworkshop.github.io/2023/. The 4th Wikidata Workshop ; Conference date: 07-11-2023.

Alt

Gabryszak

Hennig

(2020). TACRED revisited: A thorough evaluation of the TACRED relation extraction task. In D. Jurafsky, J. Chai, N. Schluter & J. Tetreault (eds.), Proceedings of the 58th Annual meeting of the association for computational linguistics. Online: Association for Computational Linguistics, (pp. 1558–1569). https://doi.org/10.18653/v1/2020.acl-main.142

Aydar

Bozal

Özbay

(2021). Neural relation extraction: A review. Turkish Journal of Electrical Engineering and Computer Sciences, 29(2), 1029–1043. https://doi.org/10.3906/elk-2005-119

Chen

Zeng

Zhang

(2024). Gap: A novel generative context-aware prompt-tuning method for relation extraction. Expert Systems with Applications, 248, 123478. https://doi.org/10.1016/j.eswa.2024.123478

Chung

H. W.

Hou

Longpre

Zoph

Tai

Fedus

Wang

Dehghani

Brahma

Webson

S. S.

Dai

Suzgun

Chen

Chowdhery

Castro-Ros

Pellat

Robinson

Valter

, ... Wei

(2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(1), 1–53.

Cohen

A. D.

Rosenman

Goldberg

(2022). Supervised relation classification as two-way span-prediction. In 4th Conference on automated knowledge base construction.

Efeoglu

(2022). A continual relation extraction approach for knowledge graph completeness (short paper). In Padua, Italy: 26th International conference on theory and practice of digital libraries (TPDL) CEUR Workshop. https://ceur-ws.org/Vol-3246/15_paper110.pdf.

10.

Gao

Xiong

Gao

Jia

Pan

Dai

Sun

Wang

(2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 https://arxiv.org/abs/2312.10997.

11.

Grishman

(2015). Information extraction. IEEE Expert, 30(5), 8–15.

12.

Han

Zhao

Cheng

(2022). Generative prompt tuning for relation classification. In Y. Goldberg, Z. Kozareva & Y. Zhang (eds.), Findings of the association for computational linguistics: EMNLP 2022 (pp. 3170–3185). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.231.

13.

Hendrickx

Kim

S. N.

Kozareva

Nakov

Ó Séaghdha

Padó

Pennacchiotti

Romano

Szpakowicz

(2010). SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In K. Erk & C. Strapparava (eds.), Proceedings of the 5th international workshop on semantic evaluation (pp. 33–38). Uppsala, Sweden: Association for Computational Linguistics. https://aclanthology.org/S10-1006.

14.

Hertling

Paulheim

(2023). Olala: Ontology matching with large language models. In Proceedings of the 12th knowledge capture conference 2023 (p. 131–139). K-CAP ’23. New York, NY, USA: Association for Computing Machinery. ISBN 9798400701412. https://doi.org/10.1145/3587259.3627571.

15.

E. J.

Shen

Wallis

Allen-Zhu

Wang

Chen

(2021). Lora: Low-rank adaptation of large language models. https://arxiv.org/abs/2106.09685.

16.

Jiang

A. Q.

Sablayrolles

Mensch

Bamford

Chaplot

D. S.

Casas

Bressand

Lengyel

Lample

Saulnier

Lavaud

L. R.

Lachaux

M-A.

Stock

Scao

T. L.

Lavril

Wang

Lacroix

Sayed

W. E.

(2023). Mistral 7b. arXiv preprint arXiv:2310.06825.

17.

Lewis

Perez

Piktus

Petroni

Karpukhin

Goyal

Küttler

Lewis

Yih

W. T.

Rocktäschel

Riedel

Kiela

(2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th international conference on neural information processing systems, NIPS’20. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713829546.

18.

Zhang

(2023). Reviewing labels: Label graph network with top-k prediction set for relation extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13051–13058. https://doi.org/10.1609/aaai.v37i11.26533

19.

Melz

(2023). Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation. arXiv e-prints : arXiv–2311.

20.

Mihindukulasooriya

Tiwari

Enguix

C. F.

Lata

(2023). Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. In T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng & J. Li (eds.), The Semantic Web—ISWC 2023 (pp. 247–265). Cham: Springer Nature Switzerland.

21.

Min

Lyu

Holtzman

Artetxe

Lewis

Hajishirzi

Zettlemoyer

(2022). Rethinking the role of demonstrations: What makes in-context learning work? In Y. Goldberg, Z. Kozareva & Y. Zhang (eds.), Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 11048–11064). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.759.

22.

Ovadia

Brief

Mishaeli

Elisha

(2024). Fine-tuning or retrieval? Comparing knowledge injection in LLMs. In Y. Al-Onaizan, M. Bansal and Y. N. Chen (eds.), Proceedings of the 2024 conference on empirical methods in natural language processing (pp. 237–250). Miami, FL, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.15.

23.

Pan

Luo

Wang

Chen

Wang

(2024). Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36(7), 3580–3599. https://doi.org/10.1109/TKDE.2024.3352100

24.

Pawar

Palshikar

G. K.

Bhattacharyya

(2017). Relation extraction : A survey. https://arxiv.org/abs/1712.05191. Preprint.

25.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V. Ng & X. Wan (eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3982–3992). Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

26.

Stoica

Platanios

E. A.

Poczos

(2021). Re-tacred: Addressing shortcomings of the tacred dataset. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13843–13850. https://doi.org/10.1609/aaai.v35i15.17631

27.

Thoppilan

Freitas

D. D.

Hall

Shazeer

Kulshreshtha

Cheng

H. T.

Jin

Bos

Baker

Lee

Zheng

H. S.

Ghafouri

Menegali

Huang

Krikun

Lepikhin

Qin

Chen

, ... Le

(2022). Lamda: Language models for dialog applications. https://arxiv.org/abs/2201.08239.

28.

Touvron

Martin

Stone

K. R.

Albert

Almahairi

Babaei

Bashlykov

Batra

Bhargava

Bhosale

Bikel

D. M.

Blecher

Ferrer

C. C.

Chen

Cucurull

Esiobu

Fernandes

Fuller

, ... Scialom

(2023). Llama 2: Open foundation and fine-tuned chat models. ArXiv abs/2307.09288. https://api.semanticscholar.org/CorpusID:259950998.

29.

Wang

Liu

Chen

Hong

Tang

Song

(2022). DeepStruct: Pretraining of language models for structure prediction. In S. Muresan, P. Nakov & A. Villavicencio (eds.), Findings of the association for computational linguistics: ACL 2022 (pp. 803–823). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-acl.67.

30.

Xiong

Song

Wang

(2023). Rationale-enhanced language models are better continual relation learners. In H. Bouamor, J. Pino & K. Bali (eds.) Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 15489–15497). Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.958.

31.

Zhang

Jimenez Gutierrez

(2023). Aligning instruction tasks unlocks large language models as zero-shot relation extractors. In A. Rogers, J. Boyd-Graber & N. Okazaki (eds.), Findings of the association for computational linguistics: ACL 2023 (pp. 794–812). Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.50.

32.

Zhang

Zhong

Chen

Angeli

Manning

C. D.

(2017). Position-aware attention and supervised data improve slot filling. In M. Palmer, R. Hwa & S. Riedel (eds.) Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 35–45). Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1004.

33.

Zhou

Chen

(2022). An improved baseline for sentence-level relation extraction. In Y. He, H. Ji, S. Li, Y. Liu & C. H. Chang (eds.), Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing (Volume 2: Short Papers) (pp. 161–168). Online only: Association for Computational Linguistics. https://aclanthology.org/2022.aacl-short.21.

Retrieval-Augmented Generation-Based Relation Extraction

Abstract

Keywords

1. Introduction

2.1. Relation Extraction

2.2. Retrieval-Augmented Generation

3. Methodology

3.1. Retrieval

4. Evaluation

4.1. Experimental Setup

4.1.1. Datasets

4.1.3. Evaluation Metrics

4.2. Results

Table 4. Experimental Results on Four Benchmark Datasets Using Different LLMs.

Table 5. Comparison of False Positives (FP) and False Negatives (FN) Between the Simple Query and RAG4RE Approaches Across Different LLMs.

5.1. Prompt Engineering Approaches

5.1.1. One-Shot Prompting

Table 7. One-Shot Experiment on TACRED and SemEval. TACRED SemEval LLM P(%) R(%) F1(%) P(%) R(%) F1(%) Flan T5 XL 81.36 70.90 75.77 21.75 13.21 16.44 Flan T5 XXL 74.36 49.99 59.79 38.830 38.37 38.60

5.2. Comparing Llama Variants

7. Conclusion and Future Work

Footnotes

Acknowledgements

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

Appendix A. Dataset Overview

Appendix B. SemEval Dataset Statistics

Appendix C. Results

Appendix D. Ablation Study

References

Table 4.
Experimental Results on Four Benchmark Datasets Using Different LLMs.

Table 5.
Comparison of False Positives (FP) and False Negatives (FN) Between the Simple Query and RAG4RE Approaches Across Different LLMs.

Table 7.
One-Shot Experiment on TACRED and SemEval.

TACRED SemEval

LLM P(%) R(%) F1(%) P(%) R(%) F1(%)

Flan T5 XL 81.36 70.90 75.77 21.75 13.21 16.44

Flan T5 XXL 74.36 49.99 59.79 38.830 38.37 38.60