Sage Journals: Discover world-class research

Abstract

Knowledge graphs and ontologies represent symbolic and factual information that can offer structured and interpretable knowledge. Extracting and manipulating this type of information is a crucial step in complex processes. While large language models (LLMs) are known to be useful for extracting and enriching knowledge graphs and ontologies, previous work has largely focused on comparing architecture-specific models (e.g. encoder-decoder only) across benchmarks from similar domains. In this work, we provide a large-scale comparison of the performance of certain LLM features (e.g. model architecture and size) and task learning methods (fine-tuning vs. in-context learning (iCL)) on text-to-graph benchmarks in two domains, namely the general and biomedical ones. Experiments suggest that, in the general domain, small fine-tuned encoder-decoder models and mid-sized decoder-only models used with iCL reach overall comparable performance with high entity and relation recognition and moderate yet encouraging graph completion. Our results also suggest that, regardless of other factors, biomedical knowledge graphs are notably harder to learn and are better modelled by small fine-tuned encoder-decoder architectures. Pertaining to iCL, we analyse hallucinating behaviour related to sub-optimal prompt design, suggesting an efficient alternative to prompt engineering and prompt tuning for tasks with structured model output.

Keywords

knowledge graphs automatic graph generation large language models in-context learning biomedical natural language processing

1. Introduction

Acquiring structured knowledge from text is a fundamental step in a complex process like reasoning and answering questions, whether such a process is carried out by a human or an artificial intelligence (AI) system (Tiwari et al., 2023). In natural language processing (NLP), structured knowledge is often handled via ontologies or knowledge graphs (Hogan et al., 2021; Paulheim, 2017; Peng et al., 2023). Knowledge graphs are typically organised as collections of [(head # relation # tail)] triplets, such as [(dog # isA # animal)], or [(Rome # CapitalOf # Italy)]. Knowledge graphs and ontologies play a pivotal role in representing knowledge across various domains, facilitating intelligent applications such as chatbots (Ait-Mlouk & Jiang, 2020), recommendation systems (Xian et al., 2019) question-answering systems (Huang et al., 2019; Kejriwal et al., 2019), and more (Kejriwal, 2022; Tiwari et al., 2023).

Knowledge graphs have seen a surge in their application in recent years (Chen et al., 2020; Ji et al., 2022). However, building them can be laborious and costly (Kejriwal et al., 2019; Peng et al., 2023). This has led to the development of numerous methods aimed at auto-generation of these graphs from text sources in various fields (Ji et al., 2022; Kejriwal, 2022; Liu et al., 2016; Peng et al., 2023). Until recently, extracting and manipulating knowledge graphs and other forms of graphs has been largely dealt with by small knowledge graph embedding models (KGEs) (Wang et al., 2017), which are lightweight but limited in capabilities, or different types of graph neural networks (GNNs) (Wu et al., 2023; Ye et al., 2022), such as convolutional GNNs (CGNNs) (Zhang et al., 2018), or gated attention GNNs (GAT-GNNs) (Veliˇković et al., 2018). Recently, many of these architectures have been replaced by transformer-based large language models (LLMs) (Vaswani et al., 2017a), which have shown great potential in modelling graph-based data.

Despite these advancements, current techniques still suffer from significant limitations concerning accuracy, completeness, privacy, bias, and scalability (Peng et al., 2023; Radulovic et al., 2018; Rashid et al., 2019). Therefore, generating a large-scale knowledge graph automatically from text corpora remains an open challenge (Hogan et al., 2021; Kejriwal, 2022; Peng et al., 2023). As shown by a consistent body of evidence (Jin et al. 2023; Liu et al., 2023; Pan et al., 2024), LLMs can be adapted to both extract knowledge graphs from a reference text (text-to-graph task), as well as to convert knowledge graphs into natural language while maintaining the semantic meaning (graph-to-text task). We are interested in the former and adopt two text-to-graph benchmark datasets, to be referred to as Web NLG (Gardent et al., 2017) and Bio Event (Frisoni et al., 2022). Web NLG is a popular benchmark containing text-graph pairs of multiple types or relations, maintaining a rather general domain, while Bio Event pertains to biomedical data by aggregating 10 popular biomedical datasets.

To adapt an LLM to a particular task, two popular task learning methods are fine-tuning and in-context learning (iCL) (Brown et al., 2020). Given a training dataset pertaining to the new task at hand, fine-tuning an LLM amounts to an additional training phase to update a subset of learnable model parameters to adapt to the new task. In-context learning, on the other hand, amounts to including a few task examples in the model prompt at inference time - a special case of few-shot learning. Typically, iCL provides weaker performance than fine-tuning and is computationally more expensive at inference time (Brown et al., 2020; Liu et al., 2022), yet it is highly flexible as it does not require any parameter updates. Both options involve a vast amount of design choices, from the quality and quantity of available training data to the amount of in-context examples to include in iCL.

While most work on knowledge graph extraction has focused on pushing the state-of-the-art in terms of performance or summarising the field in terms of different applications and formulations of scenarios and tasks, it remains unclear to the general AI practitioner what would be, given a specific dataset and computational resources, the best solution to approach a text-to-graph task, formulated as an end-to-end LLM-based solution.

1.1. Research Objective and Contribution

We direct this work to the general AI practitioner in the general or biomedical domain who aims to develop an end-to-end LLM-based knowledge graph extraction system from textual sources. We investigate how to best approach such task by examining various combinations of model design choices, assuming a fixed and accessible computational resource of a single NVIDIA Quadro RTX 8000 GPU.¹ The main variables under investigation are model architecture (encoder-decoder and decoder-only), model family (T5, BART, Mistral-v0.1, and Llama-2), model size (small (60M) to mid (13B learnable parameters)), task learning method (fine-tuning and iCL) and additional pre-training data (relation extraction data, conversation data, instruction data, and (bio)medical data). In brief, this article’s main insights are as follows:

In the general domain, we show that small fine-tuned encoder-decoder models and mid-sized decoder-only models adopting iCL achieve comparable results with high entity and relation recognition and moderate yet encouraging graph completion.

We provide tentative evidence that biomedical knowledge graphs are substantially harder to model from textual sources than the general domain. Mid-sized decoder-only models adopting iCL show weak performance, while performance of small fine-tuned encoder-decoder models is robust compared to the general domain. However, we discover issues with the biomedical benchmark adopted from Frisoni et al. (2022) and a thorough revision is required to adopt it as gold standard for text-to-graph tasks.

Only additional pre-training data on relation extraction tasks boosts model performance, while neither observing conversation data, instruction data nor (bio)medical data during pre-training makes a notable difference.

We propose and experimentally prove the effectiveness of a simple truncation-based heuristic on model output to control for a specific type of hallucination of in-context learning, avoiding expensive prompt tuning and prompt design.

Off-the-shelve LLMs in the zero-shot setting (i.e. no in-context examples, only a task instruction and reference text) show weak performance. Careful design choices and a task learning method are required to be suitable for such task, especially in safety-critical and domain-specific contexts such as the biomedical domain. We highlight several areas of importance.

1.2. Article Structure

This article is structured as follows. Section 2 introduces the related work, with a focus on architecture, tasks, and proposed benchmarks. Subsequently, Section 3 presents the methodology, as well as the used datasets (3.2), metrics (3.3), model architectures (3.4), task learning methods (3.5), and experiments’ set-up (3.6). Then Section 4 shows the experimental results, followed by Sections 5 and 6 to, respectively, discuss and conclude the article.

2. Related Work

Recent surveys suggest LLMs are of primary interest and hold potential for multiple types of graph-based tasks (Jiang & Usbeck, 2022; Jin et al., 2023; Liu et al., 2023; Pan et al., 2024). The task of automatically generating a knowledge graph from a reference text (text-to-graph) is closely related to the more general NLP task of relation extraction, traditionally composed of the two separated steps of named entity recognition and relation classification (Huguet Cabot & Navigli, 2021). Named entity recognition involves identifying and classifying entities in the text, which can be seen as nodes in the resulting knowledge graph (Yadav & Bethard, 2018). Sub-tasks include co-reference resolution and entity disambiguation. Relation classification, in turn, aims at identifying the relation between two given entities, as (often implicitly) expressed in the reference text containing the identified entities. In an LLM-based knowledge graph extraction system, both steps are potentially entangled in a single end-to-end solution.

Previous works have explored different approaches to text-to-graph tasks and the utilisation of LLMs (Babaei Giglou et al., 2023; Hofer et al., 2024; Neuhaus, 2023). However, the main focus has been on pushing the state of the art on specific sub-tasks and benchmarks, such as research data (Dessí et al., 2022; Kabongo et al., 2024; Kuhn et al., 2018), question-answering (Kacupaj et al., 2021; Kapanipathi et al., 2021), common-sense (Ilievski et al., 2021; Zavarella et al., 2024), biomedical (Himmelstein et al., 2023; Zietz et al., 2024), and other (Angioni et al., 2024). Moreover, unlike this work, the literature does not offer a systematic experimental comparison of the efficiency of contributing factors in an end-to-end text-to-graph task, as suggested by the summary proposed in Table 1. For instance, Bosselut et al. (2019) proposed COMET, a model that generates commonsense knowledge graphs from textual inputs. At the same time, the authors introduced the ATOMIC dataset, which is designed for commonsense inference, and utilised a transformer-based model to extract and generate the graph-based knowledge. The presented transformer model was only compared against existing LSTM-based solutions, which were already known to be subsumed by transformers.

Table 1.
Related Work Summary. Schematic Representation of the Most Relevant Related Work on Text-to-Graph, Focusing on Five Dimensions: Model Architecture (Encoder-Decoder vs. Decoder-Only), Main Benchmark (Name), Learning Method (Tuning vs. In-Context Learning (iCL)), and New (Yes vs. No, for Both Model Architecture and Benchmark).

Work Model Architecture New Main Benchmarks New Learning Method

Bosselut et al. (2019) Decoder-only Yes Atomic No Fine-tuning

Guo et al. (2020) Encoder-decoder Yes WebNLG No Fine-tuning

Dash et al. (2021) Variational autoencoder Yes CANONICNELL No Fine-tuning

Zhang and Zhang (2020) Encoder-only Yes IMDB, Yelp No Fine-tuning

Wang et al. (2021) Encoder-decoder No Wikigraphs Yes Fine-tuning

Colas et al. (2021) Encoder-decoder No EventNarrative Yes Fine-tuning

Mihindukulasooriya et al. (2023) Decoder-only No Text2KGBench Yes iCL

Khorashadizadeh et al. (2023) Decoder-only No Text2KGBench No zero-shot

Ours Encoder-decoder, decoder-only No WebNLG. Bioevent No Fine-tuning, iCL

Work	Model Architecture	New	Main Benchmarks	New	Learning Method
Bosselut et al. (2019)	Decoder-only	Yes	Atomic	No	Fine-tuning
Guo et al. (2020)	Encoder-decoder	Yes	WebNLG	No	Fine-tuning
Dash et al. (2021)	Variational autoencoder	Yes	CANONICNELL	No	Fine-tuning
Zhang and Zhang (2020)	Encoder-only	Yes	IMDB, Yelp	No	Fine-tuning
Wang et al. (2021)	Encoder-decoder	No	Wikigraphs	Yes	Fine-tuning
Colas et al. (2021)	Encoder-decoder	No	EventNarrative	Yes	Fine-tuning
Mihindukulasooriya et al. (2023)	Decoder-only	No	Text2KGBench	Yes	iCL
Khorashadizadeh et al. (2023)	Decoder-only	No	Text2KGBench	No	zero-shot
Ours	Encoder-decoder, decoder-only	No	WebNLG. Bioevent	No	Fine-tuning, iCL

Further work on the state-of-the-art includes Guo et al. (2020) introducing CycleGT, a two-loss model to learn from text-to-graph and graph-to-text tasks, based on an encoder-decoder pre-trained model (T5), by bootstrapping from fully non-parallel graph and text data, and iteratively back translating between the two forms. The authors propose a comparison of this solution and architecture on multiple general-domain datasets, using alignment and an unsupervised setting (Jin et al., 2020). In addition, Dash et al. (2021) propose a new text-to-graph model, called CUVA (canonicalizing using variational autoencoders), that addresses the redundancy and ambiguity of noun and relation phrases in open knowledge graphs. Unlike current methods that use a two-step process, CUVA simultaneously learns both embeddings and cluster assignments, resulting in better performance. Additional advancements in the field of knowledge graph extraction include Zhang and Zhang (2020), who demonstrate the effectiveness of pre-trained models for generating knowledge graphs from text when fine-tuned with graph-aware objectives. They propose a graph-augmented text representation model that significantly improves the performance of this task.

Other relevant work has focused on introducing new datasets and benchmarks. Wang et al. (2021) introduce Wiki Graphs, a dataset to benchmark text-to-graph and graph-to-text tasks. The study focuses on a single solution, a combination of GNN graph-transformers, compared against transformer models. The study then focuses more on the structure of the graph than its content and, as such, the dataset is generally stripped of the name entity in the task outputs. According to Colas et al. (2021), the authors introduce event narrative, an event-based text-graph and graph-to-text dataset. Similar to previous endeavours, the authors compare a graph-based transformer with two encoder-decoder pre-trained LLMs, namely T5 and BART, and find mixed performance. Frisoni et al. (2022) test a self-implemented version of these models on a new benchmark of biomedical graph-to-text and text-to-graph datasets.

In another study, Mihindukulasooriya et al. (2023) introduced a text-to-graph dataset, guided by another ontology. The authors tested their methods using a combination of prompt generation, pre-trained decoder-only LLMs, and post-processing. Our goal is similar, yet notably different, in that we aim to test how to use an end-to-end solution, starting solely from text, and requiring no post-processing.

The previous paragraphs suggest a disparity in the proposed approaches to the task under investigation, in terms of a lack of decoder-only solutions. Of note, Khorashadizadeh et al. (2023) propose a qualitative analysis of the abilities of multiple high-level and mostly decoder-only models, such as ChatGPT and Bard (now Gemini), in terms of knowledge graph completion and question answering. Focusing on biomedical queries, the authors conclude that ChatGPT might present a valuable asset in automatically extracting knowledge graphs, albeit at a significant computational cost. Hu et al. (2020), examined the impact of incorporating graph structural information into the encoding process of a decoder-only model. Their proposed model, GPT-GNN, combines the generative pre-trained transformer with GNNs to enhance the learning of graph representations.

To the best of our knowledge, this work proposes a first quantitative investigation of the abilities of both encoder-decoder and decoder-only models, without the aid of any prompt-construction resource. Compared to Khorashadizadeh et al. (2023), our work offers a deeper and more broad understanding of the iCL capacities of LLMs, under different amounts of in-context examples and dataset domains, while also including a selection of qualitative examples (see Appendix A), showcasing the strengths and failures of these tools. On the other hand, our work offers an analysis of how these models behave under simple prompts, with respect to task recognition and hallucination, and offers a simple and cost-efficient solution to control them.

Our work differs from previous studies in knowledge graph extraction in that most proposed either a new model, dataset, or comparison between relatively similar transformer models, while this study experimentally compares the effect of model architectures, family, size and pre-training data across two different domains, namely general and biomedical. As such the goal is not to push state-of-the-art performance, but to study the impact of each factor and describe best practices and important considerations, given a specific amount of computational resources.

3. Material and Methods

This session presents the materials and methods, focusing on knowledge graph structure (Section 3.1), benchmark datasets (Section 3.2), evaluation metrics (Section 3.3), LLMs of various architecture, family, size and pre-training characteristics (Section 3.4), task learning methods (Section 3.5), and experimental setup (Section 3.6).

3.1. Knowledge Graph Structure

To ensure a stable and fair comparison across domains, we pre-process our benchmark datasets to match the following linearised text-graph structure. Formally, a dataset consists of two sets of strings $T$ and $G$ , where each reference text $t_{i}$ $\in$ $T$ and knowledge graph $g_{i}$ $\in$ $G$ are assumed to be identical representations semantically, but differ syntactically. For example, given a reference text ‘The pencil is on the table’, we represent the corresponding knowledge graph as containing one linearised triplet ‘[(pencil # IsOn # table)]’. In general, a knowledge graph $g_{i}$ containing $n$ triplets (head # relation # tail) follows the structure

g_{i} = `` [({head}_{1} # {relation}_{1} # {tail}_{1}) | \dots | ({head}_{n} # {relation}_{n} # {tail}_{n})]''

3.2. Benchmark Datasets

We adopt two parallel text-to-graph datasets, to be referred to as Web NLG (Gardent et al., 2017) and Bio Event (Frisoni et al., 2022). Table 2 contains the basic descriptive statistics for both datasets.

Table 2.
Descriptive Statistics of the Web NLG and Bio Event Dataset.

Web NLG Bio Event

Dataset size Train set 13,211 18,417

Validation set 1667 2302

Test set 1779 2302

Seen categories 966

Unseen categories 813

Descriptive statistics # unique entities 3605 19,706

# unique relations 411 38

# unique triplets 4351 50,939

Avg. #tokens per text (s.d.) 23 (12) 30 (15)

Avg. #tokens per graph (s.d.) 34 (18) 29 (19)

Avg. #triplets per graph (s.d.) 2.93 (1.53) 3.23 (2.10)

Atom divergence Train-validation 0.06 0.16

Train-test 0.59 0.16

Train-test (seen categories) 0.54

Train-test (unseen categories) 0.83

Compound divergence Train-validation 0.05 0.66

Train-test 0.69 0.66

Train-test (seen categories) 0.50

Train-test (unseen categories) 0.99

		Web NLG	Bio Event
Dataset size	Train set	13,211	18,417
	Validation set	1667	2302
	Test set	1779	2302
	Seen categories	966
	Unseen categories	813
Descriptive statistics	# unique entities	3605	19,706
	# unique relations	411	38
	# unique triplets	4351	50,939
	Avg. #tokens per text (s.d.)	23 (12)	30 (15)
	Avg. #tokens per graph (s.d.)	34 (18)	29 (19)
	Avg. #triplets per graph (s.d.)	2.93 (1.53)	3.23 (2.10)
Atom divergence	Train-validation	0.06	0.16
	Train-test	0.59	0.16
	Train-test (seen categories)	0.54
	Train-test (unseen categories)	0.83
Compound divergence	Train-validation	0.05	0.66
	Train-test	0.69	0.66
	Train-test (seen categories)	0.50
	Train-test (unseen categories)	0.99

The Web NLG dataset (we choose version 3.0) is a widely used text-to-graph dataset that contains text-graph pairs of multiple types or relations, maintaining a rather general domain. For each text-graph pair, the corresponding DBpedia category is available, pertaining to the topic of the Wikipedia article it is extracted from. Using this information, we divide the test set into categories that are either seen or unseen in the training and validation data, providing opportunity to test in- and out-of-distribution generalisation capabilities of LLMs (see Appendix C). Out of a total of 18 categories, the categories film, scientist and musical work are unseen in training and validation.

Contrary to the general domain, Bio Event pertains to biomedical data and aggregates 10 popular biomedical datasets, thus (potentially) representing an important domain-specific benchmark for text-to-graph tasks in healthcare. However, the overall quality of the Bio Event dataset requires further comments. Originally presented by Frisoni et al. (2022) as separate datasets for text-to-graph and graph-to-text tasks, we adopt the graph-to-text dataset for our text-to-graph task as it surprisingly includes a larger amount of unique text-graph pairs. Another surprising finding is how in this Bio Event dataset for the graph-to-text task, there are up to 78 unique sets of triplets corresponding to a single reference text. This contradicts the underlying assumption that knowledge graphs and reference texts only differ syntactically and not semantically. While the same knowledge graphs could correspond to multiple natural language reference texts with the same semantic meaning, a reference text naturally only has one corresponding knowledge graph that contains the full set of entities and relations described. Finally, there exists a large degree of dataset contamination, where the same reference text appears in both train and test set, yet connected to a different knowledge graph.

For the purpose of this work, we use a quick heuristic to clean up the Bio Event dataset, allowing us to use it as a biomedical alternative to the general domain of Web NLG. First, we refine the Bio Event graph-to-text dataset by removing any duplicate reference texts and breaking ties in favour of the text-graph pair pertaining to the longest linearised knowledge graph, assuming that the longest knowledge graph is the most complete description of the entities and relations described. This filters out 66% of the datapoints in the original Bio Event dataset, explained by the earlier observation that a single reference text in that dataset often corresponds to multiple unique sets of triplets, whereas we limit ourselves to strictly unique text-graph pairs. We further process the knowledge graphs to match the Web NLG setup by removing metadata associated with nodes and edges, and we finally obtain a train/validation/test set with an 80/10/10% split.

As shown in Table 2, the issue remains that there is a large number of unique entities and triplets compared to the number of available examples. Furthermore, as shown in Appendix A, manual inspection of cherry-picked examples reveals that reference texts and knowledge graphs do not always contain the same set of entities and relations, and thus they differ semantically. For example, consider the following tuple taken from the Bio Event test set,

\begin{aligned} (t_{i}, g_{i}) = & (`The semaphorin 7A receptor Plexin C1 is lost during melanoma metastasis'., \\ `[(metastasis # Theme # melanoma)]') \end{aligned}

Producing the ‘correct’ knowledge graph in this instance requires (i) ignoring most of the text, and (ii) producing a relation that is absent in any grammatical form. The opposite is true for general domain graphs in Web NLG, which contain relations such as Owner or CompletionDate, which are notably more transparent and frequently explicitly named in the reference text. This phenomenon is reflected in the descriptive statistics in Table 2. We observe that the average number of tokens per reference text in Bio Event is larger than in Web NLG (30 vs. 23), while the average number of tokens per graph is lower (29 vs. 34). This implies that relative to the Web NLG benchmark, Bio Event has smaller graphs corresponding to its reference texts, suggesting a problem of type (i). In addition, the number of unique relations in the triplets in the Bio Event benchmark is low compared to Web NLG (38 vs. 411), while intuition suggests the reference texts in the biomedical domain contain more sophisticated domain language and thus a wider spectrum of relations, suggesting a problem of type (ii). We conclude that Bio Event does not qualify as a gold standard for a text-to-graph task, and we encourage future work to undertake a thorough revision. The results of our refined Bio Event dataset should thus be regarded as tentative in the biomedical domain.

Finally, we follow Keysers et al. (2020) in assessing the compositional generalisation aspect of the train, validation and test set. In brief, text, or knowledge graphs, in this case, can be categorised into atoms, and compounds. Atoms refer to atomic instances of the text, such as single word and grammatical or syntactic rules, whereas compounds are the ways single atoms are combined, for example, fully formed sentences. In our work, atoms indicate heads, relations and tails, while compounds refer to combinations of atoms in the form of triplets. The divergence in atom and compound empirical distributions between datasets provides an estimate of how well our experiments adhere to the principles of compositional generalisation, as framed by Keysers et al. (2020). That is, the divergence between distributions assesses whether our train-test experiments present a challenge from a compositional generalisation perspective, evidenced by high compound divergence between train and test sets, while exclusively measuring the recombination of known atoms into compounds through low atom divergence.

On Web NLG, we observe that the training and validation sets are similar in terms of both atoms and compounds, whereas the training and test sets differ substantially in terms of both atom and compound divergence. Not surprisingly, this is highest for unseen categories. How the distribution of Web NLG examples over train and test set originates is unclear, but a redistribution of examples over train, validation and test set in future work could be beneficial to more robustly test compositional generalisation. On Bio Event, we see low atom divergence and high compound divergence between the train, validation and test set, in line with the principles of compositionality. We recommend that the general practitioner pay attention to the quality of benchmark datasets, such as descriptive statistics and atom/compound divergence, since model performance is highly sensitive to the quality of the underlying data.

3.3. Evaluation Metrics

We evaluate a model’s performance with recall-oriented understudy for Gisting Evaluation (Rouge) scores (Lin, 2004). Originally introduced for summarisation evaluation, the set of metrics is transferable to our text-to-graph setup by identifying a graph as a single-sentence summary. We specifically focus on Rouge- $n$ ( $n = 1, 2$ ) and Rouge-L, where the former is based on $n$ -grams and the latter on the longest common sub-sequence (LCS) between two strings.

All scores are reported as the harmonic mean between recall and precision, making use of the implementation by Hugging Face.² The exact formulas to calculate Rouge- $n$ and Rouge-L over a set of candidate and reference graphs $(g_{cand}, g_{ref}) \in G$ are given by the following equation:

\begin{aligned} R o u g e - N = \frac{\sum_{g r a m_{n} \in g_{ref}} C o u n t_{m a t c h} (g r a m_{n})}{\sum_{g r a m_{n} \in g_{ref}} C o u n t (g r a m_{n})} \end{aligned}

(1)with

n

equal to the length of the n-gram

g r a m_{n}

, and

C o u n t_{m a t c h} (g r a m_{n})

being the amount of overlapping n-grams between reference (i.e.

g_{ref}

, the target output) and the model’s prediction.

Rouge-L metrics is based on the LCS between the candidate graphs ( $g_{cand}$ ) and the reference graph ( $g_{ref}$ ), using the equation below:

\begin{aligned} R_{l c s} (g_{cand}, g_{ref}) & = \frac{L C S (g_{cand}, g_{ref})}{| 1 - g r a m (g_{ref}) |} \end{aligned}

(2)

\begin{aligned} P_{l c s} (g_{cand}, g_{ref}) & = \frac{L C S (g_{cand}, g_{ref})}{| 1 - g r a m (g_{cand}) |} \end{aligned}

(3)

\begin{aligned} R o u g e - L (g_{cand}, g_{ref}) & = \frac{(1 + β^{2}) R_{l c s} (g_{cand}, g_{ref}) P_{l c s} (g_{cand}, g_{ref})}{R_{l c s} (g_{cand}, g_{ref}) + β^{2} P_{l c s} (g_{cand}, g_{ref})} \end{aligned}

(4)where

L C S (g_{cand}, g_{ref})

is the longest common subsequence between the candidate graphs (

g_{cand}

) and the reference graph (

g_{ref}

| 1 - g r a m (\cdot) |

is the number of uni-gram in a graph, and

β

is generally considered to be set to 1 (Lin, 2004).

The classical knowledge graph extraction pipeline consists of multiple stages, typically including co-reference resolution, named entity recognition, entity disambiguation and relationship classification. Using LLMs for knowledge graph extraction entangles all stages into one, therefore, complicating the process of ascribing model errors to one or more stages. Comparing the Rouge metrics employed here, however, allows some insights. For example, Rouge-1 is a direct measure of entity and relation recognition, although it does not distinguish between either. Furthermore, Rouge-2 and Rouge-L additionally take sequence order into account, such that the difference with Rouge-1 is a measure of entities and relation being in the right order. In both cases, however, entity ambiguity and co-reference problems might be confounding factors. Finally, we note that the longest common sub-sequence differs from the longest common sub-string. Unlike sub-strings, sub-sequences are not required to occupy consecutive positions. This means that although Rouge-1 is always higher than Rouge-L by definition, a Rouge-L score close to Rouge-1 does not necessarily indicate the correct formation of triplets.

How to best disentangle failure modes in knowledge graph extraction using LLMs remains an open question that other popular natural language evaluation metrics such as BLEU and cross-entropy do not provide an obvious answer to. Perhaps a more natural solution to evaluate knowledge graphs would be to extract entities, relations and triplets from structured natural language output and use knowledge-graph-specific metrics to measure model accuracy, coverage, coherence and succinctness. In this preliminary work, however, natural language in model output is not always rigidly structured and thus we pertain to natural language metrics. Furthermore, it is not obvious whether to assess knowledge graphs on entity or triplet level and how to deal with entity disambiguation and co-reference issues, such that natural language metrics on token level provide an easily accessible baseline.

3.4. Large Language Models

This work focuses on the systemic comparison of two transformer architectures that allow text-generation capabilities, namely encoder-decoder, and decoder-only models. The next paragraphs introduce these architectures from a high-level perspective and discuss the specific pre-trained models adopted in the experiments. Importantly, all models mentioned below are fully open-source and accessible through Hugging Face by adopting the transformer library (Wolf et al., 2020). A summary of adopted models and their specifications are proposed in Table 3.

Table 3.
Models Specification. Summary of the Models Used for Our Experiments, and Their Basic Specifications.

Architecture Family Size Pre-training Learning Method

Encoder-decoder T5 60.5M Fine-tuning

223M Fine-tuning

738M Fine-tuning

BART 406M Fine-tuning

406M REBEL Fine-tuning

Decoder-only Mistral-v0.1 7B iCL

7B Conversation iCL

7B OpenOrca iCL

Llama-2 7B Instruction iCL

13B Instruction iCL

7B Meditron iCL

iCL: in-context learning; T5: text-to-text transfer transformer; BART: bidirectional auto-regressive transformers; REBEL: relation extraction dataset during pre-training.

Architecture	Family	Size	Pre-training	Learning Method
Encoder-decoder	T5	60.5M		Fine-tuning
		223M		Fine-tuning
		738M		Fine-tuning
	BART	406M		Fine-tuning
		406M	REBEL	Fine-tuning
Decoder-only	Mistral-v0.1	7B		iCL
		7B	Conversation	iCL
		7B	OpenOrca	iCL
	Llama-2	7B	Instruction	iCL
		13B	Instruction	iCL
		7B	Meditron	iCL

3.4.1. Encoder-Decoder Models

Encoder-decoder architectures are a generic class of transformer models (Vaswani et al., 2017b). At their core, encoder-decoder models leverage the power and computational scalability of self-attention mechanisms and feed-forward neural networks in transformer architectures. Both the encoder and decoder consist of a stack of multi-head attention layers, be it that the attention layers in the decoder are masked to prevent attending to future output tokens. Whereas the encoder learns a rich vector representation of model inputs, the decoder auto-regressively generates model output by attending to the encoded inputs and its own generated output. For a detailed description of encoder-decoder architecture and self-attention mechanisms, please refer to Vaswani et al. (2017b). We make use of two families of pre-trained encoder-decoder models, T5 and BART.

T5

The T5 family of encoder-decoders was introduced by Raffel et al. (2020). The models undergo two pre-training processes of supervised and self-supervised learning. In the first self-supervised stage, often referred to as denoising language modelling, multi-word pieces of sentences are hidden from the input, and the model is trained end-to-end to spell out which tokens are missing. In the second supervised step, the model is trained to solve task-specific scenarios, added by instruction pre-fixes. We adopt three sizes in the T5 family, namely 60.5M parameters (t5-small³ ), 223M (t5-base $^{3}$ ), and 738M (t5-large $^{3}$ ).

BART

The BART model (facebook/bart-large $^{3}$ ) was introduced by Lewis et al. (2020). Despite minor differences in their implementations, the BART models and the T5 family of models are all sequence-to-sequence models mainly trained with tasks that can be framed within the realm of text denoising (i.e. reconstructing an input sequence of text that was previously corrupted). However, BART training focuses on multiple corruption strategies, such as sentence permutation, token masking or delation. On the other hand, T5 models follow a two step training process. Firstly, they undergo a phase of pre-training, using a procedure similar to BART’s token dilation. However, while BART was trained to fully reproduce the altered string, T5 is trained to output the missing tokens. Secondly, T5 models are then trained to solve multiple tasks, such as translation, summarisation, and text classification, in a text-to-text manner. As mentioned, these model do share minor differences, such as (part of) the text corpora used for training and the implementation of the training algorithm. For example, the BART model was pre-trained on a dataset that comprises a large corpus of text, including but not limited to, Wikipedia, BooksCorpus,⁴ and Common Crawl,⁵ while the T5 family was pre-trained on a combination of datasets, including the Colossal Clean Crawled Corpus (C4),⁶ Wikipedia, and BookCorpus. In addition, the two model families differ with respect to the adopted tokenization strategy; while the BART model uses a WordPiece tokenization scheme (Wu et al., 2016), whereas the T5 family uses SentencePiece tokenization.⁷ Furthermore, the BART model and the T5 family use different initialisation schemes for their parameters. The BART model employs a standard Gaussian initialisation, whereas the T5 family uses a scheme that initialises the weights based on the variance of the input data. Another difference consists in the adopted activation function and positional encoding. In particular, the BART model uses the GELU activation function⁸ and absolute positional encoding, while the T5 family uses the ReLU activation function⁹ and relative positional encoding.

For a more in-depth understanding of these differences and the underlying architectural designs of the BART and T5 models, the reader is referred to Lewis et al. (2020) and Raffel et al. (2020). In our experiments we consider a recent version of BART (ibm/knowgl-large $^{3}$ ) (Rossiello et al., 2023), which has gone through an additional fine-tuning phase during pre-training on the REBEL dataset (Huguet Cabot & Navigli, 2021), a large relation extraction dataset designed for text-to-text modelling. As the REBEL dataset bears resemblance to our text-to-graph task setup, we are able to investigate the impact of large-scale task-specific fine-tuning.

3.4.2. Decoder-Only Models

Decoder-only models omit the encoder module in encoder-decoder models, trading off a richer understanding of model inputs for the computational benefits of a streamlined architecture with less learnable parameters. We adopt two families of decoder-only models, namely Mistral-v0.1 and Llama-2. To maintain a reasonable comparison with encoder-decoder models and adhere to our given computational constraint, we use decoder-only models that are nowadays considered to be small-to-mid range, containing 7B to 13B learnable parameters. We stress that, while the chosen computational resources would not have permitted tuning of the selected decoder-only models on the specified dataset, as stated throughout the work, our primary experimental question is whether using large and computationally intensive decoder-only models in an in-context learning approach would produce results comparable to those of much smaller decoder-only models specifically tuned for the task.

Llama-2

Upon release, the Llama-2 family of decoder-only models (Touvron et al., 2023) showed among the strongest performances in the field of LLMs. We include two sizes, one with 7B parameters (meta-llama/Llama-2-chat-7b-hf $^{3}$ ) and the largest model in our analysis with 13B parameters (meta-llama/Llama-2-chat-13b-hf $^{3}$ ). We focus on the instruction-tuned version, fine-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) on a series of question-answering datasets, as it is enhanced to adhere to a task-specific setup like our own instead of general natural language completion. We further include a Llama-2 model that is not instruction-tuned, but instead fine-tuned on biomedical knowledge and question-answering (epfl-llm/meditron-7b $^{3}$ ) (Chen et al., 2023), to investigate the beneficial effect of domain-specific training in the biomedical domain.

Mistral-v0.1

A more recent introduction, the Mistral-v0.1 family of decoder-only models showcases a strong performance, outperforming the Llama-2 13B model with almost half its parameters (Jiang et al., 2023). Mistral-v0.1 models make use of a significant number of computational advancements, for example, sliding-windows attention (Beltagy et al., 2020), which allows the model to dramatically extend the number of tokens it can simultaneously process. We make use of three models in this family, namely the original 7B model (mistralai/Mistral-7B-v0.1 $^{3}$ ), a version fine-tuned on a variety of open-source conversation datasets (mistralai/Mistral-7B-Instruct-v0.1 $^{3}$ ), and finally a version fine-tuned by OpenOrca (Open-Orca/Mistral-7B-OpenOrca $^{3}$ ) (Lian et al., 2023) on a reproduction attempt of the Orca dataset (Mukherjee et al., 2023), leveraging the Flan Collection for effective instruction-tuning (Longpre et al., 2023).

3.5. Learning Methods

We adopt two distinct task learning methods, fine-tuning for the smaller encoder-decoder models and iCL (Brown et al., 2020) for the larger decoder-only models.

3.5.1. Fine-Tuning

All fine-tuning experiments are based on the trainer class implementation from Hugging Face. Given a text-graph pair $(t_{i}, g_{i})$ in the pre-defined training set, each model undergoes an additional fine-tuning phase where it is trained to generate the graph $g_{i}$ as output, using the text $t_{i}$ as input. All models are tuned end-to-end for up to ten epochs, selecting the best model based on the validation Rouge- $1$ score, as per standard practice in NLP and knowledge graph literature (Balazevic et al., 2019; Chami et al., 2020). For training, hyper-parameters are as given in Appendix B, mostly adopting default values implemented by the trainer class of Hugging Face and kept identical for each fine-tuning run. We omit computationally expensive tuning of hyper-parameters to resemble common practice of end-users, which often do not have time for or expert knowledge on optimal procedures. Furthermore, the aim of this article is not to measure optimal performance, but rather to provide a ballpark estimate compared to other model characteristics and task learning methods.

3.5.2. In-Context Learning

Under the iCL setting, each pre-trained model is queried with a simple prompt, containing a set of $N$ solved text-graph examples taken from the available training set. An example input prompt with $N = 2$ in-context examples is given in Figure 1 and several cherry-picked input-output examples are included in Appendix A. To limit the impact of selecting a set of poor examples, we sample $N$ examples randomly from the training set for each test instance at inference time.

Figure 1.

Example prompt – $N = 2$ in-context examples – Web NLG dataset.

We omit time-consuming prompt engineering and computationally expensive prompt tuning to resemble common practice of end-users. However, we highlight the importance of such practice to prevent model hallucinations of the kind highlighted in Section 4.2.1 and more generally to prevent spurious features in prompt design along the lines of Sclar et al. (2023). To provide a fair estimate of iCL performance, we introduce a simple post-hoc hallucination-control heuristic to determine the end of the desired structured output (i.e. the end of a knowledge graph). Simply put, we truncate model output at the appearance of the tokens ‘)]’, signalling the end of a knowledge graph in our graph structure.

The motivation behind this heuristic arises from preliminary experiments that showed our prompt design is sub-optimal. In various instances the model seems to misinterpret the specified text-to-graph task by continuously generating text-graph alterations along the lines of the in-context example sequence, instead of outputting one single graph corresponding to the final reference text (see Appendix A for some cherry-picked examples). There are ample known methods to engineer a prompt that optimises a given task, such as using delimiters to separate in-context examples, chain-of-thought prompting or more careful phrasing of the task description to maintain intent, but we avoid to engage with this time-consuming process and instead adopt the previously mentioned heuristic. An extensive error analysis of the hallucination phenomenon and our hallucination-control heuristic is given in Section 4.2.1.

3.6. Experimental Setup

This article consists of two sets of experiments, designed to unveil the approximate overall power of selected models and task learning methods, as well as to understand what impacts and shapes their performance. All our experiments are run on a single NVIDIA Quadro RTX 8000 GPU to resemble the experience and facilities of a general AI practitioner. We do recognise that assuming larger computational power could significantly improve results, especially by including large decoder-only models or by fine-tuning the mid-sized Mistral-v0.1 and Llama-2 families, which is out of the computational reach of the current setup.

3.6.1. General Comparison

The main goal of this experiment is to understand how LLM characteristics and task learning methods perform in our text-to-graph task, under fixed computational resources, adopting two task domains – general and biomedical. Throughout, we aim to guide the general AI practitioner to understand which combination is most suited for such task and to show-case how to navigate (part of) the vast and complex spectrum of model design choices. Given the fixed computational resources, we fine-tune the previously introduced set of smaller encoder-decoder models and compare performance to the set of larger decoder-only models in combination with iCL. This choice is framed in the context of a given computational resource such that fine-tuning is computationally infeasible for larger models. At the same time, the short context window of the T5 and BART families (1k tokens or below) prove iCL unsuitable. We adopt $N = 8$ for the amount of in-context examples, in line with common practice in, for example, Wei et al. (2022), and investigate the optimal choice of $N$ in the following experiment.

3.6.2. Effect of the Number of In-Context Examples

The optimal number of in-context examples varies with the task at hand and perhaps with other unknown factors. In this experiment, we investigate the impact of $N$ on a subset of decoder-only models. We specifically focus on the Mistral-v0.1 model fine-tuned by OpenOrca for its large context window of 32k tokens, testing values $N \in {0, 2, 4, 8, 16, 32}$ . Furthermore, we validate results up to $N = 8$ on a subset of decoder-only models with a slightly shorter context window, as the Llama-2 family operates on a 4k context window (see Appendix D). The encoder-decoder models only operate on a context window 1k tokens and below, rendering iCL infeasable. Among the included models is Llama-2 fine-tuned on the Meditron dataset, to further investigate how fine-tuning on biomedical knowledge affects iCL.

4. Results

4.1. General Comparison

The overall results of various combinations of model architecture, family, size, relevant pre-training data and task learning method are shown in Table 4. We describe general trends and subsequently compare Rouge metrics in Section 4.1.1. Note that for Web NLG, we compute Rouge metrics based on the full test set, and we refer to Appendix C for a comparison of seen and unseen categories.

Table 4.
General Experiment Results. We Report Rouge Scores Obtained by Various Combinations of Model Architecture, Family, Size (Learnable Parameters), Relevant Additional Data Seen During Pre-Training and Our Task Learning Method of Fine-Tuning or iCL on Two Benchmarks – Web NLG and Bio Event.

Web NLG Bio Event

Arch. Family Size Pre-training Learning Method R-1 R-2 R-L R-1 R-2 R-L

Encoder-decoder T5 60.5M Fine-tuning 0.71 0.53 0.59 0.57 0.42 0.53

223M Fine-tuning 0.82 0.68 0.64 0.62 0.48 0.59

738M Fine-tuning 0.86 0.73 0.68 0.66 0.54 0.63

BART 406M Fine-tuning 0.75 0.59 0.60 0.53 0.38 0.51

406M REBEL Fine-tuning 0.84 0.73 0.66 0.67 0.56 0.64

Decoder-only Mistral-v0.1 7B iCL + Heuristic 0.78 0.61 0.64 0.43 0.25 0.37

7B Conversation iCL $+$ Heuristic 0.76 0.56 0.61 0.43 0.24 0.36

7B OpenOrca iCL $+$ Heuristic 0.80 0.61 0.64 0.44 0.25 0.36

Llama-2 7B Instruction iCL $+$ Heuristic 0.75 0.53 0.59 0.44 0.24 0.37

13B Instruction iCL $+$ Heuristic 0.77 0.57 0.62 0.44 0.24 0.37

7B Meditron iCL $+$ Heuristic 0.71 0.52 0.58 0.40 0.21 0.34

Scores of encoder-only models refer to results obtained with our hallucination-control heuristic (see Section 3.5.2).

iCL: in-context learning; T5: text-to-text transfer transformer; BART: bidirectional auto-regressive transformers; REBEL: relation extraction dataset during pre-training.

					Web NLG	Bio Event
Encoder-decoder	T5	60.5M		Fine-tuning	0.71	0.53	0.59	0.57	0.42	0.53
		223M		Fine-tuning	0.82	0.68	0.64	0.62	0.48	0.59
		738M		Fine-tuning	0.86	0.73	0.68	0.66	0.54	0.63
	BART	406M		Fine-tuning	0.75	0.59	0.60	0.53	0.38	0.51
		406M	REBEL	Fine-tuning	0.84	0.73	0.66	0.67	0.56	0.64
Decoder-only	Mistral-v0.1	7B		iCL + Heuristic	0.78	0.61	0.64	0.43	0.25	0.37
		7B	Conversation	iCL $+$ Heuristic	0.76	0.56	0.61	0.43	0.24	0.36
		7B	OpenOrca	iCL $+$ Heuristic	0.80	0.61	0.64	0.44	0.25	0.36
	Llama-2	7B	Instruction	iCL $+$ Heuristic	0.75	0.53	0.59	0.44	0.24	0.37
		13B	Instruction	iCL $+$ Heuristic	0.77	0.57	0.62	0.44	0.24	0.37
		7B	Meditron	iCL $+$ Heuristic	0.71	0.52	0.58	0.40	0.21	0.34

First, we compare task learning methods. On Web NLG, we observe comparable results among smaller fine-tuned encoder-decoder models across metrics, as opposed to larger decoder-only models adopting iCL. On Bio Event, however, there is a clear benefit in fine-tuning smaller encoder-decoder models. We hypothesise this relates to issues regarding benchmark quality outlined in Section 3.2. For example, the high amount of unique entities and triplets in Bio Event shows a complex distribution of patterns in reference texts that is difficult to infer correctly from just eight in-context examples. Overall best performance on both benchmarks across metrics is reached with fine-tuned encoder-decoder models, that is, the largest model in the T5 family on Web NLG and BART + REBEL on Bio Event.

A visualisation of Rouge-L model performance is shown in Figure 2. Within both the T5 and Llama-2 family, we find a clear positive correlation between model size and LLM performance, which is in line with other research findings in the literature (Hestness et al., 2017). Focusing on the BART family, we see that adopting an additional relation extraction dataset during pre-training (REBEL) yields universally superior results. This is in sharp contrast to other pre-training additions, since neither conversation data nor instruction, OpenOrca or Meditron datasets seem to affect performance on either benchmark. We hypothesise none are particularly relevant to our text-to-graph task, although this is notably most surprising for the biomedical knowledge in the Meditron pre-training data.

Figure 2.

The scatter plot visualises Rouge-L scores presented in Table 4, in addition to the results of in-context learning (iCL) without output being truncated by our hallucination-control heuristic (green hue, see Section 3.5.2). Hue encodes the task learning method, while markers indicates model + pre-training combination.

Assessing in Figure 2, the performance of our hallucination-control heuristic for iCL, we observe it yields a large performance boost, independent of benchmark, model architecture, family, size or pre-training data. To briefly reiterate, we avoided computationally and experimentally demanding prompt engineering or tuning by simply truncating model output at the tokens that signal the end of a knowledge graph (i.e. ‘)]’). Section 4.2 provides a more in-depth analysis of the type of hallucinations occurring and how our heuristic accounts for them.

4.1.1. Evaluation Metric Analysis

Figure 3 presents the performance obtained by different combinations of model architecture, family, size, task learning method, and additional pre-training data, as evaluated by the set of Rouge metrics discussed in Section 3.3. To simplify the visualisation, results for models with the same amount of learnable parameters have been collapsed, reporting mean scores and an interval of one standard error both ways.

Figure 3.

Metrics analysis. The lineplot visualises the Rouge scores in Table 4 as a function of model size. Hue and style refer to the different Rouge metrics (see Section 3.3). At 406M and 7000M learnable parameters we report mean scores, obtained by collapsing the results for models with the same amount of learnable parameters, while error bars report one standard error both ways.

We observe Rouge-1 score to be consistently high, especially on Web NLG, indicating strong entity and relation recognition. On Bio Event, recognising entities and relations poses a more difficult task, as there is a significantly larger amount of entities to be learned. On Web NLG, Rouge-2 and Rouge-L are in similar neighbourhoods with a consistent gap towards Rouge-1 across model sizes, indicating that where entities and relations are often well identified, models struggle to put them in the right order and provide the right triplets. This indicates the more difficult task of relation classification in the knowledge graph completion pipeline. On Bio Event, however, Rouge-L is consistently above Rouge-2 and close to Rouge-1, indicating the identified entities and relations are often in the right order, but certain entities or relations are missing such that correct 2-grams are lacking.

For the remainder of this section, we report only Rouge-L scores to simplify visualisations, while noting that performances relative to other metrics are robust to what is described here.

4.2. Effect of Number of In-Context Examples

In the previous section, we have learned that complex knowledge graph consisting of a wide spectrum of entities and relations are hard to model using iCL, especially with exposure to only eight in-context examples. In this experiment, we aim to understand how the amount of in-context examples $N$ influences model performance and show the impact of how our hallucination-control heuristic. We focus on the Mistral-v0.1 family for its large context window, allowing up to 32 in-context text-graph examples. Specifically, we use the model fine-tuned on the OpenOrca dataset during pre-training, as it achieves greatest performance within the Mistral-v0.1 family. We further include in Figure 13 the 7B models in the Llama-2 family to validate our findings, although their context window only allows up to eight in-context examples.

In the zero-shot setting ( $N = 0$ ), we observe in Figure 4 and Appendix D that model performance is weak in both the general and biomedical domain, with Rouge-L scores not exceeding 0.35. Striking in both figures is the notable increase in performance when adopting our hallucination-control heuristic. In that case, we observe a monotonically increasing effect of the number of in-context examples on model performance for both datasets, with a sharp elbow around two to four examples. Appendix D illustrates the effects of the number of in-context examples in decoder-only models, when evaluated with and without hallucination-control heuristic. It also includes a log-plot where the model performance behaviour seems to have a log-linear relation with the number of in-context examples in the case the hallucination-control heuristic is considered. Appendix E shows instead the effects across models on iCL of the introduced hallucination-control heuristic.

Figure 4.

Effect of the number of in-context examples for the Mistral-v0.1 $+$ OpenOrca model. We report Rouge-L scores as a function of the number of in-context examples $N$ , when evaluated with and without hallucination-control heuristic. Dashed grey lines represent the best results from encoder-decoder models, that is, T5 (738M) for Web NLG, and BART + Rebel for Bio Event. T5: text-to-text transfer transformer; BART: bidirectional auto-regressive transformers.

When evaluating models on their standard generated output, we observe a peak of performance at four to eight samples, preceded by a sharp increase, and then a slightly more moderate decrease. We go into further detail on the dynamics behind this phenomenon in the next section.

4.2.1. Hallucination Analysis

We now explore a quantitative analysis of model hallucinations and our proposed heuristic for controlling them. Although hallucinations are observed across all models included, we focus specifically on the Mistral-v0.1 $+$ OpenOrca model, in order to draw a comparison with the amount of in-context examples. First, the top figures in Figure 5 show boxplots of the amount of tokens in model output as a function of the amount of in-context examples, together with the ‘true’ target output. We notice how the amount of model output is universally much higher than the target distribution, with $N = 8$ being closest overall. Question remains as to what is in the large amount of model output. Manual inspection shows two types of hallucinations (see Appendix A for some cherry-picked examples). First, the model simply outputs pure text, not adhering to the linearised graph structure specified in Section 3.2. Second, the model continues the text-graph sequence of in-context examples with new text-graph alterations, misinterpreting the task in the input prompt.

Figure 5.

Hallucination analysis: Boxplots for the number of tokens and percentage of graph output for the Mistral-v0.1 model fine-tuned by OpenOrca, as a function of various amounts of in-context examples $N$ . The percentage of graph-based tokens in the output text is measured as the number of tokens inbetween ‘[(’ and ‘)]’ tokens.

Figure 5 allows us to draw more general conclusions on the nature of this hallucinated text. The figure displays a boxplot of the percentage of graph-based tokens in the output text, measured as the number of tokens inbetween ‘[(’ and ‘)]’ tokens, associated with the amount of in-context examples. We observe that in the zero-shot setting ( $N = 0$ ), model output mostly contains hallucinations of the first type, simply unstructured text. This is due to a lack of examples for the model to understand how output should be structured. For a large amount of in-context examples, the amount of graph output stabilises around 60%, indicating a steady stream of text-graph continuations, that is, hallucinations of the second type. As the amount of in-context examples increases, the model seems to move from not understanding the text-to-graph task (zero-shot setting), to sometimes understanding the text-to-graph setup (up to $N = 8$ ), to becoming more convinced that the task at hand is to continue the text-graph sequence with more text-graph examples instead of producing a single graph corresponding to the last reference text. This shows a caveat to the common intuition that more in-context examples increase performance, as this intuition is conditional on a correctly tuned input prompt and task specification.

Finally, Appendix E shows identical figures, except now our hallucination-control heuristic is applied. We observe all boxplots are close to the target distribution across $N$ , except for the zero-shot setting. This is unsurprising, as the zero-shot setting does not adhere to the linearised graph structure, merely outputting plain text, and thus our hallucination-control heuristic does not truncate any output.

5. Discussion

Overall, this work has been directed at the AI practitioner in the general or biomedical domain, aiming to develop an end-to-end LLM-based automatic graph extraction system from textual sources. Assuming a realistic computational baseline, we aim to contribute to the development of more effective and efficient pipeline for knowledge extraction and representation tasks by highlighting the impact of a plethora of design choices. Our large-scale comparison provided several empirical insights.

Foremost, we showed off-the-shelve LLMs together with a task learning method show strong entity and relation recognition, and on the overall task of knowledge graph completion results can be classified as moderate yet promising. Optimal performance of LLMs is likely higher than displayed here, e.g. due to prompt engineering/tuning, hyper-parameter tuning, more computational power and more model parameters. Without a task learning method, we showed off-the-shelve LLMs are not directly suitable for text-to-graph tasks and especially in the biomedical domain, zero-shot performance is weak. Comparing task learning methods, fine-tuning has proven more robust than iCL, since mid-sized decoder-only models adopting iCL show weak performance in the biomedical domain, while small fine-tuned encoder-decoder models achieve robust moderate results in both the general and biomedical domain. We hypothesise that expert knowledge contained in reference texts in the biomedical domain poses a more difficult knowledge extraction problem, such that iCL with a small amount of in-context examples is not sufficient to correctly learn said task. That is, knowledge graphs in the biomedical domain might require knowledge obtained across examples, while for knowledge graphs in the general domain the information within a given reference text might be sufficient.

Due to computational constraints at inference time, we experimented with fine-tuning models up to 738M learnable parameters and due to context window constraints, we experimented with up to 32 in-context examples. As context window constraints are based on computational constraints during pre-training and the general practitioner designing a knowledge extraction pipeline typically starts with a pre-trained LLM, it only benefits from scaling up its computational resources when opting for fine-tuning as the preferred task learning method. When it comes to fine-tuning, we find that including additional datasets only boosts performance when the dataset directly pertains to the text-to-graph task. Among the options explored in this paper, only relation extraction data in the form of the REBEL dataset showed a significant boost in performance, while neither conversation data, instruction-tuning nor additional biomedical data made an impact. It is especially surprising that biomedical data does not make a difference on our biomedical benchmark.

We leave the general AI practitioner with three recommendations with regards to being mindful of small details in designing an end-to-end LLM-based knowledge graph extraction pipeline. First, an LLM-based system is highly sensitive to the underlying training and benchmark data. In the case of knowledge graphs, data should contain text-graph pairs that contain identical semantic meaning, but differ syntactically in their natural language and graph structure. Furthermore, benchmarks offering a sparse distribution of entities, relations and triplets poses a difficult knowledge extraction problem, since LLMs are given few examples to learn from. In addition, assessing atom- and compound-divergence between training and test set should be common practice to validate an LLM’s compositional generalisation capabilities and distinguish between in- and out-of-distribution generalisation. We outline several issues with the biomedical benchmark adopted in this article in Section 3.2, such that our findings in this setting should be regarded as tentative. A thorough revision of Bio Event is required to establish a gold standard benchmark in the biomedical domain. Also with regards to the general domain, we find the Web NGL benchmark does not adhere well to the principles of compositional generalisation, and a redistribution of examples over training and test set is recommended.

Second, the choice of evaluation metric for knowledge graph extraction using LLMs require careful consideration. We opted for a set of Rouge metrics for natural language evaluation in order to allow for unstructured model output, yet a post-hoc method to identify entities and triplets in combination with graph-specific evaluation metrics might allow to better distinguish between failure modes in the knowledge graph extraction pipeline. We discuss this in Section 3.3. Such a setup is applicable when LLMs can be trusted to adhere to rigid output structure, but we find LLMs to often violate our desired linearised graph structure. Using Rouge metrics, we find strong results for entity and relation recognition and moderate yet promising results for putting these together in the correct triplets.

Third, careful prompt engineering and prompt tuning are essential to avoid task misinterpretation and model hallucinations of the types described in Section 4.2.1. LLMs are known to be highly sensitive to small changes in input prompts and we provide empirical evidence for this in the case of text-to-graph tasks. To solve the type of hallucinations we encountered, we proposed a simple heuristic based on our linearised graph structure to truncate model output at the tokens signalling the end of the first knowledge graph. This proved highly effective in boosting model performance without time-expensive prompt engineering and computationally expensive prompt tuning, which is not necessary generalisable across subsets of the same dataset (Bertolini et al., 2022). These results suggest that when the output of a model follows a constrained structure, simple rule-based heuristics can be an efficient method to limit undesired output. Finally, we mention what seems to be a standard surgeon’s recommendation in the field of prompt engineering, to use delimiters to separate in-context examples from a new observation and to use careful phrasing to maintain intent of the desired task, such that post-hoc heuristics are redundant.

6. Conclusion

In this work, we examined the performance of LLMs to automatically generate knowledge graphs from reference texts in the general and biomedical domain. In an end-to-end fashion, we used LLMs in combination with a task learning method in the form of fine-tuning small encoder-decoder models or mid-sized decoder-only models adopting in-context learning. We obtained comparable performance in the general domain with high named entity and relation recognition and moderate yet promising knowledge graph completion. We show tentative evidence that knowledge graphs in the biomedical domain are harder to learn from textual sources than the general domain, independent of other factors considered. Moreover, in the biomedical domain, mid-sized decoder-only models adopting iCL show weak results, while small fine-tuned encoder-decoder models perform robustly. However, we find a gold standard benchmark of text-to-graph data in the biomedical domain is lacking. In the zero-shot setting, we obtained weak performance for all LLMs considered. Additionally, we found no connection between including additional datasets during pre-training that are not directly linked to the text-to-graph task, such as conversation-tuning, instruction-tuning and biomedical expert knowledge. Only including REBEL, a relation extracting dataset, showed a notable boost in performance. Finally, we proposed a simple heuristic to control for model hallucinations as a result of sub-optimal prompt design and provide evidence of its positive impact on performance. We hope these results guide best practices for implementing LLMs in automatic graph extraction and suggest that smaller fine-tuned models with domain-specific optimisations are preferable over large models adopting iCL. Future work could focus on refining these findings, especially by developing novel benchmark datasets and tools to evaluate structured knowledge graph output and disentangle failure modes in the knowledge graph extraction pipeline. The ultimate goal is to enhance knowledge extraction pipelines by utilising the power of LLMs for complex reasoning and text-to-graph AI systems.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Notes

Appendix

References

Ait-Mlouk

Jiang

(2020) KBot: A knowledge graph based ChatBot for natural language understanding over linked data. IEEE Access, 8, 149220–149230. https://doi.org/10.1109/ACCESS.2020.3016142

Angioni

Consoli

Dessi

Osborne

Reforgiato Recupero

Salatino

(2024). Exploring environmental, social, and governance (ESG) discourse in news: An AI-powered investigation through knowledge graph analysis. IEEE Access, 12, 77269–77283.

Babaei Giglou

D’Souza

Auer

(2023). LLMs4OL: Large language models for ontology learning. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (pp. 408–427). 14265 LNCS.

Balazevic

Allen

Hospedales

(2019) Multi-relational poincaré graph embeddings. Advances in Neural Information Processing Systems, 32, 11. https://proceedings.neurips.cc/paper_files/paper/2019/file/f8b932c70d0b2e6bf071729a4fa68dfc-Paper.pdf

Beltagy

Peters

M. E.

Cohan

(2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.

Bertolini

Weeds

Weir

(2022). Testing large language models on compositionality and inference with phrase-level adjective-noun entailment. In N. Calzolari, C. R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K. S. Choi, P. M. Ryu, H. H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, & S. H. Na (Eds.), Proceedings of the 29th international conference on computational linguistics Gyeongju, Republic of Korea: International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.359.

Bosselut

Rashkin

Sap

Malaviya

Celikyilmaz

Choi

(2019). COMET: Commonsense transformers for automatic knowledge graph construction. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual meeting of the association for computational linguistics. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1470

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Hesse

C., &

Amodei

(2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 25. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Chami

Wolf

Juan

D. C.

Sala

Ravi

Ré

(2020). Low-dimensional hyperbolic knowledge graph embeddings. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual meeting of the association for computational linguistics Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.617

10.

Chen

Jia

Xiang

(2020) A review: Knowledge reasoning over knowledge graph. Expert Systems with Applications, 141, 112948. https://doi.org/10.1016/j.eswa.2019.112948

11.

Chen

Cano

A. H.

Romanou

Bonnet

Matoba

Salvi

Pagliardini

Fan

Köpf

Mohtashami

Sallinen

Sakhaeirad

Swamy

Krawczuk

Bayazit

Marmet

Montariol

Hartley

M-A.

Jaggi

Bosselut

(2023). Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.

12.

Colas

Sadeghian

Wang

D. Z.

(2021). Eventnarrative: A large-scale event-centric dataset for knowledge graph-to-text generation. In Thirty-fifth conference on neural information processing (NeurIPS 2021) track on datasets and benchmarks.

13.

Dash

Rossiello

Bagchi

Mihindukulasooriya

Gliozzo

(2021). Open knowledge graphs canonicalization using variational autoencoders. In Proceedings of the 2021 conference on empirical methods in natural language processing. https://doi.org/10.18653/v1/2021.emnlp-main.811.

14.

Dessí

Osborne

Reforgiato Recupero

Buscaldi

Motta

(2022) SCICERO: A deep learning and NLP approach for generating scientific knowledge graphs in the computer science domain. Knowledge-Based Systems, 258, 14.

15.

Frisoni

Moro

Balzani

(2022). Text-to-text extraction and verbalization of biomedical event graphs. In Proceedings of the 29th international conference on computational linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.238

16.

Gardent

Shimorina

Narayan

Perez-Beltrachini

(2017). Creating training corpora for NLG micro-planners. In R. Barzilay & M. Y. Kan (Eds.), Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1017

17.

Guo

Jin

Qiu

Zhang

Wipf

Zhang

(2020). CycleGT: Unsupervised graph-to-text and text-to-graph generation via cycle training. In T. Castro Ferreira, C. Gardent, N. Ilinykh, C. van der Lee, S. Mille, D. Moussallem, & A. Shimorina (Eds.), Proceedings of the 3rd international workshop on natural language generation from the semantic web (WebNLG+) Dublin, Ireland (Virtual): Association for Computational Linguistics. https://aclanthology.org/2020.webnlg-1.8

18.

Hestness

Narang

Ardalani

Diamos

Jun

Kianinejad

Patwary

M. M. A.

Yang

Zhou

(2017). Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.

19.

Himmelstein

D. S.

Zietz

Rubinetti

Kloster

Heil

B. J.

Alquaddoomi

Nicholson

D. N.

Hao

Sullivan

B. D.

Nagle

M. W.

Greene

C. S.

(2023). HetNet connectivity search provides rapid insights into how biomedical entities are related. GigaScience, 12, giad047.

20.

Hofer

Obraczka

Saeedi

Köpcke

Rahm

(2024). Construction of knowledge graphs: Current state and challenges. Information, 15(8).

21.

Hogan

Blomqvist

Cochez

D’Amato

Melo

G. D.

Gutierrez

Kirrane

Gayo

J. E. L.

Navigli

Neumaier

Ngomo

A. C. N.

Polleres

Rashid

S. M.

Rula

Schmelzeisen

Sequeda

Staab

Zimmermann

(2021) Knowledge graphs. ACM Computing Surveys, 54(4), 37. https://doi.org/10.1145/3447772

22.

Dong

Wang

Chang

K. W.

Sun

(2020). Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. https://doi.org/10.1145/3394486.3403237

23.

Huang

Zhang

(2019). Knowledge graph embedding based question answering. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3289600.3290956

24.

Huguet Cabot

P. L.

Navigli

(2021). REBEL: Relation extraction by end-to-end language generation. In Findings of the association for computational linguistics: EMNLP 2021 Punta Cana, Dominican Republic: Association for Computational Linguistics. https://aclanthology.org/2021.findings-emnlp.204.

25.

Ilievski

Szekely

Zhang

(2021). CSKG: The CommonSense Knowledge Graph. In R. Verborgh, K. Hose, H. Paulheim, P. A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, & M. Alam (Eds.), The Semantic Web (pp. 680–696). Cham: Springer International Publishing.

26.

Pan

Cambria

Marttinen

P. S.

(2022) A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2), 494–514. https://doi.org/10.1109/TNNLS.2021.3070843

27.

Jiang

A. Q.

Sablayrolles

Mensch

Bamford

Chaplot

D. S.

Casas

D. dl.

Bressand

Lengyel

Lample

Saulnier

Lavaud

L. R.

Lachaux

M. A.

Stock

Scao

T. L.

Lavril

Wang

Lacroix

Sayed

W. E.

(2023). Mistral 7b. arXiv preprint arXiv:2310.06825 https://doi.org/10.48550/arXiv.2310.06825.

28.

Jiang

Usbeck

(2022). Knowledge graph question answering datasets and their generalizability: Are they enough for future research? In: Proceedings of the 45th International ACM SIGIR conference on research and development in information retrieval New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323. https://doi.org/10.1145/3477495.3531751

29.

Jin

Liu

Han

Jiang

Han

(2023). Large language models on graphs: A comprehensive survey. arXiv preprint arXiv:2312.02783.

30.

Jin

Guo

Qiu

Zhang

(2020). GenWiki: A dataset of 1.3 million content-sharing text and graphs for unsupervised graph-to-text generation. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International conference on computational linguistics Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.217

31.

Kabongo

D’Souza

Auer

(2024). ORKG-leaderboards: A systematic workflow for mining leaderboards as a knowledge graph. International Journal on Digital Libraries, 25(1), 41–54.

32.

Kacupaj

Banerjee

Singh

Lehmann

(2021). ParaQA: A Question answering dataset with paraphrase responses for single-turn conversation. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 12731, pp. 598–613). LNCS.

33.

Kapanipathi

Abdelaziz

Ravishankar

Roukos

Gray

Astudillo

Chang

Cornelio

Dana

Fokoue

Garg

Gliozzo

Gurajada

Karanam

Khan

Khandelwal

Lee

Y. S.

Luus

Makondo

,... Yu

(2021). Leveraging Abstract Meaning Representation for Knowledge Base Question Answering. (pp. 3884–3894).

34.

Kejriwal

(2022). Knowledge graphs: A practical review of the research landscape. Information, 13(4). https://doi.org/10.3390/info13040161

35.

Kejriwal

Sequeda

Lopez

(2019) Knowledge graphs: Construction, management and querying. Semantic Web, 10(6), 961–962. https://doi.org/10.3233/SW-190370

36.

Keysers

Schärli

Scales

Buisman

Furrer

Kashubin

Momchev

Sinopalnikov

Stafiniak

Tihon

Tsarkov

Wang

van Zee

Bousquet

(2020). Measuring compositional generalization: A comprehensive method on realistic data. In International conference on learning representations. https://openreview.net/forum?id=SygcCnNKwr

37.

Khorashadizadeh

Mihindukulasooriya

Tiwari

Groppe

Khorashadizadeh

Mihindukulasooriya

Tiwari

Groppe

(2023). Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text. In Proceedings of the 2nd international workshop on knowledge graph generation from text (Text2KG) (Vol. 16). Springer Nature Singapore.

38.

Kuhn

Merono-Penuela

Malic

Poelen

J. H.

Hurlbert

A. H.

Ortiz

E. C.

Furlong

L. I.

Queralt-Rosinach

Chichester

Banda

J. M.

Willighagen

Ehrhart

Evelo

Malas

T. B.

Dumontier

(2018). Nanopublications: A growing resource of provenance-centric scientific linked data. (pp. 83–92).

39.

Lewis

Liu

Goyal

Ghazvininejad

Mohamed

Levy

Stoyanov

Zettlemoyer

(2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual meeting of the association for computational linguistics. Association for Computational Linguistics.

40.

Lian

Goodson

Wang

Pentland

Cook

Vong

“Teknium” (2023). MistralOrca: Mistral-7B model instruct-tuned on filtered openOrcaV1 GPT-4 dataset. HuggingFace Repository.

41.

Lin

C. Y.

(2004). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out. Barcelona, Spain: Association for Computational Linguistics. https://www.aclweb.org/anthology/W04-1013

42.

Liu

Tam

Muqeeth

Mohta

Huang

Bansal

Raffel

C. A.

(2022) Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35, 16.

43.

Liu

Yang

Chen

Zhang

Bai

Fang

Sun

P. S.

Shi

(2023). Towards graph foundation models: A survey and beyond. arXiv preprint arXiv:2310.11829.

44.

Liu

Duan

Liu

Qin

(2016) Knowledge graph construction techniques. Journal of Computer Research and Development, 53(3), 582–600. https://doi.org/10.7544/issn1000-1239.2016.20148228

45.

Longpre

Hou

Webson

Chung

H. W.

Tay

Zhou

Q. V.

Zoph

Wei

Roberts

(2023). The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.

46.

Mihindukulasooriya

Tiwari

Enguix

C. F.

Lata

(2023). Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. In International semantic web conference. https://doi.org/10.1007/978-3-031-47243-5_14

47.

Mukherjee

Mitra

Jawahar

Agarwal

Palangi

Awadallah

(2023). Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.

48.

Neuhaus

(2023). Ontologies in the era of large language models – A perspective. Applied Ontology, 18(4), 399–407.

49.

Pan

Luo

Wang

Chen

Wang

(2024) Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 37(7), 3580–3599. https://doi.org/10.1109/TKDE.2024.3352100

50.

Paulheim

(2017) Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3), 489–508.

51.

Peng

Xia

Naseriparsa

Osborne

(2023). Knowledge graphs: Opportunities and challenges. Artificial Intelligence Review, 56(11), 13071–13102. https://doi.org/10.1007/s10462-023-10465-9

52.

Radulovic

Mihindukulasooriya

García-Castro

Gómez-Pérez

(2018). A comprehensive quality model for linked data. Semantic Web, 9(1), 3–24. https://doi.org/10.3233/SW-170267

53.

Raffel

Shazeer

Roberts

Lee

Narang

Matena

Zhou

Liu

P. J.

(2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1), 67.

54.

Rashid

M. R. A.

Rizzo

Torchiano

Mihindukulasooriya

Corcho

García-Castro

(2019) Completeness and consistency analysis for evolving knowledge bases. Journal of Web Semantics, 54, 48–71. https://doi.org/10.1016/j.websem.2018.11.004

55.

Rossiello

Chowdhury

M. F. M.

Mihindukulasooriya

Cornec

Gliozzo

A. M.

(2023). Knowgl: Knowledge generation and linking from text. In The Thirty-Seventh AAAI conference on artificial intelligence (pp. 16476–16478). AAAI Press. https://doi.org/10.1609/aaai.v37i13.27084

56.

Sclar

Choi

Tsvetkov

Suhr

(2023). Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.

57.

Tiwari

Ortíz-Rodriguez

Abbés

S. B.

Usip

P. U.

Hantach

(2023). Semantic AI in knowledge graphs. Boca Raton, US: Taylor & Francis. https://doi.org/10.1201/9781003313267

58.

Touvron

Martin

Stone

Albert

Almahairi

Babaei

Bashlykov

Batra

Bhargava

Bhosale

Bikel

Blecher

Ferrer

C. C.

Chen

Cucurull

Esiobu

Fernandes

Fuller

,..., Scialom

(2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

59.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017a) Attention is all you need Advances in Neural Information Processing Systems, 30, 11.

60.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017b) Attention is all you need. Advances in Neural Information Processing Systems, 30, 11. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

61.

Veliˇković

Cucurull

Casanova

Romero

Lio

Bengio

(2018). Graph attention networks. In International conference on learning representations. https://openreview.net/forum?id=rJXMpikCZ

62.

Wang

Aslan

Vinyals

(2021). WikiGraphs: A Wikipedia text - knowledge graph paired dataset. In A. Panchenko, F. D. Malliaros, V. Logacheva, A. Jana, D. Ustalov, & P. Jansen (Eds.), Proceedings of the Fifteenth workshop on graph-based methods for natural language processing (TextGraphs-15). Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.textgraphs-1.7

63.

Wang

Mao

Wang

Guo

(2017). Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12), 2724–2743. https://doi.org/10.1109/TKDE.2017.2754499

64.

Wei

Wang

Schuurmans

Bosma

Xia

Chi

Q V.

Zhou

(2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 14. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

65.

Wolf

Debut

Sanh

Chaumond

Delangue

Moi

Cistac

Rault

Louf

Funtowicz

Davison

Shleifer

von Platen

Jernite

Plu

Le Scao

Gugger

Drame

,... Rush

(2020). Transformers: State-of-the-art natural language processing. In Q. Liu & D. Schlangen (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-demos.6

66.

Chen

Shen

Guo

Gao

Pei

Long

(2023).Graph neural networks for natural language processing: A survey. Foundations and Trends in Machine Learning, 16(2), 119–328. https://doi.org/10.1561/2200000096

67.

Schuster

Chen

Q. V.

Norouzi

Macherey

Krikun

Cao

Gao

Macherey

Klingner

Shah

Johnson

Liu

Kaiser

Gouws

Kato

Kudo

Kazawa

Stevens

,... Dean

(2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv abs/1609.08144.

68.

Xian

Muthukrishnan

De Melo

Zhang

(2019). Reinforcement knowledge graph reasoning for explainable recommendation. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3331184.3331203 .

69.

Yadav

Bethard

(2018). A survey on recent advances in named entity recognition from deep learning models. In E. M. Bender, L. Derczynski, & P. Isabelle (Eds.), Proceedings of the 27th international conference on computational linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics. https://aclanthology.org/C18-1182

70.

Kumar

Y. J.

Sing

G. O.

Song

Wang

(2022) A comprehensive survey of graph neural networks for knowledge graphs. IEEE Access, 10, 75729–75741. https://doi.org/10.1109/ACCESS.2022.3191784

71.

Zavarella

Consoli

Reforgiato Recupero

Fenu

Angioni

Buscaldi

Dessí

Osborne

(2024). Triplétoile: Extraction of knowledge from microblogging text. Heliyon, 10(12), e32479.

72.

Zhang

(2020). Text graph transformer for document classification. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.668

73.

Zhang

Tong

Maciejewski

(2018). Graph convolutional networks: Algorithms, applications and open challenges. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-04648-4_7

74.

Zietz

Himmelstein

D. S.

Kloster

Williams

Nagle

M. W.

Greene

C. S.

(2024). The probability of edge existence due to node degree: A baseline for network-based predictions. GigaScience, 13, giae001.

					Web NLG			Bio Event
Arch.	Family	Size	Pre-training	Learning Method	R-1	R-2	R-L	R-1	R-2	R-L
Encoder-decoder	T5	60.5M		Fine-tuning	0.71	0.53	0.59	0.57	0.42	0.53
		223M		Fine-tuning	0.82	0.68	0.64	0.62	0.48	0.59
		738M		Fine-tuning	0.86	0.73	0.68	0.66	0.54	0.63
	BART	406M		Fine-tuning	0.75	0.59	0.60	0.53	0.38	0.51
		406M	REBEL	Fine-tuning	0.84	0.73	0.66	0.67	0.56	0.64
Decoder-only	Mistral-v0.1	7B		iCL + Heuristic	0.78	0.61	0.64	0.43	0.25	0.37
		7B	Conversation	iCL $+$ Heuristic	0.76	0.56	0.61	0.43	0.24	0.36
		7B	OpenOrca	iCL $+$ Heuristic	0.80	0.61	0.64	0.44	0.25	0.36
	Llama-2	7B	Instruction	iCL $+$ Heuristic	0.75	0.53	0.59	0.44	0.24	0.37
		13B	Instruction	iCL $+$ Heuristic	0.77	0.57	0.62	0.44	0.24	0.37
		7B	Meditron	iCL $+$ Heuristic	0.71	0.52	0.58	0.40	0.21	0.34

On General and Biomedical Text-to-Graph Large Language Models

Abstract

Keywords

1. Introduction

1.1. Research Objective and Contribution

1.2. Article Structure

2. Related Work

3.1. Knowledge Graph Structure

3.2. Benchmark Datasets

T5

BART

3.4.2. Decoder-Only Models

Llama-2

Mistral-v0.1

3.5. Learning Methods

3.5.1. Fine-Tuning

3.5.2. In-Context Learning

3.6.1. General Comparison

3.6.2. Effect of the Number of In-Context Examples

4. Results

4.1. General Comparison

6. Conclusion

Funding

Footnotes

Declaration of Conflicting Interests

Notes

Appendix

References