Sage Journals: Discover world-class research

Abstract

Understanding complex societal events reported on the Web, such as military conflicts and political elections, is crucial in digital humanities, computational social science, and news analyses. While event extraction is a well-studied problem in natural language processing (NLP), there remains a gap in semantic event extraction methods that leverage event ontologies for capturing multifaceted events in knowledge graphs. In this article, we aim to compare two paradigms to address this task of semantic event extraction: the fine-tuning of traditional transformer-based models versus the use of large language models (LLMs). We exemplify these paradigms with two newly developed approaches: T-SEE for transformer-based and L-SEE for LLM-based semantic event extraction. We discuss their complementary strengths and shortcomings to understand the needs and solutions required for semantic event extraction. For comparison, both approaches employ the same dual-stage architecture; the first stages focus on multilabel event classification, and the second on relation extraction. While T-SEE utilises a span prediction transformer model, L-SEE prompts an LLM for event classification and relation extraction, providing the potential event classes and properties. We assess the performances of T-SEE and L-SEE on two novel datasets sourced from DBpedia and Wikidata, and we perform an extensive error analysis. Our work makes substantial contributions to (i) the integration of Semantic Web technologies and NLP, particularly in the underexplored domain of semantic event extraction, and (ii) the understanding of how LLMs can further enhance semantic event extraction and what challenges need to be considered in comparison to traditional approaches.

Keywords

event extraction transformer models large language models event knowledge graph

1. Introduction

Event extraction (EE) aims to identify and classify events and their relations in text, including Web sources such as social media, news websites, and online encyclopedias like Wikipedia. Typically, this extraction process is conducted without relying on pre-existing knowledge structures or further structuring of extracted data. In contrast, the goal of semantic EE is to leverage an existing event ontology to lift unstructured text into a structured representation capturing the essence of the event, including its type (e.g. presidential election ) and relations to entities (e.g. < US presidential election 2020 , successful candidate , Joe Biden >). Specifically, semantic EE aims at enriching knowledge graphs to make event information more accessible, that is, by adding events that are not yet contained in the knowledge graph because (i) the input texts are about recent events or (ii) the events of that type are considered out of domain (e.g. if the knowledge graph only contains more coarse-grained event types). Practical applications of event knowledge graphs include event-centric visualisations (Gottschalk & Demidova, 2020; Latif et al., 2021), biography generations (Gottschalk & Demidova, 2019a), event narrativisation (Porzel et al., 2022) and question answering over event-related information (Souza Costa et al., 2020).

Semantic extraction operates at a critical juncture of the Semantic Web and natural language processing (NLP) technologies:

The Semantic Web offers rich event ontologies such as LODE (Linking Open Descriptions of Events, Shaw et al., 2009) and the simple event model (Van Hage et al., 2011) to represent events. However, cross-domain knowledge graphs such as DBpedia (Auer et al., 2007) and Wikidata (Erxleben et al., 2014) typically focus on named events, such as political summits and natural disasters and lack adaptability to diverse expressions in text-based event descriptions. In addition, relation extraction and link prediction for knowledge graph population typically suffer from noisy data (Ji et al., 2022; Li et al., 2019a; Shirai et al., 2023) and require the presence of the related entities in the knowledge graph (Stoica et al., 2020) and are thus not applicable for extracting relations of newly identified events.

NLP employs named entity recognition and EE techniques to identify finer-grained, transient events like individual meetings or transactions (Xiang & Wang, 2019) from text. However, traditional NLP methods often deconstruct the task of semantic EE into smaller sub-tasks such as event detection (Mehta et al., 2019; Zheng et al., 2021), and argument extraction (Li et al., 2021; Ma et al., 2022; Wang et al., 2019) with each garnering their specific benchmark datasets (Ebner et al., 2020; Wang et al., 2020) typically not bound to semantic event ontologies.

This divergence results in a critical gap, creating a need for semantic EE, blending structured, ontology-based classification with the adaptability to handle a wide range of event types – from transient interactions to significant historical occurrences.

Although some efforts have been made towards semantic EE (Guo et al., 2023a; Rospocher et al., 2016), Guan et al. denote that the construction of event knowledge graphs still suffers from the unsatisfactory performance of existing EE methods, especially for argument extraction (Guan et al., 2022). Most methods still fall short in delivering an integrative approach that works across various domains and effectively accommodates sufficiently rich and diverse ontologies (Hamborg et al., 2019; Wang et al., 2021b; Zhou et al., 2021) centring instead around aged NLP benchmark datasets such as ACE05 (Linguistic Data Consortium, 2005) or conversely on highly specific domains (Davani et al., 2019; Xu et al., 2021).

Example: As an example of semantic EE, consider the event represented in Figure 1. The text on the left is extracted from the Wikipedia article regarding the ‘2017 UEFA European Under-21 Championship Final’. We aim to extract relevant event information¹ from that text, such as the final match itself or, potentially, other events mentioned in the text, and enrich an event knowledge graph with newly extracted events and event relations. The right-hand side of the figure illustrates a knowledge graph representation of an extracted event. This representation includes an event class ( final ), an event description derived from the text, the precise location of the game, the date, and other relations.

Figure 1.

Example of semantic event extraction for an event mentioned in the Wikipedia article ‘2017 UEFA European Under-21 Championship Final’ using classes and properties in Wikidata. The figure shows a text (top-left), a set of queries consisting of an event class and a property (bottom-left), and the extracted event triples (right).

In this article, we introduce two approaches for semantic EE, which follow the same structure but two different paradigms: transformer-based architectures and large language models (LLMs).²

Transformer-based Semantic Event Extraction (T-SEE): T-SEE benefits from the strengths of both Semantic Web and NLP techniques and is trained and evaluated on two new datasets, specifically created as a resource for semantic EE. T-SEE disentangles the complexities of the task into two manageable sub-tasks:

Event classification: Approached as a multilabel classification problem, T-SEE determines the most appropriate event labels given a text from a pre-defined set of event classes. In our example, T-SEE applies multilabel classification to categorise the event into final .

Relation extraction: Utilising a span prediction transformer model, we target class-specific relations to construct a nuanced representation of events. In our example, we extract relations such as (NewEvent, country , and Poland ) employing a set of queries. Here, a query consists of an event class (e.g. final or conflict) and a property (e.g. location or sport) and is used to extract the respective information (e.g. the location of the final) within the given text.

LLM-based Semantic Event Extraction (L-SEE): With L-SEE, we examine the application of LLMs for semantic EE. Given the current prominence of LLMs in various NLP tasks (Dunn et al., 2022; Kaliamoorthi et al., 2021), it is pertinent to assess their utility and performance in extracting structured event information from text. In analogy to T-SEE, L-SEE also performs event classification followed by relation extraction, both through specific prompts.

Evaluation: To train T-SEE and to evaluate T-SEE and L-SEE, we provide two new semantic EE datasets created from DBpedia and Wikidata, containing over 80,000 Wikipedia sentences and semantic event representations. Through a subsequent manual error analysis, we not only aim to gauge the capabilities of LLMs against transformer-based methods but also to identify specific challenges and areas where LLMs might offer novel insights or complement existing approaches.

In this way, we aim to contribute to the ongoing discourse on the potential and limitations of leveraging LLMs for information extraction (IE) and knowledge engineering, particularly in cases where LLMs may uncover information beyond the predefined ground truth or existing knowledge graphs.

Contributions: In summary, our contributions are as follows:

We outline the underexplored area of semantic EE, situated at the Semantic Web and NLP intersection.

We present T-SEE and L-SEE, our approaches for semantic EE following comparable pipelines, where T-SEE uses a transformer-based architecture and L-SEE uses an LLM.

We provide two new semantic EE datasets created from Wikipedia, Wikidata, and DBpedia: Wikidata-SEE and DBpedia-SEE.

We demonstrate the efficacy of T-SEE and L-SEE through empirical evaluations against existing methods.

We perform an extensive manual annotation of the predictions of T-SEE and L-SEE to identify typical error types and compare the strengths and shortcomings of these two paradigms.

We make the code³ and the data⁴ available online.

Structure: The remainder of this article is structured as follows: in Section 2, we define the task of semantic EE. Then, we introduce T-SEE (Section 3) and L-SEE (Section 4). After an automated evaluation of these approaches on a test set (Section 5), we perform our error analysis and discussion in Section 6. After presenting related work (Section 7), we conclude in Section 8.

2. Problem Statement

We formally define the problem of semantic EE to bridge the gap between granular, structured information and the adaptability required to capture a wide variety of events.

In the context of this work, an event is an occurrence of societal importance, typically happening at a specific time and location, involving a set of participants. Examples of events include military conflicts, such as the Second World War, political shakeups, such as Brexit, but also more fine-grained events, such as the battles and air raids in the Second World War or specific football games.

We model information regarding entities (representing real-world events and real-world objects such as persons or locations) and their relations in an event knowledge graph. The classes and properties within the knowledge graph are defined by an event ontology:

Definition 1 Event Ontology

An event ontology $O = (P, C)$ defines the properties ( $P$ ) and classes ( $C$ ) in an event knowledge graph, where

$P$ is a set of properties describing the types of relations that can hold between two entities and

$C$ is a set of event classes. An event class can be a sub-class of another event class.

Classes and properties in an event ontology are uniquely identified by an Internationalized Resource Identifier (IRI).⁵ Specifically, the property $p_{type} \in P$ (typically identified via the property IRI rdf:type ) assigns an event class to an event.

Other example properties describe the location and number of participants of events. Examples of event classes include final as a sub-class of sporting event .

Based on an event ontology, we formally define an event knowledge graph as follows:

Definition 2 Event Knowledge Graph

An event knowledge graph $G_{O} = (E, V, L, R)$ models entities, events, literals, and their relations following an event ontology $O = (P, C)$ :

$E$ is a set of nodes representing real-world entities.

$V \subset E$ is a subset of nodes representing real-world events.

$L$ is a set of literals such as numbers or texts.

$R \subseteq E \times P \times (E \cup L \cup C)$ is a set of relations.

In a relation $(e, p_{type}, c) \in R$ where $e \in V$ is an event, we require that $c \in C$ is an event class. This way, we model the class assigned to an event.

We define the task of semantic EE as follows:

Definition 3 Semantic EE

Given an event ontology $O = (P, C)$ , an event knowledge graph $G_{O} = (E, V, L, R)$ , and a text $t$ , the task of semantic EE is to detect a set of events described in $t$ that are not yet represented in $G_{O}$ . For each such event $e_{t}$ , the task includes:

Identifying its event class relation $(e_{t}, p_{t y p e}, c)$ (event classification), and

Extracting a set of relations from $t$ (relation extraction), with each relation being of the form $(e_{t}, p, o)$ , where $p \in P$ is the property connecting $e_{t}$ to $o \in E \cup L \cup C$ .

These relations, and the classes they involve, must adhere to the properties and classes of O.

Figure 1 illustrates an example text ( $t$ ) taken from the Wikipedia article regarding the ‘2017 UEFA European Under-21 Championship Final’. The semantic EE leads to the creation of a new event $e_{t}$ , which is typed as the event class final and assigned to relations with properties of the event ontology $O$ (e.g., location and point in time ). These relations can be serialised as RDF triples to be used in downstream applications.

2.1. Assumptions

To perform semantic EE given the defined problem statement, we propose methodologies that employ transformers and LLMs based on the following assumptions:

2.1.1. Tasks and Models

Task representation: Following Definition 3, we frame semantic EE as a two-step task: event classification followed by relation extraction. This decomposition is assumed to be effective and meaningful for capturing events and their relations. Further, we directly intertwine the tasks of event detection and event classification: event classification detects and classifies events at the same time, that is, there are no events without event class.

Task dependency: Relation extraction depends on the results of event classification. This dependency is intentional, as event classes determine which relations to extract. Consequently, we assume errors to propagate across the entire pipeline, so any misclassifications naturally affects relation extraction results. This error propagation needs to be reflected during evaluation.

Model selection: We assume that both transformer-based models and LLMs are suitable for semantic EE.

∘
Transformers: Fine-tuned transformer models (e.g. Bidirectional Encoder Representations from Transformers, BERT) are assumed to generalise effectively for event classification and relation extraction when trained on high-quality, ontology-aligned datasets.
∘
LLMs: LLMs are assumed to generate structured outputs reliably when prompted with event ontologies. However, we acknowledge LLMs’ sensitivity to prompt design and their tendency to hallucinate relations not present in training data, requiring careful validation.

2.1.2. Data

Event ontology scope: The selected event ontology must comprehensively define event classes and properties for the target domain. We assume the event ontology is extracted from a knowledge graph (e.g. Wikidata) and filtered to exclude overly specific or metadata-like entries. As described in our evaluation setup in Section 5.1.1, we use two event ontologies for training and evaluation, extracted from DBpedia and Wikidata.

Data availability: Training data must consist of texts annotated with events, classes, and relations aligned with the event ontology. As described in our evaluation setup in Section 5.1, we use two datasets for training and evaluation. They contain triples from DBpedia and Wikidata, respectively, both linked to texts from Wikipedia. An example of a text annotated with Wikidata triples is given in Table 1.

Annotation quality: In order to generate such large-scale datasets, we assume distant supervision during dataset creation to link triples to Wikipedia texts, acknowledging potential noise in annotations. Consequently, even despite a cautious dataset creation process, ground truth annotations may still contain omissions or inaccuracies, particularly in large-scale datasets. Further, annotations can vary regarding granularity (e.g. dbo:SportsEvent vs. dbo:TennisTournament ) and completeness. This assumption motivates to perform manual validation of the evaluation results as we do in Section 6.

Table 1.
Example of an Annotated Text as Required in a Dataset Required for Training and Evaluating a Semantic Event Extraction Model. This Example Is Based on Figure 1 Using Wikidata as the Target Event Ontology.

$e_{t_{1}}$ $e_{t_{2}}$

$t$ The 2017 UEFA European Under-21 Championship Final was a football match that took place on 30 June 2017 at the Stadion Cracovia in Poland to determine the winners of the 2017 UEFA European Under-21 Championship.

C final season

R $∙$ (NewEvent1, sport , football )

$∙$ (NewEvent1, country , Poland )

$∙$ (NewEvent1, $p_{t y p e}$ , final ) $∙$ (NewEvent2, country , Poland )

$∙$ (NewEvent1, point in time , ‘2017-07-30’) $∙$ (NewEvent2, point in time , ‘2017’)

$∙$ (NewEvent1, location , Stadion Cracovia ) $∙$ (NewEvent2, $p_{t y p e}$ , season )

$∙$ …

	$e_{t_{1}}$	$e_{t_{2}}$
$t$	The 2017 UEFA European Under-21 Championship Final was a football match that took place on 30 June 2017 at the Stadion Cracovia in Poland to determine the winners of the 2017 UEFA European Under-21 Championship.
C	final	season
R	$∙$ (NewEvent1, sport , football )
	$∙$ (NewEvent1, country , Poland )
	$∙$ (NewEvent1, $p_{t y p e}$ , final )	$∙$ (NewEvent2, country , Poland )
	$∙$ (NewEvent1, point in time , ‘2017-07-30’)	$∙$ (NewEvent2, point in time , ‘2017’)
	$∙$ (NewEvent1, location , Stadion Cracovia )	$∙$ (NewEvent2, $p_{t y p e}$ , season )
	$∙$ …

The given text $t$ mentions two events (here, marked in bold for convenience).

2.1.3. Evaluation

Setup: As described above, the evaluation setting requires a training and evaluation dataset and needs to assess the quality of event classification, relation extraction and their combination in semantic EE.

Metrics: Metrics must reflect pipeline-wide performance, including error propagation. Therefore, we compute precision, recall and F $_{1}$ scores for the tasks of event classification and relation extraction in isolation and in combination. For LLM-based methods, we additionally assume consistency metrics (e.g. Fleiss’ $κ$ ) to account for stochastic outputs.

Error analysis: We assume manual error analysis is critical to identify phenomena like event ambiguity, type misalignment, and annotation discrepancies, which automated metrics may overlook.

3. T-SEE: Transformer-based Semantic Event Extraction

In this section, we present T-SEE (Transformer-based Semantic EE), an approach for semantic EE based on a transformer architecture. The design of T-SEE is guided by two goals:

Through a three-step procedure of event classification, relation extraction and event modelling, we ensure comparability with L-SEE, our LLM-based approach presented in the next section (Section 4).

To allow seamless integration into the Semantic Web, the whole architecture of T-SEE needs to be guided through an event ontology, its classes in the Resource Description Framework (RDF) and properties.

Figure 2 offers a visual summary of T-SEE following these goals. Given an event ontology $O$ , we generate a set of queries $Q$ during the preprocessing phase (query generation) as a basis of the query-based relation extraction. T-SEE then carries out a three-step process to extract and semantically represent events from a given text $t$ :

Figure 2.

Overview of transformer-based semantic event extraction (T-SEE), showing how it extracts and models a single event. Inputs to the models are shown below the horizontal lines.

Event classification: We formulate event classification as a multilabel classification problem and apply it to a given text $t$ to identify event mentions and their classes. This enables us to classify all event mentions within the text concurrently.

Query-based relation extraction: For each identified event, we extract its relations using a transformer-based extraction model and a subset of $Q$ , that is, selected queries used to extract relevant relations of the detected events. After the appropriate queries have been selected, we train our relation extraction model on pairs of event classes and properties.

Event modelling: We transform the extracted event information into triples and add them to the event knowledge graph $G_{O}$ .

With this process, T-SEE focuses on event classification and subsequent event relation extraction, aiming to generate a robust and comprehensive representation of event knowledge. We build on three key factors: (i) the inherent strengths of transformer models, including their capacity to encapsulate complex semantic relationships within the text; (ii) the use of task-specific fine-tuning of these models that allows us to tailor their powerful general language understanding capabilities to our specific extraction tasks, and (iii) the structural guidance provided by an event ontology, which not only aligns the model’s understanding of events with existing schemas but also offers adaptability accommodating emerging event types, such as ‘pandemic’.

In the following, we describe T-SEE’s steps in more detail, along with its algorithm and a running example for a more intuitive understanding.

Algorithm: Algorithm 1 provides an overview of T-SEE. The algorithm embodies the three main inference steps explained earlier: event classification, query-based relation extraction, and event modelling.

Example: We exemplify each of the steps based on the example illustrated in Figure 3, where the text $t$ pertains to protests in Tehran. T-SEE extracts two events ( $e_{t_{1}}$ and $e_{t_{2}}$ ), their classes⁶ (conflict and revolution) and relations. This example demonstrates how T-SEE’s relation extraction model is capable of extracting different relations for each detected event, for instance, ( $e_{t_{1}}$ , $p_{p a r t i c i p a n t}$ , and $Government of Iran$ )⁷ and ( $e_{t_{2}}$ , $p_{l o c a t i o n}$ , and $Tehran$ ).

Figure 3.

Example of event classification and query-based relation extraction on a sentence in the Wikipedia article ‘Mahsa Amini protests’.

3.1. Ontology-Guided Query Generation

The query generation step is a preprocessing step that creates a set of queries $Q$ used later as input to the query-based relation extraction model. The generation of $Q$ is guided by an event ontology $O$ , such that each query $q = ⟨ c, p ⟩ \in Q$ comprises the event class $c$ and a corresponding property $p$ as defined in the event ontology. For each considered event class in $O$ ⁸, a set of queries is added to $Q$ . These queries are then used in T-SEE’s query-based relation extraction step.

Given an event ontology $O = (P, C)$ and an event knowledge graph $G_{O} = (E, V, L, R)$ , we create these queries as follows: for each event class $c \in C$ , we select a set of properties that are used together with events of this class in $G_{O}$ : ${p ∣ (e, p, x) \in R \land (e, p_{t y p e}, c) \in R}$ . To avoid the inclusion of inappropriate queries (e.g. infrequent event classes and metadata properties), additional constraints can be applied to remove queries from $Q$ . We describe our constraints in Section 5.1 and make our sets of event classes, properties, and queries available.⁹

3.1.1. Example

Figure 4 shows an example Wikidata SPARQL query to extract Wikidata properties commonly (more than $50$ times) used on entities classified as wd:Q180684 (conflict). It returns $22$ properties, including wdt:P17 (country) and wdt:P710 (participant).

Figure 4.

SPARQL query on Wikidata to extract Wikidata properties commonly (more than $50$ times) used on entities classified as ‘conflict’.

Using such SPARQL queries, we can generate queries used by T-SEE. Table 2 provides examples of four such queries for two Wikidata event classes, each together with their properties.

Table 2.

Example Queries Extracted From the Wikidata Ontology.

Event class ( $c$ )	Property ( $p$ )	Query ( $q = ⟨ c, p ⟩$ )
Conflict ( wd:Q180684 )	Participant ( wdt:P710 )	<conflict, participant>
	Country ( wdt:P17 )	<conflict, country>
Revolution ( wd:Q10931 )	Location ( wdt:P276 )	<revolution, location>
	Country ( wdt:P17 )	<revolution, country>

3.2. Event Classification

Given a text $t$ , the goal of T-SEE’s event classification step is to identify a set of events that occur in $t$ and to detect their event classes, that is, the set of relations $R_{t} = {(e_{t}, p_{t y p e}, c)}$ (line 9 in Algorithm 1). To do so, we propose a multilabel event classification model based on a transformer architecture (Vaswani et al., 2017), which allows for the efficient and effective processing of input texts.

Specifically, the input to our event classification model is a sequence of tokens derived from $t$ representing one or more event mentions in the text. The model processes the input sequence using a series of self-attention mechanisms, allowing it to capture complex relationships between contextual and semantic information of the input $t$ (Vaswani et al., 2017). The output of the transformer-based architecture is a sequence of hidden states, which encodes the relevant information from the input sequence.

The hidden states are then passed through a dropout layer to reduce the number of connections between the pre-trained layers and the downstream layers, effectively forcing the downstream layers to learn more robust and generalisable representations of the input data. Finally, a fully connected layer and a sigmoid activation function are used in the output layer, generating a probability distribution over the possible event classes in the input text.

Additionally, we conduct threshold optimisation on a validation set. Prior work on multilabel classification, such as binary relevance methods (Tsoumakas & Katakis, 2007), often employs a fixed decision threshold (usually $0.5$ ) to convert predicted probabilities into class labels. However, this may not be optimal for all classes, especially in cases with imbalanced data or differing class complexities. To address this issue, we utilise an optimisation strategy that fine-tunes individual decision thresholds for each label, aiming to maximise the F $_{1}$ score.

3.2.1. Example

In our example, the event classification model receives the whole text shown in Figure 3 (‘Amidst a revolution in Tehran, continuing conflicts on the streets persist as the Iranian government violently reacts to the Mahsa Amini protests’.) as an input and returns two event classes ( conflict and revolution ) corresponding to the two events in the text.

3.2.2. Training

To train T-SEE’s event classification model, a corpus that contains texts and event class labels corresponding to the events represented in each individual text is required. Specifically, we utilise two datasets that contain sentences from Wikipedia, annotated with events and their relations from Wikidata and DBpedia, respectively. These datasets are described in detail in Section 5.1. The multilabel classification model is fed the tokenised input texts and uses a focal loss function (Lin et al., 2017).

3.3. Query-Based Relation Extraction

Given the text $t$ and the set $R_{t} = {(e_{t}, p_{t y p e}, c)}$ of detected events together with their predicted event classes, the goal of relation extraction is to detect, extract, and assign relations found in $t$ to the matching events. T-SEE utilises a subset of the generated queries $Q$ that can be matched to the predicted event classes of the extracted events in $R_{t}$ . Specifically, given $R_{t} = {(e_{t}, p_{t y p e}, c)}$ , we select those queries in $Q$ which ask about these event classes: $Q_{t} = {q = ⟨ c, p ⟩ \in Q ∣ \exists (e_{t}, p_{t y p e}, c) \in R_{t}}$ (line 13 in Algorithm 1). Together with $t$ , these queries serve as input to our query-based relation extraction model.

We leverage BERT (Devlin et al., 2019) as the base of T-SEE as it provides a nuanced understanding of semantics, capturing the meaning and context of words and sentences in text. BERT is known for its proficiency in capturing long-range dependencies, a crucial aspect of comprehending the complexities of textual narratives. In addition, BERT incorporates a next sentence prediction loss, which is specifically designed to model the coherence between sentences. This element of coherence is particularly valuable for relation extraction tasks. By understanding the continuity of text, the model is empowered to decipher the intricate relationships between entities that might be scattered across the text.

Specifically, we encode the text $t$ and a query $q$ as fixed-length vectors. The decoded results then correspond to a probability distribution over token spans that represent possible relation values.

As shown in line 14 of Algorithm 1, each selected query $q = ⟨ c, p ⟩$ , and context represented by the text $t$ are passed through our query-based relation extraction model, generating results and their associated confidence scores. Together with the respective event and the property $p$ , each result resembles a relation.

3.3.1. Example

For our predicted event classes conflict and revolution , the queries in $Q$ cover a variety of Wikidata properties such as participant and location . As shown in Figure 3, given the query < revolution , location >, we infer its result Tehran , that is, the relation $(e_{t_{2}}$ , $p_{l o c a t i o n}$ , and $Tehran)$ . This process is repeated for each query-context pair, creating, for each accepted result, a relation.

3.3.2. Training

To train our query-based relation extraction model, we use a corpus of texts with event mentions and their relations with properties in the event ontology $O$ . As in (Lan et al., 2020), the model is jointly trained using a span extraction loss and a logistic regression loss for an additional classifier that predicts answerability (Liu et al., 2019c; Yang et al., 2019b). During training, the model is rewarded for selecting token spans that correspond to correct relation values between an event of a given event class label and entities or literals that occur in the text.

3.4. Event Modelling

In the event modelling step, we materialise the extracted event information as triples and enrich the event knowledge graph with them (lines 17 and 18). Precisely, for each text $t$ , and each of the event class relations $(e_{t}, p_{t y p e}, c) \in R_{t}$ , we create the following relations:

Type relation for $e_{t}$ : $(e_{t}, p_{t y p e}, c)$ .

Description of $e_{t}$ : $(e_{t}, p_{d e s c r i p t i o n}, t)$ .

Relations extracted with our query-based relation extraction.

This process is repeated for all texts in an input corpus and the events extracted within them, after which the ontology-mapped relations can be transformed into RDF triples. As described in Definition 3, the event modelling step creates new triples of events not yet represented in the target knowledge graph $G_{O}$ . Event classes, properties and their values were identified in the extraction process guided by the event ontology $O$ .

For representing the provenance and explicitly providing the source of the semantic event representation, further information could be added, for example, a URL pointing to the source text and a description of the extraction method. To do so, sources can be directly linked to a source statement in Wikidata.¹⁰ Another option would be to use the PROV-O ontology (Hoekstra & Groth, 2015).

3.4.1. Example

Figure 3 illustrates relations extracted for the example events conflict and revolution . Given the conflict event, the following relations are created:

$(e_{t_{1}}, p_{t y p e}, conflict)$

$(e_{t_{1}}, p_{d e s c r i p t i o n},$ ‘Amidst a revolution in Tehran, continuing conflicts on the streets persist as the Iranian government violently reacts to the Mahsa Amini protests’. $)$

$(e_{t_{1}}, p_{p a r t i c i p a n t}, Government of Iran)$

$(e_{t_{1}}, p_{c o u n t r y}, Iran)$

We provide examples of generated RDF triples in Section 5.6.

4. L-SEE: LLM-based Semantic Event Extraction

In this section, we present L-SEE (LLM-based Semantic Event Extraction), an approach for semantic EE based on a LLM. As LLMs continue redefining the boundaries of NLP, their application in semantic EE presents a compelling approach for assessing their standalone capabilities and potential synergies with pipeline-based methodologies that decompose EE into event detection and argument extraction, as commonly employed in state-of-the-art approaches in foundational work (Wadden et al., 2019) and retained in recent studies (Liu et al., 2024; Shiri et al., 2024; Wu et al., 2024).

Figure 5 offers a visual summary of L-SEE whose bottom part is analogous to T-SEE in Figure 2. Given an event ontology $O$ , the set of all event classes $C$ is extracted beforehand. As in T-SEE, L-SEE then carries out a three-step process to extract and semantically represent events from a given text $t$ :

Figure 5.

Overview of large language model-based semantic event extraction (L-SEE), showing how it extracts and models a single event. Inputs to the prompts are shown below the horizontal lines. * $C_{t}$ refers to all classes of the events detected in the text. $P_{C_{t}}$ is the set of properties used together with these event classes in the relations $R$ of the target knowledge graph.

Event classification: We perform event classification as a multilabel classification problem by prompting an LLM to detect events and their classes in a text $t$ given $C$ .

Relation extraction: We prompt the LLM to extract relations of all identified events.

Event modelling: We transform the extracted event information into triples and add them to the event knowledge graph $G_{O}$ .

Algorithm: Algorithm 2 provides an overview of L-SEE and its three steps: event classification, relation extraction, and event modelling.

4.1. Event Classification

For event classification (line 7 in Algorithm 2), L-SEE guides the LLM with a precise prompting mechanism to identify and categorise events in the text $t$ , given the set $C$ of event classes in the target event ontology. This step builds upon the LLM’s ability to discern events of significance akin to those warranting dedicated Wikipedia entries, ensuring the extraction of events with substantial relevance.

The event classification LLM prompt template is shown in Figure A.1 in the Appendix, where $C$ is formatted like [‘conflict’, ‘revolution’]. In detail, the event classification LLM prompt template consists of the following parts:

Instruction: Explicitly defines event classification and the operational definition of an ‘event’.

Example: Illustrates the task with a one-shot example, including a sample text, identified event classes, and explanations to clarify expectations.

Output options: Explicitly lists the full set of potential outputs, that is, the set of all event classes $C$ in our target event ontology.

Task: Specifies the input text $t$ for classification.

4.2. Relation Extraction

For relation extraction (line 11 in Algorithm 2), L-SEE prompts the LLM a second time, now to extract the relations of each identified event, given the event classes $C_{t} = {c ∣ (e_{t}, p_{t y p e}, c) \in R_{t}}$ identified in the previous step together with the set of properties $P_{C_{t}}$ used on these classes (extracted as described in Section 3.1, i.e., $P_{C_{t}} = {p ∣ (e, p, x) \in R \land (e, p_{t y p e}, c) \in R \land c \in C_{t}}$ ).

The relation extraction LLM prompt template is shown in Figure A.2 in the Appendix. For our condensed example in Table 2, $C_{t}$ and $P_{C_{t}}$ would be added to the prompt formatted as {‘conflict’: [‘participant’, ‘conflict’], ‘revolution’: [‘location’, ‘country’]}. In detail, the relation extraction LLM prompt template consists of the following parts:

Instruction: Defines the task (relation extraction), specifies expected property-value formats (e.g. temporal or spatial attributes), and mandates valid JSON output. Semantic constraints enforced through data type conventions ensure consistency for downstream processing.

Example: Provides a one-shot demonstration with a text snippet, event classes, properties, and a corresponding JSON output to model structured responses.

Task: Presents the input text $t$ , the event classes $C_{t}$ and properties $P_{C_{t}}$ , requiring the LLM to populate these properties with text-derived values.

4.3. Event Modelling

The event modelling step (lines 14 and 15 in Algorithm 2) follows the procedure outlined in T-SEE, as detailed in Section 3.4. This process results in the creation of RDF triples that represent newly identified events and their relations.

5. Evaluation

In this section, we introduce two new datasets for semantic EE and compare T-SEE and L-SEE to EE baselines. Finally, we show an example of the generated RDF triples and compare the consistency of LLM outputs over different executions.

5.1. Datasets

We introduce two new large-scale datasets that currently stand as the largest and most diverse datasets for the task of semantic EE and follow our assumptions states in Section 2.1: DBpedia-SEE and Wikidata-SEE. They are available online.¹¹ DBpedia-SEE and Wikidata-SEE serve as training and test corpora for semantic EE based on event ontologies of DBpedia and Wikidata. To comply with the definition of semantic EE in Definition 3, each dataset belongs to an event ontology $G_{O}$ and contains a set of texts, where each text $t$ is annotated with a set of events, their classes and relations.

5.1.1. Event Ontology Extraction

In the first step, we extract relevant event classes and their properties from DBpedia and Wikidata to create two event ontologies. The main reason why we extract event ontologies from DBpedia and Wikidata instead of using event ontologies such as LODE (Shaw et al., 2009) and the Simple Event Model (Van Hage et al., 2011) is that we do not only require an event ontology but also a large corpus of events modelled with such ontology, as available in the DBpedia and Wikidata knowledge graphs. Further reasons are as follows: (i) we focus on cross-domain knowledge graphs, with DBpedia and Wikidata being well-established cross-domain knowledge graphs yet inherently incomplete and bear potential for extension (Shenoy et al., 2022), (ii) as described in Section 1, we focus on named events and (iii) to create our evaluation datasets (see next Section 5.1.2), we utilise Wikipedia links which can be directly mapped to Wikidata and DBpedia entities.

Filtering Protocols and Thresholds

To ensure the quality and relevance of the event classes and properties extracted from Wikidata and DBpedia, we apply stringent filtering protocols. Specifically, we restrict event classes and properties to those used in the context of events and apply a minimum threshold for event classes ( $100$ appearances) and properties ( $50$ appearances). These thresholds serve two key methodological purposes. First, by requiring a minimum frequency, we ensure that classes and properties are sufficiently represented across the training, validation, and test splits, thereby improving the statistical reliability of our evaluation. Second, consistent filtering avoids scenarios where very rare or highly specialised event classes (e.g. classes associated with only few events such as City of Cardiff Council election in Wikidata) might skew macro-averaged metrics or lead to overfitting on sparse patterns. We arrived at these particular numbers by analysing the frequency distributions of event-related resources in DBpedia-SEE and Wikidata-SEE, finding that they effectively preserve the majority of relevant classes and properties while excluding rarely used or metadata-like entries. We acknowledge that different use cases or domain-specific requirements might call for alternative cutoffs, but this balance between coverage and reliability is well-suited to our current scope.

While we try to keep manual interventions minimal and to be as consistent as possible in our annotations, for the remaining events and properties, we need to manually filter out overly specific event classes and metadata properties. Specifically, for Wikidata, we filtered out the following three types of event classes and properties:

Event classes specifically about a country (we still consider their parent classes. For example, instead of ‘UK Parliamentary by-election’, there still is ‘by-election’). Examples are:

$\circ$
Turkish general election (wd:Q22333900)
$\circ$
Spanish Grand Prix (wd:Q9208)
$\circ$
Sydney International (wd:Q248952)

Classes that are wrongly categorised as event classes in Wikidata. Examples are: $\circ$
communications satellite (wd:Q149918)
$\circ$
space telescope (wd:Q148578)
$\circ$
crewed spacecraft (wd:Q7217761)

Properties that do not represent real-world relations (e.g., identifiers). An example is: $\circ$
X username (wdt:P2002)

Statistics of the resulting DBpedia and Wikidata event ontologies are shown in Table 3. For example, the Wikidata event ontology has $60$ event classes and 5901 events typed as sport season . We consider the two event ontologies independently from each other and do not align them.¹² This way, we are able to evaluate semantic EE on two distinct datasets and thus demonstrate the generalisability of our models. Further, both the DBpedia (Carriero et al., 2019; Gottschalk & Demidova, 2019b) and Wikidata (Gottschalk & Demidova, 2019b; Hassanzadeh, 2021; Rudnik et al., 2019) ontologies have been successfully used to represent events in other works.

Table 3.
Statistic of the Extracted DBpedia and Wikidata Event Ontologies.

DBpedia-SEE Wikidata-SEE

Event classes $19$ $60$

Most occurrences dbo:MilitaryConflict (23,264) wd:Q27020041 (sports season) (5901)

Least occurrences dbo:MixedMartialArtsEvent ( $104$ ) wd:Q1079023 (championship) ( $55$ )

Properties $29$ $17$

Most occurrences dbo:place (17,618) wdt:P585 (point in time) (31,378)

Least occurrences dbo:previousMission ( $72$ ) wdt:P571 (inception) ( $20$ )

SEE = semantic event extraction.

5.1.2. Extraction of Event Triples

	DBpedia-SEE	Wikidata-SEE
Event classes	$19$	$60$
Most occurrences	dbo:MilitaryConflict (23,264)	wd:Q27020041 (sports season) (5901)
Least occurrences	dbo:MixedMartialArtsEvent ( $104$ )	wd:Q1079023 (championship) ( $55$ )
Properties	$29$	$17$
Most occurrences	dbo:place (17,618)	wdt:P585 (point in time) (31,378)
Least occurrences	dbo:previousMission ( $72$ )	wdt:P571 (inception) ( $20$ )

To extract texts and the RDF triples representing mentioned events, we follow a distance-label generation process.¹³ The individual texts are sentences extracted from articles in the English Wikipedia describing events.¹⁴ Event classes and relations are extracted by exploiting existing links to events and their DBpedia or Wikidata representations.

Figure 6 illustrates the distance-label generation process at an example: The Wikipedia article ‘Turkish involvement in the Syrian civil war’ has a link to the event ‘Operation Euphrates Shield’ which has a relation to Syria and is also mentioned in the same text. Consequently, we select the text, the event class military operation .¹⁵, and the country relation to Syria

Figure 6.

Example illustrating how we label texts with events and relations. The Wikipedia text on the left links to the Wikidata event on the right side, which also has a relation to an entity mentioned in the text: country and Syria.

5.1.3. Statistics

As delineated in Table 4, DBpedia-SEE includes 42,648 texts, and Wikidata-SEE contains 37,988 texts, where each text contains at least one annotated event and its corresponding relations. Together, these datasets feature over 80,636 uniquely annotated events and 111,663 relation instances, making them the most extensive repositories for training and evaluating EE models to date.

Table 4.
Statistic of our Datasets for Semantic Event Extraction.

DBpedia-SEE Wikidata-SEE

Texts 42,648 37,988

Events 42,726 38,014

Relations 47,666 63,997

	DBpedia-SEE	Wikidata-SEE
Texts	42,648	37,988
Events	42,726	38,014
Relations	47,666	63,997

SEE = semantic event extraction.

5.1.4. Comparison to Existing Datasets

DBpedia-SEE and Wikidata-SEE distinctly surpass existing benchmarks for the task of semantic EE due to their use of RDF annotations, their focus on general-domain events with societal impact and the coverage of both event detection and relation extraction annotations. Datasets such as SuicideED (Guzman-Nateras et al., 2022), SCIERC (Luan et al., 2018) and GENIA (Ohta et al., 2002) only cover very domain-specific events. MAVEN (Wang et al., 2020) and MINION (Pouran Ben Veyseh et al., 2022) only provide annotations for event detection, not relation or argument extraction. The existing larger event datasets like GDELT (Leetaru & Schrodt, 2013; Li et al., 2022) are less structured and not in RDF.¹⁶ In a comparison to the ACE05 (Linguistic Data Consortium, 2005) dataset typically used for EE, our datasets DBpedia-SEE and Wikidata-SEE:

are freely available

$\circ$
ACE05 is only available for $4,000.00 to non-members of the Linguistic Data Consortium.

have wider coverage of event domains $\circ$
for example, ACE05 does not have sport-related events

use RDF classes and properties

have a large number of event classes and properties $\circ$
DBpedia-SEE: $19$ event classes and $29$ properties
$\circ$
Wikidata-SEE: $60$ event classes and $17$ properties
$\circ$
ACE05: $33$ event classes and $22$ arguments

provide a large number of texts $\circ$
DBpedia-SEE: 42,648 texts
$\circ$
Wikidata-SEE: 37,988 texts
$\circ$
ACE05: $599$ texts

These attributes amplify the datasets’ potential for semantic EE, which can not be performed with other existing datasets.
5.1.5. Data Preparation and Experiment Design

With our distantly labelled datasets DBpedia-SEE and Wikidata-SEE, we are able to (i) train T-SEE and the baselines on large-scale datasets and (ii) evaluate their performance in the semantic EE of events which already exist in DBpedia or Wikidata. We exclude links to existing events when running the experiments to simulate the situation in which the events do not yet exist in the target knowledge graph.

In our experiments, we split the datasets into training, test, and validation sets using $70$ : $15$ : $15$ splits.

5.2. Evaluation Setup

Next, we describe our evaluation setup, that is, baselines and metrics.

5.2.1. Baselines

We compare T-SEE against two baselines:

Text2Event; Lu et al. (2021): A state-of-the-art method for EE using a sequence-to-structure generation paradigm.

EventGraph; You et al. (2022): A method for EE using semantic graph parsing that has shown state-of-the-art results for the task of argument role classification.

The selection of baselines for our study is carefully considered but constrained by the availability and adaptability of existing EE methodologies due to the following reasons: (i) despite their valuable contributions, several works do not provide any accessible implementations (Huang et al., 2023; Li et al., 2020; Liu et al., 2022b), which is a critical barrier to replication and further research. (ii) The usability of many EE frameworks is hampered by a lack of comprehensive documentation and a dependency on specific or proprietary datasets, notably the ACE05 dataset (Du & Cardie, 2020; Hsu et al., 2022; Liu et al., 2019a; Lu et al., 2023). Other methodologies like DEGREE (Hsu et al., 2022) and the question-answering paradigms by Du and Cardie (2020) and Lu et al. (2023) necessitate additional, task-specific inputs such as argument and description queries, complicating their integration into diverse research settings. Similarly, Liu et al. (2019a) and ChatIE (Wei et al., 2023) are hindered by very limited documentation and strict data formatting requirements. (iii) CLEVE (Wang et al., 2021b) cannot be adapted to our definition of semantic EE due to its presupposition of argument type knowledge. (iv) Frameworks like AllenNLP (on which DyGIE++ (Wadden et al., 2019) is built) have been discontinued, and (v) the substantial computational resources required for models like the 10-billion parameter Deepstruct (Wang et al., 2022) model further limit their viability.

Given these considerations, we have chosen Text2Event and EventGraph as our baselines. These methodologies have demonstrated strong performance in the EE task (e.g. they both outperform Deepstruct event classification; Lu et al., 2021; Wang et al., 2022; You et al., 2022), provide publicly available code, and are adaptable to our task definition.

5.2.2. Metrics and Setting

To evaluate T-SEE’s and L-SEE’s performance on semantic EE, we assess their performances both on event classification and relation extraction.

We judge the accuracy of event classification using precision, recall, and F $_{1}$ scores to assess if events with correct classes were extracted.

Analogously, we use the same metrics for evaluating relation extraction, where relations are only considered to be correct if connected to a correctly classified event via the correct property and to the correct entity or value.

In this section, we report the results of L-SEE as L-SEE* since we only consider those texts for which the output of the LLM was formatted syntactically correctly and in a consistent way allowing us to use automatic evaluation. Consequently, L-SEE is not evaluated on an identical dataset as T-SEE and the baselines, but on a smaller dataset (5602 of the full 6407 texts for DBpedia-SEE and 3958 texts of the original 5711 for Wikidata-SEE).

5.3. Event Classification

Table 5 shows the evaluation results of T-SEE, L-SEE and the baselines on the tasks of event classification. In general, we observe that T-SEE performs well on event classification, reaching F $_{1}$ scores of $0.92$ and $0.85$ . T-SEE and Text2Event outperform by a notable margin EventGraph. While Text2Event performs better than T-SEE on DBpedia-SEE, T-SEE performs better on the more diverse Wikidata dataset. This performance of T-SEE may be attributed to its capability of dealing with rich event ontologies, given that Wikidata-SEE has three times more event classes than DBpedia-SEE.

Table 5.
Precision (P), recall (R) and F $_{1}$ Scores for Event Classification on DBpedia-SEE and Wikidata-SEE.

Approach DBpedia-SEE Wikidata-SEE

P R F $_{1}$ P R F $_{1}$

Text2Event $0.94$ $0.94$ $0.94$ $0.84$ $0.84$ $0.84$

EventGraph $0.75$ $0.69$ $0.72$ $0.77$ $0.52$ $0.62$

T-SEE $0.92$ $0.92$ $0.92$ $0.85$ $0.85$ $0.85$

L-SEE* $0.88$ $0.89$ $0.89$ $0.53$ $0.58$ $0.55$

Approach	DBpedia-SEE	Wikidata-SEE
Text2Event	$0.94$	$0.94$	$0.94$	$0.84$	$0.84$	$0.84$
EventGraph	$0.75$	$0.69$	$0.72$	$0.77$	$0.52$	$0.62$
T-SEE	$0.92$	$0.92$	$0.92$	$0.85$	$0.85$	$0.85$
L-SEE*	$0.88$	$0.89$	$0.89$	$0.53$	$0.58$	$0.55$

SEE = semantic event extraction; T-SEE = transformer-based semantic event extraction; L-SEE = large language model-based semantic event extraction.

Best results are marked in bold.

The performance of L-SEE closely follows that of the baselines on DBpedia-SEE. However, we do see a notable drop-off in the case of Wikidata, likely correlated with the larger number of fine-grained event classes notable to Wikidata.

5.4. Relation Extraction

Table 6 presents the relation extraction performance of T-SEE and our baselines. While it is evident that EventGraph lags behind Text2Event and T-SEE, it demonstrates a notable precision in its extractions ( $0.85$ in Wikidata-SEE), albeit with a significantly lower recall ( $0.16$ ). This suggests that EventGraph is highly accurate in the instances it chooses to label, but it misses many relevant relations. In contrast, T-SEE consistently matches or outperforms all baseline performances across both datasets, demonstrating its robustness in the relation extraction task.

Table 6.
Precision (P), Recall (R), and F $_{1}$ Scores for Relation Extraction on DBpedia-SEE and Wikidata-SEE.

Approach DBpedia-SEE Wikidata-SEE

P R F $_{1}$ P R F $_{1}$

Text2Event $0.74$ $0.75$ $0.74$ $0.75$ $0.77$ $0.76$

EventGraph $0.72$ $0.57$ $0.64$ $0.85$ $0.16$ $0.27$

T-SEE $0.75$ $0.76$ $0.75$ $0.75$ $0.77$ $0.76$

L-SEE* $0.28$ $0.52$ $0.37$ $0.37$ $0.37$ $0.37$

Approach	DBpedia-SEE	Wikidata-SEE
Text2Event	$0.74$	$0.75$	$0.74$	$0.75$	$0.77$	$0.76$
EventGraph	$0.72$	$0.57$	$0.64$	$0.85$	$0.16$	$0.27$
T-SEE	$0.75$	$0.76$	$0.75$	$0.75$	$0.77$	$0.76$
L-SEE*	$0.28$	$0.52$	$0.37$	$0.37$	$0.37$	$0.37$

SEE = semantic event extraction; T-SEE = transformer-based semantic event extraction; L-SEE = large language model-based semantic event extraction.

Best results are marked in bold.

L-SEE shows a notably lower performance compared to both T-SEE and the baselines. This is likely associated with the relatively higher complexity of the relation extraction task compared to the event classification task. It should also be noted that as L-SEE is untrained, it relies more on the natural language understanding abilities it acquired through pre-training than other evaluated baselines. As such, it is also more likely to fail when the property labels are inadequately descriptive as to what purpose they are meant to fulfil. We go into further detail about the nature of these errors and the limitations of L-SEE in Section 6.

5.5. Cascading Errors

Given the sequential structure of T-SEE’s and L-SEE’s approach, where event classification precedes relation extraction, inaccuracies in the initial phase of event classification might negatively influence the subsequent relation extraction performance. Therefore, we analyse the impact of cascading errors by comparing the F $_{1}$ scores for relation extraction in isolated and end-to-end settings at the example of T-SEE.

For T-SEE on DBpedia-SEE, the isolated setting shows a precision score of $0.81$ , recall of $0.82$ , and an $0.82$ F $_{1}$ score, indicating the model’s performance in an ideal scenario with perfect event classification. However, in the end-to-end setting, the scores decrease to the precision of $0.75$ , recall of $0.76$ , and an F $_{1}$ score of $0.75$ . This drop in performance suggests that errors in the event classification phase cascade down, as expected, affecting the model’s ability to extract relations accurately.

Similarly, on Wikidata-SEE, T-SEE demonstrates high scores in the isolated setting with the precision of $0.87$ , recall of $0.88$ , and an F $_{1}$ score of $0.88$ . In contrast, the end-to-end setting yields lower scores: precision $0.75$ , recall $0.77$ , and F $_{1}$ $0.76$ . This reduction further underscores the presence of cascading errors. However, the effect is limited and does not prevent T-SEE from outperforming or matching the baselines in both datasets.

5.6. Example Result

Finally, we provide example RDF triples of an event extracted with T-SEE. Figure 7 shows the RDF triples created from an event which we extracted from the Wikipedia article ‘1991 Monte Carlo Open’¹⁷ using T-SEE. As we can see, T-SEE successfully extracted an event of the fine-grained event class recurring tennis tournament and several relations, including properties such as season starts , located in the administrative territorial entity and part of .

Figure 7.

Example of RDF triples generated from the Wikipedia article ‘1991 Monte Carlo Open’ using the Turtle syntax.

5.7. Consistency Analysis

To address the variability of the LLM in generating outputs for identical inputs, we evaluate the consistency of L-SEE across multiple executions of its LLM prompts. This analysis is essential for assessing the robustness of L-SEE, as LLMs inherently introduce stochasticity due to their sampling mechanisms during generation. Specifically, we repeatedly process the same set of inputs (i.e. prompts) through the LLM under identical conditions and observe the outputs generated in each iteration. To quantify consistency, we use Fleiss’ $κ$ , a metric that measures inter-rater agreement (Fleiss, 1971), adapted here to measure agreement between outputs from different executions of an LLM.¹⁸

Our analysis reveals a high level of consistency for both event ontologies, as summarised in Table 7. For DBpedia-SEE, we observe an average Fleiss’ $κ$ of $0.991$ for event classification and $1.000$ for relation extraction, indicating near-perfect agreement across runs. For Wikidata-SEE, Fleiss’ $κ$ for event classification is $0.881$ , reflecting slightly reduced but still strong consistency. These scores confirm the robustness of L-SEE, which yields highly similar results across iterations.

Table 7.
Consistency Analysis Results of L-SEE for Event Classification and Relation Extraction.

Task DBpedia-SEE Wikidata-SEE

Event classification (Fleiss’ $κ$ ) $0.991$ $0.881$

Relation extraction (Fleiss’ $κ$ ) $1.000$ $1.000$

Task	DBpedia-SEE	Wikidata-SEE
Event classification (Fleiss’ $κ$ )	$0.991$	$0.881$
Relation extraction (Fleiss’ $κ$ )	$1.000$	$1.000$

L-SEE = large language model-based semantic event extraction; SEE = semantic event extraction.

The few cases demonstrating disagreement across LLM executions for event classification and relation extraction can be attributed to the design and complexity of the two tasks and the event ontologies. There is a stronger agreement for DBpedia-SEE compared to Wikidata-SEE due to the lower number of classes and properties (see Table 3): with a lower number of event classes to select from, there naturally is a higher chance of agreement. This also explains the consistency observed for relation extraction where the prompts include a small number of properties ( $P_{C_{t}}$ ) that have been identified to be relevant given the already detected event classes $C_{t}$ .

5.8. Implementation

In order to implement our multilabel classification model, we leverage a pre-trained uncased BERT base model.¹⁹ The model is fine-tuned for $30$ epochs using the focal loss function with a gamma of $2$ , and the Adam optimiser, with a learning rate of $1 \times 10^{- 5}$ and a Dropout layer with a probability of $0.3$ . We apply early stopping based on validation-set performance, with training capped at 30 epochs. Continuing beyond this point did not improve the validation metrics. For the relation extraction model, we utilise the same BERT model and fine-tune it on the relation extraction task. Similarly to the classification model, we train the model for $30$ epochs with the Adam optimiser and a learning rate of $3 \times 10^{- 5}$ and again employ early stopping up to 30 epochs.

Text2Event and EventGraph are trained for $40$ epochs using a batch size of $30$ and their original training settings.

To generate the training data, we extract Wikipedia articles using the MWDumper.²⁰ For entity linking, we use the Spacy Entity Linker,²¹ a named entity linking tool specifically designed for Wikidata.

For L-SEE, we use gpt-3.5-turbo-1106,²² a version of GPT-3.5 Turbo that supports a 16K context and supports improved instruction following, JSON mode, and parallel function calling. We pick this version as it has shown a $38 %$ improvement in format following tasks such as generating JSON, XML, and YAML.

6. Comparison of T-SEE and L-SEE

A significant finding of our evaluation is the worse performance of L-SEE compared to T-SEE on the task of relation extraction (Section 5.4). This leads to the question of whether LLMs are not suited for the task of semantic EE at all, in contrast to fine-tuning a transformer-based architecture. To answer this question, this section delves into a manual evaluation and a multifaceted error analysis, followed by a discussion.

6.1. Manual Evaluation

In this section, we aim to understand the differences between the two paradigms of transformer-based architecture versus using LLMs for semantic EE. Therefore, on top of the automatic evaluation performed in Section 5, we perform a comparison of T-SEE and L-SEE based on a manually annotated subset of the test dataset used in the automatic evaluation.

We create DBpedia-SEE₁₀₀ – a subset of DBpedia-SEE with $100$ randomly selected texts, their events and relations. We ensure that L-SEE successfully performs semantic EE on these texts without syntactical errors. For each text in DBpedia-SEE₁₀₀, we manually annotate the semantic event representations generated by T-SEE and by L-SEE with respect to each other and the ground truth. For example, given a text $t$ , if T-SEE generates a relation $r$ that is not in DBpedia-SEE₁₀₀, we manually assess whether $r$ is correct and expressed in $t$ . If this assessment is positive and $r$ is also missing in L-SEE, we denote a true positive for T-SEE and a false negative for L-SEE.

Table 8 shows the results of evaluating T-SEE and L-SEE on DBpedia-SEE₁₀₀ before and after our manual assessment. The results before manual assessment confirm our results given in Table 5, where T-SEE and L-SEE both perform well on event classification (F $_{1}$ scores of $0.92$ and $0.89$ ), but L-SEE is clearly outperformed by T-SEE for relation extraction (F $_{1}$ scores of $0.72$ and $0.39$ ), mainly due to $200$ false positive extracted relations. This indicates a considerably better ability of T-SEE to accurately identify and categorise relationships within the data under controlled conditions.

Table 8.
Evaluation of T-SEE Versus L-SEE on DBpedia-SEE₁₀₀ Before and After Manual Assessment.

TP FP FN F $_{1}$

Task Approach Before After Before After Before After Before After

Event classification T-SEE 92 90 7 12 10 9 0.92 0.90

L-SEE 91 100 11 2 11 2 0.89 0.98

Relation extraction T-SEE 83 103 33 23 32 114 0.72 0.58

L-SEE 77 178 200 99 38 29 0.39 0.74

		TP	FP	FN	F $_{1}$
Event classification	T-SEE	92	90	7	12	10	9	0.92	0.90
	L-SEE	91	100	11	2	11	2	0.89	0.98
Relation extraction	T-SEE	83	103	33	23	32	114	0.72	0.58
	L-SEE	77	178	200	99	38	29	0.39	0.74

TP = true positives, FP = false positives, FN = false negatives; SEE = semantic event extraction; T-SEE = transformer-based semantic event extraction; L-SEE = large language model-based semantic event extraction.

After manual assessment, L-SEE shows a remarkable improvement in event classification, achieving an almost perfect F $_{1}$ score of $0.98$ , suggesting that with manual verification of the ground truth, the LLM’s capabilities are more effectively utilised. Regarding relation extraction, while L-SEE improves performance (F $_{1}$ score of $0.74$ ), T-SEE experiences a significant drop in effectiveness (F $_{1}$ score of $0.58$ ), indicating challenges in adapting to the intricacies of manually annotated samples and the complexity of real-world data.

These results underscore the strengths and limitations of both methodologies. While T-SEE demonstrates superior performance in a controlled environment, particularly in relation extraction tasks, L-SEE shows remarkable adaptability and potential in handling complex, real-world scenarios when supplemented with manual verification and annotation processes: not being constrained by any limitations in training data, L-SEE is able to extract more than double the amount of relations. This highlights the importance of context and the level of detail in ground truth annotations when evaluating and comparing data extraction methodologies.

6.2. Error Taxonomy

To understand the differences in behaviours between T-SEE and L-SEE, we manually annotate the specific errors that occur when performing semantic EE on DBpedia-SEE₁₀₀. While doing so, we create an error taxonomy presented in this section. Later, to contextualise said error taxonomy, we present examples of generated RDF triples and the errors in them.

Our manual annotation process has unveiled a structured classification of errors, which we have divided into three principal categories:

Extraction Inaccuracies

Errors arising from the model’s inability to accurately interpret information within texts:

Omissions or missing events/relations: The event or its relations are not extracted.

Type misalignment: An inappropriate type of entity or value is selected for a given property.

Granularity mismatch: The model’s predictions lack the specificity of the ground truth, for example, categorising an event broadly as dbo:SportsEvent rather than the more specific dbo:TennisTournament .

Erroneous extraction: The extraction of incorrect properties or values, leading to a misrepresentation of the factual content.

Annotation Discrepancies

Errors stemming from inconsistencies, errors or omissions in the ground truth:

Imprecise event class: The model’s predictions provide a more detailed event classification.

Imprecise property: The model predicts property values with greater accuracy than the ground truth, such as specifying the exact match score when the ground truth only acknowledges the victory.

Annotation error: The presence of omissions or inaccuracies within the ground truth itself, such as neglecting to annotate the specific date of a match or other pertinent details.

Other Anomalies

Errors arising from other sources:

Event ambiguity: The model struggles to distinguish between multiple distinct events described within a single sample, which may lead to conflated or mixed property assignments.

Processing error: T-SEE and L-SEE match spans of text to specific entities, relying on an external entity linking component and a date parsing module which are prone to errors.

6.2.1. Examples of Errors

We provide four semantic event representations generated by L-SEE as examples of the identified error types in the error taxonomy. For each of the examples, we provide the input text $t$ , selected RDF triples describing an event $e$ in the ground truth as well as selected triples generated by L-SEE.²³ Errors are marked in red, relations only in the ground truth are marked in blue, and relations only in the prediction are marked in green.

Example 1 Figure 8) – ground truth extracted from dbr:Black_Monday_(1360)

Omission: L-SEE failed to extract the dbo:commander relation.

Annotation error: On the other hand, L-SEE accurately extracts a relevant date and territory for the event, however, these are not contained within the ground truth.

Figure 8.
Example of an omission error and an annotation error.

Example 2 Figure 9) – ground truth extracted from dbr:Al-Qusayr_offensive

Type misalignment: The commanders are incorrectly identified and assigned to group entities instead of individuals. Specifically, L-SEE detects two commanders extracting ‘Syrian Army’ and the ‘Lebanese militia Hezbollah’.

Processing error: In the entity linking process, ‘Lebanese militia Hezbollah’ is wrongly linked to three entities.

Figure 9.

Example of type misalignment and processing errors.

Example 3 Figure 10) – ground truth extracted from dbr:2016_Wuhan_Open

Imprecise event class: L-SEE identifies a more precise event class ( dbo:TennisTournament versus dbo:Tournament ).

Erroneous extraction and event ambiguity: The same tennis tournament did not happen in Wuhan and in Beijing; L-SEE fails to distinguish between the tournaments Wuhan Open and China Open.

Figure 10.

Example of an imprecise event class, an erroneous extraction and event ambiguity.

Example 4 Figure 11) – ground truth extracted from dbr:1959_Ontario_general_election

Event ambiguity: The date ‘1961-01-01’ indicates confusion between multiple events. Specifically, this is because the event annotated in the ground truth is derived from the link tied to the string ‘previous election’, referring to dbr:1959_Ontario_general_election .

Erroneous extraction: The use of dbo:secondLeader to indicate a chronological successor is highlighted in red, illustrating a misunderstanding of the property, as dbo:secondLeader is meant to instead describe second ranking in a competition.

Annotation error: The relation using the dbo:affiliation property is missing in the ground truth.

Figure 11.

Example of event ambiguity, erroneous extraction and an annotation error.

6.3. Error Analysis

On the basis of our error taxonomy, we annotated each semantic event representation generated by T-SEE and L-SEE with the set of errors occurring in them.

First, we categorise errors into extraction inaccuracies, annotation discrepancies, and other anomalies to clarify our approaches’ error landscapes. Figure 12 visualises these error profiles for T-SEE and L-SEE, highlighting the challenges in semantic EE. In general, we register fewer errors for T-SEE than L-SEE across all three error categories, which results from T-SEE’s capability to mimic the dataset characteristics. On the other hand, we annotate $180$ annotation discrepancies for L-SEE, more than its $107$ extraction inaccuracies. Since annotation discrepancies represent cases where the model extracts valid triples which are not covered in the ground truth, this analysis demonstrates how L-SEE is capable of semantic EE without being closely attached to the characteristics of training data and, implicitly, the data coverage in the target knowledge graph.

Figure 12.

Distribution of error categories for T-SEE and L-SEE. T-SEE = transformer-based semantic event extraction; L-SEE = large language model-based semantic event extraction.

Figure 13 provides a detailed analysis of the error types. As can be seen in the figure, different error types manifest with varying frequencies across L-SEE and T-SEE.

Figure 13.

Distribution of error types for T-SEE and L-SEE grouped by category. T-SEE = transformer-based semantic event extraction; L-SEE = large language model-based semantic event extraction.

For L-SEE, the most prevalent error type are Annotation Errors, with a count of $167$ , reflecting instances where L-SEE identifies relevant semantic relations not annotated in the ground truth. Following closely is the Erroneous Extraction category, with $91$ instances, which encompasses errors related to the incorrect identification of properties or values. A notable portion of these errors can be attributed to Processing Errors, amounting to $28$ instances, where the entity linking and date extraction methods utilised by L-SEE falter in accurately extracting dates or correctly linking entities, leading to inaccuracies in the relation extraction.

Misunderstanding and Type Misalignment errors, with $7$ and $34$ instances respectively, further contribute to the Erroneous Extraction count. These errors emerge when L-SEE misinterprets the intended meaning of properties or incorrectly aligns relations with inappropriate entities or values. For instance, the common misunderstanding of the dbo: secondLeader property (Example 4) exemplifies how more careful prompting approaches may lead to better performance. For an example of a type misalignment error, we may observe instances where in the absence of precise time expressions in the text, L-SEE assigns imprecise values such as ‘yesterday’ to date relations.

Conversely, T-SEE demonstrates a lower overall error frequency, with Annotation Errors again emerging as the dominant error type, albeit with a substantially lower count of $22$ . This suggests a more precise alignment with the ontology. Notably, T-SEE exhibits no Type Misalignment errors and only 1 Event Ambiguity error. However, both methodologies encounter Omissions and Processing Errors, with L-SEE facing $16$ and $28$ instances respectively, and T-SEE experiencing $11$ and nine instances.

Upon a more nuanced examination, especially after correcting for annotation errors, the performance landscape shifts. Initially, T-SEE appears to outperform L-SEE due to its lower error rates. However, this might also indicate a tendency of T-SEE to conform to the existing annotations, potentially overlooking unlabelled but present relations. This could imply that while T-SEE is more aligned with the given annotations, it may also be less inclined to explore beyond them, possibly fitting to annotation noise rather than capturing the full spectrum of semantic relations.

In summary, while T-SEE shows precision in alignment with the current ontology, L-SEE’s broader extraction attempts, despite higher initial error rates, may offer a more comprehensive understanding of the underlying semantic structures, especially when considering the corrected annotation context. This dichotomy highlights the balance between precision and recall in semantic EE and underscores the importance of continuous refinement in both methodologies to enhance their efficacy and reliability.

6.3.1. Formatting and Ontology Errors

As indicated in Section 5.2.2, for $805$ of the texts in the complete dataset DBpedia-SEE, L-SEE could not generate RDF triples due to formatting issues, including:

Misformatted output: The LLM-generated JSON strings of $606$ texts were not in proper JSON syntax and could not be processed.

Non-existing event classes: In $401$ cases, an event class was identified which is not part of the event ontology and the prompt. An example is the extraction of an event typed as dbo:SyrianCivilWar, while only dbo:CivilWar exists in the DBpedia ontology.

Invalid properties: In 1191 cases, a property was identified which is not part of the event ontology (e.g. dbo:percentageOfPopularVote and dbo:delayReason). Despite these being errors, they often demonstrate L-SEE capability to suggest relevant attributes for specific scenarios adaptively.

6.4. Effect of Text Characteristics on Semantic EE

To get a sense of L-SEE performance across a variety of syntactic and semantic phenomena, we dissected DBpedia-SEE into multiple subsets, each representing distinct text characteristics. The subsets are generated employing specific strategies, each tailored to highlight a particular aspect of the dataset, ranging from event co-occurrences to the complexity of the document structure.

6.4.1. Text Characteristics

We employ a collection of strategies to generate meaningful subsets of the dataset, each aimed at isolating different factors that could influence L-SEE’s performance. Specifically, we ranked the entire dataset based on the presence and frequency of certain linguistic, syntactic, or semantic phenomena. From this ranking, we then selected the top 100 samples for each subset to focus our analysis on the most pronounced examples of each phenomenon.

Semantic diversity:
We assess samples for semantic diversity. The semantic diversity of a text is measured by the variety of verb phrases and their arguments, approximated by the count of unique verb lemmas in the text. Samples with high semantic diversity are chosen for this subset, aiming to test the model’s understanding of varied semantic contexts and its ability to extract a broad range of event semantics.
Sentence length:
This strategy sorts the samples by the length of the text. Samples are then selected from the sorted list, prioritising those with the longest texts.
Geographical diversity:
Samples of this subset are generated based on the count of geographical entities identified by the Spacy NLP pipeline (i.e. ‘GPE’ and ‘LOC’ labelled entities) in each text. To assess the model’s proficiency in dealing with texts containing diverse geographical references, we select samples with the highest counts of such entities.
Temporal event distribution:
We identify texts with temporal expressions using the Spacy library and extract where they are most frequently occurring. As temporal expressions can be crucial for event understanding, this subset evaluates L-SEE’s capability to understand and integrate temporal information.
Named entity diversity:
This subset focuses on the diversity of named entities. We again utilise the Spacy library to extract named entities and then sort and select samples with the widest range of entities. This subset tests the model’s ability to accurately recognise and categorise entities in the context of events.
Complex sentence structures:
Samples with intricate syntactical constructions are selected to challenge L-SEE’s parsing abilities, as complex structures can obscure event boundaries and relations, making extraction more difficult. This set is generated by measuring the depth of the syntactic parse tree of each text, with depth representing the maximum distance from any token to the root of the tree. Samples with the most complex sentence structures, that is, the deepest parse trees, are selected for this subset.

In the following, we detail the outcomes of this analysis, demonstrating L-SEE’s efficacy and limitations across varying text characteristics.
6.4.2. Results

L-SEE’s performance was evaluated across the subsets using precision, recall, and F $_{1}$ scores for both event classification and relation extraction. Figures 14 and 15 show the results of this analysis, detailed in the following.

Figure 14.

Large language model-based semantic event extraction (L-SEE) performance in event classification across various data subsets.

Figure 15.

Large language model-based semantic event extraction (L-SEE) performance in relation extraction across various data subsets.

For comparison, we also include the full dataset performance in our analysis. The distinctly strongest relation extraction performance on the full dataset suggests that we have successfully sampled parts of our data that L-SEE finds difficult to deal with. Semantic diversity:

L-SEE exhibits moderate performance in the semantic diversity subset, with macro metrics reflecting challenges in consistently classifying a wide range of semantically varied events. However, micro metrics indicate better performance in frequent semantic contexts. The significantly lower performance in relation extraction highlights that there may be difficulties in mapping complex semantic relationships accurately.

Sentence length:

The results suggest that longer sentences pose significant challenges, with lower macro metrics for event classification and even more pronounced difficulties in relation extraction. This indicates that L-SEE may have limitations in maintaining context and coherence over sentences.

Geographical diversity:

L-SEE performs relatively well in event classification, suggesting a good grasp of geographical contexts. However, the lower relation extraction scores point to challenges in accurately extracting relationships when the diversity of geographical entities is high.

Temporal event distribution:

L-SEE displays reduced event classification accuracy in this subset. However, relation extraction metrics remain comparatively stable. Given the lower event classification performance, this stability in relation extraction, despite potential cascading errors from misclassifications, may indicate a relatively stronger inherent capability of the model in isolating and extracting relations in the context of temporally complex texts.

Named entity diversity:

We observe notable difficulties, highlighting L-SEE’s struggle with diverse named entities. The discrepancy between macro and micro metrics points to L-SEE’s poorer handling of less frequent event types when the diversity of entity mentions is high.

Complex sentence structures:

L-SEE’s strongest performance is observed in the subset with complex sentence structures, indicating effective parsing of intricate syntactic constructions. However, the lower relation extraction metrics suggest that while L-SEE can identify events within complex sentences, accurately extracting the relations remains challenging.

This analysis underscores L-SEE’s strengths in contextual integration and syntactic navigation but also points at significant areas for improvement. Future work should conduct further analysis focusing on how improved prompting strategies may help in robustness and accuracy across these diverse linguistic and contextual scenarios.

6.5. Discussion

With our comparison of T-SEE and L-SEE in this article, we aim at a deeper understanding of the suitability of two different paradigms – transformer-based architectures and the use of LLMs – for semantic EE. Following our evaluation results and analysis, we identify five core phenomena to be considered when deciding between these paradigms:

Mimicry of dataset characteristics: Our analyses, for example, in Tables 6 and 8 and Figure 12, clearly demonstrate that the results of methodologies fine-tuned on the target datasets (T-SEE, Text2Event, and EventGraph) are much more aligned to the expected RDF triples in the test sets than triples generated by an LLM (L-SEE). From this behaviour, we infer the following: (i) In a controlled setting where mimicking the characteristics of the training data is desired, transformer-based approaches are preferable over the use of LLMs. (ii) However, transformer-based approaches also mimic the flaws of the target datasets and knowledge graphs. For example, if a specific property is rarely used in an event knowledge graph but still valuable, L-SEE would identify it, while fine-tuned approaches might miss it. An example is the property dbo:country on the event class dbo:Election , which is only used in approximately $25 %$ of DBpedia’s dbo:Election events.

Distantly labelled datasets: Training a transformer-based architecture requires the availability of large training data, that is, texts annotated with RDF triples. Therefore, we opted for the automated extraction of two new datasets. The use of distantly labelled datasets without human annotations such as DBpedia-SEE and Wikidata-SEE for semantic EE or datasets for relation extraction (Elsahar et al., 2018; Goodrich et al., 2019; Han et al., 2020; Yao et al., 2019) overcomes the issues of training data dimensionality but always comes with questions regarding dataset quality.²⁴ Specifically, we identified a large number of false positives when evaluating L-SEE (Table 8), resulting from incomplete knowledge graphs or faulty alignment between texts and RDF triples in the distant labelling process. Consequently, the evaluation of different approaches on a distantly labelled dataset requires careful investigation of the outputs beyond solely providing scores of the evaluation metrics.

Ontology guidance: We took care of carefully guiding both our approaches through our event ontologies. By fine-tuning a transformer-based architecture, adherence to the ontology can be enforced, for example, by explicitly classifying into the event classes pertinent to the event ontology. For an LLM, in contrast, while we prompted for the specific event classes and properties, we still observed cases of invalid event classes or properties as discussed in Section 6.3.1. Also, our examples demonstrated cases of type misalignment and a misunderstanding of the semantic definition of a property ( dbo:secondLeader in Example 4), demonstrating the need to control the outputs of an LLM. The improvement in the precision of LLM-based semantic EE is a major future direction for LLM-based semantic EE, for example, through the provision of property descriptions within the prompt.

Complexity: Setting up a transformer-based architecture and its fine-tuning requires the availability of rich training data, computing and time resources. Setting up an LLM, in contrast, requires access to an LLM and careful prompt engineering, that is, potentially easier-to-obtain resources.

Real-world applications: Given the capability of LLMs to adapt to different inputs and data characteristics, we assume that LLM-based approaches are well-suited under more complex, real-world conditions and to explore low-resource scenarios.

7. Related Work

Knowledge graphs have, as a form of structured human knowledge, drawn a lot of research attention from both academia and the industry (Ji et al., 2022). With a great deal of event information worldwide, it is essential to bring entities and events together through event-centric knowledge representations (Guan et al., 2022), with EE and relation extraction being key technologies for accessing event knowledge (Xiang & Wang, 2019).

7.1. Event Knowledge Graphs

Event knowledge graphs represent knowledge about happenings with societal impact in an event ontology and interlink them with connected entities (Guan et al., 2022). We distinguish between two types of event representations as follows:

Named events: The predominantly entity-centric information of popular cross-domain knowledge graphs such as DBpedia, YAGO, and Wikidata represent events as named events such as ‘Brexit’ and ‘World War II’. Named events are also the core component of EventKG (Gottschalk & Demidova, 2018), a multilingual event-centric temporal knowledge graph, part of the Open Event Knowledge Graph (Gottschalk et al., 2021) that integrates event-related data sets from multiple application domains. GDELT (Leetaru & Schrodt, 2013) and ICEWS are two datasets of global political events encoded using the CAMEO framework (Gerner et al., 2002), that is, not in RDF.

Unnamed events: Works that address unnamed events specifically deal with the identification of texts describing events and with the semantic annotation of these texts. For example, Rospocher et al. (2016) build knowledge graphs from news articles, and Zhang et al. (2020) develop a large-scale English event knowledge graph extracted from several sources such as reviews, news, and social media. For the task of event modelling, Yao et al. (2020) propose a weakly supervised approach to extract event relation tuples from text and build an event knowledge base, not focusing on event-entity relations.

All event knowledge graphs require the availability of an event ontology, with popular examples including LODE (Shaw et al., 2009), the Simple Event Model (Van Hage et al., 2011) and more as discussed by Piryani et al. (2023). Relevant patterns for event representation are presented by Carriero et al. (2021) and Krisnadhi and Hitzler (2017), focusing on the spatio-temporal extent of events, the role of their participants and recurring events. In this article, we extracted event ontologies from their vocabularies to allow the population of the well-established cross-domain knowledge graphs DBpedia and Wikidata.

With T-SEE, we aim to bring together the complementary strengths of the Semantic Web and NLP perspectives by performing EE that can be adapted to different event ontologies.

7.2. Event Extraction

EE is a critical task in constructing and populating entity-centric knowledge graphs, with recent advancements significantly diversifying the methodologies employed (Du et al., 2021; Lu et al., 2021; Xu et al., 2021). Earlier approaches have relied on sentence-level pipelines for extracting event triggers and their corresponding argument roles (Du & Cardie, 2020; Liu et al., 2020; Yang et al., 2019a), employing sequence-to-structure generation paradigms like Text2Event (Lu et al., 2021) and multi-task frameworks such as DyGIE++ (Wadden et al., 2019), which utilise contextualised embeddings and dynamic span graph updates. Other studies have extended the scope to document-level EE (Lou et al., 2021; Zheng et al., 2019) or ventured into open-domain EE without predefined event classes (Liu et al., 2019b; Rusu et al., 2014), which, while broadening the applicability, faces challenges due to the absence of a well-defined event ontology.

Innovations in the field have introduced contrastive pre-training frameworks like CLEVE (Wang et al., 2021b), which capitalise on large unsupervised datasets and their semantic structures to enhance EE’s efficacy, demonstrating marked improvements in both supervised and unsupervised settings. Similarly, EventGraph (You et al., 2022) has presented a joint framework that conceptualises events as graphs, facilitating the simultaneous detection and extraction of multiple events and their intricate interrelations, thereby achieving state-of-the-art results in event trigger and argument role classification.

Deepstruct, on the other hand, tries to leverage the structural understanding capabilities of language models through task-agnostic pretraining, allowing for zero-shot knowledge transfer across a wide array of structure prediction tasks and setting new benchmarks on numerous datasets (Wang et al., 2022). With DEGREE (Hsu et al., 2022), authors propose a data-efficient, generation-based model for EE that capitalises on semantic guidance from manually designed prompts and the joint prediction of triggers and arguments, showcasing robust performance in low-resource settings. ONEEE (Cao et al., 2022), on the other hand, utilises a one-stage framework for fast overlapping and nested EE.

A notable shift in EE methodology is the adoption of a question-answering paradigm (Du & Cardie, 2020), which mitigates the prevalent issue of error propagation seen in conventional approaches by facilitating end-to-end argument extraction, including for roles not encountered during training. Following this line, QGA-EE (Lu et al., 2023) has refined the QA-based approach by integrating context-aware question generation, thus accommodating multiple arguments for identical roles and surpassing prior single-task models in performance metrics.

In light of the new methodologies and progress in EE, the research community has also focused on the specific subtasks of EE. For example, with PAIE (Ma et al., 2022), authors devise a prompt tuning approach to document-level event argument extraction similar to the already established question-answering paradigm in EE work. Older work on event argument extraction, such as HMEAE (Wang et al., 2019), a hierarchical approach to argument extraction utilising concept correlation among argument roles, have, in turn, inspired approaches such as DEGREE that aim to resolve issues such as poor handling of the encoding of the labels semantics and other weak supervision signals.

Prompt-based approaches have been explored for event argument extraction, leveraging the ability of pre-trained language models to generate structured outputs. For example, Peng et al. (2024) propose Event Co-occurrences Prefix Event Argument Extraction (ECPEAE), which incorporates co-occurrence information of multiple events in a sentence to improve argument extraction accuracy (Peng et al., 2024b). This method uses a co-occurrence event prefix module to encode template information for all events in the input, enabling the model to leverage causal relationships between events. While ECPEAE focuses on sentence-level event interactions, T-SEE’s pipeline explicitly integrates event ontologies and RDF triples for knowledge graph population, aligning with broader semantic EE goals. Other recent work includes Hyperspherical Multi-Prototype Learning (Zhang et al., 2024a), which enhances event argument extraction via optimal transport.

The other subtask of EE, event detection, has also received attention with the DRC framework (Zhao & Yang, 2022) trying to compete with trigger-based models as a way of exploring methods of event detection robust to less annotated real-world domains, an area we examine in our work as well. Similarly, recent work has introduced retrieval-augmented prompting for event detection, leveraging LLMs to improve performance in both high- and low-resource settings (Shiri et al., 2024). This approach constructs automatic retrieval-augmented prompts to provide LLMs with structured extraction guidelines, enhancing their ability to detect events without relying solely on trigger words. These advancements align with our exploration of methods for event detection in less-annotated domains.

Other research exploring ontology and schema-based approaches to EE has yielded promising innovations. Notably, Huang et al. (2024) introduce a multi-graph representation for EE, using graph neural networks to model event interactions and improve extraction accuracy (Huang et al., 2024). This graph-based approach contrasts with L-SEE’s pipeline structure, which prioritises ontology-guided classification and relation extraction. Shiri et al. (2024) propose a schema-aware EE method using LLMs, decomposing the task into event detection and argument extraction while incorporating dynamic schema-aware retrieval examples (Shiri et al., 2024). This approach uses dynamic retrieval to fetch task-specific examples for each query, enhancing LLM understanding but requiring external data. In contrast, L-SEE employs dynamic prompt generation from a static ontology, enabling targeted extraction of event classes and properties without relying on retrieval mechanisms. Our approach achieves a critical balance: leveraging the flexibility of ontology-driven prompting while maintaining simplicity through independence from external data sources. Similarly, COfEE (Balali et al., 2024) uses a static ontology for schema-guided augmentation in supervised models. Unlike L-SEE, which leverages LLMs’ contextual understanding with ontology-guided prompts, COfEE relies on static schema augmentation. L-SEE’s ontology-driven prompting enables adaptability to diverse event types while avoiding the computational overhead of retrieval-augmented generation.

These developments reflect a broader trend towards more adaptable, efficient, and comprehensive models for EE, underlining the field’s evolution towards leveraging advanced language model capabilities and innovative problem-solving frameworks.

7.3. LLM-based IE

The field of IE has traditionally relied on rule-based and statistical methods to extract structured information from text. However, the emergence of LLMs has opened up new avenues for tackling IE tasks with remarkable capabilities in understanding and generating natural language. This section reviews recent advancements in using LLMs for IE, particularly focusing on unstructured IE and EE.

7.3.1. General IE with LLMs

A few years ago, LLMs were still in their early stages of development, with limited capabilities for tackling complex tasks like IE. While early works explored LLM-based approaches for IE (e.g. Peters et al., 2018), these models faced challenges due to limited model capacity, data inefficiency, and limited adaptation. However, significant advancements in recent years have addressed these challenges, driven by the rise of the transformer architecture (Vaswani et al., 2017) enabling long-range dependencies. Large-scale pre-training pushed things further with BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020), allowing LLMs to learn general language understanding capabilities and adapt to specific IE tasks through fine-tuning with smaller labelled datasets. Finally, the growing availability of powerful computing resources like GPUs and TPUs (Jouppi et al., 2017) has enabled the training of larger and more complex LLM models, further enhancing their ability to handle complex IE tasks.

7.3.2. Unstructured IE with LLMs

In 2022, Dunn et al. showed how a pre-trained LLM can extract structured information from scientific abstracts (Dunn et al., 2022). In 2023, Polak and Morgan (2024) expanded on the early promises of unstructured IE with ChatExtract, demonstrating that a significant amount of up-front effort, expertise, and coding may be fully automated using an advanced conversational LLM. By leveraging prompts and follow-up questions, ChatExtract achieves high accuracy and efficiency in extracting materials data, showcasing the potential of LLMs for automated knowledge extraction from scientific literature.

In the same year, Wei et al. proposed ChatIE, a multi-turn QA framework for zero-shot IE demonstrating good performance across a number of datasets, three tasks, and two languages (Wei et al., 2023). Li et al. systematically analysed ChatGPT across seven detailed IE tasks (Li et al., 2023a) including EE. The authors show that while ChatGPT underperforms in standard IE tasks compared to BERT-based models, it excels in OpenIE settings, as confirmed by human evaluators. However, a notable concern is the model’s overconfidence in its predictions, leading to calibration issues. This is further confirmed in the comprehensive survey by Liu et al. (2023), in which the authors evaluate the capabilities and applications of ChatGPT (versions 3.5 and 4) against the backdrop of current state-of-the-art models in NLP. The article highlights ChatGPT’s advancements in large-scale pre-training, instruction fine-tuning, and reinforcement learning from human feedback, which collectively enhance its adaptability and performance across a myriad of NLP tasks. A detailed comparison of ChatGPT with existing state-of-the-art models reveals that while ChatGPT excels in multitask learning and shows promising results in some NLP tasks, it falls short in multilingual capabilities and specialised tasks when compared to dedicated models. Moreover, stability and consistency emerge as areas where ChatGPT does not yet match the performance levels of state-of-the-art models, which could impact its reliability in critical applications.

7.3.3. LLM-based EE

LLMs have recently been utilised for the task of EE. In general, as already mentioned, Li et al. (2023a) evaluate the performance of ChatGPT on a number of IE tasks, revealing an increasingly worse performance as the complexity of the evaluated task increases, where the worst performance is reported on the task of EE.

A comparison between LLMs and traditional methods have been conducted on several tasks related to EE: according to Kirti et al. (2023), authors explore prompt-based learning with GPT-4 for detecting factual events in literary narratives. The study concludes that while BiLSTM with BERT embeddings excels in event detection within literary texts, GPT-4 shows promise in prompt-based learning approaches, particularly in few-shot settings. Sharif et al. (2024) conducted an in-depth analysis of ChatGPT’s performance on the task of characterising information-seeking events, where ChatGPT underperformed compared to transformer models like XLNet, especially in domain-specific contexts requiring extensive knowledge.

Zhan et al. introduce GLEN (Li et al., 2023b), a large-scale general-purpose event detection dataset that significantly expands the ontology of event types. While InstructGPT underperformed compared to other baselines in their experiments, the authors attribute this to the limited input length and lack of fine-tuning, with only $57.8 %$ of generated event types matching the ontology, similarly to our observations (Section 6.3.1). In 2024, Zhang et al. present ULTRA (Zhang et al., 2024b), a framework utilising hierarchical modelling and pairwise refinement for document-level event argument extraction.

Peng et al. (2024) introduce CsEAE, a model that combines small language models (SLMs) and LLMs for document-level event argument extraction (Peng et al., 2024a). CsEAE incorporates co-occurrence-aware and structure-aware modules to handle semantic boundaries between events and reduce interference from redundant information. The authors also demonstrate that insights from SLMs can enhance LLM performance via supervised fine-tuning and prompt engineering. This work aligns with L-SEE prompting strategies and highlights the potential for co-occurrence-aware designs to improve LLM-based EE.

Liu et al. (2024) propose EventRL, a framework that enhances LLM-based EE using reinforcement learning with outcome supervision (Gao et al., 2024). EventRL improves extraction accuracy by rewarding the LLM based on its alignment with human-annotated triggers and arguments. While EventRL focuses on refining LLM outputs through external feedback, L-SEE leverages ontology-guided prompting to structure LLM responses internally. Both approaches aim to improve LLM reliability, but EventRL uses post-hoc correction, while L-SEE prioritises upfront guidance. These methods address complementary aspects of LLM-based extraction: EventRL mitigates hallucinations via feedback, while L-SEE ensures semantic consistency with knowledge graphs through ontology integration.

While the early attempts at utilising LLMs for the complex task of EE have shown mixed results, with LLMs often underperforming in comparison to traditional methods, especially in domain-specific contexts, there is a clear trajectory of improvement. As LLMs continue to evolve, gaining the ability to handle larger context windows and as researchers refine their prompting techniques – such as breaking down the task into simpler sub-tasks as demonstrated in L-SEE – the gap between LLMs and traditional methods is expected to narrow. The advancements in hierarchical modelling, pairwise refinement, and modules like LEAFER (Zhang et al., 2024b) for argument span refinement indicate the potential for LLMs to improve and catch up to traditional EE methodologies in the near future.

7.3.4. LLM-Based Knowledge Graph Population

The use of LLMs for the population of knowledge graphs has also been explored recently. For example, Mihindukulasooriya et al. experimented on ontology-driven triple extraction from sentences (Mihindukulasooriya et al., 2023), while Yao et al. performed instruction tuning for the tasks of triple classification, relation prediction and entity link prediction (Yao et al., 2023). In another innovative approach, AutoKG leverages a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning (Zhu et al., 2024). Zhang et al. propose KoPA, which ingests entity and relation embeddings into LLMs (Zhang et al., 2024c).

These papers about LLM-based IE present a glimpse into the rapidly evolving field of LLM-based IE. While promising results have been achieved, further research is needed to address challenges such as factual correctness, bias mitigation, and adapting LLMs to specific domains and tasks. As research progresses, LLMs are set to play a key role in the future of IE, enabling efficient and accurate knowledge extraction from vast amounts of unstructured text data.

8. Conclusion

In this article, we compared two paradigms for semantic EE: fine-tuning transformer-based architectures as exemplified by our approach T-SEE and prompting LLMs, exemplified by our approach L-SEE.

Both approaches consist of two main steps: event classification and relation extraction, where T-SEE frames event classification as a multi-label classification task, and conducts relation extraction with a span prediction transformer model. L-SEE provides an LLM with two different prompts which include the event classes and properties in the target event ontology.

In our evaluation, we first introduced two new datasets for semantic EE. Then, we compare T-SEE and L-SEE to two state-of-the-art baselines, with T-SEE outperforming or matching them and setting a new benchmark for transformer-based methods in semantic EE. Finally, we specifically focused on the different characteristics of T-SEE and L-SEE, highlighting T-SEE’s adaptation to the precise characteristics of the training and test data, while L-SEE performs clearly worse on the test data. However, our subsequent analysis revealed its capability of extracting relevant knowledge that is often overlooked by distantly labelled datasets.

Consequently, we derive a set of phenomena to be regarded when performing semantic EE, including the role of distantly labelled datasets and the event ontology.

In future work, we plan to further improve T-SEE and L-SEE, for example, by bringing event classification, relation extraction and other tasks like named entity recognition even closer together in joint multi-task learning frameworks and to extend them to encompass multilingual and document-level semantic EE. In addition, we aim to enhance metrics and datasets, allowing a fair comparison between semantic EE methods employing transformer-based architectures and LLMs.

Footnotes

Acknowledgements

This work was partially funded by the Federal Ministry for Economic Affairs and Energy (BMWE), Germany (‘ATTENTION!’, 01MJ22012D). The publication of this article was funded by the Open Access Fund of Leibniz Universität Hannover.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

ORCID iDs

Tin Kuculo

Simon Gottschalk

Notes

Appendix A. Prompts of L-SEE

This appendix shows the prompts used by L-SEE as described in Sections 4.1 and 4.2.

References

Auer

Bizer

Kobilarov

Lehmann

Cyganiak

Ives

(2007). DBpedia: A nucleus for a web of open data. In K. Aberer, K.S. Choi, N. Noy, D. Allemang, K.I. Lee, L. Nixon, J. Golbeck, P. Mika, D. Maynard, R. Mizoguchi et al. (Eds.), The Semantic Web: 6th International semantic web conference, 2nd Asian semantic web conference, ISWC 2007 + ASWC 2007, Lecture Notes in Computer Science (Vol. 4825, pp. 722–735). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-540-76298-0_52

Balali

Asadpour

Jafari

S. H.

(2024). COfEE: A comprehensive ontology for event extraction from text. Computer Speech & Language, 89, 101702. https://doi.org/10.1016/j.csl.2024.101702

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Hesse

Chen

Sigler

Litwin

Gray

Chess

Clark

Berner

McCandlish

Radford

Sutskever

Amodei

(2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, & H.T. Lin (Eds.), Advances in neural information processing systems 33 (NeurIPS 2020) (pp. 1877–1901). https://doi.org/10.48550/arXiv.2005.14165

Cao

Fei

Zhao

(2022). Oneee: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International conference on computational linguistics (pp. 1967–1979). Gyeongju, Republic of Korea: International Committee on Computational Linguistics. https://doi.org/10.48550/arXiv.2209.02693

Carriero

V. A.

Gangemi

Nuzzolese

A. G.

Presutti

(2019). An ontology design pattern for representing recurrent events. In K. Janowicz, A.A. Krisnadhi, M. Poveda-Villalón, K. Hammar, & C. Shimizu (Eds.), Proceedings of the 10th workshop on ontology design and patterns (WOP 2019) co-located with 18th International semantic web conference (ISWC 2019), CEUR Workshop Proceedings (Vol. 2459, pp. 59–70). Auckland, New Zealand: CEUR-WS.org. https://ceur-ws.org/Vol-2459/pattern1.pdf

Carriero

V. A.

Gangemi

Nuzzolese

A. G.

Presutti

(2021). An ontology design pattern for representing recurrent situations. In E. Blomqvist, T. Hahmann, K. Hammar, P. Hitzler, R. Hoekstra, R. Mutharaju, M. Poveda-Villalón, C. Shimizu, M.G. Skjæveland, M. Solanki, V. Svátek, % L. Zhou (Eds.), Advances in pattern-based ontology engineering, Studies on the Semantic Web (Vol. 51, pp. 186–199). IOS Press. ISBN 978-1-64368-174-0. https://doi.org/10.3233/SSW210013

Davani

A. M.

Yeh

Atari

Kennedy

Portillo-Wightman

Gonzalez

Delong

Bhatia

Mirinjian

Ren

et al (2019). Reporting the Unreported: Event extraction for analyzing the local representation of hate crimes. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 5753–5757). Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1580

Devlin

Chang

M. W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Cardie

(2020). Event extraction by answering (Almost) natural questions. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 671–683). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.49

10.

Rush

A. M.

Cardie

(2021). GRIT: Generative role-filler transformers for document-level event entity extraction. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main Volume (pp. 634–644). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.52

11.

Dunn

Dagdelen

Walker

Lee

Rosen

A. S.

Ceder

Persson

Jain

(2022). Structured information extraction from complex scientific text with fine-tuned large language models. arXiv preprint arXiv:2212.05238 https://doi.org/10.48550/arXiv.2212.05238

12.

Ebner

Xia

Culkin

Rawlins

Van Durme

(2020). Multi-sentence argument linking. In Proceedings of the 58th annual meeting of the association for computational linguistics (ACL 2020) (pp. 8057–8077). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.718

13.

Elsahar

Vougiouklis

Remaci

Gravier

Hare

Laforest

Simperl

(2018). T-REx: A large scale alignment of natural language with knowledge base triples. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A.Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 3448–3452). Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1544/

14.

Erxleben

Günther

Krötzsch

Mendez

Vrandečić

(2014). Introducing wikidata to the linked data web. In P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. Knoblock, D. Vrandečić, P. Groth, N. Noy, K. Janowicz, & C. Goble (Eds.), The Semantic Web – ISWC 2014: 13th International semantic web conference, Lecture Notes in Computer Science (Vol. 8796, pp. 50–65). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11964-9_4

15.

Fleiss

J. L.

(1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619

16.

Gao

Zhao

Wang

(2024). EventRL: Enhancing event extraction with outcome supervision for large language models. https://doi.org/10.48550/arXiv.2402.11430

17.

Gerner

D. J.

Schrodt

P. A.

Yilmaz

Abu-Jabr

(2002). Conflict and mediation event observations (CAMEO): A New Event Data Framework for the Analysis of Foreign Policy Interactions. In Annual meeting of the international studies association. New Orleans, LA, USA. https://parusanalytics.com/eventdata/papers.dir/Gerner.APSA.02.pdf

18.

Goodrich

Rao

Liu

P. J.

Saleh

(2019). Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, KDD ’19. (pp. 166–175). New York, NY, USA: ACM. https://doi.org/10.1145/3292500.3330955

19.

Gottschalk

Demidova

(2018). EventKG: A Multilingual Event-Centric Temporal Knowledge Graph. In Proceedings of the 15th extended semantic web conference (ESWC) (pp. 272–287). Heraklion, Greece: Springer. https://doi.org/10.1007/978-3-319-93417-4_18

20.

Gottschalk

Demidova

(2019a). EventKG – the hub of event knowledge on the web – and biographical timeline generation. In Ghidini C., Hartig O., Maleshkova M., Svátek V., Cruz I., Hogan A., Song J., Lefrançois M., & Gandon F. (Eds.), The Semantic Web – ISWC 2019: 18th international semantic web conference, Lecture Notes in Computer Science (Vol. 11778, pp. 272–287). Cham: Springer International Publishing. https://doi.org/10.3233/SW-190355

21.

Gottschalk

Demidova

(2019b). HapPenIng: Happen, predict, infer—event series completion in a knowledge graph. In The Semantic Web – ISWC 2019: 18th international semantic web conference, Lecture Notes in Computer Science (Vol. 11778, pp. 200–218). Cham: Springer. https://doi.org/10.1007/978-3-030-30793-6_12

22.

Gottschalk

Demidova

(2020). EventKG+BT: Generation of interactive biography timelines from a knowledge graph. In Harth A., Kirrane S., Ngomo A. C. N., Paulheim H., Rula A., Gentile A. L., Haase P., & Cochez M. (Eds.), The Semantic Web: ESWC 2020 satellite events, Lecture Notes in Computer Science (Vol. 12124, pp. 91–99). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-62327-2_16

23.

Gottschalk

Kacupaj

Abdollahi

Alves

Amaral

Koutsiana

Kuculo

Major

Mello

Cheema

G. S.

Sittar

Swati Tahmasebzadeh

Thakkar

(2021). OEKG: The open event knowledge graph. In Demidova E. & Maleshkova M. (Eds.), Proceedings of the 2nd international workshop on cross-lingual event-centric open analytics (CLEOPATRA) co-located with the web conference 2021 (WWW ’21), CEUR Workshop Proceedings. Ljubljana, Slovenia: CEUR-WS.org. Extended version available as arXiv:2302.14688

24.

Guan

Cheng

Bai

Zhang

Zeng

Jin

Guo

(2022). What is event knowledge graph: A survey. IEEE Transactions on Knowledge and Data Engineering, 35(7), 7569–7589. https://doi.org/10.1109/TKDE.2022.3180362

25.

Guo

Diefenbach

Gourru

Gravier

(2023a). Wikidata as a seed for web extraction. In Proceedings of the ACM web conference 2023 (WWW ’23) (pp. 2402–2411). Austin, TX, USA: ACM. https://doi.org/10.1145/3543507.3583236

26.

Guo

Wang

Chen

Liu

Zhao

(2023b). EventOA: An event ontology alignment benchmark based on FrameNet and wikidata. In Rogers A., Boyd-Graber J., & Okazaki N. (Eds.), Findings of the association for computational linguistics: ACL 2023 (pp. 10038–10052). Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.637

27.

Guzman-Nateras

Lai

Pouran Ben Veyseh

Dernoncourt

Nguyen

(2022). Event Detection for Suicide Understanding. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1952–1961). Seattle, United States: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-naacl.150

28.

Hamborg

Breitinger

Gipp

(2019). Giveme5W1H: A universal system for extracting main events from news articles. arXiv preprint arXiv:1909.02766. Also published in: Proceedings of the 7th International Workshop on News Recommendation and Analytics (INRA 2019) (pp. 35–43). https://doi.org/10.48550/arXiv.1909.02766

29.

Han

Gao

Lin

Peng

Yang

Xiao

Liu

Zhou

Sun

(2020). More data, more relations, more context and more openness: A review and outlook for relation extraction. In Wong K. F., Knight K., & Wu H. (Eds.), Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing (pp. 745–758). Suzhou, China: Association for Computational Linguistics. ISBN 978-1-952148-91-0. https://doi.org/10.18653/v1/2020.aacl-main.75

30.

Hassanzadeh

(2021). Building a knowledge graph of events and consequences using wikidata. In Kaffee L. A., Razniewski S., & Hogan A. (Eds.), Proceedings of the 2nd wikidata workshop (Wikidata 2021) co-located with the 20th international semantic web conference (ISWC 2021), CEUR Workshop Proceedings (Vol. 2982, pp. 87–92). Virtual Conference: CEUR-WS.org. http://ceur-ws.org/Vol-2982/paper-12.pdf

31.

Hoekstra

Groth

(2015). PROV-O-Viz: Understanding the role of activities in provenance. In Ludäscher B. & Plale B. (Eds.), Provenance and annotation of data and processes: 5th International provenance and annotation workshop, IPAW 2014, Cologne, Germany, June 9–13, 2014, Revised Selected Papers, Lecture Notes in Computer Science (Vol. 8628, pp. 215–220). Cham: Springer. ISBN 978-3-319-16462-5. https://doi.org/10.1007/978-3-319-16462-5_18

32.

Hsu

I. H.

Huang

K. H.

Boschee

Miller

Natarajan

Chang

K. W.

Peng

(2022). DEGREE: A data-efficient generation-based event extraction model. In Carpuat M., de Marneffe M. C., & Meza Ruiz I. V. (Eds.), Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1890–1908). Seattle, United States: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.naacl-main.138

33.

Huang

Chen

Lin

Huang

Zheng

Qin

(2024). A multi-graph representation for event extraction. Artificial Intelligence, 332, 104144. https://doi.org/10.1016/j.artint.2024.104144

34.

Huang

Liu

Shi

Liu

(2023). Event extraction with dynamic prefix tuning and relevance retrieval. IEEE Transactions on Knowledge and Data Engineering, 35(10), 9946–9958. https://doi.org/10.1109/TKDE.2023.3266495

35.

Pan

Cambria

Marttinen

P. S.

(2022). A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2), 494–514. https://doi.org/10.1109/TNNLS.2021.3070843

36.

Jouppi

N. P.

Young

Patil

Patterson

Agrawal

Bajwa

Bates

Bhatia

Boden

Borchers

, et al (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual international symposium on computer architecture (ISCA ’17) (pp. 1–12). New York, NY, USA: ACM. https://doi.org/10.1145/3079856.3080246

37.

Kaliamoorthi

Siddhant

Johnson

(2021). Distilling large language models into tiny and effective students using pQRNN. arXiv preprint arXiv:2101.08890 https://doi.org/10.48550/arXiv.2101.08890

38.

Kirti

Chattopadhyay

Anand

Guha

(2023). Deciphering storytelling events: A study of neural and prompt-driven event detection in short stories. In 2023 International conference on asian language processing (IALP) (pp. 1–6). Singapore: IEEE. ISBN 979-8-3503-3078-6. https://doi.org/10.1109/IALP61005.2023.10337315

39.

Krisnadhi

Hitzler

(2017). A core pattern for events. In Hitzler P., Gangemi A., Janowicz K., Krisnadhi A., & Presutti V. (Eds.), Advances in ontology design and patterns, Studies on the Semantic Web (Vol. 32 , pp. 210–216). IOS Press. ISBN 978-1-61499-825-9

40.

Lan

Chen

Goodman

Gimpel

Sharma

Soricut

(2020). ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th international conference on learning representations (ICLR 2020). OpenReview.net. https://doi.org/10.48550/arXiv.1909.11942

41.

Latif

Agarwal

Gottschalk

Chrosch

Feit

Jahn

Braun

Tchenko

Y. C.

Demidova

Beck

(2021). Visually connecting historical figures through event knowledge graphs. https://doi.org/10.48550/arXiv.2109.09380

42.

Leetaru

Schrodt

P. A.

(2013). GDELT: Global data on events, location, and tone. In Annual convention of the international studies association (ISA). San Francisco, CA, USA. http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf

43.

Fang

Yang

Wang

Zhao

Zhang

(2023a). Evaluating ChatGPT’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv preprint arXiv:2304.11633 https://doi.org/10.48550/arXiv.2304.11633

44.

Han

(2021). Document-level Event Argument Extraction by Conditional Generation. In Toutanova K., Rumshisky A., Zettlemoyer L., Hakkani-Tur D., Beltagy I., Bethard S., Cotterell R., Chakraborty T., & Zhou Y. (Eds.), Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT 2021) (pp. 894–908). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.69

45.

Peng

Chen

Wang

Pan

Lyu

Zhu

(2020). Event extraction as multi-turn question answering. In Cohn T., He Y., & Liu Y. (Eds.), Findings of the association for computational linguistics: EMNLP 2020 (pp. 829–838). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.73

46.

Wang

Zhang

(2019a). A survey of relation extraction of knowledge graphs. In Song J. & Zhu X. (Eds.), Web and Big Data: APWeb-WAIM 2019 international workshops, KGMA and DSEA, Chengdu, China, August 1–3, 2019, Revised Selected Papers, Lecture Notes in Computer Science (Vol. 11809, pp. 52–66). Cham: Springer. ISBN 978-3-030-33982-1. https://doi.org/10.1007/978-3-030-33982-1_5

47.

Wang

Zhou

Lin

Zhu

Zeng

Chang

S. F.

(2022). CLIP-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 16399–16408). New Orleans, LA, USA: IEEE. https://doi.org/10.1109/CVPR52688.2022.01593

48.

Zhan

Conger

Palmer

Han

(2023b). GLEN: General-purpose event detection for thousands of types. In Bouamor H., Pino J., & Bali K. (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 2823–2838). Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.170

49.

Lin

T. Y.

Goyal

Girshick

Dollár

(2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 2999–3007). Venice, Italy: IEEE. https://doi.org/10.1109/ICCV.2017.324

50.

Linguistic Data Consortium. (2005). ACE (automatic content extraction) english annotation guidelines for events. Philadelphia, PA, USA. Version 5.4.3. https://www.ldc.upenn.edu/collaborations/past-projects/ace

51.

Liu

Chen

Liu

(2020). Event extraction as machine reading comprehension. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 1641–1651). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.128

52.

Liu

Han

Zhang

Yang

Tian

Liu

Zhao

Zhu

Qiang

Shen

Liu

(2023). Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology, 1(3), 100017. https://doi.org/10.1016/j.metrad.2023.100017

53.

Liu

Huang

Shi

Wang

(2022b). Dynamic prefix-tuning for generative template-based event extraction. In Muresan S., Nakov P., & Villavicencio A. (Eds.), Proceedings of the 60th annual meeting of the association for computational linguistics (Vol. 1: Long Papers) (pp. 5216–5228). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.358

54.

Liu

Huang

H. Y.

Zhang

(2019b). Open domain event extraction using neural latent variable models. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2860–2871). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1276

55.

Liu

Zhang

Yang

Zhou

(2019a). Event detection without triggers. In Burstein J., Doran C., & Solorio T. (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Vol. 1 (Long and Short Papers) (pp. 735–744). Minneapolis, Minnesota: Association for Computational Linguistics. 10.18653/v1/N19-1080

56.

Liu

Ding

(2024). Event extraction as machine reading comprehension with question-context bridging. Knowledge-Based Systems, 283, 111159. https://doi.org/10.1016/j.knosys.2023.111159

57.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

Stoyanov

(2019c). RoBERTa: A robustly optimized BERT pretraining approach. https://doi.org/10.48550/arXiv.1907.11692

58.

Lou

Liao

Deng

Zhang

Chen

(2021). MLBiNet: A cross-sentence collective event detection network. In Zong C., Xia F., Li W., & Navigli R. (Eds.), Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th International joint conference on natural language processing (Vol. 1: Long Papers) (pp. 4829–4839). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.373

59.

Lin

Han

Tang

Sun

Liao

Chen

(2021). Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (ACL-IJCNLP 2021) (pp. 2795–2806). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.217

60.

Ran

Tetreault

Jaimes

(2023). Event extraction as question generation and answering. In Rogers A., Boyd-Graber J., & Okazaki N. (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (Vol. 2: Short Papers) (pp. 1666–1688). Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-short.143

61.

Luan

Ostendorf

Hajishirzi

(2018). Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3219–3232). Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1360

62.

Wang

Cao

Chen

Wang

Shao

(2022). Prompt for Extraction? PAIE: Prompting Argument Interaction for Event Argument Extraction. In Proceedings of the 60th annual meeting of the association for computational linguistics (ACL 2022) (pp. 6759–6774). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.466

63.

Mai

Pham

T. H.

Nguyen

M. T.

Nguyen

T. D.

Bollegala

Sasano

Sekine

(2018). An empirical study on fine-grained named entity recognition. In Bender E. M., Derczynski L., & Isabelle P. (Eds.), Proceedings of the 27th international conference on computational linguistics (pp. 711–722). Santa Fe, New Mexico, USA: International Committee on Computational Linguistics. https://aclanthology.org/C18-1060/

64.

Mehta

Islam

M. R.

Rangwala

Ramakrishnan

(2019). Event detection using hierarchical multi-aspect attention. In The world wide web conference (WWW ’19) (pp. 3079–3085). San Francisco, CA, USA: ACM. https://doi.org/10.1145/3308558.3313659

65.

Mihindukulasooriya

Tiwari

Enguix

C. F.

Lata

(2023). Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. In Payne T. R., Presutti V., Tamma V., d’Amato C., Distinto I., Lawrynowicz A., Crotti Junior A., Cochez M., Ristoski P., Pesquita C., & Muggleton S. H. (Eds.), The Semantic Web – ISWC 2023: 22nd international semantic web conference, Lecture notes in computer science (Vol. 14266, pp. 253–270). Cham: Springer. https://doi.org/10.1007/978-3-031-47243-5_14

66.

Mintz

Bills

Snow

Jurafsky

(2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 1003–1011). Suntec, Singapore: Association for Computational Linguistics. https://doi.org/10.3115/1690219.1690287

67.

Ohta

Tateisi

Kim

J. D.

Mima

Tsujii

(2002). The GENIA corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the second international conference on human language technology research, HLT ’02. (pp. 82–86). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., https://doi.org/10.3115/1289189.1289260

68.

Peng

Sun

Yang

Wei

Wang

(2024a). One small and one large for document-level event argument extraction. arXiv preprint arXiv:2411.05895 https://doi.org/10.48550/arXiv.2411.05895

69.

Peng

Yang

Wei

Yao

(2024b). Event co-occurrences for prompt-based generative event argument extraction. Scientific Reports, 14(1), 82883. https://doi.org/10.1038/s41598-024-82883-w

70.

Peters

M. E.

Neumann

Iyyer

Gardner

Clark

Lee

Zettlemoyer

(2018). Deep contextualized word representations. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, Vol. 1 (Long Papers) (pp. 2227–2237). New Orleans, Louisiana: Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202

71.

Piryani

Aussenac-Gilles

Hernandez

(2023). Comprehensive survey on ontologies about event. In ESWC Workshops on semantic methods for events and stories (SEMMES@ ESWC 2023), CEUR Workshop Proceedings (Vol. 3443). Aachen, Germany: CEUR-WS.org. Workshop paper published in CEUR-WS Vol-3443

72.

Polak

M. P.

Morgan

(2024). Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nature Communications, 15(1), 1569. https://doi.org/10.1038/s41467-024-45914-8

73.

Porzel

Pomarlan

Spillner

Bateman

J. A.

Mildner

Santagiustina

(2022). Narrativizing Knowledge Graphs. In Grabmair M., Ashley K., Walenz B., & Lecuyer L. (Eds.), Proceedings of the international workshop on artificial intelligence technologies for legal documents (AI4LEGAL 2022) co-located with the 21st International Semantic Web Conference (ISWC 2022), CEUR Workshop Proceedings (Vol. 3257, pp. 81–95). Virtual Event: CEUR-WS.org. https://ceur-ws.org/Vol-3257/paper11.pdf

74.

Pouran Ben Veyseh

Nguyen

M. V.

Dernoncourt

Nguyen

(2022). MINION: a large-scale and diverse dataset for multilingual event detection. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: Human Language Technologies (pp. 2286–2299). Seattle, United States: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.naacl-main.166

75.

Rospocher

Van Erp

Vossen

Fokkens

Aldabe

Rigau

Soroa

Ploeger

Bogaard

(2016). Building event-centric knowledge graphs from news. Journal of Web Semantics, 37–38, 132–151. https://doi.org/10.1016/j.websem.2015.12.004

76.

Rudnik

Ehrhart

Ferret

Teyssou

Troncy

Tannier

(2019). Searching news articles using an event knowledge graph leveraged by Wikidata. In Companion proceedings of the 2019 world wide web conference WWW ’19. (pp. 1232–1239). New York, NY, USA: ACM. https://doi.org/10.1145/3308560.3316761

77.

Rusu

Hodson

Kimball

(2014). Unsupervised techniques for extracting and clustering complex events in news. In Proceedings of the Workshop on EVENTS: definition, detection, coreference, and representation (pp. 26–34). Baltimore, Maryland, USA: Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-2905

78.

Sharif

Basak

Parvin

Scharfstein

Bradham

Borodovsky

J. T.

Lord

S. E.

Preum

S. M.

(2024). Characterizing information seeking events in health-related social discourse. In Proceedings of the 36th AAAI conference on artificial intelligence, AAAI’24. (pp. 22350–22358). Vancouver, BC, Canada: AAAI Press. https://doi.org/10.1609/aaai.v38i20.3024

79.

Shaw

Troncy

Hardman

(2009). LODE: Linking open descriptions of events. In Gómez-Pérez A., Yu Y., & Ding Y. (Eds.), The Semantic Web: Fourth asian conference, ASWC 2009, Shanghai, China, December 6–9, 2009. Proceedings, Lecture Notes in Computer Science (Vol. 5926, pp. 153–167). Berlin, Heidelberg: Springer. ISBN 978-3-642-10870-9. https://doi.org/10.1007/978-3-642-10871-6_11

80.

Shenoy

Ilievski

Garijo

Schwabe

Szekely

(2022). A study of the quality of wikidata. Journal of Web Semantics, 72, 100679. https://doi.org/10.1016/j.websem.2021.100679

81.

Shirai

Bhattacharjya

Hassanzadeh

(2023). Event prediction using case-based reasoning over knowledge graphs. In Proceedings of the ACM web conference 2023 (WWW ’23) (pp. 2383–2391). Austin, TX, USA: ACM. https://doi.org/10.1145/3543507.3583201

82.

Shiri

Nguyen

Moghimifar

Yoo

Haffari

Y. F.

(2024). Decompose, enrich, and extract! schema-aware event extraction using llms. In Proceedings of the 27th international conference on information fusion (FUSION) (pp. 1–8). Cambridge, United Kingdom: International Society of Information Fusion. https://doi.org/10.48550/arXiv.2406.01045

83.

Souza Costa

Gottschalk

Demidova

(2020). Event-QA: A dataset for event-centric question answering over knowledge graphs. In Proceedings of the 29th ACM international conference on information & knowledge management, CIKM ’20. (pp. 3157–3164). New York, NY, USA: ACM. https://doi.org/10.1145/3340531.3412760

84.

Stoica

Platanios

E. A.

Póczos

(2020). Improving relation extraction by leveraging knowledge graph link prediction. https://doi.org/10.48550/arXiv.2012.04812

85.

Tsoumakas

Katakis

(2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1–13. https://doi.org/10.4018/jdwm.2007070101

86.

Van Hage

W. R.

Malaisé

Segers

Hollink

Schreiber

(2011). Design and use of the simple event model (SEM). Journal of Web Semantics, 9(2), 128–136. https://doi.org/10.1016/j.websem.2011.03.003

87.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., & Garnett R. (Eds.), Advances in neural information processing systems 30 (NIPS 2017) (pp. 5998–6008). Curran Associates, Inc. https://doi.org/10.48550/arXiv.1706.03762

88.

Wadden

Wennberg

Luan

Hajishirzi

(2019). Entity, relation, and event extraction with contextualized span representations. In Inui K., Jiang J., Ng V., & Wan X. (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 5784–5789). Hong Kong, China: Association for Computational Linguistics. 10.18653/v1/D19-1585

89.

Wang

Liu

Chen

Hong

Tang

Song

(2022). DeepStruct: Pretraining of language models for structure prediction. In Muresan S., Nakov P., & Villavicencio A. (Eds.), Findings of the association for computational linguistics: ACL 2022 (pp. 803–823). Dublin, Ireland: Association for Computational Linguistics. https://aclanthology.org/2022.findings-acl.67/. https://doi.org/10.18653/v1/2022.findings-acl.67

90.

Wang

Han

Jiang

Han

Liu

Lin

Zhou

(2020). MAVEN: A massive general domain event detection dataset. In Webber B., Cohn T., He Y., & Liu Y. (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 1652–1671). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.129

91.

Wang

Han

Lin

Hou

Liu

Zhou

(2021b). CLEVE: Contrastive pre-training for event extraction. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (ACL-IJCNLP 2021) (pp. 6283–6297). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.491

92.

Wang

Han

Liu

Sun

Zhou

Ren

(2019). HMEAE: Hierarchical modular event argument extraction. In Inui K., Jiang J., Ng V., & Wan X. (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 5777–5783). Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1584

93.

Wei

Cui

Cheng

Wang

Zhang

Huang

Xie

Chen

Zhang

Jiang

Han

(2023). Zero-shot information extraction via chatting with ChatGPT. arXiv preprint arXiv:2302.10205 DOI: https://doi.org/10.48550/arXiv.2302.10205

94.

Qian

Zhou

(2024). Pipelined biomedical event extraction rivaling joint learning. Methods (San Diego, Calif.), 226, 9–18. https://doi.org/10.1016/j.ymeth.2024.04.003

95.

Xiang

Wang

(2019). A survey of event extraction from text. IEEE Access, 7, 173111. https://doi.org/10.1109/ACCESS.2019.2956831

96.

Liu

Chang

(2021). Document-level event extraction via heterogeneous graph-based interaction model with a tracker. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Vol. 1: Long Papers) (pp. 3533–3546). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.274

97.

Yang

Dai

Yang

Carbonell

Salakhutdinov

R. R.

Q. V.

(2019b). XLNet: Generalized autoregressive pretraining for language understanding. In Wallach H., Larochelle H., Beygelzimer A., d’Alché Buc F., Fox E., & Garnett R. (Eds.), Advances in neural information processing systems 32 (NeurIPS 2019) (pp. 5753–5763). Curran Associates, Inc. https://doi.org/10.48550/arXiv.1906.08237

98.

Yang

Feng

Qiao

Kan

(2019a). Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 5284–5294). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1522

99.

Yao

Dai

Ramaswamy

Min

Huang

(2020). Weakly supervised subevent knowledge acquisition. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6568–6579). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.430

100.

Yao

Peng

Mao

Luo

(2023). Exploring large language models for knowledge graph completion. arXiv preprint arXiv:2308.13916 https://doi.org/10.48550/arXiv.2308.13916

101.

Yao

Han

Lin

Liu

Huang

Zhou

Sun

(2019). DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 764–777). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1074

102.

You

Samuel

Touileb

Øvrelid

(2022). EventGraph: Event extraction as semantic graph parsing. In Proceedings of the 5th workshop on challenges and applications of automated extraction of socio-political events from text (CASE) (pp. 7–15). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.case-1.2

103.

Zhang

X. F.

Blum

Choji

Shah

Vempala

(2024b). ULTRA: Unleash LLMs’ potential for event argument extraction through hierarchical modeling and pair-wise self-refinement. In Ku L. W., Martins A., & Srikumar V. (Eds.), Findings of the association for computational linguistics: ACL 2024 (pp. 8172–8185). Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.487

104.

Zhang

Chen

Zhang

Chen

(2024c). Making large language models perform better in knowledge graph completion. In Proceedings of the 32nd ACM international conference on multimedia, MM ’24. (pp. 8326–8336). Melbourne, VIC, Australia: ACM. https://doi.org/10.1145/3664647.3681327

105.

Zhang

Liu

Pan

Song

Leung

C. W. K.

(2020). ASER: A large-scale eventuality knowledge graph. In Proceedings of the web conference 2020 (WWW ’20) (pp. 201–211). Taipei, Taiwan: ACM. https://doi.org/10.1145/3366423.3380107

106.

Zhang

Wang

Tan

Liang

(2024a). Hyperspherical multi-prototype with optimal transport for event argument extraction. In Proceedings of the 62nd annual meeting of the association for computational linguistics (Vol. 1: Long Papers) (pp. 9271–9284). Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.502

107.

Zhao

Yang

(2022). Trigger-free event detection via derangement reading comprehension. arXiv preprint arXiv:2208.09659 https://doi.org/10.48550/arXiv.2208.09659

108.

Zheng

Cai

Chen

Lei

Chen

(2021). Taxonomy-aware learning for few-shot event detection. In Proceedings of the web conference 2021 (WWW ’21) (pp. 3546–3557). Ljubljana, Slovenia: ACM. https://doi.org/10.1145/3442381.3449949

109.

Zheng

Cao

Bian

(2019). Doc2EDAG: An end-to-end document-level framework for chinese financial event extraction. In Inui K., Jiang J., Ng V., & Wan X. (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 337–346). Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1032

110.

Zhou

Chen

Zhao

(2021). What the Role is vs. what plays the role: Semi-supervised event argument extraction via dual question answering. In Proceedings of the Thirty-Fifth AAAI conference on artificial intelligence (AAAI-21) (pp. 14638–14646). Online: AAAI Press. https://doi.org/10.1609/aaai.v35i16.17720

111.

Zhu

Wang

Chen

Qiao

Yao

Deng

Chen

Zhang

(2024). LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities. World Wide Web, 27, 58. https://doi.org/10.1007/s11280-024-01297-w

		TP		FP		FN		F $_{1}$
Task	Approach	Before	After	Before	After	Before	After	Before	After
Event classification	T-SEE	92	90	7	12	10	9	0.92	0.90
	L-SEE	91	100	11	2	11	2	0.89	0.98
Relation extraction	T-SEE	83	103	33	23	32	114	0.72	0.58
	L-SEE	77	178	200	99	38	29	0.39	0.74

Transformer-Based Architectures Versus Large Language Models in Semantic Event Extraction: Evaluating Strengths and Limitations

Abstract

Keywords

1. Introduction

Definition 1 Event Ontology

Definition 2 Event Knowledge Graph

Definition 3 Semantic EE

2.1. Assumptions

2.1.1. Tasks and Models

3. T-SEE: Transformer-based Semantic Event Extraction

3.1.1. Example

3.2.1. Example

3.2.2. Training

3.3. Query-Based Relation Extraction

3.3.1. Example

3.3.2. Training

3.4. Event Modelling

3.4.1. Example

4. L-SEE: LLM-based Semantic Event Extraction

4.2. Relation Extraction

4.3. Event Modelling

5. Evaluation

5.1. Datasets

5.1.1. Event Ontology Extraction

Filtering Protocols and Thresholds

Table 4. Statistic of our Datasets for Semantic Event Extraction. DBpedia-SEE Wikidata-SEE Texts 42,648 37,988 Events 42,726 38,014 Relations 47,666 63,997

5.2. Evaluation Setup

5.2.1. Baselines

5.2.2. Metrics and Setting

5.3. Event Classification

5.6. Example Result

Table 7. Consistency Analysis Results of L-SEE for Event Classification and Relation Extraction. Task DBpedia-SEE Wikidata-SEE Event classification (Fleiss’ κ ) 0.991 0.881 Relation extraction (Fleiss’ κ ) 1.000 1.000

6. Comparison of T-SEE and L-SEE

6.1. Manual Evaluation

Extraction Inaccuracies

Annotation Discrepancies

Other Anomalies

6.2.1. Examples of Errors

Example 1 Figure 8) – ground truth extracted from dbr:Black_Monday_(1360)

6.4. Effect of Text Characteristics on Semantic EE

6.4.1. Text Characteristics

7. Related Work

7.1. Event Knowledge Graphs

7.2. Event Extraction

7.3. LLM-based IE

7.3.1. General IE with LLMs

7.3.2. Unstructured IE with LLMs

7.3.3. LLM-based EE

7.3.4. LLM-Based Knowledge Graph Population

8. Conclusion

Footnotes

Acknowledgements

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

Appendix A. Prompts of L-SEE

References

Table 4.
Statistic of our Datasets for Semantic Event Extraction.

DBpedia-SEE Wikidata-SEE

Texts 42,648 37,988

Events 42,726 38,014

Relations 47,666 63,997

Table 7.
Consistency Analysis Results of L-SEE for Event Classification and Relation Extraction.

Task DBpedia-SEE Wikidata-SEE

Event classification (Fleiss’ $κ$ ) $0.991$ $0.881$

Relation extraction (Fleiss’ $κ$ ) $1.000$ $1.000$