Sage Journals: Discover world-class research

Abstract

In recent years, hazardous chemical incidents have occurred frequently, resulting in significant human casualties, property damage, and environmental pollution due to human or natural factors. Accurately mining the lessons learned from accumulating incident reports and constructing the knowledge graph for hazardous chemical incident management can assist managers in identifying patterns and analyzing common attributes, thereby preventing the recurrence of similar incidents. This article addresses the challenges of dispersed textual information, specialized vocabulary, and data formats in hazardous chemical incidents. We propose a novel entity-relation extraction model called CPBA-CLIM (content-position-based attention-cross-label intersect matching) to provide an accurate data foundation for constructing the hazardous chemical incident knowledge graph. The content-position-based attention module, based on content-position attention, incorporates contextual semantic information into the combined encoding of bidirectional encoder representations from the transformer's content and position to obtain dynamic word vectors that align with the thematic context of the text. Additionally, the cross-label intersect matching strategy evaluates the rationality of entity–relation interactions in sets containing potential overlaps, reducing the impact of entity–relation overlap on triplet extraction accuracy. Comparative experimental results on public datasets demonstrate the model's outstanding performance in overlapping triplets. Qualitative experiments on a self-constructed dataset integrate our model with ontology construction techniques, successfully establishing a knowledge graph for managing hazardous chemical incidents. This research effectively enhances the degree of automation and efficiency in knowledge graph construction, thus offering support and decision-making foundations for hazardous chemical safety management.

Keywords

Contextual semantic overlapping triplets content-position-based attention module cross-label intersect matching strategy

Introduction

In recent years, the rapid development of the chemical industry has significantly propelled global economic growth, shaping modern lifestyles worldwide.¹ However, due to the harmful chemical properties of certain substances, accidents such as explosions, fires, poisoning, corrosion, and radiation incidents may occur during their production, storage, transportation, and usage, posing potential risks to human health, wildlife, and the environment.² Chemical accidents often result from the interaction of various hazardous factors. Therefore, comprehensive safety knowledge is crucial for effectively managing hazardous chemicals. The field of chemical safety encompasses a wealth of safety regulations and accident reports, providing valuable information and resources. For instance, the Hazardous Chemical Incident Report published by the Ministry of Emergency Management of the People's Republic of China includes various direct causes of incidents. Additionally, documents such as the Regulations on the Safety Management of Hazardous Chemicals (State Council Decree No. 591) and the Comprehensive Governance Plan for Hazardous Chemical Safety (State Council [2016] No. 88) issued by the State Council outline various safety management standards. However, traditional methods of chemical safety management lack the means to transform them into reusable forms, hindering the full utilization and extraction of the dispersed, valuable, and applicable knowledge stored in different documents.

Emerging intelligent application technologies, such as knowledge graphs,³ provide more comprehensive and intelligent means for safely managing hazardous chemicals. These technologies further strengthen regulatory agencies’ supervisory and response capabilities, significantly enhancing the overall level of safety management. For example, knowledge graphs extract entities, relationships, and attributes from text or databases, representing elements such as hazardous substances, the environment, equipment, and regulations as nodes and edges.⁴ This systematic representation offers a comprehensive relational view, deepening the understanding of frequent interconnections related to hazardous substances. It provides comprehensive knowledge support for overall hazardous material management, risk assessment, and emergency response.⁵ Various intelligent applications in hazardous chemical management rely on extracting valuable information from large volumes of text and data.⁶ This information extraction is the foundational data for establishing and operating these systems.

However, hazardous chemical safety faces challenges due to dispersed textual information, specialized vocabulary, and potential ambiguity in data formats and sentence expressions. These issues give rise to uncertainty in knowledge representation and inference when constructing a knowledge graph in this domain. Currently, most research relies on static word vectors, such as Word2Vec, to represent inputs.^7,8 Although static word vectors partially capture the semantic associations between words, they lack contextual semantic information, impacting the accuracy of hazardous chemical safety knowledge management.

Furthermore, an analysis of incident report texts reveals frequent instances of overlapping multiple entities and relationships within sentences. This overlap can hinder the effectiveness of information extraction and lead to missed or erroneous detections. For example, as depicted in Figure 1, the incident entity “poisoning incident” serves as the subject for the time entity “November 25, 2022,” the company entity “Hengsheng Nord High-Tech Company,” and the casualty entity “3 fatalities and 1 injured person.” Additionally, the company entity is also the subject of the location entity. This example demonstrates the presence of multiple overlapping entities and relationships. Current research attempts to address this issue by employing innovative approaches to extract the three elements of entities and relationships. For example, Chen et al.⁹ proposed a novel feature engineering method to improve entity-relationship extraction performance. Jiang and Cao¹⁰ introduced an unknown heterogeneous graph attention network to enhance semantic analysis and fusion between entities and relationships, thereby improving entity–relationship extraction. However, when these models encounter complex relationships, although they can partially address the problem of triple element overlap, they still struggle to achieve accurate extraction and matching when multiple overlapping triplets exist within a single sentence. Utilizing these models undoubtedly poses challenges in constructing a knowledge graph for hazardous chemical incident management. They fall short of meeting the precision requirements for understanding accident scenarios and their causes.

Figure 1.

Example of entity–relationship overlap in hazardous chemical incident reports.

Based on the abovementioned issues, this article introduces CPBA-CLIM (content-position-based attention-cross-label intersect matching), a hazardous chemical safety incident entity-relation extraction model. The model utilizes natural language processing and knowledge graph mining techniques to extract information triplets from hazardous chemical safety incident reports. It stores them in a graph database for constructing incident knowledge graphs and performing key information statistics.

The main contributions of this article can be summarized as follows:

Introducing a dynamic word vector encoding module based on position and content. This module utilizes composite vectors containing word content and positional information to capture contextual semantic information, thereby representing the semantic features of each word in the input layer. This approach addresses the challenges of dispersed textual information and specialized vocabulary in hazardous chemical safety.

Proposing an entity and relation decoding module based on the cross-label intersect matching strategy. This module evaluates and matches the rationality of interactions between entities and relationships from the predicted sets generated by the model. The aim is to extract triplets in sentences with various overlapping elements accurately.

Conduct comprehensive experiments on public and self-constructed datasets, including comparative, ablation, and qualitative analysis. The results demonstrate the model's excellent handling of overlapping triplet sentence elements.

Using the proposed model, we constructed a knowledge graph for managing hazardous chemical accidents based on ontology strategies. We have completed applications such as safety accident profiling and accident information retrieval.

Related work

Data-driven approaches in hazardous chemical safety management

The hazardous chemical safety situation is currently severe, characterized by “high temperature, high pressure, flammability, explosiveness, toxicity, continuous operation, and extensive chain length.”¹¹ Once incidents occur, they severely threaten people's lives and cause enormous socio-economic losses. In recent years, various advanced technologies have been applied in hazardous chemical safety management.¹² In computer science, with the advent of the big data era, the trend is to use multi-source data for comprehensive analysis. Knowledge graphs, driven by data-driven approaches,¹³ have shown significant advantages in data management and representation by collecting and acquiring historical data and utilizing data mining or machine learning methods to extract useful information from a large amount of raw data.

Data-driven research in hazardous chemical safety involves several aspects. For example, based on learning from historical data, there are studies on the adjustment of environmental and conditional decision variables in the design and optimization of chemical processes. Sharma et al.¹³ applied intelligent algorithm techniques to the adjustment of chemical process units and proposed an improved optimization algorithm to solve dynamic optimization problems in chemical engineering. In the research of process detection and fault diagnosis, scholars have studied the IPSO-optimized SVM-BOXPLOT method¹⁴ for intelligent identification of abnormal operating conditions in the production process of hazardous chemicals. Regarding quality prediction and control, Lui et al.¹⁵ corrected product quality defects by establishing a soft measurement model to improve product quality. For example, based on fuzzy neural network algorithms,¹⁶ a device for product quality prediction is constructed to enhance product quality.

Application of knowledge graphs in hazardous chemical safety management

Knowledge graphs, intuitive tools for organizing and managing knowledge, have gradually found rich applications in various fields. For example, Moon et al.¹⁷ developed a semantic retrieval system that allows users to retrieve incident-related information according to their needs using knowledge graph-based information retrieval applications. It enables the retrieval of safety risk factors, incident emergency response measures, and more. Combining various machine learning techniques, Zhang et al.¹⁸ classified the causes of incidents in safety incident reports, providing technical support for summarizing incident lessons for safety professionals. Lee et al.,¹⁹ using dependency syntax analysis and other techniques, automatically detect potential risk clauses and propose an automatic extraction model for contract risk knowledge, significantly improving management efficiency.

In hazardous chemical safety regulation, there is a large amount of unstructured data,²⁰ such as incident reports, various safety regulations, and other data information. These data can provide decision support for hazardous chemical safety management. However, this knowledge is scattered in fragmented materials, and traditional safety management needs to utilize this information more thoroughly. Zheng et al.¹² established a knowledge graph for managing hazardous chemicals by ontology design and entity recognition, aiming to overcome the information gap between dispersed databases and improve the management of dangerous substances. Wu and Liu²¹ proposed a risk knowledge question-answering model in hazardous chemicals based on relevant knowledge graph technology, which meets the needs for safe storage and management of hazardous chemicals. Establishing a knowledge graph in hazardous chemical safety can facilitate quick retrieval, statistical analysis of various incident case information, and safety regulations knowledge and help improve hazardous chemical safety management.

Unstructured data information extraction technologies in knowledge graphs

Word vectors, as the first-step input for building knowledge graphs, directly affect the accuracy of knowledge graph construction.²² According to the different generation methods, word vectors can be divided into static and dynamic. Static word vectors are pre-computed before training, where each word is represented by a fixed-length vector that remains unchanged throughout the graph construction process. Dynamic word vectors,²³ on the other hand, adjust the vectors dynamically to represent word semantics based on specific tasks and contexts. Therefore, compared to static word vectors, dynamic word vectors can reflect contextual information and, to some extent, address issues such as polysemy and word sense disambiguation. As a result, dynamic word vectors can better adapt to different tasks and domains. In recent years, various dynamic word vector models have emerged, such as the generative pre-trained transformer model,²⁴ neural network-based bidirectional encoder representations from the transformer (BERT) model,²⁵ graph-structured gross domestic wellbeing model,²⁶ and time-attribute network Deep Convolutional Transformational Autoencoders Networks (DCTANs) based on the dynamic random state-space framework.²⁷ These dynamic word vector models can improve the accuracy of graph construction to a certain extent. However, text in hazardous chemical safety is highly specialized, and the data format needs to be more cohesive, with many entity words being uncommon. Finding a more suitable word vector representation model is crucial to improving the accuracy of building the hazardous chemical safety knowledge graph.

On the other hand, relation extraction is also a core task in building knowledge graphs, aiming to extract relationship triplets from unstructured natural language text. With the development of neural networks, researchers have gradually focused on relation extraction models in complex scenarios.²⁸ Some scholars²⁹ follow the pipelined method to complete relation extraction models: first, identify entities in the text sequence, then use relation classification methods to distinguish the relationships between entities. Although the pipelined approach can achieve relation extraction functionality, its obvious drawback is that the accuracy of the first module directly affects the effectiveness of the second module and the overall accuracy of relation extraction. Some researchers have proposed joint entity and relation detection models to address this issue. Zheng et al. used a Seq2Seq model with a coping mechanism to address overlapping relationships.²⁹ Fu et al.³⁰ proposed the binary tree structure from graph theory was introduced into the network to handle overlapping triplets.

In summary, previous research has made significant progress in hazardous chemical safety. In hazardous chemicals driven by data-driven approaches, emerging intelligent applications such as knowledge graphs provide a more comprehensive and intelligent means of managing hazardous chemical safety. This strengthens the supervisory and response capabilities of regulatory authorities. However, due to the unique characteristics of relevant texts in hazardous chemicals, the accuracy of knowledge graph construction in this domain needs further improvement. This is primarily manifested in two aspects: (1) The information in the field of hazardous chemical safety is scattered, and the vocabulary is highly specialized, with many entity terms being uncommon. Finding a more suitable word vector representation model is crucial for enhancing the accuracy of constructing hazardous chemical safety knowledge graphs. (2) Text data in hazardous chemicals often include sentences with multiple overlapping relationships. The current widely used information extraction models exhibit suboptimal accuracy when faced with complex sentences containing multiple overlapping relationships. Therefore, there is a need for a relationship extraction model that is more suitable for hazardous chemical safety to extract more accurate entity relationships.

In response to these challenges and limitations, this article presents a CPBA-CLIM model for constructing a knowledge graph in hazardous chemical safety. The model utilizes an end-to-end encoding–decoding framework, including a dynamic word vector (content-position-based attention (CPBA)) encoding module based on position and content and an entity and relation decoding module based on the cross-label intersect matching (CLIM) strategy. Finally, a knowledge graph in hazardous chemical incident management is constructed, enabling applications such as incident profiling, incident information retrieval, and incident statistical analysis.

Methods

Problem formulation and framework

Problem formulation

This paper addresses the challenges of entity relationship extraction in reports on accidents involving hazardous chemicals, including issues such as dispersed information, specialized terminology, and overlapping data relationships. An information extraction model is proposed to tackle these challenges. The model is capable of accurately extracting entity relationships in the form of triplets from a large number of unstructured documents. This provides a reliable data foundation for building knowledge graphs in hazardous chemical accident management and enhances the automation and efficiency of knowledge graph construction, offering essential support and decision-making basis for managing hazardous chemical safety.

The entity-relationship extraction task is defined as follows:

Define the entity set as E and the relation set as R.

Define the set of textual sentences in hazardous chemical safety management as $S = {t_{1}, t_{2}, t_{3}, \dots, t_{N}}$ , where $w_{i}$ represents the $i$ -th token in the sentence.

Define the entity-relation triples as $T = ⟨ subject, relation, object ⟩$ , where $subject, object \in E$ , $relation \in R$ .

The entity-relationship extraction task objectives are described as follows:

Given a sentence $S$ and a predefined set of relations R, the CPBA-CLIM is tasked with learning to extract all relation triples from S that belong to $R$ accurately. It is important to note that S may involve multiple overlapping scenarios, where the extracted triples may share the same entities or relations.

The problem description of ontology-based knowledge graph construction is as follows:

The above entity relationship extraction results, represented as T = $⟨ subject, relation, object ⟩$ , are combined with ontology construction techniques to establish a knowledge graph specifically designed for managing incidents related to hazardous chemicals. This knowledge graph enables the integration, aggregation, analysis, and other services for unstructured data. It provides accurate domain information and necessary decision-making support for relevant safety management personnel.

Framework

Based on hazardous chemical safety incident reports and incorporating natural language processing and knowledge graph techniques, this paper proposes a feature extraction model called CPBA-CLIM for encoding and decoding hazardous chemical safety incidents. The hazardous chemical incident management system is constructed through ontology-based knowledge graph-building techniques. The research framework is illustrated in Figure 2.

Figure 2.

The overall structure of the paper's method.

The CPBA encoder: The encoder effectively captures relevant information in hazardous chemical incident reports, providing richer contextual semantic information for subsequent decoding. Firstly, the text sequence containing the [CLS] token is fed into the pre-trained BERT model as the initial input data. Each token forms token pairs with every other token, taking token pairs at positions i and j as examples. Each token is encoded into two vectors, $C_{i}$ and $P_{i | j}$ , representing its content and relative positional embedding. These vectors undergo element-wise multiplication, generating four components of content-position combination embeddings (used as value projections). Subsequently, a self-attention operation is performed to obtain attention weights capturing the overall semantic context of the sentence for the token pair with relevant contextual information. Finally, these attention weights are multiplied with the value projections to obtain dynamic word vectors, which are then used as inputs to the CLIM decoder.

The CLIM decoder: The decoder employs a novel CLIM strategy to address the overlap issue between entities and relationships. The results are categorized into Subject, Object, and Relation sets by predicting all possible classification labels. The encoding strategy first identifies entity groups corresponding to the same relationship from entity–relation and relation–entity labels. Subsequently, the rationality of interactions between entity–entity, and entity–relation pairs (including subject–relation and relation–object) is assessed. The triplet is considered valid when all three sets of judgments are accurate, ensuring the accuracy of triplet extraction.

Ontology-based knowledge graph construction involves two main aspects. Firstly, we conduct ontology construction based on the features of data sources in the hazardous chemicals domain and human knowledge. This entails defining an appropriate class hierarchy and inter-class relationships. Secondly, leveraging the constructed ontology, we utilize the CPBA-CLIM model to identify the required entities and their relational data from the self-constructed dataset hazardous chemicals incident report (HCAR). This process culminates in the successful establishment of a knowledge graph for the management of hazardous chemical incidents.

CPBA encoder

The model determines the static word vectors before input and does not change with the semantic context during training. In the field of hazardous chemical safety, textual data is often complex and contains a large number of domain-specific terms and abbreviations. For example, in incident reports, there are many domain-specific terms, such as chemical names and company names, which often include common vocabulary, such as “a carbon monoxide poisoning,” “sulfide alkali workshop,” and “Petrochemical Company.” Therefore, the true meaning of a token is closely related to its contextual semantics. Consequently, in data processing, it is necessary to quantitatively encode the data based on the contextual semantics, adapting to the domain-specific vocabulary and knowledge representation, thereby improving the accuracy and robustness of the model.

The quantification of contextual semantics involves two factors: the content information of the word concerning its context and the position it occupies in the text sequence. This module proposes a dynamic word vector encoding model based on content and position to address this issue and accurately extract relevant triplets from hazardous chemical safety textual data. The structure of the CPBA module is shown in Figure 3.

Figure 3.

Structure of the content-position based attention (CPBA) module.

Assuming the length of the input text sequence is N, and the position of a token in the sequence is i, $C_{i}$ represents the content vector of that token, and $P_{i | j}$ represents the positional vector relative to position j. We believe that the attention weight of a word depends not only on its content but also on its relative position. For example, in the sentences “Store the flammable liquids in a well-ventilated area.” and “Well-ventilate the area before storing flammable liquids.,” the same vital terms “flammable liquids” and “well-ventilated” appear in different positions. In the first sentence, the emphasis is on highlighting the storage conditions when dealing with flammable liquids. The second sentence emphasizes proper area ventilation before storing flammable liquids. Therefore, if we disregard the relative positions of the words and solely concentrate on their content, there is a potential for misunderstanding the primary emphasis of the sentences. This underscores the importance of considering the position of words in language processing for a correct understanding of context and sentence intent.

Therefore, in this article, we define the attention weight of a word as follows:

A_{i, j} = w_{c \to c} C_{i} C_{j}^{T} + w_{c \to p} C_{i} P_{j | i}^{T} + w_{p \to c} P_{i | j} C_{j}^{T} + w_{p \to p} P_{i | j} P_{j | i}^{T}

(1)

In this equation,

w_{c \to c} C_{i} C_{j}^{T}

represents the attention value from the content of token i to the content of token j,

w_{c \to p} C_{i} P_{j | i}^{T}

represents the attention value from the content of token i to the position of token j relative to token i,

w_{p \to c} P_{i | j} C_{j}^{T}

represents the attention value from the position vector of token i relative to token j to the content of token j, and

w_{p \to p} P_{i | j} P_{j | i}^{T}

represents the attention value between the relative position vectors of two tokens.

According to the definition of an attention model, the formula for calculating dynamic word content attention weights is as follows:

Q_{c} = C W_{q}^{c}, K_{c} = C W_{k}^{c}, V_{c} = C W_{v}^{c}

(2)

where C represents the hidden unit content vector of the input layer,

W_{q}^{c}

W_{k}^{c}

, and

W_{v}^{c}

are parameter matrices obtained through network training.

Q_{c}

represents the projected query-content vector,

K_{c}

represents the projected key-content vector, and

V_{c}

represents the projected value-content vector.

Similarly, the formula for calculating positional attention weights is as follows:

Q_{p} = P W_{q}^{p}, K_{p} = P W_{k}^{p}, V_{p} = P W_{v}^{p}

(3)

By substituting equation (1), the calculation of attention score from token i to token j is as follows:

w_{i, j}^{a t t} = α_{1} Q_{i}^{c} K_{j}^{c T} + α_{2} Q_{i}^{c} K_{l (i, j)}^{p T} + α_{3} K_{j}^{c} Q_{l (j, i)}^{p T} + α_{4} K_{l (i, j)}^{p} Q_{l (j, i)}^{p T}

(4)

where

w_{i, j}^{a t t}

represents an element of the attention matrix

{weights}_{i, j}^{a t t}

Q_{i}^{c}

corresponds to the

i

-th row of

Q_{c}

K_{j}^{c}

corresponds to the

j

-th row of

K_{c}

K_{l (i, j)}^{p}

corresponds to the

l (i, j)

-th row of

K_{p}

, and

Q_{l (j, i)}^{p}

corresponds to the

l (j, i)

-th row of

Q_{p}

. Therefore,

α_{1} Q_{i}^{c} K_{j}^{c T}

α_{2} Q_{i}^{c} K_{l (i, j)}^{p T}

α_{3} K_{j}^{c} Q_{l (j, i)}^{p T}

, and

α_{4} K_{l (i, j)}^{p} Q_{l (j, i)}^{p T}

, respectively, represent the attention scores for content-to-content, content-to-position, position-to-content, and position-to-position. Here,

l (i, j) \in [0, 2 N]

denotes the relative distance between token i and token j, defined as follows:

l (i, j) = {\begin{matrix} 0, i - j \leq - N \\ 2 k - 1, i - j \geq N \\ i - j + k, others \end{matrix}

(5)

Finally, the scaling factor

1 / 2 \sqrt{d}

is applied to

w_{i, j}^{a t t}

to obtain the self-attention output of the sequence.

w_{output} = softmax (\frac{w_{i, j}^{a t t}}{2 \sqrt{d}}) V_{c}

(6)

Obtain the content-position-based attention encoder by combining the CPBA model described above, as shown in Figure 4. It takes input as a joint content-position text vector and utilizes multiple layers of stacked CPBA encoders to extract semantic information and correlations between words. The output consists of dynamically updated word vectors associated with contextual information.

Figure 4.

Content-position-based attention encoder.

Decoder module based on the CLIM strategy

Given a sentence $S = {t_{1}, t_{2}, \dots, t_{N}}$ , the encoded representation of the sentence is denoted as $D = {D y e m b_{1}, D y E m b_{2}, \dots, D y E m b_{N}}$ represents the dynamic word vector obtained by the CPBA encoder (as described in the “CPBA encoder” section), and N is the length of the sentence. For engineering purposes, a predefined set of relations is defined as $R = {r_{1}, r_{2}, \dots, r_{K}}$ , where K is the number of predefined relations. To address the issue of overlapping entity relationships, the BIOE labels are combined to identify all entity relationship triples in the sentence as $T_{(S)} = {(s u b j e c t, r e l a t i o n, o b j e c t) | s u b j e c t, o b j e c t \in E, r e l a t i o n \in R}$ .

Based on the embedding $D \in R^{N \times d}$ of a sentence with N tokens, a fully connected neural network is used to extract subjects and objects. Here, a simple, fully connected neural network is chosen instead of mainstream LSTM-CRF models.^29,31 The main reason for this choice is that the functionality required by this module is simple and does not involve complex entity relationship matching. By using a simpler model, higher computational efficiency can be achieved at a minor cost to accuracy. The extraction of subjects and objects is performed as follows:

h_{i}^{s u b} = Softmax (M_{s u b} (D y e m b_{i}) + b_{s u b})

(7)

h_{j}^{o b j} = Softmax (M_{o b j} (D y e m b_{i}) + b_{o b j})

(8)

where

M_{s u b}, M_{o b j} \in R^{d \times 4}

are trainable parameter matrices, and the tag set

{B, I, O, E}

has a length of 4.

b_{s u b}

and

b_{o b j}

are trainable biases.

Based on two sets $H_{s u b} = {h_{1}^{s u b}, h_{2}^{s u b}, \dots, h_{m}^{s u b}}$ , $H_{o b j} = {h_{1}^{o b j}, h_{2}^{o b j}, \dots, h_{n}^{o b j}}$ and $(h_{i}^{s u b}, h_{j}^{o b j})$ is defined as a $(subject, object)$ pair. The entity-relationship feature matrix $R_M \in R^{m \times n}$ is initialized, where the element $r m_{(i, j)}$ in the $i$ -th row and $j$ -th column represents the relationship label of $(h_{i}^{s u b}, h_{j}^{o b j})$ . The calculation formula is as follows:

r m_{(i, j)} = M_{r} R e L U (h_{i}^{s u b} ⊙ h_{j}^{o b j}) + b_{r}

(9)

where

M_{r}

is a trainable parameter matrix,

R e L U

is the activation function,

⊙

represents the Hadamard product operation for multiple matrices, and

b_{r}

is a trainable bias.

After completing all sequence labeling, $H_{s u b}$ , $H_{o b j}$ , and $R_M$ contain all possible subjects and objects concerning relation in the sentence. The entity set $H_{e} = H_{s u b} \cup H_{o b j}$ .

Based on the table-filling mechanism, a CLIM strategy is proposed to address the issue of overlapping entity relationships and to accomplish the task of matching relationships between entities and extracting the final relationship triples. The core idea of the cross-label merging strategy lies in extracting all entity pairs and entity–relation pairs that involve the determined relationship, denoted as ${(e_{i}, e_{j}), (e_{i}, r_{k}), (r_{k}, e_{j})}$ . The entity–relation triple $T_{(e_{i}, r_{k}, e_{j})}$ is defined as follows:

T_{(e_{i}, r_{k}, e_{j})} = {\begin{matrix} 1 L (e_{i}, e_{j}) \cap L (e_{i}, r_{k}) \cap L (r_{k}, e_{j}) = 1 \\ 0 e l s e \end{matrix}

(10)

where

e_{i}, e_{j} \in H_{e}

e_{i} \in H_{s u b}

e_{j} \in H_{o b j}

, and

r_{k} \in R_M

L ()

is defined as a function that determines whether there is a relationship label that matches between entities or relations. It returns

T r u e

if there is a match and

F a l s e

otherwise.

By combining the table-filling concept with the powerful deep correlation-capturing ability provided by Transformer layers, the specific calculation formula for $L ()$ is as follows:

Q_{c} = H_{e} W_{c}^{Q}

(11)

K_{c} = H_{e} W_{c}^{K}

(12)

\hat{B} = sigmoid (\frac{1}{d_{heads}} \sum_{c}^{d_{heads}} \frac{Q_{c} K_{c}^{T}}{\sqrt{d}})

(13)

where

d_{heads}

represents the number of heads in the attention model.

W_{c}^{Q}

and

W_{c}^{K}

are trainable weights.

Finally, based on the entity and relationship label judgments matrix, the matching between entities and relationships is performed to obtain all the triples. The algorithm implementation for the matching strategy is as Algorithm 1.

Algorithm 1.

Entity–relation matching strategy.

Input: The pre-define relation set

R

，the entity–relation label matrix

L_{e - r}

, the relation–entity label matrix

L_{r - e}

, the entity–entity label matrix

L_{e - e}

，the entity and relation index dict

D i c_{e}

D i c_{r}

.
Output: The predicted triple set

T_{(S)}

.
Define temporary sets

T m p_{H}

T m p_{T}

# Step 1: Extract subject–relation pairs from matrix

L_{e - r}

1 for each cell

(i, j)

L_{e - r}

do:
2 if

L_{s - r} [i] [j]

T r u e

do:
3

T m p_{H} [D i c_{r} [j]]

D i c_{e} [i]

4 end if

5 end for
# Step 2: Extract relation–object pairs from matrix

L_{r - e}

6 for each cell

(i, j)

L_{r - e}

do:
7 if

L_{r - e} [i] [j]

T r u e

do:
8

T m p_{T} [D i c_{r} [j]]

D i c_{e} [i]

9 end if

10 end for
# Step 3: Check subject–relation pairs against relation–object pairs
11 for r in

T m p_{H}

.keys() do:
12 if

T m p_{T} [r]

is not

n u l l

:
13 if

L_{e - e}

[

T m p_{H} [r]

][

T m p_{T} [r]

] ==

T r u e

:
14 triple = (

T m p_{H} [r], r, T m p_{T} [r]

)
15

T_{(S)}

.add(triple)

16 end if

17 end if
18 end for
19 return

T_{(S)}

The decoding strategy first identifies entity groups corresponding to the same relation from the entity–relation and relation–entity labels. Then, it verifies the validity of the identified entity groups based on the entity–entity label matrix. As a result, it matches all the desired relation corresponding triplets. The above decoding method considers complex sentence structures that involve overlapping relationships, such as entity pair overlap (EPO) and single entity overlap (SEO), and can accurately match all possible entity and relation triplets.

Loss Since the problem of extracting entity-relation triplets containing overlapping relationships is formulated as determining the existence of relationships between entities or relation groups, this module utilizes binary cross-entropy as the loss function to assist model training. It is defined as follows:

Loss = - \frac{1}{{(K_{e} + K_{r})}^{2}} \sum_{i}^{K_{e} + K_{r}} \sum_{j}^{K_{e} + K_{r}} (L l o g \hat{L}) + (1 - L) \log (1 - \hat{L})

(14)

where

K_{e}

and

K_{r}

, respectively, represent the number of entities and the number of relations. L represents the ground truth of the entity–relation label judgment function, and

\hat{L}

represents the predicted results of the judgment function.

Construction of the ontology in hazardous chemical accident management

Due to the diverse types of hazardous chemicals and the variety of accident scenarios, constructing a knowledge graph that can continuously evolve becomes particularly crucial. Such a knowledge graph can better support the prevention, management, and response to hazardous chemical incidents. Ontology, as the cornerstone of the knowledge graph, plays a crucial role by eliminating semantic barriers between different data sources, ensuring consistency and accuracy in knowledge representation. By carefully defining entity types, properties, and their relationships, ontology provides a standardized semantic framework for the knowledge graph, laying a solid foundation for subsequent knowledge management, queries, and reasoning.

This section presents a rule-based approach to construct an ontology for hazardous chemical accident management. The aim is to provide a standardized semantic framework for mapping the predictions of the CPBA-CLIP model to the knowledge graph. Through such construction, we ensure the sustainability of the knowledge graph for hazardous chemical incident management. This process establishes a robust foundation for future updates and the evolution of the knowledge graph. It enables it to adapt to dynamic changes in hazardous chemicals and provide persistent support for practical application scenarios.

The overall architecture for defining ontology construction rules is illustrated in Figure 5.

Step 1: Determine ontology scope and objectives. Based on meaningful information contained in incident reports and drawing from human knowledge and experience, the scope and objectives of the ontology were determined. The application scope of the ontology was defined within the domain of hazardous chemical incident information to provide knowledge services for hazardous chemical management.

Step 2: Design ontology structure. The ontology's categories, properties, relationships, etc., were defined based on domain knowledge and scope objectives.

Step 3: Identify categories. In the ontology for hazardous chemical management, defined categories included chemical, incident, hazard, organization, etc.

Step 4: Establish class hierarchy. The ontology's class hierarchy aims to provide a clear and orderly framework for the field of hazardous chemical incidents management, accurately reflecting entities and their relationships. By defining inheritance relationships between categories, it achieves a precise expression of the generality and specificity among classes, enhancing the clarity and comprehensibility of the organizational structure of knowledge.

Figure 5.

The architecture of rules for the construction of the hazardous chemical incident management ontology.

In this study, we adhered to the following design principles in establishing the class hierarchy. Firstly, based on human prior knowledge, we considered professional terminology and general classifications in hazardous chemical incident management. Secondly, we integrated entities and relationships extracted by the CPBA-CLIM model, ensuring close relevance between the ontology and actual data/model calculation results. Lastly, combined with the requirements analysis of safety management in the hazardous chemical domain, we ensured that the ontology structure authentically reflects users’ needs in their practical work.

According to the design above principles, the class hierarchy is designed as follows:

Five top classes: These top-level categories define the framework of the entire classification or ontology, representing the most extensive and abstract concepts indicative of the overall scope of the domain. This study identified five top-level classes: organization, equipment, chemical, person, and incident.

Addition of subclasses: The addition of subclasses aims to provide more detailed support for describing each top-level class. In this study, top-level categories were further divided into three hierarchical levels of subclasses, ensuring the rationality of the hierarchy. The lowest-level subclasses represent the most specific entities.

The class hierarchy of the ontology we constructed to manage hazardous chemical incidents is illustrated in Figure 6. This structure provides an overall framework and a specific and orderly knowledge organization format for practical applications.

Step 5: Define interclass relationships. Interclass relationships describe semantic connections between different categories, providing a more intricate and enriched network of associations within a knowledge graph. These relationships’ definitions reveal the inherent connections among diverse categories, thereby endowing the knowledge graph with profound semantic depth. Building upon the class hierarchy defined in Step 4, we integrate domain-specific expertise from relevant personnel and thoroughly analyze practical case studies to delve into the actual associations between different categories. This process aims to ensure that the defined relationships hold practical value in real-world applications, offering valuable support for the practical implementation of the knowledge graph. The comprehensive and precise definition of inter-class relationships is pivotal to ontology development. Through this step, we furnish robust support for the structure of the knowledge graph, enhancing the clarity and practical significance of conceptual relationships within it. For example, we defined the object attribute “invChe” (involving chemical) between the Chemical and Incident classes. The domain of the object attribute was Incident, and chemical was the range of the object attribute. Additional class relationships are illustrated in Figure 7.

Figure 6.

The class hierarchy structure of the ontology for hazardous chemical incident management.

Figure 7.

The interclass relationships of the ontology for hazardous chemical incident management.

Experiments

Experimental settings

To intuitively verify the ability of the proposed model to handle complex textual scenarios with entity and relation overlap and accurately extract entity–relation triplets, the experimental setup is as follows:

Firstly, this article selected two significant publicly available datasets, NYT³² and WebNLG,³³ for experimentation. This selection aims to provide a comparable benchmark for our model's performance, facilitating result comparison with current strong baseline and state-of-the-art models, thus accurately assessing the model's accuracy.

Secondly, after validating the model's capability to process complex texts, we qualitatively analyzed the model's performance. We used hazardous chemical incident reports as input data for the model and extracted entity-relation triplets from these reports. These triplets were stored as < entity, relation, entity > to visually evaluate the model's ability and performance.

Finally, to further demonstrate the practical value of the model, this section stored the extracted entity and related data from the incident reports in a Neo4j graph database. We constructed an ontology-based knowledge graph of hazardous chemical incident management and implemented application functionalities such as incident information retrieval and statistical analysis.

Datasets

Public datasets: Two widely used benchmark datasets, NYT³² and WebNLG,³³ were selected. The NYT dataset comprises articles from the New York Times and 24 predefined relations. The WebNLG dataset was initially introduced by Zeng et al.³¹ for relation triplet extraction tasks, consisting of 171 predefined relations. Both NYT and WebNLG have two versions, one marking the last word of the entity and the other representing the entire span of the entity. In this study, the first version was chosen as the experimental dataset.

Self-constructed dataset: For this research project, 868 hazardous chemical incident reports were collected from national government websites. Each report ranged from 50 to 300 words. After sentence segmentation, organization, and manual annotation to handle multiple entity relationships, the experimental corpus comprised ∼38.15 million characters, totaling 2738 instances of entity–relation extraction. The custom dataset consisted of six predefined relations. This dataset is named the hazardous chemicals incident report (HCAR) dataset. The schema definition of the dataset is illustrated in Table 1.

Table 1.

Schema example of self-built dataset hazardous chemicals incident report (HCAR).

Subject	Object	Relation	Triple example
Incident name	incident Type	Type	<Carbon Monoxide Leak, type, Poisoning>
Incident name	Hazardous material	Involves	<Carbon Monoxide Leak, involves, Carbon Monoxide>
Incident name	Occurrence time	Time	<Carbon Monoxide Leak, time, July 21, 2013>
Incident name	Occurrence location	Location	<Carbon Monoxide Leak, location, Alkali Sulfide Workshop>
Factor/event	Incident	Leads to	<Reversal of Elevator Housing, leads to, High Carbon Monoxide Concentration in Pit>
Cause description	Hazardous	Cause	<Carbon Monoxide, cause, Frequency Converter Trip of Forced Draft Fan, insufficient draft flow>

Data splitting and evaluation metrics

This study employs a standard data-splitting method³¹ to divide the dataset into training, validation, and testing sets during the model evaluation process. This was done to provide a training foundation for the model's performance that aligns with current research standards and facilitates a fair comparison with the baseline model to assess model performance. Additionally, given that one of the research objectives in this experiment is to evaluate the accuracy of extracting overlapping entity–relation triplets, we also conducted additional statistics on the distribution of triplets under different overlapping scenarios in various datasets. The statistical results are presented in Table 2:

Table 2.

Statistics of the experimental datasets.

Dataset	Train	Valid	Test	Overlapping pattern in test datasets
Dataset	Train	Valid	Test	Normal	SEO	EPO	SOO
NYT	56,195	5000	5000	3266	1297	978	45
WebNLG	5019	500	703	245	457	26	84
HARD	1644	547	547	204	346	20	42

SEO: single entity overlap; EPO: entity pair overlap; SOO: subject object overlap.

This experiment follows the evaluation method employed by Fu et al.³⁰ It is considered that the model correctly extracts a relationship triplet, denoted as $T = ⟨ subject, relation, object ⟩$ , only when entities (subject and object) and the relationship (relation) are all accurately extracted. The evaluation in this experiment utilizes three commonly used metrics in entity relation extraction tasks: standard precision (Prec.), recall (Rec.), and F1 score (F1). The mathematical formulations for Prec. and Rec. are as follows:

P r e c . = \frac{T P}{T P + F P}

R e c . = \frac{T P}{T P + F N}

where TP represents true positive, indicating the number of instances where the model correctly predicts a sample belonging to a specific category. FP stands for false positive, denoting the cases where the model incorrectly predicts a non-category sample belonging to that category. FN corresponds to false negative, representing instances where the model wrongly predicts a sample of that category as a non-category.

The F1 score is the harmonic mean of precision and recall, serving as a comprehensive measure of the model's performance. Its formula is given by

F 1 = \frac{2 \times P r e c \times R e c}{P r e c + R e c}

The F1 score ranges from 0 to 1, with higher values indicating better model performance. The evaluation is conducted by reporting the standard micro precision, recall, and F1-score on the testing set.

Based on the mathematical descriptions of the three metrics mentioned above, precision evaluates the accuracy of the model's positive predictions, with higher Prec. values indicating more accurate model judgments for positive instances. Recall assesses the model's ability to correctly identify all true positive instances, with higher Rec. values indicating greater comprehensiveness in capturing the most true positive instances and fewer instances missed. The F1 score provides a comprehensive evaluation by considering both precision and recall.

Implementation details

The model is implemented based on the PyTorch framework. The Adam optimizer³⁴ is used for model optimization. The learning rate is set to 3 × 10⁻⁵. The model is trained with a batch size of 24 for 100 epochs. The size of the attention heads, $d_{heads}$ , is 64.

To compare the performance with other models, the maximum length of input sentences is set to 128. The encoder model utilizes a pre-trained language model, BERT-base, with 108 M parameters. Additionally, the CPBA model is implemented to measure the model's accuracy and efficiency. The training and validation are executed using two NVIDIA GeForce RTX2080Ti GPUs.

Ten strong models, including SOTA models, are selected as benchmark methods for the comparative experiments. These methods specifically address the issue of entity relation overlap. NovelTagging¹⁰ and TPlinker_BERT³⁵ employ improved sequence labeling techniques to solve the triplet overlap problem. CopyRE³¹ and GraphRel³⁰ are end-to-end information extraction models. CasRel_BERT³⁶ learns a mapping function from subject entities to object entities to accomplish triplet extraction tasks. GRTE_BERT³⁷ utilizes graph neural networks to capture complex dependency relationships by mapping entities and relations to nodes and edges of a graph, respectively. PRGC_BERT,³⁸ EmRel,³⁹ and OneRel_BERT⁴⁰ are joint relation extraction models that employ different strategies for the joint training of entity relations.

Experimental results

This section will present the experimental results of our model compared to other baseline methods on publicly available datasets, including overall results and results in complex scenarios.

Overall results

Table 3 presents the results of our model compared to other baseline methods on two publicly available datasets. Considering that the encoder module of most mainstream tasks is based on pre-trained BERT, this experiment uses both pure BERT and an additional CPBA as the encoder module to obtain experimental results. Table 3 compares the efficiency of the models on the NYT and WebNLG datasets. The table shows that regardless of whether CPBA or BERT is used as the encoder, our model outperforms most mainstream baseline models in precision (Prec.), recall (Rec.), and F1-score. Furthermore, comparing the two encoders in our model suggests that the CPBA encoder achieves higher accuracy than the pure BERT encoder due to its incorporation of contextual semantics.

Table 3.

Comparison (%) of the proposed method with the baseline tasks on the NYT and web datasets.

	NYT			WebNLG
	Prec.	Rec.	F1	Prec.	Rec.	F1
NovelTagging	62.4	31.7	42.0	52.5	19.3	28.3
CopyRE	61.0	56.6	58.7	37.7	36.4	37.1
GraphRel	63.9	60.0	61.9	44.7	41.1	42.9
OrderCopyRE	77.9	67.2	72.1	63.3	59.9	61.6
CasRel_BERT	89.7	89.5	89.6	93.4	90.1	91.8
TPlinker_BERT	91.3	92.5	91.9	91.7	92	91.9
GRTE_BERT	92.9	93.1	93.0	93.7	94.2	93.9
PRGC_BERT	93.3	91.9	92.6	94.0	92.1	93
EmRel	91.7	92.5	92.1	92.7	93.0	92.9
OneRel_BERT	92.8	92.9	92.8	94.1	94.4	94.3
Ours_BERT	93.0	93.3	93.1	94.0	94.4	94.4
Ours_CPBA	93.5	93.6	93.2	94.4	94.6	94.7

Prec.: precision; Rec.: recall; F1: F1 score; BERT: bidirectional encoder representations from the transformer.

Complex scenarios results

To evaluate the performance of our model in handling sentences with different overlap patterns and varying numbers of triplets, we conducted further experiments on the NYT and WebNLG datasets. Since handling overlap patterns is closely related to the decoder module and most baseline tasks use BERT as the encoder module, we used a pre-trained BERT model as the encoder in this section. The experimental results are summarized in Table 4.

Table 4.

F1-score (%) on sentences with overlapping patterns and triple numbers.

Dataset	Model	Overlap scenario				Count of triples
Dataset	Model	Normal	SEO	EPO	SOO	C = 1	C = 2	C = 3	C = 4	C $\geq$ 5
NYT	CasRel_BERT	87.3	91.4	92.0	77.0	88.2	90.3	91.9	94.2	83.7
	TPlinker_BERT	90.1	93.4	94.0	90.1	90.0	92.8	93.1	96.1	90.0
	PRGC_BERT	91.0	94.0	94.5	81.8	91.1	93.0	93.5	95.5	93.0
	OneRel_BERT	90.6	95.1	94.8	90.8	90.5	93.4	93.9	96.5	94.2
	Ours	90.0	95.3	95.0	90.1	91.0	94.1	94.2	96.1	94.4
WebNLG	CasRel_BERT	89.4	92.2	94.7	90.4	89.3	90.8	94.2	92.4	90.9
	TPlinker_BERT	87.9	92.5	95.3	86.0	88.0	90.1	94.6	93.3	91.6
	PRGC_BERT	90.4	93.6	95.9	94.6	89.9	91.6	95.0	94.8	92.8
	OneRel_BERT	91.9	95.4	94.7	94.9	91.4	93.0	95.9	95.7	94.5
	Ours	91.0	95.5	95.4	94.6	90.3	93.5	95.3	95.7	94.5

SEO: single entity overlap; EPO: entity pair overlap; SOO: subject-object overlap; BERT: bidirectional encoder representations from transformer.

According to the results in Table 4, our model performs remarkably well in the overlapping scenarios. The model achieves optimal results in SEO scenarios on two datasets and EPO scenarios in the NYT dataset. In other scenarios, it also maintains suboptimal results. Additionally, in experiments with different numbers of triplets, the model achieves optimal experimental results when the number of triplets is 2, 3, and ≥ 5 in the NYT dataset and when the number of triplets is 2, 4, and ≥ 5 in the WebNLG dataset. It also maintains a top-three ranking in other cases. Therefore, this experiment confirms the model's ability to perform triplet extraction tasks in hazardous chemical accident scenarios.

It is worth noting that in our self-constructed dataset HCAR, the SEO overlap pattern has a significant proportion, accounting for 56.5% of the experimental samples. Therefore, this experiment also proves the capability of our model to perform well in hazardous chemical incident scenes, specifically in the task of triplet extraction.

Application instance of CPBA-CLIP model in the knowledge graph

(1) Mapping of model results and knowledge graph construction

Applying an entity-relation extraction model in hazardous chemical incident management overcomes the challenge of low information utilization in incident reports within the hazardous chemical domain, effectively enhancing the automation and efficiency of knowledge graph construction. This section uses the CPBA-CLIM model developed in this paper to identify the necessary entities and inter-entity relationship data from the self-constructed dataset HCAR. Then, following the methodology rules for constructing a knowledge graph based on the Hazardous Chemicals Incident Management Ontology already given in the “Construction of the ontology in hazardous chemical accident management” section, the data will be stored in a Neo4j graph using the LOAD CSV method. The knowledge graph construction for hazardous chemicals accident management is completed by storing the data in the Neo4j graph database using the LOAD CSV method.

The resulting knowledge graph for hazardous chemical incident management is presented in Figure 8. For ease of analysis and considering the limitations of information presentation due to image size, the paper displays only seven classes of entities and six classes of relationship data, including Incident Type (INC), Incident Time (TIME), Incident Location (LOC), Injured Personnel (INJ), Fatality (FAT), Economic Loss (ELOSS), and Incident Cause (CAUSE). Five classes of relationship data include TIME_to_INC, LOC_to_INC, INJ_to_INC, FAT_to_INC, ELOSS_to_INC, and CAUSE_to_INC.

Figure 8.

Example of a knowledge graph for hazardous chemical incident management.

The example of the knowledge graph for hazardous chemical incident management demonstrates that the model's results effectively showcase the entity–relation extraction outcomes, particularly for SEO scenarios. The extraction performance of the model for SEO, EPO, and subject object overlap (SOO) scenarios, which was conducted through statistical analysis, is presented in Table 5. The “Total” column represents the number of triplets in the dataset with a specific overlap type. At the same time, “RecCnts’’ indicates the number of triplets extracted by the model for that overlap type. The analysis reveals that the model achieves an extraction rate exceeding 91% for all overlapping scenarios. It demonstrates the model's capability to accurately capture crucial information from unstructured data, such as hazardous chemical incident reports.

Table 5.

Statistics of extracted overlapping entity–relation triples in the HCAR dataset.

Overlapping type	SEO		EPO		SOO
Statistics	Total	RecCnts	Total	RecCnts	Total	RecCnts
Result	3758	3432	217	199	436	401
Extraction Rate(%)	91.3		91.7		91.9

SEO: single entity overlap; EPO: entity pair overlap; SOO: subject-object overlap.

(2) Incident statistics and case analysis based on knowledge graph

Statistical analysis is an essential requirement in the analysis of hazardous chemical incidents. Once an accurate hazardous chemical incident report is established, different statistical analyses can be performed according to specific needs. Figure 9 illustrates three commonly used types of retrieval statistics, including the statistical codes and examples of visual presentations. These types include incident time, location, and incident-type statistics. It is worth noting that achieving a combined statistical analysis of incident information is possible by controlling the retrieval conditions through codes.

Figure 9.

Example of hazardous chemical incident information statistics.

According to the knowledge graph of hazardous chemicals safety management based on accident reports, the data analysis of instances is as follows: The dataset HCARD contains 868 accidents. When examining the incident data from 2011 to 2022, there is a notable 22.4% reduction in accidents in 2022 compared to 2011. At first glance, the data trend suggests that the chemical industry is becoming safer.

In terms of accident types, explosive accidents are the most prevalent, accounting for 487 cases, followed by leakage and poisoning accidents (292 cases) and fire accidents (184 cases). Asphyxiation accidents represent a smaller proportion at 5.9% (40 cases). Additionally, the majority of accidents, 75.9%, occur in processing areas, with storage areas accounting for 14.7%. According to related searches, 80% of accidents in storage areas are associated with explosions, highlighting a critical need for preventative measures in these locations.

Analyzing hazardous chemical types reveals that hydrogen sulfide is the primary cause of accidents, contributing to 11.2% of cases (77 accidents), predominantly resulting in poisoning. Carbon monoxide ranks second with 5.3% (36 cases), followed by hydrogen at 4.7% (32 cases). It is worth noting that nitrogen, while not classified as hazardous, is implicated in 4.1% of incidents (28 cases) due to its high concentration in enclosed spaces leading to asphyxiation.

Compared with hydrogen sulfide and carbon monoxide, natural gas (2.9%, 25 cases) and petroleum vapor (3.5%, 24 cases) are more prone to causing explosions, accounting for 52 incidents.

The graph data structure of the hazardous chemical incident management knowledge graph facilitates a more precise and intuitive understanding of incidents. For instance, retrieving causes of leakage and poisoning incidents from the past five years enables analysis to identify patterns and common attributes, contributing to proactive measures in preventing similar incidents in the future.

(3) The application significance of our method in Hazardous chemical safety management

There is a wealth of textual information on hazardous chemical safety, with incident reports providing detailed records and summaries of hazardous chemical accidents. These reports offer valuable empirical data for chemical safety management, facilitating a profound understanding of the underlying causes of accidents. Effectively extracting and analyzing this information, and transforming it into a graph data structure to build a domain knowledge graph, provides managers with more accurate and intuitive insights into events. This, in turn, supports improving management systems, equipment, and training based on scientific evidence.

However, traditional methods face challenges in fully utilizing and extracting dispersed knowledge stored in different files. In this context, the CPBA-CLIM model developed in this study addresses the challenges of entity relationship extraction in hazardous chemical accident reports, along with a statistical analysis approach based on knowledge graphs, demonstrating significant innovation and advantages. This research provides an accurate data foundation for constructing knowledge graphs in hazardous chemical accident management.

The CPBA-CLIM model proposed in this study and the method of building ontology-based knowledge graphs using information extracted through the mapping model represents both a technical innovation and a tool for managers to gain insights into the deep-seated causes of accidents.

For instance, conducting statistical analysis of information related to hazardous chemical accidents stored in graph data structure form (as discussed in the “Application instance of CPBA-CLIP model in knowledge graph” section, point (2)) helps reveal patterns in accident occurrences, discover hidden common attributes, and derive valuable lessons. Furthermore, connecting multiple entity nodes such as accident types, storage locations, types of hazardous chemicals, and accident causes to form a graph structure can assist relevant personnel in understanding whether a leakage incident in a specific area is related to certain environmental factors in that area (e.g. hydrogen sulfide easily causing poisoning accidents in storage (11.2%), with improper personnel operations leading to leaks (28%) or failure to wear personal protective equipment correctly (13%), etc.).

This multidimensional correlation may be challenging to capture with traditional statistical methods. Therefore, the approach presented in this paper records accident data and provides various retrieval and statistical means for managers to have a data foundation for accident analysis. This helps them understand the multidimensional characteristics of hazardous chemical accidents, comprehend the fundamental causes, and formulate safety strategies and preventive measures more effectively. Ultimately, it contributes to preventing similar incidents in the future.

Summary of experiments

To simultaneously assess the model's generalization capabilities and performance in specific application scenarios and to enable a more effective comparison with other baseline models, we conducted experimental evaluations of the proposed methods on public and self-constructed datasets in the experimental section. The outcomes indicate that our proposed methods perform exceptionally well in extracting text features and managing overlapping triplets. The detailed summary is as follows:

The experiments on public datasets provide a comparable benchmark for the performance of the proposed model against strong baseline models, achieving F1 scores of 93.2% and 94.7% on the NYT and WebNLG datasets, respectively. The model also exhibits extraction rates of entity–relationship overlaps above 91% in complex contextual situations. The extraction results on the self-built dataset further validate the model's effectiveness in scenarios involving specialized vocabulary and overlapping entity relationships. The qualitative analysis demonstrates the model's ability to provide data support for constructing a hazardous chemical incident knowledge graph based on extracted triple data from reports. In the final part of the experiment, we demonstrated the application value of the CPBA-CLIM model in hazardous chemical incident management. The application of this model not only improves the utilization rate of report data and the efficiency of knowledge graph construction but also provides insights through data analysis. This assists relevant personnel in taking more effective preventative and responsive measures in similar scenarios in the future.

However, we also acknowledge certain limitations of the model, such as a noticeable decrease in the completeness and accuracy of information extraction in complex sentence structures or extreme cases. Therefore, the robustness of the model when handling more challenging textual scenarios remains an area of concern. Future work will focus on addressing these limitations and improving model performance. Specifically, we plan to expand the scope of the dataset by incorporating additional materials, such as regulations and standardized documents related to hazardous chemical safety management, to cover a broader range of scenarios and enhance model generalization. Furthermore, ongoing efforts will be dedicated to refining the model's architecture to better adapt to various text structures.

Conclusion

In conclusion, this article presents the CPBA-CLIM model. It effectively tackles the challenges of entity relationship extraction in hazardous chemical incident reporting, including scattered text, specialized terminology, and overlapping data relationships. The CPBA-CLIM model innovatively combines content-position attention with BERT encoding for dynamic contextual analysis in hazardous chemical incidents. Its CLIM strategy efficiently addresses entity–relation overlaps, enhancing triplet extraction accuracy. Our experimental results on public and self-constructed datasets confirm its superior performance in text feature extraction and handling complex scenarios, including overlapping triples. Finally, by applying the CPBA-CLIP model to constructing knowledge graphs, we demonstrated its application value in hazardous chemical incident management. It provides technical support for intelligent services in hazardous chemical safety management. Future work will expand the dataset and refine the model architecture to enhance its ability to process complex texts and adaptability to different textual scenarios.

Footnotes

Acknowledgements

The authors would like to acknowledge the support provided by Aerospace Hongka Intelligent Technology (Beijing) Co., Ltd.

Authors’ contribution

Wanru Du and Quan Zhu: study conception and design; Wanru Du, Quan Zhu, Xiaoyin Wang, Xiaochuan Jing, and Xuan Liu: data collection; Wanru Du, Quan Zhu, Xiaochuan Jing, Xiaoyin Wang, and Xuan Liu: analysis and interpretation of results; Wanru Du, Quan Zhu, and Xiaoyin Wang: draft manuscript preparation. All authors reviewed the results and approved the final version of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Wanru Du

Author biographies

Wanru Du is a PhD candidate in Systems Engineering. Her area of research is systems engineering, machine learning, and data mining.

Xiaoyin Wang is a professor of Computer Engineering. Her area of research encompasses artificial intelligence and natural language processing.

Quan Zhu is a master's student in Computer Engineering. His area of research is data mining and sentiment analysis.

Xiaochuan Jing is a professor of Systems Engineering. His area of research is knowledge engineering and artificial intelligence.

Xuan Liu holds a PhD in Systems Engineering. Her area of research is behavior recognition and safety engineering.

Appendix

References

Johnson

Jin

Nakada

, et al. Learning from the past and considering the future of chemicals in the environment. Science Jan. 2020; 367: 384–387.

Fuller

, et al. Pollution and health: a progress update. Lancet Planet. Health 2022; 6: e535–e547.

Hogan

, et al. Knowledge graphs. ACM Comput Surv May 2022; 54: 1–37.

Pan

Cambria

, et al. A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans Neural Netw Learn Syst 2021; 33: 494–514.

Zhang

Sun

, et al. “Multi-view Knowledge Graph Embedding for Entity Alignment.” arXiv, Jun. 05, 2019. [Online]. Available: http://arxiv.org/abs/1906.02390 (accessed 21 November 2023).

Chen

Reniers

. Chemical industry in China: the current status, safety problems, and pathways for future sustainable development. Saf Sci 2020; 128: 104741.

Angelov

. “Top2Vec: Distributed Representations of Topics.” arXiv, Aug. 19, 2020. [Online]. Available: http://arxiv.org/abs/2008.09470 (accessed 21 November 2023).

Park

Lee

Kim

. Semi-supervised distributed representations of documents for sentiment analysis. Neural Netw 2019; 119: 139–150.

Chen

Yang

Wang

, et al. A neuralized feature engineering method for entity relation extraction. Neural Netw 2021; 141: 249–260.

10.

Jiang

Cao

. Joint extraction of entities and relations via entity and relation heterogeneous graph attention networks. Appl Sci|3192 2023; 13: 42.

11.

Misra

Roy

Sauter

, et al. Industrial internet of things for safety management applications: a survey. IEEE Access 2022; 10: 83415–83439.

12.

Zheng

Wang

Zhao

, et al. A knowledge graph method for hazardous chemical management: ontology design and entity identification. Neurocomputing Mar. 2021; 430: 104–111.

13.

Sharma

Kodamana

Ramteke

. Multi-objective dynamic optimization of hybrid renewable energy systems. Chem Eng Process-Process Intensif 2022; 170: 108663.

14.

Gai

Deng

. Hazardous chemical leakage accidents and emergency evacuation response from 2009 to 2018 in China. In: Emergency guidance methods and strategies for Major chemical accidents. Singapore: Springer Nature Singapore, 2022, pp.15–54. doi: https://doi.org/10.1007/978-981-19-4128-3_2.

15.

Lui

Liu

Xie

. A supervised bidirectional long short-term memory network for data-driven dynamic soft sensor modeling. IEEE Trans Instrum Meas 2022; 71: 1–13.

16.

Cao

Zhou

, et al. Application of neural network algorithm in fault diagnosis of mechanical intelligence. Mech Syst Signal Process. 2020; 141: 106625.

17.

Moon

Kim

Hwang

B-G

, et al. Analysis of construction accidents based on semantic search and natural language processing. In: ISARC. Proceedings of the international symposium on automation and robotics in construction. Berlin, Germany: IAARC Publications, 2018, pp.1–6. [Online]. Available: https://search.proquest.com/openview/0fa663ac4497cfbb95611e7843641fd8/1?pq-origsite=gscholar&cbl=1646340 (accessed 22 November 2023).

18.

Zhang

Fleyeh

Wang

, et al. Construction site accident analysis using text mining and natural language processing techniques. Autom Constr 2019; 99: 238–248.

19.

Lee

J-S

Son

. Development of automatic-extraction model of poisonous clauses in international construction contracts using rule-based NLP. J Comput Civ Eng May 2019; 33: 04019003.

20.

Menon

Krdzavac

Kraft

. From database to knowledge graph—using data in chemistry. Curr Opin Chem Eng 2019; 26: 33–37.

21.

Liu

. Research on intelligent question answering model of hazardous chemicals knowledge based on ALBERT. In: International Conference on Intelligent Systems, Communications, and Computer Networks (ISCCN 2022), 2022, pp.470–478: SPIE. doi: https://doi.org/10.1117/12.2652432.

22.

Guan

, et al. What is event knowledge graph: a survey. IEEE Trans Knowl Data Eng 2022. doi: https://doi.org/10.1109/TKDE.2022.3180362

23.

Dou

Z-Y

Peng

. Zero-shot commonsense question answering with cloze translation and consistency optimization. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp.10572–10580. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/21301 (accessed 22 November 2023).

24.

Brown

, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877–1901.

25.

Devlin

Chang

M-W

Lee

, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. [Online]. Available: http://arxiv.org/abs/1810.04805 (accessed 22 November 2023).

26.

Tang

Meng

Liang

. Dynamic co-embedding model for temporal attributed networks. IEEE Trans Neural Netw Learn Syst 2022. doi: https://doi.org/10.1109/TNNLS.2022.3193564

27.

Kanakaris

Giarelis

Siachos

, et al. Making personnel selection smarter through word embeddings: a graph-based approach. Mach Learn Appl 2022; 7: 100214.

28.

Adel

Gupta

, et al. “Combining Recurrent and Convolutional Neural Networks for Relation Classification.” arXiv, May 24, 2016. [Online]. Available: http://arxiv.org/abs/1605.07333 (accessed 22 November 2023).

29.

Luo

Liu

, et al. “A bidirectional tree tagging scheme for jointly extracting overlapping entities and relations,” ArXiv Prepr. ArXiv200813339, 2020.

30.

T-J

P-H

W-Y

. Graphrel: modeling text as relational graphs for joint entity and relation extraction. in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019: 1409–1418. doi: https://doi.org/10.18653/v1/P19-1136

31.

Zeng

, et al. “Extracting relational facts by an end-to-end neural model with copy mechanism,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 506–514. doi: https://doi.org/10.18653/v1/P18-1047.

32.

Riedel

Yao

McCallum

. Modeling relations and their mentions without labeled text. In: Balcázar

Bonchi

Gionis

Sebag

(eds) Machine learning and knowledge discovery in databases. 6323. in Lecture Notes in Computer Science, vol. 6323. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp.148–163. doi: https://doi.org/10.1007/978-3-642-15939-8_10.

33.

Gardent

Shimorina

Narayan

, et al. “Creating training corpora for nlg micro-planning,” in 55th annual meeting of the Association for Computational Linguistics (ACL), 2017. [Online]. Available: https://inria.hal.science/hal-01623744/document (accessed 22 November 2023).

34.

Kingma

. “Adam: A Method for Stochastic Optimization.” arXiv, Jan. 29, 2017. [Online]. Available: http://arxiv.org/abs/1412.6980 (accessed 22 November 2023).

35.

Wang

Zhang

, et al. “TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking.” arXiv, Oct. 26, 2020. [Online]. Available: http://arxiv.org/abs/2010.13415 (accessed 22 November 2023).

36.

Wei

Wang

, et al. “A Novel Cascade Binary Tagging Framework for Relational Triple Extraction.” arXiv, Jun. 22, 2020. [Online]. Available: http://arxiv.org/abs/1909.03227 (accessed 22 November 2023).

37.

Ren

, et al. “A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling.” arXiv, Sep. 14, 2021. [Online]. Available: http://arxiv.org/abs/2109.06705 (accessed 30 September 2023).

38.

Zheng

, et al. “PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction.” arXiv, Jun. 17, 2021. [Online]. Available: http://arxiv.org/abs/2106.09895 (accessed 30 September 2023).

39.

, et al. Emrel: joint representation of entities and embedded relations for multi-triple extraction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp.659–665. doi: https://doi.org/10.18653/v1/2022.naacl-main.48.

40.

Shang

Y-M

Huang

Mao

. Onerel: joint entity and relation extraction with one module in one step. in Proceedings of the AAAI Conference on Artificial Intelligence 2022; 36(10): 11285–11293. [Online]. Available:.https://ojs.aaai.org/index.php/AAAI/article/view/21379 (accessed 22 November 2023).

CPBA-CLIM: An entity-relation extraction model for ontology-based knowledge graph construction in hazardous chemical incident management

Abstract

Keywords

Introduction

Related work

Data-driven approaches in hazardous chemical safety management

Application of knowledge graphs in hazardous chemical safety management

Unstructured data information extraction technologies in knowledge graphs

Methods

Problem formulation and framework

Problem formulation

Framework

Decoder module based on the CLIM strategy

Construction of the ontology in hazardous chemical accident management

Experiments

Experimental settings

Datasets

Data splitting and evaluation metrics

Implementation details

Experimental results

Overall results

Complex scenarios results

Application instance of CPBA-CLIP model in the knowledge graph

Summary of experiments

Conclusion

Footnotes

Acknowledgements

Authors’ contribution

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

Appendix

References