Sage Journals: Discover world-class research

Abstract

Attempts to express information from various documents in graph form are rapidly increasing. The speed and volume in which these documents are being generated call for an automated process, based on machine learning techniques, for cost-effective and timely analysis. Past studies responded to such needs by building knowledge graphs or technology trees from the bibliographic information of documents, or by relying on text mining techniques in order to extract keywords and/or phrases. While these approaches provide an intuitive glance into the technological hotspots or the key features of the select field, there still is room for improvement, especially in terms of recognizing the same entities appearing in different forms so as to interconnect closely related technological concepts properly. In this paper, we propose to build a patent knowledge network using the United States Patent and Trademark Office (USPTO) patent filings for the semiconductor device sector by fine-tuning Huggingface’s named entity recognition (NER) model with our novel edge weight updating neural network. For the named entity normalization, we employ edge weight updating neural network with positive and negative candidates that are chosen by substring matching techniques. Experiment results show that our proposed approach performs very competitively against the conventional keyword extraction models frequently employed in patent analysis, especially for the named entity normalization (NEN) and document retrieval tasks. By grouping entities with named entity normalization model, the resulting knowledge graph achieves higher scores in retrieval tasks. We also show that our model is robust to the out-of-vocabulary problem by employing the fine-tuned BERT NER model.

Keywords

Knowledge graph named entity normalization information extraction keyword extraction

1. Introduction

With the rapid increase in the volume of documents, studies on techniques for analyzing these documents and organizing textual information are continuing. Among information representation techniques, knowledge graphs are receiving more attention recently [1, 2, 3]. This increasing interest can be attributed to their effectiveness in various domains, such as drug purposing [2], stock market analysis [3], and generative knowledge graph construction [1] for general domains. Knowledge graphs express information from various documents in intuitive forms. By representing the information in more intuitive form, knowledge graph can foster the transfer and the concoction of knowledge. The construction of a knowledge graph consists of the following major processes: a data acquisition, a natural language processing (NLP), and entity pair creation [4]. NLP includes text preprocessing, part of speech tagging, named entity recognition (NER), named entity normalization (NEN). Many previous works for document knowledge graph construction utilize those text mining techniques. However, many of these studies are focused on connecting keywords and key phrases to documents. We find it would be more intuitive if the named entity knowledge graph is constructed especially for the domain professionals who will benefit from using knowledge graphs. For example, the named entity “NASA” is more intuitive for the domain experts than the descriptive key phrase, “the agency of U.S. for the civil space program and space research.” To construct the knowledge graph based on named entities, connecting named entities with the same meaning but with different surface forms should be conducted. We implement NEN techniques to overcome these problems. To be more specific, with our proposed NEN model, our goal is to link “NASA” and “National Aeronautics and Space Administration.”

Although this paper has mainly discussed NLP techniques, especially NER and NEN, and the construction of named entity knowledge graphs, it is essential to acknowledge related literature in fields such as entity resolution (ER) and reference matching. NEN and Entity resolution share the commonality of distinguishing and connecting entities with different representations with similar meanings. Reference matching is another technique that is widely used in retrieving the relative bibliographic records which involves identifying and matching different documents and citations that points to the same source. These fields share some similarities with the problem we address in this paper, as they all involve linking entities with different surface forms but the same underlying meaning. Our NEN model which connects the entities with different surface form is influenced by the ideas from ER studies. Incorporating insights from such studies leads more robust and a more effect model NEN model that ultimately improving the quality of the end product, the semiconductor-related patent knowledge graph. Our end-product, semiconductor-related patent knowledge graph, is tested to perform document retrieval from given named entity which is influence by the reference matching techniques.

In this paper, we present a novel named entity knowledge graph construction framework that can be applied to various text data. Specifically, we focus on patent documents. Patent documents are open to the public, there is a relatively larger amount of documents, and they contain many named entities including neologisms.

Recent advances in technology have been actively witnessed through a wide range of venues, one of which includes the patent claims. Patent claims contains information on the new breakthroughs at the forefront of the industry and academia in the rawest form, which may potentially help solve various tasks such as discovering contemporary technological trends, forecasting future developments in specific domains, evaluating ideas for R&D investment decisions, identifying competitors in the technological horse-races, or developing strategic technological planning [5].

The rapid speed and the vast volume of patent filings have been worsening the challenge of distillation of useful information from the claims, which is calling for the automation, at least in part, of patent analysis. Until recently, research on patent analysis has generally involved extracting technology trees based on the bibliographic connections of the claim filings [6] or extracting keywords using text mining techniques [7, 8, 9]. While these keyword-based approaches have provided meaningful insights on the current technological developments, only few attempts have been made to extract a more complicated form of information from the patent filings, such as named entities. Named entities, which include technological concepts, specific techniques used, names of the devices or the end products, and the associated company names, are of a significant importance for richer and deeper understanding of the innovations and technology underlying the patent filings.

Moreover, past efforts have failed to provide information on the intricate connectivity among the concepts extracted from the patent-related documents. For example, a well-designed keyword detection models may successfully determine the term “Gate-All-Around” to be the arising keyword alongside the word “transistors” from the patent filings within the field of semiconductor devices, yet it will not be able to show through which patent documents and other keywords these two phrases are interconnected. Furthermore, the conventional NLP approach will parse the terms “Gate-All-Around” and “GAA” separately as two independent terms, leaving the task of recognizing them as the same entities to additional human efforts.

In this study, we address the issues of interconnecting key technological concepts and matching the same entities appearing in different forms by constructing a semiconductor-related patent knowledge graph from patent filings using the NER and NEN models with a novel edge weight updating neural network. More specifically, we constructed the NEN dataset based on the patent documents. We fine-tuned the NER model [10] using Huggingface’s Python repository, pre-trained with the CoNLL-2003 NER dataset [11]. Our BERT token concatenator for NER tasks provides more complete named entity phrases. We propose a state-of-the-art NEN model with an edge weight updating neural network with triplet loss to extract named entities and connect them through the semiconductor related patent documents and present them in the form of a knowledge graph. Our proposed NEN model achieves the highest performance for not only the conventional candidate retrieval task in NEN but also pairwise named entity matching task.

Extensive experiment results show that our proposed approach performs, against the conventional keyword extraction models frequently employed in patent analysis, very competitively, especially for the NEN and document retrieval tasks. We also show that our knowledge graph construction method is robust to the out-of-vocabulary problem. Finally, we further contribute to the existing literature by releasing our semiconductor-related patent knowledge graph online, available for all non-commercial purposes. The remainder of this paper is organized as follows. We survey the literature related to our work in Section 2. In Section 3, we introduce our proposed method in detail. Section 4 reports experiment setting details. We report experiment results in Section 5, and Section 6 concludes the paper.

2. Related work

The importance of knowledge graphs is emphasized in [12, 13]. Knowledge graphs have attracted academic and industrial attention as a form of structured knowledge representation. We conduct the literature survey regarding the key components in our proposed framework for creating the technical document knowledge graph. Section 2.1 describes the dataset used for the NEN model training in various fields. Section 2.2 reports the studies involving NEN. Section 2.3 outlines the knowledge graph construction of documents from various subjects.

2.1 Named entity normalization dataset

To fulfill the need for the domain-specific NEN dataset which is used to train NEN models, many NEN datasets are developed. For example, IUPAC [14] and SCAI [15] are two commonly used chemical name matching datasets. Weston et al. [16] developed the NEN dataset for material engineering. The NCBI [17] dataset gathered the disease names from the PubMed abstracts and linked disease names with the same intrinsic meanings. By linking the disease names in clinical notes, ShARe/CLEF [18] is commonly used bioinformatics NEN datasets Similar to disease names, drug labels are linked in TAC2017ADR [19]. Datasets such as BioNLP09 [20], and BioNLP-OST19 [21] are also distributed by academic conferences’ for various NLP challenges. BioNLP09 [20], and BioNLP-OST19 [21] are NEN dataset for proteins and bacteria, respectively. BC2GM [22], BC5CDR-disease [23], BC5CDR-chemical [23] are distributed by BioCreative for the NEN challenges in the biological domain. Above studies developed engineering and science-related NEN datasets. On the other hand, NEN datasets to satisfy the more general NEN tasks are also introduced. Sun et al. [24] constructed the NEN dataset which consists of product entity names. Francis et al. [25] extracted the International Bank Account Number (IBAN) of the beneficiary, invoice number, invoice date, and due date from the financial invoices and constructed the financial NEN dataset based on the extracted information.

2.2 Named entity normalization

Various NEN models adopt string matching techniques such as Siamese neural network [26, 27, 28]. The morphological similarity between strings was linked by the Siamese RNN model [29]. Siamese Graph Neural Network is used in [30] for company name normalization. Sun et al. [24] utilized pre-constructed product ontology for product name NEN. Rahmani et al. [31] proposed random walk applied on the augmented graph to link similar entities in genealogical graphs. More recently, attention model [32] and transformer-based model [33] were applied to medical entity normalization and linked the entity from the Wikipedia knowledge graph.

For a gene name NEN tasks, GenNorm [34] and GNAT [35] are widely used toolkits. ChemSpot [36] is trained on the SCAI [15] chemical NEN dataset and achieved an F1 measure of 79%. Cho et al. [37] listed existing NEN products and services such as ProMiner [38] and MetaMap [39]. ProMiner [38] and MetaMap [39] are the NEN tools for biomedical domain. DNorm [40] used pairwise entity ranking scoring for NEN. TaggerOne [41] utilized semi-Markov model for NEN. For documents in the material science domain, MatScholar [16] is a python repository for general text mining including NEN tasks.

In 2015, D‘Souza et al. [42] constructed the rule-based NEN model. The rule-based NEN model is one of the earlier NEN models. The machine learning-based NEN model was introduced in 2016 by Leaman et al. [41]. Leaman et al. [41] proposed TaggerOne [41] which is the model based on semi-Markov model. Deep learning algorithms such as convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU) based model was adapted to NEN research. Li et al. [43] introduced the word-level CNN ranking NEN model. Recurrent neural network-based models are also implemented [44]. Wright and Dustin [44] proposed BiGRU based NEN model. Phan et al. [45] also suggested the BiLSTM-based NEN model. Recently, the transformer-based NEN model achieved higher performances. BERT ranking model [46] and BioSyn [47] utilized BERT architecture and fine-tuned with distinct objective functions. BERT Ranking model [46] was trained with a ranking-based objective function. BioSyn [47] is the one of highest performing NEN models that the model used a pre-trained BioBERT [48] model and was trained by a synonym marginalization algorithm. Our proposed NEN model, Edge weight updating neural network with triplet loss, successfully capture the semantic similarity between entities by trained with hard positive and negative entities for each epoch. Furthermore, incorporating the substrings for generating the hard positive and negative entities enables to capture the morphological similarity between the matched entities.

2.3 Knowledge graph construction

Text mining techniques and their applications have received remarkable and rapidly growing attention as a means to acquire useful information from corpora of various backgrounds and characteristics. Technology management fields have responded by actively utilizing text mining approaches to process and analyze professionally written technological reports and other technology-related documents [49]. One of the most prevailing examples includes text-mining-based patent analysis: to date, numerous studies have attempted to analyze the patent documents to investigate contemporary technological trends, assess technological capabilities, and/or analyze the commercial value of select technologies [50]. Kim et al. [7], for instance, built a semantic network to analyze the “ubiquitous computing technology” by merging pre-determined keywords, recommended by experts in the field, from the patent claims. Patent claims were queried based on those pre-determined keywords, and the returned documents were characterized further by employing the k-means clustering algorithm. It is, however, very costly to pre-define manually the target technology-related keywords as the authors did in their study because it requires a great amount of background knowledge, time, and human labor during the process.

The number of studies on knowledge graph construction has grown rapidly. Relatively earlier knowledge representations through the ontological graph and semantic web approaches for manufacturing are listed in [51]. Rahmani et al. [52, 53] proposed human disease network and human drug network based on protein-protein interaction. DDREL [2] is more recent research on constructing the drug-drug relation graph. Li et al. [54] constructed the knowledge graph from electronic medical records (EMRs). EMR2vec [55] suggested the platform which incorporated patient data and clinical trials by a medical ontological graph. Bipartite graph [56] and hypergraph [57] are also used to represent knowledge in graph form. Technology topic network which was built based on the patent documents aided to establish the improved R&D plannings [58]. Liu et al. [59] constructed the industrial knowledge graph based on various industrial documents and applied the knowledge graph to few-shot text classification problems.

Recently, the availability of NLP tools has led to the introduction of a wide range of automatic keyword extraction models. TechNet [60] is the leading example of such efforts, which was derived by applying word embedding algorithms to a massive amount of patent filings to establish the semantic relations between the technological terms presented as vectors on the same linear space. While these studies suggest meaningful approaches for extracting insights from patent filings, they still suffer from several limitations, such as lacking the implementation of entity matching and normalization. For example, the terms “CNN” and “convolutional neural networks” convey virtually the same meaning; yet, the standard word embedding approaches would vectorize these terms separately as independent entities. In this study, we attempt to address such issues by normalizing the named entities whose definitions are supposed to be aligned as identical by exploiting the edge updating neural network of our novel design, with triplet loss, as first proposed by [61].

3. Proposed method

Our approach consists of three major components: (1) NER using the Huggingface’s pre-trained NER model; (2) NEN by relying on our novel edge updating neural network; and (3) construction of the semiconductor-related patent knowledge graph. Figure 1 shows the overall framework of the proposed method.

Figure 1.

Overall framework.

3.1 Named entity recognition using the Huggingface’s pre-trained NER model

To extract named entities from the patent claims, we rely on the BERT NER model [10] provided by Huggingface’s1

¹
https://huggingface.co.
Python package. The underlying model was pre-trained with the CoNLL-2003 NER dataset [11]. In order to adapt this pre-trained model to the specific domain of patent claims in the semiconductor related sector, we apply the concept of transfer learning. Transfer learning is a machine learning technique that pre-trained model trained with large-scale dataset is fine-tuned to a new, related task or domain with a smaller dataset [62]. By fine-tuning the BERT NER model with our novel edge weight updating neural network, we can effectively recognize and further normalize entities specific to the patent domain, thus enabling the construction of a more accurate and comprehensive patent knowledge network.

Four types of named entities are provided by the CoNLL-2003 dataset: persons (PER), organizations (ORG), locations (LOC), miscellaneous names (MISC), and those not recognizable by the given dataset (O). Because our main focus is to extract technological concepts and terms, we detect entities labeled as ORG and MISC tags only.

Figure 2 depicts an example of the NER case using the pre-trained BERT model on the sentence “The FinFET device structure includes a second fin structure embedded in the isolation structure,” which actually appears in one of the semiconductor-related patent documents considered in our experiment.

The raw output of the BERT NER model is in the form of WordPiece [63] tokens, which are very difficult to interpret to the human eye at first glance. In this study, we enhance the human understanding of the preliminary NER results by conducting further token concatenation. More specifically, we construct a rigorous, rule-based token concatenation model to detect the named entities. Table 1 lays out the token concatenation scenarios we propose by the different concatenation types.

Table 1
Example of named entity token concatenation

Named entity Tokens Token types

FinFET Fin, ##F, ##ET MISC, ORG, ORG

Amazon.com, Inc Amazon, ., com, ,, Inc ORG, ORG, ORG, O, ORG

Anti-Hebbian Anti, -, He, ##bb,##ian MISC, O, MISC, MISC

Micro USB Micro, USB O, MISC

(NFC) tag (, NFC, ), tag O, MISC, O, O

Figure 2.
BERT named entity recognition example.

The first case listed, where the tokens “Fin”, “##F”, and “##ET” are successfully recognized, is the simplest case that can be detected and easily concatenated utilizing Huggingface’s default concatenation package. The term “FinFET” refers to one of the field effect transistors, and it will make sense only if the tokens “Fin” and the acronym “FET” are concatenated together. The tag of the token following “Fin” begins with “##”, which indicates that it should be joined with the token appearing before itself. The following token, which ends with “##ET” suggests that the formerly merged term should be completed with the letters “F” and “ET”, hence leading to the final form of the detected named entity as “finFET”.

The next case is slightly more complicated yet easily solvable. Entities whose name includes punctuation marks such as periods (.) or commas (,), “Amazon.com, Inc”, for example, require an extra step to be properly concatenated because these punctuation marks are recognized with the other (o) tags. In this case, we join the (o)-tagged token with the surrounding tokens if they are labeled with ORG tags.

Meanwhile, a compound noun, whose meaning changes owing to the combination with prefixes, such as “Anti-Hebbian” in the given example, should be distinguished from its original root word, “Hebbian”. In this case, one needs to carefully concatenate the prefix “Anti-” with the following token “Hebbian” in order not to deteriorate the implication of the original wording.

The next row in the table presents the case where the major named entity token is decorated with a descriptive word or phrase, in this case “Micro USB”. We aim to keep as many information in the named entity as possible.

Our proposed token concatenator model binds such descriptive named entity tokens together effectively, hence providing richer understanding of the given corpus without misleading results.

The last token concatenation scenario presents the case of the recognition and concatenation of tokens appearing in parentheses. For example, the pre-trained BERT NER model dissects the given token “(NFC) tag” into pieces; our proposed token concatenator model, in contrast, preserves the parentheses intact with the acronym within as a unified entity. Thus, it provides the accurate interpretation that the detected entity is an acronym. Such instances are quite prevailing, especially in scientific documents, because acronyms appear rather frequently. To detect these acronyms in NEN tasks, we preserve acronyms in parentheses and their corresponding expansions. For example in NEN task, given the input text “Near Field Communication (NFC) tag”, the NEN model would detect that “NFC” is an acronym by recognizing the pattern within parentheses, and associates it with the preceding full name “Near Field Communication”. Preserving these additional context is crucial for a deeper understanding of the text and enables more accurate entity recognition.

Our token concatenator was manually built to address the above issues better than the existing BERT NER token concatenator. The semiconductor-related patent documents covered in this paper are domain-specific, and there is no NER dataset available for this domain, making it impossible to train an NER model. We manually examined various cases within the documents and developed a rule-based token concatenator to extract named entities for the final knowledge graph while preserving the meaning of the named entities as much as possible.
3.2 Named entity normalization

Named entity	Tokens	Token types
FinFET	Fin, ##F, ##ET	MISC, ORG, ORG
Amazon.com, Inc	Amazon, ., com, ,, Inc	ORG, ORG, ORG, O, ORG
Anti-Hebbian	Anti, -, He, ##bb,##ian	MISC, O, MISC, MISC
Micro USB	Micro, USB	O, MISC
(NFC) tag	(, NFC, ), tag	O, MISC, O, O

Figure 3.

Named entity graph construction based on named entities’ substrings.

With the named entities concatenated as described in Section 3.1, we initialize the construction of the named entity graph by connecting the substrings of the extracted named entities, and the process is summarized graphically in Fig. 3.

3.2.1 Edge weight normalization

Our proposed model, which fine-tunes the BERT model as its backbone, has already successfully detected the semantic similarity between two entities. However, to also detect morphological similarity, we construct a substring graph and extract the hard positive candidate with the lowest embedding similarity among the connected entities and the hard negative candidate with the highest embedding similarity among the unconnected entities at each epoch for training with triplet loss. In addition, we aimed to reflect morphological similarity by calculating a simple similarity between entities using the substring graph of entities as shown in Eq. (3.2.1). Detecting the morphological similarities between entities is possible by using graph measures such as Random Walk-based SimRank [64] and Graph Edit Distance [65] in the proposed substring graph to calculate the similarity between entities.

Our named entity graph construction proceeds as follows: first, we parse the named entities, resulting from the concatenation stage, using the whitespaces. For example, the entity “Fin Field Effect Transistor”, after the parsing, will result in the token pieces “fin”, “field”, “effect”, and “transistor”. Please note that we exclude the punctuation and the common stopwords during the parsing process. We repeat the parsing process on every named entity and construct a bipartite graph with the named entities as one group, and the associated substrings resulting after the parsing, as another, as illustrated in the left panel of Fig. 3. Then, we one-mode project the bipartite graph on the named entity level. The resulting network consists of named entities as its nodes and the number of shared substrings between each pair of named entities as the edge weights, as depicted in the right box of Fig. 3.

At this time, the entities “FinFET” and “Fin Field Effect Transistor” are not directly connected by an edge, but only indirectly via the common neighbor node “Fin Field Effect Transistor (FinFET)”. Given that these indirectly connected components of entities imply similar ideas, our goal is to determine and connect nodes of the exact same concept/definition. To do so, we compute the relative strength between the two named entities by normalizing the edge attributes based on Eq. (3.2.1) as follows:

$\displaystyle A=(W_{s},W_{t},W_{m})$ $\displaystyle W_{s}=\frac{w_{s,t}}{\max(\{w_{s,i}\mid i\in C_{s}\})}$

(1) $\displaystyle W_{t}=\frac{w_{t,s}}{\max(\{w_{t,j}\mid j\in C_{t}\})}$ $\displaystyle W_{m}=W_{s}\cdot W_{t}$

where $A=(W_{s},W_{t},W_{m})$ denotes the edge attributes and $W_{s},W_{t}$ the normalized strength between the two named entities given the source entity and the target entity, respectively. To dilute the over-fitting calculations for the relatively shorter named entity and to overcome the under-fitting calculations for the relatively longer named entity, $W_{m}$ is set to the product of $W_{s},W_{t}$ . The number of shared substrings between the named entity $i$ and $j$ is denoted by $w_{i,j}$ , and the connected components of the named entity are $i$ , $C_{i}$ .

In Fig. 3, let “Fin Field Effect Transistor (FinFET)” be the source entity and “Fin Field Effect Transistor” be the target entity. This case, $W_{s}$ is $4/5$ and $W_{t}$ is $4/4=1$ . If “Fin Field Effect Transistor (FinFET)” is the source entity and “FinFET” is the target entity, then, $W_{s}$ is $1/5$ and $W_{t}$ is $1/1=1$ .
3.2.2 Edge weight updating neural network using triplet loss

We first learn the named entities using the pre-trained BERT embedding model, and then we fine-tune the parameters with our novel edge weight updating neural network [61], which employs a triplet loss [66] instead of using KL divergence. This decision was made because of the instability caused by the cases of scarce number of connected entities (relatively fewer entities in a entity group) compared to the relatively large number of negative entities available for the query entity. By using the triplet loss with hard positive and hard negative samples, which are retrieved for every epoch, we aim to provide a more stable learning process and improve the performance of the edge weight updating neural network.

The triplet loss function we employ is mathematically defined by Eq. (3.2.2), where $a$ , $p$ , and $n$ are the anchor and the positive and negative vectors, respectively:

$\displaystyle\mathcal{L}(a,p,n)=\max\{0,d(a,p)-d(a,n)+\textit{margin}\}$ $\displaystyle\text{where},$ (2) $\displaystyle d(x,y)=\|x-y\|_{2}$

For training the model using the triplet loss function, several issues need to be considered. First, because the loss function takes triplets as its input, the number of triplet combinations explodes as the size of the data increases. At the same time, the model performance is found to be sensitive to the quality of the triplets used during the training. In other words, a selection of adequate triplets for the training is necessary.

Previous studies have suggested promising solutions to this challenge. Hermans et al. [67] proposed the batch-hard triplet loss, which chooses the most definitively positive and negative samples when constructing the triplets for the online training. Yu et al. [68], averaged the negative and positive samples instead of constructing sample-to-sample triplets.

In this study, we adopt the batch-hard triplet loss approach as implemented in [67]. Furthermore, we make use of a scarcely labeled dictionary of named entities with its variant identities as supplementary data source because the use of external information during the training process exploiting the triplet loss function leads to a significant increase in the quality of the positive and the negative samples [61].

We begin our training by using the graph resulting from Section 3.2. The similarity between the two BERT vectors is then determined by computing the inner products. After the first epoch of training, the positive and negative samples are determined as follows: among the entities connected on the network, the entity pair with the greatest similarities, yet labeled as unmatched (that is, labeled as “0”) in the dictionary, is considered as the negative input of the triplet loss. In contrast, the entity pair labeled as matched (labeled as “1”), yet the inner product of its BERT vectors, is the lowest and is considered as the positive input of the triplet loss. The positive and negative samples are then consumed as inputs in the next training epoch, given the BERT vector similarity of the previous epoch among connected entities in the substring graph.

This approach resembles the gradient descent method, as it concentrated on the errors (false positive and true negative) of the each model iteration. By re-calculating both positive and negative samples for each epoch, harder positive and negative samples are generated. Utilizing the triplet loss trains the model to train and to perform better by emphasizing these challenging cases and improving the overall performance of the model.

Mathematically speaking, let the set of named entities be denoted by $N=[\textit{entity}1,\textit{entity}2,\ldots]$ ; and the connected entities of the anchor entity $a$ with the positive and negative labels, $C_{a_{\textit{pos}}}$ and $C_{a_{\textit{neg}}}$ , respectively. Then, the total training loss can be expressed as Eq. (3.2.2).

$\displaystyle\mathcal{L}=\sum_{a\in N}\max[0,\min\{d(f(a),f(x_{p}))\ \mid x_{p% }\in C_{a_{\textit{pos}}}\}$ $\displaystyle\quad-\max\{d(f(a),f(x_{n}))\mid x_{n}\in C_{a_{\textit{neg}}}\}+% \textit{margin}]$ $\displaystyle\text{where},$ (3) $\displaystyle d(x,y)=\|x-y\|_{2}$ $\displaystyle f(x)=\textit{BERT}(x)[\textit{CLS}]$

We further illustrate our approach using an example. An entity group refers to a set of semantically related entities that share common attributes, which can be represented as connected components within the semiconductor-related patent knowledge graph. Each connected component consists of nodes representing entities that are closely related, and edges representing the relationships between these entities. For instance, consider an anchor entity, “Fin Field Effect Transistor (FinFET)”, and another one, “FinFET”, from the same entity group, meaning they are part of the same connected component in the graph. These entities share only one substring, whereas “Metal Oxide Semiconductor Field Effect Transistor (MOSFET)”, which should be placed in a different entity group (i.e., a separate connected component), shares the three substrings “Field”, “Effect”, and “Transistor”. In this case, “FinFET” will serve as the positive input for the entity “Fin Field Effect Transistor (FinFET)”, while “Metal Oxide Semiconductor Field Effect Transistor (MOSFET)” will be the negative sample.

Figure 4.

Edge weight updating neural network with triplet loss.

Figure 4 shows the example of the positive and negative inputs to train the edge weight updating neural network with triplet loss.

3.3 Construction of the semiconductor-related patent knowledge graph

The semiconductor-related patent knowledge graph is completed using the following process. We employ a logistic regression model to assess the linkage of the two given named entity pairs, initially represented by fine-tuned BERT embeddings in the substring graph Section 3.2. The logistic regression model evaluates if these pairs are still connected after the training process described in Section 3.2.2. All of the linked entities in the substring named entity graph are tested and updated. Then, the connected components of the final graph are considered the unique named entity groups. These groups are expressed as separate nodes of a different mode, which corresponds to the named entity groups. Finally, the semiconductor patent knowledge is completed by linking the patent document, in which the named entity appeared.

4. Experiment settings

4.1 Data

In our experiment, we exploit USPTO data2

²
https://bulkdata.uspto.gov.
from January, 2020 until the end of October, 2020. As our goal is to construct the patent knowledge graph from semiconductor-related patent documents, we filter the patent claims by querying the word “semiconductor” in the description of the Cooperative Patent Classification (CPC), as shown in the example exhibited in Fig. 5. The resulting dataset covers the following 12 CPC subclasses: H01C, H01F, H01G, H01L, H01M, H01P, H01R, H01S, H03H, H04R, H05B, and H05K. From the total of 35,734 documents, we recognized 69,812 named entities.
4.2 Named entity normalization training

For the NEN evaluation, we manually built a scarcely labeled dictionary matching named entities of different forms, yet with the same meanings. Out of 69,812 named entities, we hand-labeled 6,797 named entities to be matched with 1,000 unique named entity groups.

Building a domain-specific dictionary manually makes the paper specific to the “semiconductor-related patent” domain. We also have made the dictionary publicly available, allowing researchers to access and utilize it in related works. Furthermore, the dictionary is included as an appendix to the paper in Appendix A.1, making it accessible to who wish to understand the preview of the named entities and NEN dictionary used in this study.

We hand-matched entities of the following six types: (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuation and alphabets, (5) descriptive phrases, and (6) possible parsing errors. We report the examples of the matching named entities by type in Table 2.

Table 2
Matching categories and example for the named entity normalization dataset

	Named entity	Matching named entity
Synonyms	WiFi	IEEE 802.11
	Silicon Carbide.	SiC
Abbreviations	German Patent Appl.	German Patent Application
	Microsoft Corp.	Microsoft Corporation.
Acronyms	IoT	Internet of Things
	TFT	Thin Film Transistor
Combinations of punctuations	USB Type C	USB Type-C.
	Internet	“Internet”.
Descriptive phrase	Indium Tin Oxide	Indium Tin Oxide (ITO),
	LTE-A	LTE-Advanced (LTE-A),
Parsing errors	LEDs	Emitting Diodes (LEDs) can
	DRAM	Access Memory (DRAM) is

Figure 5.

Description of CPC subclass H01L.

Table 3 presents an example of the resulting, manually built dictionary for semiconductor-related patent NEN.

Table 3

Excerpt from the semiconductor patent named entity normalization dataset

Named entity	Group
Fin Field Effect Transistors (FinFETs).	Group_0
(FinFets).	Group_0
(e.g., FinFETs) and	Group_0
FinFet	Group_0
WiFi.	Group_29
WiFi 802.11	Group_29
IEEE802.11 (WiFi),	Group_29
802.11 (WiFi),	Group_29
CD-ROMs (Compact Disc-Read Only Memories),	Group_70
(CD-ROM), Compact Disk	Group_70
(CD) ROM	Group_70
(CD-ROMs), CD	Group_70

We report the positive and negative entity pairs based on the matching status on the substring graph, as described in Section 3.2, in cross-check with our manually built dictionary, as mentioned in Section 4.2. Given the entity pairs connected on the substring graph, if the two entities are labeled to be in the same named entity group in our manually built dictionary, the two entities are labeled positive. In contrast, if the two entities connected in the substring graph are placed in different groups, the two entities are labeled negative.

The detailed statistics are listed in Table 4.

Table 4

Statistics of the pairwise named entity matching evaluation dataset

	Number of positive pairs	Number of negative pairs	Total pairs
Train	25,695	29,241	54,936
Test	14,133	14,050	28,183
Total	39,828	39,745	83,119

Finally, we provide the basic summary statistics for the training and test sets in Table 5.

Table 5

Statistics of the semiconductor patent named entity normalization dataset

	Number of named Entity	Number of groups
Train	3,802	552
Test	2,995	448
Total	6,797	1,000

4.3 Comparison models

We tested our proposed approach against conventional and standard text mining models: Word2vec [69], Glove [70], Fasttext [71], and BERT [10]. SciBERT [72] is the variant of the original BERT model, pre-trained with scientific text, which might be more suitable for patent-related analysis. Hence, we also included SciBERT in our experiment. BioSyn [47] is one of the state-of-the-art NEN models. The Bio-medical documents were used for training in the original BioSyn paper. We trained the BioSyn model with our patent NEN dataset and compared the performance with other models including our proposed model. The weighted averaged vectors of each word embedding model were used for the embedding of the named entities. Table 6 summarizes the basic characteristics of the baseline models in terms of the NEN and document retrieval tasks.

Table 6
Models used for the evaluation

Models	Descriptions
Word2vec [69]	Word2vec is one of the most widely used NLP models. In our research, pre-trained Word2vec vectors were used. More specifically, word2vec-google-news-300 ${}^{1}$ was used, which consists of 3 million words using Google News. The dimension of each embedding vectors is 300.
Glove [70]	Word2vec is constructed to predict the neighboring words given a window size. However, in Glove the dot product of the embedding of the target word and that of neighboring word matches the co-occurrence of words in the corpus. We used the pre-trained Glove vectors, glove-wiki-gigaword-300 ${}^{2}$ . The pre-trained model consists of 400,000 word vectors trained with Wikipedia 2014 data and Gigaword 5 ${}^{2}$ . The dimension of each word embedding vectors is 300.
Fasttext [71]	Fasttext utilizes a training mechanism similar to that of Word2vec. Unlike Word2vec and Glove, Fasttext splits the words into subwords tokens. Fasttext is known to be the more robust NLP model when handling out-of-vocabulary problems. In technical documents, where new terminologies frequently appear, Fasttext can be the more suitable model. The pre-trained Fasttext model, fasttext-wiki-news-subwords-300 ${}^{3}$ , was used in our research, which contains a million word vectors. The dimension of each word embedding vectors is 300. The pre-trained model was trained with the Wikipedia 2017, UMBC webbase corpus, and statmt.org news datasets ${}^{3}$ .
BERT [10]	BERT is one of the state-of-the-art models for various NLP tasks. However, for our specific tasks, the BERT model has a limitation in capturing the morphological similarity between entity pairs. We used pre-trained BERT vectors ${}^{4}$ with size of 768 and train a simple MLP classifier with batch size of 4096 to determine the linkage between entity pairs.
SciBERT [72]	SciBERT uses the BERT model architecture. The model has been fine-tuned with various scientific documents. For specific NLP tasks, such fine-tuned models shows higher performance compared to the vanilla BERT model. As the structure of SciBERT is the same as that of the BERT model, the embedding dimension is same as that of the BERT model.
BioSyn [47]	BioSyn is one of the state-of-the-art NEN models. In the original report, the NEN was concentrated on bio-medical documents and used BioBERT [48] for the pre-trained embedding model. BioSyn implements marginal maximum likelihood (MML) for the objective function. We trained the BioSyn model with our patent NEN dataset. As the aim of the BioSyn model is NEN, evaluations of BioSyn in information retrieval tasks in Sections 5.1.3 and 5.1.4 are excluded.
Edge weight updating NN [61]	The model is trained to minimize the distributions of edge weight on the Ground Truth Entity Graph and the Similarity-Based Entity Graph. The model shows state-of-the-art performance on NEN tasks in both finance and biomedical domain.

${}^{1}$ https://code.google.com/archive/p/word2vec/. ${}^{2}$ https://nlp.stanford.edu/projects/glove/. ${}^{3}$ https://fasttext.cc/docs/en/english-vectors.html. ${}^{4}$ https://github.com/google-research/bert.

4.4 Parameter settings

The experiments were executed using an Intel Core-i9-10940X CPU with 128 GB of memory and three NVIDIA GeForce Titan RTX GPUs. For training the edge weight updating neural network using triplet loss as described in Section 3.2.2, the batch size was set at 64, and the learning rate was $10^{-5}$ using the Adam optimizer [73] with weighted decay [74]. The model was trained for 50 epochs. We report the best performing model out of all the results obtained after each epoch.

In order to provide a more in depth analysis of the parameter tuning process, we have included additional evaluation metrics such as accuracy, precision, recall, and F-score per each epoch. This allows us to track the model’s performance over training progresses.

Figure 6 shows the performance of the model for each epoch, where the x-axis represents the epoch, and the y-axis represents the corresponding evaluation metrics. It can be observed that the model’s performance improves relatively steeper for the first three epochs. Between five to ten iterations of training, the model’s performance reaches the plateau which is typical in deep learning models. By analyzing the trends in these metrics, we can observe the optimal set of parameters and epochs for the model and further fine-tune the model’s performance.

Figure 6.

Model accuracy, precision, recall, F-score, V-measure per epoch.

5. Results

5.1 Quantitative evaluations

5.1.1 Named entity normalization: candidate retrieval

Many NEN models from previous studies are evaluated by the candidate retrieval tasks [43, 45, 46, 47]. We evaluated the performance of the candidate retrieval for NEN with various models. An evaluation was conducted to validate the efficacy under the same conditions as those for the previous NEN models including BioSyn [47], the current state-of-the-art NEN model. The evaluation was reported based on whether the group id of query entity and the group id of the most similar entity from the dictionary dataset were the same. The performance of the models is presented in Table 7.

Table 7
Named entity normalization by candidate retrieval performances

	Accuracy
Word2vec ${}^{1}$	89.71%
Glove ${}^{2}$	73.88%
Fasttext ${}^{3}$	85.81%
BERT ${}^{4}$	64.17%
SciBERT ${}^{4}$	57.33%
BioSyn	92.52%
Edge weight updating NN	90.25%
Edge weight updating NN with triple loss	97.46%

${}^{1}$ Out of vocabulary: 1,074. ${}^{2}$ Out of vocabulary: 1,802. ${}^{3}$ Out of vocabulary: 937. ${}^{4}$ Query vectors are smoothed by the entity group. The smoothing is conducted by averaging all entity vectors in the group.

BERT [10] and SciBERT [72] models not specifically trained for NEN tasks. We utilized the similarity ranking model described in the study of Ji et al. [46], but the retrieval of a single entity was unsuccessful for many entities. Smoothing the dictionary vectors by averaging the entity vectors in the named entity group gave relatively higher accuracy. Among the models we tested, our proposed model achieved the highest performance in candidate retrieval tasks for NEN.

5.1.2 Named entity normalization: pairwise matching

the model performances were tested by precision, recall, f-score, and accuracy, computed as defined in Eq. (5.1.2), which are standard metrics for evaluating the pairwise named entity matching tasks.

$\displaystyle\textit{precision}=\frac{tp}{tp+fp}$ $\displaystyle\textit{recall}=\frac{tp}{tp+fn}$ (4) $\displaystyle F1=2\cdot\frac{\textit{precision}\cdot\textit{recall}}{\textit{% precision}+\textit{recall}}$

We detected connected components from the semiconductor-related patent knowledge graph as the unique named entity groups. Thus, we evaluated to which extent these connected components matched well as compared to the ground truth groups by computing the V-measure [75]. The V-measure calculates the harmonic mean of the other two widely used clustering evaluation metrics, homogeneity and completeness, to assess the range of the overlap between the given clusters and the ground truth grouping. The mathematical definition is expressed by Eq. (5).

$\displaystyle V=\frac{(1+\beta)\cdot\textit{homogeneity}\cdot\textit{% completeness}}{\beta\cdot\textit{homogeneity}+\textit{completeness}}$ (5)

In our evaluation, we assumed that the weight was equal across homogeneity and completeness by setting $\beta=1$ .

Table 8 reports the performance of our proposed approach against the baseline models. The results show that our model beats the conventional embedding methods in almost every case. In particular, only our model achieved over 90% in precision and recall in the pairwise entity matching tasks. By scoring over 0.97 in V-measure, the named entity groups constructed by our proposed model highly resembled the ground truth named entity groups. SciBERT with the substring graph showed the best performance in terms of recall, yet compared to our model, the measure is very close, and it differs by 0.1%. Such outstanding performances against the baseline models, we believe, is largely owed to the out of vocabulary problems. To support our claims, we additionally report the number of out of vocabularies at the end of Table 8. Word2vec, Glove, and Fasttext are known to perform relatively less robust to the words unseen during the training process, hence deteriorating performance when they met newly rising concepts. Given the recent fast-paced technological developments, however, handling out-of-vocabulary concepts is critical in scientific documents. The experiment results show that our proposed model performs well in such cases and works robustly when faced with newly introduced words never seen before.

Table 8

V-measure, precision, recall, F-score, and accuracy of models

	V-measure	Precision	Recall	F-score	Accuracy
Word2vec ${}^{1}$	0.5810	65.53%	65.01%	65.27%	66.11%
Word2vec ${}^{1}$ with substring graph	0.7579	81.79%	86.95%	84.29%	84.12%
Glove ${}^{2}$	0.5101	61.45%	64.15%	62.77%	65.62%
Glove ${}^{2}$ with substring graph	0.7528	81.43%	86.71%	83.99%	85.06%
Fasttext ${}^{3}$	0.6013	69.85%	62.55%	66.00%	68.76%
Fasttext ${}^{3}$ with substring graph	0.8298	82.91%	90.58%	86.58%	86.38%
BERT	0.4922	68.66%	76.11%	72.20%	70.60%
BERT with substring graph	0.5943	82.04%	92.32%	86.88%	86.01%
SciBERT	0.5091	72.59%	72.14%	72.37%	72.37%
SciBERT with substring graph	0.7644	86.51%	88.79%	87.63%	87.44%
BioSyn	0.5824	73.24%	74.75%	73.99%	73.64%
BioSyn with substring graph	0.6688	80.96%	89.48%	85.01%	84.17%
Edge weight updating NN with triple loss	0.9787	94.45%	91.92%	93.17%	93.24%
Edge weight updating NN with triple loss and substring graph	0.9787	94.45%	92.20%	93.31%	93.37%

${}^{1}$ Out of vocabulary: 1,074. ${}^{2}$ Out of vocabulary: 1,802. ${}^{3}$ Out of vocabulary: 937.

5.1.3 Document retrieval from the named entities

In this section, we report the performance of our proposed model in relation to the document retrieval task. To be as fair as possible, we restrained from querying named entities as we conducted the test. For the competing embedding models such as Word2vec, Glove, Fasttext, BERT, and SciBERT, the representations of each entity and each document was computed as the weighted average of all the tokens associated with the respective named entity or the documents. BioSyn is a model that specifically focuses on NEN tasks, so BioSyn is not used for the retrieval tasks. As for our proposed model, because our end-product has the form of a network, we take advantage of the structural characteristics of the knowledge graph. When given the query, we return the document with the highest edge weight connected to the given named entity’s group. We test the relevance of the document recommendations in response to the given query based on whether the named entity and the retrieved documents are from same CPC group (total: 50) and CPC subgroups (total: 449). The performances of each model are reported in Table 9.

Table 9
Accuracy of document retrieval from the named entities

	Accuracy for CPC group (50)	Accuracy for CPC subgroups (449)
Word2vec	70.70%	51.58%
Glove	68.14%	50.48%
Fasttext	48.31%	33.79%
BERT	67.26%	53.83%
SciBERT	62.09%	46.78%
Our Model	85.78%	77.46%

Across CPC groups and subgroups, our proposed model reports the highest accuracy. Our model achieved over 77% accuracy on retrieving the relevant documents with respect to the CPC subgroups. This, in particular, is an impressive result given that there were 449 subgroups. Due to the granularity of the sub-groupings, all of the other baseline models suffer gravely in terms of accuracy.

5.1.4 Named entity retrieval from the documents

Retrieving the relevant named entities from patent documents is another important task. The most related named entity was retrieved using the following procedures. For Word2vec, Glove, Fasttext, BERT, and SciBERT, the embedding vectors of the named entities that appeared in each document were averaged. Based on the embedding obtained for each document, the named entity with the highest similarity was recommended. As for our proposed model, we returned the named entity within the group that had the highest connected edge weight to the given document’s named entities. For the named entity recommendations, the named entities that appeared directly in the document were excluded from the candidates. We evaluated the performance of the named entity recommendations in response to the given query based on whether the documents and the retrieved named entities were from same CPC group (total: 50) and CPC subgroups (total: 449). The performances of each model are reported in Table 10.

Table 10
Accuracy of the named entity retrieval from the patent documents

	Accuracy for CPC group (50)	Accuracy for CPC subgroups (449)
Word2vec	84.96%	72.78%
Glove	83.65%	68.53%
Fasttext	81.44%	67.73%
BERT	85.86%	64.28%
SciBERT	77.04%	61.05%
Our Model	91.65%	83.68%

Our proposed model reports the highest accuracy for the named entity retrieval tasks. Our model achieved over 83% accuracy on retrieving the relevent named entity with respect to the CPC subgroups. The second best performing model was word2vec, which showed an accuracy of 73%, and the difference in performance compared to that of our model was approximately 10%. The proposed model has a significant improvement in performance compared to that of other models, and it can provide insights by accurately retrieving the related named entities from the document.

Table 11

Examples of false positives of pairwise named entity normalization task

Entity 1	Entity 2	Confidence ${}^{1}$
Silicon germanium (SiGe),	Silicon Nitride (SiN).	0.5400
(Organic Light Emitting Diode) or	Light-Emitting-Diode (LED) is	0.6526
(PVD) processes	(CVD) processes	0.6303
Digital Subscriber Line (DSL),	Digital Signal Processor (DSP),	0.6072
DC-AC inverter	(AC) power	0.5275

${}^{1}$ Confidence of our model to predict the entity pair as the label 1(the matching entity pair).

Table 12

Examples of false negative of pairwise named entity normalization task

Entity 1	Entity 2	Confidence ${}^{1}$
Cobalt (Co) based	Cobalt (Co) material	0.5504
Teflon (PTFE),	Teflon ${}^{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}\textregistered}$	0.6565
FPC (Flexible Printed Circuit),	(Flexible Printed Circuit).	0.5798
Linux OS	Linux ${}^{\text{TM}}$	0.5720
APPLE iPad.	(e.g., iPad ${}^{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}\textregistered}$ ),	0.6894

Confidence of our model to predict the entity pair as the label 0 (the non-matching entity pair).

Table 13

Statistics of the semiconductor patent named entity knowledge graph

Types	Number of nodes	Number of edges
Patent document nodes	34,356	–
Named entity nodes	25,938	–
Named entity group nodes	8,525	–
Total nodes	68,819	–
Total edges	–	297,542

Figure 7.

Semiconductor patent named entity knowledge graph.

Figure 8.

Subgraph of USB type C related groups.

Figure 9.

Subgraph of DNN related groups.

Figure 10.

Subgraph of Samsung Galaxy related groups.

5.2 Qualitative evaluations

5.2.1 Error analysis on the pairwise named entity normalization

As the quality of the semiconductor-related patent knowledge graph relies heavily on the performance of the NEN process, we report the result of the error analysis we conducted on the pairwise NEN in this section. More particularly, we report the false positive examples in Table 11 and the false negative examples in Table 12 on the pairwise NEN tasks with the model’s confidence.

In general, both false positive and false negative results have relatively low confidence. This implies that, when constructing the semiconductor patent named entity graph, connecting the undesired entity pairs can be prevented by connecting the entities with higher confidence. However, it is important to examine some cases where the model might have been misled. For example, the proposed model linked “DC-AC inverter” and “(AC) power” with a confidence of 0.52, likely due to the sharing term “AC”. Moreover, “inverter” and “power” are closely related concepts within the same domain of electricity, which might have contributed to the model’s confusion in distinguishing between these two entities.

The model not considering the “Linux OS” and “Linux ${}^{\text{TM}}$ ” as similar could be caused by the presence of different textual features in the two entities. The addition of the trademark symbol might have introduced a level of dissimilarity that the model found significant enough to classify the entities as different. In the train dataset, there exists some connecting entities such as “Android OS” and “Android ${}^{\text{TM}}$ ”, “Windows OS” and “Windows ${}^{\text{TM}}$ ”. Introducing more examples for such cases might help model to successfully link “Linux OS” and “Linux ${}^{\text{TM}}$ ” together.

5.3 Knowledge graph visualization and exemplary investigation

By training the NEN model as discussed in Section 3, with our hand-labeled dataset as described in Section 4.2, we have successfully recognized 69,812 named entities and connected the entity pairs with a confidence over 0.999 to maximize precision. After pruning the false positive links, we ended up with a knowledge graph with 25,938 named entities assigned to the total of 8,525 unique named entity groups. The overall statistics of the semiconductor patent named entity knowledge graph are listed in Table 13.

We present the graphical visualization of the entire knowledge graph in Fig. 7. The resulting graph may also be accessed freely online via an interactive environment, available for all non-commercial purposes3

³
https://sjeon7.github.io/Semiconductor_Patent_Named_Entity_Graph/network/.
.

The purple nodes represent the patent documents; the green nodes, the named entity groups, and; the orange nodes, the associated named entities.

As it is difficult to distinguish visually the graph with almost 70,000 nodes, we selected three named entity groups, “USB type C”, “deep neural network”, and “Samsung Galaxy ” and report the resulting subgraphs in Figs 8–10, respectively, for demonstrative purposes. As can be easily observed in these subgraphs, the named entity nodes of similar technological concepts are successfully grouped. For example, the terms “Universal Serial Bus” is well connected to the entities “USB” and “USB Type C” in the subgraph in Fig. 8, and the patent connected to those named entity nodes well encompasses these terms.

A similar pattern is observed for the subgraph reported in Fig. 9, which shows the connection between the original phrase, “deep neural network”, with its abbreviation, “DNN”, correctly established.

The example reported in Fig. 10 shows that our model, in addition to the technological jargon, also successfully extracted and connected brand and product names.
6. Conclusion

The knowledge graph has been recently attracting attention in the field of patent analysis as a useful tool to summarize and represent information from patent filings. Past research has mainly relied on extracting keywords to summarize and represent the information enclosed in patent filings. While keyword extraction models do deliver meaningful insights, named entities such as technological concepts, specific techniques used, name of the devices or end product, and the associated company names may additionally provide richer and deeper understanding of the innovations and technology underlying the patent filings.

In this study, we construct a semiconductor-related concept and entity knowledge graph by applying a novel edge updating neural network algorithm on patent claims. More specifically, our proposed model builds a knowledge network of semiconductor-related named entities from the patent filings. During this process, named entities with different surface forms, but of identical meanings, are placed into unique groups, hence providing a clearer picture and better understanding of the patent filings in hand. Our proposed model shows the highest performance on both the NEN and document retrieval tasks against that of standard baseline models. Further, experiment results show that the proposed knowledge graph construction method is robust to the out-of-vocabulary problem.

While the proposed model has showed great performances, there still is a room for further development. Currently, our research focuses only on the topics involving semiconductor devices. A focus switch to other fields may lead to a clearer understanding of a different area of innovations, while an extension to encompass a greater range of topics will help assemble a more complete picture of the recent technological advances in general. In addition, this study uses the edge updating neural network approach to discover the inter-connectivity among named entities and connect these named entity groups to the patent documents to construct the knowledge graph. However, in the constructed knowledge graph, a relationship between patent documents and entity groups is defined as existence of named entity in certain patent document. Research on the detection of more complex relationships between an entity and a patent document or another entity is our next research topic.

Footnotes

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C2093785 and No. 2018R1D1A1A02045842). This work was supported by the Artificial Intelligence Graduate School Program (Seoul National University) (NO.2021-0-01343).

Appendix

Supplementary data

References

Zhang

Chen

, Generative Knowledge Graph Construction: A Review, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 1–17. https://aclanthology.org/2022.emnlp-main.1.

Allahgholi

Rahmani

Javdani

Sadeghi-Adl

Bender

Módos

Weiss

, DDREL: From drug-drug relationships to drug repurposing, Intelligent Data Analysis 26(1) (2022), 221–237.

Choi

Gwak

Song

J.W.

Chang

, Stock market network based on bi-dimensional histogram and autoencoder, Intelligent Data Analysis 26(3) (2022), 723–750.

Hao

Yin

Liu

Sun

Liu

Yang

, Construction and application of a knowledge graph, Remote Sensing 13(13) (2021), 2511.

Abbas

Zhang

Khan

S.U.

, A literature review on the state-of-the-art in patent analysis, World Patent Information 37 (2014), 3–13.

Yoon

Park

, A text-mining-based patent network: Analytical tool for high-technology trend, The Journal of High Technology Management Research 15(1) (2004), 37–50.

Kim

Y.G.

Suh

J.H.

Park

S.C.

, Visualization of patent analysis for emerging technology, Expert Systems with Applications 34(3) (2008), 1804–1812.

Cagan

Kotovsky

Wood

, Discovering structure in design databases through functional and surface based mapping, Journal of mechanical Design 135(3) (2013), 031006.

Chan

Cagan

Kotovsky

Schunn

Wood

, The meaning of “near” and “far”: The impact of structuring design databases and the effect of distance of analogy on design output, Journal of Mechanical Design 135(2) (2013).

10.

Devlin

Chang

M.-W.

Lee

Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

11.

Sang

E.F.

De Meulder

, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, arXiv preprint cs/0306050, 2003.

12.

Pan

Cambria

Marttinen

Philip

S.Y.

, A survey on knowledge graphs: Representation, acquisition, and applications, IEEE Transactions on Neural Networks and Learning Systems 33(2) (2021), 494–514.

13.

Galkin

Auer

Vidal

M.-E.

Scerri

, Enterprise Knowledge Graphs: A Semantic Approach for Knowledge Management in the Next Generation of Enterprise Information Systems, in: ICEIS (2), 2017, pp. 88–98.

14.

Klinger

Kolářik

Fluck

Hofmann-Apitius

Friedrich

C.M.

, Detection of IUPAC and IUPAC-like chemical names, Bioinformatics 24(13) (2008), i268–i276.

15.

Kolárik

Klinger

Friedrich

C.M.

Hofmann-Apitius

Fluck

, Chemical names: terminological resources and corpora annotation, in: Workshop on Building and Evaluating Resources for Biomedical Text Mining (6th edition of the Language Resources and Evaluation Conference), 2008.

16.

Weston

Tshitoyan

Dagdelen

Kononova

Trewartha

Persson

K.A.

Ceder

Jain

, Named entity recognition and normalization applied to large-scale information extraction from the materials science literature, Journal of Chemical Information and Modeling 59(9) (2019), 3692–3702.

17.

Doğan

R.I.

Leaman

, NCBI disease corpus: A resource for disease name recognition and concept normalization, Journal of Biomedical Informatics 47 (2014), 1–10.

18.

Suominen

Salanterä

Velupillai

Chapman

W.W.

Savova

Elhadad

Pradhan

South

B.R.

Mowery

D.L.

Jones

G.J.

et al., Overview of the ShARe/CLEF eHealth evaluation lab 2013, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2013, pp. 212–231.

19.

Demner-Fushman

Shooshan

S.E.

Rodriguez

Aronson

A.R.

Lang

Rogers

Roberts

Tonning

, A dataset of 200 structured product labels annotated for adverse drug reactions, Scientific Data 5 (2018), 180001.

20.

Kim

J.-D.

Ohta

Pyysalo

Kano

Tsujii

, Overview of BioNLP’09 shared task on event extraction, in: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, 2009, pp. 1–9.

21.

Bossy

Deléger

Chaix

Nédellec

, Bacteria Biotope at BioNLP Open Shared Tasks 2019, in: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019, pp. 121–131.

22.

Smith

Tanabe

L.K.

nee Ando

R.J.

Kuo

C.-J.

Chung

I.-F.

Hsu

C.-N.

Lin

Y.-S.

Klinger

Friedrich

C.M.

Ganchev

et al., Overview of BioCreative II gene mention recognition, Genome Biology 9(S2) (2008), S2.

23.

Sun

Johnson

R.J.

Sciaky

Wei

C.-H.

Leaman

Davis

A.P.

Mattingly

C.J.

Wiegers

T.C.

, BioCreative V CDR task corpus: A resource for chemical disease relation extraction, Database 2016 (2016).

24.

Sun

Lin

Liu

Sha

, A product named entity normalization method based on entity relations, in: 2012 8th International Conference on Information Science and Digital Content Technology (ICIDT2012), Vol. 1, IEEE, 2012, pp. 166–169.

25.

Francis

Van Landeghem

Moens

M.-F.

, Transfer learning for named entity recognition in financial and biomedical documents, Information 10(8) (2019), 248.

26.

Mueller

Thyagarajan

, Siamese recurrent architectures for learning sentence similarity, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30, 2016.

27.

Ranasinghe

Orasan

Mitkov

, Semantic textual similarity with siamese neural networks, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 2019, pp. 1004–1011.

28.

Liu

Zhang

Niu

Lin

Lai

, Matching long text documents via graph convolutional networks, arXiv preprint arXiv:1802.07459, 2018, 2793–2799.

29.

Neculoiu

Versteegh

Rotaru

, Learning text similarity with siamese recurrent networks, in: Proceedings of the 1st Workshop on Representation Learning for NLP, 2016, pp. 148–157.

30.

Krivosheev

Atzeni

Mirylenka

Scotton

Casati

, Siamese Graph Neural Networks for Data Integration, arXiv preprint arXiv:2001.06543, 2020.

31.

Rahmani

Ranjbar-Sahraei

Weiss

Tuyls

, Entity resolution in disjoint graphs: An application on genealogical data, Intelligent Data Analysis 20(2) (2016), 455–475.

32.

Niu

Yang

Zhang

Sun

Zhang

, Multi-task character-level attentional networks for medical concept normalization, Neural Processing Letters 49(3) (2019), 1239–1256.

33.

Mulang’

I.O.

Singh

Prabhu

Nadgeri

Hoffart

Lehmann

, Evaluating the impact of knowledge graph context on entity disambiguation models, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 2157–2160.

34.

Wei

C.-H.

Kao

H.-Y.

, Cross-species gene normalization by species inference, BMC Bioinformatics 12(S8) (2011), S5.

35.

Hakenberg

Gerner

Haeussler

Solt

Plake

Schroeder

Gonzalez

Nenadic

Bergman

C.M.

, The GNAT library for local and remote gene mention normalization, Bioinformatics 27(19) (2011), 2769–2771.

36.

Rocktäschel

Weidlich

Leser

, ChemSpot: A hybrid system for chemical named entity recognition, Bioinformatics 28(12) (2012), 1633–1640.

37.

Cho

Choi

Lee

, A method for named entity normalization in biomedical articles: Application to diseases and plants, BMC Bioinformatics 18(1) (2017), 451.

38.

Hanisch

Fundel

Mevissen

H.-T.

Zimmer

Fluck

, ProMiner: Rule-based protein and gene entity recognition, BMC Bioinformatics 6(1) (2005), 1–9.

39.

Aronson

A.R.

, Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program, in: Proceedings of the AMIA Symposium, American Medical Informatics Association, 2001, p. 17.

40.

Leaman

Islamaj Doğan

, DNorm: Disease name normalization with pairwise learning to rank, Bioinformatics 29(22) (2013), 2909–2917.

41.

Leaman

, TaggerOne: Joint named entity recognition and normalization with semi-Markov Models, Bioinformatics 32(18) (2016), 2839–2846.

42.

D’Souza

, Sieve-based entity linking for the biomedical domain, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 297–302.

43.

Chen

Tang

Wang

Huang

, CNN-based ranking for biomedical entity normalization, BMC Bioinformatics 18(11) (2017), 79–86.

44.

Wright

, NormCo: Deep disease normalization for biomedical knowledge base construction, PhD thesis, UC San Diego, 2019.

45.

Phan

M.C.

Sun

Tay

, Robust representation learning of biomedical names, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3275–3285.

46.

Wei

, Bert-based ranking for biomedical entity normalization, AMIA Summits on Translational Science Proceedings 2020 (2020), 269.

47.

Sung

Jeon

Lee

Kang

, Biomedical Entity Representations with Synonym Marginalization, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3641–3650.

48.

Lee

Yoon

Kim

C.H.

Kang

, BioBERT: A pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36(4) (2020), 1234–1240.

49.

Noh

Lee

, Keyword selection and processing strategy for applying text mining to patent analysis, Expert Systems with Applications 42(9) (2015), 4348–4360.

50.

Choi

Hwang

Y.-S.

, Patent keyword network analysis for improving technology development efficiency, Technological Forecasting and Social Change 83 (2014), 170–182.

51.

Ramos

, Semantic Web for manufacturing, trends and open issues: Toward a state of the art, Computers & Industrial Engineering 90 (2015), 444–460.

52.

Rahmani

Blockeel

Bender

, Using a human disease network for augmenting prior knowledge about diseases, Intelligent Data Analysis 19(4) (2015), 897–916.

53.

Rahmani

Blockeel

Bender

, Using a human drug network for generating novel hypotheses about drugs, Intelligent Data Analysis 20(1) (2016), 183–197.

54.

Wang

Yan

Wang

Jiang

Sun

Tang

Chang

T.-H.

Wang

et al., Real-world data medical knowledge graph: Construction and applications, Artificial Intelligence in Medicine 103 (2020), 101817.

55.

Dhayne

Kilany

Haque

Taher

, EMR2vec: Bridging the gap between patient data and clinical trial, Computers & Industrial Engineering 156 (2021), 107236.

56.

Dang

, Solution knowledge mining and recommendation for quality problem-solving, Computers & Industrial Engineering 159 (2021), 107313.

57.

Wang

Chen

C.-H.

Zheng

Song

, A hypergraph-based approach for context-aware smart product-service system configuration, Computers & Industrial Engineering 163 (2022), 107816.

58.

Seo

, A patent-based approach to identifying potential technology opportunities realizable from a firm’s internal capabilities, Computers & Industrial Engineering 171 (2022), 108395.

59.

Liu

Guo

, A metrics-based meta-learning model with meta-pretraining for industrial knowledge graph construction, Computers in Industry 143 (2022), 103753.

60.

Sarica

Luo

Wood

K.L.

, TechNet: Technology semantic network based on patent data, Expert Systems with Applications 142 (2020), 112995.

61.

Jeon

S.H.

Cho

, Edge Weight Updating Neural Network for Named Entity Normalization, Neural Processing Letters, 2022, 1–22.

62.

Pan

S.J.

Yang

, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22(10) (2010), 1345–1359.

63.

Schuster

Chen

Q.V.

Norouzi

Macherey

Krikun

Cao

Gao

Macherey

et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144, 2016.

64.

Jeh

Widom

, Simrank: a measure of structural-context similarity, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 538–543.

65.

Sanfeliu

K.-S.

, A distance measure between attributed relational graphs for pattern recognition, IEEE transactions on systems, man, and cybernetics, 1983, 353–362.

66.

Schroff

Kalenichenko

Philbin

, Facenet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–823.

67.

Hermans

Beyer

Leibe

, In defense of the triplet loss for person re-identification, arXiv preprint arXiv:1703.07737, 2017.

68.

Dou

Bai

Zhang

Bai

, Hard-aware point-to-set deep metric for person re-identification, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 188–204.

69.

Mikolov

Chen

Corrado

Dean

, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.

70.

Pennington

Socher

Manning

C.D.

, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

71.

Joulin

Grave

Bojanowski

Douze

Jégou

Mikolov

, FastText.zip: Compressing text classification models, arXiv preprint arXiv:1612.03651, 2016.

72.

Beltagy

Cohan

, Scibert: A pretrained language model for scientific text, arXiv preprint arXiv:1903.10676, 2019.

73.

Kingma

D.P.

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

74.

Loshchilov

Hutter

, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, 2017.

75.

Rosenberg

Hirschberg

, V-measure: A conditional entropy-based external cluster evaluation measure, in: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, pp. 410–420.

Building knowledge graphs from technical documents using named entity recognition and edge weight updating neural network with triplet loss for entity normalization

Abstract

Keywords

1. Introduction

2. Related work

2.1 Named entity normalization dataset

2.2 Named entity normalization

2.3 Knowledge graph construction

3. Proposed method

4. Experiment settings

4.1 Data

Table 2 Matching categories and example for the named entity normalization dataset

Table 6 Models used for the evaluation

5.1 Quantitative evaluations

5.1.1 Named entity normalization: candidate retrieval

Table 7 Named entity normalization by candidate retrieval performances

Table 9 Accuracy of document retrieval from the named entities

Table 10 Accuracy of the named entity retrieval from the patent documents

5.2.1 Error analysis on the pairwise named entity normalization

5.3 Knowledge graph visualization and exemplary investigation

Footnotes

Acknowledgments

Appendix

Supplementary data

References

Table 2
Matching categories and example for the named entity normalization dataset

Table 6
Models used for the evaluation

Table 7
Named entity normalization by candidate retrieval performances

Table 9
Accuracy of document retrieval from the named entities

Table 10
Accuracy of the named entity retrieval from the patent documents