Enhanced character embedding for Chinese named entity recognition

Abstract

Traditional named entity recognition methods mainly explore the application of hand-crafted features. Currently, with the popularity of deep learning, neural networks have been introduced to capture deep features for named entity recognition. However, most existing methods only aim at modern corpus. Named entity recognition in ancient literature is challenging because names in it have evolved over time. In this paper, we attempt to recognise entities by exploring the characteristics of characters and strokes. The enhanced character embedding model, named ECEM, is proposed on the basis of bidirectional encoder representations from transformers and strokes. First, ECEM can generate the semantic vectors dynamically according to the context of the words. Second, the proposed algorithm introduces morphological-level information of Chinese words. Finally, the enhanced character embedding is fed into the bidirectional long short term memory-conditional random field model for training. To explore the effect of our proposed algorithm, experiments are carried out on both ancient literature and modern corpus. The results indicate that our algorithm is very effective and powerful, compared with traditional ones.

Keywords

Enhanced character embedding stroke bidirectional encoder representations from transformers named entity recognition

Introduction

Because of the popularity of the web, a great many unstructured texts have emerged to represent web contents. These texts contain many named entities which include time, location and organisation, etc. Given a text, the results generated by the named entity recognition (NER) bring valuable information for several downstream natural language processing (NLP) task, such as relation extraction,^1,2 event extraction³ and question answering.^4,5 During the past few years, traditional NER approaches are based on rules. For example, RENAR concerns some rules which are independent of any language, and obtains a precision of 83.15% on organisation, 87.7% on person, and 85.9% on location.⁶ Singh et al. develops various rules to extract thirteen named entities.⁷ Gazetteers are shown to be benefit to NER.⁸ The rule-based methods usually rely on hand-designed features to build a special dictionary or some basic rules, according to the existing entities. Recently, numerous machine learning approaches have been carefully studied for NER task, including Conditional Random Fields (CRFs), Support Vector Machines (SVMs) and Hidden Markov Models (HMMs).⁹ Meanwhile, with the popularity of artificial intelligence, neural networks have been applied in NER task.^10,11 Word embedding is beneficial to NER. Word2vec, fastText and glove were used to train word embedding in the past. Recently, Language model pre-training has emerged, such as ELMo,¹² GPT¹³ and Bidirectional Encoder Representations from Transformers (BERT),¹⁴ which can effectively process the context to improve many natural language processing tasks.

Although the methods discussed above have achieved great improvements, some issues need to be addressed further. On the one hand, most recognition results are affected with word segmentation. English NER assumes that each word can be clearly separated by explicit word separators, such as blank space, to predict the tag of each word. Although word-based algorithms have achieved certain effect in Chinese NER,^11,15 there are still many challenges. This is because the names of people, places and organisations are increasing without a uniform naming rule, and the ambiguity of Chinese language is inherent. Therefore, in most cases, the granularity of word segmentation is difficult to determine. Recently, some research results suggested that character-based algorithms outperformed word-based algorithms in NER task.^16–19 For Chinese NER, existing character-based algorithms not only loss context information, but also cannot leverage morphological level information of Chinese characters. On the other, researchers mainly pay attention to modern corpus. However, there is few research on ancient literature named entity recognition. Xie et al. firstly attempted to recognise entities on Historical Records and obtained F1 score of 67.02%,²⁰ which relied on the result of new word detection. Therefore, the precision is high, but recall is low. In response to these challenges, an enhanced character embedding algorithm, named ECEM, is proposed, while BERT and strokes are integrated to learn the character representation and explore the performance in the Chinese NER domain. BERT is pre-trained on a large corpus to capture semantic features and abundant knowledge, which is of critical importance for the NER tasks. The strokes denote the basic structures of Chinese characters, which represent the semantic information hidden in morphology. Our main contributions are illustrated as follows.

BERT and strokes are integrated to learn the embeddings of Chinese characters. The new language representation is crucial for improving the learning of Chinese character embeddings.

ECEM is first used in ancient literature named entity recognition, which not only captures context information and abundant knowledge by fine-tuning BERT, but also flexibly acquires morphological information generated through strokes.

The annotated ancient poetry dataset is important for further research in ancient literature NER. ECEM helps to extract useful entity information, which is of great significance to understand ancient books and documents efficiently. The experimental results demonstrate that ECEM is superior to traditional methods, especially in the ancient corpus.

The structure of the rest of this paper is as follows. Section 2 is about the related work. The recognition algorithm is shown in Section 3, and the experimental results are described in Section 4. Section 5 concerns the conclusions and some suggestions for future work.

Related works

NER has been studied extensively for many years in big data processing. Early approaches relied on feature engineering. However, this kind of method depended heavily on domain-specific corpus and required a large number of experts in the field. For solving the above problems, some researchers try to introduce machine learning into NER. Lafferty used CRF to build probabilistic algorithms to segment and label sequence data.⁹ Isozaki et al. gave better scores than conventional systems based on the SVM method.²¹ Bikel et al. presented a statistical, learned approach to find names and other non-recursive entities in texts, using a variant of the standard hidden Markov algorithm.²² Ratinov et al. gave some analysis of the design challenges to improve the efficiency and robustness of NER system.²³ Zhou et al. formulated Chinese NER as a joint identification and categorisation task.²⁴

Recently, neural network models play important roles in NER tasks. Hammerton tried to solve this problem based on a unidirectional LSTM, which was among the first neural models for NER.²⁵ A CNN-CRF model also produced the best results in all statistical models.²⁶ Character CNN was explored to enhance a CNN-CRF model.²⁷ LSTM-CRF architecture was leveraged in most recent work. Huang et al. attempted to use LSTM to obtain features and feed them into CRF decoder.²⁸ After that, many researchers have exploited LSTM-CRF model as the baseline.^29,30 Lample et al. used a character LSTM to represent spelling characteristics.¹⁰ Moreover, a gated convolutional neural network (GCNN) was presented by Wang et al. for Chinese NER.³¹ Peng et al. jointly trained Chinese NER with the CWS task. However, the specific features brought by the CWS task can lower the performance of the Chinese NER task.¹¹ Cao et al. tried to use adversarial transfer learning framework to solve the problem mentioned above and gave a 58.70% F1 score on Weibo dataset.³² Zhang et al. made use of a lattice-structured LSTM to choose the more useful characters and words for Chinese NER.³³ On Weibo dataset and Resume dataset, this method obtained F1 score of 58.79% and 94.46% respectively. Hang et al.³⁴ utilised relative positional encoding to capture character-level information. Meanwhile, some NER research focuses on neural representation learning. Cao et al. exploited stroke-level information to learn Chinese word embeddings and the experiment showed better word embeddings was effectiveness for NER tasks.³⁵ In addition, BERT obtained the semantic vectors dynamically according to the context of the words, which has achieved successful answering,³⁶ information retrieval,^37,38 and the text classification.³⁹ Therefore, fine-tuning of the pre-trained BERT model on the basis of one additional output layer can produce some state-of-the-art models for various downstream tasks. Huang et al. proposed a multi-criteria method of CWS, who adopt BERT to introduce external knowledge.⁴⁰ The above methods mainly focus on the modern corpus and cannot obtain deep semantic information of characters from multiple levels. We are the first to explore the NER method in the Chinese ancient literature. Meanwhile, the smallest component of Chinese character is stroke, which represents the morphological level information. The combination of BERT and strokes will achieve better characters embedding for the Chinese NER, which contains rich syntactic and semantic information.

The proposed recognition algorithm

For Chinese, each character is semantically meaningful. Therefore, this recognition algorithm attempts to obtain better character embedding for Chinese NER. To achieve this, ECEM is proposed, which combines BERT with strokes. The character embedding can extract useful features and is fed into BiLSTM layer. As illustrated in Figure 1, the whole algorithm consists of four components. Data pre-processing is the first component. The purpose is to repartition the corpora, segment it based on character-level and use the BIO method for token relabelling. The enhanced character embedding layer is constructed to convert each character into context-level embeddings and stroke-level embeddings through BERT and CW2VEC respectively. The third component concatenates the deeper context information output by two BiLSTMs as the final representations of the characters. The fourth component adopts CRF to predict the sequence labels.

Figure 1.

The whole algorithm framework based on enhanced character embedding.

Data pre-processing

This paper mainly explores the method of Chinese NER on ancient literature. The easiest way for dealing with NER problems is to transform them into sequence labelling problems. Each word needs to be assigned a named entity tag in an input sentence. There are two common tag schemes: BIO and BIOES. The tag B-X denotes the first word of a named entity of type X, that is, LOC(Location), PER(Person), ORG(Organisation), while E-X is used for the last word of a named entity. The tag I-X represents that a word is part of an entity but not the first word. The tag S-X indicates that a word is an entity. The tag O-X shows all non-entity words.⁴¹ Table 1 gives the tag scheme comparison between them.

Table 1.

Tag scheme comparison.

Tag scheme	Begin	Inside	End	Single	Outside
BIO	B-X	I-X	I-X	B-X	O
BIOES	B-X	I-X	E-X	S-X	O

In most cases, BIOES tags are better than BIO. However, ancient Chinese corpus contain more short syllables and monosyllables. The use of BIOES scheme may lead to fewer E-X labels and more unbalanced data, so that the final training model cannot make a good prediction of the end of the entity. Therefore, ECEM uses BIO scheme as the annotation scheme.

Enhanced character embedding

Chinese character-level embedding can avoid errors which are caused due to wrong segmentation and beingout-of-vocabulary. However, character embeddings may lose words and word sequence information. How to better explore implicit context features and integrate morphological features is the key to solve the aforementioned problem. Therefore, EMEC learns the character embedding from two levels. Strokes denote how the words and characters are constructed, and they provide morphological level information of Chinese characters. Meanwhile, BERT is pre-trained on a large corpus to capture semantic features and abundant knowledge, which is vital for obtaining context-level information. Therefore, this layer mainly contains two modules.

BERT

BERT facilitates pre-training deep bidirectional representations on unlabelled texts by fusing the right and the left context in all layers.¹⁴ The corresponding segment, token, and position embeddings are concatenated as the input representation. A special classification embedding [CLS] is added as the first token of every sequence, and [SEP] is inserted as the final token. The symbol [CLS] is utilised for aggregating features. The core structure of BERT is illustrated in Figure 2. Trm is the abbreviation of transformer block. Given the input embedding $E = [E_{1, . . .,} E_{N - 1,} E_{N}]$ , the corresponding output is $T = [T_{1, . . .,} T_{N - 1,} T_{N}]$ , which is learned parallelly through several Trms.

Figure 2.

The structure of BERT.

Pre-training and fine-tuning are the two steps of BERT. During pre-training, the model is trained on large unlabelled texts using two strategies, for example, Masked Language Model (MLM) and Next Sentence Prediction (NSP). MLM trains a deep bidirectional representation by masking some percentage.

CW2VEC

Most character representation methods focus on English, which uses a completely different writing system from Chinese. Each Chinese character has fruitful semantic information in its structure. Strokes provide the most basic unit for building character meanings and give some guidelines in exploring semantic information hidden in the morphology. Therefore, cw2vec improves the learning of Chinese word embeddings based on stroke-level information. Equation (1) shows the objective function of this algorithm.

L = \sum_{w \in D} \sum_{c \in T (w)} \log σ (sim (w, c)) + λ E_{w' \tilde{P}} [\log σ (- sim (w, w'))]

(1)

where w is the current word and c denotes one of its context words. T(w) denotes the set of contextual words given current word w within a window size. D is the set of all words within the training corpus. λ is the number of negative samples and $E_{w' ~ P} [.]$ is the expectation term and a selected negative sample w′ conforms to the distribution P, which is mainly based on word frequency. Therefore, the more frequently words appear in the corpus, the easier they are to be sampled. Equation (2) gives similarity function between w and w′.

sim (w, w') = \sum_{q \in S (W)} \vec{q} \cdot \vec{w}'

(2)

where S(w) denotes the set of stroke n-grams of the word w and $\vec{q}$ is the embedding of q. $\vec{w}'$ is the embedding of each context word of w.

A Chinese word can be mapped into stroke n-grams according to the following four steps. (1) Word segmentation: dividing Chinese words into characters. (2) Strokes acquisition: retrieving the stroke sequence from each character and concatenating them together. (3) Representing stroke sequence by stroke ID: the strokes are classified into five types shown in Table 2, from 1 to 5 respectively. (4) Capturing stroke features: generating stroke n-grams within a slide window of size n.

Table 2.

The corresponding relation between strokes and number.

Stroke name	Horizontal\rising	Vertical\vertical-hook	Left-falling	Right-falling\dot	Turning
ID	1	2	3	4	5

Take the verse (leaving at dawn the White King crowned with rainbow cloud) for example. Figure 3 shows the overall architecture of cw2vec with this example. First, the current word (White King) is reduced into stroke n-grams. Second, the similarity is computed between the current word and its context words and . Third, the above objective function is updated, and the context word embeddings are the final output word embeddings. In this paper, cw2vec is leveraged to generate character embeddings.

Figure 3.

The example of “leaving at dawn the White King crowned with rainbow cloud.”

Each character is converted into a multi-dimensional vector through BERT and cw2vec respectively, and the vector dimensions are the same. They carry semantic information hidden in the morphology and the context, which is fed into BiLSTM layer respectively.

BiLSTM layer

LSTM is a member of the RNN family, which overcomes long-range dependencies problem by utilising memory cell and gate mechanism. The unidirectional LSTM Mainly depends on past information, ignoring future information. However, BiLSTM will run the inputs in two ways, one from past to future and one from future to past. Therefore, BiLSTM is adopted to capture features from both sides of sequence for better understanding context. Given a sequence vector, which is the output of BERT and an input of BiLSTM. The hidden state of BiLSTM could be calculated through the following equations at each time t.

i_{t} = σ (W_{hi} h_{t} + W_{li} l_{t - 1} + W_{ci} c_{t - 1} + b_{i})

(3)

f_{t} = σ (W_{hf} h_{t} + W_{lf} l_{t - 1} + W_{cf} c_{t - 1} + b_{f})

(4)

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tanh (W_{hc} h_{t} + W_{lc} l_{t - 1} + b_{c})

(5)

o_{t} = σ (W_{ho} h_{t} + W_{lo} l_{t - 1} + W_{co} c_{t} + b_{o})

(6)

l_{t} = o_{t} ⊙ \tanh (c_{t})

(7)

where σ denotes the element-wise sigmoid function. i, f, o and c represent input gate, forget gate, output gate and the current cell state, respectively. $⊙$ represents concatenation operation. W_* and b_* denote weight matrix and bias, respectively. In addition to h_t, l_t−1 and c_t−1 are also inputs. Then, BiLSTM generates the forward hidden state $\overset{\leftarrow}{l_{t}}$ by processing the aforementioned sequence from h₁ to h_n. Similarly, the backward hidden state $\overset{\leftarrow}{l_{t}}$ provides useful information starting from the end of the sequence. Eventually, $\overset{\leftarrow}{l_{t}}$ and $\overset{\leftarrow}{l_{t}}$ are concatenated as a single vector, that is, $L_{t} = [\vec{l_{t}}; \vec{l_{t}}]$ . In this way, the output of cw2vec is converted into ${L^{'}}_{t} = [\vec{{l^{'}}_{t}}; \overset{\leftarrow}{{l^{'}}_{t}}]$ though BiLSTM layer. Thus, $L_{t}$ and ${L^{'}}_{t}$ are concatenated shown in equation (8). The original input of the characters is transformed to a sequence of vectors.

V_{t} = L_{t} ⊙ L'_{t}

(8)

CRF layer

Although the hidden vector V_t can be regarded as the direct features when predicting independent tagging, the tags sequences are not independent. For example, an I-LOC cannot follow a B-PER tag. Therefore, the CRF layer is used to predict the optimal sequence of tags. Given an input sentence $X = (x_{1}, x_{2}, \dots, x_{n})$ , P is regarded as the matrix of scores output by the BiLSTM layer. The size of P is $n \times k$ , where the number of distinct tags is k, and P_ij represents the score of the j^th tag of the i^th character in a sentence. Given a prediction sequence $y = (y_{1}, y_{2}, \dots, y_{n})$ , the prediction score is summarised by the function f(X,y) at the sentence level. Equation (9) indicates the score of a prediction sequence.

f (X, y) = \sum_{i = 1}^{n} P_{i, y_{i}} + \sum_{i = 0}^{n} T_{y_{i}, y_{i + 1}}

(9)

where the transition from tag i to tag j is modelled by the matrix of transition scores T. Then, the probability of the sequence y is calculated using a softmax on the basis of all possible tag sequences.

p (y | X) = \frac{e^{f (X, y)}}{\sum_{\tilde{y} \in Y_{X}} e^{f (X, \tilde{y})}}

(10)

where $Y_{X}$ is the set of tag sequences for input. $X$ When trying to train the model, the following log-likelihood of the correct tag sequence is used to calculate the loss, and the parameters are learned by gradient descent.

\log (p (y | X)) = f (X, y) - \sum_{\tilde{y} \in Y_{X}} e^{f (X, \tilde{y})}

(11)

The purpose of our model is to produce efficient output sequence. Therefore, the output sequence y^* will be find with the maximum score given by equation (12) compared with all the feasible outputs y.

y^{*} = \arg max (\log (p (y | X)))

(12)

Experimental results

In order to explore the effect of our proposed algorithm, an extensive set of experiments is carried out. This section details the datasets, parameter settings and results.

DataSet

Most existing NER methods focus on the standard corpus provided by many public competitions. However, researchers pay little attention to the ancient literature named entity recognition. At present, there is no large-scale labelled ancient dataset. Therefore, this paper uses the Historical Records dataset labelled manually²⁰ and a Ancient Poetry dataset that we annotate. The Historical Records dataset contains 3200 short sentences randomly extracted from historical records, and has 1100 named entities. The entire dataset is partitioned into training set, development set and testing set according to the ratio of 8:1:1. Our Ancient Poetry dataset is from “A Dictionary of Chinese Poems on Scenic Spots and Historical Sites” and we use it to annotate 2520 named entities. The entire dataset is split in accordance with the ratio of 7:2:1. The statistics of ancient literature are detailed in Table 3.

Table 3.

Statistics of the ancient literature.

Statistics	Historical records			Ancient poetry
	Person	Location	Organisation	Person	Location	Organisation
Train	224	233	372	377	1169	64
Dev	38	29	50	64	233	14
Test	32	62	60	138	447	14
Total entity	294	324	482	579	1849	92

To further check the effect of our proposed algorithm in other domains, experiments are conducted on modern corpus, which includes Resume NER dataset. Chinese Resume dataset is a standard dataset, which is extracted from Sina Finance and has eight types of named entities.³³ The detailed statistic information of Resume NER dataset is illustrated in Table 4. For this dataset, official data split is used.

Table 4.

Statistics of the resume literature.

Statistics	Train	Dev	Test
Country	260	33	28
Educational Institution	858	106	112
Location	47	2	6
Person	952	110	112
Organisation	4611	523	553
Profession	287	18	33
Ethnicity background	115	15	14
Job Title	6308	690	772
Total Entity	13438	1497	1630

Baselines

The proposed ECEM method is compared with some variants of it. In order to be more convenient when referring to the models later, the models are briefly introduced as follows.

BiLSTM-CRF. It is the naive model which applies BiLSTM to capture context information for words and CRF to obtain optimal sequence combinations.

STROKE-BiLSTM-CRF. The character embeddings, which is the input of BiLSTM-CRF, is learned from the stroke-level information of Chinese words.

BERT-BiLSTM-CRF. This model first uses BERT to learn the character embedding. Then, a sequence of character vectors is fed into BiLSTM-CRF layer.

ECEM. It is the method proposed in this paper. ECEM can enhance character embeddings for ancient literature and modern corpus.

In order to illustrate the performance of BiLSTM, the BiLSTM layers in the above four models are replaced with LSTM layers. The new generated models are LSTM-CRF, STROKE-LSTM-CRF, BERT-LSTM-CRF and BERT-STROKE-LSTM-CRF.

Tag scheme and evaluation measure

The ancient text contains many short syllables and monosyllables. Consequently, for a fair comparison, all models except Lattice use BIO scheme as the annotation scheme of all datasets. The entity will be correctly predicted when the entity boundary and category labels are all correct.

There are different metrics in evaluating NER. This paper mainly employs Precision (P), Recall (R) and F1 score to assess recognition results.²¹ More precisely, precision is the ratio of the number of correctly recognise entities to total recognised entities, recall is the number of correctly recognised entities divided by the number of real entities, and F1 score is the harmonic mean of precision and recall. Based on these descriptions, the details of precision, recall and F1 score are given in equations (13) to (15), respectively.

P = \frac{| the num of correctly recognized entities |}{| the num of total recognized entities |}

(13)

R = \frac{| the num of correctly recognized entities |}{| the num of real entities |}

(14)

F 1 = \frac{2 \cdot P \cdot R}{P + R}

(15)

Parameter settings

The Chinese BERT-Base model is used, which has 12 layers, 12 heads and 768 hidden states. Therefore, the number of character embeddings output by cw2vec is also set as 768, and the max sequence length is set to 128. Characters with length less than 128, will be padded with all-zero vectors, and character sequences that exceed length 128 are ignored. During fine-tuning, parameters are adjusted according to the performance on the development set of Historical Records. The hidden-layer dimension of BiLSTM and LSTM is set as 150. Adam is used for optimisation, with a dropout of 0.5, a batch size of 16, and the learning rate of 1e-5. When learning word embeddings through cw2vec, the window size is set as 5 and the learning rate is 2.5e-3. During training, the parameters will be updated.

To fine-turn the epochs, some variants of ECEM are compared. Each model is trained with 20 epochs totally, and the version with the highest F1 score is selected. Tables 5, 6 and 7 illustrate the F1 score of LSTM-CRF, STROKE-LSTM-CRF, BERT-LSTM-CRF, BERT-STROKE-LSTM-CRF, BiLSTM-CRF, STROKE-BiLSTM-CRF, BERT-BiLSTM-CRF, and ECEM models against the number of epochs. In addition, Figure 4 shows the effect of each model intuitively. It can be seen from Figure 4 that BiLSTM improves the NER effect compared with LSTM on all datasets. In addition, both BERT and stroke can boost the performance. One reason is that the fine-tuning of BERT will learn sentence-level representation and word-level representation from longer sequences. On the other hand, stroke can provide the meaningful sub-word information, which is useful for character embeddings. Therefore, ECEM obtains a certain improvement compared with some variant of it. Concretely, in terms of History Records, LSTM-CRF, STROKE-LSTM-CRF, BERT-LSTM-CRF, BERT- STROKE -LSTM-CRF, BiLSTM-CRF, STROKE-BiLSTM-CRF, BERT-BiLSTM-CRF, and ECEM obtains a corresponding highest F1 score of 62.75%, 64.45%, 79.01%, 81.76%, 65.16%, 66.24%, 82.50% and 84.01% for 16, 18, 16, 20, 12, 20, 16, 10 epochs, with a corresponding highest F1 score of 55.11%, 57.70%, 73.92%, 74.59%, 58.94%, 59.58%, 75.03% and 75.34% for 11, 17, 20, 16, 17, 20, 18, 9 epochs on the Ancient Poetry, and a corresponding highest F1 score of 91.06%, 92.51%, 94.99%, 95.18%, 92.29%, 93.39%, 95.31% and 95.83% for 14, 16, 17, 20, 19, 17, 18, 14 epochs on Resume.

Table 5.

F1 of each epoch on Historical Records (%).

Models	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
LSTM-CRF	44.23	42.74	48.58	53.13	55.48	57.16	53.81	55.50	56.26	57.88	59.99	61.76	61.27	61.92	60.00	62.75	61.42	62.18	60.59	62.52
STROKE-LSTM-CRF	45.16	44.64	49.81	54.77	56.86	58.99	55.37	57.70	58.34	58.51	62.60	63.90	62.00	62.04	61.35	62.12	62.37	64.45	61.60	63.24
BERT-LSTM-CRF	48.27	58.70	60.39	61.15	71.25	73.50	73.90	75.63	76.04	76.33	76.66	76.00	76.50	77.52	77.50	79.01	78.60	78.50	78.12	78.00
BERT-STROKE-LSTM-CRF	49.00	58.80	65.99	69.56	74.21	75.42	76.54	77.77	77.04	78.00	78.43	77.34	77.56	74.23	79.00	80.00	80.14	78.12	80.12	81.76
BiLSTM-CRF	46.83	45.34	50.68	55.63	57.98	59.66	56.31	60.90	59.46	59.28	63.19	65.16	64.67	62.32	62.30	63.40	63.82	64.50	62.99	64.92
STROKE-BiLSTM-CRF	47.16	58.64	51.81	57.77	58.86	60.99	60.37	62.70	61.34	62.51	63.60	65.90	65.00	64.04	64.35	65.12	65.37	65.70	66.20	66.24
BERT-BiLSTM-CRF	49.27	58.86	65.39	72.15	76.25	77.50	78.90	79.63	77.74	78.66	78.88	77.74	78.50	74.92	79.50	81.25	80.50	78.50	80.50	82.50
ECEM	56.81	60.87	66.04	78.73	78.26	78.02	81.25	79.88	79.54	84.01	81.29	81.01	80.99	78.53	80.25	82.73	81.25	80.85	80.75	83.25

Bold numbers indicate the highest F1.

Table 6.

F1 of each epoch on Ancient Poetry (%).

Models	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
LSTM-CRF	40.28	42.78	45.02	48.08	50.60	51.23	52.53	53.14	53.82	54.72	55.11	54.55	53.81	53.54	53.99	54.48	52.94	51.39	52.47	53.51
STROKE-LSTM-CRF	44.12	45.09	51.90	52.70	54.69	53.60	55.55	54.80	54.00	55.90	56.00	56.40	56.90	55.49	56.19	56.90	57.70	57.12	57.30	57.00
BERT-LSTM-CRF	55.60	60.21	61.67	64.67	68.93	69.47	70.61	71.86	72.45	71.02	71.60	72.29	72.88	71.33	71.90	72.83	72.34	73.20	72.44	73.92
BERT-STROKE-LSTM-CRF	59.00	63.01	65.16	68.00	69.00	71.61	72.00	72.08	73.14	71.41	73.38	72.07	73.16	73.93	74.10	74.59	74.31	74.01	73.29	73.94
BiLSTM-CRF	45.28	46.78	52.02	55.28	56.60	56.23	57.53	57.14	57.82	56.72	56.34	58.35	56.81	56.54	57.99	56.48	58.94	58.39	58.47	58.51
STROKE-BiLSTM-CRF	52.70	52.53	53.14	56.67	57.11	57.35	58.67	58.80	58.07	58.03	58.80	58.56	57.46	58.73	58.50	58.01	58.31	58.12	58.66	59.58
BERT-BiLSTM-CRF	60.60	64.21	67.67	70.67	71.93	72.47	71.61	73.86	74.45	72.02	72.60	73.29	73.88	72.33	72.90	74.83	74.34	75.03	73.84	74.20
ECEM	67.24	69.51	72.16	71.87	73.95	74.61	73.85	74.08	75.34	72.41	73.38	74.07	74.16	74.93	74.90	74.95	74.71	75.23	74.29	74.94

Bold numbers indicate the highest F1.

Table 7.

F1 of each epoch on Resume (%).

Models	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
LSTM-CRF	40.00	50.00	64.50	70.75	71.93	72.92	79.87	80.54	81.46	83.87	85.23	88.26	89.94	91.06	90.35	89.76	89.35	88.89	88.17	88.36
STROKE-LSTM-CRF	87.00	87.45	88.23	89.21	89.89	90.14	90.76	91.12	91.29	91.61	91.87	92.13	92.34	92.08	92.11	92.51	92.00	92.34	92.40	91.90
BERT-LSTM-CRF	90.45	90.99	91.23	91.99	92.00	92.23	92.89	92.90	93.00	93.12	93.25	93.67	93.89	93.90	94.00	94.56	94.99	94.12	93.45	92.99
BERT-STROKE-LSTM-CRF	91.08	92.08	92.59	92.92	93.70	93.20	93.20	93.31	93.87	93.05	93.17	94.16	93.92	93.83	94.15	94.00	93.16	95.02	94.58	95.18
BiLSTM-CRF	42.00	51.90	68.09	71.75	72.13	73.24	80.12	82.54	85.64	86.56	87.65	88.54	89.12	92.00	90.35	91.76	90.35	91.89	92.29	90.36
STROKE-BiLSTM-CRF	90.00	90.23	90.89	91.65	91.50	91.49	91.19	92.09	92.29	92.61	92.87	92.35	92.11	93.11	93.20	93.32	93.39	92.41	92.08	91.77
BERT-BiLSTM-CRF	91.18	91.77	93.27	93.82	93.94	93.96	93.89	94.02	94.32	93.51	93.73	94.26	94.24	93.95	94.44	94.14	93.68	95.31	94.32	95.19
ECEM	93.80	95.08	94.59	94.92	95.70	95.20	95.10	95.31	94.87	95.65	95.17	95.16	94.92	95.83	95.15	94.70	95.16	95.72	95.58	95.27

Bold numbers indicate the highest F1.

Figure 4.

F1 against epoch number: (a) F1 against epoch number on History Records, (b) F1 against epoch number on Ancient Poetry, and (c) F1 against epoch number on Resume.

When F1 score is highest, Table 8 denotes the overall recognition accuracy of baselines on each dataset. As illustrated in the table, our proposed ECEM method acquires the highest accuracy (95.01%, 93.32% and 97.25%) on three datasets. It denotes that the enhanced character embedding can improve recognition result, which indeed helps to capture context and morphological information together.

Table 8.

The total accuracy rate of baselines on each dataset.

Models	Accuracy (%)
	Historical Records	Ancient Poetry	Resume
LSTM-CRF	88.73	89.11	95.20
STROKE-LSTM-CRF	89.91	88.48	95.88
BERT-LSTM-CRF	93.64	92.89	96.87
BERT-STROKE-LSTM-CRF	94.59	93.23	97.01
BiLSTM-CRF	90.34	88.97	95.84
STROKE-BiLSTM-CRF	90.50	89.76	96.56
BERT-BiLSTM-CRF	94.05	93.21	97.21
ECEM	95.01	93.32	97.25

Time comparison

Runtime implies the complexity of the model. We conduct our experiments on a physical machine with Ubuntu 14.04, 2 Intel Xeon E5-2609 v4 CPUs, and 4 GTX 1080 GUPs. Each epoch will take out 10 minutes to train corpus using cw2vec. On this basis, we train some variants of ECEM with the above parameters for a total of 20 epochs. Therefore, we can obtain the average training time per epoch. As is illustrated in Figure 5, ECEM performs better with little loss in time on the all datasets.

Figure 5.

Running time of baselines.

Final results

To further understand the results, Tables 9 and 10 introduce the precision, recall and F1 score on three datasets, which are from different domains. The top part shows the results compared with previous state-of-the-art models and the other parts demonstrates the results of the baselines we proposed above. Overall, ECEM has obtained the competitive improvements compared with other models.

Table 9.

The P, R and F1 of ECEM on ancient datasets.

Models	Historical Records			Ancient Poetry
	P (%)	R (%)	F1 (%)	P (%)	R (%)	F1 (%)
Zhang et al.³³	72.03	67.32	69.59	77.17	57.75	66.06
LSTM-CRF	63.16	62.34	62.75	64.31	48.21	55.11
STROKE-LSTM-CRF	65.99	62.99	64.45	62.77	53.38	57.70
BERT-LSTM-CRF	75.29	83.12	79.01	76.50	71.50	73.92
BERT-STROKE-LSTM-CRF	79.27	84.42	81.76	76.14	73.10	74.59
BiLSTM-CRF	64.74	65.58	65.16	65.02	53.91	58.94
STROKE-BiLSTM-CRF	65.61	66.88	66.24	67.85	53.11	59.58
BERT-BiLSTM-CRF	79.04	86.27	82.50	76.92	73.24	75.03
ECEM	80.72	87.58	84.01	77.57	73.24	75.34

Table 10.

The P, R and F1 of ECEM on resume dataset.

Models	P (%)	R (%)	F1 (%)
Zhang et al.³³	94.81	94.11	94.46
Zhu et al.¹⁹	95.01	94.82	94.94
Hang et al.³⁴	-	-	95.00 ± 0.25
LSTM-CRF	90.89	91.23	91.06
STROKE-LSTM-CRF	92.25	92.76	92.51
BERT-LSTM-CRF	93.90	96.11	94.99
BERT-STROKE-LSTM-CRF	94.52	95.86	95.18
BiLSTM-CRF	91.70	92.88	92.29
STROKE-BiLSTM-CRF	93.16	93.62	93.39
BERT-BiLSTM-CRF	94.64	95.98	95.31
ECEM	95.18	96.48	95.83

Ancient datasets

Zhang et al.³³ not only leverages character information, but also uses words and word sequence information without segmentation errors. However, the relationship between characters is not close in ancient datasets. Meanwhile, noise characters may affect the recognition results. Therefore, this model only achieves the F1 score of 69.59% and 66.06% on Historical Records and Ancient Poetry, respectively. In the second block of Table 9, LSTM-CRF acquires the relatively low performance on all the corpora, especially on the ancient poetry dataset. The results imply that LSTM-CRF cannot obtain deep semantic features from the sparse and unbalanced corpora. In addition, to check the role of various components, we also report the performance of STROKE-LSTM-CRF, BERT-LSTM-CRF and BERT-STROKE-LSTM-CRF. The last block of Table 9 presents the results of the models based on BiLSTM. BiLSTM-CRF outperforms LSTM-CRF. Compared with LSTM-CRF, BERT-LSTM-CRF improves the F1 score by 17.34% on the Historical Records dataset and 16.09% on the Ancient Poetry dataset. Correspondingly, the F1 score of STROKE-BiLSTM-CRF improves 1.08% on the Historical Records dataset and 0.64% on the Ancient Poetry dataset. It means that the semantic features and morphological information captured by BERT and STROKE can help the algorithm improve the recognition effect. For ancient datasets, BERT-BiLSTM-CRF gives a slightly higher F1 score compared to STROKE-BiLSTM-CRF. Eventually, ECEM achieves the best results on both datasets. The results reveal that the combination of BERT and strokes performs best.

Resume dataset

Experimental results for Resume dataset are given in Table 10. Zhu et al.¹⁹ capture semantic knowledge from contexts and adjacent characters based on a character-based convolutional neural network and a gated recurrent unit. This method improves the F1 score from 94.46% to 94.94% compared with the model proposed by Zhang et al.³³ Hang et al.³⁴ proposed that transformer was incorporated with the distance-aware, direction-aware, and un-scaled attention. The model gave the F1 score of 95.00 ± 0.25%. BiLSTM-CRF achieves a F1 score of 92.29%. STROKE-BiLSTM-CRF improves the F1 score to 93.39%, and BERT-BiLSTM-CRF improves the F1 score to 95.31%. ECEM significantly improve the F1 score to 95.83% by using enhanced character embeddings, which is the highest result among existing models. Overall, ECEM is more suitable for ancient literature, and the effect of strokes on modern corpora is better. One reason is that the contents of the ancient texts are short, which leads to lack of context information. Another reason is that the ancient texts are more diverse in terms of expression, which increases the difficulty of feature extraction. Therefore, ancient literature relies heavily on enhanced character embedding. Surprisingly, ECEM also achieves the best results on modern corpora. The results demonstrate that our model is more efficient and robust than other models.

To visualise the effect of the proposed algorithm, the relative errors of F1 score is shown regarding each compared model in Figure 6. Each column is obtained by computing the difference of F1 score between the compared model and our ECEM. As can be seen from the figure, ECEM has significant superiority to most other models.

Figure 6.

Relative error of F1 Score of each model compared with ECEM. (a) comparison results of ancient dataset and (b) comparison results of resume dataset.

Conclusion

The present NER methods mainly focus on modern corpus. Occasionally, researchers have made some attempts and explorations in domain-specific NER, such as microblog, biomedical corpus, telecommunications corpus and legal file corpus. However, little research pays attention to ancient literature named entity recognition. With the digitisation of a large number of ancient books and documents in the future, it will be of great significance if the NER technology can be used to help researchers extract useful entity information from the vast amount of ancient books and documents efficiently.

ECEM, which is based on enhanced character embedding, explores to recognise entities in ancient literature and modern corpus. This algorithm not only captures context information and abundant knowledge by fine-tuning BERT, but also flexibly acquires morphological information generated through strokes. Extensive experiments are conducted to explore the recognition effect of enhanced character embedding on three datasets by setting parameters. The results indicate that the ECEM algorithm has a substantial improvement compared with the traditional models. The proposed algorithm also gives guidelines for the future research in this domain.

Although this paper has addressed some problems, there is still something untouched in the model. The size of training set will lead to the out-of-vocabulary problem. For example, given a sentence “ (Chang Jianliang is an associate professor of Beijing Wuzi University)”. If the organisation name “ (Beijing Wuzi University)” is not in the training set, it would not be recognised by ECEM. When the training examples are less, it is difficult to capture sufficient morphological and contextual information. Therefore, ECME needs further improvement and exploration in the following research. We will plan to cut the training time without reducing accuracy, and the new neural network model will be utilised to solve Chinese NER problems.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Key R&D Program of China (No.2018YFC0831500), the National Natural Science Foundation of China (No. 61972047), the Key Project of Natural Science Research of Universities in Anhui (No.KJ2020A0062), and the Key Project of University Outstanding Young Talents Project in Anhui (No.gxyqZD2018069). We are grateful to the anonymous reviewers for their careful reading.

ORCID iD

Bingjing Jia

References

Bunescu

Mooney

. A shortest path dependency kernel for relation extraction. In: Proceedings of the conference on human language technology and empirical methods in natural language processing, Vancouver, British Columbia, Canada, October 2005, pp.724–731. Association for Computational Linguistics.

Yang

Liu

Qian

, et al. Information extraction from electronic medical records using multitask recurrent neural network with contextual word embedding. Appl Sci 2019; 9(18): 3658.

Chen

Peng

Shan

, et al. Chinese named entity recognition with conditional probabilistic models. In: Proceedings of the fifth SIGHAN workshop on Chinese language processing, Sydney, Australia, July 2006, pp. 173–176. Association for Computational Linguistics.

Yao

van Durme

Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), Baltimore, Maryland, 2014, pp. 956–966. Association for Computational Linguistics.

Damiano

Minutolo

Silvestri

, et al. Query expansion based on wordnet and word2vec for Italian question answering systems. In: International conference on P2P, parallel, grid, cloud and internet computing Barcelona, Spain, Nov 8–Nov 10, 2017, pp. 301–313. Springer.

Zaghouani

. Renar: a rule-based Arabic named entity recognition system. ACM Trans Asian Lang Inform Process (TALIP) 2012; 11(1): 2.

UmrinderPal

Singh

Vishal

Goyal

Lehal

Gurpreet Singh

. Named entity recognition system for Urdu. In: Proceedings of COLING 2012, Mumbai, India, December 2012, pp. 2507–2518. The COLING 2012 Organizing Committee.

Ding

Xie

Zhang

, et al. A neural multi-digraph model for Chinese ner with gazetteers. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy, July 2019, pp. 1462–1467. Association for Computational Linguistics.

Lafferty

McCallum

Pereira

FCN

. Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML, 2001.

10.

Lample

Ballesteros

Subramanian

, et al. Neural architectures for named entity recognition Proceedings of NAACL-HLT, San Diego, CA, 2016: 260–270,

11.

Peng

Dredze

. Improving named entity recognition for Chinese social media with word segmentation representation learning. arXiv preprint arXiv:1603.00786, 2016.

12.

Peters

Neumann

Iyyer

, et al. Deep contextualized word representations. arXiv preprint arXiv:1802 .05365, 2018.

13.

Radford

Narasimhan

Salimans

, et al. Improving language understanding by generative pre-training, https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf, 2018.

14.

Devlin

Chang

M-W

Lee

, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.0480, 2018.

15.

Zhou

Dai

, et al. Chinese named entity recognition with a multi-phase model. In: Proceedings of the fifth SIGHAN workshop on Chinese language processing, Sydney, Australia, 2006, pp. 213–216. Association for Computational Linguistics.

16.

Marulli

Pota

Esposito

. A comparison of character and word embeddings in bidirectional LSTMS for POS tagging in Italian. In: International Conference on intelligent interactive multimedia systems and services, Gold Coast, Australia, 20–22 June, 2018 pp. 14–23. Springer.

17.

Pota

Marulli

Esposito

, et al. Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched word embeddings. Knowl-Based Syst 2019; 164: 309–323.

18.

Dong

Zhang

Zong

, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In: Natural Language Understanding and Intelligent Applications, Kunming, China, December 2–6, 2016, pp. 239–250. Springer.

19.

Zhu

Wang

Karlsson

. Can-ner: convolutional attention network for Chinese named entity recognition. arXiv preprint arXiv:1904.02141, 2019.

20.

Xie

. Research and implementation of named entity recognition based on ancient literature. Master’s thesis, Beijing University of Posts and Telecommunications, 2018.

21.

Isozaki

Kazawa

. Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th international conference on Computational linguistics-Volume 1, Taibei, August 2002, pp. 1–7. Association for Computational Linguistics.

22.

Bikel

Miller

Schwartz

, et al. Nymble: a high-performance learning name- finder. arXiv preprint cmp-lg/9803003, 1998.

23.

Ratinov

Roth

Design challenges and misconceptions in named entity recognition. In: Proceedings of the thirteenth conference on computational natural language learning, Boulder, Colorado, June 2009, pp. 147–155. Association for Computational Linguistics.

24.

Zhou

Zhang

. Chinese named entity recognition via joint identification and categorization. Chinese J Electron 2013; 22(2): 225–230.

25.

Hammerton

. Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, Edmonton, Canada, May 27–June 1, 2003, pp. 172–175. Association for Computational Linguistics.

26.

Collobert

Weston

Bottou

, et al. Natural language processing (almost) from scratch. J Mach Learn Res 2011; 12: 2493–2537.

27.

dos Santos

Guimaraes

. Boosting named entity recognition with neural character embeddings. arXiv preprint arXiv:1505.05008, 2015.

28.

Huang

. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.

29.

Abdo Ali

Tan

Hussain

. Boosting Arabic named-entity recognition with multi-attention layer. IEEE Access 2019; 7: 46575–46582.

30.

Zhang

, et al. Chinese ner using dynamic meta-embeddings. IEEE Access 2019; 7: 64450–64459.

31.

Wang

Chen

Named entity recognition with gated convolutional neural networks. In: Chinese computational linguistics and natural language processing based on naturally annotated big data, Nanjing, China, October 13–15, 2017, pp. 110–121. Springer.

32.

Cao

Chen

Liu

, et al. Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October-November, 2018, pp. 182–192.

33.

Zhang

Yang

. Chinese ner using lattice lstm. arXiv preprint arXiv:1805.02023, 2018.

34.

Yan

Deng

, et al. TENER: adapting transformer encoder for name entity recognition. arXiv preprint arXiv:1911.04474, 2019.

35.

Cao

Zhou

, et al. cw2vec: learning Chinese word embeddings with stroke n-gram information. In: Thirty-Second AAAI conference on artificial intelligence, Louisiana, USA, February 2–7, 2018. AAAI.

36.

Yang

Xie

Lin

, et al. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718, 2019.

37.

Wang

Liu

Zhu

, et al. A text abstraction summary model based on Bert word embedding and reinforcement learning. Appl Sci 2019; 9(21): 4701.

38.

Yang

Zhang

Lin

. Simple applications of Bert for ad hoc document retrieval. arXiv preprint arXiv:1903.10972, 2019.

39.

Adhikari

Ram

Tang

, et al. Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398, 2019.

40.

Huang

Cheng

Chen

, et al. Toward fast and accurate neural Chinese word segmentation with multi-criteria learning. arXiv preprint arXiv:1903.04190, 2019.

41.

Che

Wang

Manning

, et al. Named entity recognition with bilingual constraints. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologie, Atlanta, Georgia, June 2013, pp. 52–62. Association for Computational Linguistics.

42.

MUC-6. The sixth in a series of message understanding con-ferences, http://cs.nyu.edu/cs/faculty/grishman/muc6.html, 1995.