Keywords extraction in Chinese–Vietnamese bilingual news based on hypergraph

Abstract

The keywords extraction of bilingual news events in China and Vietnam has a very important role in understanding bilingual news events. It can quickly locate and briefly compare the news of the same events reported by the two countries. Chinese–Vietnam news texts are typically unstructured big data. How to extract the keywords that characterize the news in these unstructured data is the difficult problem of unstructured big data analysis. Bilingual documents are difficult to understand because bilingual Chinese and Vietnamese are not in the same language space. However, the hypergraph of the hypergraph model can better express the multiple relations of the vocabulary association and the entity association for bilingual news. Therefore, a method based on hypergraph for bilingual news keywords extraction is proposed. In this method, bilingual news words are extracted to construct a bilingual word set, and the words are taken as vertices. Chinese–Vietnamese sentences and bilingual words with the same semantic meaning as different types of hyperedges and the bilingual word frequency are used as the attribute to construct a bilingual news item word hypergraph model. Then, the directional diffusion algorithm in the wireless sensor network is used to iteratively calculate the weights of the vertices so as to realize the extraction of keywords in the Chinese–Vietnam bilingual news. The experimental results show that the proposed hypergraph method is better than the single-document extraction method, which can better obtain the keywords of the bilingual unstructured text data.

Keywords

Chinese–Vietnamese bilingual news events keywords extraction hypergraph model directional diffusion

Introduction

With the rapid development of the Internet, the number of Chinese-Vietnamese-related news is increasing. How to effectively understand these unstructured data plays an important role in promoting political, economic, and cultural exchanges between the two countries. Keywords extraction is an important direction of text data mining.^1,2 Keywords are the smallest unit that describes news events and also is key to unstructured text data. Keyword extraction is important in the automatic summarization of documents, the extraction of web pages, the classification and clustering of documents, and the retrieval of information. For Chinese–Vietnamese bilingual news keyword extraction, how to express the complex Chinese and Vietnamese bilingual news document structure is very important.

Based on the above problems, this article introduces hypergraph to represent the Chinese–Vietnamese bilingual news documents and associates bilingual vocabulary with a bilingual dictionary to map documents in different languages space to the same language space. When documents in different languages space are mapped to the same space, we can iteratively calculate the change of weight to achieve keyword extraction.

The remainder of this article is organized as follows. Section “Related woks” discusses the main keyword extraction algorithms. Section “An algorithm for extracting words from Chinese–Vietnam news based on hypergraph” describes how to construct a hypergraph model of the Chinese–Vietnamese news documents. Section “Chinese–Vietnamese keyword extraction based on directed diffusion” discusses how to iteratively extract keywords on the hypergraph model. Section “Experiments” describe the experiment and analysis. Section “Conclusion” presents conclusion and future work.

Related works

In the aspect of keyword extraction, there is mostly single-word keyword extraction, single-language multi-document keyword extraction, and multi-language multi-document keyword extraction. Single-document keyword extraction is only a document on the extraction of keywords, which is mainly through the word frequency, the relationship between the word and word topic information keyword extraction. Salton and Buckley³ classify the words in the document by counting the frequency Term Frequency (TF) and the reverse document frequency Inverse Document Frequency (IDF) appearing in the document for each word, according to the size of the Term Frequency–Inverse Document Frequency (TF-IDF). The algorithm proposed by Y Matsuo and M Ishizuka⁴ divides the candidate words with different subsets by word frequency and judges whether the word is the keywords for the document based on the degree of bias of the word of these subsets. R Mihalcea and P Tarau⁵ use the word as a node, then convert the text into a graph through the co-occurrence of words, and finally extract the keywords through iteration. DM Blei et al.⁶ proposed latent Dirichlet allocation (LDA). Z Liu et al.⁷ and R Saeidi et al.⁸ applied the LDA model of keyword extraction and achieved good results. Single-language multi-document keyword extraction refers to the same language from the representation of a number of news documents to extract the main theme of the keywords. And it mainly uses the words of document information and documents-related information to achieve multi-document keyword extraction. KM Hammouda and DN Matute⁹ extract key phrases for multi-document collections and clusters by CorePhrase algorithm. Y Jie and J Duo¹⁰ use the Average Term Frequency (ATF) × Proportional Document Frequency (PDF) method to select the candidate words in the document set and then use the combined weights method to extract the keywords according to the semantic similarity between the candidate words. The ATF × PDF method determines the weight of a word by the product of the average word frequency of the word in the entire document set and the proportional document of the word. Multi-language multi-document keyword extraction research work needs less research. The main purpose of the multi-language multi-document keyword extraction is to extract the bilingual keyword set which can express the news event from the multi-language multi-document, and the concrete manifestation is the bilingual multi-document understanding problem. However, because different languages cannot be calculated in the same space, single-language environment extraction algorithm cannot be completely suitable for Chinese and Vietnamese bilingual news keyword extraction.

Through statistical keyword extraction, one can obtain statistical information about terms, but cannot describe the relationship between words and sentences. Graph-based keywords extraction considers the relationship between words and words, but it is limited to simple binary co-occurrence relations and cannot express the multi-relationship of cross-language documents well. Keywords extraction based on topic models is too dependent on topic distribution, and it is difficult to represent cross-language documents in the same space. Therefore, this article uses the hypergraph to model the Chinese–Vietnamese bilingual news documents.

The hypergraph is a generalization of ordinary graphs. In the general diagram, an edge can only connect two vertices that have a certain relationship. However, in reality, the relationship between the various objects is usually much more complex. For example, a hypergraph of a hyperedge can contain multiple vertices, so when constructing a hypergraph for a document set, a hyperedge can represent an author, and this hyperedge of each vertex can be expressed as the author of each work. And the hypergraph can express the relationship between the author and the work well. Therefore, we believe that the hypergraph can better express the multivariate relationship. Chinese–Vietnamese bilingual news event keyword extraction is a multi-lingual document understanding problem. As bilingual-related news events have a correspondence between bilingual entities, at the same time, there are related relationships in terms of time, place, and reason. These relationships are complex, but the hypergraph structure can characterize bilingual multivariate relationships. Therefore, this article constructs bilingual news event word hypergraph model based on hypergraph thinking and obtains bilingual news keywords.

An algorithm for extracting words from Chinese–Vietnam news based on hypergraph

Hypergraph

The hypergraph is based on set theory and graph theory and first proposed by C Berge^11–13 in 1973. Up to now, hypergraph theory has been in the field of computer science and artificial intelligence has been developed by leaps and bounds. The concept of hypergraph is defined as follows:

Suppose that the hypergraph expression is $H = (V, E)$ , where $V = {v_{1}, v_{2}, v_{3}, \dots, v_{n - 1}, v_{n}}$ represents the vertex of the hypergraph and $v_{i}$ is a vertex of the hypergraph. $E = {e_{1}, e_{2}, e_{3}, \dots, e_{m - 1}, e_{m}}$ is a set of non-empty subsets of $V$ called hyperedges. Each hyperedge contains the number of vertices called hyperedge degree, defined as $δ (e) = | e |$ . In general, the vertexes and hyperedges of the hypergraph have corresponding weights.

The incidence matrix of the hypergraph is $H (a_{ij})$ , where $i = 1, 2, 3, \dots, n$ represents the vertex of the hypergraph and $j = 1, 2, 3, \dots, m$ represents the hyperside of the hypergraph. The value of $a_{ij}$ is defined as follows

a_{ij} = {\begin{matrix} v_{i} \in e_{j} \\ v_{i} \notin e_{j} \end{matrix}

(1)

The elements of the weighted hypergraph correlation matrix $H_{w}$ ¹⁴ are defined as follows

h_{w} = {\begin{matrix} w (v_{e}) if v \in e \\ 0 if v \notin e \end{matrix}

(2)

where $w (v_{e})$ represents the weight of the vertex $v$ at the hyperside $e$ .

The degree of the hyperedges in the weighted hypergraph is defined as follows

δ (e_{w}) = \sum_{v \in V} h_{w} (v, e)

(3)

The hypergraph model can well reflect the various relationships of news documents. If the word of a news document is taken as the vertex, the hyperedge can represent the sentence in the news document. And the hypergraph model can be more precise by giving weight to the hyperedge. If a sentence contains multiple core words, then this sentence is obviously very important, so it can give a higher weight to the hyperedges that represent the sentence. The hypergraph model not only reflects the inclusion relation between words and sentences but also describes the attribute information about sentences, which can help the extraction of keywords.

Build a hypergraph model

This article takes the Chinese and Vietnamese bilingual news events document as the research object. We crawl the Internet news events first through the news text clustering and related events analysis and access to the relevant news events. Then, after the Chinese and Vietnamese news documents are segmented and part of speech and entity recognized, the stop words are removed, and nouns, verbs, adjectives, time words, position words, and place words are selected as the word set of the news representation. Due to the existence of entities, time, place, and logic within the news event document, there is also an association with words and entities in the documents. Therefore, it is possible to realize the relationship between the bilingual news documents through the Chinese–Vietnamese bilingual dictionaries and the entity correspondence library and to establish the multiple relations such as the word association and the entity association between Chinese–Vietnamese bilingual events.

The hypergraph model can express the collection of Chinese and Vietnamese bilingual news documents and describe the multiple relations between Chinese and Vietnamese words. The hypergraph vertices are used to represent words of the Chinese and Vietnamese bilingual news document, and the hyperedges are used to express the sentences in the Chinese and Vietnamese bilingual news document.

The construction of hypergraph model is as follows: In Figure 1, $e_{2}$ represents a sentence in the Chinese news document, $e_{3}$ represents a sentence in the Vietnamese news document, and the sentence is composed of words; $V_{c 1}, V_{c 2}, \dots, V_{u 4}$ are the words in Chinese news documents and Vietnamese news documents. By looking for the Chinese and Vietnamese dictionaries, the similarity of the words in the Chinese–Vietnamese bilingual news document is calculated. If some words appear in the sentences of the Chinese–Vietnamese bilingual news document, then these words are described in Chinese news and Vietnamese news reported, so they will be included in a new hyperedge, that is, $e_{1}$ in Figure 1. In this way, different language news documents are linked together to express their vocabulary associations. Through the comparisons in the entities’ corresponding libraries, the entity associations between them are expressed.

Figure 1.

Chinese–Vietnamese bilingual hypergraph model.

Vertex weight calculation

In the Chinese–Vietnamese bilingual news hypergraph model, the vertices represent the words of Chinese news documents and Vietnamese news documents. When computing vertex weights, this article combines some ideas of single-document news keyword extraction and at the same time takes into account the characteristics of bilingual news documents and finally selects the following features to participate in the vertex weight calculation.

Word span factor. Word span refers to the distance between the first occurrence and the last occurrence of a word in a document, indicating the extent to which the word appears in a document. And the larger the value, the wider the range of influence in the text. The advantage of word span is the ability to know whether or not a particular word appears in all passages of the text.

The word span is calculated as follows

span (t) = \frac{last (t) - first (t) + 1}{sum (s)} \times \frac{d_{t}}{D}

(4)

In formula (4), $first (t)$ is the position identifier of the first occurrence of word $t$ in a document, $last (t)$ is the last position of $t$ in the document, $sum (s)$ is the participle calculated by word processing total, $d_{t}$ is the number of paragraphs that appear in the document, and $D$ is the total number of paragraphs in the document.

Word span can effectively reduce the impact of local keywords on the final extraction results.

2. The part-of-speech factors. The part of speech has a positive effect on keywords extraction.^15–17 The part of speech of the word expresses the function of the word in a sentence. According to syntactic rules, words of different parts of speech tend to have different degrees of importance. For example, nouns are often used to describe the concept of an entity and to be able to express specific things. Therefore, most of the keywords are mainly nouns in a news document. But conjunctions and prepositions are less capable of expressing specific things and are almost impossible to become keywords.

In this article, 100 news pages are randomly selected from Tencent News, Sina News, and NetEase News, and these news documents are marked with keywords by manual means. Then, the words are counted according to the frequency of part of speech, and the part of speech is used. The pos(t) is defined as shown in Table 1.

3. Location factors. Because of its rigor, the news documentation always requires a clear structure, so most of the news documents will briefly introduce the main contents of the news event at the beginning. Therefore, the higher the position of a word of a news document, especially the words of the title, the more likely it is to become a keyword.

Table 1.

Part of speech and weights.

Part of speech	pos(t)
Nouns and verbs	1
Time words, location words, and place words	0.5
Adjectives and adverbs	0.2
Others	0

The formula of position weight of words is as follows

loc (t) = 1 - \frac{first (t)}{sum (s)}

(5)

In formula (5), $first (t)$ is the first occurrence of the word $t$ in a news document and $sum (s)$ is the total number of words in the news document. In particular, when the word $t$ appears in the title, the value of 1 is taken.

4. Bilingual word frequency factors. Taking into account the characteristics of bilingual news documents, we add the characteristic information of bilingual word frequency to the vertices of the Chinese–Vietnamese news hypergraph model.

One of the biggest features of bilingual news is that when describing the same news document, there is differences between the two sides of the news content because of differences in the national standpoint. Table 2 is the difference reported by Chinese and Vietnamese News on the same incident.

Table 2.

Differences in bilingual news reports.

Vietnamese fishermen were arrested by China
Chinese news report	Vietnamese news report
渔民 (fishermen)	ngư dân (fishermen)
逮捕 (arrest)	tạm giam (detention)
驱离 (expelled)	mối đe dọa (threats)
非法捕鱼 (illegal fishing)	giam giữ độc hại (malicious detention)

In the bilingual news keyword extraction, the conventional TF-IDF has been unable to accurately express the importance of a word. Because in the context of a bilingual document, there may be some differences in the news content reported by the two countries. It has certain words that occur frequently in the news of one country, while in another country the concentration of news documents is almost invisible. If you blindly calculate the TF-IDF for the entire bilingual news document set, there will be inevitably a big error. Therefore, TF-IDF values of our restricted words are only calculated in the news documents in the same locale in the Chinese–Vietnamese bilingual news hypergraph model. In this way, errors caused by differences in news content of the two countries can be avoided.

Keyword extraction in a bilingual environment must take into account the infrequent words of a country’s news media, a factor that may be important to another country. According to these analyses, we propose the bilingual word frequency features of words, which are calculated as follows

cross (t) = tf (t) \times \log \frac{C}{C_{t} + 1}

(6)

In formula (6), $tf (t)$ is the TF-IDF value of the word $t$ in the same language news document set, $C$ is the number of news documents in different languages of the word t, and $C_{t}$ is the number of documents containing words similar to the $t$ semantic in document set $C$ . For example, $t$ is the Vietnamese word “” (harassment), then $C$ corresponds to the number of documents in the Chinese news document set, and $C_{t}$ is the number of Chinese news documents that appear “harassment.” For a word, the higher its TF-IDF value in the same language and the lower the number of corresponding words in the different language document set, the higher the bilingual word frequency values will be.

In summary, we can get the final calculation formula of the vertex weights in Chinese–Vietnamese bilingual news hypergraph model

\begin{matrix} W (v_{i}) = α^{*} cross (t) + β^{*} spant (t) + γ^{*} pos (t) \\ + ε^{*} loc (t) \end{matrix}

(7)

Among them, $α, β, γ, and ε$ are the proportion factors of each weight, whose values represent the contribution level of the corresponding weight in total weights. In this article, $α, β, γ, and ε$ values were 0.4, 0.2, 0.2, and 0.2.

Override weight calculation

There are two different types of hyperedges in the Chinese–Vietnamese bilingual news hypergraph model. One of the hyperedge represents a sentence in the news document and the other is the same-meaning word of the Chinese–Vietnamese bilingual news documents set.

For the hyperedge that represents a sentence, the idea of natural language processing is the more important the words contained in a sentence, the more important the sentence becomes. Therefore, the weight of a sentence is determined by the weight of all the words contained in the sentence. At the same time, long sentences are effectively penalized in order to prevent long sentences that might not have been important, but they contain a high number of words.

The calculation formula for the weight of the hyperedge is as follows

W (e_{i}) = \frac{1 + \sum_{v \in e_{i}} w (v)}{δ {(e_{i})}^{2}}

(8)

In formula (8), $\sum_{v \in e_{i}} w (v)$ represents the sum of the weights of all the vertices in the hyperedge $e_{i}$ , and $δ (e)$ represents the degree of the hyperedge.

The hypergraph model can express the inclusion relationship between words and sentences. In general, important sentences will get a higher weight, and the words of them rank higher in the rankings.

For the hyperedges of the set of words that represent the same meaning in bilingual news documents, we think that if some words appear frequently in the Chinese–Vietnamese news document, the words may be related to news events. Even though the description of the news of Chinese and Vietnamese differs from the point of view and content, the news event is actually the same news event; for example, “Vietnamese fishermen arrested by China,” the descriptions of time, place, people, and specific figures are the same for both Chinese and Vietnamese. At the same time, there are many words such as “starting” and “to make” which appeared in the news documents of the Chinese and Vietnamese, but not too much value of the news documents themselves. Therefore, for the calculation of the weight of such hyperedges, the importance of the Chinese–Vietnamese word contained in the hyperedge must be considered, as shown in the following formula

W (e_{j}) = \frac{1 + \sum_{v \in e_{j}} w (v)}{δ {(e_{j})}^{2}} \times \sqrt{\frac{1}{N - 1} \sum_{v \in e_{j}} {(tf (v) - \bar{tf (v)})}^{2}}

(9)

In formula (9), $N$ is the number of vertices included in the hyperedge $e_{j}$ , $tf (v)$ is the TF-IDF value of one vertex in the hyperedge, and $\bar{tf (v)}$ is the average TF-IDF value of all vertices in the hyperedge.

Chinese–Vietnamese keyword extraction based on directed diffusion

Wireless sensor networks have a large number of nodes and data-centric features.¹⁸ In the abstract, wireless sensor networks and hypergraphs have similarities. At the same time, directional diffusion protocol¹⁹ is a query-based routing mechanism, which is a typical data-centric routing mechanism. The route-directed diffusion protocol establishes a route by broadcasting an interest message to the network. The message can traverse every node of the wireless sensor network and establishes an interest list of each node to record interest information. This protocol can be applied to the iterative calculation of vertex weights on hypergraphs.

The iterative weighting of hypergraphs uses the interest diffusion phase of the directional routing protocol. In addition, certain changes are made to the original route-directed diffusion protocol. The specific description is as follows.

In the first step, a node $u$ is randomly selected according to the ratio of the weight of one hyperedge containing $u$ to the sum of the weights of all the hyperedges.

In the second step, from the hyperedge $e$ , the ratio of the weight of the target vertex $v$ and the sum of the weights of all the vertices in the hyperedge $e \in E (u) \cap E (v)$ is taken as the probability to select the target vertex $v$ to jump. At the same time, the broadcast rounds are recorded in the interest list of each node. In the same round of broadcasts, the nodes that have been broadcast are not skipped, and it is ensured that each node is iterated.

The transfer matrix $P$ is defined as follows

P (u, v) = \sum_{e \in E} w (e) \frac{h (u, e)}{\sum_{\hat{e} \in E (u)} w (\hat{e})} \frac{h_{w} (v, e)}{h_{w} (\hat{v}, e)}

(10)

The matrix representation is as follows

P = D_{v}^{- 1} H W_{e} D_{ve}^{- 1} H_{w}^{T}

(11)

where $D_{v}$ is the diagonal matrix of the degree of the vertices. $W_{e}$ is the diagonal matrix of the weights of the hyperedges. $D_{ve}$ is the diagonal matrix of the degree of hyperedges. $H$ is the association matrix of hypergraphs. $H_{w}$ is the correlation matrix of the weighted hypergraph. In order to avoid the loop, we set the diagonal element in $P$ to zero. Then put $P$ normalized, let the sum of the squares of the elements in each row be 1.

After that, this article uses the PageRank algorithm to iteratively calculate the vertex weights; $\vec{v}$ is the vertex weight vector to be sorted and $α$ is the damping coefficient, and the formula is defined as follows

\vec{v} (i + 1) = α P^{T} \vec{v} (i) + (1 - α) \frac{\vec{e}}{n}

(12)

In this article, the damping coefficient $α$ has a value of 0.85 and $n$ is the number of vertices in the hypergraph. $\vec{e} \in R^{n * 1}$ is a unit vector of length $n$ . $α P^{T} \vec{v}$ means to select an associated edge from the current vertex $u$ to jump. $(1 - α) \vec{e} / n$ represents a jump to a new vertex with a probability of $(1 - α) / n$ .

In the iterative process, if the value of $\vec{v}$ changes, the vertex weight will change. Correspondingly, the values of $D_{v}, D_{ve}, W_{e}$ , and $W$ also change and the value of P is obtained again. Iteratively repeat this process until the absolute value of the difference between the two adjacent iterations and the corresponding position is less than the threshold (0.0001) set herein to stop the iteration. Then get the highest score of K words as the final extraction of the keywords.

Experiments

Experimental data

This article selects 400 news articles from mainstream news media in China and Vietnam, including 200 Chinese news texts, mainly from websites such as Xinhua News Agency, NetEase News, and Tencent News. The other 200 are Vietnamese news, mainly from Vietnam’s People’s Daily, Vietnam News Agency, Vietnam Daily Express, and other mainstream news media in Vietnam. The average word for each document is about 300–1000.

We first extract the keywords of Chinese news and Vietnamese news through manual selection. Among them, Vietnamese news keywords using Google Translate to translate the article, and then read through the artificial extraction of Vietnamese keywords. In addition, 20 keywords are manually extracted for each news. Finally, we use the method proposed in this article to extract cross-language news keywords in Chinese and Vietnamese. The method outputs the top 20 Chinese words and Vietnamese words as keywords of Chinese documents and Vietnamese documents, respectively, and compares the results obtained by manual extraction to verify the validity of the method. Evaluation criteria are precision, recall rate, and F value index for evaluation.

Documents preprocessing

In order to effectively process the Chinese and Vietnamese bilingual news documents, the news documents must be processed by word segmentation and part of speech. Institute of Computing Technology, Chinese Lexical Analysis System (ICTCLAS) is used to analyze Chinese word segmentation, labeled and entity recognition. And Vietnamese news document analysis uses the vnTokenizer²⁰ toolkits developed by the Hanoi National University in Vietnam to provide APIs and models for handling Vietnamese participle and labeled. On this basis, Vietnamese word segmentation and part of speech tagging tool are used to achieve the Vietnamese word segmentation and part of speech tagging platform.²¹

Experiments and analysis

We compared the results of bilingual news keywords extraction with those of news keywords of single-language environment to judge the effect of the bilingual extraction method mentioned in this article. We use single-language news document keyword extraction method of comparative experiments:

Hypergraph-Directed Diffusion (HDD) for Chinese. For the 200 Chinese news documents in the document set, a single-document news keyword extraction method based on hypergraph sorting is used to construct a Chinese news document hypergraph model. Then, use the HDD algorithm to iteratively calculate and output the highest ranked 20 words as keywords of news documents.

HDD for Vietnamese. Using single-document hypergraph sorting method, we extract keywords from 200 Vietnamese documents in the document set and finally obtain the keywords of Vietnamese news documents.

HDD for Bilingual. Using the Chinese and Vietnamese bilingual hypergraph models mentioned in this article, the hypergraph is constructed, and then, the HDD algorithm is used to calculate the hypergraph. The final output is the highest ranking of 20 Chinese words and 20 Vietnamese words, that is, the keywords of the Chinese–Vietnamese bilingual news document (Table 3).

Table 3.

Experimental results.

Methods	Accuracy	Recall rate	F
HDD for Chinese	0.450	0.498	0.473
HDD for Vietnam	0.405	0.459	0.430
HDD for Bilingual	0.490	0.541	0.514

From the experimental results, it can be concluded that from the bilingual corpus, the bilingual hypergraph sorting method proposed to this article is better than the one using the single-document hypergraph sorting to extract Chinese and Vietnamese separately. The result of Vietnamese single-document news extraction is the worst, mainly because of the lexical difference between Vietnamese and Chinese; when using the Vietnamese news document set, the method of calculating the feature information using only Chinese words can produce errors. But in the Chinese–Vietnamese bilingual news keywords extraction, for each word, in addition to considering the characteristics of the word in the language environment, the characteristics and calculation method of the word in different language environment are also considered. Therefore, the bilingual news keywords extraction effect than the single-document keyword extraction effect is better. Of course, this is also because the news documents we choose from a bilingual environment are aimed at the more important domestic or international news events. For these news events, the news media in both countries reported that although there are some differences in content, there is some similarity in the content of the documents because the same news events are described.

Table 4 shows the keywords taken under different methods, taking the news document “5.27 China Hitting the Fishing Boat in Vietnam” as an example.

Table 4.

Chinese–Vietnamese news keyword extraction results.

System	The Chinese–Vietnamese news keywords
Artificial extraction (China news)	中国 (China), 越南 (Vietnam), 981 钻井平台 (981 drilling platform), 2014年5月27日 (27 May 2014), 西沙群岛海域 (the sea area of the Xisha islands), 17 海里 (17 nautical miles), 越方渔船 (Vietnamese fishing boats), 强行冲闯 (forced to break into), 干扰 (interference), 编号 11,209 (Number 11,209), 撞沉 (sinking), 10 名渔民 (10 fishermen), 落海 (fall off sea), 获救 (rescued), 正常钻探作业 (normal drilling operation), 警告 (warning), 无理行为 (irrational behavior), 非法捕捞 (illegal fishing), 正当防卫 (justifiable defense), 航行警告 (sailing warning), 固有领土 (inherent territory), 毫无争议 (there is no dispute)
Artificial extraction (Vietnam news)	Việt Nam (Vietnam), Trung Quốc (China), Nền tảng khoan 981 (981 drilling platform), Ngày 27 tháng 5 (27 May 2014), Biển Đông (East China Sea), Đánh cá (fishing), Chìm xuống (sinking), Việt Nam thànhphốĐà Nẵng (Da Nang, Vietnam), Tàu cáTrung Quốc (Chinese fishing boat), Baovây (compass), Ngưdân (fishermen), Vụ tấn côngnghiêm trọng (serious attack), Được giảithoát (rescued), Nhiều người bịthương (many people were injured), Phá hoại (damage), Bài tập bìnhthường (normal operation), Vùng đặc quyền kinh tế (exclusive economic zone), Buổi họp (assembly), Phản đốiTrung Quốc (oppose to China)
HDD for Chinese	中国 (China), 越南 (Vietnam), 981 钻井平台 (981 drilling platform), 5月27日 (27 May 2014), 西沙群岛 (the sea area of the Xisha islands), 越方渔船 (Vietnamese fishing boats), 撞沉 (sinking), 越南渔民 (Vietnamese fishermen), 非法捕捞 (illegal fishing), 海域 (sea area), 正当防卫 (justifiable defense), 中方正常 (China is normal), 外交部 (Ministry of Foreign Affairs), 钻探作业 (drilling operation), 主权 (sovereign), 骚扰 (harassment), 挑衅 (provocation), 作业 (operation), 系 (relationship), 正常 (normal)
HDD for Vietnam	Việt Nam (Vietnam), Trung Quốc (China), Biển Đông (the east China sea), Vùng biển (sea area), Nền tảng khoan 981 (981 drilling platform), Ngày 27 tháng 5 (27 May), Đà Nẵng (Da Nang), Bắt bài (fishing operation), Chìm xuống (sinking), Tàu cá (fishing boats), Ngưdân (fishermen), Đâm vào tai nạn (collision accident), Vachạm (hit), Tấn công (attack), An toàn (safety), Vùng lãnhthổ (territory), Ngưtrường truyền thống (traditional fishing grounds), NgưdânKhánh Hòa Hiệp hội (Qinghe Fishermen Association), Phản đối (oppose), Vôlý (unreasonable)
HDD for bilingual (China news)	中国 (China), 越南 (Vietnam), 981 钻井平台 (981 drilling platform), 5月27日 (27 May), 西沙群岛海域 (the sea area of the Xisha islands), 越方渔船 (Vietnamese fishing boats), 干扰 (interference), 撞沉 (sinking), 越南渔民 (Vietnamese fishermen), 正常钻探作业 (normal drilling operation), 航行警告 (sailing warning), 非法捕捞 (illegal fishing), 正当防卫 (justifiable defense), 无理 (unreasonable), 外交部 (Ministry of Foreign Affairs), 主权 (sovereign), 挑衅 (provocative), 作业 (operation), 两国关系 (relations between the two countries), 秦刚 (Qin Gang)
HR for bilingual (Vietnam news)	Việt Nam (Vietnam), Trung Quốc (China), Nền tảng khoan 981 (981 drilling platform), Ngày 27 tháng 5(27 May), Biển Đông (the east China sea), Tàu cá Việt Nam thànhphốĐà Nẵng (fishing boat of Da Nang, Vietnam), Tàu cáTrung Quốc (Chinese fishing boats), Chìm xuống (sinking), Ngưdân (fishermen), Bắt bài (fishing), Tấn công (attack), Bịthương (be injured), Vùng đặc quyền kinh tế (exclusive economic zone), Mốiđe dọa (threat), Cướp giật (loot), Ngưtrường truyền thống (traditional fishing grounds), NgưdânKhánh Hòa Hiệp hội (Qinghe Fishermen Association), Buổi họp (assembly), Phản đối (oppose), Bất hợp pháp (illegal)

Conclusion

In this article, we propose a bilingual news keyword extraction method based on the hypergraph in view of the characteristics of the news text data, combined with the correlation characteristics of Chinese and Vietnamese bilingual events. This method takes the bilingual words as the vertexes and the bilingual words and the semantic similar bilingual words as the hyperedges and calculates the weights of the hyperedges by the importance degree of the words of the sentences. Then, the bilingual Chinese–Vietnamese hypergraph model is constructed by bilingual dictionaries and bilingual entity library, and the key data are obtained by iteration. The experimental results prove the effectiveness of the proposed method. The use of bilingual multivariate relations has a very good supportive role for the keywords extraction in bilingual events in Chinese and Vietnamese. Further research focuses on how to effectively use bilingual multiple relationships to construct multi-language event hypergraphs such as entity, entity relationship, and sentence relevance in order to find a better method to mine the key data in unstructured big data.

Footnotes

Handling Editor: Songhua Xu

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant Nos 61472168, 61672271, and 61732005), High-tech Industry Development Project of Yunnan Province (Grant No. 2016ZA006), Science and Technology Leading Talent Program of Yunnan Province (Grant No. 2017HA001), and Major Science and Technology Project of Yunnan Province (Grant No. 2016ZA006).

References

Xia

Yuan

et al . Research on the model for tobacco disease prevention and control based on case-based reasoning and knowledge graph. FILOMAT 2018; 32(5): 26.

Samuel

Yuan

et al . Mining online full-text literature for novel protein interaction discovery. In: Proceedings of the IEEE international conference on bioinformatics and biomedicine workshops, Hong Kong, China, 18 December 2010, pp.277–282. New York: IEEE.

Salton

Buckley

. Term-weighting approaches in automatic text retrieval. Inform Process Manag 1988; 24(5): 513–523.

Matsuo

Ishizuka

. Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tool 2004; 13(1): 157–169.

Mihalcea

Tarau

TextRank: bringing order into texts. In: Proceedings of conference on empirical methods in natural language processing, Barcelona, 25–26 July 2004, pp.404–411. New York: IEEE.

Blei

Jordan

. Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

Liu

Huang

Zheng

et al . Automatic keyphrase extraction via topic decomposition. In: Proceedings of the conference on empirical methods in natural language processing, Cambridge, MA, 9–11 October 2010, pp.366–376. Stroudsburg, PA: Association for Computational Linguistics.

Saeidi

Astudillo

Kolossa

. Uncertain LDA: Including observation uncertainties in discriminative transforms. IEEE T Pattern Anal Mach Intell 2015; 38(7): 1479–1488.

Hammouda

Matute

. CorePhrase: keyphrase extraction for document clustering. Mach Learn Data Mining Pattern Recogn 2005; 3587: P265–P274.

10.

Jie

Duo

Wei

et al . Keyword extraction in multi-document based on joint weight. J Chin Inform Process 2008; 22(6): 75–79.

11.

Berge

. Hypergraphs. Amsterdam: North-Holland publishing, 1989.

12.

Berge

. Hypergraphs: combinatorics of finite sets. Amsterdam: North-Holland publishing, 1989, pp.521–552.

13.

Berge

. Graphs and hypergraphs. Amsterdam: North-Holland Publishing, 1973, pp.127–139.

14.

Wang

et al . Exploring hypergraph-base semi-supervised ranking for query-oriented summarization. Inform Sci 2013; 237(13): 271–286.

15.

Niu

Huang

. TF-IDF and rules based automatic extraction of Chinese keywords. J Chin Comput Syst 2016; 37(4): 711–715.

16.

Hulth

Megyesi

. A study on automatically extracted keywords in text categorization. In: Proceedings of international conference on computational linguistics and, meeting of the association for computational linguistics, Sydney, NSW, Australia, 17–18 July 2006, pp.537–544. New York: ACM.

17.

Hao

Yin

. Keyword extraction based on multi-feature fusion for Chinese web pages. In: Gaol

Nguyen

. (eds) Proceedings of the 2011 2nd international congress on computer applications and computational science. Berlin: Springer, 2012, pp.119–124.

18.

Elhoseny

Yuan

et al . Balancing energy consumption in heterogeneous wireless sensor networks using genetic algorithm. IEEE Commun Lett 2015; 19(12): 2194–2197.

19.

Intanagonwiwat

Govindan

Estrin

et al . Directed diffusion for wireless sensor networking. IEEE/ACM T Network 2003; 11(1): 2–16.

20.

Phuong

Huyen

NTM

Azim

et al . A hybrid approach to word segmentation of vietnamese texts. In: Proceedings of the 2nd international conference on language and automata. theory and applications, Tarragona, 13–19 March 2008, pp.240–249. New York: Springer.

21.

Pan

Zhou

et al . Recognition method of Vietnamese named entity based on conditional random fields. J Shandong Univ 2014; 49(1): 76–79.