Detecting new Chinese words from massive domain texts with word embedding

Abstract

Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora.

Keywords

Natural language processing new word detection similarity measurement textual information retrieval word embedding

Get full access to this article

View all access options for this article.

References

Choi

K-S

Isahara

Kanazaki

et al . Word segmentation standard in Chinese, Japanese and Korean. In: Proceedings of the 7th workshop on Asian language resources, Singapore, 6–7 August 2009, pp. 179–186. Stroudsburg, PA: Association for Computational Linguistics.

Wang

Kazama

Tsuruoka

et al . Adapting Chinese word segmentation for machine translation based on short units. In: Proceedings of the 7th conference on international language resources and evaluation (LREC’10), Valletta, 17–23 May 2010, pp. 1758–1764. Luxembourg: European Language Resources Association (ELRA).

Liu

Yang

et al . Domain phrase identification using atomic word formation in Chinese text. Knowl-Based Syst 2011; 24: 1254–1260.

Peng

Feng

McCallum

Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on computational linguistics (COLING’2004), Geneva, 23–27 August 2004, p. 562. Stroudsburg, PA: Association for Computational Linguistics.

Thet

J-C

Ko Ko

. Word segmentation for the Myanmar language. J Inf Sci 2008; 34: 688–704.

Gao

et al . Chinese word segmentation and named entity recognition: a pragmatic approach. Comput Ling 2005; 31: 531–574.

Sproat

Emerson

The first international Chinese word segmentation bakeoff. In: Proceedings of the second SIGHAN workshop on Chinese language processing, Sapporo, Japan, 11–12 July 2003, pp. 133–143. Stroudsburg, PA: Association for Computational Linguistics.

Huang

Peng

Schuurmans

et al . Applying machine learning to text segmentation for information retrieval. Inf Retr 2003; 6: 333–362.

Huang

Wang

et al . New word detection for sentiment analysis. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD, 23–24 June 2014, pp. 531–541. Stroudsburg, PA: Association for Computational Linguistics.

10.

Zheng

Liu

Sun

et al . Incorporating user behaviors in new word detection. In: Proceedings of the twenty-first international joint conference on artificial intelligence (IJCAI-09), Pasadena, CA, 2009, pp. 2101–2106. San Francisco, CA: Morgan Kaufmann Publishers.

11.

Asur

Huberman

Szabo

et al . Trends in social media: persistence and decay. In: Proceedings of the 5th international AAAI conference on weblogs and social media, Barcelona, 17–21 July 2011, pp. 434–437. Menlo Park, CA: AAAI Press.

12.

Zhang

Zhao

Who creates trends in online social media: the crowd or opinion leaders?

J Comput Mediat Commun 2016; 21: 1–16.

13.

Qiu

Zhang

Word segmentation for Chinese novels. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence (AAAI’15), Austin, TX, 25–30 January 2015, pp. 2440–2446. Menlo Park, CA: AAAI Press.

14.

Xie

Wang

New word detection in ancient Chinese literature. In: Asia-Pacific web (APWeb) and web-age information management (WAIM) joint conference on web and big data, Beijing, China, 7–9 July 2017, pp. 260–275. Cham: Springer.

15.

Huang

C-N

Gao

et al . The use of SVM for Chinese new word identification. In: First international joint conference on natural language processing (IJCNLP 2004), Hainan Island, China, 22–24 March 2004, pp. 723–732. Cham: Springer.

16.

Zhang

Liu

et al . Semantic search for public opinions on urban affairs: a probabilistic topic modeling-based approach. Inf Process Manage 2016; 52: 430–445.

17.

Liu

Lin

A new method to compose long unknown Chinese keywords. J Inf Sci 2012; 38: 366–382.

18.

Cheng

Duh

Matsumoto

Synthetic word parsing improves Chinese word segmentation. In: Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing, Beijing, China, 26–31 July 2015, pp. 262–267.

19.

Firth

JR.

A synopsis of linguistic theory, 1930–1955. In: Studies in linguistic analysis. Oxford: Philological Society, 1957, pp. 1–32.

20.

Liang

Yin

Yiu

SM.

New word detection and tagging on Chinese Twitter stream. In: Madria

Hara

(eds) Big data analytics and knowledge discovery. Cham: Springer, 2015, pp. 310–321.

21.

Luke

K-K.

Chinese unknown word identification using class-based LM. In: First international joint conference on natural language processing (IJCNLP 2004), Hainan Island, China, 22–24 March 2004, pp. 704–713. Berlin: Springer.

22.

Goh

C-L

Asahara

Matsumoto

Machine learning-based methods to Chinese unknown word detection and POS tag guessing. J Chinese Lang Comput 2006; 16: 185–206.

23.

New word recognition based on support vector machines and constraints. In: 2nd international conference on information science and control engineering (ICISCE’2015), Shanghai, China, 24–26 April 2015, pp. 341–344. New York: IEEE.

24.

Chen

Liu

Wei

et al . Open domain new word detection using condition random field method. J Soft 2013; 24: 1051–1060.

25.

Wang

M-C

Huang

C-R

Chen

K-J.

The identification and classification of unknown words in Chinese: an n-grams-based approach. In: Akira

Yoshihiko

(eds) Festschrift for professor Akira Ikeya. Tokyo, Japan: The Logico-linguistics Society of Japan, 1995, pp. 113–123.

26.

Pecina

Schlesinger

Combining association measures for collocation extraction. In: Proceedings of the COLING/ACL on main conference poster sessions, Sydney, NSW, Australia, 17–18 July 2006, pp. 651–658. Stroudsburg, PA: Association for Computational Linguistics.

27.

New word detection based on an improved PMI algorithm for enhancing Chinese segmentation system. Acta Sci Natur Univ Pekinensis 2016; 52: 35–40.

28.

Huang

Powers

Chinese word segmentation based on contextual entropy. In: Proceedings of the 17th Asian Pacific conference on language, information and computation, Sentosa, Singapore, 1–3 October 2003, pp. 152–158.

29.

Grouin

Lavergne

Névéol

Optimizing annotation efforts to build reliable annotated corpora for training statistical models. In: Proceedings of the 8th linguistic annotation workshop, Dublin, 23–24 August 2014, pp. 54–58. Stroudsburg, PA: Association for Computational Linguistics.

30.

Yuan

et al . Where to go and what to play: towards summarizing popular information from massive tourism blogs. J Inf Sci 2015; 41: 830–854.

31.

Stavrianou

Andritsos

Nicoloyannis

Overview and semantic issues of text mining. ACM Sigmod Rec 2007; 36: 23–34.

32.

Bengio

Ducharme

Vincent

et al . A neural probabilistic language model. J Mach Learn Res 2003; 3: 1137–1155.

33.

Mikolov

Chen

Corrado

et al . Efficient estimation of word representations in vector space, 2003, https://arxiv.org/abs/1301.3781

34.

Mikolov

Sutskever

Chen

et al . Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems (NIPS’2013), Lake Tahoe, NV, 5–10 December 2013, pp. 3111–3119. Red Hood, NY: Curran Associates.

35.

Deng

Bol

et al . On the unsupervised analysis of domain-specific Chinese texts. Proc Nat Acad Sci U S A 2016; 113: 6154–6159.

36.

Finley

Farmer

Pakhomov

SV.

What analogies reveal about word vectors and their compositionality. In: Proceedings of the 6th joint conference on lexical and computational semantics (*SEM 2017), Vancouver, BC, Canada, 3–4 August 2017, pp. 1–11. Stroudsburg, PA: Association for Computational Linguistics.

37.

Agrawal

Imieliński

Swami

Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’1993), Washington, DC, 25–28 May 1993, pp. 207–216. New York: ACM.

38.

Wang

Han

et al . TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE T Knowl Data En 2005; 17: 652–663.

39.

Han

Pei

MK.

Data mining: concepts and techniques. 3rd ed. San Francisco, CA: Morgan Kaufmann Publishers, 2011.

40.

Liu

Web data mining: exploring hyperlinks, contents, and usage data. 2nd ed. Berlin; Heidelberg: Springer, 2011.

41.

Zeng

Wei

Chau

et al . Domain-specific Chinese word segmentation using suffix tree and mutual information. Inf Syst Front 2011; 13: 115–125.

42.

Sornlertlamvanich

Potipiti

Charoenporn

Automatic corpus-based Thai word extraction with the C4.5 learning algorithm. In: Proceedings of the 18th conference on computational linguistics (COLING’2000), Saarbrucken, 31 July–4 August 2000, pp. 802–807. Stroudsburg, PA: Association for Computational Linguistics.

43.

Mei

Huang

Wei

et al . A novel unsupervised method for new word extraction. Sci China Inf Sci 2016; 59: 92102.

44.

Chen

Sun

Domain-specific new words detection in Chinese. In: Proceedings of the 6th joint conference on lexical and computational semantics (*SEM 2017), Vancouver, BC, Canada, 3–4 August 2017, pp. 44–53. Stroudsburg, PA: Association for Computational Linguistics.

45.

Han

ALF

Wong

Chao

LS.

Chinese named entity recognition with conditional random fields in the light of Chinese characteristics. In: Kłopotek

Koronacki

Marciniak

et al . (eds) Language processing and intelligent information systems. Berlin: Springer, 2013, pp. 57–68.

46.

Baroni

Dinu

Kruszewski

Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD, 23–25 June 2014, pp. 238–247. Stroudsburg, PA: Association for Computational Linguistics.