Automatic Chinese character similarity measurement

Abstract

Automatically identifying Chinese characters that are similar in their glyph, pronunciations and meaning are important for building smart question generation tools in a computer-assisted language-learning environment. Previous research on the Chinese character similarity measurement focused on character glyph (e.g. structures, strokes and radicals) with heuristic algorithms whose parameter have preset values. This article presents a machine learning (regression) approach to measure the similarity between two Chinese characters, based on the information which not only includes the glyph, but also pronunciation (pinyin) and semantic meaning derived from HowNet. We evaluated various regression models using a testing set consisting of 2586 pairs of characters selected from elementary Chinese textbooks used. The study results showed that four regression models (M5, Support Vector Machine, Gaussian Process and Linear Regression) have similar results ( $0.617 ⩽ Mean Absolute Error ⩽ 0.641$ , $0.772 ⩽ Root Mean Square Error ⩽ 0.790$ ). In addition, the study implied that the performance of the regression model could be influenced by the character frequency. Moreover, we evaluated the regression model in a well-known Chinese language learning resource, called 100 pairs of the most confusing Chinese characters. The experiment results indicated that this approach has potential in the recognition and generation of confusing Chinese character pairs.

Keywords

Natural language processing Chinese character similarity measurement intelligent authoring tools

Get full access to this article

View all access options for this article.

References

J.C.

Brown,

G.A.

Frishkoff and

Eskenazi, Automatic question generation for vocabulary assessment, in: HLT ’05 Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 2005, pp. 819–826.

Budanitsky and

Hirst, Evaluating wordnet-based measures of lexical semantic relatedness, Comput. Linguist. 32 (2006), 13–47. doi:10.1162/coli.2006.32.1.13.

Burstein and

Leacock, The second workshop on building educational applications using NLP, in: Second Work. Build. Educ. Appl. Using NLP, ACL, University of Michigan, Ann Arbor, Michigan, USA, 2005.

Chen,

Lin,

Chen and

Song, Specification for Identifying Indexing Components of GB, 13000.1 Chinese Characters Set, Language and Literature Press, Beijing, China, 2009.

B.-F.

Chu, Handbook of the Fifth Generation of the Cangjie Input Method, 2008.

Coniam, A preliminary inquiry into using corpus word frequency data in the automatic generation of English language cloze tests, CALICO J. 14 (1997).

B.V.

Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, 1991.

Dong and

Dong, HowNet and the Computation of Meaning, World Scientific, Singapore, 2006.

L.B.

Feldman and

W.W.T.

Siok, Semantic radicals contribute to the visual identification of Chinese characters, J. Mem. Lang. (1999), 559–576. doi:10.1006/jmla.1998.2629.

10.

A.C.

Graesser and

R.A.

Wisher, Question Generation as a Learning Multiplier in Distributed Learning Environments, Alexandria, VA, 2001.

11.

Hall,

Frank,

Holmes,

Pfahringer,

Reutemann and

I.H.

Witten, The WEKA data mining software: An update, ACM SIGKDD Explor. 11 (2009), 10–18. doi:10.1145/1656274.1656278.

12.

A.E.

Hoerl and

Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics (1970), 55–67. doi:10.1080/00401706.1970.10488634.

13.

Jiang, 100 pairs of most confusing Chinese words, Yu Wen Tian Di (2006), 19–20.

14.

Jin,

Carroll,

Wu and

McCarthy, Distributional similarity for Chinese: Exploiting characters and radicals, Math. Probl. Eng. 2012 (2012), 11.

15.

Ju and

N.E.

Jackson, Graphic and phonological processing in Chinese character identification, J. Read. Behav. (1995), 299–313. doi:10.1080/10862969509547885.

16.

Juang, Resolving the unencoded character problem for Chinese digital libraries, in: Fifth ACM/IEEE Jt. Conf. Digit. Libr., 2005, pp. 311–319.

17.

K.J.

Leck,

B.S.

Weekes and

M.J.

Chen, Visual and phonological pathways to the lexicon: Evidence from Chinese readers, Mem. Cogn. (1995), 468–476. doi:10.3758/BF03197248.

18.

C.-L.

Liu and

J.-H.

Lin, Using structural information for identifying similar Chinese characters, in: ACL-08 HLT, 2008, pp. 93–96. doi:10.3115/1557690.1557715.

19.

Liu and

S.J.

Li, Word similarity computing based on how-net, in: Third Chinese Lex. Semant. Semin. Proc., Taipei, 2012, pp. 59–76.

20.

Lu, Research on intelligent Chinese character making without librarybased on topology and statistic, South China University of Technology, 2010.

21.

Mandarin, The frequency table of Chinese characters, 2015, http://onlinechinese2u.com/blog/wp-content/uploads/2012/04/LEGOO-MANDARIN-

.pdf.

22.

McCullagh and

J.A.

Nelder, Generalized Linear Models, 2nd edn, Chapman & Hall, New York, 1989.

23.

Mitkov,

An Ha and

Karamanis, A computer-aided environment for generating multiple-choice test items, Nat. Lang. Eng. 12 (2006), 177–194. doi:10.1017/S1351324906004177.

24.

Mitkov,

L.A.

Ha,

Varga and

Rello, Semantic similarity of distractors in multiple-choice tests: Extrinsic evaluation, in: Proc. Work. Geom. Model. Nat. Lang. Semant., 2009, pp. 49–56, http://dl.acm.org/citation.cfm?id=1705415.1705422 .

25.

Navarro, A guided tour to approximate string matching, ACM Comput. Surv. 33 (2001), 31–88. doi:10.1145/375360.375365.

26.

C.A.

Perfetti,

Zhang and

Berent, Reading in English and Chinese, in: Orthogr. Phonol. Morphol. Mean., Amsterdam,

Frost and

Katz, eds, 1992, pp. 227–248. doi:10.1016/S0166-4115(08)62798-3.

27.

J.R.

Quinlan, Learning with continuous classes, in: Proc. AI, Singapore, 1992, pp. 343–348.

28.

C.E.

Rasmussen and

C.K.I.

Williams, Gaussian Processes for Machine Learning, MIT Press, 2006.

29.

Seeger, Gaussian processes for machine learning, Int. J. Neural Syst. (2004).

30.

Song,

Lin and

Ge, Similarity calculation of Chinese character glyph and its application in computer aided proofreading system, J. Chinese Comput. Syst. 29 (2008), 1964–1968.

31.

Taft and

Zhu, Sub-morphemic processing in reading Chinese, J. Exp. Psychol. Learn. Mem. Cogn. (1997), 761–775. doi:10.1037/0278-7393.23.3.761.

32.

L.H.

Tan,

Hoosain and

Peng, Role of early presemantic phonologic code in Chinese character identification, J. Exp. Psychol. Learn. Mem. Cogn. (1991), 43–54.

33.

L.H.

Tan,

Hoosan and

W.T.

Siok, Activation phonological code before accessing to Chinese character meaning in written Chinese, J. Exp. Psychol. Hum. Learn. Mem. (1996), 621–630.

34.

L.H.

Tan and

Perfetti, Phonological codes as early sources of constraint in Chinese word identification: A review of current discoveries and theoretical accounts, Cogn. Process. Chinese Japanese Lang. 165 (1998), 11–46. doi:10.1007/978-94-015-9161-4_2.

35.

Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag New York Inc., New York, NY, USA, 1995.

36.

Wang and

Xiong, New algorithm for similarity calculation of Chinese character glyph, Appl. Res. Comput. 30 (2013).

37.

Weeds and

Weir, Co-occurrence retrieval: A flexible framework for lexical distributional similarity, Comput. Linguist. 31 (2006), 439–475. doi:10.1162/089120105775299122.

38.

T.T.

Wu,

Y.F.

Chen,

Hastie,

Sobel and

Lange, Genome-wide association analysis by LASSO penalized logistic regression, Bioinformatics 25 (2009), 14–21. doi:10.1093/bioinformatics/btn569.