Abstract
Automatically identifying Chinese characters that are similar in their glyph, pronunciations and meaning are important for building smart question generation tools in a computer-assisted language-learning environment. Previous research on the Chinese character similarity measurement focused on character glyph (e.g. structures, strokes and radicals) with heuristic algorithms whose parameter have preset values. This article presents a machine learning (regression) approach to measure the similarity between two Chinese characters, based on the information which not only includes the glyph, but also pronunciation (pinyin) and semantic meaning derived from HowNet. We evaluated various regression models using a testing set consisting of 2586 pairs of characters selected from elementary Chinese textbooks used. The study results showed that four regression models (M5, Support Vector Machine, Gaussian Process and Linear Regression) have similar results (
Keywords
Get full access to this article
View all access options for this article.
References
.pdf.