Abstract
A technique for automated categorization of text documents, based on byte-level n-gram profiles and a new dissimilarity measure between profiles is presented. K nearest neighbors classifier is used. The technique is language independent. It has been applied to four document collections in English, Chinese and Serbian: Reuters-21578 newswire articles, 20-Newsgroups, Tancorp and Ebart. The evaluation was done by using the micro- and macro-averaged
Get full access to this article
View all access options for this article.
