Abstract
Document novelty detection is a concept learning problem wherein the system gains its knowledge only from the positive documents under a concept and with that limited knowledge it attempts to detect the negative cases. This work focuses on learning author style as a concept from the given set of documents, particularly emails. Since author attribution for shorter texts such as emails is more complex compared to larger documents, the techniques originally used for the large documents prove inefficient for short texts. To address this shortcoming of existing algorithms in detecting aberration in author style, we have proposed a graph-model based technique for feature set extraction from short documents. Given the extracted feature set, we have also developed two probability based text representation schemes that could best represent a text document to an underlying one-class SVM classifier. The proposed models have been compared and evaluated on the public Enron email dataset. Applying graph based feature set extraction technique in combination with the inclusive compound probability based text representation has proved to be very efficient. The generality of the proposed method allows the approach to be applicable to all kind of text documents including emails.
Keywords
Get full access to this article
View all access options for this article.
