Abstract
Keyphrase extraction is an essential task of identifying representative words or phrases in document processing. Main traditional models rely on each word frequency feature in a document and its associated corpus. There are two major limitations of the word frequency method: first, it fails to fully exploit semantic information in the document, that is, it is a bag-of-word method; second, it tends to be influenced by local word frequency in the short current text when the linked corpus is not available or incomplete. This paper proposes WAGRank, a novel unsupervised ranking model on a word attention graph, where nodes are words and edges are semantic relations between words. To assign edge weights, two interpretable statistical methods of assessing correlation strength between words are designed using attention mechanism. WAGRank depends on word semantics rather than frequency only in the current text, using external knowledge stored in a pre-trained language model. WAGRank was evaluated on two publicly available datasets against twelve baselines, presenting its effectiveness and robustness. Besides, the Granger causality test illustrated that word attention has a statistically significant predictive effect on word frequency, providing a more reasonable explanation for word frequency analysis.
Keywords
Get full access to this article
View all access options for this article.
