Abstract
In this paper, learning social networks from incomplete relationship data is proposed. Link prediction is addressed as a semi-supervised learning problem where the task is to predict a larger part of networks using available knowledge of smaller parts. By this assumption, social network extraction is translated into a classification problem. While in real case scenarios majority of links are unknown, we hypothesis self-training as the most common semi-supervised learning method can provide an effective approach for learning from unlabelled data. We proposed an interpolative self-training technique that leverages node information to generate a set of examples in learning phase along with their connections as their associated labels. The approach generates data by interpolation of documents assigned to a pair of nodes. Documents as the implicit content shared between the individual nodes provide a scope for the estimation of their similarities. Then generated training data are employed in a link prediction model with two different scenarios. The first scenario interprets the link prediction as a conventional classification problem in which we have examples from both positive (link) and negative (no-link) classes. However, the second scenario addresses more realistic case where only some positive examples (links or connections) are known. Social networks are usually very sparse structures. The sparsity of social networks implies that in the classification framework of link prediction, we deal with an imbalance class distribution in which among all possible links there are a few connections (positive class) vs. many disconnections (negative class). In order to deal with class skew and enhance the performance of the classifier, a data selection method based on node similarity was proposed. To evaluate the merit of the proposed methods, a set of experiments were conducted on co-authorship networks of 18 different domains. The result implies the feasibility of achieving significantly high performance for most of the networks using the proposed self training approach.
Get full access to this article
View all access options for this article.
