Interpolative self-training approach for link prediction

Abstract

In this paper, learning social networks from incomplete relationship data is proposed. Link prediction is addressed as a semi-supervised learning problem where the task is to predict a larger part of networks using available knowledge of smaller parts. By this assumption, social network extraction is translated into a classification problem. While in real case scenarios majority of links are unknown, we hypothesis self-training as the most common semi-supervised learning method can provide an effective approach for learning from unlabelled data. We proposed an interpolative self-training technique that leverages node information to generate a set of examples in learning phase along with their connections as their associated labels. The approach generates data by interpolation of documents assigned to a pair of nodes. Documents as the implicit content shared between the individual nodes provide a scope for the estimation of their similarities. Then generated training data are employed in a link prediction model with two different scenarios. The first scenario interprets the link prediction as a conventional classification problem in which we have examples from both positive (link) and negative (no-link) classes. However, the second scenario addresses more realistic case where only some positive examples (links or connections) are known. Social networks are usually very sparse structures. The sparsity of social networks implies that in the classification framework of link prediction, we deal with an imbalance class distribution in which among all possible links there are a few connections (positive class) vs. many disconnections (negative class). In order to deal with class skew and enhance the performance of the classifier, a data selection method based on node similarity was proposed. To evaluate the merit of the proposed methods, a set of experiments were conducted on co-authorship networks of 18 different domains. The result implies the feasibility of achieving significantly high performance for most of the networks using the proposed self training approach.

Keywords

Link prediction classification problem semi-supervised learning self-training

Get full access to this article

View all access options for this article.

References

Page

Brin

Motwani

and Winograd

, The pagerank citation ranking: Bring order to the web, in: Stanford Digital Libraries Working Paper, 1998.

Richardson

and Domingos

, Mining knowledge-sharing sites for viral marketing, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2002, pp. 61–70.

Barbieri

Bonchi

and Manco

, Who to follow and why: link prediction with explanations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 1266–1275.

Leskovec

Huttenlocher

and Kleinberg

, Predicting positive and negative links in online social networks, in: Proceedings of the 19th International Conference on World Wide Web, ACM, 2010, pp. 641–650.

Tsuda

and Noble

W.S.

, Learning kernels from biological networks by maximizing entropy, Bioinformatics 20(suppl 1) (2004), i326–i333.

Getoor

and Diehl

C.P.

, Link mining: a survey, ACM SIGKDD Explorations Newsletter 7(2) (2005), 3–12.

Scellato

Noulas

and Mascolo

, Exploiting place features in link prediction on location-based social networks, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, pp. 1046–1054.

Leroy

Cambazoglu

B.B.

and Bonchi

, Cold start link prediction, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2010, pp. 393–402.

Ben-Hur

and Noble

W.S.

, Kernel methods for predicting protein-protein interactions, Bioinformatics 21(suppl 1) (2005), i38–i46.

10.

Oyama

and Manning

C.D.

, Using feature conjunctions across examples for learning pairwise classifiers, in: Machine Learning: ECML 2004, Springer, 2004, pp. 322–333.

11.

and Zhou

, Self-training from labeled features for sentiment analysis, Information Processing & Management 47(4) (2011), 606–616.

12.

Ning

, Domain adaptation for opinion classification: a self-training approach, Journal of Information Science Theory and Practice 1(1) (2013), 10–26.

13.

Liu

K.-L.

W.-J.

and Guo

, Emoticon smoothed language models for twitter sentiment analysis, in: AAAI, 2012.

14.

Brouard

Szafranski

and D’Alché-Buc

, Protein-protein interaction network inference with semi-supervised output kernel regression, in: JOBIM, 2012, pp. 133–136.

15.

Arodz

and Bonchev

, Identifying influential nodes in a wound healing-related network of biological processes using mean first-passage time, New Journal of Physics 17(2) (2015), 025002.

16.

Ermiş

Acar

and Cemgil

A.T.

, Link prediction in heterogeneous data via generalized coupled tensor factorization, Data Mining and Knowledge Discovery 29(1) (2015), 203–236.

17.

Liben-Nowell

and Kleinberg

, The link-prediction problem for social networks, Journal of the American Society for Information Science and Technology 58(7) (2007), 1019–1031.

18.

Lü

and Zhou

, Link prediction in complex networks: a survey, Physica A: Statistical Mechanics and its Applications 390(6) (2011), 1150–1170.

19.

Al Hasan

Chaoji

Salem

and Zaki

, Link prediction using supervised learning, in: SDM’06: Workshop on Link Analysis, Counter-terrorism and Security, 2006.

20.

Mori

Matsuo

Ishizuka

and Faltings

, Keyword extraction from the web for foaf metadata, in: 1st Workshop on Friend of a Friend, Social Networking and the Semantic Web, 1–2 September 2004, Galway, Ireland, 2001.

21.

Matsumura

Goldberg

and Llora

, Mining directed social network from message board, in: In WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA, 2005. ACM Press., 2005, pp. 1092–1093.

22.

Resig

and Teredesai

, A framework for mining instant messaging services, in: In Proceedings of the 2004 SIAM DM Conference, 2004.

23.

Hassan

A.E.

and Holt

R.C.

, The small world of software reverse engineering, in: WCRE ’04: Proceedings of the 11th Working Conference on Reverse Engineering (WCRE’04), Washington, DC, USA, IEEE Computer Society, 2004, pp. 278–283.

24.

Watts

and Strogatz

, Collective dynamics of smallworld networks, Nature 363 (1998), 202–204.

25.

Matsuo

Tomobe

Hasida

and Ishizuka

, Mining social network of conference participants from the web, in: WI ’03: Proceedings of the IEEE/WIC International Conference on Web Intelligence, Washington, DC, USA, IEEE Computer Society, 2003, pp. 190–193.

26.

Tanha

van Someren

and Afsarmanesh

, Semi-supervised self-training for decision tree classifiers, International Journal of Machine Learning and Cybernetics, 2015, 1–16.

27.

Zhang

Wen

Wang

and Jiang

, Semi-supervised learning combining co-training with active learning, Expert Systems with Applications 41(5) (2014), 2372–2378.

28.

Blum

and Mitchell

, Combining labeled and unlabeled data with co-training, in: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ACM, 1998, pp. 92–100.

29.

Vapnik

, The nature of statistical learning theory, Springer Science & Business Media, 2013.

30.

Dempster

A.P.

Laird

N.M.

and Rubin

D.B.

, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society. Series B (methodological), 1977, 1–38.

31.

Zhu

and Lafferty

, Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning, in: Proceedings of the 22nd International Conference on Machine Learning, ACM, 2005, pp. 1052–1059.

32.

Zhu

, Semi-supervised learning literature survey, 2005.

33.

Rosenberg

Hebert

and Schneiderman

, Semi-supervised self-training of object detection models, 2005.

34.

Jin

Huang

and Zhao

, A semi-supervised learning algorithm based on modified self-training svm, Journal of Computers 6(7) (2011), 1438–1443.

35.

Brouard

Szafranski

et al., Semi-supervised penalized output kernel regression for link prediction, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 593–600.

36.

Kashima

Kato

Yamanishi

Sugiyama

and Tsuda

, Link propagation: a fast semi-supervised learning algorithm for link prediction, in: SDM, SIAM, 9, 2009, pp. 1099–1110.

37.

Zhu

and Ghahramani

, Learning from labeled and unlabeled data with label propagation, tech. rep., Citeseer, 2002.

38.

Estabrooks

and Japkowicz

, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence 20(1) (2004), 18–36.

39.

Vapnik

V.N.

and Vapnik

, Statistical learning theory, vol. 1. Wiley New York, 1998.

40.

Chang

C.-C.

and Lin

C.-J.

, Libsvm: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST) 2(3) (2011), 27.

41.

Banko

and Brill

, Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing, in: Proceedings of the First International Conference on Human Language Technology Research, Association for Computational Linguistics, 2001, pp. 1–5.