Handling imbalanced classification problem: A case study on social media datasets

Abstract

The imbalanced data problem occurs when the number of representative instances for classes of interest is much lower than for other classes. The influence of imbalanced data on classification performance has been discussed in some previous research as a challenge to be studied. In this paper, we propose a method to solve the imbalanced data problem by focusing on preprocessing, including: i) sampling techniques (i.e., under-sampling, over-sampling, and hybrid-sampling) and ii) the instance weighting method to increase the number of features in minority classes and to reduce comprehensive coverage in majority classes. The experimental results show that the noisy data is reduced, making a smaller sized dataset, and training time decreases significantly. Moreover, distinct properties of each class are examined effectively. Refined data is used as input for Naive Bayes and support vector machine classifiers for the targets of the training process. The proposed methods are evaluated based on the number of non-geotagged resources that are labeled correctly with their geo-locations. In comparison with previous research, the proposed method achieves accuracy of 84%, whereas previous results were 75%.

Keywords

Imbalanced datasets geotags resources sampling method instance weighting location prediction

Get full access to this article

View all access options for this article.

References

Fernández

, García

and Herrera

, Addressing the classification with imbalanced data: Open problems and new challenges on class distribution, Hybrid Artificial Intelligent Systems, SpringerBerlin Heidelberg, 2011, pp. 1–10.

Jian

, Gao

and Ao

, A new sampling method for classifying imbalanced data based on support vector machine ensemble, Neurocomputing193 (2016), 115–122.

Seiffert

, Khoshgoftaar

T.M.

, Van Hulse

and Folleco

, An empirical study of the classification performance of learners on imbalanced and noisy software quality data, Information Sciences259 (2014), 571–595.

Nguyen

D.T.

and Jung

J.E.

, Real-time event detection on social data stream, Mobile Networks and Applications20(4) (2015), 475–486.

Lauer

and Guermeur

, MSVMpack: A multi-class support vector machine package, The Journal of Machine Learning Research12 (2011), 2293–2296.

Rish

, An empirical study of the Naive Bayes classifier, IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Vol. 3. No. 22. IBM New York, 2001.

Sáez

J.A.

, Luengo

, Stefanowski

and Herrera

, Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences291 (2015), 184–203.

Jung

J.J.

, Discovering community of lingual practice for matching multilingual tags from folksonomies, The Computer Journal55(3) (2012), 337–346.

Jung

J.J.

, Exploiting geotagged resources for spatial clustering on social network services, Concurrency and Computation: Practice and Experience28(4) (2016), 1356–1367.

10.

Weston

and Watkins

, Multi-class support vector machines, Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, 1998.

11.

Bischoff

, Firan

C.S.

, Nejdl

and Paiu

, Bridging the gap between tagging and querying vocabularies: Analyses and applications for enhancing multimedia IR, Web semantics: Science, services and agents on the world wide web, 8.2, 2010, pp. 97–109.

12.

Atzori

, Iera

, Morabito

and Nitti

, The social internet of things (siot)–when social networks meet the internet of things: Concept, architecture and network characterization, Computer Networks56(16) (2012), 3594–3608.

13.

Yijing

, Haixiang

, Xiao

, Yanan

and Jinling

, Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data, Knowledge-Based Systems94 (2016), 88–104.

14.

Clements

, Serdyukov

, de Vries

A.P.

and Reinders

M.J.

, Using flickr geotags to predict user travel behaviour, In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, SIGIR ’10, 2010, pp. 851–852.

15.

Galar

, Fernandez

, Barrenechea

, Bustince

and Herrera

, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on42(4) (2012), 463–484.

16.

Nguyen

N.T.

, Processing inconsistency of knowledge on semantic level, Journalof Universal Computer Science11(2) (2005), 285–302.

17.

Feick

and Robertson

, A multi-scale approach to exploring urban places in geotagged photographs, Computers, Environment and Urban Systems53 (2015), 96–109.

18.

Stehman

S.V.

, Selecting and interpreting measures of thematic classification accuracy, Remote sensing of Environment62(1) (1997), 77–89.

19.

Hong

T.P.

, Liou

Y.L.

, Wang

S.L.

and Vo

, Feature selection and replacement by clustering attributes, Vietnam Journal of Computer Science1(1) (2014), 47–55.

20.

Bello-Orgaz

, Jung

J.J.

and Camacho

, Social Big Data: Recent achievements and new challenges, Information Fusion28 (2016), 45–59.

21.

Kurashima

, Iwata

, Irie

and Fujimura

, Travel route recommendation using geotags in photo sharing sites. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, ACM, 2010, pp. 579–588.

22.

Nguyen

T.T.

, Hwang

and Jung

J.J.

, Social tagging analytics for processing unlabeled resources: A case study on non-geotagged photos. In: Proceedings of the 8th International Symposium on Intelligent Distributed Computing, IDC 2014, Madrid, Spain, 2014, pp. 357–367.

23.

Nguyen

T.T.

and Jung

J.J.

, Exploiting geotagged resources to spatial ranking by extending HITS algorithm, Comput Sci Inf Syst12(1) (2015), 185–201.

24.

Sun

, Wong

A.K.

and Kamel

M.S.

, Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence23(4) (2009), 687–719.

25.

Zhang

, Yoshida

and Tang

, Tfidf, lsi and multi-word in information retrieval and text categorization. Systems, Man and Cybernetics, 2008 SMC 2008 IEEE International Conference on IEEE, 2008, pp. 108–113.

26.

Lee

, Lin

and Wahba

, Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data, Journal of the American Statistical Association99(465) (2004), 67–81.