A spatial,temporal and sentiment based framework for indexing and clustering in twitter blogosphere

Abstract

As the number of social networks users has increased day by day so has the user’s dependency for communication on the social networks. Social networks enable people to connect with one another in many different ways. Many social networks such as Twitter provide their users the functionality to tag the user’s current location to the post. This geographical information can be used in various information retrieval processes. Currently many methods are present which cluster the tweets using traditional K-means algorithm in which user has to specify the number of clusters to be formed, and if the tweets do not lie within those clusters they are then treated as outliers and discarded. This paper presents a framework which focuses on clustering and indexing of tweets on the basis of its geographical and temporal features. The X-means clustering has been used which does not require the cluster number input from the user but rather it takes input from the index of the specified characteristics created from tweets. The indexing mechanism will not only help in ease of searching but will also aid in many retrieval tasks. The experimental analysis shows that the proposed framework generates improved results over traditional tweet clustering methods.

Keywords

Tweet indexing tweet clustering microblog information retrieval

Get full access to this article

View all access options for this article.

References

and Croft

W.B.

, Time-based language models, in Proceedings of the Twelfth International Conference on Information and Knowledge Management ACM, 2003, pp. 469–475.

Mansouri

, Ravasan

A.Z.

and Gholamian

M.R.

, A novel hybrid algorithm based on k-means and evolutionary computations for real time clustering, International Journal of Data Warehousing and Mining (IJDWM)10(3) (2014), 1–14.

Samuel

and Sharma

D.K.

, Modified lexrank for tweet summarization, International Journal of Rough Sets and Data Analysis (IJRSDA)3(4) (2016), 79–90.

Efron

and Golovchinsky

, Estimation methods for ranking recent information, in Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval ACM, 2011, pp. 495–504.

Jones

and Diaz

, Temporal profiles of queries, ACM Transactions on Information Systems (TOIS)25(3) (2007), 14.

Diaz

and Jones

, Using temporal profiles of queries for precision prediction, in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2004, pp. 18–24.

Keikha

, Gerani

and Crestani

, Time-based relevance models, in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2011, pp. 1087–1088.

Dakka

, Gravano

and Ipeirotis

P.G.

, Answering general time-sensitive queries, Knowledge and Data Engineering, IEEE Transactions on24(2) (2012), 220–235.

Peetz

M.-H.

, Meij

, de Rijke

and Weerkamp

, Adaptive temporal query modeling, in Advances in Information Retrieval, Springer, 2012, pp. 455–458.

10.

Massoudi

, Tsagkias

, De Rijke

and Weerkamp

, Incorporating query expansion and quality indicators in searching microblog posts, in Advances in Information Retrieval, Springer, 2011, pp. 362–367.

11.

Metzler

, Cai

and Hovy

, Structured event retrieval over microblog archives, in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2012, pp. 646–655.

12.

Kwak

, Lee

, Park

and Moon

, What is twitter, a social network or a news media? in Proceedings of the 19th International Conference on World Wide Web ACM, 2010, pp. 591–600.

13.

Lavrenko

and Croft

W.B.

, Relevance based language models, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2001, pp. 120–127.

14.

Metzler

and Croft

W.B.

, A markov random field model for term dependencies, in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2005, pp. 472–479.

15.

, Shang

and Shen

, A hierarchical fuzzy cluster ensemble approach and its application to big data clustering, Journal of Intelligent & Fuzzy Systems28(6) (2015), 2409–2421.

16.

Demiriz

and Ekizoğlu

, Fuzzy rule-based analysis of spatio-temporal atm usage data for fraud detection and prevention1, Journal of Intelligent & Fuzzy Systems, no. Preprint, 1–12.

17.

Hawking

and Jones

, Reordering an index to speed query processing without loss of effectiveness, in Proceedings of the Seventeenth Australasian Document Computing Symposium ACM, 2012, pp. 17–24.

18.

Huston

, Moffat

and Croft

W.B.

, Efficient indexing of repeated n-grams, in Proceedings of the Fourth ACM International Conference on Web Search and Data Mining ACM, 2011, pp. 127–136.

19.

Vries

C.M.D.

, Geva

and Trotman

, Document clustering evaluation: Divergence from a random baseline, CoRR, vol. abs/1208.5654, 2012.

20.

O’Hare

and Murdock

, Modeling locations with social media, Information Retrieval16(1) (2013), 30–62.

21.

Pelleg

, Moore

A.W.

, et al., X-means: Extending k-means with efficient estimation of the number of clusters, in ICML, vol. 1, 2000.

22.

Wikipedia, Pareto principle, wikipedia, the free encyclopedia, 2016, [Online; accessed 30-May-2016]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Pareto_principle&oldid=716693442

23.

Wikipedia, Zipf’s law, wikipedia, the free encyclopedia, 2016, [Online; accessed 30-May-2016]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Zipf27s_law&oldid=721460097

24.

Vincenty

, Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations, Survey Review23(176) (1975), 88–93.

25.

Davies

D.L.

and Bouldin

D.W.

, A cluster separation measure, Pattern Analysis and Machine Intelligence, IEEE Transactions on2 (1979), 224–227.

26.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics20 (1987), 53–65.

27.

Morchid

, Portilla

, Josselin

, Dufour

, Altman

, El-Beze

, Cossu

J.-V.

, Linarès

and Reiffers Masson

, An author-topic based approach to cluster tweets and mine their location, Procedia Environmental Sciences27 (2015), 26–29.

28.

Khan

M.A.H.

, Bollegala

, Liu

and Sezaki

, Multitweet summarization of real-time events, in Social Computing (SocialCom), 2013 International Conference on IEEE, 2013, pp. 128–133.

29.

Kaleel

S.B.

and Abhari

, Cluster-discovery of twitter messages for event detection and trending, Journal of Computational Science6 (2015), 47–57.

30.

Doulamis

N.D.

, Doulamis

A.D.

, Kokkinos

and Varvarigos

, Event detection in twitter microblogging, IEEE Transaction on Cybernetics46(12) (2015), 2810–2824.

31.

, Li

and Li

, Mssf: A multi-document summarization framework based on submodularity, in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2011, pp. 1247–1248.

32.

Yih

W.-T.

, Goodman

, Vanderwende

and Suzuki

, Multi-document summarization by maximizing informative content-words, in IJCAI, vol. 7, 2007, pp. 1776–1782.

33.

Kumar

, Mahadevan

and Sivakumar

, A graphtheoretic approach to extract storylines from search results, in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM, 2004, pp. 216–225.

34.

Lee

D.D.

and Seung

H.S.

, Algorithms for non-negative matrix factorization, in Advances in Neural Information Processing Systems, 2001, pp. 556–562.

35.

Wang

, Li

, Zhu

and Ding

, Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization, in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2008, pp. 307–314.

36.

Shi

and Malik

, Normalized cuts and image segmentation, Pattern Analysis and Machine Intelligence, IEEE Transactions on22(8) (2000), 888–905.

37.

Wei

, Li

, Lu

and He

, Query-sensitive mutual reinforcement chain and its application in query-oriented multidocument summarization, in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2008, pp. 283–290.

38.

Allan

, Introduction to topic detection and tracking, in Topic Detection and Tracking, Springer, 2002, pp. 1–16.

39.

Lavrenko

, Allan

, DeGuzman

, LaFlamme

, Pollard

and Thomas

, Relevance models for topic detection and tracking, in Proceedings of the Second International Conference on Human Language Technology Research Morgan Kaufmann Publishers Inc., 2002, pp. 115–121.

40.

Morinaga

and Yamanishi

, Tracking dynamics of topic trends using a finite mixture model, in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM, 2004, pp. 811–816.

41.

Kleinberg

, Bursty and hierarchical structure in streams, Data Mining and Knowledge Discovery7(4) (2003), 373–397.

42.

Eisenstein

, O’Connor

, Smith

N.A.

and Xing

E.P.

, A latent variable model for geographic lexical variation, in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics, 2010, pp. 1277–1287.

43.

Watanabe

, Ochi

, Okabe

and Onai

, Jasmine: A real-time local-event detection system based on geolocation information propagated to microblogs, in Proceedings of the 20th ACM International Conference on Information and Knowledge Management ACM, 2011, pp. 2541–2544.

44.

Sakai

and Tamura

, Identifying bursty areas of emergency topics in geotagged tweets using density-based spatiotemporal clustering algorithm, in Computational Intelligence and Applications (IWCIA), 2014 IEEE 7th International Workshop on IEEE, 2014, pp. 95–100.

45.

Sugitani

, Shirakawa

, Hara

and Nishio

, Detecting local events by analyzing spatiotemporal locality of tweets, in Advanced Information Networking and Applications Workshops (WAINA), 2013 27th International Conference on IEEE, 2013, pp. 191–196.

46.

De Vries

C.M.

, Geva

and Trotman

, Document clustering evaluation: Divergence from a random baseline, arXiv preprint arXiv:1208.5654, 2012.

47.

Alonso

, Strötgen

, Baeza-Yates

R.A.

and Gertz

, Temporal information retrieval: Challenges and opportunities, in WWW 2011 Workshop on Linked Data on the Web, Hyderabad, India, 2011, 1–8.