Where to go and what to play: Towards summarizing popular information from massive tourism blogs

Abstract

In this work, we propose a novel method to summarize popular information from massive tourism blog data. First, we crawl blog contents and segment them into semantic word vectors separately. Then, we select the geographical terms in each word vector into a corresponding geographical term vector and present a new method to explore hot tourism locations and, in particular, their frequent sequential relations from a set of geographical term vectors. Third, we propose a novel word vector subdividing method to collect local features for each hot location, and introduce the metric of max-confidence to identify the Things of Interest (ToI) associated with the location from the collected data. We illustrate the benefits of this approach by applying it to a Chinese online tourism blog dataset. The experimental results show that the proposed method can be used to explore hot locations, as well as their sequential relations and corresponding ToI, efficiently.

Keywords

Blog mining information retrieval max-confidence things of interest travel sequence

Get full access to this article

View all access options for this article.

References

Werthner

Ricci

. E-commerce and tourism. Communications of the ACM 2004; 47(12): 101–105.

Pang

Lee

. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2008; 2(1–2): 1–135.

Asbagh

Sayyadi

Abolhassani

. Blog summarization for blog mining. In: Lee

Ishii

(eds) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. Berlin: Springer, 2009, pp. 157–167.

Salton

Wong

Yang

. A vector space model for automatic indexing. Communication of the ACM 1975; 18(11): 613–620.

Soucy

Mineau

. Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th International Joint Conference on Artiﬁcial Intelligence (IJCAI’05), 2005, pp. 1130–1135.

Turney

Pantel

. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 2010; 37(1): 141–188.

Robertson

. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 2004; 60(5): 503–520.

Yang

Wilbur

. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science 1996; 47(5): 357–369.

Qamra

Tseng

Chang

. Mining blog stories using community-based and temporal clustering. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 2006, pp. 58–67.

10.

Attardi

Simi

. Blog mining through opinionated words. In: Proceedings of the Fifteenth Text REtrieval Conference, 2006, pp. 14–17.

11.

Cao

Duan

Gan

. Exploring determinants of voting for the “helpfulness” of online user reviews: A text mining approach. Decision Support Systems 2011; 50(2): 511–521.

12.

Ghose

Ipeirotis

. Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering 2011; 23(10): 1498–1512.

13.

Liu

. Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168–177.

14.

Dave

Lawrence

Pennock

. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web, 2003, pp. 519–528.

15.

O’Leary

. Blog mining-review and extensions: “From each according to his opinion”. Decision Support Systems 2011; 51(4): 821–830.

16.

. Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems 2010; 48(2): 354–368.

17.

Pan

MacLaurin

Crotts

. Travel blogs and the implications for destination marketing. Journal of Travel Research 2007; 46(1): 35–45.

18.

Sharda

Ponnada

. Tourism Blog Visualizer for better tour planning. Journal of Vacation Marketing 2008; 14(2): 157–167.

19.

Hofmann

. Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999, pp. 50–57.

20.

Blei

Jordan

. Latent dirichlet allocation. Journal of Machine Learning Research 2003; 3993–1022.

21.

Banerjee

Basu

. Topic models over text streams: A study of batch and online unsupervised learning. In: SIAM International Conference on Data Mining, 2007, pp. 431–436.

22.

Moghaddam

Ester

. On the design of LDA models for aspect-based opinion mining. In: Proceedings of the 21st ACM international conference on Information and knowledge management, 2012, pp. 803–812.

23.

Kim

Park

Zhai

. Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proceedings of the American Society for Information Science and Technology 2012; 49(1): 1–10.

24.

Rokaya

Atlam

E-s

Fuketa

Dorji

Aoe

. Ranking of field association terms using Co-word analysis. Information Processing and Management 2008; 44(2): 738–755.

25.

Figueiredo

Rocha

LCd

Couto

Salles

Goncalves

Meira

Jr.

Word co-occurrence features for text classification. Information Systems 2011; 36(5): 843–858.

26.

Liu

Chen

W-Y

. An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML’2003), 2003, pp. 488–495.

27.

Yang

Liu

. A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42–49.

28.

Joho

Sanderson

. Document frequency and term specificity. In: Large-Scale Semantic Access to Content (Text, Image, Video and Sound) Conference, 2007, pp. 350–359.

29.

Sebastiani

. Machine learning in automated text categorization. ACM Computing Surveys 2002; 34(1): 1–47.

30.

Lee

. Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management 2006; 42(1): 155–165.

31.

Yang

Pedersen

. A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp. 412–420.

32.

Peng

Long

Ding

CHQ

. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005; 27(8): 1226–1238.

33.

Park

Seo

. Improving Text categorization using the importance of sentences. Information Processing and Management 2004; 40(1): 65–79.

34.

Goh

Low

. Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum 1997; 31(SI): 67–73.

35.

Gao

Huang

C-N

. Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics 2005; 31(4): 531–574.

36.

Stavrianou

Andritsos

Nicoloyannis

. Overview and semantic issues of text mining. Sigmod Record 2007; 36(3): 23–34.

37.

Sproat

Shih

Gale

Chang

. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 1996; 22(3): 377–404.

38.

Tang

Cao

Tang

. Email data cleaning. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005, pp. 489–498.

39.

Tan

P-n

Steinbach

Kumar

. Introduction to Data Mining. Boston, MA: Addison-Wesley, 2005.

40.

Lee

. Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, 1996, pp. 40–47.

41.

Lin

. Using syntactic dependency as local context to resolve word sense ambiguity. In: Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics, 1997, pp. 64–71.

42.

Xiong

Tan

P-n

Kumar

. Hyperclique pattern discovery. Data Mining and Knowledge Discovery 2006; 13(2): 219–242.

43.

. Aspect and sentiment unification model for online review analysis. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, 2011, pp. 815–824.

44.

Chen

Han

. Re-examination of interestingness measures in pattern mining: a unified framework. Data Mining and Knowledge Discovery 2010; 21(3): 371–397.

45.

Cui

Zhang

Liu

Zhang

. Discover breaking events with popular hashtags in twitter. In: Proceedings of the 21st ACM international conference on Information and knowledge management, 2012, pp. 1794–1798.

46.

Pons-porrata

Berlanga-Llavori

Ruiz-shulcloper

. Topic discovery based on text mining techniques. Information Processing and Management 2007; 43(3): 752–768.

47.

Dean

Ghemawat

. MapReduce: simplified data processing on large clusters. Communications of the ACM 2008; 51(1): 107–113.

48.

Deshpande

Karypis

. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems (TOIS) 2004; 22(1): 143–177.

49.

Yoo

Gretzel

. What motivates consumers to write online travel reviews? Information Technology & Tourism 2008; 10(4): 283–295.

50.

Gretzel

Yoo

Purifoy

. Online travel review study: Role and impact of online travel reviews, www.tripadvisor.com/pdfs/OnlineTravelReviewReport.pdf (2007, accessed 20 August 2015).