Hot topic identification from micro-blog based on improved Single-pass algorithm

Abstract

Hot topic identification from micro-blog is very important for detection and control of the public opinion. When using Single-pass algorithm to cluster hot topics for Chinese micro-blog, Chinese word segmentation technology is a necessary preprocessing, but it will introduce inevitable segment errors. This kind of errors will make topic identification has low clustering precision. To solve this problem, this paper proposed an improved algorithm based on Single-pass which combines CS (Cosine Similarity) and LCS (Longest Common Subsequences) to calculate the similarity between Chinese words. Experiments on three different micro-blog data sets for hot topic identification are made, and the results show that the improved algorithm has both higher recall rate and precision rate than the original ones. The proposed algorithm is feasible and effective.

Keywords

Hot topic identification clustering Single-pass word segmentation

Get full access to this article

View all access options for this article.

References

Bin

Yuan

Z.J.

Qiang

Yang

Z.J.

Han

and Wei

X.W.

, Review of Micro-blog analytics, J. Hebei University of Science and Technology 36 (2015), 100–109.

Anna

, Similarity measures for text document clustering, New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, 2008, 49–56.

Bin

X.Z.

Dong

and Guo

Y.C.

, Review of public opinion monitoring technology and Application, J. Software 33 (2012), 322–326.

Martin

Peter

K.H.

Sander

Michael

and Wei

X.X.

, Incremental clustering for mining in a data warehousing environment, International Conference on Very Large Databases, New York City, NY, USA, 1998, 323–333.

Bottou

and Bengio

, Convergence properties of the K-means algorithms, J. Advances in Neural Information Processing Systems 7 (1994), 585–592.

Stephen

C.J.

, Hierarchical clustering schemes, J. Psychometrika 32 (1967), 241–254.

Peter

K.H.

Peer

Jorg

and Zimek

, Density-based clustering, J. WIREs Data Mining Knowledge Discovery 10 (2011), 231–240.

Park

H.N.

and Lee

S.W.

, Statistical grid-based clustering over data streams, J. Acm Sigmod Record 33 (2004), 32–37.

Xia

and Chen

M.C.

, Design of network public opinion monitoring system in military hospital based on Single-pass, J. Electronic Design Engineering 4 (2015), 60–63.

10.

G.S.

Jie

Q.S.

Nan

Song

Z.X.

Yan

Y.C.

and Jian

, Online public opinion hotspot discovery algorithm based on Single-pass, J. University of Electronic Science and Technology 4 (2015), 599–604.

11.

Wen

C.L.

, Micro-blog topic identification based on improved Single-Pass algorithm, J. Modern Computer 29 (2016), 22–25.

12.

Nan

, Research on topic identification and tracking technology based on Web public opinion, Fuzhou University, 2014.

13.

Min

L.X.

and Wei

, Collection and string similarity query, J. Computer Science 34 (2011), 1853–1862.

14.

L.B.

and Ping

H.L.

, Distance weighted cosine similarity measure for text classification, Intelligent Data Engineering and Automated Learning-IDEAL 2013, Springer Berlin Heidelberg, 2013, 611–618.

15.

http://www.socialysis.org/data/dataset/dataset.

16.

Hong

W.M.

Wang

M.H.

Zheng

Y.H.

Qiu

and Zheng

, Research on schedule-based user recommendation model based on improved K-means algorithm, J. Computational Methods in Sciences and Engineering 16 (2016), 691–700.

17.

Toshniwa

and Roy

R.S.

, Shape pattern matching: A tool to cluster unstructured text documents, J. Computational Methods in Sciences and Engineering 10 (2010), S73–S84.

18.

Bouza

and Bernstein

, (Partial) User preference similarity as classification-based model similarity, J. Semantic Web 5 (2014), 47–64.