A word embedding topic model for topic detection and summary in social networks

Abstract

The aim of topic detection is to automatically identify the events and hot topics in social networks and continuously track known topics. Applying the traditional methods such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis is difficult given the high dimensionality of massive event texts and the short-text sparsity problems of social networks. The problem also exists of unclear topics caused by the sparse distribution of topics. To solve the above challenge, we propose a novel word embedding topic model by combining the topic model and the continuous bag-of-words mode (Cbow) method in word embedding method, named Cbow Topic Model (CTM), for topic detection and summary in social networks. We conduct similar word clustering of the target social network text dataset by introducing the classic Cbow word vectorization method, which can effectively learn the internal relationship between words and reduce the dimensionality of the input texts. We employ the topic model-to-model short text for effectively weakening the sparsity problem of social network texts. To detect and summarize the topic, we propose a topic detection method by leveraging similarity computing for social networks. We collected a Sina microblog dataset to conduct various experiments. The experimental results demonstrate that the CTM method is superior to the existing topic model method.

Keywords

Social network Cbow topic model topic detection

Introduction

In recent years, the rapid development of online social networks, such as Twitter, Facebook, Sina Weibo, has greatly affected people’s social and working styles. In social networks, individuals can interact with friends anytime, anywhere, to share their relevant information, to pay attention to interested users, subscribe to information, and view a variety of news. Organizations and official organizations can also use social networks to publish new products and news.^1–3 Due to the openness and sharing of social networks, the information that people share or the topics they talk about may spread widely in the network, causing huge social impact. It is necessary to detect the social topics and emergencies, and to identify all kinds of events, so as to purify the network environment and improve the social atmosphere. These approaches can be used to help grasp public sentiment and public opinion, and provide the basis for government decision-making. Therefore, research on social network topic detection and summary has both theoretical research significance and social practical significance.

Topic detection requires a series of processes such as preprocessing, modeling, and similarity calculation. At first, researchers proposed a method based on a vector space model and a statistical language model to implement the event monitoring modeling process. However, these models almost ignored the relevant semantic parts, which in turn affected the effect and performance of event monitoring. In 1999, Hofmann⁴ incorporated statistical concepts into the Latent Semantic Analysis (LSA) model and proposed the Probabilistic Latent Semantic Analysis method. This method, based on dual-mode and the co-occurrence data analysis method, became a research hotspot in academia. Subsequently, Blei et al.⁵ proposed a classic Latent Dirichlet Allocation (LDA) model, which discovered the topic through the co-occurrence of document-words and achieved outstanding results in the field of event monitoring, greatly influencing both academic and industrial circles. However, social network text is usually short and sparse, resulting in fewer document-words in the event topic. Due to the randomness of the social network posting method, related topic words appear less frequently and lack relevant background information. The above two aspects mean that the traditional event theme model faces many challenges in social network event monitoring.

For the short-text sparsity of social network topics, Cheng et al.⁶ proposed the double-word topic model, named a Biterm Topic Model (BTM), which effectively solved the sparsity problem of short-text topics in social networks, and provided new ideas for LDA to solve short-text problems. Zuo et al.⁷ proposed a short-text topic model, named Pseudo-document-based Topic Model (PTM), by leveraging text self-aggregation. Mehrotra et al.⁸ proposed a hash pool scheme for a Twitter dataset. In the original LDA data preprocessing stage, this method establishes a content pool for Twitter through hashing and implements automatic tagging. The above approaches can overcome the sparsity problem in the social network context by modeling the different features or attributes. However, these methods ignore the importance of relationships between words and require complex heuristic processing.

In the study of mixed methods of topic detection and summary, Cao et al.⁹ combined LDA and deep learning methods to propose a topic monitoring framework, which effectively represents documents and words and has good applicability. Based on LDA, Xie et al.¹⁰ proposed a real-time emergency event monitoring algorithm based on the set of simplified graph theme models. The validity of the algorithm was verified using a Twitter dataset. Yang et al.¹¹ proposed a new hierarchical Bayesian theme model that adopts the potential theme of the N-level conceptual hierarchy and can capture the dependence of words appearing in the local context of a word. Multi-document theme discovery has achieved good results. Xu et al.¹² leveraged the sentiments and topics of a microblog to model short texts for addressing the sparsity problem and detected bursty topics by detecting the burstiness of words. Huang et al.¹³ proposed a bursty topic tracking method to discover topics by aligning bursty word detection from a temporal view and calculating topic novelty by designing an optimization problem. However, the above methods require specialized data and extranet post-processing and can easily cause over-fitting.

In this paper, we propose a novel word embedding topic model for topic detection and summary, named CTM. First, we apply the continuous bag-of-words model (Cbow)¹⁴ method of word embedding to learn the internal relationships between words for generating coherent topics. Second, we directly model the aggregated-document instead of the short text to weaken the sparsity problem of short text by leveraging the self-aggregation topic model (SATM). Finally, to effectively detect and summarize topics, we propose a detection and summary topic method by adopting the similarity computing approach in the proposed topic model.

Related work

Research on topic monitoring and summary has attracted considerable attention. The research can be grouped into two main categories: the topic model method and the clustering method.

In the topic model, Li et al.¹⁵ aggregated short text into long documents to indirectly model short text for alleviating the sparsity problem in social networks. Zuo et al.¹⁶ built a word co-occurrence network to model text directly and used LDA to model semantic information, effectively address the problem of lack of word co-occurrence in social network context. Pang et al.¹⁷ used a two-step approach to generate a multi-modal description of an Internet topic from background-removed similarities. By removing background similarities, they generated a coherent and informative multi-media description for a topic. Liu et al.¹⁸ proposed a topic model to sample global and local topics by leveraging the location information for detecting social events. He et al.¹⁹ improved the BTM using the Metropolis-Hastings and alias method to reduce the sampling complexity of BTM without degrading the topic quality. Liang et al.²⁰ used the preference of users’ followers to model topics to improve the accuracy of topic monitoring. Quan et al.²¹ proposed a two-stage SATM to alleviate the sparsity problem in topic monitoring. Yan et al.²² used the burstiness of words as a priori to incorporate it into the topic model for topic monitoring. Although the above methods weaken the sparsity problem in the process of topic monitoring, it ignores the importance of the internal relationship between social network words for topic detection and summary.

In the clustering method, the documents are usually clustered according to the topic similarity of the corpus. Hasan et al.²³ proposed a topic detection system to provide a low computational cost solution by leveraging the improved inverted indices and an incremental clustering method for detecting real-time social network events. Xu et al.¹² proposed a user sentiment topic modeling method to model users’ sentiments and topics for alleviating the sparsity problem and designed a method to identify bursty topics by clustering the bursty words. Ai et al.²⁴ proposed a two-stage distributed topic monitoring method by leveraging spark and topic clustering. Li et al.²⁵ first extended short text in social networks using a word concept network to minimize the sparsity of short text. Then, based on the extended words, a clustering-based method was proposed to implement topic monitoring. Zhao et al.²⁶ used a hypergraph to model various features of social networks, such as temporal, spatial, and cross-media features, and cluster these features to monitor social network topics. However, these methods require complex preprocessing or post-processing and are prone to over-fitting.

Algorithm introduction and inference

Algorithm framework

Given the above problems, we propose a novel CTM model to solve the above problem. The key idea of the model vectorizes the short text using the Cbow algorithm and assumes that a large amount of short text is generated from a small number of regular-sized aggregated-documents. The topic modeling of short texts is converted into a smaller amount of aggregated-documents topic modeling by learning the topic distribution of aggregated-documents. Through this process, the better performance can be obtained, over-fitting can be prevented, and the problem of social network topic dispersion and short-text sparsity can be effectively solved. Then, the topic can be detected and summarized by leveraging the similarity computing. The algorithm framework is shown in Figure 1.

Figure 1.

Algorithm framework.

Vector representation of the event text

Cbow is a method based on distributed representation.¹⁴ The basic idea involves mapping the context information of the dataset words into K-dimensional real numbers by training (K is generally a hyperparameter in the model), and to map high-dimensional space. Then, by calculating the distance between words (such as cosine similarity, Euclidean distance, etc.), it is possible to judge their semantic similarity. Specifically, the words in the dataset are mapped to the middle projection layer through the input layer on the left to perform a vector summation operation to produce a dictionary. The specific structure is shown in Figure 2.

Figure 2.

Cbow structure.

Suppose that D words E(d) in event dataset E are projected to the corresponding location W(d). Then, the context information past W(d) is used to predict the specific training process of W(d) as follows:

Input the word vector of the input context, define the context window size of the input layer by the window value k, and read the words in the window, then use the hash table to determine the corresponding position of the projection layer. Obtain a word through the above W(d) process context vocabulary $Context (W (d))$ , and select text from the event dataset for training.

Based on the projection layer in the middle, use the W(d) context information $Context (W (d))$ to perform vector accumulation operations.

Focusing on the output layer structure, the conditional probability expression $p (w (d) | Context (W (d))$ is established using the context of W(d). Cbow is a model that predicts the probability of $Context (W (d))$ occurrence of a current word based on contextual terms. Its learning goal is to maximize the log-likelihood function L, calculated as

L = \sum_{w \in E} \log p (w | Context (w))

(1)

The gradient descent method is used to adjust the input word vector to make the actual path close to the correct path. After training, we obtain the word vector corresponding to each word from the vocabulary.¹⁹ The result obtained is shown in equation (2)

V (ϖ) : = V (ϖ) + η \sum_{μ \in {ω} \cup NEG (ω)} \frac{\partial L (ω, μ)}{\partial X_{ω}}

(2)

where $ϖ ϵ Context (ω)$ , $X_{ω}$ is the context of the word vector, and $V (ϖ)$ represents the word vector of a word in the context. Generate vectorized text by training event text data, calculate the similarity using cosine similarity, save the calculated result, and input the CTM theme model for event text modeling.

CTM model

Inspired by PTM⁷ modeling of short-text validity, we propose the CTM to model short and solve the sparsity problem. The CTM algorithm has K topic. Each topic is a multi-distribution with M short texts and C aggregated-documents. Short text is observable, whereas aggregated-documents are hidden variables. It is also assumed that each short text belongs to an aggregated-document. Each word in the short text is generated by sampling the topic Z. The word frequency is different from the LDA input and a topic is directly extracted from the document, so the CTM input is a Cbow vectorized word vector. Short text is converted into the aggregated-document, and the topic is extracted from the aggregated-document. Table 1 lists the variables and notations of the proposed CTM model.

Table 1.

Variables and notations.

Notation	Meaning
M	Number of documents
N	Number of words
C	Number of aggregated-document
K	Number of topics
d_S	Short text
d_e	Aggregated-document foe short
ϕ	Detected word distribution
θ	Detected topic distribution
η	Aggregated-document distribution
z	Topic assignment
α, β, γ	Dirichlet hyperparameter

The LDA model and CTM model are compared in Figure 3, where Figure 3(a) is the disk pattern of the LDA model and Figure 3(b) is the CTM disk chart. Figure 3(a) shows that LDA is a Bayesian model for learning hidden topics in documents. LDA assumes that there is an implicit topic layer in the visible document layer and word layer. Each document is a polynomial distribution of topics, and each topic is a polynomial distribution of words. Unlike LDA, CTM assumes that a large amount of short text is generated from a relatively small number of latent documents. Here, we refer to potential documents as aggregated-documents. By learning the topic of the aggregate-document instead of learning the topic directly from the short text, CTM avoids the problem caused by a lack of word co-occurrence information in the context of social networks. A reasonable modeling method means that the number of CTM parameters does not increase with the data, which reduces the risk of over-fitting the data, and it ensures the efficiency of the CTM learning algorithm.

Figure 3.

Comparison of LDA model and CTM model: (a) the disk pattern of the LDA model and (b) the disk pattern of the CTM model.

The CTM generation process is as follows:

Step 1. Sampling $η ~ Dir (γ)$ ;

Step 2. For each text topic z:

Sampling $ϕ z ~ Dir (β)$ ;

Step 3. For each aggregated-document C:

Sampling $θ_{e} ~ Dir (α)$ ;

Step 4. For each short text S:

Sampling aggregated-document, $e ~ Multi (η)$ ;

For each word w_i in S:

Sampling topic $z ~ Multi (θ_{e})$ ;

Sampling the ith word $w_{i} ~ Multi (ϕ_{z})$ .

The academic community often uses approximate derivation methods to estimate the relevant parameters of the topic model. Commonly used derivation methods include the variational Expectation-Maximization (EM) algorithm and the Gibbs sampling method. However, previous studies found that the approximate effect of Gibbs sampling is better.^6,7 Therefore, the Gibbs sampling algorithm²⁷ is used to derive the relevant parameters. The Gibbs sampling procedure of the proposed CTM is shown in Algorithm 1.

Algorithm 1. Gibbs sampling algorithm for CTM.
Input: topic number $K$ , $α$ , $β$ , $γ$ aggregated-document C
Output: θ, ϕ and η
Initialize topic assignments for each word randomly
for iter = 1 to $N_{iter}$ do
for each short text S do
Sample aggregated-document C from equation (3)
for each word w do
Update A_e, $N_{m_{s}}$ , $N_{m_{s}}^{z}$
Sample a topic Z from equation (4)
Update $N_{z}^{w m_{s}, i}$ , $N_{m_{s}}^{z}$
end for
end for
end for
Compute the θ, ϕ, and η by equations (5)–(7)

Given the related variables, first, in terms of topic assignment C for sampling aggregated-documents, the calculation method is

\begin{matrix} p (e_{ds} = e | rest) \\ = \frac{A_{e}, - m_{s} + γ}{M - 1 + C γ} \frac{\underset{z \in m_{s}}{Π} Π_{b = 1}^{N_{m_{s}}^{z}} (N_{e, - m_{s}}^{z} + α + b - 1)}{Π_{a = 1}^{{Nm}_{s}} (N_{e, - m_{s}}^{z} + K α + a - 1)} \end{matrix}

(3)

where A_e is the number of short texts belong to the e aggregated-document, $N_{m_{s}}$ is the length of the Sth short text, and $N_{m_{s}}^{z}$ is the number of words belong to the Zth topic.

The method of sampling text topic classification z is similar to the hidden Dirichlet distribution method. The difference is that $θ$ no longer related to the-initial short text but the aggregated-document. The specific calculation method is shown in equation (4)

p (z_{m_{s}}, a = z | rest) \propto (N_{{em}_{s}}^{z} + α) \frac{N_{z}^{{wm}_{s}, i} + β}{Nz + V β}

(4)

where $N_{z}^{w m_{s}, i}$ is the number of times the word W is assigned to the topic.

After iterating multiple times, the result of sampling tends to be stable, using the learned parameter mean as a parameter estimate. The following five distribution results are obtained using equations (5)–(7)

θ_{s, z} = \frac{N_{e_{m_{s}}}^{z} + α}{N_{e_{m_{s}}}^{z} + K α}

(5)

η_{k, m}^{e} = \frac{N_{k, m} + λ}{N_{m} + K λ}

(6)

ϕ_{k, m}^{w} = \frac{N_{z, t} + β}{N_{z} + V β}

(7)

where $θ_{s, z}$ is the short-text distribution, $η_{k, m}^{e}$ is the aggregated-document topic distribution, and $ϕ_{k, m}^{w}$ is the text word distribution.

For the same event text content, the CTM algorithm performs similarity clustering of word vectors. Essentially, the distribution of document-words input by the topic model is optimized so that the results of the solution sum are updated, and new terms and topic items are obtained.

Topic detection and summary

After the relevant topics and parameters are obtained, the above results are used to monitor and predict the event topics. Given the new microblog text, we first sample the topic assignments of each word from the learned CTM method. When the sampling process is completed, the topic distribution of the new microblogs can be obtained. Therefore, we leverage similarity computing to detect and summarize new topics for a microblog. The method is shown in equations (8) and (9)

p (y_{dn} = b) = \frac{F_{t, b}^{d} • F_{t + 1}^{dn}}{‖ F_{t, b}^{d} ‖ ‖ F_{t + 1}^{dn} ‖}

(8)

ydn = \underset{ydnew \in {1, 2, 3, \dots, L}}{argmax} (p (ydn = b))

(9)

where b represents the topics and $F_{t, b}^{d} = {θ_{t, b}^{d}}$ represents the set of topic distributions. Based on the detection results, our CTM models are updated with the topic document of each topic at each time slice.

In particular, each topic first needs to be initialized using CTM. Second, many upcoming microblog topics are categorized according to equations (7) and (8). Then, we can determine the classification of Weibo topics and update the CTM method. Finally, we can detect and summarize microblog topics and summarize.

Experiment and result analysis

Dataset

We crawled data from the Sina Weibo (the URL of Sina Weibo: https://weibo.com/), which contains more than 600,000 from August 2015 to September 2015, including several hot topics on Weibo, such as “Explosion Accident in Tianjin Binhai New Area.” The Chinese sentence was processed using the popular deep learning-based word segmentation tool. The tool was trained by the People’s Daily corpus, and the word segment accuracy rate was 97.5%.²⁸ Then, the noise data, such as stop words and advertisement content, were removed.

Experimental setup

The event dataset is input into the vector program of the CTM model for vectorization, and gradient ascent and negative sampling were used to solve the calculation. The threshold of similarity is set to 0.75. We set $α$ and $β$ in the model parameters according to the relevant literature and experience numbers: $α = 0.1$ , $β = 0.01$ , P = 50, and K is the number of implied topic words, which could be adjusted according to the text size and application scenario. The number of Gibbs sampling iterations is 3000.

For comparison, we selected comparison methods are the current mainstream topic model methods: LDA,⁵ SATM,²¹ LTM,¹⁵ and BBTM.²² The evaluation indicators were cluster purity and average accuracy. In total, 90% of the event data in the dataset were randomly selected to train the model, and the remaining 10% were employed for testing. This was verified in two ways. First, a strict truth comparison was conducted and, second, the topic evaluation was pursued and three random microblogs were selected for manual evaluation, with the first five of each event being evaluated by three human judges. This experiment used two methods because objective evaluation is not sufficiently comprehensive for this issue. Topic evaluation can broaden the evaluation index and produce more realistic results.

Result analysis

Comparison and analysis of topic discovery accuracy

Since the crawled data have no tag information, manual annotation is used to judge the accuracy of topic detection and summary by leveraging hashtags (In Sina Weibo, the hashtags were represented as “#…#”) as auxiliary label. The experiment invited four students from different colleges to mark the results. Search engines and other Internet resources are available as an aid. When more than two students decided that the topic is a sudden topic, we determined it is the correct result. The precision (P@T) corresponding to the number of topics T is calculated as an evaluation index. The experimental results are shown in Table 2.

Table 2.

Accuracy comparison of topic detection.

Method	P@5	P@10	P@20	P@30	P@50
LDA	0.279	0.267	0.261	0.272	0.231
SATM	0.483	0.476	0.471	0.481	0.443
LTM	0.547	0.502	0.513	0.487	0.464
BBTM	0.746	0.752	0.791	0.808	0.783
CTM	0.792	0.811	0.818	0.823	0.806

LDA: Latent Dirichlet Allocation ; SATM: self-aggregation topic model; LTM: Latent Topic Model; BBTM: Bursty Biterm Topic Model; CTM: Cbow Topic Model.

The experimental results show that the proposed CTM method was always more accurate than all the comparison methods for different topics, indicating that the proposed CTM can accurately monitor sudden topics. We also found that the performance of CTM is slightly worse when T is 5, which may be due to the small number of topics, which makes the topic more discrete and less focused. The BBTM and LTM outperform SATM and LDA, but they are slightly less effective than CTM. This indicates that CTM can improve the performance of topic monitoring by expressing microblog short texts through Cbow word vectors and self-aggregating short texts by modeling aggregated-documents. LDA’s performance was the worst, mainly because it could not overcome the sparsity problem and did not consider the suddenness of the topic.

Topic coherence

To verify the performance of CTM, pointwise mutual information (PMI), commonly used in topic model research, was used to evaluate the topic consistency of the CTM method.²⁹ Given a topic E, the average PMI of the top T words with the highest probability in a topic was calculated using the auxiliary corpus. The higher the PMI-score, the more relevant the topic detection results are. The PMI calculation method is shown in equation (10)

PMI (E) = \frac{2}{T (T - 1)} \sum_{1 \leq i \leq j \leq N} \log \frac{p (w_{j}, w_{k})}{p (w_{j}) p (w_{k})}

(10)

where p(w_j, w_k) is the joint distribution of the words w_j and w_k simultaneously appearing in the same time window, and p(w_k) represents the edge probability of the word w_k.

Using Chinese Wikipedia data as an auxiliary corpus, the experimental results of the topic consistency are shown in Figure 4.

Figure 4.

Topic coherence comparison.

Figure 4 shows that the topic consistency result of the proposed CTM method is superior to all comparison methods, and the BBTM method also performed well. This shows that CTM can more consistently learn topics’ representation. The high consistency of the BBTM theme indicates that the modeling word pair can minimize the sparsity of the microblog text to a certain extent. LTM and SATM also performed well, mainly because the two methods weaken the sparsity problem through different perspectives. Among all methods, LDA performed the worst, mainly because it was originally designed to model standard news documents, but in the microblog environment, there is a short-text sparsity problem.

To further verify the effectiveness of our CTM, we qualitatively analyzed the detected and summarized topic. For all baseline methods, we list the top 10 most probable words related to “#天津滨海爆炸事故 (Explosion Accident in Tianjin Binhai New Area) #” that appeared in each topic. Table 3 lists the top 10 words of the topic discovered by each method.

Table 3.

The top 10 most probable words related to “#天津滨海爆炸事故 (Explosion Accident in Tianjin Binhai New Area) #.”

LDA	SATM	LTM	BBTM	CTM
码头 (pier)	滨海 (coast)	爆炸 (explosion)	集装箱 (container)	天津 (Tianjin)
爆炸 (explosion)	码头 (pier)	监测 (monitor)	天津 (Tianjin)	爆炸 (explosion)
大街 (street)	天津 (Tianjin)	通报 (bulletin)	爆炸 (explosion)	码头 (pier)
人员 (personnel)	死亡 (death)	码头 (pier)	受伤 (injured)	集装箱 (container)
海河 (Haihe River)	新区 (new area)	死亡 (death)	滨海 (coast)	震感 (shock)
天津 (Tianjin)	汽车 (car)	天津 (Tianjin)	情况 (happening)	伤亡 (casualties)
旅游 (travel)	公园 (park)	原因 (reason)	医院 (hospital)	滨海 (coast)
包子 (bun)	天气 (weather)	结果 (result)	死亡 (death)	仓库 (warehouse)
火车站 (train station)	祈祷 (prayer)	河边 (riverside)	环境 (surroundings)	火光 (flare)
物品 (goods)	爆炸 (explosion)	图片 (image)	水质 (water quality)	居民 (resident)

LDA: Latent Dirichlet Allocation; SATM: self-aggregation topic model; LTM: Latent Topic Model; BBTM: Bursty Biterm Topic Model; CTM: Cbow Topic Model.

From Table 3, we can see that (1) the words detected by CTM are most similar to the topic; This is because CTM can learn the relationship between words and solve the sparsity problem; (2) BBTM is also able to obtain better results, but contains some common words, such as “环境 (surroundings)” and “水质 (water quality)”; (3) LTM contains many irrelevant words, such as “结果 (result),” “河边 (riverside)” and “图片 (image).” This indicates that the bursty word clustering is more sensitive to noise; (4) the topics detected by SATM are mixed with the different topic words, and only some of the words are related to “#Explosion Accident in Tianjin Binhai New Area#”; and (5) LDA performed the worst in all methods. This is because LDA cannot address the sparsity problem.

Quality of topic detection and summary

The quality of topic discovery can be verified by clustering indicators. Experiments were conducted using clustering evaluation indicators of cluster purity and clustering entropy. The higher the cluster purity result, the higher the quality of the topic discovery; the smaller the cluster entropy index, the higher the quality of the topic discovery. Since the microblog data have no tags, the hashtags of the microblog are used as the clustering tags. The experimental results are shown in Figures 5 and 6.

Figure 5.

Clustering purity comparison.

Figure 6.

Clustering entropy comparison.

The results show that CTM outperformed all other comparison methods for both clustering indicators. This is because the CTM algorithm vectorizes the text data before the input stage and performs similar word clustering, which can be used to effectively learn the internal relations between words and reduce the dimension of the input text of the model. The CTM topic detection method proposed in this paper considers the high dimensionality of the massive texts in the social network and the sparse characteristics of short texts. Therefore, it is higher than the comparison method in terms of the cluster purity and clustering entropy evaluation index. The original LDA method cannot overcome the short-text sparsity problem, so its performance is the worst in all algorithms. The BBTM method considers the sparseness problem of short texts in social networks but does not consider the high dimensionality of the text and the dispersal of the theme, so the cluster purity is good, but the algorithm proposed in this paper is slightly worse.

Conclusion

In this paper, we propose a novel word embedding topic model for topic detection and a summary method by combining a Cbow event vectorization model and aggregated-document topic model. First, we adopted the Cbow method to cluster similar words and learn the relationship between words. The results learned by Cbow are used as the input of the model. Thus, the dimension of the event topic is reduced and more clearly expressed. Second, we conduct a SATM to address the sparsity problem by aggregating short text into long documents. Finally, to detect and summarize topics, we propose a topic detection method based on the similarity computing approach. We have conducted experiments on the Sina microblog dataset, and the experimental results showed that our CTM outperforms all other baseline methods. However, social networks also contain images, temporal and spatial information, which cannot be modeled by our CTM.

In the future, we will focus on the optimization and expansion of CTM by leveraging social relationships, images, and spatial–temporal information to achieve cross-media retrieval and the bursty event detection. We also plan to extend the CTM method for sentiment analysis and hashtag recommendations.

Footnotes

Acknowledgements

The authors would like to thank all reviewers and editors for their comments and views that improved the quality of this paper.

Author contributions

L.S. and G.C. designed the overall framework and conceived the idea of this paper, L.S. analyzed the data using correlation algorithms, L.S. and G.C. wrote the paper, S.X. help typesetting and revising the paper, G.X. help modify the English grammar and put forward some suggestions for improving the writing of the paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the National Natural Science Foundation of China (no. 41702347), National Key Research and Development Program of China (no. 2018YFC1505104), Natural Science Foundation of Hebei Province of China (no. D2018508107), China Postdoctoral Science Foundation (no. 2019M651786), Hebei IoT Monitoring Engineering Technology Research Center (no. 3142018055), and Scientific Research Projects of Education Department of Hebei Province, China (no. Z2017043).

ORCID iD

Lei Shi

References

Zubiaga

Aker

Bontcheva

et al. Detection and resolution of rumours in social media: a survey. Acm Comput Surv 2018; 51: 32.

Escalera

Baró

Vitrià

et al. Social network extraction and analysis based on multimodal dyadic interaction. Sensors 2012; 12: 1702–1719.

Ali

El-Sappagh

Kwak

. Fuzzy ontology and LSTM-based text mining: a transportation network monitoring system for assisting travel. Sensors 2019; 19: 234.

Hofmann

. Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, Berkeley, CA, 15–19 August 1999, pp. 50–57. New York: Association for Computational Linguistics.

Blei

Jordan

. Latent dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

Cheng

Yan

Lan

et al. BTM: topic modeling over short texts. IEEE T Knowl Data En 2014; 26: 2928–2941.

Zuo

Zhang

et al. Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, 13–17 August 2016, pp. 2105–2114. New York: Association for Computational Linguistics.

Mehrotra

Sanner

Buntine

et al. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, Dublin, 28 July–1 August 2013, pp. 889–892. New York: Association for Computational Linguistics.

Cao

Liu

et al. A novel neural topic model and its supervised extension. In: The twenty-ninth AAAI conference on artificial intelligence (AAAI-15), Austin, TX, 25–30 January 2015, pp. 2210–2216. Palo Alto, CA: AAAI Press.

10.

Xie

Zhu

Jiang

et al. Topicsketch: real-time bursty topic detection from Twitter. IEEE T Knowl Data En 2016; 28: 2216–2229.

11.

Yang

Wen

Kinshuk Chen

et al. A novel contextual topic model for multi-document summarization. Expert Syst Appl 2015; 42: 1340–1352.

12.

Huang

et al. Detecting bursts in sentiment-aware topics from social media. Knowl-Based Syst 2018; 141: 44–54.

13.

Huang

Peng

Wang

et al. A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 2017; 20: 325–350.

14.

Mikolov

Yih

Zweig

. Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, Atlanta, GA, 9–14 June 2013, pp. 746–751. New York: Association for Computational Linguistics.

15.

Chi

et al. Short text topic modeling by exploring original documents. Knowl Inf Syst 2018; 56: 443–462.

16.

Zuo

Zhao

. Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 2016; 48: 379–398.

17.

Pang

Tao

et al. A two-step approach to describing web topics via probable keywords and prototype images from background-removed similaritie. Neurocomputing 2018; 275: 478–487.

18.

Liu

Zheng

et al. Detecting global and local topics via mining twitter data. Neurocomputing 2018; 273: 120–132.

19.

et al. FastBTM: reducing the sampling time for biterm topic model. Knowl-Based Syst 2017; 132: 11–20.

20.

Liang

Ren

Yilmaz

et al. Collaborative user clustering for short text streams. In: Thirty-first AAAI conference on artificial intelligence, San Francisco, CA, 4–9 February 2017.

21.

Quan

j Kit

et al. Short and sparse text topic modeling via self-aggregation. In: Twenty-fourth international joint conference on artificial intelligence, Buenos Aires, Argentina, 25–31 July 2015, pp. 2270–2276. Palo Alto, CA: AAAI Press.

22.

Yan

Guo

Lan

et al. A probabilistic model for bursty topic discovery in microblogs. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, Austin, TX, 25–30 January 2015. Palo Alto, CA: AAAI Press.

23.

Hasan

Orgun

Schwitter

. Real-time event detection from the Twitter data stream using the TwitterNews+ framework. Inform Process Manag 2019; 56: 1146–1165.

24.

. An effective hot topic detection method for microblog on spark. Appl Soft Comput 2018; 70: 1010–1023.

25.

Wang

et al. Learning from short text streams with topic drifts. IEEE T Cybernetics 2018; 48: 2697–2711.

26.

Zhao

Gao

Ding

et al. Real-time multimedia social event detection in microblog. IEEE T Cybernetics 2018; 48: 3218–3231.

27.

Griffiths

Steyvers

. Finding scientific topics. Proc Natl Acad Sci 2004; 101: 5228–5235.

28.

Lample

Ballesteros

Subramanian

et al. Neural architectures for named entity recognition. In: The 15th annual conference of the North American chapter of the Association for Computational Linguistics: human language technologies, San Diego, CA, 12–17 June 2016, pp. 260–270. New York: Association for Computational Linguistics.

29.

Newman

Lau

Grieser

et al. Automatic evaluation of topic coherence. In: Human language technologies: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics, Los Angeles, CA, 2–4 June 2010, pp. 100–108. New York: Association for Computationalnguistics.