Abstract
A valid mechanism for suicide detection and intervention to a wider population online has not yet been fully established. With the increasing suicide rate, we proposed an approach that aims to examine temporal patterns of potential suicidal ideations and behaviors on Twitter to better understand their risk factors and time-varying features. It identifies latent suicide topics and then models the suicidal topic–related score time series to quantitatively represent behavior patterns on Twitter. After evaluated on a collection of suicide-related tweets in 2016, 13 key risk factors were discovered and the temporal patterns of suicide behavior on different days during 1 week were identified to highlight the distinct time-varying features related to different risk factors. This study is practical to help public health services and others to develop refined prevention strategies, to monitor and support a population of high-risk at right moments.
Introduction
Mental health disorders affect a substantial portion of the population. It is estimated that nearly half of all Americans will experience a mental illness during their lifetime. 1 The economic and social costs associated with mental illness are significant. 2 Individuals with mental health conditions are more likely to have the suicidality. Suicidality is defined as any suicide-related behavior, including completing or attempting suicide, suicidal ideation or communications. 3 Suicide is a global leading cause of death in recent years. With the advent of open and massive social networking sites, such as Twitter, attention has been focused on how these new modes of communication may become a highly interconnected forum for collective communication of suicidal ideation on a large scale. 4
Concerns have been raised about how social media communication may have great influence on suicidal ideation and cause a contagion effect among people. Suicide does a devastating impact on both families 5 and communities, 6 despite that many suicide deaths are actually preventable. 7 Given the large volume of Twitter data, it is not yet feasible or ethical to directly contact and survey every Twitter user who may be at risk. 8 While the public nature makes Twitter a potentially valuable source of information about suicide from a wide population, some studies9–11 have analyzed communication on Twitter about mental health, particularly suicide. A mechanism to intervene suicidality at the community level, with valid, reliable and acceptable methods of online detection have not yet been fully established. 12 Due to the lack of effective detection methods, a significant portion of the US population with mental illness do not get any treatment. 13 It undeniably results in undesirable social consequences. For instance, negative perceptions and discrimination toward persons with mental illness are substantial and widespread. 14
Due to the increasing suicide rates and the large impact on individuals and the society, it is important to gain more knowledge to support a vulnerable population and take part in suicide prevention. Although several risk factors of suicide have already been identified, it is not yet easy to predict persons at risk especially for those who have no sign of suicide. Time trends in suicide incidence have gained broad international considerations,15,16 and it is recommended to plan mental health care services to be available especially at high-risk moments, such as during the spring and in the beginning of January. 17 However, the data studied for suicide incidence usually depend on the suicide register in statistics institutions. In addition, the results about time trends are only limited to the season or month trends, while the time periods like seasons or months are of granularities too large for effective prevention in the era with sharp increasing suicide rates. Furthermore, some studies which focused on suicide incidence did not reveal the detailed factors of suicide, which are also not practical to make appropriate suicide prevention for individuals with different risks of suicide.
Therefore, a better understanding of risk factors of suicidality and their time trends is highly needed for developing more appropriate suicide prevention strategies. Insight into high-risk time frames and suicide factors would contribute to more refined prevention strategies. Thus, this article aims to provide insights into the temporal patterns of which a large number of people communicate their suicidality on Twitter. For these research objectives, we proposed an approach to mine suicidal behaviors on Twitter based on the contents’ quantitative analysis and the time series analysis. In this special case, the tweets messages were retrieved by the Twitter-streaming API using suicide-related terms as queries. We presented an approach combining the semantic analysis and tweet time to extract the temporal patterns of suicidal behavior of the users. Since we measure the suicidality through the semantic analysis, we used the latent topic modeling to reveal the latent risk factors of suicide from large volume of tweets, and proposed a quantitative metric, suicidal topic–related score (ST-Score), to assess the suicide tendency related to risk factors. Next, we explored the temporal patterns based on ST-Score using the Fourier series analysis, so as to discover suicidal patterns across different days during a week in Twitter. The contributions of this article are summarized as follows:
Identified suicidal risk factors from tweets by content analysis, which can reflect the suicidal risk factors’ trend in time.
Discovered temporal patterns of suicidal behavior from quantitative measuring of suicidality, which reveal the suicidality peaks occurring at different time frame for different risk factors.
Proposed an approach for exploring temporal patterns of suicidal behavior on Twitter, which potentially provides means of online detection of a wider population with high risks to better understand their behavior.
Related work
Some individuals communicated their suicidal thoughts and plans to friends and family prior to suicide; 18 however, many do not disclose their intent. In recent years, individuals have broadcast their suicidality on social media sites such as Twitter, 9 indicating that it is potential to utilize social media site as a suicide prevention tool. 10 Twitter has recognized that individuals express suicidality in their broadcasts. Depression-related chatter on Twitter can glean insight into social networking about mental health. 19 Guntuku et al. 20 reviewed recent studies that aimed to predict mental illness using social media and suggested that depression and other mental illnesses are detectable on several online environments, but the generalizability of these studies to broader samples has not been established. However, little studies to date have analyzed communication on Twitter about mental health, particularly suicide. 11 Suicide is a serious public health concern and is preventable. Automated detection methods may help to identify at-risk individuals through the large-scale passive monitoring of social media. Access to appropriate mental health care, when and where it is needed, is vital for the prevention of suicide.
Christodoulou et al. 16 reviewed the literature on suicide seasonality from articles published between 1979 and 2009, and found that majority of the studies confirm a peak in spring and a secondary peak during autumn. Weekly day patterns in suicide incidence were found in Beauchamp et al., 21 showing the beginning of the week, and the spring and fall seasons were associated with higher numbers of suicide attempts. And the associated with increased attempted and completed suicides on particular days and holidays were studied. 17 Regarding seasonal patterns in suicide incidence, Durkheim et al. 22 already suggested in the early 19th century that suicide incidence shows seasonal variation, and seasonality is now one of the most studied phenomenon in suicide research.21,23,24 However, the results mostly indicated peak in spring23,25,26 or autumn. 21 Furthermore, no studies to date have analyzed seasonality patterns on Twitter about suicidality, which limits more appropriate suicide prevention strategies for population at right moment.
Materials and methods
Data collection and pre-processing
First, we went over many suicide-related tweets. By reading the tweets, we got familiar with the expression for suicide thoughts or suicide ideation, such as “suicide,” “kill myself” and so on. So we try to collect the terms in high frequency to express the suicide thoughts or suicide ideation. Second, we investigated some suicide-related research papers. Those papers gave us valuable information about suicide expressions. Based on the above two steps, we abstracted a relatively larger terms list on Twitter, including suicides or self-hurt-related keywords. The final list of suicide-related terms was identified in an interactive way. We used the initial terms list to collect the real-time tweets using Twitter streaming API. We manually checked the collected tweets and updated the terms list by adding, deleting, and modifying terms. We also add a stopwords list to remove obviously irrelevant tweets, like suicide attack. We stopped the process after we found that most of the tweets collected were related to suicides. The generation of the suicide-related key terms list lasted roughly 2 to 3 weeks. The full list of suicide-related terms and stop words can be seen in Table 1.
Suicide-related Terms as Queries for Tweets Retrieval and Stop Words for Cleaning.
We collected a data set of 716,899 public tweets from January to November on 2016 using suicide-related terms to search through Twitter-streaming API. The terms include “suicide,” “want to die,” “to be dead” and so on.
Next, the collected data were cleaned through removing stop words, which are terms that are regarded as not conveying any significant semantics to the texts or phrases they appeared in and are consequently discarded. The stop words list also includes special characters (e.g. “_,” “https,” “&”;) and meaningless words (e.g. “oh,” “ lol”) in the text.
Since we focused on the content analysis and temporal analysis, a set of main features were extracted and used to model the datasets collected above, and they were defined in Table 2.
Tweets Data Features.
The large volume of tweets collected presented a significant challenge to extract useful semantic information. Since the freedom discussion on Twitter, mainly large percentage of tweets is non-suicidality, resulting in data sparseness and diversity. To achieve better performances of suicidal behavior study, we leveraged the convolutional neural networks (CNN) model for short text classification proposed by Kim 27 to build the tweets binary classifier to select precise suicide-related tweets. We used the GloVe Twitter embedding to initialize the model input. The model was trained on corpus of 3000 annotated tweets, among which 1985 tweets were annotated as related to suicide. The model achieved a precision at 0.78, recall at 0.88 and F-1 measure at 0.83 on the testing corpus. This model was also compared with traditional machine learning algorithms including Support Vector Machine, Extra Trees, Random Forest, Logistics Regression, and Bi-directional Long Short-Term Memory model. As shown in Du et al., 28 the CNN model led the performance in Positive type, Negative type and the overall accuracy.
The CNN-based classifier was built to choose the label from Positive/Negative for the tweets. Positive means the tweet is related to suicide or suicide ideation of the Twitter user (personal experience or feeling). Negative means the tweet is not related to suicide or suicide ideation, the negation of suicide or suicide ideation or other non-positive tweets. As a result, we used the trained CNN model to select 191,473 Positive tweets for the following analysis. It significantly reduced the size of data set.
Identifying suicide-related topics
In this section, we expect to capture some interesting facts about the suicide-related semantic theme by topic modeling. Two main topic models usually used for topic modeling are (1) Latent Dirichlet Allocation (LDA), which is a probabilistic generative topic model proposed by Blei et al. 29 and (2) Non-negative Matrix Factorization (NMF), which is a vector space factorization method for topic modeling. 30 Topic modeling is a key tool to discover latent semantic structure within a variety of document collections. 31 LDA is a probabilistic model capable of expressing uncertainty about the placement of topics across texts and the assignment of words to topics. 32 NMF is a deterministic algorithm that arrives at a single representation of the corpus. For this reason, NMF is often characterized as a machine-learning algorithm. Although LDA has been effectively employed in many text mining fields, it is often not scalable to large data sets with millions documents or tweets. 33
In our research, we identified suicide-related latent topics by NMF topic modeling. First, latent topics from the suicidal tweets data set were inferred. Second, the optimum latent topics’ structure was shown, to shed light on significant semantics of suicide tweets.
Inferring topics by NMF topic model
NMF is a technique for decomposing a non-negative matrix
Choosing the optimal latent topics structure
It is critical to determine an appropriate number of topics
The metric is based on the assumption that a model with an appropriate number of topics is more robust to missing data, which was proposed by Greene et al.
36
Given a value of
Modeling the ST-Score time series
Topics found by NMF topic modeling became the main themes about which Twitter users are discussing or expressing the suicidal thought. Each vector in
To reflect the user-involved topic strength for a tweet message, we defined ST-Score as
Next, we partitioned the tweets into
Behavior patterns mining
In order to get a clear description of behavior patterns, we hope the patterns are as unrelated as possible. So, we evaluate the correlation between two ST-Score time series, when the value of, in which
Since the value of ST-Score is different, the ST-Score time series are fluctuation series, which present an oscillatory behavior that should be studied. In order to find out users’ time-varying behavior patterns, the ST-Score time series might be studied with a tool that takes advantage of this fact. Fourier transformation has been used widely in scientific researches such as signal processing and time series analysis to quantify underlying signals and repetitive cycles in data forms. It is well suited to this study.
We utilized Fourier series to make a model of periodic analysis.
37
Let
It is required to estimate the
where
For example, with weekly seasonality and
We employed
It is valuable to explore user behavior patterns of weekly period. We explored the temporal patterns of different day during 1 week. For weekly periodic analysis, we have found Fourier series expansion for
Fourier series expansion for
Results
Suicidal topics discovery
Guided by the metric described previously, we set the topic number

Weighted Jaccard average stability for different number of topics.
According to the stability value and the discovered topics, we evaluated and found that the topics structure with
Table 3 lists the most relevant words for each of the 13 topics identified from the tweets with NMF. They reveal that a variety of issues or factors related to suicide is spreading on Twitter. For example, a tweet of “my school actually makes me want to kill myself” is related to a suicidal factor about “school,” and a tweet of “I have severe depression and I want to kill myself” is related to a suicidal factor about “depression.”
Identified topics and labels.
After reviewing the highest weight words from each topic and the high-probability tweets for each topic, we abstracted the implication and gave a description label for each topic. Since the topics discovered with NMF topic modeling are latent semantic themes of the large tweets texts, the description labels could reflect the risk factors related to users’ suicidality. For instance, Topic T0 covers life factors related to suicidality and Topic T4 is a special event-based factor that shows a movie named as “suicide squad” related feeling. Topic T1, T2, T3, T5, T6, T7, T8, T9, T10, T11 and T12 cover the manifest factors about loss of energy, caring for, depression, indifferent or appearance, fashion, emotion, change, work or finance, fandom, low mood and school. Table 3 also lists the sample tweets to better understand the risk factors related to users’ suicidality. From the topic proportion list in Table 3, we can see Topic T9, T0 and T11 are the top three factors emerged from the tweets.
ST-Score time series
According the topic modeling results, we computed ST-Score for all tweets in the data set. Then we got 13 ST-Score time series
where

Number of each suicidal topic–related score time series.
Behavior patterns discovery
Correlation analysis
To leverage the discovered topics to highlight the temporal behavior patterns, first, we evaluated the correlation between two ST-Score time series
Then we used a heat map to plot the correlation matrix. Figure 3 shows the output heat map of correlational color gradient (from negative to positive correlation).

Correlational matrix.
Overall, the maximum absolute value of correlation is less than 0.45, and none of the series’ features show high correlational statistics. The features for each ST-Score time series are suitable to exploit behavior patterns.
Temporal patterns of weekly different day
Figure 4 shows variation of ST-Score across the days of the week: y axis is the log value of ST-Score and x axis is the weekly different day. We can observe different fluctuation for each topic. It reveals the weekly seasonality of suicidal behavior regarding to different risk factors.

Temporal patterns of weekly different day.
There are many significant features. First, the more high ST-Score occurs on Sunday and Saturday for T2, T5, T6, T10 and T11. During weekends, people may engage in leisure-related activities with family members or others. For those people with mental disorders, what they see and hear tend to induce suicidality, which results in a higher ST-Score on weekends than that on weekdays.
Second, the ST-Score is higher on Monday or Tuesday for T1, T7, T9, T10 and T12. It indicates that people are in high risk on Monday or Tuesday due to the stress when coming back to work or school after weekend.
Third, that the ST-Score is higher during weekdays is observed for T3 and T4. It indicates that those who have latent mental disorders are in high risk on weekdays, because they may be lonely or feel stressful on weekday work. While individuals engage more in family visits, friends visit and activities on Saturday or Sunday, their mental symptom may be relieved to some extent.
Though these observations warrant additional investigation, the temporal patterns of suicidal behavior are significant to understand the weekly seasonality related to suicidality. The identified temporal patterns in this study highlight the behavior features in fine-grained level related to different risk factors, which is practical to make refining prevention strategies, to monitor and support a wider population.
Discussion and implications
Social media like Twitter offers the opportunity to open new frontiers in the behavioral and health sciences as it also provides data where many purpose-designed studies either cannot be launched in time or would be prohibitively difficult to conduct. More generally, mental health researchers often lack data on longer periods of time to systematically assess the mental health problem. Because social media data are inherently longitudinal, it could also facilitate investigation of mental health–related problems. Furthermore, social media data are available in real-time, facilitating surveillance and prediction of mental health risk. 40
All the above demonstrate that Twitter research on suicidality could be done using more advanced content analysis across temporal domains and multiple levels of analysis. In this study, temporal patterns of suicidal behavior on Twitter in 2016 were examined. First, we examined the risk factors of suicide on Twitter. The results indicated that latent suicidal factors could be detected from Twitter by semantic analysis. And the identified risk factors are not only significantly correlated with ground truth surveillance and survey data, for example, depression, but also closer to the times and can reflect the suicidality trend in time, for example, fashion and fandom.
Second, we examined the weekly different day patterns of suicidal behavior on Twitter by constructing quantitative ST-Score time series. The suicidal risk peaks occur at different time frames for different suicidal risk factors. The temporal patterns are related with the users’ behavior of interaction on Twitter. For weekly patterns, they indicate people’s everyday life and work related to suicidality, and give fine-grained findings on time-varying suicidality trend. This kind of temporal patterns gives insights on understanding suicidal behavior better and makes more concerns on vulnerable populations at right moments.
Third, the proposed approach for exploring temporal patterns of suicidal behavior on Twitter potentially provides means of online detection for a wider population. It contributed to scientific research of suicidal behavior on social media. It presented a valid, reliable and acceptable method of online suicide detection, which uses the quantitative measure of suicidality to offer time trend analysis. It can potentially yield new insights not easily achievable through traditional qualitative science methods. More research is needed to better understand the underlying behavior mechanisms in suicide.
The findings help to develop appropriate suicide prevention strategy and can be generalized to a wider population. The findings create knowledge about high-risk time frames in fine-grained level. The high-risk time frames indicate when we should be aware of concerns to vulnerable population. Therefore, we would recommend mental health care services or communities, friends and family members to be available especially at high-risk moments, to give proper supports and resources to a broader population. Suicide prevention intervention in right moments and in multilevel seems to be promising. Furthermore, the findings in this study also would recommend mental health products to improve treatment of suicide-related mental disorders. For instance, musical player, App or wristband would be designed according to the temporal patterns of suicidal behavior discovered in this article, so as to relieve the symptoms and reduce the risks of suicide.
A limitation of this study is data collection and pre-processing. We searched and collected suicide-related tweets according the suicide-related terms in Table 1. Undeniable, some users express their suicidal ideation in other unusual terms or new Internet words. To our knowledge, there is no such full terms list in current published papers. It need more time to make out clear list for the other unusual terms or new Internet words. We look forward to read more related researches and get more information about suicide-related terms in the future. So, we collected the common terms as much as possible in Table 1 and collected data.
The tweets collected by suicide-related terms may include non-suicidal thoughts tweets. So, we cleaned it by CNN-based classifier to get Positive tweets. By using the classifier, the Positive results may contain some tweets which were vague or exaggerated expression. Although some expressions may also show suicide ideation to some extent, it is really a limitation that could not classify a vague expression tweet with full accuracy. So more tweets generated by the user can be considered to make clear whether the vague tweet is Positive or not. The future analysis over a user’s timeline tweets may improve the accuracy for vague expressions recognition. However, in this study, with the good performance of CNN, the tweets like vague expression can be a small proportion in the large data set. So, it will not have obvious influences on the research results.
Another limitation of this study is that the location data are inaccurate and incomplete when collected from Twitter, so it is not available for spatio-temporal analysis.
Conclusion
Temporal behavior patterns exploring from large-scale data on social media sites offer potential valuable channel to new insights into public health concerns. By systematically assessing suicide risk factors and time-varying behavior from Twitter, government and public health institutions may be able to improve suicide prevention and treatment initiatives, such as setting up help lines and health care services available at the right moments. Continuing work would fully exploit more available social media information to conduct multiple-level analysis.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Nature Science Foundation of China (Grant No.71501172) and Zhejiang Provincial Natural Science Foundation of China (Grant No. LY18G020017). This research was partially supported by the National Library of Medicine of the National Institutes of Health under award number 2R01LM010681-05, R01LM011829 and the Cancer Prevention Research Institute of Texas (CPRIT) Training Grant #RP160015. The authors also acknowledge the scholarship support from the China Scholarship Council.
