Abstract
The new coronavirus outbreak has been officially declared a global pandemic by the World Health Organization. To grapple with the rapid spread of this ongoing pandemic, most countries have banned indoor and outdoor gatherings and ordered their residents to stay home. Given the developing situation with coronavirus, mental health is an important challenge in our society today. In this paper, we discuss the investigation of social media postings to detect signals relevant to depression. To this end, we utilize topic modeling features and a collection of psycholinguistic and mental-well-being attributes to develop statistical models to characterize and facilitate representation of the more subtle aspects of depression. Furthermore, we predict whether signals relevant to depression are likely to grow significantly as time moves forward. Our best classifier yields F-1 scores as high as 0.8 and surpasses the utilized baseline by a considerable margin, 0.173. In closing, we propose several future research avenues.
Introduction
The ongoing coronavirus outbreak has been officially defined a global pandemic by the World Health Organization (WHO) on 11 March 2020. Coronavirus disease 2019 (COVID-19) is an infectious disease caused by a newly discovered coronavirus. 1 COVID-19 causes a respiratory illness characterized by symptoms such as cough, fever, difficulty breathing, and pneumonia in both lungs. These symptoms may take up to 14 days to appear after exposure to COVID-19. COVID-19 spares no one and infects people of all ages. Older people and those with pre-existing medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, and cancer appear to be more vulnerable to becoming severely ill with COVID-19.1,2
WHO has reported a drastic increase in confirmed cases and deaths all over the world. To mitigate the rapid spread of COVID-19, many countries have forbidden indoor and outdoor gatherings in excess of particular numbers of people; asked non-essential services, nonprofit entities, and retail businesses to close; issued stay-at-home orders for their residents; and advised them to practice social distancing and avoid all non-essential travel abroad. We are living through a pivotal moment in history. The onslaught of the pandemic has severely challenged our economic systems 3 and caused substantial changes to people’s daily routine. The current pandemic can affect people both physically and psychologically. 4 For example, in China, 96.2% of clinically stable COVID-19 patients in the early recovery phase reported significant post-traumatic stress disorder (PTSD) symptoms. 5 Psychological distress is increasing worldwide and may have long-lasting consequences and repercussions on mental health.6–9
Given the developing situation with the pandemic, social media allows people to inform themselves and get updates from official sources. People may naturally panic when seeing headlines announcing bad news and numbers of cases. This may affect ways in which individuals express themselves and share opinions, thoughts, and personal experiences with others. The emotion and language in social media postings may potentially indicate feelings such as loneliness, 10 anxiety, anger and stress, among others.11,12 For instance, a person may express emotional reactions that can be unpleasant, disturbing, and overwhelming. Emotional problems like anxiety and depression manifest themselves as feelings of inner emotional distress. Mental health issues can comprise a wide range of disorders that affect mood, thinking, and behavior. Some examples of mental illness include PTSD, depression, anxiety disorders, addictive behaviors, etc. In this paper, our primary interest is in depression. Depression is a serious condition that can cause a persistent feeling of sadness and loss of interest and can affect a person’s daily life. 13 Survey research conducted by Mental Health Research Canada found that feelings of depression are rising constantly. 14 Before the pandemic, 7% of Canadians reported high levels of depression. This rate has risen to 16% during the stay-at-home period and 22% predict high levels of depression if social isolation continues for two more months. 14
Recognizing early signs of depression is of critical importance and can aid mental health services in assessing the impact of the pandemic on the population and implementing healthier coping strategies to build personal resilience. In addition, appropriate services can be provided for those in need. In this paper, we leverage social media postings to detect signals relevant to depression due to COVID-19. To this end, we build a corpus of postings shared on Twitter during the stay-at-home period. We make use of a topic modeling approach to generate topics addressed by individuals and evaluate language features from topic words to determine whether they indicate signals for depression. It should be noted that we retain solely depression-indicative topics and collect individuals who engage with these topics to investigate their posting histories since the onset of the stay-at-home order. Specifically, this work makes the following contributions: 1. We demonstrate the effectiveness of our data collection and data pre-processing strategy to gather social media postings containing signals relevant to depression. 2. We capture evidence from a corpus of postings and potential individuals who manifest signals for depression and consider them as an experimental group. We measure the similarity between different topics addressed by individuals in the experimental group to discover their overlapping behavioral characteristics and understand their linguistic idiosyncrasies. 3. We develop models to predict whether signals relevant to depression are likely to grow significantly as time moves forward.
Related work
The role of social media in mental health has been explored by De Choudhury. 11 The study suggested a guideline that emphasizes the use of social media postings to gauge what the pertinent mental literature would predict at the individual- and population-levels. This could allow the identification of depressed or otherwise at-risk individuals through the large-scale passive monitoring of social media. 15 Recently, research has associated social media with several mental health conditions, including stress,16–18 post-traumatic stress disorder,19–21 and depression.15,22–32
To quantify depression from texts, De Choudhury et al. proposed a social media depression index to identify levels of depression among individuals and predict social network behavior changes related to post-partum depression using several features, including structural properties of social networks. 24 While some studies rely exclusively on open-vocabulary analysis and lexicon-based techniques such as Linguistic Inquiry and Word Count (LIWC) 33 to build a classifier, other studies couple LIWC with topic modeling features.27,34–36 For instance, Coppersmith et al. used LIWC to demonstrate characteristic differences in language use for mental disorders. 19 Their approach utilizes uni-grams and 5-grams to indicate the presence of mental health conditions. Stark et al. combined LIWC and latent Dirichlet allocation (LDA)-based features in the classification of social relationships. 34 Resnik et al. explored the value-add of topic modeling in text analysis for depression and showed that topic models can take us beyond the LIWC categories to relevant themes related to depression and neuroticism as a strongly associated personality measure. 27 Another work of Resnik et al. investigated the use of supervised topic models in the analysis of linguistic signals for detecting depression. 28 Tadesse et al. demonstrated that multiple feature combinations (LIWC+LDA+bi-gram) can yield competitive results. 35 In this paper, we take a step forward by combining LDA with bi-gram, LIWC and other psycholinguistic dictionary-based features to identify depression-indicative topics, in order to facilitate the investigation of signals relevant to depression. The rationale behind the incorporation of additional features is to enrich the model to be able to capture depression-related terms and patterns that may escape the LIWC dictionary. We utilize correlation metrics to compare the performance of the proposed features with other alternative feature combinations.
Detection of depression signals
Dataset during the stay-at-home period
All data we obtained is public, posted between 12 March 2020 and 25 May 2020, 1 and made available from Twitter. Specifically, we extracted tweets bearing the words or hashtags: COVID, coronavirus, #StayAtHome, or #StayHome. For privacy and ethical reasons, we avoid displaying personally identifiable information, especially names and pseudonyms. Therefore, we randomly replaced such information to ensure the anonymity and privacy of the data.
To preprocess the data, we limited our set to Canadian users and removed tweets written in a language other than English or French. Additionally, we discarded redundant tweets, retweets without comments, tweets containing only the keyword (i.e., words or hashtags utilized for extraction), and multimedia such as image and video. We removed links in tweets, but kept emojis, since research has proven that emotions within a text can be expressed through the use of emojis. 37 We used the Python Googletrans 2 implementation package to translate tweets from French to English. We removed tweets in which the word COVID or coronavirus occurs simultaneously with the term mental health or depression. 3 We believe that people reacting emotionally may avoid combining the two words in a single tweet when it conveys a personal account. Consequently, we assume that these kinds of tweets are more likely to convey information or warnings about mental health. We eliminated stopwords but kept pronouns. 4 Pronouns reveal information on people’s emotional state, thinking, and personality. 33 Chung and Pennebaker discovered that individuals susceptible to mental illness such as depression more frequently use first-person pronouns, suggesting higher self-attention focus. 38
To concentrate exclusively on data containing signals relevant to depression, we quantified different aspects of the language usage and patterns of individuals, using automated methods in order to extract features indicative of depression in tweets.
Dataset before the stay-at-home order
We replicated and applied the same logic as above to collect tweets posted before the stay-at-home order, that is, from 1 January 2020 to 11 March 2020. In total, we extracted 1,006,941 tweets and 161,327 distinct users, that is, users who had at least five tweets.
Feature Design
Bi-gram features
We extracted bi-grams from tweets by leveraging the vectors based on the term frequency-inverse document frequency (TF-IDF) approach.35,39 We used TF-IDF as a statistical measure to evaluate how important a word is to each tweet in the corpus. We convert each tweet into its bag-of-word representation and calculate the TF-IDF value of each word utilizing the standard formula (Equation (1)).
LIWC features
The Linguistic Inquiry and Word Count (LIWC) dictionary is a widely used psychometrically validated system for psychology-related analysis of language and word classification. 33 LIWC includes word categories that have pre-labeled meanings. For each tweet, we calculated the number of observed words, using the LIWC dictionary and focusing on three LIWC categories: linguistic dimensions, psychological processes, and personal concerns. For the psychological processes and personal concerns categories, we utilized all of their subcategories, while for the linguistic dimensions category, we exclusively measured the proportion of first-person pronouns in the tweet.
PLUS features
We extracted depression-related features from the MRC psycholinguistic database, 40 the WHO glossary of psychiatric and mental health terms, 41 and the NRC emotion lexicons. 42 The NRC emotion lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). MRC provides information about 26 different linguistic properties and includes more than 150,000 words with linguistic and psycholinguistic features of each word. For each tweet, we identified depression-related words using the WHO glossary and verified whether these words fall into the NRC emotion lexicons. Specifically, we discarded all the words that imply “joy” as the emotional state. Each MRC feature was computed by averaging the scores of all the depression-related words found in the database.
LDA features
We utilized LDA 43 to learn the topics addressed from the tweets. LDA is a probabilistic model that discovers latent topics in a text corpus and can be trained using collapsed Gibbs sampling. A topic is a distribution over a fixed vocabulary. As the parameters of LDA, we set α and β to 0.01. All extracted topics were used as features.
Experimental Setup and Results
Prediction of depression during the stay-at-home period
Top fifteen words for the first five of the 38 validated depression-indicative topics.
Prediction quality for depression, for different feature sets and all combinations, as measured using the Pearson r. For LIWC features, we consider one feature per category and for LDA features, we take one feature per topic.
To make predictions over time for signals relevant to depression, we divided our data (857,294 tweets) into one-week periods. Specifically, we separately derived 50 topics from each subset. We prepared the training set using topics from the first to the penultimate week and took topics from the last week as the test set. We utilized three different classifiers: support vector machine (SVM), logistic regression (LR), and random forest (RF). We trained our classifiers with the three feature sets which achieved the highest Pearson’s (r) results in Table 2: LIWC+LDA, LIWC+bi-gram+LDA, and LIWC+PLUS+bi-gram+LDA. We considered the feature set LIWC itself as a baseline. For SVM, we set the regularization parameter λ = 0.0001 and the value γ of the radial basis function kernel to 0.5 and for RF, we set the number of trees to 500 and the maximum depth and number of features to 3 and 30, respectively. The prediction performances are reported as F-1 scores, i.e., the harmonic mean of precision and recall.
Prediction performances over time. Bold font indicates the best result for each feature set.
Similarity between topics before and during stay-at-home restrictions
To discover overlapping behavioral characteristics of depression-related terms, we experimented with 50 topics on each one-week subset of the data as divided above. Each topic was represented by the top fifteen highest-probability words, out of which we retained solely the top ten depression-related words. We computed topic similarity using measures based on topic word probability distributions 45 (such as Kullback-Leibler divergence (KL) 46 ) and topic word sets 47 (such as Jaccard similarity (JS). 48 )
Similarity between different depression-related topics addressed by individuals between before and during the stay-at-home period.
The Spearman correlation (ρ) between the two-similarity metrics is presented. We obtain ρ = 0.839 for LIWC+LDA, ρ = 0.873 for LIWC+bi-gram+LDA, and ρ = 0.930 for LIWC+PLUS+bi-gram+LDA during the stay-at-home period; and ρ = 0.011 for LIWC+LDA, ρ = 0.016 for LIWC+bi-gram+LDA, and ρ = 0.02 for LIWC+PLUS+bi-gram+LDA before the stay-at-home order. We report that all correlations are statistically significant (p Monthly trends of similarity between depression-related topics addressed by individuals. Note that we utilize LIWC+PLUS+bi-gram+LDA.
These results indicate strong and meaningful correlations between depression-indicative topics addressed during the stay-at-home. The language in these topics appears to be somewhat similar and recurs from one period to another during the stay-at-home period. This suggests that we should pay more attention to this vocabulary when predicting depression from the individual-level.
Figure 2 shows the trend of individuals who have participated in depression-related topics. We observe a rise of participants within the second week of March, which symbolizes the onset of lockdown; and we note that the number substantially decreased within the fifth week of May, which represents the date on which COVID-19 lockdown restrictions began slowly being relaxed across the country. We calculated the percentage that individuals who have participated in depression-related topics represents to the overall number of individuals collected for each month. We found that 6.9%, 7.7%, 28.4%, 36.4% and 30.1%, respectively, for January, February, March, April and May. The number of individuals who have participated in depression-related topics. We make a weekly count of these individuals in the months before and during the stay-at-home order. For instance, the blue bar in Jan (January) is associated with the first week (W1), the red bar with the second week (W2), and so on.
Discussion and Conclusion
The COVID-19 pandemic has upended much of society in unprecedented ways. Due to border closures, lockdowns, social-distancing measures and other restrictions implemented by the government, people often use social media such as Twitter and Facebook for socializing. In this paper, we collected geo-located tweets before and during the first lockdown in Canada in order to detect relevant-depression signals from the content of social media postings.
The primary objective of this research was to measure the similarity between different topics addressed by people to discover overlapping behavioral characteristics of depression-related words. To this end, we computed the language similarity between all possible topic pairs addressed by people. By analyzing the similarity between depression-related topics, we observed that all correlations are statistically significant during stay-at-home restrictions, but not significant before the coronavirus lockdown. Specifically, during the lockdown, the correlations are statistically significant for all the features utilized, and the proposed features specifically achieve the strongest correlation. Our results find that people were engaged more deeply in depression-related topics during lockdown in Canada. Interestingly, our results indicate similar patterns with a study conducted to detect community depression dynamics due to COVID-19 in Australia and another study conducted to estimate the prevalence of and risk factors associated with depression symptoms among United States adults before and during the pandemic.50,51 Our research and aforementioned studies found a surge of depression-indicative signals from the onset of lockdown (March 2020) to the relaxation of lockdown (May 2020). Our findings highlight the urgent need to develop psychological interventions and preventive strategies that can preserve the mental health of individuals during the COVID-19 lockdown.
The second main aim of the research was to build a predictive model to investigate the evolution of relevant-depression signals during the coronavirus lockdown. Our predictive model provided evidence of pronounced depression patterns in the experimental datasets. Our best classifier achieves F-1 scores as high as 0.8, which is a 0.173 relative the improvement over the baseline features. The proposed features yield a higher Pearson correlation (r = 0.506) than other alternative feature combinations and the improvement is statistically significant (p
In future work, we aim to include socioeconomic and demographic attributes with network and language information to predict depression at the regional level. Furthermore, we would like to investigate affinity relationships between individuals who manifest signs of depression, including their personality types.52–54
Footnotes
Acknowledgements
The authors would like to thank the anonymous reviewers for their insightful and constructive comments. This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Discovery Grant RGPIN-2020-07110 and Discovery Accelerator Supplements Grant RGPAS-2020-00089 to Pr. Wang. Tshimula thanks the Arbour Foundation for awarding him a scholarship.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
