Abstract
This paper aims at identifying user’s information needs on Coronavirus and the differences of user’s information needs between the online health community MedHelp and the question-and-answer forum Quora during the COVID-19 global pandemic. We obtained the posts in the sub-community Coronavirus on MedHelp (195 posts with 1627 answers) and under the topic of COVID-19(2019-2020) on Quora (263 posts with 8401 answers) via web scraping built on Selenium WebDriver. After preprocessing, we conducted topic modeling on both corpora and identified the best topic model for each corpus based on the diagnostic metrics. Leveraging the improved sqrt-cosine similarity measurement, we further compared the topic similarity between these two corpora. This study finds that there are common information needs on both platforms about vaccination and the essential elements of the disease including the onset symptoms, transmission routes, preventive measures, treatment and control of COVID-19. Some unique discussions on MedHelp are about psychological health, and therapeutic management of patients. Users on Quora have special interests of information about the association between vaccine and Luciferase, and attacks on Fauci after email trove released. The work is beneficial for researchers who aim to provide accurate information assistance and build effective online emergence response programs during the pandemic.
Introduction
The growth of the Internet has led to health information being more accessible in the past decade. Studies have reported more than 70% of Internet users have searched online for health information or have used the Internet for health-related purposes.1,2 Most frequently people went online to search for condition-specific information, or a particular medical treatment or procedure as well as disease prevention. 1
The COVID-19 pandemic is a public health and medical emergency on an unprecedented scale that began in 2020. COVID-19 as a pandemic can cause feelings of worry, fear, distress and anxiety. 3 Meanwhile, the knowledge that can be gained from the Internet can have a remarkable impact during pandemics. Previous research has established that appropriate and timely access to quality healthcare information during infectious disease outbreaks could help mitigate public anxiety, develop adequate risk perceptions and make proper health decisions to adopt protective measures.4,5 Therefore, it is crucial for stakeholders to understand the public’s preferences for healthcare information. Knowing what topics the public need information about, could assist with the presentation of healthcare information in a manner and format that suits public needs.
Medical journals have emphasized the growing importance of social media platforms as a valuable tool for the dissemination of disease mitigation strategies. 6 The public has become more reliant on social networking sites to stay informed during a crisis. 7 Instead of focusing on social media platforms which are built for socializing via many-to-many conversation such as Twitter or Facebook,8–10 this study aims to investigate public health information needs for COVID-19 on online health communities (OHCs) and question-and-answer (Q&A) forums, which are built on one-to-many knowledge-based communication. The main aim of this study is to identify a variety of information needs of public in the course of the pandemic. This work will generate fresh insights to help public health agencies know how to communicate and what to focus on as the world continues to navigate through the COVID-19 outbreak.
Drawing on previous research, this study investigates the following three research questions:
Research Question 1(RQ1): What kinds of health-related information do users discuss in the OHC about COVID-19 pandemic?
Research Question 2(RQ2): What kinds of health-related information do users communicate in the Q&A forum about COVID-19 pandemic?
Research Question 3(RQ3): Do users’ information needs on COVID-19 in the OHC differ from the Q&A forum?
Related work
Health information needs during COVID-19 pandemic
Information need is a rather nebulous term, difficult to define, isolate and measure. 11 According to Nicholas, 12 “when definitions of the concept information needs are provided, they are typically vague or highly complicated, and individuals often talk about information needs when they are actually referring to information wants and demands” (p. 9). Within the context of health, Ormandy 13 suggested that a need for information would arise when recognizing the knowledge is insufficient to satisfy a healthcare-related goal. Throughout this paper, the term information need will refer to the public’s desire for more information during the COVID-19 pandemic that is expressed verbally or in active communication to become better informed about self-care and prevention. We use the concept of information need to understand what information people require about ongoing health emergencies.
Several studies have begun to examine user information needs during the coronavirus pandemic. Using entity identification and text analysis, Zhao et al. 14 identified 1496 patients with COVID-19 infections from Wuhan, China, and investigated their health information searching behavior on the Chinese social media platform. They reported that the three most searched topics were access to medical care, isolation, and quarantine guidelines, and offline to online support. Springer et al. 15 used Google Trends data to track the search trends and the patterns of worldwide interest, concerns, and information needs during COVID-19. Wei et al. identified 15 categories of questions about COVID-19 across 13 data sources, and the most asked questions are about transmission, prevention, and societal effects of COVID. 16 So far, however, there are no analyses mining the health information needs for COVID-19 on OHC and Q&A websites in the United States. This study sets out to extend previous work by investigating and comparing information needs on an OHC website MedHelp and a Q&A website Quora during the COVID-19 pandemic.
Health information needs on online health communities
OHCs are patient-led sites that focus exclusively on health-related topics. OHCs exist for a wide range of diseases and health issues, from cancer support groups to simple calorie counter forums. Compared to other health-related sites which only allow users to retrieve information, OHCs allow for communication between multiple people.
Scholars have made significant progress on the discoveries of discussion topics in OHCs. Chen 17 used the k-means algorithm to cluster the discussion content from three cancer-related OHCs and unearthed a set of common topics in each: support, treatment experience, disease and medication management. Park and Park 18 investigated the cancer-related information needs among Korean Americans by collecting posts from MissyUSA, one of the largest online communities among Korean in the USA. They identified the most discussed medical topics which are treatment, diagnosis and symptom. To understand the potential information needs among patients with physiological and psychological diseases, Liu et al. 19 used topic modeling and sentiment analysis to study the differences in topics and emotions expressed by the two groups of patients in OHCs. It was shown that people with physical illness have a high degree of attention to the medical treatment, while people with mental illness are actively involved in seeking emotional support in the community.
Health information needs on question-and-answer forums
Question-and-answer (Q&A) websites are knowledge-sharing platforms for asking explicit questions as well as posting answers. Health has been identified as a major domain to observe user interactions and identify user needs in Q&A website research. 20 These websites provide a venue for consumers to seek experiential information from other users for a quick solution to their healthcare concerns. 21 It has previously been observed that Q&A platforms offer a valuable opportunity to better understand users’ information needs and concerns about various health issues. 22
Text mining has been introduced as a useful method for studying the health information needs in Q&A forums from different perspectives. Oh et al. 23 collected cancer related questions posted on Yahoo Answers and investigated cancer-related topic categories by text mining techniques to reveal users’ multidimensional information needs. In a study investigating consumer information needs on dietary supplements, Rizvi et al. 24 retrieved a total of 2,820,179 questions and corresponding answers from Yahoo Answer. By implementing an unsupervised topic modeling method, they found that the most sought information by users are “use and adverse effects”, “product-related”, and “healthy lifestyle”. Zhao et al. 14 analyzed 10,862 depression-related posts on the platform named Zhihu, which is the largest Chinese social Q&A platform. By combining LDA and manual methods, the results showed that the users who sought help for depression pay more attention on the information linked to depression symptoms and social activities.
Method
Data collection
In this study, two typical platforms, the OHC website MedHelp and the Q&A website Quora, were chosen as the comparative cases for data collection. MedHelp is one of the earliest and most popular online forums which attracts more than 12 million users browsing monthly. 25 Quora, one of the largest and most popular Q&A forums in America, has attracted over 300 million monthly active users. 26 A key area where Quora shines is health care. All the data collected are in English and mainly for American users based on their website traffic. 1
Two web scraping scripts written in R 4.0.5 based on Selenium WebDriver 27 were utilized to obtain all posts (each post consisted of one question and its following comments) in the sub-community Coronavirus on MedHelp; 28 and in the topic of COVID-19(2019-2020) on Quora. 29 The data collection was finished on 16 October 2021. Given that short answers might not contain meaningful information for our further analysis, we decided to remove all the answers of less than 5 words. Meanwhile, we have also removed all the URLs, punctuations, numbers, and English stopwords, and performed lowercasing and stemming during the data preprocessing following the common practices. All the data collected in the study is readily available to the public and there is no direct interaction with participants during the data collection process. We have also complied with the privacy policy on each online forum respectively, as no personal information on the user-level was obtained or stored at our end and there is no possible way to link a record with a particular individual. 30 Additionally, we have consulted with an Institutional Review Board analyst at our institute to confirm that no ethical approval is required in the current study. For the process and the codes of web scraping, please refer to the Supplemental Material.
The descriptive statistics of MedHelp and Quora corpus.
a4149 of 7664 terms (4149 of 51,143 tokens) were removed due to frequency.
b7694 of 14,478 terms (7694 of 119,493 tokens) were removed due to frequency.

The wordcount distribution of MedHelp and Quora corpus.
Data analysis
Taking the posts from these two corpora as the unit of analysis, we conducted an in-depth text analysis to understand public health information needs during the COVID-19 pandemic. First, after collecting and cleaning the textual data, we used topic modeling to analyze these two corpora separately to uncover the underlying themes. Second, the differences in users’ health information needs between the two sites were investigated through the topic similarities (i.e. improve sqrt-cosine similarity). Our data analysis road map is presented as Figure 2. Data analysis road map.
Topic modeling
We are using structural topic modeling (STM) without structural metadata to perform the topic modeling via the stm R package. 31 A STM without a covariate renders a similar result of Correlated topic modeling (CTM). 2 Correlated topic modeling (CTM) is an unsupervised machine learning algorithm of text mining deriving from Latent Dirichlet allocation model (LDA). LDA renders K clusters of co-occurring terms (or the topics) with a bag-of-words approach, which assumes the order of terms and the document in the corpus is neglected in the process. Meanwhile, topics are uncorrelated with each other. CTM is performed as an extension of LDA while allowing the correlations among the latent topics. 31 Moreover, CTM also adapts a logistic normal distribution rather than Dirichlet distribution like LDA does, in order to better adjust the covariance structure among all the topics, which outperforms the regular LDA. 31 The main reason we chose the stm package here, is its utilization of elements of SAGE and DMR topic model. Meanwhile, its flexible searchK function and the model diagnostic metrics (i.e. semantic coherence, exclusivity, residuals, lower bound, and held-out likelihood), helps us greatly to identify the optimal K of the topic modeling for the MedHelp and Quora corpus.
Topic similarity
Cosine similarity is one of the common methods for comparing the text similarity between documents or two non-zero vectors based on Euclidean distance via a bag-of-words model.
32
Specifically, cosine similarity based on Euclidean distance could be defined as
However, Euclidean distance might not be a good metric for dealing with probability, such as the topic-word distribution generated from an LDA or CTM topic modeling.
33
Instead, Hellinger distance with probability-based approaches may be a preferable choice, especially when handling high-dimensional data.
34
As an attempt for a better solution, Sohangir and Wang
35
propose an Improved sqrt-cosine similarity measurement (ISC) based on equation (1) and the sqrt-cosine similarity.
34
Sohangir and Wang contend that the ISC approach outperforms other similarity measurement through their comparative experiments:
Similarly, the similarity score for each pair of documents with the ISC approach also ranges from 0 to 1, as 1 indicates that these two documents are very similar to each other, and 0 means that they are completely different. We used the same approach here to compare the topic similarity between our MedHelp and Quora corpus. After deciding the optimal K-topics model for each corpus, each topic in each corpus was represented as a vector of words with corresponding prevalence (or the per-topic-per-word probabilities, as the
Results
Topics generated on MedHelp corpus
With an exploratory approach, we started with testing various K-topics models on the MedHelp corpus, where K = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}. We then narrowed down the optimal topic K between 20 to 40. Eventually, we decided that 31 would be the optimal number of K-topics model for the MedHelp corpus, based on the model diagnostic metrics and intersubjective qualitative human judgment in the research team. For the details about the decision, please refer to Supplemental Material.
The categories and labels of the 31 topics of MedHelp corpus*.
Note: *The highly similar topics (based on the heatmap Figure 4) are highlighted in the table.
We can find the top three topics in the MedHelp corpus is Topic 25 Self-monitor for possible COVID-19 symptoms, Topic 16 Face shield/mask for protection, and Topic 7 Viral transmission. We summarized the 31 topics into seven categories, with two topics in diagnosis, six topics in protection and prevention, six topics in pathogenic mechanisms, seven topics in vaccination and vaccine reaction, two topics in public health directives, four topics in social support, and four topics in treatment.
Topics generated on Quora corpus
Adapting the same approach as above, we also started with 11 models whose topics stretched from 10 to 100, and then we tested out a more granular range of models whose K lies between 20 to 40. Similarly, we also found that 31-topics model is the best one for Quora corpus based on the model diagnostic metrics and human judgment. The details about the decision could be reviewed in the Supplemental Material as well.
The categories and labels of the 31 topics of Quora corpus*.
Note: *The highly similar topics (based on the heatmap Figure 4) are highlighted in the table.
The topic modeling discovers the top three topics among Quora corpus are: Topic 8 Immune and antibody response to vaccination, Topic 2 Common reaction to vaccination, and Topic 4 Comparing the differences among COVID-19 vaccines. As Table 3 shows, we further classified the 31 topics into six categories. There are eight topics in vaccine reaction, seven topics in vaccine development and distribution, six topics in pathogenic mechanisms, six topics in politicization and polarization, three topics in treatment and the control of COVID-19, two topics in conspiracy theories, and one standalone topic in COVID-19 employee benefits.
Topic similarities on MedHelp and Quora
The topic modeling process renders two essential probabilities across the corpus. The first one is The distribution of topic ISC similarity scores of the 31*31 matrix. The heatmap of topic ISC similarity scores between MedHelp and Quora topics.

Given the ISC similarity scores follow a normal distribution, we set up the arbitrary threshold based on the
An objective of this study is to explore the similar topics and distinctive topics between MedHelp and Quora. Based on the result of the heatmap,
Additionally, we also find
Part of the aim of RQ3 is to identify the relatively distinctive topics in each corpus. We find
Discussion
Principle findings
For the first research question, most of the users on MedHelp expressed their confusion and concerns on diagnosis, protection and prevention, coronavirus pathogenic mechanisms, vaccination and vaccine reaction, public health directives, social supports, and treatment. Meanwhile, many users on Quora are concerned about information on vaccine reaction, vaccine development and distribution, pathogenic mechanisms, politicization and polarization, treatment and control of COVID 19, COVID-19 employee benefits. These findings provide a corresponding answer for RQ2. For the third research question, users on MedHelp and Quora expressed same concerns on vaccination, including possible vaccine reaction, safety and effectiveness of possible vaccine, vaccine development and distribution. A considerable number of users on MedHelp and Quora also cared for basic elements of the disease including the onset symptoms, transmission routes, preventive measures, treatment and control of COVID-19.
There are some differences of users’ information needs on COVID-19 between the OHC and the Q&A forum. One major difference is that the proportion of posts reported being worried for themselves and their close ones on MedHelp was higher than Quora. This finding reveals that users on MedHelp have more information needs on psychological health, which accords with earlier observation reported by Chen et al. 39 The reason for the topic disparities might be the anonymity feature of the OHC. It is easier to build a strong relationship within OHC, which led users to describe their concerns and conditions more carefully, including more details and personal feelings on Medhelp. Previous evidence suggests that one of the main reasons people participate in an OHC is to seek and obtain various types of social support, 40 and social bonds grow stronger during times of uncertainty and crisis. 41 Thus, while individual experiences during COVID-19 are nuanced, OHCs such as Medhelp could serve to facilitate connections and provide opportunities for new modes of interaction and meaningful relationship-building during the pandemic.
Quora is a Q&A platform where diverse perspectives and voices are shared and heard in a culture of knowledge sharing. 42 We found that the current dominant users of Quora well represent the users of the online knowledge community, as Quora users hold more profound perspectives compared with users on Medhelp. For instance, the discussion around vaccination on Medhelp mainly focuses on individual-level risk/benefit balance, while vaccination discussions on Quora would also surround social ethics such as the ethical considerations of COVID-19 clinical trials. The results also show that Quora users are more active in discussions around controversial topics, such as anti-vaccine statements and political divides. In that sense, information exchanges on these topics were not only about exchanging helpful knowledge but also imbued with conspiracy beliefs and political judgment.
Practical implications
There are widespread public concerns about vaccine, including reactions following a vaccination, safety and effectiveness of possible vaccine, vaccine development and distribution. Although evidence has demonstrated that vaccinations were currently the best hope society had to contain the pandemic, fear, and confusion still muddled people’s confidence in vaccine. Concerns about approvals being rushed, suspicion of the pharmaceutical industry, uncertainty surroundings the vaccine are the widely mentioned reasons for vaccine hesitancy. 43 Public health authorities and pharmaceutical manufacturers must continue to communicate transparently with the public about any potential side effects of COVID-19 vaccines when highlight vaccines’ effectiveness. The Food and Drug Administration (FDA) and Centers for Disease Control and Prevention (CDC) in the United States have been thorough and transparent about the safety profiles of each of the vaccines. One example is that in April 2021 the CDC and FDA paused the use of the Johnson and Johnson vaccine and then further investigated rare blood clots that had been reported after vaccination.
Health professionals and communicators must help individuals emotionally cope with the psychological trauma during the COVID-19 pandemic and navigate to a post-crisis new normal. Users on Medhelp have expressed their needs regarding social support during the pandemic. As individuals practice social distancing and quarantine in an effort to help prevent the spread of Coronavirus, they may experience a higher prevalence of loneliness, feelings of isolation, and poor mentor health. Research has demonstrated that having strong social support during times of crisis can help mitigate mental disorders. 44 Public health authorities and communicators should intervene and fill the support gap for public during the pandemic. For instance, healthcare could be delivered remotely through several telehealth modalities and treatment protocols to provide medical assistance and psychotherapy services. Medical staff shall pay attention to the emotional and psychological conditions of patients, encourage them to speak openly about their concerns, and provide support when needed. Government should provide resources and social support services such as disaster financial assistance with food, housing, and bills.
Quora has unique features, such as the real name environment and the upvote feature, making Quora a more proper platform to rationally discuss health issues rather than to emotionally express personal attitudes. Another strength of Q&A websites is that multiple users can answer the same question, offering more than one explanation from different perspectives, which could aid in reconciling varying viewpoints and have more conciliatory conversations. The echo chambers in the online environment can thus be disrupted by providing balanced arguments. We found that some discourse on Quora becomes more polarized on the issue of handling the Coronavirus crisis. Public opinion is deeply divided along the views on urgency of the crisis, responsiveness to government decisions and personal behavioral responses to the COVID-19 pandemic. This finding is consistent with previous observations that partisan gaps exist in views of many aspects of the pandemic such as risk perceptions and responses to pandemics.45,46 Future research could investigate the effects of politicized and polarized online information on the community vulnerability to COVID-19, as well as strategies to reduce divisions and break away from previous patterns of reflexive partisanship.
Limitations
This paper has a few limitations that could be addressed in the future. This study has taken a post as the unit of analysis to adjust the length issue of online post, which might miss nuances within each answer. The future research could employ bi-term topic modeling or single topic LDA to incorporate more flexibility within each answer or incorporate user-defined seed words for topic-word distribution to better address the domain-specificity problem. 47 Also, we acknowledge the subjective judgement when assigning the topic labels and the possible bias could affect the conclusion. In the future, we can incorporate a generative labeling approach to help with evaluate the topic quality. 48 During the text preprocessing, thought we have followed the common practices in the field including removing stopwords and performing stemming, we should be more cautious the actions as some of them might have limited effects on improving the performance of topic modeling as Schofield et al. argue.49,50 Besides, this study has regarded all the posts as static. However, in reality we know the temporal trend would be an important element to capture the dynamic of information needs. In the future research, we can scrape the posting time for each answer, and conduct structural topic modeling to examine the fluctuation of the topic perplexity across time. To be noted, Quora does not show the exact date if it was posted more than 1 year. Lastly, when we computed the ISC similarity score between the MedHelp and Quora corpus, we only used the overlapped terms which could ignore the unique linguistic pattern especially in the Quora corpus (since only 3098 out of 6784 terms were included). In the future, we may compute the similarity score based on the sparse matrix with all the terms involved in both corpora, and examine the differences compared to the approach in this study.
Conclusion
Coronavirus was a heated discussion topic on both MedHelp and Quora. In this paper, we examined the COVID-19 related posts on OHC Medhelp and on Q&A platform Quora and identified a variety of information needs of the general public over the course of the pandemic. Moreover, we discovered the disparities of the information needs between the users of these two online platforms. To best of our knowledge, this is the first study to examine the public’s information needs comparing two different one-to-many knowledge-based online platforms during the COVID-19 pandemic. This insight is beneficial for tracking and responding to the public’s information needs during pandemic. The findings from this study could also provide refined knowledge for researchers or practitioners who aim to provide accurate information assistance and build effective online emergence response programs.
Supplemental Material
Supplemental Material - Understanding information needs during COVID-19: A comparison study between an online health community and a Q&A platform
Supplemental Material for Understanding information needs during COVID-19: A comparison study between an online health community and a Q&A platform by Rachel X Peng and Ryan Yang Wang in Health Informatics Journal
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical statement
All the data collected in the study is readily available to the public and there is no direct interaction with participants during the web scraping process. No personal information on user-level was obtained or stored and there is no possible way to link a record with a particular individual. Therefore, no ethical approval is required in the current study.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
