Abstract
The aim of this study is to analyze drug mentions in web forums to evaluate the utility of this data source for drug post-marketing studies. We automatically annotated over 60 million posts extracted from 21 French web forums. Drug mentions detected in this corpus were matched to drug names in a French drug database (Theriaque®). Our analysis showed that a high proportion of the most frequent drug mentions in the selected web forums correspond to drugs that are usually prescribed to young women, such as combined oral contraceptives. The most mentioned drugs in our corpus correlated weakly to the most prescribed drugs in France but seemed to be influenced by events widely reported in traditional media. In this article, we conclude that web forums have high potential for post-marketing drug-related studies, such as pharmacovigilance, and observation of drug utilization. However, the bias related to forum selection and the corresponding population representativeness should always be taken into account.
Introduction
Web forums have become a major platform for information sharing. Online discussions reflect the interest of the population for a topic at a given time. As health is a major preoccupation for many people, medical-related topics regarding diseases and treatments are often present in web forums. Such data constitute a valuable resource for research in public health.
In the pharmacovigilance domain, studies examining the interest of social media for drug safety have gained much attention in the past few years.1,2 Several neologisms were even proposed to introduce this new approach, such as “Pharmacovigilance 2.0,” 3 “Cyber pharmacovigilance” 4 and “Digital pharmacovigilance.” 5 In addition to research on adverse drug reactions (ADRs) performed from preclinical or clinical data, social media are also an appealing information resource for other drug-related domains, such as drug utilization studies,6–9 drug misuse,10,11 attitudes of patients concerning safety issues, 12 or drug development process. 13 Although most of the studies agree on the high potential of using social media in drug research, many challenges have been identified in this perspective, including, for instance, spams elimination and patient language processing and interpretation.14,15 As the cost of retrieving and managing big data from social media is high, the evaluation of the added value of such a resource is necessary.
In this context, the main aim of our work was to analyze drug mentions in web forums in order to assess the utility of this data resource for drug-related studies. This global objective was characterized through the following study questions:
What are the most recurrent drugs in users’ posts?
Does the evolution over time of a drug mentioned in web forums correspond to the events reported for this drug in the traditional media (Press, TV, and Radio)?
Do the most prescribed drugs in France correspond to the most mentioned drugs in web forums?
In the literature, few studies have addressed these issues. To our knowledge, only Wiley et al. 16 and Carbonell et al. 17 investigated drug mentions in social media without focusing on a specific type of drug or disease. Wiley et al. 16 studied drug categories involved in users posts within several types of social media, including Twitter, Google+ and web forums. Their analysis was based on drugs categories, and did not point out the most popular active ingredients or trade names. Carbonell et al. 17 analyzed mentions of trade names in twitter over a limited period of 3 weeks. Recently, Mahroum et al. 18 exploited statistics about web search queries related to the vaccination against the Human Immunodeficiency Virus (HIV). This last approach required predefining search keywords and evaluating the proportion of Google queries that are related to these keywords. Unlike the study of Mahroum et al., 18 our study is interested in users’ discussions in web forums rather than online queries, and it does not focus on a specific disease.
None of the previous studies achieved a general analysis of drug mentions in social media. They also did not analyze the evolution of drug mentions in web forums over time and the relation between this evolution and related events in traditional media. Our work in this article aims to fill these gaps.
Methods
To achieve our study, the following three steps were realized: (1) data extraction and preparation, (2) automatic detection of drug mentions, and (3) analysis of drug mentions over users’ posts. This article is an ancillary work of the French national project Vigi4Med (Vigilance for Medication in web forums) that aimed to detect ADRs from user posts in Web forums. Thus, the first two steps have been exhaustively described in previous publications related to the Vigi4Med project.19,20 The result of the first step 19 was the extraction of more than 63 million posts between the years 2000 and 2015 from 21 web forums selected by pharmacovigilance experts. These forums were chosen using Google keyword-based search and a list of certified health websites as described in Karapetiantz et al. 21 The names of these forums and the number of posts extracted from each forum can be found in Audeh et al. 19 The second step allowed the detection of drugs and pathologies mentions from users’ posts using a machine learning approach. 20 This approach used two successive classifiers based on well-known methods in machine learning: a Conditional Random Fields classifier 22 to identify medical-related named entities and a Support Vector Machine classifier 23 to identify relations between the entities.
The data extracted for the Vigi4Med project were filtered out to keep only the posts containing automatically detected mentions of at least one drug and one pathology without considering causality assessment. As we can see from Figure 1, these data consisted of 55,350,564 couples (drug, pathology) corresponding to 6,572,528 unique posts and 829,363 unique drug mention. The matching process that we will describe in this article will lead to 4165 drug mentions found in 1,627,793 posts that represent the final corpus of our analysis. We imported all the data into a Mysql database which we used as our analysis environment. In the following subsections, we describe the processes that we applied within the third step of our study regarding the analysis of drug mentions over users’ posts.

Illustration about the selection process that led to the posts and mentions included in our study.
Ethical approval
In this work, confidentiality was considered by storing the data on a server with restricted and secured access. Although this work was achieved before the application of the European General Data Protection Regulation (GDPR), a special attention was paid to respect users’ privacy by anonymizing users’ identities and submitting an official declaration about data usage to the National Committee on Computers and Liberties (CNIL). 21
Normalization by matching to a drug database
In order to explore our data, we matched drug mentions detected in step 2 to global trade names (e.g. Actifed®) or active ingredients (e.g. ibuprofen) in an official drug database. An active ingredient is a substance that alone or in combination with other ingredients is considered to fulfill the intended activity of a medicine. We chose not to match drug mentions to complete trade names that include dosage and forms information (e.g. Actifed 10 mg tablet). In fact, complete trade names are often composed of several terms and numbers. Our observation of multiple users’ posts showed that these complex names are not often employed in users’ posts. In addition, the probability of precisely matching them to complete trade names in a reference drug database is low.
The resource that we used to match drug names was Theriaque®, a French drug database developed by the National Hospital Centre on Medicines Information (CNHIM). 24 Table 1 illustrates some associations extracted from Theriaque® for the trade name ACTIFED® and some of its complete trade names.
Associations for the trade name ACTIFED® in Theriaque database.
The preprocessing of drug mentions consisted in removing points, extra spaces and commas. All matching operations were performed using case and accent insensitive comparisons. For each drug mention, the matching process starts by checking if there is a match to an active ingredient in Theriaque®. If a match is found, the drug mention is associated with the matched active ingredient; otherwise, the same process is repeated to match a trade name. If a match is found, the mention is associated with both the corresponding trade name and all its active ingredients. Table 2 demonstrates examples of matching a drug mention to an active ingredient or a trade name.
Examples of the table containing posts that match a trade name or an active ingredient.
Expert intervention for unmatched mentions
The mentions that did not match any active ingredient or trade name were listed with the number of posts associated with them and presented in descending order to a pharmacovigilance specialist. This list contained more than 800,000 mentions. Thus, the expert evaluated only unmatched mentions that were present in more than a significant number of posts (arbitrary fixed to 350) and decided if they should be considered in the study. Table 3 shows examples of such unmatched mentions and the corresponding actions suggested by the expert.
Examples of mentions unmatched to Theriaque® database that were presented to an expert.
Statistical analysis
For statistics about number of posts per drug mentions, a post was counted only once for each trade name or active ingredient that it contained. For active ingredients, a post was counted if it contained an explicit mention of the active ingredient, or a product name that contains this active ingredient. Thus, a post that had a mention of a drug that contained several active ingredients was counted as one occurrence for each of these active ingredients. For example, if a post mentioned the trade name “Actifed®,” this post was counted once for the trade name “Actifed®” and once for each of its active ingredients “cetirizine,” “diphenhydramine,” “paracetamol,” “pseudoephedrine” and “triprolidine” (cf. Table 3).
In order to evaluate the correlation between the most prescribed drugs and the most mentioned ones, we used Open Medic. This open dataset is provided and certified by the French Health Insurance System. It details the list of reimbursements performed for all deliveries of active ingredients in community pharmacies in France over a selected period. We considered the list of the most mentioned active ingredients in web forums during the year 2015 in order to compare them to the number of deliveries related to the active ingredients prescribed in France in the same year. To enable statistical correlation tests between these lists, only the 510 active ingredients in common in both lists were taken into account.
Another aspect of our study was to verify if the temporal trends of drug mentions in social media were influenced by medical events in traditional media. For this part of the study, we chose a set of drugs involved in mediatized “crises” in France:
Combined oral contraceptives (COCs). In December 2012, a case of stroke in a young woman related to a third-generation COC was reported in a French newspaper. This event alarmed health professionals and national health authorities and opened a debate concerning the use and prescription of COCs.25–27 To prepare data for the analysis relative to COCs, an expert identified the active ingredients that characterize the first, second, third or fourth generation of COCs.
Champix®. In September 2006, Champix® (varenicline) was approved in Europe as an aid to smoking cessation treatment and marketed in France in February 2007. In December 2007, The European Medicines Agency (EMA) warned doctors and patients that suicide attempt cases were reported with this drug. The French Health Insurance stopped reimbursing Champix® in 2011 on the basis of an unfavorable reassessment of its benefit–risk balance. In 2017, Champix® was finally admitted again for reimbursement in France.
Baclofene. In January 2012, the off-label use of baclofene at high dosages in alcohol-dependent patients was debated and received high media coverage.
Mirena®. In 2017, women raised attention on the potential ADRs related to the use of Mirena®, a levonorgestrel-releasing intrauterine device, which was largely echoed in the media. Although the data used in our study were collected in 2015, we aimed to investigate if Mirena® was already discussed in web forums before the crisis.
Results
What are the most recurrent drugs in users’ posts?
Table 4 presents the 50 most mentioned ingredients, and Table 5 presents the most mentioned trade names in the studied web forums. The high and correlated numbers of occurrences of sodium-based active ingredients draw our attention. We found out that this situation was the result of the high frequency of a trade name associated with the antacid drug “Gaviscon®,” which contains a combination of sodium-based ingredients. As we counted all active ingredients associated with the detected trade names, this naturally led to high occurrences of sodium-based active ingredients in our calculations. In order to synthesize the results, we grouped these ingredients (in addition to calcium carbonate) in one line in Table 4 (line 23) with the maximum number of occurrences found with its ingredients.
Top 50 active ingredients in web forums. To facilitate the analysis and improve the presentation of drug counts, raw results of active ingredients counts were reviewed by a pharmacist who grouped frequently associated ingredients and ignored synonyms of the same ingredient.
Top 50 trade names in web forums.
Tables 4 and 5 show that the most mentioned drugs in web forums are related to pregnancy, contraception or ovulation stimulation. This finding was confirmed by the results in Table 6, which presents the proportions of posts associated with each Anatomical Therapeutic Chemical (ATC) class. It is important to note that a post can be associated with several ATC classes depending on the types of mentions that it contains.
Posts’ counts in web forums per ATC class.
In addition to knowing the most used drug mentions in web forums, we wanted to verify if patients use mostly trade names or active ingredients when talking about drugs in web forums. Figure 2 shows the proportion of trade name mentions for the most frequent active ingredients. These results demonstrate that patients usually use trade names and not active ingredients when discussing about drugs on web forums. An exception to this observation was the frequent use of the mentions “Insulin” and “Cortisone” by forum users. We expect that users use these mentions to refer to drugs containing insulin-based preparations or variations of corticosteroids. In order to match trade names that correspond to “Insuline” and “Cortisone,” we considered these two mentions as umbrella terms to any active ingredients whose ATC code, respectively, starts with “A10A” (Insulins and analogues) or “H02AB” (corticosteroids for systemic use).

The proportion of using trade names versus active ingredients names in web forums.
In this study, we noticed that Doctissimo forums were the most dominant in our study (Figure 3). Figure 4 depicts the distribution of the top 10 mentioned active ingredients in web forums (excluding cortisone and magnesium) over the dominant forums.

The distribution of posts containing active ingredients over web forums.

The distribution of the top active ingredients on the dominant forums in our study.
Figures 5 and 6 show the proportions of baclofene and ethinylestradiol, respectively, within the total number of users’ posts for each of the studied forums. As expected, Figure 5 confirms the high contribution of two French forums dedicated to discussions on baclofene for drug mentions of baclofene, while several generic forums contained relatively important counts of the ethinylestradiol mention.

Proportion of baclofene (red) to the total number of posts (blue) per forum.

Proportion of ethinylestradiol (red) to the total number of posts (blue) per forum.
Does the evolution over time of a drug mention in web forums correspond to the events reported for this drug in the traditional media?
Figure 7 shows the evolution of “old” generation versus “new” generation of COCs. The third and fourth generations of COCs appeared more popular until March 2013. After this date, the tendency is inversed and posts mentioning old-generation pills, especially second generation, are more frequent in web forums.

Outcome of old versus new COC mentions in web forums over 11 years.
The evolution of the number of posts mentioning Baclofene, Champix® and Mirena® from July 2007 to May 2015 in French forums is represented in Figure 8. The analysis of these mentions revealed that the increase of interest in Baclofene started in 2010. Conversely, mentions regarding Champix® that frequently appeared in web forums till April 2007 continuously declined through the following years. Overall, the number of mentions of Mirena® in web forums was stable, at the exception of a peak observed at the start of 2013.

The use of Baclofene, Champix®, and Mirena® in web forums.
Do the most prescribed drugs in France correspond to the most mentioned drugs in web forums?
Table 7 shows a comparison of the top 10 active ingredients in 2015 in web forums with the top 10 active ingredients in Open Medic over the same year. Kendall-tau correlation coefficient value between the ranking of active ingredients found in the forums and in Open Medic was estimated at 0.31, which signifies a weak correlation between both rankings. 28 The scatterplot of the rankings shown in Figure 9 confirms this interpretation. In this figure, each point is an active ingredient whose rank in the Open Medic List is in the X axis, and rank in the Forums List is in the Y axis.
Top 10 active ingredients mentions versus top prescribed active ingredients in 2015 in France.

Scatterplot of the active ingredient ranks in the Forum’s list and Open Medic List.
Discussion
Web forums represent an interesting source of information for drug-related studies. In this article, we presented a generic method for analyzing drug mentions that are detected in automatically annotated posts extracted from web forums. This method was applied to the case study of France. The use of French drugs and active ingredients names and a French database for drug matching (Theriaque®) does not decrease the generality of our method which could be applied to any web forum of any language as long as a convenient drug database is used for matching drug names.
In this study, most of the posts were dedicated to topics on contraception or pregnancy, including fertility and pregnancy development. From the top 50 mentioned drugs, a single active ingredient concerned men (finasteride), which was ranked #38 position among the most mentioned active ingredients in the forum. These results could be explained by the fact that women aged 18–45 years correspond to the most important population of web forum users.29,30 Moreover, although our study detected the presence of drugs related to chronic diseases like diabetes mellitus (insulin/Lantus®, metformine), asthma (salbutamol/Ventoline®), hypothyroidism (Levothyrox®), epilepsy (clonazepam/Rivotril®, Depakine®, Lamictal®), depression (venlafaxine/Effexor®, paroxetine/Deroxat®, escitalopram/Seroplex®, fluoxetine/Prozac®, sertraline/Zoloft®) or cancer (Tarceva®, triptoreline/Decapeptyl®), most of the highly mentioned drugs concerned mainly young patients without serious pathologies. In the second position, we found painkiller medications (paracetamol, ibuprofen, morphine, etc.). From the top 50 active ingredients, psychotropic-related active ingredients (alprazolam, venlafaxine, paroxetine, escitalopram, bromazepam, fluoxetine, valproate, clonazepam and sertraline) were represented in 147,750 posts, while weaning-related active ingredients for tobacco and alcohol (baclofene, nicotine, varenicline) were represented by 83,248 posts.
These findings were confirmed by the high number of posts associated with the ATC classes G (Genito-urinary system and sex hormones), A (Alimentary tract and metabolism) and N (Nervous system). Our results are consistent with the findings of Wiley et al. 16 about the high frequency of discussions concerning nervous system, hormones and respiratory agents in social media.
There is a clear trend to use trade names in web forums (cf. Figure 2). However, some active ingredients such as morphine, copper and glucose were frequently used in our analyzed web forums. To understand the context in which these ingredients were used, we checked out the pathological conditions detected by machine learning associated with these three ingredients (Table 8). High frequencies of morphine and copper mentions (33,388 and 28,877 mentions, respectively) were particularly unexpected. Findings in Table 8 suggest that copper is probably used by women describing non-hormonal contraceptive devices, morphine is mentioned as a strong painkiller in serious diseases like cancer, while glucose was probably used by diabetic patients to describe their glycemic status.
Pathologies associated with unexpected active ingredients often mentioned by users.
For the analysis of the temporal trends of the case study on COCs, Figure 7 showed that at the beginning of 2013, discussions about the old COCs became more important than discussions about the new generations. Actually, the use of COCs in general had been significantly reduced after 2013, as many young women stopped using this contraception method after the mediatized information about the risks. However, this finding is difficult to confirm based on the reimbursement data we used in our study, as numerous COCs among those available in France are not reimbursed by the Health Insurance system.
For the case study of Champix®, a correspondence was observed between Champix® mentions in web forums and the dates of events related to its marketing and reimbursement (Figure 8). Unfortunately, we could not study the impact of the recent media interest on Champix® mentions as our study did not cover the year 2017. Nevertheless, we studied the case of Baclofene, which also constituted a particular case study in our analysis. In fact, Baclofene has been used for years as a muscle relaxant under the trade name of Lioresal®. After the media coverage concerning its new use and potential interest for alcohol abstinence, it became more common on web forums. Indeed, a clear increase in mentions was observable since January 2012 for this drug, which meets the date at which French authorities authorized its prescribing for alcoholic dependence treatment. Furthermore, we noticed an increasing interest in baclofene in 2008, which could be related to the release of a book about the efficiency of baclofene in alcohol withdrawal. Finally, for the Mirena® temporal case study, Figure 8 shows the high number of posts mentioning Mirena® in January 2013. We found out that at this date, a report comparing the uterine perforation rate between Mirena® and Copper-based IUD was published, which could have influenced the discussions in web forums at that period. From these case studies, we conclude that temporal trends of drug mentions seemed to be influenced by events widely reported in traditional media. One positive consequence of this influence is that we can collect in near real-time descriptions of ADRs by patients. Such a process is much expensive and time consuming when using traditional reporting procedures.
Another important finding of our study is that the most mentioned drugs in web forums are not necessarily the most prescribed. For example, although Levothyrox® is highly prescribed in France (Levothyroxine Sodique is at position 33 in 2015 in Open Medic), this drug does not appear in the list of top 50 mentioned drugs in web forums over the same studied period. It is important to keep in mind that the posts included in our study were all published before the change of the composition of Levothyrox® in France, which happened in March 2017 31 and was widely covered by media. Indeed several adverse reactions were reported by patients after lactulose was replaced by mannitol as an excipient. Our study showed out that the most mentioned drug in the selected web forums is clomiphene, which is an ovulation stimulant for women with infertility problems. This finding confirms the results of a previous study. 32
Limitations
The methods used in this article are generic, but the application proposed in this work was limited to 21 selected forums in French language for a period that ends in 2015. The domination of some big forums biased our results by representing mostly young women and pregnancy-related topics. In our case study, matching mentions in web forums to the Theriaque® drug database led to counting ingredients that are not relevant to the study, such as glucose. Another limitation was the disambiguation of misspelled drug mentions in users’ discussions, which required an expert intervention. Finally, using social media to analyze patients’ reactions excludes patients who do not have access to the Internet or who are not familiar with the use of online discussions.
Perspectives
Extending the period of our study to include recent posts and English forums will be the next follow-up for our work. This extension will allow us to analyze the echo of recent events in social media and the influence of cultural and linguistic particularities on the results. Considering the bias regarding the age, sex and health context of the studied population, any future work should be careful about the crucial choice of forums to consider. Furthermore, in order to minimize expert intervention, it is important for future work to consider a list of drug ingredients that should not be taken into account when counting active ingredients. Another perspective to this work is to consider a system that analyzes drug mentions in web forums “On the fly.” This procedure will be part of the current PHARES project that aims to establish a pipeline for detecting ADRs and off-label uses in web forums.
Finally, this article focused only on how patients mention drugs in social media. A future work will focus on extracting causality relation between drugs and adverse events mentioned on social media posts. The development and the evaluation of a robust machine learning approach are necessary to correctly detect ADRs.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the French Agency for Drug Safety (Agence nationale de sécurité du médicament et des produits de santé) through the research projects: Vigi4MED (grant AAP-2013-052) and the convention n°2016S076 through the PHARES project.
