Abstract
Background
There is a notable disparity between the guidelines for BCG therapy in non-muscle invasive bladder cancer (NMIBC). Reddit has emerged as a popular online platform for individuals seeking information and exchanging their experiences related to bladder cancer.
Objective
To investigate and classify public opinions about intravesical BCG therapy as shared on Reddit, a popular social media platform.
Methods
This study employed an artificial intelligence-based approach to examine discussions related to intravesical BCG therapy on a social media platform over the past ten years. An artificial intelligence framework was developed to categorize these conversations into distinct topics and thematic categories. This framework included a partially supervised model for processing natural language (using BERT [Bidirectional Encoder Representations from Transformers]), a method for reducing data complexity, and an algorithm for clustering. Additionally, each conversation was assessed for sentiment.
Results
A total of 1223 unique discussions related to BCG therapy were analyzed, comprising 110 unique posts and 1113 comments from 268 distinct authors. We identified four overarching thematic groups: 1) BCG administration procedures, (2) hesitancy in initiating or maintaining BCG treatment, (3) issues related to BCG shortage and alternative treatments, and (4) side effects of BCG treatment. Sentiment analysis of the 1223 discussions revealed that 25.2% (308) exhibited a negative sentiment, 58.3% (713) were neutral, and 16.5% (202) showed a positive sentiment.
Conclusion
Online social media often contains detailed personal experiences with BCG therapy, not commonly found in medical literature. Understanding these experiences can help medical professionals improve care and treatment adherence in NMIBC.
Keywords
Introduction
In 2023, it was estimated that there were approximately 82,290 new cases of bladder cancer in the United States, 1 with approximately 70% identified as non-muscle-invasive bladder cancer (NMIBC). 2 The American Urological Association (AUA) guidelines, recommend a six-week course of intravesical Bacillus Calmette–Guérin (BCG) therapy for intermediate and high-risk NMIBC, typically initiated between two to six weeks after transurethral resection of the bladder tumor. 3 However, there is a significant gap between these guidelines and the actual clinical practice of BCG therapy. 4
The reasons for the underuse of intravesical BCG therapy in NMIBC are not entirely clear. Historically, concerns about side effects were a major factor 5 ; however, with advances in medical understanding and practice, serious side effects are now rare, affecting less than 5% of patients, and are generally manageable. 6 Another potential factor is the recent shortage of BCG drugs, which may have affected NMIBC treatment approaches 7
In recent years, social media platforms have become important in capturing public opinion on health issues outside of traditional healthcare settings. 8 These platforms facilitate the sharing of experiences and information, thereby impacting public perception and knowledge. One social media platform growing traction for health discussions is Reddit.9,10 This forum-based site allows users to share questions, comments, and discussions on a wide range of topics. It is a free platform with 52 million daily active users and approximately 430 million monthly users, attracting over 30 billion views per month. 11 Given its extensive reach, it could provide substantial data on public opinion about BCG and offer opportunities for new insights.
This study aimed to explore and categorize public perceptions of intravesical BCG therapy as discussed on Reddit. We hypothesized that social media discussions would reveal patient concerns and experiences that may not be readily captured in clinical settings or through traditional research methodologies.
Methods
Data set and search
Reddit was used as the data source for this study. Data were collected between January 2013 and November 2023. This social media platform is organized into various focused communities, known as subreddits, indicated by the ‘r/’ prefix. Interaction on the platform occurs through users posting new discussion threads or commenting on existing ones. Most of these communities and their associated posts and comments are publicly accessible.
To compile a dataset of discussions related to BCG on this platform, we identified two pertinent subreddits by searching for the terms ‘BCG’ and ‘bladder cancer’ in the platform's search tool: r/BladderCancer and r/Cancer.
We then gathered posts and comments from these subreddits using Reddit's application programming interfaces (APIs), employing custom scripts written in Python. Our search within these posts and comments was designed to include case-insensitive instances of the word ‘BCG’, and the generic or brand names: Bacillus Calmette-Guerin, TheraCys® BCG, and TICE® BCG.
Data preprocessing
To prepare the raw text scraped from Reddit for topic modeling, the following series of operations was performed. We performed data cleaning by removing punctuation, capitalization, urls, and special characters. Further, for uniformity, the title of the post and body was merged, as the main idea is often written in the title. Subsequently, the text was cleaned further by removing English stop words, punctuation, capitalization, special characters, and duplicate posts. Lastly, to minimize irrelevant data and increase the substantive value of the content, we excluded any comments that were less than 5 words in length.12,13
Topic modeling
We used BERTopic, an advanced natural language processing (NLP) algorithm that takes advantage of BERT (Bidirectional Encoder Representations from Transformers) for effective topic modeling. 14 This tool uses the latest advancements in language processing to organize and interpret large amounts of text by identifying their main topics.
Initially, BERTopic embeds textual data at a sentence level, and for this, it can utilize the Sentence-BERT framework. Sentence-BERT is a specialized method for converting sentences into numerical representations (embeddings) that capture their semantic meaning. 15 Following this, the Uniform Manifold Approximation and Projection (UMAP), an unsupervised learning algorithm, is applied to further refine and organize these embeddings.
For the embedding step, the all-MiniLM-L6-v2 model is chosen within the Sentence-BERT framework. 16 This model is selected due to its proven effectiveness in analyzing content from diverse domains, including social media and scientific texts. The all-MiniLM-L6-v2 model has been pre-trained on a wide array of data sources. This includes over 600 million social media posts and the S2ORC database, encompassing more than 12.8 million scientific publications in medicine and related fields.
The identification of specific topics was achieved through spectral clustering, a technique that aggregates similar dialogues into discrete topics. The efficiency of clustering was quantified using two metrics: the Silhouette coefficient and the Davies-Bouldin index.17,18 The Silhouette coefficient assesses the similarity of a discussion to its own topic (cohesion), with values nearer to 1 indicates better performance. Conversely, the Davies-Bouldin index adopts a broader perspective, evaluating the mean similarity of each topic to its closest counterpart, where scores approaching 0 indicate greater topic distinctiveness.
Following the generation of the specific topics, we performed a subsequent clustering analysis on the mathematical representation of these topics to identify overarching themes (groups). The optimal number of groups was determined by maximizing the Silhouette coefficient and the Davies-Bouldin index, ensuring the best balance between intra-group cohesion and inter-group separation.
Sentiment analysis
Sentiment analysis is a method used to detect and categorize subjective elements within textual data. A typical approach in sentiment analysis involves categorizing the emotional tone of text into different classes, such as positive, neutral, or negative.
For this study, we utilized a pretrained BERT-based model known as RoBERTa, which was specifically trained using social media content. 19 RoBERTa is adept at assigning multiclass labels, allowing for the classification of text as positive, neutral, or negative. This model has been previously employed in various research projects focusing on healthcare-related issues using data derived from social media sources. 20 To analyze the variation of sentiments across different topics and groups, we converted these sentiment labels into numerical scores: assigning −1 for negative, 0 for neutral, and 1 for positive sentiments.
Statistical analysis
We described discussion characteristics using mean and SD. All the analyses and figure generation were done using the Google Collaboratory Pro environment (https://research.google.com/colaboratory) with the Python programming language, version 3.10.12 (Python Software Foundation) and multiple key libraries: scikit-learn, version 1.1.2; BERTopic, version 0.16.0; tensorflow, version 2.14; and matplotlib, version 3.7.1.
Results
Figure 1 represent a flowchart of the data collection and analysis process. A total of 1223 unique discussions related to BCG therapy were analyzed, comprising 110 unique posts and 1113 comments from 268 distinct authors, as detailed in Table 1. On average, posts were longer than comments, with a mean (SD) number of characters of 934.1 (626.1) for posts compared to 345.9 (368.9) for comments. The temporal dynamics of BCG-related posts and comments are illustrated in Figure 2A-B. Figure 2A represents the annual fraction of data entries per year, demonstrating a consistent and low frequency until a sharp increase occurs at the end of 2020. The Figure 2B, representing the cumulative fraction, shows a gradual and consistent ascent beginning in 2014, which markedly intensifies from 2020.

Flowchart illustrating our data analysis process.

BCG-Related posts and comments over time.
Post and comment summary statistics.
From the dataset, 50 BCG-related discussion topics were identified. These were analyzed using the Silhouette coefficient, with a performance score of 0.015, and the Davies-Bouldin index, with a score of 3.67. Subsequently, we conducted a clustering analysis of these topics to identify overarching thematic groups from these 50 topics. Four groups were identified: (1) BCG administration procedures, (2) hesitancy in initiating or maintaining BCG treatment, (3) issues related to BCG shortage and alternative treatments, and (4) side effects of BCG treatment (Figure 3A). An overview of these groups with example text is provided in Table 2. Temporal patterns within these thematic groups are depicted in Figure 3B. Additionally, supplementary Figure 1 presents a word cloud depicting the most frequent words in each group.

Topic modeling: figure 2A presents the spatial distribution of 50 topics (shown as circles) within 4 principal groups. The ‘Feature 1’ and ‘Feature 2’ axes are the result of applying Uniform Manifold Approximation and Projection to reduce complex, high-dimensional data for easier visual interpretation. ‘Feature 1’ captures one dimension of data variation, while ‘Feature 2’ captures a second, distinct dimension. The closeness of points indicates the similarity between topics, and the size of a point indicates the number of related discussions. Figure 2B tracks the temporal progression of discussions for each of the 4 groups.
Overview of groups of topics with example text.
The sentiment analysis of the 1223 discussions, based on individual posts or comments, revealed that 25.2% (308) exhibited a negative sentiment, while the majority, 58.3% (713), were classified as neutral. A smaller proportion, 16.5% (202), reflected a positive sentiment. The overall mean sentiment score across all discussions skewed slightly towards neutral-negative, with a score of −0.10 (SD 0.35). As depicted in Figure 4A-B, all thematic groups predominantly exhibited neutral or negative sentiments. Notably, none of the four thematic groups displayed a predominantly positive sentiment.

Sentiment analysis: mean sentiment (color) across topics (circles) is shown. The size of each topic represents the relative number of discussions grouped in that topic. Mean sentiment scores that were close to −1 reflected a predominantly negative sentiment (red), close to 0 reflected an overall neutral sentiment (orange), and close to 1 reflected an overall positive sentiment (yellow). The x- and y-axes represent the 2 Uniform Manifold Approximation and Projection axes that were dimensionally reduced to allow for topic visualization.
Discussion
Reddit has become a widely used online platform for individuals seeking information and sharing experiences about various types of cancer.21,22 In this AI-generated theme clustering study, we harnessed over ten years of user-generated content on this social media site to delve into the public's views and attitudes towards BCG therapy. Using advanced artificial intelligence techniques, we examined 1223 distinct discussions contributed by 288 unique users. Our analysis delineated 50 topics, which were categorized into four main thematic areas. Sentiment analysis of these discussions revealed a general trend of neutral to negative emotions. The study's results bring to light the community's perspective on BCG therapy and identify areas where modifications could potentially enhance its acceptance and usage.
First, the high level of investment that patients have in their health and treatment options was particularly evident during periods when BCG was in short supply. In 2020, due to this shortage, the AUA advised on administering lower doses of BCG. 23 Consequently, providers have had to resort to alternative intravesical therapies, 24 which are often more costly and less effective. This was reflected in the comments and discussions on Reddit, where individuals openly expressed their concerns, dissatisfaction, and frustrations. Many shared personal experiences and the impact of the shortage on their treatment plans, highlighting their dependency on this specific therapy. Moreover, in response to the shortage, patients explored alternatives like mitomycin, 25 Gemcitabine 26 or participation in clinical trials. 27 These discussions emphasize the need for robust supply chains and transparent communication from healthcare providers and authorities, especially in managing critical treatments like BCG. 28
Second, comments and discussions on Reddit revealed a significant lack of information about how BCG is instilled. Many people expressed confusion, and a lack of clear guidance about the procedure itself, despite the availability of existing resources such as BCAN (Bladder Cancer Advocacy Network) booklets and other educational materials. This suggests that some members of the bladder cancer community may not be taking full advantage of these resources. The queries ranged from the correct method of administration to handling potential immediate side effects. This lack of information often led to apprehension and unease among patients, who turned to online forums seeking advice and sharing personal experiences for support. These discussions highlight the critical importance of providing detailed, accessible information to patients undergoing BCG treatment, not only about the procedure itself but also about what to expect afterwards.29,30 The healthcare community could address this need by developing more robust educational materials and communication strategies to assist patients throughout their treatment journey. 31
Interestingly, despite the significance of side effects in BCG treatments, they did not dominate the conversations in our study. This finding is somewhat paradoxical, given the usual concerns patients have about the post instillation adverse effects of treatments like BCG. One possible explanation for this could be that patients discussing BCG on Reddit might have focused more on issues such as treatment availability, effectiveness, or procedural information, rather than on side effects. Local adverse effects from BCG therapy are frequently observed, with incidences ranging between 62.8% and 75.2% in patients. 32 These complications are typically mild in nature. These often present within hours of BCG administration and are self-limited to 48–72 h. Common issues include chemical cystitis, bacterial cystitis, hematuria, and increased urinary frequency. Each of these symptoms is reported in 20% to 50% of patients in most extensive studies.33,34 This shift in focus could also indicate a level of acceptance or a well-managed approach to the side effects among this patient community, possibly due to adequate prior information or effective coping strategies. Alternatively, it could suggest that for these patients, other aspects of BCG treatment were more pressing or challenging than the side effects.
Finally, the sentiment analysis further enriches our understanding by quantifying the emotional tone of these discussions. The predominance of neutral (58.3%) and negative sentiments (25.2%) with a mean sentiment score leaning towards neutral-negative (−0.10) indicates a general trend of concern or dissatisfaction among individuals discussing BCG on Reddit. This could reflect challenges in treatment, apprehensions about side effects, or frustrations due to the shortage of BCG. Notably, none of the identified groups had a predominantly positive sentiment, underscoring the need for better communication and support for individuals undergoing or considering BCG treatment.
Our approach is similar to previous studies that have used AI-driven analytics to extract insights from social media platforms, particularly in the field of public health.20,35,36 While these methods are promising, there are significant challenges to validating insights derived from such approaches. A key challenge is ensuring that AI-generated themes accurately capture the nuanced emotions and experiences of the patient population, which can sometimes be missed in automated analyses. In addition, the representativeness of the patient population on social media may be skewed toward younger, more tech-savvy users, making it difficult to generalize findings to the broader population of bladder cancer patients. To address these challenges, future research could employ mixed-methods approaches, combining AI-based analyses with traditional qualitative methods to compare and verify identified themes. Alternatively, conducting patient interviews or surveys could help validate AI-generated findings. These strategies would not only increase the rigor of the research, but also improve the applicability of this approach by providing a more accurate and comprehensive understanding of patient perspectives. 37
This study has several limitations. First, the absence of demographic and geographic data for Reddit users limits our ability to assess the representativeness of the sample within the broader bladder cancer population. This lack of location data also prevents analysis of potential regional variations in patient experiences and BCG treatment access. Second, the methodology did not control for individual user contribution frequency, potentially allowing frequent commenters to disproportionately influence the findings. Our analysis was confined to English-language discussions on a single social media platform. While Reddit was chosen due to its large, active health-related communities and accessibility via API, it may not represent all online forums frequented by bladder cancer patients, potentially limiting the generalizability of our findings. Third, while our AI-based approach using BERTopic enabled efficient analysis of large-scale data, it may not capture the nuanced interpretations typical of traditional qualitative methods such as reflexive thematic analysis. Although we contextualized identified themes using existing literature and clinical experience, subtle themes, cultural contexts, and emotional nuances might have been overlooked compared with comprehensive human qualitative analysis. Finally, the temporal patterns observed, particularly the increase in posts during 2020–22, 38 warrant cautious interpretation as they may reflect pandemic-driven shifts to online health communities rather than natural trends in bladder cancer discussions.
Conclusion
Information shared on online social media platforms often includes detailed accounts of personal experiences with BCG therapy, which are not extensively documented in medical literature. Patients frequently express numerous questions and concerns regarding BCG as a treatment option, particularly about its administration and effectiveness. Additionally, they often voice a variety of frustrations, such as issues with accessing treatment and inquiries about alternative therapies. By gaining a deeper understanding of these patient perspectives, medical professionals can more effectively address patient needs, thereby enhancing the care and adherence to treatment in NMIBC.
Supplemental Material
sj-docx-1-blc-10.1177_23523735241304907 - Supplemental material for Research report: BCG therapy for bladder cancer: Exploring patient experiences and concerns through artificial intelligence-based social media analysis
Supplemental material, sj-docx-1-blc-10.1177_23523735241304907 for Research report: BCG therapy for bladder cancer: Exploring patient experiences and concerns through artificial intelligence-based social media analysis by Zine-Eddine Khene, Isamu Tachibana, Raj Bhanvadia, Hagan Ausmann, Vitaly Margulis and Yair Lotan in Bladder Cancer
Footnotes
Author contributions
Conceptualization and design: Zine-Eddine Khene, Isamu Tachibana, Hagan Ausmann, Raj Bhanvadia, Vitaly Margulis, Yair Lotan.
Methodology: Zine-Eddine Khene, Vitaly Margulis, Yair Lotan.
Data curation: all authors.
Writing the original draft: Zine-Eddine Khene,
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Yair Lota is a Consultant for Nanorobotics, C2I genomics, Photocure, Astrazeneca, Merck, Fergene, Abbvie, Nucleix, Ambu, Seattle Genetics, Hitachi, Ferring Research, verity pharmaceutics, virtuoso surgical, Stimit, Urogen, Vessi medical, CAPs medical, Xcures, BMS, Nonagen, Aura Biosciences, Inc., Convergent Genomics, Pacific Edge, Pfizer, Phinomics Inc, CG oncology, Uroviu, On target lab, Promis Diagnostics, Valar labs, Uroessentials
Zine-Eddine Khene received financial support through grants from Fondation France for Interdisciplinary Studies.
Data availability
The data supporting the findings of this study are available on request from the corresponding author.
Supplemental material
Supplemental material for this article is available online.
