Abstract
Objectives
Urinary tract infections (UTIs) frequently affect individuals of all ages, necessitating antibiotic treatment and medical care, which can impair quality of life and cause psychological strain. Online Health Consultation (OHC) platforms serve as a widely used communication tool, offering integrated support for medical guidance and disease management. By examining OHC interactions, this study explores the concerns and difficulties experienced by UTI patients to better understand their perspectives.
Methods
Data from 20,000 anonymized UTI-related records (2020–2024) were obtained from a major Chinese online healthcare platform, Good Doctor Online. Analysis occurred in two stages: BERTopic extracted key themes and keywords from text data, followed by sentiment analysis of these findings using a generative AI language model. All data was publicly accessible and de-identified.
Results
Analysis of 18,479 cleaned records using BERTopic identified six key themes: “Polite Expressions for Consultation,” “Symptom and Management Challenges,” “Differential Diagnosis of Cystitis,” “Etiology Related to Sexual Activity,” “Nocturnal Symptoms and Fever,” and “Perinatal Considerations.” Sentiment analysis showed predominantly negative emotions, reflecting the condition's substantial physical and mental toll. The “Etiology Related to Sexual Activity” theme had the highest negativity (97%), while “Polite Expressions for Consultation” showed the most positivity (9%).
Conclusion
These research results highlight the important role of online communities in providing support and information to patients, and the insights derived from this study can provide valuable reference for social media developers, medical service providers, and policymakers.
Keywords
Introduction
Urinary tract infections (UTIs) represent a complex and escalating public health challenge, particularly pronounced in developing nations. UTIs affect individuals across all age groups, frequently necessitating medical consultations and antibiotic therapy while significantly compromising patient quality of life.1–4 Beyond the immediate physical discomfort, characteristic symptoms disrupt fundamental daily activities, including work, exercise, and sleep. Moreover, the persistent burden of disease management contributes to substantial emotional distress, often manifesting as psychological sequelae that elevate the risk of comorbid anxiety and depression.5,6
Concurrently, the integration of digital technologies into healthcare systems has fundamentally transformed patient–physician interactions. Online Health Consultation (OHC) platforms have emerged as pivotal channels for healthcare delivery, particularly in regions with limited access to traditional medical expertise. Facilitated by mobile internet technology, these platforms exemplify a broader digital transformation that has reshaped public health information acquisition. 7 This technological shift, bolstered by supportive policies and evolving healthcare-seeking behaviors, has accelerated the digitalization of healthcare in China, 8 establishing the internet as the primary source for health-related information. 9 Areas such as information sharing and online diagnosis have positioned the internet as a critical space for obtaining health knowledge, 10 a trend amplified during the COVID-19 pandemic.11–13
Importantly for conditions like UTIs, which may involve sensitive or stigmatized topics, OHC offers distinct advantages over traditional face-to-face consultations: 24/7 accessibility irrespective of location, 14 and a degree of anonymity that facilitates disclosure of concerns with reduced psychological burden.15–17 Consequently, a 2021 nationwide survey indicated that 28.9% of Chinese residents intended to utilize OHC for diagnosis and treatment, confirming its status as a prevalent doctor–patient communication modality that delivers an integrative support environment for professional guidance and illness management. 18 These digital interactions generate extensive clinical narratives through text-based consultations, creating an unprecedented repository of real-world clinical insights. Crucially, such patient-generated data captures nuanced aspects of disease presentation, progression, and management often undetected in conventional structured clinical data. Furthermore, OHC empowers patients by providing social support, including informational and emotional support from healthcare professionals. 19
Despite this wealth of data generated by OHC platforms, characterizing patient perspectives and experiences on conditions like UTIs has traditionally relied on cohorts recruited through clinical settings.20,21 While valuable, these methods are often inefficient for capturing insights from the broader patient community actively engaging online. Within these high-interaction digital communities, vast amounts of highly unstructured, tacit knowledge accumulate.22,23 However, the inherent complexity of this narrative data presents significant analytical challenges. Extracting meaningful, clinically relevant insights necessitates sophisticated computational methods for knowledge extraction and interpretation, 24 which have historically been lacking.
Fortunately, contemporary advances in medical informatics offer significant methodological progress in handling unstructured healthcare data. NLP-based text mining techniques, encompassing natural language processing and machine learning, have demonstrated notable efficacy in extracting clinically relevant insights from diverse medical text sources25,26 (e.g., adverse drug reaction detection, 27 disease trajectory prediction, 28 novel clinical association discovery). Nevertheless, unique linguistic, contextual, and cultural features within Chinese medical discourse introduce additional methodological hurdles that remain inadequately addressed within current analytical frameworks.
Specifically to address these challenges, particularly the nuanced and indirect nature prevalent in Chinese social media discourse, advanced analytical tools are required. While traditional topic modeling methods like Latent Dirichlet Allocation (LDA) and Top2Vec often struggle with contextual complexity, BERTopic has demonstrated its capability to excel in identifying subtle themes with greater precision, terminological diversity, and flexibility. 29 Building upon this foundation and to achieve deeper semantic analysis, we employed the cutting-edge capabilities of ChatGPT-4. This novel approach integrates BERTopic's contextual depth with ChatGPT-4's advanced semantic understanding and generative labeling. We utilized ChatGPT-4 both to classify the themes identified by BERTopic and to assess sentiment within the discourse. This AI-driven integration aims to significantly improve the accuracy and interpretability of topic modeling, thereby unlocking meaningful insights from the intricate linguistic patterns found in large-scale, real-world patient narratives.
Therefore, this research aims to harness the combined analytical power of BERTopic and ChatGPT-4. We will apply this integrated methodology to delve into large volumes of consultation records from China's premier OHC platform. Our primary goal is to uncover the common concerns, information needs, and emotional experiences of patients and health community users regarding UTIs within this digital landscape. Ultimately, we seek to generate actionable insights that can inform strategies for improving patient-centered healthcare services and optimizing the organization and structuring of health information within online health communities.
Methods
Study overview
This retrospective observational study integrates data mining and textual analysis within the interdisciplinary fields of medical informatics and health services research. It employs a mixed-methods approach, combining quantitative topic modeling and qualitative sentiment analysis on textual data, to systematically identify key challenges in UTI diagnosis and treatment and to uncover multidimensional characteristics of patient experiences.
The research consisted of two main phases: First, the BERTopic model was applied to patient consultation texts to extract core themes and representative keywords, with typical comment excerpts selected to enhance thematic interpretability. Subsequently, large language models (LLMs) were employed to further deepen the understanding of thematic connotations and their associated emotional tendencies. The entire process included detailed topic labeling and emotion-oriented identification to uncover predominant emotional characteristics within each theme.
Data collection
Data were obtained from “Haodf.com” (Good Doctor Online), a prominent Chinese OHC platform. The selection of this platform was based on comprehensive ranking metrics for medical and health websites, incorporating authoritative indicators such as Alexa ranking, Baidu weight, PR value, and mobile compatibility.
Relevant consultation records were collected by searching the following Chinese keywords: “Urinary Tract Infection,” “Bladder Infection,” “Cystitis” and “Urethritis.” All related records from June 2020 to September 2024 were crawled. Haodf.com provides full access to registered users, allowing the review of all publicly available content. The consultation data typically include text-based interactions between patients and healthcare providers, covering symptom descriptions, medical history, timestamps, and physician identifiers, among other metadata.
In evaluating the stability of clustering results, Harloff and colleagues 30 examined the convergence rates of various clustering methods using six sets of partitioned ranking data. For domains containing 25 items or more, to ensure clustering stability, there should be around 20 samples per topic. Applying the above principle (i.e., 20 samples per topic), the theoretical sample size should be 400 entries. The present study already uses 20,000 samples, well above the minimum requirement, thereby fully meeting the statistical standards for the stability of the clustering analysis.
Although these consultation data are publicly accessible and commonly used for research, this study strictly adhered to ethical guidelines, emphasizing user privacy and data confidentiality. All records underwent manual review to exclude any potentially personally identifiable information, ensuring compliance with privacy protection standards.
Data preprocessing
Preprocessing of the raw data was performed using Python (Python Software Foundation). From an initial set of 20,000 consultation records, incomplete documents, entries from unrelated medical specialties, and duplicate records were excluded. Numbers, special characters, and stop words were also filtered out. The application of these criteria resulted in a final dataset of 18,479 valid online consultation records, forming a robust sample for analyzing UTI-related consultations.
For Chinese word segmentation, the Python library “jieba” was used to convert continuous text into discrete lexical sequences. The process integrated four major stop word lexicons: the Harbin Institute of Technology Stop Word List, Baidu Stop Words, the Renmin University of China Stop Words, and the Sichuan University Machine Intelligence Laboratory Stop Words, supplemented by a custom stop word list. This effectively removed irrelevant words and transformed the text into a word-list format. These preprocessing steps significantly enhanced the robustness and interpretability of subsequent analyses.
Topic identification with BERTopic methodology
The BERTopic method is an unsupervised, pretrained deep learning method used for topic modeling and is often used in social science research. It achieves significant results in analyzing social media content, 31 performs well when reviewing analyst reports, 32 plays a key role in conducting literature meta-analysis, 33 and is also effective in evaluating customer reviews. 34 This technology uses the self-attention mechanism and Category based-Term Frequency-Inverse Document Frequency (C-TF-IDF) to generate well-defined clusters that improve the interpretability of topics while retaining key descriptive keywords.
While conventional topic modeling approaches like LDA, 31 Probabilistic Latent Semantic Analysis (PLSA), 35 and the Correlated Topic Model (CTM) 36 have their merits, BERTopic stands out by offering three distinct improvements. For starters, it leverages a pre-trained bidirectional BERT model to produce text embeddings, which not only pinpoints document themes with precision but also tackles the tricky issue of words with multiple meanings.37,38 What's more, by integrating the Transformer framework with the C-TF-IDF technique, it builds tightly knit clusters that take topic coherence to the next level. 39 And let's not forget—unlike other methods that might lose critical terms in the shuffle, BERTopic keeps the most meaningful words front and center throughout the clustering process, making the final output far easier to interpret.
The BERTopic topic modeling workflow encompasses five essential stages 40 : First, it harnesses a BERT model to generate numerical embedding vectors; Next, it applies Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction; Then, it employs the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) for clustering; Following that, it implements various topic generalization approaches; Lastly, it extracts pivotal keywords for each cluster using C-TF-IDF, streamlining the process of topic identification and ranking. The system's modular structure allows researchers to tweak each phase to suit their requirements. Given these compelling benefits, this research employs BERTopic to pinpoint topics within consulting records.
Therefore, we employed the BERTopic Python library for structured topic modeling to explore patients’ experiences with UTIs. The default Sentence-Transformers embedding method in BERTopic was utilized, specifically the paraphrase-multilingual-MiniLM-L12-v2 model, which supports Chinese text embedding. For dimensionality reduction and clustering, we adopted the default UMAP and HDBSCAN methods, respectively. After comparing multiple modeling results, the key parameters were set as follows: UMAP n_neighbors = 10, n_components = 10; HDBSCAN min_cluster_size = 200; and random state = 42. During text processing, Scikit-learn's Count Vectorizer was first applied for feature extraction and vectorization, converting raw text into a term frequency matrix. In terms of performance, the BERTopic model demonstrated significantly higher topic coherence compared to the traditional LDA model.41,42 Subsequent optimization strategies—including “C-TF-IDF,” “Distributions,” and “Embeddings”—were implemented by reloading the model, transforming text data, reducing outlier topics, and updating the model to iteratively improve topic quality and structural rationality.
Topic labeling with LLM using prompt engineering
Utilizing LLMs is our groundbreaking approach to tweak and elevate the interpretations of topics that emerge from BERTopic. By inputting BERTopic's findings into OpenAI's ChatGPT-4o-mini, guided by meticulously designed prompts, we aim to validate the core meanings and emotional undertones behind the topics identified via unsupervised machine learning. This strategy not only brings clarity to the results but also delves deeper into the intricacies of automatic thematic analysis, providing a more nuanced understanding. In the context of large-scale text sentiment analysis, LLMs have demonstrated significant advantages in understanding context, parsing complex semantics (e.g., identifying sarcasm and irony), and capturing fine-grained emotional tendencies (e.g., distinguishing between “disappointment” and “anger”). A growing body of literature has already adopted this approach.43,44
We have developed a multitier emotional dictionary, integrating automated collection with manual validation to balance efficiency and accuracy. The dictionary comprises two main parts:
1. Basic emotion words: contains general Mandarin emotion terms, including:
Positive emotion words: for example, “satisfied,” “effective,” “improved,” etc., used to express positive emotions. Negative emotion words: for example, “pain,” “discomfort,” “worry,” etc., used to express negative emotions. Each emotion word is tagged with its emotional polarity (positive/negative) and an intensity score reflecting how strongly the emotion is conveyed.
2. Domain-specific emotion words: to boost analysis performance in particular contexts, we have collected specialized vocabulary for medical consultation scenarios, such as “alleviate,” “side effects,” “therapeutic effect good,” etc. These terms were extracted from medical forums, consultation records, and professional literature to ensure strong applicability in the medical field.
To ensure the accuracy and consistency of LLMs in sentiment annotation tasks, this study employs a systematic validation process that comprises two main stages:
First, validation checks are the primary mode, with human reviewers directly evaluating whether the LLM's sentiment judgments are correct and correcting any mislabeling. To ensure uniform review standards and reduce subjective differences among reviewers, we established explicit sentiment-discrimination criteria. Following statistical requirements at the 95% confidence level with a ± 5% margin of error, 400 texts were randomly sampled from the overall dataset for detailed verification. Two reviewers independently assess the same batch of samples, and for disagreements a third senior expert is brought in to arbitrate to determine the final labeling. Furthermore, during manual review, systemic errors of the LLM (such as consistently misclassifying certain sarcasm statements as positive sentiment) are proactively identified and fed back into the model optimization process. These findings can inform iterative adjustments to prompt design or postprocessing workflows, forming a continuous closed-loop quality control mechanism for ongoing improvement.
Second, to quantify the overall performance of LLM labeling, another random sample of 200 items was selected for independent coding validation, with domain experts independently performing sentiment labeling, and the agreement between LLM outputs and expert labels was calculated. Interrater reliability was assessed using Cohen's Kappa, and statistical analysis was conducted with IBM SPSS Statistics 23.0. Results showed a Kappa value of 0.854, exceeding the conventional threshold of 0.80, indicating a high level of agreement between the LLM outputs and human expert annotations, thereby supporting the reliability of the approach in this study and enabling further analyses based on automated labeling results. The prompt used in this study is provided in the supplementary materials.
Ethical considerations
The Ethics Committee of Jiangsu Province Hospital of Chinese Medicine approved this study and waived the requirement for written informed consent (Approval No. 2024NL-025-01), as all data were obtained from a public platform. Our study was conducted in accordance with the principles outlined in the Declaration of Helsinki. All data underwent comprehensive de-identification to remove personal details. In accordance with the platform's terms of service and established international research practices, user consent followed an implied consent model.45–47
Results
Building on the framework of prior studies, this research utilizes the BERTopic topic modeling technique to identify predominant themes in UTIs online consultation data. Subsequently, the study undertakes a detailed analysis of these thematic clusters.
Topic identification
Under the preset parameter conditions, the BERTopic model identified 14 core topics (ranging from Topic 0 to Topic 13). These topics were arranged in descending order of their prevalence within the corpus. Each topic is represented by a set of characteristic terms, with varying weights assigned to each term; terms possessing higher weights contribute more significantly to defining the topic. The marginal diminishing effect associated with the feature word weights for each topic is illustrated in Figure 1.

The declining trend of feature words’ weights.
Analysis reveals that 3–5 key terms are typically sufficient to capture the essence of most topics. Beyond five terms, the inclusion of additional features yields diminishing returns, offering minimal added value for topic representation. Figure 2 presents the probabilistic feature weights and corresponding lexicon for the 13 identified research topics. This visualization enables systematic corpus-based assessment of topic-defining terminology across consultation records, permitting definitive thematic designations through comprehensive analytical synthesis. To illustrate, consider Topic 0: key terms such as “Pregnancy,” “Infant,” and “Lactation” distinctly delineate its core focus. An examination of associated discussions confirms that this topic centers on investigating the causes of urinary tract infections during the perinatal period. Consequently, documents associated with Topic 0 are categorized under “Perinatal Health.” The identification and naming of subsequent topics follow this methodology, ensuring the assigned labels accurately reflect the research nuance and maintain a clear, scholarly tone.

Fourteen consultation topic feature words and weight distribution.
This study uses a combination of internal consistency (evaluating keywords within each topic) and external validity (document sampling validation) to comprehensively validate the results of the topic model.
Regarding internal consistency, a panel of experts first assessed the semantic coherence and logical connections among the keywords inside each topic to ensure internal consistency, while also comparing keyword sets across different topics to test how well the topics are distinguished, thereby confirming the uniqueness of each topic.
Regarding external validity, we randomly selected 50 records assigned to each topic for manual cross-checking by the expert panel. For example, topic 0 (keywords: “Pregnancy,” “Infant,” “Preconception care”). After reviewing the original texts, the experts found that 48 of the records were related to pregnancy, family planning, infant care, and menstrual health, indicating a high level of classification accuracy for this topic.
Main themes
To begin, we computed a similarity matrix across topics using cosine similarity. This metric allowed us to visualize the strength of associations between different subjects through a heatmap (Figure 3). Building on these relationships, we then applied clustering techniques to systematically organize the topics, ultimately establishing a comprehensive framework for discussions about urinary tract infections. The clustering outcomes (Figure 4) revealed that the 14 initial topics naturally fell into 6 broader thematic categories, each reflecting core areas of patient inquiry. We assigned descriptive labels to these clusters based on their predominant keywords, with the full breakdown provided in Table 1. Topic 12 exhibits minimal thematic associations with other research topics in the cluster analysis visualization. Given that its characteristic keywords constitute standard consultation terminology, it has consequently been classified alongside topics 2, 3, and 4 within the overarching theme of “Polite expressions for consultation.”

Heatmap of themes in consultation records related to urinary tract infections.

Thematic hierarchy diagram: research theme clustering of urinary tract infection records based on cosine similarity.
Keywords representative of the six main themes.
Ranked based on their relevance scores from the BERTopic results, making them the 10 most representative words for each topic. The table shows the results after removing duplicates. WBCs: white blood cells; UA: urinalysis; RBCs: red blood cells; IC: interstitial cystitis; TUR: transurethral resection.
Significant variations exist in the attention distribution across different thematic clusters. Table 2 further details the number of themes encompassed within each cluster, the corresponding volume of related consultation records, and their proportional distribution. This quantification offers an objective basis for analyzing the primary thematic clusters within UTIs online consultation data. Overall distribution indicates that the cluster “Diagnosis, Symptoms, and Management Challenges” contains the highest number of consultation records, suggesting that topics within this cluster receive greater patient attention. The clusters “Polite Expressions for Consultation” and “Nocturnal Symptoms and Fever” exhibit comparable record counts. In contrast, the clusters “Perinatal Considerations,” “Differential Diagnosis of Cystitis,” and “Etiology Related to Sexual Activity” each comprise only a single theme, representing the clusters with the lowest number of constituent themes.
The volume of themes and records related to urinary tract infections.
Thematic content analysis
Topic 1: Polite expressions for consultation
This topic accounts for 14.20% of the discussions (2624/18,479), focusing primarily on the conventional phrases used when consulting doctors online. Key terms include “Trouble you,” “Hello,” and “Thanks.” In internet-based communications, patients frequently employ such polite expressions to convey respect for medical professionals. Examples are as follows: Hello, Professor. I’ve been experiencing painful urination ever since a high-risk encounter with my ex-girlfriend in March. Hello, Doctor. I’m a patient who consulted you before. You mentioned I was suffering from anxiety and advised me to see a psychiatrist.
Topic 2: Diagnosis, symptoms, and management challenges
Constituting 56.01% of the discussions (10,349/18,479), this topic centers on the diagnosis of UTIs, their primary symptoms, and preventive and therapeutic measures. As the most prominent topic, it garners the greatest attention from patients.
Terms related to specific diagnostic tests include “White blood cells (WBCs),” “Red blood cells (RBCs),” “Bacteria,” as well as testing methods such as “Urinalysis (UA)” and “Urine culture.”
Examples of records utilizing these keywords: Now my routine urine test is basically normal, but the urine culture still grows bacteria—though not in large quantities. Even after taking sensitive medications, the symptoms only improve slightly. The urine test showed elevated levels of occult blood, white blood cells, and bacteria, leading to a diagnosis of acute cystitis.
Patients also frequently report UTI-related discomfort, with key terms such as “Incomplete bladder emptying” and “Urinary hesitancy.”
Examples of such consultations:
Frequent urination, incomplete emptying, and occasional difficulty urinating. The bladder trigone feels swollen and irritated, and I have to get up to urinate over ten times at night.
Two weeks ago, after drinking lemon water, I developed frequent and urgent urination with pain. Due to the pandemic, I couldn’t get a proper urine culture, so I couldn’t receive targeted medication and had to rely on empirical treatment instead.
Posts addressing the correct usage, precautions, and dosages of medications are also prevalent, with high-frequency keywords including “Antibiotics,” “Cephalosporins,” “Levofloxacin,” and “Sanjin tablets (Chinese herbal medicine).”
For instance:
I’ve been taking antibiotics before, sometimes along with Sanjin tablets. This time, I was worried about antibiotic resistance, so I tried using only Sanjin tablets, but the results weren’t satisfactory.
Topic 3: Differential diagnosis of cystitis
This topic makes up 9.70% of the discussions (1793/18,479), focusing on the differential diagnosis of various types of cystitis. Key terms include “Cystitis glandularis,” “Cystitis,” “Interstitial cystitis,” and “Chronic cystitis.” Cystitis is a common form of UTI. While glandular cystitis and interstitial cystitis share similar symptoms with ordinary cystitis and UTIs, they differ in nature, etiology, clinical significance, and management.
Examples of records using these keywords: Is it possible for a pathological examination of glandular cystitis to be misdiagnosed? Is it necessary to get a second opinion at another hospital? Hello, my condition is a bit complicated. I suspect I have interstitial cystitis. Frequent urination, discomfort, and pubic/lower abdominal pain—this feels like interstitial cystitis.
Topic 4: Etiology related to sexual activity
Accounting for 6.59% of the discussions (1217/18,479), this topic explores the etiological links between sexual activity and UTIs. Key terms include “Sexual activity,” “Masturbation,” “Erection,” “Ejaculation,” “Semen,” “Penis,” and “Frequent.” Posters in this category attribute UTIs primarily to sexual activity.
Examples incorporating these keywords:
Unprotected sexual activity (including uncondoned intercourse and oral sex) … led to frequent urination, incomplete emptying, and urgency.
Two days after masturbating, combined with eating too much sugarcane, I started having frequent urination the next day. Then, at night, I had severe night sweats, couldn’t sleep, couldn’t eat, and felt weak all over.
Two days ago, after excessive sexual activity, I developed itching in the urethra during urination.
Topic 5: Nocturnal symptoms and fever
This topic constitutes 12.10% of the discussions (2236/18,479), focusing on patients’ nocturnal symptoms and fever. Key terms include “Fever,” “Nocturia,” “Sleep,” and “Midnight,” which appear frequently in descriptions of discomfort, indicating significant distress. Nocturnal symptoms include worsening conditions at night, frequent nighttime awakenings, and insomnia. Examples of consultation records: My buttocks and lower abdomen hurt nonstop, making it impossible to sleep. I’m extremely exhausted and have been insomnia for days. Drinking more water leads to frequent urination, with nighttime awakenings every two hours. There's a constant dull pain. Recurrent high fever for three days, accompanied by frequent urination and mild lower abdominal pain.
Topic 6: Perinatal considerations
Comprising 1.40% of the discussions (259/18,479), this topic reflects the distinct concerns of perinatal women regarding UTIs. Key terms include “Pregnancy,” “Infant,” “Preconception care,” and “Postpartum.”
Examples of sentences containing these keywords: Doctor! Hello! I had in vitro fertilization and am currently around 6 weeks pregnant. I’ve had frequent urination with small volumes since the embryo transfer, and it's gotten worse lately. During pregnancy, I have a UTI, with lower back and leg pain. My urinary tract gets infected very easily. What medications are safe to take for a UTI during preconception? I have urine occult blood and lower back pain. Have symptoms of frequent urination. I got a urinary tract infection while I was pregnant and I never took my medicine. The diagnosis is now cystitis.
Sentiment analysis of the UTIs topics
The proportional results of emotional classification for patients’ consultation records under each topic are presented in Figure 5. It is revealed that among the six topics, negative records predominate in patients’ consultations, whereas the proportions of neutral and positive records are relatively lower, with the proportion of negative records exceeding that of neutral ones.

Topic emotional analysis: comparative distribution of positive, neutral, and negative emotions across six consultation topics.
Specifically, the proportions of positive, neutral, and negative records in the four topics, namely “Diagnosis, Symptoms, and Management Challenges,” “Differential Diagnosis of Cystitis,” “Nocturnal Symptoms and Fever,” and “Perinatal Considerations,” are relatively close. Among all six topics, the “Etiology Related to Sexual Activity” topic exhibits the highest proportion of negative records, reaching as high as 96.88%. In contrast, the “Polite Expressions for Consultation” topic has the lowest proportion of negative records at 82.32% and the highest proportion of positive records at 9.03%.
Discussion
The online doctor–patient communication, as a core component of internet-based healthcare, involves both physicians and patients as the main subjects of medical services, extending the scenarios and boundaries of healthcare delivery through the application of new technologies. Given that internet-based healthcare is still in its early developmental stage, research on online doctor–patient interactions remains relatively limited. Against this backdrop, our findings offer valuable insights into the concerns and challenges faced by patients with UTIs. Specifically, the analysis of discussions within online health communities has revealed several key points, which carry significant implications for patient care, education, and support.
Our analysis revealed that patients commonly adopt polite and respectful language when seeking online consultations. This communication style aligns with established frameworks for effective doctor–patient interaction, such as the Pendleton model and the Calgary-Cambridge guidelines, which emphasize empathy, rapport-building, and mutual respect. Moreover, prior research has shown that users’ positive attitudes toward online platforms significantly enhance their willingness to use such services (P < .01). 8 Patients’ polite expressions may thus reflect both trust in and appreciation for the online consultation process, potentially eliciting more positive and empathetic responses from physicians. This may explain why the topic “Polite expressions for consultation” demonstrated the highest proportion of positive patient emotions.
The present study identified that patients’ primary concerns regarding UTIs focus on the interpretation of the analysis of test results and complex symptoms. These concerns stem partly from the variability of UTI symptoms across different populations and partly from the limitations of conventional diagnostic methods. While urine culture remains the diagnostic gold standard, it is not without limitations. For example, significant bacterial growth in culture may not necessarily indicate an active infection, particularly in cases of asymptomatic bacteriuria common among elderly women. As such, accurate diagnosis requires a nuanced interpretation of laboratory findings in light of patient-reported symptoms. 48 Our findings also highlight that patients frequently mention nocturia, particularly in relation to sleep disturbances. This symptom is likely linked to bacterial irritation of the bladder mucosa, which induces hypersensitivity and abnormal contractility, thereby diminishing the patient's ability to perceive bladder fullness and leading to frequent nighttime urination. The relationship appears to be bidirectional, as fragmented sleep may in turn exacerbate nocturia. 49 Notably, this connection has been found to persist even after adjusting for comorbidities such as BMI and diabetes. 50 Elevated body temperature is another concern that deserves attention, especially when accompanied by tremors and waist discomfort. These are typical symptoms of upper urinary tract infections, such as pyelonephritis, and are common problems in people with weak immune systems. 51 Overall, our data underscore fundamental challenges in UTI diagnosis: the subjectivity of symptoms and the necessity of laboratory confirmation. Consistent with clinical guidelines, diagnosis cannot rely solely on symptoms or isolated test results. Patients often struggle to interpret discrepant findings, such as asymptomatic positive cultures, suggesting that online health platforms should enhance patient education. Providing clear explanations of diagnostic procedures, the significance of culture results, and the differences between asymptomatic bacteriuria and symptomatic infection may help alleviate patient anxiety and improve understanding.
Our data also reveal a worrisome phenomenon: patients frequently mention antibiotics without adequate understanding, reflecting that the public generally regards them as a universal remedy for all urinary tract discomfort. This cognitive bias may stem from two factors: one, the long-standing overuse of antibiotics; and two, the prescription pressure exerted by patients on doctors. Studies have confirmed that self-medication with antibiotics is commonplace and fraught with risks, not only potentially delaying proper treatment but also worsening the escalating problem of antimicrobial resistance.52–55 In this context, the role of online doctors extends beyond information provision; they must take on a crucial health education remit. They should proactively correct patients’ erroneous beliefs, emphasize the importance of targeted therapy when indicated by urine culture and susceptibility results, and firmly discourage self-medication. This educational function is essential for the deep integration of online consultations with antimicrobial stewardship principles.
International experience offers useful reference points. Registry data from Sweden 56 show that antibiotic prescribing in digital primary care is markedly lower than in offline clinics (for urinary tract infections, the prescribing rate difference can reach 34–41 percentage points). This finding robustly demonstrates that digital healthcare can effectively substitute traditional service models without increasing the risk of antibiotic misuse. The study also identifies two core problems in doctor–patient communication records: first, there is widespread confusion among patients about antibiotic use (e.g., duration of therapy, handling of side effects); second, clinicians fail to convey prescribing guidelines adequately. Similarly, analysis of electronic prescriptions for urinary tract infections during the Saudi pandemic yields a comparable warning. The study 57 found antibiotic prescribing accounted for 32.1%, with higher prescribing among men, children, and patients with urogenital disorders. Even more concerning, “primary adherence” was very low (only 35.5% of prescriptions were redeemed), attributed primarily to “short prescription validity” and “poor patient awareness.” This phenomenon underscores the urgency and necessity of strengthening patient education.
Our analysis also revealed frequent confusion among patients regarding different subtypes of cystitis. While acute cystitis is highly prevalent among women,58–60 noninfectious etiologies such as glandular cystitis or interstitial cystitis (IC) must also be considered. Glandular cystitis involves metaplastic changes in the bladder lining, often triggered by chronic inflammation, whereas IC, or bladder pain syndrome (BPS), is characterized by chronic pelvic pain, urgency, and frequency in the absence of infection.61,62 Despite their differing pathophysiology, the similarity in symptoms between these conditions complicates differential diagnosis, 63 underscoring the importance of accurate interpretation in both online and offline clinical settings.
The analysis of consultation timing within our data suggests potential delays in seeking professional care, with patients often attempting self-management first. This delay can be particularly risky for upper UTIs like pyelonephritis, where prompt treatment is essential. Online consultations, while potentially adding a step, can serve as a critical funnel to expedite care by triaging these cases effectively. Furthermore, our data reveals nuanced concerns across demographics. The research also pinpointed particular weaknesses within specific demographic groups. A notable advantage of online consultations is the degree of anonymity they offer, which appears to reduce patients’ anxiety and increase their willingness to disclose sensitive information. This is especially evident in discussions around sexual activity—a topic often stigmatized in face-to-face settings. The reduced visibility of social cues (e.g., facial expression, tone) may shield patients from judgment, thereby encouraging more open communication.64,65 Young women who are sexually active are at a higher risk, largely because of how often they have sex and their anatomy. Things like a shorter urethra that's close to the anus make it easier for bacteria to travel where they shouldn’t.66,67 In our study, we noticed that people were more likely to admit to risky sexual behavior, which is a well-known contributor to UTIs. 68 Pregnant women are also a group where UTIs are more common. Hormonal changes and physical compression of the urinary tract slow urine flow and promote bacterial growth. If untreated, bacteriuria in pregnancy may lead to serious complications such as pyelonephritis or preterm birth.69–72 This risk is further elevated in those with gestational diabetes mellitus (GDM), where insulin resistance may impair immune responses. In parallel, our findings show that pregnant women express substantial concerns about medication use. Consistent with earlier studies, many believe that all medications may pose risks to the fetus, leading over one-third to avoid prescriptions altogether and more than half to self-modify medication regimens.73,74 This hesitation was echoed in our dataset, where perinatal patients were often reluctant to inquire about pharmaceutical options, highlighting the need for better education and reassurance in these scenarios.
Across the board, sentiment analysis reveals that patients tend to be in a pretty-negative emotional state. This negativity stems from a few key areas: first, the constant physical discomfort and pain caused by UTI symptoms like frequent urination, urgency, and painful urination really take a toll. Add to that the emotional distress—the anxiety, the feeling down, the sense of helplessness—that comes from dealing with chronic discomfort and the worry about the infection coming back. It's no surprise that this leads to a general dip in quality of life across the board, affecting everything from physical abilities and mental well-being to social life and the ability to carry out daily routines. On top of all that, the frequent setbacks of treatment failures or recurrent infections can lead to intense feelings of frustration and a sense of losing control. And let's not forget the potential stigma linked to the illness or its symptoms, like urinary urgency or incontinence, which can cause people to withdraw socially. These emotional findings line up perfectly with what other studies have shown.5,75,76 Given the complex interactions among the aforementioned factors, a comprehensive intervention model that integrates physical treatment and psychological support is of vital importance in the clinical management of UTI.
Limitations
This study has several limitations that warrant consideration when interpreting the results. Although the research design focuses exclusively on patient-initiated questions, thereby excluding the dynamics of physician–patient interaction, this specific approach enables the capture of patients’ most spontaneous and unguided expressions of healthcare needs. Current research on online healthcare often prioritizes the physician's perspective; consequently, patient-driven inquiries serve as crucial indicators of unmet medical demands. Future investigations should adopt more comprehensive frameworks. A critically underrepresented patient group is the elderly population, whose potentially lower engagement with social media platforms may limit their representation. Furthermore, participation bias is likely, as individuals actively posting about UTIs on social media may disproportionately represent those experiencing more severe symptoms, complex treatment courses, or particularly burdensome healthcare challenges compared to nonposting patients. In addition, the LLM employed in this study has inherent limitations. First, its operational mechanism is essentially a “black box,” with internal reasoning processes that lack transparency and make full verification challenging. Second, as a general-purpose model trained on vast internet-scale data, it may carry forward biases present in its training data, which could influence outcomes in tasks like sentiment classification. To address this, we have not only incorporated a manual verification step but also recognized the need to adopt more automated verification frameworks—such as Retrieval-Augmented Generation (RAG)—in future work. This approach aims to enhance the verifiability of the model's outputs and establish a more robust verification process.
Conclusions
This study, by analyzing online doctor–patient communications, investigates the concerns and challenges faced by patients with UTI. It finds that patients commonly struggle with interpreting diagnoses, using antibiotics, and anxieties particular to different population groups, underscoring the need for more proactive, educational, and guidance-oriented approaches in online consultations. The findings suggest that incorporating clinical decision support tools, guiding clinicians to address key knowledge points, providing standardized test interpretation information, and identifying potential risk factors based on patients’ reported symptoms, can markedly improve the quality and safety of online consultation platforms.
Moreover, this study not only systematically maps patients’ concerns but also offers practical insights for clinical practice in the digital health arena. Future research should prioritize assessing how such integrated educational interventions affect patient outcomes, antibiotic prescribing rates, and the management of common infections via remote healthcare, thereby supplying internet-based platforms and policymakers with a clearer understanding of patient needs and a basis for designing targeted intervention strategies.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076251393289 - Supplemental material for Personalized insights into urinary tract infection management: A text mining analysis of online consultation data
Supplemental material, sj-docx-1-dhj-10.1177_20552076251393289 for Personalized insights into urinary tract infection management: A text mining analysis of online consultation data by Ruijie Tang, Peiqi Zhu, Ruxue Yan, Yaping Zhou, Zhian Tang and Weiming He in DIGITAL HEALTH
Supplemental Material
sj-docx-2-dhj-10.1177_20552076251393289 - Supplemental material for Personalized insights into urinary tract infection management: A text mining analysis of online consultation data
Supplemental material, sj-docx-2-dhj-10.1177_20552076251393289 for Personalized insights into urinary tract infection management: A text mining analysis of online consultation data by Ruijie Tang, Peiqi Zhu, Ruxue Yan, Yaping Zhou, Zhian Tang and Weiming He in DIGITAL HEALTH
Footnotes
Acknowledgments
We thank the database provider, maintenance team, and all reviewers for their valuable contributions.
Ethical approval
This study was approved by the Ethics Committee of Jiangsu Province Hospital of Chinese Medicine. Since the observed data were obtained directly from the public platform, the requirement for participant written informed consent was waived by the Ethics Committee of Jiangsu Province Hospital of Chinese Medicine (Approval No. 2024NL-025-01). Our study was conducted in accordance with the principles outlined in the Declaration of Helsinki.
Contributorship
RT led the conceptualization of the study, with PZ contributing equally. RY and YZ were responsible for data curation and investigation. RT conducted the formal analysis, supported by PZ. HW and TZ acquired funding and provided resources. Project administration was led by RT and PZ, with equal participation from YZ and RY, and supporting roles from WH and ZT. Supervision and validation were jointly performed by RT and PZ. RT carried out the visualization. The original draft was led by RT and PZ, with supporting input from WH and ZT. All authors participated in reviewing and editing the manuscript, with RT and PZ taking the lead.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant number 82575032), the Jiangsu Provincial Medical Innovation Center (grant number 202215), and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (grant number SJCX24_0954).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Guarantor
HW.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
