Abstract
Objective
This study explored the capabilities of large language models (LLMs) GPT-3.5, GPT-4, and Llama 3 to summarize qualitative data from an online brain tumor support forum, assessing the differences between these methods and traditional thematic analysis.
Methods
Eight posts and responses were collected in September 2024 from the American Brain Tumor Association Brain Tumor Support Group, using the passive/unobtrusive method. The data were analyzed using two methods: (1) traditional thematic coding with Dedoose software and (2) summarization and interpretation using LLMs. Prompts guided the LLMs in generating summaries and identifying key challenges, with results evaluated using the metrics BLEU, ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore (f1). Flesch-Kincaid grade levels and readability ease scores were also calculated and compared.
Results
GPT-4 demonstrated superior performance across ROUGE and METEOR metrics, outperforming GPT-3.5 and Llama 3. Semantic similarity scores were comparable across models. GPT-4's capacity to process entire transcripts increased efficiency, while GPT-3.5 and Llama 3 required data segmenting. Summaries produced by LLMs aligned closely with human-generated thematic analysis, with significant reductions in time and labor.
Conclusion
LLMs, particularly GPT-4, show strong potential for summarizing complex, qualitative health data, offering time-efficient and consistent outputs. These tools may enhance research efficiency and support in patient-centered environments. However, challenges such as training data biases and capacity limitations in some models warrant further investigation.
Introduction
The rapid evolution of large language models (LLMs) has revolutionized the processing and analysis of textual information. LLMs such as GPT-3.5, 1 GPT-4, 2 and Llama 3 3 have shown a remarkable ability to summarize and interpret complex data without specific training. In its early stages, artificial intelligence operated using supervised training on specific datasets, which limited its ability to respond to novel information. Recently, LLMs possess zero-shot capability, which is the LLMs’ ability to perform specified tasks without prior task-specific training. 4 This allows LLMs to adapt quickly and relatively unsupervised, reducing user burden. Zero-shot capability can be instrumental in healthcare, which typically contains large volumes of wide-ranging, complex, and often nuanced data.
Research indicates that LLMs can perform tasks such as medical summarization, patient education, and decision support, which highlights their relevance in time-sensitive, fast-paced settings.5–8 Recent studies illustrate the potential of LLMs to assist in additional patient-focused tasks, including patient feedback analysis and qualitative interview summarization.9–11 Models such as GPT-3.5 and GPT-4 have also shown success in summarizing radiology reports, generating discharge instructions, and processing large sets of patient data. 12 These models have demonstrated strong performance across healthcare tasks by interpreting medical text and preserving key points without introducing false or conflicting information. Llama 3 also exhibits considerable effectiveness in clinical summarization, competing closely with models like ChatGPT-4 and proving to be a suitable alternative for research in medical language synthesis and generation. 13
LLMs have the potential to offer significant benefits in environments such as online community forums, where patients and caregivers share experiences, seek support, and discuss treatment-related challenges. Such forums provide rich qualitative data, including patient experiences, caregiver challenges, and general sentiments regarding treatment, symptoms, and support systems.14,15 While manual qualitative analysis is a valued approach for analyzing textual data for patterns and insights, it is time and labor-intensive. 16 With the zero-shot capability of LLMs, it is possible to generate summaries and themes rapidly without prior training on a specific dataset. Although research has highlighted LLMs’ effectiveness in clinical summaries, little is known about implementing such tools in patient support. Evaluating LLMs in the context of online patient and caregiver forums could potentially offer an advanced understanding of their capabilities and limitations in handling emotional, subjective, and often nuanced discussions. LLMs, such as GPT-3.5, ChatGPT-4, and Llama 3, require further evaluation to determine their effectiveness and reliability in this more novel domain.17,18
This study explored the possibilities of LLMs like the widely used GPT-3.5, GPT-4, and Llama 3 for summarizing data from an online health forum. By comparing these models with traditional thematic analysis, this study will help investigate the differences between model results, providing insights into the future use of LLMs for analyzing health-related online discussion forums. This research is essential as the role of LLMs within the healthcare realm continues to expand, with a strong potential to impact healthcare decision-making and patient and caregiver support.
Methods
Data
The data was collected in September 2024 from the American Brain Tumor Association (ABTA) Brain Tumor Support Group and Discussion Community. This forum was selected due to its active user base and its focus on brain tumor-specific challenges. We used data from an online brain tumor support forum because it represents a discussion that captures the everyday lived experiences, needs, and informational gaps of patients and informal caregivers. Furthermore, using LLMs to summarize and interpret an online discussion about brain tumors can test the models’ abilities to handle a sensitive topic that has a range of medical and emotional contexts. One researcher (MLF) extracted the eight most recent initial posts and up to six responses to each initial post by multiple discussion community members (21 pages of double-spaced text). This method followed the passive/unobtrusive method where researchers do not actively engage with the conversation participants. 19 Any names that appeared in the posts were removed for confidentiality purposes.
Mathematical relations
LLMs share underlying mathematical structures based on transformer architectures, which rely on self-attention mechanisms that compute word importance based on positional embeddings. Layer normalization and residual connections stabilize gradient flow and improve training efficiency. Also, probability-based token prediction using softmax functions determines the likelihood of each word in the generated text. While each LLM differs in scale, optimization strategies, and fine-tuning data, they all utilize transformer-based sequence modeling as their foundational framework. 20
Data analysis
The research team analyzed data using two different methods for comparison. In the first method, two researchers (CM and a graduate assistant for just this portion (see Acknowledgment)) used Dedoose to employ a codebook thematic analysis to develop a summary of the data content and tone and the top four challenges expressed by forum participants. The researchers discussed any disagreements until they came to an agreement. This method was chosen because it is a widely used, systematic method for analyzing qualitative data,
21
and its results (paragraph summary and list of challenges) could be used as a “
In the second method, three researchers (MCH, CM, and MLF) used three different LLMs (GPT-3.5, GPT-4, and Llama 3) to summarize and interpret the same discussion forum transcript analyzed by the first method. For GPT-4, the entirety of the transcript could be uploaded with the prompts “Summarize the content and tone of the following 8 posts and responses in one paragraph. Summarize the top 4 challenges (in no particular order) expressed in the following 8 posts and responses as an itemized list.” GPT-3.5 required that the transcripts be broken into three parts, and Llama 3 required that the transcripts be broken into two parts due to reduced capacity. The prompts for GPT-3.5 and Llama 3 were: “Summarize the content and tone of the last two/three submissions in one paragraph. Summarize the top 4 challenges (in no particular order) expressed in the posts and responses as an itemized list.” Figure 1 illustrates the LLM evaluation process.

Flowchart of the models and evaluation.
In our study, feature extraction was performed using the LLMs to process and interpret the unstructured text data of an online brain tumor support forum. The methods were evaluated using widely adopted metrics to assess the quality of text generation and summarization. The metrics were BERTScore (f1), 22 BLEU, 23 Cosine Similarity, 24 Jaccard Similarity, 25 METEOR, 26 ROUGE-1, ROUGE-2, 27 and ROUGE-L. 28 The output range for these metrics is from 0 to 1, with 1 indicating an exact match to the reference. Table 1 presents a description of these metrics. Readability was assessed using the Flesch Reading Ease score and the Flesch–Kincaid Grade Level score. The Flesch reading ease score evaluates the readability of English text by assigning a value between 0 and 100, where higher scores indicate easier readability. The formula considers the average sentence length and the average number of syllables per word. The Flesch–Kincaid grade level score translates the readability assessment into a U.S. school grade level, indicating the years of education required to comprehend the text. It uses a similar formula to the Flesch–Kincaid reading ease but with different weighting factors to produce a grade-level result. 29 These readability metrics are widely utilized and can assist in tailoring content to appropriate reading levels, thereby enhancing comprehension and engagement.
Descriptions of evaluations.
Researchers in our author group had no direct experience as brain tumor patients and varying caregiving experiences, with one of the researchers having experience caring for a patient with a brain tumor.
Ethical considerations
To limit ethical concerns, this study only included data collected from online forums that were publicly accessible. Researchers had no contact with the forum participants. All user names and other in-text name use were removed from the data before analysis to anonymize those who participated in the online communities. The Institutional Review Board (IRB) at Northern Illinois University determined the study to be exempt from human subjects review in accordance with federal regulation criteria (Protocol # HS25-0048).
Results
Completing the codebook thematic analysis required 111 minutes of the combined researchers’ time. Multitasking, which included summarization and analysis of the same transcripts, took 1 minute for GPT-4, 2 minutes for GPT-3.5, and 6 minutes for Llama 3. Researchers noted feeling emotional, fatigued, and distracted by the subject matter when conducting the codebook thematic analysis. Researchers avoided such feelings when conducting the LLM analyses. The comparison between traditional codebook thematic analysis and the outputs from three LLMs (GPT-3.5, ChatGPT-4.0, and Llama 3) revealed several common themes, as well as distinct differences in the identification of key challenges faced by patients and caregivers in an online brain tumor support community. All four approaches identified treatment-related uncertainty, fear and anxiety, and the dismissal or disbelief of symptoms by family and healthcare providers as primary concerns. However, differences emerged in the specificity and emphasis of themes. Codebook thematic analysis highlighted the need for coping strategies and additional resources, while the LLMs placed greater focus on functional rehabilitation, cognitive and physical impairments, emotional strain, and decisions surrounding end-of-life care. Although there were minimal glaring errors, omissions, and oversimplification of the data from the LLM summaries, human analysis was more adept at identifying complex emotions and strategies and higher-order reasoning, such as unmet patient and caregiver needs. Table 2 presents the four major themes and Supplemental Table 1 presents the summary paragraph from each analysis. Additionally, Table 3 presents a comparison of the output for each model across aspects. Note that GPT-3.5 was more reactive to input-data sensitivity and prompt phrasing, while Llama 3 and ChatGPT-4.0 were less so, with ChatGPT-4.0 demonstrating the greatest resiliency to input-data and prompt variability.
Top challenges determined by thematic codebook analysis and large language models.
LLM comparison.
Figure 2 presents a comparative analysis of summary evaluations between an LLM and user-generated summaries. ROUGE metrics measure the overlap of unigrams, bigrams, and the longest common subsequences between the candidate and the reference text, and GPT-4 demonstrated superior performance. It surpassed GPT-3.5 by 0.08 and Llama 3 by 0.02. Unlike ROUGE metrics, METEOR extends the evaluation to include synonyms and stemmed versions of words, along with exact word matches. In this aspect, GPT-4 again shows enhanced results, outperforming GPT-3.5 by 0.05 and Llama 3 by 0.06. For Cosine similarity, which measures the angle between text vectors, GPT-4 achieves scores that are 0.20 points higher than GPT-3.5 and 0.03 points higher than Llama. Similarly, using the Jaccard similarity metric, which quantifies the overlap between word sets, GPT-4 outperforms GPT-3.5 by 0.06 and Llama by 0.04. However, when considering semantic similarity assessed through contextual embeddings like BERTScore, all tested LLMs exhibit nearly equivalent performances.

Performance of different LLMs (GPT-3.5, Llama-3, and GPT-4) for evaluation metrics. GPT-4 demonstrated superior performance in all metrics. (Note that BLEU scores for GPT-3.5 and Llama were 0, so they do not appear here).
Figures 3 and 4 present the Flesch–Kincaid grade level and reading ease scores. The readability analysis demonstrated notable differences in the complexity of outputs between the reference and LLM methods. The reference method had the lowest Flesch reading ease score (11.6) and the second highest Flesch–Kincaid grade level (15.9) after GPT-4, making it highly complex and advanced. Among the LLMs, GPT-4 generated the easiest to read yet most complex summaries (reading ease: 18.1; grade level: 17.6), surpassing GPT-3.5 (reading ease: 13.8; grade level: 15.1) and Llama 3 (reading ease: 14.9; grade level: 14.7). Llama 3 produced the most accessible summaries among the LLMs, with a slightly lower complexity compared to GPT-3.5 and GPT-4. These results show that LLM text readability is comparable with human-generated summaries and that LLMs provide varying levels of accessibility, with Llama 3 balancing readability and detail more effectively.

Flesch–Kincaid grade level of the reference method and LLMs, where higher scores indicate a need for more advanced reading skills, according to educational standards.

Flesch–Kincaid reading ease for the reference method and LLMs, where higher scores indicate easier general readability.
Discussion
In this study, we found that LLMs can produce coherent and accurate summaries of a dataset in mere seconds, a task that took two researchers nearly 2 hours to accomplish. The LLMs developed summaries with Flesch–Kincaid grade level and readability ease scores that were considered college graduate level and consistent with that of traditional thematic analysis. Furthermore, LLMs were not burdened by human factors such as fatigue, distraction, and emotional responses, as with the codebook thematic analysis approach.
The differences in labor and time between the reference and LLM methods highlight one of the key strengths of LLMs: their capacity to process and synthesize large volumes of text almost instantaneously. When LLMs are trained on large and diverse datasets, they can better generate quick and meaningful summaries. 30 The speed at which LLMs operate makes them particularly useful in environments where time is a valuable commodity, such as real-time data processing or when dealing with rapidly changing information. 31 By automating data summarization, researchers could dedicate their time and resources to more challenging and nuanced tasks that require human guidance, such as data interpretation and hypothesis generation. Regarding online health forums, this heightened summarization capability could translate into quicker access to valuable insights from vast amounts of peer-shared experiences and advice.
LLM summarization may also be superior to human summarization in mitigating the influence of emotional states and cognitive biases. Humans inevitably bring their personal perspectives and biases into data analysis despite great efforts to avoid doing so. 32 LLMs operate instead based on patterns in data rather than unique personal experiences or feelings. 33 This detachment from individual human emotion and bias may result in more objective and consistent outputs. For example, human summarization might prioritize certain aspects of a dataset based on personal interest or subconscious preferences. In contrast, LLM summarization does not have the bias of just one or two people and is systematic, focusing on the most relevant or frequent information. 34 Furthermore, due to the increased time for human labor, human error can also be influenced by fatigue, stress, or emotion. 35 Once trained, LLMs tend to be consistent in their performance. Research has highlighted LLMs’ reliability and consistency when performing repeated tasks. 36 LLMs, therefore, have been demonstrated to be a promising tool for reducing human error and increasing the accuracy and efficiency of tasks like data summarization, particularly in high-volume and/or fast-paced environments.
The objectivity of LLMs is particularly significant in health and disease management, where decision-making often relies on synthesizing a large quantity of information under time-sensitive and emotionally charged conditions. In the context of managing chronic or life-threatening conditions like brain tumors, patients and caregivers may turn to discussion forums and online support communities. As our study suggests, LLMs can provide comprehensive summaries of these discussions, highlighting actionable trends or common challenges others face in similar situations. Such summaries could potentially help patients and informal caregivers quickly learn effective coping strategies and become aware of new treatment options or side effects, thus enhancing their ability to make informed decisions. Additionally, the consistency and efficiency of LLMs in handling large-scale data could assist healthcare professionals by summarizing patient discussions and monitoring patient-reported outcomes, prioritizing urgent cases, generating easy-to-understand educational materials, and developing individualized care plans tailored to patient needs. For researchers, LLMs could streamline qualitative data analysis, identify care gaps, and provide insights into clinical trials. Furthermore, caregivers could benefit from AI tools that offer personalized resources, real-time support, and decision-making guidance based on shared patient experiences. By mitigating the risks of human error and biases, LLMs show promise in improving the accuracy and timeliness of valuable information, ultimately supporting better outcomes in health and disease management.
When using LLMs, minor changes in input, such as altering prompt wording or segmentation strategies, can significantly impact summarization outputs. While GPT-4, with its larger context window, remains highly robust to input variations, it exhibits slight differences in phrasing when summaries are re-generated. In contrast, GPT-3.5 is more sensitive to prompt modifications, occasionally producing inconsistent outputs when given reworded instructions. 37 Llama 3, though less affected by minor prompt changes, struggles with longer, segmented inputs, often leading to repetition or content omission. These differences highlight the importance of standardized input structures, careful prompt engineering, and context-length considerations to ensure reliable and consistent model performance. 38
Limitations
Our study had some limitations. Although LLMs have many benefits, such as summarizing and synthesizing large volumes of information, their outputs depend greatly on the quality and diversity of the data on which they were trained. 39 The LLM-generated summaries closely aligned with human analyses; however, human reviewers were slightly more effective in identifying coping strategies and unmet needs, which were not explicitly captured by the LLMs. These findings suggest that while LLMs excel in thematic summarization, they may require further refinement to enhance their ability to recognize and interpret emotional complexity within patient and caregiver discussions. Additionally, there is potential for LLMs to unknowingly reinforce biases that exist within their training datasets, which could result in skewed outputs. Furthermore, only GPT-4 could analyze the transcripts in their entirety. For GPT-3.5 and Llama 3, transcripts had to be inputted in parts due to reduced capacity. This diminished ability to analyze the transcripts as a whole may have contributed to a lack of depth in the GPT-3.5 and Llama 3 analyses when compared to GPT-4. Although this only increased time by several minutes compared to GPT-4, it also reduced the efficiency impact for these two models.
It is important to note that while the Flesch–Kincaid readability score is useful for benchmarking readability, these scores may not fully capture the specific needs of patients and caregivers navigating complex diseases. For one, this readability score may not capture comprehensibility. When summarizing, the text may lack critical medical details or oversimplify, losing crucial contextual information. Additionally, readability metrics do not account for aspects like empathy, reassurance, or clarity in emotionally charged situations like those experienced by patients and caregivers. Finally, the Flesch–Kincaid readability score does not assess logical flow or medical accuracy, which is essential for effective health communication.
Future outlook
LLMs provide significant advantages in terms of efficiency and objectivity when compared to human summarization efforts when analyzing health discussion forum data. LLMs’ ability to effectively summarize and interpret data in seconds, which would typically require considerable human effort, highlights their strong potential in healthcare, where large-scale data processing is vital. Future research into ways that healthcare providers can use LLM-summarized discussion forum data to identify emerging trends in patient and caregiver concerns, uncover significant gaps in care, and inform evidence-based improvements to treatment and support strategies would be helpful to the field. To enhance their ability to match human analysis, future development of LLMs could focus on refining their capacity for empathy and emotional intelligence. This would enable LLMs to recognize subtle emotional undertones and apply higher-order reasoning, such as identifying unmet needs based on the challenges expressed. Additionally, combining text-based data such as the forum data in this study with feature extraction from medical images such as MRIs, when available, could further enhance understanding of patient conditions and caregiver experiences.40,41 Furthermore, future considerations about how LLMs can help address the needs of underserved groups should be a priority area as this body of research expands. Although LLMs have great potential, extensive validation is necessary to ensure accuracy, minimize biases, and prevent misinterpretation or omission of critical details. This includes expert review by healthcare professionals, user testing for clarity and emotional appropriateness, bias assessments, and compliance with regulatory standards.
Conclusion
This study demonstrated that LLMs, particularly GPT-4, can accurately and efficiently summarize qualitative data from an online brain tumor discussion forum. Compared to traditional thematic analysis, LLMs produced summaries with similar themes in a fraction of the time, significantly reducing the labor burden associated with manual coding. All LLMs tested identified similar patient and caregiver concerns, with GPT-4 exhibiting superior performance in text generation metrics and readability. Additionally, LLM summaries were not influenced by emotional fatigue or cognitive biases, factors which are known to affect human analyses. However, limitations such as training data biases and model capacity constraints demonstrate areas for further improvement. These findings suggest that LLMs have strong potential for use in qualitative health research and patient-centered applications due to their ability to provide rapid insights from large volumes of text.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076251337345 - Supplemental material for Exploring large language models for summarizing and interpreting an online brain tumor support forum
Supplemental material, sj-docx-1-dhj-10.1177_20552076251337345 for Exploring large language models for summarizing and interpreting an online brain tumor support forum by Christy Muasher-Kerwin, M Courtney Hughes, Michelle L Foster, Ibrahim Al Azher and Hamed Alhoori in DIGITAL HEALTH
Supplemental Material
sj-docx-2-dhj-10.1177_20552076251337345 - Supplemental material for Exploring large language models for summarizing and interpreting an online brain tumor support forum
Supplemental material, sj-docx-2-dhj-10.1177_20552076251337345 for Exploring large language models for summarizing and interpreting an online brain tumor support forum by Christy Muasher-Kerwin, M Courtney Hughes, Michelle L Foster, Ibrahim Al Azher and Hamed Alhoori in DIGITAL HEALTH
Footnotes
Acknowledgments
The authors would like to thank Samantha M. Econie for her assistance in analyzing the data.
Ethical considerations
The Institutional Review Board at Northern Illinois University determined the study to be exempt from human subjects review in accordance with federal regulation criteria.
Author contributions
Researchers CM, MCH, and MLF were involved in conceptualization, methodology, formal analysis, investigation, and writing. CM drafted the original draft, and MCH conducted a critical revision of the manuscript. MLF was responsible for data curation, while researchers IA and HA were responsible for software, formal analysis, and writing. MCH was involved in project administration and supervision of the research project and team.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.
Guarantor
MCH
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
