Sage Journals: Discover world-class research

Abstract

Objective

This study explored the capabilities of large language models (LLMs) GPT-3.5, GPT-4, and Llama 3 to summarize qualitative data from an online brain tumor support forum, assessing the differences between these methods and traditional thematic analysis.

Methods

Eight posts and responses were collected in September 2024 from the American Brain Tumor Association Brain Tumor Support Group, using the passive/unobtrusive method. The data were analyzed using two methods: (1) traditional thematic coding with Dedoose software and (2) summarization and interpretation using LLMs. Prompts guided the LLMs in generating summaries and identifying key challenges, with results evaluated using the metrics BLEU, ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore (f1). Flesch-Kincaid grade levels and readability ease scores were also calculated and compared.

Results

GPT-4 demonstrated superior performance across ROUGE and METEOR metrics, outperforming GPT-3.5 and Llama 3. Semantic similarity scores were comparable across models. GPT-4's capacity to process entire transcripts increased efficiency, while GPT-3.5 and Llama 3 required data segmenting. Summaries produced by LLMs aligned closely with human-generated thematic analysis, with significant reductions in time and labor.

Conclusion

LLMs, particularly GPT-4, show strong potential for summarizing complex, qualitative health data, offering time-efficient and consistent outputs. These tools may enhance research efficiency and support in patient-centered environments. However, challenges such as training data biases and capacity limitations in some models warrant further investigation.

Keywords

Large language models online forum thematic analysis brain tumor LLMs

Introduction

The rapid evolution of large language models (LLMs) has revolutionized the processing and analysis of textual information. LLMs such as GPT-3.5,¹ GPT-4,² and Llama 3³ have shown a remarkable ability to summarize and interpret complex data without specific training. In its early stages, artificial intelligence operated using supervised training on specific datasets, which limited its ability to respond to novel information. Recently, LLMs possess zero-shot capability, which is the LLMs’ ability to perform specified tasks without prior task-specific training.⁴ This allows LLMs to adapt quickly and relatively unsupervised, reducing user burden. Zero-shot capability can be instrumental in healthcare, which typically contains large volumes of wide-ranging, complex, and often nuanced data.

Research indicates that LLMs can perform tasks such as medical summarization, patient education, and decision support, which highlights their relevance in time-sensitive, fast-paced settings.^5–8 Recent studies illustrate the potential of LLMs to assist in additional patient-focused tasks, including patient feedback analysis and qualitative interview summarization.^9–11 Models such as GPT-3.5 and GPT-4 have also shown success in summarizing radiology reports, generating discharge instructions, and processing large sets of patient data.¹² These models have demonstrated strong performance across healthcare tasks by interpreting medical text and preserving key points without introducing false or conflicting information. Llama 3 also exhibits considerable effectiveness in clinical summarization, competing closely with models like ChatGPT-4 and proving to be a suitable alternative for research in medical language synthesis and generation.¹³

LLMs have the potential to offer significant benefits in environments such as online community forums, where patients and caregivers share experiences, seek support, and discuss treatment-related challenges. Such forums provide rich qualitative data, including patient experiences, caregiver challenges, and general sentiments regarding treatment, symptoms, and support systems.^14,15 While manual qualitative analysis is a valued approach for analyzing textual data for patterns and insights, it is time and labor-intensive.¹⁶ With the zero-shot capability of LLMs, it is possible to generate summaries and themes rapidly without prior training on a specific dataset. Although research has highlighted LLMs’ effectiveness in clinical summaries, little is known about implementing such tools in patient support. Evaluating LLMs in the context of online patient and caregiver forums could potentially offer an advanced understanding of their capabilities and limitations in handling emotional, subjective, and often nuanced discussions. LLMs, such as GPT-3.5, ChatGPT-4, and Llama 3, require further evaluation to determine their effectiveness and reliability in this more novel domain.^17,18

This study explored the possibilities of LLMs like the widely used GPT-3.5, GPT-4, and Llama 3 for summarizing data from an online health forum. By comparing these models with traditional thematic analysis, this study will help investigate the differences between model results, providing insights into the future use of LLMs for analyzing health-related online discussion forums. This research is essential as the role of LLMs within the healthcare realm continues to expand, with a strong potential to impact healthcare decision-making and patient and caregiver support.

Methods

Data

The data was collected in September 2024 from the American Brain Tumor Association (ABTA) Brain Tumor Support Group and Discussion Community. This forum was selected due to its active user base and its focus on brain tumor-specific challenges. We used data from an online brain tumor support forum because it represents a discussion that captures the everyday lived experiences, needs, and informational gaps of patients and informal caregivers. Furthermore, using LLMs to summarize and interpret an online discussion about brain tumors can test the models’ abilities to handle a sensitive topic that has a range of medical and emotional contexts. One researcher (MLF) extracted the eight most recent initial posts and up to six responses to each initial post by multiple discussion community members (21 pages of double-spaced text). This method followed the passive/unobtrusive method where researchers do not actively engage with the conversation participants.¹⁹ Any names that appeared in the posts were removed for confidentiality purposes.

Mathematical relations

LLMs share underlying mathematical structures based on transformer architectures, which rely on self-attention mechanisms that compute word importance based on positional embeddings. Layer normalization and residual connections stabilize gradient flow and improve training efficiency. Also, probability-based token prediction using softmax functions determines the likelihood of each word in the generated text. While each LLM differs in scale, optimization strategies, and fine-tuning data, they all utilize transformer-based sequence modeling as their foundational framework.²⁰

Data analysis

The research team analyzed data using two different methods for comparison. In the first method, two researchers (CM and a graduate assistant for just this portion (see Acknowledgment)) used Dedoose to employ a codebook thematic analysis to develop a summary of the data content and tone and the top four challenges expressed by forum participants. The researchers discussed any disagreements until they came to an agreement. This method was chosen because it is a widely used, systematic method for analyzing qualitative data,²¹ and its results (paragraph summary and list of challenges) could be used as a “reference” to compare the results from the second method using the LLMs described below.

In the second method, three researchers (MCH, CM, and MLF) used three different LLMs (GPT-3.5, GPT-4, and Llama 3) to summarize and interpret the same discussion forum transcript analyzed by the first method. For GPT-4, the entirety of the transcript could be uploaded with the prompts “Summarize the content and tone of the following 8 posts and responses in one paragraph. Summarize the top 4 challenges (in no particular order) expressed in the following 8 posts and responses as an itemized list.” GPT-3.5 required that the transcripts be broken into three parts, and Llama 3 required that the transcripts be broken into two parts due to reduced capacity. The prompts for GPT-3.5 and Llama 3 were: “Summarize the content and tone of the last two/three submissions in one paragraph. Summarize the top 4 challenges (in no particular order) expressed in the posts and responses as an itemized list.” Figure 1 illustrates the LLM evaluation process.

Figure 1.

Flowchart of the models and evaluation.

In our study, feature extraction was performed using the LLMs to process and interpret the unstructured text data of an online brain tumor support forum. The methods were evaluated using widely adopted metrics to assess the quality of text generation and summarization. The metrics were BERTScore (f1),²² BLEU,²³ Cosine Similarity,²⁴ Jaccard Similarity,²⁵ METEOR,²⁶ ROUGE-1, ROUGE-2,²⁷ and ROUGE-L.²⁸ The output range for these metrics is from 0 to 1, with 1 indicating an exact match to the reference. Table 1 presents a description of these metrics. Readability was assessed using the Flesch Reading Ease score and the Flesch–Kincaid Grade Level score. The Flesch reading ease score evaluates the readability of English text by assigning a value between 0 and 100, where higher scores indicate easier readability. The formula considers the average sentence length and the average number of syllables per word. The Flesch–Kincaid grade level score translates the readability assessment into a U.S. school grade level, indicating the years of education required to comprehend the text. It uses a similar formula to the Flesch–Kincaid reading ease but with different weighting factors to produce a grade-level result.²⁹ These readability metrics are widely utilized and can assist in tailoring content to appropriate reading levels, thereby enhancing comprehension and engagement.

Table 1.

Descriptions of evaluations.

Metric	Full name	Description
BLEU	Bilingual Evaluation Understudy	Measures the quality of machine-generated translations by comparing them to one or more human reference translations, considering the overlap of n-grams between machine-generated text and reference text
ROUGE-1	Recall-Oriented Understudy for Gisting Evaluation	Computes overlap scores that reflect how well the output is captured from reference summaries. Measures overlapping single words (unigrams) between machine-generated and reference text
ROUGE-2	Recall-Oriented Understudy for Gisting Evaluation	Measures the overlap of bigrams (two consecutive words)
ROUGE-L	Recall-Oriented Understudy for Gisting Evaluation	Measures longest common subsequence words between a system-generated summary and reference summary, focusing on overall structure and order
METEOR	Metric for Evaluation of Translation with Explicit Ordering	Unlike exact word matches, it considers stemmed forms, synonyms, and paraphrases, incorporating precision and recall
Cosine Similarity	Cosine Similarity	Measures the angle between two text vectors in a high-dimensional space. It is commonly used in semantic similarity tasks, such as natural language processing (NLP) and information retrieval
Jaccard Similarity	Jaccard Similarity	Measures the overlap between two sets of words. It is a lexical similarity metric that compares the common words in two texts relative to their total unique words
BERTScore (f1)	Bidirectional Encoder Representations from Transformers Score	Utilizes the contextual embedding of generated and reference texts and then computes cosine similarity. Instead of exact word matching, it captures semantic similarities more effectively

Researchers in our author group had no direct experience as brain tumor patients and varying caregiving experiences, with one of the researchers having experience caring for a patient with a brain tumor.

Ethical considerations

To limit ethical concerns, this study only included data collected from online forums that were publicly accessible. Researchers had no contact with the forum participants. All user names and other in-text name use were removed from the data before analysis to anonymize those who participated in the online communities. The Institutional Review Board (IRB) at Northern Illinois University determined the study to be exempt from human subjects review in accordance with federal regulation criteria (Protocol # HS25-0048).

Results

Completing the codebook thematic analysis required 111 minutes of the combined researchers’ time. Multitasking, which included summarization and analysis of the same transcripts, took 1 minute for GPT-4, 2 minutes for GPT-3.5, and 6 minutes for Llama 3. Researchers noted feeling emotional, fatigued, and distracted by the subject matter when conducting the codebook thematic analysis. Researchers avoided such feelings when conducting the LLM analyses. The comparison between traditional codebook thematic analysis and the outputs from three LLMs (GPT-3.5, ChatGPT-4.0, and Llama 3) revealed several common themes, as well as distinct differences in the identification of key challenges faced by patients and caregivers in an online brain tumor support community. All four approaches identified treatment-related uncertainty, fear and anxiety, and the dismissal or disbelief of symptoms by family and healthcare providers as primary concerns. However, differences emerged in the specificity and emphasis of themes. Codebook thematic analysis highlighted the need for coping strategies and additional resources, while the LLMs placed greater focus on functional rehabilitation, cognitive and physical impairments, emotional strain, and decisions surrounding end-of-life care. Although there were minimal glaring errors, omissions, and oversimplification of the data from the LLM summaries, human analysis was more adept at identifying complex emotions and strategies and higher-order reasoning, such as unmet patient and caregiver needs. Table 2 presents the four major themes and Supplemental Table 1 presents the summary paragraph from each analysis. Additionally, Table 3 presents a comparison of the output for each model across aspects. Note that GPT-3.5 was more reactive to input-data sensitivity and prompt phrasing, while Llama 3 and ChatGPT-4.0 were less so, with ChatGPT-4.0 demonstrating the greatest resiliency to input-data and prompt variability.

Table 2.

Top challenges determined by thematic codebook analysis and large language models.

	Thematic codebook analysis	GPT-3.5	GPT-4	Llama 3
Analysis time (minutes)	111	2	1	6
Top challenges	Worry and uncertainty about medicine, treatment, medical testing (scanxiety), and side effects	Exploring alternative therapies	Fear and anxiety over health uncertainties	Treatment uncertainty
	Dismissal of experiences by family and/ or doctors/ healthcare team	Maintaining hope and purpose	Disbelief and dismissal of symptoms by family and doctors	Dismissal and skepticism
	Coping with fear and disbelief and identifying effective coping strategies	Difficult end-stage decisions	Decisions around treatment options and end-of-life care	Emotional strain
	Needing more information and resources	Functional rehabilitation	Adjustment and adaptation postsurgery	Cognitive and physical impairment

Table 3.

LLM comparison.

Aspect	GPT-3.5	Llama 3	GPT-4
Segmentation required	Yes (three segments)	Yes (two segments)	No
ROUGE & METEOR scores	Moderate	Lowest	Highest
Readability (Flesh-Kincaid)	Moderate	Easiest readability, lowest complexity	Highest (more readable, complex output)
Summarization omissions	Occasional omissions due to segmentation	More frequent omissions	Minimal
Redundancy issues	Minimal	Tends to repeat content	Minimal
Contextual consistency	Moderate (varied output with prompt changes)	Less consistent	Consistent
Prompt sensitivity	High (sensitive to prompt modifications)	Moderate	Low
Alignment with human analysis	Moderate alignment	Lowest	Highest alignment
Input-data sensitivity	High (summary shifts with small content changes)	Moderate (sensitive to segmentation and input structure)	Low (consistent with minor data changes)

Figure 2 presents a comparative analysis of summary evaluations between an LLM and user-generated summaries. ROUGE metrics measure the overlap of unigrams, bigrams, and the longest common subsequences between the candidate and the reference text, and GPT-4 demonstrated superior performance. It surpassed GPT-3.5 by 0.08 and Llama 3 by 0.02. Unlike ROUGE metrics, METEOR extends the evaluation to include synonyms and stemmed versions of words, along with exact word matches. In this aspect, GPT-4 again shows enhanced results, outperforming GPT-3.5 by 0.05 and Llama 3 by 0.06. For Cosine similarity, which measures the angle between text vectors, GPT-4 achieves scores that are 0.20 points higher than GPT-3.5 and 0.03 points higher than Llama. Similarly, using the Jaccard similarity metric, which quantifies the overlap between word sets, GPT-4 outperforms GPT-3.5 by 0.06 and Llama by 0.04. However, when considering semantic similarity assessed through contextual embeddings like BERTScore, all tested LLMs exhibit nearly equivalent performances.

Figure 2.

Performance of different LLMs (GPT-3.5, Llama-3, and GPT-4) for evaluation metrics. GPT-4 demonstrated superior performance in all metrics. (Note that BLEU scores for GPT-3.5 and Llama were 0, so they do not appear here).

Figures 3 and 4 present the Flesch–Kincaid grade level and reading ease scores. The readability analysis demonstrated notable differences in the complexity of outputs between the reference and LLM methods. The reference method had the lowest Flesch reading ease score (11.6) and the second highest Flesch–Kincaid grade level (15.9) after GPT-4, making it highly complex and advanced. Among the LLMs, GPT-4 generated the easiest to read yet most complex summaries (reading ease: 18.1; grade level: 17.6), surpassing GPT-3.5 (reading ease: 13.8; grade level: 15.1) and Llama 3 (reading ease: 14.9; grade level: 14.7). Llama 3 produced the most accessible summaries among the LLMs, with a slightly lower complexity compared to GPT-3.5 and GPT-4. These results show that LLM text readability is comparable with human-generated summaries and that LLMs provide varying levels of accessibility, with Llama 3 balancing readability and detail more effectively.

Figure 3.

Flesch–Kincaid grade level of the reference method and LLMs, where higher scores indicate a need for more advanced reading skills, according to educational standards.

Figure 4.

Flesch–Kincaid reading ease for the reference method and LLMs, where higher scores indicate easier general readability.

Discussion

In this study, we found that LLMs can produce coherent and accurate summaries of a dataset in mere seconds, a task that took two researchers nearly 2 hours to accomplish. The LLMs developed summaries with Flesch–Kincaid grade level and readability ease scores that were considered college graduate level and consistent with that of traditional thematic analysis. Furthermore, LLMs were not burdened by human factors such as fatigue, distraction, and emotional responses, as with the codebook thematic analysis approach.

The differences in labor and time between the reference and LLM methods highlight one of the key strengths of LLMs: their capacity to process and synthesize large volumes of text almost instantaneously. When LLMs are trained on large and diverse datasets, they can better generate quick and meaningful summaries.³⁰ The speed at which LLMs operate makes them particularly useful in environments where time is a valuable commodity, such as real-time data processing or when dealing with rapidly changing information.³¹ By automating data summarization, researchers could dedicate their time and resources to more challenging and nuanced tasks that require human guidance, such as data interpretation and hypothesis generation. Regarding online health forums, this heightened summarization capability could translate into quicker access to valuable insights from vast amounts of peer-shared experiences and advice.

LLM summarization may also be superior to human summarization in mitigating the influence of emotional states and cognitive biases. Humans inevitably bring their personal perspectives and biases into data analysis despite great efforts to avoid doing so.³² LLMs operate instead based on patterns in data rather than unique personal experiences or feelings.³³ This detachment from individual human emotion and bias may result in more objective and consistent outputs. For example, human summarization might prioritize certain aspects of a dataset based on personal interest or subconscious preferences. In contrast, LLM summarization does not have the bias of just one or two people and is systematic, focusing on the most relevant or frequent information.³⁴ Furthermore, due to the increased time for human labor, human error can also be influenced by fatigue, stress, or emotion.³⁵ Once trained, LLMs tend to be consistent in their performance. Research has highlighted LLMs’ reliability and consistency when performing repeated tasks.³⁶ LLMs, therefore, have been demonstrated to be a promising tool for reducing human error and increasing the accuracy and efficiency of tasks like data summarization, particularly in high-volume and/or fast-paced environments.

The objectivity of LLMs is particularly significant in health and disease management, where decision-making often relies on synthesizing a large quantity of information under time-sensitive and emotionally charged conditions. In the context of managing chronic or life-threatening conditions like brain tumors, patients and caregivers may turn to discussion forums and online support communities. As our study suggests, LLMs can provide comprehensive summaries of these discussions, highlighting actionable trends or common challenges others face in similar situations. Such summaries could potentially help patients and informal caregivers quickly learn effective coping strategies and become aware of new treatment options or side effects, thus enhancing their ability to make informed decisions. Additionally, the consistency and efficiency of LLMs in handling large-scale data could assist healthcare professionals by summarizing patient discussions and monitoring patient-reported outcomes, prioritizing urgent cases, generating easy-to-understand educational materials, and developing individualized care plans tailored to patient needs. For researchers, LLMs could streamline qualitative data analysis, identify care gaps, and provide insights into clinical trials. Furthermore, caregivers could benefit from AI tools that offer personalized resources, real-time support, and decision-making guidance based on shared patient experiences. By mitigating the risks of human error and biases, LLMs show promise in improving the accuracy and timeliness of valuable information, ultimately supporting better outcomes in health and disease management.

When using LLMs, minor changes in input, such as altering prompt wording or segmentation strategies, can significantly impact summarization outputs. While GPT-4, with its larger context window, remains highly robust to input variations, it exhibits slight differences in phrasing when summaries are re-generated. In contrast, GPT-3.5 is more sensitive to prompt modifications, occasionally producing inconsistent outputs when given reworded instructions.³⁷ Llama 3, though less affected by minor prompt changes, struggles with longer, segmented inputs, often leading to repetition or content omission. These differences highlight the importance of standardized input structures, careful prompt engineering, and context-length considerations to ensure reliable and consistent model performance.³⁸

Limitations

Our study had some limitations. Although LLMs have many benefits, such as summarizing and synthesizing large volumes of information, their outputs depend greatly on the quality and diversity of the data on which they were trained.³⁹ The LLM-generated summaries closely aligned with human analyses; however, human reviewers were slightly more effective in identifying coping strategies and unmet needs, which were not explicitly captured by the LLMs. These findings suggest that while LLMs excel in thematic summarization, they may require further refinement to enhance their ability to recognize and interpret emotional complexity within patient and caregiver discussions. Additionally, there is potential for LLMs to unknowingly reinforce biases that exist within their training datasets, which could result in skewed outputs. Furthermore, only GPT-4 could analyze the transcripts in their entirety. For GPT-3.5 and Llama 3, transcripts had to be inputted in parts due to reduced capacity. This diminished ability to analyze the transcripts as a whole may have contributed to a lack of depth in the GPT-3.5 and Llama 3 analyses when compared to GPT-4. Although this only increased time by several minutes compared to GPT-4, it also reduced the efficiency impact for these two models.

It is important to note that while the Flesch–Kincaid readability score is useful for benchmarking readability, these scores may not fully capture the specific needs of patients and caregivers navigating complex diseases. For one, this readability score may not capture comprehensibility. When summarizing, the text may lack critical medical details or oversimplify, losing crucial contextual information. Additionally, readability metrics do not account for aspects like empathy, reassurance, or clarity in emotionally charged situations like those experienced by patients and caregivers. Finally, the Flesch–Kincaid readability score does not assess logical flow or medical accuracy, which is essential for effective health communication.

Future outlook

LLMs provide significant advantages in terms of efficiency and objectivity when compared to human summarization efforts when analyzing health discussion forum data. LLMs’ ability to effectively summarize and interpret data in seconds, which would typically require considerable human effort, highlights their strong potential in healthcare, where large-scale data processing is vital. Future research into ways that healthcare providers can use LLM-summarized discussion forum data to identify emerging trends in patient and caregiver concerns, uncover significant gaps in care, and inform evidence-based improvements to treatment and support strategies would be helpful to the field. To enhance their ability to match human analysis, future development of LLMs could focus on refining their capacity for empathy and emotional intelligence. This would enable LLMs to recognize subtle emotional undertones and apply higher-order reasoning, such as identifying unmet needs based on the challenges expressed. Additionally, combining text-based data such as the forum data in this study with feature extraction from medical images such as MRIs, when available, could further enhance understanding of patient conditions and caregiver experiences.^40,41 Furthermore, future considerations about how LLMs can help address the needs of underserved groups should be a priority area as this body of research expands. Although LLMs have great potential, extensive validation is necessary to ensure accuracy, minimize biases, and prevent misinterpretation or omission of critical details. This includes expert review by healthcare professionals, user testing for clarity and emotional appropriateness, bias assessments, and compliance with regulatory standards.

Conclusion

This study demonstrated that LLMs, particularly GPT-4, can accurately and efficiently summarize qualitative data from an online brain tumor discussion forum. Compared to traditional thematic analysis, LLMs produced summaries with similar themes in a fraction of the time, significantly reducing the labor burden associated with manual coding. All LLMs tested identified similar patient and caregiver concerns, with GPT-4 exhibiting superior performance in text generation metrics and readability. Additionally, LLM summaries were not influenced by emotional fatigue or cognitive biases, factors which are known to affect human analyses. However, limitations such as training data biases and model capacity constraints demonstrate areas for further improvement. These findings suggest that LLMs have strong potential for use in qualitative health research and patient-centered applications due to their ability to provide rapid insights from large volumes of text.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251337345 - Supplemental material for Exploring large language models for summarizing and interpreting an online brain tumor support forum

Supplemental material, sj-docx-1-dhj-10.1177_20552076251337345 for Exploring large language models for summarizing and interpreting an online brain tumor support forum by Christy Muasher-Kerwin, M Courtney Hughes, Michelle L Foster, Ibrahim Al Azher and Hamed Alhoori in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251337345 - Supplemental material for Exploring large language models for summarizing and interpreting an online brain tumor support forum

Supplemental material, sj-docx-2-dhj-10.1177_20552076251337345 for Exploring large language models for summarizing and interpreting an online brain tumor support forum by Christy Muasher-Kerwin, M Courtney Hughes, Michelle L Foster, Ibrahim Al Azher and Hamed Alhoori in DIGITAL HEALTH

Footnotes

Acknowledgments

The authors would like to thank Samantha M. Econie for her assistance in analyzing the data.

ORCID iDs

Christy Muasher-Kerwin

M Courtney Hughes

Ibrahim Al Azher

Hamed Alhoori

Ethical considerations

The Institutional Review Board at Northern Illinois University determined the study to be exempt from human subjects review in accordance with federal regulation criteria.

Author contributions

Researchers CM, MCH, and MLF were involved in conceptualization, methodology, formal analysis, investigation, and writing. CM drafted the original draft, and MCH conducted a critical revision of the manuscript. MLF was responsible for data curation, while researchers IA and HA were responsible for software, formal analysis, and writing. MCH was involved in project administration and supervision of the research project and team.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Guarantor

MCH

Supplemental material

Supplemental material for this article is available online.

References

OpenAI. GPT-3.5 Turbo fine-tuning and API updates, https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/ (accessed 14 December 2024).

OpenAI. GPT-4, https://openai.com/index/gpt-4/ (accessed 14 December 2024).

Our responsible approach to Meta AI and Meta Llama 3. Meta AI, https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/ (2024, accessed 14 December 2024).

Kojima

Reid

, et al. Large language models are zero-shot reasoners. Epub ahead of print 29 January 2023.

Jeblick

Schachtner

Dexl

, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34: 2817–2825.

Aydin

Karabacak

Vlachos

, et al. Large language models in patient education: a scoping review of applications in medicine. Front Med 2024; 11: 1477898.

Kojima

Reid

, et al. Large language models are zero-shot reasoners. In: Proceedings of the 36th international conference on neural information processing systems . Red Hook, NY, USA: Curran Associates Inc., 2024, pp.22199–22213.

Chen

Zhou

Hoda

, et al. Exploring the opportunities of large language models for summarizing palliative care consultations: a pilot comparative study. Digit Health 2024; 10: 20552076241293932.

AlSaad

Abd-Alrazaq

Boughorbel

, et al. Multimodal large language models in health care: applications, challenges, and future outlook. J Med Internet Res 2024; 26: e59505.

10.

Van Veen

Van Uden

Blankemeier

, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134–1142.

11.

Wosny

Hastings

. Applying large language models to interpret qualitative interviews in healthcare. Stud Health Technol Inform 2024; 316: 791–795.

12.

Liu

Wright

Mccoy

, et al. Using large language model to guide patients to create efficient and comprehensive clinical care message. J Am Med Inform Assoc 2024; 31: 1665–1670.

13.

Wang

Gao

Dantona

, et al. DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digit Med 2024; 7: 1–9.

14.

Foster

Egwuonwu

Vernon

, et al. Informal caregivers connecting online: content analysis of posts on discussion forums. JMIR—Form 2025; 9: e64757.

15.

Pendry

Salvatore

. Individual and social benefits of online discussion forums. Comput Hum Behav 2015; 50: 211–220.

16.

Nowell

Norris

White

, et al. Thematic analysis: striving to meet the trustworthiness criteria. Int J Qual Methods 2017; 16: 1609406917733847.

17.

Liu

Hughes

Wang

. Financial strain, health behaviors, and psychological well-being of family caregivers of older adults during the COVID-19 pandemic. PEC Innov 2024; 4: 100290.

18.

Nassiri

Akhloufi

. Recent advances in large language models for healthcare. BioMedInformatics 2024; 4: 1097–1143.

19.

Shaw

. The use of online discussion forums and communities for health research. Fam Pract 2020; 37: 574–577.

20.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30. DOI: https://doi.org/10.48550/arXiv.1706.03762.

21.

Miles

Huberman

. Qualitative data analysis: an expanded sourcebook. Thousand Oaks, CA, USA: Sage Publications, 1994, pp.55–72.

22.

Zhang

Kishore

, et al. BERTScore: evaluating text generation with BERT. Epub ahead of print 24 February 2020.

23.

Papineni

Roukos

Ward

, et al. Bleu: a method for automatic evaluation of machine translation. In: Isabelle

Charniak

Lin

(eds) Proceedings of the 40th annual meeting of the association for computational linguistics. Philadelphia: Association for Computational Linguistics, 2002, pp.311–318.

24.

Lahitani

Permanasari

Setiawan

. Cosine similarity to determine similarity measure: study case in online essay assessment. In: 2016 4th International conference on cyber and IT service management , 26 April 2016. IEEE, pp.1–6.

25.

Real

Vargas

. The probabilistic basis of Jaccard's index of similarity. Syst Biol 1996; 45: 380–385.

26.

Banerjee

Lavie

. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Goldstein

Lavie

Lin

C-Y

, et al. (eds) Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Ann Arbor: Association for Computational Linguistics, 2005, pp.65–72.

27.

Ganesan

. ROUGE 2.0: updated and improved measures for evaluation of summarization tasks. Epub ahead of print 5 March 2018.

28.

Lin

C-Y

. ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Barcelona, Spain: Association for Computational Linguistics, 2004, pp.74–81.

29.

Jindal

MacDermid

. Assessing Reading levels of health information: uses and limitations of Flesch formula. Edu Health. 2017; 30: 84–88.

30.

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019, accessed 19 February 2025).

31.

Devlin

Chang

M-W

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. Epub ahead of print 24 May 2019.

32.

Yarborough

. Moving towards less biased research. BMJ Open Sci 2021; 5: e100116.

33.

Raiaan

Mukta

Fatema

, et al. A review on large language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 2023; 12: 26848.

34.

Tjuatja

Chen

, et al. Do LLMs exhibit human-like response biases? A case study in survey design. Trans Assoc Comput Linguist 2024; 12: 1011–1026.

35.

Lustick

Yang

Hakouz

. The role of emotions in qualitative analysis: researchers’ perspectives. Qual Rep 2024; 29: 1103–1124.

36.

Bender

Gebru

McMillan-Major

, et al. On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event Canada. New York NY: ACM, 2021, pp.610–623.

37.

Errica

Siracusano

Sanvito

, et al. What did I do wrong? Quantifying LLMs’ sensitivity and consistency to prompt engineering. arXiv preprint arXiv:2406.12334. 2024 Jun 18.

38.

Zhang

Talukdar

Vemulapalli

, et al. Comparison of prompt engineering and fine-tuning strategies in large language models in the classification of clinical notes. AMIA Jt Summits Transl Sci Proc 2024; 2024: 478.

39.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.

40.

Ghahramani

Shiri

. Brain tumor detection in magnetic resonance imaging using Levenberg-Marquardt backpropagation neural network. IET Image Process 2023; 17: 88–103.

41.

Ghahramani

Shiri

. An adaptive neuro-fuzzy inference system optimized by genetic algorithm for brain tumour detection in magnetic resonance images. IET Image Process 2024; 18: 1358–1372.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB

0.22 MB