Development of a cancer information chatbot model: Retrieval-augmented generation with data from the national center for cancer knowledge and information

Abstract

Background

With the rapid advancement of digital health technologies, there is a growing need for reliable healthcare solutions. However, the vast amount of available cancer-related information and the challenges in identifying trustworthy sources highlight the requirement for systematic management.

Objective

This study aimed to develop a National Cancer Information Center–grounded RAG chatbot and to evaluate evidence traceability using automatic metrics.

Methods

We implemented a RAG-based chatbot using GPT-4o, FAISS vector search, and OpenAI embeddings, grounded in verified cancer-related data from the National Cancer Information Center. Two retrieval strategies were compared: (1) non-filtering retrieval based solely on vector similarity and (2) heuristic cancer-type filtering applied as a post-retrieval string-matching constraint. A total of 72 responses were evaluated using automatic evidence-traceability metrics, including retrieved evidence count, verified evidence count, cancer-matched evidence count, total answer sentence count, and evidence-aligned sentence count. Paired comparisons were conducted using the Wilcoxon signed-rank test with bootstrap confidence intervals and Holm correction.

Results

The non-filtering strategy retrieved significantly more verified and cancer-matched evidence and produced more evidence-aligned answer sentences than heuristic filtering (all p<0.01). The total number of answer sentences did not differ significantly.

Conclusion

Heuristic cancer-type filtering degraded evidence grounding in a National Cancer Information Center–based RAG chatbot. Automatic traceability metrics provide a reproducible framework for evaluating and monitoring evidence-grounded performance.

Keywords

cancer information chatbot healthcare large language model retrieval-augmented generation

Introduction

With the rapid advancement in technology, the Internet has become a widely accessible source of health information.¹ While physician consultations remain essential for patients with cancer seeking disease-related information, many turn to online sources of medical and treatment-related knowledge.² Previous studies indicated that 75% of patients are influenced by online health information when making treatment decisions.³ However, the vast amount of available cancer-related information, coupled with the challenges in identifying the most up-to-date and trustworthy sources, highlights the need for systematic management.^4,5 As the Internet continues to serve as a primary platform for disseminating health information, ensuring patient access to accurate and reliable sources has become increasingly crucial.^2,3

Currently, digital health is rapidly evolving, driven by technological advancements and the increasing demand for innovative healthcare solutions.⁶ Chatbots, computer-based systems designed to simulate and process human conversations, enable interactions through various communication modes, including text, speech, and graphics.^7,8 By analyzing user input, text-based or spoken chatbots generate predetermined responses by accessing relevant knowledge sources.^9,10 Notably, numerous studies have explored the application of chatbots in healthcare. For instance, previous research focused on developing chatbots to assist hospital caregivers by providing answers to medication-related inquiries and pharmaceutical management.^6,11 Additionally, a Korean chatbot study examined the impact of emotional disclosure on user satisfaction and intention to reuse the chatbot in the context of mental health counseling.^6,12 More recent advances in large language models have further expanded the capabilities of healthcare chatbots. In particular, retrieval-augmented generation (RAG) has emerged as a key approach to mitigating hallucination and improving response reliability by grounding generated outputs in external evidence.⁸

The medical industry is expanding its health promotion services in response to increased life expectancy, aging, and lifestyle changes, driving a paradigm shift toward smart health services.¹³ Health chatbots are increasingly utilized to enhance user experience, support healthcare professionals, and optimize healthcare processes, particularly for the dissemination of cancer information.¹⁴ Consequently, there is a growing demand for accurate, context-aware, and easily accessible health information on online platforms.¹⁵ The use of chatbots to deliver reliable health information in a user-friendly manner has become increasingly essential, allowing users to ask questions and receive accurate responses. Also, evaluating the reliability and evidence grounding of such generative AI–based systems remain a significant challenge. Existing evaluation approaches often rely on subjective expert judgment or general accuracy measures, which do not explicitly assess whether generated responses are supported by verifiable evidence. In retrieval-augmented generation (RAG) systems, the relationship between retrieved documents and generated responses is critical, yet this evidence–response alignment is not consistently quantified in prior studies.

To address this need, we developed a chatbot designed to provide responses to cancer-related inquiries by grounding its answers in content exclusively from Korea’s National Cancer Information Center. The target audience of the chatbot includes individuals seeking cancer-related health information from publicly accessible online platforms. This study aims not only to develop a RAG-based chatbot grounded in the National Cancer Information Center but also to propose an automatic evaluation framework that quantifies evidence traceability.¹⁶

Methods

Design

A Retrieval-augmented Generation (RAG) based chatbot was implemented by integrating nonparametric memory with a pre-trained Large Language Model (LLM). Designed to minimize hallucinations, the chatbot retains previous user interactions, enabling multiturn conversational dialogue.¹⁷ It generates context-aware responses by incorporating the user’s query and the retrieved evidence from the National Cancer Information Center (Figure 1).

Figure 1.

Chatbot architecture.

Data source and preprocessing

Cancer information content was collected via API endpoints of the National Cancer Information Center (https://www.cancer.go.kr). The content was used solely for research purposes, was not redistributed as a standalone dataset, and the source was attributed in accordance with the Center’s usage guidance.

Six major cancer types were selected for evaluation: liver, colorectal, gastric, breast, pancreatic, and lung cancer. Raw JSON data were first filtered to include only these cancer types. Documents with identical cancer sequence and menu sequence identifiers were consolidated into unified text entries. HTML tags, boilerplate content, URLs, and user-interface artifacts were removed to ensure clean textual input.

Each consolidated document was then segmented into section-level chunks based on original heading markers (e.g., “###”) to preserve semantic coherence. Document chunks were normalized to reduce excessive whitespace and formatting artifacts. The average chunk length was approximately 183 tokens (median 134 tokens), providing balanced retrieval granularity while maintaining contextual completeness.

Text embedding and vector indexing

Text embeddings were generated using OpenAI’s text-embedding-3-large model (OpenAI; public release: January 25, 2024). All document chunks were converted into vector representations and indexed using FAISS (Facebook AI Similarity Search). Prior to indexing, embeddings were L2-normalized, and similarity search was conducted using inner-product similarity. The resulting vector index enabled efficient top-k retrieval during query processing.

Question prompts were also embedded using the same embedding model to ensure vector-space consistency between queries and documents.

Retrieval strategies

Two retrieval strategies were evaluated:

1. Non-filtering: The top-k documents were retrieved solely based on vector similarity scores.

2. Heuristic Cancer-type filtering: A larger candidate pool (top-k × 20) was first retrieved based on similarity. A heuristic string-matching rule was then applied, retaining only documents that explicitly contained the target cancer type in the title or text. The final top-k evidence items were selected from the filtered subset.

The filtering approach relied on surface-form string matching across multiple fields. This design allowed assessment of whether heuristic cancer-type filtering improves or degrades evidence grounding. In all experiments, k was set to 5.

Response generation

Responses were generated using GPT-4o (OpenAI; public release: May 13, 2024) as the base large language model.¹⁸ A total of 72 prompts were used, consisting of 12 questions for each of six cancer types. This sample size was determined to ensure balanced representation across cancer types. The chatbot outputs were originally generated in Korean and evaluated in their original language. Retrieved evidence text was concatenated and provided as contextual input to the model. The system prompt explicitly instructed the model to answer using only the retrieved evidence and to respond with “insufficient evidence” if the provided context did not support the query. The temperature parameter was set to 0.1 to reduce overly deterministic outputs while maintaining factual consistency in evidence-grounded responses. The generative AI models used in this study, including GPT-4o and text-embedding-3-large (OpenAI), are proprietary and were accessed via the OpenAI API. GPT-4o was used as a base model without any task-specific fine-tuning. The prompts were manually constructed by a single researcher and remained fixed throughout the experiments. No patients or members of the public were involved in the development of the prompts. All queries were conducted in March 2026 in Goyang, Republic of Korea. Each prompt was submitted as an independent query in a separate session without retaining conversational history.

Automatic evidence-traceability metrics

To quantitatively assess evidence grounding, automatic evidence-traceability metrics were defined, as summarized in Table 1. For each question-response pair, the following metrics were recorded.

Table 1.

Evaluation criteria for response quality.

Metric	Definition
Retrieved evidence count	Number of document titles returned as supporting evidence
Verified evidence count	Number of retrieved evidence items confirmed to correspond to existing documents
Cancer-matched evidence count	Number of retrieved evidence items corresponding to the intended cancer type
Answer sentence count	Total number of sentences in the generated response
Evidence-aligned sentence count	Number of answer sentences directly supported by retrieved evidence text

Statistical analysis

Because identical question prompts were evaluated under two retrieval conditions, paired comparisons were conducted. As the evaluation metrics were count-based and not assumed to follow a normal distribution, the Wilcoxon signed-rank test was used for primary comparisons.

Mean differences and 95% confidence intervals were estimated using non-parametric bootstrap resampling. Holm correction was applied to adjust for multiple comparisons across evaluation metrics. Statistical significance was defined as a two-sided p-value < 0.05. Reproducibility across multiple runs was not formally evaluated, as each prompt was processed once under controlled conditions.

Supplementary expert evaluation

Expert-based evaluation was conducted to assess the overall response quality of the chatbot. The reference standard for evaluation was established clinical knowledge and the source material used in the retrieval process. Five domain experts participated in the evaluation. The expert panel included cancer specialists (two preventive medicine physicians, a nurse, and a cancer registry program specialist), along with a public health researcher who contributed a broader, population-level perspective.

The experts independently reviewed a subset of chatbot-generated responses. Each response was evaluated using a 5-point Likert scale (0–5) across the following dimensions: accuracy, relevance, completeness, clarity, and consistency. A score of 0 indicated poor quality or incorrect information, and a score of 5 indicated excellent quality fully aligned with established clinical knowledge and the source material. Experts were instructed to assess responses based solely on the information presented in the chatbot output, without access to retrieval logs or system prompts. No discussion among evaluators was permitted during the scoring process to preserve independence. No patients or members of the public were involved in the evaluation process. Blinding was partially applied, as evaluators did not have access to system prompts or retrieval evidence; however, they were aware that the responses were generated by a generative AI-based chatbot.

Results

A total of 72 question–response pairs across six cancer types were evaluated under two retrieval strategies (non-filtering and heuristic cancer-type filtering) using paired comparisons. The non-filtering condition retrieved a higher number of retrieved evidence items than the heuristic filtering condition (mean difference 0.25, 95% bootstrap CI 0.11, 0.39; p<0.001). It also retrieved more verified evidence items (mean difference 1.39, 95% bootstrap CI 1.13, 1.65; p<0.001) and more cancer-matched evidence items (mean difference 1.40, 95% bootstrap CI 1.14, 1.68; p<0.001). The non-filtering condition produced a higher number of evidence-aligned answer sentences (mean difference 0.50, 95% bootstrap CI 0.21, 0.81; p=0.003). In contrast, the total number of answer sentences did not differ significantly between conditions (mean difference 0.32, 95% bootstrap CI −0.06, 0.72; p=0.114). The distributional differences across metrics are visualized in Figure 2, while Table 2 provides a representative question with the corresponding retrieved evidence and generated responses under the filtering and non-filtering conditions.

Figure 2.

Comparison of evaluation metric distributions between filtering and non-filtering conditions/Box plots show the median and IQR across paired instances. Non-filtering consistently achieves higher scores in evidence-related metrics (retrieved, verified, cancer-matched evidence, and evidence-aligned sentences), while answer sentence count remains comparable across conditions.

Table 2.

Representative case comparing retrieved evidence and generated responses under filtering and non-filtering retrieval conditions.

	Filtering	Non-filtering
Cancer type	Gastric cancer
Question	How is the stage of gastric cancer determined?
Retrieved evidence	[Overview – Characteristics by Stage] The term stage refers to the extent of cancer …	[Differential Diagnosis] The clinical findings and symptoms of gastric cancer are …
Cancer-matched count	2	5
Verified evidence count	2	5
Answer	The stage of gastric cancer is determined according to … the presence of distant metastasis.	The stage of gastric cancer reflects the extent of … postoperative chemotherapy—are established.
Evidence-aligned sentence count	1	3
Answer sentence count	4	4

Discussion

The results of this study show that heuristic cancer-type filtering overall degraded both evidence document quality and evidence-grounded answer quality. This degradation was reflected in errors such as missing evidence, partial grounding, or unsupported statements. The non-filtering condition retrieved more verified evidence documents (1.39, p<0.001) and more cancer-matched evidence documents (1.40, p<0.001). Non-filtering also produced a greater number of evidence-aligned answer sentences (0.50, p=0.003). In contrast, total number of sentences did not differ significantly between conditions (p=0.114). Additional expert evaluation of overall chatbot responses is reported in the Supplementary Material.

These findings indicate three key points. First, evidence traceability can be quantitatively evaluated using automatic metrics in a National Cancer Information Center–based RAG chatbot. The observed statistically significant differences across conditions demonstrate that evidence-grounded performance can be compared reproducibly without relying on subjective expert scoring. Second, Responses were generated with similar length, but fewer sentences were directly supported by retrieved evidence. This suggests that filtering affected grounding quality rather than verbosity. Third, even constraints that appear redundant can substantially change outcomes. The documents explicitly contained cancer type information, and the dataset was structurally organized by cancer type. Nevertheless, the addition of heuristic filtering resulted in statistically significant differences. Even within a structured dataset, a small post-retrieval constraint can alter the composition of the top-k evidence set, and such changes can influence generation grounding.

Several limitations should be noted. Because the experiments relied on a single LLM and embedding configuration (GPT-4o with text-embedding-3-large), the observed effects may not fully generalize to other model combinations. While the proposed evidence-traceability metrics provide a quantitative proxy for evidence-groundedness, they do not directly evaluate clinical correctness of the generated answers. This study focused on hard post-retrieval filtering, and the effects of soft ranking strategies remain unknown. Evaluation of potentially harmful, biased, or misleading responses was not conducted, as this study focused on evidence traceability. This represents an important direction for future research.

Despite these limitations, this study suggests several theoretical and practical implications for medical RAG system design. In datasets that are already structurally organized by cancer type, such as the National Cancer Information Center corpus, string-based domain filtering may be unnecessary and may even be counterproductive. Moreover, in settings where the evidence corpus is already expert-verified, ensuring evidence traceability and measurable grounding may be more important than introducing additional heuristic constraints. Finally, automatic traceability metrics can serve as a practical monitoring tool for tracking RAG quality in deployment.

Future work should examine retrieval designs that incorporate soft constraints—such as score weighting or reranking-based approaches—as alternatives to hard post-retrieval filtering. The analysis should be extended across multiple LLMs and embedding models to evaluate the robustness and generalizability of the observed effects. It would also be important to examine whether these retrieval and grounding effects vary across cancer type, particularly in terms of evidence availability and cancer-specific relevance.

Conclusions

This study found that heuristic cancer-type filtering degraded both evidence document quality and evidence-grounded answer quality in a National Cancer Information Center–based RAG chatbot. Non-filtering retrieved more verified and cancer-matched evidence documents and produced more evidence-aligned answer sentences, despite similar response length. Overall, the results suggest that string-based post-retrieval filtering can reduce grounding in structured medical corpora, and that automatic traceability metrics enable reproducible evaluation of evidence-grounded RAG performance.

Supplemental material

Supplemental material - Development of a cancer information chatbot model: Retrieval-augmented generation with data from the national center for cancer knowledge and information

Supplemental material for Development of a cancer information chatbot model: Retrieval-augmented generation with data from the national center for cancer knowledge and information by Eunzi Jeong, Wonjeong Jeong, Eunkyoung Song, Eun Hye Park, Kyoung Hee Oh and Jae Kwan Jun in Digital Health.

Footnotes

ORCID iD

Eunzi Jeong

Ethical consideration

This study was exempted from Institutional Review Board approval because it utilized publicly available data from an open online platform () and did not involve human participants or identifiable personal data.

Author contributions

EJ and WJ designed the study. EJ performed the statistical analyses. WJ and EJ drafted the manuscript. JKJ contributed to the discussion and reviewed and edited the manuscript. EP reviewed and provided comments on the statistical analysis and contributed to manuscript revision. ES and KHO provided assistance in drafting the manuscript. All the authors have read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Cancer Center Grant (2511622-2). The funding sources did not have interventions such as study design and data interpretation.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data used in this study were obtained from the National Cancer Information Center (cancer.go.kr) and are publicly available. Code is available from the corresponding author upon reasonable request. A formal study protocol was not registered; however, all methodological details are fully described in the Methods section.*

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Alsaiari

Joury

Aljuaid

, et al. The Content and Quality of Health Information on the Internet for Patients and Families on Adult Kidney Cancer. Journal of Cancer Education 2017; 32: 878–884. https://doi.org/10.1007/s13187-016-1039-9

Steeb

Reinhardt

Harlaß

, et al. Assessment of the Quality, Understandability, and Reliability of YouTube Videos as a Source of Information on Basal Cell Carcinoma: Web-Based Analysis. JMIR Cancer 2022; 8: e29581. https://doi.org/10.2196/29581

Madathil

Rivera-Rodriguez

Greenstein

, et al. Healthcare information on YouTube: A systematic review. Health Informatics Journal 2014; 21: 173–194. https://doi.org/10.1177/1460458213512220

Neha

Bhati

Shukla

. Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review. AI 2025; 6: 226. https://doi.org/10.3390/ai6090226

Borges do Nascimento

Pizarro

Almeida

, et al. Infodemics and health misinformation: a systematic review of reviews. Bull World Health Organ 2022; 100: 544–561. https://doi.org/10.2471/blt.21.287654

Cárdenas

Falconi

Tusa

, et al. Development of a ChatBot model for health telecare: Integration of LangChain, embeddings with OpenAI, and Pinecone using the question answering technique. Journal of Applied Research and Technology 2024; 22: 389–402. https://doi.org/10.22201/icat.24486736e.2024.22.3.2367

Xue

Zhang

Zhao

, et al. Evaluation of the Current State of Chatbots for Digital Health: Scoping Review. J Med Internet Res 2023; 25(Review 19.12): e47217. https://doi.org/10.2196/47217 2023).

Yang

Ning

Keppo

, et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Systems 2025; 2: 2. https://doi.org/10.1038/s44401-024-00004-1

Dahiya

. A tool of conversation: chatbot. International Journal of Computer Sciences and Engineering 2017; 5: 158.

10.

Sanders

, et al. Chatbot for Health Care and Oncology Applications Using Artificial Intelligence and Machine Learning: Systematic Review. JMIR Cancer 2021; 7(Review 29.11): e27850. https://doi.org/10.2196/27850

11.

Daniel

de Chevigny

Champrigaud

, et al. Answering Hospital Caregivers’ Questions at Any Time: Proof-of-Concept Study of an Artificial Intelligence–Based Chatbot in a French Hospital. JMIR Hum Factors 2022; 9: e39102, Original Paper 11.10.2022. https://doi.org/10.2196/39102

12.

Park

Chung

Lee

. Effect of AI chatbot emotional disclosure on user satisfaction and reuse intention for mental health counseling: a serial mediation model. Current Psychology 2023; 42: 28663–28673. https://doi.org/10.1007/s12144-022-03932-z

13.

Chung

Park

. Chatbot-based heathcare service with a knowledge base for cloud computing. Cluster Computing 2019; 22: 1925–1937. https://doi.org/10.1007/s10586-018-2334-5

14.

Liu

Y-l

Yan

, et al. Effects of personalization and source expertise on users’ health beliefs and usage intention toward health chatbots: Evidence from an online experiment. DIGITAL HEALTH 2022; 8: 20552076221129718.

15.

Amugongo

Mascheroni

Brooks

, et al. Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digit Health 2025; 4: e0000877. https://doi.org/10.1371/journal.pdig.0000877

16.

Liu

McCoy

Wright

. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J Am Med Inform Assoc 2025; 32: 605–615. https://doi.org/10.1093/jamia/ocaf008

17.

Lewis

PPE

. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems 2020; 33: 9459–9474.

18.

OpenAI . Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ (2024).

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.31 MB