Abstract
Background
With the rapid advancement of digital health technologies, there is a growing need for reliable healthcare solutions. However, the vast amount of available cancer-related information and the challenges in identifying trustworthy sources highlight the requirement for systematic management.
Objective
This study aimed to develop a National Cancer Information Center–grounded RAG chatbot and to evaluate evidence traceability using automatic metrics.
Methods
We implemented a RAG-based chatbot using GPT-4o, FAISS vector search, and OpenAI embeddings, grounded in verified cancer-related data from the National Cancer Information Center. Two retrieval strategies were compared: (1) non-filtering retrieval based solely on vector similarity and (2) heuristic cancer-type filtering applied as a post-retrieval string-matching constraint. A total of 72 responses were evaluated using automatic evidence-traceability metrics, including retrieved evidence count, verified evidence count, cancer-matched evidence count, total answer sentence count, and evidence-aligned sentence count. Paired comparisons were conducted using the Wilcoxon signed-rank test with bootstrap confidence intervals and Holm correction.
Results
The non-filtering strategy retrieved significantly more verified and cancer-matched evidence and produced more evidence-aligned answer sentences than heuristic filtering (all p<0.01). The total number of answer sentences did not differ significantly.
Conclusion
Heuristic cancer-type filtering degraded evidence grounding in a National Cancer Information Center–based RAG chatbot. Automatic traceability metrics provide a reproducible framework for evaluating and monitoring evidence-grounded performance.
Introduction
With the rapid advancement in technology, the Internet has become a widely accessible source of health information. 1 While physician consultations remain essential for patients with cancer seeking disease-related information, many turn to online sources of medical and treatment-related knowledge. 2 Previous studies indicated that 75% of patients are influenced by online health information when making treatment decisions. 3 However, the vast amount of available cancer-related information, coupled with the challenges in identifying the most up-to-date and trustworthy sources, highlights the need for systematic management.4,5 As the Internet continues to serve as a primary platform for disseminating health information, ensuring patient access to accurate and reliable sources has become increasingly crucial.2,3
Currently, digital health is rapidly evolving, driven by technological advancements and the increasing demand for innovative healthcare solutions. 6 Chatbots, computer-based systems designed to simulate and process human conversations, enable interactions through various communication modes, including text, speech, and graphics.7,8 By analyzing user input, text-based or spoken chatbots generate predetermined responses by accessing relevant knowledge sources.9,10 Notably, numerous studies have explored the application of chatbots in healthcare. For instance, previous research focused on developing chatbots to assist hospital caregivers by providing answers to medication-related inquiries and pharmaceutical management.6,11 Additionally, a Korean chatbot study examined the impact of emotional disclosure on user satisfaction and intention to reuse the chatbot in the context of mental health counseling.6,12 More recent advances in large language models have further expanded the capabilities of healthcare chatbots. In particular, retrieval-augmented generation (RAG) has emerged as a key approach to mitigating hallucination and improving response reliability by grounding generated outputs in external evidence. 8
The medical industry is expanding its health promotion services in response to increased life expectancy, aging, and lifestyle changes, driving a paradigm shift toward smart health services. 13 Health chatbots are increasingly utilized to enhance user experience, support healthcare professionals, and optimize healthcare processes, particularly for the dissemination of cancer information. 14 Consequently, there is a growing demand for accurate, context-aware, and easily accessible health information on online platforms. 15 The use of chatbots to deliver reliable health information in a user-friendly manner has become increasingly essential, allowing users to ask questions and receive accurate responses. Also, evaluating the reliability and evidence grounding of such generative AI–based systems remain a significant challenge. Existing evaluation approaches often rely on subjective expert judgment or general accuracy measures, which do not explicitly assess whether generated responses are supported by verifiable evidence. In retrieval-augmented generation (RAG) systems, the relationship between retrieved documents and generated responses is critical, yet this evidence–response alignment is not consistently quantified in prior studies.
To address this need, we developed a chatbot designed to provide responses to cancer-related inquiries by grounding its answers in content exclusively from Korea’s National Cancer Information Center. The target audience of the chatbot includes individuals seeking cancer-related health information from publicly accessible online platforms. This study aims not only to develop a RAG-based chatbot grounded in the National Cancer Information Center but also to propose an automatic evaluation framework that quantifies evidence traceability. 16
Methods
Design
A Retrieval-augmented Generation (RAG) based chatbot was implemented by integrating nonparametric memory with a pre-trained Large Language Model (LLM). Designed to minimize hallucinations, the chatbot retains previous user interactions, enabling multiturn conversational dialogue.
17
It generates context-aware responses by incorporating the user’s query and the retrieved evidence from the National Cancer Information Center (Figure 1). Chatbot architecture.
Data source and preprocessing
Cancer information content was collected via API endpoints of the National Cancer Information Center (https://www.cancer.go.kr). The content was used solely for research purposes, was not redistributed as a standalone dataset, and the source was attributed in accordance with the Center’s usage guidance.
Six major cancer types were selected for evaluation: liver, colorectal, gastric, breast, pancreatic, and lung cancer. Raw JSON data were first filtered to include only these cancer types. Documents with identical cancer sequence and menu sequence identifiers were consolidated into unified text entries. HTML tags, boilerplate content, URLs, and user-interface artifacts were removed to ensure clean textual input.
Each consolidated document was then segmented into section-level chunks based on original heading markers (e.g., “###”) to preserve semantic coherence. Document chunks were normalized to reduce excessive whitespace and formatting artifacts. The average chunk length was approximately 183 tokens (median 134 tokens), providing balanced retrieval granularity while maintaining contextual completeness.
Text embedding and vector indexing
Text embeddings were generated using OpenAI’s text-embedding-3-large model (OpenAI; public release: January 25, 2024). All document chunks were converted into vector representations and indexed using FAISS (Facebook AI Similarity Search). Prior to indexing, embeddings were L2-normalized, and similarity search was conducted using inner-product similarity. The resulting vector index enabled efficient top-k retrieval during query processing.
Question prompts were also embedded using the same embedding model to ensure vector-space consistency between queries and documents.
Retrieval strategies
Two retrieval strategies were evaluated: 1. Non-filtering: The top-k documents were retrieved solely based on vector similarity scores. 2. Heuristic Cancer-type filtering: A larger candidate pool (top-k × 20) was first retrieved based on similarity. A heuristic string-matching rule was then applied, retaining only documents that explicitly contained the target cancer type in the title or text. The final top-k evidence items were selected from the filtered subset.
The filtering approach relied on surface-form string matching across multiple fields. This design allowed assessment of whether heuristic cancer-type filtering improves or degrades evidence grounding. In all experiments, k was set to 5.
Response generation
Responses were generated using GPT-4o (OpenAI; public release: May 13, 2024) as the base large language model. 18 A total of 72 prompts were used, consisting of 12 questions for each of six cancer types. This sample size was determined to ensure balanced representation across cancer types. The chatbot outputs were originally generated in Korean and evaluated in their original language. Retrieved evidence text was concatenated and provided as contextual input to the model. The system prompt explicitly instructed the model to answer using only the retrieved evidence and to respond with “insufficient evidence” if the provided context did not support the query. The temperature parameter was set to 0.1 to reduce overly deterministic outputs while maintaining factual consistency in evidence-grounded responses. The generative AI models used in this study, including GPT-4o and text-embedding-3-large (OpenAI), are proprietary and were accessed via the OpenAI API. GPT-4o was used as a base model without any task-specific fine-tuning. The prompts were manually constructed by a single researcher and remained fixed throughout the experiments. No patients or members of the public were involved in the development of the prompts. All queries were conducted in March 2026 in Goyang, Republic of Korea. Each prompt was submitted as an independent query in a separate session without retaining conversational history.
Automatic evidence-traceability metrics
Evaluation criteria for response quality.
Statistical analysis
Because identical question prompts were evaluated under two retrieval conditions, paired comparisons were conducted. As the evaluation metrics were count-based and not assumed to follow a normal distribution, the Wilcoxon signed-rank test was used for primary comparisons.
Mean differences and 95% confidence intervals were estimated using non-parametric bootstrap resampling. Holm correction was applied to adjust for multiple comparisons across evaluation metrics. Statistical significance was defined as a two-sided p-value < 0.05. Reproducibility across multiple runs was not formally evaluated, as each prompt was processed once under controlled conditions.
Supplementary expert evaluation
Expert-based evaluation was conducted to assess the overall response quality of the chatbot. The reference standard for evaluation was established clinical knowledge and the source material used in the retrieval process. Five domain experts participated in the evaluation. The expert panel included cancer specialists (two preventive medicine physicians, a nurse, and a cancer registry program specialist), along with a public health researcher who contributed a broader, population-level perspective.
The experts independently reviewed a subset of chatbot-generated responses. Each response was evaluated using a 5-point Likert scale (0–5) across the following dimensions: accuracy, relevance, completeness, clarity, and consistency. A score of 0 indicated poor quality or incorrect information, and a score of 5 indicated excellent quality fully aligned with established clinical knowledge and the source material. Experts were instructed to assess responses based solely on the information presented in the chatbot output, without access to retrieval logs or system prompts. No discussion among evaluators was permitted during the scoring process to preserve independence. No patients or members of the public were involved in the evaluation process. Blinding was partially applied, as evaluators did not have access to system prompts or retrieval evidence; however, they were aware that the responses were generated by a generative AI-based chatbot.
Results
A total of 72 question–response pairs across six cancer types were evaluated under two retrieval strategies (non-filtering and heuristic cancer-type filtering) using paired comparisons. The non-filtering condition retrieved a higher number of retrieved evidence items than the heuristic filtering condition (mean difference 0.25, 95% bootstrap CI 0.11, 0.39; p<0.001). It also retrieved more verified evidence items (mean difference 1.39, 95% bootstrap CI 1.13, 1.65; p<0.001) and more cancer-matched evidence items (mean difference 1.40, 95% bootstrap CI 1.14, 1.68; p<0.001). The non-filtering condition produced a higher number of evidence-aligned answer sentences (mean difference 0.50, 95% bootstrap CI 0.21, 0.81; p=0.003). In contrast, the total number of answer sentences did not differ significantly between conditions (mean difference 0.32, 95% bootstrap CI −0.06, 0.72; p=0.114). The distributional differences across metrics are visualized in Figure 2, while Table 2 provides a representative question with the corresponding retrieved evidence and generated responses under the filtering and non-filtering conditions. Comparison of evaluation metric distributions between filtering and non-filtering conditions/Box plots show the median and IQR across paired instances. Non-filtering consistently achieves higher scores in evidence-related metrics (retrieved, verified, cancer-matched evidence, and evidence-aligned sentences), while answer sentence count remains comparable across conditions. Representative case comparing retrieved evidence and generated responses under filtering and non-filtering retrieval conditions.
Discussion
The results of this study show that heuristic cancer-type filtering overall degraded both evidence document quality and evidence-grounded answer quality. This degradation was reflected in errors such as missing evidence, partial grounding, or unsupported statements. The non-filtering condition retrieved more verified evidence documents (1.39, p<0.001) and more cancer-matched evidence documents (1.40, p<0.001). Non-filtering also produced a greater number of evidence-aligned answer sentences (0.50, p=0.003). In contrast, total number of sentences did not differ significantly between conditions (p=0.114). Additional expert evaluation of overall chatbot responses is reported in the Supplementary Material.
These findings indicate three key points. First, evidence traceability can be quantitatively evaluated using automatic metrics in a National Cancer Information Center–based RAG chatbot. The observed statistically significant differences across conditions demonstrate that evidence-grounded performance can be compared reproducibly without relying on subjective expert scoring. Second, Responses were generated with similar length, but fewer sentences were directly supported by retrieved evidence. This suggests that filtering affected grounding quality rather than verbosity. Third, even constraints that appear redundant can substantially change outcomes. The documents explicitly contained cancer type information, and the dataset was structurally organized by cancer type. Nevertheless, the addition of heuristic filtering resulted in statistically significant differences. Even within a structured dataset, a small post-retrieval constraint can alter the composition of the top-k evidence set, and such changes can influence generation grounding.
Several limitations should be noted. Because the experiments relied on a single LLM and embedding configuration (GPT-4o with text-embedding-3-large), the observed effects may not fully generalize to other model combinations. While the proposed evidence-traceability metrics provide a quantitative proxy for evidence-groundedness, they do not directly evaluate clinical correctness of the generated answers. This study focused on hard post-retrieval filtering, and the effects of soft ranking strategies remain unknown. Evaluation of potentially harmful, biased, or misleading responses was not conducted, as this study focused on evidence traceability. This represents an important direction for future research.
Despite these limitations, this study suggests several theoretical and practical implications for medical RAG system design. In datasets that are already structurally organized by cancer type, such as the National Cancer Information Center corpus, string-based domain filtering may be unnecessary and may even be counterproductive. Moreover, in settings where the evidence corpus is already expert-verified, ensuring evidence traceability and measurable grounding may be more important than introducing additional heuristic constraints. Finally, automatic traceability metrics can serve as a practical monitoring tool for tracking RAG quality in deployment.
Future work should examine retrieval designs that incorporate soft constraints—such as score weighting or reranking-based approaches—as alternatives to hard post-retrieval filtering. The analysis should be extended across multiple LLMs and embedding models to evaluate the robustness and generalizability of the observed effects. It would also be important to examine whether these retrieval and grounding effects vary across cancer type, particularly in terms of evidence availability and cancer-specific relevance.
Conclusions
This study found that heuristic cancer-type filtering degraded both evidence document quality and evidence-grounded answer quality in a National Cancer Information Center–based RAG chatbot. Non-filtering retrieved more verified and cancer-matched evidence documents and produced more evidence-aligned answer sentences, despite similar response length. Overall, the results suggest that string-based post-retrieval filtering can reduce grounding in structured medical corpora, and that automatic traceability metrics enable reproducible evaluation of evidence-grounded RAG performance.
Supplemental material
Supplemental material - Development of a cancer information chatbot model: Retrieval-augmented generation with data from the national center for cancer knowledge and information
Supplemental material for Development of a cancer information chatbot model: Retrieval-augmented generation with data from the national center for cancer knowledge and information by Eunzi Jeong, Wonjeong Jeong, Eunkyoung Song, Eun Hye Park, Kyoung Hee Oh and Jae Kwan Jun in Digital Health.
Footnotes
Ethical consideration
Author contributions
EJ and WJ designed the study. EJ performed the statistical analyses. WJ and EJ drafted the manuscript. JKJ contributed to the discussion and reviewed and edited the manuscript. EP reviewed and provided comments on the statistical analysis and contributed to manuscript revision. ES and KHO provided assistance in drafting the manuscript. All the authors have read and approved the final manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Cancer Center Grant (2511622-2). The funding sources did not have interventions such as study design and data interpretation.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data used in this study were obtained from the National Cancer Information Center (cancer.go.kr) and are publicly available. Code is available from the corresponding author upon reasonable request. A formal study protocol was not registered; however, all methodological details are fully described in the Methods section.
Supplemental material
Supplemental material for this article is available online.
Appendix
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
