Abstract
Background/Aims
Clinical trials require numerous documents to be written: Protocols, consent forms, clinical study reports, and many others. Large language models offer the potential to rapidly generate first-draft versions of these documents; however, there are concerns about the quality of their output. Here, we report an evaluation of how good large language models are at generating sections of one such document, clinical trial protocols.
Methods
Using an off-the-shelf large language model, we generated protocol sections for a broad range of diseases and clinical trial phases. Each of these document sections we assessed across four dimensions:
Results
We find that the off-the-shelf large language model delivers reasonable results, especially when assessing
Discussion
Our results suggest that hybrid large language model architectures, such as the retrieval-augmented generation method we utilized, offer strong potential for clinical trial-related writing, including a wide variety of documents. This is potentially transformative, since it addresses several major bottlenecks of drug development.
Keywords
Background and aims
During clinical trials, large volumes of documents need to be written, including protocols, amendments, patient informed consent forms, clinical study reports, and many others. These documents are critically important for the planning and execution of trials and are often required by regulation; therefore, high-quality writing is essential. Specifically, clinical trial documents must be scientifically and clinically precise and accurate, with correct use of terminology, and must contain appropriate references to the literature, regulatory guidelines, and other documents. Due to these stringent requirements, sponsors of clinical trials spend considerable time and resources on trial-related writing. For example, most large pharmaceutical companies each employ tens to hundreds of medical writers and reviewers. 1 Even with these resources, it often takes organizations a long time to write, review, and finalize clinical trial documents. As an illustration, a clinical trial protocol typically has 50–150 or more pages and can take 3–6 months or longer to prepare. 2 A substantial proportion of this time is due to the writing and reviewing process, to ensure that the document achieves the high quality expected. As a result, writing is one of the major rate-limiting steps in the development process. With pharmaceutical companies under pressure to accelerate trials3,4 and to submit regulatory documents faster, there is strong interest across the industry in using new technologies and approaches to speed up trial-related writing.
In the past few years, large language models (LLMs), a new class of generative artificial intelligence algorithms, have advanced to a point where they can produce near-human-quality writing. 5 Since the arrival of ChatGPT, 6 the first widely used tool built on LLMs, there has been interest in using these algorithms in the context of clinical trials. Examples include enhancing patient-trial matching, 7 clinical trial planning, 8 assisting in medical writing tasks, 9 and others. While it is early days for these efforts, we see signs of great potential but also challenges, such as accuracy and potential biases in the LLMs, as well as concerns about robustness and reproducibility.10,11 To help address some of these questions, we describe a framework for document quality evaluation, and we report an analysis of LLMs in the context of clinical trial-related writing.
Methods
Our assessment focused on GPT-4, one of the leading LLMs available today
12
and utilizing it to generate key sections of clinical trial protocols. The LLM output was subsequently assessed in terms of writing quality. Specifically we analyzed four dimensions:

Overview of methodology and approach used in this analysis. (a) Typical use of off-the-shelf LLMs. (b) Retrieval-augmented generation (RAG) methodology for enhancing LLMs. (c) ClinEval methodology for assessing the output of large language models (LLMs). Further details are described in the “Methods” section and in the supplementary information.
Our analysis does not make direct comparisons between LLM-written text and fully human-written text. This is because, from our experience in the field, there is often substantial variability between individual human writers. It is therefore challenging to establish a single, objective “ground truth” to compare against. Our evaluation framework, with its four dimensions described above, addresses this challenge by breaking down the assessment into discrete sub-dimensions which can be assessed objectively.
Our assessment targeted two key sections of a clinical trial protocol document: the
Both the off-the-shelf GPT-4 and RAG-augmented GPT-4 were prompted with a natural-language user query of the form “Write the {
The evaluation process was strictly identical for both the off-the-shelf and RAG-augmented LLMs, and involved a combination of algorithmic and human expert-based scoring. Briefly, the algorithmic assessment consisted in prompting GPT-4, used as an evaluator LLM, with the generated protocol section or an individual section element (i.e. an endpoint or an eligibility criterion) and asking it to provide a binary score for each sub-dimension based on a specific list of requirements (detailed in Supplementary Table 2). For every protocol section that had been generated, metrics for each dimension were then obtained as an average of the scores of the relevant sub-dimensions. Those requirements were developed in consultation with internal and external experts and aim to capture best human practices in clinical protocol writing.
Each of the two models generated a total of 140 document sections, which covered protocols for 14 diseases across different phases of clinical trials (see Supplementary Table 3). The scores are presented as percentages which indicate the mean score achieved by the generated documents across all diseases and phases and section types. Because of the non-deterministic nature of LLMs, we performed five repetitions for each query combination. This approach mitigates the impact of inherent randomness in the model’s responses. Statistical tests were performed to assess whether differences in performance metrics between the two models were statistically significant.
Results
An overview of the result of our assessment is shown in Figure 2. Overall, we find that the off-the shelf LLM delivers reasonable results, specifically good

Comparison of off-the-shelf LLM and RAG-augmented LLM. Further information is described in the “Methods” section and supplementary information.
As an illustrative example when we asked the algorithm to draft a Phase 3 protocol for tuberculosis the off-the-shelf LLM suggested in the eligibility section to exclude patients with human immunodeficiency virus (HIV)/acquired immunodeficiency syndrome, diabetes, liver disease, and kidney disease. This contrasts with regulatory guidance documents which state that “Sponsors should include in trials […], subjects with renal insufficiency, diabetes mellitus, and subjects with hepatic impairment, if feasible. Because of the high incidence of tuberculosis in patients coinfected with HIV, subjects with HIV should be included in trials.” 15
The output of the RAG-augmented LLM (Figure 2) shows high
As detailed in Supplementary Table 2, those trends were similar across different protocol sections, with a marked improvement in
Discussion
Across both endpoints and eligibility criteria sections, we find that the off-the-shelf LLM produces seemingly well-written content, as reflected by high scores in
To address these challenges, we explored alternative approaches of using LLMs, specifically RAG which has emerged as a promising methodology for incorporating knowledge from external databases.13,14 RAG involves providing the LLM with external sources of knowledge, to supplement the model’s internal representation of information. 14 As a result of the RAG methodology, the LLM is primarily used not for its memorized knowledge, but instead for its ability to read, synthesize, and evaluate information provided to it.
In our assessment, the use of RAG augmentation produced high scores for both
While the results shown in Figure 2 are intriguing, it is important to acknowledge a number of limitations of this work: First, the evaluation framework we report is a mixture of quantitative scores (
In summary, our results suggest that hybrid LLM architectures such as agent-based RAG methodology we used offer strong potential for GenAl-powered clinical trial-related writing, covering potentially a wide variety of documents. This is exciting, since it addresses several major bottlenecks of drug development. Indeed, when we applied the RAG-augmented LLM approach in the context of recent clinical trials, we observed dramatic acceleration. For writing tasks, such as protocols or clinical study reports, the time to generate first draft versions of documents is typically reduced from days or weeks (in the case of fully-human writing), to minutes (when using a RAG-augmented LLM). For the end-to-end document creation process, which normally consists of multiple cycles of writing and review, we observe time reductions of 25%–50% or more, depending on which document is being created. The reason why the time reduction of the end-to-end process is somewhat smaller than for writing alone is because review by human experts is always required.
Beyond the writing abilities of LLMs in clinical trials, which our work demonstrates, there are a number of practical considerations which pharmaceutical companies and other trial sponsors will need to address. First and most importantly, there are questions about the ethical use of LLMs and other GenAI tools in the context of clinical trials. If sponsors wish to utilize these tools in the design of clinical trials and the writing of documents, it is imperative that there be responsible human oversight. Clinical trial experts and medical writers will need to be “in the loop” to ensure that trials are designed and executed safely and following all applicable rules and guidelines. Second, there are regulatory questions. The US FDA and other agencies have outlined their plans to regulate artificial intelligence in medical products, including building relevant infrastructure and technical expertise. 18 As the regulatory framework evolves, sponsors of clinical trials will likely need to adapt their LLM and other tools in clinical trial writing and other processes. Third, there are talent and capability considerations. Hiring and retaining suitable expertise, ideally combining clinical trial knowledge and GenAI technical experience, is critical for these efforts to succeed. A recent white paper, jointly authored by major pharmaceutical companies, highlights talent as a major challenge for the industry. 19 Fourth, there are questions regarding technical readiness. In recent years, many pharmaceutical companies and other trial sponsors have made major investments in their data analytics platforms and in data partnerships. 19 However, from the work reported here, we learned that configuring these technologies, and ingesting, integrating, and analyzing the required data sources is often challenging.
Despite these challenges, our experience of starting to deploy LLMs in a number of real-life settings suggests strong potential of accelerating and improving clinical trial-related writing. These benefits typically require a strong collaboration between medical writers, clinical researchers, data scientists, and data engineers. Organizations that achieve this cross-functional collaboration are already beginning to reap significant acceleration gains. Going forward, we expect these benefits to increase even further. Over time, we therefore expect that sponsors of clinical trials will adopt the LLM technology in their clinical and other writing tasks.
Supplemental Material
sj-pdf-1-ctj-10.1177_17407745251320806 – Supplemental material for From RAGs to riches: Utilizing large language models to write documents for clinical trials
Supplemental material, sj-pdf-1-ctj-10.1177_17407745251320806 for From RAGs to riches: Utilizing large language models to write documents for clinical trials by Nigel Markey, Ilyass El-Mansouri, Gaetan Rensonnet, Casper van Langen and Christoph Meier in Clinical Trials
Footnotes
Acknowledgements
The authors thank their colleagues Dr Jennifer Griffin and Dr Souparno Bhattacharya for their assistance in conducting this assessment.
Author contributions
N.M.: Methodology, Data processing, LLM and other analysis, and Manuscript writing, review, and editing.
C.v.L.: Methodology, Data processing, LLM and other analysis, and Manuscript writing, review, and editing.
G.R.: Methodology, Data processing, LLM and other analysis, and Manuscript review and editing.
I.E.-M.: Methodology, Data processing, and LLM and other analysis.
C.M.: Methodology, LLM and other analysis, and Manuscript writing, review, and editing.
All authors contributed to the article and approved the submitted version.
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The authors of this article are employees of The Boston Consulting Group (BCG), a management consultancy that works with the world’s leading companies.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research for this article was funded by BCG s Health Care practice and by BCG X, the firm’s in-house data science unit.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
