Sage Journals: Discover world-class research

Abstract

Clinical trials are vital for advancing care. However, a systematic approach to tracking trial participation across different facilities and sponsors has been lacking. We developed natural language processing (NLP) methods to extract study enrollment history, including enrollment status, consent date, and study title from information on clinical trial participation recorded in clinical notes in the electronic health record based on national Veterans Affairs electronic health record data. The method exhibited high test-set precision for enrollment status (0.94), consent date (0.97), and study title (0.87) and acceptably high recall (0.76, 0.70, and 0.84, respectively). From a single center, the classifier correctly identified 111 of 125 trial participants (88.8%) across 12 distinct trials. Our study demonstrates the feasibility of using NLP to capture trial enrollment from a nationwide healthcare system. This algorithm creates a novel data resource for analyzing and tracking trial enrollment at the population level.

Keywords

clinical trials electronic health records information extraction natural language processing study enrollment

Introduction

Electronic health records (EHR) are increasingly used to facilitate clinical trial recruitment and follow-up.¹ Specifically, in the case of follow-up of trial participants, clinicians and study personnel can obtain event data in real-time from the EHR with or without need for manual review. Thus, various methods have been developed to mark trial participant data in the EHR for long-term follow-up in clinical research.¹ When systematically integrated into the EHR on a large scale, these electronic markers of trial participation provide a unique opportunity to study and monitor trial participation at a population level. For example, population-level monitoring of patient demographics (e.g., geographic location and rurality) and trends (e.g., over time and location) can inform efforts to increase clinical trial access and participation.

To facilitate EHR use for clinical research, the United States Veterans Affairs (VA) healthcare system encourages use of “research” notes for documentation of clinical trial participation.^2,3 These notes often have research-specific note titles to distinguish them from other patient chart documentation. Research note titles are metadata labels (e.g., “RESEARCH ENROLLMENT NOTE,” “RESEARCH CONSENT”) assigned by VA clinical staff to distinguish research-related documentation from routine clinical care notes. The VA system is the largest integrated healthcare system in the United States, providing care for over 9 million Veterans each year at over 1,255 facilities across the nation.⁴ As part of required or recommended documentation for trial participation, the “research” trial enrollment consent note contains information on both the study title for which the patient was consented and the date they were consented for the study.⁵ The study title can provide insights into the study’s intent, and when the titles are matched to trials known to have recruited at the VA, additional details can be obtained such as intervention, sponsor, start and completion dates.

The VA’s use of a nationwide shared EHR and shared clinical trial documentation processes enables tracking trial enrollments on a large scale, all linked to the rich clinical data contained within the EHR. This approach will enable in-depth analysis of multifaceted barriers to trial enrollment at various levels: trial, facility, provider, and patient.⁶ Performing this detailed, multi-level analysis is currently not feasible because existing methods for identifying comprehensive trial enrollment do not link to the EHR,⁶ and other large-scale resources that cover multiple facilities lack patient-level or facility-level data.⁷ In this study, we create a generalizable EHR-linked trial enrollment classifier. This EHR-based classifier will systematically identify and categorize trial enrollment, enabling a comprehensive and systematic understanding of enrollment gaps at various levels.

The purpose of the current study is to develop and evaluate a method to extract research study enrollment as recorded in unstructured clinical notes, using data from the VA’s national EHR database. We use independently recorded trial enrollments from a medical oncology at a VA facility to determine the completeness of our trial enrollment capture. We focus specifically on cancer because oncology trials are a leading specialty in clinical trial development and execution, representing over 20% of interventional clinical studies registered on ClinicalTrials.gov, a comprehensive and robust trial registry whose use is mandated by US regulations.⁸

Methods

Data source and primary dataset

This study is based on EHR data from the nationwide VA healthcare system, which is collated in the VA Corporate Data Warehouse (CDW) and updated nightly.⁹ The analyses for this report were conducted as part of a VA project to support clinical operations that was classified as non-research by the VA Boston Healthcare System Research and Development Committee. Clinical notes were included if their date of entry was between January 1, 1999 and December 31, 2021, and either of two metadata fields identifying the note type (TIUStandardTitle or TIUDocumentDefinition) contained the word “research”. These criteria identified a set of 4,423,583 notes, representing 957,047 unique patients. We called this set of notes the Primary Dataset.

System architecture

We developed a rule-based system with three components. First, we created a rule-based NLP algorithm to identify the subset of notes in the Primary Dataset that are most likely to record study enrollment, called the Enrollment Classifier. This step helped us distinguish enrollment notes from other notes, such as those that primarily described research follow-up appointments. Application of the Enrollment Classifier produces the Refined Dataset containing notes classified as containing information about a patient’s enrollment in a study. Second, we created two rule-based NLP algorithms to extract information the Refined Dataset about each study a patient enrolled in: (a) the Study Title NLP, which extracts the title of the study the patient enrolled in, and (b) the Consent Date NLP, which extracts the date of the patient’s consent to participate in the study. These three algorithms (the Enrollment Classifier, the Study Title NLP, and the Consent Date NLP) are described in the following sections. Source code embodying the algorithms is described in Code Availability below.

Enrollment classifier

To develop the Enrollment Classifier, we randomly selected 300 notes from the Primary Dataset and allocated 50% of the notes (150 documents) as the training set for Enrollment Classifier development and the remaining 50% (150 documents) as the held-out test set used for final performance evaluation. All model development work, including all preliminary performance evaluations conducted as part of model development, were conducted in the training set; it was not necessary to create a separate “tuning” or “development” set due to the use of a manual rule-development process. These 300 notes were split into 3 sets of 100 notes which were each annotated by one of our 3 pairs of annotators. Each member of the pair independently labeled their set of 100 notes so that all notes were annotated twice. To measure annotator reliability while accounting for expected chance agreements in the annotations, we calculated a note-level Cohen’s Kappa as the measure for inter-annotator agreement. The note-level Cohen’s Kappa averaged across the three pairs of annotators was κ = 0.61 on the initial annotation, demonstrating moderate-to-substantial agreement. Informal review of conflicting annotations revealed disagreement about how to handle notes that mixed information on study enrollment with information on screening for eligibility (before enrollment) and/or study follow-up (after enrollment). Conflicting annotations were adjudicated by a third reviewer, who included all notes that recorded study enrollment, regardless of whether they also included other information. Clinical note annotation was performed using Label Studio, an open-source data labeling tool that provides an easy-to-use interface for annotating data.

Annotators were instructed to label the notes as “Enrollment” or “Not Enrollment” and to annotate the text phrases relevant to making their decision. The annotators used the following annotation guidelines. (a) If the note wording indicated that a patient had enrolled in a study, e.g., “a consent form was signed”, it was labeled “Enrollment”. (b) Notes detailing patient screening that included phrases such as “patient enrolled” were also labeled “Enrollment”. (c) Notes incidentally describing earlier patient enrollment or a patient’s progress, e.g., containing phrases such as “visit #2”, “3 weeks visit”, “6 months visit”, were classified as “Not Enrollment”. (d) If the note text only referenced a scanned consent document stored elsewhere but did not provide further details, then it was also labeled “Not Enrollment”.

We used the training set of annotated notes to build a rule-based enrollment classifier. The training set was reviewed for the presence of repeating patterns consistent with either the note being an Enrollment note (inclusion patterns) or being a Not Enrollment note (exclusion patterns). Inclusion patterns included phrases indicating a signature for informed consent or patient agreement to participate in the study. Exclusion patterns included phrases indicating the note was related to a follow-up visit, a patient progress note, or a patient’s withdrawal from a study. For each note, the algorithm first searched for relevant exclusion patterns in the note text and assigned the “Not Enrollment” label to matching notes. Next, the algorithm searched for inclusion patterns and assigned the note “Enrollment” as applicable. Finally, all residual unassigned notes were labeled “Not Enrollment”. We iteratively refined the algorithm using a develop-test-update strategy within the training set. That is, we iteratively modified the existing regular expressions or added new regular expressions to reduce the number of improperly labeled notes in the training set. This process was repeated several times until iterative performance gains within the training set were substantially diminished.

Study title NLP and consent date NLP

The Enrollment Classifier was executed on the full Primary Dataset and classified approximately 24% (1,068,879 notes) of the notes as Enrollment. This set of clinical notes was identified as the Refined Dataset. We randomly selected 750 notes from the Refined Dataset and allocated 50% of them (375 notes) as the training set for the Study Title NLP and Consent Date NLP development and the remaining 50% (375 notes) as the test set. Three pairs of annotators were assigned a set of 250 notes for which each member labeled the set independently. The 750 notes were split into 3 sets of 250 notes which were each annotated by one of our 3 pairs of annotators. As with the previous annotation scheme, each set of 250 notes was independently annotated by each member of the assigned pair so that all notes were annotated twice. The token-level Cohen’s Kappa was κ = 0.89 for the consent date annotation and κ = 0.93 for the study title annotation. Conflicting annotations were adjudicated by a third reviewer. Figure 1 presents examples of annotated enrollment notes.

Figure 1.

Enrollment notes with study title and consent date annotation: The general structure of clinical notes can vary significantly. (A-C) Three de-identified notes are shown with their respective study title (yellow) and consent date (pink) annotated.

Annotators used the following annotation guidelines for Study Title: (a) Annotate the first occurrence of the fully written-out, human-readable study title, if such a title is present. (b) If title is not present, annotate the first occurrence of study acronym, for example, “EVALUATE-HF”. (c) If neither is present, then annotate the first occurrence of trial number, IRB number or other study identifier.

Annotators used the following annotation guidelines for Consent Date: (a) Annotate the first occurrence of the consent or enrollment date. (b) If there is both a generic visit date and a more specific date in regard to consent or enrollment, then use the more specific date. (c) Do not annotate testing date as a consent date. (d) Exclude the time part of the date, if present (e.g., “3/14/15 @ 9:26am”).

We used the training set of annotated clinical notes to build a rule-based NLP algorithm. The set was reviewed for the presence of repeating word patterns associated with the protocol title and consent date. These patterns were then used to generate regular expressions to facilitate automated rule-based annotation. The rule-based NLP algorithm was designed to extract information consistent with the annotation guidelines described above.

More specifically, we used two strategies to identify research protocol titles. The first approach employed two regular expression sets, one to detect the start of a protocol title, e.g., “Research study name:”, “Name of protocol:”, and another to detect the end of a title, e.g., line break, paragraph end or subsequent section title. The word context between the start and end of the title was annotated as the research protocol title.

The second strategy utilized visual exploration of word order and syntax within high frequency clinical notes, i.e., those likely to be the same study, to create more accurate regular expression sets. More specifically, this entailed minimizing extraction of unwanted context flanking the title as well as that surrounding consent dates.

We iteratively refined our rule-based NLP algorithm using the training note set and a develop-test-update strategy in which the regular expressions were modified as needed to improve the capture of missed or incorrectly identified patterns. This process was repeated several times until the performance gains associated with individual iterations substantially diminished.

Performance evaluation

Performance of the Enrollment Classifier, Consent Date NLP, and Study Title NLP were evaluated with reference to their respective test sets. After tabulating the confusion matrix (the number of true positives, true negatives, false positives, and false negatives), standard evaluation measures were evaluated, including precision, recall, accuracy, and the F1 score.

The Enrollment Classifier is a binary classifier, so tabulation of the confusion matrix was straightforward. For Consent Date NLP, we converted all annotated dates into a normalized form (YYYY-MM-DD) and compared the results by value. For Study Title NLP, we used two different methods to produce the confusion matrix. In the Offset Overlap Method, we considered any non-zero length overlap between the NLP annotation and the human annotation as a match. In the Normalized Levenshtein Distance Method, we used the normalized Levenshtein distance metric with a threshold of 0.7 to evaluate the results. Cases where both the NLP algorithm and the annotator produced a value but these results did not match under the chosen criterion were classified as false positives.

Validation relative to external clinical trial enrollment data

In addition to validating the NLP algorithm’s performance relative to the labels our annotators assigned in the held-out test set, we aimed also to assess its performance in identifying individuals confirmed to have enrolled in trials based on gold-standard enrollment data that is external to our EHR data source. We randomly selected patients known to have consented to specific cancer clinical trials from a single VA facility (Durham VA Medical Center) medical oncology department, encompassing various trial types, enrollment periods, and sponsors. There was no overlap between these patients and those with notes annotated in the training and test sets described in prior subsections. In addition to the trial name, we obtained different possible trial identifiers that may be used instead of the study title, such as protocol number and IRB number. We used NLP to identify trial enrollment status in the validation set as follows: We obtained patient notes from the period of potential trial enrollment, during which a patient could have been enrolled in more than one trial. If no enrollment note was detected by the Enrollment Classifier in the patient’s notes during this period, the patient was labeled as “not found”. If at least one NLP-derived trial title matched the recorded trial identifier (title, IRB number, or protocol number), the patient was labeled as “found-match”. If the NLP study title did not match the recorded study identifier, the patient was labeled as “found-nonmatch”.

Code availability

Code for the complete system, including the Enrollment Classifier, the Consent Date NLP, and the Study Title NLP, is available at https://github.com/bostoninformatics/.

Results

Performance evaluation results

Performance evaluation results of the Enrollment Classifier, the Consent Date NLP, and the Study Title NLP relative to their respective test sets are provided in Table 1. The Enrollment Classifier had very high precision (0.94) and acceptably high recall (0.76), indicating that almost all notes identified by the Enrollment Classifier are truly enrollment notes, and the substantial majority of true enrollment notes were identified by the classifier. The Consent Date NLP also had very high precision (0.97) and acceptably high recall (0.70), indicating that almost all consent dates extracted by the algorithm were truly consent dates and matched the actual consent date, and the substantial majority of true consent dates were extracted by the algorithm. The Study Title NLP had 0.87 precision and 0.84 recall under the Offset Overlap Method and 0.84 precision and 0.83 recall under the Normalized Levenshtein Distance Method. Under either evaluation method, these results indicate that the substantial majority of study titles identified by the NLP were truly study titles and matched the annotated study title, and the substantial majority of true study titles were extracted by the algorithm.

Table 1.

Evaluation results. Evaluation measures of the enrollment classifier, the consent date NLP, and the study title NLP relative to their respective test sets are provided, including the number of true positives, true negatives, false positives, false negatives, precision, recall, accuracy, and F1 score. For the study title NLP, evaluation results based on both the offset overlap method and the normalized Levenshtein distance method are displayed, as defined in the methods.

	Enrollment classifier evaluation	Consent date NLP evaluation	Study title NLP evaluation
	Enrollment classifier evaluation	Consent date NLP evaluation	Offset overlap method	Normalized Levenshtein distance method
Total Number of Notes in Test Set	150	375	375	375
True Positive (TP)	34	172	273	263
True Negative (TN)	103	123	10	10
False Positive (FP)	2	5	40	50
False Negative (FN)	11	75	52	52
Precision	0.94	0.97	0.87	0.84
Recall	0.76	0.7	0.84	0.83
Accuracy	0.91	0.79	0.75	0.73
F1 Score	0.84	0.81	0.86	0.84

Error analysis

False positive error analyses for the Consent Date NLP and the Study Title NLP are shown in Table 2. Each false positive in the test set was reviewed and the error was classified as to its type. For Study Title NLP, we show error analysis only for the Offset Overlap Method, since both evaluation methods had similar performance.

Table 2.

Error analysis. False positive error analysis for the study title and consent date NLP are shown. Each false positive in the test set was reviewed and the error was classified as to its type. For study title NLP, error analysis is shown only relative to evaluation with the offset overlap method, since performance was similar relative to evaluation with the normalized Levenshtein distance method.

Study title (offset overlap method)
Study number or IRB number matched instead of title	15
Wrong/irrelevant context matched (this includes things like “enrolled in VA research study”)	8
Correct/partial title matched in a different note part	7
Study title abbreviation matched, where study title is present along with the abbreviation in the same sentence	4
Correct study title captured in wrong note type (e.g., in termination note or progress note)	3
Enrollment in a different study captured mentioned in the same note	2
Enrollment in non-research activity is captured (e.g., patient enrolled in the INTENSIVE group)	1
Total Mismatches	40
Consent date
Correct date, note type incorrect (for example, consent date mentioned in a visit #N note)	2
Correct date, mentioned in historic context (or example, past enrollments are mentioned in the note text)	2
Wrong date picked (for example, multiple dates are mentioned, participation date is picked instead of consent date)	1
Total Mismatches	5

For the Study Title NLP false positives, the majority (29 out of 40, i.e., 73% of the false positives) are only technically false positives, meaning that while the extracted span did not exactly match the annotated target used for evaluation, the algorithm still correctly captured a valid study identifier. This includes 15 errors where the study number or IRB number was matched instead of title, 7 errors where the correct/partial title was matched in a different part of the note from the annotation, 4 errors where the study title abbreviation was matched but the study title is present along with the abbreviation in the same sentence, and 3 errors where the correct study title was captured in the wrong note type (e.g., in termination note or progress note). If these errors were reclassified as true positives, the precision for Study Title NLP would be 0.96.

For Consent Date NLP false positives, all but 1 (4 out of 5 false positives, or 80%), were captures of the correct consent date but in a context that did not align with our guidelines. If these errors were reclassified as true positives, the precision of the Consent Date NLP would be 0.99. We also analyzed Consent Date NLP false negatives. The most common cause of missed consent dates is the case where consent date appears in a context not seen in the training set. The NLP algorithm relies on an extensive set of specific patterns that capture the majority of consent dates seen in the training set data, but it limits its use of wild card matching to reduce the number of false positives. The second most common cause of the missed consent dates is unresolved coreference. For example, the date of initial study participation discussion is mentioned in one part of the document, while the other part says that the patient has consented on that date. Other causes of the missed consent dates include misspelled dates (e.g., with year being specified as 3 digits) and dates in obscure formats that are not recognized by NLP as valid dates.

Validation relative to external clinical trial enrollment data

To validate our trial enrollment classifier relative to external gold-standard clinical trial enrollment data, we obtained a random sample of patients known to have enrolled in a cancer clinical trial at a single VA medical oncology department. Out of a total of 125 patients across 12 different trials, 111 (88.8%) had an enrollment note identified by the Enrollment Classifier (Table 3). 106 (84.8%) had a study title identified by the Study Title NLP that matched the recorded trial identifier. There were no appreciable differences in the rate of participant identification with NLP by trial type, trial sponsor, or years of enrollment.

Table 3.

Missingness analysis of participants in cancer clinical trials from a single center. Comparison of recorded enrollments from a single VA center from the medical oncology department from years 2000 to 2023, representing 12 distinct trials.

Trial characteristics	Total patients enrolled, N	Patients, N (%)
		Enrollment found		Enrollment not found
		Match	Nonmatch	Enrollment not found
Type	125	106 (84.8)	5 (4.0)	14 (11.2)
Interventional	95	79 (83.2)	2 (2.1)	14 (14.7)
Observational	30	27 (90.0)	3 (10.0)	0 (0)
Sponsor
Industry	22	19 (86.3)	3 (13.6)	0 (0)
VA ORD	3	3 (100)	0 (0)	0 (0)
Federal or other	100	84 (84.0)	2 (2.0)	14 (14.0)
Years of enrollment§
2000-2004		1 (50.0)	1 (50.0)
2005-2009		17 (100)	0 (0)
2010-2014		15 (100)	0 (0)
2015-2019		30 (93.8)	2 (6.3)
2020-2023		43 (95.6)	2 (4.4)

^§Includes all trials that enrolled within the given time period. Trials do not have to recruit for the entire duration of the time period listed. Trials that span multiple time periods are split by the years of enrollment of the individual trial participants. Specific dates of enrollment were not recorded externally and so the “total patients enrolled” and “enrollment not found” column has been purposefully left blank for those rows.

Application to full dataset

Application of the Enrollment Classifier to all notes in the Primary Dataset took 2h 40min. Application of the Consent Date NLP and Study Title NLP to all notes in the Refined Dataset took 1h 10min. The results are shown in Table 4. Nearly a million (928,620) clinical notes with a record of enrollment into a specific study were identified, corresponding to 571,118 unique patients. Of these notes, about half (487,868) also contained a consent date within the note. Post hoc review suggests that when the Consent Date NLP fails to identify the consent date, the timestamp of the enrollment note would be a useful proxy, since in situations where the study title but not consent date is not recorded, the note usually states or implies that consent took place on the date of the note.

Table 4.

Enrollment note count based on information content. The number of notes identified in the primary dataset, the refined dataset, and those with a study title, a study title and consent date, and a study title but not consent date are presented. The number of distinct patients represented in these notes is also shown.

	Number of notes	Number of patients
Primary Dataset	4,423,583	957,047
Refined Dataset	1,068,879	634,398
Study Title	928,620	571,118
Study Title and Consent Date	487,868	335,704
Study Title, No Consent Date	440,752	288,028

Among the 335,704 patients with both a Study Title and Consent Date, the vast majority (271,842; 81%) had only one record of enrollment, 42,140 (13%) had a record of enrollment in two studies, and 21,722 (6%) had a record of enrollment in three or more studies. For patients with multiple enrollments, there were a median of 796 days (mean, 1226 days) between enrollments.

Discussion

In this study, we developed and evaluated an automated and accurate approach to extract clinical study enrollment information from unstructured clinical notes using NLP methods. We chose a rule-based approach over other approaches such as large language models because a rule-based approach is more interpretable and was feasible to deploy in our environment. We found that rule-based NLP methods performed very well in identification of study title and consent dates. In fact, even most of the errors identified through our pre-designed evaluation measures actually still captured useful information such as the second rather than first mention of a study title or capture of an abbreviation of a title rather than the full title. Using data from a single center, we also determined that the NLP captured known trial enrollments across a variety of trial types, sponsors, and enrollment periods.

Our work complements prior work on information extraction related to clinical trials. First, a large body of literature investigates using information extraction techniques to determine patient eligibility for clinical trials. This contrasts with our work, which identifies patients already enrolled in past trials rather than determining eligibility for future ones. For example, numerous methods have been developed to parse unstructured trial documentation into structured eligibility rules,^10,11 while others use structured eligibility criteria to automatically extract information from the EHR to identify potentially eligible patients.^12,13 More recently, machine learning and large language models have been used to combine these steps,^14,15 and dedicated user-facing applications like MatchMiner and the VA’s Matching Patients to Accelerate Clinical Trials (MPACT) system have been developed to integrate these processes into easy-to-use prescreening interfaces.^16,17 A second major area of work involves extracting information about clinical trial design or characteristics from publications of trial results. For example, methods have been developed for information extraction of study characteristics such as eligibility criteria, sample size, treatments, and outcomes from publications reporting clinical trial results,¹⁸ while other work examines published trials to evaluate empirical barriers to clinical trial enrollment.¹⁹ This line of research differs from our own, as it focuses on extracting summary data from publications, whereas our method extracts individual-level data from the EHR.

LLM approaches have recently shown strong performance on information-extraction tasks, and some of the errors we observed, such as mismatches arising from ambiguous context, might occur less often with LLMs.^20,21 However, rule-based methods retain important operational advantages, including minimal computational requirements, easier deployment at national VA scale, fewer dependencies, and full auditability. Future work can compare this system with LLM-based extraction to assess potential accuracy gains and explore hybrid approaches that balance performance with operational constraints.

Limitations of the study include the following. First, we only attempt to extract study enrollment information in EHR notes with a title that includes the keyword “Research”, since this is by far the most common location where such data is captured in the VA’s EHR. Expansion to different note types such as oncology notes could potentially capture additional trial enrollment data from non-standard locations and improve sensitivity but would also likely capture more spurious information such as consent to non-research procedures. Continued standardization efforts in trial enrollment reporting can improve data accuracy. In order to apply our approach in another healthcare system, similar pre-selection of notes to those that are likely to include some trial enrollment information will be an important factor for generalizability of our results, and further refinement of the rules may be necessary to optimize performance.

Conclusion

We developed and evaluated a method to extract research study enrollment history as recorded in unstructured clinical notes. The method exhibited strong performance in a held-out test set and an external validation set. This method establishes a unique data source for studying nationwide, population-level barriers to clinical trial enrollment. Unlike existing resources, it encompasses clinical trial participation across multiple sponsors, including industry and federal entities, and incorporates patient- and facility-level data. This resource enables comprehensive analysis of multi-level barriers to trial enrollment, which can yield insights to guide interventions to boost trial accrual and enhance access.

Footnotes

ORCID iD

Nathanael R. Fillmore

Ethical considerations

The analyses for this report were conducted as part of a VA project to support clinical operations that was classified as non-research by the VA Boston Healthcare System Research and Development Committee.

Consent to participate

Informed consent was not applicable as the project was classified as non-research by the VA Boston Healthcare System Research and Development Committee.

Author contributions

DCE, NVD, and NRF conceived the study. SG, EL, JL, CY, and DC annotated data. SG, EL, RZ, JL, and JC conducted data analysis. SG, JTW, EL, DRF, RD, DCE, NVD, MTB, and NRF interpreted data. SG, RD, EL, JTW, and NRF drafted the manuscript. All authors critically edited the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the U.S. Veterans Affairs (VA) Cooperative Studies Program (MTB, NVD, NRF) and the VA Boston Medical Informatics Fellowship (EL). The views expressed are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs nor the United States government.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Research funding from Bayer and Merck unrelated to the present work (JL, NRF; research funds to institution).

Data Availability Statement

The United States Department of Veterans Affairs (VA) places legal restrictions on access to veteran’s health care data, which includes both identifying data and sensitive patient information. The analytic data sets used for this study are not permitted to leave the VA firewall without a Data Use Agreement. This limitation is consistent with other studies based on VA data. However, VA data are made freely available to researchers behind the VA firewall with an approved VA study protocol. For more information, please visit or contact the VA Information Resource Center (VIReC) at VIReC@va.gov.

References

Mc Cord

Hemkens

. Using electronic health records for clinical trials: Where do we stand and where can we go? CMAJ 2019; 191: E128–E133. https://doi.org/10.1503/cmaj.180841

Research and Development Service - VA Portland HCS . Research & Development - Clinical Research Resource Page. 2025; https://www.va.gov/PORTLANDRESEARCH/crcresources/index.asp

Brown

Lincoln

Groen

, et al. Department of Veterans Affairs national-scale HIS. Int J Med Inform 2003; 69: 135–156. https://doi.org/10.1016/s1386-5056(02)00131-4

Veterans Health Administration . About VHA. 2025; https://www.va.gov/health/aboutvha.asp

VA Office of Research and Development . Program Guides, VHA Directive and Handbooks -- 1200 series. 2023; https://www.research.va.gov/resources/policies/handbooks.cfm.

Unger

Vaidya

Hershman

, et al. Systematic Review and Meta-Analysis of the Magnitude of Structural, Clinical, and Physician and Patient Barriers to Cancer Clinical Trial Participation. J Natl Cancer Inst 2019; 111: 245–255. https://doi.org/10.1093/jnci/djy221

Green

Tabatabai

Bai

, et al. Validation of a Population-Based Data Source to Examine National Cancer Clinical Trial Participation. JAMA Netw Open 2022; 5: e223687. https://doi.org/10.1001/jamanetworkopen.2022.3687

Hirsch

Califf

Cheng

, et al. Characteristics of oncology clinical trials: insights from a systematic analysis of ClinicalTrials.gov. JAMA Intern Med 2013; 173: 972–979. https://doi.org/10.1001/jamainternmed.2013.627

Fihn

Francis

Clancy

, et al. Insights from advanced analytics at the Veterans Health Administration. Health Aff (Millwood) 2014; 33: 1203–1211. https://doi.org/10.1377/hlthaff.2014.0054

10.

Kang

Zhang

Tang

, et al. EliIE: An open-source information extraction system for clinical trial eligibility criteria. Journal of the American Medical Informatics Association 2017; 24: 1062–1071. https://doi.org/10.1093/jamia/ocx019

11.

Datta

Lee

Paek

, et al. AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models. Journal of the American Medical Informatics Association 2024; 31: 375–385. https://doi.org/10.1093/jamia/ocad218

12.

Adupa

Garg

Corona-Cox

, et al. An Information Extraction Approach to Prescreen Heart Failure Patients for Clinical Trials. 2016. Preprint at. https://doi.org/10.48550/ARXIV.1609.01594

13.

Wright

Perentesis

, et al. Increasing the efficiency of trial-patient matching: automated clinical trial eligibility pre-screening for pediatric oncology patients. BMC Med Inform Decis Mak 2015; 15: 28. https://doi.org/10.1186/s12911-015-0149-3

14.

Unlu

Shin

Mailly

, et al. Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening. NEJM AI 2024; 1. https://doi.org/10.1056/aioa2400181

15.

Zhang

Xiao

Glass

, et al. DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction. Proceedings of The Web Conference 2020. ACM, 2020, pp. 1029–1037. https://doi.org/10.1145/3366423.3380181

16.

Elbers

Fillmore

, et al. Matching Patients to Accelerate Clinical Trials (MPACT): Enabling Technology for Oncology Clinical Trial Workflow. Stud Health Technol Inform 2024; 310: 1086–1090. https://doi.org/10.3233/SHTI231132

17.

Klein

Mazor

Siegel

, et al. MatchMiner: an open-source platform for cancer precision medicine. npj Precis. Onc 2022; 6: 69. https://doi.org/10.1038/s41698-022-00312-5

18.

Kiritchenko

De Bruijn

Carini

, et al. ExaCT: automatic extraction of clinical trial characteristics from journal publications. BMC Med Inform Decis Mak 2010; 10: 56. https://doi.org/10.1186/1472-6947-10-56

19.

Liu

Rizzo

Whipple

, et al. Evaluating eligibility criteria of oncology trials using real-world data and AI. Nature 2021; 592: 629–633. https://doi.org/10.1038/s41586-021-03430-5

20.

Laparra

Mascio

Velupillai

, et al. A Review of Recent Work in Transfer Learning and Domain Adaptation for Natural Language Processing of Electronic Health Records. Yearb Med Inform 2021; 30(1): 239–244. https://doi.org/10.1055/s-0041-1726522

21.

Shenoy

Carey

, et al. ChatGPT: Increasing accessibility for natural language processing in healthcare quality measurement. Infect Control Hosp Epidemiol 2024; 45(1): 9–10. https://doi.org/10.1017/ice.2023.236

Rule-based natural language processing to extract clinical trial and research study enrollment history from unstructured notes

Abstract

Keywords

Introduction

Methods

Data source and primary dataset

System architecture

Enrollment classifier

Study title NLP and consent date NLP

Performance evaluation

Validation relative to external clinical trial enrollment data

Code availability

Results

Performance evaluation results

Error analysis

Validation relative to external clinical trial enrollment data

Application to full dataset

Discussion

Conclusion

Footnotes

ORCID iD

Ethical considerations

Consent to participate

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

References