Sage Journals: Discover world-class research

Abstract

Objectives:

The objective of the Prediction Augmented Screening Initiative (PASI) pilot application was to design and implement a clinical tool to optimize the lung cancer screening (LCS) workflow for providers. The Boston Informatics Group (BIG) at the Department of Veterans Affairs (VA) developed the Enabling Technologies for Rapid Learning Health Systems Platform (ENTHRALL) to support delivery of knowledge in a Learning Health System (LHS) framework. The BIG leveraged ENTHRALL to implement the PASI pilot application on a very short timeline. The application uses VA data to estimate patients’ benefit from LCS based on National Cancer Institute (NCI) models, allowing proactive outreach to patients with high predicted benefit from LCS.

Methods:

The application was designed utilizing ENTHRALL infrastructure, including optimized nightly data pulls to gather patient information, Natural Language Processing to extract smoking history, and a user interface (UI). Cross-functional collaboration allowed the use of the NCI’s peer-reviewed prediction algorithm to provide daily patient benefit scores.

Results:

The UI displays patients in descending order of benefit, delivering a prioritized list to providers. Clinicians can fill in information and track patient status to assist with their outreach activities. For the pilot, only patients meeting USPSTF LCS criteria (the current field standard) were displayed. Five VA stations were included.

Conclusions:

Utilizing the VA BIG’s ENTHRALL framework for an LHS, the group demonstrated their ability to design and deliver a new application within 3 months of inception, which was successfully utilized at 5 VA hospitals. The VA’s capability to rapidly build clinically relevant applications will help it become an LHS tailored to current problems impacting the Veteran. Due to the success of the pilot, the clinical research team got approval to expand their study. The BIG is working on a non-pilot build.

Keywords

risk-prediction natural language processing data structures computer science screening statistical algorithms predictive modeling bioinformatics clinical data

Introduction

Lung cancer remains the leading cause of cancer-related deaths in the United States, particularly among Veterans, who are at a heightened risk due to factors such as age, smoking history, and exposure to environmental toxins during military service. Early detection through screening is crucial to improving survival rates and reducing avoidable deaths, but despite its importance, lung cancer screening (LCS) is not uniformly utilized within the Veteran population.¹ To address this challenge, this paper presents the overview of the development and implementation of a risk prediction algorithm-based application designed to support the initial phase of a planning project, the Prediction Augmented Screening Initiative (PASI). PASI aims to identify patients at high risk for lung cancer, who might benefit most from screening. This allows clinicians to prioritize screenings for those patients and intervene early when it can have the greatest impact.

The initial phase of PASI was executed as a pilot project with limited resources, reflecting a scaled-down version of what a fully funded initiative might achieve. Despite these constraints, significant progress was made in a short time frame. The pilot serves as an essential step in demonstrating the potential of dynamic risk prediction-driven solutions in Veteran healthcare and sets the stage for future improvements, including a PASI research trial that will be conducted by the clinical research team. The development of these risk prediction algorithms for healthcare applications, especially in screening services, is growing rapidly. However, there remains a gap between the availability of algorithms and their active use in clinical practice. Clinicians can be hesitant to adopt algorithm-based systems due to concerns over efficacy and reliability, emphasizing the need for rigorous validation of algorithms.² Our application, built for LCS population management through PASI, offers an example of how externally-validated technical tools can be introduced in a clinical setting like the Veterans Health Administration (VHA), which is committed to patient-centered care.

The increasing prominence of cancer screening highlights the importance of this application. Early screening not only saves lives but also reduces long-term healthcare costs, particularly in systems like the VHA where the efficient use of government resources is key.³ Current literature supports the notion that prevention, rather than treatment alone, can be beneficial for both patients and healthcare systems in the long term.^4,5 Pursuing primary prevention aligns with the VA’s broader objective of implementing strategies that provide both high-quality care and cost savings. Screening is central to this approach, with LCS offering a particularly high-impact opportunity due to the disease’s prevalence among Veterans.⁶

Lung cancer screening tools relying on algorithms, artificial intelligence (AI), machine learning (ML), and other novel technology-based approaches have traditionally relied on vast amounts of validated data. Electronic health records (EHRs) are critical to amass these data sets with high-quality data that can be used to support these tools.⁷ While many of these models provide evidence-based thresholds for screening and can help identify high-risk patients, their implementation at the point of care remains limited due to workflow integration challenges and lack of real-time clinical decision support.⁸ Unlike some existing systems, PASI was developed specifically for integration within the VA’s clinical infrastructure, using real-time clinical data, the VA’s expansive EHR system, and natural language processing (NLP) to continuously update patient profiles. The easy operationalization of the application allows VA data systems to be leveraged in creating a transparent, clinician-facing tool that supports screening decisions at the point of care.

The pilot application supporting PASI is also a prime example of the VA’s long-term vision to create a Learning Health System (LHS) and leverages the Enabling Technologies for Rapid Learning Health Systems Platform (ENTHRALL) developed by the VA Boston Informatics Group (BIG) to support delivery of knowledge in an LHS. LHS frameworks focus on continuously improving patient care by learning from every patient interaction and using data to inform clinical decision-making.⁹ This pilot contributes to that vision by demonstrating how risk prediction algorithms can play a role in ongoing system improvements, facilitating data-driven insights to refine screening and treatment protocols.¹⁰ An LHS relies on leveraging data integration for continuous learning and improvement¹¹; this pilot application is a practical embodiment of these features, and it can offer insight into how such systems can be effectively applied within the VA. The PASI pilot application, preceded by the BIG’S MPACT application, provides real time data that is incorporated into patient care through the use of algorithms and NLP to implement a dynamic risk prediction model for lung cancer.¹² The application employs this model to identify patients with high lung cancer risk, establishing a cohort of Veterans that would most benefit from early screening. This instant feedback loop, a key component of an LHS, aligns the project with the VA’s mission of bridging the gap between bench and bedside.¹³

In summary, this pilot project not only advances the VA’s goal of enhancing LCS and prevention but also contributes to the LHS system. This study’s focus on Veterans, a population that stands to benefit significantly from improved screening protocols, provides a compelling case for the continued development of algorithm-driven, data-focused solutions in clinical settings. The work presented here lays the groundwork for more expansive projects that can further solidify the role of algorithms in improving healthcare outcomes for Veterans.

Methods

The PASI pilot application was built using ENTHRALL infrastructure.¹³ This includes Agile development methodology, defined standard operating procedures for gathering and processing user requirements, and optimized nightly data pulls which support the multiple applications under ENTHRALL.

The development process was an iterative feedback loop between the clinical research team, ML subject-matter experts (SMEs), and the BIG team. This cross-disciplinary collaboration allowed clinical needs to be translated into technical requirements. Utilizing common development tools such as wireframes and GitHub tickets, a technical project leader mediated discussions between the 2 groups, ensuring that a consensus was met efficiently and allowing rapid development of high-impact technology. Following the framework of an LHS, this kind of cross-department collaboration allowed research advancements to be combined into clinical care more rapidly. Figure 1 shows the initial wireframe that was agreed to by the development and clinical research teams. Although some aspects of it were changed during development, it provided a baseline understanding between the groups.

Figure 1.

The initial wireframe agreed on by both the clinical and engineering teams. It uses fake patient data, so privacy is preserved.

The PASI pilot application can be broken into 4 main pieces: a database, fed by nightly pulls; the optimized nightly data pulls, an element of the ENTHRALL infrastructure; an external prediction algorithm,¹⁴ and a user interface (UI) designed to display patients in descending order of benefit, allowing providers to prioritize those with the greatest predicted benefit from LCS. Figure 2 shows an overview of how these pieces connect.

Figure 2.

This diagram shows the 4 sections of the application – the nightly data pulls, the application database, the prediction algorithm, and the UI.

ETL Data Pulls and Database: The optimized nightly data pulls are structured to support the inclusion and exclusion criteria of the project. For this application, the Extract, Transform, and Load (ETL) processes create the project cohort and provide the data needed by the prediction algorithm. The eligibility criteria – the data conditions a patient needs to be included in the project – were determined by the clinical research team at the beginning of the planning phase. To be included in the PASI pilot application, patients need to have (1) a history of smoking; (2) Health Factor data available about the length and intensity of their smoking; (3) never been diagnosed with lung cancer; and (4) not been counseled or screened for lung cancer in the last 18 months. Further restrictions required patients to have primary care access at one of the 5 VA stations included in the pilot.

Patient information is extracted from the Corporate Data Warehouse (CDW), the VA’s EHR, using SQL code scripts through nightly SQL Server Integration Services (SSIS) executions. The results are stored in a SQL database designated for the PASI pilot application. Clinical researchers and medical SMEs were consulted in the development of the data-pull scripts to ensure the proper variables were identified. Beyond just collecting the information needed to determine patients’ inclusion in the application, the data pulls also gather the data that is needed by the external prediction algorithm, such as environmental exposures, history of non-lung cancers, family history of diseases, other comorbid diseases, and more. Once the cohort has been created and patient data has been pulled into the database, NLP is used to extract packs-per-day and years-smoked data from the smoking history Health Factors. These data pulls run each night, so that the application always has a current calculation of the eligible patients, and those patients have up-to-date information. Sometimes patients become ineligible overnight through the data pull (eg, if they age out of the cohort, participate in LCS, or get a lung cancer diagnosis). If patients become ineligible in the data pull overnight, the prediction algorithm will no longer run on them, so they still appear on the UI, but they no longer have risk scores. Their reason for ineligibility is recorded and saved in the database for future analysis.

External Prediction Algorithm: The application’s external risk prediction algorithm, designed by the National Cancer Institute (NCI), is managed through an R library named “lcmodels.”¹⁴ The lcmodels function within this package takes 33 specific comorbid conditions from each patient and returns several lung cancer prediction scores. Especially important are the years-smoked and packs-per-day variables extracted from Health Factor data using NLP.

The data extracted using NLP was present in the database as handwritten medical notes. This resulted in inconsistent free-text data. The lcmodels function required clean numerical inputs for these fields, such as years-smoked = 40 or packs-per-day = 0.5. Preparing the natural language data for the algorithm required extensive cleaning and parsing by NLP algorithms. As this pilot-level application was developed on a short timeline, the NLP algorithms were simple regular expression (regex) functions written within the application’s R pipeline. We manually defined natural language expressions that represented a recognizable data piece. For example, “2 ppd” or “2 packs per day” would be assigned the value packs-per-day = 2. Similarly, for years-smoked, expressions like “35 yrs” or “smoked for 35 years” were assigned years-smoked = 35. Regex phrases were only generated for terms that could be reasonably inferred to indicate packs per day and years smoked. No conversions were made for data that was listed in units besides packs per day or years smoked – these patient records had to be excluded. For example, entries like “3 cig/day” or “quit in 2006” were not mathematically adjusted – they had to be left out of the pilot phase.

Twenty regex phrases were defined for each of the 2 variables. When patient data was pulled into the application, it was passed through these regex NLP algorithms to be assigned numerical values for packs-per-day and years-smoked. As the lcmodels prediction algorithm requires these 2 features to be present, any patients with data not matching the defined regex were excluded.

The remaining required input variables were identified with International Classification of Diseases (ICD) codes. The clinical team defined sets of ICD codes that represented features like “dust exposure,” “hay fever,” “diabetes,” “liver condition” and “heart attack.” The lcmodels algorithm required that these be passed in as a binary yes/no (or 1/0) for whether the condition was present or not. For each patient, if they were found to have any of the relevant ICD codes, they were marked as having that condition (condition = 1). If a relevant code was not found, they were marked as not having that condition (condition = 0). The engineering team consulted with one of the lcmodels authors, a statistical SME, to ensure the variables were being properly mathematically defined for use in the algorithm.

For the PASI pilot application, the lcmodels algorithm was run nightly on every eligible patient in the PASI application database, and 2 risk scores were recorded – probability of lung cancer death within 5 years without screening,¹⁵ and days of life gained from undergoing 3 rounds of CT screening.¹⁶ The algorithm also identifies United States Preventive Services Task Force (USPSTF) eligibility status. Thus, patients with available data were given a newly-calculated risk/benefit score each day.

User Interface (UI): The UI displays a table of patients predicted to benefit from LCS, with 1 row per patient. The application was restricted to only display patients who are USPSTF eligible, per the lcmodels algorithm. The patient data rows contain the predicted benefit and risk scores, as well as several columns of checkboxes to allow medical professionals to track the progress of patients through the LCS process. As described above, the 2 scores are calculated daily from the most up-to-date patient data. The probability of death in 5 years from lung cancer without LCS was displayed as a percentage. The days of life gained from LCS was represented categorically to show the benefit of screening. Patients who were predicted to gain greater than or equal to 16 days of life from LCS were assigned the Encourage Screening category, signaling to providers that those patients would likely benefit from LCS. Patients predicted to gain less than 16 days of life from LCS were assigned the Preference Sensitive category, suggesting personal preference for whether to undergo LCS.¹⁷ Patients are displayed in descending order of benefit, based on days of life gained and risk of death. This provides users with a prioritized list of patients to contact regarding LCS. As this was the planning phase of the project, patient data from only 5 VA stations was included: Durham, NC; Los Angeles, CA; St. Louis, MO; Houston, TX; and Fayetteville, NC.

Due to the cross-discipline collaboration encouraged by ENTHRALL, and the national nature of the VA’s databases, the team was able to build and launch the application on a very short timeline. Within 2 months, the development team had an internal version of the application running, with all vital infrastructure built. Figure 3 shows a visual timeline of the main development milestones during the pilot phase of the project.

Figure 3.

This timeline displays the major milestones achieved by the Boston Informatics Group from project inception to the move out of pilot.

Results

The PASI pilot application was implemented across 5 VA sites from June 2023 to February 2024. Testing at multiple sites ensured that the application was evaluated under different workflows and regional practices, highlighting its adaptability. The application provides a clinical workflow that comes from a real-world model. The clinical research team provided feedback on the system’s usability and integration into daily operations. They reported that the application had a user-friendly interface¹⁸ and helped them identify high-benefit patients who otherwise may not have been reached to discuss LCS. The engineering team focused on this through the feedback cycle with the clinical research team.

The BIG team was able to turn this application around very quickly, due to open collaboration with the clinical research team and cross-functional technical work within the informatics group. Generally, applications can take up to a year to start up, but the PASI pilot was live for user testing within 3 months of project initiation (Figure 3). Applications with such a quick lifecycle can easily feed into real-world clinical workflows and then be improved rapidly based on user feedback. This was successful enough on PASI that the study was extended out of the pilot version, and the application will be used in more real-world settings (expanding to 28 sites from the original 5). Below is a (redacted) screenshot of the completed PASI pilot application.

PASI prioritizes patients for LCS based on real-time data, as shown in the ranked Benefit column in Figure 4. Applications like this could help simplify the screening process for providers, optimizing the population of patients to whom they reach out and helping to standardize outreach across VA facilities. This application facilitated the identification of patients who, due to various factors, might not have otherwise been prioritized for LCS.

Figure 4.

Redacted screenshot of the PASI pilot application.

Patient risk profiles are updated based on new, real time patient data. This enables clinicians to make decisions based on the most current information, enhancing the quality of patient care. The daily feedback loop supports dynamic risk assessments, allowing clinicians to react quickly to changes in patient status. During the pilot phase, over 150 patient profiles had a user interact with them inside the PASI application, across the 5 stations.

Discussion

This effort demonstrated the feasibility of implementing the PASI population management application with a risk prediction algorithm designed to enhance LCS within the VA healthcare system. The pilot successfully integrated real-time data into clinical workflows, allowing providers to efficiently identify and prioritize high-benefit patients for screening. The project’s expedited timeline and limited resources meant that some features could not be fully developed during the pilot phase. For example, the regex algorithms used to identify critical inputs to the prediction models were manually defined. With more time and resources, a finely-tuned NLP algorithm could be developed to possibly include additional patients in the project. However, the successful deployment and positive clinician feedback¹⁸ resulted in approval for the PASI study to be expanded to 23 additional VA sites, marking a critical step toward broader implementation and integration into an LHS. The results highlighted both the strengths of the application’s design – particularly its user-friendly interface, instant feedback loop, and ability to help streamline workflows – and areas for improvement, such as enhancing transparency in the risk prediction process and expanding the system’s reach to a more diverse clinical population.

The planning phase was intentionally conducted in 5 varied clinical settings to assess the PASI application’s adaptability, though no Cerner sites were included due to differences in data entry requirements. This discrepancy could not be resolved within the pilot’s expedited timeline, which limited the generalizability of the overall study to all patient populations. While the inclusion of Cerner sites was not a necessity for this initial phase, future expansions could incorporate a broader range of clinical settings to enhance generalizability. One key lesson from the pilot was the desire for greater transparency in PASI’s risk prediction process for clinicians using the dashboard. The predictive risk and benefit score did not include sufficient clarification of the contributing factors that generated the dashboard results. This affected clinician confidence in the recommendations. The upcoming expansion will address this by incorporating interface updates that provide a breakdown of patient risk factors that influence the algorithm, alongside more detailed training sessions for frontline clinicians.

The effort’s efficient timeline and quick turnaround allowed for rapid feedback on dashboard development and clinical use. On the other hand, the short duration limited the study as only a relatively small sample size was able to be included. Additionally, the project was developed with restricted informatics funding and a small team, which constrained the implementation of certain requested features through the latter half of the project. However, the informatics team successfully met the project’s core objectives and implemented as many clinician-driven refinements as possible under the time constraints. The application’s demonstrated effectiveness within these restrictions supports its potential for larger-scale deployment, with a future study planned to assess its long-term impact across a broader range of VA sites.

Moving beyond the pilot, the PASI clinical research team is conducting a trial to study the improvement in LCS uptake that comes with decision support tools. This application will be augmented accordingly based on limitations of the pilot, ensuring wider applicability and improved evaluation of its effectiveness. Enhancements will include expansion to 28 VA facilities, improved transparency in risk score calculations, additional clinician training, and further optimization to align with diverse clinical workflows. For example, clinicians expressed a need for non-technical documentation within the PASI application that describes the statistical methods of the lcmodels algorithm, so they may better understand the predictions during use. Refinements such as this are essential to fostering clinician trust and ensuring that risk prediction tools are effectively utilized in patient care. The iterative nature of this approach reflects the VA’s commitment to building an LHS, where real-world data continuously informs and refines clinical decision-making.

The PASI application’s ability to help streamline patient identification and prioritization has the potential to improve care efficiency, reduce screening delays, and enhance patient outcomes. While long-term impacts require further study, the pilot suggests that the PASI application could reduce clinician workload by automating risk stratification and clinical decision support. Its adaptability also presents opportunities for integration into other preventive care initiatives, reinforcing the VA’s broader mission to provide data-driven, patient-centered care while optimizing resource utilization. The success of this pilot lays the foundation for the application’s continued development and positions it as a model for embedding risk prediction algorithms into routine clinical practice.

Conclusions

The PASI pilot application project successfully demonstrated the potential of an application using a risk prediction algorithm to enhance LCS within the VA healthcare system. By integrating real-time data into clinical workflows, the PASI pilot application helps clinicians efficiently identify and engage Veterans at risk of lung cancer who would benefit most from earlier screening. Despite the constraints of a pilot-level initiative, the system was well-received across 5 VA sites, with positive overall clinician feedback. This early success highlights the application’s potential to improve preventive care and aligns with the VA’s larger efforts to develop evidence-based solutions for enhancing patient outcomes. Furthermore, the ability to deliver knowledge and iteratively improve predictive models based on real-world clinical data exemplifies the principles of an LHS, where continuous feedback informs patient care. The funded initiative that is currently under development has the potential to further solidify the impact of predictive analytics on clinical workflows, furthering the benefits of specialized data tools implemented within VA facilities.

Footnotes

Acknowledgements

The views expressed are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the United States government.

ORCID iDs

Hannah M. Tosi

Danne C. Elbers

Ethical Considerations

This article does not contain any studies with human or animal participants. There are no human participants in this article and informed consent is not required. This quality improvement project was approved under IRB number IRBNet #1721656.

Consent to Participate

Patient data was pulled into the application under a HIPAA waiver.

Author Contributions

Conceptualization and requirement definition: LK, TC, NT, RW, MB, ND, NF, DE. Application design and implementation: HT, CZ, SM, OS, AT, JC, GS, DE. Prediction algorithm implementation: HK, HT, SM, OS, JC. Project supervision: MB, NF, ND, DE. Manuscript development: MY, HT, DE. Manuscript review: all.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: PASI Planning project was funded by ORD as supplemental funds to LPO grant number VA ORD CSP L0018. Dr. Wiener is supported in part by resources from the VA Boston Healthcare System. Dr. Caverly is supported in part by resources from VA HSR, IIR 21-152, VA CSRD LPOP AMP, Grant I01CX002850-01.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data from this project cannot be made publicly available, as it is Protected Health Information (PHI).

References

Lewis

Samuels

Denton

, et al. National lung cancer screening utilization trends in the Veterans Health Administration. JNCI Cancer Spectrum. 2020;4(5):aa053.

Darzi

Busse

Torabiardakani

, et al. Risk assessment models: considerations prior to use in clinical practice. Eye. 2025;39:617-619.

Sarfati

Gurney

Preventing cancer: the only way forward. Lancet. 2022;400(10352):540-541.

Ilbawi

Anderson

BO.

Cancer in global health: how do prevention and early detection strategies relate?

Sci Transl Med. 2015;7(278):278cm1.

Adams

Stone

Baldwin

Vliegenthart

Lee

Fintelmann

FJ.

Lung cancer screening. Lancet. 2023;401(10374):390-408.

U.S. Department of Veterans Affairs Office of Research & Development. VA Research on Cancer – Lung Cancer. U.S. Department of Veterans Affairs Office of Research & Development. 2025. https://www.research.va.gov/topics/cancer.cfm#research1

Gandhi

Gurram

Amgai

, et al. Artificial intelligence and lung cancer: impact on improving patient outcomes. Cancers. 2023;15(21):5236.

Ladbury

Amini

Govindarajan

, et al. Integration of artificial intelligence in lung cancer: rise of the machine. Cell Rep Medicine. 2023;4(2):100933.

Enticott

Johnson

Teede

Learning health systems using data to drive healthcare improvement and impact: a systematic review. BMC Health Serv Res. 2021;21(1):200.

10.

Fiore

Ferguson

Brophy

, et al. Implementation of a precision oncology program as an exemplar of a learning health care system in the VA. Fed Practitioner. 2016;33(Suppl 1):26S-30S.

11.

Nash

Bhimani

Rayner

Zwarenstein

Learning health systems in primary care: a systematic scoping review. BMC Fam Pract. 2021;22(1):126.

12.

Elbers

Fillmore

, et al. Matching patients to accelerate clinical trials (MPACT): enabling technology for oncology clinical trial workflow. Stud Health Technol Inform. 2024;310:1086-1090.

13.

Elbers

Fillmore

, et al. Building research infrastructure to develop greater learning efficiencies (BRIDGE). Stud Health Technol Inform. 2024;310:1131-1135.

14.

Cheung

Kovalchik

Katki

HA.

Predictions From Lung Cancer Models – ‘Lcmodels’ R Package; Version 4.1.1. National Cancer Institute; 2023. https://dceg.cancer.gov/tools/riskassessment/lcmodels

15.

Katki

Kovalchik

Berg

Cheung

Chaturvedi

AK.

Development and validation of risk models to select ever-smokers for CT lung cancer screening. JAMA. 2016;315(21):2300-2311.

16.

Cheung

Berg

Castle

Katki

Chaturvedi

AK.

Life-gained-based versus risk-based selection of smokers for lung cancer screening. Ann Intern Med. 2019;171(9):623-632.

17.

Mazzone

Silvestri

Souter

, et al. Screening for lung cancer: CHEST guideline and expert panel report. Chest. 2021;160(5):e427-e494.

18.

Kearney

Brady

Pendergast

, et al. Formative development of a population management approach to engage high-benefit veterans in lung cancer screening. Am J Respir Crit Care Med. 2024;209:A1421.

Rapid Support and Implementation of an Application for the Prediction Augmented Screening Initiative (PASI) Planning Phase Through the Enabling Technologies for Rapid Learning Health Systems Platform (ENTHRALL) at the Department of Veterans Affairs (VA)

Abstract

Objectives:

Methods:

Results:

Conclusions:

Keywords

Introduction

Methods

Results

Discussion

Conclusions

Footnotes

Acknowledgements

ORCID iDs

Ethical Considerations

Consent to Participate

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References