Abstract
Prostate cancer is the second most diagnosed cancer in the world. Treatment guidelines involve a multitude of therapies, however adherence to them is not fully established, while lack of personalized treatment strategies fails to put the patient as an individual clinical profile at the center of their treatment. We aim to present the concept of a digital treatment analyzer (TA) for the management of prostate cancer (PC) patients, leveraging real-world data (RWD) and predictive modeling to enhance personalized disease management strategies and adherence to PC guidelines, ultimately aiming to optimize therapeutic efficacy and improve outcomes. The TA comprises digital tools integrated into one user-intuitive interface, facilitating the development of patient-specific clinical profiles, classification of patients into matched historical RWD cohorts, presentation of relevant clinical guidelines, visual representation of treatment and outcomes, and mortality risk prediction based on a validated machine learning models. The Medical Information Mart for Intensive Care (MIMIC) IV dataset was utilized, including structured and unstructured data from the patient journey. The developed TA represents a promising approach to enhance personalized disease management strategies and adherence to PC guidelines. By integrating contemporary clinical guidelines, RWD and AI-driven insights, our digital TA aims to optimize therapeutic efficacy and improve patient outcomes. The presented concept demonstrates the potential for using a digital approach that integrates RWD into a treatment journey, to provide healthcare stakeholders with a holistic approach to PC management involving all available modern tools to achieve optimal outcomes.
Keywords
Introduction
Prostate cancer is the second most commonly diagnosed cancer in the world. 1 This constitutes a reason for valid concern for public health and medical experts worldwide.
Treatment of PC may include medication, radiation and surgery or combination thereof. While treatment guidelines try to capture the entire spectrum of patients and guide clinicians based on the available therapy strategy accordingly,2–4 clinician adherence to them is not always ensured and can lead to worse outcomes.5–7
At the same time, for clinicians to better understand their respective patients and possible outcomes to specific therapies, they can refer to past evidence from registries or clinical trials, that however don’t always represent the real world. 8 This gap can be closed by analysis of Real World Data (RWD) and their implementation to the clinical routine.
We introduce the concept of a digital tool, hereafter referred to as Treatment Analyzer (TA), that will utilize RWD as the basis for patient classification according to guidelines, development of clinical profile based on weighted variables and a historical RWD-based matched cohort for visualization of standard of care, as well as the introduction of machine learning models for the prediction of mortality.
Case description
User interface
Treatment Analyzer is a set of digital tools incorporated into one interface, including 3 screens, which allow the following:
Screen 1. Patient clinical profile:
User enters patient and disease data, which are relevant to match the patient to a historical cohort in the next step. A patient-specific clinical profile based on user input is created. This screen serves user as a comparison of the patient with a cohort of “similar” patients. Visualization of patient metrics alongside ‘benchmark’ baselines from a matched historical Real World Evidence (RWE) cohort. Presentation of tailored relevant clinical guidelines for treatment, as well as visual representation of treatment and outcomes based on this matched historical RWE cohort. Mortality risk prediction based on an optimized Random Forest Machine Learning model.
Screen 2. Patient benchmarking:
Screen 3. Results:
Data
For our specific TA concept, we selected the use of the Medical Information Mart for Intensive Care (MIMIC) IV dataset, version 2.2 by PhysioNet.9,10 It includes various aspects of the patient journey – from admission to discharge as well as mortality, based on both structured and unstructured data.
Data from the MIMIC-IV corpus was extracted and assembled in R, using unique visit and subject identifier keys. Current prostate cancer patients were identified using CDC-identified ICD-9 and ICD-10 codes (see Appendix 1).
The index visit for each patient was defined as the first visit with a prostate cancer specific code, or general codes for prostate cancer disorder (ICD9 601*, ICD10 N41.9) or chemotherapy related codes (ICD9 285.3, V58.11, ICD10 D64.81, N42.9) in the absence of other cancers.
Patients were excluded if they died or received a radical prostatectomy on the same day as admission, as identified through discharge codes, ICD-9 and ICD-10 procedural codes and regex matching to identify endoscopic surgery for treatment of prostate cancer. A wider search of ICD-9 and ICD-10 codes following WHO guidelines 11 (see Appendix 2) identified other cancers in these patients. Those with a non-prostate cancer code preceding a prostate cancer code, or with codes specific to secondary prostate cancer, were excluded. Patients presenting with cancer of the lymph, bladder, bone, lung or liver alongside prostate cancer at their first recorded visit are considered as metastatic prostate cancer patients. Other patients with multiple cancers at the start of the study, for whom the primary site was unknown, were omitted. The process of patient selection in the database is presented in Figure 1.

CONSORT diagram detailing patient selection process.
Statistics
Data were standardized, eligibility criteria were applied based on presence of active treatment of PC within the dataset, while relevant variables were extracted with the use of Natural Language Processing (NLP) on radiology and discharge summary reports. 12 Outcomes were measured based on time from index event/discharge to next event and visualized with a Sankey diagram. Medication was extracted from structured and unstructured reports using NLP models. 13 Statistical analysis in R included propensity score matching and mortality prediction was performed based on a Random Forest classifier. 14
Propensity score matching utilized 10 variables previously identified as significant to prostate cancer mortality (TMN stage, treatment, age, physical function, hemoglobin, systolic blood pressure, diastolic blood pressure, serum hemoglobin, history of cardio-vascular disease, hospital frailty risk score (HFRS) and metastases), 15 and was conducted via GLM using the R MatchIt library with a probit link. Matching patients on the basis of propensity score used a nearest neighbour matching algorithm without replacement, varying distance to fulfil a 1:30 quota of matches. Random Forest models were trained including all covariates to predict mortality at 6 months, 1 year and 5 years, across treatment groups, optimizing numbers of trees and nodes within each model.
To create the user interface graphics, Sankey diagrams were created using the ggvis R library, to visualize the patient journey of cases matched to the entered patient specifics, who were receiving various treatments and had a range of outcomes. Bar plots also visualized values for the first measurements of each metric recorded at the index visit for this subset of patients alongside the patient details entered, for rapid visual comparison.
Treatment analyzer
The analysis of patient data resulted in designing Treatment Analyzer prototype, which represents the main functionality of real digital product and helps understanding its logic and purpose. The prototype consists of the main screens presented below (Figure 2).
The first screen serves the user as an interface for entering patient and disease data which is used further for patient matching and benchmarking (Figure 3).
Based on propensity score matching particular patient is compared to other patients with similar characteristics. Bar charts with different parameters help better visualize this data (Figure 4).
Results screen is aimed on presentation of different tools that assist user in clinical decision making. In this example these tools include:
- summary of clinical guidelines for treatment relevant to specific disease and stage of it. - visual representation of treatment and outcomes based on matched historical RWE cohort. - mortality risk prediction based on an optimized Random Forest Machine Learning model.
Conclusion
Based on metrics readily available in real world EHR data, as described above, the TA digital instrument was developed. In the authors’ opinion, this is a promising approach to enhance personalized disease management strategies and adherence to PC guidelines, thereby aiming to optimize therapeutic efficacy and improve outcomes.
Our PC TA is presented as a concept tool, to better inform healthcare professionals and stakeholders, enabling them to navigate through the multitude of therapies in heterogenous populations using the help of contemporary clinical guidelines, RWE and use of AI. This tool is not proposed to predictively model atypical treatment pathways, for small patient pools or rare conditions as these will have a lower confidence associated with predicted outcomes.
Illustrative examples of the user interface are also included in Figures 2, 3 and 4 and clearly show the three main parts. The first one is aimed at building patient clinical profile based on entered variables by user. The second screen presents matching of the patient's clinical profile to a representative historical RWE cohort. The last one provides clinical guidelines regarding patient classification, visual representation of treatment and outcomes based on the patient's matched RWE cohort and mortality probability calculated by predictive machine learning model.

Screen #1 with patient clinical profile.

Screen #2 with patient benchmarking.

Screen #3 with tools relevant to clinical decision making.
Applied methods of data processing presented in the scope of our PC TA can also be used in adjacent tasks like building synthetic control arms via emulation of standard of care based on matched historical cohorts, optimization of clinical workflows and development of digital biomarkers in a multitude of indications. We intentionally left the analysis of such secondary usage for further research in the future.
We present the concept of a digital tool for prostate cancer management, that allows users to stay informed with recent clinical guidelines and introduces a personalized approach to patient treatment based on real world data and predictive modelling.
Our pipeline demonstrates how RWD can be analyzed and used for development of digital tools and their application by clinicians, pharmaceutical and insurance companies. We aim to initiate a discussion of how such an approach can be applied not only to PC, but to other diverse conditions as a way to optimize patient care through RWD-based methods.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076251326021 - Supplemental material for Development of a digital treatment analyzer for the management of prostate cancer patients, with the help of real world data and use of predictive modelling
Supplemental material, sj-docx-1-dhj-10.1177_20552076251326021 for Development of a digital treatment analyzer for the management of prostate cancer patients, with the help of real world data and use of predictive modelling by Lev Korolkov, Heather A Robinson and Konstantinos Mouratis in DIGITAL HEALTH
Footnotes
Acknowledgements
Not required.
Author contributions
KM – conceptualization, funding acquisition, investigation, methodology, project administration, supervision, validation, writing – review and editing, LK - investigation, project administration, resources, visualization, writing original draft, HR - data curation, formal analysis, investigation, methodology, software, validation, visualization, writing – review and editing. All authors reviewed and edited the manuscript and approved the final version.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
Not required.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Patient consent statement
Not required.
Guarantor
Not required.
Supplemental material
Supplemental material for this article is available online.
Appendix 1. Inclusion and exclusion ICD codes.
ICD code list (inclusion):
185 ICD9 Malignant neoplasm of prostate
198.82 ICD9 Secondary malignancy neoplasm genital
R972.1 ICD10 Rising PSA following treatment for malignant neoplasm of prostate
233.4 ICD9 Carcinoma in situ of prostate
D07.5 ICD10 Carcinoma in situ of prostate
C61* ICD10 Malignant neoplasm of prostate
C79.82* ICD10 Secondary malignant neoplasm of genital organs
ICD code list (exclusion):
60.5 ICD9 Radical prostatectomy
N52.31 ICD10 Erectile dysfunction following radical prostatectomy
N52.34 ICD10 Erectile dysfunction following simple prostatectomy
V10.46 ICD9 Personal history of malignant neoplasm of the prostate
Z08 ICD10 Encounter for follow-up examination after completed treatment for malignant neoplasm
Z85.46 ICD10 personal history of malignant neoplasm
Z90.79 ICD10 Acquired absence of other genital organ
0VB00ZZ ICD10-PCS Excision of prostate, open approach
Regex matching in ICD descriptions for “prostatectomy”, “excision of prostate”, “end surgery” in case of atypical coding behaviours.
Patients with codes specific to prior prostate cancer remission were excluded. An assumption was made that clinicians followed ICD guidelines to code the end of prostate cancer treatment with ICD-10 code Z08 (including transition to palliative care), and indicate prostate cancer in remission specifically with personal history of malignant neoplasm of the prostate codes ICD10 Z85.46 or ICD9 V10.46
* this is the root code, that includes all subcodes.
Appendix 2. ICD codes used to identify other cancer types.
Cancer (general): ICD9 1*, 20* V10*, 23*, 273.3, ICD1° C*, Z85, V10.85
Regex match: “malignant neoplasm”,"carcinoma” (in absence of Z09), excluding “screening for” and “family history”
Skin cancer 172, 173, C43*, 198.2, C79.2, C44*, Z85.828, V10.83
Brain cancer 191, C71*, 198.3, C79.3, Z85.841, V10.85.
Colon cancer 153, C18*
Lung cancer 162, C34*, C78.0, 197.0, 197.2, C78.2
Bladder cancer 188, C67*, C79.1, 2337, Z85.51, D09.0
Small intestine 152, C17*, C78.4, 197.4
Heart C38*
Thyroid 193, C73*, Z85.850, 164.0, V10.87
Pancreatic 157, C25*, Z85.07
Thymus 164, C37*,
Rectal 154, C20*, C785, 197.5, Z85.238, C19*, Z85.048, V10.06,
Anal C21*
Soft tissue C49*, 171, Z85.831
Lymphatic C77*, 196, Z85.79
Large intestine C78.5, 197.5, 152.0, C16, Z85.01, V10.05, Z85.038
Esophagus 150, C11.1, C15*, V10.03
Peritoneum C78.6, 197.6, 158
Liver C78.7, 197.7, C22.0, C22.3, C22.4, C22.7, C22.8, C78.7, 155.0, 155.2, 156.1, Z85.05, V10.07
Renal C79.0, 198.0, 189, C64*, V10.5, Z85.528
Bone 198.5, C79.5, Z85.830, C41, V10.81
Spleen 197.8, 159.1
Bile C22.1, C24*, 155.1, 156.9
Nervous system 192, 1984, 2379, C72*, C79.4
Urinary C79.19, 198.1, C68, D41.9, 189.9, 189.8, 198.1, Z85.59, Z85.50
Leukemia C91*, C92*, C93*, C94*, C95*, 204.0, 204.1, 204.8, 202.4, 206.22, 206.92, 207, 208
Stomach C16.9, 151, Z85.028, V10.04
Breast 175, V10.3
Genital 187, 198.82, C60*, C62*, C63*, Z85.47, Z85.49
Adrenal C74*, C79.7, 194, 198.7
* this is the root code, that includes all subcodes.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
