Abstract
Objective
Cancer is a significant cause of death globally. Effective management requires use of real-world data to inform treatment and planning. This demands improvements in data management to support healthcare delivery, research, and patient outcomes. Electronic health records (EHR) offer a potential solution yet integrating and interpreting the data poses challenges. To address this we developed Data Environment for Cancer Inquiries and Decisions or DECIDe. DECIDe is a cancer data warehouse consolidating patient data from multiple systems in use at Alberta Health Services (AHS).
Methods
DECIDe’s development began with stakeholder engagement and requirements gathering followed by design. The warehouse includes landing, staging, analysis, and reporting schemas that facilitate integration and transformation. Primary EHR, cancer registry, and legacy EHR data were extracted, transformed, and loaded into DECIDe.
Results
DECIDe integrated 2,100+ tables into 36 analysis tables with data from 1,055,186 unique patients. Integration exceeded 93% across all data sources demonstrating robust consolidation. DECIDe’s structured format enables efficient storage, retrieval, and analysis, laying the foundation for comprehensive reporting and analytics.
Conclusion
DECIDe addresses challenges in managing healthcare data, supporting analytics into cancer trends and treatment outcomes. Unique to Canadian healthcare DECIDe provides a comprehensive view of cancer care, enhancing decision-making and research opportunities.
Keywords
Introduction
Cancer remains a leading cause of premature death in 134 of 183 countries including most countries with a high-ranking human development index (HDI). The World Health Organization (WHO) predicts that from 2018 to 2040 there will be in excess of 27 million new cancer cases, or a 50% increase. 1 Our opportunity in the digital age is to leverage technology to assist in quickly collating and curating cancer-related health data with the goal of accelerating research and inquiry that advances understanding of the impacts of cancer diagnosis as well as treatment and survivorship. This will improve outcomes experienced for people facing cancer in the digital age.
The rise in use of EHRs to capture and manage health data has facilitated their use for analytics. However, health data is not always easy to connect, analyze, and interpret. Health data is voluminous, diverse, rapidly changing, and fraught with caveats.2,3 The availability of complete health records for secondary use remains challenging.4,5 One problem is that health data often resides in multiple EHRs for the same patients 6 and these systems may not adopt the same medical concepts or semantics.7,8 Additionally, as EHRs are more common in larger facilities that offer specialized expertise and technology, the data may be biased towards patients seen in these centers. 5 Furthermore, much of the data collected, such as clinical notes and summaries of care, may be unstructured.5,9,10 One approach to tackling many of these challenges is the creation of a data warehouse. A data warehouse is a repository of structured data integrated from multiple sources to leverage downstream analytics. Data warehouses are designed to consolidate vast amounts of disparate data and can help mitigate the aforementioned issues.6,11,12 As such, data warehouses can improve querying and enhance insights for clinicians, planners, researchers, and administrators. This paper describes the methods, design, and implementation of a data warehouse focusing on cancer patients developed by Cancer Advanced Analytics, Health Shared Services and Cancer Care Alberta - Data Environment for Cancer Inquiries and Decisions or DECIDe.
Cancer Care Alberta (CCA) provides care to patients diagnosed with cancer in the province of Alberta. As of October 1, 2023, Alberta’s population was 4,756,408 people. 13 CCA services include oncologist visits, supportive care services, systemic treatment, and radiation therapy. CCA is the agency that oversees cancer services and programs at two tertiary cancer centers, four regional cancer centers and eleven community cancer centers. Individual patient treatment may take place at any combination of CCA and external facilities.
Treatment outside of a CCA facility may be based on factors including availability of specialized services, regional accessibility, and patient preference. Patients are typically referred to the most appropriate facility for their specific diagnosis and treatment plan. As a result, prior to the creation of DECIDe, cancer data was housed in multiple systems. Data from any single system may have been housed in multiple data warehouses and managed by different teams. Additionally, not all cancer patients appear in all systems.
The development of DECIDe as a single repository of Alberta cancer patient data was spurred by AHS’ move to a new enterprise-wide EHR from the vendor Epic Systems. 14 The implementation of Epic has meant the cessation of use of the legacy system Varian’s ARIA® Oncology Information System for Medical Oncology (ARIA) and disconnecting of its’ data from EHR reporting. Merging historical patient data from the ARIA system into the new Epic EHR was not feasible. Furthermore, the Alberta Cancer Registry (ACR), a legally mandated provincial system recording all cases of cancer diagnosed in Alberta, continued as a stand-alone system. As cancer trends and surveillance are monitored over decades for impacts of prevention as well as treatments, 15 a method of preserving the legacy data and merging it with that of the new EHR was required. The solution was to consolidate data within DECIDe merging data from Epic, ARIA, and the ACR to develop a single system for business intelligence. This paper outlines the methods for developing DECIDe to address the common challenges of managing multiple EMRs and databases to support secondary use for quality improvement, surveillance, planning and research.
Methods
Engagement
A governance committee comprised of clinician researchers, AHS Privacy and Information Governance, source systems experts, implementation team members, and end-user stakeholders was established to assess and approve implementation decisions. Cybersecurity is managed within AHS, meaning that cybersecurity representatives were not necessary within the governance committee. However, if needed, they were consulted on an ad hoc basis.
Overall approach
DECIDe is a standalone cancer data warehouse that is housed within a larger AHS enterprise-wide data warehouse. The DECIDe cohort includes all patients from Epic with an encounter at any CCA facility, or who attended a gamma knife radiosurgery department at a non CCA-facility, or patients with a cancer diagnosis. The DECIDe warehouse model has three schemas: landing, staging, and analysis (Figure 1). New data flows from Epic and the ACR into the landing schema. The landing schema is a non-persistent layer within the overall solution and is used as a placeholder. Data is truncated and refreshed with every load. Each new set of source system data is loaded via integration pipeline execution. In the landing schema, tables are brought over in their entirety. Stored procedures then move the data into the staging schema: a persistent layer within the overall solution. The staging schema is where data is sourced from the landing tables and stored. Historical data is available in this layer and all columns from the landing schema are retained. The staging schema is also where required objects such as lookups and extract, transform, and load (ETL) control tables are created. Most of the modelling occurs in the Analysis Schema as data moves and transforms into Dimensions and Facts and is placed in the appropriate tables. In addition to the three modelling schemas described a log schema that houses error logging tables, and a reporting schema were also developed. There is one error table for each table in the staging schema. The reporting schema consists of views constructed for a group of users’ specific analytical needs. High Level Data Flow Diagram.
Data sources
Epic systems
Epic Systems was implemented in multiple phases starting in 2019 with cancer facilities going live in November 2022. Epic captures patient specific data related to specialized care and treatment of cancer, as well as patient reported outcomes. Epic has three reporting layers: (1) the real-time production system entitled Chronicles, (2) a relationally formatted version of the production system entitled Clarity, and (3) an open platform that allows integration with data from elsewhere in AHS entitled Caboodle. Data was extracted from Clarity and ongoing extraction of new and changed records occurs daily.
The data needed to fulfill DECIDe requirements were identified by viewing the Caboodle platform. The Caboodle architecture was then used as the starting point for DECIDe data architecture. Data points were audited to the Clarity environment and extracted.
Alberta cancer registry
The ACR was established in 1942 and is legislated to record and maintain data on all patients in Alberta who have been diagnosed with an in situ or invasive malignant tumor, a benign or borderline central nervous system tumor, a borderline bladder tumor, a borderline hematologic disorder, or an in situ cervical cancer. 16 The ACR has maintained a Gold Certification status from the North American Association of Central Cancer Registries (NAACCR) since 2005 16 and is considered the ‘source of truth’ for cancer diagnoses data in Alberta. ACR contains identifying information, details of diagnosis via the best information available at diagnosis, and follow-up information upon death. Data was extracted from the ACR reporting server. Ongoing extraction of new and updated records occurs weekly.
ARIA
The legacy EHR contained data starting from 2002 that captures patient specific data for treatment and visits in CCA. Its data could not be merged due to incompatibility including but not limited to architectural misalignment, platform level divergence, and semantic misalignment. The decades old platform differs in structure and standards from the modern Epic platform, and data elements are not consistently aligned in a one-to-one manner. Additionally, the implementation of Epic involved consolidating data from multiple legacy systems which increased the complexity of integration and made a direct merge impractical. A one-time full load of required tables and views from the legacy EHR was extracted from the ARIA reporting server directly to the staging schema.
Data integration
Informatica was used to move data from Microsoft SQL Server sources into the landing schema and subsequent schemas in Oracle 11g. To maintain integrity of the source data, there was no intent to alter or edit the source data. Tables containing the required data were converted from Microsoft SQL Server to Oracle data types during import from source to the landing schema as exact copies using an in-house stored procedure. Table naming conventions are applied with tables prefixed with LND_. The only changes that occurred were data type conversions necessary for Oracle 11g. Time stamps of each rows creation in the schema were inserted into the relevant tables. An ARIA one time data load used Microsoft SQL Server utility Bulk Copy Program (BCP) to export the legacy data into a data file. Oracle SQLLDR or SQL*Loader was then used to load the data into Oracle. Once the data file was loaded into Oracle, the data file was deleted. The data in the landing schema is not persistent, and automated scripts are run on these working tables daily to truncate and reload data. A single dedicated ETL specialist focused on maintaining consistency and accuracy of the data throughout the processes helping to ensure fidelity of the data.
Data is extracted as-is from the landing schema into the staging schema. Transition of data from the landing schema to the staging schemas occurs via incremental loads enabling the identification of data differences. In the staging schema Type 1 or Slowly Changing Dimension tables handle data by inserting new records, updating existing records when changes occur, and deleting records that are no longer current. In contrast, Type 2 or historical tables create a new record each time a change is made, allowing for the tracking of data changes over time. Table naming conventions were applied with tables prefixed with STG_. In the staging schema the subset of data required to meet customer requirements is transformed and conformed via ETL packaging containing multiple stored procedures and functions into dimensions and facts. A subset of conformed data is brought into the reporting schema to meet specific data mart needs outlined by business units. Creation of net new fields via transformations and calculations occur during this process. Examples include patient address being used to group patients into AHS geographic administrative zones, and birthdate being leveraged to calculate patient age. Errors are written to an error log table for each table. The full collection of dimensions and facts are brought into the analysis schema. The data pipeline was built on well-established organization-level systems familiar to the implementation team thus aiding in the safeguarding of data quality.
Transition of data from the staging schema to the analysis schema establishes the common data model (CDM) standardizing fields across the three systems. Key decisions were reviewed and approved by the governance committee.
The CDM is organized into several entities identifying key aspects of downstream reporting functionality. These entities include Patients, Treatments, Cancer Staging, Encounters, Referrals, Orders, and Questionnaires. Each entity is defined by semantic metadata describing its purpose, description, and constraints. The entities are further described by attributes that house the data that facilitates data analysis across AHS. For example, the Patient entity contains demographic attributes as well as patient identifier attributes that facilitate the creation of patient durable keys that in turn link to other relevant entities. Errors are written to an error log table for each table.
As the data moves from the staging schema to the analysis schema, patient records from all three systems are matched on several common patient identifiers that exist and are validated within the source systems. These patient identifiers include a Unique Lifetime Identifier (ULI) which is an exclusive and permanent number assigned to every person who presents for health services in Alberta, a Personal Health Number (PHN) which is a unique identifier assigned to each individual by the health insurance plan of each Canadian province or territory, an Alberta Cancer Board Number (ACB#) which is a front-facing patient identifier from the legacy registration system, and an Axon ID which is the system-level unique patient identifier in the same legacy registration system. Matching records were merged in the analysis schema where the patient records are given durable keys. Ongoing loads match incoming records with existing durable keys and update the record. The out-of-date records are set as historical. If no durable key is found, the records are merged and a new durable key assigned.
Durable and non-durable keys ensure data integrity through their enforced relationships. For example, the patient durable key is a stable and unchanging identifier that provides a reliable means of uniquely identifying a patient in DECIDe despite any changes that might occur to other attributes of the patient. The patient durable key is of prime importance in that it acts as a foreign key for multiple other entities and consequently stars in the multi-star schema. In particular, the patient durable key acts as a foreign key for Encounters, Procedures, Medications, Referrals, Diagnoses, Treatments, and Staging. Additionally, the patient centered CDM facilitates the growth and complexity of expanding data capture and reporting needs in that it mirrors the design of electronic health records, that is via patient centered data.
Dedicated backup procedures were not established at the DECIDe level. DECIDe operates within the larger AHS technical environment which centrally manages data security, redundancy, and backup via standardized protocols.
Data harmonization
One method of data harmonization employed was to identify standardized source elements and matching these elements to a look-up table with the subsequent assignment to the desired unified values. For example, diagnosis name was derived by matching ICDO codes from each system to a look-up table that matched each code to the desired end value. This end value is then populated in the data warehouse as the diagnosis name. Another harmonization method for less complex data such as phone number was to choose a ‘source of truth’ and then implement a hierarchy of source values. The source tables contain either a primary key (PK) or a unique key (UK) depending on the source system. These keys were leveraged to prevent duplication within the data warehouse. Any duplicates found during processing are captured in error log tables.
Data access and ethical oversight
Direct access to DECIDe is restricted to an approved group of users within AHS. All AHS employees complete mandatory privacy training, and internal operations comply with established protocols. Researchers requesting access must have already obtained formal ethics approval, and requests are reviewed and approved by established governance structures and in accordance with Alberta’s Health Information Act. 17
Results
Architecture
DECIDe is a dimensional model comprised of multiple star schemas with a central fact table. There is a central patient data star with peripheral stars organized by subject area including staging, encounters, referrals, treatment, and questionnaires (Figure 2). Within each subject area schema there are multiple database objects connected to each other, as well as to other subject area schemas by primary keys and foreign keys. The DECIDe dimensional data model consists of central fact tables with measures along with supporting dim or context tables. The model also leverages bridge tables to resolve many-to-many relationships such as person-diagnosis: one person can have multiple diagnoses, and one diagnosis can be linked to multiple patients. Multiple Start Schema (partial).
Indexes were created for the primary and durable keys as well as source unique identifiers. For improved manageability and performance, several complex areas of data were partitioned into multiple tables. Cancer staging data was partitioned into collaborative staging and TNM staging tables, while treatment data was partitioned into high level and detailed level tables. Table space segregation was performed to allow access of data and indexes simultaneously.
Data integration
The scale of population data: Record count for PATIENTDIM and TREATMENTFACT.
Data integration by data source for PATIENTDIM.
EHR = Electronic Health Record.
ACR = Alberta Cancer Registry.
Data lineage and metadata tracking to all three sources is managed via a metadata Tableau dashboard and underlying spreadsheet. Patient records additionally are linked back to the source table row via the introduction of an additional record identifier linked to source records coming into the landing environment. Traceability is enhanced using timestamps indicating when records have been created or updated in each layer. Additionally, source system change tracking is limited to a maximum of 30 days while DECIDe tracks changes indefinitely. The data flow is visible from Tableau dashboard to source system.
Operations
The transition away from three source systems and the development of DECIDe as the single source of data has streamlined the data request and reporting process in that end-users need only consult with one unit instead of many to complete their data pulls. The consolidation of requests to a single unit also facilitates consistent methods of pulling data and the end products across requests. This in turn normalizes reporting outcomes across the organization. One example would be the normalization of multidisciplinary tumor board reporting. Prior to DECIDe each source of data or each data team had disparate definitions of grouped diagnoses attached to specific cancer care teams. As a result of the development of DECIDe, a common, normalized definition has been employed. Thus, all leadership teams can report on things such as workload burden and clearance, treatments, and wait times by tumor team using the same underlying language. Additionally, more data is available via the integrations than was previously available adding to a wholistic view and improved decision making. The development of the DECIDe as the single source of data has also consolidated evaluation and feedback which has led to a standardized quality assurance approach and improved data quality.
Unifying data into a single repository has also allowed multiple groups of data users to access data in an efficient and easy manner. DECIDe is the primary data source for cancer-related information in Alberta, supporting use across organizational leadership, research, and regulatory agencies. Access is provisioned by a permissions framework that provides varying levels of access across a broad range of data. The framework facilitates straightforward adding and dropping of permissions as necessary as DECIDe grows and changes.
Conceptually scaling the warehouse is simple as we have the necessary framework, and the hardware is managed by AHS Information Technology. A recent example of scaling up is the addition of twenty-eight consult and treatment departments because of the opening of a new facility. The new departments were added to the inclusion list, and look-up tables were updated with the related data.
Furthermore, the development of DECIDe has facilitated the creation of a physician-facing dashboard designed to monitor critical indicators such as suicidal ideation. 18 This dashboard is populated by data derived from a questionnaire that integrates the revised Edmonton Symptom Assessment System (ESAS-r) and the Canadian Problem Checklist (CPC) into a composite measure specifically tailored for the ambulatory care setting. While the dashboard was previously updated on a daily basis, DECIDe enables refreshment of the data every 15 minutes, thereby providing physicians with more timely and accurate patient safety information at the point of consultation.
Discussion
DECIDe has successfully integrated cancer-related health data from multiple systems simplifying formerly complex queries, and enabling downstream analyses that were not considered feasible in the past. The scale of the data incorporated, and continuing to be added, requires ongoing careful management. Determining the logic of many-to-many relationships proved laborious. However, the goals of the project were met in the preservation of legacy EHR data, and the merging of data from multiple systems into a single repository of consistent, and standardized patient-centered data. DECIDe has the potential to advance AHS analytics in the areas of business intelligence, quality improvement, reporting, surveillance, and research.
The construction of DECIDe encountered several limitations primarily related to data heterogeneity and completeness. The integration of multiple EMRs over several decades of data collection introduced variability in coding standards such as ICD-9, ICD-10, and ICD-10-CA, as well as tumor classification systems such as ICD-O. Despite harmonization efforts, complete standardization across systems and time was not achievable. Data completeness also varied considerably because many fields were or are non-mandatory for completion within the EMR, resulting in missing or incomplete entries. Ethical constraints precluded imputation or deletion of records, while evolving data definitions such as changes in gender categories further complicated longitudinal consistency. Additionally, the absence of free-text integration limited the capture of nuanced clinical details. Finally, resource constraints inherent to public-sector funding affected hardware capacity, processing speed, and scalability. These limitations reflect common challenges in large-scale health data integration and should be considered when interpreting study findings. The scope of DECIDe provides a springboard for end-user initiated and co-designed, at-a-glance reporting outcomes for clinicians and administrators via dashboards. DECIDe also provisions consistent and standardized data that has effected efficiencies in the report request and generation process. Reporting outcomes that are not available via current dashboards can be generated directly from DECIDe with no need for repetitive and time-consuming code generation to merge or harmonized data from multiple sources for individual reports.
This embedded nature of business intelligence development benefits all stakeholders. End-users are empowered through a deeper understanding of data and how to use data to make informed decisions. Reduced timelines to meet end-user objectives can also be achieved via rapid iteration and feedback loops. The analytics team also benefits from the reduced timelines. The analytics team can also more precisely determine user requirements thus improving data quality. Additionally, improved understanding of what data is most relevant to end-users enables more impactful solutions and reduces unnecessary resource allocation. The analytical team can also align end-user efforts with broader AHS goals. Moreover, breaking down barriers between technical and non-technical staff can foster a collaborative environment where insights are shared and utilized more widely supporting a cultural shift to data-driven decision-making.
Decision making will be enhanced and improved via augmenting DECIDe in multiple phases. Work is currently underway to design architecture for inclusion of radiation oncology data from the current radiation oncology EHR. In addition to demographics and diagnosis, the radiation oncology system captures treatment plans and treatment record for patients who have received radiation therapy at one of the four radiation therapy treatment cancer centers in Alberta. Data includes treatment days, fractions, courses, modalities, body site, and machine used. The inclusion of this data will supply the necessary treatment information to support further analyses. DECIDe’s patient-centered design provides a robust framework to accommodate this expansion. Since most electronic health records (EHRs) utilize a similar patient-focused approach, semantic integration is greatly simplified. This structural alignment enhances the ease of integrating additional systems.
A similar project is underway in the province of Newfoundland and Labrador known as Health Connect. Health Connect aims to create a single, scalable, reliable, and secure data environment to help with providing optimal care. 19 However, there are key differences between DECIDe and Health Connect. While Health Connect has engaged an external vendor, DECIDe represents a fully internal data solution, designed, developed, and maintained by AHS. Although partnering with an external vendor can be beneficial when an organization seeks access to specialized expertise or when technical resources are operating at full capacity, the development of DECIDe within the organization was preferable for several reasons. Development in-house ensured comprehensive documentation and full retention of technical knowledge, as well as enables more agile responses to evolving requirements. It also mitigated data governance concerns related to third-party access to sensitive information and eliminated the need for additional financial expenditure.
The requirement to solely manage our own challenges facilitated an evolution in our knowledge of and experience with data warehousing. The result is that we can manage DECIDe through future iterations without requiring external consultation. Additionally, Health Connect focuses on screening and cancer registry data, while DECIDe is a system which focuses on all aspects of the patient cancer journey including treatment and visit data. To our knowledge, DECIDe is the first Canadian cancer population-based data warehouse that contains diagnostic, treatment, referral, and patient reported outcomes. DECIDe is uniquely placed in the Canadian healthcare system to offer a comprehensive population-based view of cancer treatment.20,21 When compared internationally, DECIDe appears to be unique in its scope of disease locations11,12,20,21 and number of unique patients managed and available for reporting.6,11,12,20,21
Conclusion
The creation of DECIDe realized a single system for business intelligence and provided a centralized patient cancer data repository effecting streamlined, more comprehensive, and more consistent reporting of cancer-related health services utilization and outcomes across our organization.
The principal lesson learned was to ensure that detailed and stringent requirements are documented and formally approved by the end-user groups. Another key lesson was the importance of thoroughly evaluating the scale of existing and growing data earlier in the project to help plan server growth.
DECIDe will facilitate advancements in cancer care and delivery within CCA as well as provide the fulcrum around which developments in cancer research and innovation will occur. Ultimately, DECIDe represents a step forward for CCA in treating persons diagnosed with cancer, as well as providing invaluable data for research in the fight against cancer.
Footnotes
Acknowledgements
We gratefully acknowledge the financial support provided by Alberta’s Tomorrow Project for funding staff salaries. Without their generous support, this research would not have been possible. We thank data architect Alan De Los Angeles for architectural development, and Senior Programmer Shan Wan who performed the bulk of the stored procedure development. We also thank Meichun Jiang for help with ETL implementation and Ke Chen for contributing to data base coding and stored procedure implementation.
Ethical considerations
The research conducted for this paper did not require formal approval from an ethics board due to the nature of the study and the characteristics of the work involved. Measures were implemented to ensure the privacy and confidentiality of participants’ data. Any personal information collected was anonymized and stored securely to prevent unauthorized access or disclosure.
Author contributions
K.S.: Data Collection, Data Curation, Visualization, Writing, Writing – Review & Editing.
J.S.: Conceptualization, Methodology, Data Collection, Data Curation, Architecture, Validation, Writing – Review & Editing, Supervision, Project Administration
E.M.: Writing – Review & Editing.
J.B.: Writing – Review & Editing.
P.R.: Writing – Review & Editing.
L.S.: Writing – Review & Editing, Supervision.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
In accordance with the guidelines of the Health Informatics Journal, the authors declare that they have no conflicts of interest to disclose pertaining to this research. All authors contributed to the study design, data collection, analysis, and interpretation, and have approved the final version of the manuscript for publication.
