Sage Journals: Discover world-class research

Abstract

Objective: To explore the application of online analytic processing (OLAP) to improve the efficiency of analytics using large administrative health data sets. Methods: 18 years of administrative health data (1994/95 to 2012/13) were obtained from the Alberta Ministry of Health in Canada. The data sets included hospitalization, ambulatory care and practitioner claims data. Reference files were obtained that provided information including patient demographics, resident postal code, facility, and provider details. Population counts and projections for each year, sex, age were included for rate calculations. These sources were used to develop a data cube using OLAP tools. Results: Time required for analyses was reduced to 5% of that required when comparing run-time for simple queries that did not require linkage of data sets. The data cube negated the need for many intermediary steps for data extraction and analyses for research activities. Conventional methods required over 250 GB of server space for multiple analytic subsets, compared to only 10.3 GB for the data cube. Conclusions: Cross-training in information technology and health analytics is recommended to provide capacity to better leverage OLAP tools which are available with many common applications.

Keywords

Administrative health data data linkage multidimensional data cube on-line analytic processing OLAP

Introduction

Health care research projects frequently involve integration of multiple large data sets, such as electronic medical records from practitioners’ offices^1,2 and population-based linked administrative health data for physicians’ claims, hospital stays and medication records.³ Common approaches to working with these data that involve direct database queries for data extraction and/or statistical packages to manage and analyse such research data can quickly become time-consuming and onerous processes,⁴ even at sub-terabyte volumes of data. To address this problem, analysts may create multiple smaller subsets or aggregations, followed by merging, and compilation before they can obtain a working analytic data set. Although reducing data set size may address some problems related to performance run time, fragmented analytic processes result, and data management issues can be compounded. In analytic work, this process of resource intensive data preparation followed by a fraction of resource use for actual data analysis is referred to as the ‘80:20 rule’.⁵ The ratio of time spent in data preparation versus actual analyses and reporting of results can be further compounded when using ‘big data’. Big data are often characterized by volume (amount of data), velocity (rate of data accumulation) and variety (data types).⁶ Some authors also include the term ‘veracity’ to the collection of big data ‘Vs’ to exemplify the importance of data quality.⁴

Although our study did not involve large volumes of data relative to that being experienced in some areas of research,⁷ the combination of the 4 Vs unquestionably presented analytic challenges.^8–10 For example, frequent iterations and/or data refreshes are often required during development of patient-level simulation models for evaluation of alternative models.¹¹ The processes and resources involved in bringing our research data to an analyzable state exemplify the issues encountered. Considerable resources were required to support the initial process for managing and analyzing these data. In addition, future research plans would require appending new releases of the administrative data and addition of other large data sources, such as administrative prescription drug data. These updates were expected to further complicate data management and analyses.

One solution, online analytical processing (OLAP) technology has been an available and maturing technology for over two decades.^12,13 OLAP technology expedites querying of large data stores via a multidimensional structure, often referred to as a data cube. In addition to supporting integrated calculated measures, many front-end applications for OLAP solutions are also flexible in that experienced analysts can create custom measures based on the underlying data.^14,15

The research data were currently being stored in a database environment with OLAP tools readily available. Over the years, occasional reports have appeared indicating adoption of OLAP tools to support genomic and molecular research,^14,16,17 but to our knowledge, applications in health research settings has been uncommon. These tools are increasingly being used in health care organizations,^13,18–20 and recently published articles demonstrate application to extract inputs for machine learning models,^21,22 and analyses in public health during the COVID pandemic.²³ In this work, we explore the potential usefulness of OLAP tools to support data management and analyses required for an economic simulation modeling study. The intent was to find an alternative method of managing and analyzing data that would support fast and flexible data extraction as well as calculation of standardized measures needed for our health research initiative.

In this article, we discuss methods used for the development of a multidimensional data cube using OLAP tools to support an economic health research initiative that involved an analyses of multiple large administrative data sources.

Methods

The aim of the System Dynamic Osteoarthritis project is to create a simulation model to inform development of clinical pathways and health service delivery within the province of Alberta, Canada. 18 years of health care utilization data (fiscal years 1994/95 to 2012/13) across multiple health services were obtained from the provincial Ministry of Health to provide data inputs for development of the system dynamics simulation model. In Alberta, administrative data from the publicly funded health system are made available to support organizational needs, such as planning and quality improvement, as well as research activities. These data were made available in anonymized form after completion of a Data Sharing Agreement with the Alberta Ministry of Health.²⁴ The health care utilization data sets included hospitalizations, ambulatory care visits (emergency and other outpatient visits) and practitioner claims (e.g., billing data primarily from physicians, but also other health care providers that were reimbursed by the publicly funded health care system). In addition, several reference files were obtained that provided information such as patient demographics, year(s) of coverage in the Alberta health system, resident postal code, facility, and provider details. A file containing population counts (historic and projections up to 2045) for each year, by sex, residence location (based on postal code), and age, was also required for the denominator in rate calculations. The demographic and other information was limited to that available; these variables were included in the conceptual model of the OLAP solution and assessed as stratification factors in the simulation model. These data, summarized in Table 1, formed the basis of development of the OLAP solution.

Table 1.

Administrative data sources accessed for the osteoarthritis research project.

Data set	source^a	rows (Approximate)^b	Variables	Size (MB)
Practitioner claims	AH	316 M	100	46,800
Ambulatory data (ACRS and NACRS)^c	AH	53.6 M	48	15,203
Inpatient data (DAD)^c	AH	3.4 M	182	2,120
Cohort file^d	AH	1.5 M	4	160
Patient registry	AH	17.2 K	14	2,301
Provider	AH	17.2 K	4	178
Population counts	AHS	531 M	9	180
Postal code translation file	AHS	90.5 K	16	120
International classification of diseases (ICD-10)	AHS	18.4 K	7	196
International classification of Diseases (ICD-9)	AHS	8.4 K	5	110
Canadian classification Of health interventions (CCI)	AHS	45.5 K	8	182
Facility information	AHS	88.2 K	22	210
Totals		560.6 M	419	67.8 gigabytes

^aAlberta Health (AH); Alberta Health Services (AHS).

^bK = thousand; M = million.

^cAmbulatory Care Reporting System (ACRS); National Ambulatory Care Reporting System (NACRS); Discharge Abstract Data (DAD).

^dListing of patients identified in provincial administrative sources with at least one osteoarthritis diagnosis (between 1994/95 to 2012/13, inclusive).²⁵

Extensive data management and analytic tasks were required to utilize these data for the intended research purposes. Development of a multidimensional data cube was undertaken to facilitate quick access to iterations of analyses as information was required for simulation model inputs. The process for creating the data cube for the current research project is described below in terms amenable to all types of research analysts, with minimal use of technical language.

Multidimensional cube development

Development of a data cube was initiated in tandem with the on-going work being conducted by an analyst using conventional epidemiologic methods. This process facilitated validation during development of the cube, and assessment of efficiency improvements. The research data were housed in a standard database with the OLAP tools available. OLAP tools were accessible in a Microsoft SQL Server environment,²⁵ although there is now a plethora of software options with OLAP tools, including open source.^26–28 The hardware requirements for this project were minimal (the solution could run on most new office workstations), but requirements could vary depending on the size of data and number of users, thus consultation with information technology and/or individuals familiar with OLAP is recommended. Initial analytic activities (using conventional methods) were being conducted using database queries to create merged subsets and aggregations; a statistical application was used for more complex statistical computations.

The first step in developing the data cube was to restructure the original sources of data into a series of tables within the database into a format that could be used as the data source for the multidimensional cube. Data cubes consist of two main constructs: fact and dimension tables. Fact tables consist of the core data of interest from which measures are derived, such as inpatient visits, visits to a physician, and cost information. Dimension tables consist of the parameters by which the researchers may want to stratify (i.e., slice/dice) data (e.g., age, geography, health facility, etc.). The source files for a data cube are typically stored in the database using a naming convention that prefixes the table name with the type of table (i.e., dimTableName, factTableName) to facilitate differentiation of these types of tables in the warehouse. Tables may also contain both dimension and fact elements, an example of which is described below.

Fact tables

Two main fact tables were created, from which most measures were derived. Selected fields from the three main data sources (inpatient, ambulatory and practitioner claims) were appended to create a primary fact table (factUtilization). This process required input from analysts knowledgeable with the three utilization data sources to inform mapping of fields across the data sets. The final fact table was an amalgamation of data which included eight fields that utilized data from all three sources, 16 fields from two sources, and 41 fields that obtained data from only one source. The resulting table allowed creation of simple measures, such as distinct counts of patients who utilized any of the three services, and summative measures such as costs and number of health care encounters, that could be stratified (i.e., sliced) by the various dimensions. In creating a fact table, it is also necessary to incorporate the finest level of stratification (or grain) that will be linked to dimension tables (where the levels of aggregation are defined). A field was also added to define the data source from which the utilization data were derived (i.e., inpatient, ambulatory or practitioner claims). Three additional fields for patient demographics (age in utilization year, sex, and residence postal code) that were originally provided in the Health Care Registry file were merged into the fact table.

Two additional fields were added to the main fact table based on algorithms defined by the research team; i) visit type defined health care utilization for specific services, such as radiology (based on the provider specialty and any of 16 procedure codes) and diagnostic imaging visits (based on financial and diagnostic codes, and specialty type) and ii) one of five stages of disease were assigned to each utilization event. Although some of the information acquired through these additional fields could be extracted from a data cube using a series of filters based on dimension values, integration into the fact table allowed direct slicing of data and expedited extraction of data for the simulation inputs. The process to integrate these two fields provides an example of the team collaboration that was required to develop the analytic platform - the benefits and advantages of the two methods were considered by the multidisciplinary team prior to decision on the final method.

A second fact table (factPopulation) was created based on historical and projected provincial population data (up to 2045). The population counts were available at age, sex and postal code level. The population counts were required for denominators in calculations of utilization rates, incidence and prevalence of disease, as well as for inputs into the simulation models projecting disease burden, health care utilization and costs.

Dimension tables

Several dimension tables were created based on pre-existing reference tables provided by Alberta Health and Alberta Health Services (the service delivery entity in the province), with modification to add aggregation levels, as required for the research analytics. Data for a dimension table with health care provider details (dimProvider) were obtained from Alberta Health and included the provider’s sex, year of registration and age at practice start. Sources for dimension tables obtained from Alberta Health Services included; i) disease classification (dimICD - based on the World Health Organization’s International Classification of Diseases (ICD) codes – versions 9 and 10), ii) procedures (dimProcedure - based on the Canadian Classification of Health Interventions (CCI)), iii) facility details (dimFacility) and iv) geographic information (dimPostalCode - Postal Code Translation file).

Multiple files were obtained for the diagnostic and procedure codes, as the source data were in a semi-relational format. For example, the source files for the CCI codes were obtained from four separate files; intervention code (finest level of information which linked to the factUtilization table), two additional files with aggregation levels and a separate file with provincial definitions for surgical procedures. Similarly, the diagnostic information was provided in three files for the ICD-10 information, plus one additional file with ICD-9 codes and descriptors. These files were compiled (denormalized) to create two dimension tables (dimICD and dimProc). Three additional fields were added to the procedure dimension to allow a higher level of aggregation (than was provided in the source files). For diagnostic information, the three ICD-10 reference tables were compiled into one table, and then the one source file with ICD-9 codes and descriptors was appended. A mapping file (ICD-9 to ICD-10) obtained from the Canadian Institute of Health Information will be used to create aggregation levels for the ICD-9 groups to facilitate grouping of diagnostic information (work underway). A flow diagram of the process to create the diagnostic and procedure dimensions is provided in Figure 1.

Figure 1.

Star schema data structure for multidimensional data cube based on administrative health data.

Longitudinal analysis based on cohort entry was also a requirement. Due to the complexity of the definition, the cohort entry dates for patients were first determined using a statistical package, and subsequently this information was integrated into a dimension table. This allowed users to select a block (e.g., year) of cohort patients and examine annual utilization rates longitudinally. The linkages between the fact and dimension tables were then organized in the OLAP software. This organized data format is known as a star schema and is the foundation for the multidimensional data cube (Figure 1).

The work described above were conducted using SAS^® software (SAS Institute. 2011) for statistical analyses and Microsoft SQL Server^® 2017 Enterprise Edition for development of the multidimensional cube.

Results

Development of the multidimensional cube occurred in iterations over approximately 12 months. The initial challenge was a paucity of human resources with experience or training using the software. The shortfall in access to human resources with the required skills in applying these tools in research settings has been highlighted previously.^29,30 Coordinating the time of a multidisciplinary group of analysts, researchers with content expertise, and information technology resources was also a challenge at times. However, the expertise from various disciplines was crucial to developing this analytic solution.

Preparation of the data for OLAP processing was counter to the conventional processes in many ways (Table 2). Rather than sub-setting the data, the three administrative data sets were compiled into one. This compilation of tables containing the utilization data to be analyzed (e.g., row counts, sums, distinct counts, rate calculations) provided the core table for the OLAP solution (fact table).

Table 2.

Comparison of conventional analytic approaches with cube development.

Conventional methods	Cube development
Independent analysts	Development team: researchers, health data analysts, information technology, subject matter experts
Subset large data sets	Append large data sets
Merge reference information into subsets	Link dimensionalized reference tables to appended source files in star schema
Analytic measures run for simulation inputs as needed, by aggregation levels (in SQL management studio or statistical software)	Analytic measures integrated into solution: Output can be aggregated via simple pivot table interface (SAS enterprise guide, excel, power BI, tableau, reporting services)

Integration of reference information and definition of stratification variables was also counter-intuitive to conventional approaches for compiling research data. Rather than creating multiple flat files by merging the administrative files with reference data and writing code to create aggregation levels, this information was compiled into dimension tables. End users were then able extract and analyze the information via a simple pivot table interface using various common tools such as Microsoft Excel and Power BI, Tableau, and statistical software, such as SAS Enterprise Guide.

Data retrieval time has been reported to be approximately 0.1% using a data cube compared to running direct queries from a database.³¹ We observed major time reductions for relatively simple queries, as well as more complex queries that required linkage of several tables (Figure 2). The analytics platform has also negated the need for many intermediary steps for data extraction and analyses for iterations of development of the simulation models.

Figure 2.

Comparison of query time using the data cube versus direct database queries. *Instantaneous run from cube.

In addition, a significant reduction in server space was achieved, primarily due to the compression of data in the cube format, as well as a reduction in the number of subsets required, and pre-aggregated tables stored in the data warehouse (Figure 3).

Figure 3.

Comparison of analytics process using conventional methods with direct database access and statistical tools versus the online analytic processing solution (data cube).

There were also significant differences in storage space requirements when the conventional methods were compared with the OLAP solution. The original source data was over 67 GB (Table 1). This increased to nearly 250 GB of space after creation of multiple subsets and aggregated files for analysis using conventional methods. Comparatively, the data tables required for the star schema were created using views (thus data were not replicated in analytic tables), and the size of the final data cube was only 10.3 GB (Figure 3).

Discussion

This project has demonstrated how an OLAP solution can be utilized to address some of the challenges encountered by researchers related to managing and analyzing large and/or complex data sets. In the current project, traditional health research methods for managing and analyzing data sets in the gigabyte range were inefficient, requiring sub-setting and repeated data linkages, to achieve reasonable run time. The major efficiency gains we observed exemplify the need for researchers to begin to explore alternatives to traditional methods for managing their data stores and conducting analyses. The utility of an OLAP solution to expedite more complex analytics has also been demonstrated previously.^32,33 It has been shown that, while an OLAP solution may limit the capacity for complex analyses, OLAP can expedite descriptive analytics and extraction of data for more complex analytics with statistical tools. While we observed significant efficiency gains with respect to storage space and processing, gains may vary with the size of the data set and complexity of analytics. While accessing raw sources can provide more analytic flexibility and will still be required for some analyse, the utility of a technical solution such as OLAP could be applied to support even complex analyses,³² providing that end users are adequately trained in use of these types of platforms.

The type of analytics required should also be considered when assessing the utility of an OLAP solution. If the analytics require frequent access to granular (for example patient and physician level data), rather than aggregate information, an OLAP solution may impart only limited benefits. Consultation with those familiar with OLAP applications for analytics is recommended. Additionally, the current study was able to access line level data from several sources with an anonymized, but linkable, identifier. Several countries are improving access to such information for research purposes,^34,35 but access does vary^36,37 and should be considered when assessing potential analytic efficiency gains and feasibility of developing OLAP solutions.

Integration of training on these alternative methods into programs for research analysts is needed to increase uptake in the academic environment. This in turn would also benefit health care organizations by providing applied scientists and analysts with the training and skills needed to work with large data sets by exposing trainees to tools typically found in the information technology domain. Failure to adopt technical solutions to enhance analytic efficiencies in health research settings may be due to various factors. Initially, health data analytics may require very different measures than are required for typical for-profit industries.³⁸ Compared to straightforward business measures such as volume of sales and number of customers, epidemiologic measures often involve complex rate calculations, risk adjustment and standardized estimates. Costing is also not as straightforward in healthcare and may require use of parameters such as resource intensity weighting,³⁹ as was the case in this study. Further, methods to manage and analyze health data are typically taught in public and community health programs using statistical programs. Conversely, education related to technical solutions is usually part of the information technology domain, where methods for analyses of complex health care data are not included in the academic venue.

The disconnect between health data analysts, information technology, access to technical tools and their disparate training is likely, at least in part, attributable to the lack of application of existing tools that could impart huge efficiencies in how health data are accessed and analyzed in research settings. In addition, cloud-based solutions may not always be feasible for researchers working with sensitive and/or personal health data, thus limiting options to researchers. Some have proposed that issues such as costs or patient confidentiality may be in part responsible for this delay.⁹ However, tools to support large scale analytics platforms are often embedded in database applications that are used to house data on-site. Further, newer open source OLAP applications which run on big data platforms such as Hadoop® have become available in recent years²⁸ and should decrease costs with implementing this technology. Regarding data security, technical applications used to develop analytic platforms often support integrated role-based security, thus may provide data security superior to that achievable with systems where analysts must access databases directly to extract and manipulate data for analyses. Thus, the division of roles between health data analysts, information technology, access to tools and their disparate training is a likely culprit that contributes to lack of application of tools that could impart efficiencies in how health data are accessed and analyzed for research in academic settings.

Lessons learned

We found that close collaboration of a multidisciplinary team was crucial to development of a functional solution with validated output acceptable to the research group. The need for strong collaborations has also been stressed in an early discussion of implementing business intelligence solutions in healthcare settings.⁴⁰ The final product relied on collaborative development, with research analysts having extensive input into the structure of the star schema and measures in the solution, and information technology supporting creation of the data sets required for the cube solution. Both researchers and information technology supported development of the solution within the technical platform using OLAP tools, and subsequent validation processes.

The changes in work processes, where researchers now work with ‘pre-analyzed’ data, rather than accessing the raw data and conducting analysis on an individual basis, also created some initial dissonance. However, concerns regarding validity of the solution output were achieved by ensuring that; i) all team members were involved in development and ii) exposing all team members to the software applications and methods involved in creating the solution, and iii) provisioning key members of the research group (epidemiologists and operations researchers) with access to the raw data to support validation checks. In addition, by integrating preliminary work done by a research analyst to define cohort populations into the data cube solution, we were able to expand the application to support longitudinal analyses.

Despite longer than expected development time, the final product has resulted in major efficiency gains in how data analysis and management are done within the research team. There was a steep learning curve, for both information technology support and research analysts, and an adjustment in work processes. We expect that the learnings from this project will serve to expedite future iterations of development of the current research data and can be leveraged to support research involving other large data sets. The scalability of these types of solutions has been demonstrated by others,⁴¹ and we plan to apply these learnings to expedite integration of updated data and new sources into the OLAP platform.

Our experience in this project leads us to propose two key recommendations for the academic community using large data sets; i) courses taught in health research disciplines need to begin exposing students to technical applications that facilitate efficient management and analyses of large data sets, which will in turn bolster skill sets for those entering work environments, and ii) methods to apply technical solutions should be leveraged across research groups (e.g., sharing of methods used to create data structures needed for OLAP solutions, which could be applied using many different software applications). Shared learnings across research groups will allow the research community to strengthen capacity to efficiently utilize the growing amount and complexity of data in academic settings, as well as provide trainees and students with expanded skill sets for entering the workforce more prepared to work with large data sources.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this study was provided by the Canada Foundation for Innovation (CFI) grant “An integrative approach for translating research to improve musculoskeletal health” and the CIHR (Grant #: 126128) Operating grant “Developing an innovative evidence-based decision support tool to improve osteoarthritis care planning and health service management for diverse patient populations in Alberta, Saskatchewan and Manitoba”. DAM is supported by the Arthur J.E. Child Chair in Rheumatology and a Canada Research Chair in Health Systems and Services Research (2008–2018).

Ethical approval

These data were made available in anonymized form after receiving approval from the University of Calgary Conjoint Research Ethics Board (UCCREB) and completion of a Data Sharing Agreement with the Alberta Ministry of Health.

Consent

The UCCREB also approved a waiver of consent, given that the data were existing, anonymized and that there was no direct contact between the researchers and patients.

ORCID iD

Shelly Vik

References

Garies

Youngson

Soos

, et al. Primary care EMR and administrative data linkage in Alberta, Canada: describing the suitability for hypertension surveillance. BMJ Health Care Inform 2020; 27: e100161. Epub ahead of print 2020. DOI: 10.1136/bmjhci-2020-100161.

Ludwick

Doucette

. Adopting electronic medical records in primary care: lessons learned from health information systems implementation experience in seven countries. Int J Med Inform 2009; 78: 22–31. Epub ahead of print. DOI: 10.1016/j.ijmedinf.2008.06.005.

Doiron

Raina

Fortier

, et al. Linking Canadian population health data: maximizing the potential of cohort and administrative data. Canadian Journal of Public Health 2013; 104: e258–e261. Epub ahead of print 2013. DOI: 10.17269/cjph.104.3775.

Raghupathi

. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2014; 2. Epub ahead of print. DOI: 10.1186/2047-2501-2-3.

Nisbet

Elder

Miner

. The right model for the right purpose: when less is good enough. In: Handbook of statistical analysis and data mining applications. San Diego: Elsevier Science & Technology, 2009, pp. 728–728.

Beyer

Laney

. The importance of ‘big data’: a definition. Stamford, CT: Gartner, 2012

Check Hayden

. Experimental drugs poised for use in Ebola outbreak. Nature 2018; 557: 475–476. Epub ahead of print 2015. DOI: 10.1038/nature.2015.17912.

Adams

. Genetics: big hopes for big data. Nature 2015; 527: S108–S109. Epub ahead of print 2015. DOI: 10.1038/527S108a.

Belle

Thiagarajan

Soroushmehr

SMR

, et al. Big data analytics in healthcare. Biomed Res Int 2015; 2015: 370194. Epub ahead of print 2015. DOI: 10.1155/2015/370194.

10.

Kahn

Weng

. Clinical research informatics for big data and precision medicine. Yearb Med Inform 2016; 25: 211–218. Epub ahead of print. DOI: 10.15265/iy-2016-019.

11.

Marshall

Burgos-Liz

Pasupathy

, et al. Transforming healthcare delivery: integrating dynamic simulation modelling and big data in health economics and outcomes research. Pharmacoeconomics 2016; 34(2). Epub ahead of print 2016. DOI: 10.1007/s40273-015-0330-7.

12.

Hettler

. Data mining goes multidimensional. Healthc Inform 1997; 14: 43–46.

13.

Silver

Sakata

, et al. Case study: how to apply data mining techniques in a healthcare data warehouse. J Healthc Inf Manag 2001; 15: 155–164.

14.

Alkharouf

Jamison

Matthews

. Online Analytical Processing (OLAP): a fast and effective data mining tool for gene expression databases. J Biomed Biotechnol 2005; 2005: 181–188. Epub ahead of print 2005. DOI: 10.1155/JBB.2005.181.

15.

Ribeiro

Silva

da Silva

. Data Modeling and Data Analytics: A Survey from a Big Data Perspective. Journal of Software Engineering and Applications 2015; 8(12). Epub ahead of print 2015. DOI: 10.4236/jsea.2015.812058.

16.

Dzeroski

Hristovski

Peterlin

. Using data mining and OLAP to discover patterns in a database of patients with Y-chromosome deletions. Proceedings/AMIA. Annual Symposium AMIA Symposium 2000; 2000: 215–219.

17.

Kehl

Simms

Toofanny

, et al. Dynameomics: a multi-dimensional analysis-optimized database for dynamic protein data. Protein Engineering, Design and Selection 2008; 21(6). Epub ahead of print 2008. DOI: 10.1093/protein/gzn015.

18.

Gordon

Asplin

. Using online analytical processing to manage emergency department operations. Acad Emerg Med 2004; 11(11). Epub ahead of print 2004. DOI: 10.1197/j.aem.2004.08.015.

19.

Hristovski

Rogac

Markota

. Using data warehousing and OLAP in public health care. Proceedings/AMIA. Annual Symposium AMIA Symposium 2000; 2000: 369–373.

20.

Studnicki

Fisher

Eichelberger

, et al. NC CATCH: advancing public health analytics. Online J Public Health Inform 2010; 2. Epub ahead of print 2010. DOI: 10.5210/ojphi.v2i3.3348.

21.

Machine learning based method for prediction of heart disease in big data environment. International Journal of Innovative Technology and Exploring Engineering 2020; 9. Epub ahead of print 2020. DOI: 10.35940/ijitee.f3957.049620.

22.

Leung

Chen

Hoi

CSH

, et al. Machine learning and OLAP on big COVID-19 data. In: Proceedings - 2020 IEEE International Conference on Big Data, Atlanta, GA, USA, 10–13 December 2020. Epub ahead of print 2020. DOI: 10.1109/BigData50022.2020.9378407.

23.

Agapito

Zucco

Cannataro

. COVID-warehouse: a data warehouse of Italian COVID-19, pollution, and climate data. Int J Environ Res Public Health 2020; 17: 5596. Epub ahead of print 2020. DOI: 10.3390/ijerph17155596.

24.

Government of Alberta . Health data for research. Atlanta, GA: Government of Alberta, 2022. https://www.alberta.ca/health-research.aspx

25.

Microsoft. Washington, DC: Microsoft SQL Server, 2017.

26.

Guarda

Carvaca

Gozabay

, et al. Business intelligence analytic tools. In: Bertino

Gao

Steffen

(eds). Computational science and its applications.Cham, Switzerland: Springer, 2022.

27.

SAS Institued Inc . SAS® OLAP server and SAS® OLAP cube studio. Cary, NC: SAS Institued Inc, 2006

28.

Ranawade

Navale

Dhamal

, et al. Online analytical processing on hadoop using Apache Kylin. Int J Appl Inf Syst 2017; 12: 1–5. Epub ahead of print 2017. DOI: 10.5120/ijais2017451682.

29.

Anderson

Lee

Brockenbrough

, et al. Issues in biomedical research data management and analysis: needs and barriers. J Am Med Inform Assoc 2007; 14: 478–488. Epub ahead of print 2007. DOI: 10.1197/jamia.M2114.

30.

Auffray

Balling

Barroso

, et al. Making sense of big data in health research: towards an EU action plan. Genome Med 2016; 8: 71. Epub ahead of print 2016. DOI: 10.1186/s13073-016-0323-y.

31.

Davenport

Harris

. The Architecture of Business Intelligence. In: Competing on analytics: the new science of winning. London, UK: Harvard Business School Publishing Corporation, 2006, pp. 168–169.

32.

Vik

Marshall

Liu

, et al. Combining multidimensional analytics with conventional statistical tools to expedite large cohort analyses of geographic health cost variations. Canadian Association for Health Services and Policy Research, 2021; 255.

33.

Sharif

Vik

Marshall-Catlin

. Applications of big data analytics within a dynamic simulation modeling platform to inform osteoarthritis care in Alberta. Int J Popul Data Sci 2018; 3. Epub ahead of print 2018. DOI: 10.23889/ijpds.v3i4.1035.

34.

Mues

Liede

Liu

, et al. Use of the Medicare database in epidemiologic and health services research: a valuable source of real-world evidence on the older and disabled populations in the US. Clin Epidemiol 2017; 9: 267–277. Epub ahead of print 2017. DOI: 10.2147/clep.s105613.

35.

Kendell

Levy

Porter

, et al. Factors affecting access to administrative health data for research in Canada: a study protocol. Int J Popul Data Sci 2021; 6: 1653. Epub ahead of print 1 January 2021. DOI: 10.23889/ijpds.v6i1.1653.

36.

Burgun

Bernal-Delgado

Kuchinke

, et al. Health data for public health: towards new ways of combining data sources to support research efforts in Europe. Yearb Med Inform 2017; 26: 235–240. Epub ahead of print 2017. DOI: 10.15265/IY-2017-034.

37.

Xie

Wang

, et al. Real-world data for healthcare research in china: call for actions. Value Health Reg Issues 2022; 27: 72–81. Epub ahead of print 2022. DOI: 10.1016/j.vhri.2021.05.002.

38.

Parmanto

Scotch

Ahmad

. A framework for designing a healthcare outcome data warehouse. Perspect Health Inf Manag 2005; 2: 3.

39.

Poole

Robinson

MacKinnon

. Resource Intensity Weights and Canadian hospital costs: some preliminary data. Healthcare management forum/Canadian College of Health Service Executives = Forum gestion des soins de santé/Collège canadien des directeurs de services de santé 1998; 11. Epub ahead of print 1998. DOI: 10.1016/S0840-4704(10)61000-9.

40.

Mettler

Vimarlund

. Understanding business intelligence in the context of healthcare. Health Informatics J 2009; 15(3). Epub ahead of print 2009. DOI: 10.1177/1460458209337446.

41.

Francia

Marcel

Peralta

, et al. Enhancing cubes with models to describe multidimensional data. Inf Syst Front 2022; 24: 31–48. Epub ahead of print 2022. DOI: 10.1007/s10796-021-10147-3.

Breaking the 80:20 rule in health research using large administrative data sets