Abstract
Big data technologies are increasingly used for biomedical and health-care informatics research. Large amounts of biological and clinical data have been generated and collected at an unprecedented speed and scale. For example, the new generation of sequencing technologies enables the processing of billions of DNA sequence data per day, and the application of electronic health records (EHRs) is documenting large amounts of patient data. The cost of acquiring and analyzing biomedical data is expected to decrease dramatically with the help of technology upgrades, such as the emergence of new sequencing machines, the development of novel hardware and software for parallel computing, and the extensive expansion of EHRs. Big data applications present new opportunities to discover new knowledge and create novel methods to improve the quality of health care. The application of big data in health care is a fast-growing field, with many new discoveries and methodologies published in the last five years. In this paper, we review and discuss big data application in four major biomedical subdisciplines: (1) bioinformatics, (2) clinical informatics, (3) imaging informatics, and (4) public health informatics. Specifically, in
Background: What is Big Data?
In the biomedical informatics domain, big data is a new paradigm and an ecosystem that transforms case-based studies to large-scale, data-d riven research. It is widely accepted that the characteristics of big data are defined by three major features, commonly known as the 3Vs: volume, variety, and velocity.
First and most significantly, the
The second feature of big data is the
The third characteristic of big data,
Big Data Technologies
Biomedical scientists are facing new challenges of storing, managing, and analyzing massive amounts of datasets. 23 The characteristics of big data require powerful and novel technologies to extract useful information and enable more broad-based health-care solutions. In most of the cases reported, we found multiple technologies that were used together, such as artificial intelligence (AI), along with Hadoop®, 24 and data mining tools.
As such,
Research Methods
We searched four bibliographic databases to find related research articles: (1) PubMed, (2) ScienceDirect, (3) Springer, and (4) Scopus. In searching these databases, we used the main keywords “big data,” “health care,” and “biomedical.” Then, we selected papers based on the following inclusion criteria:
The paper was written in English and published within the past five years (2000–2015). The paper discussed the design and use of a big data application in the biomedical and health-care domains. The paper reported a new pipeline or method for processing big data and discussed the performance of the method. The paper evaluated the performance of new or existing big data applications.
The following exclusion criteria were used to filter out irrelevant papers:
The paper did not discuss any specific big data applications (eg, general comments about big data). The paper was a tutorial or a course material. The paper was not in the four focus areas: bioinformatics, clinical informatics, public health informatics, and imaging informatics.
Two searches were performed. In the first search, the first author (JL) and the second author (MW) of the present study began the search process based on the main keywords. All potentially related papers were collected by reviewing the title and abstract. This initial search resulted in 755 papers from 2000 to 2015. In the second search, the second author (MW) and the third author (DG) screened the papers based on the abovementioned inclusion and exclusion criteria and subsequently selected 94 candidate papers. Finally, each author of the present study evaluated the final selection by reading the content of the papers, and consensus was reached to review 68 papers for this study.
Big Data Applications
Bioinformatics applications
Bioinformatics research analyzes biological system variations at the molecular level. With current trends in personalized medicine, there is an increasing need to produce, store, and analyze these massive datasets in a manageable time frame. Next-generation sequencing technology enables genomic data acquisition in a short period of time.27,28 The role of big data techniques in bioinformatics applications is to provide data repositories, computing infrastructure, and efficient data manipulation tools for investigators to gather and analyze biological information. Taylor discusses that Hadoop and MapReduce are currently used extensively within the biomedical field. 29
This section classifies big data technologies/tools into four categories: (1) data storage and retrieval, (2) error identification, (3) data analysis, and (4) platform integration deployment. These categories are correlated and may overlap; for instance, most data input applications may support simple data analysis, or vice versa. However, our classification in the present study is based only on the main functions of each technology.
Data storage and retrieval
Nowadays, a sequencing machine can produce millions of short DNA sequencing data during one run. The sequencing data need to be mapped to specific reference genomes in order to be used for additional analysis, such as genotype and expression variation analysis.
The
Error identification
A number of tools have been developed to identify errors in sequencing data;
Data analysis
In addition to the described frameworks and toolkits for sequencing data analysis, the
The
The
Current bioinformatics platform also incorporates a virtual machine.
Deploying the Hadoop cloud platform can be a big challenge for researchers who do not have a computer science background.
Clinical informatics applications
Clinical informatics focuses on the application of information technology in the health-care domain. It includes activity-based research, analysis of relationship between patient main diagnosis (MD) and underlying cause of death (UCD), and storage of data from EHRs and other sources (eg, electrophysiological [such as EEG] data). In this section, we classified big data technologies/tools into four categories: (1) data storage and retrieval, (2) interactive data retrieval for data sharing, (3) data security, and (4) data analysis. Compared with bioinformatics, clinical informatics does not offer many tools for error identification but pays more attention to data-sharing and data security issues. Its data analysis method is very different from bioinformatics, as clinical informatics works with both structured and unstructured data, develops specific ontologies, and uses natural language processing extensively.
Data storage and retrieval
It is critical to discuss the ways in which big data techniques (eg, Hadoop, NoSQL database) are used for storing EHRs. The efficient storage of data is especially important when working with clinical real-time stream data. 56 Dutta et al evaluated the potential of using Hadoop and HBase 35 as data warehouses for storing EEG data and discussed their high-performance characteristics. Jin et al. 57 analyzed the potential of using Hadoop HDFS and HBase for distributed EHRs.
Furthermore, Sahoo et al.
58
and Jayapandian et al.
59
proposed a distributed framework for storing and querying large amounts of EEG data. Their system,
Compared with a traditional relational database that handles structured data well, the novel
Interactive data retrieval for data sharing
Interactive medical information retrieval is expected to play an important role in sharing medical knowledge and integrating data. Many researchers have seen the need for such a role and have offered possible solutions. Deb and Srirama 63 proposed a three-tier ecosystem to improve the shortcomings of cloud-enabled social networks for eHealth Solutions. Bahga and Madisetti 64 developed a cloud-based approach for interoperable EHRs. Sharp 65 proposed an application architecture based on the cloud approach to enhance the interaction between researchers in multisite clinical trials. Chen et al. 66 discussed the present and future aspects of translational informatics based on the cloud approach. He et al. 67 provided a private cloud platform architecture for handling enormous data requests from health-care services. To handle huge amounts of online heart disease data analyses in China, Wang et al. 68 used a hybrid XML database and the Hadoop/HBase infrastructure to design the “Clinical Data Managing and Analyzing System.”
Data security
Schultz
69
concluded that vast amounts of data can be collected over time and that health-care challenges could be met and addressed in response to big data opportunities. This in turn means that major data technology advancements will enable health-care practitioners to manipulate even larger amounts of data in the future. However, interactive data retrieval places greater pressure on data security. Sobhy et al.
70
proposed
Data analysis
Predicting disease risk and progression over time can be very useful for clinical decision support, and building computational models for clinical prediction requires a complex pipeline. Ng et al.
72
proposed
In addition, Zolfaghar et al. 73 used big data techniques to study the 30-day risk of readmission for congestive heart failure patients. The patient data were extracted from the National Inpatient Dataset and the Multicare Health System. Several algorithms (eg, logistic regression, random forest) were used to build a predictive model to analyze the possibility of patient readmission. The investigators performed several tests on more than three million patient records. The results showed that the use of big data significantly increased the performance of building a predictive model: the models achieved the highest accuracy at 77% and recall at 61%.
Deligiannis et al. 74 presented a data-driven prototype using MapReduce to diagnose hypertrophic cardiomyopathy (HCM), an inherited heart disease that causes cardiac death in young athletes. Successful diagnosis of HCM is challenging due to the large number of potential variables. Deligiannis et al believed that the diagnosis rate could be improved by using a data-driven analysis. In addition to improved predictive accuracy, the experimental results showed that the overall runtime of predictive analysis decreased from nine hours to only a few minutes when accessing a dataset of 10,000 real medical records – this is a remarkable improvement over previous analyses and could lead to possible future applications for early systematic diagnoses.
Furthermore, the use of big data to analyze clinical data could have a significant impact on the medical community. A number of researchers have described future possibilities for the application of big data analytics. Ghani et al. 75 argued that the adoption of EHRs and the use of picture archiving and communication systems (PACS) have led to the capture of mass quantities of digital big data. They also inferred that urologists can use big data analytics for decision support, such as predicting whether a patient will need readmission to hospital after a cystectomy. Ghani et al anticipated that analytics of big data can also be applied to determine whether radiation therapy or prostatectomy should be used for a 75-year-old patient to avoid immediate risks from advanced prostate cancer. Wang and Krishnan 76 gave a systematic review of how big data can facilitate outcomes, such as identifying the causality of patient symptoms, predicting hazards of disease incidence or reoccurrence, and improving primary care quality.
Genta and Sonnenberg 77 provided an overview of big data in gastroenterology research, stating that the big data method is a new tool for finding significant association among large amounts of “messy” clinical data. Furthermore, the use of a large dataset will rapidly expand for gastroenterologists and advance the understanding of digestive diseases. Chawla and Davis 78 illustrated the overall vision of the big data approach to personalized medicine and provided a patient-centered framework. Abbott 79 explained the contribution of big data to perioperative medicine. McGregor 80 contended that using big data could help predict deadly pediatric medical conditions at an early stage, leading to a breakthrough in clinical applications for neonatal intensive care units. Fahim et al. 81 proposed a system for active lifestyles and argued that a visual design engages users by enhancing their self-motivation.
Imaging informatics applications
Imaging informatics is the study of methods for generating, managing, and representing imaging information in various biomedical applications. It is concerned with how medical images are exchanged and analyzed throughout complex health-care systems. With the growing need for more personalized care, the need to incorporate imaging data into EHRs is rapidly increasing.
In this section, we classified big data technologies/tools into three categories: (1) data storage and retrieval, (2) data sharing, and (3) data analysis. Imaging informatics developed almost simultaneously with the advent of EHRs and the emergence of clinical informatics; however, it is very different from clinical informatics due to the heterogeneous data types generated from different modalities of medical images. Data security remains an important consideration in this area, but because current systems primarily rely on commercial cloud platforms and existing protocols, such as digital image communication in medicine (DICOM), there is no research focusing on improving data security in imaging informatics.
Data storage and retrieval
Imaging informatics is predominantly used for improving the efficiency of image processing workflows, such as storage, retrieval, and interoperation. PACS are popular for delivering images to local display workstations, which is accomplished primarily through DICOM protocols in radiology departments. Many web-based medical applications have been developed to access PACS, and greater use of big data technology has been improving their performance. Silva et al. 82 proposed an approach to integrate the data in PACS, given the current trend among health-care institutions to outsource the two important components of PACS (DICOM object repository and database system) to the cloud. Silva et al proposed to provide an abstract layer with a Cloud IO (input/output) stream mechanism to support more than one cloud provider despite their differences in data access standards.
In addition to big data technologies based on the implementation of cloud platforms with PACS, Yao et al. 83 developed a massive Hadoop-based medical image retrieval system that extracted the characteristics of medical images using a Brushlet transform and a local binary pattern algorithm. Then, the HDFS stored the image features, followed by the implementation of MapReduce. The evaluation results indicated a decreased error rate in images compared with the result without homomorphic filtering. Similarly, Jai-Andaloussi et al. 84 used the MapReduce computation model and HDFS storage model to address the challenges of content-based image retrieval systems. They performed experiments on mammography databases and obtained promising results, showing that the MapReduce technique can be effectively used for content-based medical image retrieval.
Data and workflow sharing
PACS primarily provide image data archiving and analysis workflow at single sites. Radiology groups operating under a disparate delivery model (ie, different services offered by different vendors to complete a single radiology task) face significant challenges in a data-sharing infrastructure. Benjamin et al.
85
developed
Data analysis
Seeking to overcome the challenges brought by large-scale (terabytes or petabytes) data derived from pathological images, Wang et al.
86
proposed
To analyze cardiac imaging and medical data to optimize clinical diagnosis and treatment, Dilsizian and Siegel 87 proposed a framework to integrate AI, massive parallel computing, and big data mining and argued that these technologies are critical components for evidence-based personalized medicine. They also argued that big data mining techniques would be used for next-generation AI techniques in which large numbers of possible factors (eg, whether a patient had myocardial infarction) could be analyzed and a prediction could be completed in less time, thereby improving diagnosis and treatment. Using a cardiac imaging field as a focus area, Dilsizian and Siegel showed that the Formation of Optimal Cardiovascular Utilization Strategies group introduced the use of AI and big data to reduce inappropriate uses of diagnostic imaging; such cases decreased from 10% to 5% among the 55 participating sites.
In addition, Markonis et al. 88 used Hadoop to establish a cluster of computing nodes and MapReduce to speed up the process. Three cases of use were analyzed: (1) parameter optimization for lung texture classification using support vector machines (SVMs), (2) content-based medical image indexing, and (3) three-dimensional directional wavelet analysis for solid texture classification. Test results in a parallel grid search for optimal SVM parameters showed that using concurrent map tasks reduced the total runtime from 50 hours to 9 hours 15 minutes - a significant improvement in computing efficiency while maintaining a good classification performance.
Public health information
As described by Short-liffe and Cimino, 89 public health has three core functions: (1) assessment, (2) policy development, and (3) assurance. Among these, assessment is the prerequisite and fundamental function. Assessment primarily involves collecting and analyzing data to track and monitor public health status, thereby providing evidence for decision making and policy development. Assurance is used to validate whether the services offered by health institutions have achieved their initial target goals for increasing public health outcomes; as such, many large public health institutions, such as the Centers for Disease Control and Prevention and the Administration of Community Living, have collected and analyzed very large amounts of population health data.
In this section, no new approaches are introduced. Instead, we present an integrated view of big data and health from a population perspective rather than a single medical/ clinical activity perspective. This section focuses on four areas: (1) infectious disease surveillance, (2) population health management, (3) mental health management, and (4) chronic disease management.
Infectious disease surveillance
Hay et al.
90
discussed the opportunities for using big data for global infectious disease surveillance. They developed a system that provides real-time risk monitoring on map, pointing out that machine learning and crowdsourcing have opened new possibilities for developing a continually updated atlas for disease monitoring. Hay et al believed that online social media combined with epidemiological information is a valuable new data source for facilitating public health surveillance. The use of social media for disease monitoring was demonstrated by Young et al.
91
, in which they collected 553,186,016 tweets and extracted more than 9,800 with HIV risk-related keywords (eg, sexual behaviors and drug use) and geographic annotations. They showed that there is a significant positive correlation (
Population health management
To study the distribution and impact of sociodemographic and medico-administrative factors, Lamarche-Vadel et al. 92 analyzed the independent association of patient MD and UCD. The MD was identified by ICD10 code, while the UCD was extracted from a death registry. If MD and UCD were different events, then those events were found to be independent. Using health insurance data, information from 421,460 deceased patients was extracted from 2008 to 2009. The results show that 8.5% of inhospital deaths and 19.5% of out-of-hospital deaths were independent events and that independent death was more common in elderly patients. The results demonstrate that large-scale data analysis can be used to effectively analyze the association of medical events.
Mental health management
Nambisan et al. 93 found that messages posted on social media could be used to screen for and potentially detect depression. Their analysis is based on previous research of the association between depressive disorders and repetitive thoughts/ruminating behavior. Big data analytics tools play an important role in their work by mining hidden behavioral and emotional patterns in messages, or “tweets,” posted on Twitter. Within these tweets, we may be able to detect a disease-related emotion pattern, which is a previously hidden symptom. The authors foresee that future research could delve deeper into the conversations of the depressed users to understand more about their hidden emotions and sentiments. In addition, Dabek and Caban 94 presented a neural network model that can predict the likelihood of developing psychological conditions, such as anxiety, behavioral disorders, depression, and post-traumatic stress disorder. They also analyzed the effectiveness of their model against a dataset of 89,840 patients, and the results show that they can achieve an overall accuracy of 82.35% for all conditions.
Chronic disease management
Tu et al. 95 introduced the Cardiovascular Health in Ambulatory Care Research Team (CANHEART), a unique, population-based observational research initiative aimed at measuring and improving cardiovascular health and the quality of ambulatory cardiovascular care provided in Ontario, Canada. The research focused on identifying opportunities to improve the primary and secondary prevention of cardiovascular events in Ontario's diverse multiethnic population. The study included data from 9.8 million Ontario adults aged ≥20 years. Data were assembled by linking multiple databases, such as electronic surveys, health administration, clinical, laboratory, drug, and electronic medical record databases using encoded personal identifiers. Follow-up clinical events were collected through record linkages to comprehensive hospitalization, emergency department, and vital statistics administrative databases. The huge linked databases enable the CANHEART study cohort to serve as a powerful big data resource for scientific research aimed at improving cardiovascular health and health services delivery.
Kupersmith et al. 96 introduced the health IT infrastructure in the US Veterans Health Administration's (VHA) health information infrastructure and the factors that made it possible to achieve chronic disease management for its patients. Structured clinical data in the EHRs can be aggregated within specialized databases, while unstructured text data, such as clinician notes, can be reviewed and abstracted electronically from a central location. The rich clinical information makes it possible for professionals to extract insights; for instance, the VHA has identified a high rate of mental illness comorbidity (24.5%) among patients with diabetes. The VHA also uses EHR data to explore the influence of sex and race/ ethnicity and to understand the extent to which newer psychotropic drugs contribute to poor outcomes, in the context that drugs promote weight gain and mental illness itself. The VHA also uses this information to identify and track diabetic complications, such as early chronic kidney disease without renal impairment, as indicated in the record. After identifying patients at high risk for comorbidities or amputation, the VHA distributes the information to clinicians to better coordinate patient care.
Conclusion
We are currently in the era of “big data,” in which big data technology is being rapidly applied to biomedical and healthcare fields. In this review, we demonstrated various examples in which big data technology has played an important role in modern-day health-care revolution, as it has completely changed people's view of health-care activity. The first three sections of this review revealed that big data applications facilitate three important clinical activities, while the last section (especially the chronic disease management section) draws an integrated picture of how separate clinical activities are completed in a pipeline to manage individual patients from multiple perspectives. We summarized recent progress in the most relevant areas in each field, including big data storage and retrieval, error identification, data security, data sharing and data analysis for electronic patient records, social media data, and integrated health databases.
Furthermore, in this review, we learned that bioinformatics is the primary field in which big data analytics are currently being applied, largely due to the massive volume and complexity of bioinformatics data. Big data application in bioinformatics is relatively mature, with sophisticated platforms and tools already in use to help analyze biological data, such as gene sequencing mapping tools. However, in other biomedical research fields, such as clinical informatics, medical imaging informatics, and public health informatics, there is enormous, untapped potential for big data applications.
This literature review also showed that: (1) integrating different sources of information enables clinicians to depict a new view of patient care processes that consider a patient's holistic health status, from genome to behavior; (2) the availability of novel mobile health technologies facilitates real-time data gathering with more accuracy; (3) the implementation of distributed platforms enables data archiving and analysis, which will further be developed for decision support; and (4) the inclusion of geographical and environmental information may further increase the ability to interpret gathered data and extract new knowledge.
While big data holds significant promise for improving health care, there are several common challenges facing all the four fields in using big data technology; the most significant problem is the integration of various databases. For example, the VHA's database, VISTA, is not a single system; it is a set of 128 interlinked systems. This becomes even more complicated when databases contain different data types (eg, integrating an imaging database or a laboratory test results database into existing systems), thereby limiting a system's ability to make queries against all databases to acquire all patient data. The lack of standardization for laboratory protocols and values also creates challenges for data integration. For example, image data can suffer from technological batch effects when they come from different laboratories under different protocols. Efforts are made to normalize data when there is a batch effect; this may be easier for image data, but it is intrinsically more difficult to normalize laboratory test data. Security and privacy concerns also remain as hurdles to big data integration and usage in all the four fields, and thus, secure platforms with better communication standards and protocols are greatly needed.
In its latest industry analysis report, McKinsey & Company predicted that big data analytics for the medical field will potentially save more than $300 billion per year in US health-care costs. Future development of big data applications in the biomedical fields holds foreseeable promise because it is dependent on the advancement of new data standards, relevant research and technology, cooperation in research institutions and companies, and strong government incentives.
Author Contributions
JL, MW conceived and designed the experiments. MW, JL jointly developed the structure and arguments for the paper. JL, MW, YZ, DG analyzed the data. JL, MW, DG, YZ wrote the first draft of the manuscript. JL, MW, YZ contributed to the writing of the manuscript. All authors reviewed and approved the final manuscript.
