Abstract

“Big data” in health is no longer a trendy buzzword but a reality. Abundant amount of data is amassed from a variety of sources including electronic health records (EHRs), administrative claims, wearables and mobile devices, genomic sequencing, medical imaging, even social media and the broad Internet, to name a few. Unfortunately, the bridge from being able to process these big data sets to knowledge discovery is still insurmountably challenging. With artificial intelligence being tested and implemented across the health and health care continuum, intelligent systems and their applications in health have the potential to transform the way how we learn from a wide range of data sources to ultimately improve health outcomes. Nevertheless, the increasing complexity of today’s biomedical research requires more than traditional, single point-of-view approaches. Indeed, (big-)data-driven approaches that can reveal patterns in massive heterogeneous data sets and make clinically relevant predictions are becoming increasingly common in translational research.
This Special Issue, “Intelligent Systems in Health,” is an extension of a special track on the same topic in the 30th International Conference on Industrial, Engineering, Other Applications of Applied Intelligent Systems (IEA/AIE) hosted at the Université d’Artois in Arras, France. Five journal extensions have gone through additional peer-review and were accepted to this Special Issue. The five articles are showcases of the state-of-the-art research and development efforts that effectively apply intelligent algorithms and systems on a wide range of heterogeneous data sources.
ElTayeby et al. 1 developed and tested machine learning models using different types—text, image, and video—social media data (i.e. Facebook) to detect binge drinking (i.e. excessive alcohol use that brings a person’s blood alcohol concentration to 0.08 grams percent or above 2 ) among college students in the United States. Their intelligent system was able to identify drinking-related posts with an F1-score of 0.81, leading to the potential of mining social media to identify public health issues and subsequently can be leveraged for tailored and precision interventions.
Zhang et al. 3 tackled a different public health issue—psychiatric stressors that leads to suicidal behaviors—using a different data source (i.e. clinical narratives in patients’ EHRs). One important challenge of using EHR data is more than 80 percent of clinical information is documented in clinical narratives. 4 Thus, the authors explored natural language processing (NLP) methods to extract both the psychiatric stressors and suicidal behaviors from physician notes. Their system can accurate recognize stressor entities with an F1-score of 0.89 and mentions of suicidal behaviors with an F1-score of 0.97. Because of these intelligent NLP methods, they were able to identify top psychiatric stressors (e.g. health concern, and pressure) associated with suicidal thoughts.
Qiu et al. 5 explored a wide of Internet data (i.e. Google search volume, page view count on Wikipedia, and disease mentioning frequency on Twitter) to estimate disease burden for 1633 diseases over an 11-year period. They found that Google search volume is strongly correlated with the burdens for 39 diseases such as viral hepatitis, diabetes mellitus, and hemorrhoids. Being able to efficiently measure disease burdens is important for assessing population health, evaluating the effectiveness of interventions, formulating health policies, and planning future health resource allocations.
Guo et al. 6 assessed the effect of integrating heterogeneous data sources on the predictive ability of cancer survival models. Cancer outcomes such as survival are determined by a complex interplay between individual- and contextual-level risk factors such as individual’s access to health care services (e.g. can be measured as area-level primary care physician density). Nevertheless, most existing cancer survival models are only focused on single-level (primarily individual-level) characteristics, limited by our ability to integrate data from heterogeneous sources. This study showed that (1) models without contextual-level risk factors had suboptimal performance, (2) cross-level interactions were common, and (3) different representations of the same risk factor have differential impacts on model performance. It also highlighted the importance of data integration and multi-level integrative data analyses (IDAs) as they generate better models and new knowledge about cancer outcomes.
Qiang et al. 7 developed machine learning-based named entity recognition systems for identifying biomedical software in biomedical literature (i.e. MEDLINE abstracts and titles). Specially, they tested two novel feature engineering strategies: (1) domain knowledge features and (2) unsupervised word embedding features. Their system achieved an F1-score of 0.86. Software tools are now essential to research and applications in the biomedical domain; and, this study shows the potential of using advanced NLP methods to automatically build a high-quality software catalog—an ongoing effort promoted by the National Institute of Health in United States.
Footnotes
Author’s note
François Modave is now affiliated with University of Florida, USA.
