Big data in clinical biochemistry

Abstract

The term ‘big data’ is used to refer both to data-sets and to how the data are analysed and used. The data may be heterogeneous and complex and may change or accumulate rapidly. A given data-set may combine many different databases, so connecting them together can present a big challenge.

Big data appeal to providers of services. By knowing more about the target audience, companies can target their products more accurately and perhaps more cheaply, thus increasing profit or reducing the cost to the consumer. Using big data may allow providers of public services, such as local boroughs, transport providers, and health authorities, to understand what services are needed and to target them at the appropriate consumers.

The public is naturally suspicious about how data are collected and how they are used. This concern may be well founded. For instance, combining socioeconomic status (postcode) with health data, tax returns, Google searches, credit card bills and police data may reveal information about us of which we are unaware. The fear is that the data may be used to control us and to benefit the provider rather than the consumer. This has been realized recently with the demonstration of data leaks by Facebook.¹

The UK is fortunate in having a unified health system, albeit that the component parts are often difficult to unite. This means that, in principle, all public health service data about patients in the UK are available to inform health policy and health outcomes and identify the factors that can help to improve health and provide healthcare in as rational and cost-effective a way as possible.

Sources of data include GP records, collected and anonymized in the General Practice Research Database and the Health Improvement Network, individual hospital data, and laboratory databases. Health sources include genetics databases, pathology and laboratory medicine, pharmacy, outpatient attendances and socioeconomic status (postcodes).The main problems with using these data are maintaining anonymity, data compatibility, analysis and identifying the owner(s) of the combined data.

It is possible to identify individuals simply by knowing their date of birth, sex and postcode, or because they have rare conditions or uncommon single nucleotide polymorphisms. Using effective algorithms to anonymize data is therefore crucial to allowing data to be released safely.

To combine health data from different sources requires common identifiers for people, healthcare professionals and locations. While everyone registered with the NHS has an NHS number, not everyone in the UK is registered, and certain organizations, such as prisons and the armed forces, tend not to use them. Hospitals often use their own registration numbers internally in preference to NHS numbers, and the format may differ between departments, for instance by use of a check letter and the number may be stored as a number or as a character string. These factors increase the difficulty of combining data between departments and trusts. Even dates may be recorded in different formats.

Analysing large data-sets requires the ability to scale up familiar methods. These include database manipulations and statistical techniques. Some databases, such as spreadsheet programs, have a limit to the number of rows of data that can be added, and are not scalable and therefore are not appropriate for such methods of analysis. Newer methods of analysing data, such as machine learning, often using neural networks, have been developed to help analyse such data.²

The statistical analysis of data can also provide challenges. Since the data-sets are so large, associations may occur by chance. However, the volume of these data-sets allows the possibility of splitting the data into two or more sections, and seeing whether the same associations occur in the split portions.

In medicine, connecting diverse sources of data makes it possible to identify associations and investigate them for possible causes. This may allow us to identify patterns of disease and their possible causes, such as the adverse effects of drugs once they have been licensed and are used in the general population (pharmacovigilance).³

If we can identify factors for developing disease or associated with favourable outcomes, we can improve management for individual patients (precision medicine and personalized medicine).

Laboratory data can be used, either alone or combined with other databases, to answer important medical questions.

DeepMind Health was set up by Google to explore the use of data to improve health outcomes in the NHS. An example of this was a project with the Royal Free Hospital, in which alerts concerning acute kidney injury were sent to clinicians using a mobile phone app. Clinicians felt that it was helpful because alerts could be seen earlier. This saved time and possibly reduced morbidity. However, the project ran into problems because of privacy concerns, and the information commissioner ‘found several shortcomings in how the data was handled’.⁴

The Infections in Oxfordshire Research Database⁵ has been in operation for nearly a decade. It integrates data from the patient administration system, pathology and infection control databases.⁶ It has been used to examine linkage between cases of tuberculosis⁷ and examine questions, such as the weekend effect on mortality.⁸

Researchers in Oxford used the laboratory information management system, which contains data going back for more than 30 years for a population of around 660,000, to confirm that lithium use was associated with development of chronic kidney disease, and with both hypothyroidism and hyperthyroidism.⁹ This group also showed that the number of cholesterol tests rose following publication of studies supporting the hypothesis that lowering cholesterol is associated with better outcomes, and the rate of rise fell with adverse reports.¹⁰

In conclusion, the concept of big data is very appealing, offering many advantages to consumers because providers can understand their needs and use of services and target their efforts more efficiently. However, there are challenges in data protection, particularly in effectively anonymizing the data and in understanding how to join together data from disparate sources.

The promise is that, if we get it right, we can make a difference to health outcomes for the population in general, and for groups that are hard to study and target at present. The challenge is to achieve this in an efficient and ethical manner.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethical approval

Not applicable.

Guarantor

BS.

Contributorship

BS and JHB contributed equally to writing this article.

References

Naughton J. Why Facebook is in a hole over data mining, www.theguardian.com/commentisfree/2017/oct/08/facebook-zuckerberg-in-a-hole-data-mining-business-model (2017, accessed 5 September 2018).

Beam

Kohane

Big data and machine learning in health care. JAMA 2018; 319: 1317–1318.

Coloma

Trifiro

Patadia

et al . Postmarketing safety surveillance: where does signal detection using electronic healthcare records fit into the big picture? Drug Saf 2013; 36: 183–197.

Information Commissioner’s Office. Royal Free – Google DeepMind trial failed to comply with data protection law, https://ico.org.uk/about-the-ico/news-and-events/news-and-blogs/2017/07/royal-free-google-deepmind-trial-failed-to-comply-with-data-protection-law/ (2017, accessed 5 September 2018).

Oxford Biomedical Research Centre. Infections in Oxfordshire Research Database (IORD), https://oxfordbrc.nihr.ac.uk/research-themes-overview/antimicrobial-resistance-and-modernising-microbiology/infections-in-oxfordshire-research-database-iord/ (accessed 5 September 2018).

Finney

Walker

Peto

et al . An efficient record linkage scheme using graphical analysis for identifier error detection. BMC Med Inform Decis Mak 2011; 11: 7.

Yang

Niehaus

Walker

et al . Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics 2018; 34: 1666–1671.

Walker

Mason

Phuong Quan

et al . Mortality risks associated with emergency admissions during weekends and public holidays: an analysis of electronic health records. Lancet 2017; 390: 62–72.

Shine

McKnight

Leaver

et al . Long-term effects of lithium on renal, thyroid, and parathyroid function: a retrospective analysis of laboratory data. Lancet 2015; 386: 461–468.

10.

Doll

Shine

Kay

et al . The rise of cholesterol testing: how much is unnecessary. Br J Gen Pract 2011; 61: e81–e88.