Abstract

The concept of ‘Big Data’ is not new, having been in frequent use since the 1990s. In recent years, the term has become more relevant, with increases in the size and type of data available, as well as the computational capability to rapidly execute analyses across large datasets. ‘Big Data’ is typically characterized by at least ‘3 Vs’: velocity, variety and volume; some posit that ‘veracity’, ‘variability’ or ‘value’ should also be added to the list. 1 For many years, Big Data has been described as having the potential to result in transformative changes to many industries, including healthcare. While there have been significant advances in its use in the financial and retail sectors, its impact in healthcare has been slow, in part due to regulations and privacy concerns. In drug and vaccine safety, in particular, its postulated benefits have not fully materialized. In this editorial, we examine the impact of Big Data on the cornerstones of post-approval pharmacovigilance: quantitative signal evaluation and the identification of potentially new safety signals. We describe the achievements thus far of Big Data approaches within pharmacoepidemiology; the capabilities and approaches that are most promising for improving the quality of data available for drug safety research; and, lastly, the importance of evaluating whether the contribution of Big Data sources to identifying potential safety signals is redundant, complements or replaces the traditional data sources and techniques being used in pharmacovigilance today.
For decades, varied sources of data, ranging from pre-clinical studies, randomized clinical trials and real-world data, have been used throughout the medicinal product lifecycle in order to better understand and describe the benefit–risk profile for patients and physicians in the approved product label. Pharmacovigilance and benefit–risk assessment occurs on an ongoing basis as new data emerge. With increases in the volume and type of data, particularly post-approval data which is often lacking medical history and other important patient information, this assessment has become increasingly complex and manual clinical review untenable as the primary method of signal detection. Historically, from the 1960s, safety surveillance used paper-based spontaneous reporting systems for signal detection 2 and specialized primary observational data-collection systems (e.g. Boston Collaborative Drug Surveillance Program, and UK and New Zealand Prescriptive Event Monitoring systems3–7); although in the US and UK in particular, observational administrative claims and medical records databases have been leveraged for the purpose of drug safety from the 1980s onwards. 8 More recently, spontaneous reporting systems have evolved to vast repositories of reports primarily submitted electronically, analyzed quantitatively and clinically as part of signal management systems.9,10 In addition to numerous primary data-collection studies, there are now hundreds of existing longitudinal observational databases (LODs) available for secondary use in epidemiological studies in North America, Europe and Asia, from drug or outcome registries, to transactional insurance claims databases and Electronic Medical Record (EMR) databases. 11
While the core of regulated pharmacovigilance practice still centers on the collection of individual case safety reports, change is occurring, in part as a result of Big Data approaches. The greatest change in pharmacovigilance analytics being applied today, and the one most connected to the Big Data revolution, is the more sophisticated use of observational data, as evidenced by pharmacoepidemiologic studies conducted across multiple databases and the development of large networks of observational databases of Electronic Healthcare Records in North America, Europe and Asia. 12 The most well-known example of a longitudinal observational database network for safety assessment is the US FDA’s Sentinel system, initiated in pilot form in 2009 and consisting primarily of private transactional insurance claims data. The system was specifically designed to investigate potential safety concerns 13 to respond to perceived weaknesses of a safety surveillance system reliant on spontaneous reporting. Sentinel, now in routine use at the FDA, conducts hundreds of assessments of products, conditions and product–outcome pairs each year (Jeff Brown, personal communication). Analyses are conducted across a distributed network of data from 16 health plans (with additional datasets coming on line over time) with currently over 220 million members and over 425 million person-years of data for analysis. Partners retain physical and operational control over their data by using a Common Data Model (CDM). Standard executable programs are then sent to each data partner to perform analyses or create analysis files for pooling summary data then being returned and compiled at a coordinating center. The network routinely uses standardized, simple queries which have as fast as a 1 week turnaround from query initiation to result, a rapid analysis capability not seen previously on large-scale observational data. The distributed database, which is updated quarterly, has information on over seven billion medical encounters and six billion outpatient pharmacy dispensations, and is growing at nearly one billion encounters per year (Jeff Brown, personal communication).
The Sentinel Initiative is increasingly described as a component of a national evidence-generation system. 14 In practice, this means exploring a broader use of this data network, by connecting it to additional data types (e.g. disease or drug registries), and/or other data networks such as PCORNet, a network of EMR repositories, 15 resulting in data systems encompassing more than half, and up to two-thirds, of the US population. These systems are envisaged as having value for research other than safety assessment, such as comparative effectiveness studies, pragmatic trials or investigational trials in real-world settings. Further, since these systems are being viewed as a national resource, public–private partnerships have been created to permit use by stakeholders other than the FDA – for example, the Innovation in Medical Evidence Development and Surveillance (IMEDS) program, which enables access to the Sentinel network.16–18 Similar networks for safety surveillance have been developed around the world: ASPEN in Asia, 19 CNODES in Canada 20 and several multinational European networks such as the Innovative Medicines Initiative (IMI) PROTECT 21 project, ARITMO 22 and SOS, 23 while other networks, such as OMOP and OHDSI, primarily focus on method testing and informatics tool development for data networks.24,25 While studies using multiple databases naturally provide more power, and therefore ability to address safety issues more effectively, care needs to be taken to design studies appropriately, taking into account both the drug–event pair of focus but also the rich heterogeneity across databases. While many of these systems use CDMs to create distributed data networks, other networks, such as PROTECT, have demonstrated that through careful epidemiological reasoning to produce a common protocol across multiple centers and countries, analyses conducted locally yielded generally concordant and predictably discordant results across databases. 21 As more observational data analyses are conducted with these data systems, it is essential to ensure studies are conducted when there is a research question appropriate for observational study designs, that techniques for confounding control are reasonably good at controlling for important confounders, particularly confounding by indication and severity, and that regulatory and international scientific good practice guidelines are followed. Increased transparency in the conduct and reporting of studies may support better reproducibility and replicability, an approach the European Medicines Agency (EMA) has taken with its requirement for companies to register and disclose protocols and study reports of Post-Authorization Safety Studies (PASS). Recent guidance from the joint International Society of Pharmacoepidemiology – International Society for Pharmacoeconomics and Outcomes Research (ISPE-ISPOR) taskforce on ‘Real World Evidence in Health Care Decision Making’ looks to provide guidance on the design and reporting of pharmacoepidemiological analyses of longitudinal healthcare databases.26–28 In the coming years, we expect improvements in the quality and variety of clinical data available and linked in these networks. For example, to study the prevalence of congenital malformations among infants exposed and not exposed to varenicline in utero, Danish and Swedish medical birth registries were used to identify live-born infants, then data on maternal varenicline use and congenital malformations in offspring were obtained by linkage to nationwide registries of dispensed prescriptions and hospital admissions. 29 Similarly, a planned study for meningitis B vaccination will look to examine the safety of Trumenba vaccine exposure during pregnancy using electronic healthcare data and linked birth certificates from multiple healthcare systems in the US, all of which participate in the Sentinel distributed network. 30 Advances are also likely to occur through data enrichment, that is linkage to clinical disease and drug registries or other primary data-collection systems, and supplementing coded data with information obtained from the free text of medical records using natural language processing or similar automated techniques. 31
The existence of these large networked data systems, coupled with the ability to gain insights into a study question within days, rather than months or years as in the past, are the Big Data promises most clearly fulfilled. 32 With these advances, researchers have recently explored how these systems might be used for exploratory assessment rather than the usual signal evaluation approaches applied to these data. In this approach, the goal is to capture emerging and previously unsuspected signals – that is, hypothesis-free signal detection in LODs. 33 There are limitations to the data, however, that currently hamper their usefulness in signal detection, including: the lack of a learned reporter that suspects a medicine–adverse event relationship and is able to provide detailed medical information and a rationale for that suspicion; incomplete linkage of data from primary, specialist and in-patient visits; and slow delays in updating and making available the data for research, which takes up to a year for some databases. There is now a nascent literature on comparison of LODs to spontaneous reports for signal detection, where there is cautious promise, at least for outcomes with high background event rates, which are difficult to capture as safety signals in spontaneous reporting systems. 34 For now, research suggests that, at best, this approach is likely to be complementary rather than replace spontaneous reports.
Consumer wearable technology, such as fitness devices and smartphones, and ‘smart’ digital technology, such as thermometer or glucose and heart monitors, have the potential to supplement these approaches by providing better, more detailed, health and behavioral information than that collected routinely in electronic healthcare databases, at least for some diseases and comorbidities. Some of the data are collected automatically, such as a phone’s (and therefore individual’s) location at a particular day and time. These data can then be linked to other information known about the location, such as the weather or air pollution levels. Most of the mobile data streams, though, are collected through ‘apps’ which intend to collect information about subjective experience and/or objective measures (e.g. heart rate or number of paces). These subjective data streams are potentially more representative and systematic than social media data streams since they are able to prompt the user to enter data and provide data summaries that are useful or of interest to the user. For example, researchers in the UK are using smartphones and linked mobile data to study the relationship between weather patterns and rheumatoid arthritis symptom severity. 35 This study collects information about severity of pain symptoms from an app and then links it to weather information based on the patient’s location at the time of data entry. To encourage participation and frequent data entry, the researchers have created the app so that it is relevant to patients; patients may view their individual symptom reporting over time as well as aggregate reporting trends for the entire study population. Elsewhere, computer games are used to collect data on reaction times to better understand disease progression 36 and smartphone apps are being explored as tools to collect safety information during research studies.
The use of consumer wearable technology for pharmacoepidemiologic research is in its infancy, although some argue that the line between medical devices and consumer wearable technologies is already beginning to blur. 37 If large-volume data streams are being created that may be accessible in near real time and proximal temporally to a healthcare encounter or experience following the use of a medication, the promise for pharmacoepidemiology is great, particularly as these streams increasingly focus on medical and behavioral data (e.g. heart rate, personal and family medical history, smoking status, diet, alcohol consumption and exercise patterns). Additionally, our analytic capabilities are being advanced, as these data streams and networks of sensors make it possible to examine relationships between data types previously unknown at this scale. To give three examples, not yet to our knowledge applied to drug safety research: studies demonstrating how non-contact visual images can be used to infer muscle activity and force 38 which one could anticipate would be of value in monitoring ALS progression; video-recorded data can be used to allow more accurate health insights and therefore treatment in asthma patients; 39 and research into food-related object recognition 40 could potentially lead to objectively recorded dietary data being linked to electronic healthcare databases.
We anticipate patients whose healthcare is complex, with an impact on their daily life such as the chronically ill, or those that are part of active and organized patient communities will be early adopters of sharing their information for research purposes, despite the loss of privacy. This assumes of course that these data-collection tools offer value to the individual patient and, when appropriate, their healthcare providers. Privacy, the perceived risk of misuse of data and the regulatory considerations over tools that measure objective medical data (and therefore may be considered medical devices) are current hurdles to more widespread use. While impossible to predict the rapidity in uptake of these types of apps, we expect it will occur faster among younger generations of healthcare providers and patients, who are arguably more comfortable with sharing data this way and perhaps more likely to find it valuable.
Beyond pharmacoepidemiology, the hope, and much of the hype, of Big Data for pharmacovigilance center on new data streams and technologies as a source for identifying potential new safety signals. For example, there are numerous claims in the popular and trade literature about the value of social and digital media for pharmacovigilance. The terms social media and digital media are often used interchangeably. In fact, they are poor descriptors and are instead ‘catch all’ terms for a range of heterogeneous data sources ranging from postings on social media applications such as Facebook and Twitter to chat rooms and blogs to the compilation of search engine logs. Normally the use of these two terms, at least in the context of pharmacovigilance, are loosely connected to description of adverse events actively posted in some way on the internet, rather than the traditional spontaneous report from a healthcare professional or consumer, with or without explicit linking by the recorder to medicinal product usage. The use of these data for pharmacovigilance is fraught with challenges, not least of which is that the primary purpose of most postings/communications is other than the report of a suspected link between a specific health intervention and adverse event. Clearly, when clinical or patient concerns are expressed on social media about perceived harm associated with a healthcare intervention, these should be treated as such. 41 Additionally, how patients, and healthcare providers, view the effectiveness of treatments or define unmet need is important for drug developers, payers and other healthcare decision-makers. Despite these potential advantages, most social media reports that mention use of a drug and potential adverse event do not meet the basic regulatory definitions of an individual case safety report (ICSR), an ICSR being considered valid if as a minimum it contains (1) an identifiable patient, (2) an identifiable reporter, (3) a suspect drug and (4) an adverse event/outcome. Such reports often lack sufficient details to conduct a medical assessment, or describe events already included in the approved product label. These individual reports are thus unlikely to provide new medical data relevant to benefit–risk assessment. The same holds true for many customer engagement programs sponsored by commercial functions within pharmaceutical companies; research has demonstrated these programs generate a large volume of safety reports while contributing limited to no relevant new medical information. As a result, some researchers, including our pharmacovigilance department, are investigating the use of quantitative signal detection techniques applied to the data in aggregate, with the goal of identifying previously unsuspected safety signals through pattern or time disturbances in the data. Despite the hype and considerable research activity around social media data streams being able to enhance signal detection capability, evidence of value and demonstrable practical impact is limited to date; resources for routine signal management are therefore best deployed on clinical and healthcare data sources at this time. In the coming years, ongoing initiatives, such as the IMI’s WEB-RADR and the EMA’s goal to measure the impact of pharmacovigilance practices, will likely identify the best uses of these data for pharmacovigilance, including which patient populations, outcomes, or medicines are best suited to using these data for signal detection.
The ultimate outcome of any contribution of Big Data to pharmacovigilance practice, whether quantitative signal evaluation or identifying new safety signals, must be safe-guarding patient safety and wellbeing. Practically, this means the use of new data sources and technologies that make up Big Data should lead to outcomes that impact individual patient wellbeing and public health. For this reason, as we consider the steady stream of new data and technology platforms, we need to critically evaluate the impact of innovative data sources and techniques, and whether these should uniquely complement or replace existing approaches, or are redundant, adding little or no value to current pharmacovigilance practice. Such evaluations will need to carefully assess performance across data streams, comparing their timeliness, effectiveness and reliability for detecting emerging safety issues. Key to these evaluations is testing compared to appropriate established external reference sets and transparent and measurable performance criteria. Objective and reproducible performance assessments are essential if such evaluations are intended to modify or enhance or replace components of routine pharmacovigilance practice. There is much recent focus in the field of pharmacovigilance into further developing the science of measuring the impact of pharmacovigilance activity. It is beyond the scope of this editorial to address this issue in detail, but the interested reader is referred as a starting point to the 2016 EMA Workshop report on ‘Measuring the impact of pharmacovigilance activities’, providing recommendations to develop a framework for impact evaluation, 42 as an introduction to this emerging research field.
With more data streams than ever and the capacity for very rapid quantitative analyses, the tendency may be to want to do it all, as in theory, these capabilities will lead to better insights for decision-making. To do this requires a transparent and scientific framework for evaluating new data sources and technologies, and measuring their impact on pharmacovigilance process and research relative to the sources and approaches currently used. With such a framework in place, we envision a gradual evolution of pharmacovigilance’s existing multiple data stream strategy to an even more holistic data strategy, one in which evidence from across the ‘data smorgasbord’ is used to detect, understand and manage potential safety issues. Ultimately, a scientifically robust strategy to measure the specific value of innovative Big Data will ensure the best data sources and techniques are applied at the right time and, importantly, that human, financial and technological resources are allocated to those activities with the greatest impact, resulting in an effective, modern pharmacovigilance system.
Footnotes
Disclaimer
The views expressed in this article are the personal views of the authors and are not necessarily those of Pfizer Inc.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Conflict of interest statement
The authors declare that there is no conflict of interest.
