Abstract
The rise of electronic medical records promotes the collection and aggregation of medical data. These data have tremendous potential utility for health policy and public health; yet there are gaps in the scholarly literature. No articles in the medical or legal literature have mapped the “information flows” from patient to database, and commentary has focused more on privacy than on data’s social value and incentives for production. Utilizing short case studies of data flows, I show that ample data exist, much of them are available online through government websites or hospital trade associations. However, available information comes from billing records rather than medical records. Turning to legal and policy recommendations for better provision, I note that weak intellectual property law has ironically led to stronger control over health data through private contracts and technological barriers, as these methods of protection lack any exceptions for noncommercial use. I conclude with a series of policy proposals to make data more available.
Uncovering the data flows
The United States has more than 5700 hospitals, 59,000 retail pharmacies, and more than 800,00 physicians.1–3 Virtually, every time an individual visits a doctor, fills a prescription, or stays in a hospital, information on the encounter is recorded and increasingly that record takes electronic form. Once aggregated, this information serves critical purposes for public health and medicine. Researchers use it to determine the effectiveness of public health interventions and medical treatments. Patients can use it to compare the performance of health-care providers before choosing one for medical care, and hospitals may use aggregate data to compare themselves to other medical centers. Indeed, the Institute of Medicine, an influential advisory organization, has noted the importance of health-care data to measure care quality and costs. 4 Medical businesses use the information for planning, such as by determining needs for services in their area. Indeed, multiple authors have commented—usually critically—on the existence of a multibillion dollar industry of proprietary health information available to those capable of paying five-, six-, and even seven-figure fees for data. 5 Thus, health data serve important purposes for research, patient decision making, and business planning.
The literature on aggregate health data paints a portrait of a vast, fragmented landscape of health-related information. Moreover, this portrait is accurate. Understandably, then, articles tend to critique without describing the chaos of data flows. Yet these data flows are important to understand. Where does the available aggregate data originate? What is its path to the databases where it resides? What are the major gaps in data available at low cost, and at any cost? Here, I chart information flows, describe the uses of data, and profile several of the major databases. Only by understanding these data flows—the institutions through which the data flows as it is aggregated—can we evaluate policy “levers” to improve the availability of data.
In every case, data have to start somewhere, typically at the point of care. Clinical data begin with a patient receiving a diagnosis or treatment. In that encounter, whether it is a visit to a primary care doctor or an appendectomy, the patient’s data are transmitted to a facility record. Cost data on what different treatments cost, what hospitals spend on beds and computers, and so on, also begin at the point of care, if in a different form. From that point, data usually flow up from its initial aggregation point to increasing levels of aggregation. Thus, I start with brief examples of data flows from the point of care to aggregation for particular health-care providers. The case studies were performed at the end of 2009 and represent data flows for these providers as of January 2010 Although the flows for these providers have changed in certain respects since then, the case studies still provide an accurate representation of data flows in health care.
Data flows: Some brief examples
Yale-New Haven Hospital
My first example, Yale-New Haven Hospital (YNHH), provides a picture of data flows through a large hospital and illustrates further the characteristics of health data records. YNHH is a nonprofit, university-affiliated medical center that ranks among the top hospitals nationally. 6 As noted, this portrait describes the Yale system as of January 2010. At that time, about 70 percent of its patient encounters entered a unified electronic medical record (EMR) system called Eclipsys (J.R. and Mark Andersen, Chief Information Officer, YNHH, 30 November 2009, personal interview). The remaining data, mainly from outpatient clinics, were recorded in paper records, stored locally. All billings, however, were recorded electronically.
The EMRs in Eclipsys contain a mixture of structured elements (e.g. lab values), structured text (fixed category descriptors), and considerable information entered as free text. The last category consists of “notes” written by health-care providers describing physical exams, evaluations, and treatment plans, time-stamped for entry (but not reflecting the actual time of occurrence). Although YNHH used support software offered by Eclipsys to help aggregate information from the EMRs, most aggregation was performed without automation.
An example of an important quality measure can illustrate the sort of aggregation that must be done “by hand.” YNHH measures the “door to balloon time,” which corresponds to the time it takes for patients who enter the emergency department (ED) experiencing a heart attack to be transferred to a separate area of the hospital, the catheterization laboratory, to have a blocked artery opened by a small balloon inserted into the artery. The precise time is an important predictor of patient outcomes and is best measured in minutes. 7 However, this data point required manual inspection of several records—notes submitted in the ED, test results like an electrocardiogram (EKG), and the procedure details in the catheterization laboratory. The information, in other words, existed in the EMR but not in a format that could be extracted easily through code. This problem of extracting data from records is a major one, which EMR companies are striving to reduce by increasing the amount of information entered into structured fields (J.R. and Brad Shilling-Stad, Sales Representative, Epic Inc., 22 January 2010, personal interview). However, free text remains the principal mode of entry for physician notes in the majority of commercially available EMR systems. YNHH aggregated some other quality measures from the records as well, focusing on outcomes, procedure rates, and other values for the entire institution and broken down by physician.
The billing system, in contrast to the EMR, is more structured. YNHH, like the vast majority of American hospitals, used a standard known as UB04, developed by the American Hospital Association and used by a majority of hospitals. Similar standardization of EMR records, though a far more challenging task, would help make EMR data available.
Each UB04 record contains demographic information about the patient (age and gender, but not necessarily ethnicity), an identifier of the treating physician, a coded system for payment type, and clinical information on diagnoses and procedure coded using the International Classification of Diseases, version 9 (ICD-9). Although used for billing to insurers, uninsured patients are entered into the system as well. The billing system is used to generate trends in diagnoses, procedures, physician patterns of drug and procedure use, and other metrics. Some of these metrics are made publicly available on the YNHH website. Thus, the data might contain the following basket of variables:
Observation 1: Patient characteristics (age, gender, etc.), ICD-9 codes, admission date characteristics, biller characteristics, hospital characteristics, and provider characteristics.
Both cost and medical information make their way from YNHH to third parties. At this time, YNHH did not sell any data. It did, however, give some data away. It also exchanged data for the ability to view data of peer institutions. Pursuant to Connecticut state law, it submitted and continues to submit extensive data from its billing system to the state hospital association. The data include exactly what I described previously: diagnoses, procedures, patient information, treating physician information, and payer information. The data that YNHH supplies to the Connecticut Hospital Association are pseudo-anonymous. It does not contain names or a code for each patient used beyond the data set. That said, there is enough information about patients that some records might be reidentifiable. These data at that time resided on servers in the headquarters of the Connecticut Hospital Association in Wallingford, Connecticut (J.R. and Mary Lyons, Vice-President, Connecticut State Hospital, 20 January 2010, personal interview).
YNHH also submitted medical and billing data to the University HealthSystem Consortium (UHC). Based in Oak Brook, Illinois, UHC is an association of university medical centers with 119 members in the United States.
Griffin Hospital
Griffin Hospital is a 160-bed, nonprofit community hospital in Derby, Connecticut. It mainly serves a local Connecticut population. Griffin, like YNHH, has a computerized billing system and a hybrid clinic system supported at the time of profiling by the MEDITECH record platform (J.R. and Kenneth Steele, Data Management Coordinator, Griffin Hospital, 19 January 2010, personal interview). At that time, MEDITECH supplied the system for all electronic patient records and administrative and billing records.
Griffin aggregated data internally and hired a company to perform additional analysis. It internally aggregated measures on mortality, length of stay, and 7- and 14-day readmission rates. Although MEDITECH offered some basic metrics for aggregation, Griffin did most manually. One major reason for manual review is that the creation of meaningful aggregate statistics requires excluding certain categories of data. For example, if the purpose of a readmission rate is to show when patients were improperly discharged, the aggregation should exclude certain types of readmission, such as pregnant women with false labor. Griffin also paid an outside company, Deltagroup, to provide additional analysis on Griffin’s own data. Specifically, Griffin wished to view comparative patient outcomes by physician, after controlling for severity of patient load. Deltagroup uses national data from Medicare claims and the company’s own algorithms to perform this adjustment. Deltagroup signed an agreement not to use Griffin’s data for any other purpose without the hospital’s permission.
Ken Steele, the data management coordinator for Griffin in 2010, noted the difficulty of aggregating certain useful measures. He, like Mr Andersen, the then-Chief Information Officer of YNHH, tended to name “systems” measures, such as catheterization laboratory time and average waiting time in the ED overall. MEDITECH provides only limited tools for aggregation (J.R. and Kenneth Steele, Data Management Coordinator, Griffin Hospital, 19 January 2010, personal interview). While Griffin has named better data aggregation tools as one wish (among many) for improvement to MEDITECH, Mr Steele believed that the inherent costs in switching to a new package prohibited making a switch for the purpose of data aggregation.
Mr Steele provided insight into the process by which billing codes are chosen. As noted, these codes have importance beyond billing. Physicians document the treatment provided during patient visits. Trained coders then map that information to the codes. In some cases, if a coder suspects that better documentation could allow the hospital to bill more codes, the coder will contact the physician and request additional documentation. The coding process is overseen closely by insurance companies.
Private gastroenterology practice
Joel Garsten and his partner run a gastroenterology practice in Waterbury, Connecticut. As of the time of the case study, his practice did not use EMRs because of the cost, but they did submit billing information electronically. The practice maintained a spreadsheet aggregating its patients by diagnosis, which it uses for internal planning. The practice does not submit any data to external parties (J.R. and Joel Garsten, Gastroenterologist, 4 December 2009, email communication).
Pediatric practice
Pediatric Medicine of Wallingford has five pediatricians. The practice did not at this time use an electronic system, but it did perform aggregation manually. It was part of a loose consortium of private practices that pool and compare data informally, focusing on one or two diseases at a time. At that time, the disease of focus was asthma (J.R. and Steven Frank, Pediatrician, 14 January 2010, email communication).
CVS pharmacies
CVS is a publicly traded company with 7100 pharmacies around the country, which fill more than 1 billion prescriptions each year. 8 It has at least one store in 44 states. CVS did not respond to multiple requests for an interview, but information on its data-sharing practices is available from court filings and other sources. When a customer comes into a CVS store and fills a prescription, information is immediately transferred to a centralized CVS database in Rhode Island. CVS analyzes data from all its pharmacies internally and sells deidentified data to data companies.
Access to data: Public domain
There is a large amount of publicly available data, and more data are still available for a price. At the same time, much data remain untapped—useful information remains in medical records, unextracted. Here, I highlight what data are and are not available. My goal is not to provide a comprehensive summary of available data sources. Others have already compiled lists. Rather, I hope to profile some of the major data sources. I go in decreasing order of availability.
Almost all data in the public domain are prepared by either the federal government or state departments of health. The Agency of Healthcare Research and Quality (AHRQ), an agency within the Department of Health and Human Services, puts a significant portion of this information on its website. Generally, free public data come from sources, such as the Health Care Utilization Project (HCUP), available for purchase. Data in the public domain are further aggregated so that they pose little likelihood of compromising privacy, such as state-level data or even region-level data. For example, AHRQ has begun to put information online from the HCUP, mentioned earlier and discussed in detail in the following.
HCUPnet
The web-searchable format, HCUPnet, offers a variety of state-level trends data on diagnoses, procedures, and patient trends. Thus, a researcher could count the number of pregnancy-related complications in a particular state or compare the morbidity rate for appendectomies in rural and urban hospitals. To help patients, HCUPnet has two starting points for each type of query, one for “lay people” and another for “researchers.”
The lay interface uses simpler language. Rather than having to enter codes, it contains searchable lists of diseases and procedures that maps onto the coding system. AHRQ plans to add all available data from HCUP onto the website.
Beth Israel Deaconess quality registry
Physicians associated with Beth Israel Deaconess, one of the Harvard teaching hospitals, developed a “community quality registry.” In 2009, the registry was run by the Beth Israel Deaconess Physicians Associations, led by John Halamka, a physician who is also the Chief Information Officer of the hospital. The registry was designed at that time to feature “treatment process and outcomes” data on 20 important clinical topics. 9 The medical center required physicians in the Beth Israel Deaconess Physicians Organization to submit records to the registry, which is now operated by the statewide Massachusetts eHealth Collaborative. One motivation for the Deaconess registry is to capture Medicare incentives for making quality measures public, 10 and the data have begun to be used for that purpose. I discuss these incentives further in the final section.
Access to data: Available for purchase
HCUP
Perhaps the most comprehensive data set available for purchase is the HCUP, which exploits the widespread use of a standard billing format to provide comprehensive data on hospital patient characteristics, diagnoses, and procedures. HCUP is an effort to make claims data assembled at the state level available to third parties, including researchers, companies, and others. The program is run by AHRQ, but a private contractor, Thomson Reuters, processes requests for data and supplies customer service. Currently, 39 states participate in HCUP.
Typically, states join after having already developed their own systems to macro-aggregate claims data. Alaska, for example, had a compliance rate of 50 percent in 2009 (J.R. and Alice Rarig, Alaska Department of Health and Human Services, 3 December 2009, personal interview). In most cases, the state hospital association leads this process, though, in a few states, the state department of health has provided leadership. Why do hospitals submit data? In some states, it is required by statute, supported by fines large enough to prompt compliance, but not so large as to guarantee that all institutions will comply. The fine in California was US$100/day in 2009 (J.R. and Jonathan Teague, California Office of Statewide Health Planning and Development, 3 December 2009, personal interview) Most institutions use an electronic uniform billing system that makes compliance easy, but some hospitals do not, and those must go to extra efforts to gather the data. Other states have developed programs through voluntary participation. Participation is always nonexclusive; hospitals are free to provide the same data to others.
Usually, the state hospital association contracts with a data clearinghouse to clean, store, and manage access to the data. A popular one is the Healthcare Industry Data Institute (HIDI), a nonprofit spin-off of the Missouri Hospital Association. As of January 2010, HIDI stored data for nine states at its headquarters in Jefferson City, Missouri . The company at that time charged about US$2000/hospital per year. This typically corresponds to US$0.80/patient visit per year. At the state level, these funds are raised in three separate ways. Some states charge hospitals fees directly. Others charge hospitals to view state data through their own state-based systems, not through HCUP.
Finally, some states use state funds. Usually, when the state does not charge hospitals for data access, it conditions access on submitting data. The stored data do not contain patient identifiers, but some patients might be identifiable by working backwards from their information.
Participation in HCUP is voluntary. Just as hospitals are free to distribute data in other ways, so are states permitted to do their own data distribution outside HCUP. In addition, states have autonomy to decide whether they wish to use the HCUP distributor to sell data only to researchers or to anyone who wants to purchase it. Some have tiered pricing for academics, other nonprofit affiliates, and for-profit affiliates; others charge everyone the same price. Prices thus vary dramatically from state to state, ranging from US$35/state-year for educational users in California to US$2635/state-year for commercial users in Maine. In part, pricing differences result from the fact that some states are trying to use HCUP to recoup the costs of data storage, whereas others are not concerned with this purpose. Applicants must fill an application explaining their need for the data, complete a 30-min privacy training, and sign a data-use agreement.
Medicare and Medicaid data
The Centers for Medicare and Medicaid Services (CMS), the agency that administers Medicare and the federal component of Medicaid, offers a variety of data sets on individuals who receive services through Medicare, the health insurance program for individuals 65 years and older, and Medicaid, the joint federal-state health insurance program for lower income Americans. These data contain similar fields to the HCUP data, but the CMS data are in one sense broader than HCUP and in another narrower. It is broader because it includes outpatient care as well as inpatient. (HCUP does have data available on ambulatory surgery). It is narrower because the Medicare data are skewed to an older population and the Medicaid program to low-income Americans. It is also delayed longer than the HCUP data. Claims data pass from providers to the Medicare intermediaries who administer the program. From there, it passes to the central headquarters of the CMS, where a copy is retained. The data are then distributed to outside parties.
CMS data have three forms: public (no person-specific information), limited (patient- and physician-specific data but stripped of identifiers), and identifiable data. Public data are available for download, but limited and identifiable data require completion of an application with an explanation of the reason for the request. Data are available for a fee. For limited data on Medicare patients’ hospital stays, for example, the cost is US$3655 per year. CMS data have prompted a cottage industry of companies who sell data analysis services to providers and patients. Deltagroup serves the provider market, utilizing CMS data to adjust physician performance by patient-load severity. Healthgrades, in contrast, is an example of a company that serves a patient market, providing ratings of physicians, hospitals, and other health-care providers based on CMS data.
Commercial
Data companies maintain expensive data sets with additional information on prescription drugs, inpatient and outpatient care, and other health data. CVS and other pharmacy chains sell their data to these companies. Verispan LLC (now owned by SDI), for example, claimed to have a database with 50 percent of the drug prescriptions filled each year in the United States. 11 These firms do not disclose details of their hospital databases or their sources (J.R. and representative (unnamed by request), Verispan Inc., 22 January 2010, personal interview). Verispan appears to have similar claims data to HCUP and CMS, but it may have more extensive outpatient data.
Access to data: Private
DATABANK
Thirty-seven states participate in DATABANK, a service run by the Colorado Hospital Association (CHA). The information is administrative and financial. It includes data on hospital bed utilization, average salaries by employee type, and payer distribution, among other areas (Reed, 2010, telephone interview). CHA functions much like a data clearinghouse. State hospital associations pay a license fee (US$11,500 per year in 2009) in return for which CHA cleans and stores the data. Selected employees of CHA have access to all the data. Hospitals also receive access to all data for any month in which they submit data. State hospital associations have ongoing access to their state data. There are no restrictions on what state associations can do with their data. No liability waivers are signed. Generally, state participation is voluntary, but in Colorado, all hospitals serving Medicaid patients are required by law to submit data. As noted, Connecticut hospitals participate, including YNHH and Griffin Hospital.
Although the service is oriented primarily at participating members, DATABANK has begun to accept select requests from third parties. A typical regional slice of the data might cost US$70,000–US$90,000 for 1 year of access.
Making sense of data aggregation: Micro- and macro-aggregators
How can we make sense of these data flows? One notable pattern is the common presence of two levels of aggregation. Health data flow through two major aggregation points. The first, as noted, is the point of care delivery. These “micro-aggregators” include hospitals, physician practices, dental practices, and pharmacies. Given the structure of the US health-care system, this first point of aggregation is all but inevitable. Many of the records are a product of what occurs at this level. Patient and physician data are generated when patients come for care and would not exist but for aggregation at the microlevel. Institutional cost data are generated when micro-aggregators bill for procedures, buy supplies, or pay staff. Some micro-aggregators do extensive data manipulation, others do not, but so long as they are taking the information and organizing it into compiled records, they are aggregators. Furthermore, since the United States does not have portable personal medical records, micro-aggregator records are necessary for the data to exist in the first place. It is surely possible to envision other systems. For patient data, each individual could possess a health record stored on central government service, for example. However, that is not the US system now nor will it be in the near future.
Micro-aggregators only rarely furnish data directly to the public sphere, even to those willing to pay for data. Except for examples like YNHH’s quality metrics, micro-aggregators pass data to another level of macro-aggregation. This level consists of four broad groups. First, there are state hospital associations. These are nonprofit corporate entities who receive data from billing and financial records. They organize the HCUP data I discuss below. Second, there are companies like IMS Health, Verispan, and Cerner Corporation, who purchase data from micro-aggregators, pool it together, and sell it as a line of business. Third, there are government agencies who receive data either directly from micro-aggregators or indirectly from other macro-aggregators. Fourth, recently insurance companies have begun to sell their own aggregated claims data. In some sense, macro-aggregators must exist as well, but their form is more flexible. Google Health notwithstanding, aggregation at the point of care will continue to exist for the foreseeable future. In contrast, if micro-aggregators made their data public, an ambitious programmer could perform macro-aggregation.
Between micro- and macro-aggregators is a third emerging group consisting of direct linkages between either micro-aggregators or patients. Regional health information organizations (RHIOs) pool data across multiple micro-aggregators within a geographic region, sharing EMRs and aggregating some measures of quality. RHIOs, however, exist in only a nascent stage. Although more than 30 were created between 2000 and 2005, only 7 remain today, and they do not share records as intended. 12 On the patient side, patients have themselves begun to pool their data through online communities of patients.
Retail pharmacy chains occupy an odd place in this framework. CVS is, in one sense, a macro-aggregator combining data from thousands of micro-aggregator pharmacies. However, it operates almost as one large database, in which each pharmacy is a point to enter data into the system. Yet, even here, there is a second layer. CVS sells data to data companies, which in turn distribute it to other for a price. The micro–macro distinction, after all, is not a perfect one. A hospital could start a data business and become a macro-aggregator. The point, rather, is to note a pattern in data flows: micro-aggregation at the point of care and macro-aggregator “upstream.”
Putting the pieces together: The status of health data
Here, I integrate information from my brief case studies, my research into specific databases, and my analysis of the micro–macro aggregator divide. Clearly, a great deal of data is available. What is available for that price is quite remarkable: extensive information on every hospital visit in most state during the calendar year (for available years). What are the shortcomings in the availability of health data?
Gaps in aggregate health data
No longitudinal patient data. Longitudinal data on patients are limited. HCUP and CMS principally use the “patient visit,” not the patient, as the unit of observation, though the full version of CMS data contains a patient identifier that can be used to link patient visits. Related, patient visit data omit many important characteristics of a patient other than weight and gender, such as ethnicity, activity level, family history, and social history.
Little data from EMRs rather than billing systems. Most aggregated data come from administrative and billing systems rather than EMRs, introducing certain nearly unavoidable gaps in the data. In particular, there is little information on the providers as “health systems.” Operational information—times to the catheterization laboratory, mean emergency room (ER) waiting times, and the alike—is not available. Clinical information is filtered through ICD-9 diagnostic and billing codes. While these codes standardize the data, they also lose information that was not easily applied to a code. Differences within codes are lost entirely.
Limited outpatient data. Data on health care provided outside hospitals are limited to Medicare claims data and prescription drug information, the latter of which imposes major financial barriers. It is important not to overlook the importance of outpatient data for planning and policy purposes—some studies have pointed to higher outpatient expenditures as a major reason that the United States spends more on health care than other countries. 13
Hospital information is based on discharges rather than admissions. Although HCUP and CMS data provide detailed information on discharge diagnoses and procedures, it provides little information on “admission” information—that is, what conditions patients have when they arrive at the provider’s door to receive care. As a result, it is difficult to distinguish complications of care from preexisting problems—that is, to distinguish conditions the patient had as a result of treatment from those the patient already had upon arrival at the hospital.
Financial barriers
Administrative and cost data. Financial and cost data at the micro-aggregator level is available only from commercial or quasi-commercial sources (DATABANK). Prices range from thousands to US$100,000+.
High cost of private data. While private companies house data on prescription drugs and likely better outpatient claims data, the costs to access this information are staggering. IMS Health charges a minimum of US$2000 for any search of a database. Prices can reach millions of dollars.
Trends data. Trends data are extremely expensive, even from modestly priced sources like HCUP:
While 1 year of HCUP data costs at most US$2000 per state, the cost of data spanning multiple states and years may be more than US$100,000.
Outside of HCUP, trends data might need to be purchased from private sources.
Technology limitations
Limits in EMR aggregation technology. Although EMR software typically contains a number of add-on modules for data aggregation, these tools have major limits. Most importantly, they cannot extract the massive amount of information entered as free text in the EMR records. They also are less able to aggregate “between records,” such as extracting the time to the catheterization laboratory from multiple entries by different providers at different times.
Variation in aggregation quality The quality of aggregation tools varies from one EMR provider to another. While Epic has a wide range of tools, MEDITECH, according to Ken Steele, has more limited capabilities. Since switching EMR vendors is a difficult process, micro-aggregators are limited to the tools available for their own platform.
Lack of third-party software. Third-party aggregation tools (i.e. tools separate from the Electronic Health Record (EHR) that work with multiple platforms) are not extensively available, partly because of a lack of system standardization that would make it difficult to develop a tool that could work with any vendor.
Delays in availability. Imagine if data on waiting times in ERs were available instantaneously. Patients could choose whether to go based on the hospital with the shortest wait-time in their local area. This, in turn, would help to efficiently allocate patients between EDs based on how busy they were at any given moment. Yet even if data on waiting times were readily available, it would likely be part of a large data set that was not available for months or even years after the date in question. Real-time information is a distant dream.
Having listed these problems, I wish to make clear that numerous sources of aggregate data exist and data aggregation will continue to become more commonplace with the combined effect of increasing adoption of electronic record keeping and federal reforms designed to reward tracking outcomes. Beth Israel’s efforts to warehouse quality data have lead to statewide data aggregation, organized by the Massachusetts eHealth Collaborative. Outside the United States, the Center for Health Record Linkage in Australia, for example, links together data stored by hospitals, health departments, and other sources. 14 Another example is the Infections in Oxfordshire Research Database, which aggregates infectious diseases data from hospital records and laboratories. These efforts will no doubt continue. 15
The fragmentation problem: A lack of standards or integration
Even with these new attempts to aggregate, data remain fragmented. This problem is particularly acute in the United States. Analysts have commented that the US health-care industry is highly fragmented. 16 Fittingly, health data suffer from the same sickness. There is technical fragmentation—EMR vendors differ, and there remains little record standardization between vendors. More standardization would allow third-party vendors to develop more aggregation tools. There is quality reporting fragmentation—different providers report different information, or no information at all, on their websites as they choose. Most fundamentally, there is data fragmentation. Sources like HCUP, Medicare data, and private databases exist. The data in them overlap in content (HCUP contains Medicare data) and standardization (HCUP and CMS share ICD-9 codes), but there are also differences (HCUP vs private hospital data). Private and public sources offer data at different prices in different ways and formats. Researchers face different applications at every turn—on the IMS website, from the HCUP distributor, from the Medicare agency.
Privacy, limitations on data use, and the failure of aggregation
Although I largely bracket privacy in this discussion, I recognize its importance in the health data debate. Aggregation has its limit. Data so aggregated as to ensure privacy are also with limited usefulness. Features of data that could meet the shortcomings of current aggregated data could also pose new privacy concerns. Longitudinal patient data, for example, is high on the wish list of public health researchers. From a privacy standpoint, however, such data raise serious concerns.
Privacy and utility come into inherent conflict. For utility, data should be detailed. It should contain detailed variables about the patient, the treatment, and the physician. Researchers should be able to link the records by patient or provider to other data sets. If data are deidentified, it should be possible to reidentify them. For privacy, precisely these features pose the greatest risks. As Paul Ohm has written, “[d]ata can either be useful or perfectly anonymous but never both.” 17 Even aggregate data can still threaten anonymity. Aggregation can, of course, occur on varying levels of granularity, and granular aggregate data can still permit the data to be reidentified. HCUPnet, for that reason, will not display results on its public website with counts less than 10—if a query of hospital admissions for a certain diagnosis has only three results, the website will only indicate that the result is under 10. This not only helps prevent the data from being reidentified but also decreases the utility of the search. Perfect utility and perfect anonymity conflict, and efforts to make health data more available should consider privacy concern very carefully. This trade-off will be an ongoing issue in data aggregation.
The effort to protect privacy will encounter legal and social barriers as well as technical ones. Once data come into being, rights over its use may take on a life of their own. For example, states have attempted to restrict the sale of data on drug prescriptions, and these attempts have encountered constitutional challenges. In IMS Health v. Sorrell, the Supreme Court struck down Vermont’s law barring the use of prescription data for marketing purposes on the basis that it infringed on the free speech rights of data companies. 18 The Court suggested that one critical flaw in the law was the selectivity of its restrictions, which prohibited particular uses of data by particular parties (e.g. drug companies). Other statutes might fare differentially, 19 but the lesson is that once data are created, it is difficult to control their distribution
Legal control: “Off the shelf” property rights, private contracts, and technological “fences”
How the American legal system treats medical data has tightened control over data availability and use for noncommercial purposes. Essentially, weak property rights have prompted strong contract rights.
Control over paper copies of medical records is governed by state law. Many states allow patients to inspect their records. The records themselves, however, are generally the property of the health-care provider in much the same way as the provider’s phones, cabinets, and office supplies. Electronic medical and billing records raise a host of new—and still unresolved—issues. Given that information is far easier to extract and aggregate from EMRs, one such question is the status of that compiled information, separate from the raw records themselves.
Intellectual property law—copyright, patents, trademark, and trade secrets—is the standard means through which Anglo-American law protects “information goods” such as health data. Information goods have characteristics that differentiate them from physical property: they are nonrival (consumption by one person does not prevent consumption by another), and the marginal cost of production is low. Intellectual property rights (IPRs), however, offer weak and uncertain protections for health data. The most applicable IPR is copyright, but American copyright does not protect raw compilations of data such as phonebook listings. 20 That said, a specific selection and arrangement of data may be copyrightable if it meets a minimum bar of originality. 21 Thus, to an unknown degree, copyright may prevent unauthorized wholesale copies of health databases but is unlikely to prevent using the raw data itself.
Unfortunately, the picture becomes more complicated because certain data elements themselves may be copyrightable, if the element itself involves sufficient creativity. The most likely contender in existing data is the ICD-9 code, as a doctor could argue that the decision making involved in determining applicable codes was sufficiently original. Lists of coin prices, for example, are copyrightable. Ages and genders of patients are almost certainly not. In summary, when a researcher seeks to copy data to use it, she or he may or may not infringe on a valid copyright, depending on the form in which the data get copied (whole database vs extracted selections of raw data) and the types of data elements used (ICD-9 codes may or may not be permissible, ages almost certainly are).
To strengthen and clarify their rights of control over their data, aggregators have turned away from IPRs and instead relied on contracts and password-protected databases. Health data thus offer a good example of what Margaret Radin has described as regulation by contract and machine. 22 Aggregators protect their data by requiring consumers of the data to sign agreements promising not to distribute the data to others, and they guard the data through password-protected database. Even government officials can be required to agree on contractual restrictions. Alice Rarig, the Alaska state official who oversees Alaska’s HCUP data, said that she had to sign a contract promising to “sign away her first born” if she shared the data. It was her understanding that the agreement prevented her from sharing with anyone outside her department for any reason, even for a purpose that would fall under the umbrella of copyright’s fair use.
By relying on contract, aggregators achieve greater predictability and control. They achieve predictability because they need not worry about the unresolved issues in the use of copyright, or the uncertainties surrounding underutilized concepts such as unfair competition. They achieve control because they can select the rights of control they wish to exercise, for as long as they state by contract, subject only to the willingness of buyers and possible constitutional limits. 23 For researchers, however, contracts lack the same exceptions for noncommercial use as for copyright. Contracts need not contain a fair use provision. That is, an aggregator can sell its data and bar the buyer from sharing the data for any reason, even for research purposes.
The European Union (EU), in contrast, has created special property rights for databases that function alongside copyright law. These provide property-like rights without imposing the same originality requirement as copyright. The EU database directive grants an initial 15 years of protection to a “collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic or other means.” 24 The Directive, which member countries must then implement, provides producers the rights to prevent extraction and reutilization of the contents of their databases. Whatever their disadvantages, these laws at least help clarify the property rights available to database producers, and may discourage a reliance on contracts.
Issues and recommendations in the movement toward a public good
What are some means to make health data more available? By “available,” I refer to increasing the availability of existing data and producing additional data that addresses the gaps noted in section ‘Uncovering the data flows’ of this article. Here I touch on some possibilities.
Changes to ownership rights over data: to recap, the current environment is weak and governed by private ownership with strong contractual rights. There are two possibilities:
Public ownership. Some prior writers have called for public ownership of data. But data are already publicly owned, as aggregators retain only weak property rights in commercialization. Public ownership would have to go further. It could, perhaps, prohibit provisions in contracts with data buyers that prevent them from giving data to researchers for noncommercial purposes. It is unclear whether such prohibitions would significantly lower the incentives for data companies and trade associations to aggregate data for internal and commercial purposes. Stronger property rights. It is conceivable that stronger property rights for commercial uses, paired with limits on contracts, would make data more available for researchers. One option would be to grant IPRs over data for commercial purposes in exchange for commitments to make data available to researchers.
Direct subsidies for electronic records and data aggregation: One way to promote gathering of additional data and increased availability is to offer financial rewards for doing so. The CMS offers some such incentives as part of its incentive programs for EHRs. These programs were established under the Health Information Technology for Economic and Clinical Health Act of 2009, which earmarked more than US$30b to promote EHR adoption. 25 The CMS programs offer financial rewards to health-care providers who make “meaningful use” of an EHR. 26 While much of meaningful use concerns how providers utilize EHRs for patient care delivery and decision support, the provisions also require reporting of “core” quality measures on such outcomes as smoking cessation, and offers funds for such public health data gathering as immunization registries. It is too early to assess the effect on EHR utilization, much less data aggregation, or whether this set of programs creates the right incentives. Nevertheless, direct subsidies for EHR adoption, data aggregation, and data reporting are an important policy option.
Indirect subsidies: Another option is to subsidize provision indirectly, such as through tax relief. Tax relief may be an appropriate tool to encourage dissemination of data that is already aggregated by private companies, which, as noted earlier in the article, is some of the most expensive and inaccessible aggregate data existing at this time. Unfortunately, current tax law offers little financial incentive for data companies to donate data to qualifying charitable organizations. In general, tax law treats donations of property quite favorably, allowing donors to take deductions for the entire fair market value of the property, up to 10 percent of the business’s gross adjusted income. However, the tax code views donations of intellectual property less favorably. Since the American Jobs Creation Act of 2004, businesses can no longer deduct the fair market value of donations of patents, copyright, and trade secrets and instead must deduct the basis, which is typically far less.27,28 Donations of “copies” of protected items or unprotected nonpublic information goods usually do not qualify for a significant deduction. These reforms were designed to prevent businesses from exploiting donating intellectual property of little value and deducting an inflated estimate of the property’s value. But reform to allow better incentives through the tax code would be one option to promote data sharing.
Promoting professional norms of data production and distribution: Health-care professionals operate according to complex norms. Partly, these drive from informal and formal rules established by professional bodies like the American Medical Association (AMA), rules that are capable of enforcement by state licensing boards. Norms can change. Hand washing is an area where efforts by hospitals and professional organizations effected a shift. 29 Professional societies and individual champions can be a useful “lever” for norm change. The AMA, for example, promulgates a complete set of ethical and policy stances. Currently, the AMA, for example, takes a stance only on data privacy, not public provision: it urges safeguarding privacy when data is made public but says nothing about the need for better aggregate data. 30 The hope with this form of professional norm change is that when one professional adopts the practice, it encourages others to do the same. Social network effects matter for doctors, nurses, and other staff. Consistent with literature on norms influencing on physician trends, past work has shown the role of individual champions to promote ideas. 31
Record standardization. A lesson from claims data is that standardization makes aggregation far easier. The Department of Health and Human Services has begun to promote efforts to standardize electronic medical information. Record standardization would make technological tools for data aggregation easier to develop.
Third-party software. Given how difficult it is for micro-aggregators to switch EMR vendors, third-party software has an important role to play. Subsidies for research (e.g. free-text parsing of medical records) and third-party software development would be helpful.
Privacy. Data anonymization will never preserve all data utility while safeguarding privacy. The balance between utility and privacy will require placing trust in the appropriate institutions rather than relying entirely on a process of technical deidentification. This describes the HCUP approach to privacy, which combines some technical measures (stripping of patient identifiers) with data-use agreements that requires researchers using the data sets to respect patient privacy. Thus, we may want to ask what level of anonymization and aggregation is appropriate for health data sets in different people’s hands.
Further research into available data and the barriers. Research should proceed on at least two fronts. First, the public needs a better understanding of the data possessed by private companies and their efforts to make that data available for noncommercial purposes. As I have noted, these companies tend to be extremely secretive. Verispan would not comment, even off-the-record, on whether it provides data to researchers. Second, research should explore the barriers to data aggregation and distribution on the part of micro-aggregators. I have taken a small first step in that direction, but there is far more work to be done.
Footnotes
Acknowledgements
The author thanks the participants of the fall 2010 Access to Knowledge Practicum: Lea Shaver, Laura Bishop, Steve Gikow, John Lu, and Victoria Stodden for their helpful comments. The author has no personal financial interest related to the subject matter of this article.
