Abstract
Peking University has several prestigious teaching hospitals in China. To make secondary use of massive medical data for research purposes, construction of a clinical data warehouse is imperative in Peking University. However, a big concern for clinical data warehouse construction is how to protect patient privacy. In this project, we propose to use a combination of symmetric block ciphers, asymmetric ciphers, and cryptographic hashing algorithms to protect patient privacy information. The novelty of our privacy protection approach lies in message-level data encryption, the key caching system, and the cryptographic key management system. The proposed privacy protection approach is scalable to clinical data warehouse construction with any size of medical data. With the composite privacy protection approach, the clinical data warehouse can be secure enough to keep the confidential data from leaking to the outside world.
Introduction
In the last two decades, we have witnessed an ever increasing volume of data that is being collected in hospital information systems (HISs), especially in electronic medical records (EMRs). As clinical databases provide a rich source of data for research in areas such as medical care delivery and medical quality monitoring, how to make secondary uses of those collected large volume data has become a hot research topic in the literature.1–5 Abhyankar et al. 6 proposed a method to standardize clinical laboratory data for secondary use. Tolar and Balka 7 put forward advice about how to enhance care through secondary use of EMR data in a general practice setting. The Strategic Health IT Advanced Research Projects Area 4 Consortium (SHARPn) project funded by the US government in 2010 to build a robust infrastructure for secondary use of electronic health records (EHRs) data has taken shape. 8
Driven by demand, data warehousing has been proposed as a way of supporting secondary use of those invaluable medical data maintained in current HISs.9,10 With the aid of data warehousing technologies, we can collect and integrate data from various HISs, and build multidimensional patient information cubes or data marts. Based on the raw, aggregate, or statistical information displayed in a clinical data warehouse (CDW), researchers may have various hypotheses tested, which can stimulate further research in healthcare-related issues. Several CDWs11,12 have been developed for this purpose.
However, with the increasing complexity of integrating patient data from different departments or hospitals, and the keen desire of other entities such as insurance companies and pharmaceuticals companies to access these data, serious pressure is put on a CDW’s capability to protect patient privacy. Privacy violations may result in loss of dignity, and confidential information from a person’s medical record may influence his or her credit, employment, and the ability to get health insurance. 13 Therefore, protecting patient privacy becomes a big challenge for CDW designers.
In response to these concerns, various methods have been employed. From the policy and law side, several bills were introduced during the past two decades in some developed countries. Take the United States as an example; the Health Insurance Portability and Accountability Act (HIPAA) was signed into law in 1996, and component privacy regulations were published in December 2000. 14 However, in China, due to the fact that China is a developing country, the most worrying aspect regarding healthcare services for citizens is medical cost, and only a small portion of the population may be conscious of their privacy in the process of seeing a doctor. Research on patient privacy protection is still in its initial stage, and aside from some graduates’ theses15,16 that discuss the issue of legislation to protect patient privacy, there is neither a specific bill nor regulation regarding patient privacy protection. From the technology side, first, user access control17,18 has been proposed as a measure to limit users’ access to confidential and sensitive data. Second, lots of methods for data encryption and pseudonymization have been proposed by researchers19–23 to protect privacy. In this article, we propose the use of composite methods to implement privacy protection in CDW, which is being developed at Peking University (PKU). In our composite approach, the privacy model is composed of various data encryption methods and a user access control mechanism in a layered CDW structure, and we mainly describe the composite data encryption approach in this article.
The structure of this article is as follows: a brief introduction of the Peking University Clinical Data Warehouse (PKUCDW) and the necessity of protecting patient privacy in the CDW are given in the “Background and significance section”; the “Methods” section presents a composite privacy protection model designed for the PKUCDW; and finally, the “Conclusion” section summarizes the results and the contribution of this article.
Background and significance
Background of PKUCDW
PKU is affiliated with six prestigious teaching hospitals. Three of them are general hospitals, and three others are special hospitals specializing in cancer, dentistry, and mental disorder. All affiliated general hospitals have adopted HISs such as EMRs, laboratory information systems (LISs), pharmacy information systems (PISs), picture archiving and communication system (PACS), and so on. Motivated by making secondary use of large volume data collected in HISs, a research group at Peking University Medical Informatics Center (PKUMIC) with experts from computer science, decision making, data mining, statistics, and epidemiology was formed in 2011. A very important mission for PKUMIC is to centralize and integrate data from HISs located in all affiliated hospitals and to develop a CDW to extract new knowledge and valuable information from a large volume of medical data.
Structure of PKUCDW
We adopted the architecture proposed by Inmon 24 to design our CDW. In this structure, a data warehouse (DW) should be a repository that provides data for data marts, which are created only after the creation of a complete DW.
Inmon’s DW structure matches our requirement for a CDW very well. The main mission of PKUCDW is to warehouse the data in main HISs across all affiliated hospitals and to analyze the data from different research angles. The end users of PKUCDW include not only researchers at PKUMIC but also medical staff in affiliated hospitals. As different users would require data for analysis from their own perspectives, it would be better if we could provide them with different data marts or cubes according to their specific requirements. Here, a cube implies something very specific, while a data mart is more inclusive and it can have tables or cubes. The structure of PKUCDW is shown as in Figure 1.

Structure of PKUCDW.
Privacy concerns
The biggest concern for the public about making secondary use of data in HISs is patient privacy violation. Besides those routine clinical data such as common clinical symptoms and laboratory results that are stored in HISs, confidential data such as
In our case, although all researchers and staff involved in the CDW project have signed agreements with PKUMIC or affiliated hospitals on privacy protection, there are still concerns about internal personnel disclosing, as all parties with strong interests in the EMRs data may make every possible attempt to access the data. Take pharmaceutical companies, for instance, as clinicians in different hospitals may have different preferences in prescribing medicines for patients, salesmen from pharmaceutical companies are keen to get EMR data to do analysis about which clinician in which hospital favors which medicines. With the above concerns in mind, it is necessary for us to employ technological methods together with administrative procedures to protect data privacy in the design and development of PKUCDW.
Overview of privacy protection methods
In general, privacy protection can be enforced from two perspectives. From one perspective, protecting privacy needs to limit the number of users that can access the data, and from the other perspective, protecting privacy needs to limit the data that can be accessed. For limiting users’ access to data, different user access control mechanisms have been proposed and employed in the literature.17,18,25–27 For limiting the data that can be accessed by users, different data de-identification approaches have been developed.21,22,28,29 Procedures from administration perspectives and methods from technological perspectives have been proposed and employed in a complementary way in the literature.7,21,22,25,30 However, most methods employed in the literature are for medical data sharing or publishing, and they lack viable and practical approaches to comprehensively protect patient privacy from the beginning of HISs data transmission to the end of researchers’ data manipulations.
Data encryption
The general approaches used to protect or to de-identify person-specific confidential and sensitive information include data encryption and data pseudonymization.7,19–21 As our study focuses on encrypting data to protect privacy, we only briefly discuss data encryption methods in the following. Data encryption can protect sensitive data from unauthorized access. Two widely used data encryption methods in the literature are symmetric block cipher and public key cryptography. A symmetric block cipher uses a permutation–substitution network to encrypt a fixed size of data block, with a predefined encryption key. The security of a block cipher depends on the length of the encryption key, and the time complexity of a brute force attack against a cipher of 256 bits key is
Methods
Message-level encryption
Since the patient data such as name, IDN, and other information are very critical and sensitive, we need to provide an end-to-end protection for them. The name and IDN must be encrypted at rest and in-transit, starting from the upload server of the hospital HIS system to the backend system in the CDW. Encrypting data at rest can effectively prevent sensitive data from being accessed by unauthorized users. Even if the malicious third party gained access to the electronic media, the data they receive are cipher-texts, whose original meaning cannot be revealed without proper cryptographic algorithms and encryption keys. Encrypting data in-transit can protect the data from eavesdropping when data are transmitted along wired or wireless communication channels. Without correct cryptographic algorithms and keys, the malicious third party cannot recover the original plain-text from the cipher-text in a reasonable amount of time.
Encrypting data at rest can be implemented by a variety of techniques at different layers, from the lower level hard drive encryption to higher level application layer data encryption. Although hard drive encryption is an easy solution, we chose not to apply it in the CDW for several reasons. The first reason is that if all data on the hard disk are encrypted, they will have to be decrypted when being read by the applications, which results in a significant reduction in performance. The second reason is that the hard-disk encryption can only protect the data from breaching in some special cases such as when the system is powered off and the hard drives are stolen. When the system is up and running, the data on the hard drive is decrypted automatically and become clear-texts to end users, including malicious users who break into the system, which actually defeats the whole purpose of data encryption.
Another solution is to use database layer encryption technologies to enforce data encryption at rest. The advantage of this solution is the ease of use and user transparency, which means existing applications can run smoothly without code change. However, a big disadvantage of this solution is that it is limited to a particular database product, which makes the interoperability between different products a big problem. Another issue is the limitation of the cryptographic key management capabilities of database column-level encryption solution, because existing solutions ordinarily use only one key for each table or for each column. If one key is compromised, then a bunch of data will be leaked. The third problem is that the database encryption solution can be applied only to structured data. However, there are large volumes of unstructured data, such as XML files, in the HIS systems. To enforce data encryption for these unstructured data, we need other technologies. Therefore, instead of using hard-disk encryption and database-level encryption, we chose to encrypt data at the message layer, which can effectively solve this issue.
Encrypting data in-transit can also be enforced at different layers. The widely adopted network layer data security protocol IPsec and the transport layer protocol SSL/TLS (secure socket layer/transport layer security) provide a transparent secure communication channel to end users. However, both of these techniques encrypt all the data transmitted through the channel, which consumes a lot of computing resources and requires considerable effort to set up the environment.
When designing the security architecture of the CDW, we adopted an application-level data encryption solution to protect data at rest and in-transit at the same time. All the data in hospital HIS systems are first classified into two classes, either private or public. Private data are those data that contain sensitive information such as patient name, IDN, and so on, which need to be encrypted at rest and in-transit. Public data are those data that do not contain sensitive, personal information and can be accessed freely by a third party, which can be stored on physical devices and transmitted in clear-text format. Most of the clinical data are classified as public data and thus can be stored and transmitted in clear-text, which saves a lot of computing resources. Only a small portion of the clinical data is classified as private data and must be encrypted at rest and in-transit. Figure 2 shows the security architecture of the PKUCDW. The private data at the data source are encrypted at one end and can be decrypted at the other end, based on the end user’s security role and policy. Specifically, if the end user is a CDW developer with lower level access right to the data, he or she has to deal with encrypted data if he or she has no right to access decrypted data. The same rule applies to end users of CDW data.

Security architecture of PKUCDW.
Data encryption process
The following pseudo code (Algorithm 1) is the encryption algorithm, which is used by the hospital upload servers. The input is private data and the output is the cipher-text. The encryption key generation and protection happens automatically behind the scenes, which is transparent to end users.
Algorithm 1: Encrypt Private Data Into Cipher-Text
All sensitive data are encrypted at the upload server of the hospital HIS system, using the Advanced Encryption Standard (AES; Federal Information Processing Standards (FIPS) 197) algorithm, with a 256-bit session key. 32 The session key is generated at run time by a cryptographic hashing algorithm such as secure hash algorithm (SHA)-256 and is therefore unique for each session. We chose SHA-256 instead of the still widely used message digest 5 (MD5) and SHA-1 algorithms because of the known security attacks against these two algorithms.26,27 The session key is further encrypted by the RSA algorithm with a 2048-bit CDW public key, which is a RSA public key on the CDW server side. This encrypted session key is appended to the cipher-text and a digest of these two fields is generated by the SHA-256 algorithm. This digest is signed by the RSA algorithm with the 2048-bit private key of the hospital HIS system. Each hospital has its own private RSA key, which is unique within the system.
A cryptographic hashing function is used to compute the digest of the session key, which is appended to the string for caching purposes. The last step is to use the cyclic redundancy check (CRC)-32 algorithm to calculate a checksum for all the data, including the cipher-text, the encrypted session key, the digital signature, and the session key hash value. A final optional step is to transform the whole message into a Base64 format, which is easier to transmit in ordinary text format. The following diagram in Figure 3 shows the detailed data flow and data format of the cipher-text generation.

Client side private data encryption.
Data decryption process
After receiving the cipher-texts transmitted from the client, they will be stored on hard disks in the CDW server. Algorithm 2 is applied at the application server on the CDW side, which is the other end of the data flow. Here, the decryption key is retrieved based on end users’ role, and plain-text is recovered only for valid users.
Algorithm 2: Decrypt cipher-Text into private data
Upon receiving the cipher-text from the client (one affiliated hospital’s HIS upload server), if the message was coded in Base64, the server will transform the message back into binary form. Then a CRC-32 algorithm is applied to the received message, excluding the last CRC portion, to generate a new checksum CRC′. The newly calculated CRC′ is compared against the received CRC to check whether the message was modified during the transmission. If the checksum test passed, the RSA verification algorithm is applied to the client’s signature to recover the original digest of the cipher-text and encrypted session key. An SHA-256 algorithm is used to calculate a new digest of the received cipher-text and encrypted session key. The newly generated digest is compared to the original digest to check whether the received cipher-text and encrypted session key are the original ones sent from the client. Figure 4 shows the data verification phase of the decryption process.

Server side data verification.
After the CRC test and digital signature verification, the RSA decryption algorithm is applied to the encrypted session key, with the private key of the CDW server, to recover the original AES-256 session key. This key is then used by the AES-256 decryption algorithm to decrypt the cipher-text and get the original private data. Figure 5 shows the data decryption phase. The session key hash can be used to greatly improve the performance of bulk data processing, where lots of private data are required to be decrypted in a batch process. When systems are running in this mode, the already decrypted session key and its hash value are stored in memory, and when a new cipher-text is received, the session key hash is searched against the in-memory cache to find the original session key. The cache is implemented as a hash table, whose search time complexity is

Server side data decryption.
Cryptographic key management
Algorithm 3 is used to retrieve the server’s private key from the key server. Note that the private key of the CDW application server and hospital HIS upload server are never stored in plain-text on hard disks. The private key is encrypted by the AES-256 algorithm with a predefined server side key–encryption–key. The cipher-text is further separated into two parts, where the first part is stored on one key server, and the second part is stored on another key server with privileged access rights. These two parts are transmitted through secure communication channels from the two key servers and are concatenated into one piece on the central key server. Then the AES-256 decryption algorithm is used to decrypt the cipher-text and get the private key. The following diagram in Figure 6 shows a detailed data flow of the server’s private key retrieval.
Algorithm 3: Retrieve The Server’s Private Key

Server side private key retrieval.
Apart from the data encryption technology used to protect the patients’ private information, a cryptographic hashing algorithm (SHA-256) is applied to each private data to generate a digest, for the purpose of indexing and cross-referencing. Since the code space of the SHA-256 algorithm is 256 bits, the possibility of code collision is extremely small, and it is practical for ordinary applications.
The novelty of our privacy protection model lies in the message-level data encryption technology designed into our CDW system, which protects data at rest and in-transit at the same time, and results in a minor reduction in performance to the whole system, due to the creative design of the cryptographic key caching system. Another novel design is our cryptographic key management system, which utilizes the separation-of-knowledge technique to reduce the attack surface and improve the overall security level of the system.
Results
On a test environment that has a 3.1-GHz central processing unit (CPU) and 16-GB memory machine, we ran the application 10 times to test the decryption performance. Table 1 shows the performance test results. The first column shows the decryption speed in megabits per second. The second column is the medical record reading speed in milliseconds, where there are 10,000 records and each record contains 20-kB data. The third column shows the record reading rate in milliseconds, where there are also 10,000 records, each of which is 20 kB and each contains some cipher-text fields. The average decryption speed is 141.74 Mbps. The average speed of noncipher-text record reading is 659.4 ms. The average speed of reading records containing cipher-text is 693.8 ms. There is a 5% performance penalty because of the encryption/decryption process added to the workflow.
Performance test results of the encryption system.
Conclusion
The security of the confidential data relies upon the encryption of the session key by the CDW’s public key. To recover the session key without the CDW’s private key is equivalent to solving the integer factorization problem, which is a well-known hard mathematical problem. The fastest algorithm to solve the problem is the general number field sieve (GNFS) algorithm, whose time complexity is
Meanwhile, patient privacy protection is a complex social issue, which involves policy-making, technology, and psychology. The focus of our study is on technological solutions, especially data encryption techniques. The composite privacy protection model ensures that the data each end user acquires is secure.
In China, few researchers in computer sciences, medical informatics, or statistics are doing research on privacy protection for medical data. Although our CDW project is small in scope compared to the larger initiatives at the national or international level, we do hope that the data privacy model designed for the PKUCDW can provide some useful ideas and insight for projects of larger scope and stimulate more research on advanced patient privacy protection techniques.
Footnotes
Funding
This research was supported by a grant funded by the Ministry of Education of China under Grant No.: 13YJC630066. The research was also supported by a grant from the Natural Science Foundation of China (NSFC) under Grant No.:81301296.
