Validation of the transformed clinical practice research datalink (CPRD) GOLD and aurum data into the OMOP common data model

Abstract

Objective: To assesses the transformation of UK Clinical Practice Research Datalink (CPRD) databases into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) version 5.3.1. Methods: A systematic approach was used to generate medical code lists and compare prevalent and incident counts between the source and OMOP CDM versions. Results: The results showed, for CPRD General Practitioner Online Database (GOLD) database, 89.5% of clinical events had no or very small differences in prevalent and incident event counts between the two versions of the database. The differences for CPRD Aurum were even smaller, with 97.4% of events showing no or very small differences in counts between the source and OMOP versions. Some observed discrepancies were due to codes being mapped into different tables. Conclusion: The study findings confirm the consistency of the OMOP transformation and provide confidence in analyses that query CPRD OMOP-transformed data.

Keywords

CPRD Aurum CPRD GOLD OMOP medical codes

Introduction

Within the healthcare industry, robust and generalisable real-world evidence (RWE) is increasingly required by decision-makers, such as regulators, payers and Health Technology Assessment (HTA) bodies.^1,2 These studies often require the use of multiple databases to answer research questions that require large sample sizes and diverse study populations.³ There are several networks that have been designed as platforms to carry out such multi-database studies. Examples include the Food and Drug Administration (FDA) Sentinel Initiative, the Patient-Centered Outcomes Research Network (PCORnet), the European Union Adverse Drugs Reactions (EU-ADR), Vaccine Safety Datalink and the Data Analysis and Real-World Interrogation Network (DARWIN EU).^4–10

To homogenise analysis across databases, Common Data Models (CDMs) have been developed to standardize table structure, variable names and definition of key concepts.¹¹ The Observational Medical Outcomes Partnership (OMOP) CDM standardizes the structure and content of observational data to enable efficient analyses that can produce reliable evidence.¹² What makes this standardisation method different to other approaches (e.g. Sentinel/PCORnet) is the use of common vocabularies to which different coding systems, within the source databases, are mapped (e.g. for conditions, ICD-9, ICD-10 or Read codes are mapped to the common SNOMED-CT coding format).¹³ Assuming that codes from different vocabularies, that represent common diseases, products, etc. are mapped to a common, standard vocabulary, then this would allow building one set of cohort definitions and analysis scripts that could be applied unchanged to every participating database.

Accurate transformation of source databases into CDM is essential for maintaining data integrity. Most peer-reviewed published studies assessing data transformation from the source data to a CDM have been conducted using Claims, Electronic Health Records (EHR) or Registry data in the US and the results have largely demonstrated a good mapping between the source and transformed data.^14–18 However, relatively few studies have assessed European data sources.^19–22 A study assessing the transformation of UK The Health Improvement Network data (THIN) to the OMOP CDM found that information loss occurred due to incomplete mapping of medical and drug codes as well as limitations in the data structure of the OMOP CDM.¹⁹ In contrast, a study in UK Clinical Practice Research Datalink (CPRD) General Practitioner Online Database (GOLD) deemed all elements of the OMOP CDM transformation to be of high quality (99.9% of database condition records and 89.7% of database drug records were mapped and most unmapped drugs were devices or over-the-counter products).²¹ Given these contrasting findings and the lack of formal evaluation of CPRD Aurum OMOP transformation, there is value in conducting an additional assessment of CPRD using a recent version of OMOP CDM. The aim of this study was to compare the source medical codes included in the source/non-transformed CPRD GOLD and CPRD Aurum databases to source medical codes within the OMOP-transformed data.

Methods

This is a methodological validation study designed to assess the transformation of CPRD GOLD and CPRD Aurum EHR databases into the OMOP CDM. The objective was to evaluate whether key clinical concepts (such as diagnoses, laboratory results, prescriptions and vaccinations) retained consistency in event counts after OMOP transformation. The study does not involve hypothesis testing, clinical outcome assessment, or inference about treatment effects.

Data source

The CPRD primary care electronic records have been used for research for over 30 years. CPRD databases are among the most thoroughly described and validated primary care databases in the world with over 3,500 peer-reviewed publications (https://www.cprd.com/bibliography). The UK National Health Service (NHS) provides universal health coverage. From the CPRD databases patients are excluded if they choose to opt out or if their General Practitioner (GP) practice does not give its consent to be included.²³ The study was conducted using CPRD GOLD and CPRD Aurum databases. The two databases include GP practices that use different patient management software (Vision and EMIS for CPRD GOLD and CPRD Aurum, respectively).²⁴ The CPRD GOLD database was established in 1987 and includes over 21 million historical patients in England, Wales, Scotland and Northern Ireland. The geographical distribution of the ∼3 million active patients has changed resulting in very few practices in England.²⁵ Within the Vision system, diagnoses and other non-prescription data are recorded using the Read coding system and prescriptions are coded using Gemscript. The CPRD Aurum was launched in 2017 and includes ∼45 million patients from 1987 onwards of which >15 million are active patients.²⁶ It includes GP practices in England and Northern Ireland. The EMIS system uses a combination of SNOMED CT (UK edition), Read Version 2 and local EMIS Web® software-specific codes that have been cross-mapped to a single code dictionary by National Health Service Digital for diagnoses and other non-prescription data. Prescriptions are coded using the Dictionary of Medicines and Devices (dm + d) codes which are a subset of the SNOMED CT terminology. Some of the characteristics of the GP systems and the CPRD databases have been described in the past.^24,27–29

Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)

The OMOP CDM is managed by the Observational Health Data Sciences and Informatics (OHDSI) community. As of October 2024, 544 data sources across 54 countries have been converted into OMOP CDM.^12,30 Within the OMOP CDM, clinical events are expressed as concepts which represents the semantic notion of each event. The concepts cover any event related to patient experience (e.g. conditions, procedures, drug exposures etc) as well as administrative information (e.g. visits, care sites etc). Each standard concept has a concept id (concept_id) and is assigned to a domain (e.g. “Condition”, “Drug”, “Procedure”, “Visit”, “Device”, “Specimen” etc) which direct to which CDM table and field a clinical event or event attribute is recorded.³¹ Records from the source database are mapped from the original table to the domain table in the OMOP version of the database in which the standard vocabulary belongs. This mapping is described in the Vocabulary tables that are an integral part of the OMOP CDM. A detailed online browser of the OMOP Vocabularies is also available.^32,33

The transformation of the CPRD databases into OMOP CPD was performed using an Extract, Transform, Load (ETL) process, which describes how the data can be systematically converted into the standardised OMOP structure. For this study, the CPRD databases were transformed into version 5.3.1 of the OMOP CDM by Odysseus Data Services. The databases containing data up to 31^st December 2019 were used.

The COde list DEvelopment and eXploration (CODEX) approach for generating medical code lists

A novel methodology is introduced here, termed CODEX, designed to systematically generate comprehensive medical code lists. This approach was applied to produce code lists for 12 diseases, 9 medications, 10 laboratory measures, 2 lifestyle measures, 3 vaccinations and 3 procedures. For specific measures such as Body Mass index (BMI), smoking status and laboratory values the presence of recorded values (and not the actual measurement) was used. The CODEX approach can be summarised in the following three steps.

• Step 1: Relevant medical terms were searched for (using combinations of “OR” and “AND” logical operators) within the database dictionaries, producing a “broad code list”. Only medical codes that appeared in the database at least once were included.

• Step 2: Each medical code description within the “broad code list” was reviewed by a researcher line-by-line to determine inclusion in the final list based on clinical relevance.

• Step 3: A second researcher independently repeated Step 2 to ensure consistency. Any discrepancies were discussed before a final decision on the classification was made.

A complete record of the audit trail was maintained to make the process fully transparent and reproducible. This process was carried out separately to generate medical code lists for CPRD GOLD and CPRD Aurum. A comprehensive description of the CODEX process will be provided in a forthcoming publication.

As an example, for Peripheral Arterial Disease (PAD), the following search terms were used: “PAD” OR (“peripheral” AND “arterial” AND “disease”) OR (“peripheral” AND “vascular”) OR (“peripheral” AND “angiopath”) OR “claudication” OR (“peripheral” AND “ischaemia”) OR (“peripheral” AND “ischaemic” AND “disease”) OR (“ischaemia” AND “leg”). Searching by these search terms anywhere within the corresponding MEDCODE code description, a “broad code list” of 2,578 codes were generated based on the CPRD Aurum database. After careful review of these codes by two researchers independently, 50 codes were included in a final list (Table 1).

Table 1.

Medcodes selected for peripheral arterial disease.

Medcodes	Code description
970921000006119	DNA - Did not attend peripheral vascular disease clinic
906771000006113	(RFC) peripheral vascular disease
8014491000006111	Peripheral vascular disease due to type I diabetes
742481000006118	Ischaemia of legs
741131000000114	Peripheral vascular disease monitoring first letter
741071000000112	Peripheral vascular disease monitoring invitation
7115581000006111	Peripheral vascular congenital anomaly
6850081000006110	Did not attend peripheral vascular disease clinic
6633851000006110	PVD-peripheral vascular disease
6632261000006118	PVD - peripheral vascular disease
6632231000006110	PAOD - peripheral arterial occlusive disease
6632221000006112	Peripheral arterial occlusive disease
6456621000006111	Peripheral vascular resistance
5974641000006112	History of peripheral vascular disease procedure
580401000006115	Congenital anomaly of peripheral vascular system OS
5057981000006111	Peripheral ischemic vascular disease
451376017	H/O: Peripheral vascular disease procedure
4392201000006111	Diabetic peripheral vascular disease
411512011	Claudication
375462013	Claudication distance
3532661000006118	IC - intermittent claudication
350535018	Peripheral ischaemic vascular disease
350533013	Peripheral ischaemia
3285181000006112	Diagnostic ultrasound of peripheral vascular system
3285171000006114	Ultrasonography of peripheral vascular system
3285161000006119	Echography of peripheral vascular system
3285141000006118	Ultrasound peripheral vascular flow study
325067010	Peripheral vascular complications of care NOS
325063014	Peripheral vascular complications of care
313563016	Peripheral vascular system anomaly NOS
300515011	Peripheral vascular disease NOS
2537052019	DNA - Did not attend peripheral vascular disease clinic
235911000006116	Peripheral vascular disease NOS
2231911000000110	Peripheral vascular disease monitoring third letter
2231891000000112	Peripheral vascular disease monitoring second letter
216185010	Peripheral vascular disease monitoring
1849551000006115	No peripheral vascular disease symptoms
1847121000006114	Peripheral arterial disease
1823971000006118	Peripheral arterial disease confirmed
1715241000006112	Vascular claudication
1696241000006111	Peripheral vascular disease monitoring third letter
1696231000006118	Peripheral vascular disease monitoring second letter
1696221000006116	Peripheral vascular disease monitoring first letter
1696211000006112	Peripheral vascular disease monitoring administration
1696201000006114	Peripheral vascular disease annual review
1672661000006115	Edinburgh claudication questionnaire
12729241000006119	Peripheral vascular disease NOS
12335991000006117	Peripheral vascular disorder due to diabetes mellitus
12335981000006115	Peripheral vascular disorder co-occurrent and due to diabetes mellitus
105536013	Intermittent claudication

Analytical methods

Prevalent and incident event counts for the selected diseases, medications, laboratory records, lifestyle measures, vaccinations and procedure/tests were compared between the source and OMOP versions of CPRD databases. For example, for the “diagnoses” domain we compared counts of incident acute myocardial infarctions (identified via Read code sets). For “vaccinations” we evaluated influenza and Measles, Mumps, Rubella (MMR) vaccine administrations. The year 2019 was selected as the reference year to calculate incident and prevalent cases. To identify incident events in 2019, we screened all records prior to 1st January 2019, to confirm that no earlier occurrence of each event (diagnosis, medication, etc.) existed. For prevalence on 1st January 2019, we reviewed every record before that date to determine whether the event had already occurred and should therefore be counted as prevalent. For each of the 38 clinical concepts, the number of prevalent and incident events was calculated in the source CPRD and in the OMOP-formatted data. The absolute difference in counts was then derived between the two datasets and expressed as a percentage of source CPRD count. The formula for the difference can be written as:

% d i f f e r e n c e = \frac{| {c o u n t}_{n a t i v e} - {c o u n t}_{O M O P} |}{{c o u n t}_{n a t i v e}} \times 100

This procedure was performed separately for prevalent and incident measures.

For the analyses, the source code lists were used (instead of concept sets created from standardized codes) in both the source and the OMOP-transformed versions of the databases. Specific CPRD tables were used to identify events in the source and OMOP databases. For example, for disease diagnoses, the clinical and referral tables were searched in the CPRD GOLD database. For the OMOP CPRD GOLD database, the same codes were sought within the condition, procedure and observation domains. A summary of the datasets and domains used to search codes in CPRD GOLD, CPRD Aurum and OMOP databases, is given in Table 2.

Table 2.

Datasets/domains were used to search codes by database.

	Databases
Type of code	CPRD GOLD	CPRD aurum	OMOP^a
Diagnosis	Clinical, referral	Observation	Condition, procedure, observation
Drug	Therapy	Drug_issue	Drug exposure, Device_exposure
Lab or test	Test	Observation	Measurement, observation
Vaccination	Immunisation, clinical, referral	Drug_issue	Drug exposure, Device_exposure
Lifestyle	Additional	Observation	Measurement, observation

^aSame domains used for CPRD GOLD and CPRD Aurum.

The data analysis for this paper was generated using SAS software, Version 9.4 of the SAS System for LIN X64. Copyright © SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.

Results

The prevalent and incident event counts (definition was based on source codes) for diagnoses, laboratory records and tests, vaccination records, lifestyle risk factors and medications were compared before and after the OMOP transformation. When comparing the source and OMOP versions of CPRD GOLD, 89.5% (35 out of 38) prevalent and incident event count differences were of 0.1% or less. For CPRD Aurum, 97.4% (37 out of 38) prevalent and incident event count differences appeared to be 0.1% or less. Of the five variables with largest incident event count differences, four had slightly higher counts in the source data compared to the OMOP data (Table 3). Of the five variables with largest prevalent event count differences, three had slightly higher counts in the source data compared to the OMOP data (Table 3).

Table 3.

The incident and prevalent event counts before and after OMOP transformation for OMOP CPRD GOLD and CPRD Aurum databases.

(a) Incident event counts of patient with at least a record of the listed variables in 2019
		CPRD GOLD			CPRD aurum
		Source	OMOP	% difference	Source	OMOP	% difference
Procedure/tests	DEXA	6,282	6,282	0.0	12,322	12,322	0.0
	AVR	375	375	0.0	2410	2410	0.0
	Cardiac rehabilitation	653	653	0.0	2322	2322	0.0
Lifestyle	BMI	38,433	38,433	0.0	158,111	158,111	0.0
Lifestyle	Smoking	21,293	21,293	0.0	105,528	105,528	0.0
Vaccinations	MMR	6,600	6,572	0.4	27,322	27,322	0.0
	BCG	257	261	−1.6	911	911	0.0
	Pneumococcal	28,043	27,759	1.0	128,264	128,264	0.0
Labs	Albumin	83,158	83,158	0.0	13,855	13,640	1.6
	Bilirubin	83,961	83,961	0.0	288,897	289,268	−0.1
	ALP	84,184	84,184	0.0	284,799	284,898	0.0
	AST	22,465	22,465	0.0	86,636	86,651	0.0
	GGT	40,161	40,161	0.0	125,949	125,962	0.0
	ALT	97,151	97,151	0.0	301,757	301,786	0.0
	HbA1c	84,897	84,897	0.0	512,483	512,523	0.0
	LDL	61,231	61,231	0.0	191,347	191,351	0.0
	HDL	63,006	63,006	0.0	239,936	239,936	0.0
	Total cholesterol	58,754	58,754	0.0	70,788	70,788	0.0
Medications	Aspirin	16,089	16,089	0.0	73,938	73,938	0.0
	Atenolol	1,267	1,267	0.0	3650	3650	0.0
	Bendroflumethiazide	3,160	3,160	0.0	6203	6203	0.0
	Furosemide	15,222	15,222	0.0	47,251	47,251	0.0
	Levothyrox	6,620	6,620	0.0	23,307	23,307	0.0
	Omeprazole	66,503	66,503	0.0	213,246	213,246	0.0
	Paracetamol	40,524	40,524	0.0	107,115	107,115	0.0
	Salbutamol	38,207	38,207	0.0	125,017	125,017	0.0
	Simvastatin	3,840	3,840	0.0	8,587	8,587	0.0
Indications	PAD	2,035	1,973	3.0	7,083	7,083	0.0
	TIA	4,993	4,998	−0.1	22,760	22,760	0.0
	HF	12,398	12,400	0.0	27,076	27,076	0.0
	AAA	469	469	0.0	2,898	2,898	0.0
	AMI	4,036	4,036	0.0	23,313	23,313	0.0
	CRC	2,067	2,067	0.0	7,358	7,358	0.0
	GIOP	1	1	0.0	4	4	0.0
	IS	1,866	1,866	0.0	30,218	30,218	0.0
	Melanoma	1,235	1,235	0.0	4,007	4,007	0.0
	MMR	180	180	0.0	1,024	1,024	0.0
	UA	540	540	0.0	1,751	1,751	0.0

(b) Prevalent event counts of patient with at least a record of the listed variables in 2019
		CPRD GOLD			CPRD Aurum
		Source	OMOP	% difference	Source	OMOP	% difference
Procedure/tests	DEXA	75,462	75,462	0.0	89,249	89,249	0.0
	AVR	3532	3532	0.0	21,192	21,192	0.0
	Cardiac rehabilitation	6272	6272	0.0	21,078	21,078	0.0
Lifestyle	BMI	2,422,867	2,422,867	0.0	8,388,142	8,388,142	0.0
Lifestyle	Smoking	2,404,926	2,404,926	0.0	5,389,729	5,389,729	0.0
Vaccinations	MMR	1,138,345	1,337,956	−17.5	2,635,619	2,635,619	0.0
	BCG	288,463	294,075	−1.9	970,400	970,400	0.0
	Pneumococcal	967,703	966,714	0.1	3,639,429	3,639,429	0.0
Labs	Albumin	1,766,576	1,766,576	0.0	377,101	372,754	1.2
	Bilirubin	1,821,561	1,821,561	0.0	6,873,408	6,868,283	0.1
	ALP	1,831,363	1,831,363	0.0	6,758,196	6,754,458	0.1
	AST	569,217	569,217	0.0	1,486,309	1,484,877	0.1
	GGT	906,305	906,305	0.0	3,040,184	3,038,890	0.0
	ALT	1,747,089	1,737,089	0.6	6,623,395	6,623,070	0.0
	HbA1c	1,944,321	1,944,321	0.0	5,224,877	5,224,725	0.0
	LDL	1,205,742	1,205,742	0.0	4,608,299	4,608,230	0.0
	HDL	1,344,286	1,344,186	0.0	5,411,880	5,411,880	0.0
	Total cholesterol	1,417,813	1,417,813	0.0	1,228,559	1,228,559	0.0
Medications	Aspirin	350,119	350,199	0.0	1,273,074	1,273,074	0.0
	Atenolol	141,812	141,812	0.0	469,160	469,160	0.0
	Bendroflumethiazide	216,182	216,182	0.0	674,053	674,053	0.0
	Furosemide	167,992	167,992	0.0	502,817	502,817	0.0
	Levothyrox	149,064	149,064	0.0	471,984	471,984	0.0
	Omeprazole	953,666	953,666	0.0	2,922,596	2,922,596	0.0
	Paracetamol	1,703,452	1,703,452	0.0	5,376,173	5,376,173	0.0
	Salbutamol	841,875	841,875	0.0	3,106,528	3,106,528	0.0
	Simvastatin	359,574	359,574	0.0	1,162,364	1,162,363	0.0
Indications	PAD	26,511	25,752	2.9	78,272	78,272	0.0
	TIA	54,200	54,223	0.0	171,711	171,711	0.0
	HF	96,247	96,291	0.0	150,627	150,627	0.0
	AAA	3128	3128	0.0	18,845	18,845	0.0
	AMI	53,308	53,318	0.0	253,275	253,275	0.0
	CRC	16,031	16,031	0.0	59,015	59,015	0.0
	GIOP	49	49	0.0	334	334	0.0
	IS	19,047	19,049	0.0	395,618	395,618	0.0
	Melanoma	15,738	15,738	0.0	53,548	53,548	0.0
	MMR	1039	1039	0.0	5228	5228	0.0
	UA	8549	8550	0.0	22,899	22,899	0.0

Note. AAA, abdominal aortic ANEURISM; ALP, alkaline phosphatase; ALT, alanine aminotransferase; AMI, acute myocardial infraction; AST, aspartate aminotransferase; AVR, aortic valve disease; BCG, bacille calmette-guérin; BMI, body mass index; CRC, colorectal cancer; DEXA, dual X-ray absorptiometry; GGT, gamma-glutamyl transferase; GIOP, glucocorticoid-induced osteoporosis; HDL, high-density lipoprotein; HbA1c, hemoglobin A1c; HF, heart failure; IS, ischemic stroke; LDL, low-density lipoprotein; MMR, measles mumps rubella; PAD, peripheral arterial disease; TIA, transient ischemic attack; UA, unstable angina.

Note. DEXA, dual X-ray absorptiometry; AVR, aortic valve disease; BMI, body mass index; MMR, measles mumps rubella; BCG, bacille calmette-guérin; ALP, alkaline phosphatase; AST, aspartate aminotransferase; GGT, gamma-glutamyl transferase; ALT, Alanine aminotransferase; HbA1c, hemoglobin A1c; LDL, low-density lipoprotein; HDL, high-denisty Lipoprotein; PAD, peripheral arterial disease; TIA, transient ischemic attach; HF, heart failure; AAA, abdominal aortic aneurism; AMI, acute myocardial infraction; CRC, ColoRectal Cancer; GIOP, Glucocorticoid-induced osteoporosis; IS, ischemic stroke; MMR, measles mumps rubella; UA, unstable angina.

The most prominent percentage difference in incident event counts between source and OMOP CPRD GOLD was observed for PAD (3.0%) and Bacille Calmette-Guérin (BCG) and Pneumococcal vaccines (1.6% and 1.0%, respectively). For CPRD Aurum, the greatest incident event count difference was for Albumin blood test result (1.6%) (Table 3). The largest prevalent event count differences between source and OMOP CPRD GOLD were for MMR vaccine (17.5%), PAD (2.9%) and BCG vaccine (1.9%). For CPRD Aurum, the largest prevalent event count differences were observed for the Albumin blood test result (1.2%) (Table 3).

The mapping of events from the clinical tables in source CPRD GOLD database to their respective domains in the OMOP-transformed version is shown in Figure 1. For CPRD GOLD, events from the source “Clinical” table were split between “Observation”, “Condition Occurrence” and “Procedure Occurrence” tables following the OMOP transformation. Events from the “Additional” table were split between “Observation” and “Measurement” domains. Events from the tables “Test” and “Therapy” were mainly moved into tables “Measurement” and “Drug Exposure”, respectively. Events from the “Immunisation” table were also moved into the “Drug Exposure” domain.

Figure 1.

Records (rows of data) for CPRD GOLD before and after OMOP transformation. Note: This figure could not be replicated for CPRD Aurum as there was no direct access to the full database.

Discussion

In this study, CPRD GOLD and CPRD Aurum databases were transformed (separately) into the OMOP CDM (version 5.3.1). A total of 38 diseases, medications, laboratory records, lifestyle measures, vaccinations and procedures/tests were used for comparing the source and OMOP-transformed CPRD databases. The results showed that for CPRD GOLD database, 89.5% of events had no or very small differences in prevalent and incident event counts between the source and OMOP versions. CPRD Aurum showed even better alignment, with 97.4% of events showing no or very small differences in counts between the source and OMOP data. When looking at individual domains, diagnoses and prescriptions showed nearly perfect alignment (differences ≤ 0.1%), whereas vaccinations had the largest mapping discrepancies. Laboratory measurements exhibited marginally higher variability compared with other domains. These differences were due to how the corresponding codes were mapped to tables during the OMOP transformation. The results suggest that vaccine codes (and to a lesser extent laboratory codes) require extra attention when building OMOP cohorts. Moreover, to mitigate information loss and ensure no medical codes are missed, it is important that researchers search for relevant codes across all OMOP tables and domains, rather than relying on a single domain search.

The code lists were generated for the purpose of validating the OMOP transformations for this paper. They were developed using a novel methodology that systematically searches within the description associated with each medical code. Using this approach, it is crucial to include all potentially valid medical terms associated with the disease, medication, etc. Missing a relevant search term could lead to incomplete code list and hence, patient cohort. The advantage with this approach is that it is fully transparent and reproducible. It requires suitable search terms and criteria for deciding whether to include each code within the code list. The limitation of this approach is that it is resource intensive as it requires manual review of large numbers of codes. However, the latter issue can be mitigated by applying exclusion criteria.

OMOP CDM standardizes different structures across disparate data sources into common tables with a single structure, field datatypes and conventions. This re-formatting of the data is designed to avoid information loss. However, the standardisation process may in some cases inadvertently result in data loss. For example, past OMOP transformations excluded patient information prior to the dates GP surgeries started to provide information according to a set of data quality standards that was introduced at the time. Although it is likely that this information may not be completed as consistently or as fully, it is still believed to be accurate and valuable and therefore, should be used.

The 38 cohorts were chosen to cover a broad range of clinical events. We believe these events provide a sufficient test for validating the measures derived using the source against the OMOP-transformed CPRD versions. However, these may not be representative of all possible use cases in CPRD data, particularly for conditions captured infrequently (rare events).

Converting CPRD to the OMOP CDM can be challenging because of the unusual data structure some of the source tables have. Our evaluation relied on the quality and transparency of the ETL process provided by the vendor. Moreover, vocabularies such as Read codes are very comprehensive and granular which means that some source code types may not be mapped to the OMOP Vocabularies.³⁴ However, all Read codes were retained (whether mapped to a concept id or not) and placed in the Observation table.

Although our analysis focused on UK primary-care data, the same principles can be applied to any databases transformed to OMOP CDM. These can include specialty medicine, hospital or claims data. Researchers working with these data types can follow a similar approach (generating code lists, manually reviewing mappings, and comparing event counts) to evaluate potential information loss in these data. Moreover, because OMOP Vocabularies are country-agnostic, this approach could be extended to healthcare data outside the UK, enabling consistent validation across international datasets. To replicate and validate CPRD OMOP transformations without having direct access to patient-level data, researchers can use high-fidelity synthetic datasets that have been generated by CPRD (available at https://www.cprd.com/synthetic-data).

Studies carried out on multiple databases have become more common due to the increased accessibility of data.^35,36 Data sources transformed into CDMs are often used as they provide analytic efficiencies.^9,37 The aim of this study was to evaluate the success of the OMOP transformation for CPRD, a major EHR database, by comparing the prevalent and incident event/code counts between the source and OMOP-transformed data. The event count estimates were not produced with the same rigour as when investigating diseases/episodes and should not be used for that purpose. Some differences were due to how codes were mapped to different tables, highlighting the importance for researchers to search across all datasets and domains when defining cohorts. Overall, the findings confirm the consistency of output following the OMOP transformation of both CPRD GOLD and CPRD Aurum and provide confidence in analyses conducted using OMOP-transformed data. The availability of CPRD datasets in OMOP format promotes methodological consistency and streamlines multi-database analyses. Citing OMOP transformation evaluations (for CPRD and other databases) is essential for the scientific community to ensure high-quality research. Future research could include validation of other aspects of OMOP transformation and specifically the mapping of source codes to concepts.

Conclusion

This study evaluated the transformation of CPRD GOLD and CPRD Aurum databases into the OMOP CDM. A comparison of prevalent and incident event counts across 38 clinical concepts between the source and OMOP-transformed datasets, showed a high degree of consistency, particularly for diagnoses and prescriptions. Small discrepancies were observed in certain domains such as vaccinations and laboratory results, largely due to variation in code-to-domain mappings.

These findings support the reliability of CPRD OMOP-transformed data for real-world evidence generation. However, it is important to note that this study assessed only selected aspects of the OMOP transformation process, focusing on event count consistency, and did not evaluate the accuracy of medical code mappings. Comprehensive validation of OMOP-transformed data requires further examination of these additional components. To enhance confidence in the use of OMOP CDM across diverse data sources, it is critical that researchers working with other OMOP-transformed datasets conduct similar validation exercises tailored to their data context. Consistent validation practices across OMOP data sources are essential not only to improve efficiency for multi-database studies, but also to ensure transparency, reproducibility and credibility of real-world evidence.

Footnotes

ORCID iD

George Kafatos

Ethical consideration

This study used data from the Clinical Practice Research Datalink (CPRD). The CPRD data are collected in compliance with relevant legal and regulatory requirements, and studies conducted using CPRD data are approved by the Independent Scientific Advisory Committee (ISAC) for Medicines and Healthcare products Regulatory Agency (MHRA) Database Research (protocol number: 19_044). This study did not require additional ethical approval as it relied solely on anonymized, de-identified data provided by CPRD.

Author contributions

George Kafatos: Conceptualisation, Methodology, Analysis review, Writing. Joe Maskell: Conceptualisation, Methodology, Analysis, Writing review. Olia Archangelidi: Conceptualisation, Methodology, Analysis review, Writing review. David Neasham: Conceptualisation, Methodology, Analysis review, Writing review.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: GK, JM, OA and DN are Amgen Ltd employees and own Amgen Inc shares.

Data Availability Statement

The data used in this study are derived from the Clinical Practice Research Datalink (CPRD) and were transformed into the OMOP Common Data Model. Access to individual patient-level data is not permitted due to contractual agreements with the data provider. Researchers interested in accessing CPRD data should contact CPRD directly () to obtain the necessary permissions and licenses.

References

Liu

Panagiotakos

. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 2022; 22(1): 287.

Longoni

Ward

Bhasin

, et al. Real-world practice patterns reveal <1% adoption of recommended genetic testing for inherited cardiomyopathies. American Journal of Preventive Cardiology 2023; 13: 100422–100450.

Toh

. Analytic and data sharing options in real-world multidatabase studies of comparative effectiveness and safety of medical products. Clin Pharmacol Ther 2020; 107(4): 834–842.

Corley

Feigelson

Lieu

, et al. Building data infrastructure to evaluate and improve quality: PCORnet. J Oncol Pract 2015; 11(3): 204–206.

Daley

Clarke

Glanz

, et al. The safety of live attenuated influenza vaccine in children and adolescents 2 through 17 years of age: a Vaccine Safety Datalink study. Pharmacoepidemiol Drug Saf 2018; 27(1): 59–68.

Lin

PID

Daley

Boone-Heinonen

, et al. Comparing prescribing and dispensing data of the PCORnet common data model within PCORnet antibiotics and childhood growth study. EGEMS (Wash DC) 2019; 7(1): 11.

Oliveira

Lopes

Nunes

, et al. The EU-ADR Web Platform: delivering advanced pharmacovigilance tools. Pharmacoepidemiol Drug Saf 2013; 22(5): 459–467.

Peng

Henke

Reinecke

, et al. An ETL-process design for data harmonization to participate in international research with German real-world data based on FHIR and OMOP CDM. Int J Med Inf 2023; 169: 104925.

Platt

Brown

Robb

, et al. The FDA Sentinel initiative - an evolving national resource. N Engl J Med 2018; 379(22): 2091–2093.

10.

Schneeweiss

Brown

Bate

, et al. Choosing among common data Models for real-world data analyses fit for making decisions about the effectiveness of medical products. Clin Pharmacol Ther 2020; 107(4): 827–833.

11.

Platt

Brown

, et al. How pharmacoepidemiology networks can manage distributed analyses to improve replicability and transparency and minimize bias. Pharmacoepidemiol Drug Saf. 2019; 29(13): 3–7.

12.

OHDSI . Observational health data Sciences and Informatics. Accessed 5 July 2025.Available from: https://ohdsi.org/

13.

European Medicines Agency . A common data Model for Europe? Workshop report from a meeting held at the EMA 11-12, december 2017. 2018. https://www.ema.europa.eu/en/documents/report/common-data-model-europe-why-which-how-workshop-report_en.pdf.

14.

Biedermann

Ong

Davydov

, et al. Standardizing registry data to the OMOP Common Data Model: experience from three pulmonary hypertension databases. BMC Med Res Methodol 2021; 21(1): 238.

15.

Makadia

Ryan

. Transforming the premier perspective hospital database into the observational medical outcomes partnership (OMOP) common data model. EGEMS (Wash DC) 2014; 2(1): 1110.

16.

Panaccio

Cummins

Wentworth

, et al. A common data model to assess cardiovascular hospitalization and mortality in atrial fibrillation patients using administrative claims and medical records. Clin Epidemiol 2015; 7: 77–90.

17.

Reisinger

Ryan

O'Hara

, et al. Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. J Am Med Inf Assoc 2010; 17(6): 652–662.

18.

Voss

Ryan

. The impact of standardizing the definition of visits on the consistency of multi-database observational health research. BMC Med Res Methodol 2015; 15: 13.

19.

Zhou

Murugesan

Bhullar

, et al. An evaluation of the THIN database in the OMOP Common Data Model for active drug safety surveillance. Drug Saf 2013; 36(2): 119–134.

20.

Haberson

Rinner

Schoberl

, et al. Feasibility of mapping Austrian health claims data to the OMOP common data model. J Med Syst 2019; 43(10): 314.

21.

Matcho

Ryan

Fife

, et al. Fidelity assessment of a clinical practice research datalink conversion to the OMOP common data model. Drug Saf 2014; 37(11): 945–959.

22.

Papez

Moinat

Payralbe

, et al. Transforming and evaluating electronic health record disease phenotyping algorithms using the OMOP common data model: a case study in heart failure. JAMIA Open 2021; 4(3): ooab001.

23.

Padmanabhan

Carty

Cameron

, et al. Approach to record linkage of primary care data from Clinical Practice Research Datalink to other health-related patient data: overview and implications. Eur J Epidemiol 2019; 34(1): 91–99.

24.

Hagberg

Vasilakis-Scaramozza

Persson

, et al. Presence of breast cancer information recorded in United Kingdom primary care databases: comparison of CPRD Aurum and CPRD GOLD (companion paper 1). Clin Epidemiol 2023; 15: 1183–1192.

25.

Sanchez-Santos

Axson

Dedman

, et al. Data resource profile update: CPRD GOLD. Int J Epidemiol 2025; 54(4): dyaf077.

26.

Coton

Welburn

Williams

, et al. The clinical practice research datalink (CPRD) mother-baby links: a data resource profile. Pharmacoepidemiol Drug Saf 2025; 34(2): e70091.

27.

Hagberg

Vasilakis-Scaramozza

Persson

, et al. Correctness and completeness of breast cancer diagnoses recorded in UK CPRD Aurum and CPRD GOLD databases: comparison to hospital episode statistics and cancer registry (companion paper 2). Clin Epidemiol 2023; 15: 1193–1206.

28.

Jick

Vasilakis-Scaramozza

Persson

, et al. Use of the CPRD Aurum database: insights gained from new data quality assessments. Clin Epidemiol 2023; 15: 1219–1222.

29.

Vasilakis-Scaramozza

Hagberg

Persson

, et al. Comparison of rheumatoid Arthritis information recorded in UK CPRD Aurum and CPRD GOLD databases (Companion paper 3). Clin Epidemiol 2023; 15: 1207–1218.

30.

Kohler

Boscá

Kärcher

, et al. Eos and OMOCL: towards a seamless integration of openEHR records into the OMOP Common Data Model. J Biomed Inf 2023; 144: 104437.

31.

Informatics OHDSa . Chapter 5 standardised vocabularies. 2021 6/5/2024. The Book of OHDSI. [Internet]. Available from: https://ohdsi.github.io/TheBookOfOhdsi/StandardizedVocabularies.html

32.

ATHENA . Explore domains. Accessed 5 July 2025.Available from: https://athena.ohdsi.org/search-terms/start

33.

Reich

Ostropolets

Ryan

, et al. OHDSI Standardized Vocabularies-a large-scale centralized reference ontology for international data harmonization. J Am Med Inf Assoc 2024; 31(3): 583–590.

34.

Benson

. The history of the Read codes: the inaugural James Read Memorial lecture 2011. Inf Prim Care 2011; 19: 173–182.

35.

Kissling

Pozo

Martínez-Baz

, et al. Influenza vaccine effectiveness against influenza A subtypes in Europe: results from the 2021-2022 I-MOVE primary care multicentre study. Influenza Other Respir Viruses 2023; 17(1): e13069.

36.

Stuurman

Bollaerts

Alexandridou

, et al. Vaccine effectiveness against laboratory-confirmed influenza in Europe - results from the DRIVE network during season 2018/19. Vaccine 2020; 38(41): 6455–6463.

37.

Sing

Lin

Bartholomew

, et al. Global epidemiology of hip fractures: secular trends in incidence rate, post-fracture treatment, and all-cause mortality. J Bone Miner Res 2023; 38(8): 1064–1075.