Sage Journals: Discover world-class research

Abstract

Objectives: Addressing data duplication is one of the most important issues in electronic health record (EHR) processing since the nature of data collection in the field. It does not only affect the data quality in healthcare management, but also the reliability in the downstream analyses. In this paper, we propose a comprehensive data de-duplication framework tailored for medical databases to tackle data duplication for a kidney disease identification, Acute Kidney Failure (AKF). Methods: The proposed work begins with the data joining from various sources, basic data de-duplication which automatically removes the dirty texts, medical note-event extraction since the data could be sources for further de-duplication, NLP data de-duplication based on a pre-trained model, data mapping for integration, unrelated data and outlier elimination, and eventually data imputation by a clustered based imputer. Results: We illustrated our de-duplication framework on MIMIC-III database both on the de-duplication task and the classification task based on AKF. The experiments demonstrated that the proposed work could achieve up to 99.59% accuracy or 23% higher than the traditional method and could achieve a high classification accuracy at 86 % and the F1-score at 0.87, which outperformed the traditional method, and the original dataset without any modification. Conclusion: These results demonstrated that the framework can potentially address the data duplication issue in healthcare effectively.

Keywords

data duplication electronic medical record acute kidney failure natural language processing classification

Introduction

Acute kidney failure (AKF) is characterized by a rapid decline in renal function and is frequently encountered among patients in intensive care units (ICUs) and other hospital setting.¹ The severity of AKF can range from mild impairment to life-threatening organ dysfunction. If not appropriately managed, AKF may lead to serious complications and increased mortality. Generally, patients with severe acute kidney failure may require Renal Replacement Therapy (RRT) to facilitate the removal of waste products from the bloodstream.²

For patients admitted to the ICU, treatment for kidney failure, in addition to the RRT, may comprise pharmacologic interventions aimed at managing blood pressure, controlling inflammation, and addressing the underlying dysfunction. Continuous monitoring of fluid balance, electrolyte concentrations, and other vital parameters may also be required in these patients.³ Typically, the prognosis of acute kidney failure in the intensive care setting can vary depending on the severity of the condition, the underlying cause, and the patient’s overall health. With timely and appropriate intervention, a substantial proportion of ICU patients with AKF can regain renal function and return to baseline health status. However, a subset may sustain chronic kidney impairment necessitating long-term dialysis or eventual renal transplantation.

The standard diagnostic evaluation for AKF includes laboratory tests, such as serum creatinine, blood urea nitrogen (BUN) levels, and urine analysis. Serum creatinine might be the most commonly used indicator of kidney function, and its level is a primary criterion for classifying the severity of AKF. Measurement of BUN provides additional assessment of renal excretory capacity, as it reflects the amount of nitrogen that is excreted by the kidneys. Urinalysis may provide critical insights into the underlying etiology of AKF by identifying abnormalities such as the presence of red blood cells, white blood cells, or urinary casts. In addition to the laboratory testing, clinical evaluation is a critical component of AKF diagnosis. This includes a thorough patient history, physical examination, assessment of fluid and electrolyte balance. Patients with AKF may present with symptoms such as decreasing of urine output, swelling, and shortness of breath. Hypertension, the swelling caused by excessive fluid trapped in the body’s tissues (or edema), and electrolyte imbalances are also commonly observed.⁴ Based on these diagnostic tests, AKF can be classified into different stages of severity depending on the degree of kidney damage and the decreasing function of kidney.⁵ The diagnosis of AKF helps healthcare providers determining an appropriate course of treatment, including the need for dialysis and other supportive interventions. In this context, the clinical guidelines outlined by KDIGO⁶ or RIFLE⁷ should be adhered to.

In this paper, an issue of data duplication in Electronic Health Record (EHR), which may comprise of medical history, diagnoses, medications, laboratory test results, treatment plans, and other relevant clinical information, is addressed for a kidney disease, Acute Kidney Failure (AKF), identification. Generally, data duplication in EHR refers to the presence of repeated patient and related information. It occurs when the same data elements are recorded multiple times within the EHR. Such duplication can manifest in various forms, e.g., having identical patient demographics excessively entered in multiple places, duplicate medication orders or prescriptions, redundant lab test results entered separately, or even the different presentation but same meaning as “HTN” and “Hypertension” or “DM” and “Diabetes”. These duplications may compromise data integrity and accuracy, thereby adversely affecting patient care, clinical decision-making, and research outcomes.⁸

We propose to tackle the data duplication issue by applying feature engineering techniques to select, extract, and transform the most relevant features from the raw EHR data thereby improving data quality. In addition to structured data, note-event data comprising medical notes documented by physicians, nurses, and other healthcare professionals are incorporated to enhance the proposed methodology. Thus, an NLP technique, from a pre-trained model - scispaCy which has the biomedical terminology recognition capability, is applied to enhance the efficiency of eliminating duplicate data.^9,10 Both feature engineering and NLP techniques are investigated in the experiments to reduce data duplication in the context of AKF. The proposed work is evaluated its outcome of duplication elimination as well as the application of a machine learning algorithm, KNN, on the de-duplicated well-known dataset, MIMIC-III database.

Method

In this section, the proposed framework to address data duplication in EHRs is proposed. Generally, it incorporates seven sequential processes, outlined as shown in Figure 1.

Figure 1.

Proposed data management for de-duplication.

Initially, data joining is generally conducted to integrate data from related sources, ensuring comprehensive and interconnected information. This step could be omitted, if the data integration is in-place before the de-duplication process is applied. However, it is generally necessary due to the inherent structure of electronic medical records (EMRs) which are collected and stored in healthcare information systems.

In Step 2, basic de-duplication techniques are employed to identify and remove duplicate records for redundant reducing in the joined dataset based on metadata.¹¹ This is proceeded by automated removal of any dirty text elements and replacing with empty string then storing it back to the dataset. Subsequently, in Step 3, patient medical history (PMH) is extracted from the note-event sources by constructing algorithm as shown in Algorithm 1. The algorithm extracts patient’s morbidity from the note-event source using a function to recursively split the keywords until the patient’s morbidity is found. For example, suppose that the note-event text is “HPI: PMHx: PMH: Prostate CA w/spinal mets, Gastric volvulus, Constipation, Depression, Lacunar infarct PSH: Gastropexy, Hiatal Hernia Repair... Current medications: ...PMH: (copied from prior clinic note) HTN; DM2; CKD stage 3; dyslipidemia.”, the first extract text will be “HPI:” as underlined, as well as all the second part after PMHx:, then the algorithm will extract the second part text into: “PMH: Prostate CA w/spinal mets, Gastric volvulus, Constipation, Depression, Lacunar infarct PSH: Gastropexy, Hiatal Hernia Repair... Current medications: ... ” and “(copied from prior clinic note) HTN; DM2; CKD stage 3; dyslipidemia.” recursively. Note that medical history keyword can be varied, the algorithm can be adapted and applied to extract the information.

To further improve the de-duplication process, a pre-trained model - scispaCy¹² (en_core_sci_lg mode) is applied in Step 4. This process helps identifying and eliminating any remaining duplicated data. The identification is performed by the scientific or medical term tokenization such that the terms, e.g. mg/dL are not parsed separately. Also, such duplicate data, e.g. “HTN” and “Hypertension”, “DM” and “Diabetes”, “PVD” and “Peripheral Vascular Disease” or “CAD” and “Coronary Artery Disease”, can be eliminated by the Unified Medical Language System (UMLS) in which the terms are compared with string similarity with the context data. It is trained on a vast amount of biomedical and scientific text data contained approximately 785,000 terms, and 600,000 word vectors which were trained with GloVe algorithm.¹³ Such algorithm captures semantic relationships between words through the analysis, maximizing the likelihood of observed word co-occurrences within the given text. Additionally, an AbbreviationDetector, a Spacy component that executes the abbreviation detection algorithm, is also applied to verify similarity between its abbreviated and full-term words.⁹ Eventually, PMH data from the previous step is mapped back into the dataset to further seamless data integrating in Step 5.

Subsequently, to facilitate the effective application of the designated analytical tasks, two additional steps are performed. First, irrelevant data and outliers potentially present in the dataset are identified and removed to minimize bias and improve the validity of the results in Step 6.

Secondly, in Step 7, data imputation and normalization procedures are performed. Note here regarding to the data imputation, missing data in EHRs can be classified into several types, namely missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).¹⁴ This study focuses primarily on MCAR, as it frequently occurs in medical datasets. MCAR arises when certain healthcare measurements cannot be completed uniformly in practice. For instance, patients admitted to the ICU who remain bedridden may be unable to have their height accurately measured, resulting in missing values for height in the clinical notes. Such instances represent MCAR, where data absence is unrelated to observed or unobserved patient characteristics.

Thus, in Step 7, KNN imputer¹⁵ is applied to impute missing data within the dataset. The KNN imputer is particularly effective in this context as it considers the characteristics of related observations, clusters, or nearest neighbor data, the approach is well-suited to medical data where features often exhibit strong interdependencies. This can result in accurate and reliable imputation of missing values, especially for data characterized as MCAR. The optimal value of k for the KNN imputer is determined via GridSearchCV,¹⁶ with uniform weighting and the Euclidean distance metric applied. Furthermore, data normalization is conducted to standardize the scale and range of features, thus facilitating unbiased comparison and more robust analysis across variables.

To evaluate our proposed work, two experiments are to be conducted. Firstly, the confusion matrix from the manual-labeled duplication identified dataset compared to the duplication identified by our proposed work is presented. The criteria for the manually duplication identification are that two records are entirely identical, and the timestamp is in the same day. This evaluation serves as a direct evaluation whether our proposed work can eliminate the duplication. Such experiment will be conducted to the result dataset up to Step 5, before any specified data analytic task, of our proposed framework. Secondly, a machine learning algorithm, KNN classification between the AKF and Unspecified kidney condition (UKC) classes, is applied to the de-duplicated result datasets. This can evaluate how the result of our proposed work can be utilized in the further analytical tasks. The comparing methods are the baseline de-duplication¹¹ under our framework without NLP process (referred as basic de-duplication), and result of the classification on the original dataset without any modification (referred as Org). Note here that the GridSearchCV¹⁶ is applied for fine-tuning the parameters in training process to automate the process of searching for the optimal combination of hyper-parameters within a predefined range in all the three comparing methods.

Results

Dataset exploration

Our proposed work is evaluated on the public available medical dataset, MIMIC-III database version 1.4 (MIMIC-III v1.4), the details of the dataset as well as some key findings from applying our framework to the dataset are as follows.

First, the MIMIC-III database consists of 26 tables, which is classified into four types: identifier, events, dictionary, and patient care. The identifier tables, i.e. PATIENTS, ADMISSIONS, and ICUSTAYS, employ distinct primary keys with the ‘ID’ text suffix to uniquely identify patients, hospital admissions, and intensive care unit stays, respectively. The events tables, i.e. OUTPUTEVENTS, NOTEEVENTS, CHARTEVENTS, and PROCEDUREEVENTS, have comprehensive records of distinct events and measurements pertaining to patients’ data, e.g. lab results or medical notes. The dictionary tables play a crucial role of facilitating cross-referencing between tables through ID of the defined terms, e.g. D_ICD DIAGNOSES, D_ICD_PROCEDURES, D_ITEMS, and D_LABITEMS for diagnoses ICD, medical procedure ICD, and lab item ID respectively. Last, the patient care table contains important information, e.g. physiological measurements, caregiver observations, or comprehensive billing details.

The results of applying our 7-steps proposed framework as shown in Figure 1 on MIMIC-III database are as follows. First, the related tables from the database are retrieved and joined hierarchically as shown in Figure 2. For example, patient’s information is to be joined with hospital admission, i.e. admission number, preliminary diagnosis, and also laboratory results, i.e. BUN, sodium, potassium, or creatinine. The tables are to be joined to the top of the hierarchy. The size of each joining result is shown in Table 1.

Figure 2.

Hierarchy of data joining.

Table 1.

MIMIC-III joining result.

Target table	Source tables	#Records	#Total records
PADL	Patients	46,520	1,128,226
	Admission	58,976
	Pivot lab	1,388,007
PICUL	Patients	46,520	60,840
	ICU stay	61,532
	Pivot lab	1,388,007
CO-PADICL	PADL	1,128,226	1,346,051
CO-PADICL	PICUL	60,840	1,346,051
CO-PADICL+V	CO-PADICL	1,346,051	306,437
CO-PADICL+V	Pivot vital sign	9,152,812	306,437
PNOTE	PICUL	60,840	73,218
	Height-weight	61,532
	Diagnose-icd	651,047
	Notevets	2,083,180
	D-icd-diagnose	12,167
FINAL	CO-PADICL+V	306,437	812,995
FINAL	PNOTE	85,252	812,995

Then, in the basic de-duplication step, some duplications existing in dataset are removed initially for minimizing duplicate data based on the metadata. We construct an automated algorithm to remove such the dirty text elements, e.g. parentheses “()”, brackets “{}”, square brackets “[]”, chevrons “⟨⟩”, numbers, symbols, etc., in note-event column. After this step, the number of records is reduced to 735,609 records from originally 812,995 records.

Subsequently, the PMH extraction algorithm is applied for reducing excessive PMH records in note-event column. Typically, PMHs information in the MIMIC-III database are written on paper-based systems and stored to EHR system. This results in long texts and redundant data storing. Furthermore, the keywords of PMHs in the note-even column can be labeled variably, i.e. “PMH:”, “PMHx”, and “PMH”. With Algorithm 1, these keywords are split from the note-event column for finding patient’s morbidity. After the splitting, it is recursively separated into two chunks, until the patient’s morbidity is found in the second one. Then, the second chunk will be stored back into the dataset. In some cases, there are redundant PMH keyword found in the second chunk caused by excessive recording of practitioners. Therefore, the second chunk is re-split to exactly meet patient’s morbidity and store it back to the dataset.

Then, the scispaCy is applied to further reduce duplicate cases in note-event records by replacing the ambiguous terms with the full-correct terms. After this step, the number of records in MIMIC-III database is reduced from 735,609 to 675,315 records. Then, the data is mapped the note-event column back into the dataset. In the case of MIMIC-III, it results in a final table of 22 columns as shown together with their description in Table 2. In addition, the other unrelated diseases and conditions aside from AKF or kidney injury will be excluded to prevent potential bias on the further task.

Table 2.

The description of each column in the dataset.

Column	Description
Subject id	Patients’ identification
Gender	Gender of each patient
Hadm id	Patients’ identification staying in hospital
Pre diagnose	Primary diagnoses
Admittime	Time of each patient who admitted to hospital
Dischart time	Time of each patient dis-charted time from hospital
BUN	Blood urea nitrogen value
Sodium	Sodium value
Potassium	Potassium value
Creatinine	Creatinine value
Lab charttime	Time of lab observation
icustay id	Patients’ identification staying in icu
icu intime	Time of each patient who transferred into the icu
icu outtime	Time of each patient who transferred out of the icu
HeartRate	Heart rate
TempC	Temperature of each patient when initially come to hospital
icd9 code	The international coding definition (version 9)
Diagnosis	Diagnosis of each patient
Weight	Weight of each patient
Height	Height of each patient
Storetime	Time of each medical note stored into EHR system
pmh	Patients’ medical history

For the outlier elimination, we employ two techniques, i.e. percentile capping as suggested by the domain experts¹⁷ as the following steps:

(1) First, all instances of Acute kidney failure (AKF) such as Acute kidney failure with lesion of tubular necrosis and Acute kidney failure, unspecified, are combined under the category of “Acute kidney failure.”

(2) Secondly, the other cases, which are not belonging in condition that result in kidney injury, are discarded. In addition, any records in the diagnosis column, which are less than 10% of the total dataset, as well as those with the highest occurrence, are excluded to mitigate potential biases during the training process. As the result of this elimination process, the final dataset consists of 39,264 records.

Lastly, the data imputation and normalization are proceeded. First the dataset is examined to assess the extent of missing data, as illustrated in Figure 3. The analysis reveals considerable variation in the proportion of missing values among variables, with a standard deviation of 0.34. Then, the KNN imputer is applied because its MCAR nature as mentioned before.

Figure 3.

Missing values in each of variables.

For the classification effectiveness evaluation, the distribution of classes between AKF and other UKC in MIMIC-III database is examined, the frequency of each condition is shown in Table 3.

Table 3.

Distribution of kidney condition within the dataset.

ICD9	Diagnosis	Frequency
5849	Acute kidney failure (AKF)	21,193
5859	Chronic kidney disease (UKC)	5,813
40390	Hypertensive chronic kidney disease, stage I - IV (UKC)	5,411
40391	Hypertensive chronic kidney disease, stage V (UKC)	2,210
—	The other kidney conditions (UKC)	4,637

De-duplication results

In this section, we present the first experiment result, the de-duplication effectiveness. Such result of de-duplication by our proposed method compared with the baseline basic de-duplication¹¹ without NLP process (basic de-duplication) are presented in form of confusion matrix in Table 4. From the result, the baseline basic de-duplication had comparatively higher false positive and false negative, at 86,537 and 56,221 records respectively, resulting an accuracy of 80.59%. In contrast, the false positive and false negative result of our proposed method was at 1905 and 849 records respectively. Thus, the accuracy of our proposed method was at 99.59% compared with 80.59% of the baseline. Notably, the proposed method reduced false positives and false negatives by factors of approximately 45.43 and 66.22, respectively, compared to the baseline. These results demonstrate a significant improvement in de-duplication performance.

Table 4.

Performance of the proposed method for the de-duplication.

		Actual values
		Duplicate		Non-duplicate
		Proposed Methods	Basic de-duplication	Proposed Methods	Basic de-duplication
Predicted values	Duplicate	429,372	339,326	1,905	86,537
	Duplicate	63.58%	46.13%	(0.28%)	11.76%
	Non-duplicate	849	56,221	243,189	253,525
	Non-duplicate	0.13%	7.64%	36.01%	34.47%

Classification results

To evaluate the classification performance between the AKF and UKC classes, first, the dataset was divided into 80:20 for the training and testing purpose respectively. Note that the number of records in each method was different due to its process, i.e., the number of records for the proposed method, the basic de-duplication, and the Org is 39,264, 43,258, and 43,258 respectively. The testing dataset from the 80:20 splitting for the proposed method, the size of basic de-duplication, and the Org result was 7,853, 8,652, and 8652 respectively. As mentioned before, GridSearchCV¹⁶ was applied for fine-tuning the parameters in training process for hyper-parameters optimizing for all the three comparing methods.

The experiment results, summarized in Table 5 demonstrates that the proposed method outperforms other approaches in terms of accuracy, precision, and F1-score, achieving value of 0.86, 0.89, and 0.87 respectively. However, as presented in Table 6, the proposed method exhibits slightly lower performance with respect to false negatives, with a rate of 8.16% compared to 7.84% of the baseline basic de-duplication method. The difference contributes to the higher recall metric of the basic method.

Table 5.

Performance of the proposed method for the classification task.

Method	Accuracy	Precision	Recall	F1-score
1. Proposed method	0.86	0.89	0.85	0.87
2. Basic de-duplication	0.85	0.87	0.85	0.86
3. Org	0.82	0.84	0.81	0.82

Table 6.

Confusion matrix of the comparing methods.

		Actual values
		AKF			UKC
		Proposed methods	Basic de-duplication	Org	Proposed methods	Basic de-duplication	Org
Predicted values	AKF	3,631 (46.24%)	3,984 (46.05%)	3,754 (43.39%)	447 (5.69%)	609 (7.04%)	694 (8.02%)
	UKC	641	678	869	3,134	3,381	3,335 (38.55%)
	UKC	(8.16%)	(7.84%)	(10.04%)	(39.91%)	(39.07%)	3,335 (38.55%)

Discussion

First, with regard to de-duplication performance, our proposed method demonstrated an effective reduction in both false positive and false negative rates compared to the basic de-duplication. The false positive caused by the same notations in the EHR were written in the difference context in which the basic de-duplication cannot detect effectively. Meanwhile, the false negative caused by the small variation when recording the data in the EHR by different personal in a single tuple, thus the NLP applying can eliminate more effectively. From such processes, our proposed method can outperform the basic de-duplication in a large margin, the accuracy at 99.59% comparing with the basic de-duplication at 80.59 accuracy. In addition, when the discordant incorrect outcomes between the two methods, i.e. when any method detects the duplicates incorrectly while the other incorrectly detects, are to be considered, the McNemar’s test was statistically significant (χ² at 79,800.00, p << 0.001), which can present the difference between the methods clearly.

For the classification performance, our proposed method aided the KNN classifier to achieve the highest accuracy result at 0.86 compared with the other two comparing methods. Also, the McNemar’s test between the proposed method and the basic de-duplication was also statistically significant (χ² at 199.24, p << 0.001), which can present the difference between the two methods clearly. It can also be seen that the recall of our proposed method was lower than the recall. In general, the higher recall could be more challenging because the automatic process cannot justify the duplication exactly in the same way as manual process. Based on our experiment results, there were different interpretation of creatinine levels which may be ambiguous by each practitioner. There was variation in measurement units (e.g., mg/dL or µmol/L), or the normal range differs across the dataset. Moreover, there also were some incorrect laboratory test results remains in the dataset. For example, a patient had negative creatinine level which was not valid. This issue could be prevented by either a more effective data entry processes, or a semi-automated de-duplication process with human interaction.

Furthermore, the presence of diagnostically similar conditions may contribute to misclassification. Although elevated BUN and creatinine levels are commonly indicative of AKF, overlapping value ranges can occur across different renal conditions. For example, a patient presenting with a BUN level of 30 mg/dL and a creatinine level of 1.5 mg/dL was diagnosed with AKF, whereas another patient with a BUN level of 29 mg/dL and a creatinine level of 1.6 mg/dL was classified as having an UKC. Such similarities contributed to an increase in false negative rates, as evidenced by the 8.16% rate reported in Table 6, which subsequently affected the recall performance demonstrated in Table 5.

However, when considering F1 score, our proposed method still achieved the highest effective result at 0.87. In contrast, the baseline basic de-duplication method exhibited a slightly lower score of 0.85, as shown in Table 5. The difference was primarily contributed to a decline in precision. Such method cannot adequately handle the ambiguous cases existed in the dataset, i.e., semantically equivalent terms, but different representation. For example, “hypertension” and “HTN”. Furthermore, in certain instances, misclassification occurred due to the highly distinction feature, e.g., “Chest pain” which often occurred in pre-diagnostic records and was associated with AKF and some UKC.

There were some limitations in this study. Our analysis was conducted exclusively using the MIMIC-III dataset. We recognize that the absence of an external validation cohort or cross-institution experiments restricts the immediate generalizability of our derived models to various clinical settings. However, the main objective of this study was to evaluate the relative impact of de-duplication approaches on model development and internal validation, rather than to establish the generalizability of the derived model in particular settings. The MIMIC-III database provides a sufficiently large and heterogeneous sample size to statistically validate the internal performance of our framework. Consequently, the significant performance gap observed between our method and the baseline remains a valid indicator of the framework’s utility in managing EHRs. Finally, concerning model performance metrics, the mean differences in AuROCs from ANOVA with Bonferroni-adjusted pairwise comparisons ranged from 0.7% to 3.5% as presented in Table 7. We acknowledged that these margins might be considered relatively small for clinical relevance and might only slightly affect a clinical decision. Nonetheless, this modest increase does not undermine the methodological value of the framework. Our proposed de-duplication method demonstrated a statistically significant improvement in overall discrimination performance compared to both the original and basic de-duplicated data. This statistical significance indicates that the framework effectively reduces noise and offers a more robust foundation for predictive modeling than standard techniques, thereby validating the approach, even if the marginal gain appears limited in a clinical context.

Table 7.

Discrimination performance comparisons.

Method	Mean AuROC (1,000 BS)	95% CI	Mean AuROC difference	95% CI	p-value
Proposed	0.861	0.857–0.863	0.041	0.035–0.046	<0.001
Basic	0.853	0.847–0.856	0.033	0.027–0.389	<0.001
Original	0.820	0.816–0.824	Ref.

Conclusion

This paper presents a comprehensive data duplication framework comprising of data acquisition, basic de-duplication by the baseline algorithm, medical note-event extraction, NLP methods application, data mapping and unrelated and outlier data elimination, eventually missing data imputation. The proposed de-duplication method was evaluated against the baseline algorithm on MIMIC-III database in both de-duplication and the classification task based on AKF. The experiment results showed that the proposed work achieved up to 99.59% accuracy in de-duplication task, achieved a classification accuracy at 86 % and the F1-score at 0.87, which outperforming the comparing methods. Future work will focus on improving the recall metric, particularly addressing duplicates recorded at different timestamp through more semantic-aware approaches. Such distinguishing between variations in timestamps that represent genuine clinical updates versus redundant duplicates remains a significant challenge. Additionally, further investigations will explore other data analytical tasks within EHRs beyond AKF classification.

Footnotes

ORCID iD

Juggapong Natwichai

Ethical considerations

This research, utilizing solely open and publicly anonymous available data with no reasonable expectation of privacy issue, thus, does not necessitate ethical board review, as such data falls outside the purview of ethical oversight.

Author contributions

Chomchanok Yawana is responsible for initial research design, method implementation, experiment conduct, and manuscript drafting. Wachiranun Sirikul is responsible for method validation and data analysis. Juggapong Natwichai is responsible for the research design and manuscript review and editing.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Chiang Mai University.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available in the PhysioNet repository, at .

References

Mayo Clinic.

Acute kidney failure , www.mayoclinic.org/diseases-conditions/kidney-failure/symptoms-causes/syc-20369048 (Accessed 9 April 2023).

Ortiz

Covic

Fliser

, et al. Epidemiology, contributors to, and clinical trials of mortality risk in chronic kidney failure. The lancet 2014; 383(9931): 1831–1843.

Fidalgo

Bagshaw

. Chronic kidney disease in the intensive care unit. Springer, 2014.

Ostermann

. Diagnosis of acute kidney injury: Kidney disease improving global outcomes criteria and beyond. Curr Opin Crit Care 2014; 20(6): 581–587.

Patschan

Müller

. Acute kidney injury. J Inj Violence Res 2015; 7(1): 19.

Khwaja

. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin Pract 2012; 120(4): c179–c184.

Ricci

Cruz

Ronco

. The RIFLE criteria and mortality in acute kidney injury: a systematic review. Kidney Int 2008; 73(5): 538–546.

Wang

ECH

Wright

. Characterizing outpatient problem list completeness and duplications in the electronic health record. J Am Med Inf Assoc 2020; 27(8): 1190–1197.

Neumann

King

Beltagy

, et al. ScispaCy: fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, August 2019. Association for Computational Linguistics, pp. 319–327. https://www.aclweb.org/anthology/W19-5034

10.

Lossio-Ventura

Sun

Boussard

, et al. Clinical concept recognition: evaluation of existing systems on EHRs. Front Artif Intell 2022; 5: 1051724.

11.

Kong

Kim

Lee

, et al. Two-level metadata management for data de-duplication system. Proceedings of IST, ASTL 2013; 23: 299–303.

12.

Kyle

A full spaCy pipeline and models for scientific/biomedical documents, https://github.com/allenai/SciSpaCy (Accessed on 1 December 2022).

13.

Pennington

Socher

Manning

. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 25-29, 2014, pp. 1532–1543.

14.

Beaulieu-Jones

Moore

CONSORTIUM PROAACT . Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific symposium on bio-computing 2017, Kohala Coast, Hawaii, January 4-8, 2017, pp. 207–218; World Scientific.

15.

Juna

Umer

Sadiq

, et al. Water quality prediction using KNN imputer and multilayer perceptron. Water 2022; 14(17): 2592.

16.

Alghobiri

. A comparative analysis of classification algorithms on diverse datasets. Eng Technol Appl Sci Res 2018; 8(2): 2790–2795.

17.

Babbar

. Integration of domain knowledge for outlier detection in high dimensional space. In: International Conference on Database Systems for Advanced Applications, Brisbane, Australia, April 21 - 23, 2009, pp. 363–368; Springer.

A comprehensive framework for de-duplication: Acute kidney failure (AKF) case study