Sage Journals: Discover world-class research

Abstract

Objective

Machine learning (ML) has enabled healthcare discoveries by facilitating efficient modeling, such as for cancer screening. Unlike clinical trials, real-world data used in ML are often gathered for multiple purposes, leading to bias and missing information for a specific classification task. This challenge is especially pronounced in healthcare because of stringent ethical considerations and resource constraints.

This study proposed an integrated approach to enhance the quality of health evidence from a classification task for predicting Medicare's Diagnosis-Related Groups of ischemic heart disease (IHD) patients.

Methods

Eligible participants were identified from the Medical Information Mart for Intensive Care IV (MIMIC IV), a publicly available hospital database. Six ML models were selected for model triangulation. Sequential triangulation was employed via Local Process Mining (LPM) and Qualitative Comparative Analysis (QCA).

Results

A total of 1545 IHD hospitalizations from 916 patients were identified from the MIMIC IV. Eight health process features were identified through LPM aligned with clinical knowledge. The correlation coefficients for process features, ranging from 0.24 to 0.42, are higher than those for non-process features ranged from 0.02 to 0.36. A total of 56 unique combinations were identified from the QCA, with 28 configurations having raw coverage lower than 1.0%. The overall model performance (i.e. weighted F1 and area under the curve scores) increased after adopting this integrated approach. The proportion of cases misclassified by any of the six models decreased by 47% after incorporating process features (from 5.29% to 2.91%) and further decreased to 0.0% after applying the QCA solutions.

Conclusion

The integrated approach demonstrates its ability to enhance quality of a classification task through its clinical relevance, improved model performance, and reduced case-level error rates. However, more scalable QCA methods are needed for larger datasets. Developing health process feature engineering for broader applications can be a future direction.

Keywords

Process mining classification machine learning electronic health records triangulation evidence-based assessment

Highlights

Current classification tasks in machine learning that use real-world health data may be subject to noise, missing information, and bias. The quality of empirical insights derived from these tasks influences the quality of health evidence, as reflected in its validity and reliability.

This study proposed an integrated approach to enhancing the quality of health evidence in a classification task. It offers opportunities to enhance validity at both the case and variance levels, as well as to synthesize individual, episodic, and temporal information to increase internal validity. The use of publicly available real-world health data largely increased the reliability (i.e. reproducibility) of health evidence.

We showcased how this proposed approach could improve the predictive quality of classifying Medicare's Diagnosis-Related Groups, which serve as a hospital price estimator, for patients with ischemic heart disease.

As a result, the proposed approach enhanced quality through its clinical relevance, improved model performance and reduced case-level error rates.

Introduction

Recent technological progress, particularly in machine learning (ML) and deep learning (DL), has opened avenues for healthcare discoveries by enabling efficient model generation, especially in classification tasks such as cancer screening¹ and disease diagnosis.² However, unlike medical trials—particularly randomized controlled trials—which are conducted with rigorous protocols, study designs, and targeted data collection methods, the data used in ML and DL models are often sourced from public data that are collected for multiple purposes. These real-world data contain errors, noise, and missing information, which can lead to erroneous conclusions for specific classification tasks. This challenge is especially pronounced in healthcare because of stringent ethical considerations and resource constraints.³

Patient-centered care emphasizes the importance of addressing the diversity of individual cases. However, current model generation methods, such as cross-validation, focus on the variance level, which might overlook records collected with a high proportion of missing information or bias,⁴ and force them into a class in the testing set. The quality of healthcare-specific classification tasks has traditionally focused on overall performance on average, such as F1 score and accuracy, without sufficient attention to case-level variations in realistic clinical practice.⁵ The goal of minimizing discrepancies between predicted class and observed data can overlook the varied impacts and consequences on individual patients. For example, the misclassification of patients with unique care needs, such as abandoned children without complete medical histories, is often underestimated by current quality metrics. A systematic approach that evaluates case-level outcomes while effectively controlling for bias is essential. The triangulation approach, which involves the use of various techniques and multiple data sources to verify the reliability and validity of research outcomes,^6–8 offers a unique opportunity to synthesize evidence at both the variance and case levels.

In this study, we propose a triangulation approach to enhance the quality of health evidence in a classification task from case-based reasoning as well as variance-level process feature engineering. To demonstrate its capabilities, we conducted a case study using real-world hospital records.

Ischemic heart disease (IHD) is one of the leading causes of hospitalizations in the USA.⁹ High costs have hindered low-income patients from adhering to relevant treatments.¹⁰ Improved cost estimation would benefit dynamic management by health professionals.¹¹ Hospital providers receive fixed reimbursement amounts for services under the Diagnosis-Related Groups (DRGs) payment system, making DRG codes essential for cost monitoring and resource allocation.¹⁰ However, coding is typically performed retrospectively, after discharge. Classification on the basis of patients’ electronic health records would support earlier cost management.^10,11 We demonstrated how the proposed approach could identify “difficult-to-treat” cases while enhancing overall model performance in predicting DRG classifications for IHD patients.

Methods and materials

Eligible participants and their common process features were identified from the Medical Information Mart for Intensive Care IV (MIMIC IV) version 2.2, a publicly available hospital database containing over 65,000 de-identified patient records from the Beth Israel Deaconess Medical Center in Boston.^12–14 Over 13 tables were used for this analysis, including “admissions_table”, “d_icd_diagnoses”, “d_icd_procedures”, “d_labitems”, “diagnoses_icd”, “drgcodes”, “hpcsevents”, “labevents”, “patients_table”, “ICU_stays” and “datetimeevents”.

Nature of the study

The study is methodological in nature, with the goal of enhancing the quality and generalizability of health evidence in a given classification task within real clinical settings. The case study was conducted from 18 June 2023 to 17 July 2024 at the University of Sydney School of Computer Science Innovation Center.

Evidence synthesis via triangulation

Triangulation is a series of approaches that enhances the validity of health findings through the convergence of information and evidence from multiple perspectives (e.g. data sources, diverse points of view, and theoretical methodologies).^6,8 Six widely employed ML models were selected for model triangulation. Sequential triangulation was employed via local process mining (LPM)¹⁵ and qualitative comparative analysis (QCA),^16,35 synthesizing sequential findings from both the case level and the variance level.

Eligibility criteria for participants

IHD patients aged 20 years and above who were eligible for Medicare claims with principal diagnosis codes I20-I25 (ICD-10-CM) were extracted from MIMIC IV. Patients from outpatient settings, as well as those transferred without an approximate discharge record, were excluded from the analysis.

Classification labeling and model building

Hospital separation refers to the process by which an episode of care for an admitted patient ceases.³⁸ The Medicare claim of each hospital separation for eligible participants was assigned a DRG code on the basis of U.S. Center of Medicare and Medicaid regulations.¹⁷ Over 90 DRG codes were grouped into three broad categories: bypass group, percutaneous cardiovascular intervention (PCI) group and others (Supplementary Appendix 1). The DRG codes in each group represented similar financial burdens for IHD patients. The labels were assessed by a clinician with the aim of maximizing sample balance and clinical relevance. Six commonly used ML models were selected to perform this classification task: logistic regression (LR), k-nearest neighbors (KNN), random forest (RF), decision tree (DT), support vector machine (SVM) and linear discriminant analysis (LDA).

Data randomization (also known as data shuffling) was adopted to prevent bias introduced by the initial record ordering, followed by standardization to eliminate redundancies and inconsistencies in the feature space. Seventy percent of data were utilized as training data, and 30% of the remaining data were treated as testing data. Most records were assigned to the first two categories (bypass and PCI groups). Since ML classifiers tend to exhibit bias toward the majority class, potentially leading to poor classification of minority classes,¹⁸ an oversampling technique¹⁹ was employed to balance the class distribution in the “Other” group. This technique involves duplicating existing records rather than generating new records, thus avoiding the introduction of bias in subsequent analyses while ensuring model performance. Five-fold cross validation³⁹ was employed to evaluate the performance in the training set. Hyperparameter tuning⁴⁰ was adopted in model optimization for each model.

Event log generation and mining local health process feature

Patients’ activities, such as when medication is dispensed and how they are transferred within the ICU ward, provide valuable information that has not yet been fully utilized in ML classification tasks. The discovery and conversion of these features pose significant challenges. Process mining is a technique for discovering complex behavioral patterns by using event logs.¹⁵ To minimize the noise contributed from incomplete patients’ trace, the index event started at hospital admission. The end of care is defined as either of the following scenarios:

Died in a hospital stay

Died within one year after hospital separation

Discharged from a hospital.

The timestamps of clinical and administrative activities, including hospital admission, procedures, selected lab tests and medication scheduling, were extracted for eligible episodes to generate an event log. For patients with multiple ICU stays, events in each ICU stay were assigned distinct indices. A series of ICU events, such as falls, respiratory arrest, unplanned line or catheter removal, were grouped and labeled as significant ICU events.

LPM is a widely employed process mining technique aimed at discovering common local paths within 3‒5 activities.¹⁵ It was employed to discover common behaviors observed in eligible IHD patients. The mining algorithm was chosen for its feasibility of conversion to feature space compared with end-to-end models (i.e. from hospital admission to discharge). The maximum number of transitions in the LPMs was set as five activities to enable long local trace discovery. The number of LPMs to be discovered was set to 10 to maximize the computational efficiency. The pattern mining included sequence, concurrency, exclusive and loop operators in the search space of the mining procedure to enable coverage of the complex behaviors of the IHD care flow.

The extracted events with timestamps were converted from CSV format to XES, a tag-based language designed to capture system behaviors.²⁰ The log projections were performed via Hidden Markov models to address large resource requirements. The discovered local health processes were then transferred and added to feature space for the classification task.

Qualitative comparative analysis

QCA is a technique used to study how exposures interact as a causal recipe in causing a specific outcome at the case level, as well as to discover complex cases.¹⁶ This technique has demonstrated its potential in medical informatics to predict risk and outcomes after traumatic brain injury.³⁵ This approach has been widely adopted to study counterfactual effects of selected study factors and partial dependencies,¹⁶ demonstrating its suitability for identifying “difficult-to-classify” cases. The selected features, including the added process features, were calibrated into binary variables (1 or 0) to reflect the membership of each case with the outcome (the allocated class). “1” represented the occurrence of certain features and/or above normal ranges. For example, records with cardiac troponin level above 0.04 ng/ml have been converted to 1 to indicate abnormalities in troponin level.²¹

A truth table was generated to describe the outcome for each possible combination of present and absent interventions. Logical minimization was used to systematically compare the truth table rows with sufficient combinations of conditions. The “difficult-to-classify” cases in this study were defined as those with raw coverage below 1.0%, minimizing the reduction in case numbers. The identified “difficult-to-classify” cases were excluded from the original set. The classification task was performed again after applying the QCA solution.

Evaluation

The evaluation of this integrated approach was assessed across three layers:

Quality of components, reflected in the performance of the LPM,¹⁵ QCA¹⁶ and classification model triangulation⁸

Quality of task objectives, measured by model-based and case-level performance^16,36

Quality of clinical evidence, assessed by the generalization of the application to complex real-life settings, tolerance of bias and predictive validity.

The quality of LPM features was evaluated from five criteria: Support measures how often the pattern described by the LPM is found in the event log (Equation (1)); Confidence is a ratio of the events in the event log of the activities that are described in the pattern belongs to incidences of the pattern (Equation (2)); Determinism indicates the level of predictability of future behaviors (Equation (3)). It reflects the average number of enabled transitions during replay of the pattern instances on the pattern; Language fit is defined as the ratio of behaviors allowed by the local process observed in the event log (Equation (4)); Coverage measures the frequency of events that can be identified in the log (Equation (5)). The overall performance of the LPM was measured using the weighted average of these metrics, known as the fitness score.¹⁵

S u p p o r t = \frac{\sum_{i \in L} S_{i}}{\sum_{i \in L} S_{i} + 1}

(1)

Where S_i is the incidence of a segement from LPM is found in event log L,and then equation 1 transferred the incidence between [0,1) intervals.

C o n f i d e n c e = \frac{\sum_{a \in L} E_{a} \in L N}{E_{a} (L)}

(2)

Where L is the event log; LN is the mining pattern from LP, and E_a (L) is the frequency of event E_a observed in the L. Given that event E_a is from L, the confidence of event E_a is the ratio of events of E_a in L that fit in LN.

D e t e r m i n i s m = \frac{\sum_{E_{a} \in R (L N)} W_{L} (E_{a})}{\sum_{E_{a} \in R (L N)} W_{L} (E_{a}) D (E_{a})}

(3)

where R is a set of reachable states of a mining pattern LN.

W_{L}

represented a function assigning the number of times a state is reached while replaying the fitting segments of the event log L on the pattern LN. D represented a function assigning the number of transitions enabled in a certain state in pattern LN. A model that allows for more behaviors might have a lower determinism.

L a n g u a g e f i t = \frac{| θ \in £ (L N) |}{| £ (L N) |}

(4)

where

£ (L N)

represented behaviors allowed by a mining pattern LN;

θ

represented the observations from event log L. Language fit is equal to 0 for any LN containing a loop.

C o v e r a g e = \frac{N u m b e r o f e v e n t s f r o m t h e \log L i n L N}{N u m b e r o f a l l e v e n t s i n t h e \log L}

(5)

QCA was evaluated on the Consistency which is defined as the degree to which the solution affects to the preferable outcomes (Equation (6)), as well as Coverage reflected how much of the outcome is covered by each solution (Equation (7)).¹⁶

C o n s i s t e n c y = \frac{\sum_{i = 1}^{n} m i n (X_{i}, Y_{i})}{\sum_{i = 1}^{n} X_{i}}

(6)

C o n v e r a g e = \frac{\sum_{i = 1}^{n} m i n (X_{i}, Y_{i})}{\sum_{i = 1}^{n} Y_{i}}

(7)

where X is a certain feature, Y is the outcome, n is the total number of cases, and i is the case index.

The importance of selected features with classification labels was evaluated via the correlation matrix both before and after adding process features. Traditional model evaluation metrics such as the area under the curve (AUC) and/or F1 score focus on the performance of individual models. In a multiclass task, the weighted AUC score is calculated by weighting the score of each group by its support, which is the number of true instances for each class label. The overall AUC score in this study is the average AUC of all possible pairwise combinations of the three classes. The weighted F1 score, regarded as a harmonic mean of the precision and recall, is calculated via Equation (8) and weighted by the number of true instances for each class label.³⁶

F 1 s c o r e = \sum_{i = 1}^{N} W_{i} \frac{2}{\frac{1}{R e c a l l_{i}} + \frac{1}{P r e c i s i o n_{i}}}

(8)

where N is the number of classes;

W_{i}

equals to the number of samples in class i divided by the total number of samples;

R e c a l l_{i}

equals to the number of true positives divided by the total number of positives, and

P r e c i s i o n_{i}

is calculated by dividing the number of true positives by the total number of positive predictions.

However, the nature of these metrics does not reflect performance on individual cases. A case-based measure—the proportion of errors—was used to calculate the proportion of cases that were never classified correctly by any of the six selected models. The performance was compared both before and after adding process features, as well as after applying the QCA solutions. The identified patterns were assessed on the basis of their clinical relevance, generalizability to complex real-life settings, tolerance to bias, and predictive validity.

The event log conversion and LPM were performed with the ProM plugin version 6.1, an open-source process mining software.³⁷ Event extraction, data preprocessing, analysis, classification model generation and visualization were performed in Python.

Results

A total of 916 eligible patients were identified from MIMIC IV (Table 1). Of these, 72.5% were males, and 27.5% were females. The overall average age at admission was 73 years, with no significant gender variation. A total of 1545 Medicare-claimed hospitalizations were extracted from eligible patients. The average hospital length of stay was 7.4 days for females and 7.5 days for males.

Table 1.

Baseline characteristics of eligible patients with ischemic heart disease (IHD).

Measure	Administrative Gender
Measure	Males	Females	Total
Baseline demographics
Patient, n (%)	664 (72.5)	252 (27.5)	916 (100.0)
Medicare claimed hospitalization, n (%)	1075 (69.6)	470 (30.4)	1545 (100.0)
Age at admission in years, n (%)
20–59	64 (65.3)	34 (34.7)	98 (100.0)
60–69	138 (73.0)	51 (27.0)	189 (100.0)
70–79	498 (71.8)	196 (28.2)	694 (100.0)
80–89	87 (65.9)	45 (34.0)	132 (100.0)
90+	288 (66.7)	144 (33.2)	432 (100.0)
Charlson comorbidity score, mean (SD)	6.44(2.24)	6.93(2.40)	6.59 (2.3)
Hospital length of stay,^a mean (SD)	7.5 (5.8)	7.4 (5.6)	7.5(5.7)
In-hospital mortality, n (%)	34(65.4)	18(34.6)	52 (100.0)
One-year mortality, n (%)	84(34.7)	158(65.3)	242 (100.0)
Cardiac biomarkers
Troponin T (ng/mL), M (SD)	2.0(3.9)	1.96(3.4)	2.0 (3.7)
Creatine Kinase MB (ng/mL), M (SD)	47.7(87.9)	38.8(68.3)	44.46 (81.5)
Nt-Probnp Levels (pg/mL), M (SD)	7144.1(10075.7)	9314.5(11083.2)	8062 (10516.4)

Note. A total of 1277 records (82.7%) were allocated to Class 1 and Class 2. After oversampling, the cases were evenly distributed across the three categories, with approximately 30% in each category (Table 2).

The hospital length of stay in the study was not adjusted for leaving days because of a lack of relative information in the data.

Table 2.

Sampling size by classification group before and after sampling.

Label Class	Before OversamplingN (%)	After OversamplingN (%)
Class 1: Bypass group	683 (44.4)	668 (34.3)
Class 2: PCI group	594 (38.4)	683 (35.1)
Class 3: Other	268 (17.3)	594 (30.5)
Total	1545 (100.0)	1945 (100.0)

Note. Oversampling in this study involves duplicating existing records rather than generating new records.

Eight out of the 10 health process features were identified from the LPM after two processes that did not align with clinical facts were excluded (Table 3). The fitness score of the mined local processes ranged from 0.58 to 0.64. ICU stay and clinical imaging, such as X-ray, were identified as frequent clinical events in this cohort.

Table 3.

Local process mining and feature engineering.

Petri Net of Mined Local Health Process(Fitness Score, Feature Labeling)	Feature Engineering
(0.641, Imaging)	Records completing a cycle of imaging at their first ICU stay (including Start and End event) were recorded as 1; otherwise they were recorded as 0.
(0.638, ICU_procedure)	Records completing a cycle of any type of clinical procedures at their first ICU stay (including Start and End event) were recorded as 1; otherwise they were recorded as 0.
(0.638, ED_visit)	Records completing a cycle of emergence department stay (including Start and End event) were recorded as 1, otherwise they were recorded as 0.
(0.632, Sig_Event)	Records completing a cycle of any type of significant event in the first ICU stay (including Start and End event) were recorded as 1; otherwise they were recorded as 0.
(0.596, ICU_Sig_Event)	Records including a start of the first ICU stay and observed at least 1 significant event (including the alert and the end event) were recorded as 1; otherwise they were recorded as 0.
(0.595, Admin_ICU)	Records including a hospital admission and a start of the first ICU treatment were recorded as 1; otherwise they were recorded as 0.
(0.595, ICU_Access_line)	Records including an end of the first ICU stay and peripheral access lines operation were recorded as 1; otherwise they were recorded as 0.
(0.582, ICU_Ventilation)	Records including a start of the first ICU treatment and under ventilation in the first ICU stay were recorded as 1; otherwise they were recorded as 0.

Note. “Start” and “End” in the circles of each local process above represent artificial notations for the input place (i.e. the starting point of a local process) and the output place (i.e. the ending point of a local process). For activities with a duration, shown in rectangles, “Start” and “End” indicate the beginning and end time points of the activity.

The absolute correlation coefficients for process features, ranging from 0.24 to 0.42, were higher than those for non-process features, which ranged from 0.02 to 0.36 (Figure 1). The three local patterns “ICU_sig_event”, “Sig_event” and “ICU_ventilation” have the highest correlation coefficients with classification labels among those process features (0.42, 0.42 and 0.38, respectively). The “ED_visit” is positively correlated with DRG labels (0.24), whereas the remaining process features are negatively associated with DRG labels (−0.30, −0.32, −0.42, −0.42, −0.28, −0.28 and −0.38, respectively).

Figure 1.

Correlation heatmaps of process and non-process features and outcomes. The figure presents the correlation between the machine learning features and classification labels. A higher absolute correlation indicates a stronger association between the paired variables. Process features are shaped as squares and labeled in red; the class label is shaped as hexagon and labeled in green, and non-process features are shaped as circles and shown in dark.

A total of 56 unique combinations were identified from the QCA, with 28 configurations having raw coverage lower than 1.0%. The model performance gradually improved in both the weighted AUC and weighted F1 score in all six models after adding process features. The improvement was further enhanced after the application of the QCA solutions (Table 4).

Table 4.

Model-oriented performance evaluation by scenario.

Model Name	Original Set		Original Set + PM Features		Original Set + PM Features+ QCA Solution
Model Name	Weighted AUC	Weighted F1 Score	Weighted AUC	Weighted F1 Score	Weighted AUC	Weighted F1 Score
RF	0.936	0.832	0.956	0.870	0.999	0.987
SVM	0.874	0.728	0.896	0.818	0.991	0.933
LR	0.837	0.679	0.883	0.738	0.893	0.742
DT	0.825	0.768	0.852	0.802	1.000	1.000
LDA	0.824	0.656	0.877	0.74	0.889	0.693
KNN	0.811	0.749	0.866	0.821	0.889	0.854

Note. The table is presented in descending order of weighted AUC score of the original set.

A value of 1.0% was chosen as the cutoff point, with “sacrifice” being the minimum number of excluded cases. A total of 105 (6.8% of 1545) records aligning with these QCA combinations were excluded from the original data. The proportion of errors experienced a dramatic reduction of 47% following the inclusion of process features, decreasing from 5.48% to 2.91%. This rate further decreased to 0.0% after the implementation of the QCA solutions, as detailed in Table 5.

Table 5.

Changes in error rates by scenario.

Scenario	Error (%)	Point Difference^a
Original set	5.48	0.00
Original set + PM features	2.91	−0.03
Original set + PM features + QCA solution	0.00	−0.55

Point difference is presented as the difference between the index scenario and the original set. A negative sign indicates a decrease in the error proportion and an improvement in overall performance.

Discussion

Quality of classification task

For a given classification task, a model’s performance ceiling usually depends on feature quality,^22–24 inter-annotator agreement,²⁵ and whether unknown information includes deterministic factors linked to the desired outcome.²⁴ Correlation-based feature selection techniques, such as principal component analysis (PCA),⁷ are commonly employed in classification tasks but often receive insufficient attention concerning their clinical relevance and data scopes.²⁴ Moreover, current approaches, such as imputations, aim to predict missing information on the basis of variances in existing observations. However, imputations carry the risk of introducing false information via different similarity measures.⁴ Existing modeling logic seeks to identify universal patterns from collected data, with techniques such as error minimization through cross-validation if training features encompass adequate distinct information for desired outcomes.²⁶ Consequently, only a variance-based approach might lack consideration and adherence to evidence-based medicine principles as mentioned in Cumpston et al.²⁷ from a patient-centered perspective (the case level). Insufficient attention at this level thereby impacts the quality of health evidence derived from these tasks.

In our approach, by excluding a small number of records with a clear strategy, we can ensure a more robust classification performance. The excluded cases can be investigated separately by requesting additional information through queries from different hospital systems or consultations with the clinicians in charge.

Importance of process features in a healthcare classification task

Process feature engineering is an umbrella term whose scope has not yet been clearly defined. It can take different forms related to operational processes and resource supply chains, on the basis of the strategy used for event log generation and the research question. In this work, we focus on the contributions of patients’ clinical information and their sequential interactions. To date, process mining has shown its capacity to identify issues associated with urgent public outbreaks and routine care. In 2020, Chang et al.²⁸ used process mining techniques to study how COVID-19 impacts the length of stay and delay of process in emergency department for acute cerebrovascular patients during the pandemic. Alharbi et al.²⁹ used process mining to detect hidden healthcare sub-processes amongst 356 patients with “altered mental status” diagnoses. Chen et al.³⁰ combined process information as features to predict unplanned ICU readmission in MIMIC IV. However, how to incorporate process information into a given classification task has not been systematically studied.

In addition to the use of traditional clinical features, our finding indicated that the occurrence of significant events during the first ICU stay may serve as a potential early indicator for estimating hospital costs. In-hospital falls, one of the significant events included in the LPM, align with existing evidence that factors such as staffing and unit design may influence fall risk. Since 2008, fall-related injuries have been considered in the CMS reimbursement and DRG regulation guidelines.³¹ By incorporating these processes information, we observed improvements in the DRG classification performance at both the model and case levels. However, this study grouped various significant events (such as ventricular drain and unplanned catheter removal) due to computational complexity. Future research is needed to explore the variations in hospital costs associated with the occurrence and sequence of significant events in the ICU for IHD patients.

Importance of QCA in a healthcare classification task

QCA has been widely used in health literatures to understand complex configurations and asymmetrical effects. Cairns et al.³³ utilized QCA to identify pathways linking health and place in geographic areas with resilient outcomes. They reported that the social environment appears to be more important than the natural environment in influencing pathways to health resilience in the selected areas. Triangulating the QCA to understand configuration information helps us identify nonlinear relationships between the feature space and classification labels.

Our findings from the QCA indicated that the set of interrelated process events that positively impact clinical interventions could be very different from the set of events that hinder the effectiveness of the intervention. Certain clinical events that positively affect future clinical outcomes do not necessarily have a reverse effect when their order is changed or when they are removed from the clinical trace. For example, 274 patients who underwent processes such as “imaging”, “ICU_procedure”, “Sig_event”, “Admin_ICU”, “ICU_Access Lines” and “ICU_ventilation” but did not experience “ED_visits” or “ICU_sig_event” were classified as Class 1 bypass surgery with probabilities of positive and negative configurations of 90.5% and 9.5%, respectively. However, by simply changing “Sig_event” from 1 to 0 while keeping the rest of the configuration unchanged, we observed that all 20 cases in this process were classified as Class 1. This suggests that certain process-related factors, such as the occurrence and absence of “ED_visits” may have asymmetrical effects on classification outcomes, which might not have been adequately addressed in previous studies.

After implementing QCA solutions to address cases with rare coverage configurations, a noticeable enhancement in case-based performance becomes evident. Additionally, by excluding cases based on clear rules, bias is substantially mitigated compared with randomly excluding records without defined strategies.

Need for an integrated approach rather than traditional pipeline

Traditionally, ML engineers have followed a linear pipeline that includes data preprocessing, feature elimination and selection, model training, and performance evaluation.⁴² This pipeline approach relies on existing data as the “ground truth” for performance validation. However, inherent assumptions and biases may unconsciously influence the quality assessment and the conclusions drawn.

This study proposed an integrated approach that enhanced quality across three layers:

First, component evaluation examined the quality of key techniques using fitness scores, coverage proportions, and weighted F1 and AUC scores for the LPM, QCA, and classification models. By leveraging the LPM technique to convert local process features, additional hidden configurational information can be incorporated into classifier training. Triangulating with the QCA solution enables the identification of cases that require special considerations and enhances the validity of the findings.

Second, an evaluation takes into account both case-level and variation-level uncertainties, enabling the identification of individual needs while also ensuring overall performance.

Third, the clinical relevance was assessed based on health insights from each component, enabling flexible cooperation among teams with varied expertise. By systematically identifying and addressing “difficult-to-treat” cases, it enhanced the generalizability of applying it in a real-life setting.

The process features discovered from the LPM illustrate not only the occurrence and absence of multiple events but also the controlled relationships among them. The path “ICU_Access_line” represents patients transferred from the ICU ward for a peripheral access line operation, followed by two hidden activities. One of these activities may trigger a side effect, necessitating the operation. This information cannot be fully reflected by the traditional pipeline through feature selection or elimination. However, the model's performance ceiling cannot be increased simply by adding configuration information. Triangulating the identified process features into the QCA added an additional layer to help identify cases lacking sufficient information, which further improved the inherent performance ceiling.

Limitations

This integrated approach has some limitations and needs further improvement. Firstly, as LPM is not inherently goal-oriented, meaning that the discovered common local patterns may or may not be directly related to the classification task,¹⁵ future studies may require a goal-oriented process miner to establish causal relationships. In 2014, Yan et al. proposed an agent-oriented goal-mining approach for pattern discovery in executed processes using event logs.⁴¹ However, the complexity of the mined end-to-end model via their approach presented ongoing challenges in translating it into feature space. A systematic incorporation of duration and configuration in this transition process is essential.³²

QCA techniques enabled the identification of “difficult-to-classify” cases. However, it is worth noting that the QCA technique demands significant computing resources and is most efficient when dealing with small-N problems, making it a prevalent choice in qualitative research.³⁴ There is a pressing need for a scalable QCA approach to address big data challenges.

Whether PM is utilized to augment hidden information or whether QCA is employed to address “difficult-to-classify” cases and achieve perfect classifiers (guarantee of 100% performance) has not yet been fully explored and validate by using this integrated framework. This is because, in real-world data, while cases may share common characteristics, the absence of determining factors in feature spaces cannot be entirely resolved by this integrated approach. Future research is needed to address this challenge effectively. Moreover, the impact of different stratifications (e.g. age, race) on the quality of this integrated approach requires further investigation.

Conclusion

The proposed integrated approach demonstrates its ability to enhance overall classification quality from its clinical relevance, improved model performance, and reduced case-level error rates. This approach minimized “difficult-to-classify” cases and combined enriched process information in the feature space. The findings from this paper also added values to quality enhancement from three layers. Our findings indicate that certain clinical events that positively affect future clinical outcomes do not necessarily have the opposite effect when their order is changed or when they are removed from the clinical trace. This triangulation approach also shows potential for uncovering asymmetrical effects among process-related factors, which have not been adequately addressed in previous studies.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251314097 - Supplemental material for Enhance health evidence quality in classification tasks: A triangulation approach utilizing case-based reasoning and process features

Supplemental material, sj-docx-1-dhj-10.1177_20552076251314097 for Enhance health evidence quality in classification tasks: A triangulation approach utilizing case-based reasoning and process features by Ruihua Guo, Ross Smith, Qifan Chen, Angus Ritchie and Simon Poon in DIGITAL HEALTH

Footnotes

Contributorship

RHG contributed to conceptualization, design, data analysis, output interpretation, and writing of the manuscript. RS contributed to the data analysis, output interpretation and editing. QFC contributed to output interpretation and editing. AR contributed to manuscript editing and guidance. SP contributed to conceptualization, design, output interpretation, manuscript editing and guidance.

Data availability

Data included in this study sourced from a publicly accessible critical care database. The data that support the findings of this study are available from PhysioNet. Restrictions apply to the availability of these data, which were used under license for this study. Data are available at with the permission of PhysioNet.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical Approval

The dataset supporting the conclusions of this article is available in the Medical Information Mart for Intensive Care version IV (MIMIC-IV).^12–14 This database is a public de-identified database; thus, informed consent and approval from the Institutional Review Board at the Beth Israel Deaconess Medical Center were waived. Analysts finished the required “Data or Specimens Only Research” course and certificate to use this data (Record ID: 55008135). All methods were performed in accordance with the relevant guidelines and regulations.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ruihua Guo

Supplemental material

Supplemental material for this article is available online.

References

Amrane

Oukid

Gagaoua

, et al. Breast cancer classification using machine learning. In: 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT). Turkey, April 2018: 1–4.

Zhu

Tang

, et al. Machine learning for the preliminary diagnosis of dementia. Sci Program 2020; 2020: 5629090.

O’Reilly-Shah

Gentry

Walters

, et al. Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth 2020; 125: 843–846.

Young

Weckman

Holland

. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theor Issues Ergon Sci 2011; 12: 15–43.

Wilkinson

Arnold

Murray

, et al. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digital Health 2020 Dec 1; 2: e677–e680.

Valencia

MMA

. Principles, scope, and limitations of the methodological triangulation. Invest Educ Enferm Epub ahead of print 2022; 40(2):e03.

Jolliffe

Cadima

. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci 2016; 374: 20150202.

Dewasiri

NJTKGP

Banda

YKW

Azeez

. Triangulation approaches in finance research. Colombo J Multi-Discip Res 2018; 3: 87–112.

Honigberg

Patel

Pandey

, et al. Trends in hospitalizations for heart failure and ischemic heart disease among US adults with diabetes. JAMA Cardiol 2021; 6: 354–357.

10.

Liu

Capurro

Nguyen

, et al. Early prediction of diagnostic-related groups and estimation of hospital cost by processing clinical notes. NPJ Digit Med 2021; 4: 103.

11.

Dugani

Moran

Bonow

, et al. Ischemic heart disease: cost-effective acute management and secondary prevention. In: Prabhakaran

Anand

Gaziano

(eds) Cardiovascular, respiratory, and related disorders. 3rd ed. Washington, DC: The International Bank for Reconstruction and Development/The World Bank, 2017, pp.135–155.

12.

Johnson

Bulgarelli

Pollard

, et al. MIMIC-IV (version 2.0). PhysioNet. 2022. https://doi.org/10.13026/7vcr-e114.

13.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1.

14.

Goldberger

Amaral

Glass

, et al. Physiobank, PhysioToolkit, and Physionet: components of a new research resource for complex physiologic signals. Circulation 2000; 101: E215–E220.

15.

Tax

Sidorova

Haakma

, et al. Mining local process models. J Innov Digit Ecosyst 2016; 3: 183–196.

16.

Ragin

. Using qualitative comparative analysis to study causal complexity. Health Serv Res 1999; 34: 1225–1239.

17.

Centers for Medicare & Medicaid Services (CMS) . Regulations and guidance, https://www.cms.gov/marketplace/resources/regulations-guidance (accessed 15 March 2024).

18.

Yap

Rani

Rahman

HAA

, et al. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In: Proceedings of the first international conference on advanced data and information engineering (DaEng-2013). December 2013; Kuala Lumpur, Malaysia. Singapore: Springer, 2014, pp.13–22.

19.

Pedregosa

. Scikit-learn: machine learning in python Fabian. J Mach Learn Res 2011; 12: 2825.

20.

van Dongen

Shabani

. Relational XES: data management for process mining (BPM reports; Vol. 1502). 2015.

21.

McIlvennan

Urra

Helmkamp

, et al. Magnitude of troponin elevation in patients with biomarker evidence of myocardial injury: relative frequency and outcomes in a cohort study across a large healthcare system. BMC Cardiovasc Disord 2023; 23: 151.

22.

Pinkston

. Justification for the hierarchical pyramid of evidence-based medicine and a defense of randomization. In: Evidence and hypothesis in clinical medical science [e-book]. Cham, Switzerland: Springer International Publishing, 2020, pp.101–127.

23.

Uddin

Lee

Rizvi

, et al. Proposing enhanced feature engineering and a selection model for machine learning processes. Appl Sci 2018; 8: 646.

24.

Roe

Jawa

Zhang

, et al. Feature engineering with clinical expert knowledge: a case study assessment of machine learning model complexity and performance. PLoS ONE 2020; 15: e0231300.

25.

Boguslav

Cohen

. Inter-annotator agreement and the upper limit on machine performance: evidence from biomedical natural language processing. Stud Health Technol Inform 2017; 245: 298–302.

26.

Stone

. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B (Methodol) 1974; 36: 111–133.

27.

Cumpston

McKenzie

Welch

, et al. Strengthening systematic reviews in public health: guidance in the Cochrane Handbook for Systematic Reviews of Interventions, 2nd edition. J Public Health (Oxf) 2022; 44: e588–e592.

28.

Chang

Yoon

, et al. Impact of COVID-19 pandemic on the overall diagnostic and therapeutic process for patients of emergency department and those with acute cerebrovascular disease. J Clin Med 2020; 9: 3842.

29.

Alharbi

Bulpitt

Johnson

. Towards unsupervised detection of process models in healthcare. In: Building continents of knowledge in oceans of data: the future of co-created eHealth 2018. Amsterdam: IOS Press 2018, pp.381–385.

30.

Chen

Tam

, et al. Outcome-oriented predictive process monitoring to predict unplanned ICU readmission in MIMIC-IV database. InECIS 2022. Timisoara, Romania: ECIS Proceedings. June 2022.

31.

Inouye

Brown

Tinetti

. Medicare nonpayment, hospital falls, and unintended consequences. N Engl J Med 2009; 360: 2390–2393.

32.

Elkhovskaya

Kovalchuk

. Feature engineering with process mining technique for patient state predictions. In: International conference on computational science. Cham: Springer International Publishing, 2021, pp.584–592.

33.

Cairns

Wistow

Bambra

. Making the case for qualitative comparative analysis in geographical research: a case study of health resilience. Area 2017; 49: 369–376.

34.

Rutten

. Applying and assessing large-N QCA: causality and robustness from a critical realist perspective. Sociol Methods Res 2022; 51: 1211–1243.

35.

Gorji

Zador

Poon

. A configurational analysis of risk patterns for predicting the outcome after traumatic brain injury. In: AMIA Annual symposium proceedings 2017. November 2017; Washington, DC, USA: American Medical Informatics Association. 2018, p.780.

36.

Grandini

Bagli

Visani

. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756; 2020. https://doi.org/10.48550/arXiv.2008.05756.

37.

Van der Aalst

van Dongen

Günther

, et al. Prom: the process mining toolkit. In: Proceedings of the BPM 2009 Demonstration Track (BPMDemos 2009). September 2009; Ulm, Germany: CEUR-WS. org, 2009, pp.1–4.

38.

Meteor. Separation [Internet].

Canberra

Metadata Online Registry

. https://meteor.aihw.gov.au/content/327268 (accessed 10 October 2024).

39.

Yadav

Shukla

. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In: 2016 IEEE 6th International conference on advanced computing (IACC); February 2016. Bhimavaram, India: IEEE, 2016, pp.78–83.

40.

Yang

Shami

. On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 2020; 415: 295–316.

41.

Yan

Liao

, et al. Mining agents’ goals in agent-oriented business processes. ACM Trans Manage Inform Syst 2014 Dec 29; 5: 1–22.

42.

Hapke

Nelson

. Building machine learning pipelines. Sebastopol: O'Reilly Media, 2020.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB