Abstract
Objective
Although the evaluation of left ventricular ejection fraction (LVEF) in patients with atrial fibrillation (AF) or atrial flutter (AFL) is crucial for appropriate medical management, the prediction of reduced LVEF (<50%) with AF/AFL electrocardiograms (ECGs) lacks evidence. This study aimed to investigate deep-learning approaches to predict reduced LVEF (<50%) in patients with AF/AFL ECGs and easily obtainable clinical information.
Methods
Patients with 12-lead ECGs of AF/AFL and echocardiography were divided into those with LVEF <50% and ≥50%. A convolutional neural networks-based model customized to the study (AFibEFNet) and other deep-learning models were investigated. Electrocardiogram signals, ECG features, and clinical features (demographic information, comorbidities, blood cell counts, and blood test results) were collected for training. A hold-out test dataset was constructed using a different recruitment period. Five-fold cross-validation and calibration plots were used to evaluate performance.
Results
A total of 15,683 patients were analyzed (mean age, 70.0 ± 11.7 years; 61.2% men), with 82.2% having LVEF ≥50% and 17.8% having LVEF < 50%. Among the learning models, the AFibEFNet outperformed other models regarding area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and F1-score. Using ECG signals alone, the AFibEFNet model predicted reduced LVEF with AUROC of 0.798 (95% confidence interval [CI], 0.767–0.829) and AUPRC of 0.508 (95% CI, 0.434–0.564). For the AFibEFNet model, additional training with ECG and clinical features significantly improved AUROC (0.816 vs. 0.798, p = 0.04) and AUPRC (0.547 vs. 0.508, p < 0.001). The AFibEFNet model primarily focused on the R-wave, QRS onset and offset, and T-wave in ECG signals.
Conclusions
Among the patients with AF/AFL, machine learning may predict reduced LVEF with 12-lead ECGs of AF/AFL.
Introduction
Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and is often accompanied by atrial flutter (AFL). 1 For optimal management of patients with AF or AFL, measuring left ventricular ejection fraction (LVEF) is essential, as reduced LVEF limits the medications that can be safely prescribed. European guidelines advise against the use of Vaughan-Williams class I antiarrhythmic drugs or dronedarone in patients with AF and reduced LVEF, due to increased risks of proarrhythmia and adverse outcomes.2,3 Additionally, nondihydropyridine calcium channel blockers should be avoided for the heart rate control in these patients, as they may further decrease the cardiac output.2,3 The presence of reduced LVEF also influences a stroke risk and prevention strategies.2,4 However, an accurate measurement of LVEF typically requires an echocardiographic assessment. 5
Deep learning has been introduced as an effective tool for analyzing 12-lead electrocardiograms (ECGs) and detecting the underlying medical condition. 6 Accordingly, some reports stated that deep learning is feasible for detecting underlying left ventricular dysfunction by analyzing ECGs for the general population.7–10 However, most previous studies examined sinus rhythm ECGs; thus, their results may not apply to AF/AFL ECGs since it is often difficult to define the ST-segment or T waves during AF/AFL. Therefore, for patients with AF/AFL ECGs, we must develop a new deep-learning model to predict a reduced LVEF.
This study aimed to investigate a deep-learning approach to detect reduced LVEF (<50%) by analyzing 12-lead ECGs of AF/AFL and easily obtainable clinical data such as age, sex, and comorbidities.
Methods
Study design and population
This was a single-center retrospective cohort study. The enrollment flow of the study population is shown in Figure 1. Patients aged ≥18 years and underwent echocardiography between January 2003 and July 2022 was identified. This study included patients who underwent AF/AFL ECG and echocardiography at an interval of <1 month. The exclusion criteria were as follows: (1) no available ECG (n = 13,011); (2) no AF/AFL ECG (n = 231,557); (3) no echocardiographic data for analysis (n = 29,060); (4) missing values for LVEF (n = 3514); (4) an interval of ≥1 month between echocardiography and AF/AFL ECG (n = 10,086); and (5) outliers for study variables (n = 384). The outliers are defined in Supplementary Table 1. Consequently, 15,683 patients with AF/AFL ECG and echocardiographic LVEF data were investigated. The study population was divided into the control group (those with LVEF ≥ 50%; n = 12,899) and the reduced LVEF group (those with LVEF < 50%, n = 2784).

The flowchart of the study population. The study population was patients with AF/AFL ECGs and echocardiographic LVEF data within a one-month interval. The population was categorized into the control group with normal LVEF (≥50%) and the reduced LVEF (<50%) group. AF: atrial fibrillation; AFL: atrial flutter; ECG: electrocardiogram; LVEF: left ventricular ejection fraction.
Data acquisition
This study used clinical data retrieved from the Seoul National University Hospital Patient Research Environment (SUPREME) system. Patients’ demographic information, comorbidities, blood test results, and echocardiography data were extracted from the SUPREME system, while raw ECG data were retrieved from the MUSE Cardiology Information System (GE Healthcare, WI, USA). Physicians at the Seoul National University Hospital diagnosed AF or AFL based on their direct evaluation of the patients. A complete list of the features and their definitions is presented in Supplementary Table 2. The index date was defined as the date of the earliest AF/AFL ECG. Demographic and ECG features were acquired from the index date, whereas comorbidities, blood test results, and echocardiographic features were acquired within three months of the index date.
Echocardiographic features were compared between groups to delineate the baseline characteristics of the population further (Supplementary Table 3). This study used index ECGs which confirmed AF or AFL. The index ECGs were initially screened by diagnostic labels within the MUSE Cardiology Information System. If multiple echocardiographic studies were available within one month of the index AF/AFL ECG, the study with the closest date to the index AF/AFL ECG was chosen for analysis. Both the signal data and features were obtained for each ECG. Electrocardiogram features included heart rate, QRS duration, QT interval, R-axis, T-axis, Q onset, Q offset, T offset, and selected cardiac rhythm or conduction abnormalities (AF, AFL, bundle branch block, and atrioventricular block). The ECG parameters were measured from global fiducial points of all 12 simultaneous leads using the Marquette 12SL algorithm (GE Healthcare, Chicago, IL, USA). Numerous studies have utilized the Marquette 12SL algorithm due to its stability and accuracy in measuring ECG parameters such as amplitudes, durations, and intervals. 11 Previous publications have comprehensively documented the specific details and criteria for ECG diagnostic statements and measurements generated by the Marquette 12SL algorithm.11,12 The ECG data comprised signals from eight leads: I, II, and V1-V6. Each signal was recorded for 10 s at a sampling rate of 500 Hz. Initially encoded in base64 within XML files, the signal data were decoded into numerical arrays using a custom Python script. If missing values existed among continuous features, imputation was performed using the median value of each feature. For all tabular data, robust regularization was performed using the median and interquartile range values for reliable training.
Deep-learning training and evaluation process
This study categorized training data as follows: (1) ECG signals that could be acquired from raw ECG data, (2) clinical features composed of demographic features (age, sex, height, body weight, and body mass index), comorbidities (hypertension, diabetes mellitus, ischemic heart disease, dyslipidemia, chronic obstructive pulmonary disease (COPD), chronic kidney disease (CKD), liver disease, stroke, thyroid disease), and blood test results (blood urea nitrogen (BUN), serum creatinine, glomerular filtration rate (GFR), high-sensitive C-reactive protein (hs-CRP), hemoglobin, serum levels of sodium, potassium, white blood cell and platelet counts), and (3) ECG features. Deep learning was designed to identify subjects with reduced LVEF (<50%). Tabular data, including the ECG features, were used in the LightGBM model, 13 which has been shown to perform well among tree-based algorithms. For the deep-learning model, the ECG signal was processed through the convolutional layer, and the remaining data were concatenated with feature embeddings of the signal in the third fully connected layer on a convolutional neural network (CNN) model with a residual block.
We compared three CNN architectures. Inspired by architectures originally designed for two-dimensional image analysis, these models were modified to 1-D convolution to handle ECG data. All models were trained from scratch using our ECG dataset. Three types of CNN-based deep-learning models were compared. The first model is a CNN with a residual block based on nEMGNet (the number of parameters: 46.3 M.) 14 The kernel sizes of some of the convolution layers were modified to fit the input shape of the ECG signals, and the clinical data (tabular data) were concatenated after the third fully connected layer. The second model was a modified version of ResNet50 (the number of parameters: 15.9 M.) 15 The kernel size was modified to fit the input shape and to concatenate the clinical data with the embedding vector of the ECG signal in the fully connected layer. The third model is EfficientNet b5 (the number of parameters: 41.7 M), another CNN-based model that achieves high efficiency and accuracy with fewer parameters. 16 Also, we analyzed a long short-term memory (LSTM) model (the number of parameters: 353.2 M) as a representative of recurrent neural network (RNN)-based models. 17 The architectures of the four deep-learning models are presented in Supplementary Figure 1. After analyses, a CNN with a residual block model exhibiting the highest performance was selected as the final model and named AFibEFNet to distinguish it from other CNN-based models. To assess the appropriateness of the final model selection, we further investigated its interpretability through various tests, including subgroup and sensitivity analyses.
In our analysis, the training and hold-out test datasets were split based on patients, meaning no patient was included in both datasets simultaneously. Among the study population, those before 2021 (n = 14,247) were used to develop and train the model, and those after 2021 (n = 1436) were used as the hold-out test dataset. A five-fold cross-validation was performed with patient-based splits for the performance evaluation, and the F1-score was used as the model optimization metric. The success criteria for predicting reduced LVEF was defined as achieving an area under the receiver operating characteristic curve (AUROC) with a lower limit of 95% confidence interval (CI) above 0.5, using the hold-out test dataset. All experiments were conducted in Python v.3.6.9 and Pytorch v.1.10.2 environments, and Pycaret library v.2.3.10 was used for the LightGBM model. Detailed information on the hyperparameter settings is presented in Supplementary Table 4.
This study adheres to the guidelines for machine learning application in biomedical research. 18
Calibration plot
A calibration plot was constructed to evaluate the model's performance by dividing the test dataset into six bins according to the predicted probability of a reduced LVEF. The plot depicts the relationship between the predicted and actual probabilities for a given class.
Evaluating feature attributions
The guided Grad-CAM method 19 was used to visualize the ECG segments important for the deep-learning model in predicting reduced LVEF. Sensitivity maps for the average beat of each lead were plotted for the five patients with the highest probability of reduced LVEF. For the LightGBM model, we investigated feature importance. Gini importance was calculated for the feature importance.
Statistical analyses
For the performance evaluation, various diagnostic parameters were evaluated (AUROC, area under the precision-recall curve (AUPRC), F1-score, sensitivity, and specificity). We compared the diagnostic parameters across the following models: (1) four deep-learning models with training ECG signals, clinical features, and ECG features; (2) four deep-learning models with training ECG signals alone; (3) the LightGBM model with training clinical features and ECG features; (4) the LightGBM model with training ECG features; and (5) the LightGBM model with training clinical features. Owing to the inherent randomness of deep-learning models, we repeated the measurement of the diagnostic parameters five times. The average results were reported with 95% CIs. For the threshold-dependent metrics, we used the Youden J-index. Area under the receiver operating characteristic curves were compared using the DeLong test. Two-sided values of p < 0.05 rejected the null hypothesis. All statistical analyses were performed using the R software and Python.
Subgroup and sensitivity analyses
We performed the subgroup analysis according to sex, the timing of ECG and echocardiography (ECG first/echocardiography first), hypertension, ischemic heart disease, diabetes mellitus, stroke, thyroid disease, COPD, liver disease, CKD, left atrial diameter (<40 mm/≥40 mm), and heart rate (<100 beats/min/≥100 beats/min). To compare the model's performance according to the reduced LVEF criteria, we performed a sensitivity analysis by modifying the definition of reduced LVEF to <40% and <35% and compared the diagnostic parameters with the main results.
Results
Baseline characteristics of the study population
A total of 15,683 patients with AF/AFL were analyzed (12,899 in the control group and 2784 in the reduced LVEF group). Compared to the control group, the reduced LVEF group was more likely to be male (65.5% vs. 60.3%, p < 0.001). In general, the reduced LVEF group had more prevalent comorbidities, such as diabetes mellitus, ischemic heart disease, COPD, CKD, and liver disease, than the control group (Table 1). Accordingly, the reduced LVEF group had significantly higher BUN, serum creatinine, hs-CRP, and potassium levels and lower GFR, hemoglobin, and platelet counts (all p < 0.001) (Table 1). For ECG parameters, the reduced LVEF group had a significantly higher heart rate (103 vs. 91 beats/min), wider QRS duration (98 vs. 92 ms), shorter QT interval (360 vs. 368 ms), and higher incidences of AFL (13.8% vs. 10.3%) and LBBB (4.7% vs. 0.9%) (all p < 0.001 except for QT interval, which was p = 0.001) (Table 1).
Baseline characteristics.
Data are N (%) or median (interquartile range).
BUN: blood urea nitrogen; COPD: chronic obstructive pulmonary disease; GFR: glomerular filtration rate; hs-CRP: high-sensitive C-reactive protein; LVEF: left ventricular ejection fraction.
Deep-learning performance for predicting reduced LVEF
Table 2 summarizes deep-learning performance across the algorithms and training datasets. For the LightGBM model, training ECG features yielded a higher AUROC than training clinical features (0.758; 95% CI, 0.727–0.790; and 0.670; 95% CI, 0.633–0.708, respectively). In general, training both ECG and clinical features resulted in a higher diagnostic performance than training either feature set alone (AUROC, 0.752; 95% CI, 0.717–0.786; AUPRC, 0.423; 95% CI, 0.356–0.490, F1-score, 0.430; 95% CI, 0.390–0.476; sensitivity, 0.645; 95% CI, 0.543–0.782; and specificity, 0.735; 95% CI, 0.607–0.838). However, the LightGBM model was inferior to the AFibEFNet model even if it was trained with ECG signals alone (Table 2 and Figure 2).

The performance of the AFibEFNet and LightGBM models for predicting reduced LVEF. The performance of the two models was evaluated by comparing their AUROCs and AUPRCs on different training datasets. AUPRC: the area under the precision-recall curve; AUROC: the area under the receiver operating characteristics curve; ECG: electrocardiogram; LVEF: left ventricular ejection fraction.
The deep-learning performance for the prediction of reduced LVEF among patients with AF/AFL.
Data are mean (95% CI). The numbers of training and test samples were 14,247 and 1436, respectively. Compared to the final model, the other models showed significantly lower AUROCs; all p < .05.
AF: atrial fibrillation; AFL: atrial flutter; AUPRC: the area under the precision-recall curve; AUROC: the area under the receiver operating characteristics curve; ECG: electrocardiogram; LSTM: long short-term memory; LVEF: left ventricular ejection fraction.
Similarly, it was generally observed in other deep-learning models that diagnostic performance improved as more datasets were trained (Table 2). Among the four deep-learning models, the AFibEFNet outperformed other deep-learning models regarding AUROC, AUPRC, and F1-score (Table 2). Across the various models, the final model (the AFibEFNet trained with all datasets which included ECG signals, ECG features, and clinical features) achieved the highest diagnostic performance in general (AUROC, 0.816; 95% CI, 0.787–0.845; AUPRC, 0.547; 95% CI, 0.481–0.594; F1-score, 0.492; 95% CI, 0.461–0.536; sensitivity, 0.765; 95% CI, 0.692–0.833; and specificity, 0.738; 95% CI, 0.699–0.799).
Subgroup analysis
Subgroup analyses were performed using the final model (Table 3). Among the subgroups, the model showed significantly higher performance in patients without liver disease or CKD; the AUROC was 0.826 (95% CI: 0.797–0.855) versus 0.648 (95% CI: 0.495–0.801) for liver disease (p = 0.03) and 0.826 (95% CI: 0.797–0.855) versus 0.618 (95% CI: 0.447–0.788) for CKD (p = 0.02). Also, the subgroup of slower heart rates (<100/min) showed a significantly higher AUROC compared to its counterpart (≥100/min); AUROCs of 0.851 and 0.769, respectively; p = 0.009. The final model showed marginally increased AUROCs for those with ischemic heart disease (p = 0.14) and left atrial diameter of ≥40 mm (p = 0.10) versus their counterparts. However, no significant performance differences were observed for other comorbidities. The model showed the best overall performance in patients with ischemic heart disease (AUROC of 0.868, AUPRC of 0.905, F1-score of 0.791, and sensitivity of 0.944). However, the highest specificity (0.800) was observed in the subgroup with heart rates <100 beats/min.
Subgroup analysis.
For the comparison of AUROCs between subgroups.
Data are mean. The analysis was performed with the final model (the AFibEFNet model trained with ECG signals, ECG features, and clinical features).
AUPRC: the area under the precision-recall curve; AUROC: the area under the receiver operating characteristics curve; CKD: chronic kidney disease; COPD: chronic pulmonary obstructive disease; ECG: electrocardiogram; LVEF: left ventricular ejection fraction.
Sensitivity analysis
As the cutoff for reduced LVEF became lower (<40% or <35%), the AUROC generally increased (0.859 vs. 0.868), while the AUPRC and F1-score decreased (0.459 vs. 0.406 and 0.352 vs. 0.274 for LVEF <40% and <35%, respectively) (Supplementary Table 5).
Evaluation of feature attributions for predicting reduced LVEF
Figure 3 illustrates the feature attributions of the LightGBM model. Among the ECG features, information on the QRS morphology (R-axis, Q onset, Q offset, and QRS duration), ventricular repolarization (T-axis and T offset), and heart rate were considered important for the LightGBM model. Among the clinical features, blood cell count, demographic features (BMI, body weight, and age), and serologic biomarkers (hs-CRP, creatinine, sodium, and blood urea nitrogen) are important. If all the features were trained together, some of the most important were the R-axis, creatinine, heart rate, and T-axis. For deep learning, the important ECG signals were mainly the R-waves (especially lead V6), QRS onset and offset (leads V1, V2, and V3), and minorly T-waves (leads V1 and V2) (Figure 4).

Important features for the LightGBM model to predict reduced LVEF. The importance of ECG and clinical features was calculated and compared. BMI: body mass index; BUN: blood urea nitrogen; ECG: electrocardiogram; GFR: glomerular filtration rate; hs-CRP: high-sensitive C-reactive protein; LVEF: left ventricular ejection fraction; RBBB: right bundle branch block.

Visualization of feature attributions of ECG signals to predict reduced LVEF. Darker shades in Guided Grad-CAM visualize critical ECG features essential for model assessment. Black lines represent the averaged ECG of leads for five patients with the highest likelihood of reduced LVEF. Grey areas represent the standard deviation of the beats. ECG: electrocardiogram, LVEF: left ventricular ejection fraction.
Discussion
This study demonstrated the feasibility of using deep learning to detect underlying reduced LVEF in patients with AF/AFL, employing 12-lead ECG and clinical information. Our principal findings are: First, deep learning achieved the highest diagnostic performance when the AFibEFNet model was trained with raw ECG signals and features. Second, learning raw ECG signals provided better predictions of reduced LVEF compared to learning ECG or clinical features alone, emphasizing the diagnostic importance of raw AF/AFL ECG signals. Compared to previous reports,7–10 our study's strengths include (1) focusing on patients with AF/AFL ECG, (2) using a large-scale dataset (∼15,000 patients), and (3) analyzing multiple diagnostic parameters (AUROC, AUPRC, F1-score, etc.) for a balanced interpretation of the model's performance.
Machine learning can be used to detect reduced LVEF or heart failure.20,21 However, most studies investigated general cardiovascular patients and did not focus on those with AF/AFL. Attia et al. validated a deep-learning prediction model of reduced LVEF using ECGs of general cardiovascular patients. 9 They showed that the deep-learning model achieved a sensitivity of 82.5% and specificity of 86.8%, which seems better than our results (sensitivity of 76.5%, specificity of 73.8%). A possible explanation for the differences between the previous study and our study is as follows: First, the previous study used a relatively more severe definition of a reduced LVEF (≤35%) than our study (<50%). We found that AUROC tended to increase as the cutoff value for reduced LVEF decreased, but this benefit was counterbalanced by decreased AUPRC (Supplementary Table 5). This could be because ECG features for left ventricular dysfunction, such as pathologic Q-waves, ST-segment changes, or T-wave abnormalities, become more robust as left ventricular dysfunction increases. Second, previous studies focused on general cardiovascular patients, whereas our study focused on those with AF/AFL. Predicting a reduced LVEF using AF/AFL ECGs might be challenging because the signs of left ventricular dysfunction, such as ST segments and T-wave abnormalities, are often difficult to characterize because of fibrillatory or flutter waves. Therefore, deep-learning models trained with sinus rhythm ECGs may yield different results if applied to patients with AF/AFL.
To date, few studies have targeted patients with AF to predict heart failure using machine learning. Hamatani et al. investigated the machine-learning prediction of heart failure in patients with AF. 22 The study evaluated seven clinical variables (age, history of heart failure, creatinine clearance, the cardiothoracic ratio on X-ray, LVEF, left ventricular end-systolic diameter, and left ventricular asynergy) to predict heart failure among patients with AF. However, the clinical utility of machine learning appears limited because the previous model requires a prior echocardiographic study to perform machine learning analysis. In addition, the study did not analyze AF ECG signals, which could have greater diagnostic importance for reduced LVEF than clinical variables, as shown in our analysis (Table 2). Compared with a previous study, we also focused on patients with AF/AFL and utilized AF/AFL ECG signals, ECG features, and clinical variables to predict a reduced LVEF.
Our data emphasizes the importance of ECG to predict reduced LVEF with deep learning. The ECG features were considered more important than clinical features (Figure 3 and Table 2). When the total feature importance scores were evaluated, the 23 clinical features had a total score of 922, whereas 12 ECG features had a total score of 673. Therefore, each ECG feature had more weight on average, with a mean importance score of 56.08, compared to 40.09 for clinical features. These results supported the finding that ECG features hold a higher feature importance than clinical features in our model. Also, we conducted a feature ablation analysis in which LightGBM's performances were compared by sequentially excluding each feature (Supplementary Figure 2). The result was generally coherent, suggesting that selected ECG features (particularly heart rate, R-, Q-, and T-wave) were crucial to classifying the reduced LVEF group. However, deep learning of raw ECG signals may outperform machine learning of human-invented ECG features for detecting underlying heart disease.23,24 Accordingly, we also observed that deep learning of raw ECG signals resulted in a better predictive performance than the LightGBM model training with ECG features (Table 2). This finding suggests that, in addition to the analysis of human-invented ECG features, that of whole ECG signals using deep learning would be beneficial for predicting a reduced LVEF.
Among the deep-learning models, we observed that the CNN-based models (AFibEFNet, ResNet50, and EfficientNet b5) performed better than the LSTM model. However, we acknowledge that our findings do not definitively establish the superiority of CNN over RNN for all applications. This is primarily because the comparative efficacy of these models was not the primary focus of our study, and other RNN-based models may yield different results. Nonetheless, our findings were consistent with existing literature that CNN-based models showed robust performance in ECG signal analysis, 25 which may enhance the robustness and credibility of our study.
Besides the model's performance, the interpretability of the model is crucial, especially in clinical applications. To address this concern, we have visualized the feature attributions of the ECG signals that the model uses to predict reduced LVEF, as shown in Figure 4. The visualization indicates that the model prioritizes the R-wave, QRS onset and offset, and T-wave. This focus aligns with clinical knowledge, as left ventricular dysfunction in heart failure patients often leads to observable changes in QRS morphology. The QRS complex, which represents ventricular activation, is typically affected in conditions such as myocardial infarction and cardiomyopathy, presenting with widened QRS duration, reduced R-wave amplitude, alterations in the R-wave axis, pathological Q-waves, and ST-segment changes. Therefore, the model's emphasis on the QRS complex is clinically reasonable. Further analysis of feature importance (Figure 3) corroborates the findings from the ECG signal attributions. The model consistently considers factors such as the R-wave axis and heart rate, along with Q-wave onset and offset, as significant for identifying reduced LVEF. Among the clinical features, variables such as blood cell count, high-sensitivity C-reactive protein, body mass index, and age were deemed important. These factors are clinically relevant as systemic inflammation and advanced age are recognized risk factors for heart failure. 26 Therefore, the model's prioritization of these specific features for predicting reduced LVEF suggests that it operates on medically relevant criteria, thus providing some interpretability.
In the context of deep-learning classification for medical purposes, it is frequently observed that there are significantly more normal samples than diseased ones. In such cases where the classes are imbalanced, as in our study, training deep-learning models to produce accurate results is challenging, often leading to a bias toward the majority class in predicted class probabilities. 27 The calibration plot (Figure 5) was employed to evaluate the model's probabilistic outputs considering the class imbalance present in our dataset. A well-calibrated model should provide outputs that closely approximate the actual likelihood of an event. According to Figure 5, our final model exhibited under-confidence, which was inclined to incorrectly predict the absence of reduced LVEF, especially within subsets of data with a higher prevalence of the condition. Therefore, if our model is used in clinical practice, physicians need a cautious interpretation when the predicted probability of reduced LVEF is above 0.5.

A calibration plot of the final model. A calibration plot of the final model (AFibEFNet trained with ECG signals and all features) showed under-confidence. ECG: electrocardiogram.
Limitations
First, there was a class imbalance (12,899 and 2784 patients in the control and reduced LVEF groups, respectively). To assess the impact of class imbalance, we conducted a sensitivity analysis by systematically adjusting the class ratio within our dataset to observe the corresponding effects on the model's performance, thereby investigating the impact of class imbalance (Supplementary Table 6). As the class imbalance increased, there was a general trend of an increasing AUROC, and a decreasing F1-score balanced this improvement. Despite the varying levels of imbalance, the model maintained a generally acceptable performance. Another approach to addressing class imbalance is to use a weighted loss function to penalize the majority class during training. However, the model's performance with the weighted loss function did not improve compared to the default loss function (Supplemental Table 7). We hypothesize that the default loss function may have allowed the model to learn more generalized patterns without overemphasizing the minority class, resulting in better overall performance. Second, we were unable to perform external validation for this study. However, we implemented a temporal validation approach to mitigate the risk of overfitting and enhance our findings’ validity. We trained our model on patient data collected until 2021 and validated it on a subsequent dataset comprising patients from 2021 onward. This approach ensures a clear temporal separation between the training and validation sets, thereby reducing the potential for overfitting. Third, the operational definitions of comorbidities might have over- or under-estimated their prevalence, although they were validated and peer-reviewed elsewhere.28–30 Also, as the data were obtained from a single tertiary hospital, the patient profile may differ from that of the general AF/AFL population. Fourth, although the deep-learning model was able to detect a reduced LVEF in patients with AF/AFL ECG, its performance was not validated for other underlying cardiac conditions, such as heart failure with preserved LVEF or structural heart disease. Fifth, variability introduced by different ECG equipment and the varying expertise of technicians could affect signal quality. We evaluated the model's performance across test datasets sorted by ECG devices and technicians, as shown in Supplementary Table 8. Despite the limited subgroup sizes, the general preservation of robustness across devices and technicians was observed. Sixth, individual variations may affect the model's performance. Although it is challenging to evaluate every individual factor, the subgroup analysis investigated the impact of selected variables on performance. Seventh, although our model demonstrated the feasibility of predicting reduced LVEF, its current performance remains modest, limiting its immediate clinical applicability. Further development and validation in diverse populations, datasets, and racial groups are essential to enhance its generalizability and potential clinical utility. Finally, our results are primarily based on the Korean population. Thus, the extrapolation of our model to other races needs further validation.
Conclusions
Machine learning prediction of reduced LVEF using AF/AFL ECGs was feasible. However, the model's modest performance may limit its current clinical applicability. Further model development and validation in broader AF and AFL populations are necessary to assess its potential clinical utility.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076241311460 - Supplemental material for Prediction of reduced left ventricular ejection fraction using atrial fibrillation or flutter electrocardiograms: A machine-learning study
Supplemental material, sj-docx-1-dhj-10.1177_20552076241311460 for Prediction of reduced left ventricular ejection fraction using atrial fibrillation or flutter electrocardiograms: A machine-learning study by Soonil Kwon, SooMin Chung, So-Ryoung Lee, Kwangsoo Kim, Junmo Kim, Dahyeon Baek, Hyun-Lim Yang, Eue-Keun Choi and Seil Oh in DIGITAL HEALTH
Footnotes
Abbreviations
AF: atrial fibrillation; AFL: atrial flutter; AUPRC: area under the precision-recall curve; AUROC: area under the receiver operating characteristic curve; CI: confidence interval; ECG: electrocardiogram; LVEF: left ventricular ejection fraction; HR: hazard ratio.
Consent to participate
The Seoul National University Hospital Institutional Review Board waived the requirement for informed consent since the study used anonymized data, and thus, consent would be impossible or impracticable to obtain.
Contributorship
Kwon S and Chung S contributed equally and are co-first authors. Lee SR and Kim K are the co-corresponding authors. Conceptualization: Kwon S, Chung S, Kim J, Baek D, Yang HL, Lee SR, Kim K; Data curation: Kwon S, Chung S, Kim J, Baek D; Formal analysis: Kwon S, Chung S, Kim J, Baek D; Funding acquisition: Lee SR, Kim K; Investigation: Kwon S, Chung S, Kim J, Baek D, Yang HL, Lee SR, Choi EK, Oh S, Kim K; Methodology: Kwon S, Chung S, Kim J, Baek D, Yang HL, Lee SR; Project administration: Lee SR, Kim K; Supervision: Yang HL, Lee SR, Kim K; Validation: Kwon S, Chung S, Kim J, Baek D, Yang HL, Lee SR, Kim K; Visualization: Kwon S, Chung S; Writing original draft: Kwon S, Chung S; Writing review & editing: Kwon S, Chung S, Yang HL, Lee SR, Choi EK, Oh S, Kim K.
Data and code availability
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Choi EK: Research grants from Bayer, BMS/Pfizer, Biosense Webster, Chong Kun Dang, Daiichi-Sankyo, Samjinpharm, Sanofi-Aventis, Seers Technology, Skylabs, and Yuhan. No fees are received personally.
Ethical approval
The study protocol conformed to the Declaration of Helsinki (revised in 2013) and was reviewed and approved by the Seoul National University Hospital Institutional Review Board (no. H-2207-001-1336).
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Korea Medical Device Development Fund grant funded by the Korea government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: HI20C1662, 1711138358, KMDF_PR_20200901_0173). The funding source had no roles in the study.
Guarantor
KS and CS.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
