Sage Journals: Discover world-class research

Abstract

Objectives

The automated classification of electrocardiogram (ECG) scatter plots has significant clinical value for the rapid diagnosis of cardiac arrhythmias. However, existing methods based on convolutional neural networks (CNNs) are constrained by their local receptive fields, which limit their ability to effectively capture the non-local contextual relationships and global features essential for classification. This study aims to design and validate a novel deep learning model to overcome these limitations by effectively learning long-range dependencies and key discriminative regions within ECG scatter plots, thereby improving the accuracy of arrhythmia classification.

Methods

This study proposes a vision transformer (ViT) network based on Token Selection. The method first segments the ECG scatter plot into a sequence of patches and utilizes the self-attention mechanism of the Transformer encoder to model global contextual information. The core innovation is the introduction of a Token Selection module deep within the network, which dynamically filters the most discriminative patches (Tokens) to be used for final classification. This enables the model to focus on critical regions decisive for diagnosis while reducing interference from redundant information.

Results

The proposed model was validated on real ECG scatter plot datasets. Experimental results demonstrate that the method achieves superior classification accuracy, outperforming traditional CNN-based models.

Conclusion

This study establishes a vision transformer model with token selection, providing an effective and precise solution for the automated classification of ECG scatter plots. By overcoming the limitations of conventional CNN methods, this model demonstrates exceptional capability in capturing both global and key local features, thus offering a novel approach for the advancement of automated arrhythmia diagnosis technology.

Keywords

Cardiac arrhythmia ECG scatter plot abnormal classification vision transformer token selection

Introduction

In recent years, cardiovascular diseases (CVDs) have become one of the most significant threats to human health.¹ The electrocardiogram (ECG), owing to its non-invasive nature and portability, is widely used for the diagnosis of cardiac diseases.² With the advancement of electrocardiographic technology, the utilization of electrocardiogram scatter plots has gradually gained momentum. Most dynamic electrocardiography (DCG) analyzers now include automated scatter plot analysis software that primarily assesses changes in RR intervals. This differs significantly from the conventional template analysis software in terms of operational perspective, particularly in diagnosing arrhythmias and analyzing heart rate variability (HRV). ECG scatter plots enable rapid identification of abnormal conditions such as premature ventricular contraction (PVC), supraventricular premature contraction (SVPC), atrial fibrillation (AF), atrial fibrillation with premature ventricular contraction, and atrial flutter (AFL). They hold considerable clinical utility in diagnosing arrhythmias and warrant clinical dissemination. Esperer reported that utilization of two-dimensional Lorenz plots generated from 24-h holter recordings has shown potential in improving the accuracy of detecting and distinguishing major tachyarrhythmias. This method, particularly beneficial in cases involving supraventricular tachyarrhythmias, offered a promising approach to enhancing the precision of arrhythmia identification and differentiation.³ Borracci conducted a study in which they found that Modified Poincaré plots, also known as Lorenz scatter plots, with smoothed lines connecting successive points, could effectively classify various types of arrhythmias based on short RR interval sequence variability. The research revealed that distinct emergent patterns could be visually identified, offering the potential for automatic classification systems to differentiate between different arrhythmias. This work introduces an alternative approach for interpreting time-delay plots derived from short-term EKG signal recordings.⁴ Ultimately, the Lorenz scatter plot presents a valuable tool in aiding clinicians in the diagnosis of arrhythmias and could potentially be integrated into automatic classification systems for enhanced diagnostic accuracy.

Currently, the discrimination of ECG scatter plots largely relies on the manual observation of cardiologists. Given the diverse array of cardiovascular diseases and the multifaceted characteristics of ECG scatter plots, fast and accurate image recognition requires a certain level of expertise and clinical experience from the clinician. Furthermore, the diagnostic process is inherently subjective, influenced by clinician interpretation and cognition, and long hours of heavy workload can lead to physician fatigue, increasing the likelihood of misdiagnosis and missed. Consequently, automated identification of cardiac arrhythmias has emerged as a research hotspot. This approach serves to assist physicians in diagnosis and enhance overall workflow efficiency.⁵

Traditional automatic electrocardiogram identification primarily relies on machine learning (ML). Li et al. introduced a novel approach wherein wavelet packet entropy (WPE) was manually extracted as features from the ECG signals, followed by input into a random forest (RF) model for electrocardiogram recognition and classification⁶. Asl et al. and Elhaj et al. utilized linear and nonlinear methods to extract features from heart rate variability (HRV), and subsequently employed a support vector machine (SVM) in combination with a one-versus-many strategy for HRV signal classification, achieving favorable classification results.^7,8 Traditional machine learning depends on manually extracting features from electrocardiograms or ECG signals, a process that is complex, time-consuming, and challenging to ensure accuracy. With the continuous development of deep learning (DL), an increasing number of studies are inclined towards using DL to automatically extract features from electrocardiograms, enabling automatic detection and identification of one or multiple potential heart diseases. Deep learning is particularly sensitive to local details and can automatically extract rich image features. Several typical algorithms have been employed in research, including stacked autoencoders (SAE), deep belief networks (DBN), convolutional neural networks (CNN), and recurrent neural networks (RNN). A review of existing studies applying deep learning to ECG diagnosis reveals that these methods have achieved high accuracy in ECG signals classification.⁹ Ullah and Huang transformed ECG time series signals into two-dimensional (2-D) spectrograms using the Fourier transform (FT) and utilized two-dimensional convolutional neural network(2-D CNN) model to identify arrhythmias.^10,11 This approach successfully accomplished multi-label classification tasks including normal beat, premature ventricular contraction beat (PVC), paced beat (PAB), right bundle branch block beat (RBB), left bundle branch block beat (LBB), and atrial premature contraction beat (APC), achieving an average classification accuracy of 99.11%. Researchers^12,13 employed convolutional neural network (CNN) and bidirectional Long Short-Term Memory(LSTM) to capture contextual relevance of ECG features for automatic arrhythmia classification. Reference¹⁴ utilized CNN for classifying arrhythmia data and achieved different classification outcomes by configuring different convolutional layers. Donida Labati et al. and Acharya et al. utilized a deep convolutional neural network to extract important features from one or multiple leads and compared ECG feature templates using a simple and fast distance function, resulting in higher accuracy.^15,16 However, neural networks like CNN suffer from limited receptive fields of the convolutional kernels, lack a holistic understanding of image information, and may extract many irrelevant redundant features, making it difficult to capture long-distance dependencies and impacting the performance of multi-label classification.

To this end, researches¹⁷ have applied the structure of vision transformer (ViT), based on the mechanism of attention, to medical imaging tasks, assisting convolutional neural network (CNN) in extracting image features. The Transformer¹⁸ can simultaneously capture all information within the entire image, and the attention mechanism's architecture aids the model in focusing on meaningful regions. A transformer-based deep learning neural network, electrocardiogram detection transformer (ECG DETR), has been proposed, which performs arrhythmia detection on continuous single-lead electrocardiogram segments, significantly enhancing classification performance.¹⁹ Literature²⁰ connects features extracted by a Transformer model from RR intervals with attention modules for final electrocardiogram classification. Experimental results demonstrate that the proposed model achieves higher accuracy on the original dataset. Jamil²¹ utilizes ViT to identify phonocardiogram (PCG) signals for rapid and accurate diagnosis of valve heart disease (VHD) at its early stages. The achieved results exhibit an accuracy rate of up to 99.9%, surpassing the current state-of-the-art VHD classification models. Natarajan et al.²² introduce a waveform Transformer that projects electrocardiogram segments through a multi-layer perceptron into one-dimensional vectors, which are then combined with positional embeddings as inputs to the Transformer encoder for final electrocardiogram classification. Experimental results demonstrate that this model exhibits excellent performance in classification tasks. From the aforementioned studies, it can be concluded that Vision Transformer has achieved promising results in electrocardiogram classification, and has improved recognition accuracy compared to deep learning (DL) models such as CNN and LSTM. Literature¹⁸ points out that when Transformer models are input into higher layers of the encoder, single-layer attention weights may not necessarily represent the importance of the original input tokens. In cases where original images, such as scatter plots of electrocardiograms, contain large amounts of blank or irrelevant areas, the recognizability of high-level tokens obtained after multiple layers of computation may be insufficient. Therefore, this study proposes an improved ViT network (TSViT) for ECG scatter plots classification. To further extract discriminative features, a local token selection module is introduced before the final Transformer input. This module merges the attention weights from earlier layers and selects tokens representing discriminative regions as inputs for the final layer while disregarding redundant information. Subsequently, the classification vector is concatenated with the selected tokens to retain global information, and the classifier outputs the results. The network architecture model of TSViT is illustrated in Figure 1. On the left side of Figure 1 is the overall framework of the network. The image to be classified is segmented into multiple patches, which are input into a linear mapping layer along with positional encodings before being fed into the Transformer encoder. To enhance the network's ability to recognize local discriminative features in electrocardiogram scatter plots, the penultimate layer of the Transformer encoder in ViT is modified into a token selection module. This module selects tokens with discriminative characteristics to remove interference from redundant information. The right-hand side of Figure 1 depicts the internal structure of the Transformer.

Figure 1.

The TSViT network architecture.

Methods

Dataset

The experimental dataset comprised 9000 scatter plots of electrocardiogram, collected from patients with arrhythmia and sinus rhythm (SR) who underwent twelve-lead dynamic electrocardiogram examinations in the Department of Electrocardiographic Diagnosis, the Second Affiliated Hospital of Anhui Medical University, from December 2022 to March 2024. The dataset included five types of arrhythmias: ventricular premature beats, supraventricular premature beats, atrial fibrillation, atrial fibrillation with ventricular premature beats, and atrial flutter. The diagnoses were established based on the gold standard provided by expert clinicians and other diagnostic machines in the hospital, and the dataset were manually annotated. Prior to the experiments, the dataset underwent preprocessing, where all input images were resized to 224*224 pixels. To prevent overfitting during training, data augmentation techniques were applied to the training set, including horizontal flipping, perspective transformation, shearing, rotation, translation, scaling, and brightness adjustment.

Experimental setup

The experiment was conducted using Python 3.8 with the PyTorch framework version 1.7.1. The experimental environment consisted of an Intel(R) Core(TM) i9-7900X processor and an NVIDIA RTX 3090 GPU.

The batch size was set to 64, the initial learning rate was set to 0.001, and the Adam optimizer was selected as the optimizer.²³ The loss function used was the cross-entropy loss function.

The original dataset comprised a total of 9000 ECG scatter plot images, with each type having 1200 images in the training set, 150 images in the test set, and 150 images in the validation set.

The feasibility of the method is determined using the following four performance metrics: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP represents the number of abnormal traces correctly identified, TN represents the number of normal traces correctly identified, FP represents the number of normal traces incorrectly identified as abnormal, and FN represents the number of abnormal traces incorrectly identified as normal. These metrics are summarized into the following evaluation indicators:

Precision (P) = TP / (TP + FP): Precision measures the proportion of correctly identified abnormal behaviors among all identified abnormal behaviors. It is used to assess the accuracy of abnormal behavior identification.

Recall (R) = TP / (TP + FN): Recall measures the proportion of correctly identified abnormal behaviors among all actual abnormal behaviors. It evaluates the algorithm's ability to identify abnormal behaviors.

The formula for calculating the $F_{\partial}$ score is:

F_{β} = \frac{(β^{2} + 1) \cdot Precision \cdot Recall}{β^{2} \cdot Precision + Recall}

The $F_{\partial}$ score is the harmonic mean of precision and recall, with ∂ used to adjust the weights of precision and recall. When ∂ is set to 2, recall is given more weight than precision. When ∂ is set to 1, $F_{1}$ is defined as the harmonic mean of precision and recall. A higher $F_{1}$ score indicates better overall performance of the anomaly detection method.

The Transformer is a deep learning (DL) network equipped with an attention mechanism, comprising mainly encoder and decoder components. The encoder is responsible for encoding various word vectors, while the decoder generates predicted translation results. In the context of image classification, the vision transformer (ViT) shares a similar network structure to the Transformer, featuring an encoder. However, the ViT, which is used for image classification, has a similar network structure to the Transformer, and the main structure has an encoder, with the difference that ViT outputs predictions from the encoder outputs directly into a multilayer perceptron, which completes the classification task and no longer needs a decoder. The ViT network architecture primarily consists of an encoder, self-attention mechanisms, multi-head attention, and multi-layer perceptrons (MLP). In clinical practice, Lorenz plots of the ECG include two types: the RR interval scatter plot and the RR interval difference Lorenz plot.²⁴ Both types of plots are two-dimensional graphs derived from iterative calculations, and they reflect dynamic changes in cardiac rhythms. The formation of these images is influenced by the so-called attractors, which are divided into two types: homeostatic and non-homeostatic attractors. Homeostatic attractors refer to a cardiac rhythm that is always consistent in its origin and usually manifests itself as a sinus rhythm. Non-stationary attractors, on the other hand, appear when the heart's rhythm changes from one state to another, and they have different morphologies and characteristics, with ventricular rhythms being the common type. By looking at the graphical features of the Lorenz diagram, doctors can categorize arrhythmias. For example, sinus rhythm shows a specific distribution pattern in the graph, while atrial fibrillation shows a fan-shaped distribution. Supraventricular premature beats usually show a three-cluster distribution on the graph, premature ventricular contraction show a four-cluster distribution, and parallel rhythms show inverted Y distribution on the graph.

This paper focuses on the analysis of scatter plots of electrocardiograms, specifically Lorenz-RR scatter plots.²⁵ The principle of plotting is that the first RR interval is the horizontal coordinate and the second RR interval is the coordinate. The Lorenz plot is constructed by plotting the scatter over a period of time to show the characteristic information of the ECG signal, the principle is shown in Figure 2.

Figure 2.

Principle diagram of ECG Lorenz plot.

In these scatter plots, the “B-line” refers to the line closest to the x-axis, formed by points preceding each premature beat and their preceding sinus RR intervals. The B-line can also represent the edge of the graph, adjacent to the x-axis. Diagnostic criteria for ECG scatter plots^5,26 include: premature ventricular contraction, characterized by primarily quadriphasic distribution with some biphasic or triphasic patterns, and a B-line slope approaching zero; supraventricular premature beat, mostly displaying triphasic distribution with some quadriphasic patterns, and a B-line slope ranging from 0.132 to 0.506; atrial fibrillation (AF), indicated by a fan-shaped scatter plot with a base slope >0.11; atrial fibrillation with premature ventricular beat, featuring a fan-shaped scatter plot with a linear component parallel to the x-axis; atrial flutter, observed as an orderly grid-like distribution of scatter points.²⁷ Figure 3 (Source from²⁸) illustrates the shapes of six types of ECG scatter plots, with Plot A representing a sinus rhythm (SR) scatter plot.

Figure 3.

ECG scatter plot categories. A: normal sinus rhythm; B: ventricular premature beats; C: supraventricular premature beats; D: atrial fibrillation(AF); E: atrial fibrillation with ventricular premature beats; F: atrial flutter.

Data processing pipeline

Encoder

Since the encoder of the transformer requires one-dimensional vectors as input, preprocessing of the input image is necessary to embed it. This involves segmenting the image into patches, which are then composed into vectors for input into the network. Specifically, given an input image $x = [x_{1}, \dots, x_{N}] \in R^{H \times W \times C},$ where H and W represent the height and width of the image, respectively, and C is the number of channels. Dividing the entire image into P*P sized patches yields N image blocks, where each block has pixel dimensions of P*P*C. After conversion into vectors, each patch becomes a $P^{2} C$ -dimensional vector. Concatenating the vectors of the N image blocks results in a two-dimensional matrix of size $N times (P^{2} C)$ . Consequently, the final image encoding is represented as $x_{0} = [E x_{1}, \dots, E x_{N}]$ , where $E \in R^{N \times (P^{2} C)}$ denotes the embedding vector for each patch.

Incorporating positional information into the Transformer model is essential. To capture positional variables, positional encoding $pos = [po s_{1}, \dots, po s_{N}] \in R^{N \times D}$ is introduced. Consequently, the Encoder encoding is given by:

z_{0} = [{Ex}_{1}, \dots, E x_{N}] + [po s_{1}, \dots, po s_{N}]

Multi-head attention (MSA)

MSA is the core module in the encoder, derived from self-attention. A self-attention head learns attention weights by computing the dot product of queries (Q), keys (K), and values (V). In the self-attention layer, the input is first subject to three independent linear transformations to obtain query matrix (Q), key matrix, and value matrix. The calculation formula is as follows:

Q = {XW}_{q}, K = {XW}_{k}, V = {XW}_{v}

In the formula, $W_{q}, W_{k}, W_{v}$ are parameter matrices, and X is the input matrix. The attention weights for each position are computed using the product of K and Q, followed by the SoftMax function. The weighted sum of all positions serves as the output of the self-attention layer. The calculation formula is as follows:

F_{A} (Q, K, V) = δ (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V

In the formula, $F_{A}$ represents the attention layer, $δ$ denotes the SoftMax activation function, and $d_{k}$ is the dimensionality of K. Dividing the attention weight matrix by $\sqrt{d_{k}}$ is aimed at mitigating the impact of large variances on the backward propagation process, thereby stabilizing the model optimization process. Figure 4 illustrates the calculation flow of self-attention. In multi-head attention, based on the aforementioned self-attention computation, Q, K, and V are split into multiple small matrices of the same size. Attention operations are conducted separately on each of these matrices, and the computed results are concatenated as the final output. The computational process is as follows:

\begin{matrix} F_{s} (Q, K, V) = [H_{1}, H_{2}, \dots, H_{h}] W^{o} \\ H_{i} = F_{A} {(XW}_{i}^{q} {, XW}_{i}^{k} {, XW}_{i}^{v}) \end{matrix}

Figure 4.

Self-attention computation process.

In the formula, $F_{s}$ represents the MSA module, while $W_{i}^{q} {, W}_{i}^{k} {, W}_{i}^{v}, W^{o}$ are all parameter matrices. $H_{i}$ denotes the i-th head of MSA, with a total of h heads.

Multi-layer perceptron (MLP)

The MLP consists of two fully connected layers and employs Gaussian error linear units (GELU) as the activation function. The structure of the MLP is as follows:

F_{p} (X) = W_{1} (ε (W_{2} X + b_{2})) + b_{1}

In the equation, $F_{p}$ represents the MLP module, where X is the input matrix of the MLP layer, $ε$ denotes the GELU activation function, and and $b_{2}$ are the parameter matrices and bias matrices of the two fully connected layers, respectively.

Both MSA and MLP employ skip connections and are preceded by layer normalization (LN). Therefore, the main process in the Transformer encoder can be summarized as follows:

{z^{'}}_{l} = F_{s} [LN (z_{l - 1})] + z_{l - 1}, l = 1, \dots, L

z_{l} = F_{p} [LN ({z^{'}}_{l})] + {z^{'}}_{l}

In the equation, $z_{l}$ represents the encoded features. In the last layer of the encoder, the first token from $z_{L}^{0}$ is extracted as the global feature representation of the image, which is then passed to a classifier for label prediction.

Token selection module

Typically, Lorenz-RR scatter plot images contain numerous irrelevant regions,²⁹ and focusing on discriminative local areas can significantly enhance classification effectiveness. Therefore, a token selection module is integrated into the conventional ViT architecture to capture locally important information, thereby strengthening the network's classification capability.

As the final output includes an abundance of global information, the Token Selection Module operates before the input to the encoder's penultimate layer. It selects tokens containing more discriminative features as input, while disregarding tokens containing redundant information.

The hidden features output from the (L-1)th layer are denoted as $T_{L - 1} = {[T}_{L - 1}^{0} {, T}_{L - 1}^{1} {, T}_{L - 1}^{2}, \dots {, T}_{L - 1}^{N}]$ , and the attention weights from previous layers can be represented as $a_{l} = [a_{l}^{0}, a_{l}^{1}, a_{l}^{2}, \dots, a_{l}^{k}]$ , where $l = 1, 2, 3, \dots, L - 1$ , and k represents the number of self-attention heads in the multi-head attention mechanism. Each attention weight for a self-attention head is expressed as $a_{l}^{i} = [a_{l}^{i_{0}}, a_{l}^{i_{1}}, a_{l}^{i_{2}}, \dots, a_{l}^{i_{N}}]$ , where $i = 0, 1, 2, \dots, k - 1$ . It has been pointed out in literature³⁰ that the recognizability of tokens in higher layers of the network is insufficient after multiple perceptron operations. Therefore, the multiplication of attention scores from earlier layers is performed:

a_{final}^{i} = \prod_{l = 1}^{L - 1} a_{l}

The $a_{f i n a l}$ integrates attention weights from the starting layer to the ending layer. Compared to the attention scores from individual layers, it is more effective in selecting tokens containing important discriminative regions. Therefore, among the h distinct attention heads with the highest final scores in $a_{f i n a l}$ , we select the maximum value, obtaining the corresponding h highest-score tokens in $T_{L - 1}$ . Finally, we concatenate these obtained tokens with the classified tokens to serve as the input for the subsequent module:

T_{choose} = {[T}_{L - 1}^{0} {; T}_{L - 1}^{\max 1} {, T}_{L - 1}^{\max 2}, \dots {, T}_{L - 1}^{maxh}]

$T_{c h o o s e}$ retains global information while allowing tokens with more discriminative features to be used as input for the final layer. It disregards other tokens that are irrelevant to discrimination, enabling the deep learning model to focus more on important regions and thereby enhancing the network's classification performance.

Grid optimization

To accelerate the convergence speed of the proposed network and prevent overfitting, this experiment incorporates the dropout optimization method proposed in literature³¹ into the fully connected layers. This optimization method involves dropping connections of some nodes in the hidden layers to prevent the reliance on specific combinations of features, thus enabling the model to learn more general patterns. The Adam optimizer is chosen for its simplicity, computational efficiency, and the interpretability of its hyperparameters.²³

Results

Ablation study

In order to analyze the impact of Vision Transformer, and the token selection module on the model, ablation experiments were conducted. The experimental results are presented in Table 1. As the results indicate, the proposed network with the ViT + Token Selection module achieved higher classification results than the baseline ViT, improving the accuracy by 1.5%.

Table 1.

Ablation experiments were conducted on both the model architecture and the token selection module.

Model	Top accuracy	F1-score
TSViT	97.09%	96.54%
ViT	95.5%	95.01%

In order to further improve the model accuracy, experiments were conducted on the batch size used during training.³² The batch size refers to the number of samples selected for a single training iteration, where the model optimizes parameters through backpropagation for each batch. The influence of different batch sizes on model classification accuracy was demonstrated by training the model with batch sizes of 1, 4, 16, and 32, respectively. According to the experimental results (Figure 5), using smaller batch sizes can increase the number of parameter optimization iterations during model training. This allows the model to achieve higher accuracy on small-scale datasets. Although this approach may increase training time, more frequent parameter updates enable the model to better adapt to the dataset's features, thereby enhancing its performance. Therefore, when dealing with small-scale datasets and aiming for higher accuracy, selecting smaller batch sizes for training the TSViT model can be considered.

Figure 5.

Comparative experiments between different batches.

In addition to validating the impact of the Token Selection module, we conducted comparative experiments against three established recognition networks: VGG16,³³ GoogleNet,³⁴ and AlexNet.³⁵ To investigate the direct influence of model architecture on computational efficiency, we focused on comparing the core parameters and the time performance of each architecture for both inference and training tasks, as detailed in Table 2. The analysis of the data reveals that a model's architectural design is the decisive factor in its time efficiency. For instance, models like 2D-CNN and GoogLeNet achieve high computational speeds due to their relatively simple designs, featuring stacked convolutional layers and global average pooling. In contrast, our proposed TSVIT method is a highly complex model designed to capture global information through its unique architectural parameters, including patch size, multiple stacked Transformer layers, and the multi-head self-attention mechanism. This architectural complexity is directly reflected in a higher computational cost, resulting in longer times for inference and training. Nevertheless, we contend that this increased time investment is a necessary trade-off for achieving superior performance. As subsequent experimental results will demonstrate, it is precisely these architectural features that endow our model with its exceptional feature learning capabilities, allowing it to attain the highest performance on the critical metric of classification accuracy.

Table 2.

Comparison of methodological differences and computation time.

Model / Type	Key Architectural Parameters	Inference Time	Training Time
2D-CNN	·Number of convolutional layers: 9 ·Kernel sizes: Mixed use of 5 × 5 and 3 × 3 ·Number of filters: 16, 32, 64 ·Pooling method: Global Average Pooling (GAP) ·Final classification layer: Softmax	5s	12min
GoogLeNet	·Building block: Inception module ·Bottleneck layer size: 1 × 1 convolution ·Parallel convolutional kernels: 1 × 1, 3 × 3, 5 × 5 ·Pooling method: Global Average Pooling ·Network depth: 22 layers	10s	29min
AlexNet	·Kernel sizes: Varied (11 × 11, 5 × 5, 3 × 3) ·Number of convolutional layers: 5 ·Number of fully-connected layers: 3 ·Activation function: ReLU ·Regularization: Dropout, LRN	14s	40min
TSVIT	·Patch size: 32 × 32 ·Hidden dimension (d): 768 ·Number of Transformer layers (L): 12 ·Number of multi-head attention heads (h): 12 ·MLP size: 3072	16s	45min
VGG16	·Kernel size: Uniform 3 × 3 ·Number of convolutional layers: 13 ·Number of fully-connected layers: 3 ·Network depth: 16 layers ·Pooling kernel size: Uniform 2 × 2	39s	78min

In order to provide a comprehensive reflection and evaluation of the performance of the proposed network, confusion matrices were used to illustrate the actual recognition performance of the four networks for each type of scatter plot. The results are shown in Figure 6, where 1, 2, 3, 4, 5, and 6 represent premature ventricular beats, supraventricular premature beats, atrial fibrillation, atrial fibrillation with ventricular premature beats, atrial flutter, and normal sinus rhythm (NSR) respectively. From Figure 6, it can be observed that all networks exhibit a higher rate of misclassification for scatter plot types 3 and 4. This is attributed to the similarity in morphology between atrial fibrillation (AF) and atrial fibrillation with ventricular premature beats. Overall, the TSViT model demonstrates good recognition capabilities for all six types of ECG scatter plots, with the majority of scatter plot categories being correctly identified.

Figure 6.

Confusion matrix results for different models. (a) TSViT; (b) VGG16; (c) AlexNet; (d) GoogleNet.

The precision, recall, and F1-score of each network were calculated using the confusion matrix as evaluation metrics to assess the recognition performance of the networks.

Table 3 presents the evaluation results for different models. It can be observed that the ECG scatter plot abnormal identification based on the TSViT model shows improvements in precision, recall, and F1-score compared to the VGG16, AlexNet, and GoogleNet models. Specifically, the precision increased by 9.57%, 16.39%, and 4.41% respectively, the recall increased by 8.34%, 15.73%, and 3.6% respectively, and the F1-score increased by 8.95%, 16.06%, and 4% respectively. Therefore, the proposed TSViT model further enhances the accuracy of ECG scatter plot recognition. Figures 7 and 8 show the original and marked images of the ECG scatter plot, respectively.

Comparative Experiments

Figure 7.

Original images.

Figure 8.

Marked images.

Table 3.

Evaluation results of different models.

Network	Precision	Recall	F1-score
TSViT	97.09%	96%	96.54%
VGG16	87.52%	87.66%	87.59%
AlexNet	80.70%	80.27%	80.48%
GoogleNet	92.68%	92.40%	92.54%

To gain a deeper understanding of the performance advantages of the proposed network over other recognition networks, the receiver operating characteristic (ROC) curve was employed to visualize their performance. The comparative method was a 2D-CNN-based approach for ECG scatter plot recognition, as proposed in the literature. The results are shown in Figure 9, which presents the ROC curves and their corresponding area under the curve (AUC) values for each network across the different classes.

Figure 9.

ROC curves and AUC values of the proposed method compared with state-of-the-art methods.

The ROC curve is plotted with the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis, allowing for a rapid assessment of a classifier's performance on ECG scatter plots at various thresholds. The closer the curve is to the top-left corner, the lower the misclassification rate and the higher the sensitivity. The AUC, representing the area under the ROC curve, is a metric used to measure the model's generalization ability and reflects the overall classification capability. A higher AUC value indicates better classification performance.

By analyzing both the ROC curves and the AUC values, it can be concluded that TSVIT demonstrates a superior performance in identifying abnormalities in ECG scatter plots compared to the existing method, exhibiting better robustness and generalization capabilities.

Conclusion

Building upon the success of vision transformer (ViT) in image classification tasks, the TSViT approach is proposed for abnormal classification of electrocardiogram (ECG) scatter plots. Considering that ECG scatter plot images often contain significant blank and irrelevant areas, a token selection module is introduced before the final layer of the transformer encoder to select discriminative tokens as input, thereby obtaining the output used for classification. Experimental results demonstrate that the TSViT model achieves superior performance on real ECG scatter plot datasets. In the future, research will focus on exploring lightweight vision transformer models for more precise ECG scatter plots classification, as well as developing applications based on relevant algorithms to facilitate their integration into clinical practice more effectively.

Footnotes

ORCID iD

Jing Cheng

Ethics statement

This study was approved by the Ethics Committee of The Second Affiliated Hospital of Anhui Medical University (Approval No. SL-YX2024-081), and written informed consent was obtained from all participants prior to the initiation of the study.

CRediT authorship contribution statement

WYT: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. XXY: Resources, Writing – original draft. GXY: Methodology, Writing – original draft. CQ: Investigation, Writing – original draft. YCY: Investigation, Writing – original draft. HF: Conceptualization, Software, Investigation, Validation, Writing – original draft, Writing – review & editing. CJ: Funding acquisition, Supervision, Validation, Writing – original draft, Writing – review & editing.

Ethical approval

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research was supported by the Key Scientific Research Projects of Universities in Anhui Province(2023AH053177), Key Social Research Projects of Universities in Anhui Province(2023AH050722).

the Key Scientific Research Projects of Universities in Anhui Province, Key Social Research Projects of Universities in Anhui Province, (grant number 2023AH053177, 2023AH050722).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Mensah

Roth

Fuster

. The global burden of cardiovascular diseases and risk factors. JACC 2019; 74: 2529–2532.

Mehta

Nazir

Trohman

, et al. Single-lead portable ECG devices: perceptions and clinical accuracy compared to conventional cardiac monitoring. J Electrocardiol 2015; 48: 710–716.

Esperer

Cohen

. Cardiac arrhythmias imprint specific signatures on lorenz plots. Ann Noninvasive Electrocardiol 2008; 13: 44–60.

Borracci

Montoya Pulvet

Ingino

, et al. Geometric patterns of time-delay plots from different cardiac rhythms and arrhythmias using short-term EKG signals. Clin Physiol Funct Imaging 2018; 38: 856–863.

Bao

Cao

Yan

, et al. Heart Rhythm Anomaly Recognition Based on ECG Lorenz Plot. In: 2022 4th International Conference on Natural Language Processing (ICNLP). Xi’an, China: IEEE, pp. 299–303.

Zhou

. ECG Classification using wavelet packet entropy and random forests. Entropy 2016; 18: 285.

Asl

Setarehdan

Mohebbi

. Support vector machine-based arrhythmia classification using reduced features of heart rate variability signal. Artif Intell Med 2008; 44: 51–64.

Elhaj

Salim

Harris

, et al. Arrhythmia recognition and classification using combined linear and nonlinear features of ECG signals. Comput Methods Programs Biomed 2016; 127: 52–63.

Liu

Wang

, et al. Deep learning in ECG diagnosis: a review. Knowl-Based Syst 2021; 227: 107187.

10.

Huang

Chen

Yao

, et al. ECG Arrhythmia classification using STFT-based spectrogram and convolutional neural network. IEEE Access 2019; 7: 92871–92880.

11.

Ullah

Anwar

Bilal

, et al. Classification of arrhythmia by using deep learning with 2-D ECG spectral image representation. Remote Sens 2020; 12: 1685.

12.

Xingxiu

Jianjun

Jing

HUA

. Arrhythmia classification based on CNN and bidirectional LSTM. Journal of Frontiers of Computer Science & Technology EBSCOhost 2021; 15: 2353.

13.

Petmezas

Haris

Stefanopoulos

, et al. Automated atrial fibrillation detection using a hybrid CNN-LSTM network on imbalanced ECG datasets. Biomed Signal Process Control 2021; 63: 102194.

14.

Dai

Yan

, et al. Study of cardiac arrhythmia classification based on convolutional neural network. Comput Sci Inf Syst 2020; 17: 445–458.

15.

Acharya

Hagiwara

, et al. A deep convolutional neural network model to classify heartbeats. Comput Biol Med 2017; 89: 389–396.

16.

Donida Labati

. Deep-ECG: convolutional neural networks for ECG biometric recognition. Pattern Recognit Lett 2019; 126: 78–85.

17.

Dosovitskiy

Beyer

Kolesnikov

, et al. An Image is Worth 16(16 Words: Transformers for Image Recognition at Scale. Epub ahead of print 3 June 2021. DOI: 10.48550/arXiv.2010.11929.

18.

Vaswani

Shazeer

Parmar

, et al. Attention is All you Need. In: Advances in Neural Information Processing Systems. Curran Associates, Inc., https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (2017, accessed 24 September 2025).

19.

Chen

Zhou

. A transformer-based deep neural network for arrhythmia detection using continuous ECG signals. Comput Biol Med 2022; 144: 105325.

20.

Yan

Liang

Zhang

, et al. Fusing Transformer Model with Temporal Features for ECG Heartbeat Classification. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 898–905.

21.

Jamil

Roy

. An efficient and robust phonocardiography (PCG)-based valvular heart diseases (VHD) detection framework using vision transformer (ViT). Comput Biol Med 2023; 158: 106734.

22.

Natarajan

Boverman

Chang

, et al. Convolution-Free Waveform Transformers for Multi-Lead ECG Classification. In: 2021 Computing in Cardiology (CinC), pp. 1–4.

23.

Kingma

and Adam. A Method for Stochastic Optimization. Epub Ahead Print 29 January 2017. DOI: 10.48550/arXiv.1412.6980.

24.

Yunping

Fangfang

, et al. The characteristics of RR–lorenz plot in persistent atrial fibrillation patients complicating with escape beats and rhythm. Chin J Cardiol 2014; 42: 481–483.

25.

Yiming

Anqin

DONG

Xiangdong

NIU

, et al. Characteristics of ECG lorenz scatter plot of muscle sleeve atrial arrhythmias. Chin Gen Pract 2021; 24: 3698.

26.

Chishaki

Takeshita

, et al. Different features of ventricular arrhythmias and the RR-interval dynamics in atrial fibrillation related to the patient’s clinical characteristics: an analysis using RR-interval plotting. J Electrocardiol 2004; 37: 207–217.

27.

Arnar

Mairesse

Boriani

, et al. Management of asymptomatic arrhythmias: a European Heart Rhythm Association (EHRA) consensus document, endorsed by the Heart Failure Association (HFA), Heart Rhythm Society (HRS), Asia Pacific Heart Rhythm Society (APHRS), Cardiac Arrhythmia Society of Southern Africa (CASSA), and Latin America Heart Rhythm Society (LAHRS).

28.

lan

jingjing

xin

. Value of ECG scatter plot in rapid diagnosis of arrhythmia. J Clin Intern Med 2022; 39: 326–328.

29.

Zhang

Zhao

. The clinical significance of Lorenz scatter plots in detecting ventricular parasystoles.

30.

Abnar

Zuidema

. Quantifying Attention Flow in Transformers, http://arxiv.org/abs/2005.00928 (2020, accessed 3 April 2024).

31.

Srivastava

Hinton

Krizhevsky

, et al. Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting.

32.

Ioffe

Szegedy

. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

33.

Simonyan

Zisserman

. Very Deep Convolutional Networks for Large-Scale Image Recognition, http://arxiv.org/abs/1409.1556 (2015, accessed 3 April 2024).

34.

Szegedy

Liu

Jia

, et al. Going Deeper With Convolutions. pp. 1–9.

35.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Commun ACM 2017; 60: 84–90.

Abnormal classification of electrocardiogram scatter plot based on token selection vision transformer

Abstract

Objectives

Methods

Results

Conclusion

Keywords

Introduction

Methods

Dataset

Experimental setup

Data processing pipeline

Encoder

Multi-head attention (MSA)

Multi-layer perceptron (MLP)

Token selection module

Grid optimization

Results

Ablation study

Conclusion

Footnotes

ORCID iD

Ethics statement

CRediT authorship contribution statement

Funding

Declaration of conflicting interests

References