Deep Learning ECG Segmentation for Arrhythmias
Deep Learning ECG Segmentation for Arrhythmias
RESEARCH ARTICLE
Abstract
OPEN ACCESS
Accurate delineation of key waveforms in an ECG is a critical step in extracting relevant fea-
Citation: Joung C, Kim M, Paik T, Kong S-H, Oh
S-Y, Jeon WK, et al. (2024) Deep learning based tures to support the diagnosis and treatment of heart conditions. Although deep learning
ECG segmentation for delineation of diverse based methods using segmentation models to locate P, QRS, and T waves have shown
arrhythmias. PLoS ONE 19(6): e0303178. https:// promising results, their ability to handle arrhythmias has not been studied in any detail. In
[Link]/10.1371/[Link].0303178
this paper we investigate the effect of arrhythmias on delineation quality and develop strate-
Editor: Mohamed Hammad, Menoufia University, gies to improve performance in such cases. We introduce a U-Net-like segmentation model
EGYPT
for ECG delineation with a particular focus on diverse arrhythmias. This is followed by a
Received: December 15, 2023 post-processing algorithm which removes noise and automatically determines the bound-
Accepted: April 20, 2024 aries of P, QRS, and T waves. Our model has been trained on a diverse dataset and evalu-
Published: June 13, 2024 ated against the LUDB and QTDB datasets to show strong performance, with F1-scores
exceeding 99% for QRS and T waves, and over 97% for P waves in the LUDB dataset. Fur-
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review thermore, we assess various models across a wide array of arrhythmias and observe that
process; therefore, we enable the publication of models with a strong performance on standard benchmarks may still perform poorly on
all of the content of peer review and author arrhythmias that are underrepresented in these benchmarks, such as tachycardias. We pro-
responses alongside final, published articles. The
pose solutions to address this discrepancy.
editorial history of this article is available here:
[Link]
Hospital Institutional Data Access / Ethics Precise delineation, which involves identifying the onset and offset of these waves and not just
Committee The contact information of the IRB of the peaks, is hence critical.
SNUH is as follows: Seoul National University
As such, automatic delineation of ECGs has been an important and well-developed topic,
Medical College/Seoul National University Hospital
Medical Research Ethics Review Committee Tel: starting with rule-based techniques for locating the QRS complex, to wavelet transform-based
82-02-2072-0694/2266 FAX: 82-02-3675-6824 delineation, [2–4] and deep learning techniques. Wavelet transforms deliver state-of-the-art
03080, 101 Daehak-ro, Jongno-gu, Seoul, performance in the benchmark QT database (QTDB), [5]. However, as [6, 7] point out, rule-
Republic of Korea [Link] based approaches typically require the adjustment of a threshold value for high scores, which
introirb/_/singlecont/[Link] Email requests for
may limit their generalizability to other datasets. Deep learning offers an alternative as shown
data access can be sent toirb@[Link] or to
Myung-Jin Cha, chamj81@[Link] The other
for example in [6–8]: Jimenez-Perez et al. [6] used a U-Net type architecture [9] to achieve
data underlying the results presented in the study delineation performance comparable to wavelet-based methods on QTDB, and Moskalenko
are available from Physionet, specifically QTDB, et al. [8] reported higher delineation performance compared to wavelet-based algorithms on
LUDB, MIT-BIH and PTB-XL. These public the Lobachevsky University Database (LUDB) [10].
databases are available at [Link] Although deep learning based delineation has shown excellent performance on bench-
content/qtdb/1.0.0/ [Link]
marks, arrhythmias pose a particular challenge in two important ways. First of all, there is a
ludb/1.0.1/ [Link]
0.0/ [Link] lack of data, reflected in the fact that the benchmark datasets are quite small and their samples
tend to have relatively low heart rates. As a result, the variety of arrhythmias for the purpose of
Funding: This work was supported by the Korea
testing delineation quality is limited despite the careful preparation of these databases. A sec-
Medical Device Development Fund grant funded by
the Korea government (the Ministry of Science and ond important obstacle is that many arrhythmias cause significant changes in the structural
ICT, the Ministry of Trade, Industry and Energy, the elements and morphological features of an ECG. These changes are particularly striking in
Ministry of Health \& Welfare, the Ministry of Food case of the P wave, which usually has the lowest signal to noise ratio. For example, in atrial
and Drug Safety) (Project Number: 1711174270, fibrillation (AFIB) and atrial flutter (AFL) the P wave is absent, and a fibrillatory signal or flut-
RS-2021-KD000008), (JSH, JHJ). [Link]
ter wave is found instead. As noted in [11, 12], false P wave predictions during such events
[Link]/eng/[Link] In addition, WK and OvK
receive support from the National Research
present a significant challenge for delineation algorithms in clinical practice. Other arrhyth-
Foundation of Korea (NRF) Grant mias, such as atrioventricular (AV) block, affect not only the position of P waves in relation to
2022R1A5A6000840 (joint), as well as (MSIP) the QRS complex, but also their occurrence. This can result in P waves and QRS complexes fol-
[RS-2022-00165404 (WK), NRF2023005562 lowing independent rhythms. In all of these cases, the performance of a P wave delineation
(Ovk)], funded by the Korean Government. https:// algorithm is affected adversely. For instance, Aziz et al. [13] report a considerable drop in sen-
[Link]/[Link] None of the funders played
sitivity for P wave detection in the case of ECGs exhibiting arrhythmia.
a role in the study design, data collection and
analysis, decision to publish, or preparation of the In this paper, we investigate this two-fold problem in the setting of deep learning and
manuscript. develop remedies. Specifically, we evaluate the performance of deep learning models, and find
that the performance drops markedly in the presence of certain arrhythmia, such as various
Competing interests: I have read the journal’s
policy and the authors of this manuscript have the forms of tachycardia. An example of such failure is shown in Fig 6. Because tachycardia are
following competing interests: CJ, TP, WK and OvK underrepresented in both QTDB and LUDB, this drop doesn’t affect the average performance
received financial support through NRF grants of delineation in these benchmarks much, which explains why previous approaches still per-
2022R1A5A6000840 as well as RS-2022- formed well on the benchmark tests. As remedies, we build on prior studies and devise a seg-
00165404, NRF2023005562, funded by the Korean
mentation model with a U-Net like architecture to delineate ECG signals with diverse
Government. SHK, SYO, JHJ, JSH, WJK and MJC
received financial support through grants
arrhythmias by training on a new dataset consisting of a large number of recordings with vari-
1711174270, RS-2021-KD000008 funded by the ous arrhythmia types. In addition, we develop a segmentation model using a hybrid loss func-
Korean government. In addition, SHK, SYO, JHJ, tion that combines segmentation with the task of arrhythmia classification. This classification
JSH, WJK and MJC are Stockholders of guided approach can ameliorate false P wave predictions for AFIB and AFL in short signals. A
Medifarmsoft Co., Ltd. flow diagram of the model is shown in Fig 2.
The key contributions of this paper can be summarized as follows:
• Training a segmentation model that accurately delineates a chosen set of common arrhyth-
mia types, achieved by using a diverse training set and employing a suitable post-processing
strategy.
• Identifying common failure cases of segmentation models through separate validation on
different arrhythmia types.
Fig 1. A schematic representation of an ECG signal measured in lead I or lead II with the main complexes indicated.
[Link]
• Evaluating our model’s performance on benchmark datasets QTDB and LUDB, demonstrat-
ing generalizability by results comparable with previous research.
• Introducing a classification guided strategy to reduce false P wave predictions for AFIB and
AFL in short signals.
The rest of the paper is structured as follows. The Related Work Section 2 provides a review
of relevant literature on ECG delineation and deep learning-based ECG analysis. The Methods
Section 3 outlines the databases used for this study and presents the proposed delineation algo-
rithm. The Results Section 4 presents performance evaluation metrics and reports experimen-
tal results. The Discussion Section 5 offers interpretations, implications, and discusses
limitations and future directions. Finally, the Conclusion Section 6 concludes the paper.
Fig 2. Flow diagram for ECG delineation: An ECG input signal is segmented by a U-Net like model using an optional classification branch, and
post-processed for noise, before producing final delineation results.
[Link]
2 Related work
2.1 Traditional approaches for ECG delineation
Early works on ECG delineation were primarily focused on developing rule-based methods to
identify and locate the QRS complex. Pan and Tompkins [14] presented a seminal example of
detecting the QRS complex by utilizing slope, amplitude, and width information. Subse-
quently, more advanced techniques have been employed to identify also the P and T waves.
These include digital signal processing such as the wavelet transform [2–4, 15], the Hilbert
transform [16, 17], and the phasor transform [18]. Additionally, classical machine learning
approaches like hidden Markov models [19, 20] and Gaussian mixture models [21] have also
been employed. Among these, wavelet-based methods have been widely cited as being the
state-of-the-art, based on their delineation performance on public datasets such as QTDB and
LUDB [4]. However, despite their effectiveness, traditional methods typically require manual
feature extraction or domain-specific knowledge, whereas wavelet-based algorithms demand
careful threshold selection for consistent results on different datasets.
By filtering out the segmentation output using the classification output, the number of false
positives is reduced. A similar approach was taken by Shuvo et al. [33], where a separate locali-
zer branch was added together with an additional classifier branch.
In the ECG literature, classification and segmentation tasks have remained separate for the
most part, while deep learning architectures have shown great success for both tasks [26]. In
our current work, we experiment with combining the two tasks by training an ECG segmenta-
tion model together with an additional arrhythmia classification learning objective. Previous
studies have demonstrated the effectiveness of convolutional neural networks for arrhythmia
classification. For example, Hannun et al. [22] trained a 34-layer convolutional neural network
for arrhythmia classification of single-lead ECG signals, showing performance comparable to
that of cardiologists. Ribeiro et al. [34] later used a residual network architecture, an architec-
ture first developed by He et al. [35] in the context of image classification, for the reliable diag-
nosis of 12-lead ECG signals. For a detailed review of deep learning applications in arrhythmia
classification, we refer to the systematic reviews conducted by Xiao et al. [36] and Ansari et al.
[37].
3 Methods
3.1 Data
For this study, we have used both internal and external datasets to develop and test our algo-
rithm. The internal database was used for training the segmentation model and assessing
delineation accuracy across diverse arrhythmias. The standard public datasets QTDB and
LUDB were used for external validation of our algorithm. The characteristics of these datasets
are summarized in Table 1 and elaborated upon in subsequent sections.
Table 1. Descriptions of signals and their annotations for each of the databases.
Data Source # Recordings Duration Frequency Leads Boundary Annotations
Internal Database 1557 10 seconds 500Hz, 250Hz 2 (I, II) P, QRS, T on/offsets
QTDB [5] 105 15 minutes 250Hz 2 P, QRS on/offsets, T offsets
LUDB [10] 200 10 seconds 500Hz 12 P, QRS, T on/offsets
[Link]
Initial selection of ECGs from the electrocardiography database was based on the presence
of common clinically significant cardiac arrhythmias. After reviewing ECG records, we
excluded ECGs where disagreement on the review result was found, as well as ECGs that were
too noisy to interpret reliably. The original arrhythmia automatic diagnosis from the database,
the commercial interpretation product, the MUSE Cardiology Information System by GE,
confirmed by an overreader, was then independently reviewed by two expert cardiologists.
Only when both readings were in agreement was it applied to the analysis. After that, the onsets
and offsets for P, QRS, and T waves were manually annotated for each lead independently by a
cardiologist using a custom-made software tool. The annotation results were then confirmed
by another cardiologist. As a quality control measure, we include in the S1 Appendix statistics
on the difference between lead I and lead II annotations.
For each subject, the extracted data consisted of a recording with a duration of 10 seconds
for leads I and II with a sampling frequency of either 250Hz or 500Hz. The dataset was parti-
tioned into a training set and a test set. The training set comprised 1032 recordings and was
organized to include approximately 70% of recordings for each identified arrhythmia class.
The test set was composed of the remaining 525 recordings.
Fig 3. Segmentation model architecture. Our architecture is similar to U-Net3+, but uses 1D convolutional blocks and has an additional classifier
branch.
[Link]
softmax classifier for four classes: P wave, QRS complex, T wave, and none of these. This gives
four class probabilities for each time stamp.
Note that for all other convolutional layers, we use a kernel size of 9 and padding of 4. As
for the activation function, we use a leaky rectified linear unit with negative slope 0.01 for all
layers. More specific details can be found in our implementation, which is available at https://
[Link]/ckjoung/ecg-segmentation.
3.7 Post-processing
The waveform boundaries are determined from the segmentation output through a post-pro-
cessing stage, which consists of the following three steps. First, we extract segments of each type
(P wave, QRS, T wave, none) by taking connected intervals where the probability of that type
outputted by the model is highest. As a second noise reduction step, we discard short connected
regions (of a duration less than 40 ms) and adjust the label based on the segmentation results of
the adjacent intervals. In particular, we adjust the label according to the following rule:
1. if the two intervals adjacent to a short region have the same label, we regard the short seg-
ment as having the same label, thereby gluing the two regions to a single segment;
2. if the labels of the adjacent intervals are different, we discard the short region and label it as
being none of the waveforms.
In the final step, we proceed by choosing the longest intervals labeled as P wave and T wave
between consecutive QRS intervals and obtain their onsets and offsets. It can of course happen
that there is no P wave, for example in the case of atrial fibrillation, or no T wave, which is very
rare. This procedure automatically removes noise and returns unambiguous results.
apply batch normalization and dropout for regularization following the classification models
of [22, 34]. The arrhythmia classification is performed by the final fully connected layer with
softmax activation, whose output represents the probabilities of the signal belonging to either
an AFIB or an AFL episode or not. A final prediction is made using an argmax function. Note
that we have allowed the classification branch to take as input not just the feature of the last
encoder block, but of encoder blocks of all levels. This is done by an aggregation scheme which
works as follows. We first downsample the features of the first four encoder blocks to a size
equal to that of the last encoder block. The downsampling is done using an average pooling
layer. After the features have been resampled to the same shape, we concatenate the features to
get a single aggregated feature.
3.9 Training
We have trained the network from scratch with convolutional weights initialized as in He et al.
[42] using the Adam optimizer [43] with default parameters. The learning rate was initialized
to be 0.001 and set to follow a cosine annealing schedule. To increase the diversity of training
data, we applied data augmentation using transformations designed to mimic probable physio-
logical noise, such as baseline wander and powerline noise, as used in [44]. The equations for
these transformations are given as follows:
• Baseline wander:
X
50
nblw ðtÞ ¼ ak cosð2ptkDf þ �k Þ ð1Þ
k¼1
• Powerline noise:
X
3
npln ðtÞ ¼ ak cosð2ptkfn þ �1 Þ ð2Þ
k¼1
where Δf = 0.01Hz, fn = 50Hz, with ak and ϕk uniformly sampled from [0, 1) and [0, 2π).
We have also randomly resized the input signal by a factor exp(α) where α is uniformly sam-
pled from [log0.5, log2], added random Gaussian noise with zero mean and standard deviation
0.01mV, and applied a constant baseline shift by an offset sampled from a Gaussian distribu-
tion. Fig 5 shows examples of the used transformations.
We adopt focal loss as introduced in [45] as our segmentation loss function. Focal loss mod-
ifies the standard cross-entropy loss by providing smaller weights to well-classified time
stamps, letting the model focus on regions that are difficult to classify. The focal loss general-
ized to our multi-class segmentation setting can be written in the following form:
1X N X C
Lfocal ¼ ð1 ^y n;c Þg yn;c log ^y n;c ð3Þ
N n¼1 c¼1
Here, ^y n;c denotes the predicted probability of time stamp n belonging to class c, while yn is the
one-hot vector of the true class label for time stamp n. In our experiments, we use the default
value of γ = 1.0. During arrhythmia classification guidance of the Arrhythmia Classification
Guidance Section 3.8, we use the standard binary cross-entropy loss Lbce for the classification
branch. This gives the overall loss function:
Ltotal ¼ Lfocal þ aLbce : ð4Þ
Fig 5. Examples of transformations used for data augmentation. (a) Original, (b) baseline wander, (c) baseline shift, (d) resize, (e) powerline noise
and (f) Gaussian noise.
[Link]
The additional trade-off parameter α can be adjusted to balance the effect of classification and
segmentation losses during training. For all our experiments, we used α = 1.
We train and validate our model using single lead ECG signals. To prevent potential issues
arising from incomplete annotations for waveforms near the beginning and the end of a signal,
we proceed as in [8] to exclude the initial and final 2 seconds of our signals during the training
process. Hence, our model performs segmentation and classification using a signal of duration
6 seconds during training, and of 10 seconds during validation. While this scheme was
designed mainly due to its practicality, we note that ECG recordings of 5 or 10 seconds have
been shown to be successful for a CNN based arrhythmia classification [46]. We only use sig-
nals from leads I and II for training and validation of our model. Each input signal is resam-
pled to 500Hz.
4 Results
4.1 Evaluation metrics
In order to evaluate the performance of the proposed delineation algorithm, we compare the
ground truth annotations for the onsets and offsets of P, QRS, and T waves with the predicted
annotations. We follow the usual standard chosen by The Association for the Advancement of
Medical Instrumentation(AAMI) [47], which considers an onset or an offset to be correctly
detected if an algorithm locates the same type of annotation in a neighborhood of 150ms.
Using this threshold value, we examine for each predicted point whether the prediction cor-
rectly detects a point in the ground truth annotation.
If a ground truth annotation is correctly detected, we count a true positive(TP). In this case,
the error is measured as the time deviation of the predicted point from the manual annotation.
If there is no point of the ground truth annotation in the 150ms neighborhood of the predic-
tion, then we count a false positive(FP). Once every prediction has been compared with the
manual labels, we count for each point of the ground truth annotation which has not been
related to any prediction a false negative(FN).
Based on this, we calculate the following evaluation metrics:
• mean error m
• standard deviation of error σ
• sensitivity
TP
Se ¼ ð5Þ
TP þ FN
• F1-score
Se � PPV
F1 ¼ 2 � ð7Þ
Se þ PPV
Se indicates the algorithm’s ability to identify true positives among all ground truth annota-
tions, while PPV quantifies the algorithm’s precision in detecting annotations. Furthermore,
the F1-score, defined as the harmonic mean of Se and PPV, offers a unified assessment of the
algorithm’s performance. These metrics have been commonly used in the literature for the
evaluation of ECG delineation algorithms [3, 4, 8, 40], and we use them to evaluate perfor-
mance of our model.
Fig 6. Delineation of an ECG showing sinus tachycardia (PTB-XL ECG-ID: 857) using two different models: (a) a model trained on LUDB, which is
somewhat short on tachycardia samples, fails to detect the fairly obvious P waves; (b) a model trained on more diverse data with otherwise identical
settings, performs much better.
[Link]
Fig 7. Delineation of an ECG showing sinus tachycardia and AVB1 (PTB-XL ECG-ID: 3337) using two different models: (a) a model trained on LUDB
delineates all QRS complexes and T waves, including the premature ventricular complex, correctly, but misses P waves that are in shorter RR intervals;
(b) a model trained on more diverse data with otherwise identical settings, finds all P waves.
[Link]
arrhythmias, specifically AVB1 in Fig 7, both models perform well in detection of QRS com-
plexes and T waves, but model a) has trouble with identifying the P waves. As a final example,
Fig 8 presents the performance of the two models on a signal with bundle branch block and
premature ventricular complexes. In this case, both models detect all waves correctly according
to the standard of AAMI. However, model a) underestimates the width of the QRS complex: it
puts the S wave offset well before the J point. We note that this type of defect is only visible in
the mean error. The other metrics do not reveal this type of flaw. We will now check these phe-
nomena systematically by delineating the test set of the internal dataset, which contains a wide
range of arrhythmia. We point out that model a) and b) both perform well on QTDB (and
LUDB); performance on these benchmark sets is addressed in the next section.
To assess the model’s ability to handle signals with diverse arrhythmias, we measure the
F1-scores separately for each of the following arrhythmia types: normal sinus rhythm (NSR),
sinus tachycardia (ST), bundle branch block (BBB), first degree atrioventricular block (AVB1),
atrial fibrillation (AFIB), atrial flutter (AFL) and ventricular tachycardia (VT). We also exam-
ine how the arrhythmia distribution of the training set can affect the delineation performance.
For this, we train a separate segmentation model using LUDB as the only training set and com-
pare the resulting delineation performance. LUDB has often been used in previous studies [7,
29] for training a segmentation model for the purposes of delineation. Here, we follow the
same approach but test it on the internal dataset in order to measure performance for different
Fig 8. Delineation of an ECG showing bundle branch block and premature ventricular complex (PTB-XL ECG-ID: 287) using two different models:
(a) a model trained on LUDB detects all waves correctly, but underestimates the width of all QRS complexes with the exception of the PVC; (b) a model
trained on more diverse data with otherwise identical settings, detects the onsets and offsets accurately.
[Link]
Table 3. Arrhythmia dependence of onset and offset delineation performance on a test set comprised of diverse arrhythmia. The training data strongly affects the
models’ performance as highlighted in the bold-faced F1-scores: scores can drop more than 15%.
T raining Rhythm F1-scores (%)
P onset P offset QRS onset QRS offset T onset T offset
Trained on LUDB (limited diversity) NSR 99.84 99.84 99.83 99.84 99.97 99.97
ST 81.54 81.54 99.93 99.93 97.59 98.83
BBB 98.89 98.89 99.94 99.94 99.89 99.94
AVB1 90.53 90.97 99.82 99.82 100.00 100.00
AFIB - - 99.29 99.29 97.92 97.60
AFL - - 99.21 99.21 92.67 93.00
VT - - 91.04 91.11 78.78 78.63
All 90.37 90.46 99.45 99.46 97.42 97.54
Trained on diverse dataset NSR 99.69 99.69 99.78 99.81 99.95 99.95
ST 97.19 97.19 99.91 99.91 99.90 99.94
BBB 99.00 99.00 99.94 99.94 99.88 99.89
AVB1 95.93 95.93 99.84 99.84 100.00 100.00
AFIB - - 99.54 99.54 99.56 99.54
AFL - - 98.97 98.97 98.56 97.57
VT - - 97.83 96.84 94.49 94.61
All 96.47 96.46 99.71 99.69 99.67 99.63
[Link]
arrhythmias. For a reliable comparison, each evaluation is repeated 20 times and the average
score is reported.
Table 3 shows the F1-scores for the onset and offset delineation. From the results, we see
that the model trained on the internal dataset can accurately delineate signals of each of the
identified arrhythmia types. The F1-scores are mostly above 0.99, and all above 0.97 except for
VT and P waves for AVB1. By contrast, the model trained on LUDB shows a much higher vari-
ation across different arrhythmia types. For normal sinus rhythm, exceptional F1-scores (over
0.99) are achieved. However, the effect of arrhythmia in delineation accuracy is noticeable in
the F1-scores for P waves during ST and AVB1, and T waves during ST, AFIB, AFL, and VT.
Table 4. Comparison of delineation performance on QTDB and LUDB. For a direct comparison, we have considered the results of Moskalenko et al. [8] which uses sin-
gle lead input, namely lead II. N/A: not applicable, N/R: not reported. This table shows that the performance of model, trained on diverse arrhythmia, has a performance
that is comparable to that of other recent models.
Database Method Metrics P onset P offset QRS onset QRS offset T onset T offset
QTDB Di Marco et al. [40] Se (%) 98.15 98.15 100.0 100.0 - 99.77
PPV (%) 91.00 91.00 N/A N/A 97.76
m ± σ (ms) -4.5 ± 13.4 -2.5 ± 13.0 -5.1 ± 7.2 0.9 ± 8.7 1.3 ± 18.6
Kalyakulina et al. [4] Se (%) 97.46 97.53 98.42 98.42 - 96.16
PPV (%) 97.86 97.93 98.24 98.24 94.87
m ± σ (ms) 3.5 ± 13.8 3.4 ± 12.7 -5.1 ± 6.6 4.7 ± 9.5 13.4 ± 18.5
Chen et al. [7] Se (%) 99.58 99.78 100.0 100.0 - 98.63
PPV (%) N/R N/R N/A N/A N/R
m ± σ (ms) -0.6 ± 20.9 4.9 ± 19.5 1.3 ± 11.4 3.8 ± 18.8 7.4 ± 32.5
Our Method Se (%) 96.51 96.55 100.0 100.0 - 97.50
PPV (%) 97.94 97.97 N/A N/A 95.31
m ± σ (ms) 13.0 ± 16.1 -3.3 ± 18.5 4.1 ± 11.2 2.8 ± 17.3 -0.4 ± 35.1
LUDB Kalyakulina et al. [4] Se (%) 98.46 98.46 99.61 99.61 - 98.03
PPV (%) 96.41 96.41 99.87 99.87 98.84
m ± σ (ms) -2.7 ± 10.2 0.4 ± 11.4 -8.1 ± 7.7 3.8 ± 8.8 5.7 ± 15.5
Sereda et al. [29] Se (%) 95.20 95.39 99.51 99.50 97.95 97.56
PPV (%) 82.66 82.59 98.17 97.96 94.81 94.96
m ± σ (ms) 2.7 ± 21.9 -7.4 ± 28.6 2.6 ± 12.4 -1.7 ± 14.1 8.4 ± 28.2 -3.1 ± 28.2
Moskalenko et al. [8] Se (%) 98.61 98.59 99.99 99.99 99.32 99.40
PPV (%) 95.61 95.59 99.99 99.99 99.02 99.10
m ± σ (ms) -4.1 ± 20.4 3.7 ± 19.6 1.8 ± 13.0 -0.2 ± 11.4 -3.6 ± 28.0 -4.1 ± 35.3
Our Method Se (%) 98.16 98.20 99.67 99.97 99.82 99.63
PPV (%) 96.39 96.36 99.29 99.59 99.66 99.42
m ± σ (ms) 7.4 ± 14.1 -1.8 ± 9.9 6.1 ± 10.5 2.0 ± 10.7 3.0 ± 25.2 4.5 ± 24.4
[Link]
PPV value. In fact, when there is no annotation, we cannot decide whether the waveform is
not present or the annotation is simply not included. To address this, we adopt the approach
from [3, 40] and treat an absent manual annotation on a predicted beat as a non-included
annotation. To ensure consistency with [4, 40], we select the lead with the lowest error for each
boundary point.
Table 5. Number of false positive P annotations for AFIB and AFL. The PPV and Se scores for the entire test set are shown for reference. The values are averaged over
20 runs.
AFIB (1437 beats) AFL (540 beats) All (14418 beats)
False Positives False Positives PPV (Precision) Se (Recall)
P onset P offset P onset P offset P onset P offset P onset P offset
Trained w/o classification 62.35 62.35 34.25 34.25 97.53 97.52 95.43 95.43
Trained w/ classification 13.85 13.85 1.85 1.9 98.70 98.69 95.31 95.31
[Link]
Our method provides accurate delineation in all the presented examples, highlighting its
versatility in several aspects. First, with the exception of signal resampling, no additional signal
processing techniques were used to achieve the results. Second, due to the convolutional
nature of the segmentation model, the algorithm can accommodate signals of varying lengths.
This greatly enhances its utility, particularly in the context of Holter recordings containing
potential arrhythmias, allowing for the algorithm’s application to windows of sizes chosen for
convenience. Our pytorch implementation segments and delineates an ECG record of 30 min-
utes in under 2s-3s on an Ubuntu machine with 64GB DRAM equipped with an NVIDIA
3080Ti with 12GB memory. The model itself uses a little under 20 � 106 parameters, and needs
about 80 MB of memory. In particular, this is both suitable for real time analysis and the
intended application of the analysis of long Holter recordings. Finally, it is worth noting that
no parameter tuning was necessary for the delineation when applied to the MIT-BIH arrhyth-
mia database.
Fig 9. Segmentation results on the MIT-BIH arrhythmia database. (a) Atrial fibrillation in record 221. The small bumps are not misidentified as P
waves, and we have observed the same correct behavior in the presence of atrial flutter. (b) First degree atrioventricular block in record 228, with correct
detection of longer-than-normal PR intervals. (c) Bundle branch block in record 212, featuring a wide QRS complex. (d) Sinus tachycardia in record
209, with heart rate slightly over 100 bpm.
[Link]
Fig 10. More segmentation results on the MIT-BIH arrhythmia database. (a) Normal sinus rhythm in record 101, with baseline oscillations and
noise. (b) The onset of an episode of atrial flutter in record 222. The early signal displays normal sinus rhythm with PAC, and P waves being detected.
Later, atrial flutter without P waves is observed. (c) An episode of loss of signal in record 232. (d) Ventricular trigeminy in record 201.
[Link]
5 Discussion
5.1 Delineation of ECGs with arrhythmia
In Table 3 we see that there is a significant difference between the LUDB trained model and
the model trained on internal data with regard to ECGs with certain arrhythmia. There is
almost no performance difference for NSR and BBB, but for the arrhythmias that are not so
well-represented in LUDB, the difference is striking. For example, in LUDB, 15 recordings
represent signals with atrial fibrillation, while only three recordings with atrial flutter and four
recordings with sinus tachycardia are available [10]. The performance drops especially in the
latter case. Fig 3 shows that in some cases of tachycardia all P waves can be missed by an
improperly trained model. Similar problems can occur in AVB1. This brings us to another
problem; the LUDB trained model has a high number of false positive P waves for AFIB and
AFL. Without testing the model on a dataset that has a balanced distribution of arrhythmias, it
is difficult to identify such failure cases. Overall, the results from Table 3 highlight the impor-
tance of using a well-curated dataset that encompasses a broad range of arrhythmias com-
monly seen in clinical practice for developing and validating an ECG delineation algorithm.
For completeness, we reiterate that model a), trained on LUDB, although it has poor perfor-
mance on tachycardia, still performs well on the benchmark QTDB (and of course on LUDB).
The model trained on more diverse data has much better performance in cases of arrhythmia,
while retaining a good performance on the standard benchmark tests as we will discuss next.
performance in delineating P wave onsets and offsets, achieving a PPV of over 97.9%, outper-
forming the methods we compared against. In the case of LUDB, our method’s strength lies in
accurate T wave delineation, with both Se and PPV exceeding 99.4%, an improvement over
other methods. Taken together, these results underscore the consistent accuracy of our pro-
posed delineation algorithm across various waveforms. Our method’s weakest point is
observed in the standard deviation of error (σ), particularly noticeable for the T offset of
QTDB signals. In fact, we can observe from Table 4 that deep learning-based methods tend to
exhibit higher σ compared to wavelet-based methods. This also aligns with the observations of
Jimenez-Perez et al. [6], where their deep learning-based delineation also reported a σ larger
than 30ms for T offset delineation in QTDB. We also observe that the onset errors for P and
QRS are shifted positively while the standard deviation remains relatively similar to other
methods, which may partially be an artifact of the independent annotations for training and
test data.
It is worth noting that the comparable performance on the public datasets has been
achieved by training exclusively on the internal dataset. This is important as it implies the high
generalization ability of the proposed algorithm and deep learning based methods in general.
As noted in [6, 8], the ability to handle unseen signals without the need for additional tuning
of parameters is a key advantage of deploying a deep learning model compared to wavelet-
based methods. By using a private dataset as opposed to a portion of either QTDB or LUDB
for training, we have made a clear demonstration of the effectiveness at which deep segmenta-
tion models can be applied to diverse scenarios.
Fig 11. Delineation of atrial fibrillation sample (PTB-XL ECG-ID: 5634) using a model trained with arrhythmia classification guidance (a)
without P wave suppression and (b) with P wave suppression.
[Link]
post-processing step (Post-processing Section 3.7), where one P wave segment is selected per
RR interval. This design is particularly effective for mitigating noise, but limits applicability to
abnormal rhythms, such as second or third-degree atrioventricular blocks, where multiple P
waves may precede a QRS complex. Secondly, the assumption of non-overlapping waveforms
(P, QRS, and T) in the algorithm’s output is a further restriction; overlapping waveforms can
for example occur in first, second or third-degree atrioventricular blocks. We note that these
limitations are not unique to our algorithm but are common in deep learning based delinea-
tion approaches [7, 8].
To address these limitations, future work may incorporate flexibility into a delineation algo-
rithm’s output by allowing for the detection of multiple P waves within RR intervals, imple-
menting multi-label classification techniques or separate models for QRS/T waves
(depolarization/repolarization of ventricles) and P waves (depolarization of atria). Moreover,
advanced data augmentation techniques should be investigated to accommodate other
arrhythmias for which collecting annotated data may be challenging or impractical. These
future directions aim to enhance delineation performance and widen its scope of application
in clinical settings.
6 Conclusion
One of the main challenges in ECG delineation is to accurately identify and delineate wave-
forms within irregular cardiac rhythms. This study aimed to develop a deep learning-based
segmentation model capable of detecting the onsets and offsets of P, QRS, and T waves in sig-
nals with potential arrhythmias. By evaluating on the internal dataset, we have highlighted the
impact of arrhythmias on delineation quality. We observed significant drops in F1-scores for
waveform boundary detection, particularly with arrhythmias such as ST, AVB1, AFIB, AFL,
and VT, with reductions of up to 15% in certain cases, emphasizing the need to account for
arrhythmias when developing and evaluating segmentation models for ECG analysis. To
address this, we experimented with training on a diverse dataset and employing a post-pro-
cessing strategy that can handle noise during the final delineation step. Furthermore, we
assessed generalization capability through experiments on the QTDB and LUDB datasets. Our
model demonstrated strong performance on the LUDB dataset, achieving Se and PPV scores
above 99% for QRS and T wave boundaries, and above 98% and 96% respectively for P waves,
showing comparable performance without direct training on LUDB signals. Overall, our study
shows a deep learning based segmentation model to be a versatile tool for delineation which
can be highly adaptive to various situations, while addressing the challenge of accurately
delineating waveforms in abnormal cardiac rhythms. Future research and development could
focus on broadening the scope of automatic delineation to encompass a wider range of
arrhythmias, through more manual annotations or advanced data augmentation techniques.
Supporting information
S1 Appendix.
(PDF)
Author Contributions
Conceptualization: Chankyu Joung, Mijin Kim.
Data curation: Mijin Kim, Seong-Ho Kong, Seung-Young Oh, Won Kyeong Jeon, Myung-Jin
Cha.
Formal analysis: Chankyu Joung.
Funding acquisition: Joong-Sik Hong, Woong Kook, Myung-Jin Cha.
Investigation: Chankyu Joung, Mijin Kim.
Methodology: Chankyu Joung, Mijin Kim, Myung-Jin Cha.
Project administration: Seung-Young Oh, Jae-hu Jeon.
Resources: Taejin Paik.
Software: Chankyu Joung, Taejin Paik.
Supervision: Seong-Ho Kong, Jae-hu Jeon, Joong-Sik Hong, Wan-Joong Kim, Woong Kook.
Validation: Chankyu Joung, Taejin Paik.
Visualization: Otto van Koert.
Writing – original draft: Chankyu Joung, Mijin Kim, Otto van Koert.
Writing – review & editing: Chankyu Joung, Taejin Paik, Woong Kook, Otto van Koert.
References
1. Gacek A, Pedrycz W, editors. ECG Signal Processing, Classification and Interpretation: A Comprehen-
sive Framework of Computational Intelligence. London: Springer London; 2012.
2. Li C, Zheng C, Tai C. Detection of ECG characteristic points using wavelet transforms. IEEE Transac-
tions on Biomedical Engineering. 1995; 42(1):21–28. [Link] PMID:
7851927
3. Martinez JP, Almeida R, Olmos S, Rocha AP, Laguna P. A Wavelet-Based ECG Delineator: Evaluation
on Standard Databases. IEEE Transactions on Biomedical Engineering. 2004; 51(4):570–581. https://
[Link]/10.1109/TBME.2003.821031 PMID: 15072211
4. Kalyakulina AI, Yusipov II, Moskalenko VA, Nikolskiy AV, Kozlov AA, Zolotykh NY, et al. Finding Mor-
phology Points of Electrocardiographic-Signal Waves Using Wavelet Analysis. Radiophysics and Quan-
tum Electronics. 2019; 61(8):689–703. [Link]
5. Laguna P, Mark RG, Goldberg A, Moody GB. A database for evaluation of algorithms for measurement
of QT and other waveform intervals in the ECG. In: Computers in Cardiology 1997. IEEE; 1997.
p. 673–676.
6. Jimenez-Perez G, Alcaine A, Camara O. Delineation of the electrocardiogram with a mixed-quality-
annotations dataset using convolutional neural networks. Scientific Reports. 2021; 11(1):863. https://
[Link]/10.1038/s41598-020-79512-7 PMID: 33441632
7. Chen Z, Wang M, Zhang M, Huang W, Gu H, Xu J. Post-processing refined ECG delineation based on
1D-UNet. Biomedical Signal Processing and Control. 2023; 79:104106. [Link]
2022.104106
8. Moskalenko V, Zolotykh N, Osipov G. Deep learning for ECG segmentation. In: Advances in Neural
Computation, Machine Learning, and Cognitive Research III. Springer International Publishing; 2020.
p. 246–254.
9. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation.
In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Springer Interna-
tional Publishing; 2015. p. 234–241.
10. Kalyakulina AI, Yusipov II, Moskalenko VA, Nikolskiy AV, Kosonogov KA, Osipov GV, et al. LUDB: A
New Open-Access Validation Tool for Electrocardiogram Delineation Algorithms. IEEE Access. 2020;
8:186181–186190. [Link]
11. Saclova L, Nemcova A, Smisek R, Smital L, Vitek M, Ronzhina M. Reliable P wave detection in patho-
logical ECG signals. Scientific Reports. 2022; 12(1):6589. [Link]
PMID: 35449228
12. Hong J, Li HJ, Yang Cc, Han CL, Hsieh Jc. A clinical study on Atrial Fibrillation, Premature Ventricular
Contraction, and Premature Atrial Contraction screening based on an ECG deep learning model.
Applied Soft Computing. 2022; 126:109213. [Link]
13. Aziz S, Ahmed S, Alouini MS. ECG-based machine-learning algorithms for heartbeat classification. Sci-
entific Reports. 2021; 11(1):18738. [Link] PMID: 34548508
14. Pan J, Tompkins WJ. A Real-Time QRS Detection Algorithm. IEEE Transactions on Biomedical Engi-
neering. 1985; BME-32(3):230–236. [Link] PMID: 3997178
15. Sabherwal P, Agrawal M, Singh L. Independent detection of T-waves in single lead ECG signal using
Continuous Wavelet Transform. Cardiovasc Eng Technol. 2023; 14(2):167–181. [Link]
1007/s13239-022-00643-1 PMID: 36163602
16. Benitez DS, Gaydecki PA, Zaidi A, Fitzpatrick AP. A new QRS detection algorithm based on the Hilbert
transform. In: Computers in Cardiology 2000. Vol.27 (Cat. 00CH37163); 2000. p. 379–382.
17. Mukhopadhyay SK, Mitra M, Mitra S. Time plane ECG feature extraction using Hilbert transform, vari-
able threshold and slope reversal approach. In: 2011 International Conference on Communication and
Industrial Application; 2011. p. 1–4.
18. Martı́nez A, Alcaraz R, J Rieta J. Automatic electrocardiogram delineator based on the Phasor Trans-
form of single lead recordings. In: 2010 Computing in Cardiology; 2010. p. 987–990.
19. Graja S, Boucher JM. Hidden Markov tree model applied to ECG delineation. IEEE Transactions on
Instrumentation and Measurement. 2005; 54(6):2163–2168. [Link]
20. Akhbari M, Shamsollahi MB, Sayadi O, Armoundas AA, Jutten C. ECG segmentation and fiducial point
extraction using multi hidden Markov model. Computers in Biology and Medicine. 2016; 79:21–29.
[Link] PMID: 27744177
21. Dubois R, Maison-Blanche P, Quenet B, Dreyfus G. Automatic ECG wave extraction in long-term
recordings using Gaussian mesa function models and nonlinear probability estimators. Computer Meth-
ods and Programs in Biomedicine. 2007; 88(3):217–233. [Link]
PMID: 17997186
22. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level
arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network.
Nature Medicine. 2019; 25(1):65–69. [Link] PMID: 30617320
23. De Melo Ribeiro H, Arnold A, Howard JP, Shun-Shin MJ, Zhang Y, Francis DP, et al. ECG-based real-
time arrhythmia monitoring using quantized deep neural networks: A feasibility study. Comput Biol Med.
2022; 143(105249):105249. [Link] PMID: 35091363
24. Sivapalan G, Nundy KK, Dev S, Cardiff B, John D. ANNet: A lightweight neural network for ECG anom-
aly detection in IoT edge sensors. IEEE Trans Biomed Circuits Syst. 2022; 16(1):24–35. [Link]
10.1109/TBCAS.2021.3137646 PMID: 34982689
25. Zhang Y, Liu S, He Z, Zhang Y, Wang C. A CNN model for cardiac arrhythmias classification based on
individual ECG signals. Cardiovasc Eng Technol. 2022; 13(4):548–557. [Link]
s13239-021-00599-8 PMID: 34981316
26. Hao W, Jingsu K. Investigating Deep Learning Benchmarks for Electrocardiography Signal Processing.
arXiv. 2022; p. 2204.04420.
27. Zeiler MD, Fergus R. Visualizing and Understanding Convolutional Networks. In: Fleet D, Pajdla T,
Schiele B, Tuytelaars T, editors. Computer Vision—ECCV 2014. Lecture Notes in Computer Science.
Springer International Publishing; 2014. p. 818–833.
28. Jimenez-Perez G, Alcaine A, Camara O. U-Net Architecture for the Automatic Detection and Delinea-
tion of the Electrocardiogram. In: 2019 Computing in Cardiology (CinC); 2019. p. Page 1–Page 4.
29. Sereda I, Alekseev S, Koneva A, Kataev R, Osipov G. ECG Segmentation by Neural Networks: Errors
and Correction. In: 2019 International Joint Conference on Neural Networks (IJCNN); 2019. p. 1–7.
30. Nurmaini S, Darmawahyuni A, Rachmatullah MN, Firdaus F, Sapitri AI, Tutuko B, et al. Robust electro-
cardiogram delineation model for automatic morphological abnormality interpretation. Scientific
Reports. 2023; 13(1):13736. [Link] PMID: 37612382
31. Li X, Cai W, Xu B, Jiang Y, Qi M, Wang M. SEResUTer: a deep learning approach for accurate ECG sig-
nal delineation and atrial fibrillation detection. Physiological Measurement. 2023; 44(12):125005.
[Link] PMID: 37827168
32. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, et al. UNet 3+: A Full-Scale Connected UNet for
Medical Image Segmentation. In: ICASSP 2020—2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE; 2020. p. 1055–1059.
33. Shuvo MB, Ahommed R, Reza S, Hashem MMA. CNL-UNet: A novel lightweight deep learning archi-
tecture for multimodal biomedical image segmentation with false output suppression. Biomedical Signal
Processing and Control. 2021; 70:102959. [Link]
34. Ribeiro AH, Ribeiro MH, Paixão GMM, Oliveira DM, Gomes PR, Canazart JA, et al. Automatic diagnosis
of the 12-lead ECG using a deep neural network. Nature Communications. 2020; 11(1):1760. https://
[Link]/10.1038/s41467-020-15432-4 PMID: 32273514
35. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016. p. 770–778.
36. Xiao Q, Lee K, Mokhtar SA, Ismail I, Pauzi ALBM, Zhang Q, et al. Deep Learning-Based ECG Arrhyth-
mia Classification: A Systematic Review. Applied Sciences. 2023; 13(8):4964. [Link]
app13084964
37. Ansari Y, Mourad O, Qaraqe K, Serpedin E. Deep learning for ECG Arrhythmia detection and classifica-
tion: an overview of progress for period 2017–2023. Frontiers in Physiology. 2023; 14:1246746. https://
[Link]/10.3389/fphys.2023.1246746 PMID: 37791347
38. Moody GB, Mark RG. The impact of the MIT-BIH arrhythmia database. IEEE engineering in medicine
and biology magazine: the quarterly magazine of the Engineering in Medicine & Biology Society. 2001;
20(3):45–50. [Link] PMID: 11446209
39. Taddei A, Distante G, Emdin M, Pisani P, Moody GB, Zeelenberg C, et al. The European ST-T data-
base: standard for evaluating systems for the analysis of ST-T changes in ambulatory electrocardiogra-
phy. European heart journal. 1992; 13(9). [Link]
PMID: 1396824
40. Di Marco LY, Chiari L. A wavelet-based ECG delineation algorithm for 32-bit integer online processing.
BioMedical Engineering OnLine. 2011; 10(1):23. [Link] PMID:
21457580
41. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: Redesigning Skip Connections to Exploit Mul-
tiscale Features in Image Segmentation. IEEE Transactions on Medical Imaging. 2020; 39(6):1856–
1867. [Link] PMID: 31841402
42. He K, Zhang X, Ren S, Sun J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on
ImageNet Classification. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE;
2015. p. 1026–1034.
43. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv. 2017; p. 1412.6980.
44. Mehari T, Strodthoff N. Self-supervised representation learning from 12-lead ECG data. Computers in
Biology and Medicine. 2022; 141:105114. [Link] PMID:
34973584
45. Lin TY, Goyal P, Girshick R, He K, Dollar P. Focal Loss for Dense Object Detection. In: Proceedings of
the IEEE International Conference on Computer Vision (ICCV); 2017.
46. Fan X, Yao Q, Cai Y, Miao F, Sun F, Li Y. Multiscaled Fusion of Deep Convolutional Neural Networks
for Screening Atrial Fibrillation From Single Lead Short ECG Recordings. IEEE Journal of Biomedical
and Health Informatics. 2018; 22(6):1744–1753. [Link] PMID:
30106699
47. Association for the Advancement of Medical Instrumentation. Testing and reporting performance results
of cardiac rhythm and ST segment measurement algorithms. ANSI/AAMI EC38. 1998;1998.
48. Wagner P, Strodthoff N, Bousseljot RD, Kreiseler D, Lunze FI, Samek W, et al. PTB-XL, a large publicly
available electrocardiography dataset. Sci Data. 2020; 7(1):154. [Link]
0495-6 PMID: 32451379