0% found this document useful (0 votes)
21 views

2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos

Uploaded by

mritunjayrai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos

Uploaded by

mritunjayrai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Neural Computing and Applications

https://doi.org/10.1007/s00521-023-08956-5 (0123456789().,-volV)(0123456789().
,- volV)

ORIGINAL ARTICLE

A novel deep convolutional encoder–decoder network: application


to moving object detection in videos
Avatharam Ganivada1 • Srinivas Yara1

Received: 17 May 2022 / Accepted: 15 August 2023


Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023

Abstract
Moving object detection is one of the key applications of video surveillance. Deep convolutional neural networks have
gained increasing attention in the field of video surveillance due to their effective feature learning ability. The performance
of deep neural networks is often affected by the characteristics of videos like poor illumination and inclement weather
conditions. It is important to design an innovative architecture of deep neural networks to deal with the videos effectively.
Here, the convolutional layers for the networks require to be in an appropriate number and it’s important to determine the
number. In this study, we propose a customized deep convolutional encoder–decoder network, say CEDSegNet, for
moving object detection in a video sequence. Here, the CEDSegNet is based on SegNet, and its encoder and decoder parts
are chosen to be two. By customizing the SegNet with two encoder and decoder parts, the proposed CEDSegNet improves
detection performance, where its parameters are reduced to an extent. The two encoder and decoder parts function towards
generating feature maps preserving the fine details of object pixels in videos. The proposed CEDSegNet is tested on
multiple video sequences of the CDNet dataset2012. The results obtained using CEDSegNet for moving object detection in
the video frames are interpreted qualitatively. Further, the performance of CEDSegNet is evaluated using several quan-
titative indices. Both the qualitative and quantitative results demonstrate that the performance of CEDSegNet is superior to
the state-of-the-network models, VGG16, VGG19, ResNet18 and ResNet50.

Keywords Convolutional neural network  Deep learning  Object detection  Performance analysis  Video surveillance

1 Introduction Intelligent video surveillance (IVS) is more efficient in


observing moving scenes and it records a scene in a reliable
Surveillance involves observing many real activities. It is manner. The IVS is widely used in the case of protection
based on electronic equipment, like closed-circuit televi- and security in smart cities. The IVS has a variety of
sion (CCTV) which can gather information about activity computer vision tasks, like people tracking, traffic moni-
from a long distance and copy or record a moving scene in toring, and semantic annotation of videos. The IVS mainly
the form of video. The videos of CCTV can be used to consists of five categories namely, object acquisition [4],
playback or broadcast the moving scene again. Surveil- target detection [22], detection of abnormal activities [34],
lance video is used for the purpose of security in public object classification and data analysis [3].
places, public transportation (train platforms, airports), and Detection of single or multiple objects in a video frame
various places such as shopping centers, banks, companies, is an important application of video surveillance. There are
money vending machines, etc. The surveillance video may several key issues with video surveillance like object
often provide unusable fuzzy scenes and, due to this, it can occlusions, similar appearance of illumination, scale
be unreliable in observing multiple real-time scenes. change, plane rotations, and detecting potential object
locations, which make object detection tasks very complex.
A moving object in a video frame is accurately detected
& Srinivas Yara
[email protected] when the shape of the object and its pixel positions are
correctly identified. In typical video frames, some pixels
1
School of Computer and Information Sciences, University of belong to multiple objects. Multiple objects can have
Hyderabad, Hyderabad, Telangana 500046, India

123
Neural Computing and Applications

different shapes. The shape of an object is delineated by its CEDSegNet, is developed. The novelty lies in choosing a
pixel positions and boundary information. In obtaining an suitable number of encoder and decoder parts and repre-
accurate shape of an object, several attempts are made sentation of the input vector (i.e, the network’s input is
available to design object detection methods like frame generated newly). Further, an application of moving object
difference [27], background subtraction [30], and deep detection in the video frames using the proposed CED-
neural networks [12, 35]. The frame difference methods SegNet is provided for the first time. The CEDSegNet is
rely on threshold values. The model’s performance appears proposed for the purpose of improving the accuracy of
to be degraded when it does not satisfy the threshold cri- moving object detection in a video. The proposed network
terion. Selection of a threshold exploring the criterion for also has the advantages of fewer parameters used for end-
detecting the shape of an object is difficult. Background to-end learning and low computational time. The formation
subtraction methods first estimate the background of an of the CEDSegNet is described as follows.
object from multiple video frames. The difference between The proposed network has two parts, an encoder and a
the estimated background frame and the current frame is decoder, similar to the architecture designed in [1]. The
then computed to identify the foreground object. It is a depth of the network (SegNet) is reduced by modifying the
complex process of estimating the background from the encoder and decoder parts. Here, the number of the encoder
video that mimics backgrounds like illumination variation and decoder parts is set equal to 2. The total number of
and these methods may fail to handle those videos. Deep layers in both the encoder and decoder parts of the network
convolutional neural networks can avoid the issues with the is 31. The modified encoder–decoder SegNet is called
aforesaid methods and are used highly for moving object CEDSegNet. The representation of the input vector of
detection in videos. Deep learning refers to learn features CEDSegNet is based on massive data samples (videos)
deeply from an input video frame. A class of deep con- which are generated through the data augmentation pro-
volutional neural networks for feature learning from a cess, under multiple scenarios. The data sample (video)
video sequence can be found in the investigations of with the object region and background region is partitioned
[12, 23], and [15]. into different frames. These are further partitioned into
Deep convolutional neural networks consist of several blocks. These blocks are fed to the CEDSegNet along with
convolutional and pooling layers which are arranged in a corresponding labeled blocks as its input. The CEDSegNet
hierarchical fashion. The convolutional and pooling layers is trained from end to end with large sets of masked frames
are categorized into multiple layers (blocks) of the encoder (blocks) representing moving objects. The proposed net-
and decoder networks. The encoder–decoder networks work uses fewer fine-tuned parameters, during training.
generate feature maps with the actual spatial resolution, The encoder–decoder networks enable the CEDSegNet to
during down-sampling and up-sampling. SegNet [1], VGG precisely segment objects where the 8-connected compo-
net [28], and its variants VGG16 & VGG19 are typical nent principle is used to examine the object region.
examples of deep convolutional neural network models. The paper is organized as follows. Section 2 provides a
The architectures of both the SegNet and VGG net are literature review of related deep neural networks. The
identical, but the depth of the former model (involving preliminaries of the encoder and decoder parts of SegNet
encoder and decoder networks) is less than that of the latter and its training process are provided in Sect. 3. Section 4
model. Here, SegNet is designed by removing the fully presents the design structure of the proposed CEDSegNet.
connected layers of the VGG net. The architecture of Section 5 reports the experimental results & discussions of
SegNet involves several encoder and decoder networks that the proposed network comparing with the state-of-the-art-
grow the depth of the network. Due to the large depth of methods. Conclusions and further research directions of
the network, its training process needs to be tuned with this study are presented in Sect. 6.
several parameters which lead to a high computation time.
Further, it is claimed that the SegNet attains impressive
performance for the segmentation of image datasets [1]. 2 Literature survey
However, the architecture of the SegNet has to be recon-
structed for the video as it is a collection of images. The Deep convolutional neural networks are highly used, due to
important aspect here is to select an appropriate number of their powerful learning ability, in computer vision appli-
encoder and decoder networks. The encoder and decoder cations like segmentation, object detection, and moving
networks are responsible for producing high-resolution object detection. We discuss different deep neural networks
feature maps, comprising the fine details of the features to designed recently which are related to the present
identify an accurate shape of an object by its pixel posi- investigation.
tions. By a little new adaption to the encoder and decoder An attempt is made to design an architecture of fully
networks and the input vector, a sophisticated SegNet, say convolutional neural networks (FCNs) for pixel-wise

123
Neural Computing and Applications

semantic segmentation [19]. The constituent parts of the Here, a coarse grain detection method and a region
FCNs are AlexNet [16], VGG net [28] and GoogleNet [32]. extraction method are used. The coarse grain detection
There are skip connections of FCNs that combine learned method involves a low pass filter, a filter-based object
features. The use of these FCNs is to provide an application detection method, and mathematical morphology. While
of semantic segmentation, like segmentation of bridges, high-frequency noises are removed using the low pass fil-
mountain. In [26], the authors develop a multi-scale deep ter, the mathematical morphology suppresses the ill effects
encoder–decoder network for object detection in images. of noises. The filter-based method detects the motion of an
The input of the network is multiple images with varying object. The region extraction method identifies the moving
sizes. The network generates saliency feature maps at regions of an object. A motion U-Net (MU-Net) for mov-
different locations (scales) for detecting objects in images. ing object detection in videos is designed in [25]. The
A deep learning framework for object detection in image network generates an output mask for moving object,
datasets is available in [17]. Here, deep features from where features of pixel-level motion and object-level
visual objects, in terms of augmented categories, are appearance cues are extracted. [24] explores the use of a
extracted using a hierarchical feature model (HFM). The histogram to design a motion saliency network for back-
augmented categories are person’s categories like sitting, ground estimation from video frames. The model generates
standing, and riding. A hierarchical ensemble classifier for temporal saliency maps from the background estimated for
object recognition and localization is designed where it detecting moving objects. Another method, namely multi-
uses a support vector classifier to calculate the confidence view 3D-CNN, for moving object recognition in videos is
scores of each of the augmented categories. A detailed illustrated in [33]. The network uses a multi-view 3D-CNN
review of deep neural networks for object detection in with multiple convolutional layers and a pooling layer to
images is provided in [11, 21]. All the above methods are generate a feature vector that accurately detects the
dealt with object detection or semantic segmentation in boundaries of a moving object. A review on moving object
images. A video is a collection of images. The following detection in a video sequence based on a deep convolu-
paragraphs present different methods of object detection in tional encoder–decoder network can be found in [18].
videos that are developed under the deep learning
framework.
In [9], the authors develop a cross-attention Siamese 3 SegNet: preliminaries
network by incorporating cross-attention module into a
deep encoder–decoder network. This network generates SegNet is developed in [1]. It consists of two sub-networks
high and low-level feature maps that preserve both the namely an encoder part and a decoder part, followed by a
semantic information and local context of an object. The classification layer. There are 5 encoders and 5 decoders
maps are fused using the cross-attention module to produce for the SegNet. The encoder part is designed with 2 con-
spatial-temporal features, which lead to obtaining accurate volutional layers, 2 batch normalization layers, and 2
object detection results. In [13], one can find the concept of ReLU activation layers, followed by a max-pooling layer.
a tubelet, a sequence of bounding boxes in a video frame, The convolutional layer extracts a feature from the input
which is used to design a tubelet proposal network (TPN). image using a set of kernels. The size of a kernel can be 7
The TPN employs a spatial-temporal tubelet proposal  7. During the convolution process, the kernel is moved
model and an encoder–decoder LSTM model to generate to the whole input image with a stride of 1. The output of
bounding boxes within a video frame and classify tubelets the convolution layer is input to the ReLU activation layer
into their respective categories. By utilizing tubelet infor- and it adds non-linearity to the input. The max pooling
mation, the TPN achieves high object detection accuracy in layer normalizes the output of the ReLU activation layer.
videos. One can refer to [14] to a deep learning framework In the decoder part, the convolutional layers, ReLU func-
for object detection in a video. Here, the contextual and tion, and max pooling layers, convolutional kernels are the
temporal information of an object extracted from the same as in the encoder part. The output of the final decoder
tubelet is incorporated into the R-CNN and faster R-CNN (fifth decoder) is fed into a SoftMax layer to classify pixels
models to design the deep learning framework. Further [12] as objects or backgrounds. There are 64 kernels in each
provides a comprehensive literature review of various deep block of the convolution layer. We describe the layers, in
learning methods designed for video object detection, detail, as follows.
including analyses of designing processes and modeling of
feature maps. 3.1 Convolution layer
Several deep-learning methods for moving object
detection in videos have been developed. One such method In the convolution layer, an input image is convolved with
is the deep convolution neural network developed in [35]. a filter or kernel matrix. During convolution, the filter is

123
Neural Computing and Applications

moved over the input image in both horizontal and vertical negative value into a zero and returns 1, otherwise. It also
directions. Convolutional layers or convolutional opera- transforms the input into a non-linear form and is compu-
tions are not only applied to the input pixels or input tationally efficient. The ReLU activation function can be
images but also applied to the output of the other convo- defined as
lutional layers. Its aim is to generate a feature map from the 
1; if [ 0;
input image. The number of feature maps corresponds to ReLUðiÞ ¼ ð3Þ
0; if\0
the number of convolutions. Typically, the number of fil-
ters or kernels in the convolution layer is a power of two.
The size of a feature map (Fmap ) generated by the convo- 3.3 Pooling layer
lution layer is calculated using
Fmap ¼ ððI  K þ 2pÞ=sÞ þ 1  Kn ð1Þ The pooling layer reduces spatial the size of a feature map.
In effect, computational complexity and the number of
where, Fmap is an output feature map, I is an input frame, K parameters of the network are reduced. There are three
is a convolution kernel or filter size, Kn is the number of types of pooling layers such as max pooling, min pooling,
kernels used in convolution process, s is a stride value and and average pooling. Max-pooling is defined as
p is a padding value.
Max-Pooling ¼ ððI  WÞ=sÞ þ 1; ð4Þ
Figure 1 depicts how an input image I is convolved with
a kernel K. We choose the size of a kernel as 3  3. We where I is the size of the feature map; W is window size;
perform convolution between an input frame I and a kernel and s is stride value. The output of max pooling is sensitive
K using Eq. 2. It is obtained from [29]. to network initialization.
X X
colum; row
Fmap ¼ Iðp  i; q  jÞ  Kði; jÞ; ð2Þ 3.4 Loss function
i¼0 j¼0
The loss function calculates the loss value that is used to
where, Fmap is the output feature map, i and j are the
assess the rate of the network. Here, A ’cross-entropy’ is
columns and rows of the kernel or filter, and p and q are the
used as a loss function to classify pixels at the Soft-Max
columns and rows of an image respectively, and K is a
layer. It is defined as
filter. The size (rows  columns) of K is typically chosen
equal to 3. Here, 16 kernels (Kn ) require to generate a X c
Loss ¼  ti logsi ; ð5Þ
feature map Fmap with size 4  4, based on the input matrix i
and kernel matrix in Fig. 1. The kernel is moved over the
input image with a stride of 1. where ti is the actual value and si denotes predicted output
of the network for each class i.
3.2 ReLU activation layer
3.5 Weight updation
There are different activation functions such as tangent,
sigmoid, and rectified linear unit (ReLU). In the SegNet, The SegNet is trained through a stochastic gradient descent
the ReLU activation function is used. It converts any algorithm [2]. During training, the network parameters

Fig. 1 Convolutional operations

123
Neural Computing and Applications

Fig. 2 Flow diagram of the proposed method

such as weights and biases are optimized. Equation 6 is 4.1 Preprocessing


used for the weight updation process.
hiþ1 ¼ hi  arEðhi Þ; ð6Þ During training the proposed model, we consider a few
RGB input video frames and their corresponding labeled
where h is a weight parameter, i is the number of iterations, frames. These frames are randomly selected. The input and
EðhÞ is the loss function, r EðhÞ is gradient of loss func- labeled frames are re-scaled to a resolution of 320  240,
tion, a is learning rate. which is a standard video resolution true color RGB for-
mat.1 Initially, the re-scaled input frames are partitioned
3.6 Learning parameters into blocks where the size of a block is 16  16. The
intensity levels of labeled blocks are denoted by ’255’ or 0.
The parameters are updated during the training of SegNet. Here, 225 is for the object region/foreground, and ’0’ is for
The updated parameters can be found in [2]. The input the background. Representation of these blocks as input
layer provides input image dimensions and learning starts data is discussed as follows.
from the convolution layer of the CNN model. The number
of parameters in a convolution layer depends on a kernel 4.2 Input vector representation
size (m  n), the number of filters in the previous layer (p),
and the current layer (k). The parameters in a convolution The above blocks with the size 16  16 and their corre-
layer can be calculated using Eq. 7. sponding labeled frames are considered to be input vectors
Nconv ¼ ððm  n  pÞ þ 1Þ  k ð7Þ for the CEDSegNet. The input vectors are fed to the
CEDSegNet for its training. The architecture of CEDSeg-
Net is now described in Sect. 4.2.
3.7 SoftMax layer
4.2.1 Architecture of customised encoder decoder SegNet
Using a ‘‘cross-entropy’’ loss function, the SoftMax layer (CEDSegNet)
or pixel classification layer performs the classification task
and it categorizes each pixel belonging to the background The proposed model is based on SegNet architecture. We
or the object region. design the architecture of the proposed model by reducing
the encoder and decoder parts of the SegNet model. Here,
we choose only two encoder and decoder parts. The SegNet
4 Methodology of the proposed model with the two encoders and decoder parts is called a cus-
tomized encoder–decoder SegNet (CEDSegNet). Its input
First, the proposed model, involving pre-processing and
input vector representation, is depicted in Fig. 2. The
architecture of CEDSegNet is then described. 1
https://optiviewusa.com/cctv-video-resolutions/.

123
Neural Computing and Applications

Fig. 3 Customized encoder–


decoder SegNet (CEDSegNet)

vector representation is shown above. The architecture of max pooling, as in the encoder part. Following the enco-
CEDSegNet is shown in Fig 3. der–decoder parts, a softmax layer is present in the pro-
The depth of the network (SegNet) is reduced by posed model. We perform the convolution process, ReLU
choosing two encoder and decoder parts. The encoder of activation, and max pooling to generate a feature map as
the CEDSegNet has 2 convolutional layers, 2 batch nor- explained in Sects. 3.1–3.3. The details of those layers
malization layers, and 2 ReLU activation layers, followed involved in designing the architecture of CEDSegNet are
by a max-pooling layer. The structure of the encoder part is provided in Tables 1 and 2.
similar to SegNet [1]. The decoder part of the CEDSegNet The total number of layers in both the encoder and
is constructed with the same number of layers, involving decoder parts of the CEDSegNet, including the input layer,
convolution, batch normalization, ReLu activation, and is 31 (see Table 1). We use an RGB input frame with size

Table 1 Details of different layers for CEDSegNet


Network Conv. BN ReLu MaxPool SoftMax Classif.
name layers layers layers layers layer layer

CEDSegNet 8 8 8 4 1 1

123
Neural Computing and Applications

Table 2 Details of convolutional layer, size of kernels, and max pooling layer for CEDSegNet
Networks Conv. layer No. of kernels Max pooling layer
i/p size Kernel size Padding Stride Kernel size Padding Stride

CEDSegNet 16  16  3 33 1 1 64 22 0 2

Table 3 Details of different


Network Conv. BN ReLu MaxPool Crop Add. Depth SoftMax Classif.
layers of the comparative
name layers layers layers layers layers concat. layer layer
methods
VGG16 26 26 26 10 – – – 1 1
VGG19 32 32 32 10 – – – 1 1
ResNet18 31 28 25 1 2 8 2 1 1
ResNet50 64 61 57 1 2 16 2 1 1

16  16  3. A kernel with the size is 3  3 with stride 1 ResNet50 can be found in the table. The total numbers of
and padding 1 is used in the convolution process. The total layers of VGG19, ResNet18, and ResNet50 are 109, 100
number of kernels used for the proposed network model is and 206, respectively.
64. In the max pooling layer, a kernel with size 2  2 and Table 4 provides the details of the convolutional layer,
stride 2 is used. These details are provided in Table 2. kernel size (filter), and max pooling layer for all the
The CEDSegNet is trained through the stochastic gra- methods. For the VGG16 model, in the convolutional layer,
dient descent method [2]. During training, the weight the size of the kernel is chosen as 3  3 with a stride of 1
parameters of CEDSegNet are updated using Eq. 6 in Sect. and padding of 1. The max pooling layer has a filter size of
3.5. The learning parameters, which are required for 2  2, a stride of 2, and a padding of 0. Here, the size of the
training CEDSegNet, are calculated using Eq. 7 in Sect. input vector is 224  224  3. The values in parenthesis
3.6. Here, we use a batch normalization process. The 64, 128, 256, 512, and 512 represent the number of con-
training data is split into batches, called mini-batch, and its volutional kernels in blocks 1, 2, 3, 4, and 5 respectively.
size is 16. It has the advantages of fast training speed, Similar figures for the other methods can be seen in the
avoiding the effect of weight parameters, and getting high table.
accuracy. As an example, we provide a comparison of CEDSegNet
with SegNet as: (i) The number of layers in the proposed
4.3 State-of-the-art-methods used CEDSegNet is less than SegNet. (ii) Two encoders and
for comparison decoders are used in the proposed CEDSegNet, whereas in
SegNet, five numbers of encoders and decoders are used.
In this section, we describe the structure of the state-of-the- iii) The total number of layers in both the encoder and
art methods, VGG16, VGG19, ResNet18, and ResNet50. decoder parts of the CEDSegNet, including the input layer,
Each method has its own network architecture designed by is 31 whereas, for SegNet, it is 73. Clearly, there are 20
convolutional layers, batch normalization layers, max- conv. layers, 20 batch normalization layers, 20 ReLu lay-
pooling layers, ReLu activation layers, softmax layer, and ers, and 10 pax pooling layers, followed by 1 SoftMax and
classification layer. The ResNet models additionally 1 classification layer. The above section provides similar
involve crop layers, add layers and depth concatenation details for the CEDSegNet. iv) The kernel size for CED-
layers. There are five blocks in every method. SegNet is 3  3 whereas, for SegNet, the kernel size is 7 
The details of the aforesaid layers for each method are 7 with stride 1 padding 3. In max pooling, a filter with the
shown in Table 3. For example, the VGG 16 model has 26 size 2  2 of stride 2, padding 0, is chosen for SegNet. For
convolutional layers, 26 batch normalization (BN) layers, the CEDSegNet, the details of max pooling layer are pre-
26 ReLu activation layers, and 10 max-pooling layers sent above.
which are followed by 1 softmax layer and 1 classification In a similar way, we can also provide comparisons of the
layer. The total number of layers in all the blocks of CEDSegNet with the above state-of-the-art methods, using
VGG16 model, including an input layer, is 91. Similarly, the details of layers and kernel functions in Tables 3 and 4.
the figures corresponding to VGG19, ResNet18, and

123
Neural Computing and Applications

Table 4 Details of convolutional layer, convolutional kernels, and max pooling layer for the comparative methods
Networks Conv. layer No.of conv. Max pooling layer
I/p size Kernel size Padding Stride Kernels Kernel size Padding Stride

VGG16 224  224  3 33 1 1 blocks 1–5 (64,128,256,512,512) 22 0 2


VGG19 224  224  3 33 1 1 blocks 1–5 (64,128,256,512,512) 22 0 2
ResNet18 224  224  3 77 3 2 blocks 1–5 (64,128,256,512,512) 33 1 2
ResNet50 224  224  3 77 3 2 blocks 1–5 (64,128,256,512,512) 33 1 2

5 Experimental results Table 5 Details of input video sequences


Input video sequences (RGB) Total video frames Resolution
The proposed CEDSegNet is coded in MATLAB R2020a
using a system with configuration details, Boulevard 2500 352  240
Intel(I) Core(TM) i5–4570T CPU processor @2.9 GHz Pets2006 1200 720  576
RAM, Window 10 Operating system. The performance of Pedestrians 1099 360  240
CEDSegNet is discussed for CDNet datase2012. Highway 1700 320  240

5.1 Dataset

The dataset, CDNet 2012, can be downloaded from.2 It 5.3 Training of CEDSegNet and its
comprises a diverse collection of 6 categories. In each hyperparameters
category, there are video sequences ranging from 4 to 6.
Out of these sequences, 4 real-life RGB video sequences, The CEDSegNet model is trained using the stochastic
Boulevard, Pets2006, Pedestrians and Highway with mul- gradient descent algorithm (SGD) [2]. A batch normal-
tiple effects (e.g., camera jitter and object motion blur) are ization process [8], involving mini-batches instead of the
considered. The details of these video sequences are shown entire training set, is used. For the training of CEDSegNet,
in Table 5. we select typically four RGB video sequences and their
corresponding labeled video sequences from the
5.2 Indices to evaluate the performance CDNet2012 dataset. Each video sequence belongs to a
of the proposed CEDSegNet particular category. The category of every video sequence
is present in Table 5. The total number of frames in the four
We use quantitative indices to evaluate the performance of video sequences is 6499. Out of these sequences, there are
CEDSegNet for detecting of objects in videos. The indices 4639 frames with ground truth. These frames are used in
are precision (Pr), recall(Re), specificity (Spe), false posi- experiments. The training set and test set are selected as
tive rate (FPR), false negative rate (FNR), percentage of similar to [10]. Those details are discussed as follows.
wrong classification (PWC), F Measure and MCC. The For the training set, out of 4639 frames, 40 video frames
formula for every index is given in Table 6. and their labeled frames (40) are selected randomly. All
For example, the notion of a PWC is based on true these frames are re-scaled to a resolution of 320  240  3.
positive (TP), a false positive (FP), true negative (TN) and We split every frame into 16  16  3 blocks. The total
false negative (FN). A true positive (TP) is an output pixel number of blocks in a frame is 300 ((320  240)/ (16 
to which the proposed model correctly predicts the positive 16)). For 40 frames, the number of blocks is 12,000 (40 
class (say object). A false positive (FP) is an output pixel to 300). Similarly, we obtain 12,000 blocks for the 40 labeled
which the proposed model wrongly predicts the positive frames corresponding to the video frames. Therefore, the
class. Similarly, true negative (TN) and false negative (FN) training set consists of 24,000 blocks and the size of each
are the output pixels of the proposed model which are block is 16  16  3.
predicted correctly and incorrectly as a negative class For the test set, 20 video frames, out of 4599 frames
(background), respectively. In a similar way, the other after excluding the video frames and their labeled frames in
notions can be explained. the training set, are randomly selected. The size of each
frame is 320  240  3. The test frames are not divided
2
http://jacarini.dinf.usherbrooke.ca/dataset2012/. into blocks.

123
Neural Computing and Applications

Table 6 Evaluation metrics


Name of a metric Formula
based on the confusion table
Recall TP
ðTPþFNÞ

Specificity TN
ðTNþFPÞ

False Positive Rate FP


ðFPþTNÞ

False Negative Rate FN


ðTPþFNÞ

PWC 100ðFNþFPÞ
ðTPþFPþFNþTNÞ

Precision TP
ðTPþFPÞ

F_Measure 2ðRECALLPRECISIONÞ
RECALLþPRECISION
Mathews correlation coefficient (MCC) ðTNTPÞðFNFPÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðTPþFPÞ:ðTPþFNÞ:ðTNþFPÞ:ðTNþFNÞ

The CEDSegNet involves a set of hyperparameters The CEDSegNet detects the objects in video frames
whose values need to be initialized for its training. The with well-shaped where pixel positions and boundary
hyperparameters of CEDSegNet and their values chosen information of objects are correctly identified. Whereas
for its training are provided in Table 7. Using the ’Glorot’ ResNet18, ResNet50, VGG16, and VGG19 also detect the
method [6], the weight parameters of the CEDSegNet are objects in video sequences which are not well-shaped. The
initialized. A cross-entropy is chosen as a loss function. same information is reflected in the video sequences in
The CEDSegNet involves L2 regularisation and momen- Fig. 4. The performance of CEDSegNet is superior to
tum parameters which are chosen equal to 0.0001 and 0.9, ResNet18, ResNet50, VGG16, and VGG19. The ground
respectively, as given in [1]. The number of epochs is set truth results are slightly better than the CEDSegNet as
equal to 10. The value of the initial learning rate, a, is 0.1. expected. This description is about single object detection
Here, the network is initially trained for different values of in videos.
the learning rate chosen between 0 to 1. Then, 0.1 to which In the case of multiple objects in a video frame (fourth
the network attains the best performance is fixed. row), the CEDSegNet detects objects which are well-sep-
Using Eq. 7, the CEDSegNet requires the total number arated from the background. Whereas, using the ResNet18,
of learning parameters about 0.12 million, whereas these ResNet50, VGG16, and VGG19, the detected objects in
parameters for VGG16, VGG19, ResNet18, and ResNet50 video sequences are seen to be connected to each. The
are about 2.94 million, 4 million, 2.06 million, and 4.39 detected objects using CEDSegNet appear to be close to
million, respectively. The quantitative and qualitative those of ground truth. In contrast, using the state-of-the-
results of the CEDSegnet are discussed as follows. models, the background pixels in overlapping regions are
detected as objects; thereby the detected objects appear not
5.4 Qualitative results of CEDSegNet close to the ground truth results.

The performance of CEDSegNet for object detection, as 5.5 Results of CEDSegNet using quantitative
compared to the ground truth, ResNet18, ResNet50, indices
VGG16, and VGG19, for four video frames is qualitatively
shown in Fig. 4. Here, the ground truth results are collected For every resultant video sequence belonging to the
from.3 boulevard, pets2006, pedestrians, or highway, the aforesaid
The first three rows in the figure are related to the single specificity, FPR, FNR, PWC, precision, recall, F Measure
object detection results. These are followed by the results and MCC (Mathew’s correlation coefficient) are applied to
of multi objects detection using the CEDSegNet in the assess the performance of the proposed CEDSegNet. For
fourth row. Each row corresponds to a video sequence example, we provide the values of indices for a video
belonging to a particular category. In the figure, the results sequence pertaining to pedestrians in Table 8. Here, the
of the other models including ground truth are also pro- proposed method achieves balanced values, in the case of
vided for comparing CEDSegNet. precision and recall. In the other methods, precision is
much higher than recall. Considering precision only, our
method yields flexible results in a few cases (comparisons).
3
However, it obtains flexible performance in all cases
http://jacarini.dinf.usherbrooke.ca/dataset2012/

123
Neural Computing and Applications

Table 7 Details of the


Name of the network Hyper parameters Values of hyper parameters
hyperparameters
Optimizer SGD
Loss function Cross-entropy
Initial Learning rate 0.1
Proposed CEDSegNet Batch size 16
Max epochs 10
L2Regularization 1.0000e04
Momentum 0.9

Fig. 4 Four different input video frames are shown in (column a). The d), ResNet50 (column e), VGG16 (column f), VGG19 (column g) for
performance of CEDSegNet (column c), with different the state-of- all the video frames is compared
the-methods involving ground truth (column b), ResNet18 (column

Table 8 Comparison of the


Networks Metrics
performance of CEDSegNet
with Resnet18, ResNet50, Spe FPR FNR PWC Pre Rec F_Measure MCC
VGG16, and VGG19 for a video
sequence (pedestrians) of CEDSegNet 0.9969 0.0031 0.1302 0.5260 0.8335 0.8698 0.8513 0.8488
CDNet dataset2012 VGG19 0.9997 0.0003 0.7706 2.3617 0.956 0.2294 0.37 0.4622
ResNet18 0.9994 0.0005 0.6169 1.134 0.9231 0.3831 0.5415 0.5907
ResNet50 0.9991 0.00089 0.6984 1.5645 0.8791 0.3016 0.4491 0.5098
VGG16 0.9985 0.0015 0.8556 3.5814 0.7995 0.1444 0.2446 0.3306

(comparisons) corresponding to recall. It may be noted that As explained above, for every video sequence (boule-
a flexible F Measure is attained when the model yields vard, pets2006, pedestrians or highway), the scores of
balanced values of precision and recall. These values specificity, FPR, FNR, PWC, precision, recall, F Measure
support the quantitative results demonstrated in [31]. The and MCC (Mathew’s correlation coefficient) for CED-
value of F Measure indicates that the proposed method is SegNet are obtained. The average scores of all the video
the best. For specificity, our method has slightly low per- sequences obtained for CEDSegNet, as compared to
formance. The values of indices indicate that the CED- ResNet18, Resent50, VGG16 and VGG19 models [7], are
SegNet outperforms Resnet18, ResNet50, VGG16, and reported in Table 9.
VGG19 models, where precision þ recall is treated as one It can be observed from Table 9 that the performance of
index. the proposed CEDSegNet is superior to ResNet18,

123
Neural Computing and Applications

Table 9 Comparison of the


Networks Metrics
performance of CEDSegNet
with Resnet18, ResNet50, Spe FPR FNR PWC Pre Rec F_Measure MCC
VGG16, and VGG19 for CDNet
dataset2012 using quantitative CEDSegNet 0.9889 0.0105 0.0625 1.1209 0.7215 0.9335 0.8086 0.8118
measures, which are based on ResNet18 0.9903 0.0049 0.3051 1.3117 0.8470 0.6902 0.7246 0.7402
the average of segmentation
ResNet50 0.9925 0.0007 0.4053 2.0007 0.8937 0.6056 0.6958 0.7110
results of four video sequences
VGG16 0.9993 0.0006 0.6481 4.6038 0.9506 0.3228 0.5037 0.5318
VGG19 0.9995 0.0005 0.3784 1.3720 0.9708 0.6145 0.7346 0.7521

ResNet50, VGG16 and VGG19 with respect to FNR, PWC, better than VGG16, VGG19, ResNet18, and ResNet50
F_Measure and MCC (Mathew’s correlation coefficient). (please find the results in Tables 8 and 9). The superiority
Further, the CEDSegNet performs better than the other is achieved for the CEDSegNet due to its newly generated
methods in terms of recall. Here, the CEDSegNet provides input vector, reduced depth, and a small set of learning
balanced values of precision and recall, unlike ResNet18, parameters. Whereas, the state-of-the-art methods do not
ResNet50, VGG16 and VGG19. contain those details.
Using specificity (Spe) and FPR, the results of the
CEDSegNet are also better than the other methods except 5.6 Test performance of CEDSegNet using cross-
for VGG19 only where it achieves slightly better perfor- validation technique
mance. Although these two values (specificity (Spe) and
FPR) seem to be better for VGG19, the proposed CED- As mentioned in the above section, we select 60 video
SegNet provides low error rate in terms of FNR and PWC frames, out of 4639 video frames, belonging to four cate-
and a high classification score in terms of F_Measure and gories of video sequences, pedestrians, highway, pets2006,
MCC (i.e., clear visibility of an object from the background and boulevard. Out of the 60 video frames, 40 video frames
in all the frames as in Fig. 4). for the training set and 20 for the test set are selected. The
Table 10 provides the number of layers and training performance of the CEDSegNet is found to be better than
time in minutes for CEDSegNet, as compared to the state-of-the-art methods. We use a cross-validation tech-
VGG16, VGG19, ResNet18, and ResNet50. The total nique to prove the superiority of the proposed model. The
number of layers for the proposed CEDSegNet is 31 training process of the model is explained as follows.
whereas the numbers of layers for VGG16, VGG19, We use the k-fold cross-validation method to test the
ResNet18, and ResNet50 are 91, 109, 100, and 206, performance of the proposed CEDSegNet. Here, the value
respectively. The figures indicate that the size of CED- of k is typically chosen equal to 5. In the five-fold cross-
SegNet is less than VGG16, VGG19, ResNet18, and validation method, we use the 60 video frames belonging
ResNet50. It may be noted that Sect. 4.2.1 details the to four categories of video sequences, pedestrians, high-
proposed CEDSegNet. The models VGG16, VGG19, way, pets2006, and boulevard. We now split the 60 video
ResNet18, and ResNet50 used for comparison are dis- frames into five folds where each fold consists of 12 video
cussed, in detail, in Sect. 4.3. frames of all the four categories. The first four folds are
The training time of CEDSegNet comparing with the treated as a training set and the remaining one fold is used
VGG16, VGG19, ResNet18, and ResNet50 is reported in as a test set. This process is repeated five times, considering
Table 10. It is observed from the table that the training time each fold as a test set exactly once.
(CPU time) for CEDSegNet is lower than all the models
except ResNet18. Here, the ResNet18 has a slightly lower
value due to the use of Atrous Separable Convolution
based on Deeplabv3plus [5]. However, the CEDSegNet is

Table 10 Comparison of
Name of the network Size (No.of layers) Training Time(in minutes)
network size and CPU time for
the proposed CEDSegNet with Proposed CEDSegNet 31 4.17
the state-of-the-art methods
VGG16 91 10.82
VGG19 109 12.78
ResNet18 100 2.32
ResNet50 206 4.26

123
Neural Computing and Applications

5.6.1 Qualitative results F_Measure and MCC to evaluate the performance of


CEDSegNet. The results of CEDSegNet for the 5 folds of
For typical fold 1, the qualitative results of CEDSegNet are test sets and their average values are present in Table 11.
shown in Fig. 5. The results belong to four categories of These are based on the average of all four video sequences.
video sequences, namely boulevard, pets2006, pedestrians, The proposed CEDSegNet is trained using a large input
and highway. The CEDSegNet’ss performance is qualita- dataset, consisting of approximately 28,800 blocks which
tively compared with the state-of-the-art network models enable the CEDSegNet to avoid the over-fitting issue. We
including ground truth. can observe from the table that the proposed network
For the pedestrians’ category, multiple objects corre- consistently achieves high accuracy across the five folds. It
sponding to ground truth are present in the figure. The is not seen in the performance of the state-of-the-art net-
detection result of the proposed CEDSegNet is similar to works and these models exhibit poor performance. Note
the ground truth. The proposed model provides well- that, the standard deviation (r) of CEDSegNet obtained
shaped objects like the ground truth. While the objects are from the set of experiments using 5-fold cross-validation, is
in moving condition, the CEDSegNet effectively detects quite low at 0.05 and 0.04 for F_Measure and MCC,
the objects with the information which is reflected in the respectively.
ground truth result. Whereas, using ResNet18, the objects Table 12 depicts the average results of five-folds using
detected are not in good shape which is quite different from CEDSegNet for all four video sequences (boulevard, Pet-
the ground truth. Similar detection results using ResNet50, s2006, pedestrians, or highway). The table also provides
VGG16, and VGG19 are also obtained and these are worse the results of ResNet18, ResNet50, VGG16, and VGG19
than the proposed CEDSegNet. For the remaining cate- used for comparison. Here, the values of PWC and FNR for
gories in the figure, the proposed CEDSegNet attains object CEDSegNet are lower than the VGG16, VGG19,
detection results that are very close to the ground truth. ResNet18, and ResNet50. The CEDSegNet achieves a low
Further, the CEDSegNet outperforms ResNet50, VGG16, error rate in terms of FNR, PWC comparing with the other
and VGG19 in single and multi-object detection cases. models. Using precision and recall, the proposed model
obtains balanced values which provide a better value of
5.6.2 Quantitative results F_Measure than the other models. The performance of the
CEDSegNet is superior to VGG16, VGG19, ResNet18, and
We use quantitative metrics, precision (Pre), recall (Rec), ResNet50 with respect to all the indices, except specificity
specificity (Spe), false positive rate (FPR), false negative and FPR where VGG19 is the best and our model is second
rate (FNR), percentage of wrong classification (PWC), the best. The justification for this is as follows. When the

Fig. 5 Comparative analysis of CEDSegNet (column c) with different methods, ground truth (column b), ResNet18 (column d), ResNet50
(column e), VGG16 (column f), VGG19 (column g). The results are based on a cross-validation technique

123
Neural Computing and Applications

Table 11 Results of
Fold Metrics
CEDSegNet for every fold
using different quantitative Spe FPR FNR PWC Pre Rec F_Measure MCC
indices
1 0.9885 0.0115 0.0272 1.1404 0.7791 0.9727 0.8626 0.8639
2 0.9913 0.0086 0.284 1.4075 0.8263 0.7159 0.7388 0.7475
3 0.9823 0.0176 0.0167 1.7327 0.6051 0.9832 0.7381 0.7583
4 0.9865 0.0134 0.0329 1.3769 0.6909 0.9671 0.8024 0.8091
5 0.9837 0.01621 0.041 1.6345 0.6811 0.9589 0.7934 0.7993
Avg. of
5-folds 0.9865 0.0135 0.0804 1.4584 0.7165 0.9195 0.7871 0.7956
Std. deviation (r) 0.0517 0.0462

Table 12 Average results of all


Network model Metrics
five folds obtained using
CEDSegNet as compared to Spe FPR FNR PWC Pre Rec F_Measure MCC
VGG16, VGG19, ResNet18,
and ResNet50 models CEDSegNet 0.9865 0.0135 0.0804 1.4584 0.7165 0.9195 0.7871 0.7956
ResNet18 0.9869 0.0131 0.5155 16.3815 0.6442 0.4838 0.3861 0.4281
ResNet50 0.9892 0.0107 0.5417 12.1059 0.7054 0.4582 0.4011 0.4399
VGG16 0.9986 0.0012 0.6831 13.1393 0.9671 0.3169 0.4386 0.4843
VGG19 0.9991 0.0009 0.5134 3.1873 0.9766 0.4771 0.6337 0.6592

background pixels are close to the boundary of an object, proposed network, CEDSegNet, demonstrates a substan-
the VGG19 exhibits a tendency to prioritize the identifi- tially higher AUC value (AUC = 0.98) compared to other
cation of the pixels as object pixels. It leads to the model state-of-the-art models, indicating its discriminative
getting high specificity and a low false positive rate. capacity in accurately classifying the object pixels.
We provide a ROC diagram, as discussed in [20], for the
depiction of accuracy in Fig. 6. The ROC curve illustrates
the trade-off between the true positive rate (TPR) and the 6 Conclusion
false positive rate (FPR) at different classification thresh-
olds. It is often used to evaluate and compare the perfor- A customized encoder decoder SegNet (CEDSegNet) for
mance of different classification models. The results of detecting objects in video surveillance is proposed in this
network models for four video sequences using ROC curve study. A coarse to fine-grained moving object detection
reveal a distinct performance differentiation as shown in framework is presented with a novel encoder–decoder
Fig. 6. In the present investigation over ROC curve, the network, called CEDSegNet. An application of object

Fig. 6 Comparative analysis of


CEDSegNet with the state-of-
the-art networks

123
Neural Computing and Applications

segmentation in videos to the proposed CEDSegNet is 4. Chang X, Yu Y-L, Yang Y, Xing EP (2017) Semantic pooling for
provided. The architecture of CEDSegNet is identical to complex event analysis in untrimmed videos. IEEE Trans Pattern
Anal Mach Intell 39(8):1617–1632. https://doi.org/10.1109/
SegNet. However, the number of convolutional layers of TPAMI.2016.2608901
the encoder and decoder parts is modified. Its input vector 5. Chen L-C, Zhu Y, Papandreou G, Schroff F, and Adam H (2018)
is generated newly. Those modifications are made with aim Encoder-decoder with atrous separable convolution for semantic
of improving the proposed network’s performance. The image segmentation. In: Proceedings of the European conference
on computer vision, pp 801–818
performance of the proposed CEDSegNet is demonstrated 6. He K, Zhang X, Ren S, and Sun J (2015) Delving deep into
on several video frames of CDNet 2012 dataset, where it is rectifiers: Surpassing human-level performance on imagenet
used to detect single and multiple objects. The results of classification. In: Proceedings of the IEEE international confer-
CEDSegNet, as compared to the state-of-the-art methods, ence on computer vision, pp 1026–1034
7. He K, Zhang X, Ren S, and Sun J (2016) Deep residual learning
are provided for four video sequences, as examples. The for image recognition. In: Proceedings of the IEEE conference on
qualitative results demonstrate that the CEDSegNet is able computer vision and pattern recognition, pp 770–778
to detect object regions similar to the ground truth results. 8. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep
Further, the CEDSegNet provides results better than the network training by reducing internal covariate shift. Int Conf
Mach Learn 37:448–456
Resnet18, ResNet50, VGG16, and VGG19. The signifi- 9. Ji Y, Zhang H, Jie Z, Ma L, Wu QJ (2020) Casnet: a cross-
cance of the proposed CEDSegNet is that it has the ability attention siamese network for video salient object detection.
in identifying accurate object shapes, and boundary infor- IEEE Trans Neural Netw Learn Syst 32(6):2676–2690
mation, and is efficient for handling pixels in multiple 10. Jiang S, Lu X (2018) Wesambe: a weight-sample-based method
for background subtraction. IEEE Trans Circuits Syst Video
object regions. Technol 28(9):2105–2115
We use different quantitative metrics to evaluate object 11. Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R (2019) A
regions detected using the proposed CEDSegNet. The survey of deep learning-based object detection. IEEE Access
superiority of the proposed CEDSegNet, over the state-of- 7:128837–128868
12. Jiao L, Zhang R, Liu F, Yang S, Hou B, Li L, and Tang X (2021)
the-art deep neural networks, is obtained in terms of New generation deep learning for video object detection: a sur-
specificity, FPR, FNR, PWC, (precision ? recall), vey. IEEE Trans Neural Netw Learn Syst
F_Measure, and MCC. The proposed model has the 13. Kang K, Li H, Xiao T, Ouyang W, Yan J, Liu X, and Wang X
advantages of fewer parameters and low computational (2017a) Object detection in videos with tubelet proposal net-
works. In: Proceedings of the IEEE conference on computer
cost. In the case of computation cost, our model is the vision and pattern recognition, pp 727–735
second best. Our future research task would be on 14. Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang
designing a rough set-based deep neural network using the Z, Wang R, Wang X et al (2017) T-cnn: Tubelets with convo-
concepts of the rough set for efficient detection. Further, lutional neural networks for object detection from videos. IEEE
Trans Circuits Syst Video Technol 28(10):2896–2907
testing its effectiveness in segmenting objects in video 15. Kompella A, Kulkarni RV (2021) A semi-supervised recurrent
sequences will be a part of future research work. neural network for video salient object detection. Neural Comput
Appl 33:2065–2083
16. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classi-
Availability of data: A video database that supports the findings of fication with deep convolutional neural networks. Commun ACM
this study is publicly available at http://jacarini.dinf.usherbrooke.ca/ 60(6):84–90
dataset2012. 17. Lee B, Erdenee E, Jin S, Rhee PK (2016) Efficient object
detection using convolutional neural network-based hierarchical
feature modeling. Signal Image Video Process 10(8):1503–1510
Declarations 18. Lim LA, Keles HY (2018) Foreground segmentation using con-
volutional neural networks for multiscale feature encoding. Pat-
tern Recogn Lett 112:256–262
Conflict of interest The authors declare no conflict of interest.
19. Long J, Shelhamer E, and Darrell T (2015) Fully convolutional
networks for semantic segmentation. In: Proceedings of the IEEE
conference on computer vision and pattern recognition (CVPR)
References 20. Malebary SJ, Khan R, Khan YD (2021) Protopred: advancing
oncological research through identification of proto-oncogene
1. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep proteins. IEEE Access 9:68788–68797
convolutional encoder-decoder architecture for image segmenta- 21. Minaee S, Boykov YY, Porikli F, Plaza AJ, Kehtarnavaz N, and
tion. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495 Terzopoulos D (2021) Image segmentation using deep learning: a
2. Bottou L (2010) Large-scale machine learning with stochastic survey. IEEE Trans Pattern Anal Mach Intell
gradient descent. In: Proceedings of COMPSTAT’2010, 22. Muhammad K, Ahmad J, Baik SW (2018) Early fire detection
pp 177–186. Springer using convolutional neural networks during surveillance for
3. Chang X, Yang Y (2016) Semisupervised feature analysis by effective disaster management. Neurocomputing 288:30–42
mining correlations among multiple tasks. IEEE Trans Neural 23. Pal SK, Bhoumik D, Bhunia Chakraborty D (2020) Granulated
Netw Learn Syst 28(10):2294–2305 deep learning and z-numbers in motion detection and object
recognition. Neural Comput Appl 32(21):16533–16548

123
Neural Computing and Applications

24. Patil PW, Murala S (2018) Msfgnet: a novel compact end-to-end 32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D,
deep network for moving object detection. IEEE Trans Intell Erhan D, Vanhoucke V, and Rabinovich A (2015) Going deeper
Transp Syst 20(11):4066–4077 with convolutions. In: Proceedings of the IEEE conference on
25. Rahmon G, Bunyak F, Seetharaman G, and Palaniappan K (2021) computer vision and pattern recognition, pp 1–9
Motion u-net: multi-cue encoder-decoder network for motion 33. Wang D, Cui X, Chen X, Zou Z, Shi T, Salcudean S, Wang ZJ,
segmentation. In; 2020 25th International conference on pattern and Ward R (2021) Multi-view 3d reconstruction with trans-
recognition (ICPR), pp 8125–8132 formers. In: 2021 IEEE/CVF international conference on com-
26. Ren Q, Hu R (2018) Multi-scale deep encoder-decoder network puter vision (ICCV), pp 5702–5711. https://doi.org/10.1109/
for salient object detection. Neurocomputing 316:95–104 ICCV48922.2021.00567
27. Shi G, Suo J, Liu C, Wan K, and Lv X (2017) Moving target 34. Xiaojun C, Zhigang M, Yi Y, Zhiqiang Z, G, H. A. (2016) Bi-
detection algorithm in image sequences based on edge detection level semantic representation analysis for multimedia event
and frame difference. In: 2017 IEEE 3rd information technology detection. IEEE Trans cybern 47(5):1180–1197
and mechatronics engineering conference (ITOEC), pp 740–744. 35. Zhu H, Yan X, Tang H, Chang Y, Li B, Yuan X (2020) Moving
IEEE object detection with deep cnns. IEEE Access 8:29729–29741
28. Simonyan K and Zisserman A (2014) Very deep convolutional
networks for large-scale image recognition. arXiv preprint: Publisher’s Note Springer Nature remains neutral with regard to
arXiv:1409.1556 jurisdictional claims in published maps and institutional affiliations.
29. Singh SA, Meitei TG, and Majumder S (2020) Short pcg clas-
sification based on deep learning. In: Deep learning techniques
Springer Nature or its licensor (e.g. a society or other partner) holds
for biomedical and health informatics, pp 141–164. Elsevier
exclusive rights to this article under a publishing agreement with the
30. St-Charles P-L, Bilodeau G-A, Bergevin R (2014) Subsense: a
author(s) or other rightsholder(s); author self-archiving of the
universal change detection method with local adaptive sensitivity.
accepted manuscript version of this article is solely governed by the
IEEE Trans Image Process 24(1):359–373
terms of such publishing agreement and applicable law.
31. St-Charles P-L, Bilodeau G-A, Bergevin R (2016) Universal
background subtraction using word consensus models. IEEE
Trans Image Process 25(10):4768–4781

123

You might also like