2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
https://doi.org/10.1007/s00521-023-08956-5 (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Abstract
Moving object detection is one of the key applications of video surveillance. Deep convolutional neural networks have
gained increasing attention in the field of video surveillance due to their effective feature learning ability. The performance
of deep neural networks is often affected by the characteristics of videos like poor illumination and inclement weather
conditions. It is important to design an innovative architecture of deep neural networks to deal with the videos effectively.
Here, the convolutional layers for the networks require to be in an appropriate number and it’s important to determine the
number. In this study, we propose a customized deep convolutional encoder–decoder network, say CEDSegNet, for
moving object detection in a video sequence. Here, the CEDSegNet is based on SegNet, and its encoder and decoder parts
are chosen to be two. By customizing the SegNet with two encoder and decoder parts, the proposed CEDSegNet improves
detection performance, where its parameters are reduced to an extent. The two encoder and decoder parts function towards
generating feature maps preserving the fine details of object pixels in videos. The proposed CEDSegNet is tested on
multiple video sequences of the CDNet dataset2012. The results obtained using CEDSegNet for moving object detection in
the video frames are interpreted qualitatively. Further, the performance of CEDSegNet is evaluated using several quan-
titative indices. Both the qualitative and quantitative results demonstrate that the performance of CEDSegNet is superior to
the state-of-the-network models, VGG16, VGG19, ResNet18 and ResNet50.
Keywords Convolutional neural network Deep learning Object detection Performance analysis Video surveillance
123
Neural Computing and Applications
different shapes. The shape of an object is delineated by its CEDSegNet, is developed. The novelty lies in choosing a
pixel positions and boundary information. In obtaining an suitable number of encoder and decoder parts and repre-
accurate shape of an object, several attempts are made sentation of the input vector (i.e, the network’s input is
available to design object detection methods like frame generated newly). Further, an application of moving object
difference [27], background subtraction [30], and deep detection in the video frames using the proposed CED-
neural networks [12, 35]. The frame difference methods SegNet is provided for the first time. The CEDSegNet is
rely on threshold values. The model’s performance appears proposed for the purpose of improving the accuracy of
to be degraded when it does not satisfy the threshold cri- moving object detection in a video. The proposed network
terion. Selection of a threshold exploring the criterion for also has the advantages of fewer parameters used for end-
detecting the shape of an object is difficult. Background to-end learning and low computational time. The formation
subtraction methods first estimate the background of an of the CEDSegNet is described as follows.
object from multiple video frames. The difference between The proposed network has two parts, an encoder and a
the estimated background frame and the current frame is decoder, similar to the architecture designed in [1]. The
then computed to identify the foreground object. It is a depth of the network (SegNet) is reduced by modifying the
complex process of estimating the background from the encoder and decoder parts. Here, the number of the encoder
video that mimics backgrounds like illumination variation and decoder parts is set equal to 2. The total number of
and these methods may fail to handle those videos. Deep layers in both the encoder and decoder parts of the network
convolutional neural networks can avoid the issues with the is 31. The modified encoder–decoder SegNet is called
aforesaid methods and are used highly for moving object CEDSegNet. The representation of the input vector of
detection in videos. Deep learning refers to learn features CEDSegNet is based on massive data samples (videos)
deeply from an input video frame. A class of deep con- which are generated through the data augmentation pro-
volutional neural networks for feature learning from a cess, under multiple scenarios. The data sample (video)
video sequence can be found in the investigations of with the object region and background region is partitioned
[12, 23], and [15]. into different frames. These are further partitioned into
Deep convolutional neural networks consist of several blocks. These blocks are fed to the CEDSegNet along with
convolutional and pooling layers which are arranged in a corresponding labeled blocks as its input. The CEDSegNet
hierarchical fashion. The convolutional and pooling layers is trained from end to end with large sets of masked frames
are categorized into multiple layers (blocks) of the encoder (blocks) representing moving objects. The proposed net-
and decoder networks. The encoder–decoder networks work uses fewer fine-tuned parameters, during training.
generate feature maps with the actual spatial resolution, The encoder–decoder networks enable the CEDSegNet to
during down-sampling and up-sampling. SegNet [1], VGG precisely segment objects where the 8-connected compo-
net [28], and its variants VGG16 & VGG19 are typical nent principle is used to examine the object region.
examples of deep convolutional neural network models. The paper is organized as follows. Section 2 provides a
The architectures of both the SegNet and VGG net are literature review of related deep neural networks. The
identical, but the depth of the former model (involving preliminaries of the encoder and decoder parts of SegNet
encoder and decoder networks) is less than that of the latter and its training process are provided in Sect. 3. Section 4
model. Here, SegNet is designed by removing the fully presents the design structure of the proposed CEDSegNet.
connected layers of the VGG net. The architecture of Section 5 reports the experimental results & discussions of
SegNet involves several encoder and decoder networks that the proposed network comparing with the state-of-the-art-
grow the depth of the network. Due to the large depth of methods. Conclusions and further research directions of
the network, its training process needs to be tuned with this study are presented in Sect. 6.
several parameters which lead to a high computation time.
Further, it is claimed that the SegNet attains impressive
performance for the segmentation of image datasets [1]. 2 Literature survey
However, the architecture of the SegNet has to be recon-
structed for the video as it is a collection of images. The Deep convolutional neural networks are highly used, due to
important aspect here is to select an appropriate number of their powerful learning ability, in computer vision appli-
encoder and decoder networks. The encoder and decoder cations like segmentation, object detection, and moving
networks are responsible for producing high-resolution object detection. We discuss different deep neural networks
feature maps, comprising the fine details of the features to designed recently which are related to the present
identify an accurate shape of an object by its pixel posi- investigation.
tions. By a little new adaption to the encoder and decoder An attempt is made to design an architecture of fully
networks and the input vector, a sophisticated SegNet, say convolutional neural networks (FCNs) for pixel-wise
123
Neural Computing and Applications
semantic segmentation [19]. The constituent parts of the Here, a coarse grain detection method and a region
FCNs are AlexNet [16], VGG net [28] and GoogleNet [32]. extraction method are used. The coarse grain detection
There are skip connections of FCNs that combine learned method involves a low pass filter, a filter-based object
features. The use of these FCNs is to provide an application detection method, and mathematical morphology. While
of semantic segmentation, like segmentation of bridges, high-frequency noises are removed using the low pass fil-
mountain. In [26], the authors develop a multi-scale deep ter, the mathematical morphology suppresses the ill effects
encoder–decoder network for object detection in images. of noises. The filter-based method detects the motion of an
The input of the network is multiple images with varying object. The region extraction method identifies the moving
sizes. The network generates saliency feature maps at regions of an object. A motion U-Net (MU-Net) for mov-
different locations (scales) for detecting objects in images. ing object detection in videos is designed in [25]. The
A deep learning framework for object detection in image network generates an output mask for moving object,
datasets is available in [17]. Here, deep features from where features of pixel-level motion and object-level
visual objects, in terms of augmented categories, are appearance cues are extracted. [24] explores the use of a
extracted using a hierarchical feature model (HFM). The histogram to design a motion saliency network for back-
augmented categories are person’s categories like sitting, ground estimation from video frames. The model generates
standing, and riding. A hierarchical ensemble classifier for temporal saliency maps from the background estimated for
object recognition and localization is designed where it detecting moving objects. Another method, namely multi-
uses a support vector classifier to calculate the confidence view 3D-CNN, for moving object recognition in videos is
scores of each of the augmented categories. A detailed illustrated in [33]. The network uses a multi-view 3D-CNN
review of deep neural networks for object detection in with multiple convolutional layers and a pooling layer to
images is provided in [11, 21]. All the above methods are generate a feature vector that accurately detects the
dealt with object detection or semantic segmentation in boundaries of a moving object. A review on moving object
images. A video is a collection of images. The following detection in a video sequence based on a deep convolu-
paragraphs present different methods of object detection in tional encoder–decoder network can be found in [18].
videos that are developed under the deep learning
framework.
In [9], the authors develop a cross-attention Siamese 3 SegNet: preliminaries
network by incorporating cross-attention module into a
deep encoder–decoder network. This network generates SegNet is developed in [1]. It consists of two sub-networks
high and low-level feature maps that preserve both the namely an encoder part and a decoder part, followed by a
semantic information and local context of an object. The classification layer. There are 5 encoders and 5 decoders
maps are fused using the cross-attention module to produce for the SegNet. The encoder part is designed with 2 con-
spatial-temporal features, which lead to obtaining accurate volutional layers, 2 batch normalization layers, and 2
object detection results. In [13], one can find the concept of ReLU activation layers, followed by a max-pooling layer.
a tubelet, a sequence of bounding boxes in a video frame, The convolutional layer extracts a feature from the input
which is used to design a tubelet proposal network (TPN). image using a set of kernels. The size of a kernel can be 7
The TPN employs a spatial-temporal tubelet proposal 7. During the convolution process, the kernel is moved
model and an encoder–decoder LSTM model to generate to the whole input image with a stride of 1. The output of
bounding boxes within a video frame and classify tubelets the convolution layer is input to the ReLU activation layer
into their respective categories. By utilizing tubelet infor- and it adds non-linearity to the input. The max pooling
mation, the TPN achieves high object detection accuracy in layer normalizes the output of the ReLU activation layer.
videos. One can refer to [14] to a deep learning framework In the decoder part, the convolutional layers, ReLU func-
for object detection in a video. Here, the contextual and tion, and max pooling layers, convolutional kernels are the
temporal information of an object extracted from the same as in the encoder part. The output of the final decoder
tubelet is incorporated into the R-CNN and faster R-CNN (fifth decoder) is fed into a SoftMax layer to classify pixels
models to design the deep learning framework. Further [12] as objects or backgrounds. There are 64 kernels in each
provides a comprehensive literature review of various deep block of the convolution layer. We describe the layers, in
learning methods designed for video object detection, detail, as follows.
including analyses of designing processes and modeling of
feature maps. 3.1 Convolution layer
Several deep-learning methods for moving object
detection in videos have been developed. One such method In the convolution layer, an input image is convolved with
is the deep convolution neural network developed in [35]. a filter or kernel matrix. During convolution, the filter is
123
Neural Computing and Applications
moved over the input image in both horizontal and vertical negative value into a zero and returns 1, otherwise. It also
directions. Convolutional layers or convolutional opera- transforms the input into a non-linear form and is compu-
tions are not only applied to the input pixels or input tationally efficient. The ReLU activation function can be
images but also applied to the output of the other convo- defined as
lutional layers. Its aim is to generate a feature map from the
1; if [ 0;
input image. The number of feature maps corresponds to ReLUðiÞ ¼ ð3Þ
0; if\0
the number of convolutions. Typically, the number of fil-
ters or kernels in the convolution layer is a power of two.
The size of a feature map (Fmap ) generated by the convo- 3.3 Pooling layer
lution layer is calculated using
Fmap ¼ ððI K þ 2pÞ=sÞ þ 1 Kn ð1Þ The pooling layer reduces spatial the size of a feature map.
In effect, computational complexity and the number of
where, Fmap is an output feature map, I is an input frame, K parameters of the network are reduced. There are three
is a convolution kernel or filter size, Kn is the number of types of pooling layers such as max pooling, min pooling,
kernels used in convolution process, s is a stride value and and average pooling. Max-pooling is defined as
p is a padding value.
Max-Pooling ¼ ððI WÞ=sÞ þ 1; ð4Þ
Figure 1 depicts how an input image I is convolved with
a kernel K. We choose the size of a kernel as 3 3. We where I is the size of the feature map; W is window size;
perform convolution between an input frame I and a kernel and s is stride value. The output of max pooling is sensitive
K using Eq. 2. It is obtained from [29]. to network initialization.
X X
colum; row
Fmap ¼ Iðp i; q jÞ Kði; jÞ; ð2Þ 3.4 Loss function
i¼0 j¼0
The loss function calculates the loss value that is used to
where, Fmap is the output feature map, i and j are the
assess the rate of the network. Here, A ’cross-entropy’ is
columns and rows of the kernel or filter, and p and q are the
used as a loss function to classify pixels at the Soft-Max
columns and rows of an image respectively, and K is a
layer. It is defined as
filter. The size (rows columns) of K is typically chosen
equal to 3. Here, 16 kernels (Kn ) require to generate a X c
Loss ¼ ti logsi ; ð5Þ
feature map Fmap with size 4 4, based on the input matrix i
and kernel matrix in Fig. 1. The kernel is moved over the
input image with a stride of 1. where ti is the actual value and si denotes predicted output
of the network for each class i.
3.2 ReLU activation layer
3.5 Weight updation
There are different activation functions such as tangent,
sigmoid, and rectified linear unit (ReLU). In the SegNet, The SegNet is trained through a stochastic gradient descent
the ReLU activation function is used. It converts any algorithm [2]. During training, the network parameters
123
Neural Computing and Applications
123
Neural Computing and Applications
vector representation is shown above. The architecture of max pooling, as in the encoder part. Following the enco-
CEDSegNet is shown in Fig 3. der–decoder parts, a softmax layer is present in the pro-
The depth of the network (SegNet) is reduced by posed model. We perform the convolution process, ReLU
choosing two encoder and decoder parts. The encoder of activation, and max pooling to generate a feature map as
the CEDSegNet has 2 convolutional layers, 2 batch nor- explained in Sects. 3.1–3.3. The details of those layers
malization layers, and 2 ReLU activation layers, followed involved in designing the architecture of CEDSegNet are
by a max-pooling layer. The structure of the encoder part is provided in Tables 1 and 2.
similar to SegNet [1]. The decoder part of the CEDSegNet The total number of layers in both the encoder and
is constructed with the same number of layers, involving decoder parts of the CEDSegNet, including the input layer,
convolution, batch normalization, ReLu activation, and is 31 (see Table 1). We use an RGB input frame with size
CEDSegNet 8 8 8 4 1 1
123
Neural Computing and Applications
Table 2 Details of convolutional layer, size of kernels, and max pooling layer for CEDSegNet
Networks Conv. layer No. of kernels Max pooling layer
i/p size Kernel size Padding Stride Kernel size Padding Stride
16 16 3. A kernel with the size is 3 3 with stride 1 ResNet50 can be found in the table. The total numbers of
and padding 1 is used in the convolution process. The total layers of VGG19, ResNet18, and ResNet50 are 109, 100
number of kernels used for the proposed network model is and 206, respectively.
64. In the max pooling layer, a kernel with size 2 2 and Table 4 provides the details of the convolutional layer,
stride 2 is used. These details are provided in Table 2. kernel size (filter), and max pooling layer for all the
The CEDSegNet is trained through the stochastic gra- methods. For the VGG16 model, in the convolutional layer,
dient descent method [2]. During training, the weight the size of the kernel is chosen as 3 3 with a stride of 1
parameters of CEDSegNet are updated using Eq. 6 in Sect. and padding of 1. The max pooling layer has a filter size of
3.5. The learning parameters, which are required for 2 2, a stride of 2, and a padding of 0. Here, the size of the
training CEDSegNet, are calculated using Eq. 7 in Sect. input vector is 224 224 3. The values in parenthesis
3.6. Here, we use a batch normalization process. The 64, 128, 256, 512, and 512 represent the number of con-
training data is split into batches, called mini-batch, and its volutional kernels in blocks 1, 2, 3, 4, and 5 respectively.
size is 16. It has the advantages of fast training speed, Similar figures for the other methods can be seen in the
avoiding the effect of weight parameters, and getting high table.
accuracy. As an example, we provide a comparison of CEDSegNet
with SegNet as: (i) The number of layers in the proposed
4.3 State-of-the-art-methods used CEDSegNet is less than SegNet. (ii) Two encoders and
for comparison decoders are used in the proposed CEDSegNet, whereas in
SegNet, five numbers of encoders and decoders are used.
In this section, we describe the structure of the state-of-the- iii) The total number of layers in both the encoder and
art methods, VGG16, VGG19, ResNet18, and ResNet50. decoder parts of the CEDSegNet, including the input layer,
Each method has its own network architecture designed by is 31 whereas, for SegNet, it is 73. Clearly, there are 20
convolutional layers, batch normalization layers, max- conv. layers, 20 batch normalization layers, 20 ReLu lay-
pooling layers, ReLu activation layers, softmax layer, and ers, and 10 pax pooling layers, followed by 1 SoftMax and
classification layer. The ResNet models additionally 1 classification layer. The above section provides similar
involve crop layers, add layers and depth concatenation details for the CEDSegNet. iv) The kernel size for CED-
layers. There are five blocks in every method. SegNet is 3 3 whereas, for SegNet, the kernel size is 7
The details of the aforesaid layers for each method are 7 with stride 1 padding 3. In max pooling, a filter with the
shown in Table 3. For example, the VGG 16 model has 26 size 2 2 of stride 2, padding 0, is chosen for SegNet. For
convolutional layers, 26 batch normalization (BN) layers, the CEDSegNet, the details of max pooling layer are pre-
26 ReLu activation layers, and 10 max-pooling layers sent above.
which are followed by 1 softmax layer and 1 classification In a similar way, we can also provide comparisons of the
layer. The total number of layers in all the blocks of CEDSegNet with the above state-of-the-art methods, using
VGG16 model, including an input layer, is 91. Similarly, the details of layers and kernel functions in Tables 3 and 4.
the figures corresponding to VGG19, ResNet18, and
123
Neural Computing and Applications
Table 4 Details of convolutional layer, convolutional kernels, and max pooling layer for the comparative methods
Networks Conv. layer No.of conv. Max pooling layer
I/p size Kernel size Padding Stride Kernels Kernel size Padding Stride
5.1 Dataset
The dataset, CDNet 2012, can be downloaded from.2 It 5.3 Training of CEDSegNet and its
comprises a diverse collection of 6 categories. In each hyperparameters
category, there are video sequences ranging from 4 to 6.
Out of these sequences, 4 real-life RGB video sequences, The CEDSegNet model is trained using the stochastic
Boulevard, Pets2006, Pedestrians and Highway with mul- gradient descent algorithm (SGD) [2]. A batch normal-
tiple effects (e.g., camera jitter and object motion blur) are ization process [8], involving mini-batches instead of the
considered. The details of these video sequences are shown entire training set, is used. For the training of CEDSegNet,
in Table 5. we select typically four RGB video sequences and their
corresponding labeled video sequences from the
5.2 Indices to evaluate the performance CDNet2012 dataset. Each video sequence belongs to a
of the proposed CEDSegNet particular category. The category of every video sequence
is present in Table 5. The total number of frames in the four
We use quantitative indices to evaluate the performance of video sequences is 6499. Out of these sequences, there are
CEDSegNet for detecting of objects in videos. The indices 4639 frames with ground truth. These frames are used in
are precision (Pr), recall(Re), specificity (Spe), false posi- experiments. The training set and test set are selected as
tive rate (FPR), false negative rate (FNR), percentage of similar to [10]. Those details are discussed as follows.
wrong classification (PWC), F Measure and MCC. The For the training set, out of 4639 frames, 40 video frames
formula for every index is given in Table 6. and their labeled frames (40) are selected randomly. All
For example, the notion of a PWC is based on true these frames are re-scaled to a resolution of 320 240 3.
positive (TP), a false positive (FP), true negative (TN) and We split every frame into 16 16 3 blocks. The total
false negative (FN). A true positive (TP) is an output pixel number of blocks in a frame is 300 ((320 240)/ (16
to which the proposed model correctly predicts the positive 16)). For 40 frames, the number of blocks is 12,000 (40
class (say object). A false positive (FP) is an output pixel to 300). Similarly, we obtain 12,000 blocks for the 40 labeled
which the proposed model wrongly predicts the positive frames corresponding to the video frames. Therefore, the
class. Similarly, true negative (TN) and false negative (FN) training set consists of 24,000 blocks and the size of each
are the output pixels of the proposed model which are block is 16 16 3.
predicted correctly and incorrectly as a negative class For the test set, 20 video frames, out of 4599 frames
(background), respectively. In a similar way, the other after excluding the video frames and their labeled frames in
notions can be explained. the training set, are randomly selected. The size of each
frame is 320 240 3. The test frames are not divided
2
http://jacarini.dinf.usherbrooke.ca/dataset2012/. into blocks.
123
Neural Computing and Applications
Specificity TN
ðTNþFPÞ
PWC 100ðFNþFPÞ
ðTPþFPþFNþTNÞ
Precision TP
ðTPþFPÞ
F_Measure 2ðRECALLPRECISIONÞ
RECALLþPRECISION
Mathews correlation coefficient (MCC) ðTNTPÞðFNFPÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
ðTPþFPÞ:ðTPþFNÞ:ðTNþFPÞ:ðTNþFNÞ
The CEDSegNet involves a set of hyperparameters The CEDSegNet detects the objects in video frames
whose values need to be initialized for its training. The with well-shaped where pixel positions and boundary
hyperparameters of CEDSegNet and their values chosen information of objects are correctly identified. Whereas
for its training are provided in Table 7. Using the ’Glorot’ ResNet18, ResNet50, VGG16, and VGG19 also detect the
method [6], the weight parameters of the CEDSegNet are objects in video sequences which are not well-shaped. The
initialized. A cross-entropy is chosen as a loss function. same information is reflected in the video sequences in
The CEDSegNet involves L2 regularisation and momen- Fig. 4. The performance of CEDSegNet is superior to
tum parameters which are chosen equal to 0.0001 and 0.9, ResNet18, ResNet50, VGG16, and VGG19. The ground
respectively, as given in [1]. The number of epochs is set truth results are slightly better than the CEDSegNet as
equal to 10. The value of the initial learning rate, a, is 0.1. expected. This description is about single object detection
Here, the network is initially trained for different values of in videos.
the learning rate chosen between 0 to 1. Then, 0.1 to which In the case of multiple objects in a video frame (fourth
the network attains the best performance is fixed. row), the CEDSegNet detects objects which are well-sep-
Using Eq. 7, the CEDSegNet requires the total number arated from the background. Whereas, using the ResNet18,
of learning parameters about 0.12 million, whereas these ResNet50, VGG16, and VGG19, the detected objects in
parameters for VGG16, VGG19, ResNet18, and ResNet50 video sequences are seen to be connected to each. The
are about 2.94 million, 4 million, 2.06 million, and 4.39 detected objects using CEDSegNet appear to be close to
million, respectively. The quantitative and qualitative those of ground truth. In contrast, using the state-of-the-
results of the CEDSegnet are discussed as follows. models, the background pixels in overlapping regions are
detected as objects; thereby the detected objects appear not
5.4 Qualitative results of CEDSegNet close to the ground truth results.
The performance of CEDSegNet for object detection, as 5.5 Results of CEDSegNet using quantitative
compared to the ground truth, ResNet18, ResNet50, indices
VGG16, and VGG19, for four video frames is qualitatively
shown in Fig. 4. Here, the ground truth results are collected For every resultant video sequence belonging to the
from.3 boulevard, pets2006, pedestrians, or highway, the aforesaid
The first three rows in the figure are related to the single specificity, FPR, FNR, PWC, precision, recall, F Measure
object detection results. These are followed by the results and MCC (Mathew’s correlation coefficient) are applied to
of multi objects detection using the CEDSegNet in the assess the performance of the proposed CEDSegNet. For
fourth row. Each row corresponds to a video sequence example, we provide the values of indices for a video
belonging to a particular category. In the figure, the results sequence pertaining to pedestrians in Table 8. Here, the
of the other models including ground truth are also pro- proposed method achieves balanced values, in the case of
vided for comparing CEDSegNet. precision and recall. In the other methods, precision is
much higher than recall. Considering precision only, our
method yields flexible results in a few cases (comparisons).
3
However, it obtains flexible performance in all cases
http://jacarini.dinf.usherbrooke.ca/dataset2012/
123
Neural Computing and Applications
Fig. 4 Four different input video frames are shown in (column a). The d), ResNet50 (column e), VGG16 (column f), VGG19 (column g) for
performance of CEDSegNet (column c), with different the state-of- all the video frames is compared
the-methods involving ground truth (column b), ResNet18 (column
(comparisons) corresponding to recall. It may be noted that As explained above, for every video sequence (boule-
a flexible F Measure is attained when the model yields vard, pets2006, pedestrians or highway), the scores of
balanced values of precision and recall. These values specificity, FPR, FNR, PWC, precision, recall, F Measure
support the quantitative results demonstrated in [31]. The and MCC (Mathew’s correlation coefficient) for CED-
value of F Measure indicates that the proposed method is SegNet are obtained. The average scores of all the video
the best. For specificity, our method has slightly low per- sequences obtained for CEDSegNet, as compared to
formance. The values of indices indicate that the CED- ResNet18, Resent50, VGG16 and VGG19 models [7], are
SegNet outperforms Resnet18, ResNet50, VGG16, and reported in Table 9.
VGG19 models, where precision þ recall is treated as one It can be observed from Table 9 that the performance of
index. the proposed CEDSegNet is superior to ResNet18,
123
Neural Computing and Applications
ResNet50, VGG16 and VGG19 with respect to FNR, PWC, better than VGG16, VGG19, ResNet18, and ResNet50
F_Measure and MCC (Mathew’s correlation coefficient). (please find the results in Tables 8 and 9). The superiority
Further, the CEDSegNet performs better than the other is achieved for the CEDSegNet due to its newly generated
methods in terms of recall. Here, the CEDSegNet provides input vector, reduced depth, and a small set of learning
balanced values of precision and recall, unlike ResNet18, parameters. Whereas, the state-of-the-art methods do not
ResNet50, VGG16 and VGG19. contain those details.
Using specificity (Spe) and FPR, the results of the
CEDSegNet are also better than the other methods except 5.6 Test performance of CEDSegNet using cross-
for VGG19 only where it achieves slightly better perfor- validation technique
mance. Although these two values (specificity (Spe) and
FPR) seem to be better for VGG19, the proposed CED- As mentioned in the above section, we select 60 video
SegNet provides low error rate in terms of FNR and PWC frames, out of 4639 video frames, belonging to four cate-
and a high classification score in terms of F_Measure and gories of video sequences, pedestrians, highway, pets2006,
MCC (i.e., clear visibility of an object from the background and boulevard. Out of the 60 video frames, 40 video frames
in all the frames as in Fig. 4). for the training set and 20 for the test set are selected. The
Table 10 provides the number of layers and training performance of the CEDSegNet is found to be better than
time in minutes for CEDSegNet, as compared to the state-of-the-art methods. We use a cross-validation tech-
VGG16, VGG19, ResNet18, and ResNet50. The total nique to prove the superiority of the proposed model. The
number of layers for the proposed CEDSegNet is 31 training process of the model is explained as follows.
whereas the numbers of layers for VGG16, VGG19, We use the k-fold cross-validation method to test the
ResNet18, and ResNet50 are 91, 109, 100, and 206, performance of the proposed CEDSegNet. Here, the value
respectively. The figures indicate that the size of CED- of k is typically chosen equal to 5. In the five-fold cross-
SegNet is less than VGG16, VGG19, ResNet18, and validation method, we use the 60 video frames belonging
ResNet50. It may be noted that Sect. 4.2.1 details the to four categories of video sequences, pedestrians, high-
proposed CEDSegNet. The models VGG16, VGG19, way, pets2006, and boulevard. We now split the 60 video
ResNet18, and ResNet50 used for comparison are dis- frames into five folds where each fold consists of 12 video
cussed, in detail, in Sect. 4.3. frames of all the four categories. The first four folds are
The training time of CEDSegNet comparing with the treated as a training set and the remaining one fold is used
VGG16, VGG19, ResNet18, and ResNet50 is reported in as a test set. This process is repeated five times, considering
Table 10. It is observed from the table that the training time each fold as a test set exactly once.
(CPU time) for CEDSegNet is lower than all the models
except ResNet18. Here, the ResNet18 has a slightly lower
value due to the use of Atrous Separable Convolution
based on Deeplabv3plus [5]. However, the CEDSegNet is
Table 10 Comparison of
Name of the network Size (No.of layers) Training Time(in minutes)
network size and CPU time for
the proposed CEDSegNet with Proposed CEDSegNet 31 4.17
the state-of-the-art methods
VGG16 91 10.82
VGG19 109 12.78
ResNet18 100 2.32
ResNet50 206 4.26
123
Neural Computing and Applications
Fig. 5 Comparative analysis of CEDSegNet (column c) with different methods, ground truth (column b), ResNet18 (column d), ResNet50
(column e), VGG16 (column f), VGG19 (column g). The results are based on a cross-validation technique
123
Neural Computing and Applications
Table 11 Results of
Fold Metrics
CEDSegNet for every fold
using different quantitative Spe FPR FNR PWC Pre Rec F_Measure MCC
indices
1 0.9885 0.0115 0.0272 1.1404 0.7791 0.9727 0.8626 0.8639
2 0.9913 0.0086 0.284 1.4075 0.8263 0.7159 0.7388 0.7475
3 0.9823 0.0176 0.0167 1.7327 0.6051 0.9832 0.7381 0.7583
4 0.9865 0.0134 0.0329 1.3769 0.6909 0.9671 0.8024 0.8091
5 0.9837 0.01621 0.041 1.6345 0.6811 0.9589 0.7934 0.7993
Avg. of
5-folds 0.9865 0.0135 0.0804 1.4584 0.7165 0.9195 0.7871 0.7956
Std. deviation (r) 0.0517 0.0462
background pixels are close to the boundary of an object, proposed network, CEDSegNet, demonstrates a substan-
the VGG19 exhibits a tendency to prioritize the identifi- tially higher AUC value (AUC = 0.98) compared to other
cation of the pixels as object pixels. It leads to the model state-of-the-art models, indicating its discriminative
getting high specificity and a low false positive rate. capacity in accurately classifying the object pixels.
We provide a ROC diagram, as discussed in [20], for the
depiction of accuracy in Fig. 6. The ROC curve illustrates
the trade-off between the true positive rate (TPR) and the 6 Conclusion
false positive rate (FPR) at different classification thresh-
olds. It is often used to evaluate and compare the perfor- A customized encoder decoder SegNet (CEDSegNet) for
mance of different classification models. The results of detecting objects in video surveillance is proposed in this
network models for four video sequences using ROC curve study. A coarse to fine-grained moving object detection
reveal a distinct performance differentiation as shown in framework is presented with a novel encoder–decoder
Fig. 6. In the present investigation over ROC curve, the network, called CEDSegNet. An application of object
123
Neural Computing and Applications
segmentation in videos to the proposed CEDSegNet is 4. Chang X, Yu Y-L, Yang Y, Xing EP (2017) Semantic pooling for
provided. The architecture of CEDSegNet is identical to complex event analysis in untrimmed videos. IEEE Trans Pattern
Anal Mach Intell 39(8):1617–1632. https://doi.org/10.1109/
SegNet. However, the number of convolutional layers of TPAMI.2016.2608901
the encoder and decoder parts is modified. Its input vector 5. Chen L-C, Zhu Y, Papandreou G, Schroff F, and Adam H (2018)
is generated newly. Those modifications are made with aim Encoder-decoder with atrous separable convolution for semantic
of improving the proposed network’s performance. The image segmentation. In: Proceedings of the European conference
on computer vision, pp 801–818
performance of the proposed CEDSegNet is demonstrated 6. He K, Zhang X, Ren S, and Sun J (2015) Delving deep into
on several video frames of CDNet 2012 dataset, where it is rectifiers: Surpassing human-level performance on imagenet
used to detect single and multiple objects. The results of classification. In: Proceedings of the IEEE international confer-
CEDSegNet, as compared to the state-of-the-art methods, ence on computer vision, pp 1026–1034
7. He K, Zhang X, Ren S, and Sun J (2016) Deep residual learning
are provided for four video sequences, as examples. The for image recognition. In: Proceedings of the IEEE conference on
qualitative results demonstrate that the CEDSegNet is able computer vision and pattern recognition, pp 770–778
to detect object regions similar to the ground truth results. 8. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep
Further, the CEDSegNet provides results better than the network training by reducing internal covariate shift. Int Conf
Mach Learn 37:448–456
Resnet18, ResNet50, VGG16, and VGG19. The signifi- 9. Ji Y, Zhang H, Jie Z, Ma L, Wu QJ (2020) Casnet: a cross-
cance of the proposed CEDSegNet is that it has the ability attention siamese network for video salient object detection.
in identifying accurate object shapes, and boundary infor- IEEE Trans Neural Netw Learn Syst 32(6):2676–2690
mation, and is efficient for handling pixels in multiple 10. Jiang S, Lu X (2018) Wesambe: a weight-sample-based method
for background subtraction. IEEE Trans Circuits Syst Video
object regions. Technol 28(9):2105–2115
We use different quantitative metrics to evaluate object 11. Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R (2019) A
regions detected using the proposed CEDSegNet. The survey of deep learning-based object detection. IEEE Access
superiority of the proposed CEDSegNet, over the state-of- 7:128837–128868
12. Jiao L, Zhang R, Liu F, Yang S, Hou B, Li L, and Tang X (2021)
the-art deep neural networks, is obtained in terms of New generation deep learning for video object detection: a sur-
specificity, FPR, FNR, PWC, (precision ? recall), vey. IEEE Trans Neural Netw Learn Syst
F_Measure, and MCC. The proposed model has the 13. Kang K, Li H, Xiao T, Ouyang W, Yan J, Liu X, and Wang X
advantages of fewer parameters and low computational (2017a) Object detection in videos with tubelet proposal net-
works. In: Proceedings of the IEEE conference on computer
cost. In the case of computation cost, our model is the vision and pattern recognition, pp 727–735
second best. Our future research task would be on 14. Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang
designing a rough set-based deep neural network using the Z, Wang R, Wang X et al (2017) T-cnn: Tubelets with convo-
concepts of the rough set for efficient detection. Further, lutional neural networks for object detection from videos. IEEE
Trans Circuits Syst Video Technol 28(10):2896–2907
testing its effectiveness in segmenting objects in video 15. Kompella A, Kulkarni RV (2021) A semi-supervised recurrent
sequences will be a part of future research work. neural network for video salient object detection. Neural Comput
Appl 33:2065–2083
16. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classi-
Availability of data: A video database that supports the findings of fication with deep convolutional neural networks. Commun ACM
this study is publicly available at http://jacarini.dinf.usherbrooke.ca/ 60(6):84–90
dataset2012. 17. Lee B, Erdenee E, Jin S, Rhee PK (2016) Efficient object
detection using convolutional neural network-based hierarchical
feature modeling. Signal Image Video Process 10(8):1503–1510
Declarations 18. Lim LA, Keles HY (2018) Foreground segmentation using con-
volutional neural networks for multiscale feature encoding. Pat-
tern Recogn Lett 112:256–262
Conflict of interest The authors declare no conflict of interest.
19. Long J, Shelhamer E, and Darrell T (2015) Fully convolutional
networks for semantic segmentation. In: Proceedings of the IEEE
conference on computer vision and pattern recognition (CVPR)
References 20. Malebary SJ, Khan R, Khan YD (2021) Protopred: advancing
oncological research through identification of proto-oncogene
1. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep proteins. IEEE Access 9:68788–68797
convolutional encoder-decoder architecture for image segmenta- 21. Minaee S, Boykov YY, Porikli F, Plaza AJ, Kehtarnavaz N, and
tion. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495 Terzopoulos D (2021) Image segmentation using deep learning: a
2. Bottou L (2010) Large-scale machine learning with stochastic survey. IEEE Trans Pattern Anal Mach Intell
gradient descent. In: Proceedings of COMPSTAT’2010, 22. Muhammad K, Ahmad J, Baik SW (2018) Early fire detection
pp 177–186. Springer using convolutional neural networks during surveillance for
3. Chang X, Yang Y (2016) Semisupervised feature analysis by effective disaster management. Neurocomputing 288:30–42
mining correlations among multiple tasks. IEEE Trans Neural 23. Pal SK, Bhoumik D, Bhunia Chakraborty D (2020) Granulated
Netw Learn Syst 28(10):2294–2305 deep learning and z-numbers in motion detection and object
recognition. Neural Comput Appl 32(21):16533–16548
123
Neural Computing and Applications
24. Patil PW, Murala S (2018) Msfgnet: a novel compact end-to-end 32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D,
deep network for moving object detection. IEEE Trans Intell Erhan D, Vanhoucke V, and Rabinovich A (2015) Going deeper
Transp Syst 20(11):4066–4077 with convolutions. In: Proceedings of the IEEE conference on
25. Rahmon G, Bunyak F, Seetharaman G, and Palaniappan K (2021) computer vision and pattern recognition, pp 1–9
Motion u-net: multi-cue encoder-decoder network for motion 33. Wang D, Cui X, Chen X, Zou Z, Shi T, Salcudean S, Wang ZJ,
segmentation. In; 2020 25th International conference on pattern and Ward R (2021) Multi-view 3d reconstruction with trans-
recognition (ICPR), pp 8125–8132 formers. In: 2021 IEEE/CVF international conference on com-
26. Ren Q, Hu R (2018) Multi-scale deep encoder-decoder network puter vision (ICCV), pp 5702–5711. https://doi.org/10.1109/
for salient object detection. Neurocomputing 316:95–104 ICCV48922.2021.00567
27. Shi G, Suo J, Liu C, Wan K, and Lv X (2017) Moving target 34. Xiaojun C, Zhigang M, Yi Y, Zhiqiang Z, G, H. A. (2016) Bi-
detection algorithm in image sequences based on edge detection level semantic representation analysis for multimedia event
and frame difference. In: 2017 IEEE 3rd information technology detection. IEEE Trans cybern 47(5):1180–1197
and mechatronics engineering conference (ITOEC), pp 740–744. 35. Zhu H, Yan X, Tang H, Chang Y, Li B, Yuan X (2020) Moving
IEEE object detection with deep cnns. IEEE Access 8:29729–29741
28. Simonyan K and Zisserman A (2014) Very deep convolutional
networks for large-scale image recognition. arXiv preprint: Publisher’s Note Springer Nature remains neutral with regard to
arXiv:1409.1556 jurisdictional claims in published maps and institutional affiliations.
29. Singh SA, Meitei TG, and Majumder S (2020) Short pcg clas-
sification based on deep learning. In: Deep learning techniques
Springer Nature or its licensor (e.g. a society or other partner) holds
for biomedical and health informatics, pp 141–164. Elsevier
exclusive rights to this article under a publishing agreement with the
30. St-Charles P-L, Bilodeau G-A, Bergevin R (2014) Subsense: a
author(s) or other rightsholder(s); author self-archiving of the
universal change detection method with local adaptive sensitivity.
accepted manuscript version of this article is solely governed by the
IEEE Trans Image Process 24(1):359–373
terms of such publishing agreement and applicable law.
31. St-Charles P-L, Bilodeau G-A, Bergevin R (2016) Universal
background subtraction using word consensus models. IEEE
Trans Image Process 25(10):4768–4781
123