0% found this document useful (0 votes)
9 views

Distill Dbdgan

The paper presents Distill-DBDGAN, a novel framework for defocus blur detection that combines knowledge distillation and adversarial learning to improve detection accuracy while reducing computational costs. The method utilizes a larger teacher network to guide a smaller student network, effectively addressing challenges such as background clutter and scale sensitivity. Experimental results demonstrate that the lightweight model achieves performance comparable to state-of-the-art techniques without significant loss in accuracy.

Uploaded by

moushumimedhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Distill Dbdgan

The paper presents Distill-DBDGAN, a novel framework for defocus blur detection that combines knowledge distillation and adversarial learning to improve detection accuracy while reducing computational costs. The method utilizes a larger teacher network to guide a smaller student network, effectively addressing challenges such as background clutter and scale sensitivity. Experimental results demonstrate that the lightweight model achieves performance comparable to state-of-the-art techniques without significant loss in accuracy.

Uploaded by

moushumimedhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Distill-DBDGAN: Knowledge Distillation and Adversarial

Learning Framework for Defocus Blur Detection

SANKARAGANESH JONNA, MOUSHUMI MEDHI, and RAJIV RANJAN SAHAY,


Indian Institute of Technology Kharagpur, India

Defocus blur detection (DBD) aims to segment the blurred regions from a given image affected by defocus
blur. It is a crucial pre-processing step for various computer vision tasks. With the increasing popularity of
small mobile devices, there is a need for a computationally efficient method to detect defocus blur accurately.
We propose an efficient defocus blur detection method that estimates the probability of each pixel being
focused or blurred in resource-constraint devices. Despite remarkable advances made by the recent deep
learning-based methods, they still suffer from several challenges such as background clutter, scale sensitiv-
ity, indistinguishable low-contrast focused regions from out-of-focus blur, and especially high computational
cost and memory requirement. To address the first three challenges, we develop a novel deep network that
efficiently detects blur map from the input blurred image. Specifically, we integrate multi-scale features in
the deep network to resolve the scale ambiguities and simultaneously modeled the non-local structural corre-
lations in the high-level blur features. To handle the last two issues, we eventually frame our DBD algorithm
to perform knowledge distillation by transferring information from the larger teacher network to a compact
student network. All the networks are adversarially trained in an end-to-end manner to enforce higher order
consistencies between the output and the target distributions. Experimental results demonstrate the state-of- 87
the-art performance of the larger teacher network, while our proposed lightweight DBD model imitates the
output of the teacher network without significant loss in accuracy. The codes, pre-trained model weights, and
the results will be made publicly available.

CCS Concepts: • Computing methodologies → Computer vision problems;

Additional Key Words and Phrases: Defocus blur detection, knowledge distillation, adversarial learning

ACM Reference format:


Sankaraganesh Jonna, Moushumi Medhi, and Rajiv Ranjan Sahay. 2023. Distill-DBDGAN: Knowledge Dis-
tillation and Adversarial Learning Framework for Defocus Blur Detection. ACM Trans. Multimedia Comput.
Commun. Appl. 19, 2s, Article 87 (February 2023), 26 pages.
https://doi.org/10.1145/3557897

1 INTRODUCTION
Defocus blur occurs commonly or intentionally in everyday photography when the light rays from
scene points on objects, not located at the camera’s focus distance, converge in front or behind

Sankaraganesh Jonna and Moushumi Medhi contributed equally to this research.


Authors’ address: S. Jonna, M. Medhi, and R. R. Sahay, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal,
India, 721302; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2023 Association for Computing Machinery.
1551-6857/2023/02-ART87 $15.00
https://doi.org/10.1145/3557897

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:2 S. Jonna et al.

Fig. 1. A challenging example for DBD. We show the blur detection results for a blurry input image taken
from CUHK dataset [29]. The algorithms proposed in (b)–(g) introduce large incorrect detection regions. Our
proposed method predicts masks closest to the ground truth.

the image plane. Defocus blur detection (DBD) aims at pixelwise identification of the out-of-
focus regions from an image. DBD has been an active area of research over the past few decades
due to its wide range of potential applications in several vision problems. Automatic detection
of the commonly encountered non-uniform blur with spatially varying point spread function is
a very complicated and challenging task. Common challenges involved are (i) precise detection
of the boundary between visually indistinguishable blurry smooth regions and in-focus smooth
regions in a partially blurred image, (ii) susceptibility of the degree of blur to image scales, (iii) low
detection accuracy, and (iv) runtime detection/inference speed.
Most of the existing blur region detection/classification techniques [6, 25] rely on traditional
handcrafted features that are often based on low-level defocus blur cues, such as gradient, fre-
quency, and contrast. They often fail to detect blur at large homogeneous or low-contrast regions.
Recently, deep convolutional neural networks– (DCNNs) based DBD methods [4, 33, 34, 37,
45–47] have successfully overcome several limitations of the traditional methods. Hence, several
algorithms have been proposed in this direction starting from References [18, 25, 45] to the current
state-of-the-art (SOTA) method [4, 33, 42–44, 47]. Nevertheless, the output detection results of
several recent notable works contain a lot of falsely labeled regions. The blur maps obtained us-
ing handcrafted-based methods (discriminative blur detection features (DBDF)) [29], local
binary patterns (LBP) [39], high-frequency multi-scale fusion and sort transform of gra-
dient magnitudes (HiFST) [6] are shown in Figure 1(b)–(d), respectively. In Figure 1(e)–(g), we
show the defocus detection results generated from deep learning-based DBD methods CRLNet
[46], DeFusionNet [37], and depth distillation (DD) [4], respectively. Note that the results ob-
tained using both the handcrafted feature-based methods in Figure 1(b)–(d) and the CNN models
in Figure 1(e)–(g) suffer from poor quality boundaries of in-focus objects and erroneous detections.
Thus, despite the superior performance of DCCN-based methods for the DBD problem in recent
years, they still encounter several challenges concerning accurate localization of ramified blurred
structures and low-variance homogeneous focal regions. Moreover, the increasingly deep com-
plex models employed in the previous works require high computational and storage resources.
However, the abundance and popularity of small mobile devices in recent times demand more
memory/computationally efficient methods. Motivated by the above observations, in this work, we

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:3

Fig. 2. The overall pipeline of our approach. It consists of a larger teacher network T and a smaller student
network S along with multiple discriminators D 1 , D 2 , D 3 . The idea is to transfer knowledge from the teacher
network T to the student network S to improve the blur detection performance of the student network S.
Both the networks are trained using content losses together with adversarial losses (orange and purple).
Information is distilled across both the feature and output spaces (blue). The solid lines are the forward
paths; the dotted lines denote the backward paths. The pipeline allows significant reduction in memory cost
for the task of blur detection during inference.

explore technically, computationally, and economically feasible memory-efficient solutions with


high performance that addresses the above challenges in resource-constraint devices. For this pur-
pose, we focus on training an efficient blur detection student network (S) while leveraging the
knowledge of a pre-trained teacher network (T ). To address the above challenges, we also formu-
late the problem of DBD using supervised generative adversarial networks (GAN) [7, 19, 21]
that can better capture the high-level global as well as local semantic contexts concealed in smooth
homogeneous regions. Our goal is to improve the accuracy and efficiency of the DBD models us-
ing the adversaries that induce the generators to produce plausible results without adding to the
computational cost during inference. The results obtained from our proposed T and S models
corresponding to the blurred image shown in Figure 1(a) are presented in Figure 1(h) and (i), re-
spectively. The deep teacher network is designed such that multi-scale deep features (owing to the
sensitivity of blur to image scales) and self-attention (SA) guided feature maps are effectively
fused to accurately detect blur even in ambiguous blur regions. The overall idea of our proposed
DBD method is shown in Figure 2.
Our results demonstrate the robustness and efficiency of the proposed network that can effec-
tively handle the challenges involved in DBD with less computational load. We also conduct an
ablation study to analyze the contribution of different factors and components within the proposed
framework. Furthermore, we demonstrate some of the possible applications that benefit from our
blur detection method in Section 6.
Note that in this article, we use the terms “blur detection” and “blur segmentation”
interchangeably.
The contributions of this article are summarized as follows: (1) To the best of our knowledge,
the proposed method is the first attempt to present a knowledge distillation (KD) scheme for
the problem of DBD to improve the performance of lightweight models for blur detection. (2) We
investigated an end-to-end adversarial learning-based framework, namely Distill-DBDGAN, for
detecting spatially varying defocus blur from the input blur image. (3) We provide qualitative and
quantitative comparison results for blur detection against SOTA techniques on publicly available

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:4 S. Jonna et al.

datasets CUHK [29], DUT [45], and SZU-Blur detection (SZU-BD) [31] along with the ablation
results. Finally, we provide a few applications of our proposed DBD method.

2 RELATED WORKS
Based on the image features, blur map segmentation techniques are broadly divided into two cat-
egories: methods based on handcrafted features and learning-based algorithms.

2.1 Handcrafted Features for Blur Detection


Shi et al. [29] combined kurtosis, power spectrum, and local filters to train a Bayes classifier to
perform blur/non-blur discrimination. Yi et al. [39] exploited the frequency of LBP for defocus blur
and non-blur region segmentation. The algorithm in Reference [30] learned a dictionary using both
sharp and blurred image patches. For a given input patch, the degree of Gaussian blur is estimated
by the number of dictionary atoms used to reconstruct that patch. Tang et al. [36] iteratively refined
a coarse blur map generated using logarithm of averaged spectrum residual. DCT coefficients were
used in Reference [6] to differentiate between sharp and blurry image regions.

2.2 Learning-based Methods for Blur Detection


Park et al. [25] combined handcrafted features with those obtained from a trained CNN for the
problem of defocus map estimation. In another work [11], a CNN was employed to produce mul-
tiscale blur likelihood maps that were later fused to generate the final blur detection map. Zhao
et al. [45] proposed a fully convolutional bottom-top-bottom network (BTBNet) for pixel-level
defocus blur detection from a given input image. However, the proposed BTBNet needs to generate
a pyramid of images and pass through a multi-stream network during both training and testing.
Zeng et al. [40] trained CNNs to learn locally relevant features and proposed an updating mech-
anism to refine the defocus blur detection result from coarse to dense. The method, being a local
feature-based approach, fails to discriminate large homogeneous regions. A multi-stream bottom
to top and top to bottom convolutional network was proposed in Reference [46] to handle multiple
scales. The low-level features of an end-to-end network were used as input to a cross-ensemble
network (CENet) [47] for defocus blur detection. Cun et al. [4] proposed a defocus blur detection
algorithm via DD. Relative depth information is distilled from a pre-trained depth estimation net-
work to leverage blur detection. Zhang et al. [42] employed a self-supervised training objective and
augmented data to inhibit semantic information. Tang et al. [34] proposed a deep convolutional
network that integrates the shallow and deep semantic features separately and recurrently refines
them in a cross-layer manner. Residual learning and refining modules were utilized by Tang et al.
[33, 35] for error correction and update. Li et al. [16] trains a dual-branch network with multiple
attention maps generated from each branch and then imposed on the other branch for joint focus
and defocus detections. Zhao et al. [43] proposed an adaptive ensemble network (AENet) and
an encoder-feature ensemble network (EFENet) to generate diverse results through multiple
detectors and diverse features with a single detector, respectively. A DBD mask was used in Ref-
erence [44] to create composite full clear and full blurred images that are fed to dual adversarial
discriminators (DAD), which in turn implicitly forces the deep generator network to improve
output accuracy. DAD tries to avert the possibility of a full or an empty DBD mask that may be
generated due to GAN instability in an unsupervised learning scenario. Unlike Reference [44], our
discriminators directly leverage the desired target distribution and therefore are better equipped
at addressing the homogeneous region-based uncertainties that may be triggered in Reference
[44] by unsupervised setting. Zhai et al. [41] proposed a hierarchically residual feature refine-
ment network (HRFRNet) that refines a coarse DBD map in a top-down manner. Guo et al. [8]

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:5

designed an end-to-end convolution neural network (HANUN) with channel attentions and small
U-shaped networks embedded into the decoders for blur detection.
In contrast to the existing works, we formulate the blur detection problem using knowledge
distillation along with adversarial learning framework with an aim to create high-performance
models with fewer parameters and high inference speed.

3 PROPOSED METHODOLOGY
In this work, we design a lightweight blur detection student network that is trained in an adversar-
ial manner and produces comparable results to the state-of-the-art DCNN-based DBD methods.

3.1 Problem Formulation


Deep learning–based DBD problem learns a mapping function Φ(·; θ ) that maps an observed de-
focus blurred image x to an output blur detection mask ŷ ∈ [0, 1], i.e., Φ : RCi ×W ×H −→ R1×W ×H .
Here θ and Ci represent the weight parameters of the generative model and the input chan-
nel dimension, respectively. W and H stand for the width and height of the input, respectively.
Mathematically, the problem can be written as ŷ = Φ(x, θ ). Given a set of N training samples,
X = {x 1 , x 2 , . . . , x N }, we learn the mapping function Φ(x n ; θ ), where x n is the nth training sam-
ple, such that it yields a set of DBD masks Ŷ = {ŷ1 , ŷ2 , . . . , ŷ N }. Optimal parameter estimates are
obtained by minimizing an objective function specific to the defocus blur detection task.

3.2 Knowledge Distillation


The overwhelming computational burden associated with deep networks for the task of DBD se-
verely limits the applicability of the algorithm in the real world [4, 43, 46, 47]. To address this limi-
tation associated with complex networks, we therefore focus on formulating the DBD problem in
the space of KD to develop a lightweight model. In this work, we have proposed a deep teacher
model (T ) that has achieved superior performance compared to the SOTA algorithms. However,
the current SOTA models and the proposed teacher model are not computation friendly. Conse-
quently, we have integrated the concept of KD for efficient training and inference. During the
training phase of our student model (S), we model knowledge transfer from teacher T to student S
at three different levels: (1) global consistency constraint, (2) feature constraint, and (3) adversarial
learning. The final student model S, trained using these constraints, performs at par with SOTA
methods with real-time performance.

3.3 Adversarial Learning for Defocus Blur Detection


Using adversarial learning [7, 20], we aim at optimizing our T and S networks along with two other
critics D 1 and D 2 , one for each of the two networks T and S, respectively, to enforce higher-order
consistencies between the learned defocus blur distribution and the true distribution via perceptual
modeling. Additionally, a third critic D 3 is employed for network S to adversarially distill knowl-
edge of the output distribution of the high-capacity T model to an S model. For the blur detection
task, we have adopted the relatively stable Least Squares Generative Adversarial Network
(LSGAN) [20] that leverage a least squares loss rather than the conventional cross-entropy loss.
It is to be noted that naively embedding existing blur detection models in an adversarial setting
may lead to divergence, mode collapse, and network instability due to the high sensitivity of GAN
training to hyperparameters and network designs, and a small change can easily lead to failure.
We attempt to resolve the differences by building a novel GAN architecture for blur detection.
3.3.1 GAN Training Scheme and Stability. We have attempted to build a GAN architecture for
blur detection from scratch by alternately training G 1 and D 1 of T network and G 2 and (D 2 , D 3 )

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:6 S. Jonna et al.

of the S network. In line with previous works [13, 22], we used batch normalization in the gener-
ators and spectral normalization in the discriminators for stable training. Spectral normalization
was introduced to stabilize GAN training [17] and was shown to outperform other regularization
techniques. In addition, the use of LeakyReLU instead of ReLU in the discriminators facilitates
a stronger backward flow of gradients for negative values from the discriminators to the corre-
sponding generators [26]. Our GAN training scheme with a two-timescale update rule [9] could
successfully avoid mode collapse using only a 1:1 balanced update interval between the generators
and the corresponding discriminators. In general, the optimization problem solved by our LSGAN
can be formulated as follows:
1 1
min VLSGAN (D 1 ) = Er ∼pdata (r ) [(D 1 (r ) − 1) 2 ] + Ez ∼pz (z ) [(D 1 (G 2 (z))) 2 ],
D1 2 2
(1)
1
min VLSGAN (G 1 ) = Ez ∼pz (z ) [(D 1 (G 1 (z)) − 1) 2 ],
G1 2

1 1
min VLSGAN (D 2 , D 3 ) = Er ∼pdata (r ) [(D 2 (r ) − 1) 2 ] + Ez ∼pz (z ) [(D 2 (G 2 (z))) 2 ]
D 2, D 3 2 2
1 1
+ Ez ∼pz (z ) [(D 3 (G 1 (z)) − 1) ] + Ez ∼pz (z ) [(D 3 (G 2 (z))) 2 ],
2
(2)
2 2
1 1
min VLSGAN (G 2 ) = Ez ∼pz (z ) [(D 2 (G 2 (z)) − 1) 2 ] + Ez ∼pz (z ) [(D 3 (G 2 (z)) − 1) 2 ].
G2 2 2
Here r and z represent the real data variable and the generator input variable, respectively. Simi-
larly, pdata (r ) and pz (z) denote the real data distribution and the input distribution of the generator,
respectively. The goal is to seek generators that generate samples as close as possible to the real
data distribution pdata (r ) from the given input distribution pz (z). Please refer to Sections 3.5.1
and 3.5.2 for details on the training objectives of the generators. The discriminators are discarded
during testing, and only the generators are employed for defocus blur detection. The training ob-
jectives of the discriminators are provided in Appendix A.

3.4 Network Architectures


3.4.1 Teacher Network. The proposed architecture of the teacher network is shown in Figure 3.
We initialize the convolutional layers of the encoder network of the generator G 1 with the SE-
ResNeXt-101 [10] weights pre-trained on ImageNet [5] dataset for object recognition. We remove
the fully connected layer and the last pooling layer in the encoder module. Max pooling layers with
window size of 2 × 2 and stride of two pixels are employed to reduce the size of the input resolution
to half. The integration of multi-scale features is desirable to tackle the image scale sensitivity is-
sue associated with the DBD task. In recent years, several remarkable methods [45, 46] have taken
into account the influence of image scale on defocus blur detection and thereby proffered possible
solutions with extended capacity, including multi-scale input image pyramid [46], multi-scale loss
pyramid [37], and multi-model pyramid. Even though these classical strategies of resizing input
images into multiple scales or multi-pathway image propagation techniques have been success-
ful to a fairly good extent, they are resource hungry and are associated with high computational
complexity owing to their large number of parameters during both training and inference phases.
To encode multi-scale contextual information in the blur detection network without significantly
increasing the training computation cost, we incorporate the denseASPP (DASPP) [38] module
in the bottleneck layer. DASPP was implemented in Reference [38], where dense connections were
introduced into a cascade of atrous convolutional layers to compose a dense multi-scale feature
pyramid. The sequence of concatenating blur features from atrous convolutions with increasing

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:7

Fig. 3. The proposed architecture of the teacher network for defocus blur detection. The following notations
are used: C(d) = Dilated Conv2D with atrous rate d, C = Conv2D, T = Transposed Conv2D, SN = Spectral
Norm, BN = BatchNorm, LR = Leaky ReLU, and R = ReLU. We deploy pre-trained SE-ResNeXt-101 [10] as the
backbone of the teacher network. The backbone encoder of the blur detection map generator extracts the
feature representations from the given input blurry image x. The extracted features are further enhanced by
using DenseASPP [38] and self-attention modules at the intermediate layer. A decoder decodes these latent
high-level feature maps into pixelwise blur detection map ŷ (t ) . A discriminator distinguishes the generator
prediction ŷ (t ) from the ground truth y. Channel dimensions are shown in each block of the encoder, decoder,
and the discriminator.

dilation rates (sparse sampling rate) leads to an exceedingly large field of view that densely ensem-
bles the diverse multi-scale structural information. Moreover, capturing the long-range non-local
dependencies of a given blur pixel assists in the effective coordination of the pixel’s spatial infor-
mation to that of all the similar pixels within the entire image. We thereby propose to use a parallel
intra-attention or SA module at the bottleneck layer to fully utilize the non-local structural corre-
lations in the high-level blur features. The key decoder part consists of a combination of vanilla
convolutions, batch normalization, and upsampling layers. Transposed convolutional layers are
used to gradually upsample the spatial resolution of the extracted features until the higher input
resolution is attained. To further improve the localization accuracies/detection of small regions
and to avoid the gradient vanishing problem, we use skip connections [27] between the encoder
and the decoder modules. Hence, the decoder module recovers boundaries by integrating low-level
information from the encoder with high-level features from the decoder.
The second module in the proposed teacher framework is the discriminator network D 1 . It con-
sists of seven convolutional blocks, each comprising a vanilla convolution of kernel size 4 × 4 and
stride 2, followed by spectral normalization [22]. LeakyReLU is used as an activation function in
all the convolutional blocks of D 1 . The number of output channels of the consecutive convolution
layers are 64, 128, 256, 256, 256, 256, and 256, respectively.
3.4.2 Student Network. For the generator G 2 of the student network, we design a lightweight
encoder-decoder architecture for blur detection. We utilize EfficientNetB3 [32], pre-trained on Ima-
geNet data [5], to initialize the weights of the encoder structure of the generator G 2 . EfficientNetB3
belongs to a family of recently introduced EfficientNets [32] that are fast, light networks and have
established benchmarks in image classification. The term B3 indicates the size of the network that

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:8 S. Jonna et al.

scales up for every increment in the series from B0 to B7 and the corresponding increase in process-
ing power also leads to higher accuracies. In the ImageNet classification problem, EfficientNet-B3
exhibits top-1 accuracy of 81.6% with 12M parameters compared to 77.1% top-1 accuracy obtained
by EfficientNet-B0 with 5.3M parameters and 84.3% top-1 accuracy scored by EfficientNet-B7 with
66M parameters. We chose EfficientNet-B3 to achieve a tradeoff between network complexity and
performance. Mobile inverted bottleneck convolution [28] forms the basic building block of the
network. Our UNet decoder consists of five transposed convolution layers for upsampling, each
followed by two convolution blocks. The transposed convolution layers in the decoder use 2 ×
2 kernels with stride 2. The subsequent convolution blocks use 3 × 3 kernels with stride 1. A
sigmoid function is placed at the last layer to output blur detection score.
The architectures of the discriminator networks D 2 and D 3 of the student network are similar
to that of D 1 of the teacher network (see Section 3.4.1).
We capitalize on the knowledge of the teacher network to learn the student network by matching
the output distributions and the intermediate features of the teacher and the student networks that
is formulated through a knowledge distillation loss term mentioned in Section 3.5.

3.5 Objective Function


Our objective is to learn the parameters of the representation functions G 1 and G 2 that optimally
approximate the input-target dependency according to the loss functions L (t ) and L (s ) , respec-
tively (described below). Note that the letters t and s correspond to the teacher T and the student
S networks, respectively. The parameters in the proposed models have been obtained by optimiz-
ing L (t ) and L (s ) containing both content loss and adversarial loss. The objective function L (s )
for the student network S contains an additional knowledge distillation term LK D . For a given
spatially varying blurred image x ∈ R3×W ×H , where W × H is the spatial dimension, its ground
truth (GT) blur map is denoted as y ∈ R1×W ×H .
3.5.1 Content Loss. The content loss is computed between the generated blur map ŷ (n) = Gl (x)
and the ground truth blur map y. Here, l = 1 for n = t, and l = 2 for n = s. For the blur detection
task, we use binary cross-entropy loss as our content loss, which is defined as
W×H    
(n) 1
Lcon =− yj log ŷj(n) + (1 − yj ) log 1 − ŷj(n) , (3)
W × H j=1

where j is the pixel index of the image.


3.5.2 Adversarial Loss. The least square adversarial loss of the generator is calculated from the
output of the discriminator network D k (for n = t, k = 1, and for n = s, k = 3),
W×H  2
(n) 1
LLSGAN = D k (ŷj(n) ©xj ) − 1 , (4)
2(W × H ) j=1

where ©is the concatenation operator.


3.5.3 Knowledge Distillation Loss. The knowledge distillation loss LK D for the generator G 2
of the student network consists of a consistency term Lcons , a feature affinity term Lf a , and an
adversarial distillation term LK D−LSGAN that are discussed below,
LK D = α 1 Lcons + α 2 Lf a + α 3 LK D−LSGAN , (5)
where α 1 , α 2 , and α 3 are the weights for adjusting the penalty terms of the distillation loss.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:9

Consistency Loss: The consistency loss Lcons ensures consistency between the outputs of the T
and S networks. It is defined as follows:
W×H
Lcons =
1 ŷ (t ) − ŷ (s )  . (6)
W × H j=1  j j 

Feature Affinity Loss: Our aim is to effectively transfer the short and long-range correlations among
spatial locations of the rich feature maps F (t ) ∈ RC×Wf ×H f of the teacher network to the feature

maps F (s ) ∈ RC ×Wf ×H f of the student network. Here C, C  are the feature depths, and Wf × H f is
the spatial dimension. Note that C does not necessarily equal C  as the dimensions of the computed
adjacency matrices do not depend on the feature channel dimensions. The feature maps F (t ) and
F (s ) are obtained from the mid-level layers of both the networks. We build affinity graphs such
that the affinity functions defined on the nodes of the graph encode pairwise affinities between
the nodes and form the edges of the graph. First, we apply max pooling operation on F (t ) and
F (s ) (resized by bilinear interpolation to match spatial dimensions) to obtain feature maps F̌ (t ) ∈
C×Wf ×H f C  ×Wf ×H f   
R and F̌ (s ) ∈ R (F̌ (n) = {f̌c(n)
j }∀c ∈ C or C ;Wf < Wf , H f < H f ) comprising the
most activated pixels. The activations at each spatial position are independently normalized across
the channels to capture the structural information [15], resulting in feature maps F̃ (n) = {f̃j(n) } as

f̌j(n)
f̃j(n) = , (7)
 (n) 2
c ( f̌c j ) +ε

where ε is a small stability constant (e.g., ε = 1e−6) to avoid divisions by zero. When n = t, c ∈ C
else c ∈ C .
If we assume that there are Wf · H f entities, then we can define graph adjacency matrices A (t ) ,
(n) (n)
A (s ) (A (n) = {ajk }) for the affinity graphs, where the affinity weights ajk between the pair of
entities, j and k are computed using an affinity function O as
(n)
 
ajk = O f̃j(n) , f̃k(n) . (8)

The affinity function O is formulated as


u v
O(u, v) = . (9)
u 2 v 2
The feature vectors f̃ (n) and the affinity weights a (n) constitute the nodes and the edges of the
affinity graph G (n) = f̃ (n) , a (n) , respectively. Aligning affinity graphs at different network depths
may improve detection accuracies but also incurs extra computational cost during training. Hence,
we apply feature affinity distillation at only one layer in our method.
The feature affinity distillation loss is given as follows:
W  ×H  W  ×H 
1 
f f 
f f
 (t ) 
(s ) 2
Lf a =   ajk − ajk . (10)
(Wf × H f ) j=1
2
k=1

Adversarial Distillation Loss: Mathematically, the Adversarial distillation loss is defined as


W×H    2
1
LK D−LSGAN = D 3 ŷj(s ) − 1 . (11)
2(W × H ) j=1

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:10 S. Jonna et al.

3.5.4 Joint Loss. The overall objective functions for the proposed blur detection algorithm is
formulated as follows:

Teacher Network
(t ) (t )
L (t ) = λ 1 Lcon + λ 2 LLSGAN (12)
Student Network
(s ) (s )
L (s ) = γ 1 Lcon + γ 2 LLSGAN + γ 3 LK D , (13)
where λ 1 , λ 2 , γ 1 , γ 2 , and γ 3 are the weighing parameters used for weighing the loss terms.
The sensitivity analysis of the adversarial hyperparameters are provided in Appendix B.

4 IMPLEMENTATION DETAILS
We have implemented the entire pipeline on a machine containing Nvidia Tesla K80 GPU with a
mini-batch size of 4. The two decoder networks and the three discriminator networks are initialized
by sampling from a zero-mean normal distribution with a standard deviation of 0.2. We set all the
biases to 0. We use ADAM optimizers [14] with initial learning rates of 0.0002, 0.0001, 0.0005,
0.00005, and 0.00005 corresponding to G 1 , D 1 , G 2 , D 2 , and D 3 , respectively. In Equation (5), α 1 , α 2
and α 3 are set to 1.0, 0.1, and 0.1, respectively. The parameters λ 1 and λ 2 in Equation (12) are set to
1.0 and 0.1, respectively. In Equation (13), γ 1 , γ 2 , and γ 3 are set to 1.0, 0.1, and 1.0, respectively. We
resize all the images to 320 × 320 during training and evaluation, similar to the previous algorithms.

5 EXPERIMENTAL RESULTS
5.1 Datasets
We evaluate the proposed framework on publicly available datasets, namely CUHK [29], DUT [45],
and SZU-BD datasets [31].
5.1.1 CUHK Dataset [29]. The database proposed in Reference [29] is a publicly available
dataset consisting of both out-of-focus and motion blurred images for benchmarking the perfor-
mance of blur segmentation algorithms. It contains 1, 000 partially blurred images, of which 704
images contain out-of-focus blur, and the remaining 296 images are motion blurred. The ground
truth binary blur maps corresponding to all 1, 000 images are labeled by humans. To provide a fair
comparison with the SOTA methods [4, 45–47], we used the same training–testing split as in the
above methods. The training set consists of 604 blurred images, and the remaining 100 images are
used for evaluation.
5.1.2 DUT Dataset [45]. Recently, the authors in Reference [45] proposed a dataset for eval-
uating the performance of contemporary blur segmentation methods. It consists of 500 partially
blurred images containing defocus blur. The dataset comes with ground truth binary blur maps
corresponding to all the 500 blurred images. Since the authors in References [4, 45, 47] had not
utilized this dataset for training, we have used the DUT dataset for evaluation purposes only. Al-
though the authors in Reference [46] proposed DUT training dataset consisting of 600 training
images, we have not considered DUT data for training our model.
5.1.3 SZU-BD Dataset [31]. The SZU-BD dataset is a relatively newer benchmark dataset pro-
posed in Reference [31] and consists of 784 blur images. Besides, the dataset also contains the
corresponding pixelwise annotated ground truth blur masks. Of these 784 blur images, 709 images
contain defocus blur, while the remaining 75 images contain motion blur only. Part of the images
was collected from DUT blur dataset [45] and MSRA10K salient object detection dataset [3]. The

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:11
Table 1. Quantitative Comparisons with 14 DBD Methods on CUHK [29] and DUT [45] Datasets in
Terms of F-measure ↑ and MAE ↓

Metric DBDF [29] SS [36] KSFV [24] DHCF [25] HiFST [6] LBP [39] BTBNet [45] CRLNet [46] CENet [47]
F-measure 0.548 0.649 0.534 0.202 0.553 0.681 0.808 0.871 0.906
CUHK
MAE 0.309 0.259 0.301 0.498 0.221 0.183 0.106 0.083 0.061
F-measure 0.497 0.629 0.576 0.252 0.503 0.687 0.701 0.804 0.816
DUT
MAE 0.383 0.292 0.276 0.510 0.249 0.191 0.193 0.141 0.137
Metric DD [4] HRFRNet [41] DAD [44] HANUN [8] AENet [43] EFENet [43] Ours (T ) Ours (S)
F-measure 0.879 0.921 0.884 0.970 0.910 0.914 0.926 0.918
CUHK
MAE 0.057 0.114 0.079 0.036 0.056 0.053 0.041 0.044
F-measure 0.828 0.945 0.794 0.920 0.831 0.854 0.901 0.894
DUT
MAE 0.107 0.103 0.153 0.107 0.114 0.094 0.068 0.071
Best in bold. Second best is underlined.

SZU-BD dataset was created mainly for testing purposes, with the image resolutions varying from
275 × 218 pixels to 500 × 468 pixels.

5.2 Evaluation Metrics


We use the standard quantitative metrics such as precision–recall (PR) curves, F-measure, mean
absolute error (MAE), and time in milliseconds (ms) or seconds (s) to evaluate the proposed T and
S models for defocus blur detection. The precision and recall values are obtained by thresholding
the detected blur maps (scaled between 0 and 255 and thresholded only for the computation of
precision and recall) and then comparing it with the ground truth maps. Note that here “threshold”
refers to the general threshold value used for the computation of PR curve and does not indicate
any post-processing operation on the detected blur maps. Precision and recall values are calculated
as follows: Precision = |B∩G | |B∩G |
|B | and Recall = |G | , where |·| gathers the non-zeros entries in a mask
and B and G are the predicted and the ground truth blur maps, respectively. PR curves demonstrate
the relationship between precision and recall values over a dataset at various thresholds. To assess
the quality of blur detection, a combined F-measure metric is used, which is defined as F -measure =
(1+β 2 )×pr ecision×r ecall
β 2 ×pr ecision+r ecall
. As suggested by many of the previous works, we choose the β 2 value to be
0.3 to give more importance to precision. We show the F-measure curves over each dataset. Given
the blur map B and the ground truth mask G of spatial resolution P ×Q, the MAE can be calculated
1  
as MAE = P ×Q p ∈P q ∈Q |B(p, q) − G (p, q)|.

5.3 Comparisons with the State-of-the-Art Methods


We compare our method with the handcrafted feature-based methods, including DBDF [29], KSFV
[24], spectral and spatial approach (SS) [36], LBP [39], and HiFST [6]. We use the original im-
plementation of these methods with recommended parameters. We have also compared with the
deep learning-based models such as BTBNet [45], CRLNet [46], CENet [47], DD [4], HRFRNet [41],
DAD (supervised) [44], HANUN [8], AENet [43], and EFENet [43] for CUHK [29] and DUT [45]
datasets. We provide visual comparisons for CUHK [29] and DUT [45] datasets in Figures 4 and 5,
respectively, with methods whose output results or codes along with pre-trained model weights
are publicly accessible. The proposed method performs well in various challenging cases (e.g., ho-
mogeneous regions, low-contrast in-focus regions, and cluttered background), yielding DBD maps
closest to the ground truth maps. Quantitative comparisons are provided in Table 1 with methods
whose output results or codes are publicly accessible or whose quantitative results are being re-
ported in their papers. We show the quantitative results in Table 1 using average F-measure ↑ and
average MAE ↓. ↑ means the higher the better, and ↓ denotes the lower the better. As can be seen
from Table 1, our teacher and student models achieve better performances in terms of F-measure
and MAE compared to most of the previous techniques [4, 43, 44, 46, 47] on both the datasets.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:12 S. Jonna et al.

HRFRNet [41] achieves the best F-measure score of 0.945 on DUT dataset [45], but the large hier-
archical refinement network leads to a considerably larger model parameter size and thus limiting
its applicability. Our proposed T and S models achieve F-measure scores closer to the second best
method of HANUN [8] on the DUT dataset [45]. Also, our method achieves the best MAE values
on the DUT dataset [45]. HANUN [8] achieves the best performance on the CUHK dataset [29]
on both the metrics, while our proposed T model achieves the second-best performance. The pro-
posed S model produces comparable results with fewer parameters. Beyond the number of model
parameters, on considering the computational cost owing to the embedding of multiple nested at-
tention blocks and nested U-shaped network in the decoders in Reference [8], it is inevitable that
the method in Reference [8] requires us to perform several computationally intensive operations
across the network. We also present the visual comparisons with defocus blur detection methods
[4, 6, 18, 24, 31, 36, 39, 43, 44] on the SZU-BD dataset [31] in Figure 6. Our method provides satis-
factory results in Figure 6 (k) and (l) for complex scenes with cluttered backgrounds (second row)
or scenes where the difference between the focused and blurred regions is not very pronounced
(fourth row). Our method also produces more precise boundaries (sixth row). We provide the quan-
titative comparisons in Table 2 for only the 709 defocused images in the SZU-BD dataset [31]. DD
[4] achieves the best F-measure and MAE values even though our method produces more visually
plausible results, as can be seen from the examples in Figure 6 (sixth and eighth rows). This can
be attributed to the sample types present in the dataset (e.g., Figure 6 (last row)), where the results
of DD are closer to the annotated ground truths. Our method yields the second-best scores. Note
that the results reported in Figures 4, 5, and 6, and Tables 1 and 2 are obtained directly from the
proposed network output. No postprocessing or refinement steps have been used for the results
shown in the above mentioned figures and tables. The overall good performance of our models is
in general due to the collective effective factors such as architectural choices, training strategies
that include knowledge distillation and adversarial learning schemes (described in detail in the
ablation study). We also provide comparison of the inference time in second (s) and model param-
eters in Tables 3 and 4, respectively. Our proposed S model improves the speed to ∼1.5× the speed
of the second-fastest model [44]. Also, our proposed student model has ∼4×, ∼1.6×, ∼11×, ∼1.4×
and ∼2.5× fewer parameters than SOTA methods DD [4], DAD [44], HRFRNet [41], HANUN [8],
and EFENet [43], respectively. In Tables 3 and 4, we report the inference time and model parameter
count for methods whose codes along with pre-trained model weights are publicly accessible or
whose computation speed/parameter values are reported in their papers. PR curves and F-measure
values of the SOTA algorithms and the proposed T and S models are shown in Figure 7. From the
PR curves and F-measure values in Figure 7, one can see that the proposed T and S models achieved
comparable performance over all the three datasets on all the evaluation metrics.

5.4 Ablation Study


5.4.1 Effectiveness of Adversarial Learning. The ablation study on adversarial learning frame-
work, shown in Table 5 (left: second and third rows, right: fifth and sixth rows), validates the
significance of adversarial training that provides stronger supervision for DBD. Adversarial learn-
ing boosts MAE performance of T net and S net by 4.65% and 13.63% on CUHK dataset and 8.1%
and 15.01% on DUT dataset, respectively.

5.4.2 Effectiveness of Knowledge Distillation. We show the effectiveness of KD for the problem
of DBD in Table 5 (right: second and fifth rows). By comparing the performances of the smaller
networks S trained with and without (w/o) KD, we can see that distillation improves the perfor-
mance of the proposed student network (Distill-DBDGAN) by transferring knowledge from the
larger teacher network. KD lowers the MAE values from 0.057 to 0.044 on CUHK dataset and from

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:13

Fig. 4. Visual comparison results of DBD maps on the CUHK dataset [29]. The GT maps are shown in the
last column. It can be seen that our methods consistently produce DBD maps closest to the ground truth
maps. Additional results on the CUHK dataset [29] are available in Appendix C.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:14 S. Jonna et al.

Fig. 5. Visual comparison results of DBD maps on DUT dataset [45]. Our method produces the most visually
plausible DBD maps similar to the GT maps.
0.090 to 0.071 on DUT dataset. Our aim is to empower a lightweight network with the ability to per-
form on par with a larger network for DBD task. Several recent works [4, 43] have made gradual
improvements over existing deep learning techniques [46, 47] for DBD task. However, compli-
cated deep networks with larger inference time were employed to achieve the same. In our case,
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:15

Fig. 6. Visual comparison results of DBD maps on SZU-BD dataset [31] (GT is ground truth). Our mod-
els show good generalization capacity with similar or better performance than other state-of-the-art DBD
methods.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:16 S. Jonna et al.
Table 2. Quantitative Comparisons with Nine DBD Methods on SZU Defocus Blur Detection
Dataset [31] in Terms of F-measure ↑ and MAE ↓

Metric KSFV [24] SS [36] LBP [39] HiFST [6] EHS [18] DPN [31] DD [4] DAD [44] EFENet [43] Ours (T ) Ours (S)
F-measure 0.841 0.877 0.912 0.899 0.939 0.958 0.972 0.916 0.968 0.969 0.968
MAE 0.273 0.224 0.160 0.217 0.126 0.078 0.055 0.172 0.073 0.064 0.065
Best in bold. Second best is underlined.
Table 3. Inference Time in Second (s) of Different Methods for Image Size 320 × 320
Metric DBDF [29] SS [36] DHCF [25] HiFST [6] LBP [39] BTBNet [45] CRLNet [46] DD [4] AENet [43] EFENet [43] DAD [44] Ours (T ) Ours (S)
Time (s) 45.45 0.7142 11.76 47.61 9.00 25.02 12.04 0.107 0.054 0.044 0.035 0.036 0.023

Best in bold. Second best is underlined.


Table 4. Model Parameters in Millions (M)

DD [4] DAD [44] HRFRNet [41] HANUN [8] EFENet [43] Ours (T ) Ours (S)
# Params 84.47M 34.89M 231.2M 29.75 53.13M 119.41M 21.43
Best in bold. Second best is underlined.

knowledge distillation had managed to incrementally improve the detection performance of a


smaller network while boosting inference time. From Tables 1 and 5 (right: fifth row), we can
observe that, compared to the larger EFENet [43], our lightweight S net, when trained without KD
but with only adversarial loss, has lower F-measure and higher MAE on CUHK dataset but still
exhibits 3.27% increase in F-measure and 4.25% decrease in MAE on the DUT dataset. However,
KD has further statistically improved the results compared to EFENet [43] by a 0.44% and 4.68%
increase in F-measure and a 16.98% and 24.47% decrease in MAE on CUHK and DUT datasets, re-
spectively, while simultaneously retaining the runtime efficiency. Similarly, compared to HRFRNet
[41], S net achieves lower MAE values when trained without KD but with only adversarial loss.
KD boosts the performances from 50% and 12.62% to 61.40% and 31.0% decrease in MAE values
on CUHK and DUT datasets, respectively. These performance gains are statistically considerable
from a relative standpoint and validate the overall viability of the technique for DBD task. We
also show the effect of the different KD loss terms in the second, third, and fourth rows (right) in
Table 5.
5.4.3 Teacher Net Architecture. We designed the teacher network using a trial-and-error
method, starting from a simple U-net model and gradually traced the complexities and nuances
of the study in a composable ML framework. From a logical perspective, we harness the capabil-
ities of the architectural blocks that can best leverage the structure of the blurry data. Table 5
(left: third and fourth rows) substantiates the effectiveness of SA and DASPP. To illustrate the in-
dividual contribution of each of the two modules, we present DBD results before/after applying
SA and DASPP for cluttered scenarios in Figure 8. The teacher model T−sa−da without SA and
DASPP is more prone to be affected by the negative influence of background clutters (BC), as
shown in Figure 8(b). Inclusion of either of the two modules significantly helps suppress BC, as
shown in Figure 8 (c) and (d). This is due to the capability of SA to capture the long-range non-
local structural correlations and enhance the model’s discriminative ability to separate the blurry
but heavily cluttered background from the focused region. However, by introducing multi-scale
dense connectivity, we capture the rich global contextual information and deduce the DBD masks
from an ensemble of local to global views with increasing receptive fields. This, in turn, assists
the model in reducing the misclassifications of small blur regions due to slightly blurred cluttered
background by considering the spatial continuity of the blurry/focused segments. Our final T net-
work, which integrates both the modules simultaneously, extracts more accurate high-level cues

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:17

Fig. 7. Precision, Recall, and F-measure plots on CUHK, DUT, and SZU-BD datasets.

Table 5. Ablation Analysis on CUHK [29] and DUT [45] Datasets Using F-measure ↑ and MAE ↓
CUHK DUT CUHK DUT
Model Model
F-measure MAE F-measure MAE F-measure MAE F-measure MAE
Proposed Distill-DBDGAN
Proposed T net 0.926 0.041 0.901 0.068 0.918 0.044 0.894 0.071
(w/ KD:Lcons , LK D−LSGAN , Lf a , w/ D 2 )
T net (w/o D 1 ) 0.917 0.043 0.889 0.074 S net (w/ KD:Lcons , LK D−LSGAN , w/ D 2 ) 0.914 0.047 0.891 0.081
T net (w/o D 1 , w/o SA, w/o DASPP) 0.915 0.047 0.882 0.087 S net (w/ KD:Lcons , w/ D 2 ) 0.911 0.049 0.887 0.084
— — — — — S net (w/o KD, w/ D 2 ) 0.902 0.057 0.882 0.090
— — — — — S net (w/o KD, w/o D 2 ) 0.887 0.066 0.858 0.106

We study the impact of different components in the proposed method. Best in bold.

and thereby results in better predictions for both the boundaries and blurred regions, as shown in
Figure 8(e). We have also observed that using an aggressively complex network for DBD task in a
limited dataset regime does not bring significant improvements during testing phase.
5.4.4 Model Performance vs. Model Size Tradeoff. We show the performances of the deep
teacher model and the lightweight smaller models along with the corresponding model

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:18 S. Jonna et al.

Fig. 8. DBD results for two test samples with cluttered backgrounds, taken from DUT dataset [45]. T−sa−da
stands for the proposed teacher network T without SA and DASPP modules. GT denotes ground truth. Back-
ground clutters are marked with red circles and ellipses. The green rectangles show erroneous detections
(w.r.t. GT).

Table 6. Ablation Analysis on CUHK [29] and DUT [45] Datasets Using F-measure ↑,
MAE ↓, Number of Model Parameters ↓, and Inference Time ↓

CUHK DUT #Params, Time (ms)


Model
F-measure MAE F-measure MAE
Teacher 0.926 0.041 0.901 0.068 119.41M, 36.24 ms
Lightweight networks
MobileNetV2-UNet 0.883 0.066 0.839 0.114 9.46M, 8.66 ms
EfficientNetB0-UNet 0.879 0.060 0.868 0.100 14.11M, 16.30 ms
EfficientNetB3-UNet 0.902 0.057 0.882 0.090 21.43M, 23.25 ms
EfficientNetB7-UNet 0.928 0.040 0.898 0.071 77.01M, 46.40 ms
We analyze the performances of different lightweight models for DBD task. Best in bold.

parameters and inference speed in Table 6. To reduce training computations, the lightweight
models, shown in Table 6, were trained without KD for comparison purposes. We notice that
MobileNetV2-UNet, with 9.46M parameters, is the most compute-friendly model among the tested
models and has less inference time of 8.66 ms. However, its performance is low, as observed from
the reported F-measure and MAE values. EfficientNetB0-UNet has a lower model size (14.11M)
than EfficientNetB3-UNet (21.43M). But the performance of EfficientNetB0-UNet drops on both
CUHK and DUT datasets compared to the performance of EfficientNetB3-UNet. EfficientNetB7-
UNet achieves the highest accuracy among the tested lightweight models, but it also has the
largest parameter size and inference time of 77.01M and 46.40 ms, respectively. Hence, we have
employed EfficientNetB3-UNet -based framework for the student model that exhibits satisfactory
performance in terms of MAE and F-measure values (shown in Table 1). Our proposed student
model has reasonable parameter size of 21.43M and average inference time of 23.25 ms.

6 APPLICATIONS
Here we explore some of the possible applications that benefit from blur detection: blur magnifi-
cation, focal boundary detection, and foreground-background segmentation.

6.1 Blur Magnification


We apply our method to perform defocus blur magnification [1] where we selectively amplify the
detected blurry regions. Blur magnification increases the aesthetic appeal of an image by produc-
ing a higher degree of blur in the out-of-focus regions. It thereby highlights the in-focus regions

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:19

Fig. 9. Demonstration of blur magnification. First row: (a) Partially blurred input image taken from CUHK
test dataset [29], images with magnified defocus blur in the background region obtained with the aid of blur
maps generated using (b) DD [3] and the proposed (c) T and (d) S networks. Second row: Zoomed regions
corresponding to the pink patches. Our approach preserves the in-focus pixels of the thumb region (zoomed
left patch) in the vicinity of the magnified blurred background. Pixels at the lower edge of the wristwatch
(zoomed right patch) are also more plausibly restored by our method.

Fig. 10. Focal boundary detection results. (a) Input defocus blurred image taken from CUHK test dataset
[29]. Results obtained by (b) Canny operator [23], (c) Sobel operator [12], (d) DD [4], proposed (e) T network,
and (f) S network.

in an image. We show examples of blur magnification in Figure 9 by using the detected blur maps
obtained by the proposed method and state-of-the-art DD technique [4]. In Figure 9(a), we show a
defocused image selected from CUHK test dataset [29]. We compare blur magnification achieved
using blur maps obtained by DD [4] and our method. Here we compute blur detection maps cor-
responding to Figure 9(a). The images with blur magnified in the background, obtained with the
aid of detected blur maps from DD [4], proposed teacher (T ), and student (S) models, are shown in
Figure 9(b), (c), and (d), respectively. The magnification result obtained by using the blur detection
output of DD [4] suffers from erroneous boundaries, as shown in the highlighted regions in the
second row. Looking at the highlighted regions, we can notice that our method achieves better de-
focus magnification where the focused pixels are kept intact at the boundaries of the foreground
object.

6.2 Focal Boundary Detection


Focal boundary detection from blurry image is an important step in shape representation-based
algorithms. In Figure 10(a), we show an out-of-focus blurred images taken from the CUHK test
dataset [29]. Directly applying traditional gradient-based edge detectors like Canny [23], Sobel
[12] operators may not yield favorable results in partially blurred images. They are sensitive to
local patterns and textures and generate a lot of disconnected edges. Figure 10(d), (e), and (f) shows
the focal boundary detection results obtained directly from the defocus masks that are generated
by the deep model trained in Reference [4], the proposed T model, and S model, respectively.
Erroneous focal boundaries in Figure 10(d) are due to the intrinsic limitation of the model that

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:20 S. Jonna et al.

Fig. 11. Foreground-background segmentation results. (a) Input defocus blurred image taken from CUHK
test dataset [29]. Results obtained by (b) DD [4], proposed (c) T network, and (d) S network.

predicted incorrect mask for the input blurry image. The superior quality of the defocus masks
produced by our proposed method helps to better capture the shape information near the focused
object boundaries.

6.3 Foreground-Background Segmentation


Manual segmentation of a blurry image into foreground and background is tedious, time-
consuming, and often prone to errors due to intra- and inter-observer variability. To avoid manual
pixel-level annotations and to minimize user intervention in robust segmentation algorithms, one
needs to fully automate the initial segmentation process. Interestingly, the blur map detected us-
ing the proposed algorithm provides a useful mask to initialize the segmentation process. We first
extract the contours from the detected defocus mask and then derive a set of contour-constrained
superpixels from the contour map. The superpixels are subsequently labeled to obtain the final
foreground-background segmentation mask. We show a case of a complicated blurry image from
CUHK test dataset [29] in Figure 11(a) with large homogeneous focal regions. It would be a com-
plex task for a user to segment the image manually. However, by using the detected blur maps, we
can derive the initial segmentation outputs as shown in Figure 11(b), (c), and (d). The segmentation
map in Figure 11(b) is obtained by using the blur detection result of DD [4]. Notice that it exhibits
inaccurate segmentation results and poor boundary localization of the generated segments. How-
ever, the segmentation results in Figures 11(c) and (d), obtained by using the blur detection results
of our proposed teacher (T ) and student (S) models, respectively, are of superior quality. Sometimes
the results obtained using the S model are even better than the results procured from the T model.

7 FAILURE CASES
We identify some failure modes for our trained models across different experimental scenarios.
Our trained models may not always yield accurate results for transparent objects. Figure 12(a) il-
lustrates one such scenario. Since the glass object inherits multiple abstract textures and colors
from the blurry image background, it exhibits a similar appearance to its defocused surroundings
and is falsely labeled as blur. We also show a failure example of a complex scene image of focussed
water drops with multi-colored reflections in Figure 12(b). The blur mask in this scene is hard to
detect because of the complex light paths with reflections. Besides, our method is also prone to
failure when strong specular highlights are present in the image, as shown in Figure 12(c). Consid-
ering the physical constraints, we have not separately modeled the reflections characteristics for
specular region detection and removal in our proposed framework, which is out of scope of this
article. In Figure 12(d), our model failed to distinguish the thin focused structure/strip from the
corresponding blurry background setup. The detection accuracy of our method slightly suffers in
such cases, especially when there is a high resemblance in the appearances between the two.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:21

Fig. 12. Failure examples. The input images in panels (a) and (b) and panels (c) and (d) are taken from the
DUT and CUHK datasets, respectively. Our trained networks may fail in areas marked by the rectangles in (a)
transparent glass object, (b) falling water drops with complex light paths, (c) object with specular highlights,
and (d) thin focused structure with defocused background.

8 CONCLUSION
In this article, we proposed a KD strategy and an adversarial learning-based framework for the
challenging task of defocus blur detection. As per our knowledge, the proposed model is the first
application of KD scheme in an adversarial framework for robust detection of defocus blur regions
from a single image. We leverage deep multi-scale and attention-guided features to detect blur in
ambiguous blur regions accurately. We provide qualitative and quantitative comparison results
with state-of-the-art defocus blur detection algorithms. In future studies, we plan to investigate
automatic segmentation and tracking of blurred regions in videos.
APPENDICES
A OBJECTIVE FUNCTIONS OF THE ADVERSARIAL DISCRIMINATORS
We minimize the following objective functions to train the discriminators D 1 , D 2 , and D 3 :
W×H   2 W×H  2
1 1
LD 1 = D 1 ŷj(t ) ©xj + D 1 (yj ©xj ) − 1 , (14)
2(W × H ) j=1 2(W × H ) j=1
W×H    2 W×H    2
1 1
LD 2 = D 2 ŷj(s ) ©xj + D 2 yj ©xj − 1 , (15)
2(W × H ) j=1 2(W × H ) j=1
W×H   2 W×H  2
1 1
LD 3 = D 3 ŷj(s ) + D 3 (ŷj(t ) ) − 1 , (16)
2(W × H ) j=1 2(W × H ) j=1

where x is the input blurred image having spatial dimension W × H ; y is the ground truth blur
mask; ŷ (t ) and ŷ (s ) are the outputs of the teacher model G 1 and the student model G 2 , respectively;
j is the pixel index of the image; and ©is the concatenation operator.
B SENSITIVITY ANALYSIS OF ADVERSARIAL HYPERPARAMETERS
Considering the presence of several hyperparameters in our proposed algorithm and long train-
ing hours, we attempted a robust method to tune the sensitive hyperparameters present in our
method. Since hyperparameters related to KD losses have a known effect on a model in the
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:22 S. Jonna et al.

Fig. 13. Effect of adversarial training hyperparameters on network performances when evaluated on the
CUHK and DUT datasets. We show plots of average MAE versus (a) λ 2 , versus (b) γ 2 , and versus (c) α 3 for
the CUHK test dataset and MAE versus (d) λ 2 , versus (e) γ 2 , and versus (f) α 3 for the DUT test dataset.

general sense and are not as sensitive as that of adversarial hyperparameters, we keep the value for
this hyperparameter constant and perform sensitivity analysis of the adversarial hyperparameters.
We have objectively selected a set of values within a range that would presumably demonstrate
a distinctive influence on the model performances. Figure 13 shows plots of the effect of change
of hyperparameters λ 2 (Equation (12)), γ 2 (Equation (13)), and α 3 (Equation (5)) on network per-
formances, when trained on the CUHK dataset and evaluated on the CUHK and DUT datasets.
The training parameter λ 2 controls the contribution of the adversarial loss for the generator of T
network, whereas γ 2 and α 3 control the same for the generator of S network. Each of the MAE
values in Figure 13 is calculated by averaging over 30.2K iterations for each of the hyperparam-
eter settings. In Figure 13(a), we observe that as λ 2 increases from 0.1 to 1.0, the average MAE
changes negligibly. We, therefore, reckon L1 loss to be less sensitive to λ 2 in the range [0.1, 1.0]
for the CUHK dataset. It achieves the lowest value (lower is better) at λ 2 = 10, whereas, for the
DUT dataset in Figure 13(b), MAE is lowest at λ 2 = 0.1 and then increases with an increase in
λ 2 value. We then analyze the influence of the parameter γ 2 on the smaller S network by setting
γ 3 = 0 (Equation (13)) and keeping γ 1 (Equation (13)) constant at 1.0 (as mentioned in Section 4).
We can observe from Figure 13(c) and (d) that with the increasing value of γ 2 , there is a decline
in average MAE values across both the datasets until it achieves the best results at γ 2 = 10.0 and
γ 2 = 0.1 for the CUHK and DUT datasets, respectively. We conjecture that evaluation on DUT
test dataset, with several obscure real blur images and 5× the size of CUHK test dataset, ensures
a more unbiased analysis and provides greater confidence in model performances. Moreover, the
DUT dataset, having not been used during training, is not directly related to the learning set (e.g.,
different acquisition settings) and therefore provides a better assessment of the model’s general-
ization abilities. Hence, we opted to choose both λ 2 and γ 2 as 0.1 for the final training session. To
analyze the effect of change of α 3 , we vary only the value of α 3 while keeping the other parame-
ters γ 1 , γ 3 (Equation (13)) and α 1 , α 2 (Equation (5)) constant (mentioned in Section 4), with γ 2 = 0.1
being chosen. In Figure 13(e) and (f), we notice nearly similar behavior of the effect of change of α 3
on the performance of the student network as that shown in Figure 13(c) and (d), respectively. In
conformity with the initial conjecture, we choose to adopt α 3 as 0.1 for training the final student
network within the knowledge distillation scheme.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:23

We observe that the optimal parameters for the CUHK dataset tend to be higher than those for
the DUT dataset, and there is a larger drop in MAE value for the DUT dataset at a hyperparameter
value of 0.1 for all the three hyperparameters considered.

C ADDITIONAL RESULTS
We provide additional results of our method along with other defocus blur detection methods
[4, 6, 8, 36, 39, 43–47] on the CUHK dataset [29] in Figure 14.

Fig. 14. Additional results of DBD maps on the CUHK dataset [29]. The GT maps are shown in the last
column.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:24 S. Jonna et al.

D EXPERIMENTS ON IMAGES CAPTURED USING HANDHELD DEVICES


We verify the efficacy and versatility of our proposed DBD method via experiments on images
captured using handheld consumer devices such as smartphones and DSLR cameras. We have cap-
tured images from different scenes using a smartphone camera, as shown in Figure 15(a) and (d)
(first and second rows). The images in Figure 15(a) and (d) in the third and fourth rows were cap-
tured using a DSLR camera and are taken from the DFD dataset proposed in Reference [2]. The
two scenes along the same row in Figure 15 are similar but contain varying degrees of blur. We
show the results obtained from the proposed teacher and student models for the aforementioned
captured images in Figure 15(b) and (e) and Figure 15(c) and (f), respectively. The results demon-
strate robust generalization abilities of the proposed models to different defocus effects in images
captured using mobile phones. Our method is fast and extends potentiality for efficient deploy-
ment of our trained models on commodity smartphones. The images taken from the DFD dataset
in Figure 15(a) and (d) (third and fourth rows) are complicated, and the blurring effect is not very
prominent in the first scene. Still, our models generate decent DBD maps for such complicated
scenarios.

Fig. 15. Visual results of DBD maps on images captured using a smartphone camera (first and second rows)
and DSLR camera (third and fourth rows). The images captured using a DSLR camera are taken from the
DFD dataset [2]. The ((a) and (d)) input images are captured from similar scenes (along the same row) with
varying degrees of blur. Note that the image in (d) is more blurred than the image in (a) along the same row.
DBD maps detected by our proposed ((b) and (e)) teacher network and ((c) and (f)) student networks.

REFERENCES
[1] Soonmin Bae and Frédo Durand. 2007. Defocus magnification. In Computer Graphics Forum, Vol. 26. 571–579.
[2] Marcela Carvalho, Bertrand Le Saux, Pauline Trouvé-Peloux, Andrés Almansa, and Frédéric Champagnat. 2018. Deep
depth from defocus: How can defocus blur improve 3D estimation using dense neural networks? In Proceedings of the
European Conference on Computer Vision (ECCV’18) Workshops.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
Knowledge Distillation and Adversarial Learning Framework for DBD 87:25

[3] Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, and Shi-Min Hu. 2014. Global contrast based salient
region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37, 3 (2014), 569–582.
[4] X. Cun and C. M. Pun. 2020. Defocus blur detection via depth distillation. In Proceedings of the European Conference
Computer Vision (ECCV’20), Vol. 12358. 747–763.
[5] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248–255.
[6] S. A. Golestaneh and L. J. Karam. 2017. Spatially-varying blur detection based on multiscale fused and sorted transform
coefficients of gradient magnitudes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR’17). 596–605.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014.
Generative adversarial nets. In Proceedings of the Conference on Advances in Neural Information Processing Systems
(NeurIPS’14). 2672–2680.
[8] Wenliang Guo, Xiao Xiao, Yilong Hui, Wenming Yang, and Amir Sadovnik. 2021. Heterogeneous attention nested
U-shaped network for blur detection. IEEE Signal Process. Lett. 29 (2021), 140–144.
[9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by
a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information
Processing Systems (NeurIPS’17), Vol. 30.
[10] J. Hu, L. Shen, and G. Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR’18). 7132–7141.
[11] R. Huang, W. Feng, M. Fan, L. Wan, and J. Sun. 2018. Multiscale blur detection by learning discriminative deep features.
Neurocomputing 285 (2018), 154–166.
[12] Zhang Jin-Yu, Chen Yan, and Huang Xian-Xiang. 2009. Edge detection of images based on improved Sobel operator
and genetic algorithms. In Proceedings of the International Conference on Image Analysis and Signal Processing. 31–35.
[13] Alexia Jolicoeur-Martineau. 2019. The relativistic discriminator: A key element missing from standard GAN. In Pro-
ceedings of the International Conference on Learning Representations (ICLR’19).
[14] Diederick P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR’15).
[15] Boyi Li, Felix Wu, Kilian Q. Weinberger, and Serge J. Belongie. 2019. Positional normalization. In Proceedings of the
Advances in Neural Information Processing Systems (NeurIPS’19). 1620–1632.
[16] Jinxing Li, Dandan Fan, Lingxiao Yang, Shuhang Gu, Guangming Lu, Yong Xu, and David Zhang. 2021. Layer-output
guided complementary attention learning for image defocus blur detection. IEEE Trans. Image Process. 30 (2021), 3748–
3763.
[17] Zinan Lin, Vyas Sekar, and Giulia Fanti. 2020. Why spectral normalization stabilizes GANs: Analysis and improve-
ments. Advances in Neural Information Processing Systems (NeurIPS) 34 (2020), 9652–9638.
[18] K. Ma, H. Fu, T. Liu, Z. Wang, and D. Tao. 2018. Deep blur mapping: Exploiting high-level semantics by deep neural
networks. IEEE Trans. Image Process. 27, 10 (2018), 5155–5166.
[19] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. 2017. Least squares generative adversarial networks. In
Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2813–2821.
[20] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. 2019. On the effectiveness of least squares generative
adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 12 (2019), 2947–2960.
[21] Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv:1411.1784. Retrieved from
https://arxiv.org/abs/1411.1784.
[22] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative
adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).
[23] Farzin Mokhtarian and Riku Suomela. 1998. Robust image corner detection through curvature scale space. IEEE Trans.
Pattern Anal. Mach. Intell. 20, 12 (1998), 1376–1381.
[24] Y. Pang, H. Zhu, X. Li, and X. Li. 2016. Classifying discriminative features for blur detection. IEEE Trans. Cybern. 46,
10 (2016), 2220–2227.
[25] J. Park, Y. W. Tai, D. Cho, and I. S. Kweon. 2017. A unified approach of multi-scale deep and hand-crafted features
for defocus estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).
2760–2769.
[26] Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised representation learning with deep convolutional
generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR’16),
Yoshua Bengio and Yann LeCun (Eds.).
[27] O. Ronneberger, P. Fischer, and T. Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In
Proceedings of the Annual Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI’15).
234–241.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.
87:26 S. Jonna et al.

[28] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: In-
verted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR’18).
[29] J. Shi, L. Xu, and J. Jia. 2014. Discriminative blur detection features. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR’14). 2965–2972.
[30] J. Shi, L. Xu, and J. Jia. 2015. Just noticeable defocus blur detection and estimation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR’15). 657–665.
[31] Xiaoli Sun, Xiujun Zhang, Mingqing Xiao, and Chen Xu. 2020. Blur detection via deep pyramid network with recurrent
distinction enhanced modules. Neurocomputing 414 (2020), 278–290.
[32] M. Tan and Q. V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings
of the International Conference on Machine Learning (ICML’19).
[33] C. Tang, X. Liu, S. An, and P. Wang. 2021. BR2 Net: Defocus blur detection via a bidirectional channel attention residual
refining network. IEEE Trans. Multimedia 23 (2021), 624–635.
[34] C. Tang, X. Liu, X. Zheng, W. Li, J. Xiong, L. Wang, A. Zomaya, and A. Longo. 2022. DeFusionNET: Defocus blur
detection via recurrently fusing and refining discriminative multi-scale deep features. IEEE Trans. Pattern Anal. Mach.
Intell. 44, 2 (2022), 955–968.
[35] Chang Tang, Xinwang Liu, Xinzhong Zhu, En Zhu, Kun Sun, Pichao Wang, Lizhe Wang, and Albert Zomaya. 2020.
R2 MRF: Defocus blur detection via recurrently refining multi-scale residual features. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, Vol. 34. 12063–12070.
[36] C. Tang, J. Wu, Y. Hou, P. Wang, and W. Li. 2016. A spectral and spatial approach of coarse-to-fine blurred image
region detection. IEEE Sign. Process. Lett. 23, 11 (2016), 1652–1656.
[37] Chang Tang, Xinzhong Zhu, Xinwang Liu, Lizhe Wang, and Albert Zomaya. 2019. DeFusionNET: Defocus blur detec-
tion via recurrently fusing and refining multi-scale deep features. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR’19).
[38] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. 2018. DenseASPP for semantic segmentation in street scenes. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3684–3692.
[39] X. Yi and M. Eramian. 2016. LBP-Based segmentation of defocus blur. IEEE Trans. Image Process. 25, 4 (2016), 1626–1638.
[40] K. Zeng, Y. Wang, J. Mao, J. Liu, W. Peng, and N. Chen. 2019. A local metric for defocus blur detection based on CNN
feature learning. IEEE Trans. Image Process. 28, 5 (2019), 2107–2115.
[41] Yongping Zhai, Junhua Wang, Jinsheng Deng, Guanghui Yue, Wei Zhang, and Chang Tang. 2021. Global context
guided hierarchically residual feature refinement network for defocus blur detection. Sign. Process. 183 (2021), 107996.
[42] Ning Zhang and Junchi Yan. 2020. Rethinking the defocus blur detection problem and a real-time deep DBD model.
In Proceedings of the European Conference Computer Vision (ECCV’20). 617–632.
[43] Wenda Zhao, Xueqing Hou, You He, and Huchuan Lu. 2021. Defocus blur detection via boosting diversity of deep
ensemble networks. IEEE Trans. Image Process. 30 (2021), 5426–5438.
[44] Wenda Zhao, Cai Shang, and Huchuan Lu. 2021. Self-generated defocus blur detection via dual adversarial discrimi-
nators. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21). 6933–6942.
[45] Wenda Zhao, Fan Zhao, Dong Wang, and Huchuan Lu. 2018. Defocus blur detection via multi-stream bottom-top-
bottom fully convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR’18).
[46] W. Zhao, F. Zhao, D. Wang, and H. Lu. 2019. Defocus blur detection via multi-stream bottom-top-bottom network.
IEEE Trans. Pattern Anal. Mach. Intell. 42, 8 (2019), 1884–1897.
[47] Wenda Zhao, Bowen Zheng, Qiuhua Lin, and Huchuan Lu. 2019. Enhancing diversity of defocus blur detectors via
cross-ensemble network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).

Received 27 January 2022; revised 30 June 2022; accepted 3 August 2022

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2s, Article 87. Publication date: February 2023.

You might also like