Doctoral Dissertation For Junyong Lee, PH.D., POSTECH
Doctoral Dissertation For Junyong Lee, PH.D., POSTECH
Junyong Lee (이 준 용)
2023
보조 데이터를 이용한
영상 및 비디오 복원 학습
ABSTRACT
network architectures for better restoration quality. However, a network with the im-
proved capacity may easily overfit to a training dataset and fail to be fully exploited
lize auxiliary data to provide degradation-specific priors for the network to be fully
for defocus deblurring and video super-resolution tasks, for which we propose novel
the auxiliary data, allowing the networks to achieve state-of-the-art restoration quality.
map containing the per-pixel blur amount of an input defocused image. To train the
network, we present a dataset containing synthetic images with defocus maps. During
training, we utilize a real-world blur detection dataset as auxiliary data to reduce the
domain gap that occurs when a real-world defocused image is fed to the network
trained with the synthetic dataset. Our method reports state-of-the-art defocus map
estimation performance, and we show that leveraging defocus maps predicted by our
–I–
method can improve the defocus deblurring quality.
per-pixel deblurring filters for flexible handling of spatially varying defocus blur. Due
to its high flexibility, the network may easily overfit to a training dataset. During
training, we prevent this by utilizing auxiliary disparity map estimation and reblurring
tasks for the network to exploit defocus-specific priors about blur sizes and shapes,
allowing robust single image defocus deblurring. Our network effectively removes
telephoto videos as auxiliary references. To train our network, we propose the dataset
also present the training strategy fully utilizing video triplets in the proposed dataset.
Lastly, we present a memory network for the implicit reference utilization in the
RefVSR task. Using ultra-wide LR features as queries, our memory network returns
work for high-fidelity results. We also propose the test-time optimization strategy that
We show that reference features queried from the proposed memory network can be
utilized across the entire region of an LR frame and help improve the final SR quality.
– II –
Contents
I. Introduction 1
1.1 Utilizing Auxiliary Data for Image & Video Restoration . . . . . . . 8
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
– III –
3.4.4 Evaluation on CUHK and RTF Datasets . . . . . . . . . . . . 30
3.4.5 Defocus Deblurring . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
– IV –
5.4.1 Analysis on Reference Video Types . . . . . . . . . . . . . . 73
5.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.3 Effect of Propagative Temporal Fusion . . . . . . . . . . . . 80
5.4.4 Effect of Bidirectional Scheme . . . . . . . . . . . . . . . . . 83
5.4.5 Effect of Alignment Methods . . . . . . . . . . . . . . . . . 84
5.4.6 Comparison on RealMCVSR Dataset . . . . . . . . . . . . . 84
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References 113
–V–
I. Introduction
b = k ∗ l + n, (1.1)
–1–
Figure 1.1: Images with degradation. Unexpected degradation is one of the most
annoying artifacts that most photographers want to avoid, as it may severely degrade
the visual quality of resulting images or video frames with irreversible information
loss. Left: defocus blur due to focus failure, right: low-resolution artifact caused by
the finite size of a camera sensor.
due to outliers such as noise, saturation, and compression artifacts may raise serious
artifacts in deconvolution results, which are very difficult to remove. Last but not
least, the spatially varying nature of real-world degradation (i.e., degradation in an
image may locally vary, e.g., the size of defocus blur varies upon depth) is another
factor that intensifies the difficulty of a deconvolution problem.
Recent advancements in deep learning have significantly resolved the aforemen-
tioned problems in conventional deconvolution-based restoration approaches. Thanks
to the nonlinear nature of a deep convolutional neural network, deep learning-based
restoration methods do not require complex degradation modeling or constraints as
conventional ones. A deep network trained with a restoration dataset implicitly learns
to model degradation and to remove them much more effectively than conventional
approaches, even with a simple network architecture [13, 14]. It also has been shown
that a network can effectively deal with nonlinear outliers such as noise and satura-
tion [13, 15], which were very difficult to be handled in conventional methods.
Despite the success of previous deep learning-based restoration approaches, prob-
lems still exist. First, a deep network can easily overfit to a restoration dataset and often
–2–
fail for unseen real-world test cases. This is mainly due to difficulties in collecting a
restoration dataset containing degraded images or videos with every possible degrada-
tion in a real-world scenario. Second, network architectures proposed in previous deep
learning-based approaches are not specifically designed to fully reflect a characteristic
of degradation. For example, in defocus deblurring, due to the spatial invariant nature
of convolution operations, a naı̈vely designed network would not be enough to flexibly
handle spatially varying defocus blur.
Most of the previous deep learning-based restoration approaches have focused on
improving training strategies and network architectures to resolve the aforementioned
problems. Regularization techniques such as dropout [16] and L1 / L2 weight regular-
ization [17] have been proposed for a network to be well generalized for unseen test
cases. Network architecture such as U-Net [18], RNN [19], and components such as
attention [20] and dynamic convolution [21] are employed for better handling of de-
graded features. However, a network with improved capabilities increases the degree
of freedom of the network complexity, which may lead the network to easily overfit to
training data [22]. The overfitted network may not be fully exploited for a restoration
task and suffer from severe ill-posedness in handling real-world arbitrary degradation.
This dissertation presents deep learning-based techniques for image and video
restoration tasks: defocus deblurring and video super-resolution. Distinguished from
previous deep learning-based approaches, we concentrate on developing network ar-
chitectures and training strategies with a focus on leveraging not only the primary
restoration dataset but also additional auxiliary data. The main principle here is to pro-
vide degradation-specific priors extracted from auxiliary data to prevent a restoration
network from overfitting to primary training data and to generalize the network for its
computational capability to be fully involved in restoring unseen images or videos.
We first propose two defocus deblurring methods, one of which focuses on de-
focus map estimation, in which we leverage a synthetic dataset primarily for the es-
timation task and real-world defocused images as auxiliary data to train a network
for robust real-world defocus map estimation-based defocus deblurring. For the other
–3–
one, we propose an end-to-end learning-based approach for defocus deblurring, in
which we present a deblurring network specifically designed to flexibly handle spa-
tially varying and large defocus blur. However, the network may fail to remove unseen
defocus blur due to the improved flexibility, which leads the network to easily overfit
to a training dataset. We mitigate this problem by utilizing dual-pixel stereo images
as auxiliary data to provide defocus-specific priors for the network to predict the de-
blurring filters that accurately depict the nature of defocus blur. Moreover, we pro-
pose reference-based video super-resolution (RefVSR) methods designed to leverage
high-fidelity reference videos as auxiliary data. We present two RefVSR frameworks,
each of which equips a patch-based matching module for an explicit reference utiliza-
tion and a memory bank for an implicit reference utilization, respectively. We collect
multi-camera video triplets to train each framework for super-resolving an ultra-wide
video utilizing wide-angle or telephoto video frames as references. The followings are
a detailed introduction to the contributions of this dissertation.
Deep Defocus Map Estimation Using Domain Adaptation A defocus map contains
the per-pixel defocus blur size of an image. A conventional strategy for defocus deblur-
ring [23, 24, 25, 26, 27, 28] is to estimate per-pixel blur kernels based on the estimated
defocus map and then performs non-blind deconvolution [29, 30, 31]. However, previ-
ous defocus map estimation approaches often fail because they heavily depend on blur
cues only around the edges of a defocused image. As the edges in a blurred image are
often ambiguous, it leads to inaccurate detection of blur amount. Also, blur estima-
tion on edges is inherently prone to errors, as a pixel at an object boundary with depth
discontinuity contains a mixture of different blurs in a defocused image [32].
In Chapter III, we propose the first end-to-end defocus map estimation network
(DMENet), which directly estimates a dense defocus map given a defocused image.
Unlike previous edge-based approaches, we train DMENet to densely estimate a defo-
cus map given a defocused image. To train the network, we collect a novel synthetic
depth-of-field dataset, SYNDOF, where each image is synthetically blurred with a
–4–
ground-truth depth map. Due to the synthetic nature of SYNDOF, the feature char-
acteristics of images in SYNDOF can differ from those of real defocused photos. To
address this gap, we utilize a real-defocused dataset as auxiliary data for domain adap-
tation that transfers the features of real-world defocused images into those of syn-
thetically blurred ones. Our DMENet consists of four subnetworks: blur estimation,
domain adaptation, content preservation, and sharpness calibration networks. The sub-
networks are connected to each other and jointly trained with their corresponding su-
pervisions in an end-to-end manner. Our method is evaluated on publicly available
blur estimation datasets, and we show state-of-the-art defocus map estimation quality.
–5–
estimation task, which provides strong defocus-specific priors about blur magnitudes
for IFAN. We also present a reblurring scheme as an auxiliary task, which provides
defocus-specific priors about blur shapes and sizes for IFAN to predict more accurate
deblurring filters. We show that both disparity map estimation and reblurring tasks
significantly boost the deblurring quality, and our method achieves state-of-the-art per-
formance both quantitatively and qualitatively on real-world images.
–6–
wide-angle, and telephoto videos concurrently taken from triple cameras of a smart-
phone. We also propose a two-stage training strategy fully utilizing video triplets in the
dataset for real-world 4× video super-resolution. We extensively evaluate our method,
and the result shows the state-of-the-art video super-resolution performance.
–7–
Restoration task Primary data Auxiliary data Auxiliary priors Effect
synthetic real-world domain gap
Defocus map real-world
defocused images defocused images reduction b/w
estimation defocused feature
and and synthetic and real
(Chapter III) characteristics
defocus maps blur detection labels defocused features
accurate & robust
Defocus deblurring defocused and dual-pixel blur sizes
deblurring filter
(Chapter IV) all-in-focus images stereo images and shapes
prediction
wide-angle and
Explicit RefVSR high-fidelity
telephoto videos high-quality
(Chapter V) reference textures
ultra-wide (explicit reference) 4×SR results
videos memorizable transferred w/
Implicit RefVSR wide-angle videos
high-fidelity reference textures
(Chapter VI) (implicit reference)
reference textures
Table 1.1: We propose to utilize auxiliary data for various image and video restora-
tion tasks. In the table, “auxiliary priors” indicate degradation-specific priors extracted
from auxiliary data. Auxiliary priors guide a restoration network to avoid overfitting
to primary training data and to learn more robust degradation-specific operations, al-
lowing state-of-the-art restoration quality.
–8–
Table 1.1 summarizes the auxiliary data utilization and its effect on restoration
tasks proposed in this dissertation. In the table, we describe the primary and auxil-
iary data used in each restoration task. We also include degradation-specific priors
extracted from auxiliary data and their effect on the corresponding restoration task.
For defocus map estimation (Chapter III), the network fails to handle real-world
defocused images due to their domain gap from synthetic primary data used to train
the network. To reduce the domain gap, we use real-world defocused images with their
semi-labeled binary blur maps as auxiliary data to provide defocus-specific priors for
the network to robustly estimate defocus maps given real-world images.
For defocus deblurring (Chapter IV), we propose an end-to-end network that pre-
dicts per-pixel deblurring filters to flexibly deal with spatially varying defocus blur.
Due to the highly flexible architecture, the network easily overfits to the primary train-
ing data, and the deblurring filters fail to remove arbitrary and diverse blurs unexposed
during training. For robust deblurring performance, we leverage dual-pixel defocused
images to provide defocus-specific priors about blur sizes and shapes for the network
to predict deblurring filters that can be robustly applied to unseen defocused images.
For video super-resolution (Chapters V and VI), we propose reference-based
video super-resolution (RefVSR) networks that utilize wide-angle or telephoto videos
as auxiliary references for super-resolving LR ultra-wide videos. In Chapter V, we ex-
plicitly utilize the auxiliary reference videos, in which reference features are matched
and aligned to LR features. Aligned reference features are then directly utilized by
our RefVSR network for producing SR results explicitly transferred with high-fidelity
reference textures. However, reference textures may fail to be transferred to the results
for the unmatched regions between LR and reference video frames.
In Chapter VI, we present a memory network for the implicit utilization of auxil-
iary reference videos. We use reference videos to train a memory network to memorize
useful reference information. Using LR features as queries, our memory network re-
turns corresponding reference features, which are then utilized by a VSR network for
producing SR results transferred with reference textures across the entire region.
–9–
1.2 Organization
• Chapter III presents the defocus map estimation framework that uses real-world
defocused images as auxiliary data to reduce the domain gap between synthetic
and real-world features for accurate real-world defocus map estimation.
• Chapter VI proposes the memory network that constitutes reference features that
are queried and implicitly utilized as auxiliary data by a video-super resolution
network for improving the quality of results across the entire region.
– 10 –
II. Previous Work
For defocus map estimation, most of the previous works first estimate blur amounts
around explicitly detected edges and then propagate them to the surrounding homoge-
neous regions. Zhuo et al. [42] and Karaali et al. [27] use image gradients as local
blur cues, and calculate the ratio of the blur cues between the edges of the original and
re-blurred images. Tang et al. [43] estimate a sparse blur map with spectrum contrast
near image edges. Shi et al. [44] utilize frequency-domain features, learned features,
and image gradients to estimate blur amounts. Shi et al. [23] adopt a sparse represen-
tation to detect just noticeable blurs, which cannot handle large blurs. Xu et al. [45]
use the rank of a local patch as a cue for blur amount. Park et al. [25] build feature
vectors consisting of hand-crafted features as well as deep features taken from a pre-
trained blur classification network, then feed the feature vectors to another network
to regress the blur amounts on edges. All these methods commonly rely on features
defined only around image edges, and so blur amounts interpolated from the edges for
homogeneous regions could be less accurate.
Recently, machine learning techniques have been utilized to densely estimate de-
focus maps. Andrès et al. [24] create a dataset where a ground-truth defocus map is
labeled with the radius of point-spread-function at each pixel which minimizes the er-
ror on a defocused image. They train regression tree fields to estimate the blur amount
of each pixel. However, the method cannot be easily generalized due to insufficient
training images and is not robust at pixels around depth boundaries where ground-truth
blur amounts cannot be accurately measured. Zhang et al. [46] create a dataset by man-
ually labeling each pixel of a defocused image into four levels of blur: high, medium,
low, and no blur, for training a CNN for a classification task. Their method shows
– 11 –
state-of-the-art performance for the blur classification task, but it cannot estimate the
exact blur amount, which is essential for applications such as deblurring.
– 12 –
2.2 Defocus Deblurring
– 13 –
36, 37, 38, 39, 40, 41] have focused on establishing non-local correspondence between
LR and Ref features. For establishing correspondence, either offset-based matching
(optical flow [36] and deformable convolution [40]) or patch-based matching (patch-
match [34, 35, 54, 37], learnable patch-match [38, 39], learnable patch-match with
affine correction [41]) are employed.
Video Super-Resolution (VSR) Previous VSR methods have focused on how to ef-
fectively utilize highly related but unaligned LR frames in a video sequence. With
respect to how LR frames in video sequences are handled by a model, previous VSR
approaches can be categorized into either sliding window-based [55, 56, 57, 58, 59]
or recurrent framework-based [60, 61, 62, 63, 64] approaches. For handling unaligned
LR frames, warping using optical flow [55, 62, 64], patch-based correlation [59], and
deformable convolution [56, 58] have been employed.
– 14 –
may be useful for a target task [66]. Memory networks were originally developed and
adopted for natural language processing tasks [66, 67, 68]. Then, the property of stor-
ing features naturally extended memory networks to video tasks, such as video under-
standing [69], object tracking [70], and video object segmentation [71, 72]. However,
memory networks are rarely used for image and video restoration tasks. Recently,
Ji and Yao proposed a video deblurring method leveraging a memory network [73].
They stack temporal features in a memory bank and compute a spatio-temporal at-
tention between a target blurry feature and temporal features stacked in the memory.
The attention is then used as a query to retrieve possible sharp features in the memory
bank. However, the model requires large memory and computational complexities to
keep stacking the features in the memory and to compute the spatio-temporal attention
between the target blurry features and features stored in the memory.
In this dissertation (Chapter VI), distinguished from the previous memory net-
work used for video deblurring, we keep our memory bank in the memory network
at a fixed size and train the memory bank to memorize useful reference information.
The memory bank is composed of keys and their corresponding features, which we
call basis features, as their linear combinations are trained to constitute reference fea-
tures. To this end, we use queries extracted from reference frames and keys in the
memory network to compute non-local attention. Using non-local attention, we lin-
early combine the basic features to constitute reference features, which are then used
to reconstruct the reference frames. Then, for VSR, we use a target LR frame to extract
queries and constitute reference features from the memory network. Although the ref-
erence features are retrieved from a memory bank of a fixed size, they help reconstruct
high-quality SR results because these features convey high-fidelity reference textures.
Moreover, since our memory network learns general reference information, it can be
utilized as a Plug-and-Play module for a VSR framework and can be fine-tuned for a
specific reference video to memorize enhanced reference information.
– 15 –
III. Deep Defocus Map Estimation Using Domain
Adaptation
3.1 Motivation
A defocus map contains the amount of defocus blur or the size of the circle of
confusion (COC) per pixel for a defocus-blurred (in short, defocused) image. Esti-
mation of a defocus map from a defocused image can greatly facilitate high-level vi-
sual information processing, including saliency detection [74], depth estimation [42],
foreground/background separation [75], and deblurring [76]. A typical approach for
defocus map estimation first detects edges from a blurred image, then measures the
amounts of blur around the edges, and finally interpolates the estimated blur amounts
at edges to determine the blur amounts in homogeneous regions.
The previous edge-driven approach has a few limitations. First, the edges in a
blurred image are often ambiguous, leading to inaccurate detection. Second, blur es-
timation for edges is inherently prone to errors, as a pixel at an object boundary with
depth discontinuity contains a mixture of different COCs in a defocused image [32].
Third, this instability of blur estimation at edges would result in less reliable prediction
in homogeneous regions. That is, the blur amounts estimated at different parts of an
object boundary could be incoherent, and then their interpolation toward the homo-
geneous object interior would produce only smooth but less accurate blur estimation.
For example, the estimated blur amounts of an object with a single depth may not be
constant because the blur amounts separately measured at the opposite edges could not
be the same when the edges have different depth discontinuities with nearby objects.
In this dissertation, we present DMENet (Defocus Map Estimation Network), the
first end-to-end CNN framework, which directly estimates a defocus map given a defo-
– 16 –
cused image. Our work is distinguished from the previous ones for its clear definition
of which COC we try to estimate among the mixture of COCs, where we infer the COC
size of a pixel using the depth value in the corresponding pinhole image. The network
trained with our COC definition leads to a more robust estimation of blur amounts, es-
pecially at object boundaries. The network also better handles homogeneous regions
by enlarging its receptive field, so that object edges and interior information are used
together to resolve ambiguity. As a result, our network significantly improves the blur
estimation accuracy in the presence of mixtures of COCs.
To enable such network learning, a high-quality dataset is crucial. However, cur-
rently available datasets [44, 24] are not enough, as they are either for blur detec-
tion [44], instead of blur estimation, or of a small size [24]. To this end, we generate
a defocus-blur dataset, which we call “SYNDOF” dataset. It would be almost impos-
sible, even manually, to generate ground-truth defocus maps for defocused photos. So
we use pinhole image datasets, where each image is accompanied by a depth map, to
synthesize defocused images with corresponding ground-truth defocus maps.
One limitation of our dataset is that defocus blurs are synthetic, and there could
be domain difference [52] between the characteristics of real and synthetic defocused
images. To resolve this, we design our network to include domain adaptation, which
is capable of adapting the features of real defocused images to those of synthetic ones
so that the network can estimate the blur amounts of real images with the training of
defocus blur estimation using synthetic images.
To summarize, our contributions include:
• the first end-to-end CNN architecture that directly estimates accurate defocus
maps without edge detection,
• domain adaptation that enables learning through a synthetic dataset for real-
world defocused images.
– 17 –
Datasets # samples # outputs Type
MPI 1,064 4,346 synthetic
SYNTHIA 896 3,680 synthetic
Middlebury 46 205 real
Total 2,006 8,231
We first collected both synthetic and real images with their associated depth maps.
We did not use 3D scene models to avoid time-consuming high-quality rendering.
Our images are from MPI Sintel Flow (MPI) [77], SYNTHIA [78], and Middlebury
Stereo 2014 (Middlebury) [79] datasets. MPI dataset is a collection of game scene
renderings, the SYNTHIA dataset contains synthetic road views, and the Middlebury
dataset consists of real indoor scene images with accurate depth measurements.
MPI and SYNTHIA datasets include sequences of similar scenes, and thus we
kept only dissimilar images in terms of peak-signal-to-noise-ratio (PSNR) and struc-
tural similarity index (SSIM), ending up with 2,006 distinct sample images in total.
Then, we repeated the process of randomly selecting an image from the sample set
to generate a defocused image with a random sampling of camera parameters and the
focal distance. The total number of defocused images we generated is 8,231. Table 3.1
shows the details.
Given the color-depth pairs, we generated defocused images using the thin-lens
model [80], a standard for defocus blur in computer graphics (Fig. 3.1). Let the focal
length be F (mm), the object-space focal distance S1 (mm), and the f-number N . The
F S1 F
image-space focal distance is f1 = S1 −F , and the aperture diameter is D = N. Then,
– 18 –
D
C(x) c(x)
S1 F
x f1
the image-space COC diameter c(x) of a 3D point located at the object distance x is
defined as:
|x − S1 | f1
c(x) = α , where α = D. (3.1)
x S1
To apply defocus blur to an image, we first extract the minimum and maximum
depth bounds, xnear and xf ar , from the depth map, respectively. Then, we randomly
sample S1 from the range of [xnear , xf ar ]. When computing c(x) using Eq. 3.1, we
only need α that abstracts physical parameters. In practice, x is not near zero (implying
very close to the lens), having a certain limit. To facilitate the meaningful yet random
generation of capture conditions, we limit the COC size up to cmax . Thereby, the upper
bound of α, denoted by αup , is:
xf ar xnear
αup = cmax · min , . (3.2)
|xf ar − S1 | |xnear − S1 |
Now α is randomly sampled within [0, αup ]. We then apply Gaussian blur to the image
c(x)
with kernel standard deviation σ, where we empirically define σ(x) = 4 .
To blur an image based on the computed COC sizes, we first decompose the
image into discrete layers according to per-pixel depth values, where the maximum
number of layers is limited to 350. Then, we apply Gaussian blur to each layer with
– 19 –
σ(x), blurring both the image and mask of the layer. Finally, we alpha-blend blurred
layer images in the back-to-front order using the blurred masks as alpha values. In
addition to defocused images, we generate labels (i.e., defocus maps), which trivially
record σ(x) as the amounts of per-pixel blur. This layer-driven defocus blur is similar
to the algorithm of [81], but we bypass the matting step as we do not put different
depths into the same layer.
Our SYNDOF dataset enables a network to accurately estimate a defocus map due
to the following properties. First, our defocus map is densely (per-pixel and not binary)
labeled. The dense labels respect the scene structure, including object boundaries
and depth discontinuities, and resolve ambiguities in homogeneous regions. Second,
object positions in the original sharp image are used when pixels are labeled with
blur amounts in a defocused image. Then, even if the network encounters a mixture
of COCs (called the partial occlusion [82]), a blurry pixel is supervised to have the
COC size that it had in the sharp image. Note that the other COCs in the mixture are
irrelevant at the pixel, as they come from nearby foreground or hidden surfaces (not
revealed in the sharp image) [83]. This clarification of which COCs are estimated in a
defocus map is a drastic improvement over the previous studies.
Network Design Our DMENet has a novel architecture for estimating a defocus map
from a defocused image (Fig. 3.2). The network consists of four subnetworks: blur
estimation (B), domain adaptation (D), content preservation (C), and sharpness cali-
bration networks (S).
The blur estimation network B is the main component of our DMENet and super-
vised with ground-truth synthetic defocus maps from the SYNDOF dataset to predict
blur amounts given an image. To enable network B to measure the blur amounts on real
defocused images, we attach the domain adaptation network D to it, which minimizes
– 20 –
Training Phase blur es GT synthetic
timatio defocus map, y
n loss,
Testing Phase LB
Blur Estimation Net. (B) Content Preservation Net. (C)
Skip connections
GT activation
Convolutions
VGG19
Residual
VGG19 U-net Pretrained conte
Pretrained Decoder prese nt
(fixed) rvati
predicted
loss, L on
C
auxiliary module (A) activation
Discriminator
1X1 CNN
Domain Adaptation Net. (D)
Synthetic or Real?
Real
Predicted real
defocused image, IR domain adaptation loss, Ladv and LD bration loss, LS
defocus map sharpness cali
Figure 3.2: Architecture of DMENet. During training, we utilize all four subnetworks:
blur estimation (B), domain adaptation (D), content preservation (C) and sharpness
calibration (S) networks. They are jointly trained to learn blur amounts from synthetic
defocused images while minimizing the domain difference between synthetic and real
defocused images. For testing, we only utilize network B for estimating a defocus map
given a real defocused image.
domain differences between synthetic and real features. The content preservation net-
work C supplements network B to avoid a blurry output. The sharpness calibration
network S allows real domain features to induce correct sharpness in a defocus map
by informing network B whether the given real domain feature corresponds to a sharp
or blurred pixel.
Training Our ultimate goal is to train the blur estimation network B to estimate blur
amounts of real images. To achieve this, we jointly train networks B, D, and S pa-
rameterized by θB , θD , and θS , respectively, with three different training sets. Note
that network C is fixed during our training. DS = . . . , ⟨ISn , y n ⟩, . . . is a train-
ing set of synthetic defocused images with ground truth defocus maps, where ISn
and y n are the n-th image and the corresponding defocus map, respectively. DR =
n , . . . is a training set of real defocused images with no labels. Lastly, D =
. . . , IR B
n , bn ⟩, . . . is a training set of real defocused images I n with ground truth
. . . , ⟨IB B
– 21 –
binary blur maps bn , where bn is labeled as sharp or blurred at each pixel.
Given the training datasets, we alternatingly train θB and θS with a loss Lg , and
θD with a loss Ld , following the common practice of adversarial training. Loss Lg is
defined as:
|DS |
1 X
Lg = {LB (ISn , y n ; θB ) + λC LC (ISn , y n ; θB )}
|DS |
n=1
|DB |
1 X n n n
+ {λadv Ladv (IB ; θB ) + λS LS (IR , b ; θB , θS )} (3.3)
|DB |
n=1
|DR |
λadv X n
+ Ladv (IR ; θB ),
|DR |
n=1
where |D| is the number of elements in a set D. LB , LC , LS , and Ladv are blur
map loss, content preservation loss, sharpness calibration loss, and adversarial loss,
respectively, which will be discussed later. λc , λS , and λadv are hyper-parameters to
balance the loss terms. Loss Ld is defined as:
|DS |
λD X
Ld = LD (ISn , 1; θD )
|DS |
n=1
|DR | |DB |
(3.4)
λD X
n
X
n
+ LD (IR , 0; θD ) + LD (IB , 0; θD ) ,
|DR | + |DB |
n=1 n=1
– 22 –
blurriness in the prediction B(IS ) using the network C.
Real defocused images with binary blur maps, ⟨IB , b⟩ ∈ DB , are used to calibrate
sharpness measurement from domain transferred features. With the supervision of b,
the sharpness calibration loss LS guides network S to classify whether an estimated
defocus map B(IB ) has correct blur amounts, eventually calibrating network B to
estimate correct degrees of sharpness from domain transferred features.
Finally, IS ∈ DS , IB ∈ DB , and IR ∈ DR are used together to minimize domain
difference between features extracted from synthetic and real defocused images. For
images IS , the ground-truth domain labels are synthetic, while IB and IR are labeled
as real. We minimize the discriminator loss LD and the adversarial loss Ladv in an
adversarial way, in which we train network D to correctly classify the domains of
features from different inputs, while train network B to confuse D.
In the remaining section, we describe the four networks and their associated loss
functions in more detail.
The blur estimation network B is the core module in our DMENet. We adopt a
fully convolutional network (FCN) [50], which is based on the U-net architecture [33]
with slight changes. We initialize the encoder using a pre-trained VGG19 [84] for
better feature representations at the initial stage of training. The decoder uses up-
sampling convolution instead of deconvolution to avoid checkerboard artifacts [85].
We also apply scale-wise auxiliary loss at each up-sampling layer to guide the multi-
scale prediction of a defocus map. This structure induces our network not only to be
robust on various object scales but also to consider global and local contexts with large
receptive fields. After the last up-sampling layer of the decoder, we attach convolution
blocks with short skip connections to refine domain-adapted features.
We use mean squared error (MSE) for the loss function LB to estimate the over-
all structure of a defocus map and densely predict blur amounts in regions. Given a
– 23 –
synthetic defocused image IS of size W × H, LB is defined as:
W H
1 XX
LB = (B(IS ; θB )i,j −yi,j )2 +λaux Laux , (3.5)
WH
i=1 j=1
where B(IS ; θB )i,j is the amount of blur of IS predicted by network B at pixel (i, j)
with learning parameters θB . yi,j is the corresponding ground-truth defocus value.
Laux is the scale-wise auxiliary loss defined as:
LB W
ℓ Xℓ H
X 1 X
Laux = (Bℓ (IS ; θB , θaux )i,j − yℓ,i,j )2 , (3.6)
W l Hl
ℓ=1 i=1 j=1
where Bℓ (IS ; θB , θaux ) = Aℓ (Bℓ (IS ; θB ); θaux ) is the output at the ℓ-th up-sampling
level of network B converted to a defocus map by a small auxiliary network Aℓ pa-
rameterized by θaux . Each auxiliary network Aℓ consists of two convolutional layers,
where the number of kernels in the first layer varies with level ℓ. λaux is a balance pa-
rameter. Wℓ × Hℓ is the size of a defocus map at the ℓ-th level. yℓ is the ground-truth
defocus map resized to Wℓ × Hℓ . LB is the number of up-sampling layers in B.
Our domain adaptation network D compares the features of real and synthetic de-
focused images captured by the blur estimation network B. We use adversarial training
for network D so that the two domains have the same distributions in terms of extracted
features. In principle, D is a discriminator in the GAN framework [86], but in our case,
it makes the characteristics of the captured features of real and synthetic defocused
images indistinguishable. We design D as a CNN with four convolution layers, each
of which is followed by a batch normalization layer [87] and leaky rectified-linear-
unit (ReLU) activation [88].
– 24 –
Discriminator Loss We first train the network D as a discriminator to classify features
from synthetic and real domains with the discriminator loss LD , defined as:
where z is a label indicating whether the input feature comes from a real or synthetic
defocused image, i.e., whether the input image I is real or synthetic; z = 0 if the
feature is real and z = 1 otherwise. Blast (I, θB ) returns the feature maps of the last
up-sampling layer of B for image I. Note that here we only train the parameters of the
discriminator, θD .
Adversarial Loss We then train network B to minimize the domain difference be-
tween features of synthetic and real-world defocused images. Given a real-world de-
focused image IR , we define the adversarial loss Ladv for domain adaptation as:
Our blur estimation loss LB is an MSE loss and has the nature of producing blurry
outputs, as it takes the smallest value with the average of desirable targets [89]. To
reduce the artifact, we use a content preservation loss [90] that measures the distance
– 25 –
in a feature space ϕ, rather than in the image space itself. We define our content
preservation network C as the pre-trained VGG19 [84]. During training, network B is
optimized to minimize:
Wℓ X
Hℓ
1 X
LC = (ϕℓ (B(IS ; θB ))i,j − ϕℓ (y)i,j )2 , (3.9)
Wℓ Hℓ
i=1 j=1
where Wℓ × Hℓ is the size of a feature map ϕℓ (·) at the last convolution layer in the
ℓ-th max pooling block of VGG19.
– 26 –
cross entropy loss for optimization:
W H 2
1 XX 1
LS = − bi,j , (3.10)
WH 1 + exp(−S(B(IB ; θB ); θS )i,j ) 2
i=1 j=1
3.4 Experiments
This section reports our experiments that assess the performance of DMENet in
generating defocus maps. We first summarize the setting of our experiments, then dis-
cuss the influence of reciprocal connections between the subnetworks, B, D, C, and S.
We then compare our results with the state-of-the-art methods on CUHK dataset [44]
and RTF dataset [24], followed by a few applications of our DMENet.
Training Details We use Adam [91] for optimizing our network. The network is
trained with a batch size of 4, and the learning rate is initially set to 0.0001 with an
exponential decay rate of 0.8 for every 20 epochs. Our model converged after around
60 epochs. The loss coefficients in Eqs. 3.3, 3.4 and 3.5 are set to: λadv = 1e−3,
λD = 1.0, λC = 1e−4, λS = 2e−2, and λaux = 1.0. We use ℓ = 4 for ϕℓ , which
indicates the last convolution layer before the fourth max-pooling layer in VGG19.
We jointly train all the networks in an end-to-end manner on a PC with an NVIDIA
GeForce TITAN-Xp (12GB).
– 27 –
(a) Input (b) DMENetB (c) DMENetBD (d) DMENetBDC (e) DMENetwo Laux
BDCS (f) DMENetBDCS (g) GT
Figure 3.3: Outputs generated with incremental additions of subnetworks in our net-
work. The top row shows the defocus maps estimated from a synthetic defocus-blur
image, and the bottom row shows the results given a real DOF image. We can observe
that each subnetwork improves the quality of the output. In (g), the ground-truths are
a defocus map in SYNDOF dataset (top) and a binary blur mask in the CUHK dataset
(bottom).
Dataset For synthetic defocused images IS used for training network B (Eq. 3.5), C
(Eq. 3.9), and D (Eq. 3.7), we use images of SYNDOF dataset. We limit the maximum
size cmax of COC to 28. For real defocused images IR for domain adaptation, we
used 2,200 real defocused images collected from Flickr and 504 images from CUHK
blur detection dataset [44]. For sharpness calibration, we also use the same 504 im-
ages from the CUHK dataset for real defocused images IB , which require binary blur
maps. During training, we augment all images with random flips, rotations, and crops.
For evaluation, we used 200 images of the CUHK dataset and 22 images of the RTF
dataset [24], which are not used for training.
– 28 –
Figure 3.4: Defocus maps generated with different convolutional filter sizes in the
sharpness calibration network S: 3 × 3 filter (middle) and 1 × 1 filter (right), given an
input image (left). The larger filter leads to a smudged defocus map.
the degree of a blur for a real image to some extent (Fig. 3.3c), yet with blurry out-
put. Adding content preservation subnetwork (DMENetBDC ) effectively removes blur
artifacts from the estimated defocus map, enhancing the estimation in texture regions
(Fig. 3.3d). Finally, with the sharpness calibration subnetwork S, DMENetBDCS cor-
rectly classifies real-domain features corresponding to blurry or sharp regions (Fig. 3.3f).
We also compare results of DMENetBDCS with and without the scale-wise auxiliary
loss Laux (Eq. 3.6). Fig. 3.3e demonstrates that the network without the auxiliary
module generates a less clear and inaccurate defocus map.
This section reports further evaluations performed regarding a larger kernel size
and binarization for the sharpness calibration network S.
Larger Kernel The filter size of the convolutional layers is 1×1 in the network S. We
can use a larger kernel size, but it might degrade the accuracy of a generated defocus
map. With a larger kernel, the receptive field of S becomes larger, and then gradients
passed from S to the blur estimation network B would be propagated to larger regions
than the receptive fields of B. Fig. 3.4 shows an example of a degraded defocus map.
– 29 –
datasets DMENetBDC DMENetBDCS
SYNDOF 0.015/0.093 0.011/0.072
RTF 0.019/0.159 0.012/0.088
Table 3.2: Errors of the defocus maps generated with and without the sharpness cal-
ibration network on the SYNDOF and RTF datasets. Mean Squared Error (MSE) /
Mean Absolute Error (MAE) are used as error metrics.
Binarization One concern that may arise is that the network S may binarize a defocus
map generated by the network B. The sharpness calibration loss (LS ) would try to
binarize the result if it is directly applied to network B. However, we attach network S
to B, making the gradients of loss LS to be flexibly applied to B. For example, assume
that B has estimated a perfect defocus map. If LS is directly applied to B, LS would
be non-zero and influence B towards a binary map. On the other hand, in our case, LS
is applied to S that is attached to B, and LS will be non-zero only when S does not
produce a relevant binary blur map. Consequently, S gives room to B for being trained
to correctly estimate a defocus map without being directly governed by LS .
To quantitatively show the effect of sharpness calibration on the accuracy, we
measured the errors of the defocus maps estimated with and without the network S,
using the ground truth defocus maps from the test sets of SYNDOF and RTF datasets.
In Table 3.2, the errors for both synthetic and real images are reduced with sharpness
calibration, which indirectly shows our framework avoids binarization degradation.
We compare our results with the state-of-the-art methods [42, 23, 25, 27, 46]. For
ours, we used the final model DMENetBDCS . To quantitatively assess the quality, we
measured the accuracy and precision-recall of each method for 200 test images from
the CUHK blur detection dataset. As the dataset contains only binary blur maps for the
ground-truths, we convert estimated defocus maps into binary blur maps. Following
the method of Park et al. [25], the threshold τ for binarization is determined by τ =
– 30 –
60 70 80 90
87.35
84.08
77.81
76.54
72.96
Zhou et al. Karaali et al. Shi et al. Park et al. Ours (DMENetBDCS) Accuracy (%)
1.00
0.95
0.90
Precision
0.85
Ours (DMENetBDCS)
0.80 Park et al. [2017]
Shi et al. [2014]
0.75 Karaali et al. [2018]
0.70 Zhou et al. [2011]
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall
α vmax + (1 − α) vmin , where vmax and vmin are the maximum and minimum values in
the estimated defocus map, respectively, and α = 0.3.
Figs. 3.5 and 3.6 show quantitative comparison results. Our network significantly
outperforms the previous methods in accuracy, which is the ratio of correctly classi-
fied pixels in a given image. Precision-recall curves also show the superiority of our
method in detecting blurred regions, where the curves are computed using defocus
maps binarized with different levels of τ that are adjusted from vmin to vmax .
Fig. 3.7 visually compares results generated by our network against previous
methods, confirming the benefits of ours. First, our defocus maps show a more contin-
uous spectrum for the degrees of blur compared to others. In the first row of Fig. 3.7,
our results exhibit less noise and smoother transitions with depth changes. Second, our
network estimates more accurate blur for objects (e.g., human, sky), as it is trained to
consider scene contexts with a mixture of COCs at object boundaries and ground-truth
– 31 –
(a) (b) (c) (d) (e) (f) (g)
Figure 3.7: Qualitative comparison between DMENetBDCS and other methods: (a)
Inputs and the defocus maps estimated by (b) Zhou et al. [42], (c) Shi et al. [23], (d)
Park et al. [25], (e) Karaali et al. [27], (f) ours, and (g) ground-truth binary blur masks.
blur amounts on object interiors. In the second row of the figure, our result shows
coherently labeled blur amounts while clearly respecting object boundaries. In the
third row, our method estimates consistent blur amounts both for the box surface and
the symbol on it, while some other methods differently handle the symbol due to its
strong edges. Lastly, our method is more robust in homogeneous regions. In the sec-
ond and fourth rows, our results show little smudginess around some objects, but they
are still accurate in terms of relative depths. For instance, the sky should be farther
than the mountain, which is not necessarily preserved with other methods.
We also report qualitative results compared with the most recent approach [46],
whose implementation has not been publicized yet. Fig. 3.8 shows that our model can
handle a wider depth range of a scene. While our defocus map includes all the people
who are located throughout the depth range in the scene, the result of [75] only deals
with people within a narrow depth range.
In addition, we conducted an evaluation on RTF dataset [24], which consists of 22
– 32 –
Figure 3.8: Qualitative comparison with [46]. From left to right: input, defocus map
estimated by [46] and ours.
Table 3.3: Evaluation of defocus map estimation result on RTF dataset in terms of
mean squared error (MSE) and mean absolute error (MAE).
real defocused images and ground truth defocus maps labeled with radii of disc PSFs.
For all compared methods considering Gaussian PSF (including ours), we rescaled
defocus maps using a conversion function that authors of [24] provide, which maps
a Gaussian PSF into a disc PSF by measuring the closest fit. Our network shows the
state-of-the-art accuracy on the dataset (Table 3.3).
Our estimated defocus map can be naturally utilized for deblurring (Fig. 3.9).
From the estimated defocus map, we generate a Gaussian blur kernel for each pixel
with the estimated σ. We then use a non-blind image deconvolution technique lever-
aging hyper-Laplacian [31]; to handle spatially varying deblur, we applied deconvo-
lution to each layer of images that are decomposed so that each image layer has the
shame blur amount according to an estimated defocus map. Then, we obtained the
final deblurred results by combining deconvolved layer images.
– 33 –
(a) Input (b) [27] (c) Ours
Figure 3.9: Qualitative comparison on defocus deblurring using defocus map esti-
mated by (b) Karaali et al. [27] and (c) our DMENetBDCS .
Figure 3.10: Defocus blur magnification using the defocus map estimated by
DMENetBDCS . From left to right: input and our blur magnification result.
3.4.6 Applications
Defocus Blur Magnification Given an input image and its estimated defocus map,
we can generate a magnified defocus-blur image (Fig. 3.10). We first estimate the blur
amount σi,j for each pixel using DMENetBDCS . Then, we blur each pixel using m·σi,j
for σ of Gaussian blur kernel, where m is a magnifying scale (m = 8 in Fig. 3.10). We
used the same blur algorithm used for generating our SYNDOF dataset. The defocus
blur magnification result demonstrates the accuracy of our estimated defocus map.
Depth from Blur Even without the presence of precise parameters related to the op-
tical geometry (focus point, focal length, and aperture number), we can approximate
the pseudo-depth using a scaled defocus map in a limited yet common scenario (i.e.,
– 34 –
Figure 3.11: Depth from our defocus map estimated by DMENetBDCS . From left to
right: input, depth from our estimated defocus map, and ground-truth depth.
a focus point is at either depth znear or zfar ). We used a light-field dataset [92, 93] to
compare with the ground-truth depth map. Fig. 3.11 shows our estimated defocus map
can provide a good approximation for the depth map.
3.5 Discussion
– 35 –
Limitation The proposed network works best with LDR images, and strong highlights
(i.e., bokeh) may not be properly handled. We plan to improve the SYNDOF dataset by
including a more diverse and realistic DOF rendering technique (e.g., distributed ray
tracing [94]), and a more realistic optical model (e.g., a thick-lens or compound-lens
model).
– 36 –
IV. Defocus Deblurring Using Iterative Filter
Adaptive Network
4.1 Motivation
– 37 –
large blur [21].
In this chapter, we propose an end-to-end network embedded with our novel It-
erative Filter Adaptive Network (IFAN) for single image defocus deblurring. IFAN is
specifically designed for the effective handling of spatially varying and large defocus
blur. To handle the spatially varying nature of defocus blur, IFAN adopts an adap-
tive filter prediction scheme motivated by recent filter adaptive networks (FANs) [99,
21]. Specifically, IFAN does not directly predict pixel values but generates spatially-
adaptive per-pixel deblurring filters, which are then applied to features from an input
defocused image to generate deblurred features.
To efficiently handle large defocus blur that requires large receptive fields, IFAN
predicts stacks of small-sized separable filters instead of conventional filters, unlike
previous FANs. To apply predicted separable filters to features, we also propose a
novel Iterative Adaptive Convolution (IAC) layer that iteratively applies separable fil-
ters to features. As a result, IFAN significantly improves the deblurring quality at a
low computational cost in the presence of spatially varying and large defocus blur.
To further improve the single image deblurring quality, we train our network with
novel defocus-specific tasks: defocus disparity estimation and reblurring. The learning
of defocus disparity estimation exploits dual-pixel data, which provides stereo images
with a tiny baseline, whose disparities are proportional to defocus blur magnitudes
[100, 101, 18]. Leveraging dual-pixel stereo images, we train IFAN to predict the
disparity map from a single image so that it can also learn to predict blur magnitudes
more accurately.
On the other hand, the learning of reblurring task, which is motivated by the
reblur-to-deblur scheme in [102], utilizes deblurring filters predicted by IFAN for re-
blurring all-in-focus images. For accurate reblurring, IFAN needs to predict deblur-
ring filters that contain accurate information about the shapes and sizes of defocus
blur. During training, we introduce an additional network that inverts predicted de-
blurring filters to reblurring filters and reblurs the ground-truth all-in-focus image. We
then train IFAN to minimize the difference between the defocused input image and the
– 38 –
corresponding reblurred image. We experimentally show that both tasks significantly
boost the deblurring quality.
To verify the effectiveness of our method on diverse real-world images from dif-
ferent cameras, we extensively evaluate the method on several real-world datasets such
as the DPDD dataset [18], Pixel dual-pixel test set [18], and CUHK blur detection
dataset [44]. In addition, for quantitative evaluation, we present the Real Depth of
Field (RealDOF) test set that provides real-world defocused images and their ground-
truth all-in-focus images.
To summarize, our contributions include:
• Iterative Filter Adaptive Network (IFAN) that effectively handles spatially vary-
ing and large defocus blur,
• a novel training scheme that utilizes the learning of defocus disparity estimation
and reblurring, and
In this section, we first introduce the Iterative Adaptive Convolution (IAC) layer,
which forms the basis of our IFAN (Sec. 4.2.1). Then, we present our deblurring
network based on IFAN with detailed explanations of each component (Sec. 4.2.2).
Finally, we explain our training strategy exploiting the disparity estimation and reblur-
ring tasks (Sec. 4.2.3).
Filter Adaptive Networks FANs have been proposed to facilitate the spatially-adaptive
handling of features in various tasks [103, 104, 105, 106, 107, 108, 21, 99, 109, 110].
FANs commonly consist of two components: prediction of spatially-adaptive filters
– 39 –
Filter Adaptive Convolution (𝑭𝑭𝑭𝑭𝑭𝑭)
𝑤𝑤 𝑒𝑒 𝑤𝑤 𝑒𝑒̂
𝑘𝑘
𝑘𝑘
ℎ convolution 𝑐𝑐 ℎ
𝑐𝑐 𝑐𝑐
𝑤𝑤
reshape
1
ℎ 1 1 2 3 4 5 6 7 ⋯ 𝑘𝑘 2
𝑐𝑐𝑘𝑘 2
predicted
convolution filter 𝐅𝐅
and transformation of features using the predicted filters, where the latter component
is called filter adaptive convolution (FAC). Various FANs have been proposed and ap-
plied to different tasks, such as frame interpolation [103, 104, 105], denoising [106],
super-resolution [107, 108], semantic segmentation [109], and point cloud segmenta-
tion [110]. For the deblurring task, Zhang et al. [99] proposed pixel-recurrent adaptive
convolution for motion deblurring. However, their method requires massive computa-
tional costs as the recurrent neural network must run for each pixel. Zhou et al. [21]
proposed a novel filter adaptive convolution layer for frame alignment and video de-
blurring. However, handling large motion blur requires predicting large filters, which
results in huge computational costs. Our IAC is inspired by the filter adaptive convo-
lution (FAC) [21] that forms the basis of previous FANs. For a better understanding of
IAC, we first briefly review FAC.
Fig. 4.1 shows an overview of a FAC layer. A FAC layer takes a feature map e
and a map of spatially varying convolution filters F as input. e and F have the same
spatial size h×w. The input feature map e has c channels. At each spatial location in
F is a ck 2 -dim vector representing c convolution filters of size k ×k. For each spatial
location (x, y), the FAC layer generates c convolution filters from F by reshaping the
vector at (x, y). Then, the layer applies the filters to the features in e centered at (x, y)
in a channel-wise manner to generate an output feature map ê ∈ Rh×w×c .
– 40 –
Iterative Adaptive Convolution (𝑰𝑰𝑰𝑰𝑰𝑰)
𝑤𝑤 𝑒𝑒 𝑤𝑤 𝑒𝑒̂
* * *
LReLU
LReLU
LReLU
ℎ ℎ
𝑐𝑐 𝐅𝐅 𝟏𝟏 𝐅𝐅 𝑛𝑛 𝐅𝐅 𝑵𝑵 𝑐𝑐
𝑤𝑤
predicted
ℎ separable
filters 𝐅𝐅
𝑁𝑁𝑁𝑁 2𝑘𝑘 + 1
𝑛𝑛 : separable adaptive convolution
*� 𝑒𝑒̂ 𝑛𝑛−1 𝑒𝑒̂ 𝑛𝑛
𝑤𝑤 𝑤𝑤 𝑤𝑤
1 𝑘𝑘
𝑘𝑘 1
ℎ convolution 𝑐𝑐 ℎ 𝑐𝑐 ℎ
𝑐𝑐 𝑐𝑐 𝑐𝑐
reshape sum
𝑤𝑤 𝐅𝐅 𝑛𝑛
𝐟𝐟1𝑛𝑛 𝐟𝐟2𝑛𝑛
1 𝐛𝐛𝒏𝒏
ℎ 1 1 ⋯ 𝑘𝑘 1 ⋯ 𝑘𝑘 1
𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐 𝑐𝑐
𝑐𝑐 2𝑘𝑘 + 1
For effectively handling spatially varying and large defocus blur, it is critical to
secure large receptive fields. However, while FAC facilitates spatially-adaptive pro-
cessing of features, increasing the filter size to cover wider receptive fields results in
huge memory consumption and computational cost. To resolve this limitation, we pro-
pose the IAC layer that iteratively applies small-sized separable filters to efficiently
enlarge the receptive fields with little computational overhead.
Fig. 4.2 shows an overview of our IAC layer. Similarly to FAC, IAC takes a
feature map e and a map of spatially varying filters F as input, whose spatial sizes are
the same. At each spatial location in F is a N c(2k + 1)-dim vector representing N
sets of filters {F1 , F2 , · · · , FN }. The n-th filter set Fn has two 1-dim filters f1n and f2n
of the sizes k ×1 and 1×k, respectively, and one bias vector bn. f1n , f2n , and bn have
c channels. The IAC layer decomposes the vector in F at each location into filters and
– 41 –
bias vectors, and iteratively applies them to e in a channel-wise manner to produce an
output feature map ê.
Let ên denote the n-th intermediate feature map after applying the n-th separable
filters and bias, where ê0 = e and êN = ê. Then, the IAC layer computes ên for
n ∈ {1, · · · , N } as follows:
where LReLU is the leaky rectified linear unit [111], and ∗ is the channel-wise convo-
lution operator that performs convolutions in a spatially-adaptive manner.
Separable filters in our IAC layer play a key role in resolving the limitation of the
FAC layer. Xu et al. [13] showed that a convolutional network with 1-dim filters can
successfully approximate a large inverse filter for the deconvolution task. Similarly,
our IAC layer secures larger receptive fields at much lower memory and computational
costs than the FAC layer by utilizing 1-dim filters instead of 2-dim convolutions. How-
ever, compared to dense 2-dim convolution filters in the FAC layer, our separable fil-
ters may not provide enough accuracy for deblurring filters. We handle this problem
by iteratively applying separable filters to fully exploit the nonlinear nature of a deep
network. Our iterative scheme also enables small-sized separable filters to be used for
establishing large receptive fields.
Fig. 4.3 shows an overview of our deblurring network based on IFAN. Our net-
work takes a single defocused image IB ∈ RH×W ×3 as input and produces a deblurred
output IBS ∈ RH×W ×3 , where H and W are the height and width of an image, respec-
tively. The network is built upon a simple encoder-decoder architecture consisting of a
feature extractor, reconstructor, and IFAN module in the middle. The feature extractor
H W
extracts defocused features eB ∈ Rh×w×ce , where h = 8 and w = 8 , and feeds them
– 42 –
𝑒𝑒𝐵𝐵 𝑒𝑒𝐵𝐵𝑆𝑆
𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰
𝐼𝐼𝐵𝐵 Feature Extractor Reconstructor
𝐼𝐼𝐵𝐵𝐵𝐵
Iterative Filter Adaptive Network ( 𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 ) Sum C Concat Conv Layer DeConv Layer Residual Block
𝑒𝑒𝐵𝐵 𝑒𝑒𝐵𝐵𝑆𝑆
defocused features 𝑰𝑰𝑰𝑰𝑰𝑰 deblurred features
𝐅𝐅𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑒𝑒𝐅𝐅 𝑑𝑑𝑟𝑟→𝑙𝑙 𝑒𝑒𝑑𝑑
C
Figure 4.3: Proposed defocus deblurring network with Iterative Filter Adaptive Net-
work (IFAN).
to IFAN. IFAN removes defocus blur in the feature domain by predicting spatially
varying deblurring filters and applying them to eB using IAC. The deblurred features
eBS from IFAN are then passed to the reconstructor, which restores an all-in-focus
image IBS . In the following, we describe IFAN in more detail.
– 43 –
4.2.3 Network Training
We train our network using the DPDD dataset with two defocus-specific tasks:
defocus disparity estimation and reblurring. In this section, we explain the training
data, our training strategy using each task, and our final loss function.
Dataset We use dual-pixel images from the DPDD dataset [18] to train our network.
A dual-pixel image provides a pair of stereo images with a tiny baseline, whose dis-
parities are proportional to defocus blur magnitudes. The DPDD dataset provides 500
dual-pixel images captured with a Canon EOS 5D Mark IV. Each dual-pixel image is
l and I r , respectively.
provided in the form of a pair of left and right stereo images IB B
For each dual-pixel image, the dataset also provides a defocused image IB , which is
l and I r , and its corresponding ground-truth all-in-focus im-
generated by merging IB B
age IS . The 500 dual-pixel images are split into training, validation, and testing sets,
each of which contains 350, 74, and 76 scenes, respectively. Refer to [18] for more
details on the DPDD dataset.
r→l l
Ldisp = MSE(IB↓ , IB↓ ), (4.2)
– 44 –
Reblurring Network
𝑰𝑰𝑰𝑰𝑰𝑰
𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰
𝐼𝐼𝑆𝑆↓ 𝐼𝐼𝑆𝑆𝑆𝑆↓
𝐅𝐅𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝐅𝐅𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
Sum
Conv Layer
Residual Block
encoder and disparity map estimator are trained to predict an accurate disparity map
and defocus magnitudes. Note that we utilize dual-pixel images only for training, and
our trained network requires only a single defocused image as its input.
Reblurring We train IFAN also using the reblurring task. For the learning of reblur-
ring, we introduce an auxiliary reblurring network. The reblurring network is attached
at the end of IFAN and trained to invert deblurring filters Fdeblur to reblurring filters
Freblur (Fig. 4.4). Then, using Freblur , the IAC layer reblurs a downsampled ground-
truth image IS↓ ∈ Rh×w×3 to reproduce a downsampled version of the defocused input
image. For training IFAN as well as the reblurring network, we use a reblurring loss
defined as:
where ISB↓ is a reblurred image obtained from IS↓ using Freblur , and IB↓ is a down-
sampled input image. Lreblur induces IFAN to predict Fdeblur containing valid infor-
mation about blur shapes and sizes needed for accurate reblurring. Such information
eventually improves the performance of deblurring filters used for the final defocus
deblurring. Note that we utilize the reblurring network only for training.
– 45 –
Loss Functions In addition to the disparity and reblurring losses, we use a deblurring
loss, which is defined as:
Our total loss function to train our network is then defined as Ltotal = Ldeblur +
Ldisp + Lreblur . Each loss term affects different parts of our network. While Ldeblur
trains the feature extractor, IFAN, and reconstructor, Ldisp trains only the filter encoder
and disparity map estimator in IFAN. Lreblur trains both IFAN and reblurring network.
l , I r ) only for training. Both L
Note that we use dual-pixel stereo images (IB B deblur and
l , I r ). In this way, we can fully utilize the
Lreblur utilize IB while Ldisp utilizes (IB B
DPDD dataset for training our network.
4.3 Experiments
– 46 –
IFAN Evaluations on the DPDD Dataset [18] Computational Costs
RBN
FP + IAC DME PSNR↑ SSIM↑ MAE(×10-1 )↓ LPIPS↓ Params (M) MACs (B)
24.88 0.753 0.416 0.289
10.58 364.3
✓ 24.97 0.761 0.412 0.280
✓ 25.07 0.765 0.406 0.271
✓ ✓ 25.18 0.780 0.403 0.233
10.48 362.9
✓ ✓ 25.28 0.780 0.400 0.245
✓ ✓ ✓ 25.37 0.789 0.394 0.217
Table 4.1: Quantitative ablation study. FP, DME, and RBN indicate the filter predic-
tor, disparity map estimator, and reblurring network, respectively. The first row cor-
responds to the baseline model. For fair evaluation, to obtain the baseline model, the
components of our model are replaced by conventional convolution layers and residual
blocks with similar model parameter numbers and computational costs.
– 47 –
(a) Input (b) baseline (c) D (d) F (e) FD (f) FR (g) FDR (h) GT
Figure 4.5: Qualitative results of an ablation study on the DPDD dataset [18]. The
first and last columns show a defocused input image and its ground-truth all-in-focus
image, respectively. Between the columns, the letters in each sub-caption indicate a
combination of components (refer Table 4.1). F means the filter predictor and the IAC
layer, D means the disparity map estimator, and R means the reblurring network. The
baseline implies a model without F, D and R. Images in the red and green boxes are
zoomed-in cropped patches.
Explicit Deblurring Filter Prediction We first verify the effect of the filter predic-
tion scheme implemented using the filter predictor and IAC layer. Table 4.1 shows that
introducing the filter predictor and IAC layer to the baseline model increases the de-
blurring performance (the first and third rows in the table), confirming the advantage of
explicit pixel-wise filter prediction in flexible handling of spatially varying and large
defocus blur. In addition, compared to the gain (PSNR: 0.36% and LPIPS: 3.21%)
obtained when the disparity map estimator is embedded in the baseline model (the
second row in the table), there is more performance gain (PSNR: 0.44% and LPIPS:
16.31%) when the disparity map estimator is added to a model with the filter predictor
and IAC layer (the fourth row in the table). This observation validates that explicit uti-
lization of deblurring filters has more potential in absorbing extra defocus blur-specific
supervision provided by dual-pixel images. Fig. 4.5 shows a qualitative comparison.
As shown in the figure, the filter predictor and IAC layer substantially enhance the
deblurring quality ((b) and (c) vs. (d) and (e) in the figure).
Disparity Map Estimation and Reblurring We analyze the influence of the dis-
parity map estimator and reblurring network. Specifically, we compare the combina-
tions (filter predictor + IAC + disparity map estimator) and (filter predictor+IAC +
– 48 –
reblurring network). Table 4.1 show that the model with the disparity map estima-
tor performs better than the model with the reblurring network in recovering textures
(lower LPIPS), as the disparity map estimator helps more accurately estimate per-pixel
blur amounts (Figs. 4.5e and f). On the other hand, the model with the reblurring net-
work better restores overall image contents (higher PSNR), as the reblurring network
guides deblurring filters to contain information about blur shapes and blur amounts.
We can also observe that the model with both disparity map estimator and reblurring
network achieves the best performance in every measure (Fig. 4.5g). This shows that
the disparity map estimator and reblurring network have synergistic effects, comple-
menting each other.
In this section, we compare our method with previous defocus map-based ap-
proaches as well as recent end-to-end learning-based approaches: Just Noticeable
Blur estimation (JNB) [23], Edge-Based Defocus Blur estimation (EBDB) [27], Defo-
cus Map Estimation Network (DMENet) [28], DPDNetS and DPDNetD [18]. Among
these, JNB, EBDB, and DMENet are defocus map-based approaches that first estimate
a defocus map and perform non-blind deconvolution. DPDNetS and DPDNetD are
end-to-end learning-based approaches that directly restore all-in-focus images. They
share the same network architecture, but DPDNetS takes a single defocused image as
input while DPDNetD takes a pair of dual-pixel stereo images.
For all the previous methods, we use the code (and model weights for learning-
based methods, DMENet, and DPDNet) provided by the authors. For JNB, EBDB, and
DMENet, we used the deconvolution method [31] to obtain all-in-focus images using
the estimated defocus maps. For DPDNetS , we retrain the network on 8-bit images
with the training code provided by the author, as the authors provide only a model
trained with 16-bit images. We measure computational costs in terms of the number of
network parameters, the number of multiply-accumulate operations (MACs) computed
– 49 –
Evaluations the DPDD Dataset [18] Computational Costs
Model
PSNR↑ SSIM↑ MAE(×10-1 )↓ LPIPS↓ Params (M) MACs (B) Time (Sec)
Input 23.89 0.725 0.471 0.349 - - -
JNB [23] 23.69 0.707 0.480 0.442 - - 105.8
EBDB [27] 23.94 0.723 0.468 0.402 - - 96.58
DMENet [28] 23.90 0.720 0.470 0.410 26.94 1172.5 77.70
DPDNetS [18] 24.03 0.735 0.461 0.279 35.25 989.8 0.462
DPDNetD [18] 25.23 0.787 0.401 0.224 35.25 991.4 0.474
Ours 25.37 0.789 0.394 0.217 10.48 362.9 0.014
Table 4.2: Quantitative comparison with previous defocus deblurring methods. All the
methods are evaluated using the code provided by the authors. JNB and EBDB are not
deep learning-based methods, so their parameter numbers and MACs are not available.
DPDNetD [18] takes not a single defocused image but dual-pixel stereo images as
input at test time. All the other methods, including ours, take a single defocused image
as input at test time.
on a 1280 × 720 image, and the average computation time computed on test images.
For JNB and EBDB, which are not learning-based methods, we measure only their
computation times.
We compare the methods on the DPDD test set [18]. Table 4.2 shows a quantita-
tive comparison. The previous defocus map-based methods show poor performance on
the real-world blurred images in the DPDD test set, which is even lower than the input
defocused images due to their restrictive blur models. On the other hand, the recent
end-to-end approaches, DPDNetD and DPDNetS , achieve higher quality compared to
the previous methods. Our model outperforms DPDNetS by a significant gap with a
smaller computational cost. Moreover, although our model uses a single defocused
image, it outperforms DPDNetD as well, proving the effectiveness of our approach.
Fig. 4.6 shows a qualitative comparison. Due to their inaccurate defocus maps
and restricted blur models, the results of the defocus map-based methods have a large
amount of remaining blur (Fig. 4.6b). DPDNetS and DPDNetD produce better re-
sults than the previous ones, however, tend to produce artifacts and remaining blur
– 50 –
(a) Input (b) EBDB (c) DPDNetS (d) DPDNetD (e) Ours (f) GT
Figure 4.6: Qualitative comparison on the DPDD dataset [18]. The first and last
columns show defocused input images and their ground-truth all-in-focus images, re-
spectively. Between the columns, we show the deblurring results of different methods.
Note that DPDNetD requires a pair of dual-pixel stereo images as input, while other
methods, including ours, require only a single image at test time.
(Figs. 4.6c and 4.6d). On the other hand, our method shows more accurate deblur-
ring results (Fig. 4.6e), even with a single defocused input. Especially, compared to
DPDNetD , Fig. 4.6 show that our method better handles spatially varying blur (the
first row), large blur (the second row), and image structures as well as textures (the
third row).
As our method is trained using the DPDD training set, which is captured by a
specific camera (Canon EOS 5D Mark IV), one question naturally follows how well
the model generalizes to other images from different cameras. To answer this question,
we evaluate the performance of our approach on other test sets.
– 51 –
Model PSNR↑ SSIM↑ MAE(×10 )↓-1
LPIPS↓
Input 22.33 0.633 0.513 0.524
JNB [23] 22.36 0.635 0.511 0.601
EBDB [27] 22.38 0.638 0.509 0.594
DMENet [28] 22.41 0.639 0.508 0.597
DPDNetS [18] 22.67 0.666 0.506 0.420
Ours 24.71 0.748 0.407 0.306
Figure 4.7: Qualitative comparison on the RealDOF test set. From left to right: a
defocused input image, deblurred results of DPDNetS [18] and our method, and a
ground-truth image.
RealDOF Test Set To quantitatively measure the performance of our method on real-
world defocus blur images, we prepare a new dataset named Real Depth of Field (Re-
alDOF) test set. RealDOF consists of 50 scenes. For each scene, the dataset provides a
pair of a defocused image and its corresponding all-in-focus image. To capture image
pairs of the same scene with different depth-of-fields, we built a dual-camera system
with a beam splitter as described in [117]. Specifically, our system is equipped with
two cameras attached to the vertical rig with a beam splitter. We used two Sony a7R
IV cameras, which do not support dual pixels, with Sony 135mm F1.8 lenses. The sys-
tem is also equipped with a multi-camera trigger to synchronize the camera shutters to
capture images simultaneously. The captured images are post-processed for geometric
and photometric alignments, similarly to [117].
Table 4.3 shows a quantitative comparison on the RealDOF test set. The table
– 52 –
(a) Input (b) DPDNetS (c) Ours
Figure 4.8: Qualitative comparison on the CUHK blur detection dataset [44]. From
left to right: a defocused input image, deblurred results of DPDNetS [18] and our
method.
shows that our model clearly improves the image quality, showing that the model can
generalize well to images from other cameras. Moreover, our model significantly out-
performs the previous state-of-the-art single image deblurring method, DPDNetS , by
more than 2 dB in terms of PSNR. Fig. 4.7 qualitatively compares our method and
DPDNetS . While the result of DPDNetS contains some amount of remaining blur,
ours looks much sharper with no remaining blur.
CUHK Blur Detection Dataset The CUHK blur detection dataset [44] provides 704
defocused images collected from the internet without ground-truth all-in-focus images.
Fig. 4.8 shows a qualitative comparison between DPDNetS and ours on the CUHK
dataset. The result shows that our method removes defocus blur and restores fine
details more successfully than DPDNetS .
Pixel Dual-Pixel Test Set The DPDD dataset [18] provides an additional test set con-
sisting of dual-pixel defocused images captured by a Google Pixel 4 smartphone cam-
– 53 –
(a) Input (b) DPDNetS [18] (c) DPDNetD [18] (d) Ours (e) OursD
Figure 4.9: Qualitative comparison on the Pixel Dual-Pixel dataset [18]. The first
columns show defocused input images, and for the other columns, we show deblurring
results of different methods. Images in the red and green boxes are zoomed-in cropped
patches.
era. Fig. 4.9 shows a qualitative comparison between DPDNetS , DPDNetD and ours
on the Pixel dual-pixel test set. The result shows that our method can more successfully
remove defocus blur and restore fine details than DPDNets.
16-Bit Images Our final model is trained on 8-bit images, as most standard encodings
still rely on 8-bit images. Nonetheless, we also show the capability of our model in
handling high bit-depth images, as the final model of DPDNet is targeted for 16-bit
images. Table 4.4 shows a quantitative comparison on the DPDD dataset between
– 54 –
Evaluations on the DPDD Dataset [18]
Model
PSNR↑ SSIM↑ MAE(×10-1 )↓ LPIPS↓
DPDNetD 25.23 0.787 0.401 0.224
OursD 25.99 0.804 0.373 0.207
our model and DPDNets that are trained and tested for 16-bit images. Our model
outperforms DPDNets for all the deblurring metrics.
– 55 –
Deblurring Results Params MACs
N (RF)
PSNR↑ SSIM↑ MAE(×10-1)↓ LPIPS↓ (M) (B)
8 (17) 25.19 0.777 0.404 0.246 9.44 347.9
17 (35) 25.37 0.789 0.394 0.217 10.48 362.9
26 (53) 25.39 0.788 0.393 0.215 11.52 377.9
35 (71) 25.42 0.789 0.391 0.213 12.56 392.8
44 (89) 25.45 0.792 0.389 0.206 13.60 407.8
Table 4.6: Deblurring performance and computational cost with respect to the number
of deblurring filters N evaluated on the DPDD dataset [18]. RF denotes the receptive
field size.
IAC vs. FAC Finally, we analyze the effectiveness of the proposed IAC compared
to FAC [21]. To compare IAC and FAC, we replace the IAC layers in both IFAN
and reblurring network in our final model with FAC layers. For the FAC layers, we
use k = 11 to match the computational cost to that of our final model for fairness.
Table 4.7 and Fig. 4.11 respectively show quantitative and qualitative comparisons
– 56 –
Deblurring Results Params MACs
Module
PSNR↑ SSIM↑ MAE(×10-1)↓ LPIPS↓ (M) (B)
FAC 25.18 0.778 0.406 0.249 10.51 363.4
IAC 25.37 0.789 0.394 0.217 10.48 362.9
Table 4.7: Quantitative comparison between the FAC [21] and IAC layers evaluated on
the DPDD dataset [18]. IAC indicates our model with IAC layers, while FAC indicates
a variant of our model whose IAC layers are replaced with FAC layers. We set the filter
size to 11×11 for the FAC layers for fairness in computational costs.
Figure 4.11: Qualitative comparison between the FAC and IAC layers evaluated on the
DPDD dataset [18]. FAC in (b) means our final model whose IAC layers are replaced
with FAC layers. IAC in (c) means our final model with IAC layers. The input blurred
image has a large defocus blur, so details in the red and green boxes are not visible.
Our final model with IAC shows better-restored details compared to the model with
FAC.
between IAC and FAC. The comparisons show that IAC outperforms FAC even with
fewer parameters and operations, as IAC is better in handling large defocus blur by
covering a much larger receptive field (35×35) on defocused features than the receptive
field (11×11) of FAC.
Effect of Noise Augmentation Level Table 4.8 shows the effect of the noise level used
to augment training images. For training each model in the table, defocused images are
randomly augmented with Gaussian noise, controlled by a random standard deviation
– 57 –
Evaluations on the DPDD Dataset [18]
σ
PSNR↑ SSIM↑ MAE(×10-1 )↓ LPIPS↓
0.07 25.37 0.789 0.394 0.217
0.14 25.38 0.789 0.395 0.221
0.21 25.39 0.787 0.394 0.224
Table 4.8: Comparison between models trained with different noise augmentation lev-
els. σ indicates the standard deviation of Gaussian noise used to augment defocused
images in the training set.
within a range [0, σ]. We can infer from the table that compared to a model trained
with a low noise level, a model trained with a higher noise level is better at restoring
overall image contents (higher PSNR) but worse at recovering textures (higher LPIPS).
4.4 Discussion
Limitations The proposed network is still limited in handling significantly large de-
focus blur (e.g., Fig. 4.11c has remaining blur). Our network works best with typical
isotropic defocus blur, and may not properly handle blur with irregular shape (e.g.,
swirly bokeh as in the first row of Fig. 4.12) or strong highlight (i.e., glitter bokeh as
in the second row).
– 58 –
(a) Input (b) DPDNetS (c) Ours
Figure 4.12: Failure cases. The input images are from the CUHK blur detection
dataset [44]. From left to right: a defocused input image, deblurred results of
DPDNetS [18] and our method.
– 59 –
V. Reference-Based Multi-Camera Video
Super-Resolution
5.1 Motivation
Recent mobile devices such as Apple iPhone or Samsung Galaxy series are man-
ufactured with at least two or three asymmetric multi-cameras typically having differ-
ent but fixed focal lengths. In a triple camera setting, each ultra-wide, wide-angle, and
telephoto camera has a different field of view (FoV) and optical zoom factor. One ad-
vantage of such configuration is that, compared to an ultra-wide camera, a wide-angle
camera captures a subject with more details and higher resolution, and the advantage
escalates even further with a telephoto camera. A question that naturally follows is
why not leverage higher-resolution frames of a camera with a longer focal length to
improve the resolution of frames of a camera with a short focal length.
Utilizing a reference (Ref) image to reconstruct a high-resolution (HR) image
from a low-resolution (LR) image has been widely studied in previous reference-based
image super-resolution (RefSR) approaches [34, 35, 36, 37, 38, 39, 40, 41]. However,
it has not been explored yet to utilize a Ref video for video super-resolution (VSR).
In this dissertation, we expand the RefSR to the VSR task and introduce reference-
based video super-resolution (RefVSR) that can be applied to videos captured in an
asymmetric multi-camera setting.
RefVSR inherits objectives of both RefSR and VSR tasks and utilizes a Ref video
for reconstructing an HR video from an LR video. Applying RefVSR for a video
captured in an asymmetric multi-camera setting requires consideration of the unique
relationship between LR and Ref frames in multi-camera videos. In the setting, a
pair of LR and Ref frames at each time step shares almost the same content in their
– 60 –
Input frame (ultra-wide) at 𝑡𝑡 Bicubic 4× Bicubic 4×
(inside the overlap) (outside the overlap)
𝑡𝑡 + 1 𝑡𝑡 + 2 𝑡𝑡 + 3
Figure 5.1: Comparison on 8K 4×SR video results from a real HD video between
state-of-the-art (SOTA) RefSR approach [41] and the proposed RefVSR approach. Our
method learns to super-resolve an LR video by utilizing relevant high-quality patches
of reference frames and robustly recovers sharp textures of both inside and outside the
overlapped FoV between the input ultra-wide and reference wide-angle frames (white
dashed box).
overlapped FoV (top and middle rows of the leftmost column in Fig. 5.1). Moreover, as
a video exhibits a motion, neighboring Ref frames might contain high-quality contents
useful for recovering the outside of the overlapped FoV (the bottom row of the leftmost
column in Fig. 5.1).
For successful RefVSR in an asymmetric multi-camera setting, we take advan-
tage of temporal Ref frames in reconstructing regions both inside and outside the
overlapped FoV. In previous RefSR approaches [37, 38, 39, 41], global matching has
been a common choice for establishing non-local correspondence between a pair of
– 61 –
LR and Ref images. However, given a pair of LR and Ref video sequences, it is not
straightforward to directly apply global matching between an LR frame and multiple
Ref frames. To utilize as many frames as possible in the global matching for large
real-world videos (e.g., HD videos), we need a framework capable of managing Ref
frames in a memory-efficient way.
We propose the first end-to-end learning-based RefVSR network that can gen-
erally be applied for super-resolving an LR video using a Ref video. Our network
adopts a bidirectional recurrent pipeline [60, 61, 64] to recurrently align and propagate
Ref features that are fused with the features of LR frames. Our network is efficient in
terms of computation and memory consumption because the global matching needed
for aligning Ref features is performed only between a pair of LR and correspond-
ing Ref frames at each time step. Still, our network is capable of utilizing temporal
Ref frames, as the aligned Ref features are continuously fused and propagated in the
pipeline.
As a key component for managing Ref features in the pipeline, we propose a
propagative temporal fusion module that fuses and propagates only well-matched Ref
features. The module leverages the matching confidence computed during the global
matching between LR and Ref features as the guidance to determine well-matched
Ref features to be fused and propagated. The module also accumulates the matching
confidence throughout the pipeline and uses the accumulated value as guidance when
fusing the propagated temporal Ref features.
To train and validate our model, we present the first RefVSR dataset consisting
of 161 video triplets of ultra-wide, wide-angle, and telephoto videos simultaneously
captured with triple cameras of a smartphone. Wide-angle and telephoto videos have
the same size as ultra-wide videos but their resolutions are 2× and 4× the resolution
of ultra-wide videos, respectively. With the RefVSR dataset, we train our network to
super-resolve an ultra-wide video 4× to produce an 8K video with the same resolution
as a telephoto video. To this end, we propose a two-stage training strategy that fully
utilizes video triplets in the proposed dataset. We show that, with our training strategy,
– 62 –
𝐿𝐿𝐿𝐿 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑅𝑅
𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 𝐼𝐼𝑡𝑡 𝐿𝐿𝐿𝐿
𝐼𝐼𝑡𝑡+1 𝐼𝐼𝑡𝑡+1
𝑈𝑈 𝑈𝑈 𝑈𝑈
𝑆𝑆𝑆𝑆
𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆
𝐼𝐼𝑡𝑡+1
• the first RefVSR framework with the focus on videos recorded in an asymmetric
multi-camera setting,
• the propagative temporal fusion module that effectively fuses and propagates
temporal Ref features,
• the RealMCVSR dataset, the first dataset for the RefVSR task, and
• the two-stage training strategy fully utilizing video triplets for real-world 4×VSR.
Fig. 5.2 shows an overview of the proposed network, which can generally be ap-
plied to a RefVSR task for super-resolving an LR video utilizing a Ref video. Our
network follows a typical bidirectional propagation scheme [60, 64], consisting of
– 63 –
bidirectional recurrent cells Ff and Fb , where the subscripts f and b indicate forward
and backward propagation branches, respectively (Fig. 5.3). Our network is distin-
guished from previous ones in additional inputs, intermediate features, and modules to
utilize a Ref video sequence.
Specifically, for a time step t, each recurrent cell Ff or Fb takes not only low-
LR at the previous time step and I LR at the current time step,
resolution LR frames It±1 t
but also a Ref frame ItRef at the current time step. Each cell is also recurrently fed
with aggregated LR and Ref features h{f,b} {f,b}
t±1 and accumulated confidence maps ct±1
propagated from the previous time step. Here, the accumulated confidence maps are
utilized for fusing well-matched Ref features later in each recurrent cell. Finally, each
recurrent cell propagates the resulting features h{f,b}
t and the accumulated matching
confidences c{f,b}
t to the next cell. Formally, we have:
For reconstructing an SR result ItSR , the upsampling module U first takes the
intermediate features h{f,b}
t and accumulated matching confidences c{f,b}
t of both for-
ward and backward branches. Then, the features are aggregated and upsampled with
multiple convolution and pixel-shuffle [118] layers to produce ItSR . Mathematically,
we have:
For the upsampling module U to accurately reconstruct ItSR , the intermediate fea-
tures ht{f,b} should contain details integrated from both LR and Ref frames in a video
sequence. To this end, each recurrent cell Ff and Fb performs inter-frame alignment
between the previous and current LR input frames, then aggregates and propagates the
features (Sec. 5.2.2). To exploit multiple Ref frames, each recurrent cell aligns the
– 64 –
,
warp
,
warp
,
current Ref features to the current LR frame and fuses the aligned Ref features to the
aggregated features of the previous Ref, LR, and current LR frames using a reference
alignment and propagation module (Sec. 5.2.3). In this way, features of temporally
distant LR input and Ref frames can be recurrently integrated and propagated.
In each recurrent cell Ff and Fb (Fig. 5.3), we first use a flow estimation network
S [119] to estimate the optical flow between the LR frame ItLR at the current time step
LR at the previous time step to align propagated features h{f,b} to I LR . Then,
and It±1 t±1 t
using a residual block R, we aggregate an LR frame ItLR into the aligned features to
h{f,b} . Specifically, we have:
obtain temporally aggregated features b t
{f,b}
wt = S(ItLR , It±1
LR
),
{f,b} {f,b} {f,b}
ht
e = warp(ht±1 , wt ), (5.3)
{f,b} {f,b}
ht
b = R{f,b} (ItLR , e
ht ),
– 65 –
Reference Alignment and Propagation (𝑅𝑅𝑅𝑅𝑅𝑅)
𝜙𝜙
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 cosine 𝑐𝑐𝑡𝑡 (matching confidence)
𝜙𝜙 similarity 4 … 5
𝑝𝑝𝑡𝑡
𝑅𝑅𝑒𝑒𝑒𝑒
𝐼𝐼𝑡𝑡 matrix …
9
…
…
…
2 (matching index)
𝑅𝑅𝑅𝑅𝑅𝑅
ℎ� 𝑡𝑡 {𝑓𝑓,𝑏𝑏}
Reference Propagative ℎ𝑡𝑡
Reference feature
Alignment aligned to LR Temporal
{𝑓𝑓,𝑏𝑏} {𝑓𝑓,𝑏𝑏}
𝑐𝑐𝑡𝑡̃ Fusion 𝑐𝑐𝑡𝑡
Accumulated matching confidence
{𝑓𝑓,𝑏𝑏}
ℎ� 𝑡𝑡
where warp(, ) denotes warping operation, and wt{f,b} is the optical flow estimated
h{f,b}
by the flow estimation network S. Note that the temporally aggregated features b t
contains details aggregated from multiple LR features, as well as temporal Ref features
propagated from neighboring cells.
Now we propose the reference alignment and propagation module for each cell
Ff and Fb to fuse the current Ref frame ItRef into temporally aggregated features
h{f,b} .
b
t
Our reference alignment and propagation module (Fig. 5.4) consists of three
sub-modules: cosine similarity, reference alignment, and propagative temporal fusion
modules. The cosine similarity module computes a cosine similarity matrix between
the Ref frame ItRef and target LR frames ItLR and computes an index map pt and a
confidence map ct needed for the other two sub-modules. The reference alignment
module extracts a feature map from the current Ref frame ItRef and warps the feature
map to ItLR using the index map pt . Then, the propagative temporal fusion module
h{f,b} . In the
fuses the aligned Ref features with the temporally aggregated features b t
– 66 –
Cosine Similarity Module To compute an index map pt and a confidence map ct , we
first embed ItLR and It↓
Ref into the feature space by a shared encoder ϕ [120], where
h{f,b}
the temporally aggregated features b t contain aggregated temporal Ref features
propagated from neighboring recurrent cells. For the successful fusion, the propagative
h{f,b} and e
temporal fusion module has to fuse b t hRef in the way of selecting the Ref
t
– 67 –
Propagative Temporal Fusion
temporally {𝑓𝑓,𝑏𝑏}
aggregated features ℎ� 𝑡𝑡
concat
conv
aligned Ref
𝑅𝑅
𝑅𝑅𝑅𝑅𝑅𝑅
features ℎ� 𝑡𝑡 ℎ𝑡𝑡
{𝑓𝑓,𝑏𝑏}
concat
Guidance for the fusion conv
matching
𝑐𝑐𝑡𝑡
confidence
accumulated {𝑓𝑓,𝑏𝑏} {𝑓𝑓,𝑏𝑏}
matching confidence 𝑐𝑐𝑡𝑡̃ max 𝑐𝑐𝑡𝑡
features better aligned to the target frame so that well-matched Ref features can keep
propagating to the next cell. Otherwise, erroneous Ref features can be accumulated in
the pipeline, leading to blurry results.
hRef
However, a naı̈ve fusion of the Ref features e t is error-prone, as matching is
not necessarily accurate. Inspired by [38, 41], we thus perform feature fusion guided
by the matching confidences ct , which guides the fusion module to select only well-
hRef . The fusion module also needs guidance for propagated
matched features in e t
h{f,b}
Ref features aggregated in b t . The guidance should accommodate temporal in-
formation that coincides with propagated Ref features maintained in the propagation
pipeline. To this end, we accumulate matching confidences throughout the propaga-
tion pipeline and use the accumulated confidence as the guidance for the temporally
h{f,b} during the fusion. Formally, we have:
aggregated features b t
where c{f,b}
t±1 is the accumulated matching confidence propagated from neighboring
– 68 –
For the fusion, we provide matching confidence ct computed between the current
ct{f,b}
target and reference frames, and we also provide aligned matching confidence e
propagated from neighboring recurrent cells as guidance. The matching confidences
are embedded with a convolution layer to consider matching scores of neighboring
patches for providing more accurate guidance during the fusion [41]. Formally, the
fusion process is defined as:
{f,b} {f,b}
ht = {conv([ct , e
ct ]) ⊗
(5.6)
{f,b} {f,b}
hRef
conv([e t ,b
ht ])} + b
ht ,
{f,b} {f,b}
ct = max(ct , e
ct ). (5.7)
– 69 –
resulting 8K video has the same resolution as a telephoto video, but 16× larger in size.
It is worth noting that we use only a wide-angle video as a Ref video. While it
may look reasonable to use a telephoto video as an additional Ref video to achieve the
resolution of a telephoto video, we found that it does not improve the super-resolution
quality much because a telephoto video covers only 1/16 the area of an ultra-wide
video.
Training our network to produce 8K videos is not trivial as there are no ground-
truth 8K videos. While we have wide-angle and telephoto videos, they neither cover
the entire area nor perfectly align with an ultra-wide video. To overcome this, we
propose a novel training strategy that fully exploits wide-angle and telephoto videos.
Our training strategy consists of pre-training and adaptation stages. In the pre-
training stage, we downsample ultra-wide and wide-angle videos 4×. We then train the
network to 4× super-resolve a downsampled ultra-wide video using a downsampled
wide-angle video as a reference. The training is done in a supervised manner using
the original ultra-wide video as the ground-truth. Finally, in the adaptation stage, we
fine-tune the network to adapt it to real-world videos of the original sizes. This stage
uses a telephoto video as supervision to train the network to recover high-frequency
details of a telephoto video. The following subsections describe each stage in more
detail.
In this stage, we train our network using two loss functions: a reconstruction
loss motivated by [123, 124, 41] and a multi-Ref fidelity loss. The reconstruction loss
minimizes the low- and high-frequency differences between a super-resolved ultra-
wide frame ItSR and the ground-truth ultra-wide frame ItHR . The reconstruction loss
ℓrec is defined as:
X
SR HR
ℓrec = ∥It,blur − It,blur ∥ + λrec δi (ItSR , ItHR ), (5.8)
i
– 70 –
where the subscript blur indicates a filtering operation with a 3 × 3 Gaussian kernel
with σ = 1.0 and λrec is a weight for the second term. δi (X, Y ) = minj D(xi , yj ) is
the contextual loss [123] that measures the distance between the pixel xi in X and its
most similar pixel yj in Y under some feature distance measure D, e.g., a perceptual
distance [123, 124, 125].
In the first term on the right-hand side in Eq. 5.8, filtering frames with Gaussian
kernels imposes results to follow low-frequency structures of a ground-truth ultra-wide
frame ItHR . The second term enforces the network to follow the high-frequency details
of ItHR . Note that in the second term, we use the contextual loss even for aligned pairs
ItSR and ItHR , as the loss is verified to be better in boosting the perceptual quality than
the perceptual loss [126] designed for aligned pairs [124].
To guide the network to take advantage of multiple Ref frames, we encourage
Ref features to keep propagating from one to the next cells. Motivated by [41], we
propose a multi-Ref fidelity loss. Given a super-resolved ultra-wide frame ItSR and
RefHR , the multi-Ref fidelity loss is defined as:
ground-truth wide-angle frames It∈Ω
P P SR RefHR
t′ ∈Ω i δi (It , It′ ) · ct′,i
ℓMfid = P P , (5.9)
t′ ∈Ω i ct′,i
k−1 k−1
where Ω = [t − 2 , ..., t + 2 ] is a set of frame indices in a temporal window
of size k. We use k = 7 in practice. Here, ct′,i is the matching confidence used for
weighting the distance δi (ItSR , ItRef
′
HR ). Specifically, during training, pixels of I SR
t
with higher matching confidence ct′,i are assigned with larger weights for optimization.
Ref and keep
Eq. 5.9 enables our network to effectively utilize multiple Ref frames It∈Ω
the details of multiple Ref frames to flow through the propagation pipeline. Our loss
for the pre-training stage is defined as:
RefHR
ℓpre = ℓrec (ItSR , ItHR ) + λpre ℓMfid (ItSR , It∈Ω ). (5.10)
– 71 –
where λpre is a weight for the multi-Ref fidelity loss for the pre-training stage.
For adaptation, our network takes real-world ultra-wide ItU W and wide-angle
ItW ide HD frames as LR and Ref frames, respectively. As in the pre-training stage,
the adaptation stage separately handles low- and high-frequency of a super-resolved
ultra-wide frame ItSR . However, as there is no ground-truth frame available for ItSR ,
we downsample ItSR and use the input ultra-wide frame ItU W as the supervision for re-
covering low-frequency structures. For recovering high-frequency details, we directly
T ele as the supervision for the proposed multi-Ref fidelity
utilize telephoto frames It∈Ω
loss ℓMfid . The adaptation loss is defined as:
SR UW
ℓ8K = ||It↓,blur − It,blur || + λ8K ℓMfid (ItSR , It∈Ω
T ele
), (5.11)
where λ8K is a weight for the multi-Ref fidelity loss for the adaptation stage. The first
term imposes our network to reconstruct low-frequency structures of input ultra-wide
frames, and the second term trains our network to transfer the finest high-frequency
details of telephoto frames.
5.4 Experiments
– 72 –
Each video is saved in the MOV format using HEVC/H.265 encoding with the HD
resolution (1080×1920). The dataset contains triplets of 161 video clips with 23,107
frames in total. The video triplets are split into training, validation, and testing sets,
each of which has 137, 8, and 16 triplets with 19,426, 1,141, and 2,540 frames, respec-
tively.
– 73 –
I Ref I RefHR Patch size PSNR↑ SSIM↑ Params (M)
tele tele 32×32 29.81 0.893 4.277
wide tele 32×32 30.41 0.895 4.277
wide wide 32×32 30.36 0.897 4.277
dual dual 32×32 30.39 0.888 5.076
wide wide 64×64 31.68 0.914 4.277
Table 5.1: Quantitative comparison on models trained with different reference video
types. In the top row, I Ref , I RefHR , and patch size indicate the input reference, refer-
ence supervision, and patch size used for the pre-training stage, respectively. ‘wide’,
‘tele’, and ‘dual’ indicate wide-angle, telephoto, and both wide-angle and telephoto
videos, respectively. Only the pre-training stage is used for training the models.
ℎ� 𝑡𝑡𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
ℎ� 𝑡𝑡𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇
𝑅𝑅 ℎ𝑡𝑡
{𝑓𝑓,𝑏𝑏}
concat
conv
𝑐𝑐𝑡𝑡𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊, {𝑓𝑓,𝑏𝑏}
𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊, {𝑓𝑓,𝑏𝑏}
𝑐𝑐𝑡𝑡̃ max 𝑐𝑐𝑡𝑡
𝑐𝑐𝑡𝑡𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇, {𝑓𝑓,𝑏𝑏}
𝑐𝑐𝑡𝑡̃
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇, {𝑓𝑓,𝑏𝑏}
max 𝑐𝑐𝑡𝑡
Figure 5.6: Modified propagative temporal fusion module for handling dual reference
features.
‘reference supervision’ to indicate a reference video that is fed into the network, and
a high-resolution video used for additional supervision, respectively. We denote them
by I Ref and I RefHR , respectively.
– 74 –
To quantitatively analyze the effect of reference combinations, we prepare five
models that are trained with only the pre-training stage (Table 5.1), where each model
is trained with a different combination of wide-angle and telephoto videos for the input
reference I Ref and the reference supervision I RefHR . We also prepare a model taking
dual references, both wide-angle and telephoto videos. To this end, we modify the
reference alignment and propagation module (Sec. 5.2.3) to separately obtain features
hW ide and features of a telephoto frame e
of a wide-angle frame e t hT ele that are aligned
t
to ItLR .
Moreover, the propagative temporal fusion module is modified to take both
hW ide and e
e
t hT ele , and utilizes confidence maps cW ide and cT ele computed by matching
t t t
for temporal Ref features during the fusion (Fig. 5.6). For training the model taking
dual references, we use both a wide-angle ItW ide and telephoto ItT ele frames for the
proposed pre-training loss ℓpre (Eq. 5.10). Specifically, ℓpre is modified as follows:
We first verify that a wide-angle video is the best option for an input reference
I Ref . In Table 5.1, compared to the model utilizing a telephoto video as I Ref (the first
row of the table), the models using a wide-angle video (from second to fourth rows)
show much better SR performance. This is mainly due to the larger matching coverage
of a wide-angle frame on an ultra-wide frame (about 25%) than that of a telephoto
frame (about 6.25%). Larger matching coverage allows fine details of reference frames
to be widely transferred to a resulting SR frame, which contributes significantly to
reconstructing high-quality results.
Moreover, we verify that a wide-angle video is also the best choice for reference
supervision I RefHR needed for the pre-training stage. While it may look reasonable
to use a telephoto video as I RefHR to transfer the resolution of a telephoto video, we
– 75 –
LR↑ / Ref↓ Bicubic wide-wide wide-tele
Wide-angle
Wide-angle
Figure 5.7: Qualitative comparison on 8K 4×VSR results from models trained with
different reference video types for a supervision I RefHR in the adaptation stage. The
first column shows LR and Ref real-world HD inputs. The other columns show
zoomed-in cropped SR results of models taking wide-angle video as an input ref-
erence I Ref , but trained with different videos for the reference supervision I RefHR
(e.g., ‘wide-tele’ indicates that wide-angle and telephoto videos are used for I Ref and
I RefHR , respectively). Red and green boxes indicate the inside and outside of the over-
lapped FoV between LR and Ref frames, respectively.
found that it does not improve the SR quality much (the second row vs. the third
row). This is because both wide-angle and telephoto frames lose details as they are
downsampled 2× and 4×, respectively, for being used as supervision for the pre-
training stage to match the scale of contents with a resulting SR frame.
– 76 –
One question that naturally follows is why not utilize both wide-angle and tele-
photo videos as dual references. However, the SR performance of the model taking
dual references is almost the same as models taking a single wide-angle reference
video (the third vs. the fourth rows). This indicates that it is not worth utilizing dual
reference videos. A slight SR performance gain does not fully justify extra memory
and computational costs (58.1T and 71.5T MACs1 for the single and dual reference
models, respectively) needed for processing additional reference video.
According to the analysis, we take advantage of the broad matching coverage of
a wide-angle video and use it as the input reference I Ref for both pre-training and
adaptation stages. Moreover, we use wide-angle videos as the reference supervision
I RefHR for the pre-training stage as we can have a larger training patch size (64×64 in
practice) than the patch size possible when a telephoto video is used for the supervision
(32×32 at maximum, due to small overlap between a telephoto and a downsampled
LR frames), which boosts SR quality (the last row in the table).
In the adaptation stage, however, we can establish a large patch size even when a
telephoto frame video is used as the reference supervision I RefHR because downsam-
pling is not required in the real-world scenario. We thus directly use a telephoto video
as reference supervision I RefHR for the adaptation stage to take advantage of their
finest details in reconstructing SR results from a real-world HD video. The benefit of
taking a telephoto video as supervision for the adaptation stage is qualitatively shown
in Fig. 5.7. In the figure, compared to the model trained with a wide-angle video as
reference supervision I RefHR (the third column in the figure), the model trained with
a telephoto video as I RefHR shows sharper and finer details (the last column).
To analyze the effect of each component of our model, we conduct ablation stud-
ies. First, we validate the effects of the propagative temporal fusion module (Eq. 5.6)
1
Computational costs are measured as the number of multiply-accumulate operations (MACs) computed
on 1920×1080 frames.
– 77 –
ℓMfid PTF PSNR↑ SSIM↑ Params (M)
30.71 0.894 4.2768
✓ 31.31 0.913 4.2768
✓ ✓ 31.68 0.914 4.2772
Table 5.2: Quantitative ablation study. The first row corresponds to the baseline model.
ℓMfid and PTF indicate the models trained with Eq. 5.9 and propagative temporal fu-
sion module, respectively.
and multi-Ref fidelity loss ℓMfid (Eq. 5.9). To this end, we compare the stripped-out
baseline model with its two variants. The baseline model is trained with ℓrec and ℓMfid ,
but we set the temporal window size k = 1 for ℓMfid , indicating only a single ground-
truth Ref frame is used for computing the loss. Regarding the propagative temporal
fusion module, we use a modified one for the baseline model. Specifically, Eq. 5.6
becomes:
For the other variants, we recover the key components one by one from the baseline
model. For the variant with ℓMfid , we train the baseline model with ℓrec and ℓMfid with
window size k = 7. For the last variant, we attach the propagative temporal fusion
module. For quantitative and qualitative comparison, we compare pre-trained models
(Sec. 5.3.1) and their fine-tuned models (Sec. 5.3.2) on the proposed RealMCVSR test
set, respectively.
Table 5.2 shows quantitative results. The table indicates that compared to the
baseline model (the first row in the table), the model trained with ℓMfid (the second
row) shows much better VSR performance. The model additionally equipped with the
propagative temporal fusion module (the third row) achieves the best results in every
measure.
Fig. 5.8 shows a qualitative comparison. As shown in the figure, the model trained
– 78 –
LR↑ / Ref↓ Bicubic Baseline ℓMfid ℓMfid +PTF
Figure 5.8: Qualitative ablation study. The first column shows LR and Ref real-world
HD inputs. For the rest of the columns, we show zoomed-in cropped 4×SR results
of different combinations of modules (Table 4.1). Red and green boxes indicate the
inside and outside of the overlapped FoV between LR and Ref frames, respectively.
with ℓMfid (the fourth column of the figure) enhances details inside (red box) and
outside (green box) the overlapped FoV much better compared to the results of the
baseline model (the third column). The result confirms that ℓMfid enforces temporal
Ref features to keep streaming through the propagation pipeline to be utilized in re-
constructing high-fidelity results. The model attached with the propagative temporal
fusion module shows accurately recovered structures and enhanced details for both
inside and outside the overlapped FoV (the last column). This demonstrates the prop-
agative temporal fusion module promotes well-matched Ref features to be fused and
to flow through the propagation pipeline.
We also validate the effects of the proposed training strategy. Specifically, we
– 79 –
LR↑ / Ref↓ Bicubic ℓpre ℓ8K
qualitatively compare the model pre-trained with the pre-training loss ℓpre (Eq. 5.10)
and the model fine-tuned with the adaptation loss ℓ8K (Eq. 5.11). For comparison, we
show 8K VSR results given real-world HD videos. Note that in the real-world scenario,
there is no ground-truth available for a quantitative comparison. Fig. 5.9 shows the
results. The pre-trained model does not improve the details of a real-world input,
due to the domain gap between real-world inputs and downsampled inputs (the third
column). However, the fine-tuned model shows much higher fidelity results compared
to the pre-trained model (the last column), thanks to the adaptation stage that trains the
network to well adapt to real-world videos.
In this section, we analyze the effect of the propagative temporal fusion module
on the SR quality inside and outside the overlapped FoV between an LR frame and the
– 80 –
𝑡𝑡 = 20 𝑡𝑡 = 40 𝑡𝑡 = 60 𝑡𝑡 = 80
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿
𝑅𝑅𝑅𝑅𝑅𝑅
𝐼𝐼𝑡𝑡
𝑐𝑐𝑡𝑡
𝑓𝑓
𝑐𝑐𝑡𝑡̃
w/o PTF w/ PTF w/o PTF w/ PTF w/o PTF w/ PTF w/o PTF w/ PTF
Figure 5.10: Effect of the proposed Propagative Temporal Fusion (PTF) module. ct
is the confidence map computed when the input LR frame ItLR is matched with the
ctf is the accumulated matching confidence
Ref frame ItRef at the current time step. e
of the forward propagation branch. As can be seen in the figure, confidence values
ctf are accumulated following the motion in the video. The red and green boxes
in e
show zoomed-in cropped patches from the region inside and outside the overlapped
FoV between ItLR and ItRef , respectively. Note that the matching confidence maps are
noisy due to HEVC/H.265 compression artifacts contained in video frames.
corresponding Ref frame. The proposed propagative temporal fusion module performs
hRef at the current time step and temporally aggregated
fusion between Ref features e t
h{f,b}
features b t propagated from the previous step. During the fusion, the module uti-
lizes the matching confidence ct and the accumulated matching confidence c{f,b}
t±1 as
– 81 –
h{f,b}
guidance for e t h{f,b}
and b t , respectively.
The proposed propagative temporal fusion module improves the SR quality of
both regions inside and outside the overlapped FoV, as the module utilizes the accu-
mulated matching confidence c{f,b} {f,b}
t±1 during the fusion. This is because ct±1 provides
a cue for the temporal propagative fusion module to select well-matched temporal Ref
h{f,b} .
features aggregated in temporally aggregated features b t
Fig. 5.10 qualitatively demonstrates the effect of the propagative temporal fusion
module. For the evaluation, we prepare models with and without the propagative tem-
poral fusion module. For the model without the propagative temporal fusion module,
we use a modified fusion module that does not utilize accumulated matching confi-
dence c{f,b}
t±1 during the fusion. Specifically, we modify Eq. 5.6 for the modified fusion
In Fig. 5.10, matching confidence ct at the current time step shows a high match-
ing score mainly concentrated in the region inside the overlapped FoV between an LR
frame ItLR and a Ref frame ItRef (the third row of the figure). However, in the accu-
ctf , the matching scores spread out following the motion
mulated matching confidence e
of the video (the fourth row). As we can observe from the figure, compared to the
ctf , the model with the
model with the modified fusion module that does not utilize e
propagative temporal fusion module restores more accurate structures and details for
ctf providing
the region inside the overlapped FoV (red boxes in the figure), due to e
better matched temporal Ref feature during the fusion. Moreover, the model with the
propagative temporal fusion module shows finer details in reconstructed SR frames
ctf guides temporal Ref
for the region outside the overlapped FoV (green boxes), as e
features outside the overlapped region to be utilized during the fusion.
– 82 –
F{f,b} ℓMfid PTF PSNR↑ SSIM↑ Params (M)
30.07 0.890 4.2653
✓ ✓ 31.02 0.906 4.2656
✓ 30.71 0.894 4.2768
✓ ✓ ✓ 31.68 0.914 4.2772
Table 5.3: Ablation study including bidirectional branches. F{f,b} indicates the model
with bidirectional branches. ℓMfid is the model trained with the multi-Ref fidelity loss.
PTF is the model with the propagative temporal fusion module.
Table 5.4: Effect of alignment modules for inter-frame alignment (Inter-frame) and
reference alignment (RA). OF and PM indicate the optical flow [119] and patch-match-
based alignment [41] methods, respectively. Our model adopts the combination in the
last row. MACs are computed on 256×256 frames.
The effect of the bidirectional scheme has been widely explored in previous VSR
works [60, 61, 64]. To validate the effect on our method, we conduct the ablation
study on the model with a unidirectional forward branch (Table 5.3), in addition to the
ablation study reported in Table 5.2 (Sec. 5.4.2). In the table, models with bidirectional
branches show better VSR quality than models with only a forward branch. Moreover,
the proposed components (ℓMfid and PTF in the table) improve the VSR performance
for both unidirectional and bidirectional schemes.
– 83 –
5.4.5 Effect of Alignment Methods
For the proposed RefVSR network, different alignment methods can be used for
inter-frame and reference alignments. For inter-frame alignment, resolving local dis-
parity is important [129], and we use flow-based alignment [119] that shows similar
VSR quality to patch-match-based alignment, but with much smaller computational
cost (2nd vs. 4th rows in Table 5.4). For reference alignment, establishing global cor-
respondence is important [59], and we adopt patch-match-based alignment [41] that
shows significant performance gain with slight computational overhead compared to
flow-based alignment (3rd vs. 4th rows in the table).
– 84 –
Model PSNR↑ SSIM↑ Params (M)
Bicubic 26.65 0.800 -
SISR
indicated with -ℓ1 ), for a fair comparison with the previous models trained with pixel-
based losses, such as ℓ1 , ℓ2 , and ℓch (Charbonnier loss [132]), which are known for
having an advantage in PSNR over perceptual-based loss [126].
In Table 5.5, while RefSR methods show a better performance than SISR meth-
ods, our methods outperform all previous ones. Interestingly, VSR methods outper-
form RefSR methods that are additionally fed with Ref frames. However, this is not
particularly true if we measure the performance on the regions of the SR frame cor-
responding to different FoV ranges. Table 5.6 shows the results. For comparison,
we measure the SR quality for the region inside the overlapped FoV (0%–50%) be-
tween an ultra-wide SR and a wide-angle Ref frames. For outside the overlapped FoV,
we measure SR performance for the banded regions at different FoV ranges from the
– 85 –
PSNR / SSIM measured for regions in the indicated FoV range Params
Model
0%–50% 50%–60% 50%–70% 50%–80% 50%–90% 50%–100% (M)
Bicubic 25.38 / 0.757 26.30 / 0.785 26.42 / 0.789 26.71 / 0.798 26.99 / 0.801 27.29 / 0.815 -
SISR
RCAN-ℓ1 [130] 29.77 / 0.895 30.69 / 0.908 30.86 / 0.910 31.17 / 0.914 31.50 / 0.918 31.80 / 0.921 15.89
RefSR DCSR-ℓ1 [41] 34.90 / 0.963 31.96 / 0.927 31.61 / 0.921 31.58 / 0.919 31.81 / 0.921 31.93 / 0.923 5.419
VSR IconVSR-ℓch [64] 32.79 / 0.946 33.43 / 0.949 33.60 / 0.950 33.89 / 0.951 34.19 / 0.953 34.40 / 0.953 7.255
Ours-ℓ1 36.02 / 0.971 34.59 / 0.958 34.31 / 0.956 34.23 / 0.954 34.40 / 0.955 34.50 / 0.954 4.277
RefVSR
Ours-IR-ℓ1 36.14 / 0.971 34.66 / 0.959 34.40 / 0.956 34.34 / 0.955 34.52 / 0.955 34.63 / 0.955 4.774
Table 5.6: Quantitative results measured with varying FoV range. The center 50% of
FoV in an ultra-wide SR frame is overlapped with the FoV of a wide-angle reference
frame. Here, 0%–50% indicates the region inside the overlapped FoV, and 50%–100%
is the region outside the overlapped FoV. 50%–60% means the banded region between
the center 50% and 60% of an ultra-wide SR frame.
overlapped FoV (50%) to full FoV (100%). In the table, DCSR [41] outperforms Icon-
VSR [64] for the overlapped FoV (0%–50%) between an input and Ref frames, while
IconVSR outruns DCSR for the rest of the regions. Our models exceed all models for
all regions.
Note that in Table 5.6, our models show a performance gap between regions in-
side (0%–50%) and outside (50%–100%) the overlapped FoV. However, compared to
the PSNR/SSIM gap of DCSR (8.5% / 4.2%), our models show a much smaller gap
(Ours-ℓ1 : 4.2% / 1.8% and Ours-IR-ℓ1 : 4.2% / 1.6%). The result implies the proposed
architecture effectively utilizes neighboring Ref features for recovering regions both
inside and outside of the overlapped FoV.
– 86 –
LR↑ / Ref↓ (a) Bicubic (c) RCAN [130] (d) DCSR [41] (e) IconVSR [64] (f) Ours
HD videos. The results show that non-reference-based SR methods, RCAN and Icon-
VSR, tend to over-exaggerate textures, while non-textured regions tend to be overly
smoothed out. The RefSR method, DCSR, shows better fidelity than RCAN and Icon-
VSR in the overlapped FoV (red box). However, DCSR tends to smooth out regions
outside the overlapped FoV (green box). Our method shows the best result compared
to the previous ones. Compared to DCSR, our model robustly reconstructs finer details
with balanced fidelity between regions inside and outside the overlapped FoV. More-
over, the details and textures reconstructed outside the FoV are more photo-realistic.
5.5 Discussion
We proposed the first RefVSR framework with a practical focus on videos cap-
tured from asymmetric multi-cameras. To efficiently utilize a Ref video, we adopted a
bidirectional recurrent framework and proposed the propagative temporal fusion mod-
ule to fuse and propagate Ref features well-matched to LR features. To train and
– 87 –
LR LR (4×) Ref (2×) Ours
Figure 5.12: Failure cases. The first column shows LR real-world HD input frames.
The other columns show zoomed-in cropped patches of bicubic-upsampled LR and
Ref input frames and SR frames resulting from our method, corresponding to the red
box in an LR frame. In these examples, LR and Ref input frames do not contain
enough cues needed for accurate matching, and the structure and detail in the results
are not restored as those in Ref frames.
Limitation As previous RefSR methods [37, 38, 39, 41], our network consumes quite
an amount of memory for applying global matching between real-world HD frames.
Moreover, our network may fail to accurately utilize Ref frames when matching be-
tween LR and Ref frames is inaccurate. Fig. 5.12 qualitatively shows the failure cases.
In the figure, we show an LR patch (the second column), and its corresponding patches
from a Ref frame (the third column) and a resulting SR frame (the last column). We
can observe from the figure that when an LR frame does not contain enough cues
needed for accurate matching with a Ref frame (e.g., texture patterns), our model fails
to accurately utilize Ref patches for recovering an SR frame.
– 88 –
VI. Reference Memory-Based Multi-Camera
Video Super-Resolution
6.1 Motivation
– 89 –
tures to constitute reference features. For the reference features to convey high-quality
reference textures, the key and basis features in the memory network are required to
memorize useful reference information, for which we use the entire reference videos
provided in the RealMCVSR training set proposed in Chapter V. For super-resolution,
the retrieved reference features are implicitly utilized by a VSR framework in recon-
structing a high-fidelity SR frame.
Utilizing the memory network has a few merits. First, as the reference memory
network is trained to mine general reference features from various reference videos,
LR features of any region may map to reference features and implicitly utilized by
a VSR network. Second, unlike the RefVSR framework proposed in Chapter V, our
reference memory network can be employed by any VSR framework as a plug-and-
play referencing module to increase its SR quality. Third, thanks to the modularity
of the reference memory network, it can easily be fine-tuned to contain video-specific
reference features. During test time, where reference video is available at hand in
the multi-camera setting, we optimize the pre-trained memory network to learn video-
specific reference information, which enhances the fidelity of the final SR results.
We train and evaluate our model using the RealMCVSR dataset proposed in
Chapter V. As in Chapter V, the proposed model aims to 4× super-resolve an ultra-
wide video using a wide-angle video as a reference, in which we use 4× downsampled
ultra-wide and wide-angle videos as LR and Ref videos, respectively, and the original
ultra-wide video as ground-truth high-resolution video. We verify that the proposed
reference memory network can help improve the SR quality of a VSR framework.
Our contribution is summarized as:
– 90 –
𝐿𝐿𝐿𝐿
𝐼𝐼𝑡𝑡−𝑛𝑛 SR
Backbone features
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 VSR 𝐹𝐹𝑡𝑡𝑆𝑆𝑅𝑅
encoder
𝐿𝐿𝐿𝐿
𝐼𝐼𝑡𝑡+𝑛𝑛
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿
Reference Memory Network
Reference
query features Backbone
Residual
LR query Memory
c
block
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 encoder 𝑄𝑄𝑡𝑡𝐿𝐿𝐿𝐿 Lookup 𝑅𝑅𝑡𝑡𝐿𝐿𝐿𝐿 VSR 𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆
decoder
Memory bank
(keys, basis features)
Different from the RefVSR framework proposed in Chapter V that requires ar-
chitectural modification of a conventional VSR framework for explicit utilization of
reference video, the RefVSR framework proposed in this chapter simply inserts a
plug-and-play reference memory network into a VSR network without modifying its
internal architecture. In this section, although our reference memory network can be
inserted into any VSR framework (either sliding window-based [55, 56, 57, 58, 59] or
recurrent framework-based [60, 61, 62, 63, 64]), we use a sliding window-based VSR
framework in a simple encoder-decoder structure to describe our framework.
The overview of our framework is shown in Fig. 6.1. The backbone 4×VSR net-
work takes temporal LR frames {ItLR ∈ Rc×h×w |t ∈ [t − n...t...t + n]} as inputs and
super-resolves target LR frame ItLR at time step t to produce its resulting SR frame
ItSR . We insert our reference memory network between the encoder and decoder of
the backbone VSR network. The reference memory network is composed of the LR
– 91 –
query encoder and the memory bank, which consists of keys and basis features con-
taining useful reference information memorized from reference videos. The reference
memory network first extracts LR query QLR
t from the target LR frame ItLR using the
LR query encoder. Then, we retrieve reference features from the memory bank, using
its keys and the LR query QLR LR and keys to
t . To that end, we use the LR query Qt
compute non-local attention, which linearly combines the basis features to constitute
the reference features RtLR . Note that the LR-queried reference features RtLR can im-
plicitly be utilized by a VSR network for high-fidelity SR results, as RtLR are trained to
carry high-quality reference information. For the backbone VSR network to utilize the
reference features retrieved from the memory network, we concatenate the reference
features RtLR with the SR features FtSR produced by the backbone VSR encoder. The
concatenated features are then fed to a residual block composed of convolution and
nonlinear activation layers. The residual block fuses the SR features FtSR and the LR-
queried reference features RtLR . The fused features are then passed to the backbone
VSR decoder, which produces the final SR frame ItSR . In the following, we describe
the reference memory network in more detail.
– 92 –
Non-local memory lookup operation
Memory bank
𝑤𝑤 𝑅𝑅𝑡𝑡
ℎ 1
1
𝐵𝐵
Basis features
softmax
𝐾𝐾
Keys
𝑤𝑤 𝑄𝑄𝑡𝑡𝐿𝐿𝐿𝐿
LR Query ℎ 1
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 encoder 1 𝐿𝐿𝐿𝐿
𝑄𝑄𝑡𝑡,𝑥𝑥
Figure 6.2: Detailed reference memory network. The non-local memory lookup oper-
ation is performed in parallel for each pixel QLR
x in the LR query Q
LR
and memory costs. Nonetheless, the memory bank memorizes the temporal reference
information, as we train the memory network to learn to compose reference features
using various reference videos. For the reference features to convey high-quality refer-
ence textures, we use the entire reference videos in the training set for the key and basis
features in the memory bank to memorize useful reference information. Once they are
memorized, reference features queried by LR features will contain high-quality refer-
ence textures useful for super-resolving an LR frame.
Fig. 6.2 illustrates our reference memory network. The reference memory net-
work maintains a memory bank consisting of keys K ∈ Rk×m and basis features
B ∈ Rm×b , which are trainable tensors and optimized to learn reference information
during the training stage. Here, k and b are the numbers of channels for keys K and
basis features B, respectively, and m is the number of key and basis feature pairs. To
query basis features, we first extract LR queries QLR
t ∈ Rhw×k from an LR query
encoder by feeding an LR frame ItLR . Then, we compute non-local attention [135],
– 93 –
att = QLR
t K ∈ R
hw×m , between LR queries QLR and the keys K in the memory
t
bank. Finally, we compute reference features RLR using the non-local attention, which
is mathematically defined as:
where sof tmax(·) performs the softmax operation along the second dimension of its
input and reshape(·) is the reshape operation. Now, we propose the training strategy
for training keys and basis features in the reference memory network to memorize
useful reference information.
6.2.2 Training
– 94 –
𝐿𝐿𝐿𝐿
𝐼𝐼𝑡𝑡−𝑛𝑛 SR
Backbone features
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 VSR 𝐹𝐹
encoder
𝐿𝐿𝐿𝐿
𝐼𝐼𝑡𝑡+𝑛𝑛 RefVSR Loss
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿
Reference Memory Network
Reference
LR query features
Backbone
Residual
LR query Memory
c 𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆
block
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 encoder 𝑄𝑄𝑡𝑡𝐿𝐿𝐿𝐿 Lookup 𝑅𝑅𝑡𝑡𝐿𝐿𝐿𝐿 VSR
decoder
Memory bank
(keys, basis features)
Figure 6.3: Training strategy for the VSR network equipped with our reference mem-
ory network. The reference reconstruction task in the shaded region is used only during
training time.
Reference Video Memorization For the proposed memory network to memorize ref-
erence information, we employ the reference reconstruction task during training time.
– 95 –
The shaded region in Fig. 6.3 illustrates the reference reconstruction task. We first 2×
Ref ↓2×
downsample a Ref frame ItRef , which is denoted as It , to match its resolution
Ref ↓2×
with the LR frame. It is fed to the Ref query encoder, which extracts Ref query
QRef
t . Note that the weights of the Ref query encoder are not shared with the LR
query encoder. Then, QRef
t is used to look up reference features RtRef from the mem-
ory network, using the same operation described in Eq. 6.1. The retrieved reference
features are then fed to the reference reconstruction, which produces a residual Ref
Ref ↓2× Ref ↓2× ↑2×
image. Finally, we 2× upsample It , which we denote as It , and add it
Ref ′
to the predicted residual Ref image to reconstruct the Ref frame It . We implement
the reconstructor light as possible, as we want the reference features retrieved from the
′
memory network to contain rich information that is enough to reconstruct ItRef , for
which we use only a pixel-shuffle [118] and a convolution layers. For the reference
reconstruction task, we use reference memorization loss LRM em that minimizes the
′
difference between a reconstructed reference frame ItRef and the ground-truth refer-
ence frame ItRef . Mathematically, the loss is defined as:
′
LRM em = ∥ItRef − ItRef ∥. (6.2)
The key role of LRM em is to train keys K and basis features B in the memory network
to memorize reference information needed to reconstruct reference frames. The keys
K will be trained to properly map Ref query QRef
t to basis features B, while basis
features will be trained to convey reference information, whose linear combination
constitutes reference features RtRef that can be reconstructed into a reference frame
ItRef .
– 96 –
task. Third, the task trains the backbone VSR network to properly handle the reference
features RtLR retrieved from the memory network. Specifically, the backbone network
is trained to transform the reference features RtLR to be suited with the SR features
FtSR , and to fuse the reference features with the SR features extracted from the VSR
network. For the RefVSR task, we use the RefVSR loss LRef V SR . The RefVSR
loss minimizes the difference between a super-resolved ultra-wide frame ItSR and the
ground-truth ultra-wide frame ItHR . Formally, the loss is defined as:
Our total loss function to train the reference memory-based network is Ltotal =
LRM em + LRef V SR . Note that each loss term updates different parts of our network.
LRM em trains the Ref query encoder, keys and basis features in the memory network,
and Ref reconstructor. LRef V SR trains the entire memory network, the residual block
after the memory network, and the backbone VSR network.
– 97 –
EDVR [56] BasicVSR [64] Ref Mem Test Opt PSNR↑ SSIM↑ Params (M)
✓ 33.47 0.946 20.63
✓ ✓ 33.84 0.951 30.51
✓ 33.66 0.951 4.851
✓ ✓ 34.04 0.954 4.849
✓ ✓ ✓ 34.24 0.955 4.849
Table 6.1: Quantitative ablation study on our reference memory-based VSR frame-
work with different backbone VSR networks. “Ref Mem” and ”Test opt” are the pro-
posed reference memory network and test time optimization strategy, respectively.
6.3 Experiment
– 98 –
LR↑ / Ref↓ BasicVSR [64] +Ref Mem +Test Opt
Figure 6.4: Qualitative ablation study on our reference memory-based VSR framework
with different backbone VSR networks. “Ref Mem” and ”Test opt” are the proposed
reference memory network and test time optimization strategy, respectively.
– 99 –
strategy shows further improved SR performance (the fourth vs. last rows in the table).
Fig. 6.4 shows the qualitative results. As shown in the figure, the backbone VSR
network attached with our reference memory network (the fourth column of the figure)
shows better details inside (red box) and outside (green box) overlapped FoV between
LR and Ref frames compared to the backbone VSR model (the second column). The
result confirms that the high-fidelity reference information memorized in the proposed
reference memory network help improve the SR quality of the backbone VSR net-
work. The model, additionally trained with the test time optimization strategy, shows
further enhanced structures and improved details for both regions inside and outside
the overlapped FoV (the last column). This demonstrates the modularity of the pro-
posed memory network that can take advantage of the reference VSR setting where
we can utilize a reference video available at test time.
– 100 –
Model PSNR↑ SSIM↑ Params (M)
SISR
Bicubic 26.65 0.800 -
RCAN-ℓ1 [130] 31.07 0.915 15.89
Explicit patch match-based RefSR
TTSR-ℓ1 [38] 30.83 0.911 6.730
DCSR-ℓ1 [41] 32.43 0.933 5.419
VSR
EDVR-M-ℓch [56] 33.26 0.946 3.317
EDVR-ℓch [56] 33.47 0.948 20.63
BasicVSR-ℓch [64] 33.66 0.951 4.851
IconVSR-ℓch [64] 33.80 0.951 7.255
Explicit patch match-based RefVSR
MCVSR-IR-ℓ1 (Chapter V) 34.74 0.959 4.774
Proposed implicit reference memory-based VSR
Ours-ℓ1 (BasicVSR+Ref Mem+Test Opt) 34.24 0.955 4.849
Table 6.2: Quantitative evaluation on the RealMCVSR test set. For our model, we
attach the backbone VSR network, BasicVSR, with the proposed reference memory
network (Ref Mem) and apply test time optimization strategy (Test Opt).
(Chapter V) and the code provided by the authors. All compared models are trained
with pixel-based losses, either using ℓ1 or ℓch (Charbonnier loss [132]) loss functions.
For all experiments, ultra-wide HD frames and their 4× downsampled ones are used
as ground-truths and inputs, respectively.
– 101 –
PSNR / SSIM in the FoV range↑ Fidelity balance↓ Params
Model
0%–50% (in) 50%–100% (out) b/w in and out (M)
Bicubic 25.38 / 0.757 27.29 / 0.815 7.0% / 7.1% -
SISR
RCAN [130] 29.77 / 0.895 31.80 / 0.921 6.4% / 2.8% 15.89
RefSR DCSR [41] 34.90 / 0.963 31.93 / 0.923 8.5% / 4.2% 5.419
BasicVSR [64] 32.64 / 0.945 34.27 / 0.953 4.8% / 0.8% 4.851
VSR
IconVSR [64] 32.79 / 0.946 34.40 / 0.953 4.7% / 0.7% 7.255
Explicit
MCVSR-IR 36.14 / 0.971 34.63 / 0.955 4.2% / 1.6% 4.774
RefVSR
Impicit Ours w/o Test Opt 34.20 / 0.951 34.64 / 0.961 1.3% / 1.0%
4.849
RefVSR Ours 34.31 / 0.959 34.70 / 0.964 1.1% / 0.6%
Table 6.3: Quantitative results measured with varying FoV range. The center 50% of
FoV in an ultra-wide SR frame is overlapped with the FoV of a wide-angle reference
frame. Here, 0%–50% indicates the region inside the overlapped FoV, and 50%–100%
is the region outside the overlapped FoV.
– 102 –
regions at the FoV range from the overlapped FoV (50%) to full FoV (100%). In
the table, the SISR methods show the worst SR quality for both regions inside and
outside the overlapped FoV (the first row). The RefSR method (the second row) shows
better SR performance than the VSR methods (the third row) for the region inside the
overlapped FoV, while the VSR methods show better quality for the rest of the regions.
Both explicit and the proposed implicit RefVSR approaches (the fourth and fifth rows)
show better SR performance than the other methods. The explicit RefVSR approach
shows higher SR quality than ours for the region inside the overlapped FoV, whereas
our implicit approach better restores the outside region.
Note that in Table 6.3, we report fidelity balance, for which we measure a perfor-
mance gap of PSNR and SSIM between the regions inside and outside the overlapped
FoV. In the table, our implicit approach shows the best fidelity balance, which implies
that the proposed memory network effectively memorizes useful reference informa-
tion, which is generally exploited for recovering regions both inside and outside of the
overlapped FoV.
– 103 –
LR↑ / Ref↓ (a) Bicubic (b) DCSR [41] (c) IconVSR [64] (d) MCVSR-IR (e) Ours
RefVSR model, ours shows better SR quality for the region outside the overlapped
FoV and fidelity balance between the inside and outside the overlapped region.
– 104 –
Training strategy for Reference reconstruction RefVSR
the reference reconstruction task PSNR SSIM PSNR SSIM
Entire wide-angle videos 33.91 0.947 34.04 0.954
+ Test Opt on sampled video-specific frames 34.17 0.961 34.12 0.955
+ Test Opt on entire video-specific frames 34.35 0.964 34.24 0.955
Table 6.4: Quantitative analysis on the training strategy for the reference reconstruc-
tion task. We compare three model variants of the proposed memory network trained
for the reference reconstruction task with different training strategies. We measure the
effect of the training strategies in terms of the reference reconstruction performance of
each memory network and the SR quality of VSR networks, each of which is embed-
ded with a different memory network. In the table, “Test Opt” indicates the proposed
test-time optimization strategy.
time optimization strategy. It is trained solely with reference videos in the training set.
For the second model variant, we uniformly sample 20 frames from a target reference
video and use them to apply the test time optimization to the baseline model. For the
last variant, we apply the test time optimization to the baseline model using all frames
in a target reference video.
Table 6.4 shows the quantitative analysis. The baseline memory network trained
without the test time optimization strategy shows the worst reference reconstruction
quality (the first row in the table). The memory network trained with the test time
optimization using sampled frames in a target reference video shows better reference
reconstruction quality than the baseline model (the second row). The memory network
applied with test time optimization using every frame in a target reference video shows
the best reference reconstruction performance (the last row).
We also measure the effect of each memory network in terms of the SR quality
of the VSR network that embeds a memory network for implicit reference utilization
(the fourth and last column in the table). For the evaluation, we use BasicVSR [64]
as the backbone VSR network. The result indicates that the better the reference re-
– 105 –
Channel for Channel for # of key and basis Reference reconstruction
keys K (k) basis features B (b) feature pairs (m) PSNR SSIM
64 31.42 0.918
128 128 128 32.82 0.932
256 32.82 0.932
128 32.82 0.932
256 33.00 0.931
128 128
512 33.19 0.935
1024 33.20 0.935
128 33.19 0.935
256 33.21 0.940
128 512 512 33.67 0.942
1024 34.35 0.965
2048 34.39 0.964
Table 6.5: Effect of implementation parameters for the memory bank. We qualitatively
compare the performance of the reference reconstruction task (Ref recon) of memory
networks with different implementation parameters.
construction quality of a memory network, the higher the SR performance of the VSR
network.
– 106 –
6.4 Discussion
We proposed the plug-and-play reference memory network for the implicit refer-
ence utilization in the video super-resolution task. Instead of explicitly feeding refer-
ence videos to the network, we equip a VSR framework with the reference memory
network that maps LR to reference features. For effective implicit mapping, we im-
plement the reference memory network to constitute reference features from a fixed
number of keys and basis reference features. We also proposed the test time optimiza-
tion strategy that fine-tunes the memory network to enhance the memorized reference
information specifically for a reference video. For super-resolution, the retrieved ref-
erence features are implicitly utilized by a VSR framework in reconstructing a high-
fidelity SR frame. In the experiment, we showed that the proposed reference memory
network help improve the SR quality of a VSR network by providing high-quality ref-
erence information. The SR result of our framework showed the best fidelity balance,
which verifies that reference features retrieved from our memory network are generally
utilized for the entire region of an LR frame.
– 107 –
VII. Conclusion and Future Work
7.1 Summary
– 108 –
cused images compared to previous methods.
In Chapter V, we focused on utilizing multi-camera video triplets for the video
super-resolution (VSR) task, for which we proposed the first reference-based VSR
(RefVSR) framework that utilizes wide-angle and telephoto videos as auxiliary ref-
erences for super-resolving an ultra-wide video. Our model adopts a bidirectional
recurrent framework equipped with the propagative temporal fusion module and cost-
effectively utilizes reference video frames. To train and validate the network, we pro-
vided the RealMCVSR dataset consisting of real-world HD video triplets. An adapta-
tion training strategy is also proposed to fully utilize video triplets in the dataset, and
our RefVSR framework achieves the state-of-the-art 4×VSR performance.
In Chapter VI, we focused on the implicit utilization of temporal reference fea-
tures for the VSR task, for which we proposed a plug-and-play reference memory
network that constitutes reference features from a fixed number of keys and basis fea-
tures. For super-resolution, the memory network is inserted into a VSR network and
reference features queried from the memory network are utilized by the VSR network
to improve the SR quality. We also proposed the test-time optimization strategy to
fine-tune the memory network to memorize video-specific reference information. We
verify that reference features queried from the proposed reference memory network are
implicitly utilized across the entire region of a low-resolution frame and help improve
the final SR quality.
Physical Priors for Defocus Blur In Chapter III, we used real-world defocused im-
ages as auxiliary data to reduce the domain gap between synthetic and real-world de-
focused images. However, there still exists a remaining domain gap because auxiliary
data are leveraged as a semi-supervision. In Chapter IV, although we adopted auxil-
iary dual-pixel images to induce our network for predicting more accurate deblurring
filters, the results still showed remaining blur and often included ringing-like artifacts.
– 109 –
In future work, we would like to provide degradation priors directly related to defocus
blur. We are specifically planning to employ the depth map estimation task, in which
we expect to leverage a physically proportional relationship between depth and blur
amount, which will allow more accurate analysis and removal of defocus blur.
Hybrid Explicit and Implicit Reference Utilization In Chapters V and VI, we demon-
strated that utilizing auxiliary reference features helps improve the reconstruction qual-
ity of super-resolved results. While explicitly utilizing reference features brings a
much higher increase in the super-resolution quality for the regions where reference
features can be explicitly utilized (i.e., overlapping regions between multi-camera
videos), implicit utilization of reference information better enhanced the super-resolution
quality for the region where reference features could not be explicitly utilized (i.e.,
outside the overlapping regions). In future work, we are considering a hybrid refer-
ence utilization. Specifically, we plan to develop a network that may explicitly lever-
age reference features for the regions with strong matching confidence between low-
resolution and reference features (Chapter V). Otherwise, the network may implicitly
utilize reference features queried from the memory network (Chapter VI).
– 110 –
요약문
– 111 –
실제적 디포커스 영상 데이터를 활용한 디포커스 맵 측정 인위적 디포커스 맵 데이
– 112 –
References
[1] T. Son, J. Kang, N. Kim, S. Cho, and S. Kwak. Urie: Universal image enhance-
ment for visual recognition in the wild. In Proc. ECCV, 2020.
[3] C. Seibold, A. Hilsmann, and P. Eisert. Model-based motion blur estimation for
the improvement of motion tracking. Computer Vision and Image Understand-
ing (CVIU), 160:45–56, 2017.
[4] S. Cho and S. Lee. Fast motion deblurring. ACM Trans. Graphics (TOG),
28(5):145:1–145:8, 2009.
[5] A. Levin, Y. Weiss, F. Durand, and W.T. Freeman. Understanding and evaluat-
ing blind deconvolution algorithms. In Proc. CVPR, 2019.
[8] J. Pan, D. Sun, H. Pfister, and M.-H. Yang. Blind image deblurring using dark
channel prior. In Proc. CVPR, 2016.
[9] A. Agrawal and R. Raskar. Resolving objects at higher resolution from a single
motion-blurred image. In Proc. CVPR, 2007.
– 113 –
[10] W. Zhang and W.-K. Cham. A single image based blind super-resolution ap-
proach. In Proc. ICIP, 2008.
[12] E. Faramarzi, D. Rajan, and M. P. Christensen. Unified blind method for multi-
image super-resolution and single/multi-image blur deconvolution. IEEE Trans.
Image Processing (TIP), 22(6):2101–2114, 2013.
[13] L. Xu, J. S. J. Ren, C. Liu, and J. Jia. Deep convolutional neural network for
image deconvolution. In Proc. NeurIPS, 2014.
[14] C. Dong, C. Change Loy, K. He, and X. Tang. Image super-resolution using
deep convolutional networks. In Proc. ECCV, 2014.
[15] H. Son and S. Lee. Fast non-blind deconvolution via regularized residual net-
works with long/short skip-connections. In Proc. ICCP, 2017.
[19] S. Nah, S. Son, and K. M. Lee. Recurrent neural networks with intra-frame
iterations for video deblurring. In Proc. CVPR, 2019.
– 114 –
[21] S. Zhou, J. Zhang, J. Pan, W. Zuo, H. Xie, and J. Ren. Spatio-temporal filter
adaptive network for video deblurring. In Proc. ICCV, 2019.
[23] J. Shi, L. Xu, and J. Jia. Just noticeable defocus blur detection and estimation.
In Proc. CVPR, 2015.
[25] J. Park, Y.-W. Tai, D. Cho, and I. Kweon. A unified approach of multi-scale
deep and hand-crafted features for defocus estimation. In Proc. CVPR, 2017.
[26] S. Cho and S. Lee. Convergence analysis of MAP based blur kernel estimation.
In Proc. ICCV, 2017.
[27] A. Karaali and C. Jung. Edge-based defocus blur estimation with adaptive scale
selection. IEEE Trans. Image Processing (TIP), 27(3):1126–1137, 2018.
[28] J. Lee, S. Lee, S. Cho, and S. Lee. Deep defocus map estimation using domain
adaptation. In Proc. CVPR, 2019.
[30] A. Levin, R. Fergus, F. Durand, and W. Freeman. Image and depth from a
conventional camera with a coded aperture. In Proc. SIGGRAPH, 2007.
– 115 –
[32] S. Lee, E. Eisemann, and H. Seidel. Real-time lens blur effects and focus con-
trol. ACM Trans. Graphics (TOG), 29(4):65:1–65:7, 2010.
[35] H. Zheng, M. Ji, L. Han, Z. Xu, H. Wang, Y. Liu, and L. Fang. Learn-
ing cross-scale correspondence and patch-based synthesis for reference-based
super-resolution. In Proc. BMVC, 2017.
[38] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo. Learning texture transformer
network for image super-resolution. In Proc. CVPR, 2020.
[39] Y. Xie, J. Xiao, M. Sun, C. Yao, and K. Huang. Feature representation mat-
ters: End-to-end learning for reference-based image super-resolution. In Proc.
ECCV, 2020.
– 116 –
[42] S. Zhuo and T. Sim. Defocus map estimation from a single image. Pattern
Recognition, 44(9):1852–1858, 2011.
[43] C. Tang, C. Hou, and Z. Song. Defocus map estimation from a single image via
spectrum contrast. Optic Letters, 38(10):1706–1708, 2013.
[44] J. Shi, L. Xu, and J. Jia. Discriminative blur detection features. In Proc. CVPR,
2014.
[45] G. Xu, Y. Quan, and H Ji. Estimating defocus blur via rank of local patches. In
Proc. ICCV, 2017.
[50] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level ad-
versarial and constraint-based adaptation. arXiv preprint, arXiv:1612.02649,
2016.
[51] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. F. Wang, and M. Sun. No more dis-
crimination: Cross city adaptation of road scene segmenters. In Proc. ICCV,
2017.
[52] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Dar-
rell. CyCADA: Cycle-consistent adversarial domain adaptation. In Proc. ICML,
2018.
– 117 –
[53] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsuper-
vised pixel-level domain adaptation with generative adversarial networks. In
Proc. CVPR, 2017.
[55] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi. Real-
time video super-resolution with spatio-temporal networks and motion compen-
sation. In Proc. CVPR, 2017.
[56] X. Wang, K. C. K. Chan, K. Yu, C. Dong, and C. C. Loy. Edvr: Video restora-
tion with enhanced deformable convolutional networks. In Proc. CVPRW, 2019.
[57] S. Li, F. He, B. Du, L. Zhang, Y. Xu, and D. Tao. Fast spatio-temporal residual
network for video super-resolution. In Proc. CVPR, 2019.
[59] W. Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan:
Multi-correspondence aggregation network for video super-resolution. In Proc.
ECCV, 2020.
– 118 –
[63] T. Isobe, X. Jia, S. Gu, S. Li, S. Wang, and Q. Tian. Video super-resolution with
recurrent structure-detail network. In Proc. ECCV, 2020.
[64] K. C. K. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy. Basicvsr: The search
for essential components in video super-resolution and beyond. In Proc. CVPR,
2021.
[67] S. Sukhbaatar, A. Szlam, J. Weston, and Rob Fergus. End-to-end memory net-
works. arXiv preprint, arXiv:1503.08895, 2015.
[69] S. Na, S. Lee, J. Kim, and G. Kim. A read-write memory network for movie
story understanding. In Proc. ICCV, 2017.
[70] T. Yang and A. B. Chan. Learning dynamic memory networks for object track-
ing. In Proc. ECCV, 2018.
[71] H. Seong, J. Hyun, and E. Kim. Kernelized memory network for video object
segmentation. In Proc. ECCV, 2020.
[72] H. K. Cheng, Y.-W. Tai, and C.-K. Tang. Rethinking space-time networks
with improved memory coverage for efficient video object segmentation. arXiv
preprint, arXiv:2106.05210, 2021.
– 119 –
[73] B. Ji and A. Yao. Multi-scale memory-based video deblurring. In Proc. CVPR,
2022.
[74] P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Unique-
ness, focusness and objectness. In Proc. ICCV, 2013.
[76] C. Zhou and S. K. Nayar. What are good apertures for defocus deblurring? In
Proc. ICCP, 2009.
[77] J. Wulff, D. J. Butler, G. B. Stanley, and M. J. Black. Lessons and insights from
creating a synthetic optical flow benchmark. In Proc. ECCVW, 2012.
[80] M. Potmesil and I. Chakravarty. A lens and aperture camera model for synthetic
image generation. In Proc. SIGGRAPH, 1982.
– 120 –
[83] S. Lee, G. J. Kim, and S. Choi. Real-time depth-of-field rendering using point
splatting on per-pixel layers. Computer Graphics Forum, 27(7):1955–1962,
2008.
[84] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint, arXiv:1409.1556, 2014.
[87] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. In Proc. ICML, 2015.
[88] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activa-
tions in convolutional network. arXiv preprint, arXiv:1505.00853, 2015.
[91] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc.
ICLR, 2014.
– 121 –
[93] G. Wetzstein, D. Lanman, W. Heidrich, and R. Raskar. Layered 3d: Tomo-
graphic image synthesis for attenuation-based light field and high dynamic
range displays. ACM Trans. Graphics (TOG), 30(4):95:1–95:12, 2011.
[94] R. L. Cook, T. Porter, and L. Carpenter. Distributed ray tracing. ACM SIG-
GRAPH Computer Graphics, 18(3):137–145, 1984.
[95] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic
segmentation. In Proc. ICCV, 2015.
[96] L. Wang, D. Li, Y. Zhu, L. Tian, and Y. Shan. Dual super-resolution learning
for semantic segmentation. In Proc. CVPR, 2020.
[97] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully
convolutional networks. In Proc. NeurIPS, 2016.
[98] J. Cao, H. Cholakkal, R. Anwer, F. Khan, Y. Pang, and L. Shao. D2det: Towards
high quality object detection and instance segmentation. In Proc. CVPR, 2020.
[99] J. Zhang, J. Pan, J. Ren, Y. Song, L. Bao, R. Lau, and M.-H. Yang. Dynamic
scene deblurring using spatially variant recurrent neural networks. In Proc.
CVPR, 2018.
[100] R. Garg, N. Wadhwa, S. Ansari, and J. Barron. Learning single camera depth
estimation using dual-pixels. In Proc. ICCV, 2019.
[102] H. Chen, J. Gu, O. Gallo, M.-Y. Liu, A. Veeraraghavan, and J. Kautz. Re-
blur2deblur: Deblurring videos via self-supervised learning. In Proc. ICCP,
2018.
– 122 –
[104] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolu-
tion. In Proc. CVPR, 2017.
[105] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable
convolution. In Proc. ICCV, 2017.
[107] Y. Jo, S. Oh, J. Kang, and S. Kim. Deep video super-resolution network using
dynamic upsampling filters without explicit motion compensation. In Proc.
CVPR, 2018.
[108] X. Wang, K. Yu, C. Dong, and C. C. Loy. Recovering realistic texture in image
super-resolution by deep spatial feature transform. In Proc. CVPR, 2018.
[111] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities
improve neural network acoustic models. In Proc. ICML, 2013.
[114] L. Liu, H. Jiang, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the
adaptive learning rate and beyond. In Proc. ICLR, 2020.
– 123 –
[115] Z. Wang, A. Bovik, H. R. Sheikh, and E. Simoncelli. Image quality assess-
ment: from error visibility to structural similarity. IEEE Trans. Image Process-
ing (TIP), 13(4):600–612, 2004.
[117] J. Rim, L. Haeyun, J. Won, and S. Cho. Real-world blur dataset for learning
and benchmarking deblurring algorithms. In Proc. ECCV, 2020.
[119] A. Ranjan and M. Black. Optical flow estimation using a spatial pyramid net-
work. In Proc. CVPR, 2017.
[120] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
scale image recognition. In Proc. ICLR, 2015.
[122] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable
convolutional networks. In Proc. ICCV, 2017.
[123] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image
transformation with non-aligned data. In Proc. ECCV, 2018.
[125] X. Zhang, Q. Chen, R. Ng, and V. Koltun. Zoom to learn, learn to zoom. In
Proc. CVPR, 2019.
– 124 –
[126] J. Johnson, A. Alahi, and F. Li. Perceptual losses for real-time style transfer and
super-resolution. In Proc. ECCV, 2016.
[127] L. Liu, H. Jiang, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the
adaptive learning rate and beyond. In Proc. ICLR, 2020.
[128] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts. In
Proc. ICLR, 2017.
[130] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution
using very deep residual channel attention networks. In Proc. ECCV, 2018.
[132] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep laplacian pyramid
networks for fast and accurate super-resolution. In Proc. CVPR, 2017.
– 125 –
Acknowledgements
E-mail : [email protected]
Website : https://junyonglee.me
E DUCATION
POSTECH, Pohang, South Korea Mar. 2016 – Feb. 2023
- Ph.D. in Computer Science and Engineering
- Advisor: Prof. Seungyong Lee (Computer Graphics Lab)
- Dissertation: Learning Image and Video Restoration Using Auxiliary Data
Handong Global University, Pohang, South Korea Mar. 2012 – Feb. 2016
- B.S. in Computer Science and Electrical Engineering
- Graduated with Summa Cumme Laude
P UBLICATIONS
Conferences
[C1] Jaesung Rim, Geonung Kim, Jungeon Kim, Junyong Lee, Seungyong Lee,
Sunghyun Cho, “Realistic blur synthesis for learning image deblurring”, Proc.
European Conference on Computer Vision (ECCV) 2022
[C2] Junyong Lee, Myeonghee Lee, Sunghyun Cho, Seungyong Lee, “Reference-
based video super-resolution using multi-camera video triplets”, Proc. IEEE
Computer Vision and Pattern Recognition (CVPR) 2022
[C3] Hyeongseok Son, Junyong Lee, Sunghyun Cho, Seungyong Lee, “Single im-
age defocus deblurring using kernel-sharing parallel atrous convolutions”, Proc.
IEEE International Conference on Computer Vision (ICCV) 2021
[C4] Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun Cho, Seungyong Lee,
“Iterative filter adaptive network for single image defocus deblurring”, Proc.
IEEE Computer Vision and Pattern Recognition (CVPR) 2021
[C5] Junyong Lee, Sungkil Lee, Sunghyun Cho, Seungyong Lee, “Deep defocus
map estimation using domain adaptation”, Proc. IEEE Computer Vision and
Pattern Recognition (CVPR) 2019
Journals
[J1] Hyeongseok Son*, Junyong Lee*, Sunghyun Cho, Seungyong Lee, “Real-time
video deblurring via lightweight motion compensation”, Computer Graphics
Forum (special issue on PG 2022), Vol. 41, No. 7, 2022 (*: equal contribution)
[J2] Hyeongseok Son, Junyong Lee, Sunghyun Cho, Seungyong Lee, “Recurrent
video deblurring with blur-invariant motion estimation and pixel volumes”,
ACM Transactions on Graphics (TOG), Vol. 40, No. 5, 2021 (presented at
SIGGRAPH 2021)
[J3] Junyong Lee, Hyeongseok Son, Gunhee Lee, Jonghyeop Lee, Sunghyun Cho,
Seungyong Lee, “Deep color transfer using histogram analogy”, The Visual
Computer (special issue on CGI 2020), Vol. 36, No. 10, 2020
[1] Myeonghee Lee, Junyong Lee, Seungyong Lee, “Optical flow estimation using
multi-camera”, IEIE 2022 (paper award)
[2] Gwangjin Ju, Soongjin Kim, Myeonghee Lee, Jooeun Son, Junyong Lee, Se-
ungyong Lee, “Computational photography softwares using deep learning”,
KCCV 2022 (demo)
[5] Jonghyeop Lee, Hyeongseok Son, Junyong Lee, Haeun Yoon, Sunghyun Cho,
Seungyong Lee, “Single panorama depth estimation using domain adaptation”,
JKCGS 2020
[6] Junyong Lee, Jonghyeop Lee, Seungyong Lee, “Deep learning-based video
stabilization”, Proc. IPIU 2020
[7] Junyong Lee, Hyeongseok Son, Jonghyeop Lee, Seungyong Lee, “Computa-
tional photography softwares using deep learning”, ICCV 2019 (demo)
[8] Jonghyeop Lee, Junyong Lee, Seungyong Lee, “Video stabilization using deep
learning-based optical flow”, Proc. KSC 2018
[10] Junyong Lee, Seungyong Lee, “Deep time transfer: hallucination of different
times of day using CNN”, Proc. KCGS 2016
[11] Junyong Lee, Seungyong Lee, “Hallucination from noon to night images using
CNN”, SIGGRAPH 2016 (poster)
PATENTS
[1] Seungyong Lee, Sunghyun Cho, Junyong Lee, “Method and recorded medium
for reference-based video super-resolution using multi-camera video triplets”,
KR (filed for a patent, 10-2022-0186476)
[2] Seungyong Lee, Hyeongseok Son, Sunghyun Cho, Junyong Lee, “Method and
recorded medium for inverse kernel-based defocus deblurring”, KR (filed for a
patent, 10-2022-0019174), US (filed for a patent, 17/974,383)
[3] Seungyong Lee, Junyong Lee, Hyeongseok Son, Sunghyun Cho, “Method and
apparatus for video restoration based on machine learning”, KR (filed for a
patent, 10-2020-0180046), US (filed for a patent, 17/497,824)
[4] Seungyong Lee, Hyeongseok Son, Sunghyun Cho, Junyong Lee, “Pixel volume-
based machine learning method for video quality enhancement and apparatus
using same”, KR (filed for a patent, 10-2020-0188668)
[5] Seungyong Lee, Sunghyun Cho, Junyong Lee, “Method for estimating defocus
map and apparatus thereof”, KR (issued, 10-2363049)
P ROFESSIONAL ACTIVITIES
Conference Reviewer
r IEEE Computer Vision and Pattern Recognition (CVPR)
r Eurographics (EG)
Journal Reviewer
r IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI)