22.1
22.1
net/publication/338673398
CITATIONS READS
45 1,213
4 authors, including:
All content following this page was uploaded by Kang Liao on 17 December 2020.
Abstract— Recently, learning-based distortion rectification settings. Methods of distortion rectification can be divided into
schemes have shown high efficiency. However, most of these two categories: traditional vision methods and learning-based
methods only focus on a specific camera model with fixed methods. The traditional vision methods remove the distortion
parameters, thus failing to be extended to other models. To avoid
such a disadvantage, we propose a model-free distortion rec- based on handcrafted features [10]–[18]. However, these meth-
tification framework for the single-shot case, bridged by the ods perform poorly on the images captured by other cameras
distortion distribution map (DDM). Our framework is based and cannot correct the distortion without a manual interven-
on an observation that the pixel-wise distortion information tion and scene constraints. The learning-based methods have
is explicitly regular in a distorted image, despite different shown promising results in fundamental image processing
models having different types and numbers of distortion para-
meters. Motivated by this observation, instead of estimating the problems, such as salient object detection [19]–[21], image
heterogeneous distortion parameters, we construct a proposed super-resolution [22]–[24], and depth estimation [25], [26].
distortion distribution map that intuitively indicates the global Compared with the aforementioned active research areas,
distortion features of a distorted image. In addition, we develop the distortion rectification has been gained little attention
a dual-stream feature learning module, benefitting from both with deep learning [7]–[9]. The learning-based methods can
the advantages of traditional methods that leverage the local
handcrafted feature and learning-based methods that focus on be further divided into two sub-categories: parameter-based
the global semantic feature perception. Due to the sparsity of methods and parameter-free methods. The parameter-based
handcrafted features, we discrete the features into a 2D point methods [7] and [8] estimate the distortion parameters using
map and learn the structure inspired by PointNet. Finally, convolutional neural networks (CNNs) in terms of the one-
a multimodal attention fusion module is designed to attentively parameter division and fisheye camera models, respectively.
fuse the local structural and global semantic features, providing
the hybrid features for the more reasonable scene recovery. Our previous work, DR-GAN [9], learns the mapping func-
The experimental results demonstrate the excellent generalization tion between the distorted and undistorted image rather than
ability and more significant performance of our method in both estimates parameters, achieving the one-stage rectification for
quantitative and qualitative evaluations, compared with the state- the even-order distortion model.
of-the-art methods. We find that learning-based methods have the following
Index Terms— Distortion rectification, model-free framework, problems. (1) They heavily rely on the assumption of a specific
dual-stream feature learning, deep learning. camera model. Trained using the synthesized distorted image
I. I NTRODUCTION dataset based on the specific camera model, their networks
cannot flexibly expand to other models. Besides, the ranges
ISTORTION rectification is a classical and essential tech-
D nique of image processing with applications in various
fields, such as structure from motion (SfM) [1]–[3] and scene
of distortion parameters are limited in their dataset, leading to
an inferior rectification result even on the same camera model
when the parameter is out of the range. (2) In the parameter-
understanding [4]–[6]. However, because of the unknown cam-
based methods, the network is pretrained on the ImageNet [27]
era model and limited view, it is challenging to recover the real
that only contains natural images without distortion, and then it
geometric scene from a distorted image without any additional
directly estimates the parameters of a distorted image in which
Manuscript received July 23, 2019; revised December 7, 2019; accepted the detailed and semantic features were geometrically changed.
December 28, 2019. Date of publication January 17, 2020; date of current Thus, neural networks struggle with feature extraction in
version January 30, 2020. This work was supported in part by the National
Natural Science Foundation of China under Grant 61772066 and Grant regard to the distortion. Moreover, these methods suffer from
61972028 and in part by the Open Project Program of State Key Labora- an imbalanced problem during the training process because
tory of Virtual Reality Technology and Systems, Beihang University, under of the heterogeneousness of the distortion parameters. (3) In
Grant VRLAB2019B05. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Lisimachos P. Kondi. the parameter-free methods, the network lacks the supervision
(Corresponding author: Chunyu Lin.) of the distortion information; thus, it cannot discriminate the
Kang Liao, Chunyu Lin, and Yao Zhao are with the Institute of Infor- difference of the similar geometric attributes. While these
mation Science, Beijing Jiaotong University, Beijing 100044, China, and
also with the Beijing Key Laboratory of Advanced Information Science and methods are focused on the mapping function learning and
Network Technology, Beijing 100044, China (e-mail: [email protected]; correct the distortion with a higher efficiency, they cannot
[email protected]; [email protected]). obtain the distortion parameters that are important for SfM
Mai Xu is with the School of Electronic and Information Engineering,
Beihang University, Beijing 100191, China (e-mail: [email protected]). and camera calibration. (4) Previous learning-based methods
Digital Object Identifier 10.1109/TIP.2020.2964523 did not explicitly investigate the local structural features or
1057-7149
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3708 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3709
Fig. 2. Comparison of the previous learning-based methods and our proposed method for the distortion rectification. (a) The previous learning-based methods
[7], [8], and [9] (from the top to the bottom) only focus on specific camera models: the one-parameter division model, fisheye camera model, and even-order
distortion model, respectively. They cannot adaptively correct the other types of the distortion owing to the model and domain limitations. (b) By constructing
the distortion distribution map (DDM) and unifying different domains, our model-free framework is capable of achieving various distortion rectifications.
cameras and require cumbersome manual operations. The line- problem, when networks regress the heterogeneous distor-
based methods [13]–[18] achieve the distortion rectification tion parameters simultaneously. The other type of learning-
using the detected lines or curves, following the principle based methods is parameter-free. Liao et al. [9] transferred
that the straight line has to be straight, no matter the way of the distortion parameter estimation into the mapping function
the projection [30]. However, these methods obtained inferior learning and achieve one-stage rectification according to the
results on images with a lack of lines, thus failing to expand even-order distortion model [33]. However, their learning
to the real scenes. All the above methods are built for the process lacked the distortion information supervision, so that
specific camera models, showing poor generalization abilities it could not intuitively discriminate similar distortions.
to other camera models. In addition, with traditional vision Importantly, neural networks encounter poor transferring
methods, it is nearly impossible to automatically correct the abilities in different domains. All of the above networks are
distortion in a single image without manual interventions and trained using synthesized distorted image datasets derived
scene limitations. from specific camera models, which causes inferior results
when networks are tested for other camera models. Even for
the same camera model, the data that is out of the given
B. Learning-Based Distortion Rectification parameter range will also confuse these networks. Therefore,
Unlike with traditional vision distortion rectification meth- until now, there was a lack of general model-free framework
ods, few studies focused on the learning-based distortion for the single-shot distortion rectification.
rectification. Rong et al. [7] exploited CNNs to correct the
radial distortion in an image for the first time. They trained III. P RELIMINARIES
their neural network using a synthesized distorted image In this section, we list a number of classical distortion
dataset in terms of the one-parameter division model [31] camera models; then, we unify these models into the same
and classified the different distortion into 401 categories. domain using the proposed distortion distribution map.
However, this method cannot correct a strong distortion
because of the simplified camera model and over-discrete
A. Distortion Camera Models
range of distortion parameters. To address the above issue,
Yin et al. [8] presented a multicontext collaborative network Suppose that a point pd = [x d , yd ]T is in the distorted image
for the strong distortion rectification in fisheye images. Based and a corresponding point pr = [xr , yr ]T is in the rectified
on the fisheye camera model [32], they introduced a scene image. The general projection function f of these two points
parsing network to provide semantic information for more can be expressed as a polynomial [32]:
accurate rectification. However, the rectification results heavily
f (x d , yd ) = k1 + k2rd + k3rd2 + · · · + k N rdN , (1)
rely on the performance of the scene parsing network, and
this network imposes increased memory and computation where dp = [k1 , k2 , k3 , · · · , k N ] are the distortion parameters,
requirements on the rectification system. We call the afore- and rd is the distance between pd and distortion center pc =
mentioned two methods as the parameter-based methods that [x c , yc ]T :
directly estimate the distortion parameters from an image
using CNNs. These methods suffered from an imbalanced rd = (x d − x c )2 + (yd − yc )2 . (2)
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3710 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3711
Fig. 4. Overview of the proposed model-free distortion rectification framework. This framework consists of three modules: SSD (semantics, structure, and
distortion) learning, multimodal attention fusion, and distortion rectification module. In the SSD learning module, there are three special neural networks
designed for the perceptions of the semantic, structural, and distorted information. And SSD leaning module can be regarded as a combination of the proposed
dual-stream learning feature and distortion learning modules. To efficiently aggregate the multimodal features from the above networks, we fuse these features
with an attention mechanism. Finally, we utilize the refined hybrid features to recover the real geometric shape. The skip connections are marked with gray
dashed arrows.
DDM. Suppose that the distortion levels [d1 , d2 , · · · , dn ] at However, the heterogeneousness of these parameters influ-
the location [r1 , r2 , · · · , rn ] are available in the DDM. Then, ences the performance of their learning models and causes
the distortion parameters [k1, k2 , · · · , kn ] can be obtained as an imbalanced problem during the regression. We will show
follows based on Eq. 10: the experiment related to this problem in Section V-B. Having
⎡ ⎤T ⎡ 0 ⎤−1 observed this fact, we design a specific network to intuitively
d1 r1 r11 · · · r1n−1 perceive the prior knowledge of the distortion that is regular
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ in a distorted image.
⎢d2 ⎥ ⎢r 0 r 1 · · · r n−1 ⎥
⎢ ⎥ ⎢ ⎢ ⎥ ⎢ 2 2 2 ⎥ As mentioned in Section III-B, DDM describes the global
⎥
k1 k2 · · · kn = ⎢ ⎥ ⎢ ⎥ (11) distortion features of a distorted image, which can unify
⎢ .. ⎥ ⎢ . . . . ⎥
⎢ . ⎥ ⎢ .. .. .. .. ⎥ the different domains of different camera models. Therefore,
⎢ ⎥ ⎢ ⎥ we first construct the DDM of a distorted image rather than
⎣ ⎦ ⎣ ⎦
roughly feed the image into an existing network pretrained on
dn rn rn · · · rn
0 1 n−1
. the ImageNet [27], which contains no distorted images. The
distortion learner is a fully convolutional neural network such
IV. P ROPOSED M ODEL -F REE D ISTORTION as the U-Net [36], consisting of an encoder and a decoder, with
R ECTIFICATION F RAMEWORK skip connections between the encoder and decoder features at
In this section, we demonstrate the proposed model-free the same spatial resolution. Specifically, we have 5 hierarchies
distortion rectification framework in detail. The overall archi- in the encoder. Each hierarchy has a convolutional layer
tecture of this framework is illustrated in Fig. 4, including with 3×3 kernels and 2 strides (2× downsample), which
three modules: SSD (semantics, structure, and distortion) are followed by BatchNormalization layers using LeakyReLU
learning, multimodal attention fusion, and distortion rectifica- activation (α = 0.2). The number of filters per hierarchy is as
tion module. Specifically, we first introduce the SSD learning follows: 64, 128, 256, 512, and 512. There are 5 hierarchies in
module that perceives the semantic, structural, and distorted the decoder part. At the beginning of each hierarchy, a bilinear
information with respect to an image, achieved by three special upsampling layer is leveraged to increase the spatial dimension
neural networks. We then describe the multimodal attention by a factor of 2, followed by convolutional, BatchNormal-
fusion module that attentively aggregates multimodal features. ization, and LeakyReLU layers. Note that the channel of the
Finally, we explain the distortion rectification module and DDM is 1, that is the same as the depth map; thus, the number
training loss of the whole framework. of filters in the last convolutional layer equals 1.
2) Distortion-Guided Semantics Extraction: Unlike the pre-
vious learning-based methods, we do not pretrain our model
A. SSD Learning on the ImageNet [27], as the domains of the natural image and
1) Distortion Perception: Previous learning-based methods distorted image are different. Instead, we exploit the generated
ignored the prior knowledge of the distortion in an image DDM to guide the semantics extraction of distorted images and
and directly estimated the distortion parameters using CNNs. train the semantics learner from the scratch. Compared with
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3712 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3713
where Vhyb is the hybrid feature that consists of the semantic as M̂ and M, respectively. The distortion distribution loss can
and structural features. However, the types of these fused be defined as
features are different, playing ambiguous roles in the distortion W H
rectification. Thus, the simple operation for this fusion task 1
Ld = || M̂x,y − Mx,y ||1 , (17)
would hinder the complementary message passing. To address WH
x=1 y=1
the above issue, we introduce a multimodal attention fusion
module to guide the interaction between the semantic and where W and H are the width and height of the DDM,
structural features. Because the interaction of different modal respectively.
information is not always meaningful, the attention mechanism 2) Reconstruction Loss: When the proposed framework
can help neural networks autonomously learn to focus or produces a rectified image, one straightforward way is to
to omit message passing from other features. Specifically, minimize the reconstruction loss between I r and I u :
the attention map M can be constructed as follows: W H
1
Lr = ||Ix,y
r
− Ix,y
u
||1 . (18)
M = σ ( f c(Vsem )), (14) WH
x=1 y=1
where f c is a fully connected layer with 256 units, and 3) Low-to-High Perceptual Loss: The reconstruction loss
the sigmoid function σ is used to normalize the value of describes the difference between the generated image and
M. Subsequently, the hybrid feature Vhyb derived from the ground truth at the pixel level. However, it causes the blur arti-
multimodal attention fusion can be formulated as follows: facts in the generated image. To solve this problem, we exploit
Vhyb = Vsem ⊕ M ⊗ Vstr , (15) a perceptual loss [41] at the feature level, to preserve details
of the predictions and make rectified images sharper.
where ⊗ is element-wise multiplication, and ⊕ is concatena-
Wi, j Hi, j
tion for the multimodal features. 1
Compared with the plain fusion in Eq. 13, the proposed Lp = ||φi, j (I r )x,y − φi, j (I u )x,y ||2 , (19)
Wi, j Hi, j
multimodal attention fusion module guides the meaningful x=1 y=1
message passing between the local structure feature and global where Wi, j and Hi, j are the width and height of the feature
semantics feature. Thus, our framework learns to automatically map φi, j , which is derived from the j -th convolution (after
select the valid structural information and further achieves activation) before the i -th maxpooling layer in the VGG19 net-
better rectification results. work [42]. As suggested in [9], the low-to-high perceptual
loss that jointly considers the shallow and deep feature maps
C. Distortion Remover produces more reasonable rectification results than a vanilla
perceptual loss. Thus, we also implement the low-to-high
At the end of the proposed framework, we leverage the
perceptual loss into our framework:
hybrid feature Vhyb for the distortion rectification in a gener-
ation fashion. The architecture of this network can be treated Ll2h l h
p = λl L p + (1 − λl )L p , (20)
as a decoder, consisting of 8 hierarchies, where each hierarchy
has an upsampling layer (2× upsample), followed by a con- where Llp and Lhp are the low and high perceptual loss,
volutional layer with 4×4 kernels and 1 stride. The number respectively. In analogy to [9], we choose the activations from
of filters per convolutional layer is as follows: 512, 512, 512, the V GG 1,2 and V GG 5,2 convolutional layers to obtain the
512, 256, 128, 64, and 3. Each convolutional layer is followed low-level and high-level perceptual loss, respectively. Here, λl
by a BatchNormalization layer using LeakyReLU activation is a factor to balance Llp and Lhp .
(α = 0.2), except that the activation of the last convolutional 4) Adversarial Loss: The adversarial loss is widely used
layer is Tanh function. To promote the complementary effect in the image generation task [43]–[45]; it helps to produce
of low and high level features, we add the skip connections more reasonable and consistent rectified images. Therefore,
between the semantics learner and distortion remover at the we introduce a network D to discriminate the rectified image
same spatial resolution. generated by the distortion remover and the ground truth. This
network consists of 6 convolutional layers with 4×4 kernels
D. Training Loss and 2 strides. The number of filters per convolutional layer
Given the input distorted image I d , rectified image I r , and is 64, 128, 256, 512, 512, and 512. BatchNormalization and
ground truth of the undistorted image I u , the training loss for LeakyReLU activation (α = 0.2) are applied to each layer.
our framework is a linear combination of four terms: After the last convolutional layer, there are two fully connected
layers with 1024 and 1 units, predicting the input image as true
L = λd Ld + λr Lr + λ p Ll2h
p + λa La . (16) or false. The adversarial loss can be defined as follows:
N
1) Distortion Distribution Loss: As mentioned in
Section III, DDM intuitively indicates the global distortion La = − log Dθ D (I r ). (21)
features of a distorted image, guiding the semantic perception n=1
of neural networks. Suppose that the generated DDM for the Every module of our model-free framework is differentiable;
distortion learner and the ground truth of DDM are denoted thus, the whole learning model can be end-to-end trained.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3714 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
V. E XPERIMENTS
In this section, to demonstrate the distortion rectification
performance of the proposed framework, we first describe
the synthesized distorted image dataset for neural networks
training and the implementation details. Then, we analyze the
limitations of previous learning-based methods following two
aspects: the bad generalization ability for different camera
models and the imbalanced problem during the parameter
estimation. Moreover, we report an ablation study of the
different performances of the distortion rectification, which
is implemented with different modules in the SSD learning
introduced in Section IV-A. Finally, we compare the proposed
method with the state-of-the-art methods, in both the quanti-
tative measurement and visual appearance.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3715
TABLE I
Fig. 7. The generalization limitation of the previous learning-based methods
for the distortion rectification. We show three types of distorted image: one- A N A BLATION S TUDY OF THE D IFFERENT VARIANTS
parameter division, even-order, and fisheye camera models from the top to OF O UR M ODEL -F REE F RAMEWORK
bottom.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3716 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
TABLE II
Q UANTITATIVE M EASUREMENTS OF O UR P ROPOSED M ETHOD AND THE S TATE - OF - THE -A RT M ETHODS ,
W HICH A RE Q VALUATED U SING PSNR AND SSIM
Fig. 10. Rectification results of the corrected synthetic distorted images. For each result, we show the distorted image, ground truth, and corrected results
of the compared methods: Alemánflores [17], Santanacedrés [18], Rong [7], and Liao [9], and corrected results of our proposed method, from left to right.
the CNNs-based feature extractor cannot fully learn the local fusion modules obtains the best results in both the PSNR and
structural feature of a distorted image. In contrast, our struc- SSIM.
tural learner motivated by the point cloud learning network
PointNet [28], is specially designed for the handcrafted feature
D. Comparisons to the State-of-the-Art Methods
learning, thus learns more effective distortion information and
achieves better rectification performance than CNNs_HFL. For In this part, we compare the proposed method with the
the meaningful interaction of the local structural features and state-of-the-art methods, such as the traditional vision meth-
global semantic features, we introduce a multimodal attention ods: Alemánflores et al. [17] and Santanacedrés et al. [18]
fusion module to help neural networks autonomously learn which are based on the one-parameter division model and
to focus or to omit the message passing. Therefore, fusing two-parameter division model, as well as the learning-based
the local structural features and global semantic features with methods: Rong et al. [7] and Liao et al. [9] which are based
an attention mechanism, the SL+MAF method outperforms on the one-parameter division model and even-order model.
the SL method that only fuses the different modal features These different methods are evaluated using the quantitative
with the concatenation. On the other hand, by constructing measurement and visual appearance as follows.
the DDM of a distorted image, the distortion learner explicitly 1) Quantitative Measurement: To demonstrate a quantita-
provides the prior knowledge of the distortion distribution to tive comparison with the state-of-the-art methods, we evaluate
our framework, significantly improving the performance of the the rectified images obtained from different methods using
distortion rectification. Compared with the baseline only uses PSRN and SSIM. To be more specific, we exploit five test
SL or DL, the baseline w/ SL+DL achieves better performance datasets to validate the performances of different methods,
due to the comprehensive perception on both structural and i.e., the division (d), odd-order (o), even-order (e), and fisheye
semantic information. Furthermore, the complete framework (f) camera models. The measurement results are demonstrated
comprising of the SSD learning and multimodal attention in Table II. As a benefit of the proposed DDM that unifies
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3717
Fig. 11. Rectification results of the corrected real distorted images. For each result, we show the distorted image, ground truth, and corrected results of the
compared methods: Alemánflores [17], Santanacedrés [18], Rong [7], and Liao [9], and corrected results of our proposed method, from left to right.
the different camera models into the same domain, our pro- results. Compared with the unsatisfactory corrections of previ-
posed method can correct different types of the distortion and ous methods, our method achieves the best visual appearance
achieves the best performance on all test datasets, in both on all the rectified results, excellently recovering the real
PSRN and SSIM. Suffering from the strong assumption of the scenes from the distorted geometric distributions.
specific camera model, the state-of-the-art methods perform
poorly on the distorted images derived from other camera VI. C ONCLUSION AND F UTURE W ORK
models. Under the specific camera model, due to the excellent
learning ability and the global semantic information analysis, In this paper, we consider the challenging problem of
our method significantly leads the traditional vision methods single-shot distortion rectification and further present a general
[17], [18] in the quantitative measurement. Compared with framework. Compared with the previous learning-based meth-
the learning-based methods [7], [9] that ignore the prior ods that only focus on the specific camera models thus failing
knowledge of the distortion and the local structural feature to be expanded to other models, our framework is model-
learning, our framework exhibits more promising rectification free and has better generalization ability. By constructing the
results because of the complete learning in regards to the distortion distribution map (DDM) of an image, we unify the
structure, semantics, and distortion. different types of camera models into the same domain. Sub-
2) Visual Appearance: We further compare our method sequently, DDM is utilized to guide the semantic perception,
with the state-of-the-art methods in the visual appearance. eliminating the imbalanced problem during the multiple para-
Firstly, we leverage the constructed synthesized distorted meters regression. Moreover, we propose a dual-stream feature
images to evaluate the above methods. The comparison results learning structure, to extract both the local handcrafted features
are illustrated in Fig. 10. Lacking of the global semantic and global semantic features. For the meaningful interaction
perception, traditional methods [17], [18] perform poorly of different features, a multimodal attention fusion module
on the distortion rectification under the various scenes and is introduced. Experimental results demonstrate the excellent
obtain few reasonable results when the distortions are not generalization ability of our framework. The proposed method
strong. Due to the specific assumption of the derived cam- significantly outperforms the state-of-the-art methods in both
era models, learning-based methods [7], [9] obtain inferior quantitative and qualitative evaluations. Our futher work will
rectifications facing to other models. By contrast, as benefit consider the self-supervised distortion rectification.
of the model-free framework and local-to-global learning, our
method achieves the best results on various scenes and camera R EFERENCES
models, accurately correcting the curves that are supposed to [1] T. Collins and A. Bartoli, “Planar structure-from-motion with affine
be straight. Therefore, the proposed algorithm shows better camera models: Closed-form solutions, ambiguities and degeneracy
generalization ability of the practical distortion rectifications. analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6,
pp. 1237–1255, Jun. 2017.
For the robustness validation, we then compare different [2] H. Guan and W. A. P. Smith, “Structure-from-motion in spherical video
methods using the real distorted images captured by wide- using the von Mises–Fisher distribution,” IEEE Trans. Image Process.,
angle lenses. As shown in Fig. 11, it is difficult to correct the vol. 26, no. 2, pp. 711–723, Feb. 2017.
distorted structures using the previous methods [7], [9], [17], [3] M. Lee, J. Cho, and S. Oh, “Procrustean normal distribution for non-
rigid structure from motion,” IEEE Trans. Pattern Anal. Mach. Intell.,
[18], which displays under-rectification and over-rectification vol. 39, no. 7, pp. 1388–1400, Jul. 2017.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3718 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
[4] J. L. Herrera, C. R. del-Blanco, and N. Garcia, “Automatic depth [25] Z. Zhang, C. Xu, J. Yang, J. Gao, and Z. Cui, “Progressive hard-mining
extraction from 2D images using a cluster-based learning frame- network for monocular depth estimation,” IEEE Trans. Image Process.,
work,” IEEE Trans. Image Process., vol. 27, no. 7, pp. 3288–3299, vol. 27, no. 8, pp. 3691–3702, Aug. 2018.
Jul. 2018. [26] L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Robust 3D hand pose
[5] Y. Wang and W. Deng, “Generative model with coordinate metric estimation from single depth images using multi-view CNNs,” IEEE
learning for object recognition based on 3D models,” IEEE Trans. Image Trans. Image Process., vol. 27, no. 9, pp. 4422–4436, Sep. 2018.
Process., vol. 27, no. 12, pp. 5813–5826, Dec. 2018. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
[6] M. Wang et al., “BiggerSelfie: Selfie video expansion with hand-held with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
camera,” IEEE Trans. Image Process., vol. 27, no. 12, pp. 5854–5865, Process. Syst., 2012, pp. 1097–1105.
Dec. 2018. [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on
[7] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens distortion point sets for 3D classification and segmentation,” in Proc. IEEE Conf.
correction using convolutional neural networks trained with synthesized Comput. Vis. Pattern Recognit., Jul. 2017, pp. 77–85.
images,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 35–49. [29] Y. Gao, C. Lin, Y. Zhao, X. Wang, S. Wei, and Q. Huang, “3-D surround
[8] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “FishEyeRecNet: view for advanced driver assistance systems,” IEEE Trans. Intell. Transp.
A multi-context collaborative deep network for fisheye image rectifica- Syst., vol. 19, no. 1, pp. 320–328, Jan. 2018.
tion,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 469–484. [30] F. Devernay and O. Faugeras, “Straight lines have to be straight,” Mach.
[9] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “DR-GAN: Automatic radial Vis. Appl., vol. 13, no. 1, pp. 14–24, Aug. 2001.
distortion rectification using conditional GAN in real-time,” IEEE Trans. [31] D. Claus and A. W. Fitzgibbon, “A rational function lens distortion
Circuits Syst. Video Technol., to be published. model for general cameras,” in Proc. IEEE Comput. Soc. Conf. Comput.
[10] S. Shah and J. Aggarwal, “Intrinsic parameter calibration procedure Vis. Pattern Recognit., vol. 1, Jul. 2005, pp. 213–219.
for a (high-distortion) fish-eye lens camera with distortion model and [32] J. Kannala and S. Brandt, “A generic camera model and calibration
accuracy estimation,” Pattern Recognit., vol. 29, no. 11, pp. 1775–1788, method for conventional, wide-angle, and fish-eye lenses,” IEEE Trans.
Nov. 1996. Pattern Anal. Mach. Intell., vol. 28, no. 8, pp. 1335–1340, Aug. 2006.
[11] Z. Zhang, “Flexible camera calibration by viewing a plane from [33] R. I. Hartley and S. B. Kang, “Parameter-free radial distortion correction
unknown orientations,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 1, with centre of distortion estimation,” in Proc. IEEE Int. Conf. Comput.
Sep. 1999, pp. 666–673. Vis., Jun. 2005, pp. 1834–1841.
[12] X. Chen, J. Yang, and A. H. Waibel, “Calibration of a hybrid [34] D. Scaramuzza, A. Martinelli, and R. Siegwart, “A toolbox for easily
camera network,” in Proc. IEEE Int. Conf. Comput. Vis., calibrating omnidirectional cameras,” in Proc. IEEE/RSJ Int. Conf. Intell.
Oct. 2003. Robots Syst., Oct. 2006, pp. 5695–5701.
[13] J. Barreto and H. Araujo, “Geometric properties of central cata- [35] A. W. Fitzgibbon, “Simultaneous linear estimation of multiple view
dioptric line images and their application in calibration,” IEEE geometry and lens distortion,” in Proc. IEEE Conf. Comput. Vis. Pattern
Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1327–1333, Recognit., Aug. 2001.
Aug. 2005. [36] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[14] R. Melo, M. Antunes, J. P. Barreto, G. F. P. Fernandes, and N. Gonçalves, works for biomedical image segmentation,” in Proc. Int. Conf. Med.
“Unsupervised intrinsic calibration from a single frame using a ‘plumb- Image Comput. Comput.-Assisted Intervent., 2015, pp. 234–241.
line’ approach,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, [37] J. F. Canny, “A computational approach to edge detection,” IEEE Trans.
pp. 537–544. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.
[15] R. Carroll, M. Agrawal, and A. Agarwala, “Optimizing content- [38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical
preserving projections for wide-angle images,” ACM Trans. Graph., feature learning on point sets in a metric space,” in Proc. Neural Inf.
vol. 28, no. 3, p. 43, 2009. Process. Syst., 2017.
[16] F. Bukhari and M. N. Dailey, “Automatic radial distortion estimation [39] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas,
from a single image,” J. Math. Imag. Vis., vol. 45, no. 1, pp. 31–45, “Volumetric and multi-view CNNs for object classification on 3D
2013. data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016,
[17] M. Alemán-Flores, L. Alvarez, L. Gomez, and D. Santana-Cedrés, pp. 5648–5656.
“Automatic lens distortion correction using one-parameter division mod- [40] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view
els,” Image Process. Line, vol. 4, pp. 327–343, Nov. 2014. convolutional neural networks for 3D shape recognition,” in Proc. IEEE
[18] D. Santana-Cedrés et al., “An iterative optimization algorithm for lens Int. Conf. Comput. Vis., Dec. 2015, pp. 945–953.
distortion correction using two-parameter models,” Image Process. Line, [41] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
vol. 5, pp. 326–364, Dec. 2016. style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.,
[19] G. Li and Y. Yu, “Deep contrast learning for salient object detec- 2016.
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
pp. 478–487. large-scale image recognition,” CoRR, vol. abs/1409.1556, Apr. 2015.
[20] Y. Fang, G. Ding, J. Li, and Z. Fang, “Deep3DSaliency: Deep stereo- [43] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
scopic video saliency detection model by 3D convolutional networks,” with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
IEEE Trans. Image Process., vol. 28, no. 5, pp. 2305–2318, May 2019. Pattern Recognit., Jul. 2017.
[21] H. Lin, C. Lin, Y. Zhao, and A. Wang, “3D saliency detection based [44] J.-Y. Zhu et al., “Toward multimodal image-to-image translation,” in
on background detection,” J. Vis. Commun. Image Represent., vol. 48, Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 465–476.
pp. 238–253, Oct. 2017. [45] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang,
[22] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution “Diverse image-to-image translation via disentangled representations,”
using very deep convolutional networks,” in Proc. IEEE Conf. Comput. in Proc. Eur. Conf. Comput. Vis., 2018.
Vis. Pattern Recognit., Jun. 2016, pp. 1646–1654. [46] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
[23] C. Ledig et al., “Photo-realistic single image super-resolution using Proc. Eur. Conf. Comput. Vis., 2014.
a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. [47] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
Pattern Recognit., Jul. 2017, pp. 105–114. tion,” 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/
[24] Y. Dong, C. Lin, Y. Zhao, C. Yao, and J. Hou, “Depth map up- abs/1412.6980
sampling with texture edge feature via sparse representation,” in Proc. [48] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” CoRR,
Vis. Commun. Image Process., 2016, pp. 1–4. vol. abs/1701.07875, Dec. 2017.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
View publication stats