0% found this document useful (0 votes)
3 views13 pages

22.1

The document presents a model-free distortion rectification framework that utilizes a distortion distribution map (DDM) to address the limitations of traditional and learning-based methods, which often rely on specific camera models. The proposed framework combines local structural and global semantic features through a dual-stream feature learning module, enhancing the generalization and performance of distortion rectification. Experimental results demonstrate the effectiveness of this approach in handling various types of distortions without the need for manual parameter estimation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

22.1

The document presents a model-free distortion rectification framework that utilizes a distortion distribution map (DDM) to address the limitations of traditional and learning-based methods, which often rely on specific camera models. The proposed framework combines local structural and global semantic features through a dual-stream feature learning module, enhancing the generalization and performance of distortion rectification. Experimental results demonstrate the effectiveness of this approach in handling various types of distortions without the need for manual parameter estimation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338673398

Model-Free Distortion Rectification Framework Bridged by Distortion


Distribution Map

Article in IEEE Transactions on Image Processing · January 2020


DOI: 10.1109/TIP.2020.2964523

CITATIONS READS
45 1,213

4 authors, including:

Kang Liao Yao Zhao


Nanyang Technological University Beijing Jiaotong University
67 PUBLICATIONS 764 CITATIONS 877 PUBLICATIONS 14,614 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Kang Liao on 17 December 2020.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020 3707

Model-Free Distortion Rectification Framework


Bridged by Distortion Distribution Map
Kang Liao , Chunyu Lin , Yao Zhao , Senior Member, IEEE, and Mai Xu , Senior Member, IEEE

Abstract— Recently, learning-based distortion rectification settings. Methods of distortion rectification can be divided into
schemes have shown high efficiency. However, most of these two categories: traditional vision methods and learning-based
methods only focus on a specific camera model with fixed methods. The traditional vision methods remove the distortion
parameters, thus failing to be extended to other models. To avoid
such a disadvantage, we propose a model-free distortion rec- based on handcrafted features [10]–[18]. However, these meth-
tification framework for the single-shot case, bridged by the ods perform poorly on the images captured by other cameras
distortion distribution map (DDM). Our framework is based and cannot correct the distortion without a manual interven-
on an observation that the pixel-wise distortion information tion and scene constraints. The learning-based methods have
is explicitly regular in a distorted image, despite different shown promising results in fundamental image processing
models having different types and numbers of distortion para-
meters. Motivated by this observation, instead of estimating the problems, such as salient object detection [19]–[21], image
heterogeneous distortion parameters, we construct a proposed super-resolution [22]–[24], and depth estimation [25], [26].
distortion distribution map that intuitively indicates the global Compared with the aforementioned active research areas,
distortion features of a distorted image. In addition, we develop the distortion rectification has been gained little attention
a dual-stream feature learning module, benefitting from both with deep learning [7]–[9]. The learning-based methods can
the advantages of traditional methods that leverage the local
handcrafted feature and learning-based methods that focus on be further divided into two sub-categories: parameter-based
the global semantic feature perception. Due to the sparsity of methods and parameter-free methods. The parameter-based
handcrafted features, we discrete the features into a 2D point methods [7] and [8] estimate the distortion parameters using
map and learn the structure inspired by PointNet. Finally, convolutional neural networks (CNNs) in terms of the one-
a multimodal attention fusion module is designed to attentively parameter division and fisheye camera models, respectively.
fuse the local structural and global semantic features, providing
the hybrid features for the more reasonable scene recovery. Our previous work, DR-GAN [9], learns the mapping func-
The experimental results demonstrate the excellent generalization tion between the distorted and undistorted image rather than
ability and more significant performance of our method in both estimates parameters, achieving the one-stage rectification for
quantitative and qualitative evaluations, compared with the state- the even-order distortion model.
of-the-art methods. We find that learning-based methods have the following
Index Terms— Distortion rectification, model-free framework, problems. (1) They heavily rely on the assumption of a specific
dual-stream feature learning, deep learning. camera model. Trained using the synthesized distorted image
I. I NTRODUCTION dataset based on the specific camera model, their networks
cannot flexibly expand to other models. Besides, the ranges
ISTORTION rectification is a classical and essential tech-
D nique of image processing with applications in various
fields, such as structure from motion (SfM) [1]–[3] and scene
of distortion parameters are limited in their dataset, leading to
an inferior rectification result even on the same camera model
when the parameter is out of the range. (2) In the parameter-
understanding [4]–[6]. However, because of the unknown cam-
based methods, the network is pretrained on the ImageNet [27]
era model and limited view, it is challenging to recover the real
that only contains natural images without distortion, and then it
geometric scene from a distorted image without any additional
directly estimates the parameters of a distorted image in which
Manuscript received July 23, 2019; revised December 7, 2019; accepted the detailed and semantic features were geometrically changed.
December 28, 2019. Date of publication January 17, 2020; date of current Thus, neural networks struggle with feature extraction in
version January 30, 2020. This work was supported in part by the National
Natural Science Foundation of China under Grant 61772066 and Grant regard to the distortion. Moreover, these methods suffer from
61972028 and in part by the Open Project Program of State Key Labora- an imbalanced problem during the training process because
tory of Virtual Reality Technology and Systems, Beihang University, under of the heterogeneousness of the distortion parameters. (3) In
Grant VRLAB2019B05. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Lisimachos P. Kondi. the parameter-free methods, the network lacks the supervision
(Corresponding author: Chunyu Lin.) of the distortion information; thus, it cannot discriminate the
Kang Liao, Chunyu Lin, and Yao Zhao are with the Institute of Infor- difference of the similar geometric attributes. While these
mation Science, Beijing Jiaotong University, Beijing 100044, China, and
also with the Beijing Key Laboratory of Advanced Information Science and methods are focused on the mapping function learning and
Network Technology, Beijing 100044, China (e-mail: [email protected]; correct the distortion with a higher efficiency, they cannot
[email protected]; [email protected]). obtain the distortion parameters that are important for SfM
Mai Xu is with the School of Electronic and Information Engineering,
Beihang University, Beijing 100191, China (e-mail: [email protected]). and camera calibration. (4) Previous learning-based methods
Digital Object Identifier 10.1109/TIP.2020.2964523 did not explicitly investigate the local structural features or
1057-7149 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3708 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

for the structural analysis and geometric recovery. Therefore,


we present a dual-stream feature learning structure, reasoning
both with the local structural and global semantic features.
Because of sparsity and grayscale attributes of handcrafted
features, we design a novel network to efficiently learn the
structural information in regard to the distortion, based on
the point cloud learning network PointNet [28]. Subsequently,
a multimodal attention fusion module is exploited to atten-
tively fuse the local structural and global semantic features,
guiding the meaningful interaction of different modal features.
Finally, a module named distortion remover, takes the hybrid
features as input and recovers the real geometric distribution
of a distorted image. Experimental results demonstrate the
excellent generalization and robustness of our method, as well
as a significant performance improvement compared with the
state-of-the-art methods.
Fig. 1. Distortion distribution map (DDM) intuitively describes the global To the best of our knowledge, this is the first general model-
distortion feature of an image and is similar with the pixel-wise map such as free framework for the single-shot distortion rectification that
the semantic segmentation map and depth map. Most importantly, DDM is
independent of the camera models and distortion parameters, facilitating the
is capable of handling various types of distortions. The con-
model-free and parameter-free algorithm for the general distortion rectifica- tributions of this paper can be summarized as follows:
tion. We show two examples of the distorted image and DDM from the left 1) We redefine the learning pipeline and present a general
to right.
model-free and parameter-free framework for the single-shot
distortion rectification.
handcrafted features that are widely leveraged in the traditional 2) We construct DDM that intuitively indicates the global
vision methods. However, these local features are crucial for distortion features of an image. Bridged by DDM, distor-
the structural analysis and geometric recovery. tion parameter estimation problem is transferred into DDM
In this paper, we unify different camera models and further construction. Furthermore, DDM can properly address the
redefine the pipeline of the single-shot distortion rectifica- imbalanced problems in the parameter-based methods as well
tion. Specifically, we transfer the key algorithm from the as provide the supervision of the distortion information and
distortion parameter estimation into the distortion distribu- the omitted distortion parameters for parameter-free methods.
tion map (DDM) construction and geometry recovery. The 3) We present a dual-stream feature learning structure,
proposed DDM intuitively describes the global distortion reasoning both with the local structural and global semantic
features of a distorted image, which is independent of the features. For the meaningful interaction between local and
heterogeneous parameters and camera models. Each pixel has global features, a multimodal attention fusion module is intro-
a value to represent the distortion degree in DDM, which duced.
is similar to the semantic segmentation map and depth map. The remainder of the paper is organized as follows. Related
We show two examples that include the distorted image and work is reviewed in Section II. Preliminaries and the proposed
DDM in Fig. 1. A comparison of the previous learning- model-free distortion rectification framework are presented in
based methods [7]–[9] and our proposed methods for the Section III and Section IV, respectively. Section V describes
distortion rectification is shown in Fig. 2. Bridged by DDM, the experiments. The conclusion and future studies are dis-
our framework can achieve the model-free and parameter- cussed in Section VI.
free training; moreover, it is suitable for various camera
models. More importantly, the imbalanced problem of multiple II. R ELATED W ORK
parameter regression has been well-addressed because of this Distortion rectifications cover a broad range of topics
new estimated target. As a benefit of the supervision of the in image processing and computer vision including object
estimated distortion information, this framework is able to recognition, motion analysis, and scene understanding. In this
more accurately recover the real scene. It is worth noticing section, we briefly review the existing traditional vision meth-
that although our framework is parameter-free and pays no ods and learning-based methods for the distortion rectification.
attention to the distortion parameter estimation, we can easily
calculate these parameters using the constructed DDM.
Moreover, we combine the advantages of the traditional A. Traditional Vision Distortion Rectification
vision and learning-based methods for the distortion recti- Traditional methods for the distortion rectification have
fication. Specifically, traditional methods leverage the local focused on handcrafted feature detection and distortion para-
handcrafted features (such as points and curves) to calcu- meter estimation. [10]–[12], [29] leveraged the planar calibra-
late the distortion parameters; however, it cannot obtain the tion pattern to remove the distortion in images. This pattern has
global scene perception. Instead, learning-based methods uti- the known metric points, corners, and blocks, etc., which need
lize neural networks to learn the global semantic features to be captured from different views. However, these methods
while ignoring the local structural features that are crucial cannot remove the distortion in images captured by other

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3709

Fig. 2. Comparison of the previous learning-based methods and our proposed method for the distortion rectification. (a) The previous learning-based methods
[7], [8], and [9] (from the top to the bottom) only focus on specific camera models: the one-parameter division model, fisheye camera model, and even-order
distortion model, respectively. They cannot adaptively correct the other types of the distortion owing to the model and domain limitations. (b) By constructing
the distortion distribution map (DDM) and unifying different domains, our model-free framework is capable of achieving various distortion rectifications.

cameras and require cumbersome manual operations. The line- problem, when networks regress the heterogeneous distor-
based methods [13]–[18] achieve the distortion rectification tion parameters simultaneously. The other type of learning-
using the detected lines or curves, following the principle based methods is parameter-free. Liao et al. [9] transferred
that the straight line has to be straight, no matter the way of the distortion parameter estimation into the mapping function
the projection [30]. However, these methods obtained inferior learning and achieve one-stage rectification according to the
results on images with a lack of lines, thus failing to expand even-order distortion model [33]. However, their learning
to the real scenes. All the above methods are built for the process lacked the distortion information supervision, so that
specific camera models, showing poor generalization abilities it could not intuitively discriminate similar distortions.
to other camera models. In addition, with traditional vision Importantly, neural networks encounter poor transferring
methods, it is nearly impossible to automatically correct the abilities in different domains. All of the above networks are
distortion in a single image without manual interventions and trained using synthesized distorted image datasets derived
scene limitations. from specific camera models, which causes inferior results
when networks are tested for other camera models. Even for
the same camera model, the data that is out of the given
B. Learning-Based Distortion Rectification parameter range will also confuse these networks. Therefore,
Unlike with traditional vision distortion rectification meth- until now, there was a lack of general model-free framework
ods, few studies focused on the learning-based distortion for the single-shot distortion rectification.
rectification. Rong et al. [7] exploited CNNs to correct the
radial distortion in an image for the first time. They trained III. P RELIMINARIES
their neural network using a synthesized distorted image In this section, we list a number of classical distortion
dataset in terms of the one-parameter division model [31] camera models; then, we unify these models into the same
and classified the different distortion into 401 categories. domain using the proposed distortion distribution map.
However, this method cannot correct a strong distortion
because of the simplified camera model and over-discrete
A. Distortion Camera Models
range of distortion parameters. To address the above issue,
Yin et al. [8] presented a multicontext collaborative network Suppose that a point pd = [x d , yd ]T is in the distorted image
for the strong distortion rectification in fisheye images. Based and a corresponding point pr = [xr , yr ]T is in the rectified
on the fisheye camera model [32], they introduced a scene image. The general projection function f of these two points
parsing network to provide semantic information for more can be expressed as a polynomial [32]:
accurate rectification. However, the rectification results heavily
f (x d , yd ) = k1 + k2rd + k3rd2 + · · · + k N rdN , (1)
rely on the performance of the scene parsing network, and
this network imposes increased memory and computation where dp = [k1 , k2 , k3 , · · · , k N ] are the distortion parameters,
requirements on the rectification system. We call the afore- and rd is the distance between pd and distortion center pc =
mentioned two methods as the parameter-based methods that [x c , yc ]T :
directly estimate the distortion parameters from an image 
using CNNs. These methods suffered from an imbalanced rd = (x d − x c )2 + (yd − yc )2 . (2)

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3710 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Based on Eq. 1, various distortion camera models can be


derived as follows:

xr = x d (1 + k1rd2 + k2rd4 + k3rd6 + k4rd8 + · · · )


yr = yd (1 + k1rd2 + k2rd4 + k3rd6 + k4rd8 + · · · ), (3)
xr = x d (1 + k1rd + k2rd3 + k3rd5 + k4rd7 + ···)
yr = yd (1 + k1rd + k2rd3 + k3rd5 + k4rd7 + · · · ). (4)

These two models are named the even-order and odd-order


polynomial model, respectively. They can be further expanded
to the fisheye camera model:

xr = x d (k1rd + k2rd3 + k3rd5 + k4rd7 + · · · )


yr = yd (k1rd + k2rd3 + k3rd5 + k4rd7 + · · · ). (5)

As suggested in [34], for the unified omnidirectional camera


model, the projection function f always satisfies the following
condition:
 Fig. 3. DDM intuitively indicates the global distortion features of a distorted
d f  image, which is independent of the camera models and distortion parameters.
= 0. (6)
dr r=0
Therefore, the different distorted image derived from different camera models
can be unified into the same domain.

Therefore, k2 = 0 in Eq. 1, and the unified omnidirectional


camera model can be described as follows:
B. Distortion Distribution Map
xr = x d (k1 + k3rd2 + k4rd3 + ···) As mentioned in Section II, the state-of-the-art methods
yr = yd (k1 + k3rd2 + k4rd3 + · · · ). (7) focus on specific camera models and fail to extend to other
camera models. Motivated by this fact, we construct DDM to
The above polynomial camera models are suitable for small unify the various camera models into the same domain.
distortions. However, more distortion parameters are required Considering that D(x i , y j ) is the value of a pixel (x i , y j )
for a strong distortion. Fitzgibbon [35] presented the division in DDM, we can obtain the expression as follows:
model as an alternative camera model: j
xri yr
xd D(x i , y j ) = = = k1 + k2rd + k3rd2 + · · · , (10)
xr = x di j
yd
1 + k1rd + k2rd + k3rd6 + k4rd8 + · · ·
2 4
yd where the value of a pixel in DDM is denoted as distortion
yr = . (8)
1 + k1rd + k2rd + k3rd6 + k4rd8 + · · ·
2 4
level that means the ratio between two coordinates in the
distorted image and rectified image. During the construction
While requiring fewer parameters for a strong distortion, of DDM, we utilize the position (x̂, ŷ) of a pixel in distorted
the division model is still hard to be approximately fitted image and the distortion parameters to calculate the distortion
because of the limited information from a single image. level of a point (x, y) in DDM, where x̂ = x and ŷ = y.
To further simplify the difficulty of the distortion parameter Therefore, the width and height of DDM and distorted image
estimation, [7], [16], [17] employed the one-parameter division are equal, and they are point-to-point aligned. A larger absolute
model for the single image distortion rectification: value of the distortion level means a stronger distortion of the
xd image, which is jointly determined by distortion parameters
xr =
1 + k1rd2 and the location of the pixel.
yd DDM intuitively indicates the global distortion features of
yr = , (9) a distorted image, which is model-free and parameter-free.
1 + k1rd2
Therefore, different distorted images derived from various
where the distortion parameter k1 determines the type of the camera model can be covered in this kind of map, as shown
distortion. When k1 is positive, it generates the pincushion in Fig. 3. In addition, DDM is theoretically capable of covering
distortion; when k1 is negative, it generates the barrel dis- all distortion parameters rather than an approximate fitting
tortion. Although the introduced camera models look similar, scheme such as [7]–[9], [16], [17]. Thus, as a benefit that all
the values of each distortion parameter are quite different. the distortion levels are homogenous, DDM properly addresses
Moreover, the state-of-the-art methods only consider several the imbalanced problem in the heterogeneous parameters
parameters to reduce the estimation difficulty. For example, regression.
the 4t h order polynomial model has been leveraged in [8], It is worth mentioning that our framework is parameter-free:
[9], [34], leading to the underfitting with respect to the real it does not directly estimate the distortion parameters, but we
camera model. can still easily calculate these parameters using the constructed

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3711

Fig. 4. Overview of the proposed model-free distortion rectification framework. This framework consists of three modules: SSD (semantics, structure, and
distortion) learning, multimodal attention fusion, and distortion rectification module. In the SSD learning module, there are three special neural networks
designed for the perceptions of the semantic, structural, and distorted information. And SSD leaning module can be regarded as a combination of the proposed
dual-stream learning feature and distortion learning modules. To efficiently aggregate the multimodal features from the above networks, we fuse these features
with an attention mechanism. Finally, we utilize the refined hybrid features to recover the real geometric shape. The skip connections are marked with gray
dashed arrows.

DDM. Suppose that the distortion levels [d1 , d2 , · · · , dn ] at However, the heterogeneousness of these parameters influ-
the location [r1 , r2 , · · · , rn ] are available in the DDM. Then, ences the performance of their learning models and causes
the distortion parameters [k1, k2 , · · · , kn ] can be obtained as an imbalanced problem during the regression. We will show
follows based on Eq. 10: the experiment related to this problem in Section V-B. Having
⎡ ⎤T ⎡ 0 ⎤−1 observed this fact, we design a specific network to intuitively
d1 r1 r11 · · · r1n−1 perceive the prior knowledge of the distortion that is regular
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ in a distorted image.
⎢d2 ⎥ ⎢r 0 r 1 · · · r n−1 ⎥
  ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ 2 2 2 ⎥ As mentioned in Section III-B, DDM describes the global

k1 k2 · · · kn = ⎢ ⎥ ⎢ ⎥ (11) distortion features of a distorted image, which can unify
⎢ .. ⎥ ⎢ . . . . ⎥
⎢ . ⎥ ⎢ .. .. .. .. ⎥ the different domains of different camera models. Therefore,
⎢ ⎥ ⎢ ⎥ we first construct the DDM of a distorted image rather than
⎣ ⎦ ⎣ ⎦
roughly feed the image into an existing network pretrained on
dn rn rn · · · rn
0 1 n−1
. the ImageNet [27], which contains no distorted images. The
distortion learner is a fully convolutional neural network such
IV. P ROPOSED M ODEL -F REE D ISTORTION as the U-Net [36], consisting of an encoder and a decoder, with
R ECTIFICATION F RAMEWORK skip connections between the encoder and decoder features at
In this section, we demonstrate the proposed model-free the same spatial resolution. Specifically, we have 5 hierarchies
distortion rectification framework in detail. The overall archi- in the encoder. Each hierarchy has a convolutional layer
tecture of this framework is illustrated in Fig. 4, including with 3×3 kernels and 2 strides (2× downsample), which
three modules: SSD (semantics, structure, and distortion) are followed by BatchNormalization layers using LeakyReLU
learning, multimodal attention fusion, and distortion rectifica- activation (α = 0.2). The number of filters per hierarchy is as
tion module. Specifically, we first introduce the SSD learning follows: 64, 128, 256, 512, and 512. There are 5 hierarchies in
module that perceives the semantic, structural, and distorted the decoder part. At the beginning of each hierarchy, a bilinear
information with respect to an image, achieved by three special upsampling layer is leveraged to increase the spatial dimension
neural networks. We then describe the multimodal attention by a factor of 2, followed by convolutional, BatchNormal-
fusion module that attentively aggregates multimodal features. ization, and LeakyReLU layers. Note that the channel of the
Finally, we explain the distortion rectification module and DDM is 1, that is the same as the depth map; thus, the number
training loss of the whole framework. of filters in the last convolutional layer equals 1.
2) Distortion-Guided Semantics Extraction: Unlike the pre-
vious learning-based methods, we do not pretrain our model
A. SSD Learning on the ImageNet [27], as the domains of the natural image and
1) Distortion Perception: Previous learning-based methods distorted image are different. Instead, we exploit the generated
ignored the prior knowledge of the distortion in an image DDM to guide the semantics extraction of distorted images and
and directly estimated the distortion parameters using CNNs. train the semantics learner from the scratch. Compared with

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3712 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

the distortion learner, the semantics learner only consists of an


encoder that outputs the global semantic features of a distorted
image. Specifically, we have 8 hierarchies in the encoder.
Each hierarchy has a convolutional layer with 4×4 kernels
and 2 strides (2× downsample), which are followed by Batch-
Normalization layers using LeakyReLU activation (α = 0.2).
The number of filters per hierarchy is as follows: 64, 128,
256, 512, 512, 512, 512, and 512. At the end of the encoder,
the last convolutional layer is followed by a fully connected
layer with 512 units, which outputs a feature vector in regards
to the semantic information. As a benefit of the guidance of
the estimated DDM, our framework can extract more efficient
global semantic features for accurate distortion rectifications.
The relative ablation study is demonstrated in Section V-C. Fig. 5. Architecture of the structure learner for the structural feature learning.
Structure learner takes the 2D canny points as the input that includes the
3) Structure Analysis: Previous learning-based methods original structural information of the distortion. Subsequently, there are two
only considered the global semantic features of a distorted transform stages and three abstract stages. The transform stages randomly
image, while ignoring the local handcrafted features that were rotate the 2D canny points in the data and feature domains, the abstract stages
progressively extract the structural information of 2D canny points into a
important for the structural analysis and geometric recovery. feature vector.
Therefore, we design a special network to learn the local
handcrafted features.
In analogy to the traditional vision methods, we also use layer and ReLU activations are employed for each layer,
canny edges [37] as the original structural features, which except for the last fully connected layer, in both T-Nets.
are provided with a binary map. However, this map is very Subsequently, the abstract stages consist of shallow and deep
sparse (approximately 5% points in total) and has no RGB abstract modules, and a header module. The shallow module
information, thus the CNNs-based feature extractor is not takes the 2D canny points that are transformed by T-Net as
suitable for this type of data. PointNet [28] and PointNet++ the input and outputs the feature with size of n×64. Then,
[38] are implemented for the point cloud classification and the feature transformed by feature T-Net is fed into the deep
segmentation, showing more promising performance on the module to further generate a high dimension feature with
3D sparse data learning than other networks [39], [40], and 3D size of n×1024. The numbers of filters per 1D convolutional
CNNs. Motivated by this work, we propose a structure learner layers with 1×1 kernels for the shallow and deep modules are
for the handcrafted feature learning. We first randomly discrete 64 and 64 and 64, 128, and 1024, respectively. The header
the canny edge map into a 2D canny point map that contains module is utilized to produce a feature vector from the deep
N points in total: {Pi |i = 1, 2, ..., N}, and the coordinate of abstract module, including three fully connected layers with
each point is (x i , yi ). By contrast, the 3D point cloud can be the following number of units: 1024, 512, and 256.
formulated as { P̂i |i = 1, 2, ..., M}, where each point P̂i is a In contrast to the CNNs-based feature extractor that excels at
vector of its (x̂ i , ŷi , ẑ i ) coordinate, plus extra feature channels the dense RGB image learning, our proposed structure learner
such as intensity and color. Following the universal continuous considers the specific attributes of the handcrafted features,
set function approximation in PointNet [28], we can transfer i.e., the sparsity and grey values. This network takes the
this principle into our structure learner: raw 2D canny points as the input and progressively extracts
the structural features at the point level, inspired by the
f (P1 , P2 , ..., PN ) = γ ( max {h(Pi )}), (12) PointNet [28] that is specially designed for the 3D point cloud
i=1,...,N
learning. Therefore, the structure learner effectively perceives
where γ and h are typically multi-layer perceptrons (MLPs). the local structural features, providing information that is
We use the 1D convolutional layer as an MLP in the structure more beneficial for the distortion rectification. A performance
learner. comparison of the CNNs-based feature extractor and our
There are two transform stages and three abstract stages in proposed structure learner is presented in Section V-C.
the structure learner. The transform stages consist of two 2D
T-Net [28] and randomly rotate the 2D canny points in the
domain of data and features. Compared with the 3D T-Net B. Multimodal Attention Fusion
in PointNet [28], we learn a rotation matrix with the size After the SSD learning, we can obtain the global semantic
of 2×2 in the first transform stage. In the second transform feature mixed with the distortion information and local struc-
stage, the size of the feature rotation matrix is 64×64. Each tural feature. One way to fuse these two features is using the
T-Net consists of three 1D convolutional layers with 1×1 ker- concatenation operation. We suppose the feature vector Vsem
nels, a maxpooling layer, and three fully connected layers. is generated by the semantics learner, and feature vector Vstr
The number of filters per 1D convolutional layers is 64, 128, is generated by the structure learner. Then, the plain fusion by
and 1024. The numbers of units per fully connected layer concatenation ⊕ can be expressed as follows:
in the first and second T-Net are 512, 256, and 4; and 512,
256, and 4096 (64×64), respectively. The BatchNormalization Vhyb = Vsem ⊕ Vstr , (13)

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3713

where Vhyb is the hybrid feature that consists of the semantic as M̂ and M, respectively. The distortion distribution loss can
and structural features. However, the types of these fused be defined as
features are different, playing ambiguous roles in the distortion W H
rectification. Thus, the simple operation for this fusion task 1
Ld = || M̂x,y − Mx,y ||1 , (17)
would hinder the complementary message passing. To address WH
x=1 y=1
the above issue, we introduce a multimodal attention fusion
module to guide the interaction between the semantic and where W and H are the width and height of the DDM,
structural features. Because the interaction of different modal respectively.
information is not always meaningful, the attention mechanism 2) Reconstruction Loss: When the proposed framework
can help neural networks autonomously learn to focus or produces a rectified image, one straightforward way is to
to omit message passing from other features. Specifically, minimize the reconstruction loss between I r and I u :
the attention map M can be constructed as follows: W H
1
Lr = ||Ix,y
r
− Ix,y
u
||1 . (18)
M = σ ( f c(Vsem )), (14) WH
x=1 y=1
where f c is a fully connected layer with 256 units, and 3) Low-to-High Perceptual Loss: The reconstruction loss
the sigmoid function σ is used to normalize the value of describes the difference between the generated image and
M. Subsequently, the hybrid feature Vhyb derived from the ground truth at the pixel level. However, it causes the blur arti-
multimodal attention fusion can be formulated as follows: facts in the generated image. To solve this problem, we exploit
Vhyb = Vsem ⊕ M ⊗ Vstr , (15) a perceptual loss [41] at the feature level, to preserve details
of the predictions and make rectified images sharper.
where ⊗ is element-wise multiplication, and ⊕ is concatena-
Wi, j Hi, j
tion for the multimodal features. 1
Compared with the plain fusion in Eq. 13, the proposed Lp = ||φi, j (I r )x,y − φi, j (I u )x,y ||2 , (19)
Wi, j Hi, j
multimodal attention fusion module guides the meaningful x=1 y=1

message passing between the local structure feature and global where Wi, j and Hi, j are the width and height of the feature
semantics feature. Thus, our framework learns to automatically map φi, j , which is derived from the j -th convolution (after
select the valid structural information and further achieves activation) before the i -th maxpooling layer in the VGG19 net-
better rectification results. work [42]. As suggested in [9], the low-to-high perceptual
loss that jointly considers the shallow and deep feature maps
C. Distortion Remover produces more reasonable rectification results than a vanilla
perceptual loss. Thus, we also implement the low-to-high
At the end of the proposed framework, we leverage the
perceptual loss into our framework:
hybrid feature Vhyb for the distortion rectification in a gener-
ation fashion. The architecture of this network can be treated Ll2h l h
p = λl L p + (1 − λl )L p , (20)
as a decoder, consisting of 8 hierarchies, where each hierarchy
has an upsampling layer (2× upsample), followed by a con- where Llp and Lhp are the low and high perceptual loss,
volutional layer with 4×4 kernels and 1 stride. The number respectively. In analogy to [9], we choose the activations from
of filters per convolutional layer is as follows: 512, 512, 512, the V GG 1,2 and V GG 5,2 convolutional layers to obtain the
512, 256, 128, 64, and 3. Each convolutional layer is followed low-level and high-level perceptual loss, respectively. Here, λl
by a BatchNormalization layer using LeakyReLU activation is a factor to balance Llp and Lhp .
(α = 0.2), except that the activation of the last convolutional 4) Adversarial Loss: The adversarial loss is widely used
layer is Tanh function. To promote the complementary effect in the image generation task [43]–[45]; it helps to produce
of low and high level features, we add the skip connections more reasonable and consistent rectified images. Therefore,
between the semantics learner and distortion remover at the we introduce a network D to discriminate the rectified image
same spatial resolution. generated by the distortion remover and the ground truth. This
network consists of 6 convolutional layers with 4×4 kernels
D. Training Loss and 2 strides. The number of filters per convolutional layer
Given the input distorted image I d , rectified image I r , and is 64, 128, 256, 512, 512, and 512. BatchNormalization and
ground truth of the undistorted image I u , the training loss for LeakyReLU activation (α = 0.2) are applied to each layer.
our framework is a linear combination of four terms: After the last convolutional layer, there are two fully connected
layers with 1024 and 1 units, predicting the input image as true
L = λd Ld + λr Lr + λ p Ll2h
p + λa La . (16) or false. The adversarial loss can be defined as follows:
N
1) Distortion Distribution Loss: As mentioned in
Section III, DDM intuitively indicates the global distortion La = − log Dθ D (I r ). (21)
features of a distorted image, guiding the semantic perception n=1

of neural networks. Suppose that the generated DDM for the Every module of our model-free framework is differentiable;
distortion learner and the ground truth of DDM are denoted thus, the whole learning model can be end-to-end trained.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3714 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

V. E XPERIMENTS
In this section, to demonstrate the distortion rectification
performance of the proposed framework, we first describe
the synthesized distorted image dataset for neural networks
training and the implementation details. Then, we analyze the
limitations of previous learning-based methods following two
aspects: the bad generalization ability for different camera
models and the imbalanced problem during the parameter
estimation. Moreover, we report an ablation study of the
different performances of the distortion rectification, which
is implemented with different modules in the SSD learning
introduced in Section IV-A. Finally, we compare the proposed
method with the state-of-the-art methods, in both the quanti-
tative measurement and visual appearance.

A. Dataset and Implementation Details


We generate a synthesized distorted image dataset con-
sisting of various camera models and distortion parameters,
where each distorted image has a corresponding canny edge
map, DDM, and rectified image. We introduced various Fig. 6. Samples from the presented synthesized distorted image dataset.
We show the distorted image, canny edges, DDM, and rectified image from the
camera models containing different distortion parameters in left to right. Our dataset covers indoor and outdoor scenes as well as different
Section III. In analogy to [7]–[9], [17], [18], our proposed types of camera models and distortion parameters. Moreover, the available
method is also designed for the real scene case including the DDM and handcrafted features provide rich distortion information and local
structure features for the more accurate distortion rectification.
indoor and outdoor. The original images of the dataset are
derived from the MS-COCO dataset [46], which are resized
with the size of 256×256. Specifically, there are 16 types of
distortion configurations in regard to our dataset: the even- B. Analysis of Previous Learning-Based Methods
order model included 2, 4, 6, and 8 distortion parameters,
the odd-order model included 2, 3, 5, and 7 distortion 1) Performance on Other Camera Models: Previous
parameters, the division model included 1, 2, 3, 4 distor- learning-based methods have strong assumptions on the spe-
tion parameters, and the fisheye model included 3, 4, 5, cific cameras models. For example, Rong et al. [7], Yin et al.
and 6 distortion parameters. To the best of our knowledge, [8], and Liao et al. [9] established their training datasets in
this is the first synthesized distorted image dataset covering terms of the one-parameter division, fisheye, and even-order
various scenes, camera models, and distortion parameters. camera models, respectively. However, these datasets contain
Considering the off-axis distortion component in real cameras, no other distorted images generated by other camera models,
we randomly perturb the distortion center of a distorted image leading to inferior generalization abilities of the neural net-
and calculate its corresponding DDM. In addition, we extra works. To intuitively demonstrate the limitations of previous
provide the handcrafted feature for the local structure feature learning-based methods, we utilize various types of distorted
learning, which is ignored by the previous learning-based images to test the rectification results. As shown in Fig 7,
methods [7]–[9] but is crucial for the structural analysis and we show three types of distorted image: one-parameter divi-
geometric recovery. Therefore, neural networks gain more sion, even-order, and fisheye camera models from the top
diversities and information of the scenes and distortions using to bottom. Previous methods suffer from poor performances
our presented dataset, and thus achieve more excellent gener- when they come to correct other types of distorted image,
alization ability of the distortion rectification. Some samples which does not exist in their constructed dataset based on the
of our synthesized distorted image dataset are illustrated specific camera model. For instance, Rong et al. [7] aimed to
in Fig. 6. correct the distorted image with respect to the one-parameter
We train our framework using the above synthesized dis- division camera model, and thus show an under-corrected
torted image dataset. For the optimization process, we leverage result for the more complicated model such as the even-order
Adam [47] as the optimizer and set the basic learning rate of model. Liao et al. [9] specially designed a network for the
the framework to 1 × 10−4 . As mentioned in Section IV-D, even-order camera model, performing poorly on the plain
to make the rectified image looks more reasonable and con- one-parameter division model with an over-corrected result.
sistent, we introduce a network D to adversatively discrim- In addition, these two methods both got inferior rectification
inate the rectified image and the corresponding undistorted results for the fisheye model due to the significant difference
image. During the training process, we implement 5 gradient of model domains. Therefore, previous methods cannot adap-
descent steps on D followed the principle of [48] for each tively correct the other types of distortions, limiting the general
iteration. applications in practice.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3715

Fig. 9. Imbalanced problem during the heterogeneous distortion parameter


regression. We show the original hybrid loss curve without balanced settings
(a) and the balanced hybrid loss curve with respect to the distortion
parameter k4 (b).

TABLE I
Fig. 7. The generalization limitation of the previous learning-based methods
for the distortion rectification. We show three types of distorted image: one- A N A BLATION S TUDY OF THE D IFFERENT VARIANTS
parameter division, even-order, and fisheye camera models from the top to OF O UR M ODEL -F REE F RAMEWORK
bottom.

Moreover, we compare the different loss curves of distortion


parameters without and with the balanced setting, respectively.
Fig. 8. Distortion level-radius (dl − r) curves in terms of the different roles The comparison results are visualized in Fig. 9. As illustrated
of distortion parameters. We show different biases with the value of -400 in Fig. 9(a), during the training process, the neural network
(a) and 400 (b) on each distortion parameter. pays more attention on the loss optimization of the distortion
parameter k1 , making the losses of k3 and k4 hard to con-
2) Imbalanced Parameters Regression: Previous learning- verge. After the balanced factor setting on k4 , all loss curves
based methods [7], [8] estimate the distortion parameters display better decreased trends. However, manually choosing
of an image using CNNs, however, the global optimum of the effective balanced factors for each distortion parameter is
these parameters is not totally equal to the best rectification time consuming, which needs to comprehensively consider the
result due to the different roles of each parameter. As shown different parameter ranges and distorted image datasets.
in Fig.8, we consider the influences of distortion parameters
{k1 , k2 , k3 , k4 } on the global distortion distribution in regard to C. Ablation Study
the even-order cameral model. To be more specific, we draw To fully investigate the effects of the proposed SSD learning
different dl − r (distortion level-radius) curves in terms of and multimodal attention fusion modules in Section IV-B,
Eq.10 and select k1 = k2 = k3 = k4 = 500 after the we now introduce different variants of the proposed model-free
quantization as the ground truth (GT). Then, we add the bias framework. Firstly, we design a baseline that only contains
with values of −400 and 400 on each distortion parameter, the semantics learner and distortion remover. Then, we add
respectively. As the power of the radius grows, the distortion the CNNs-based handcrafted feature learner (CNNs_HFL),
level is more sensitive to the change of corresponding parame- structure learner (SL), structure learner with the multimodal
ter in Eq.10. For example, although we implement the same attention fusion (SL+MAF), the distortion learner (DL),
bias to each parameter, the bias of parameter k4 mostly affects the structure learner and distortion learner without the mul-
the global distortion distribution while k1 shows the least timodal attention fusion (SL + DL), and the SSD learning and
influence. Therefore, it is not proper to simultaneously regress multimodal attention fusion modules (Ours) into the baseline,
all parameters without any balance factor settings, which are respectively. The above different variants of the proposed
crucial for recovering the real scene from the distorted image. framework are trained with the same implementation details
We further conclude that searching for the global optimum of as discussed in Section V-A, and the rectified images are
all estimated distortion parameters is not a reasonable choose evaluated using PSNR and SSIM. As listed in Table I, due
for the learning-based distortion rectification. to the sparsity and grey of the 2D handcrafted feature map,

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3716 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

TABLE II
Q UANTITATIVE M EASUREMENTS OF O UR P ROPOSED M ETHOD AND THE S TATE - OF - THE -A RT M ETHODS ,
W HICH A RE Q VALUATED U SING PSNR AND SSIM

Fig. 10. Rectification results of the corrected synthetic distorted images. For each result, we show the distorted image, ground truth, and corrected results
of the compared methods: Alemánflores [17], Santanacedrés [18], Rong [7], and Liao [9], and corrected results of our proposed method, from left to right.

the CNNs-based feature extractor cannot fully learn the local fusion modules obtains the best results in both the PSNR and
structural feature of a distorted image. In contrast, our struc- SSIM.
tural learner motivated by the point cloud learning network
PointNet [28], is specially designed for the handcrafted feature
D. Comparisons to the State-of-the-Art Methods
learning, thus learns more effective distortion information and
achieves better rectification performance than CNNs_HFL. For In this part, we compare the proposed method with the
the meaningful interaction of the local structural features and state-of-the-art methods, such as the traditional vision meth-
global semantic features, we introduce a multimodal attention ods: Alemánflores et al. [17] and Santanacedrés et al. [18]
fusion module to help neural networks autonomously learn which are based on the one-parameter division model and
to focus or to omit the message passing. Therefore, fusing two-parameter division model, as well as the learning-based
the local structural features and global semantic features with methods: Rong et al. [7] and Liao et al. [9] which are based
an attention mechanism, the SL+MAF method outperforms on the one-parameter division model and even-order model.
the SL method that only fuses the different modal features These different methods are evaluated using the quantitative
with the concatenation. On the other hand, by constructing measurement and visual appearance as follows.
the DDM of a distorted image, the distortion learner explicitly 1) Quantitative Measurement: To demonstrate a quantita-
provides the prior knowledge of the distortion distribution to tive comparison with the state-of-the-art methods, we evaluate
our framework, significantly improving the performance of the the rectified images obtained from different methods using
distortion rectification. Compared with the baseline only uses PSRN and SSIM. To be more specific, we exploit five test
SL or DL, the baseline w/ SL+DL achieves better performance datasets to validate the performances of different methods,
due to the comprehensive perception on both structural and i.e., the division (d), odd-order (o), even-order (e), and fisheye
semantic information. Furthermore, the complete framework (f) camera models. The measurement results are demonstrated
comprising of the SSD learning and multimodal attention in Table II. As a benefit of the proposed DDM that unifies

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
LIAO et al.: MODEL-FREE DISTORTION RECTIFICATION FRAMEWORK BRIDGED BY DDM 3717

Fig. 11. Rectification results of the corrected real distorted images. For each result, we show the distorted image, ground truth, and corrected results of the
compared methods: Alemánflores [17], Santanacedrés [18], Rong [7], and Liao [9], and corrected results of our proposed method, from left to right.

the different camera models into the same domain, our pro- results. Compared with the unsatisfactory corrections of previ-
posed method can correct different types of the distortion and ous methods, our method achieves the best visual appearance
achieves the best performance on all test datasets, in both on all the rectified results, excellently recovering the real
PSRN and SSIM. Suffering from the strong assumption of the scenes from the distorted geometric distributions.
specific camera model, the state-of-the-art methods perform
poorly on the distorted images derived from other camera VI. C ONCLUSION AND F UTURE W ORK
models. Under the specific camera model, due to the excellent
learning ability and the global semantic information analysis, In this paper, we consider the challenging problem of
our method significantly leads the traditional vision methods single-shot distortion rectification and further present a general
[17], [18] in the quantitative measurement. Compared with framework. Compared with the previous learning-based meth-
the learning-based methods [7], [9] that ignore the prior ods that only focus on the specific camera models thus failing
knowledge of the distortion and the local structural feature to be expanded to other models, our framework is model-
learning, our framework exhibits more promising rectification free and has better generalization ability. By constructing the
results because of the complete learning in regards to the distortion distribution map (DDM) of an image, we unify the
structure, semantics, and distortion. different types of camera models into the same domain. Sub-
2) Visual Appearance: We further compare our method sequently, DDM is utilized to guide the semantic perception,
with the state-of-the-art methods in the visual appearance. eliminating the imbalanced problem during the multiple para-
Firstly, we leverage the constructed synthesized distorted meters regression. Moreover, we propose a dual-stream feature
images to evaluate the above methods. The comparison results learning structure, to extract both the local handcrafted features
are illustrated in Fig. 10. Lacking of the global semantic and global semantic features. For the meaningful interaction
perception, traditional methods [17], [18] perform poorly of different features, a multimodal attention fusion module
on the distortion rectification under the various scenes and is introduced. Experimental results demonstrate the excellent
obtain few reasonable results when the distortions are not generalization ability of our framework. The proposed method
strong. Due to the specific assumption of the derived cam- significantly outperforms the state-of-the-art methods in both
era models, learning-based methods [7], [9] obtain inferior quantitative and qualitative evaluations. Our futher work will
rectifications facing to other models. By contrast, as benefit consider the self-supervised distortion rectification.
of the model-free framework and local-to-global learning, our
method achieves the best results on various scenes and camera R EFERENCES
models, accurately correcting the curves that are supposed to [1] T. Collins and A. Bartoli, “Planar structure-from-motion with affine
be straight. Therefore, the proposed algorithm shows better camera models: Closed-form solutions, ambiguities and degeneracy
generalization ability of the practical distortion rectifications. analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6,
pp. 1237–1255, Jun. 2017.
For the robustness validation, we then compare different [2] H. Guan and W. A. P. Smith, “Structure-from-motion in spherical video
methods using the real distorted images captured by wide- using the von Mises–Fisher distribution,” IEEE Trans. Image Process.,
angle lenses. As shown in Fig. 11, it is difficult to correct the vol. 26, no. 2, pp. 711–723, Feb. 2017.
distorted structures using the previous methods [7], [9], [17], [3] M. Lee, J. Cho, and S. Oh, “Procrustean normal distribution for non-
rigid structure from motion,” IEEE Trans. Pattern Anal. Mach. Intell.,
[18], which displays under-rectification and over-rectification vol. 39, no. 7, pp. 1388–1400, Jul. 2017.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
3718 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

[4] J. L. Herrera, C. R. del-Blanco, and N. Garcia, “Automatic depth [25] Z. Zhang, C. Xu, J. Yang, J. Gao, and Z. Cui, “Progressive hard-mining
extraction from 2D images using a cluster-based learning frame- network for monocular depth estimation,” IEEE Trans. Image Process.,
work,” IEEE Trans. Image Process., vol. 27, no. 7, pp. 3288–3299, vol. 27, no. 8, pp. 3691–3702, Aug. 2018.
Jul. 2018. [26] L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Robust 3D hand pose
[5] Y. Wang and W. Deng, “Generative model with coordinate metric estimation from single depth images using multi-view CNNs,” IEEE
learning for object recognition based on 3D models,” IEEE Trans. Image Trans. Image Process., vol. 27, no. 9, pp. 4422–4436, Sep. 2018.
Process., vol. 27, no. 12, pp. 5813–5826, Dec. 2018. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
[6] M. Wang et al., “BiggerSelfie: Selfie video expansion with hand-held with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
camera,” IEEE Trans. Image Process., vol. 27, no. 12, pp. 5854–5865, Process. Syst., 2012, pp. 1097–1105.
Dec. 2018. [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on
[7] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens distortion point sets for 3D classification and segmentation,” in Proc. IEEE Conf.
correction using convolutional neural networks trained with synthesized Comput. Vis. Pattern Recognit., Jul. 2017, pp. 77–85.
images,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 35–49. [29] Y. Gao, C. Lin, Y. Zhao, X. Wang, S. Wei, and Q. Huang, “3-D surround
[8] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “FishEyeRecNet: view for advanced driver assistance systems,” IEEE Trans. Intell. Transp.
A multi-context collaborative deep network for fisheye image rectifica- Syst., vol. 19, no. 1, pp. 320–328, Jan. 2018.
tion,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 469–484. [30] F. Devernay and O. Faugeras, “Straight lines have to be straight,” Mach.
[9] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “DR-GAN: Automatic radial Vis. Appl., vol. 13, no. 1, pp. 14–24, Aug. 2001.
distortion rectification using conditional GAN in real-time,” IEEE Trans. [31] D. Claus and A. W. Fitzgibbon, “A rational function lens distortion
Circuits Syst. Video Technol., to be published. model for general cameras,” in Proc. IEEE Comput. Soc. Conf. Comput.
[10] S. Shah and J. Aggarwal, “Intrinsic parameter calibration procedure Vis. Pattern Recognit., vol. 1, Jul. 2005, pp. 213–219.
for a (high-distortion) fish-eye lens camera with distortion model and [32] J. Kannala and S. Brandt, “A generic camera model and calibration
accuracy estimation,” Pattern Recognit., vol. 29, no. 11, pp. 1775–1788, method for conventional, wide-angle, and fish-eye lenses,” IEEE Trans.
Nov. 1996. Pattern Anal. Mach. Intell., vol. 28, no. 8, pp. 1335–1340, Aug. 2006.
[11] Z. Zhang, “Flexible camera calibration by viewing a plane from [33] R. I. Hartley and S. B. Kang, “Parameter-free radial distortion correction
unknown orientations,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 1, with centre of distortion estimation,” in Proc. IEEE Int. Conf. Comput.
Sep. 1999, pp. 666–673. Vis., Jun. 2005, pp. 1834–1841.
[12] X. Chen, J. Yang, and A. H. Waibel, “Calibration of a hybrid [34] D. Scaramuzza, A. Martinelli, and R. Siegwart, “A toolbox for easily
camera network,” in Proc. IEEE Int. Conf. Comput. Vis., calibrating omnidirectional cameras,” in Proc. IEEE/RSJ Int. Conf. Intell.
Oct. 2003. Robots Syst., Oct. 2006, pp. 5695–5701.
[13] J. Barreto and H. Araujo, “Geometric properties of central cata- [35] A. W. Fitzgibbon, “Simultaneous linear estimation of multiple view
dioptric line images and their application in calibration,” IEEE geometry and lens distortion,” in Proc. IEEE Conf. Comput. Vis. Pattern
Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1327–1333, Recognit., Aug. 2001.
Aug. 2005. [36] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[14] R. Melo, M. Antunes, J. P. Barreto, G. F. P. Fernandes, and N. Gonçalves, works for biomedical image segmentation,” in Proc. Int. Conf. Med.
“Unsupervised intrinsic calibration from a single frame using a ‘plumb- Image Comput. Comput.-Assisted Intervent., 2015, pp. 234–241.
line’ approach,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, [37] J. F. Canny, “A computational approach to edge detection,” IEEE Trans.
pp. 537–544. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.
[15] R. Carroll, M. Agrawal, and A. Agarwala, “Optimizing content- [38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical
preserving projections for wide-angle images,” ACM Trans. Graph., feature learning on point sets in a metric space,” in Proc. Neural Inf.
vol. 28, no. 3, p. 43, 2009. Process. Syst., 2017.
[16] F. Bukhari and M. N. Dailey, “Automatic radial distortion estimation [39] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas,
from a single image,” J. Math. Imag. Vis., vol. 45, no. 1, pp. 31–45, “Volumetric and multi-view CNNs for object classification on 3D
2013. data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016,
[17] M. Alemán-Flores, L. Alvarez, L. Gomez, and D. Santana-Cedrés, pp. 5648–5656.
“Automatic lens distortion correction using one-parameter division mod- [40] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view
els,” Image Process. Line, vol. 4, pp. 327–343, Nov. 2014. convolutional neural networks for 3D shape recognition,” in Proc. IEEE
[18] D. Santana-Cedrés et al., “An iterative optimization algorithm for lens Int. Conf. Comput. Vis., Dec. 2015, pp. 945–953.
distortion correction using two-parameter models,” Image Process. Line, [41] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
vol. 5, pp. 326–364, Dec. 2016. style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.,
[19] G. Li and Y. Yu, “Deep contrast learning for salient object detec- 2016.
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
pp. 478–487. large-scale image recognition,” CoRR, vol. abs/1409.1556, Apr. 2015.
[20] Y. Fang, G. Ding, J. Li, and Z. Fang, “Deep3DSaliency: Deep stereo- [43] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
scopic video saliency detection model by 3D convolutional networks,” with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
IEEE Trans. Image Process., vol. 28, no. 5, pp. 2305–2318, May 2019. Pattern Recognit., Jul. 2017.
[21] H. Lin, C. Lin, Y. Zhao, and A. Wang, “3D saliency detection based [44] J.-Y. Zhu et al., “Toward multimodal image-to-image translation,” in
on background detection,” J. Vis. Commun. Image Represent., vol. 48, Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 465–476.
pp. 238–253, Oct. 2017. [45] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang,
[22] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution “Diverse image-to-image translation via disentangled representations,”
using very deep convolutional networks,” in Proc. IEEE Conf. Comput. in Proc. Eur. Conf. Comput. Vis., 2018.
Vis. Pattern Recognit., Jun. 2016, pp. 1646–1654. [46] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
[23] C. Ledig et al., “Photo-realistic single image super-resolution using Proc. Eur. Conf. Comput. Vis., 2014.
a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. [47] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
Pattern Recognit., Jul. 2017, pp. 105–114. tion,” 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/
[24] Y. Dong, C. Lin, Y. Zhao, C. Yao, and J. Hou, “Depth map up- abs/1412.6980
sampling with texture edge feature via sparse representation,” in Proc. [48] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” CoRR,
Vis. Commun. Image Process., 2016, pp. 1–4. vol. abs/1701.07875, Dec. 2017.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on October 15,2020 at 13:55:13 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like