A Deep Approach To Image Matting Report
A Deep Approach To Image Matting Report
Abstract
In this project, we aim to solve the Image matting problem. Image matting is a fundamental
computer vision problem of foreground extraction with many applications. It is challenging
for the traditional methods to solve this severely under constrained problem and hence have
a poor performance. They fail as their estimation is based solely on the intensity values of
the foreground and background. Deep learning methods have tried to solve this problem by
directly estimating the alpha matte given the image and a trimap. These methods use the
ground truth foreground and background images for refining the alpha values. Hence solve a
simpler problem assuming the network learns the high level context required for prediction.
In our work, instead of using the hand-crafted trimaps. 1) We make a neural network to
generate the trimaps. 2) Another neural network is trained to estimate the foreground and
background while predicting the alpha matte. This is to ensure consistency with the original
matting problem set up. The main part of our models is a deep convolutional encoder-decoder
network that takes an image/images as input and predict it’s alpha matte.
1 Introduction
Matting, the problem of accurate foreground estimation in images and videos, has significant
practical importance. It is a key technology in image editing and film production and effective
natural image matting methods can greatly improve current professional workflows. It necessitates
methods that handle real world images in unconstrained scenes. Unfortunately, Traditional matting
approaches do not generalize well to typical everyday scenes. This is partially due to the difficulty
of the problem. Traditionally the matting problem is formulated as:
Ii = αi ∗ Fi + (1 − αi ) ∗ Bi , αi ∈ [0, 1] (1)
where the RGB color at pixel i, Ii , is known and the foreground color Fi , background color Bi
and matte estimation αi are unknown. It is clear that this is clearly an under constrained problem
with 7 unknown values per pixel but only 3 known values. This is one of the limitation of the
traditional approaches following the above equation 1. The equation 1 formulates the matting
problem as a linear combination of colours and this makes the algorithms approach the problem
from the perspective of colour(often along with the spatial position of the pixels). Since they
depend primarily on colour it makes them sensitive to situations where foreground and background
distributions overlap. A second limitation is due to the focus on a very small dataset. Generating
ground truth for matting is very difficult, and the alphamatting.com dataset [5] made a significant
contribution to matting research by providing ground truth data. Unfortunately, it contains only
27 training images and 8 test images, most of which are objects in front of an image on a monitor.
As is the case with all datasets, especially small ones, at some point methods will over-fit to the
dataset and no longer generalize to real scenes. In this work, we present an approach aimed to
overcome these limitations. Our method uses deep learning to directly compute the trimap given
an input RGB image. Using the trimap and the input image it then computes the alpha matte.
Instead of relying primarily on color information, our network can learn the natural structure that is
present in alpha mattes. For example, hair and fur (which usually require matting) possess strong
structural and textural patterns. Other cases requiring matting (e.g. edges of objects, regions
of optical or motion blur, or semi-transparent regions) almost always have a common structure
or alpha profile that can be expected. While low-level features will not capture this structure,
deep networks are ideal for representing it. To train a model that will excel in natural images of
1
Figure 1
Figure 2: An illustration of the SegNet architecture. There are no fully connected layers and hence
it is only convolutional. A decoder upsamples its input using the transferred pool indices from its
encoder to produce a sparse feature map(s). It then performs convolution with a trainable filter
bank to densify the feature map. The final decoder output feature maps are fed to a soft-max
classifier for pixel-wise classification
unconstrained scenes, we need a much larger dataset than currently available. For this purpose
we use the dataset created by Xu et al. [7] in which images with objects on simple backgrounds
were carefully extracted and were composited onto new background images to create a dataset
with 43100 training images and 1000 test images.
In the following section we elaborate on the Methodology followed. It includes the description
of the dataset used. Followed by the neural network model designs that were experimented with.
Following to this, we describe our training procedure in detail corresponding to each of the model.
Thereafter we present our experimental results and discussion.
2 Methodology
2.1 Dataset
The traditional matting benchmark on alphamatting.com [5] has been very successful in increasing
the pace of research in matting. However, due to the carefully controlled setting required to obtain
ground truth images, the dataset consists of only 27 training images and 8 testing images. This
dataset is not big enough to train a neural network and also it is limited in diversity, restricted to
small-scale lab scenes with static objects. For the purpose of training our matting networks, we
generate the matting dataset created by Xu et al. [7]. This is a large dataset created by compositing
real images onto new backgrounds. The foreground images with simple background(Fig. 2a) are
chosen and their alpha matte(Fig. 2b) are carefully created using Photoshop. Along with this the
pure foreground colours(Fig. 2c) are also created. These are then treated as ground truth and for
each alpha matte and foreground image, composite images are created by sampling N background
images from Pascal VOC [2] and MS COCO [3]. Both training and testing datasets are created
in the same way. The training dataset has 431 unique foreground objects and 43100 composite
images while our testing dataset has 50 unique foreground objects and 1000 composite images i.e
each foreground is used to form (N =100) composite images in training and 20 in testing. The
trimaps for each composite image are created by randomly dilating the ground truth alpha matte.
In comparison with the previous existing matting datasets, this dataset has two main advantages.
1. It has many unique objects and covers several matting cases such as hair, fur, semi-transparency,
etc.,
2. Many composite images have similar mix of background and foreground colours along with
complex textures which make the dataset more practical and challenging.
2
Figure 3: The complete architecture of VGG16
2.2 Method
As in [7] our network too uses deep encoder-decoder network, which has achieved successes in
many other computer vision tasks such as image segmentation [1], boundary prediction[8] and hole
filling[4]. We experimented with different architectures modeled around encoder-decoder network.
In the following subsection we describe the basic encoder-decoder architecture. Thereafter we
present the details of the primary model as implemented in [7]. This is followed by the models
we implemented with minor modifications and the intuition behind them We present those minor
modifications and the intuition behind them in the current section
3
(a) (b)
Figure 4: a.) This network structure is used in model-1 and model-2. b.) This network structure
is used in model-4. The same structure is used in model-3 keeping Image-2 slot empty.
where αpi is the output of the prediction layer at pixel i thresholded between 0 and 1. αgi is the
ground truth alpha value at pixel i. is a small value which is equal to 10−6 in our experiments.
∂Li
The derivative ∂ααi can be written as
p
4
2.2.4 Method-3 (Prediction of Alpha matte with predicted Trimap-I)
Input: Here we have two neural networks. The first network takes the 3-channel composite
image as its input. This generates the trimap which then along with the composite image is given
as a a 4-channel input to the second network. This is then used by the network to produce a
single channel alpha matte.
Network Structure:The network for this method consists of two encoder and decoder networks
4b. The architecture of every encoder-decoder network is similar to the one mentioned above.
Major Idea: The major idea is to automate the generation of trimap. In the previous methods,
the trimap generation was done by dilating and eroding the ground truth alpha matte. Thus for
the test case, it is mandatory for the input to have a trimap. In order to overcome this, we build
a network which generates the trimap given a composite image.
Loss: The two networks are independent and have their individual loss.
• Network 1 : We formulate Trimap prediction as the pixel level classification task where loss
Lit at pixel i is given by the cross-entropy loss between the ground truth trimap pixel class
and predicted pixel class. It can be represented as
3
X
Lit = −yi log(yˆi ) (5)
n=1
where yˆi is the output of the softmax layer. The three classes in the summation belong to
background, foreground and uncertain region.
• Network 2 : Alpha prediction loss Liα is the square difference between the ground truth
alpha values and the predicted alpha values at each pixel as mentioned in 2
2.3 Experiments
We did various experiments structured around using segnet[1]. We played around by modifying
inputs and loss functions and report all our results in this section. We used the Composition-1k
test set [7] that includes 1000 images and 50 unique foregrounds.
2.3.1 Preprocessing
The preprocessing part for all our methods mentioned in the above section 2.2 is the same.
Wherever we take the RGB image as input, we do mean subtraction on it with respect to the
dataset. Whenever the trimap is given as input, we crop a section of it centered around a pixel in
the uncertain area and send it to the network. The composite image relevant to the trimap is
cropped in the same way in the corresponding region. To ensure that our network becomes scale
invariant, we pick crop size at random from among 320, 480, 640.
5
2.3.2 Results
We are reporting the training loss curves for all the methods we tried, followed by the results we
obtained. All the experiments have been carried out for a maximum of 20 epochs. Adam
Optimizer has been used with an initial learning rate of 10−5 . Weights have been initialized using
Xavier normal initialization.
In the trimap generation for method-4, as we mentioned in the sections above 2.2.5, we give two
RGB images to the network to generate trimap. But we have two variations in this method.
• Method-4 variation 1: In this variation, we gave the network, a composite image(foreground
on a relatively non-plain background) along with a simple image(foreground on a relatively
plain background)
• Method-4 variation 2: In other variation, we give two composite images as input to the
network.
We observe that, variation 1 gives us better trimaps compared to variation 1. We present them in
the following pages
Figure 5: a.) Training loss for model-3 2.2.4 for trimap generation. b.) Training loss for variation-
1 model-4 2.2.5 for trimap generation. c.)Training loss for variation-2 model-4 2.2.5 for trimap
generation.
(a) (b)
Figure 6: Training loss for alpha matte generation of a.) model-1 2.2.2 c.) model-4 variation-1.
6
Model-4 Model-4
GT image GT trimap Model-3 variation-1 variation-2
7
Method-4
GT image GT alpha matte Method-1 Variation-1
Figure 8: Generated Alpha matte from model-1 2.2.2, and model-4 2.2.5. Extracted alpha matte
for two composite images having the same foreground for different methods.
8
References
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis
and machine intelligence, 39(12):2481–2495, 2017.
[2] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
Zisserman. The pascal visual object classes (voc) challenge. International journal of computer
vision, 88(2):303–338, 2010.
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision, pages 740–755. Springer, 2014.
[4] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros.
Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
[5] Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and
Pamela Rott. A perceptually motivated online benchmark for image matting. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1826–1833.
IEEE, 2009.
[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[7] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In Computer
Vision and Pattern Recognition (CVPR), 2017.
[8] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object contour
detection with a fully convolutional encoder-decoder network. In Computer Vision and
Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 193–202. IEEE, 2016.