0% found this document useful (0 votes)
34 views

A Deep Approach To Image Matting Report

1) The document describes a deep learning approach to solving the image matting problem of extracting foreground objects from images. 2) It proposes using a neural network to generate trimaps instead of using hand-crafted trimaps. Another neural network is then trained to estimate foreground, background, and predict alpha mattes to ensure consistency with the original problem formulation. 3) The main part of the models are deep convolutional encoder-decoder networks that take images as input and predict alpha mattes. The models are trained on a large synthesized matting dataset with over 43,000 images to better generalize to real scenes.

Uploaded by

Nikhil Rayaprolu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

A Deep Approach To Image Matting Report

1) The document describes a deep learning approach to solving the image matting problem of extracting foreground objects from images. 2) It proposes using a neural network to generate trimaps instead of using hand-crafted trimaps. Another neural network is then trained to estimate foreground, background, and predict alpha mattes to ensure consistency with the original problem formulation. 3) The main part of the models are deep convolutional encoder-decoder networks that take images as input and predict alpha mattes. The models are trained on a large synthesized matting dataset with over 43,000 images to better generalize to real scenes.

Uploaded by

Nikhil Rayaprolu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Deep approach to Image matting

Manisha Padala, Nikhil Rayaprolu, Muthireddy Vamsidhar


20172145, 201501090, 20172144
September 25, 2023

Abstract
In this project, we aim to solve the Image matting problem. Image matting is a fundamental
computer vision problem of foreground extraction with many applications. It is challenging
for the traditional methods to solve this severely under constrained problem and hence have
a poor performance. They fail as their estimation is based solely on the intensity values of
the foreground and background. Deep learning methods have tried to solve this problem by
directly estimating the alpha matte given the image and a trimap. These methods use the
ground truth foreground and background images for refining the alpha values. Hence solve a
simpler problem assuming the network learns the high level context required for prediction.
In our work, instead of using the hand-crafted trimaps. 1) We make a neural network to
generate the trimaps. 2) Another neural network is trained to estimate the foreground and
background while predicting the alpha matte. This is to ensure consistency with the original
matting problem set up. The main part of our models is a deep convolutional encoder-decoder
network that takes an image/images as input and predict it’s alpha matte.

1 Introduction
Matting, the problem of accurate foreground estimation in images and videos, has significant
practical importance. It is a key technology in image editing and film production and effective
natural image matting methods can greatly improve current professional workflows. It necessitates
methods that handle real world images in unconstrained scenes. Unfortunately, Traditional matting
approaches do not generalize well to typical everyday scenes. This is partially due to the difficulty
of the problem. Traditionally the matting problem is formulated as:

Ii = αi ∗ Fi + (1 − αi ) ∗ Bi , αi ∈ [0, 1] (1)

where the RGB color at pixel i, Ii , is known and the foreground color Fi , background color Bi
and matte estimation αi are unknown. It is clear that this is clearly an under constrained problem
with 7 unknown values per pixel but only 3 known values. This is one of the limitation of the
traditional approaches following the above equation 1. The equation 1 formulates the matting
problem as a linear combination of colours and this makes the algorithms approach the problem
from the perspective of colour(often along with the spatial position of the pixels). Since they
depend primarily on colour it makes them sensitive to situations where foreground and background
distributions overlap. A second limitation is due to the focus on a very small dataset. Generating
ground truth for matting is very difficult, and the alphamatting.com dataset [5] made a significant
contribution to matting research by providing ground truth data. Unfortunately, it contains only
27 training images and 8 test images, most of which are objects in front of an image on a monitor.
As is the case with all datasets, especially small ones, at some point methods will over-fit to the
dataset and no longer generalize to real scenes. In this work, we present an approach aimed to
overcome these limitations. Our method uses deep learning to directly compute the trimap given
an input RGB image. Using the trimap and the input image it then computes the alpha matte.
Instead of relying primarily on color information, our network can learn the natural structure that is
present in alpha mattes. For example, hair and fur (which usually require matting) possess strong
structural and textural patterns. Other cases requiring matting (e.g. edges of objects, regions
of optical or motion blur, or semi-transparent regions) almost always have a common structure
or alpha profile that can be expected. While low-level features will not capture this structure,
deep networks are ideal for representing it. To train a model that will excel in natural images of

1
Figure 1

Figure 2: An illustration of the SegNet architecture. There are no fully connected layers and hence
it is only convolutional. A decoder upsamples its input using the transferred pool indices from its
encoder to produce a sparse feature map(s). It then performs convolution with a trainable filter
bank to densify the feature map. The final decoder output feature maps are fed to a soft-max
classifier for pixel-wise classification

unconstrained scenes, we need a much larger dataset than currently available. For this purpose
we use the dataset created by Xu et al. [7] in which images with objects on simple backgrounds
were carefully extracted and were composited onto new background images to create a dataset
with 43100 training images and 1000 test images.
In the following section we elaborate on the Methodology followed. It includes the description
of the dataset used. Followed by the neural network model designs that were experimented with.
Following to this, we describe our training procedure in detail corresponding to each of the model.
Thereafter we present our experimental results and discussion.

2 Methodology
2.1 Dataset
The traditional matting benchmark on alphamatting.com [5] has been very successful in increasing
the pace of research in matting. However, due to the carefully controlled setting required to obtain
ground truth images, the dataset consists of only 27 training images and 8 testing images. This
dataset is not big enough to train a neural network and also it is limited in diversity, restricted to
small-scale lab scenes with static objects. For the purpose of training our matting networks, we
generate the matting dataset created by Xu et al. [7]. This is a large dataset created by compositing
real images onto new backgrounds. The foreground images with simple background(Fig. 2a) are
chosen and their alpha matte(Fig. 2b) are carefully created using Photoshop. Along with this the
pure foreground colours(Fig. 2c) are also created. These are then treated as ground truth and for
each alpha matte and foreground image, composite images are created by sampling N background
images from Pascal VOC [2] and MS COCO [3]. Both training and testing datasets are created
in the same way. The training dataset has 431 unique foreground objects and 43100 composite
images while our testing dataset has 50 unique foreground objects and 1000 composite images i.e
each foreground is used to form (N =100) composite images in training and 20 in testing. The
trimaps for each composite image are created by randomly dilating the ground truth alpha matte.
In comparison with the previous existing matting datasets, this dataset has two main advantages.

1. It has many unique objects and covers several matting cases such as hair, fur, semi-transparency,
etc.,
2. Many composite images have similar mix of background and foreground colours along with
complex textures which make the dataset more practical and challenging.

2
Figure 3: The complete architecture of VGG16

2.2 Method
As in [7] our network too uses deep encoder-decoder network, which has achieved successes in
many other computer vision tasks such as image segmentation [1], boundary prediction[8] and hole
filling[4]. We experimented with different architectures modeled around encoder-decoder network.
In the following subsection we describe the basic encoder-decoder architecture. Thereafter we
present the details of the primary model as implemented in [7]. This is followed by the models
we implemented with minor modifications and the intuition behind them We present those minor
modifications and the intuition behind them in the current section

2.2.1 Background Auto-Encoder


Auto-encoders particularly learn to compress data given from the input layer into a short code in
latent space and then uncompress the same code into something that closely matches the original
data. This forces the network to reduce the dimension of the input. The initial layer might
learn to encode easy features like corners, the next layer analyzes the previous layer’s output and
then encode less local features like the tip of a nose, the third might encode a whole nose etc.,
until the final layer encodes the whole image into latent code corresponding to the given input
image. In short, the input to the encoder network is transformed into downsampled feature maps
by subsequent convolutional layers and max pooling layers. The decoder network in turn uses
subsequent unpooling layers which reverse the max pooling operation and convolutional layers to
upsample the feature maps and have the desired output. We use segnet[1] as our network. It uses
first 13 layers from VGG-16 [6] as the encoder and it’s flipped version as decoder. This network
is characterized by its simplicity, using only 3×3 convolutional layers stacked on top of each other
in increasing depth. Reducing volume size is handled by max pooling. Two fully-connected layers,
each with 4,096 nodes are then followed by a softmax classifier.

2.2.2 Method-1 (Direct Alpha Matte Prediction)


Input: Our input to the network in this method is a composite image along with its
corresponding trimap. They are then concatenated along the channel dimension to result in a
4-channel input. This input is then given to the network to produce a single channel alpha matte.
The trimap in this method is hand crafted along with alpha matte.
Network Structure: The network for this method consists of a single encoder and a decoder
network 4a following the VGG16 architecture. This network takes the composite image along
with it’s trimap and produces output alpha matte in our case.
Loss: The loss used in this network is called the alpha-prediction loss, which is the square
difference between the ground truth alpha values and the predicted alpha values at each pixel.
However, due to the non-differentiable property of absolute values, we use the following loss

3
(a) (b)

Figure 4: a.) This network structure is used in model-1 and model-2. b.) This network structure
is used in model-4. The same structure is used in model-3 keeping Image-2 slot empty.

function Liα to approximate it.


q
Liα = (αpi − αgi )2 + 2 , αpi , αgi [0, 1] (2)

where αpi is the output of the prediction layer at pixel i thresholded between 0 and 1. αgi is the
ground truth alpha value at pixel i.  is a small value which is equal to 10−6 in our experiments.
∂Li
The derivative ∂ααi can be written as
p

∂Liα αpi − αgi


i
=q (3)
∂αp (αi − αi )2 + 2
p g

Total loss is given by

Loverall = wl .Lα (4)


Since, only the alpha values inside the unknown regions of trimaps need to be inferred, we
therefore set additional weights on the loss according to the pixel locations, which can help our
network pay more attention on the important areas. Specifically, wi = 1 if pixel i is inside the
unknown region of the trimap while wi = 0 otherwise.

2.2.3 Method-2 (Prediction of Alpha matte along with Background and


Foreground )
Input: Our inputs to the network in this method are an image along with it’s trimap. They are
then concatenated along the channel dimension to result in a 4-channel input. This input is then
given to the network to produce foreground and background. They are then used to predict the
alpha matte using 1 which is the output of the network.
Network Structure: The network for this method consists of a single encoder and decoder
network 4a. The architecture of encoder-decoder network is similar to the one mentioned above.
Major Idea: It is to be noted that, this problem is significantly difficult with multiple possible
estimate of the Foreground and Background colors that result in the same alpha matte value. We
need alpha matte to estimate the foreground and background and vice-versa. Given the trimap,
we estimate the values only in the unknown region. As completely estimating the foreground and
background is a pretty heavy task.
Loss: The loss function used here has two major components,
• Alpha loss - the network parameters are updated to ensure correct foreground and
background prediction such that the alpha mattes is close to the ground truth. Here we use
the same loss 2 as described in Method-1 2.2.2.
• RGB loss - given the Foreground, background and alpha matte, we try to reconstruct the
composite image and thus find the mean squared error between the input composite image
and the predicted one.

4
2.2.4 Method-3 (Prediction of Alpha matte with predicted Trimap-I)
Input: Here we have two neural networks. The first network takes the 3-channel composite
image as its input. This generates the trimap which then along with the composite image is given
as a a 4-channel input to the second network. This is then used by the network to produce a
single channel alpha matte.
Network Structure:The network for this method consists of two encoder and decoder networks
4b. The architecture of every encoder-decoder network is similar to the one mentioned above.
Major Idea: The major idea is to automate the generation of trimap. In the previous methods,
the trimap generation was done by dilating and eroding the ground truth alpha matte. Thus for
the test case, it is mandatory for the input to have a trimap. In order to overcome this, we build
a network which generates the trimap given a composite image.
Loss: The two networks are independent and have their individual loss.
• Network 1 : We formulate Trimap prediction as the pixel level classification task where loss
Lit at pixel i is given by the cross-entropy loss between the ground truth trimap pixel class
and predicted pixel class. It can be represented as
3
X
Lit = −yi log(yˆi ) (5)
n=1

where yˆi is the output of the softmax layer. The three classes in the summation belong to
background, foreground and uncertain region.

• Network 2 : Alpha prediction loss Liα is the square difference between the ground truth
alpha values and the predicted alpha values at each pixel as mentioned in 2

2.2.5 Method-4 (Prediction of Alpha matte with predicted Trimap-II)


Input: This again has two networks like above 2.2.4. The input to the first network is two
images with the same foreground. They are then concatenated along the channel dimension to
result in a 6-channel input. The output is a trimap which then along with a composite image is
given as input to the second network for alpha matte prediction.
Network Structure:The network for this method consists of two encoder and decoder networks
4b. The architecture of every encoder-decoder network is similar to the one mentioned above.
Major Idea: The above method 2.2.4 of generating Trimap is extremely difficult. To get better
prediction, we give more information to the network. Specifically we make it easier for the
network to identify the foreground by giving two images with the same for ground.
Loss: The loss function are similar to that ones used in method-2. We have two different losses
in our network. One is called the alpha-prediction loss and the other trimap loss.

2.3 Experiments
We did various experiments structured around using segnet[1]. We played around by modifying
inputs and loss functions and report all our results in this section. We used the Composition-1k
test set [7] that includes 1000 images and 50 unique foregrounds.

2.3.1 Preprocessing
The preprocessing part for all our methods mentioned in the above section 2.2 is the same.
Wherever we take the RGB image as input, we do mean subtraction on it with respect to the
dataset. Whenever the trimap is given as input, we crop a section of it centered around a pixel in
the uncertain area and send it to the network. The composite image relevant to the trimap is
cropped in the same way in the corresponding region. To ensure that our network becomes scale
invariant, we pick crop size at random from among 320, 480, 640.

5
2.3.2 Results
We are reporting the training loss curves for all the methods we tried, followed by the results we
obtained. All the experiments have been carried out for a maximum of 20 epochs. Adam
Optimizer has been used with an initial learning rate of 10−5 . Weights have been initialized using
Xavier normal initialization.
In the trimap generation for method-4, as we mentioned in the sections above 2.2.5, we give two
RGB images to the network to generate trimap. But we have two variations in this method.
• Method-4 variation 1: In this variation, we gave the network, a composite image(foreground
on a relatively non-plain background) along with a simple image(foreground on a relatively
plain background)
• Method-4 variation 2: In other variation, we give two composite images as input to the
network.
We observe that, variation 1 gives us better trimaps compared to variation 1. We present them in
the following pages

(a) (b) (c)

Figure 5: a.) Training loss for model-3 2.2.4 for trimap generation. b.) Training loss for variation-
1 model-4 2.2.5 for trimap generation. c.)Training loss for variation-2 model-4 2.2.5 for trimap
generation.

(a) (b)

Figure 6: Training loss for alpha matte generation of a.) model-1 2.2.2 c.) model-4 variation-1.

6
Model-4 Model-4
GT image GT trimap Model-3 variation-1 variation-2

Figure 7: Generated trimaps from model-3 2.2.4 and model-4 2.2.5

7
Method-4
GT image GT alpha matte Method-1 Variation-1

Figure 8: Generated Alpha matte from model-1 2.2.2, and model-4 2.2.5. Extracted alpha matte
for two composite images having the same foreground for different methods.

8
References
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis
and machine intelligence, 39(12):2481–2495, 2017.
[2] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
Zisserman. The pascal visual object classes (voc) challenge. International journal of computer
vision, 88(2):303–338, 2010.
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision, pages 740–755. Springer, 2014.

[4] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros.
Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
[5] Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and
Pamela Rott. A perceptually motivated online benchmark for image matting. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1826–1833.
IEEE, 2009.
[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.

[7] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In Computer
Vision and Pattern Recognition (CVPR), 2017.
[8] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object contour
detection with a fully convolutional encoder-decoder network. In Computer Vision and
Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 193–202. IEEE, 2016.

You might also like