Learning To Compress Images and Videos
Learning To Compress Images and Videos
Li Cheng
[email protected]
S.V. N. Vishwanathan
[email protected]
Statistical Machine Learning, National ICT Australia
Research School of Information Sciences & Engineering, Australian National University, Canberra ACT 0200
Abstract
We present an intuitive scheme for lossy
color-image compression: Use the color information from a few representative pixels to
learn a model which predicts color on the
rest of the pixels. Now, storing the representative pixels and the image in grayscale
suffice to recover the original image. A similar scheme is also applicable for compressing videos, where a single model can be
used to predict color on many consecutive
frames, leading to better compression. Existing algorithms for colorization the process of adding color to a grayscale image or
video sequence are tedious, and require intensive human-intervention. We bypass these
limitations by using a graph-based inductive
semi-supervised learning module for colorization, and a simple active learning strategy
to choose the representative pixels. Experiments on a wide variety of images and video
sequences demonstrate the efficacy of our algorithm.
1. Introduction
The explosive growth of the Internet, as witnessed
by the popularity of sites like YouTube and image
google, has exponentially increased the amount of images and movies available for download. As more and
more visual data is being exchanged, there is an everincreasing demand for better compression techniques
which will reduce network traffic. Typical compression
algorithms for images work in the frequency domain,
and use sophisticated techniques like wavelets. In the
case of video clips, these algorithms not only compress
each frame, but also use compression across frames in
Appearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 2007. Copyright
2007 by the author(s)/owner(s).
2. Semi-Supervised Learning
Semi-supervised learning refers to the problem of
learning from labeled and unlabeled data. It has
1 X
||f ||2G +
l(xi , yi , f ). (1)
2
n
m i=1
(2)
where
f
denotes
the
vector
[f (xi ), . . . , f (xm ), . . . f (xn )], and G Rnn is
a function of G which determines the specific form of
regularization imposed. Two choices are particularly
relevant to us:
kf k2G = f > f ,
(3)
(4)
and
(5)
n
X
i=1
i k(xi , ).
(6)
m
G K)1 y,
n2
2
n
m
X
X
X
f (xi )
ij f (xj )
+
(f (xi ), yi ). (9)
i=1
In transductive learning one is given a (labeled) training set and an (unlabeled) test set. The idea of transduction is to perform predictions only for the given
test points (Chapelle et al., 2006). This is in contrast to inductive learning, where the goal is to output
a prediction function which is defined on the entire
space (Chapelle et al., 2006). In our context, this is a
key difference. While an inductive algorithm can easily be used to predict labels on closely related images,
transductive algorithms are unsuitable. This limits the
applicability of transductive algorithms to video compression.
All the algorithms we discussed above are inductive,
but, it is easy to turn them into transductive algorithms. Let X denote the set of labeled and unlabelled
points, then, work with functions f : X R, drop the
regularization term ||f ||2H from the objective function,
(1), and minimize (Zhu, 2005; Belkin et al., 2006):
m
J(f ) =
1 X
2
||f
||
+
l(xi , yi , f ).
G
n2
m i=1
(8)
We note in the passing that if we drop the regularization term ||f ||2H from the objective function but continue to work with f H, i.e., f : X R, we get an
inductive algorithm whose prediction function is required to be smooth only on the observed examples
(both labeled and unlabelled). In applications (like
ours) where f is only used to predict on examples that
are very similar to the observed examples, this suffices.
3. Colorization by Semi-Supervised
Learning
Colorization the process of adding color to a grayscale
image or video sequence has attracted considerable
research interest recently. Unfortunately, existing algorithms are tedious, and labor-intensive. For our purposes, a particularly relevant algorithm is due to Levin
et al. (2004), which we now present using notation that
ij
i=1
Here, f : X R is a function that assigns color values to pixels, i j implies that pixel xj is a neighbor
of pixel xi , and for each i the weights ij are nonnegative and sum to one. The predictor f is forced
to take on user-specified values on all pixels where
color information is available, by the loss function
(f (xi ), yi ) which is 0 if f (xi ) = yi and otherwise.
The weights are computed using a normalized radial
basis function or a second-order polynomial, and takes
into account the similarities in intensity values.
To show that the above algorithm is a graph-based
transductive semi-supervised learning algorithm, we
begin by constructing a weighted
adjacency matrix W
P
such that Wij = ij . Since i ij = 1, the degree matrix D = I, and the graph Laplacian can be written as
L = I W . It is now easy to verify that
2
n
X
X
f (xi )
f > L2 f =
ij f (xj ) ,
i=1
ij
apriori. Our algorithm, which we describe next, addresses all these issues.
5. Implementation Details
255
PSNR = 20 log10
,
MSE
(10)
n
0
1 X
(Iij Iij )2 .
n2 i,j=1
(11)
6. Experiments
Our algorithm can work in two different modes: A
human-assisted mode, where the active learning module is switched off, and a completely automatic mode
which requires an oracle supplying ground truth. The
human-assisted mode is useful in situations where
ground truth is either not available or it is expensive to
label pixels. On the other hand, the automated mode
can be used for compression. We experiment with both
images and video and report our results below. 1
Human-Assisted Image Colorization Given a
gray-scale image, when its color image are not known
apriori, we could label a few pixels with color and
hand to our algorithm which will learns the predictor f . The color image are then revealed by applying
f to the whole image. Figure 1 presents an experiment
on a image (panel (a)), where with the aid of the labeled pixels (panel (a), in color), our semisupervized
algorithm is able to produce a visually appealing color
image (panel (b)).
Image Compression To test the efficacy of our image compression scheme we perform two experiments.
The aim of the first experiment is to show that our active learning approach outperforms humans in choosing pixels for labeling. Here we work with the color
image of a colony of bees on hive frames. The image
is of size 640 853 and is depicted in panel (a) of Figure 2. The corresponding grayscale image is depicted
in panel (b). On this image we asked a human volunteer to label certain pixels with color. The pixels
1
(a)
(b). The random pixels chosen for labeling are depicted in panel (c), and the colorized image is depicted
in panel (d). Notice again that the predicted image
exhibits some artifacts (e.g. whitish color around the
forehead area). In contrast, panel (e) depicts the pixels chosen by our active learning approach, and the
corresponding colorized image is depicted in panel (f).
Our predicted image is visually indistinguishable from
the ground truth. Now the PSNR values are 38.41 and
40.95, and the number of pixels chosen are 2976 and
2766, for the random and active learning approaches
respectively. The evolution of the PSNR score with the
number of iterations is shown in panel (b) of Figure 3.
(b)
(d)
For the girl image, the figures are: 220641 bytes for
the color JPEG, 161136 bytes for the grayscale, and
2766 colored pixels, leading to a compression ratio of
0.781.
Human-Assisted Video Colorization The aim of
this experiment is to show that the color predictor
learnt from a single frame can be successfully deployed
to predict color on many successive frames, without
any visible distortions.
(e)
(f)
42
40
30.5
PSNR
psnr
31
30
36
29.5
29
0
38
2
4
6
number of iterations
(a)
34
0
5
10
15
number of iterations
20
(b)
Figure 3. The PSNR scores for (a) bees and (b) girl vs
number of iterations of the active learning algorithm.
(a)
(a)
(b)
(c)
(d)
(b)
(e)
Figure 5. Human assistant video compression example.
See text for details.
(c)
(e)
(d)
(f)
7005 pixels. Each pixel for which we store color information requires 4 bytes of additional storage: 2 bytes
for the color information (the luminance channel is already present in the grayscale image) and 2 bytes to
encode its location. This adds a modest 35025 bytes
(approx 34Kb) of extra storage, leading to a compression ratio of 0.899.
is no reward for forgetting labels. Our algorithm iteratively queries for labels, but never forgets previously
queried labels. But, it is possible that we might be
able to achieve the same PSNR values with far fewer
number of pixels. Extending it to forget labels is part
of our future research. Proving performance bounds
for our algorithm and addressing non-stationary video
sequences are also fertile areas of future research.
Acknowledgements
(a)
(b)
(c)
(d)
References
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps
for dimensionality reduction and data representation. Neural Computation, 15 (6), 13731396.
(e)
(f)
Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for
learning from labeled and unlabeled examples. J.
Mach. Learn. Res., 7, 23992434.
Chapelle, O., Scholkopf, B., & Zien, A., eds. (2006).
Semi-Supervised Learning. Cambridge, MA: MIT
Press.
(g)
(h)
Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization. In SIGGRAPH 04: ACM
SIGGRAPH 2004 Papers, 689694. New York, NY,
USA: ACM Press.
Ren, X., & Malik, J. (2003). Learning a classification
model for segmentation. In Proc. 9th Intl. Conf.
Computer Vision, vol. 1, 1017.
Scholkopf, B., & Smola, A. (2002). Learning with Kernels. Cambridge, MA: MIT Press.
Smola, A. J., & Kondor, I. R. (2003). Kernels and
regularization on graphs. In B. Scholkopf, & M. K.
Warmuth, eds., Proc. Annual Conf. Computational
Learning Theory, Lecture Notes in Comput. Sci.,
144158. Heidelberg, Germany: Springer-Verlag.
Zhang, T., & Ando, R. K. (2005). Graph based semisupervised learning and spectral kernel design. Tech.
Rep. RC23713, IBM T.J. Watson Research Center.
Zhu, X. (2005).
Semi-supervised learning literature survey.
Tech. Rep. 1530, Computer
Sciences,
University of Wisconsin-Madison.
Http://www.cs.wisc.edu/jerryzhu/pub/ssl survey.pdf.