Visual Content Indexing and Retrieval With Psycho-Visual Models
Visual Content Indexing and Retrieval With Psycho-Visual Models
Jenny Benois-Pineau
Patrick Le Callet Editors
Visual Content
Indexing and
Retrieval with
Psycho-Visual
Models
Multimedia Systems and Applications
Series editor
Borko Furht, Florida Atlantic University, Boca Raton, USA
More information about this series at http://www.springer.com/series/6298
Jenny Benois-Pineau • Patrick Le Callet
Editors
123
Editors
Jenny Benois-Pineau Patrick Le Callet
LaBRI UMR 5800, Univ. Bordeaux, LS2N, UMR CNRS 6004
CNRS, Bordeaux INP Université de Nantes
Univ. Bordeaux Nantes Cedex 3, France
Talence, France
Since the early ages of Pattern Recognition, researchers try to make computers
imitate the perception and understanding of visual content by humans. In the era of
structural pattern recognition, the algorithms of contour and skeleton extrapolation
in binary images tried to link missing parts using the principle of optic illusions
described by Marr and Hildreth.
Modeling of Human Visual System (HVS) in perception of visual digital content
has attracted a strong attention of research community in relation to the development
of image and video coding standards, such as JPEG, JPEG2000, and MPEG1,2. The
main question was how strongly and where in the image the information could be
compressed without a noticeable degradation in the decoded content, thus ensuring
quality of experience to the users. Nevertheless, the fundamental research on the
borders of signal processing, computer vision, and psycho-physics continued and in
1998 has appeared the model of Itti, Koch and Niebur which has become the most
popular model for prediction of visual attention. They were interested in both pixels-
wise saliency and the scan-path, “static” and dynamic components. A tremendous
amount of saliency models for still images and video has appeared during 2000
ties addressing both “low-level”, bottom-up or stimuli-driven attention and high-
level,“top-down”, task-driven attention.
In parallel, content-based image an video indexing and retrieval community
(CBIR and CVIR) has become strongly attached to the so-called “salient features”,
expressing signal singularities: corners, blobs, spatio-temporal jams in video. Using
the description of the neighbourhood of these singularities, we tried to describe,
retrieve and classify visual content addressing classical tasks of visual information
understanding: similarity search in images, recognition of concepts, objects and
actions. Since a few years these two streams have met. We are speaking today
about “perceptual multimedia”, “salient objects”, and “interestingness” and try to
incorporate this knowledge into our visual indexing and retrieval algorithms, we
develop models of prediction of visual attention adapted to our particular indexing
tasks. . . and we all use models of visual attention to drive recognition methods.
vii
viii Preface
In this book we tried to give a complete state of the art in this highly populated
and exploding research trend: visual information indexing and retrieval with psycho-
visual models. We hope that the book will be interesting for researchers as well as
PhD and master’s students and will serve as a good guide in this field.
We thank the French National Research Network GDR CNRS ISIS for the support
of scientific exchanges during our Workshops, and Souad Chaabouni and Boris
Mansencal for their technical help in preparation of the manuscript of the book.
ix
Contents
xi
xii Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Contributors
xiii
xiv Contributors
Along the last two decades, perceptual computing has emerged as a major topic for
both signal processing and computer science communities. Taking care that many
technologies produce signals for humans or process signals produced by humans, it
is all the more important to consider perceptual aspects in the design loop. Whatever
the uses cases, perceptual approaches rely on perceptual models that are supposed
to predict and/or mimic some aspects of the perceptual system.
Such models are not trivial to obtain. Their development implies a multidis-
ciplinary approach, in addition to signal processing of computer science encom-
passing neurosciences, psychology, physiology to name few. Perceptual modeling
depends on the ability to identify the part of the system under study. In the case
P. Le Callet ()
LS2N UMR CNRS 6004, Université de Nantes, Nantes Cedex 3, France
e-mail: [email protected]
J. Benois-Pineau
LaBRI UMR 5800, Univ. Bordeaux, CNRS, Bordeaux INP, Univ. Bordeaux, 351,
crs de la Liberation, F33405 Talence Cedex, France
e-mail: [email protected]
of visual perception, sub-part of human visual system are easier to identify than
some others, especially through psychophysics. With such approaches, relatively
sufficient models have been successfully developed, mainly regarding “low level”
of human vision. First order approximation for contrast perception such as Weber’s
law is a good and classic example, but we have been able to go much further,
developing models for masking effects, color perception, receptive fields theory. In
the late 1990s, there were already pretty advanced and practical perceptual models
suitable for many image processing engineers. Most of them, such as Just Noticeable
Difference (JND) models, are touching the visibility of signals and more specifically
the visual differences between two signals. This knowledge is naturally very useful
for applications such as Image quality prediction or image compression.
For years, these two applications have constituted a great playground for per-
ceptual computing. They have probably pushed the evolution of perceptual models
along the development of new immersive technologies (increasing resolution,
dynamic range . . . ), leading not only to more advances JND models [19] but also to
explore higher levels of visual perception.
Visual attention modeling is probably the best illustration of this trend, having
concentrating massive efforts by both signal processing and computer science the
last decade. From few papers in the mid 2000s, it is now a major topic covering
several sessions in major conferences. High efforts on visual attention modeling
can be legitimated also by applications angle. Knowing where humans are paying
attention is very useful for perceptual tweaking of many algorithms: interactive
streaming, ROI compression, gentle advertising [17]. Visual content indexing and
retrieval field is not an exception and a lot of researchers have started to adopt visual
attention modeling for their applications.
As the term Visual Attention has been used in a very wide sense, even more in
the community that concerns this book, it requires few clarification. It is common
to associate visual attention to eye gaze location. Nevertheless, eye gaze location
do not necessarily fully reflect what human observers are paying attention to. One
should first distinguish between overt and covert attention:
• Overt attention is usually associated with eye movements, mostly related to gaze
fixation and saccades. It is easily observable nowadays with eye tracker devices,
which record gaze tracking.
• Covert attention: William James [13] explained that we are able to focus
attention to peripheral locations of interest without moving eyes. Covert attention
is therefore independent of oculomotor commands. A good illustration is how a
driver can remain fixating road while simultaneously covertly monitoring road
signs and lights.
Even if overt attention and covert attention are not independent, over attention has
been from far much more studied mostly because it can be measured in a straight-
forward way by using eye-tracking techniques. This is also one of the reasons why
Visual Content Indexing and Retrieval with Psycho-Visual Models 3
Itti et al. [12] describe the neurological backbone behind the top-down and bottom-
up attention modeling as natural outcomes of the Inferotemporal cortex and
Posterior parietal cortex based processing mechanisms respectively.
Whatever of the considered neurological model, it is more important in most
usage of them, to appreciate the relative weights to be used or the mechanisms
of interaction between these top-down and bottom-up approaches. Schill et al.
[27] highlighted that humans gaze at regions where further disambiguation of
information when required. After the gaze is deployed towards such a region, it
is the bottom-up features which stand up by feature selection that helps achieve this
goal. The work in [23] also highlights some important aspects of free-viewing in
this regard, where the variation of the relative top-down versus bottom-up weight
.t/ was examined as a function of time. While attention was initially found to
be strongly bottom-up driven, there was a strong top-down affect in the range of
100–2000 ms. Later however the interaction between the two processes reach an
equilibrium state.
4 P. Le Callet and J. Benois-Pineau
From the application angle addressed in this book, it is desirable to get some models
of visual attention. Despite their common goal of identifying the most relevant
information in a visual scene, the type of relevance information that is predicted
by visual attention models can be very different. While some of the models focus
on the prediction of saliency driven attention locations, others aim at predicting
regions-of-interest (ROI) at an object level.
Several processes are thought to be involved in making the decision for an
ROI, including, attending and selecting a number of candidate visual locations,
recognizing the identity and a number of properties of each candidate, and finally
evaluating these against intentions and preferences, in order to judge whether or not
an object or a region is interesting. Probably the most important difference between
eye movement recordings and ROI selections is related to the cognitive functions
they account for. It is very important to distinguish between three “attention”
processes as defined by Engelke and Le Callet [6]:
• Bottom-up Attention: exogenous process, mainly based on signal driven visual
attention, very fast, involuntary, task-independent.
• Top-down Attention: endogenous process, driven by higher cognitive factors (e.g.
interest), slower, voluntary, task-dependent, mainly subconscious.
• Perceived Interest: strongly related to endogenous top-down attention but involv-
ing conscious decision making about interest in a scene.
Eye tracking data is strongly driven by both bottom-up and top-down attention,
whereas ROI selections can be assumed to be mainly driven by top-down attention
and especially perceived interest. It is the result of a conscious selection of the ROI
given a particular task, providing the level of perceived interest or perceptual
importance. Consequently, from a conceptual point of view, it might interesting to
distinguish between two different types of perceptual relevance maps of a visual
content: Importance versus Salience maps. While Salience refers to the pop-out
effect of a certain feature: either temporally or spatially, importance maps indicates
the perceived importance as it could be rated by human subjects. A saliency map is
a probabilistic spatial signal, that indicates the relative probability with which the
users regard a certain region. Importance maps on the other hand could be obtained
by asking users to rate the importance of different objects in a scene.
As stated before, the terms visual attention and saliency can be found in literature
with various meaning. Whatever models adopted, researchers should be cautious
and check if the selected model is designed to meet the requirements of the targeted
Visual Content Indexing and Retrieval with Psycho-Visual Models 5
application. Moreover, one should also carefully verify the data on which models
have been validated. In the context of Visual content indexing and retrieval appli-
cations, models touching concepts related to top down saliency, ROI and perceived
interest/importance seem the more appealing. Nevertheless, while practically useful,
it is very rare that these concepts are explicitly refereed as such, including some of
the chapters in this book. The careful reader should be able to make this distinction
when visual attention is concerned.
given pool of images [21]. The authors proposed to consider saliency of images
at two levels: the local level is the saliency of segmented regions, the global level
is the saliency defined by image edges. For querying image database they select
salient regions using underlying Harel’s (GBVS) saliency map [10]. To select salient
regions to be used in a query the authors use the statistics which is a mean saliency
value across a region. Salient regions are selected accordingly to the criterion of
retrieval performance by thresholding of its histogram for the whole image partition.
The authors use various thresholding methods including the well-known Otsu’s
method [20]. The querying is fulfilled by computation of Earth Mover Distance
from regions of Query Image and the Database Image with saliency weighting.
The global saliency expressed by the energy of contours is also incorporated into
the querying process. They conduct multiple tests on CORELL 1000 and SIVAL
databases and show that taking into account saliency allows for better top ranked
results: more similar images are returned at the top of the rank list.
In chapter “Visual Saliency for the Visualization of Digital Paintings” the authors
show how saliency maps can be used in a rather unusual application of visual content
analysis, which is creation of video clips from art paintings for popularization
of cultural heritage. They first built a saliency map completing Itti’s model [12]
with a saturation feature. Then the artist is selecting and weighting salient regions
interactively. The regions of interest (ROIs) are then ordered accordingly to the
central bias hypothesis. Finally, an oriented graph of salient regions is built. The
graph edges express the order in which the regions will be visualized and the edges
of the graph are weighted with transition times in the visualization process set by the
artist manually. Several generated video clips were presented to eight naive users in
a psycho-visual experiment with the task to score how the proposed video animation
clip reflects the content of the original painting. The results, measured by the mean
opinion score (MOS) metric, show that, in case of four-regions visualization, the
MOS values for randomly generated animation clips and those generated with
proposed method differ significantly up to 12%.
Finally, chapter “Predicting Interestingness of Visual Content” is devoted to
the prediction of interestingness of multimedia content, such as image, video and
audio. The authors consider visual interestingness from a psychological perspective.
It is expressed by two structures “novelty-complexity” and a “coping potential”.
The former indicates the interest shown by subjects for new and complex events
and the latter measures a subject’s ability to discern the meaning of a certain
event. From the content-driven, automatic perspective, the interestingness of content
has been studied in a classical visual content indexing framework, selecting the
most relevant image-based features within supervised learning (SVM) approach
[30]. Interestingness of media content is a perceptual and highly semantic notion
that remains very subjective and dependent on the user and the context. The
authors address this notion for a target application of a VOD system, propose a
benchmark dataset and explore the relevance of different features, coming from
the most popular local features such as densely sampled SFIT to the latest CNN
features extracted from fully connected layer fc7 and prob features from AlexNet
Deep CNN [14]. The authors have conducted the evaluation of various methods
Visual Content Indexing and Retrieval with Psycho-Visual Models 9
References
1. Agrawal, P., Girshick, B., Malik, J.: Analyzing the performance of multilayer neural networks
for object recognition. In: Computer Vision - ECCV 2014–13th European Conference, Zurich,
September 6–12 (2014), Proceedings, Part VII, pp. 329–344 (2014)
2. Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. IEEE Trans.
Pattern Anal. Mach. Intell. 34(11), 2189–2202 (2012)
3. Buswell, G.T.: How People Look at Pictures. University of Chicago Press, Chicago, IL (1935)
4. de Carvalho Soares, R., da Silva, I.R., Guliato, D.: Spatial locality weighting of features using
saliency map with a BoVW approach. In: International Conference on Tools with Artificial
Intelligence, 2012, pp. 1070–1075 (2012)
5. de San Roman, P.P., Benois-Pineau, J., Domenger, J.-P., Paclet, F., Cataert, D., de Rugy,
A.: Saliency driven object recognition in egocentric videos with deep CNN. CoRR,
abs/1606.07256 (2016)
6. Engelke, U., Le Callet, P.: Perceived interest and overt visual attention in natural images. Signal
Process. Image Commun. 39(Part B), 386–404 (2015). Recent Advances in Vision Modeling
for Image and Video Processing
7. Frieden, B.R.: Science from Fisher Information: A Unification, Cambridge edn. Cambridge
University Press, Cambridge (2004)
8. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for
accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1),
142–158 (2016)
9. González-Díaz, I., Buso, V., Benois-Pineau, J.: Perceptual modeling in the problem of active
object recognition in visual scenes. Pattern Recogn. 56, 129–141 (2016)
10. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural
Information Processing Systems, vol. 19, pp. 545–552. MIT, Cambridge (2007)
11. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the 4th
Alvey Vision Conference, pp. 147–151 (1988)
12. Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2(3),
194–203 (2001)
13. James, W.: The Principles of Psychology. Read Books, Vancouver, BC (2013)
14. Jiang, Y.-G., Dai, Q., Mei, T., Rui, Y., Chang, S.-F.: Super fast event recognition in internet
videos. IEEE Trans. Multimedia 177(8), 1–13 (2015)
15. Larson, M., Soleymani, M., Gravier, G., Jones, G.J.F.: The benchmarking initiative for
multimedia evaluation: MediaEval 2016. IEEE Multimedia 1(8), 93–97 (2017)
16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
60, 91–110 (2004)
17. Le Meur, O., Le Callet, P.: What we see is most likely to be what matters: visual attention
and applications. In: 2009 16th IEEE International Conference on Image Processing (ICIP),
pp. 3085–3088 (2009)
18. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: Proceedings
of the 8th IEEE International Conference on Computer Vision, vol. 1, pp. 525–531 (2001)
10 P. Le Callet and J. Benois-Pineau
19. Narwaria, M., Mantiuk, K.R., Da Silva, M.P., Le Callet, P.: HDR-VDP-2.2: a calibrated method
for objective quality prediction of high-dynamic range and standard images. J. Electron.
Imaging 24(1), 010501 (2015)
20. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man
Cybern. 9(1), 62–66 (1979)
21. Papushoy, A., Bors, G.A.: Visual attention for content based image retrieval. In: 2015 IEEE
International Conference on Image Processing, ICIP 2015, Quebec City, QC, 27–30 September
2015, pp. 971–975
22. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving
particular object retrieval in large scale image databases. In: 2008 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, Alaska,
24–26 June 2008
23. Rai, Y., Cheung, G., Le Callet, P.: Quantifying the relation between perceived interest and
visual salience during free viewing using trellis based optimization. In: 2016 International
Conference on Image, Video, and Multidimensional Signal Processing, vol. 9394, July 2016
24. Rayatdoost, S., Soleymani, M.: Ranking images and videos on visual interestingness by
visual sentiment features. In: Working Notes Proceedings of the MediaEval 2016 Workshop,
Hilversum, 20–21 October 2016, CEUR-WS.org
25. Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in
egocentric video. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)
26. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In:
Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 1508–1511
(2005)
27. Schill, K., Umkehrer, E., Beinlich, S., Krieger, G., Zetzsche, C.: Scene analysis with saccadic
eye movements: top-down and bottom-up modeling. J. Electron. Imaging 10(1), 152–160
(2001)
28. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated
recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229
(2013)
29. Sermanet, P., Kavukcuoglu, K., Chintala, S., LeCun, Y.: Pedestrian detection with unsupervised
multi-stage feature learning. In: 2013 IEEE Conference on Computer Vision and Pattern
Recognition, Portland, OR, June 23–28, pp. 3626–3633 (2013)
30. Soleymani, M.: The quest for visual interest. In: ACM International Conference on
Multimedia, New York, pp. 919–922 (2015)
31. Uijlings, J.R.R., Van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for
object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
32. Vig, E., Dorr, M., Cox, D.: Space-Variant Descriptor Sampling for Action Recognition Based
on Saliency and Eye Movements, pp. 84–97. Springer, Firenze (2012)
33. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the
IEEE International Conference on Computer Vision (2013)
34. Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3169–3176. IEEE, New York (2011)
35. Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for
action recognition. Int. J. Comput. Vis. 219–38 (2016)
Perceptual Texture Similarity for Machine
Intelligence Applications
1 Introduction
Textures are fundamental part of the visual scene. They are random structures often
characterized by homogeneous properties, such as color, orientation, regularity and
etc. They can appear both as static or dynamic, where static textures are limited to
spatial domain (like texture images shown in Fig. 1), while dynamic textures involve
both the spatial and temporal domain Fig. 2.
Research on texture perception and analysis is known since quite a long time.
There exist many approaches to model the human perception of textures, and also
many tools to characterize texture. They have been used in several applications such
Fig. 2 Example of dynamic textures from DynTex Dataset [93]. First row represents the first
frame, and next rows are frames after respectively 2 s
The rest of the chapter is organized as follows: Sect. 2 discusses about the
meaning of texture in both technical and non-technical contexts. The details of
texture perception, covering both static texture and motion perception, are given in
Sect. 3. The models of texture similarity are reviewed in Sect. 4, with benchmarking
tools in Sect. 5. The application of texture similarity models in image and video
compression is discussed in Sect. 6, and the conclusion is given in Sect. 7.
2 What is Texture
Linguistically, the word texture significantly deviates from the technical meaning
in computer vision and image processing. According to Oxford dictionary [86], the
word refers to one of the followings:
1. The way a surface, substance or piece of cloth feels when you touch it
2. The way food or drink tastes or feels in your mouth
3. The way that different parts of a piece of music or literature are combined to
create a final impression
However, technically, the visual texture has many other definitions, for exam-
ple:
• We may regard texture as what constitutes a macroscopic region. Its structure
is simply attributed to pre-attentive patterns in which elements or primitives are
arranged according to placement order [110].
• Texture refers to the arrangement of the basic constituents of a material. In a
digital image, texture is depicted by spatial interrelationships between, and/or
spatial arrangement of the image pixels [2].
• Texture is a property that is statistically defined. A uniformly textured region
might be described as “predominantly vertically oriented”, “predominantly
small in scale”, “wavy”, “stubbly”, “like wood grain” or “like water” [58].
• We regard image texture as a two-dimensional phenomenon characterized by
two orthogonal properties: spatial structure (pattern) and contrast (the amount
of local image structure) [84].
• Images of real objects often do not exhibit regions of uniform and smooth
intensities, but variations of intensities with certain repeated structures or
patterns, referred to as visual texture [32].
• Textures, in turn, are characterized by the fact that the local dependencies
between pixels are location invariant. Hence the neighborhood system and the
accompanying conditional probabilities do not differ (much) between various
image loci, resulting in a stochastic pattern or texture [11].
• Texture images can be seen as a set of basic repetitive primitives characterized
by their spatial homogeneity [69].
• Texture images are specially homogeneous and consist of repeated elements,
often subject to some randomization in their location, size, color, orienta-
tion [95].
14 K. Naser et al.
• Texture Movie:
1. Texture movies are obtained by filming a static texture with a moving camera
[119].
• Textured Motion:
1. Rich stochastic motion patterns which are characterized by the movement of a
large number of distinguishable or indistinguishable elements, such as falling
snow, flock of birds, river waves, etc. [122].
• Video Texture:
1. Video textures are defined as sequences of images that exhibit certain
stationarity properties with regularity exhibiting in both time and space [42].
It is worth also mentioning that in the context of component based video coding,
the textures are usually considered as details irrelevant regions, or more specifically,
the region which is not noticed by the observers when it is synthesized [9, 108, 134].
As seen, there is no universal definition of the visual phenomena of textures, and
there is a large dispute between static and dynamic textures. Thus, for this work, we
consider the visual texture as:
A visual phenomenon, that covers both spatial and temporal texture, where
spatial textures refer to homogeneous regions of the scene composed of small
elements (texels) arranged in a certain order, they might exhibit simple motion
such as translation, rotation and zooming. In the other hand, temporal textures are
textures that evolve over time, allowing both motion and deformation, with certain
stationarity in space and time.
Static texture perception has attracted the attention of researchers since decades.
There exists a bunch of research papers dealing with this issue. Most of the
studies attempt to understand how two textures can be visually discriminated, in
an effortless cognitive action known as pre-attentive texture segregation.
Julesz extensively studied this issue. In his initial work in [51, 53], he posed the
question if the human visual system is able to discriminate textures, generated by a
statistical model, based on the kth order statistics, and what is the minimum value
of k that beyond which the pre-attentive discrimination is not possible any more.
The order of statistics refers to the probability distribution of the of pixels values,
in which the first order measures how often a pixel has certain color (or luminance
value), while the second order measures the probability of obtaining a combination
of two pixels (with a given distance) colors, and the same can be generalized for
higher order statistics.
16 K. Naser et al.
Fig. 3 Examples of pre-attentive textures discrimination. Each image is composed of two textures
side-by-side. (a) and (b) are easily distinguishable textures because of the difference in the first and
the second order statistics (resp.), while (c), which has identical first and the second but different
third order statistics, is not
Fig. 4 Example of two textures (side-by-side) having identical third order statistics, yet pre-
attentively distinguishable
the visual cortex, analyzes the input signal by a set of narrow frequency channels,
resembling to some extent the Gaborian filtering [94]. According, different models
of texture discrimination have been developed, based on Gabor filtering [85, 118],
or difference of offset Gaussians [65], etc. These models are generally performing
the following steps:
1. Multi-channel filtering
2. Non linearity stage
3. Statistics in the resulting space
The texture perception models based on the multi-channel filtering approach is
known as back-pocket model (according to Landy [57, 58]). This model, shown in
Fig. 5, consists of three fundamental stages: linear, non-linear, linear (LNL). The
first linear stage accounts for the linear filtering of the multi-channel approach. This
is followed then by a non-linear stage, which is often rectification. This stage is
required to avoid the problem of equal luminance value which will on average cancel
out the response of the filters (as the filters are usually with zero mean). The last
stage refers to us as pooling, where a simple sum can give an attribute for a region
such that it can be easily segmented or attached to neighboring region. The LNL
model is also occasionally called filter-rectify-filter (FRF) as how it performs the
segregation [98].
18 K. Naser et al.
Pooling and
Input Image Decision
Texture videos, as compared to texture images, add the temporal dimension to the
perceptual space. Thus, it is important to include the temporal properties of the
visual system in order to understand its perception. For this reason, the subsection
provides an overview of studies on motion perception.
The main unit responsible for motion perception is the visual cortex [40].
Generally, the functional units of the visual cortex, which is responsible for motion
processing, can be grouped into two stages:
1. Motion Detectors
The motion detectors are the visual neurons whose firing rate increases when
an object moves in front of the eye, especially within the foveal region. Several
studies have shown that the primary visual cortex area (V1) is place where the
motion detection happens [20, 83, 102, 116]. In V1, simple cells neurons are often
modeled as a spatio-temporal filters that are tuned to a specific spatial frequency
and orientation and speed. On the other hand, complex cells perform some non-
linearity on top of the simple cells (half/full wave rectification and etc.).
The neurons of V1 are only responsive to signal having the preferred
frequency-orientation-speed combination. Thus, there is still a lack of the motion
integration from all neurons. Besides, the filter response cannot cope with the
aperture problem. As shown in Fig. 6, the example of the signal in the middle of
the figure shows a moving signal with a certain frequency detected to be moving
up, while it could actually be moving up-right or up-left. This is also true for the
other signals in the figure.
2. Motion Extractors
The motion integration and aperture problem are solved at a higher level of
the visual cortex, namely inside the extra-striate middle temporal (MT) area. It
Perceptual Texture Similarity for Machine Intelligence Applications 19
Up to our knowledge, a perceptual model that governs both static and dynamic
textures doesn’t not exist. The main issue is that although extensive perceptual
studies on texture images exist, the texture videos have not been yet explored.
Looking at the hierarchy of the visual system in Fig. 7, we can differentiate two
pathways after V1. The above is called the dorsal stream, while the lower is called
the ventral stream. The dorsal stream is responsible for the motion analysis, while
the ventral stream is mainly concerned about the shape analysis. For this reason, the
dorsal stream is known as the “where” stream, while the ventral is known as the
“what” stream [40].
One plausible assumption about texture perception is that texture has no shape.
This means that visual texture processing is not in the ventral stream. Beside this,
one can also assume that the type of motion is not a structured motion. Thus, it is not
processed by the dorsal stream. Accordingly, the resulting texture perception model
is only due to V1 processing. That is, the perceptual space is composed of proper
modeling of V1 filters along with their non-linearity process. We consider this type
of modeling as Bottom-Up Modeling.
20 K. Naser et al.
Dorsal Stream
V1
-Speed
-Direction
-Spatial Filtering
-Temporal Filtering
-Orientation
-Color VentralStream
V2 V4 TEO/PIT TE/AIT
-Edges -Angles -Simple Shapes -Complex Shapes/
-Illusory Edges -Curvature Body Parts
-Border ownership -Perceived Color -Object Recognition
-Color -Kinetic Contour -Object Invariance
-Motion
On the other hand, another assumption about the texture perception can be
made. Similar to Julesz conjectures (Sect. 3.1), one can study different statistical
models for understanding texture discrimination. This includes either higher order
models, or same order at different spaces. One can also redefine what is texton.
These models impose different properties about the human visual system that
don’t consider the actual neural processing. We consider this type of modeling as
Top-Down Modeling.
Texture similarity is a very special problem that requires a specific analysis of the
texture signal. This is because two textures can look very similar even if there is
a large pixel-wise difference. As shown in Fig. 8, each group of three textures has
overall similar textures, but there is still a large difference if one makes a point by
point comparison. Thus, the human visual system does not compute similarity using
pixel comparison, but rather considers the overall difference in the semantics. For
this reason, simple difference metrics, such mean squared error, can not accurately
express texture (dis-)similarity, and proper models for measuring texture similarity
have always been studied.
This is even more difficult in the case of dynamic textures, because there exists a
lot of change in details over time, the point-wise comparison would fail to express
the visual difference. In the following subsections, a review of the existing texture
similarity models is provided, covering both static and dynamic textures.
Perceptual Texture Similarity for Machine Intelligence Applications 21
Transform based modeling has gained lots of attention in several classical as well
as recent approaches of texture similarity. This is because of the direct link with
the neural processing in the visual perception. As explained in Sect. 3, both neural
mechanisms of static texture and motion perception involve kind of subband filtering
process.
One of the early approaches for texture similarity was proposed by Manjunath
et al. [67], in which the mean and standard deviation of the texture subbands
(using Gabor filtering) are compared and the similarity is assessed accordingly.
Following this approach, many other similarity metrics are defined in a similar way,
using different filtering methods or different statistical measures. For example, the
Kullback Leiber divergence is used in [25] and [26]. Other approach is by using
the steerable pyramid filter [101] and considering the dominant orientation and
scale [69].
Knowing the importance of subband statistics, Heeger et al. proposed to syn-
thesize textures by matching the histogram of each subband of the original and
synthesized textures. To overcome the problem of irreversibility of Gabor filtering,
they used the steerable pyramid filter [101]. The resulting synthesized textures were
considerably similar to the original, especially for the case of highly stochastic
textures. The concept has also been extended by Portilla et al. [95], where larger
number of features defined in the subband domain are matched, resulting in a better
quality of synthesis.
22 K. Naser et al.
The significance of the subband statistics has led more investigation of texture
similarity in that domain. Recently, a new class of similarity metrics, known as
structural similarity, has been introduced. The structural texture similarity metric
(STSIM) was first introduced in [137], then it was enhanced and further developed
in [138, 140] and [64]. The basic idea behind them is to decompose the texture,
using the steerable pyramid filter, and to measure statistical features in that domain.
The set of statistics of each subband contains the mean and variance. Besides, the
cross correlation between subbands is also considered. Finally, these features were
fused to form a metric, that showed a high performance in texture retrieval.
The filter-bank approach, which was applied for static textures, has been also
used in dynamic texture modeling by several studies. However, the concept was used
in a much smaller scope compared to static textures. In [103], three dimensional
wavelet energies were used as features for textures. A comparison of different
wavelet filtering based approaches, that includes purely spatial, purely temporal and
spatio-temporal wavelet filtering, is given in [30].
A relatively new study on using energies of Gabor filtering is found in [39]. The
work is claimed to be inspired by the human visual system, where it resembles to
some extent the V1 cortical processing (Sect. 3).
Beside this, there exist also other series of papers, by Konstantinos et al. [21, 22],
employed another type of subband filtering, which is the third Gaussian derivatives
tuned to certain scale and orientation (in 3D space). The approach was used for
textures representation recognition and also for dynamic scene understanding and
action recognition [23].
The auto-regressive (AR) model has been widely used to model both static and
dynamic textures, especially for texture synthesis purposes. In its simplistic form,
AR can be expressed in this form:
N
X
s.x; y; t/ D i s.x C yi ; y C yi ; t C ti / C n.x; y; t/ (1)
iD1
Where s.x; y; t/ represents the pixel value at the spatio-temporal position .x; y; t/,
i is the model weights, xi ,yi ,ti are the shift to cover the neighboring pixels.
n.x; y; t/ is the system noise which is assumed to be white Gaussian noise.
The assumption behind AR is that each pixel is predictable from a set of its
neighboring spatio-temporal pixels, by the means of weighted summation, and the
error is due to the model noise n.x; y; t/. An example of using model for synthesis
can be found in [4, 12, 55].
The auto-regressive moving average (ARMA) model is an extension of the
simple AR model that is elegantly suited for dynamic textures. It was first introduced
Perceptual Texture Similarity for Machine Intelligence Applications 23
by Soatto and Dorreto [29, 104] for the purpose of dynamic texture recognition. The
ARMA model is mathematically expressed in this equation:
Where x.t/ is a hidden state and y.t/ is the output state, v.t) and w.t/ are system
noise (normally distributed) and A, are the model weights as in AR. The output
state is the original frame of the image sequence. Comparing Eq. (2) with Eq. (1),
it is clear that the model assumes that the hidden state x.t/ is modeled as an AR
process, and the observed state is weighted version of the hidden state with some
added noise.
Both AR and ARMA can be directly used to measure texture similarity by
comparing the model parameters. In other words, the parameters can be considered
as visual features to compare textures and express the similarity. This has been
used in texture recognition, classification, segmentation and editing [27, 28]. Other
than that, it has been extended by several studies to synthesize similar textures.
For example, by using Fourier domain [1], by including several ARMA models
with transition probability [59], using higher order decomposition [18] and others
[35, 131].
Although there is no direct link between the texture perception and the auto-
regressive models, we can still interpret its performance in terms of Julesz con-
jectures (Sect. 3.1). The assumption behind these models is that textures would
look similar if they are generated by the same statistical model with a fixed set
of parameters. While Julesz has conjectured that the textures look similar if they
have the same first and second order statistics. Thus, it can be understood that these
models are an extension of the conjecture, in which the condition for similarity is
better stated.
Recalling that textons are local conspicuous features (Sect. 3.1), a large body of
research has been put to define some local features that can be used to measure the
texture similarity. One of the first approaches, and still very widely used, is the local
binary pattern approach (LBP) [84]. This approach is simply comparing each pixel
with each of its circular neighborhood, and gives a binary number (0–1) if the value
is bigger/smaller than the center value. The resulting binary numbers are gathered
in a histogram, and any histogram based distance metric can be used.
The approach has gained a lot of attention due to its simplicity and high
performance. It was directly adopted for dynamic textures in two manners [136].
First, by considering the neighborhood to be a cylindrical instead of circular in
24 K. Naser et al.
the case of Volume Local Binary Pattern (V-LBP). Second, by performing three
orthogonal LBP on the xy, xt and yt planes, which is therefore called Three
Orthogonal Planes LBP (LBP-TOP).
Several extensions of the basic LBP model have been proposed. For example,
a similarity metric for static textures known as local radius index (LRI)[132, 133],
which incorporates LBP along with other pixel to neighbors relationship. Besides,
there is another method that utilizes the Weber law of sensation, that is known as
Weber Local Descriptor (WLD) [14].
Rather than restricting the neighborhood relationship to binary descriptors, other
studies have introduced also trinary number [6, 46, 47] in what is known as texture
spectrum.
It is worth also mentioning that some studies consider the textons as the results
of frequency analysis of texture patches. The study of Liu et al. [61] considered the
marginal distribution of the filter bank response as the “quantitative definition” of
texton. In contrast, textons are defined [120] as the representation that results from
codebook generation of a frequency histogram.
The motion based analysis and modeling of dynamic textures has been in large body
of studies. This is because motion can be considered as a very important visual cue,
and also because the dynamic texture signal is mostly governed by motion statistics.
To elaborate on motion analysis, let’s start with basic assumption that we have an
image patch I.x; y; t/ in a spatial position .x; y/ and at time .t/, and this patch would
appear in the next frame, shifted by .x; y/. Mathematically:
where Ixn , Iyn and Itn are the nth order partial derivatives with respect to x, y and t.
The equation can be further simplified by neglecting the terms of order higher than
one, then it becomes:
Ix Vx C Iy Vy D It (5)
Perceptual Texture Similarity for Machine Intelligence Applications 25
where Vx , Vy are the velocities in x and y directions (Vx D x=t and so on).
The solution of Eq. (5) is known as optical flow. However, further constraints
are needed to solve the equation because of the high number of unknowns. One
of the constraints is the smoothness, in which a patch is assumed to move with
the same direction and speed between two frames. This is not usually the case
for dynamic texture, in which the content could possibly change a lot in a short
time instant. Accordingly, there exists also another formulation of the brightness
constancy assumption, that doesn’t require the analytical solution. This is known as
the normal flow. It is a vector of flow, that is normal to the spatial contours (parallel
to the spatial gradient), and its amplitude is proportional to the temporal derivative.
Mathematically, it is expressed as:
It
NF D q N (6)
Ix 2 C Iy 2
Where xx and yy are the partial derivatives of the shifts in x and y. Comparing
this equation to Eq. (3), the model allows the brightness I to change over time
to better cover the dynamic change inherited in the dynamic textures. The model
26 K. Naser et al.
has been used for detecting dynamic textures [3], in which regions satisfying this
assumption are considered as dynamic textures. However, further extensions of this
ideas were not found.
4.5 Others
Along with other aforementioned models, there exist other approaches that cannot
be straightforwardly put in one category. This is because the research on texture
similarity is quite matured, but still very active.
One major approach for modeling texture and expressing similarity is by using
the fractal analysis. It can be simply understood as an analysis of measurements at
different scales, which in turn reveals the relationship between them. For images,
this can be implemented by measuring the energies of a gaussian filter at different
scales. The relationship is expressed in terms of the fractional dimension. Recent
approaches of fractal analysis can be found in [126–128].
Another notable way is to use the self avoiding walks. In this, a traveler walks
through the video pixels using a specified rule and memory to store the last steps. A
histogram of walks is then computed and considered as features for characterizing
the texture (cf. [37, 38]).
Beside these, there exist also other models that are based on the physical behavior
of textures (especially dynamic textures). This includes models for fire [24], smoke
[7] and water [70].
Although these models suit very well specific textural phenomenon, they cannot
be considered as perceptual ones. This is because they are not meant to mimic the
visual processing, but rather the physical source. For this reason, these are out of
scope of this book chapter.
After viewing several approaches for assessing the texture similarity (Sect. 4), the
fundamental question here is how to compare these approaches, and to establish a
benchmark platform in order to differentiate the behavior of each approach. This
is of course not a straightforward method, and a reasonable construction of ground
truth data is required.
Broadly speaking, comparison can either be performed subjectively or objec-
tively. In other words, either by involving observers in a kind of psycho-physical
test, or by testing the similarity approaches performance on a pre-labeled dataset.
Both have advantages and disadvantages, which are explained here.
The subjective comparison is generally considered as the most reliable one. This
is because it directly deals with human judgment on similarity. However, there are
several problems that can be encountered in such a methodology. First, the selection
Perceptual Texture Similarity for Machine Intelligence Applications 27
and accuracy of the psycho-physical test. For example, a binary test can be the
simplest for the subjects, and would result in very accurate results. In contrast, this
test can be very slow to cover all the test conditions, and possibly such a test would
not be suitable. Second, the budget-time limitation behind the subjective tests would
result in a limited testing material. Thus, it is practically unfeasible to perform a
large scale comparison with subjective testing.
Accordingly, there exist few studies on the subjective evaluation of texture
similarity models. For example, the subjective quality of the synthesized textures
were assessed and predicted in [42, 109], and adaptive selection among the synthesis
algorithms was provided in [121]. The similarity metrics correlation with subjective
evaluation was also assessed in [5, 139].
As explained earlier, subjective evaluation suffers from test accuracy and budget
time-limitation. One can also add the problem of irreproducibility, in which the
subjective test results cannot be retained after repeating the subjective test. There is
also a certain amount of uncertainty with the results, which is usually reported in
terms of confidence levels. To encounter this, research in computer vision is usually
leaded by objective evaluations.
One commonly used benchmarking procedure is to test the performance on
recognition task. For static textures, two large datasets of 425 and 61 homogeneous
texture images are cropped into 128x128 images with substantial point-wise
differences [140]. The common test is to perform a retrieval test, in which for a test
image if the retrieved image is from the correct image source then it is considered
as correct retrieval. This is performed for all of the images in the dataset, and the
retrieval rate is considered as the criteria to compare different similarity measure
approaches. For example, Table 1 provides the information about the performance
of different metrics. In this table, one can easily observe that simple point-wise
comparison metric like the Peak Signal to Noise Ratio (PSNR) provides the worst
performance.
For dynamic textures, similar task is defined. Commonly, the task consists
of classification of three datasets. These are the UCLA [100], DynTex [93] and
DynTex++ [36] datasets. For each dataset, the same test conditions are commonly
used. For example, DynTex++ contains 36 classes, each of 100 exemplar sequences.
The test condition is to randomly assign 50% of the data for training and the rest for
testing. The train data are used for training the models, and the recognition rate is
reported for the test data. The procedure is repeated 20 times and the average value
is retained. This is shown in Table 2.
As mention in Sect. 3.3, bottom up approaches try to perform the same neural
processing of the human visual system. We have seen many transform based models
(Sect. 4.1) that showed good performance for measuring the texture similarity.
These models can be also used in image/video compression scenario, such that the
compression algorithm is tuned to provide the best rate-similarity trade-off instead
of rate-distortion. By doing so, the compression is relying more on a perceptual
similarity measure, rather than a computational distortion metric. Consequently, this
could perceptually enhance the compression performance.
In our previous studies [71, 73, 74], we have used the perceptual distortion
metrics inside the state of the art video compression standard, known as High
Efficiency Video Coding (HEVC [106]), and evaluated their performance. We used
the two metrics of STSIM and LRI (Sects. 4.1 and 4.3) inside as distortion measure
Perceptual Texture Similarity for Machine Intelligence Applications 29
Fig. 9 Examples of decoded textures using the same QP. From left to right: Original texture,
compressed using HEVC with default metrics, with STSIM and with LRI
Fig. 10 Rate Distortion (using Gabor distance metric [67]) of the textures shown in Fig. 9. x-axes:
Bytes used to encode the texture, y-axes: distance to the original texture
not necessary to encode the patch but rather its copy index. The visual comparison
showed some point-wise difference, but high overall similarity.
Fig. 11 Dynamic texture synthesis approach for alternative frames [112]. E is a decoded picture
and S is synthesized one
Fig. 12 An example of visual comparison between default compression and proposed method in
[112]. Left: original frame, middle: is compressed frame with HEVC and right: synthesized frame
at the decoder side
32 K. Naser et al.
Fig. 14 Compressed texture with QP=27. Left: default encoder, right: LTS. Bitrate saving=9.756%
contents in B0 , and will then select the block Bj such that Bj has the minimum rate
and distortion. Thus, the algorithm tries to replace the contents while encoding, by
visually similar ones, such that the contents will be easier to encode.
An example for comparing the behavior of LTS against HEVC is shown in
Fig. 14. Due to the simplification procedure of the contents in LTS, one can achieve
about 10% bitrate saving. On the other hand, there is also some visual artifacts due
to this simplification. By carefully examining the differences in Fig. 14, we can see
that some of the wall boundaries are eliminated by LTS. This is because encoding
an edge costs more than a flat area, and thus LTS would choose to replace this edge
by another possible synthesis that is easier to encode.
machine learning based estimation of the curve is performed, and used to provide
an improved compression result.
The other indirect use of texture similarity measure is to exploit the analysis
tools and features from that domain in image and video compression. For example,
in [76], the visual redundancies of dynamic textures can be easily predicted by a set
of features, such as normal flow and gray level co-occurrence matrix. Similarity,
the optimal rate-distortion parameter (Lagrangian multiplier) can be predicted
similarly [63].
Beside texture synthesis based coding, there also exist several studies on
perceptually optimizing the encoder based on texture properties. These studies fall
generally into the category of noise shaping, where the coding noise (compression
artifact) is distributed to minimize the perceived distortions. Examples can be found
in [60, 107, 125, 129, 130]. Besides, textures are considered as non-salient areas,
and less bitrate is consumed there [43, 44].
7 Conclusion
Acknowledgements This work was supported by the Marie Sklodowska-Curie under the
PROVISION (PeRceptually Optimized VIdeo CompresSION) project bearing Grant Number
608231 and Call Identifier: FP7-PEOPLE-2013-ITN.
References
1. Abraham, B., Camps, O.I., Sznaier, M.: Dynamic texture with fourier descriptors. In:
Proceedings of the 4th International Workshop on Texture Analysis and Synthesis, pp. 53–58
(2005)
2. Amadasun, M., King, R.: Textural features corresponding to textural properties. IEEE Trans.
Syst. Man Cybern. 19(5), 1264–1274 (1989)
3. Amiaz, T., Fazekas, S., Chetverikov, D., Kiryati, N.: Detecting regions of dynamic texture.
In: Scale Space and Variational Methods in Computer Vision, pp. 848–859. Springer, Berlin
(2007)
4. Bao, Z., Xu, C., Wang, C.: Perceptual auto-regressive texture synthesis for video coding.
Multimedia Tools Appl. 64(3), 535–547 (2013)
5. Ballé, J.: Subjective evaluation of texture similarity metrics for compression applications. In:
Picture Coding Symposium (PCS), 2012, pp. 241–244. IEEE, New York (2012)
6. Barcelo, A., Montseny, E., Sobrevilla, P.: Fuzzy texture unit and fuzzy texture spectrum for
texture characterization. Fuzzy Sets Syst. 158(3), 239–252 (2007)
7. Barmpoutis, P., Dimitropoulos, K., Grammalidis, N.: Smoke detection using spatio-temporal
analysis, motion modeling and dynamic texture recognition. In: 2013 Proceedings of the
22nd European Signal Processing Conference (EUSIPCO), pp. 1078–1082. IEEE, New York
(2014)
8. Beck, J.: Textural segmentation, second-order statistics, and textural elements. Biol. Cybern.
48(2), 125–130 (1983)
36 K. Naser et al.
9. Bosch, M., Zhu, F., Delp, E.J.: An overview of texture and motion based video coding at
Purdue University. In: Picture Coding Symposium, 2009. PCS 2009, pp. 1–4. IEEE, New
York (2009)
10. Bradley, D.C., Goyal, M.S.: Velocity computation in the primate visual system. Nature Rev.
Neurosci. 9(9), 686–695 (2008)
11. Caenen, G., Van Gool, L.: Maximum response filters for texture analysis. In: Conference on
Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04, pp. 58–58. IEEE,
New York (2004)
12. Campbell, N., Dalton, C., Gibson, D., Oziem, D., Thomas, B.: Practical generation of
video textures using the auto-regressive process. Image Vis. Comput. 22(10), 819–827
(2004)
13. Chang, W.-H., Yang, N.-C., Kuo, C.-M., Chen, Y.-J., et al.: An efficient temporal texture
descriptor for video retrieval. In: Proceedings of the 6th WSEAS International Conference
on Signal Processing, Computational Geometry & Artificial Vision, pp. 107–112. World
Scientific and Engineering Academy and Society (WSEAS), Athens (2006)
14. Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M., Chen, X., Gao, W., Wld: a robust local
image descriptor. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1705–1720 (2010)
15. Chessa, M., Sabatini, S.P., Solari, F.: A systematic analysis of a v1–mt neural model for
motion estimation. Neurocomputing 173, 1811–1823 (2016)
16. Chetverikov, D., Péteri, R.: A brief survey of dynamic texture description and recognition. In:
Computer Recognition Systems, pp. 17–26. Springer, Berlin (2005)
17. Chubach, O., Garus, P., Wien, M.: Motion-based analysis and synthesis of dynamic textures.
In: Proceedings of International Picture Coding Symposium PCS ’16, Nuremberg. IEEE,
Piscataway (2016)
18. Costantini, R., Sbaiz, L., Süsstrunk, S.: Higher order SVD analysis for dynamic texture
synthesis. IEEE Trans. Image Process. 17(1), 42–52 (2008)
19. Crivelli, T., Cernuschi-Frias, B., Bouthemy, P., Yao, J.-F.: Motion textures: modeling,
classification, and segmentation using mixed-state Markov random fields. SIAM J. Image.
Sci. 6(4), 2484–2520 (2013)
20. David, S.V., Vinje, W.E., Gallant, J.L.: Natural stimulus statistics alter the receptive field
structure of v1 neurons. J. Neurosci. 24(31), 6991–7006 (2004)
21. Derpanis, K.G., Wildes, R.P.: Dynamic texture recognition based on distributions of
spacetime oriented structure. In: 2010 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 191–198. IEEE, New York (2010)
22. Derpanis, K.G., Wildes, R.P.: Spacetime texture representation and recognition based on a
spatiotemporal orientation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1193–
1205 (2012)
23. Derpanis, K.G., Sizintsev, M., Cannons, K.J., Wildes, R.P.: Action spotting and recognition
based on a spatiotemporal orientation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 35(3),
527–540 (2013)
24. Dimitropoulos, K., Barmpoutis, P., Grammalidis, N.: Spatio-temporal flame modeling and
dynamic texture analysis for automatic video-based fire detection. IEEE Trans. Circ. Syst.
Video Technol. 25(2), 339–351 (2015). doi:10.1109/TCSVT.2014.2339592
25. Do, M.N., Vetterli, M.: Texture similarity measurement using Kullback-Leibler distance on
wavelet subbands. In: 2000 International Conference on Image Processing, 2000. Proceed-
ings, vol. 3, pp. 730–733. IEEE, New York (2000)
26. Do, M.N., Vetterli, M.: Wavelet-based texture retrieval using generalized gaussian density and
Kullback-Leibler distance. IEEE Trans. Image Process. 11(2), 146–158 (2002)
27. Doretto, G., Soatto, S.: Editable dynamic textures. In: 2003 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, pp. II–137,
vol. 2. IEEE, New York (2003)
28. Doretto, G., Soatto, S.: Modeling dynamic scenes: an overview of dynamic textures. In:
Handbook of Mathematical Models in Computer Vision, pp. 341–355. Springer, Berlin (2006)
Perceptual Texture Similarity for Machine Intelligence Applications 37
29. Doretto, G., Chiuso, A., Wu, Y.N., Soatto, S.: Dynamic textures. Int. J. Comput. Vis. 51(2),
91–109 (2003)
30. Dubois, S., Péteri, R., Ménard, M.: A comparison of wavelet based spatio-temporal decompo-
sition methods for dynamic texture recognition. In: Pattern Recognition and Image Analysis,
pp. 314–321. Springer, Berlin (2009)
31. Dumitras, A., Haskell, B.G.: A texture replacement method at the encoder for bit-rate
reduction of compressed video. IEEE Trans. Circuits Syst. Video Technol. 13(2), 163–175
(2003)
32. Fan, G., Xia, X.-G.: Wavelet-based texture analysis and synthesis using hidden Markov
models. IEEE Trans. Circuits Syst. I, Fundam. Theory Appl. 50(1), 106–120 (2003)
33. Fazekas, S., Chetverikov, D.: Dynamic texture recognition using optical flow features and
temporal periodicity. In: International Workshop on Content-Based Multimedia Indexing,
2007. CBMI’07, pp. 25–32. IEEE, New York (2007)
34. Fazekas, S., Amiaz, T., Chetverikov, D., Kiryati, N.: Dynamic texture detection based on
motion analysis. Int. J. Comput. Vis. 82(1), 48–63 (2009)
35. Ghadekar, P., Chopade, N.: Nonlinear dynamic texture analysis and synthesis model. Int.
J. Recent Trends Eng. Technol. 11(2), 475–484 (2014)
36. Ghanem, B., Ahuja, N.: Maximum margin distance learning for dynamic texture recognition.
In: European Conference on Computer Vision, pp. 223–236. Springer, Berlin (2010)
37. Goncalves, W.N., Bruno, O.M.: Dynamic texture analysis and segmentation using determin-
istic partially self-avoiding walks. Expert Syst. Appl. 40(11), 4283–4300 (2013)
38. Goncalves, W.N., Bruno, O.M.: Dynamic texture segmentation based on deterministic
partially self-avoiding walks. Comput. Vis. Image Underst. 117(9), 1163–1174 (2013)
39. Gonçalves, W.N., Machado, B.B., Bruno, O.M.: Spatiotemporal Gabor filters: a new method
for dynamic texture recognition (2012). arXiv preprint arXiv:1201.3612
40. Grill-Spector, K., Malach, R.: The human visual cortex. Annu. Rev. Neurosci. 27, 649–677
(2004)
41. Grossberg, S., Mingolla, E., Pack, C.: A neural model of motion processing and visual
navigation by cortical area MST. Cereb. Cortex 9(8), 878–895 (1999)
42. Guo, Y., Zhao, G., Zhou, Z., Pietikainen, M.: Video texture synthesis with multi-frame
LBP-TOP and diffeomorphic growth model. IEEE Trans. Image Process. 22(10), 3879–3891
(2013)
43. Hadizadeh, H.: Visual saliency in video compression and transmission. Ph.D. Dissertation,
Applied Sciences: School of Engineering Science (2013)
44. Hadizadeh, H., Bajic, I.V.: Saliency-aware video compression. IEEE Trans. Image Process.
23(1), 19–33 (2014)
45. Haindl, M., Filip, J.: Visual Texture: Accurate Material Appearance Measurement, Represen-
tation and Modeling. Springer Science & Business Media, London (2013)
46. He, D.-C., Wang, L.: Texture unit, texture spectrum, and texture analysis. IEEE Trans. Geosci.
Remote Sens. 28(4), 509–512 (1990)
47. He, D.-C., Wang, L.: Simplified texture spectrum for texture analysis. J. Commun. Comput.
7(8), 44–53 (2010)
48. Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkey striate
cortex. J. Physiol. 195(1), 215–243 (1968)
49. Jin, G., Zhai, Y., Pappas, T.N., Neuhoff, D.L.: Matched-texture coding for structurally lossless
compression. In: 2012 19th IEEE International Conference on Image Processing (ICIP),
pp. 1065–1068. IEEE, New York (2012)
50. Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC
JTC 1/SC 29/WG: High Efficiency Video Coding (HEVC) Test Model 16 (HM 16) Encoder
Description. Technical Report (2014)
51. Julesz, B.: Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962)
52. Julesz, B.: Textons, the elements of texture perception, and their interactions. Nature
290(5802), 91–97 (1981)
53. Julész, B., Gilbert, E., Shepp, L., Frisch, H.: Inability of humans to discriminate between
visual textures that agree in second-order statistics-revisited. Perception 2(4), 391–405 (1973)
38 K. Naser et al.
54. Julesz, B., Gilbert, E., Victor, J.D.: Visual discrimination of textures with identical third-order
statistics. Biol. Cybern. 31(3), 137–140 (1978)
55. Khandelia, A., Gorecha, S., Lall, B., Chaudhury, S., Mathur, M.: Parametric video com-
pression scheme using ar based texture synthesis. In: Sixth Indian Conference on Com-
puter Vision, Graphics & Image Processing, 2008. ICVGIP’08. IEEE, New York (2008),
pp. 219–225
56. Kwatra, V., Essa, I., Bobick, A., Kwatra, N.: Texture optimization for example-based
synthesis. In: ACM Transactions on Graphics (TOG), vol. 24(3), pp. 795–802. ACM, New
York (2005)
57. Landy, M.S.: Texture Analysis and Perception. The New Visual Neurosciences, pp. 639–652.
MIT, Cambridge (2013)
58. Landy, M.S., Graham, N.: Visual perception of texture. Vis. Neurosci. 2, 1106–1118 (2004)
59. Li, Y., Wang, T., Shum, H.-Y.: Motion texture: a two-level statistical model for character
motion synthesis. In: ACM Transactions on Graphics (ToG), vol. 21(3), pp. 465–472. ACM,
New York (2002)
60. Liu, M., Lu, L.: An improved rate control algorithm of h. 264/avc based on human visual
system. In: Computer, Informatics, Cybernetics and Applications, pp. 1145–1151. Springer,
Berlin (2012)
61. Liu, X., Wang, D.: A spectral histogram model for texton modeling and texture discrimina-
tion. Vis. Res. 42(23), 2617–2634 (2002)
62. Liu, L., Fieguth, P., Guo, Y., Wang, X., Pietikäinen, M.: Local binary features for texture
classification: taxonomy and experimental study. Pattern Recogn. 62, 135–160 (2017)
63. Ma, C., Naser, K., Ricordel, V., Le Callet, P., Qing, C.: An adaptive lagrange multiplier
determination method for dynamic texture in HEVC. In: IEEE International Conference on
Consumer Electronics China. IEEE, New York (2016)
64. Maggioni, M., Jin, G., Foi, A., Pappas, T.N.: Structural texture similarity metric based on
intra-class variances. In: 2014 IEEE International Conference on Image Processing (ICIP),
pp. 1992–1996. IEEE, New York (2014)
65. Malik, J., Perona, P.: Preattentive texture discrimination with early vision mechanisms. JOSA
A 7(5), 923–932 (1990)
66. Maloney, L.T., Yang, J.N.: Maximum likelihood difference scaling. J. Vis. 3(8), 5 (2003)
67. Manjunath, B.S., Ma, W.-Y.: Texture features for browsing and retrieval of image data. IEEE
Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996)
68. Medathati, N.K., Chessa, M., Masson, G., Kornprobst, P., Solari, F.: Decoding mt motion
response for optical flow estimation: an experimental evaluation. Ph.D. Dissertation, INRIA
Sophia-Antipolis, France; University of Genoa, Genoa, Italy; INT la Timone, Marseille,
France; INRIA (2015)
69. Montoya-Zegarra, J.A., Leite, N.J., da S Torres, R.: Rotation-invariant and scale-invariant
steerable pyramid decomposition for texture image retrieval. In: SIBGRAPI 2007. XX
Brazilian Symposium on Computer Graphics and Image Processing, 2007, pp. 121–128.
IEEE, New York (2007)
70. Narain, R., Kwatra, V., Lee, H.-P., Kim, T., Carlson, M., Lin, M.C.: Feature-guided dynamic
texture synthesis on continuous flows,. In: Proceedings of the 18th Eurographics conference
on Rendering Techniques, pp. 361–370. Eurographics Association, Geneva (2007)
71. Naser, K., Ricordel, V., Le Callet, P.: Experimenting texture similarity metric STSIM for
intra prediction mode selection and block partitioning in HEVC. In: 2014 19th International
Conference on Digital Signal Processing (DSP), pp. 882–887. IEEE, New York (2014)
72. Naser, K., Ricordel, V., Le Callet, P.: Local texture synthesis: a static texture coding algorithm
fully compatible with HEVC. In: 2015 International Conference on Systems, Signals and
Image Processing (IWSSIP), pp. 37–40. IEEE, New York (2015)
73. Naser, K., Ricordel, V., Le Callet, P.: Performance analysis of texture similarity metrics in
HEVC intra prediction. In: Video Processing and Quality Metrics for Consumer Electronics
(VPQM) (2015)
Perceptual Texture Similarity for Machine Intelligence Applications 39
74. Naser, K., Ricordel, V., Le Callet, P.: Texture similarity metrics applied to HEVC intra predic-
tion. In: The Third Sino-French Workshop on Information and Communication Technologies,
SIFWICT 2015 (2015)
75. Naser, K., Ricordel, V., Le Callet, P.: A foveated short term distortion model for perceptually
optimized dynamic textures compression in HEVC. In: 32nd Picture Coding Symposium
(PCS). IEEE, New York (2016)
76. Naser, K., Ricordel, V., Le Callet, P.: Estimation of perceptual redundancies of HEVC
encoded dynamic textures. In: 2016 Eighth International Conference on Quality of Multi-
media Experience (QoMEX), pp. 1–5. IEEE, New York (2016)
77. Naser, K., Ricordel, V., Le Callet, P.: Modeling the perceptual distortion of dynamic textures
and its application in HEVC. In: 2016 IEEE International Conference on Image Processing
(ICIP), pp. 3787–3791. IEEE, New York (2016)
78. Ndjiki-Nya, P., Wiegand, T.: Video coding using texture analysis and synthesis. In: Proceed-
ings of Picture Coding Symposium, Saint-Malo (2003)
79. Ndjiki-Nya, P., Makai, B., Blattermann, G., Smolic, A., Schwarz, H., Wiegand, T.: Improved
h. 264/avc coding using texture analysis and synthesis. In: 2003 International Conference on
Image Processing, 2003. ICIP 2003. Proceedings, vol. 3, pp. III–849. IEEE, New York (2003)
80. Ndjiki-Nya, P., Hinz, T., Smolic, A., Wiegand, T.: A generic and automatic content-based
approach for improved h. 264/mpeg4-avc video coding. In: IEEE International Conference
on Image Processing, 2005. ICIP 2005, vol. 2, pp. II–874. IEEE, New York (2005)
81. Ndjiki-Nya, P., Bull, D., Wiegand, T.: Perception-oriented video coding based on texture
analysis and synthesis. In: 2009 16th IEEE International Conference on Image Processing
(ICIP), pp. 2273–2276. IEEE, New York (2009)
82. Nelson, R.C., Polana, R.: Qualitative recognition of motion using temporal texture. CVGIP:
Image Underst. 56(1), 78–89 (1992)
83. Nishimoto, S., Gallant, J.L.: A three-dimensional spatiotemporal receptive field model
explains responses of area mt neurons to naturalistic movies. J. Neurosci. 31(41), 14551–
14564 (2011)
84. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7),
971–987 (2002)
85. Ontrup, J., Wersing, H., Ritter, H.: A computational feature binding model of human texture
perception. Cogn. Process. 5(1), 31–44 (2004)
86. Oxford Dictionaries. [Online]. Available: http://www.oxforddictionaries.com
87. Pack, C., Grossberg, S., Mingolla, E.: A neural model of smooth pursuit control and motion
perception by cortical area MST. J. Cogn. Neurosci. 13(1), 102–120 (2001)
88. Pappas, T.N., Neuhoff, D.L., de Ridder, H., Zujovic, J.: Image analysis: focus on texture
similarity. Proc. IEEE 101(9), 2044–2057 (2013)
89. Peh, C.-H., Cheong, L.-F.: Synergizing spatial and temporal texture. IEEE Trans. Image
Process. 11(10), 1179–1191 (2002)
90. Perrone, J.A.: A visual motion sensor based on the properties of v1 and mt neurons. Vision
Res. 44(15), 1733–1755 (2004)
91. Perry, C.J., Fallah, M.: Feature integration and object representations along the dorsal stream
visual hierarchy. Front. Comput. Neurosci. 8, 84 (2014)
92. Péteri, R., Chetverikov, D.: Dynamic texture recognition using normal flow and texture
regularity. In: Pattern Recognition and Image Analysis, pp. 223–230. Springer, Berlin (2005)
93. Péteri, R., Fazekas, S., Huiskes, M.J.: Dyntex: a comprehensive database of dynamic textures.
Pattern Recogn. Lett. 31(12), 1627–1632 (2010)
94. Pollen, D.A., Ronner, S.F.: Visual cortical neurons as localized spatial frequency filters. IEEE
Trans. Syst. Man Cybern. SMC-13(5), 907–916 (1983)
95. Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex
wavelet coefficients. Int. J. Comput. Vis. 40(1), 49–70 (2000)
96. Rahman, A., Murshed, M.: Real-time temporal texture characterisation using block-based
motion co-occurrence statistics. In: International Conference on Image Processing (2004)
40 K. Naser et al.
97. Rahman, A., Murshed, M.: A motion-based approach for temporal texture synthesis. In:
TENCON 2005 IEEE Region 10, pp. 1–4. IEEE, New York (2005)
98. Rosenholtz, R.: Texture perception. Oxford Handbooks Online (2014)
99. Rust, N.C., Mante, V., Simoncelli, E.P., Movshon, J.A.: How mt cells analyze the motion of
visual patterns. Nature Neurosci. 9(11), 1421–1431 (2006)
100. Saisan, P., Doretto, G., Wu, Y.N., Soatto, S.: Dynamic texture recognition. In: CVPR 2001.
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2001, vol. 2, pp. II–58. IEEE, New York (2001)
101. Simoncelli, E.P., Freeman, W.T., Adelson, E.H., Heeger, D.J.: Shiftable multiscale transforms.
IEEE Trans. Inf. Theory 38(2), 587–607 (1992)
102. Simoncelli, E.P., Heeger, D.J.: A model of neuronal responses in visual area mt. Vis. Res.
38(5), 743–761 (1998)
103. Smith, J.R., Lin, C.-Y., Naphade, M., Video texture indexing using spatio-temporal wavelets.
In: 2002 International Conference on Image Processing. 2002. Proceedings, vol. 2, pp. II–437.
IEEE, New York (2002)
104. Soatto, S., Doretto, G., and Wu, Y.N., Dynamic textures. In: Eighth IEEE International
Conference on Computer Vision, 2001. ICCV 2001. Proceedings, vol. 2, pp. 439–446. IEEE,
New York (2001)
105. Solari, F., Chessa, M., Medathati, N.K., Kornprobst, P.: What can we expect from a v1-mt
feedforward architecture for optical flow estimation? Signal Process. Image Commun. 39,
342–354 (2015)
106. Sullivan, G.J., Ohm, J., Han, W.-J., Wiegand, T.: Overview of the high efficiency video coding
(HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1649–1668 (2012)
107. Sun, C., Wang, H.-J., Li, H., Kim, T.-H.: Perceptually adaptive Lagrange multiplier for rate-
distortion optimization in h. 264. In: Future Generation Communication and Networking
(FGCN 2007), vol. 1, pp. 459–463. IEEE, New York (2007)
108. Sun, X., Yin, B., Shi, Y.: A low cost video coding scheme using texture synthesis. In: 2nd
International Congress on Image and Signal Processing, 2009. CISP’09, pp. 1–5. IEEE, New
York (2009)
109. Swamy, D.S., Butler, K.J., Chandler, D.M., Hemami, S.S.: Parametric quality assessment of
synthesized textures. In: Proceedings of Human Vision and Electronic Imaging (2011)
110. Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual perception.
IEEE Trans. Syst. Man Cybern. 8(6), 460–473 (1978)
111. Thakur, U.S., Ray, B.: Image coding using parametric texture synthesis. In: 2016 IEEE 18th
International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6 (2016)
112. Thakur, U., Naser, K., Wien, M.: Dynamic texture synthesis using linear phase shift inter-
polation. In: Proceedings of International Picture Coding Symposium PCS ’16, Nuremberg.
IEEE, Piscataway (2016)
113. Tiwari, D., Tyagi, V.: Dynamic texture recognition based on completed volume local binary
pattern. Multidim. Syst. Sign. Process. 27(2), 563–575 (2016)
114. Tiwari, D., Tyagi, V.: Dynamic texture recognition using multiresolution edge-weighted local
structure pattern. Comput. Electr. Eng. 11, 475–484 (2016)
115. Tiwari, D., Tyagi, V.: Improved weber’s law based local binary pattern for dynamic texture
recognition. Multimedia Tools Appl. 76, 1–18 (2016)
116. Tlapale, E., Kornprobst, P., Masson, G.S., Faugeras, O.: A neural field model for motion
estimation. In: Mathematical image processing, pp. 159–179. Springer, Berlin (2011)
117. Tuceryan, M., Jain, A.K.: Texture Analysis. The Handbook of Pattern Recognition and
Computer Vision, vol. 2, pp. 207–248 (1998)
118. Turner, M.R.: Texture discrimination by Gabor functions. Biol. Cybern. 55(2–3), 71–82
(1986)
119. Valaeys, S., Menegaz, G., Ziliani, F., Reichel, J.: Modeling of 2d+ 1 texture movies for video
coding. Image Vis. Comput. 21(1), 49–59 (2003)
120. van der Maaten, L., Postma, E.: Texton-based texture classification. In: Proceedings of
Belgium-Netherlands Artificial Intelligence Conference (2007)
Perceptual Texture Similarity for Machine Intelligence Applications 41
121. Varadarajan, S., Karam, L.J.: Adaptive texture synthesis based on perceived texture regularity.
In: 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX),
pp. 76–80. IEEE, New York (2014)
122. Wang, Y., Zhu, S.-C.: Modeling textured motion: particle, wave and sketch. In: Ninth IEEE
International Conference on Computer Vision, 2003. Proceedings, pp. 213–220. IEEE, New
York (2003)
123. Wang, L., Liu, H., Sun, F.: Dynamic texture classification using local fuzzy coding. In: 2014
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1559–1565. IEEE, New
York (2014)
124. Wei, L.-Y., Lefebvre, S., Kwatra, V., Turk, G.: State of the art in example-based texture syn-
thesis. In: Eurographics 2009, State of the Art Report, EG-STAR, pp. 93–117. Eurographics
Association, Geneva (2009)
125. Wong, C.-W., Au, O.C., Meng, B., Lam, K.: Perceptual rate control for low-delay video com-
munications. In: 2003 International Conference on Multimedia and Expo, 2003. ICME’03.
Proceedings, vol. 3, pp. III–361. IEEE, New York (2003)
126. Xu, Y., Quan, Y., Ling, H., Ji, H.: Dynamic texture classification using dynamic fractal
analysis. In: 2011 International Conference on Computer Vision, pp. 1219–1226. IEEE, New
York (2011)
127. Xu, Y., Huang, S., Ji, H., Fermüller, C.: Scale-space texture description on sift-like textons.
Comput. Vis. Image Underst. 116(9), 999–1013 (2012)
128. Xu, Y., Quan, Y., Zhang, Z., Ling, H., Ji, H.: Classifying dynamic textures via spatiotemporal
fractal analysis. Pattern Recogn. 48(10), 3239–3248 (2015)
129. Xu, L., et al.: Free-energy principle inspired video quality metric and its use in video coding.
IEEE Trans. Multimedia 18(4), 590–602 (2016)
130. Yu, H., Pan, F., Lin, Z., Sun, Y.: A perceptual bit allocation scheme for h. 264. In: IEEE
International Conference on Multimedia and Expo, 2005. ICME 2005, p. 4. IEEE, New York
(2005)
131. Yuan, L., Wen, F., Liu, C., Shum, H.-Y.: Synthesizing dynamic texture with closed-loop linear
dynamic system. In: Computer Vision-ECCV 2004, pp. 603–616. Springer, Berlin (2004)
132. Zhai, Y., Neuhoff, D.L.: Rotation-invariant local radius index: a compact texture similarity
feature for classification. In: 2014 IEEE International Conference on Image Processing
(ICIP), pp. 5711–5715. IEEE, New York (2014)
133. Zhai, Y., Neuhoff, D.L., Pappas, T.N.: Local radius index-a new texture similarity feature. In:
2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 1434–1438. IEEE, New York (2013)
134. Zhang, F., Bull, D.R.: A parametric framework for video compression using region-based
texture models. IEEE J. Sel. Top. Sign. Proces. 5(7), 1378–1392 (2011)
135. Zhang, J., Tan, T.: Brief review of invariant texture analysis methods. Pattern Recogn. 35(3),
735–747 (2002)
136. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an
application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928
(2007)
137. Zhao, X., Reyes, M.G., Pappas, T.N., Neuhoff, D.L.: Structural texture similarity metrics for
retrieval applications. In: 15th IEEE International Conference on Image Processing, 2008.
ICIP 2008, pp. 1196–1199. IEEE, New York (2008)
138. Zujovic, J., Pappas, T.N., Neuhoff, D.L.: Structural similarity metrics for texture analysis and
retrieval. In: 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE,
New York (2009)
139. Zujovic, J., Pappas, T.N., Neuhoff, D.L., van Egmond, R., de Ridder, H.: Subjective and
objective texture similarity for image compression. In: 2012 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 1369–1372. IEEE, New York
(2012)
140. Zujovic, J., Pappas, T.N., Neuhoff, D.L.: Structural texture similarity metrics for image
analysis and retrieval. IEEE Trans. Image Process. 22(7), 2545–2558 (2013)
Deep Saliency: Prediction of Interestingness
in Video with CNN
performs smooth pursuit of objects which are of interest for him. To learn more
about the history of the taxonomy of visual attention studies, we refer the reader
to the paper by Borgi [1]. What is clear today, that any model trying to predict
human visual attention attractors in visual scenes, needs to combine both: bottom-up
and top-down components. Therefore, it is believable that supervised machine
learning methods, which combine stimuli driven features measures and capability
of prediction on the basis of seen data, will bring a satisfactory solution to this
complex problem. With the explosion of research with deep networks and their
proven efficiency, different models of visual attention have been proposed using this
supervised learning approach. Shen [42] proposed a deep learning model to extract
salient areas in images, which allows firstly to learn the relevant characteristics of
the saliency of natural images, and secondly to predict the eye fixations on objects
with semantic content. Simonyan [43] defined a multi-class classification problem
using “task-dependent” visual experiment to predict the saliency of image pixels.
Vig [47] tackles prediction of saliency of pixels using feature maps extracted from
different architectures of a deep network. In [25], a multi-resolution convolutional
neural network model has been proposed using three different scales of the raw
images and the eye fixations as targets. In [22], three CNN models are designed to
predict saliency using a segmented input image. The authors of [23, 34] propose
to adopt the end-to-end solution as a regression problem to predict the saliency. In
[24] global saliency map is computed by summing all intermediate saliency maps
that are obtained by convolving the images with learned filters and pooling their
Gaussian-weighted responses at multiple scales. In [55], a class activation maps
using average pooling in order to produce the desired class was proposed. Deep
Neural Networks classifiers have become winners in indexing of visual information,
they show ever increasing performances in prediction. This is why they have also
become a methodological framework for prediction of saliency or interestingness of
visual content. In summary, this chapter makes the following contributions:
• construct from four benchmark datasets with ground-truth labels the support to
study of interestingness of areas in video frames.
• To incorporate the top-down “semantic” cues in the prediction of interestingness
in video, a Deep CNNs architecture is proposed with a novel residual motion
feature.
A neural network is then a network whose nodes are formal neurones, and to define
a neural network, one needs to design its architecture (the number of hidden layers
and the number of nodes per layer, etc.) as well as estimation of parameters once
the network is fixed. Figure 2 gives an example of such a network.
x2
∑| f y
..
.
xp
In order to extract the most important information for further analysis or exploitation
of image patches, the convolution with a fixed number of filters is needed. It is
necessary to determine the size of the convolution kernel to be applied to the input
image in order to highlight its areas. Two stages are conceptually necessary to create
a convolutional layer. The first refers to the convolution of the input image with
linear filters. The second consists in adding a bias term. Generally, the equation of
convolution can be written as (1):
X
Xjl D f . Xil1 !ijl C Blj / (1)
i2Mj
Pooling reduces the computational complexity for the upper layers and summarizes
the outputs of neighboring groups of neurons from the same kernel map. It reduces
Image FC
CONV
RELU CONV
the size of each input feature map by the acquisition of a value for each receptive
field of neurons of the next layer. We use max-pooling, see Eq. (2):
The Rectified Linear Unit (ReLu for short) has become very popular in the last few
years. It computes the function f .x/ D max.0; x/. Thus, the activation is thresholded
at zero. It was found to accelerate the convergence of a very popular parameter
optimization method, stochastic gradient descent, compared to the sigmoid function.
x;y
Here Uf represents the value of the feature map at .x; y/ coordinates and the sums
are taken in the neighborhood of .x; y/ of size N N, ˛ and ˇ regulate normalization
strength.
Once the architecture of the network is fixed, the next step is to estimate its
parameters. In next section, we explain how this can be done.
More formally, a sample i from the training set is denoted .x1i ; x2i ; ; xni ; yi / and
the response of the model is denoted yO i .
There are many functions used to measure prediction errors. They are called
loss functions. A loss function somehow quantifies the deviation of the output of
the model from the correct response. We are speaking here about “empirical loss”
functions [46], that is the error computed on all available ground truth training data.
Here we will shortly present one of them.
Back to the training set, the known response of each observation is encoded in a
one-hot labels vector. More formally, given an observation .x1i ; x2i ; ; xni ; yi /, we
introduce a binary vector Li D .L1i ; L2i ; ; Lki / such that if yi D cj then Lji D 1 and
i
8m ¤ j, Lm D 0. This is the function which ensures a “hard” coding of class labels.
2.2.2 Softmax
eyi
pi D Pk : (4)
jD1 eyj
The softmax function is used in the last layer of multi-layer neural networks
which are trained under a cross-entropy (we will define this function in next
paragraphs) regime. When used for image recognition, the softmax computes the
estimated probabilities, for each input data, of being in a class from a given
taxonomy.
2.2.3 Cross-Entropy
The cross-entropy loss function is expressed in terms of the result of the softmax
and the one-hot encoding. It is defined as follows:
k
X
D.S; L/ D Li log .pi / : (5)
iD1
The definition of one-hot encoding and Eq. (5) means that only the output of the
classifier corresponding to the correct class label is included in the cost.
50 S. Chaabouni et al.
To deal with the cross-entropy of all the training set, we introduce the average
cross-entropy. This is simply the average value, over all the set, of the cross-entropy
introduced in Eq. (5):
N
1X
L D D.Si ; Li /: (6)
N iD1
The gradient descent algorithm is the most simple and most used algorithm to find
parameters for the learning model under the assumption of convexity of function
to minimize. There are mainly two versions of this algorithm, the first one acts
in a batch mode and the other in on-line mode. The batch mode: when we aim
to minimize globally the loss function (this is why it is named batch), we first
initialize randomly the parameters and we iteratively minimize the loss function
by updating the parameters. This updating is done following the opposite direction
of the gradient of the loss function which, locally, shows the highest slope of this
function. Hence, at iteration t, the new values of the weights w.tC1/ are estimated
using the values of the weights at step t and the gradient of the loss function
estimated at weight w.t/ :
8t 2 N; w.tC1/ D w.t/ rL w.t/ ; (7)
where 2 RC is a positive real called learning rate. One fundamental issue is how
to choose the learning rate. If this rate is too large, than we may obtain oscillations
around the minimum. If it is two small, then the convergence toward the minimum
will be too slow and in same cases it may never happen.
The on-line mode: when we are dealing with large set of data, batch algorithms
are not useful anymore since they are not scalable. Many works have been done to
Deep Saliency: Prediction of Interestingness in Video with CNN 51
overcome this issue and to design on-line algorithms. These algorithms consider a
single example at each iteration and are shown to be more efficient both in time and
space complexities.
Among all the on-line algorithms, the stochastic gradient Descent (SGD for
short) is considered as the most popular and the most used one. Many works have
proved its efficiency and its scalability.
The SGD algorithm is an iterative process which acts as follows: at each iteration
t, a training example .x1t ; x2t ; ; xnt ; yt / is chosen uniformly at random and is used to
update the weights of the loss function following the opposite of the gradient of this
function. The SGD algorithm belongs to first-order methods, i.e., those that form
the parameter update on the basis of only first order gradient information. First-
order methods, when used to solve convex optimization problems, have been shown
to have a convergence speed, when used with large dimension problems, which
can not be better than sub-linear in means of t1=2 , [37], where t is the number of
iterations. This theoretical result implies that first-order methods can not be used to
solve, scalable problems in an acceptable time and with high accuracy.
Momentum is a method that helps accelerate SGD in the relevant direction. It
achieves this by adding a fraction of the update vector of the past time step to the
current update vector. The most popular is the method of Nesterov Momentum [32]:
t .t/
8t 2 N; y.t/ D w.t/ C w w.t1/
tC1
w.tC1/ D y.t/ rL y.t/ ; (8)
In data mining, noise has two different main sources [56]. Different types of
measurement tools induce implicit errors that yield noisy labels in training data.
Besides, random errors introduced by experts or batch processes when the data
are gathered can produce the noise as well. Noise of data could adversely disturb
the classification accuracy of classifiers trained on this data. In the study [33], four
supervised learners (naive Bayesian probabilistic classifier, the C4.5 decision tree,
the IBk instance-based learner and the SMO support vector machine) were selected
to compare the sensitivity with regard to different degrees of noise. A systematic
evaluation and analysis of the impact of class noise and attribute noise on the system
performance in machine learning was presented in [56].
The Deep CNNs use the stacking of different kinds of layers (convolution,
pooling, normalization, . . . ) that ensures the extraction of features which lead to
the learning of the model. The training of deep CNN parameters is frequently done
with the stochastic gradient descent ‘SGD’ technique [16], see Sect. 2.2.5. For a
simple supervised learning the SGD method still remains the best learning algorithm
when the training set is large. With the wide propagation of convolutional neural
52 S. Chaabouni et al.
networks, and the massive labeled data needed to train the CNNs networks, studies
of the impact of noisy data was needed. A general framework to train CNNs with
only a limited number of clean labels and millions of noisy labels was introduced
in [49] in order to model the relationships between images, class labels and label
noises with a probabilistic graphical model and further integrate it into an end-to-end
deep learning system. In [39], substantial robustness to label noise of deep CNNs
was proposed using a generic way to handle noisy and incomplete labeling. This is
realized by augmenting the prediction objective with a notion of consistency.
The research focused on noise produced by random errors was published in [5].
Here it typically addresses a two-class classification problem: for each region in
an image/video plane it is necessary to give the confidence to be salient or not
for a human observer. One main contribution of this chapter is to identify how
noise of data impacts performance of deep networks in the problem of visual
saliency prediction. Here, to study the impact of the noise in ground truth labels,
two experiments on the large data set were conducted. In the first experiment non-
salient windows were randomly selected in an image plane in a standard way,
just excluding already selected salient windows. Nevertheless, in video, dynamic
switching of attention to distractors or to smooth pursuit of moving objects, makes
such a method fail. This policy of selection of non-salient areas yields random
errors. In the second experiment, cinematographic production rule of 3/3 for non-
salient patches selection was used, excluding the patches already defined as salient
area in all the videos frames and excluding the area where the content producers—
photographers or cameramen place important scene details. The results in [5] show
the increase in accuracy in the most efficient model up to 8%, all other settings being
equal: the network architecture, optimization method, input data configuration.
Transfer learning presents a technique used in the field of machine learning that
increases the accuracy of learning either by using it in different tasks, or in the
same task [52]. Training CNNs from scratch is relatively hard due to the insufficient
size of available training dataset in real-world classification problems. Pre-training
a deep CNNs by using an initialization or a fixed feature extractor presents the heart
of the transfer method. Two famous scenarios of transfer learning with CNNs were
followed: (1) using a fixed feature extractor with removing the last fully-connected
layer. Here the training is fulfilled just for the linear classifier on the new dataset.
(2) Fine-tuning the weights of the pre-trained deep CNN by continuing the back-
propagation [52].
In the research of Bengio et al. [52] addressing object recognition problem, the
authors show that the first layers of a Deep CNN learn characteristics similar to the
responses of Gabor’s filters regardless of the data set or task. Hence in their transfer
learning scheme just the three first convolutional layers already trained on one
training set are used for the initialization for parameter training on another training
Deep Saliency: Prediction of Interestingness in Video with CNN 53
set. The coefficients on deeper layers are left free for optimization, that is initialized
randomly. Several studies have proven the power of this technique [31, 53]. Transfer
learning with deep CNN shows its efficiency in different application domain such
as saliency prediction [4], person re-identification [8].
Now our question is how to predict the areas in natural video content which are of
interest to a human observer, when he/she executes a free viewing task of unknown
video content. Our task is to predict the interestingnees “at a glance”, a precise
shape of the salient area in the image is not important. We still believe that the
“Rough Indexing Paradigm” [27], which means fast mining of visual information
with nearly pre-attentive vision is of much interest in our era of “big” visual data.
Furthermore, such a AOI can be further used by object recognition methods. Hence
we consider a squared windows in the image plane as the area-of-interest for human
observer.
In order to train the model able to predict saliency of a given region in the image
plane, the training set has to be built to comprise salient and non-salient regions.
Salient regions-patches are selected on the basis of gaze fixation density maps which
are obtained during a psycho-visual experiment with cohorts of subjects. In this
work, the creation of data set from available video database in order to train the
model with Deep CNN, is realized under a specific policy that minimizes the noise
in the training data. The approach previously presented in [5] was followed.
Figure 4 below presents the group of salient and non-salient patches selected
under the proposed approach. The rows contain some examples taken from frames
of a set of video sequences “actioncliptrain” from the HOLLYWOOD1 data set. The
first line presents the map built on gaze fixations by the method of Wooding [48].
The second line describes the position of the selected patches: the yellow square is
the salient patch and the black one is labeled as non-salient patch. The third line
presents the group of salient patches on the left and non-salient patches on the right
for each frame.
1
Available at http://www.di.ens.fr/~laptev/actions/hollywood2/.
54 S. Chaabouni et al.
Fig. 4 Training data from HOLLYWOOD data set: (left) #frame176 of the ‘actioncliptrain299’,
(right) #frame210 of the ‘actioncliptrain254’
We define a squared patch P of size s s (in this work s D 100 adapted to the spatial
resolution of standard definition (SD) video data) in a video frame as a vector in
Rssn . Here n stands for the quantity of primary feature maps serving as an input to
the deep CNN. If n D 3 just color RGB planes are used as primary features in each
pixel. In case when n D 4 the L2 squared norm of a motion vector for each pixel,
normalised on the dataset is added to RGB planes as a new feature map. We define
patch “saliency” on the basis of its interest for subjects. The interest is measured by
the magnitude of a visual attention map built upon gaze fixations which are recorded
during a psycho-visual experiment using an eye-tracker. The fixation density maps
(FDM) are built by the method of Wooding [48]. Such a map S.x; y/ represents a
multi-Gaussian surface normalized by its global maximum.
A binary label is associated with pixels X of each patch Pi using Eq. (9).
1 if S.x0;i ; y0;i / J
l.X/ D (9)
0 otherwise
with .x0;i ; y0;i / the coordinates of the patch center. A set of thresholds is selected
starting by the global maximum value of the normalized FDM and then relaxing
threshold values as in Eq. (10):
0 D max.S.x; y/; 0/
(10)
.jC1/ D j j
Deep Saliency: Prediction of Interestingness in Video with CNN 55
Fig. 5 Policy of patch selection: an example and processing steps from GTEA data set
The ReLu operation is used due to its better performances in image classification
tasks compared to sigmoid function, as it does not suppress high frequency features.
The first pattern P1 is designed in the manner that the ‘ReLu’ operation is introduced
after the ‘pooling’ one. In this research, the max-pooling operator was used. As the
operation of ‘pooling’ and ‘ReLu’ compute the maximum, they are commutative.
Cascading ‘pooling’ before ‘ReLu’ can reduce the execution time as ‘pooling’ step
reduces the number of neurons or nodes (‘pooling’ operation is more detailed in the
following section). In the two last patterns, stacking two convolutional layers before
the destructive pooling layer ensures the computation of more complex features that
will be more “expressive”.
In ‘ChaboNet’ network, we used 32 kernels with the size of 12 12 for the
convolution layer of the first pattern P1. In the second pattern P2, 128 kernels
for each convolutional layer were used. In P2 the size of the kernels for the first
convolutional layer was chosen as 6 6 and for the second convolution layer, a
kernel of 3 3 was used. Finally, 288 kernels with the size of 3 3 were used for
each convolution layer of the last pattern P3. Here we were inspired by the literature
[19, 42] where the size of convolution kernels is either maintained constant or is
decreasing with the depth of layers. This allows a progressive reduction of highly
dimensional data before conveying them to the fully connected layers. The number
of convolution filters is growing, on the contrary, to explore the richness of the
original data and highlight structural patterns. For the filter size, we made several
tests with the same values as in AlexNet [19], Shen’s network [42], LeNet [21],
Cifar [18] and finally, we retained a stronger value of 12 12 in the first layer of the
pattern P1 as it yielded the best accuracy of prediction in our saliency classification
problem.
The kernel size of the pooling operation for the both patterns P1 and P2 is set to
3 3. However, the pooling of the third pattern P3 is done with a size of 1 1.
The generalization power of Deep CNN classifiers strongly depends on the quantity
of the data and on the coverage of data space in the training data set. In real-life
applications, e.g. prediction of benchmark models for studies of visual attention
of specific populations [6] or saliency prediction for visual quality assessment [2],
the database volumes are small. Hence, in order to predict saliency in these small
collections of videos, we use the transfer learning approach.
Our preliminary study on transfer learning performed in the task of learning areas
that attract visual attention in natural video [4] showed the efficiency of weights
already learned on a large database, for the training on small databases. In the
present work, we benchmark our transfer learning scheme designed for saliency
prediction with regard to the popular approach proposed in [52]. Saliency prediction
task is different from object recognition task. Thus our proposal is to initialize all
58 S. Chaabouni et al.
parameters in all layers of the network to train on a small data set by the best
trained model on a large data set. The following Eq. (13) expresses the transfer
of the classification knowledge obtained from the larger database to the new smaller
database. Here the Stochastic Gradient descent with momentum is used as in [16].
(
@L
ViC1 D m Vi weightDecay Wi h @W jWi iD
i (13)
WiC1 D Wi C ViC1 j W0 D Wn0
With m D 0:9; weightDecay D 0:00004 and Wn0 presents the best learned model
parameters pre-trained on the large data set. We set the initial value of the velocity
V0 to zero. These parameter values are inspired by the values used in [16] and show
the best performances on a large training data set.
If we have predicted for each selected window in a given video frame its “saliency”
or interest for a human observer, the natural question rises how can we assess the
quality of this prediction. The quality of trained model is evaluated on the validation
dataset when training the network. The accuracy gives the classification power of
the network for a given training iteration. But we cannot compare the classified
windows with a “manually” selected ground truth on a test set. First of all, it would
require a tedious annotation process, and secondly human annotation is not free of
errors: how to trace an “interesting window” of a given size? We are not focused
on any specific objects, hence this question will be difficult to answer for a human
annotator. His visual system instead gives the ground truth: humans are fixating
areas which are of interest for them. Hence we come now back to image pixels in
order to be able to asses the quality of our prediction comparing the saliency maps
we can predict by the trained network with Wooding maps built on gaze fixations.
Hence we will speak about Pixels of Interest (POI) or pixel-wise saliency maps.
The pixel-wise saliency map of each frame F of the video is constructed using the
output value of the trained deep CNN model. The soft-max classifier, Eq. (4) which
takes the output of the third pattern P3 as input, see Fig. 6, gives the probability for
a patch of belonging to the salient class.
Hence, from each frame F we select local region having the same size as training
patches (here s D 100). The output value of the soft-max classifier on each local
region defines the degree of saliency of this area. In the center of each local region, a
Gaussian is applied with a pick value of 10f .i/
2
2
with the spread parameter
chosen as
a half-size of the patch. In this way, a sparse saliency map is predicted. If we slide
the patch on the input frame, with a step of one pixel, then a dense saliency map
will be produced for the whole frame by the trained CNN. To avoid computational
overload, sampling of windows to classify can be more sparse, e.g. with a stride of
5 pixels. Then score values assigned to the centers are interpolated with Gaussian
Deep Saliency: Prediction of Interestingness in Video with CNN 59
To learn the model for prediction of visually interesting areas in image plane, four
data sets were used, HOLLYWOOD [29, 30], GTEA corpus [7], CRCNS [14] and
IRCCYN [3].
The HOLLYWOOD database contains 823 training videos and 884 videos for
the validation step. The number of subjects with recorded gaze fixations varies
according to each video with up to 19 subjects. The spatial resolution of videos
varies as well. In others terms the HOLLYWOOD data set contains 229;825 frames
for training and 257;733 frames for validation. From the frames of the training set
we have extracted 222;863 salient patches and 221;868 non-salient patches. During
the validation phase, we have used 251;294 salient patches and 250;169 non-salient
patches respectively (Tables 1, 2, 3, and 4).
men. For three participants some problems occurred in the eye-tracking recording
process. These three records were thus excluded. From the 17 available videos of
GTEA dataset, ten were selected for the training step with a total number of frames
of 10;149. And seven videos with 7840 frames were selected for the validation step.
The split of salient and non-salient patches for the total of 19;910 at the training step
and 15;204 at the validation step is presented in Table 5.
In the CRCNS2 data set [14], 50 videos of 640480 resolution are available with
gaze recordings of up to eight different subjects. The database was split equally:
training and validation sets contain 25 videos each. From the training set, we have
extracted 30;374 salient- and 28;185 non-salient patches. From the validation set,
19;945 salient and 17;802 non-salient patches were extracted.
IRCCYN [3] database is composed of 31 SD videos and gaze fixations of 37
subjects. These videos contain certain categories of attention attractors such as
high contrast, faces. However, videos with objects in motion are not frequent. Our
purpose of saliency prediction modeling the “smooth pursuit” cannot be evaluated
by using all available videos of IRCCyN data set. Videos that do not contain a real
object motion were eliminated. Therefore, only SRC02, SRC03, SRC04, SRC05,
SRC06, SRC07, SRC10, SRC13, SRC17, SRC19, SRC23, SRC24 and SRC27
2
Available at https://crcns.org/data-sets/eye/eye-1.
62 S. Chaabouni et al.
Table 5 Distribution of learning data: total number of salient and NonSalient patches selected
from each database
Datasets Training step Validation step
HOLLYWOOD SalientPatch 222,863 251,294
NonSalientPatch 221,868 250,169
Total 444,731 501,463
GTEA SalientPatch 9961 7604
NonSalientPatch 9949 7600
Total 19,910 15,204
CRCNS SalientPatch 30,374 19,945
Non-SalientPatch 28,185 17,802
Total 58,559 37,747
IRCCyN-MVT SalientPatch 2013 511
Non-SalientPatch 1985 506
Total 3998 1017
The network was implemented using a powerful graphic card Tesla K40m and
processor (2 14 cores). Therefore a sufficiently large amount of patches, 256, was
used per iteration. After a fixed number of training iterations, a model validation
step was implemented: here the accuracy of the model at the current iteration was
computed on the validation data set.
To evaluate our deep network and to prove the importance of the addition of the
residual motion map, two models were created with the same parameter settings and
architecture of the network: the first one contains R, G and B primary pixel values
in patches, denoted as ChaboNet3k. The ChaboNet4k is the model using RGB
values and the normalized energy of residual motion as input data, see Sect. 3.2.
The following Fig. 7 illustrates the variations of the accuracy along iterations of
all the models tested for the database “HOLLYWOOD”. The results of learning
experiments on HOLLYWOOD data set yield the following conclusions: (1) when
adding residual motion as an input feature to RGB plane values, the accuracy is
improved by almost 2%. (2) The accuracy curve (Fig. 7a) show that the best trained
model reached 80% of accuracy at the iteration #8690. The model obtained after
8690 iterations is used to predict saliency on the validation set of this database, and
to initialize the parameters when learning with transfer on other used data sets.
Deep Saliency: Prediction of Interestingness in Video with CNN 63
Fig. 7 Training the network—accuracy vs iterations of ChaboNet3k and ChaboNet4k for all
tested databases. (a) Accuracy vs iterations Hollywood dataset. (b) Accuracy vs iterations
IRCCyN-MVT dataset. (c) Accuracy vs iterations CRCNS dataset. (d) Accuracy vs iterations
GTEA dataset
Two experiments were conducted with the same small data set IRCCyN-MVT and
CRCNS, and the same definition of network “ChaboNet”:
(1) Our method: start training of all ChaboNet layers from the best model already
trained on the large HOLLYWOOD data set (see Sect. 3.5).
(2) Bengio’s method [52]: the three first convolutional layers are trained on the
HOLLYWOOD data set and then fine-tuned on the target data set, other layers
are trained on target data set with random initialization.
The following Fig. 8 illustrates the variations of the accuracy along iterations of
the two experiments performed with the data sets “CRCNS” , “IRCCyN-MVT” and
GTEA.
64 S. Chaabouni et al.
Fig. 8 Evaluation and comparison of our proposed method of transfer learning. (a) Comparison
on IRCCyN-MVT data set. (b) Comparison on CRCNS data set. (c) Comparison on GTEA data
set
After training and validation of the model on HOLLYWOOD data set, we choose
the model obtained at the iteration #8690 having the maximum value of accuracy
80:05%. This model will be used to predict the probability of a local region
to be salient. Hence, the final saliency map will be built. For the CRCNS data
set, the model obtained at the iteration #21984 with the accuracy of 69:73% is
Table 6 Accuracy results on HOLLYWOOD, IRCCyN-MVT, CRCNS and GTEA data sets
HOLLYWOOD IRCCyN-MVT CRCNS GTEA
ChaboNet3k ChaboNet4k ChaboNet3k ChaboNet4k ChaboNet3k ChaboNet4k ChaboNet3k ChaboNet4k
Interval- Œ5584 : : : 6800 Œ8976 : : : 10288 Œ8702 : : : 26106 Œ11908 : : : 29770 Œ6630 : : : 12948 Œ12090 : : : 16458
stabilization
min.#iter/ 50:11%.#0/ 65:73%.#0/ 89:94%.#5632/ 90:72%.#9264/ 64:51%.#8702/ 65:73%.#11908/ 86:46%.#7566/ 89:80%.#9750/
max.#iter/ 77:98%.#5214/ 80:05%.#8690/ 92:67%.#6544/ 92:77%.#9664/ 69:48%.#19923/ 69:73%.#21984/ 91:61%.#6786/ 90:30%.#15678/
avg ˙ std 77:30% ˙ 0:864 78:73% ˙ 0:930 91:84 ˙ 0:592 92:24% ˙ 0:417 67:89% ˙ 0:907 68:61% ˙ 0:805 90:78% ˙ 0:647 90:13% ˙ 0:106
Deep Saliency: Prediction of Interestingness in Video with CNN
65
66 S. Chaabouni et al.
used to predict saliency. In the same manner, the model with the accuracy of
92:77% obtained at the iteration #9664 is used for the IRCCyN-MVT data set.
To evaluate our method of saliency prediction, performances were compared with
the most popular saliency models from the literature. Two spatial saliency models
were chosen: Itti and Koch spatial model [15], Signature Sal [13] (the algorithm
introduces a simple image descriptor referred to as the image signature, performing
better than Itti model), GBVS (regularized spatial saliency model of Harel [12]) and
the spatio-temporal model of Seo [40] built upon optical flow.
In Tables 8 and 9 below, we show the comparison of Deep CNN prediction of
pixel-wise saliency maps with the gaze fixations and compare performances with
the most popular saliency prediction models (Signature Sal, GBVS, Seo). Hence,
in Table 10, we compare our ChaboNet4k model with the model of Itti, GBVS and
Seo. In Tables 8, 9, 10 and 11 the best performance figures are underlined.
The comparison is given in terms of the widely used AUC metric [20]. Mean
value of the metric is given together with standard deviation for some videos. In
general it can be stated that spatial models (Signature Sal, GBVS or Itti) performed
better in half of the tested videos. This is due to the fact that these videos contain
very contrasted areas in the video frames, which attract human gaze. They do not
contain areas having an interesting residual motion. Nevertheless, the ChaboNet4K
model outperforms the Seo model which uses motion features such as optical flow.
This shows definitively that the use of a Deep CNN is a way for prediction of visual
saliency in video scenes. However, for IRCCyN-MVT data set, see Table 9, despite
videos without any motion were set aside, the gain in the proposed model is not
very clear due to the complexity of these visual scenes, such as presence of strong
contrasts and faces.
In Table 11 below we show the comparison of Deep CNN prediction of pixel-
wise saliency maps with the saliency maps built by Wooding’s method on gaze
fixations and also compare performances with the most popular saliency prediction
models form the literature. In general we can state that spatial models perform better
(Signature Sal, GBVS). Nevertheless, our 4K model outperforms that one of Seo in
four cases on this seven examples. This shows that definitely the use of a Deep CNN
is a way for prediction of top-down visual saliency in video scenes.
Table 8 The comparison of AUC metric of gaze fixations ‘Gaze-fix’ vs predicted saliency ‘GBVS’, ‘SignatureSal’ and ‘Seo’) and our ChaboNet4k for the
videos from HOLLYWOOD data set
VideoName TotFrame=2248 Gaze-fix vs GBVS Gaze-fix vs SignatureSal Gaze-fix vs Seo Gaze-fix vs ChaboNet4k
clipTest56 137 0:76 ˙ 0:115 0:75 ˙ 0:086 0:64 ˙ 0:116 0:77 ˙ 0:118
clipTest105 154 0:63 ˙ 0:169 0:57 ˙ 0:139 0:54 ˙ 0:123 0:69 ˙ 0:186
clipTest147 154 0:86 ˙ 0:093 0:90 ˙ 0:065 0:70 ˙ 0:103 0:81 ˙ 0:146
clipTest250 160 0:74 ˙ 0:099 0:69 ˙ 0:110 0:47 ˙ 0:101 0:71 ˙ 0:180
clipTest350 66 0:65 ˙ 0:166 0:68 ˙ 0:249 0:57 ˙ 0:124 0:72 ˙ 0:177
clipTest400 200 0:75 ˙ 0:127 0:67 ˙ 0:110 0:60 ˙ 0:106 0:71 ˙ 0:146
clipTest451 132 0:70 ˙ 0:104 0:59 ˙ 0:074 0:57 ˙ 0:068 0:63 ˙ 0:151
clipTest500 166 0:82 ˙ 0:138 0:84 ˙ 0:150 0:75 ˙ 0:152 0:84 ˙ 0:156
clipTest600 200 0:75 ˙ 0:131 0:678 ˙ 0:149 0:53 ˙ 0:108 0:71 ˙ 0:180
clipTest650 201 0:72 ˙ 0:106 0:74 ˙ 0:087 0:61 ˙ 0:092 0:70 ˙ 0:078
Deep Saliency: Prediction of Interestingness in Video with CNN
ClipTest700 262 0:74 ˙ 0:128 0:76 ˙ 0:099 0:50 ˙ 0:059 0:78 ˙ 0:092
clipTest800 200 0:70 ˙ 0:096 0:75 ˙ 0:071 0:53 ˙ 0:097 0:66 ˙ 0:141
ClipTest803 102 0:86 ˙ 0:106 0:87 ˙ 0:068 0:73 ˙ 0:148 0:88 ˙ 0:078
ClipTest849 114 0:75 ˙ 0:155 0:91 ˙ 0:070 0:55 ˙ 0:122 0:74 ˙ 0:132
67
68
Table 9 The comparison of AUC metric of gaze fixations ‘Gaze-fix’ vs predicted saliency ‘GBVS’, ‘SignatureSal’ and ‘Seo’) and our ChaboNet4k for the
videos from IRCCyN-MVT data set
VideoName TotFramesNbr Gaze-fix vs GBVS Gaze-fix vs SignatureSal Gaze-fix vs Seo Gaze-fix vs ChaboNet4k
src02 37 0:68 ˙ 0:076 0:49 ˙ 0:083 0:44 ˙ 0:017 0:48 ˙ 0:073
src03 28 0:82 ˙ 0:088 0:87 ˙ 0:057 0:76 ˙ 0:091 0:70 ˙ 0:149
src04 35 0:79 ˙ 0:058 0:81 ˙ 0:029 0:59 ˙ 0:057 0:57 ˙ 0:135
src05 35 0:73 ˙ 0:101 0:67 ˙ 0:122 0:48 ˙ 0:071 0:53 ˙ 0:128
src06 36 0:85 ˙ 0:080 0:71 ˙ 0:151 0:73 ˙ 0:148 0:60 ˙ 0:180
src07 36 0:72 ˙ 0:070 0:73 ˙ 0:060 0:57 ˙ 0:060 0:55 ˙ 0:135
src10 33 0:87 ˙ 0:048 0:92 ˙ 0:043 0:82 ˙ 0:101 0:60 ˙ 0:173
src13 35 0:79 ˙ 0:103 0:75 ˙ 0:111 0:64 ˙ 0:144 0:52 ˙ 0:138
src17 42 0:55 ˙ 0:092 0:33 ˙ 0:099 0:45 ˙ 0:033 0:51 ˙ 0:098
src19 33 0:76 ˙ 0:094 0:68 ˙ 0:086 0:59 ˙ 0:117 0:75 ˙ 0:123
src23 40 0:76 ˙ 0:050 0:69 ˙ 0:070 0:58 ˙ 0:067 0:66 ˙ 0:105
src24 33 0:63 ˙ 0:071 0:58 ˙ 0:054 0:55 ˙ 0:059 0:50 ˙ 0:052
src27 33 0:59 ˙ 0:117 0:64 ˙ 0:091 0:52 ˙ 0:057 0:54 ˙ 0:106
S. Chaabouni et al.
Table 10 The comparison of AUC metric of gaze fixations ‘Gaze-fix’ vs predicted saliency ‘GBVS’, ‘IttiKoch’ and ‘Seo’) and our ChaboNet4k for 5330
frames of CRCNS videos
VideoName TotFrame=5298 Gaze-fix vs GBVS Gaze-fix vs IttiKoch Gaze-fix vs Seo Gaze-fix vs ChaboNet4k
beverly03 479 0:78 ˙ 0:151 0:77 ˙ 0:124 0:66 ˙ 0:172 0:75 ˙ 0:153
gamecube02 1819 0:73 ˙ 0:165 0:74 ˙ 0:180 0:61 ˙ 0:179 0:78 ˙ 0:160
monica05 611 0:75 ˙ 0:183 0:73 ˙ 0:158 0:54 ˙ 0:156 0:80 ˙ 0:144
standard02 515 0:78 ˙ 0:132 0:72 ˙ 0:141 0:61 ˙ 0:169 0:70 ˙ 0:156
tv-announce01 418 0:60 ˙ 0:217 0:64 ˙ 0:203 0:52 ˙ 0:206 0:65 ˙ 0:225
tv-news04 486 0:78 ˙ 0:169 0:79 ˙ 0:154 0:61 ˙ 0:162 0:71 ˙ 0:158
Deep Saliency: Prediction of Interestingness in Video with CNN
tv-sports04 970 0:68 ˙ 0:182 0:69 ˙ 0:162 0:56 ˙ 0:193 0:75 ˙ 0:173
69
70
Table 11 The comparison of AUC metric gaze fixations ‘Gaze-fix’ vs predicted saliency ‘GBVS’, ‘SignatureSal’ and ‘Seo’) and our 4k_model for the videos
from GTEA dataset
VideoName TotFrame=7693 Gaze-fix vs GBVS Gaze-fix vs SignatureSal Gaze-fix vs Seo Gaze-fix vs 4k_model
S1_CofHoney_C1_undist 1099 0:811 ˙ 0:109 0:800 ˙ 0:091 0:578 ˙ 0:120 0:732 ˙ 0:157
S1_Pealate_C1_undist 1199 0:824 ˙ 0:099 0:846 ˙ 0:080 0:594 ˙ 0:139 0:568 ˙ 0:185
S1_TeaC 1_undist 1799 0:770 ˙ 0:127 0:816 ˙ 0:074 0:567 ˙ 0:135 0:745 ˙ 0:211
S2_Cheese_C1_undist 499 0:813 ˙ 0:116 0:766 ˙ 0:0138 0:552 ˙ 0:127 0:643 ˙ 0:218
S2_Coffee_C1_undist 1599 0:802 ˙ 0:098 0:720 ˙ 0:094 0:594 ˙ 0:116 0:636 ˙ 0:193
S3_Hotdog_C1_undist 699 0:768 ˙ 0:103 0:851 ˙ 0:088 0:585 ˙ 0:114 0:415 ˙ 0:145
S3_Peanut_C1_undist 799 0:757 ˙ 0:115 0:758 ˙ 0:135 0:519 ˙ 0:100 0:570 ˙ 0:162
S. Chaabouni et al.
Deep Saliency: Prediction of Interestingness in Video with CNN 71
3.7.5 Conclusion
References
1. Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal.
Mach. Intell. 35(1), 185–207 (2013)
2. Boulos, F., Chen, W., Parrein, B., Le Callet, P.: Region-of-interest intra prediction for
H.264/AVC error resilience. In: IEEE International Conference on Image Processing, Cairo,
pp. 3109–3112 (2009)
3. Boulos, F., Chen, W., Parrein, B., Le Callet, P.: Region-of-interest intra prediction for
H.264/AVC error resilience. In: IEEE International Conference on Image Processing, Cairo,
pp. 3109–3112 (2009). https://hal.archives-ouvertes.fr/hal-00458957
4. Chaabouni, S., Benois-Pineau, J., Ben Amar, C.: Transfer learning with deep networks for
saliency prediction in natural video. In: 2016 IEEE International Conference on Image
Processing, ICIP 2016, vol. 91 (2016)
5. Chaabouni, S., Benois-Pineau, J., Hadar, O.: Prediction of visual saliency in video with deep
CNNs. Proceedings of the SPIE Optical Engineering C Applications, pp. 9711Q-99711Q-14
(2016)
6. Chaabouni, S., Benois-Pineau, J., Tison, F., Ben Amar, C.: Prediction of visual attention with
Deep CNN for studies of neurodegenerative diseases. In: 14th International Workshop on
Content-Based Multimedia Indexing CBMI 2016, Bucharest, 15–17 June 2016
7. Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: 2011
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3281–3288 (2011)
8. Geng, M., Wang, Y., Xiang, T., Tian, Y.: Deep Transfer Learning for Person Re-identification.
CoRR abs/1611.05244 (2016). http://arxiv.org/abs/1611.05244
72 S. Chaabouni et al.
9. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for
accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1),
142–158 (2016)
10. González-Díaz, I., Buso, V., Benois-Pineau, J.: Perceptual modeling in the problem of active
object recognition in visual scenes. Pattern Recogn. 56, 129–141 (2016)
11. Gygli, M., Soleymani, M.: Analyzing and predicting GIF interestingness. In: Proceedings of
the 2016 ACM on Multimedia Conference, MM ’16, pp. 122–126. ACM, New York (2016).
doi:10.1145/2964284.2967195. http://doi.acm.org/10.1145/2964284.2967195
12. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural
Information Processing Systems, vol. 19, pp. 545–552. MIT Press, Cambridge (2007)
13. Hou, X., Harel, J., Koch, C.: Image signature: highlighting sparse salient regions. IEEE Trans.
Pattern Anal. Mach. Intell. 34(1), 194–201 (2012)
14. Itti, L.: CRCNS data sharing: eye movements during free-viewing of natural videos. In:
Collaborative Research in Computational Neuroscience Annual Meeting, Los Angeles, CA
(2008)
15. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
16. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S.,
Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of
the ACM International Conference on Multimedia, MM ’14, Orlando, FL, 03–07 November,
2014, pp. 675–678 (2014)
17. Jiang, Y., Wang, Y., Feng, R., Xue, X., Zheng, Y., Yang, H.: Understanding and predicting
interestingness of videos. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence, AAAI’13, pp. 1113–1119. AAAI Press, Palo Alto (2013). http://dl.
acm.org/citation.cfm?id=2891460.2891615
18. Krizhevsky, A.: Learning multiple layers of features from tiny images. Ph.D. thesis, University
of Toronto (2009)
19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in
Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red
Hook (2012)
20. Le Meur, O., Baccino, T.: Methods for comparing scanpaths and saliency maps: strengths and
weaknesses. Behav. Res. Methods 45(1), 251–266 (2010)
21. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
22. Li, G.Y.Y.: Visual saliency based on multiscale deep features. In: IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5455–5463 (2015)
23. Li, G.Y.Y.: Deep contrast learning for salient object detection. In: IEEE Conference on
Computer Vision and Pattern Recognition. 1603.01976 (2016)
24. Lin, Y., Kong, S., Wang, D., Zhuang, Y.: Saliency detection within a deep convolutional
architecture. In: Cognitive Computing for Augmented Human Intelligence: Papers from the
AAAI-14 Workshop, pp. 31–37 (2014)
25. Liu, N.H.J.Z.D.W.S., Liu, T.: Predicting eye fixations using convolutional neural networks. In:
IEEE Conference on Computer Vision and Pattern Recognition, pp. 362–370 (2015)
26. Mai, L., Le, H., Niu, Y., Liu, F.: Rule of thirds detection from photograph. In: 2011 IEEE
International Symposium on Multimedia (ISM), pp. 91–96 (2011)
27. Manerba, F., Benois-Pineau, J., Leonardi, R.: Extraction of foreground objects from MPEG2
video stream in rough indexing framework. In: Proceedings of the EI2004, Storage and
Retrieval Methods and Applications for Multimedia 2004, pp. 50–60 (2004). https://hal.
archives-ouvertes.fr/hal-00308051
28. Marat, S., Ho Phuoc, T., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A.: Modelling
spatio-temporal saliency to predict gaze direction for short videos. Int. J. Comput. Vis. 82(3),
231–243 (2009)
Deep Saliency: Prediction of Interestingness in Video with CNN 73
29. Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer
Vision & Pattern Recognition (2009)
30. Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency
models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424
(2015)
31. Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I.J., Lavoie, E., Muller,
X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and
transfer learning challenge: a deep learning approach. In: JMLR W& CP: Proceedings of the
Unsupervised and Transfer Learning Challenge and Workshop, vol. 27, pp. 97–110 (2012)
32. Nesterov,
Y.: A method of solving a convex programming problem with convergence rate
O 1=k2 . Sov. Math. Doklady 27, 372–376 (1983)
33. Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise
on the precision of supervised learning techniques. Artif. Intell. Rev. 33(4), 275–306 (2010).
http://dx.doi.org/10.1007/s10462--010-9156-z
34. Pan, J.G.: End-to-end convolutional network for saliency prediction. In: IEEE Conference on
Computer Vision and Pattern Recognition 1507.01422 (2015)
35. Pérez de San Roman, P., Benois-Pineau, J., Domenger, J.P., Paclet, F., Cataert, D., De
Rugy, A.: Saliency Driven Object recognition in egocentric videos with deep CNN. CoRR
abs/1606.07256 (2016). http://arxiv.org/abs/1606.07256
36. Pinto, Y., van der Leij, A.R., Sligte, I.G., Lamme, V.F., Scholte, H.S.: Bottom-up and top-down
attention are independent. J. Vis. 13(3), 16 (2013)
37. Polyak, B.: Introduction to Optimization (Translations Series in Mathematics and Engineer-
ing). Optimization Software, New York (1987)
38. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: The
Art of Scientific Computing, 2nd edn. Cambridge University Press, New York (1992)
39. Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural
networks on noisy labels with bootstrapping. CoRR abs/1412.6596 (2014). http://arxiv.org/
abs/1412.6596
40. Seo, H.J., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J.
Vis. 9(12), 15, 1–27 (2009)
41. Shen, J., Itti, L.: Top-down influences on visual attention during listening are modulated by
observer sex. Vis. Res. 65, 62–76 (2012)
42. Shen, C., Zhao, Q.: Learning to predict eye fixations for semantic contents using multi-layer
sparse network. Neurocomputing 138, 61–68 (2014)
43. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising
image classification models and saliency maps. CoRR abs/1312.6034 (2013)
44. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1),
97–136 (1980)
45. Uijlings, J., de Sande, K.V., Gevers, T., Smeulders, A.: Selective search for object recognition.
Int. J. Comput. Vis. 104(2), 154–171 (2013)
46. Vapnik, V.: Principles of risk minimization for learning theory. In: Moody, J.E., Hanson, S.J.,
Lippmann, R. (eds.) NIPS, pp. 831–838. Morgan Kaufmann, Burlington (1991)
47. Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency
prediction in natural images. In: Proceedings of the 2014 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR ’14, pp. 2798–2805 (2014)
48. Wooding, D.S.: Eye movements of large populations: II. Deriving regions of interest, coverage,
and similarity using fixation maps. Behav. Res. Methods Instrum. Comput. 34(4), 518–528
(2002)
49. Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for
image classification. In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2015)
50. Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video
summarization via constrained submodular maximization. In: Proceedings of the CVPR (2015)
74 S. Chaabouni et al.
51. Yoon, S., Pavlovic, V.: Sentiment flow for video interestingness prediction. In: Proceedings
of the 1st ACM International Workshop on Human Centered Event Understanding from
Multimedia, HuEvent 14, pp. 29–34. ACM, New York (2014). doi:10.1145/2660505.2660513
52. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural
networks? In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.
(eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 3320–3328. Curran
Associates, Inc., Red Hook (2014)
53. Zeiler, M.D., Fergus, R.: Visualizing and Understanding Convolutional Networks. CoRR
abs/1311.2901 (2013)
54. Zen, G., de Juan, P., Song, Y., Jaimes, A.: Mouse activity as an indicator of interestingness in
video. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval,
ICMR ’16, pp. 47–54. ACM, New York (2016). doi:10.1145/2911996.2912005. http://doi.acm.
org/10.1145/2911996.2912005
55. Zhou, Z., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for
discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition
1512.04150 (2015)
56. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3),
177–210 (2004). doi:10.1007/s10462-004-0751-8
57. Hebb, D.O.: The Organisation of Behaviour: A Neurophysiological Theory, p. 379. Laurence
Erlbaum Associates, Inc. Mahwah (2002). ISBN:1-4106-1240-6. Originaly published Willey,
New York (1949)
58. Rosenblatt, F., The perception: a probabilistic model for information storage and organization
in the brain. Psychol. Rev. 65(6), 386-408 (1958)
Introducing Image Saliency Information
into Content Based Indexing and Emotional
Impact Analysis
1 Introduction
Local feature detectors are widely used in the literature as the first step of many
systems in image processing domain and its various applications: retrieval, recogni-
tion, . . . Due to this large usage, lots of state-of-the-art papers are dedicated to local
feature evaluation [35, 49]. Their principal aim is to define some criteria to compare
existing local features. They often considered using among others the repeatability
S. Gbehounou
Jules SAS, Blagnac, France
e-mail: [email protected]
T. Urruty • F. Lecellier () • C. Fernandez-Maloigne
Xlim, University of Poitiers, CNRS, Poitiers, France
e-mail: [email protected]; [email protected];
[email protected]
selection to more detectors and datasets. Then, at first, we propose in this chapter
to evaluate keypoint detectors with respect to the visual saliency of their outputs.
Our goal is to quantify the saliency of different local features detected using four of
usual local feature detectors: Harris, Harris-Laplace, DOG and FAST. To do this we
used four visual attention models [16, 18, 26, 38]. Secondly, we study the impact of
selecting local feature values according to their saliency on image retrieval.
The considered features are principally used for image indexing or classification
based on their semantic content. However, there is also the possibility to measure
other parameters such as the emotional impact. This latter has several applications:
film classification, road safety education, advertising or e-commerce by selecting
the appropriate image information of the situation. The extraction of emotional
impact is an ambitious task since the emotions are not only content related (textures,
colours, shapes, objects, . . . ) but also depend on cultural and personal experiences.
Before giving more details about the emotion classification in the literature, one
may need to define what an emotion is and how to classify them. There are two
different approaches[28] to perform such a classification:
1. Discrete approach: emotional process can be explained with a set of basic
or fundamental emotions, innate and common to all human (sadness, anger,
happiness, disgust, fear, . . . ). There is no consensus about the nature and the
number of these fundamental emotions.
2. Dimensional approach: on the contrary to the previous one, the emotions are
considered as the result of fixed number of concepts represented in a dimensional
space. The dimensions can be pleasure, arousal or power and vary depending
to the needs of the model. The advantage of these models is to define a large
number of emotions. But there are some drawbacks because some emotions may
be confused or unrepresented in this kind of models.
In the literature, lots of research works are based on the discrete modeling of
the emotions; for example those of Paleari and Huet [43], Kaya and Epps [21], Wei
et al. [58] or Ou et al. [40–42]. In this chapter, we choose an approach close to
the dimensional one in order to obtain a classification into three different classes
“Unpleasant”, “Neutral” and “Pleasant”. Our goal is to summarize the emotions of
low semantic images and since the number and nature of emotions in the discrete
approach remain uncertain, the selection of specific ones may lead to an incorrect
classification.
The emotion analysis is based on many factors. One of the first factors to
consider consists in the link between colors and emotions [3, 5, 6, 31, 40–42, 58].
Several of those works have considered emotions associated with particular colors
through culture, age, gender or social status influences. Most of the authors agreed
to conclude that there is a strong link between emotions and colors. As stated by
Ou et al. [40], colors play an important role in decision-making, evoking different
emotional feelings. The research on color emotion or color pair emotion is now
a well-established area of research. Indeed, in a series of publications, Ou et al.
[40–42] studied the relationship between emotions, preferences and colors. They
have established a model of emotions associated with colors from psycho-physical
experiments.
78 S. Gbehounou et al.
Another part of the emotional impact analysis of images depends on the facial
expression interpretation [43]. Emotions are then associated with facial features
(such as eyebrows, lips). This seems to be the easiest way to predict emotions, since
facial expressions are common to human and that the basic emotions are relatively
easy to evaluate for the human (happy, fear, sad, surprise, . . . ). However in this case,
the system detects emotions carried by the images and not really the emotions felt
by someone looking at these pictures which can depend on its empathy or on the
global content of the image (for example, a baby crying in the background of an
image does not necessarily implies that the image is sad).
More recently some authors considered the emotion recognition as a CBIR task
[32, 53, 59]. They want to use the traditional techniques of image retrieval to
extract their emotional impact. To perform such a task, they need to choose some
images features such as color, texture or shape descriptors and combine them with
a classification system. Those two steps, after a learning step, allow the authors to
predict the emotional impact of the images. For example, Wang and Yu [57] used the
semantic description of colours to associate an emotional semantic to an image. Liu
et al. [28] concludes on texture for emotion classification. They stated that oblique
lines could be associated with dynamism and action; horizontal and vertical ones
with calm and relaxation.
In the last part of this chapter, we evaluate some low level features well adapted
for object recognition and image retrieval [1, 20, 22, 29, 30, 39, 56] and experiment
our study on two databases:
• A set of natural images that was assessed during subjective evaluations: Study of
Emotion on Natural image databaSE (SENSE) [8];
• A database considered as a reference on psychological studies of emotions:
International Affective Picture System (IAPS) [24].
The remainder of this chapter is structured as follows: we provide a brief
description of the chosen detectors in Sect. 2. The visual attention model is described
in Sect. 3. Then, we present the database and the local features detectors setting in
Sect. 4 and the findings on our study of local feature saliency in Sect. 5. The study
conducted on the importance of salient pixels for image retrieval is explained in
Sect. 6, followed by a discussion on results in Sect. 7. Then in Sect. 8 we extend our
results to emotion classification. Finally we conclude and present some future works
in Sect. 9.
This section presents the four corner and blob detectors we chose for our study.
These detectors are widely used in many image processing frameworks [29, 30, 33,
47, 48, 56]. Note that evaluation of region detectors such as MSER is not in the
scope of this chapter, mostly due to the detected area complexity. As these areas are
not regularly shaped it is difficult to define the saliency value linked to the detected
regions.
Image Saliency Information for Indexing and Emotional Impact Analysis 79
In the following, we explain briefly the four corner and blob detectors selected
for our study.
1. Harris detector is a corner detector proposed by Harris and Stephen in 1988
[17]. It is based on the auto-correlation matrix used by Moravec in 1977 [37]
and measures the intensity differences between a main window and windows
shifted in different directions for each pixel. Harris and Stephen in their improved
version proposed to use the matrix M defined by Eq. (1).
2 P P 3
Ix .xk ; yk /2 Ix .xk ; yk /Iy .xk ; yk /
M.x; y/ D 4P W W P 5; (1)
Ix .xk ; yk /Iy .xk ; yk / Iy .xk ; yk /2
W W
Harris detector is robust to the rotation but suffers from scale changes [49].
2. Harris-Laplace detector was proposed by Mikolajczyk and Schmid [33] and
resolves the scale invariance problem of the Harris detector. Indeed, the points
are firstly detected with a Harris function on multiple scales and then filtered
according to a local measure. They use the Laplacian and only points with a
maximal response are considered in the scale-space.
3. Difference of Gaussians (DOG) was used by Lowe in the SIFT (Scale-Invariant
Feature Transform) algorithm [29] to approximate the Laplacian of Gaussian
whose kernel is particularly stable in scale-space [34]. The local maxima allow
to detect blob structures. This detector is robust to rotation and scale changes.
4. Features from Accelerated Segment Test (FAST) was introduced by Rosten
and Drummond [47, 48] for the real-time frame-rate applications. It is a high
speed feature detector based on the SUSAN (Smallest Univalue Segment Assim-
ilating Nucleus) detector introduced by Smith and Brady [51]. For each pixel,
a circular neighborhood with a fixed radius is defined. Only the 16 neighbors as
shown on Fig. 1 defined on the circle are handled. p is a local feature if at least 12
contiguous neighbors have an intensity inferior to its value and some threshold.
In the last decades, many visual saliency frameworks have been published [10, 18,
25, 62]. Although Borji et al. [4] have proposed an interesting comparative study
of 35 different models of the literature. They also mentioned the ambiguity around
saliency and attention. Visual attention is a broad concept covering many topics
(e.g., bottom-up/top-down, overt/covert, spatial/spatio-temporal). On the other hand
it has been mainly referring to bottom-up processes that render certain image regions
80 S. Gbehounou et al.
more conspicuous; for instance, image regions with different features from their
surroundings (e.g., a single red dot among several blue dots).
Many visual saliency frameworks are inspired from psycho-visual features [18,
25] while others make use of several low-level features in different ways [10, 62].
The works of Itti et al. [18] can be considered as a noticeable example of the bio-
inspired models. An input image is processed by the extraction of three conspicuous
maps based on low level characteristic computation. These maps are representative
of the three main human perceptual channels: color, intensity and orientation before
combining them to generate the final saliency map as described on Fig. 2.
Moreover we used the bio-inspired model proposed by Itti et al. to assess the
saliency of our local features in our first study.1
Figure 3b is the saliency map for Fig. 3a. The lighter pixels are the most salient
ones.
4 Experimental Setup
4.1 Databases
For the evaluation of the visual saliency of local features obtained with the four
detectors mentioned in the Sect. 2, we use the following image sets:
1. University of Kentucky Benchmark proposed by Nistér and Stewénius [39]. In
the remainder, we will refer to this dataset as “UKB” to simplify the reading of
this chapter. UKB is really interesting because it is a large benchmark composed
of 10,200 images grouped in sets of 4 images showing the same object. They
present interesting properties for image retrieval: changes of point of view,
illumination, rotation, etc.
1
Our saliency values are computed using the Graph-Based Visual Saliency (GBVS) software http://
www.klab.caltech.edu/~harel/share/gbvs.php which implements also Itti et al.’s algorithm.
Image Saliency Information for Indexing and Emotional Impact Analysis 81
Input image
Linear filtering
Feature maps
Conspicuity maps
Linear combinations
Salency maps
Attended location
2. PASCAL Visual Object Classes challenge 2012 [9] called PASCAL VOC2012.
This benchmark is composed of 17,125 images. They represent realistic scenes
and they are categorized in 20 object classes, e.g. person, bird, airplane, bottle,
chair and dining table.
3. The dataset proposed by Le Meur and Baccino [25] for saliency study which
contains 27 images. We will refer to this dataset as “LeMeur”.
4. The database introduced by Kootstra et al. [23] composed of 101 images
refereed as “Kootstra” in this chapter. It is also used for saliency model
evaluation.
We decided to consider two image databases traditionally used for the study of
visual saliency in order to quantify a potential link between the ratio of local features
detected and the nature of the dataset.
82 S. Gbehounou et al.
For our second study concerning the filtering of local features detected based on
their visual saliency we use two databases:
1. UKB already described in this section;
2. Holidays provided by Jegou et al. [19]. This dataset is composed of 1491 images
with a large variety of scene types. There are 500 groups each representing a
distinct object or scene.
The different parameters chosen for the local feature detectors are the default ones.
The idea of this chapter is not to have the best parameters but to use those proposed
by the authors that can be considered as a average optimum. We use Opencv
implementation of Harris, FAST and DOG detectors. For the last one we considered
the keypoints described by SIFT algorithm. For Harris-Laplace detector, we use
color descriptor software developed by van de Sande et al. [56].
In our experiments, we use k D 0:4 for Harris detector. The Harris threshold
was defined equal to 0.05 multiplied by the best corner quality C computed using
Eq. (2). The neighborhood size is 33 and we use k D 0:64. The Harris threshold
is set to 109 and the Laplacian threshold to 0.03. DOG detector settings are the
original values proposed by Lowe[29]. The threshold needful in the FAST algorithm
to compare the intensity value of the nucleus and its neighbors is set to 30 in our
experiments.
Image Saliency Information for Indexing and Emotional Impact Analysis 83
This section introduces our first evaluation study for the local feature saliency. Our
aim is to compare the local feature detectors with respect to the ratio of visual salient
features they produce. To do so, we need to find a threshold t in order to classify a
local feature as salient.
The different visual saliency values that we obtained are normalized between 0
and 1. Then, an instinctive threshold might be 0.5. However we preferred to define
a threshold that conserves an easy recognition for human of the scenes/different
objects with the minimal number of pixels. We made a small user study to evaluate
different values of thresholds. The different images of Fig. 4 show the results with
three values of threshold: 0.3, 0.4 and 0.5. We chose the threshold equal to 0.4 as it
is the minimal value where most users in our study recognized the objects in most
of the images. Thus we consider that a local feature is salient if the saliency on its
position is greater than or equal to 0.4.
Before studying the local feature saliency, we have tested if there is a significant
difference between the studied databases related to the ratio of salient pixels they
contain. Detailed results of our experiments are not presented here, however we
summarize them in the Table 1. For this study we consider the median (the second
Fig. 4 Figure 3a quantised with different saliency thresholds. (a) t D 0:3, (b) t D 0:4, (c) t D 0:5
Table 1 Distribution of the salient features for each detector and dataset. Bold values correspond
to best scores
LeMeur Kootstra UKB PascalVOC12 Average
Harris Median 48.72 42.91 56.71 50 49.59
Inter-quartile 37.41 28.34 40.18 30.87 34.2
FAST Median 34.76 33.29 43.38 37.70 37.28
Inter-quartile 21.66 21.89 37.98 25.75 26.82
DOG Median 30.71 31.84 41.13 36.42 35.03
Inter-quartile 12.53 21.01 33.50 22.45 22.37
H-L Median 26.80 29.38 34.95 32.46 30.90
Inter-quartile 14.25 21.04 30.55 20.05 21.47
H-L detector corresponds to Harris-Laplace detector. The median and inter-quartile values are
percentages
84 S. Gbehounou et al.
quartile) and the inter-quartile intervals (the difference between the third and the
first quartiles). We notice that LeMeur and Kootstra databases [23, 25] specially
proposed for saliency studies have in average more salient pixels. However the four
databases contain a lot of non-salient information. The highest median values are
observed for the interval Œ0; 0:1: >30% for LeMeur, >20% for Kootstra, >40% for
UKB and 30% for Pascal VOC2012.
If we consider the average of different medians, Harris detector with 49:59%
appears as the one that extracts the most salient features despite the nature of the
images of these bases. It could be explained by the fact that it measures intensity
differences in the image space that can be interpreted as a measure of contrast useful
for visual saliency. The difference between the three other detectors is minimal. The
results of Harris-Laplace and DoG could be explained by the scale changes they
incorporate. Despite its good results for image retrieval and object categorization
[61], Harris-Laplace detector selects less salient local features.
Our study of local feature detector saliency confirms that they do not detect the
most salient information.2 These observations are comprehensible since the local
detectors used and the visual saliency models are not based on the same concept.
The fact that the Harris detector produces more salient corners is interesting. It may
advise to use Harris detector if any scale change invariant is needed for local feature
filtering.
In the following, we focus on Harris-Laplace, and assess the importance of the
local features according to their visual attention for image retrieval on UKB. We
no longer consider the previous threshold t D 0:4. The local features are ranked
according to their saliency value.
In this section, we study the impact of filtering the local features according to their
saliency value before the image signature computation. To do so, we consider two
local descriptors:
1. Colour Moment Invariants (CMI) descriptor [36] that describes local features
obtained with one of the following detectors:
• Harris-Laplace detector;
• a dense detection scheme.
We choose to use the Bag of Visual Words representation [7, 50] which is
widely used to create image signatures and to index images from UKB dataset.
The visual codebook we have computed has been introduced in our previous
work [55]. In this work, we proposed a random iterative visual word selection
2
Those from the chosen detectors.
Image Saliency Information for Indexing and Emotional Impact Analysis 85
algorithm and their results are interesting for this descriptor using only 300 visual
words. The visual codebook is computed with Pascal VOC2012 [9].
2. SIFT [29] and the local features are detected with:
• a grid-dense detection scheme for UKB dataset images;
• Hessian-Affine detector for INRIA Holidays images.3
For this descriptor, we also use the Bag of words model as visual signature
with:
• 10,000 visual words computed with K-means algorithm for UKB. The visual
codebook is computed with Pascal VOC2012 [9].
• 10,000 and 200,000 visual words used by Jegou et al. [19].
For the UKB retrieval results, a score of 4 means that the system returns all correct
neighbors for the 10,200 images. In fact, the database contains a set of four identical
images with different transformations. In this case the query is included in the score
calculation. Concerning Holidays, we compute the mean Average Precision (mAP)
as detailed in [46].
As we previously mentioned we rank the local features according to their saliency
values. For our study, we filtered local features with two different configurations:
• “More salient”: the more salient features are removed;
• “Less salient”: the less salient features are removed.
The image signature is then built with the residual local features after filtering. The
full algorithm is presented on Fig. 5.
The results for UKB are presented in Fig. 6a for CMI and Fig. 6b for SIFT.
The results clearly highlight the importance of salient local features for the
retrieval. For example, removing 50% of the most salient features with SIFT induces
a loss of retrieval accuracy of 8.25% against 2.75% for the 50% of the least salient
ones. The results are similar for CMI: 20% when filtering 50% of the most salient
features and 3.55% otherwise.
Whatever the descriptor, our findings go in the same direction as the previ-
ous for UKB: local features can be filtered according to their saliency without
affecting significantly the retrieval results. The most salient local features are very
discriminative to have an accurate retrieval. These conclusions are valid for Harris-
Laplace detector. We have tested these assumptions with another keypoint detection
scheme: grid-dense quantization. Indeed increasing research works consider this
feature selection approach [15, 45] which poses a problem: the large number
of keypoints affecting the efficiency of the retrieval. If the previous results are
confirmed then the visual attention can be used to filter local keypoints regardless
the descriptor and the feature detector.
3
We used the descriptors provided by Jegou et al. available at http://lear.inrialpes.fr/people/jegou/
data.php.
86 S. Gbehounou et al.
Fig. 5 The used algorithm to perform local feature filtering based on visual saliency. On the last
step of the algorithm the four images correspond to different percentages of removed features with
the lowest saliency values
For our grid-dense scheme we selected a pixel on a grid of 15*15 every 5 pixels
producing 11,000 local features per image of size 800*600.
The results for example, for UKB are presented in Fig. 7a for CMI and Fig. 7b
for SIFT.
Filtering dense local features with respect to their visual saliency values has
the same impact as previous filtering (Fig. 6). We can conclude that using CMI
on UKB, saliency filtering does not impact in a negative way the retrieval results
respecting an adequate threshold. Moreover, our results show that the precision
score increases while deleting up to 50% of least salient local features: C1.25 to
C2.5%. This highlights that using too many non salient keypoints has the same
effect as introducing noise leading to a small decrease in the retrieval precision.
With a grid-dense selection and filtering by visual saliency value, CMI shows that
Image Saliency Information for Indexing and Emotional Impact Analysis 87
2,76
2,5 (30%)
2,40
(50%)
2
1,5
0,5
0
0 10 20 30 40 50 60 70 80 90 100
Percentage of removed local features (%)
(a)
More Salient Less Salient
2,5 2,20
2,20
(30%) (50%)
2,16
2
Mean correct retrieval rate on UKB
1,99
(30%) 1,87
(50%)
1,5
0,5
0
0 10 20 30 40 50 60 70 80 90 100
Percentage of removed local features (%)
(b)
Fig. 6 Local features detected by Harris-Laplace filtered according to their saliency value. K is
the size of the visual codebook. (a) CMI: K= 300. (b) SIFT: K= 10,000
salient local features are particularly important as so far as the difference between
the two curves on Fig. 7a is 19.25% for 30% and 31% for 50%.
Our different results highlight the importance of salient local features for a
correct retrieval on UKB both with Harris-Laplace detection and dense selection.
To validate the independence to the database we conducted the same evaluations
on Holidays. The impact on the retrieval is measured with the mean Average
Precision (mAP) scores represented in Fig. 8. The observations are similar: the
salient local features lead to better retrieval. The difference between deleting less
salient and more salient on Holidays (5%) is less important than those observed
88 S. Gbehounou et al.
3.32 3.37
3,5 (30%) (50%)
3.27
3
Mean correct retrieval rate on UKB
2,5
2.55
(30%)
2 2.13
(50%)
1,5
0,5
0
0 10 20 30 40 50 60 70 80 90 100
Percentage of removed local features (%)
(a)
More Salient Less Salient
3
2,59 2,55
2,59 (30%) (50%)
2,5
Mean correct retrieval rate on UKB
2,36
(30%)
2 2,13
(50%)
1,5
0,5
0
0 10 20 30 40 50 60 70 80 90 100
Percentage of removed local features (%)
(b)
Fig. 7 Filtering dense selected local features according to their saliency value. (a) CMI: K= 300,
(b) SIFT: K= 10,000
on UKB. It supposes that the local salient feature importance for retrieval depends
on the database and the descriptor.
These first results obtained with Itti et al.’s model [18] are also confirmed with
the three following models:
1. Graph Based Visual Saliency (GBVS) which is a bottom-up visual saliency
model proposed by Harel et al. [16];
2. Prediction of INterest point Saliency (PINS) proposed by Nauge et al. [38] and
based on the correlation between interest points and gaze points;
Image Saliency Information for Indexing and Emotional Impact Analysis 89
0,6
0,5
mAP scoreon Holidays
0,4
0,3
0,2
0,1
0
0 10 20 30 40 50 60 70 80 90 100
Percentage of removed local features (%)
Fig. 8 mAP scores after filtering local features according to their saliency values on Holidays
Table 2 Computation of the area between the curves of deleting most salient and less salient
local features. Bold values correspond to best scores
Databases Descriptors Saliency model Area value
UKB CMI Itti et al. 10.06
K=300 GBVS 9.11
PINS 8.95
Yin Li 8.85
SIFT Itti et al. 3.61
K=10,000 GBVS 4.34
PINS 3.54
Yin Li 3.32
Holidays SIFT Itti et al. 0.63
K=10,000 GBVS 1.65
PINS 0.40
Yin Li 0.51
SIFT Itti et al. 0.51
K=200,000 GBVS 2.28
PINS 1.20
Yin Li 0.62
Let C1 and C2 , two curves, the area value, A is obtained with Eq. (3).
X
A D j C1 .i/ C2 .i/ j; (3)
I
3
2,5
Mean correct retrieval rate on UKB
2,5
2
2
1,5
1,5
1
1
0,5
0,5
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Percentage of removed features (%) Percentage of removed features (%)
c 0,4
Itti GBVS PINS Yin Li
d Itti GBVS PINS Yin Li
0,35 0,7
0,3 0,6
mAP score on Holidays
mAP score on Holidays
0,25 0,5
0,2 0,4
0,15 0,3
0,2
0,1
0,1
0,05
0
0
0 10 20 30 40 50 60 70 80 90 100
0 10 20 30 40 50 60 70 80 90 100
Percentage of removed features (%) Percentage of removed features (%)
Fig. 9 The results obtained deleting less salient local features. (a) UKB-CMI, (b) UKB-SIFT, (c)
Holidays-SIFT 10,000, (d) Holidays-SIFT 200,000
Image Saliency Information for Indexing and Emotional Impact Analysis 91
Even if this study shows that the results are much better with GBVS, the main
conclusion of our experiments is the importance of filtering local keypoints with the
saliency value.
7 Discussion
The different evaluations we have conducted about the impact of selecting local
features according to their visual saliency show that:
• this filtering does not significantly affect the results;
• the salient local features are important for image retrieval.
The results presented in this chapter confirm that the visual saliency can be useful
and helpful for a more accurate image retrieval. Especially, they highlight that it can
easily enhance dense selection results:
• by deleting a certain proportion of less salient keypoints;
• by reducing the quantity of local features without negative impact for the retrieval
score.
The different detectors studied do not extract important quantity of salient
information. This observation has among others two important outcomes:
1. Visual saliency usage should be an additional process in the whole framework of
image retrieval, while indexing and retrieving images; the most of available tools
today do not include the saliency information.
2. Adding more salient local features could create more accurate signatures for most
of usual images, improving at the same time the retrieval process.
We have just started our investigations on the second outcome by replacing the less
salient local features detected by more salient ones. First results obtained with this
new research investigation were conducted on UKB dataset using CMI descriptor.
We add salient local features from the dense CMI to the Harris Laplace CMI. The
results presented in Fig. 10 confirm our hypothesis.
Replacing less salient local features by the most salient ones from dense detection
seems to be a good compromise to use visual saliency in order to improve the
retrieval. Indeed the score increases by 3.75% with 20% of replaced keypoints. Of
course, this improvement is small but it shows the importance to research deeper
this way as all results were improved.
8 Emotion Classification
In this section we applied the previous image feature detectors to emotion clas-
sification. In order to perform such a classification, one needs to use a specific
database constructed for this application. There are many datasets proposed in
92 S. Gbehounou et al.
Replacing less salient HL points by most salient ones from dense detection
Less salient HL points removing without substitution
3,5 3,29 2,59 2,55
(30%) (50%)
2,36
3 2,13
(30%)
(50%)
Mean correct retrieval rate on UKB
2,5
1,5
0,5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Percentage of removed local features (%)
Fig. 10 Replacing the less salient points detected buy Harris-Laplace by the most salient selected
with dense quantization
We applied the same optimal parameters used by the authors to create the
thumbnails of the images of our database. This second database has been rated
by 1166 participants.
In this section we present our results for local and global feature evaluation for
emotional impact recognition. At first we discuss about those obtained on SENSE
and IAPS. To finish we compare those from IAPS to some baselines from the
literature.
In a previous work [13], we have shown that CBIR local and global descriptors can
be used to define the emotional impact of an image. In Table 3, we summarize the
results obtained after classification in Positive and Negative emotion class for each
descriptor. In this table:
• WA4 and WA5 respectively mean Wave Atoms Scale 4 and Wave Atoms Scale 5.
• CM denotes Color Moments and CMI, Color Moment Invariants.
• OpSIFT means OpponentSIFT.
For the results, we use the notation Dataset_Visual codebook to resume the
different configurations we have tested. Then in SENSE1_I configuration, the
visual signatures (Bags of Visual Words) of the images of SENSE1 are computed
using the visual vocabulary from IAPS. The different configurations allow us to
determine whether or not the results are dependant from the image database used to
create the visual dictionary.
The different features do not have the same behaviors on predicting emotions
in the different configurations tested. For example, SIFT have approximately the
same results for negative and positive emotions on IAPS and SENSE regardless
the vocabulary changes. On the contrary, CMI and WA4, for example, seem more
adequate for negative images with at least 50%.
Overall, the visual dictionary has little impact on the behavior of descriptors
for classification for SENSE and IAPS. However, CM descriptors for example,
are affected. The rate of recognized negative images is significantly higher with
codebook from IAPS (C70% for SENSE images and C20% for IAPS images).
The opposite effect is observed for positive images: 34% for SENSE images and
17% for IAPS images. This illustrates very well the impact of the variability
of the database. Indeed, IAPS contains a lot of negative images: the dictionary
built with this dataset allows to better recognize negative emotions. Building the
visual dictionary with SENSE improves recognition of positive images since this
base contains a lot. We also conclude that the negative images are much easier to
recognize in the two databases that we have chosen.
94
Table 4 Comparison of correct average classification rates on SENSE and IAPS before and after
fusion with Majority Voting
Before fusion (%) After fusion (%)
SENSE1_S Negative 55.56 60
Positive 54.17 57.29
Average 54.86 57.55
SENSE1_I Negative 64.44 90
Positive 54.86 64.58
Average 59.65 66.98
IAPS_S Negative 61.75 75.41
Positive 47.13 41.38
Average 54.44 58.82
IAPS_I Negative 65.58 77.05
Positive 45.02 46.55
Average 55.30 62.18
In order to validate our approach, we chose to compare our results on IAPS to three
different papers on the literature:
• Wei et al. [58] are using a semantic description of the images for emotional
classification of images. The authors chose a discrete modeling of emotions in
eight classes: “Anger”, “Despair”, “Interest”, “Irritation”, “Joy”, “Fun”, “Pride”
96 S. Gbehounou et al.
and “Sadness”. The classification rates they obtained vary from 33.25% for the
class “Pleasure” to 50.25% for “Joy.”
• Liu et al. [28] have proposed a system based on color, texture, shape features and
a set of semantic descriptors based on colors. Their results on IAPS are 54.70% in
average after a fusion with the Theory of evidence and 52.05% with MV fusion.
• Machajdik et al. [32] are using color, texture, composition and content des-
criptors. They chose a discrete categorization in eight classes: “Amusement”,
“Anger”, “Awe”, “Contentment”, “Disgust”, “Excitement”, “Fear” and “Sad”.
The average rates of classification vary from 55 to 65%. The lowest rate is
obtained for the class “Contentement” and the highest for the class “Awe”.
If we compare our results with those three, we clearly see that our method compete
with them since we obtain classification rates from 54.44 to 62.18%. The goal of
this study is then correctly achieved by proving that the use of classical features of
CBIR can improve the performance of emotional impact classification.
The results from the subjective evaluations SENSE2 show that the regions of interest
evaluation is equivalent to the full image evaluation [11]. So we decided to substitute
the SENSE1 images by those used during SENSE2. The results presented here are
with respect to the local descriptors. Because of the variable sizes of the ROI images
(from 3 to 100% of the size of the original images) we chose a grid-dense selection.
For effective comparison, we also consider a grid-dense selection for SENSE1. The
average classification rates are shown in Fig. 11. We notice that for a majority of
descriptors, limiting the informative area to the salient region improves the results.
The results by local keypoints descriptor are summarized in Fig. 12.
An improvement is made for negative and positive classes when using SENSE2.
The usage of the regions of interest obtained with visual saliency model improves
the results for positive and negative images especially for SIFT and OpponentSIFT:
C10%. The previous conclusions about SIFT based descriptors remain valid.
In this chapter we have evaluated the saliency for four local features detectors:
Harris, Harris-Laplace, DOG and FAST. The threshold to decide that a point is
salient has been fixed after conducting a user study. In fact, this threshold allows to
easily recognize the different objects in the image. We choose to study the behavior
of local feature saliency on two databases used for image retrieval and categorization
and two others for saliency studies. The observations are globally similar:
Image Saliency Information for Indexing and Emotional Impact Analysis 97
61,32
60
Correct average rate of classifcation (%)
48,11
48,11
50
40,57
40
30
20
10
0
SENSE1_Dense SENSE2_Dense
Fig. 11 Average classification rates obtained for SENSE2 and SENSE1 with a dense selection of
local features
CM CMI SIFT OpponentSIFT
100
90
Correct average rate of classifcation (%)
80
70
60
50
40
30
20
10
0
Neg Pos Neg Pos
SENSE1_Dense SENSE2_Dense
Fig. 12 Classification rates obtained for SENSE2 and SENSE1 for 2 class classification by feature
the saliency model used among the four we tested. These conclusions are consistent
with previous studies from the literature and allow to consider different perspectives
which include finding the good proportion for the filtering of the less salient local
features without affecting the retrieval results. Another perspective of our study is
to consider top-down visual saliency model as the four tested are bottom-up and to
compare the results.
Concerning emotional impact recognition, for SENSE2, we used a bounding
box of the different salient areas, we think that a more precise region definition
must be studied: defining different regions of interest by image and determining the
emotion of each region. The final emotion of the image could be a combination of
the negative and positive areas thereby resuming the idea of the harmony of a multi-
colored image from Solli et al. [52]. The fusion method could be found based on
subjective evaluations to find the correct weighting between negative and positive
“patches” to form the final emotional impact.
References
1. Abdel-Hakim, A.E., Farag, A.A.: CSIFT: A SIFT Descriptor with color invariant character-
istics. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (2006)
2. Bay, H., Tuytelaars, T., Van Gool, L.: Surf: speeded up robust features. Lecture Notes in
Computer Science, vol. 3951, pp. 404–417. Springer, Berlin (2006)
3. Beke, L., Kutas, G., Kwak, Y., Sung, G.Y., Park, D., Bodrogi, P.: Color preference of aged
observers compared to young observers. Color. Res. Appl. 33(5), 381–394 (2008)
4. Borji, A., Sihite, D., Itti, L.: Quantitative analysis of human-model agreement in visual saliency
modeling: A comparative study. IEEE Trans. Image Process. 22(1), 55–69 (2013)
5. Boyatziz, C., Varghese, R.: Children’s emotional associations with colors. J. Gen. Psychol.
155, 77–85 (1993)
6. Bradley, M.M., Codispoti, M., Sabatinelli, D., Lang, P.J.: Emotion and motivation ii: sex
differences in picture processing. Emotion 1(3), 300–319 (2001)
7. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In:
Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22 (2004)
8. Denis, P., Courboulay, V., Revel, A., Gbehounou, S., Lecellier, F., Fernandez-Maloigne, C.:
Improvement of natural image search engines results by emotional filtering. EAI Endorsed
Trans. Creative Technologies 3(6), e4 (2016). https://hal.archives-ouvertes.fr/hal-01261237
9. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual
object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
10. Gao, K., Lin, S., Zhang, Y., Tang, S., Ren, H.: Attention model based sift keypoints filtration
for image retrieval. In: Proceedings of IEEE International Conference on Computer and
Information Science, pp. 191–196 (2008)
11. Gbèhounou, S., Lecellier, F., Fernandez-Maloigne, C., Courboulay, V.: Can Salient Interest
Regions Resume Emotional Impact of an Image?, pp. 515–522 Springer, Berlin (2013). doi:10.
1007/978-3-642-40261-6_62. http://dx.doi.org/10.1007/978-3-642-40261-6_62
12. Gbehounou, S., Lecellier, F., Fernandez-Maloigne, C.: Evaluation of local and global des-
criptors for emotional impact recognition. J. Vis. Commun. Image Represent. 38, 276–283
(2016)
13. Gbèhounou, S., Lecellier, F., Fernandez-Maloigne, C.: Evaluation of local and global des-
criptors for emotional impact recognition. J. Vis. Commun. Image Represent. 38(C), 276–283
(2016). doi:10.1016/j.jvcir.2016.03.009. http://dx.doi.org/10.1016/j.jvcir.2016.03.009
Image Saliency Information for Indexing and Emotional Impact Analysis 99
14. González-Díaz, I., Buso, V., Benois-Pineau, J.: Perceptual modeling in the problem of active
object recognition in visual scenes. Pattern Recognition 56, 129–141 (2016). doi:10.1016/j.
patcog.2016.03.007. http://dx.doi.org/10.1016/j.patcog.2016.03.007
15. Gordoa, A., Rodriguez-Serrano, J.A., Perronnin, F., Valveny, E.: Leveraging category-level
labels for instance-level image retrieval. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3045–3052 (2012)
16. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural
Information Processing Systems, pp. 545–552. MIT Press, Cambridge (2007)
17. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the 4th
Alvey Vision Conference, pp. 147–151 (1988)
18. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
19. Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for
large scale image search. In: Proceedings of the 10th European Conference on Computer
Vision: Part I, ECCV’08, pp. 304–317. Springer, Berlin (2008)
20. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact
image representation. In: Proceedings of the 23rd IEEE Conference on Computer Vision &
Pattern Recognition, pp. 3304–3311. IEEE Computer Society, New York (2010)
21. Kaya, N., Epps, H.H.: Color-emotion associations: Past experience and personal preference.
In: AIC Colors and Paints, Interim Meeting of the International Color Association (2004)
22. Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local image des-
criptors. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 2, pp. 506–513 (2004)
23. Kootstra, G., de Boer, B., Schomaker, L.: Predicting eye fixations on complex visual stimuli
using local symmetry. Cogn. Comput. 3(1), 223–240 (2011)
24. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International affective picture system (IAPS):
affective ratings of pictures and instruction manual. technical report A-8. Technical Report,
University of Florida (2008)
25. Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to
model bottom-up visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 802–817
(2006)
26. Li, Y., Zhou, Y., Yan, J., Niu, Z., Yang, J.: Visual saliency based on conditional entropy. Lecture
Notes in Computer Science, vol. 5994, pp. 246–257. Springer, Berlin (2010)
27. Liu, W., Xu, W., Li, L.: A tentative study of visual attention-based salient features for image
retrieval. In: Proceedings of the 7th World Congress on Intelligent Control and Automation,
pp. 7635–7639 (2008)
28. Liu, N., Dellandréa, E., Chen, L.: Evaluation of features and combination approaches for the
classification of emotional semantics in images. In: International Conference on Computer
Vision Theory and Applications (2011)
29. Lowe, D.G.: Object recognition from local scale-invariant features. In: International Confer-
ence on Computer Vision, vol. 2, pp. 1150–1157 (1999)
30. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60,
91–110 (2004)
31. Lucassen, M.P., Gevers, T., Gijsenij, A.: Adding texture to color: quantitative analysis of color
emotions. In: Proceedings of CGIV (2010)
32. Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychol-
ogy and art theory. In: Proceedings of the international conference on Multimedia, pp. 83–92
(2010)
33. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: Proceedings
of the 8th IEEE International Conference on Computer Vision, vol. 1, pp. 525–531 (2001)
34. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Computer Vision-
ECCV. Lecture Notes in Computer Science, vol. 2350, pp. 128–142. Springer, Berlin (2002)
35. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir,
T., Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vision 65(1-2),
43–72 (2005)
100 S. Gbehounou et al.
36. Mindru, F., Tuytelaars, T., Van Gool, L., Moons, T.: Moment invariants for recognition under
changing viewpoint and illumination. Comput. Vis. Image Underst. 94(1–3), 3–27 (2004)
37. Moravec, H.P.: Towards automatic visual obstacle avoidance. In: Proceedings of the 5th Inter-
national Joint Conference on Artificial Intelligence, vol. 2, pp. 584–584. Morgan Kaufmann,
San Francisco (1977)
38. Nauge, M., Larabi, M.C., Fernandez-Maloigne, C.: A statistical study of the correlation
between interest points and gaze points. In: Human Vision and Electronic Imaging, p. 12.
Burlingame (2012)
39. Nistér, D., Stewénius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2161–2168 (2006)
40. Ou, L.C., Luo, M.R., Woodcock, A., Wright, A.: A study of colour emotion and colour
preference. part i: Colour emotions for single colours. Color. Res. Appl. 29(3), 232–240 (2004)
41. Ou, L.C., Luo, M.R., Woodcock, A., Wright, A.: A study of colour emotion and colour
preference. part ii: Colour emotions for two-colour combinations. Color. Res. Appl. 29(4),
292–298 (2004)
42. Ou, L.C., Luo, M.R., Woodcock, A., Wright, A.: A study of colour emotion and colour
preference. Part iii: colour preference modeling. Color. Res. Appl. 29(5), 381–389 (2004)
43. Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: Proceedings on
Content-Based Multimedia Indexing, International Workshop, pp. 425–432 (2008)
44. Perreira Da Silva, M., Courboulay, V., Prigent, A., Estraillier, P.: Evaluation of preys/predators
systems for visual attention simulation. In: Proceedings of the International Conference on
Computer Vision Theory and Applications, pp. 275–282, INSTICC (2010)
45. Perronnin, F.: Universal and adapted vocabularies for generic visual categorization. IEEE
Trans. Pattern Anal. Mach. Intell. 30(7), 1243–1256 (2008)
46. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabular-
ies and fast spatial matching. In: IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), Minneapolis, MI (2007)
47. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In:
Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 1508–1511
(2005)
48. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: Proceedings
of the European Conference on Computer Vision, vol. 1, pp. 430–443 (2006)
49. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. Int. J. Comput.
Vision 37(2), 151–172 (2000)
50. Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos.
In: Proceedings of the International Conference on Computer Vision, pp. 1470–1477 (2003)
51. Smith, S.M., Brady, J.M.: Susan—a new approach to low level image processing. Int. J.
Comput. Vision 23(1), 45–78 (1997)
52. Solli, M., Lenz, R.: Color harmony for image indexing. In: Proceedings of the 12th
International Conference on Computer Vision Workshops, pp. 1885–1892 (2009)
53. Solli, M., Lenz, R.: Emotion related structures in large image databases. In: Proceedings of the
ACM International Conference on Image and Video Retrieval, pp. 398–405. ACM, New York
(2010)
54. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Found. Trends
Comput. Graph. Vis. 3(3), 177–280 (2008)
55. Urruty, T., Gbèhounou, S., Le, T.L., Martinet, J., Fernandez-Maloigne, C.: Iterative random
visual words selection. In: Proceedings of International Conference on Multimedia Retrieval,
ICMR’14, pp. 249–256. ACM, New York (2014)
56. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and
scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010)
57. Wang, W., Yu, Y.: Image emotional semantic query based on color semantic description. In:
Proceedings of the The 4th International Conference on Machine Leraning and Cybernectics,
vol. 7, pp. 4571–4576 (2005)
Image Saliency Information for Indexing and Emotional Impact Analysis 101
58. Wei, K., He, B., Zhang, T., He, W.: Image Emotional classification based on color semantic
description. Lecture Notes in Computer Science, vol. 5139, pp. 485–491. Springer, Berlin
(2008)
59. Yanulevskaya, V., Van Gemert, J.C., Roth, K., Herbold, A.K., Sebe, N., Geusebroek, J.M.:
Emotional valence categorization using holistic image features. In: Proceedings of the 15th
IEEE International Conference on Image Processing, pp. 101–104 (2008)
60. Zdziarski, Z., Dahyot, R.: Feature selection using visual saliency for content-based image
retrieval. In: Proceedings of the IET Irish Signals and Systems Conference, pp. 1–6 (2012)
61. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classifi-
cation of texture and object categories: a comprehensive study. Int. J. Comput. Vis. 73(2),
213–238 (2007)
62. Zhang, L., Tong, M.H., Marks, T.K., Shan, H., Cottrell, G.W.: Sun: A Bayesian framework for
saliency using natural statistics. J. Vis. 8(7), 1–20 (2008)
Saliency Prediction for Action Recognition
Abstract Despite all recent progress in computer vision, humans are still far
superior to machines when it comes to the high-level understanding of complex
dynamic scenes. The apparent ease of human perception and action cannot be
explained by sheer neural computation power alone: Estimates put the transmission
rate of the optic nerve at only about 10 MBit/s. One particular effective strategy to
reduce the computational burden of vision in biological systems is the combination
of attention with space-variant processing, where only subsets of the visual scene are
processed in full detail at any one time. Here, we report on experiments that mimic
eye movements and attention as a preprocessing step for state-of-the-art computer
vision algorithms.
1 Introduction
The human brain is remarkably energy efficient and runs on about 10–15 W of
power, less than most laptop computers. By the standards of the animal kingdom,
however, the human brain is already quite big, and many species with much less
neural hardware nevertheless perceive and act in complex environments seemingly
without effort. In contrast to this, even supercomputers still struggle with the under-
standing of dynamic scenes, despite a highly active computer vision community,
its rapid progress, and the recent surge of bio-inspired, “deep” neural-network
architectures that have shattered many benchmarks. Computer vision performance
may have reached or even surpassed human performance in more abstract, static
object recognition scenarios such as handwritten character or traffic sign recognition
[9]; in fully dynamic, unconstrained environments, this has not been the case
yet. One particular processing strategy that biological agents employ to improve
M. Dorr ()
Technical University Munich, Munich, Germany
e-mail: [email protected]
E. Vig
German Aerospace Center, Oberpfaffenhofen, Germany
e-mail: [email protected]
efficiency is selective attention: at any given time, only a fraction of the entire
(visual) input is processed in full detail. In combination with efficient coding, this
allows humans to process complex visual inputs despite the limited transmission
bandwidth of the optic nerve that is estimated to be comparable to an Ethernet
link (10 Mbit/s) [21]. In humans, attention is also closely linked to eye movements,
which are executed several times per second and direct the high-resolution centre of
the retina to points of interest in the visual scene.
Because of the potential to reduce bandwidth requirements, models of attention
and eye movements, or saliency models, have long been and still are an active
research field [4]. For static images, state-of-the-art models have come close to
human performance (meaning the implicit, typically subconscious choice where to
direct gaze), although there are at least two caveats: first, it is still a matter of debate
how to best quantify the similarity between machine predictions and actual human
eye movements [5]; second, the laboratory-based presentation of static images for
prolonged inspection is not a very accurate representation of real-world viewing
behaviour and thus might give rise to idiosyncratic viewing strategies [11].
The more challenging case of saliency for videos, however, has received less
attention (but see Chap. 3 in this book [8]). A likely contributor to this deficit has
been the lack of standardized benchmarks that make it easier to directly compare dif-
ferent models, and consequently improve upon them. Yet, video processing has high
computational cost and therefore could particularly benefit from attention-inspired
efficiency gains. One computer vision application of note is action recognition: out
of a set of (human) actions, which action is depicted by a short video clip? Current
approaches to this problem extract densely sampled descriptors from the whole
scene. While this provides full coverage, it also comes at high computational cost,
and descriptors from uninformative, non-salient image regions may even impair
classification performance.
In this chapter, we shall therefore extend previous work on saliency-based
descriptor pruning for action recognition [41, 42]; very similar, independently
developed work was published in [25, 26]. Since these original publications,
the state-of-the-art action recognition processing pipeline has been improved to
incorporate an implicit foreground saliency estimation step [48], and we shall
investigate whether an additional explicit pruning can further improve performance.
2 Related Work
For recent overviews of the highly active and very wide field of action recognition,
we point the reader to the surveys [12, 44].
Despite the recent success of deep learning methods for video-related tasks,
hand-crafted features are still indispensable when designing state-of-the-art action
recognition solutions. These methods typically rely on the—by now—standard
improved Dense Trajectories (iDT) [45, 46, 48] representation that aggregates local
spatio-temporal descriptors into a global video-level representation through the Bag
Saliency Prediction for Action Recognition 105
of Words (BoW) or the Fisher Vector encoding. Along trajectories located at high
motion contrast, rich appearance and motion descriptors, such as HOG, HOF, and
MBH are extracted. Several improvements have been proposed to this standard
pipeline, from stacking features extracted at temporally subsampled versions of
the video [22] to employing various descriptor- and representation-level fusion
methods [28]. Interestingly, by paying careful attention to details (of normalization,
and data- and spatio-temporal augmentations), [12] could show that the iDT pipeline
is on par with the state of the art, including recent deep methods, on five standard
benchmarks.
A number of more recent works explored deep architectures for action recogni-
tion. These methods aim at automatically learning discriminative representations
end-to-end and must therefore rely on vast training data sets, such as Sports-
1M [20] comprising of more than a million YouTube videos. Notable deep
approaches include the extension of 2D convolutional neural networks (CNNs) to
the time domain [20], the Two-Stream architecture [35] with two separate CNNs
for appearance and motion modeling, as well as the combination of recurrent
and convolutional networks to encode the temporal evolution of actions [10].
Overall, however, end-to-end deep models only achieved marginal improvements
over the established hand-tuned baselines. For better performance, these methods
often complement their learned representations with dense trajectory features.
Alternatively, hybrid architectures leverage the representational power of deep per-
frame feature maps in a Bag of Words pipeline (e.g. TDD method [47]).
3 Methods
In this section, we shall describe materials and methods of the work underlying this
chapter. Because several of the analyses presented here are extensions of previous
work and thus have been described before, we will put particular emphasis on the
novel saliency measure based on smooth pursuit eye movements.
Even for a single human action alone, the space of possible scenes depicting that
particular action is incredibly large. Thus, any finite data set for action recognition
can only be a coarse approximation to real-world action recognition. Over the past
few years, several benchmark data sets have been made available with varying
difficulty and complexity.
For this chapter, we focus on the Hollywood2 data set of short excerpts from
professionally produced Hollywood movies [24]. This data set comprises 823
training and 884 test clips with overall about half a million frames and 100
billion pixels, and using the Dense Trajectories pipeline (see below), intermediate
processing steps require about half a terabyte of storage space. This makes it still
Saliency Prediction for Action Recognition 107
possible to handle this data set without very large-scale computing facilities, and
yet bandwidth gains by saliency-based descriptor pruning would be desirable. At
the same time, performance has not reached ceiling on this challenging data set yet,
after almost a decade of intense research.
Because of its popularity, there are two independent eye movement data sets that
were recorded for the Hollywood2 data set [25, 26, 41, 42], namely by Mathe et al.
and Vig et al. Both groups used different eye trackers to record gaze: the Mathe data
set was collected monocularly with an SMI iView X HiSpeed 1250 at 500 Hz, while
data collection for the Vig data set used an SR Research EyeLink 1000 at 1000 Hz
that tracked both eyes simultaneously. Whereas at least theoretically, this should
have negligible consequences for recorded gaze locations, the viewing distances
(60 and 75 cm, respectively) and screen sizes (38.4 and 41.3 deg, respectively)
also differed; previous work has shown an effect of stimulus size on saccade
behaviour [49].
Most importantly, however, tasks also subtly differed between the data sets: in
the Mathe data set, subjects either performed a free-viewing task or the same action
recognition task as in the original computer vision benchmark, where they had to
explicitly name the presented actions after each video clip. By comparison, the task
in the Vig data set was constrained to an intermediate degree, and subjects were
asked to silently identify the presented actions.
We therefore computed the Normalized Scanpath Saliency (NSS) [29] to check
for systematic differences in the spatio-temporal distribution of gaze in the two data
sets. For NSS, a fixation map is first created by superimposing spatio-temporal
Gaussians (128 128 pixels and 5 frames support,
D 0:21) at each gaze
sample of one “reference” group (e.g. the Mathe data set). This fixation map is then
normalized to zero mean and unit standard deviation, and NSS is the average of this
fixation map’s values for all gaze samples of the other “test” group (e.g. the Vig
data set). Similar gaze patterns in both groups correspond to high NSS values, and
unrelated gaze patterns (chance) correspond to zero. For a comparative evaluation,
we computed NSS both within- and between data sets: if NSS between subsets of
e.g. Mathe is similar to NSS between Mathe and Vig, we can assume that both data
sets have little systematic difference. Because of the differing number of subjects,
we used gaze data from only one subject at a time to form the initial fixation map;
this was repeated for each subject and for up to 5 other subjects as “tests”.
We follow the standard (improved) Dense Trajectories pipeline from [45, 46, 48].
Based on optical flow fields, trajectories are computed first, and then descriptors
are extracted along these trajectories from densely sampled interest points. These
descriptors comprise the shape of the trajectory, HOG, HOG, and Motion Boundary
Histograms (MBH). Mostly due to camera motion, many descriptors were extracted
by the original pipeline that corresponded to trajectories of irrelevant background
108 M. Dorr and E. Vig
It is well-established that human eye movements and attention show a bias towards
the centre of a visual stimulus [11, 19, 39, 40]. On the one hand, the central,
resting position of the eyeball requires the least energy expenditure by the six
ocular muscles. Under truly naturalistic free-viewing conditions, head movements
therefore typically immediately follow eye movements to reinstate the central
gaze position. On the other hand, the central position yields the best trade-off for
allocating the space-variant retinal resolution to the whole scene.
In Hollywood movies, this effect is typically further exacerbated by deliberate
staging of relevant persons and objects at the centre of the scene. As in previous
work, we therefore computed a central bias saliency measure that simply comprised
the distance to the screen centre.
1
http://lear.inrialpes.fr/~wang/improved_trajectories.
Saliency Prediction for Action Recognition 109
Both available gaze data sets were combined and in a first step, blinks and
blink-induced artefacts immediately before or after them were removed. For each
video, we then created a spatio-temporal fixation density map by superimposing
3D-Gaussians at each valid gaze sample position; these Gaussians had a support of
384 by 384 pixels and 5 temporal frames and a
of 80 pixels in space and one frame
in time. Subsequently, these gaze density maps were linearly normalized per frame
to [0, 255] and stored to disk as MPEG-4 video streams.
Note that for simplicity, we did not perform explicit fixation detection, and thus
included saccadic and smooth pursuit gaze samples as well. Because of the strong
spatio-temporal blurring, the effect of this inclusion should be negligible.
gaze traces can still be noisy and exhibit severe artefacts [18]. This is especially
problematic during the detection of smooth pursuit eye movements because these
have relatively low speeds and thus are harder to distinguish from fixations than
high-speed saccades. However, hand-labelling on large-scale data sets such as
Hollywood2, with more than 100 subject-video hours available [25, 42], is not
practically feasible anymore. Recently, we thus have developed a novel algorithm
for the automatic classification of smooth pursuit eye movements [1] that employs
a simple trick to substantially improve classification performance compared to
state-of-the-art algorithms. Because smooth pursuit can only be performed in the
presence of a moving target, there typically should only be a few candidate locations
(objects) for smooth pursuit per scene or video frame. As discussed above, we
expect pursuit mainly on especially informative image regions, and thus it is likely
that more than one observer will follow any particular pursuit target. At the same
time, noise in the gaze signal should be independent across different observers.
Combining these two observations, we can assume that the likelihood of a smooth
pursuit is high whenever several observers exhibit similar dynamic gaze patterns
in the same spatio-temporal video location; conversely, slow shifts in the gaze
position of individual observers are more likely noise artefacts. This approach can
be summarized in the following steps:
1. Discard all those gaze traces that are clearly fixations (fully static) or saccades
(very high speed).
2. Cluster remaining gaze samples using the DBSCAN algorithm [13].
3. Discard clusters that comprise gaze data from fewer than four observers.
4. Post-process clusters; discard episodes of less than 50 ms duration.
This algorithm has been evaluated against a hand-labelled subset of the GazeCom
data set of naturalistic videos and has shown dramatically improved precision
at better recall than state-of-the-art algorithms [1]. However, the GazeCom data
set comprises unstaged everyday outdoor scenes that lead to high eye movement
variability except for some ’hot spots’ [11]. By contrast, professionally cut video
material such as Hollywood movies is specifically designed to focus attention on
few ‘objects’ (typically, characters) of interest, and thus gaze patterns are highly
coherent [11, 15, 17, 27]. In principle, it would thus be possible that almost all
observers cluster in very few locations, and noise in the gaze traces might get
misclassified as smooth pursuit. However, visual inspection of a sample of detected
smooth pursuit episodes showed that this was typically not the case; nevertheless,
even subtle image motion such as a character tilting their head often evoked
corresponding gaze motion.
An additional complication compared to the empirical saliency based on fixations
is the sparsity of smooth pursuit: whereas in the case of static images, fixations
will make up about 90% of the viewing time (with the remainder mainly spent
on saccades), the occurrence of smooth pursuit heavily depends on the stimulus.
In the Hollywood2 video set, the smooth pursuit rate per video clip ranged from
0% to almost 50%, with a mean of about 15% when considering the recall rate of
Saliency Prediction for Action Recognition 111
Whereas a wealth of saliency models exists for static images [2], often even with
publicly available implementations, there are much fewer available models for video
saliency. As in previous work, we here chose to use the geometrical invariants of the
structure tensor, which indicate the locally used degrees of freedom in a signal, or its
intrinsic dimensionality (inD). The invariants H, S, and K correspond to a change in
the signal in one, two, and three directions, respectively. They are computed based
on the structure tensor
0 1
fx f x f x f y f x f t
B C
J D ! @ fx fy fy fy fy ft A (1)
fx ft fy ft ft ft
H D 1=3 trace.J/
S D M11 C M22 C M33
K D jJj
where Mii are the minors of J. Since i1D regions (H) mainly comprise spatial edges
and we were rather interested in spatio-temporal features, we here used i2D and
i3D features (S and K). K has previously been shown to be more predictive of eye
movements in dynamic natural scenes [43], but is also sparser than S. Because we
are interested more in salient regions than very fine localization, we first computed
the invariants on the second spatial scale of the Hollywood2 video clips as described
in [43] and then additionally blurred the resulting saliency videos with a spatial
Gaussian with a support of 21 and
D 4 pixels.
Humans are social creatures, and thus depictions of humans and particularly faces
are very strong attractors of attention, regardless of bottom-up saliency [7, 23]. Our
analytical saliency measures S and K do not contain an explicit face or human
112 M. Dorr and E. Vig
detector, and we therefore computed the distribution of both analytical and empirical
saliency values in regions where humans had been detected, relative to saliency
values at randomly sampled locations.
In analogy to the central bias analysis above, we also computed the distribution
of saliency values at the locations of extracted iDT descriptors.
Based on previous work, we hypothesized that salient regions are more informative
than non-salient regions, and thus discarding descriptors extracted in less-salient
regions should either improve action recognition performance, or enable us to main-
tain baseline performance with fewer descriptors and thus at reduced bandwidth. To
prune descriptors based on saliency, we used an equation loosely based on the CDF
of the Weibull function,
( k
.1 e.x=/ ; x > 0
F.xI I kI / D (2)
; otherwise
where x is the raw saliency value (normalized from [0, 255] to [0, 1]), and k; > 0
are the shape and scale parameters, respectively; we here used k=1 and =0.001.
is an additional parameter that allows for low-probability sampling outside salient
regions in order to achieve broad coverage of the scene. Throughout this chapter,
we use D 0:01 unless noted otherwise.
4 Results
In this section, we shall describe the results for the analyses as described above.
The Hollywood2 data set exhibits central biases on several levels as evidenced by
Figs. 1 and 2. The spatial distribution of human detection bounding boxes is shown
in Fig. 1. Along the horizontal dimension, the distribution is roughly symmetrical
with a clear peak in the centre, likely reflecting the cinematographer’s preference
to frame relevant characters in the centre of the scene. The distribution along the
vertical dimension is also peaked at the centre, but shows a clear asymmetry for the
top and bottom parts of the screen because humans are literally footed in the ground
plane.
In the same figure, the spatial distribution of extracted improved Dense Trajec-
tory descriptors is also presented. These correspond to moving features in the scene
Saliency Prediction for Action Recognition 113
horizontal
0.0015
0.0010
0.0005
0.0000
0.00 0.25 0.50 0.75 1.00
y
vertical
1e−03
5e−04
type
descriptor
human_detector
0e+00
Fig. 1 Spatial distribution of improved Dense Trajectory descriptors and human bounding boxes
in Hollywood2. Particularly for the locations of detected humans, there is a clear central bias along
the horizontal axis (top); along the vertical axis, this bias is less pronounced and the distribution of
bounding boxes is also skewed towards the bottom of the screen (bottom)
and while the distributions are also peaked near the screen centre, the central bias is
not as clearly pronounced as for human detections, particularly along the horizontal
dimension.
The analytical saliency measures S and K should also respond strongly to moving
image features. Nevertheless, their central bias is weaker than for iDT descriptors
(Fig. 2) and it is more in line with the biases exhibited by oculomotor behaviour,
empirical saliency (ES), and smooth pursuit (SP) saliency.
Surprisingly, the peaks of the distributions of empirical saliency are slightly
shifted left and down from the peaks of the image-based saliency measures and
the image centre, and smooth pursuit eye movements seem less prone to the central
bias than fixations.
114 M. Dorr and E. Vig
horizontal
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.00 0.25 0.50 0.75 1.00
y
vertical
0.0030
0.0025
0.0020
measure
0.0015 S
K
0.0010 SP
ES
0.0005
0.00 0.25 0.50 0.75 1.00
x
Fig. 2 Marginal distributions along the horizontal (top) and vertical (bottom) image dimensions
of empirical and analytical saliency measures. The geometrical invariant K, which corresponds
to intrinsically three-dimensional features and is thus less redundant and more informative than
the i2D features represented by S, is more centrally biased than S. Among the eye movement-
based measures, smooth pursuit (SP) episodes are less centrally biased than regular empirical
saliency (ES)
Results for our evaluation of the similarity of the two independently collected
gaze data sets for Hollywood2 are shown in Fig. 3. The highest coherence across
observers can be found for the “active” condition in the Mathe data set (median NSS
score 6.21), where subjects explicitly performed the action recognition task. The
“free” condition constrained subjects less and showed very comparable NSS scores
as the Vig data set (median NSS scores of 5.05, and 4.65, respectively). Notably,
the “cross-free” condition, which assessed the similarity of subjects from the Vig
Saliency Prediction for Action Recognition 115
l l
l
l l
l l
l l
l l
l l
l l
l
l l l
l l
l l
l
l l
l l
l
l
l l
l l
l
l l
l l
l l
l l
l l
l
l
l
l l
l l
l
l l
l l
l l
20 l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l l
l
l l l l
l
l
l l
l l
l
l l l
l l
l
l l
l
l l
l
l l
l l
l
l
l
l l
l l
l
l l
l l
l
l l
l l
l
l l
l l
l
l
l
l l
l l
l
l l
l l
l
l
l l
l l
l
l
l l
l
l l
l
l
l
l l
l l
l
l l
l l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
15 l
NSS
10
Fig. 3 Normalized Scanpath Saliency distribution for the two gaze data sets “Mathe” [25] and
“Vig” [42]. Eye movements are most coherent in the active condition of the Mathe data set where
subjects had to explicitly solve an action recognition task. The free-viewing condition has similar
NSS scores as the Vig data set, where subjects were only told to look for the actions, but not to
expressly report them. The inter-data set NSS distribution (“cross_free”) is very similar to both
intra-data set NSS distributions (“Mathe_free” and “Vig”), indicating that there is no fundamental
difference between the eye movement data sets recorded by different groups and with different
hardware setups
data set to those in the Mathe “free” condition and vice versa, had similar NSS
scores (median 4.93) as the two within-group comparisons, indicating no substantial
differences between the data sets despite the differences in hardware setup and
protocol.
Figure 4 shows the relationship of both empirical and analytical saliency measures
on the one hand and human bounding boxes on the other hand. Because the
distributions of raw saliency value follow a power law and the strong peaks at zero
116 M. Dorr and E. Vig
0.4
l l
l l
l
log10(value)
0.0
−0.4
type
l S
K
ES
SP
0 100 200
raw saliency value
Fig. 4 Relationship of saliency values and detected humans in Hollywood2. Shown here is the
histogram of log-ratios of saliency values within the detected bounding boxes and at random
locations. Higher saliency values have log-ratios greater than zero, indicating that humans are
classified as salient by all measures. Eye movements-based measures capture the saliency of
humans better than analytical measures, and smooth pursuit (SP) performs better than empirical
saliency (ES) does
make it hard to discern any differences, we here chose a different visualization: for
each raw saliency value (or histogram bin) between 0 and 255, we plot the log-ratio
of the number of pixels with that value within human bounding boxes versus the
number of pixels with that value at random locations. If human bounding boxes and
saliency were unrelated, we would therefore expect a flat line around zero (log of
unit ratio). For all saliency measures, there is a clear trend towards higher ratios
for higher saliency values, i.e. a systematic correlation between the presence of
humans in the scene and their associated saliency. This effect is particularly strong
for the empirical saliency measure based on smooth pursuit; the analytical saliency
measures S and K capture bottom-up saliency such as edges and corners rather than
semantics and are therefore less correlated.
A similar analysis is shown in Fig. 5, but for the saliency values at the locations
of extracted descriptors. There is a systematic relationship between the analytical
Saliency Prediction for Action Recognition 117
0.4
l
l l
0.2 l
l
0.0
log10(value)
−0.2
−0.4 type
l S
K
ES
l SP
0 100 200
raw saliency value
Fig. 5 Relationship of saliency values and iDT descriptors in Hollywood2. Shown here is the
histogram of log-ratios of saliency values at extracted feature locations and at random locations.
The intrinsically three-dimensional measure K shows a stronger relation to iDT descriptors than
the intrinsically two-dimensional measure S
saliency measures and descriptors, which is to be expected given that they both
are functions of local image structure. However, there is also a strong effect on
the empirical saliency measures; in other words, descriptors are doing a good job
of selecting informative image regions (as determined by the attention of human
observers) already.
Without descriptor pruning, the iDT pipeline with human detection and the Bag of
Words encoding resulted in a mean average precision of 62.04%. This is in line with
expectations based on [48] and the increased number of codebook vectors.
118 M. Dorr and E. Vig
l
l
l
0.62
l l
0.60
mAveP
0.58
0.56
type
l central
S
0.54 l K
Fig. 6 Effect of saliency-based descriptor pruning for analytical saliency measures; dashed
horizontal line indicates baseline performance for the full descriptor set (=100%). For the image-
based measures S and K, performance quickly deteriorates below baseline; S may give a very
small benefit for a pruning of descriptors by about 10% only. However, central saliency, which
exploits the central bias in professionally directed Hollywood video clips, improves upon baseline
performance for moderate pruning levels and maintains baseline performance for a pruning by
almost 60%
Saliency Prediction for Action Recognition 119
of descriptors (all descriptors in a radius of 0.3 around the image centre). Even
with 44.4% of descriptors (radius 0.2), performance is still slightly above baseline
(62.17%).
0.60
0.56
mAveP
0.52
type
ES
SP−pure
SP−mixed
0.48
0.00 0.25 0.50 0.75 1.00
Ratio of descriptors kept
Fig. 7 Effect of saliency-based descriptor pruning for empirical saliency measures; dashed
horizontal line indicates baseline performance for the full descriptor set (=100%). Because of the
sparsity of smooth pursuit eye movements, more than two thirds of descriptors are pruned for
the “SP-pure” measure, and action recognition performance is substantially worse than baseline.
Augmenting the descriptor set by additionally sampling outside of the SP regions (“SP-mixed”)
brings performance back to baseline, but does not improve upon it. The empirical saliency measure
ES, which is based on raw gaze samples, performs better than SP, but yields only very little
improvement relative to baseline
120 M. Dorr and E. Vig
0.62
0.60
mAveP
0.58
0.56
type
central
S and central (.4)
S and central (.3)
0.54 S and human BB
Fig. 8 Performance for combinations of analytical saliency S with a central bias or human
detection. Performance decreases relative to the central bias alone
Saliency Prediction for Action Recognition 121
measure are repeated from Fig. 6. A combination of human detections with analyt-
ical saliency S is actually slightly worse than S alone (62.14% at 86.9% retained
descriptors). Simultaneously pruning based on S and the “central” measure (with
radii 0.3 and 0.4, respectively) also reduces performance relative to the central
measure alone.
5 Discussion
the Hollywood2 data set, which we consider mostly artefacts. Truly unconstrained
video, such as that potentially encountered by robots or autonomous vehicles, thus
will likely pose future challenges.
Acknowledgements Our research was supported by the Elite Network Bavaria, funded by the
Bavarian State Ministry for Research and Education.
References
1. Agtzidis, I., Startsev, M., Dorr, M.: Smooth pursuit detection based on multiple observers.
In: Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research &
Applications, ETRA’16, pp. 303–306. ACM, New York (2016)
2. Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal.
Mach. Intell. 35(1), 185–207 (2013)
3. Buso, V., Benois-Pineau, J., González-Díaz, I.: Object recognition in egocentric videos with
saliency-based non uniform sampling and variable resolution space for features selection.
In: CVPR 2014 Egocentric (First-Person) Vision Workshop (2014)
4. Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A.: MIT Saliency
Benchmark (2016). http://saliency.mit.edu
5. Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics
tell us about saliency models? arXiv preprint arXiv:1604.03605 (2016)
6. Castelhano, M.S., Mack, M.L., Henderson, J.M.: Viewing task influences eye movement
control during active scene perception. J. Vis. 9(3), 6 (2009)
7. Cerf, M., Frady, P., Koch, C.: Faces and text attract gaze independent of the task: experimental
data and computer model. J. Vis. 9(12:10), 1–15 (2009)
8. Chaabouni, S., Benois-Pineau, J., Zemmari, A., Amar, C.B.: Deep saliency: prediction of
interestingness in video with CNN. In: Benois-Pineau, J., Le Callet, P. (eds.) Visual Content
Indexing and Retrieval with Psycho-Visual Models. Springer, Cham (2017)
9. Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image
classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 3642–3649 (2012)
10. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K.,
Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2625–2634 (2015)
11. Dorr, M., Martinetz, T., Gegenfurtner, K., Barth, E.: Variability of eye movements when
viewing dynamic natural scenes. J. Vis. 10(10), 1–17 (2010)
12. de Souza, C.R., Gaidon, A., Vig, E., López, A.M.: Sympathy for the details: Dense trajectories
and hybrid classification architectures for action recognition. In: Proceedings of the European
Conference on Computer Vision, pp. 697–716. Springer, Cham (2016)
13. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters
in large spatial databases with noise. In: KDD Proceedings, vol. 96, pp. 226–231 (1996)
14. Feichtenhofer, C., Pinz, A., Wildes, R.P.: Dynamically encoded actions based on spacetime
saliency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2755–2764 (2015)
15. Goldstein, R.B., Woods, R.L., Peli, E.: Where people look when watching movies: Do all
viewers look at the same place? Comput. Biol. Med. 3(7), 957–64 (2007)
16. Harel, J., Koch, C., Perona, P., et al.: Graph-based visual saliency. In: Advances in Neural
Information Processing Systems, vol. 1, p. 5 (2006)
Saliency Prediction for Action Recognition 123
17. Hasson, U., Landesman, O., Knappmeyer, B., Vallines, I., Rubin, N., Heeger, D.J.: Neurocine-
matics: the neuroscience of film. Projections 2(1), 1–26 (2008)
18. Hooge, I., Holmqvist, K., Nyström, M.: The pupil is faster than the corneal reflection (CR): are
video based pupil-CR eye trackers suitable for studying detailed dynamics of eye movements?
Vis. Res. 128, 6–18 (2016)
19. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look.
In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2106–
2113 (2009)
20. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2014)
21. Koch, K., McLean, J., Segev, R., Freed, M.A., II, M.J.B., Balasubramanian, V., Sterling, P.:
How much the eye tells the brain. Curr. Biol. 16, 1428–34 (2006)
22. Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond Gaussian Pyramid: Multi-skip
feature stacking for action recognition. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 204–212 (2015)
23. Marat, S., Rahman, A., Pellerin, D., Guyader, N., Houzet, D.: Improving visual saliency by
adding ‘face feature map’ and ‘center bias’. Cogn. Comput. 5(1), 63–75 (2013)
24. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2929–2936 (2009)
25. Mathe, S., Sminchisescu, C.: Dynamic eye movement datasets and learnt saliency models for
visual action recognition. In: Proceedings of the European Conference on Computer Vision,
pp. 842–856. Springer, Berlin (2012)
26. Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency
models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424
(2015)
27. Mital, P.K., Smith, T.J., Hill, R., Henderson, J.M.: Clustering of gaze during dynamic scene
viewing is predicted by motion. Cogn. Comput. 3(1), 5–24 (2011)
28. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action
recognition. Comput. Vis. Image Underst. 150(C), 109–125 (2016)
29. Peters, R.J., Iyer, A., Itti, L., Koch, C.: Components of bottom-up gaze allocation in natural
images. Vis. Res. 45(8), 2397–2416 (2005)
30. Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans
and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2012)
31. Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space-time actions from weakly
labelled videos. In: Proceedings of the British Machine Vision Conference, vol. 2, p. 3 (2012)
32. Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space–time action parts from
weakly labelled videos. Int. J. Comput. Vis. 110(1), 30–47 (2014)
33. Shapovalova, N., Raptis, M., Sigal, L., Mori, G.: Action is in the eye of the beholder: eye-
gaze driven model for spatio-temporal action localization. In: Advances in Neural Information
Processing Systems, pp. 2409–2417 (2013)
34. Shi, F., Petriu, E., Laganiere, R.: Sampling strategies for real-time action recognition. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2595–2602
(2013)
35. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in
videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
36. Smith, T.J., Mital, P.K.: Attentional synchrony and the influence of viewing task on gaze
behavior in static and dynamic scenes. J. Vis. 13(8), 16–16 (2013)
37. Spering, M., Schütz, A.C., Braun, D.I., Gegenfurtner, K.R.: Keep your eyes on the ball: smooth
pursuit eye movements enhance prediction of visual motion. J. Neurophysiol. 105(4), 1756–
1767 (2011)
38. Sultani, W., Saleemi, I.: Human action recognition across datasets by foreground-weighted
histogram decomposition. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 764–771 (2014)
124 M. Dorr and E. Vig
39. Tatler, B.W.: The central fixation bias in scene viewing: Selecting an optimal viewing position
independently of motor biases and image feature distributions. J. Vis. 7(14), 1–17 (2007).
http://journalofvision.org/7/14/4/
40. Tseng, P.H., Carmi, R., Cameron, I.G.M., Munoz, D.P., Itti, L.: Quantifying center bias of
observers in free viewing of dynamic natural scenes. J. Vis. 9(7), 1–16 (2009). http://
journalofvision.org/9/7/4/
41. Vig, E., Dorr, M., Cox, D.D.: Saliency-based selection of sparse descriptors for action
recognition. In: Proceedings of International Conference on Image Processing, pp. 1405–1408
(2012)
42. Vig, E., Dorr, M., Cox, D.D.: Space-variant descriptor sampling for action recognition based
on saliency and eye movements. In: Proceedings of the European Conference on Computer
Vision. LNCS, vol. 7578, pp. 84–97 (2012)
43. Vig, E., Dorr, M., Martinetz, T., Barth, E.: Intrinsic dimensionality predicts the saliency of
natural dynamic scenes. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1080–1091 (2012)
44. Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods.
Front. Robot. AI 2, 28 (2015)
45. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the
IEEE International Conference on Computer Vision (2013)
46. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3169–3176. IEEE, New York (2011)
47. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional
descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4305–4314 (2015)
48. Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for
action recognition. Int. J. Comput. Vis. 119, 219–38 (2016)
49. von Wartburg, R., Wurtz, P., Pflugshaupt, T., Nyffeler, T., Lüthi, M., Müri, R.: Size matters:
Saccades during scene perception. Perception 36, 355–65 (2007)
50. Zhou, Y., Yu, H., Wang, S.: Feature sampling strategies for action recognition. arXiv preprint
arXiv:1501.06993 (2015)
51. Zitnick, L., Dollar, P.: Edge boxes: locating object proposals from edges. In: Proceedings of
the European Conference on Computer Vision (2014)
Querying Multiple Simultaneous Video Streams
with 3D Interest Maps
Abstract With proliferation of mobile devices equipped with cameras and video
recording applications, it is now common to observe multiple mobile cameras
filming the same scene at an event from a diverse set of view angles. These recorded
videos provide a rich set of data for someone to re-experience the event at a later
time. Not all the videos recorded, however, show a desirable view. Navigating
through a large collection of videos to find a video with a better viewing angle
can be time consuming. We propose a query-response interface in which users can
intuitively switch to another video with an alternate, better, view, by selecting a 2D
region within a video as a query. The system would then response with another video
that has a better view of the selected region, maximizing the viewpoint entropy. The
key to our system is a lightweight 3D scene structure, also termed 3D interest map.
A 3D interest map is naturally an extension of saliency maps in the 3D space since
most users film what they find interesting from their respective viewpoints. A user
study with more than 35 users shows that our video query system achieves a suitable
compromise between accuracy and run-time.
1 Introduction
The proliferation of mobile devices, such as smartphones and tablets, that are
equipped with sensors, cameras, and networking capabilities has revolutionized the
way multimedia data are produced and consumed, and has posed new challenges as
well as led to new, novel applications.
We consider the case of public cultural performances (dance, singing, sport,
magic, theater, etc.) with a spectating crowd. It has become common for a spectator
to watch a performance and film it at the same time with a mobile camera. Figure 1
depicts such a scene with an open air stage and multiple filming cameras (six of
which are labelled). The recorded videos are often uploaded and shared via social
networks. For instance, a search for “Chingay Parade 2014” on YouTube returns
more than 3000 results; “New York Ball Drop 2014” returns more than 172,000
results.
While the large amount of simultaneous videos capturing the same event provides
a rich source of multimedia data captured from a diverse angle for someone to
explore and experience the captured event, navigating from one video to another
is challenging. In this paper, we focus on the problem of view switching: often, a
camera filming the event does not permanently get a perfect view—for instance, the
object of interest on the stage might be occluded by another spectator. In this case, it
will be useful for the viewer to switch to another video capturing the same instance,
but with a better view.
Figure 2 illustrates the idea of our work. The first three images from the left show
snapshots (at the same time instance) of videos filmed simultaneously by mobile
devices of a song-and-dance performance. At that particular moment, the lead singer
(with a white bow on her head) is singing alone and other performers are dancing by
her side. The rightmost image in Fig. 2 shows a user interface on the mobile phone,
where the user is watching a video stream of a performance filmed from the right
Fig. 1 A performance filmed by seven cameras (Jiku dataset): six of the cameras are located just
around the scene, and the seventh is out of the scope of this image but its field of view is shown as
a red rectangle (Color figure online)
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 127
Fig. 2 Images of three synchronized videos from the Fukuoka dataset (top row), and an example
of two possible queries (bottom row) materialized by a red rectangle and a blue circle on the image
(Color figure online)
side of the stage and is not able to view the lead singer clearly. Using our system,
the user can request to switch to a better view of the lead singer, by either tapping
on the lead singer on his screen (the blue circle) or clicking and dragging a rectangle
surrounding the lead singer on his screen (the red rectangle). This action generates a
query that is sent to the server. The server then determines that, say, the video stream
that corresponds to the first image from the left, provides the best view of the lead
singer, and switches to transmitting this stream to the user.
It is important to note that this query is not interpreted as a content-based image
retrieval problem (CBIR). We do not look for something similar to queried region
in appearance, but we rather want to make a 3D interpretation of the query. In our
use case, the user might orientate his mobile camera towards the important part of
the scene but he has a partially occluded/contaminated viewpoint.
While the user interface above is intuitive, we face a few questions. First, we need
to efficiently identify the regions containing the objects of interest in the videos. Our
approach exploits the fact that, the cameras naturally film objects that are of interest
to the viewers. For instance, a camera usually zooms into or points at interesting
events on the stage. Thus, objects that appear in multiple cameras can be assumed
to be of high interest. (e.g., the lead singer in Fig. 2).
Second, we need to relate what is captured between different videos to tell if they
are depicting the same scene. Traditional content-based analysis would fail, due to
high variations in appearance of the same object captured from different angles. To
address this issue, we need to reconstruct the scene captured by the cameras in 3D,
using multiple view geometry. We limit the reconstruction to regions that are in the
intersection of the views of multiple cameras. As a result, 3D points around objects
of high interest naturally become denser.
Third, we need to perform the reconstruction and answer the query efficiently,
with a target response time of 1 s. To this end, we chose to perform a coarse yet
discriminative reconstruction, using a modified version of the multiview stereopsis
128 A. Carlier et al.
algorithm [7] that is less sensitive to camera parameters and is tuned to run in real-
time. We model these objects of interest and the associated clusters of 3D points as
a collection of 3D ellipsoids, providing a fairly coarse representation of the objects
in the scene, but that are sufficient to support our view switching application. For
instance, in Fig. 2, each 3D ellipsoid would fit each performer visible in all three
images. We also choose to use only simple features to identify potential objects of
interest. While these features alone may not lead to accurate inference of the object
of interest, we found that combining the features from multiple cameras improves
the accuracy.
Finally, we need to define and quantify what does a “better” view mean. In this
work, we adopt the notion of viewpoint entropy [25] and choose the video depicting
the same scene with the largest entropy with respect to the current video to switch to.
The chapter is organized into seven sections. We first review the related work
in Sect. 2. Section 3 introduces the 3D query principles. Section 4 describes how
we reconstruct the set of 3D ellipsoids. The experiments and evaluation results are
presented in Sect. 5. Finally, the Sect. 6 concludes the paper.
2 Related Work
Our literature review is broken down in two parts. We first review some recent work
on visual computing with mobile phones. Then we explain how the proposed system
based on multiple simultaneous video streams relates to several similar applications.
Visual computing on mobile phones equipped with cameras is moving in two direc-
tions: the visual computation power is increasing on-device and, simultaneously,
the tremendous amount of deployed devices leads to large scale camera networks.
The first aspect is perfectly illustrated by Tanskanen et al. [22] in which a complete
on-device live 3D reconstruction pipeline using monocular hand-held camera along
with inertial sensors is presented. In the second direction, structure-from-Motion
(SfM) algorithms can also be centralized taking as input images acquired by many
mobile cameras: the famous SfM pipeline used in the Photo Tourism project and
called Bundler [20] allows to reconstruct jointly 3D cameras and a set of 3D
points from unordered image collections. An even more challenging set-up consists
in synchronizing [18] many mobile phone cameras to make them cooperating
in near real time. If we already enjoy many location-based-services exploiting
sensor network of mobile phones (real-time applications reporting flooded roads
for instance), we identify only a few efforts fusing simultaneous mobile phone
camera captures. In their work [9], Kansal and Zhao consider an audio-visual
sensor network that potentially exploits millions of deployed mobile phones. The
visual data from those mobile cameras help overcoming limitations of classical
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 129
Video clips recorded by multiple users during the same crowded event can be
used for several purposes. For instance, these videos can be automatically edited
to produce a new mashup video similar to how a TV director would switch
between different cameras to produce the show. Shresta et al. [19] address this
problem and select the cameras by maximizing video quality. They also explain
in [18] how to synchronize multiple cameras from the available multimedia content.
Saini et al. [16] expand this effort about video mashups with live applications in
mind. They try to make decisions about cameras without any reference to future
information. Their improved model jointly maximizes the video signal quality but
also the diversity and quality of the view angles as well as some aesthetic criteria
about shot lengths, distances from the stage etc. These works add to the abundant
literature about best view selection as explained in the recent paper introducing the
Jiku mobile video dataset [17]. We also address this issue in our video switching
application. In the camera networks community, one often aims at selecting the
optimal viewpoint in order to control data redundancy while saving resources
[6, 24] or in order to identify view(s) with task-oriented criteria for sport [14] or
surveillance [12] applications. In this paper we follow a popular approach to this
problem leading to the most informative viewpoint [26]. Finally our video query
system offers social media interactions during live events. Dezfulli et al. [5] also
investigate a live scenario for mobile video sharing during a soccer match. Their
prototype, called CoStream, has been designed to evaluate both the production and
the consumption of videos. Indeed, a close-up video showing a goal can be produced
by users located next to the goal in the stadium and be requested (e.g., consumed)
by some friends seating in other aisles far apart from the place where the goal
was scored. During CoStream design sessions, the participants stressed the need
of a pointing gesture for immediate interaction. Pointing their mobile device in the
130 A. Carlier et al.
In this section, we define formally what we meant by 3D query and better view. To
simplify the explanation, we will present our algorithm in the context of a set of
J images Ij , j D 1 : : : J, corresponding to video frames taken from the same time
instance from J cameras filming the same scene from different angles. We denote
the cameras as Cj and we know the projection matrices of the cameras.
Let Iq be the image currently viewed by the user, in which the user will specify a
region of interest (ROI) Rq . We call Iq the query image and Rq the query region.
For now, we assume that we have a set of K 3D shapes, representing the
interesting objects in the scene (how we obtained this will be described in the next
section). We back-project Rq onto 3D space, forming a query volume Vq that is a
generalized viewing cone through Rq . We then compute the intersection between Vq
and the 3D shapes.
After this step, the algorithm selects a subset of 3D shapes Oq that intersects with
Vq (we consider Vq intersects with a 3D shapes if more than 40% of a shape is within
Vq ). Note that, it is possible to select a shape corresponding to an object that does
not appear in Iq .
The set Oq represents 3D shapes selected by the user through Rq , and ideally,
would correspond to the set of objects of interest in the scene that the user is
interested in. What remains is for the algorithm to return an image that depicts these
objects in the “best” way. To compute this, we use the notation of viewpoint entropy,
as inspired by Vázquez et al. [25].
For each image Ij , we compute its viewpoint entropy. We adapt the notion
of viewpoint entropy by Vazquez et al. to handle a finite set of 3D shapes and
the restricted region of background visible from Ij . The viewpoint entropy E.Ij /
represents the amount of visual information about the selected 3D shapes Oq in
Ij . Let Ao be the projected area of shape o on Ij , normalized between 0 and 1 with
respect to the area of Ij . We define Abg as the normalized area in Ij that is not covered
by any shape (i.e., the background). We define the viewpoint entropy as
X
E.Ij / D Abg log2 Abg Ao log2 Ao (1)
o2Oq
An image that depicts all the requested shapes with the same relative projected
area would have the highest entropy. Since the relative projected areas Ai form
a probability distribution, the relative visibility of the background at maximum
entropy (log2 .jOq j C 1/) should be also comparable to the visibility of each shape.
In practice, we do not reach this upper bound and simply maximize E.Ij / over j.
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 131
We return the image with the highest entropy as the result of our query. The
system then switches the video stream to the one corresponding to the resulting
image.
As clearly stated by Vazquez et al. the intervention of A0 helps to handle various
zoom levels (or various distances between the cameras and the scene) among the J
candidate images. The use of the background visibility level gives nearer shapes a
higher entropy. In Fig. 3, a larger projection of the requested yellow shape increases
the entropy of the best viewpoint (by limiting the information brought by the
background).
Several efficient implementations can be used to compute E.Ij / ranging from
immediate rendering on graphics hardware to specialized algebraic methods for 3D
shapes and their images. We use an algebraic method which is out of the scope of
this chapter.
Figure 3 shows a small example illustrating this process. The small red square
inside the image represents the query Rq region. The corresponding query volume
Vq intersects two shapes, shown in blue and yellow, out of three 3D shapes. The
image with the highest entropy (0.75) is selected out of two candidate images.
an automatic detection of the important objects a user may wish to query; (2) we
need a 3D reconstruction that makes those objects identifiable from any viewpoint;
(3) we need a very fast reconstruction to typically get a one-second-duration query-
response cycle.
the experiments, this method would lead to poor results as the 3D consistency of
the detected 2D regions of interest is not guaranteed! This is why we introduce an
alternative method based on the computation of the intersection of visibility cones,
in 3D. We therefore build a 3D mask, from which we use the 2D re-projections as
additional inputs for PMVS. This step is detailed in the following subsection.
Fig. 4 Default
134 A. Carlier et al.
Fig. 5 Results on the Brass Band dataset: on top, ellipses detected based on the central focus
assumption and used for the cones intersection. Below, reconstructed ellipsoids reprojected into
the images
positions of the cameras are the ones estimated for the Brass Band dataset (see also
Fig. 5). Let Cj be the ellipse associated with the UROI in image j and let ƒj be the
visual cone back-projecting Cj .
We first describe an algorithm for intersecting two visual cones, related to
cameras i and j.
1. In image i, generate R random points inside the ellipse Ci and back-project each
of these points p as a 3D line Lip , through p and the camera centre.
2. Compute the two intersection points where line Lip meets the PCoI ƒj associated
with image j.
3. If such intersection points exist, discretise the line segment, whose endpoints are
these two points, into S points so image i yields RS 3D points.
4. Repeat 1–3 by switching the roles of cameras i and j.
Given now a sequence of J views, we can apply the above algorithm for each of
the 12 J.J 1/ distinct image-pairs that can be formed in order to eventually obtain
J.J 1/RS 3D points of interest for the sequence.
Scenario #2: Significant Apparent Motion
In that case, we assume that the interesting video objects are correlated with the
motion in the video. This is particularly true for examples such as sport or dance.
We start by computing a motion map, which we get by computing the distance
in RGB color space between two consecutive images in a video. Many solutions
exist and the reader’s favorite one will be satisfactory. Since we handle many views
simultaneously, the motion detection can fail for a few viewpoints without any
significant impact on the final result. Our motion map computation is primarily
simple and rapid. With our motion maps, we use an unsupervised 2D clustering
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 135
Fig. 6 Results on the Jiku dataset: on top, ellipses detected based on the motion and used for the
cones intersection. Below, reconstructed ellipsoids reprojected into the images
In this subsection we describe our method to obtain the final set of 3D ellipsoids, as
can be visualized on the third row of Figs. 5 and 6.
We run PMVS with all needed inputs: the images, the masks and the camera
projection matrices. We set the level parameter to 4, in order to lower the resolution
of the images and therefore obtain a sparse reconstruction.
At this point we get a 3D points cloud and those points should be concentrated
on important objects. It is then very intuitive to cluster the points, and to associate
an ellipsoid with each cluster.
We use, once again, the Mean-Shift clustering algorithm [4] to cluster the
points. This choice is motivated by the fact that we do not know the number
of important objects that are present in the scene, so we need an unsupervised
clustering technique in order to automatically discover the number of objects. The
Mean-Shift algorithm takes a parameter as input, that defines a radius of search for
136 A. Carlier et al.
neighbouring points. This parameter is called the bandwidth, and because we aim at
separating objects that are in fact persons, we choose a bandwidth equal to 50 cm.
Finally we need to create an ellipsoid to model a cluster of points. We set
the ellipsoid center to the mean value of the coordinates of all points. Then, by
computing the eigenvectors of the covariance matrix of the points coordinates we
get the directions of the three axes of the ellipsoid.
ƒ D PT CP:
Lp D .P/T Œp P;
where p D .u; v; 1/T is the homogeneous vector of pixel p with coordinates .u; v/
and Œp is the order-3 skew-symmetric matrix associated with p [8, p. 581].
The two 3D intersection points A and B where line Lp meets the cone are given
in closed form by the rank-2 order-4 symmetric matrix
where Lp is the dual Plücker matrix of Lp , A and B are the homogeneous 4-vectors of
points A and B, and the operator means ‘proportional to’. Due to lack of space, the
proof is not given here but it can be shown that (1) the two points are real if the two
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 137
non-zero eigenvalues of ABT C BAT are of different signs and complex otherwise,
and (2) A and B can be straightforwardly obtained by a simple decomposition of
ABT C BAT .
Now consider the special case of a circular UROI centered at the principal point.
It can be shown that a visual cone through such a circle only depends on the camera
centre and the optical axis (through it). In this special case, errors on the rotation
around the optical axis, which are computed in the projection matrix, have no effect
on the matrix of the visual cone.
4.5 Summary
To wrap up this section, we list here the steps required to obtain a lightweight
reconstruction of the scene.
• Apply the OpenCV blob tracker to find 2D Regions of Interest.
• Intersect back-projection of the 2D ROI, sample these cones intersection, and
estimate a 3D mask.
• Project the 3D mask back into 2D masks, and apply PMVS with these masks on
a sub-sampled version of the images.
• Cluster the set of 3D points thus obtained using Mean-Shift, and then associate
one ellipsoid to every cluster.
5 Experiments
5.1 Datasets
moved around the scene and for which we also recorded its accelerometers and
GPS values. The dataset depicts an outdoor scene, where eight musicians stand
close to each other and move according to the instrument they are playing. We drew
concentric circles on the floor so that we can reconstruct the cameras pose more
easily.
Fukuoka This dataset consists of five high resolution video clips of a 5-min dance
show captured by fixed cameras. The video clips capture a complex scene with
multiple performers, wearing the same uniform, standing close to each other, and
often moving and occluding each other. The floor of the stage, however, is composed
of colored concentric circles, easing the camera parameter estimation process.
RagAndFlag The RagAndFlag dataset is the most challenging one. It consists of
a set of videos shot from seven cameras, including mobile devices, surrounding
an open air stage. The videos captured an event with many dancers moving in a
highly dynamic outdoor scene. The videos are shaky and have a variable lighting.
Furthermore, the lack of information on camera calibration makes this dataset a
challenging input to our algorithms.
Table 1 Results obtained for the internal and external camera calibration on Fukuoka and
BrassBand datasets
Sequence Fukuoka BrassBand
Sensor No Yes
Orientation error (degree) 0.13 0.48
Position error (%) 1.45 5.74
Focal length error (%) 1.59 5.91
Reprojection error (pixels) on 1920 1080 pixels image 2.25 2.15
Reprojection error (pixels) on 240 135 pixels image (as input 0.28 0.27
for PMVS level 3)
Table 2 Frame offset between videos computed by Ke et al. [10] and obtained manually
Videos Offset given by Ke et al. [10] Manually defined offset
#1–#5 48 55
#2–#5 1578 1570
#3–#5 890 899
#4–#5 1241 1248
Fig. 7 Query on the Fukuoka dataset. On the left the queried region (in red) and on the right, the
top answer of both our algorithm and the users (Color figure online)
5.3 Queries
To evaluate the quality of the resulting video that the system provide in response to
a query, we conducted a user study with the following setup.
Videos We selected 15 sequences, 1 from Fukuoka, 1 from RagAndFlag and 13
from BrassBand. Since we only have captors data on the BrassBand dataset, most
of the sequences used in the study are taken from this dataset.
Methodology For each sequence, one of the images is considered as the query
image. On this image, we define five spatial regions corresponding to the actual
query we want to evaluate. For each query, we ask the users which of the remaining
videos they think is the best answer to the query. In order to avoid any bias, the order
of presentation of the videos is randomized, as well as the order of the sequences
and the order of the queries.
We have set up a web-based interface, displaying the query (as a red rectangle
on one of the images) and the entire set of videos. Users were asked to look at the
query, identify the objects of interest that were queried and find, among the set of
videos, the one that they think shows the queried objects of interest the best. They
then voted for this video by clicking on it, and moved on to the next query.
Participants The study involved 35 participants (13 females, 22 males) with ages
ranging from 20 to 60 (average age is 30). Each of the 35 participants answered 75
queries, giving a total of 2625 votes.
Results We show some qualitative results in Figs. 7 and 8.
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 141
Fig. 8 Query on the BrassBand dataset. On the left the queried region (in red), on the middle the
top answer of our algorithm and on the right the top answer of the users (Color figure online)
Table 3 Percentage of the users satisfied with the top two answers to the query
Method First answer (%) Second answer (%)
Best possible 56 78
Our method (real time) 44 62
PMVS 30 41
Our method (no time constraints) 49 68
Figure 7 shows a scenario where our algorithm performs exactly as user did. The
top answer of our algorithm allows to see a clear view of the queried region, which
is the lead singer of the band. This answer was also chosen by all users but one.
Figure 8 shows a scenario where our algorithm is not consistent with users’
answers. A majority of users chose the rightmost image, as it allows to see all
performers clearly without occlusion. Our algorithm on the other hand gave the
middle image as the best answer. The main reason for that is that even if some of
the ellipsoids are occluded, the background weight is less important than in others
images, which results in a larger viewpoint entropy.
We now introduce quantitative results, based on the user study. We say that a
user is satisfied by the answer to a query if the answer given by our algorithm is the
same as the one voted by the user during the user study. This hypothesis assumes
that users would be satisfied with one answer only, which is not true in general. For
example, in the BrassBand dataset, some of the cameras were located pretty close
to each other, which means that in some cases users would have been satisfied by
both videos as an answer to a query.
Table 3 shows the percentage of users that are satisfied by the top answer, and
the top two answers, to the query. The first row of Table 3 indicates the highest
percentage of users that it is possible to satisfy. These relatively low numbers are
explained by the dispersion of the users answers to the query during the user study.
There are often several suitable images that answer a query, and this fact is reflected
by the users’ choice.
Our results are introduced in the second row of Table 3. Almost half of the users
(44%) would be satisfied on average by the answer provided by our algorithm, and
more than 60% would be satisfied by one of our top two answers. The remaining
users chose different views, which does not necessarily mean they would not have
142 A. Carlier et al.
been pleased with our answer. In fact, in the case of the Fukuoka and BrassBand
datasets, there are often multiple views that would give acceptable answers to a
query, as some cameras are close to each other.
It is common in CBIR to evaluate an algorithm by computing the Average Rank
of the answers to a query. To obtain this value, we rank the videos according to
the number of users’ votes for each query. Let rank.I; q/ be the rank of image I
associated to the query q. The Average Rank of the first answer to a query (or AR.1/)
is defined as
1 X
AR.1/ D rank.A1q ; q/ (2)
card.Q/ q2Q
where A1q is the first answer to query q from our algorithm, and Q is the set of all
queries that were evaluated during the user study.
Table 4 shows the Average Rank of the first four answers to the query from
various algorithms. The first row shows that our method performs well: AR.1/ D
1:95 reflects that the first answer given by our algorithm is often either the best or
the second best one. Another interesting aspect of the result is that the average rank
of each answer is correlated to the rank of the answers from the user study. Indeed,
we observe that AR.1/ < AR.2/ < AR.3/ < AR.4/. This observation means that our
algorithm can rank the possible answer in a comparable way as real users would.
Our algorithm sacrifices some accuracy for efficiency. To study this effect, the
third row of the table shows the results of a query that is performed on a scene
reconstructed without time constraints, using the full resolution images but with
our masks. The results, as expected, are better than our method since no accuracy
is sacrificed. Note that the percentage of users satisfied by the two best answers
computed with this method (Table 3) is also quite high compared to the best possible
result.
Finally, the second row of the Table 4 and the third row in Table 3 show the
results of a query performed on a scene reconstructed by PMVS without any masks.
Not only does not the algorithm run in real time, but the query results are poor. This
result is due to the fact that PMVS reconstructs more objects than just those that are
needed to perform the query. This result validates our masks construction step as a
necessary contribution to perform the querying.
Table 4 Average Rank (AR) of the first four answers to the query implemented using our real-
time method (first line), PMVS (second line) and our method without time constraints (third line)
Method AR.1/ AR.2/ AR.3/ AR.4/
Our method (real time) 1.95 2.7 3.6 4.5
PMVS 2.9 3.7 3.1 3.9
Our method (no time constraints) 1.5 2.6 3.7 4.4
Querying Multiple Simultaneous Video Streams with 3D Interest Maps 143
6 Conclusion
In this paper, we have investigated and validated a new interest-based and content-
based mechanism for users to switch between multiple cameras capturing the same
scene at the same time. We devised a query interface that allows a user to highlight
objects of interest in a video, in order to request video from another camera with a
closer, non-occluded, and interesting view of the objects. The answer to the query is
computed based on a 3D interest map [2] automatically inferred from the available
simultaneous video streams. Such a map can be reconstructed very efficiently with
our approach if the environment is equipped enough to have a reasonably precise
calibration and synchronization of the cameras. During our user study, we have
shown that the system responses fit well with users’ expectations. We have also
seen that our original 3D interest map is a rich combination of the simultaneous
visual sources of information as well as a consistent generalization of the more
conventional 2D saliency maps.
In the future, we want to precisely evaluate each component of our system. In
order to respect the end-to-end delay, the 3D interest map reconstruction should
be performed in real-time on the server side. We expect a cycle between the
query and the video-switch to be done in about one or two seconds. The precise
performances of both the cones intersection and the finer reconstruction technique
must be known.The spatio-temporal coherence of the maps could also be taken
into account to improve the computations. More generally, using more of content
analysis for improving the masks and the maps would probably be interesting.
Finally, we also aim at further studying the ergonomics of the client interface: for
instance, it is not clear so far if we should provide the requester with an interactive
overview of the good available streams or simply switch to the best one.
References
1. Calvet, L., Gurdjos, P., Charvillat, V.: Camera tracking using concentric circle markers:
paradigms and algorithms. In: ICIP, pp. 1361–1364 (2012)
2. Carlier, A., Calvet, L., Nguyen, D.T.D., Ooi, W.T., Gurdjos, P., Charvillat, V.: 3D interest
maps from simultaneous video recordings. In: Proceedings of the 22nd ACM International
Conference on Multimedia, MM ’14, pp. 577–586 (2014)
3. Chandra, S., Chiu, P., Back, M.: Towards portable, high definition multi-camera video capture
using smartphone for tele-immersion. In: IEEE International Symposium on Multimedia
(2013)
4. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE
Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
5. Dezfuli, N., Huber, J., Olberding, S., Mühlhäuser, M.: CoStream: in-situ co-construction of
shared experiences through mobile video sharing during live events. In: CHI Extended
Abstracts, pp. 2477–2482 (2012)
6. Ercan, A.O., Yang, D.B., Gamal, A.E., Guibas, L.J.: Optimal placement and selection of
camera network nodes for target localization. In: IEEE DCOSS, pp. 389–404 (2006)
144 A. Carlier et al.
7. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern
Anal. Mach. Intell. 32(8), 1362–1376 (2010)
8. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cam-
bridge University Press, Cambridge (2004). ISBN: 0521540518
9. Kansal, A., Zhao, F.: Location and mobility in a sensor network of mobile phones. In: ACM
Network and Operating Systems Support for Digital Audio and Video (2007)
10. Ke, Y., Hoiem, D., Sukthankar, R.: Computer vision for music identification. In: CVPR (1),
pp. 597–604 (2005)
11. Kim, J.S., Gurdjos, P., Kweon, I.S.: Geometric and algebraic constraints of projected concentric
circles and their applications to camera calibration. IEEE Trans. Pattern Anal. Mach. Intell.
27(4), 637–642 (2005)
12. Lee, H., Tessens, L., Morbée, M., Aghajan, H.K., Philips, W.: Sub-optimal camera selection in
practical vision networks through shape approximation. In: ACIVS, pp. 266–277 (2008)
13. Oliensis, J., Hartley, R.: Iterative extensions of the sturm/triggs algorithm: convergence and
nonconvergence. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2217–2233 (2007)
14. Philip Kelly, C.O.C., Kim, C., O’Connor, N.E.: Automatic camera selection for activity
monitoring in a multi-camera system for tennis. In: ACM/IEEE International Conference on
Distributed Smart Cameras (2009)
15. Pollefeys, M., Gool, L.J.V., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual
modeling with a hand-held camera. Int. J. Comput. Vis. 59(3), 207–232 (2004)
16. Saini, M.K., Gadde, R., Yan, S., Ooi, W.T.: MoViMash: online mobile video mashup. In: ACM
Multimedia, pp. 139–148 (2012)
17. Saini, M., Venkatagiri, S.P., Ooi, W.T., Chan, M.C.: The Jiku mobile video dataset. In:
Proceedings of the Fourth Annual ACM SIGMM Conference on Multimedia Systems, MMSys,
Oslo (2013)
18. Shrestha, P., Barbieri, M., Weda, H., Sekulovski, D.: Synchronization of multiple camera
videos using audio-visual features. IEEE Trans. Multimedia 12(1), 79–92 (2010)
19. Shrestha, P., de With, P.H.N., Weda, H., Barbieri, M., Aarts, E.H.L.: Automatic mashup
generation from multiple-camera concert recordings. In: ACM Multimedia, pp. 541–550
(2010)
20. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM
Trans. Graph. 25(3), 835–846 (2006)
21. Sturm, P.F.: Algorithms for plane-based pose estimation. In: CVPR, pp. 1706–1711 (2000)
22. Tanskanen, P., Kolev, K., Meier, L., Camposeco, F., Saurer, O., Pollefeys, M.: Live metric 3D
reconstruction on mobile phones. In: ICCV (2013)
23. Tatler, B.W.: The central fixation bias in scene viewing: selecting an optimal viewing position
independently of motor biases and image feature distributions. J. Vis. 7(14), 1–17 (2007)
24. Tessens, L., Morbée, M., Lee, H., Philips, W., Aghajan, H.K.: Principal view determination for
camera selection in distributed smart camera networks. In: IEEE ICDSC, pp. 1–10 (2008)
25. Vázquez, P.P., Feixas, M., Sbert, M., Heidrich, W.: Viewpoint selection using viewpoint
entropy. In: Proceedings of VMV 2001, pp. 273–280 (2001)
26. Wang, C., Shen, H.W.: Information theory in scientific visualization. Entropy 13(1), 254–273
(2011)
Information: Theoretical Model for Saliency
Prediction—Application to Attentive CBIR
1 Introduction
While machine vision systems are becoming increasingly powerful, in most regards
they are still far inferior to their biological counterparts. In human, the mechanisms
of evolution have generated the visual attention system which selects the relevant
information in order to reduce both cognitive load and scene understanding ambi-
guity.
This most widespread visual attention theory is referred to as an early selection
model because irrelevant messages are filtered out before the stimulus information is
processed for meaning [3, 50, 53]. In this context, attention selects some information
Information: Theoretical Model for Saliency Prediction. . . 147
in order not to overload our cognitive system. This is also the basic premise of a
large number of computational models of saliency and visual attention [23, 27, 41].
Besides these well known models, we would like to mention here the theory of
simplexity [1] which also places attention among the key mechanisms to simplify
complexity.
In this section, we have decided to focus our attention on a particular family of
computational visual attention models, those based on information theory. Directly
related to probabilistic theories, models based on information theory postulate that
the brain uses its attentional mechanisms to maximize the amount of information
extracted. Estimated locally, it can then be used to define image saliency. Different
approaches to compute this amount of information are available.
Gilles [24] proposes an explanation of salience in terms of local complexity,
which can be measured by the Shannon entropy of local attributes of the image.
Kadir [28] takes this definition and expands the model using the maximum entropy
to determine the scale of the salient features in a multi-scales analysis.
Bruce [4] proposes to use a measure of self-information to build non-linear
filtering operators, used to normalize singularity maps, before merging them, in a
similar architecture to that proposed in [27]. In [5, 6], he combines an independent
component analysis [42] and a measurement of self-information in order to obtain
an estimation of the salience of an image.
Mancas [36] proposes a very comprehensive approach based on the salience of
self-information. He presents models to suit different conditions: 1D (audio), 2D
(images) and 2D C t (video). His approach also includes attention with or without a
priori information (top-down or bottom-up).
Finally, we want to mention the outlying works of Diamant [10–12]. He proposes
a new definition of information, derived from Kolmogorov’s complexity theory
and Chaitin’s notion of algorithmic information. He presents a unifying framework
for visual information processing, which explicitly accounts for the perceptual and
cognitive image processing peculiarities.
Interestingly, all of these approaches (except the last one) consider image and
pixels as isolated entities, linked by statistical bottom-up or top-down properties
without any biological plausibility. They only focus on salience and forget attention
(cognitive or computational).
In the next section, we present a framework which keeps the advantages of
informational theory approaches and at the same time provides strong explanatory
capacity: the extreme physical information (EPI).
In EPI, observation is seen as an active process which extracts information from
a measured object to increase the knowledge of an observer (human or computer). In
this framework, salience and visual attention can be linked by a flow of information
coming from a scene to its cognitive or computational representation.
We propose that EPI provides an optimal way of extracting visual information.
Based on this assumption, EPI has already been used to optimally represent
information held in an image [8, 25]. In this chapter, we extend these previous
works by considering an open system, i.e. a system that can acquire or loose a priori
information; for instance an observation of an image or a video shot by an observer.
But let’s make a brief presentation of the theoretical framework: the EPI.
148 V. Courboulay and A. Revel
Over the past 20 years, Roy B. Frieden has developed a very powerful and complete
theory which proposes a unified approach of physics and exploratory analysis
[19, 20]. In his work, both theory and applications are presented, from the derivation
of the Schrödinger equation to the theory of social change and numerous other
various topics. Recently, EPI has been extended to universal cognitive models via
a confident information first principle [58]. In this part, we briefly present the EPI
principle.
The main objective of Frieden’s approach is to develop a theory of measurement
that incorporates the observer into the phenomenon under measurement based on
Fisher Information.
In our particular case of interest, where we have to deal with images and videos,
if p.x/ denotes the probability density functions for the noise vector x intrinsic to
the nD measurement and q2 the real probability amplitudes defined as p D q2 (see
[19] for a more complete and proper approach), I can be expressed as:
!2
XZ X @qi
IŒq D 4 dxi ; (1)
i v
@xiv
where qi qi .xi / is the ith probability amplitude for the measure fluctuation xi D
.xi1 ; : : : ; xip /.
Using Fisher information, instead of Shannon or Boltzmann information, and
taking into account the observer allow to derive among the best-known laws of
physics, statistical mechanics, quantum mechanics, thermodynamics and quantum
gravity [19]. Frieden defined a unique principle of physics, that of EPI. As
previously mentioned, the main characteristic of this approach is the inclusion as
integral part of the measurement of the observer, and its main aim is to find unknown
system probabilities p.
The EPI principle is based upon the discovery that the Fisher information
I contented in data arises out of the effect under measurement where it had
the value J. Whereas information I characterizes the quality or efficiency of
the data, the information J characterizes the effect under measurement and it is
determined through an invariance principle (Fourier transform for instance). Thus,
any measurement arises from a flow of information:
J!I (2)
where J is named the bound information (in the sense of being bound to the source).
Frieden also defined a coefficient :
I J D 0 01 (4)
Information: Theoretical Model for Saliency Prediction. . . 149
I J K D extrem (5)
This quotation was the key point of our reasoning. In the next section, we propose
to use it to transpose the EPI principle used in an ecological scenario into the context
of visual information.
Frieden started to study closed-system phenomenon [19]. The fact that the system
is closed is expressed by stating that the total number of each element is conserved,
no matter what kind of molecule it may be a part of. There is no way to gain, loss or
integrate a priori elements.
Yet, this closed world is obviously not adapted to visual observation, in which
both environment and interior mood may affect the process. Observation clearly
1
EPI used in an open system.
150 V. Courboulay and A. Revel
4.1 Hypothesis
In this part, we present the hypothesis on which we rely in order to apply the EPI
framework.
• First of all, we assume that visual observation leads to a global and optimal
representation built from different and well-known features extracted by our eyes
(color, intensity, orientation, motion) [32].
• Secondly, our main assumption is that each measured or perceived feature (color,
intensity and orientation) behaves like a mind particles population [33]. Every
feature is fed by the observed scene. Since the capacity of information that exists
on the retina is obviously limited, we state that our mind contains N kinds of
particles which represent the population of the features measured. Let mn D
1; : : : ; N the level of population of these particles, the total number of particles
in the retina is:
X
M mn
n
Considering these assumption, we can re-use the work made in [20] concerning
Growth and transport processes.
Information: Theoretical Model for Saliency Prediction. . . 151
mn =M pn;
and p D Œp1 ; : : : ; pN .
• Our last hypothesis is to consider that our system involves four different
populations, N D 4, with three families of populations that represents basics
inputs (color, intensity orientation) and one population that represents interest,
but we can easily increase the number of particles, which is a very interesting
part of our approach.
– The three inputs we used here was proposed by Koch and Ullman [29]. It is
related to the so-called feature integration theory, explaining human visual
search strategies. Visual input is thus decomposed into a set of three features.
– the output, i.e. interest is considered as a population that consumes low-level
features [33]. In this chapter, authors explained that interests compete for
attention, which they consume but also interests are consumed by the activity
they engender.
With such an approach authors modeled development of interest system.
Once these three hypothesis made, we can reuse the work of Fath and Frieden
[14, 15, 20] in order to derive a set of optimal equations that rule the observation
process.
In this section, we do not present the entire computation of optimal laws but only
the main steps and results, interested readers can refer to [14, 15, 20].
For our discrete problem, Fisher information at time t is given by Frieden [20]:
X z2 dpn
n
I.t/ D ; pn p.njt/; zn pn .gn C dn / n D 1; : : : ; 4;
n
pn dt
(6)
where
X
gn D gnk pk ; gnk gnk ..p/; t/
n (7)
dn D dn ..p/; t/; n D 1; : : : ; 4
152 V. Courboulay and A. Revel
The gn , dn are change coefficients called growth and depletion respectively. These
functions are assumed to be known functions that effectively drive the differential
equations, acting as EPI source terms. In Sect. 3, we have seen that:
I J D 0 01 (8)
zn X @Jm
2 D 0; n D 0; : : : ; 4: (11)
pn m
@zn
z2n 1
Jn D 0; n D 1; : : : ; 4 (13)
pn 2
Combining Eqs. (11) and (13) and eliminating their common parameters pn gives
2zn z2
pn D P @J D 2 n (14)
m @z
m pn
n
1
zn fn .g; p; t/pn (17)
2
where:
dp
D Œgn .p; t/ C dn .p; t/ pn ; n D 1; : : : ; 4; (19)
dt
where:
• p represents the optimal evolution process of interest,
• gn represents a growth function,
• dn represents a depletion function.
This equation is the optimal growth process equation [39]. Considering previous
hypothesis and Lessers’ definitions previously presented (interests compete for
attention, which they consume; interests are consumed by the activity they engen-
der), we can simplified this general equations in order to obtain the well known
Volterra-Lotka equations [14], where the growth function is known as growth
rate and depletion as mortality. Hence, our goal of deriving an optimal evolution
process to extract information via EPI has been met. This general equation is usually
presented as a pair of first-order, nonlinear, differential equations presented next.
4.3 Solution
It is very interesting to mention that these results are totally coherent with works
of Lesser [33] in which they assume that it exists a competition between different
sources in our brain in order to define what we have to focus. The authors of
this model propose that mind is a noisy far from equilibrium dynamical system
of competing interests. The system comprises two spatially discretized differential
equations similar to a chemical diffusion reaction model. These equations belong to
the same family of equations than Volterra-Lotka equations. In the next section, we
present how to exploit such a result to model visual attention.
Fig. 1 Competitive preys/predators attention model. Singularity maps are the resources that feed a
set of preys which are themselves eat by predators. The maximum of the predators map represents
the location of the focus of attention at time t (red circle) (Color figure online)
• in order to generalize this work to video, we can easily include a new population
of preys which represents the information included in motion.
This yields to the following set of equations, modeling the evolution of prey and
predator populations on a two dimensional map:
(d
Cx;y
dt
D bCx;y C f 4Cx;y mC Cx;y sCx;y Ix;y
dIx;y (21)
dt
D sCx;y Ix;y C sf 4Px;y mI Ix;y
156 V. Courboulay and A. Revel
For each of the four conspicuity maps (color, intensity orientation, motion), the
preys population C evolution is governed by the following equation:
dCx;y
n
n n n
D hCx;y C hf 4Cx;y
n mC C
x;y sCx;y Ix;y (23)
dt
n n n 2
with Cx;y D Cx;y C wCx;y and n 2 fc; i; o; mg, which mean that this equation is valid
c i o m
for C , C ,C and C which represent respectively color, intensity, orientation and
motion populations.
C represents the curiosity generated by the image’s intrinsic conspicuity. It is
produced by a sum h of four factors:
where:
• the image’s conspicuity SMn (with n 2 fc; i; o; mg) is generated using our real
time visual system, previously described in [43, 45]. Its contribution is inversely
proportional to a;
• a source of random noise R simulates the high level of noise that can be measured
when monitoring our brain activity [18]. Its importance is proportional to a. The
equations that model the evolution of our system become stochastic differential
equations. A high value for a gives some freedom to the attentional system, so it
can explore less salient areas. On the contrary, a lower value for a will constrain
the system to only visit high conspicuity areas;
• a Gaussian map G which simulates the central bias generally observed during
psycho-visual experiments [32, 49]. The importance of this map is modulated
by g
• the entropy e of the conspicuity map (color, intensity, orientation or motion). This
map is normalized between 0 and 1. C is modulated by 1 e in order to favor
maps with a small number of local minimums. Explained in terms of predator-
prey system, we favor the growth of the most organized populations (grouped in
a small number of sites). This mechanism is the predator-prey equivalent to the
feature maps normalization presented above.
The population of predators I, which consume the four kinds of preys, is governed
by the following equation:
dIx;y 2
D s.Px;y C wIx;y / C sf 4Px;y CwIx;y
2 mI Ix;y (25)
dt
P n
with Px;y D n2fc;i;og .Cx;y /Ix;y .
As already mentioned, the positive feedback factor w enforces the system
dynamics and facilitates the emergence of chaotic behaviors by speeding up
saturation in some areas of the maps. Lastly, please note that curiosity C is consumed
by interest I, and that the maximum of the interest map I at time t is the location of
the focus of attention.
In this part, we will present the bottom-up attentional systems that model the
principal of human selective attention we derived. This model aims to determine
the most relevant parts within the large amount of visual data. As we mentioned, it
uses psychological theories like “Feature Integration theory” [51, 52] and “Guided
Search model” [57]. Four features have been used based on these theories in
computational models of attention: intensity, color and orientation and motion. The
first complete implementation and verification of attention model was proposed by
158 V. Courboulay and A. Revel
Itti et al. [27] and was applied to synthetic as well as natural scenes. Its main idea
was to compute features and to fuse their saliencies in a representation which is
usually called saliency map. Our approach proposes to substitute the second part of
Itti’s model by our optimal competitive approach. The output of this algorithm is a
saliency map S.I; t/ computed by a temporal average of the focalization computed
through a certain period of time t. The global architecture is presented in Fig. 2.
6 Validation
6.1 Benchmark
measures were done on two image databases. The first one is proposed in [6].2 It is
made up of 120 color images which represent streets, gardens, vehicles or buildings,
more or less salients. The second one, proposed in [32]3 contains 26 color images.
They represent sport scenes, animals, building, indoor scenes or landscapes. For
both databases, eye movements recordings were performed during a free viewing
task.
Regarding the numerous models that exist in literature, we have decided to
confront our generic model to a larger amount of algorithm. However, it is hard
to make immediate comparisons between models. To alleviate this problem, Tilke
Judd have proposed a benchmark data set containing 300 natural images with eye
tracking data from 39 observers to compare model performances (http://saliency.
mit.edu/). As she write, this is the largest data set with so many viewers per image.
She calculates the performance of ten models at predicting ground truth fixations
using three different metrics: a receiver operating characteristic, a similarity metric,
and the Earth Mover’s Distance. We have downloaded the database, have runned
our model to create saliency maps of each image and have submitted our maps. We
present the results in Fig. 3. References of algorithms can be found in the web page
of the benchmark.
Fig. 3 Comparison of several models of visual attention. The number inside parenthesis is the
rank of our method
2
Bruce database is available at http://www-sop.inria.fr/members/Neil.Bruce.
3
Le Meur database is available at http://www.irisa.fr/temics/staff/lemeur/visualAttention.
160 V. Courboulay and A. Revel
The domain of Content Based image retrieval (CBIR) is considered as one of the
most challenging domain in computer vision and it has been an active and fast
advancing research area over the last years. Most retrieval white boxes methods4 are
based on extracting points of interest using interest points detectors [54] as Harris,
Harris/Laplace and described it by multi dimensional feature vectors using SIFT
[35]. The set of these feature vectors is known as bag-of-features [13] and a retrieval
can be performed. Although these approaches have demonstrated a high efficiency,
some weakness may be mentioned. The first limitation is represented in the interest
point detectors. Most of these detectors are based on geometric forms as corners,
blobs or junctions and consider that the interest of the image is correlated with the
presence of such features. This constraint is well known as semantic gap [47]. The
second constraints mentioned in the literature concerns SIFT descriptors. Although
SIFT shows a high efficiency, scalability remains an important problem due to the
large number of features generated for each image [17]. Many of them are outliers.
An alternative way for extracting region of interests is derived from visual
attention domain. This domain had been investigated intensely in the last years and
many models had been proposed [2]. In this section, we focus on bottom-up visual
attention models [27, 44].
Recently, many works have been proposed to combine these domains, given
what we called “Attentive Content Based Image Retrieval (Attentive C.B.I.R)”.
This idea was introduced earlier in [40], who indicate that object recognition in
human perception consists of two steps: “attentional process selects the region
of interest and complex object recognition process restricted to these regions”.
Based on this definition, Walther [56] proposed an algorithm for image matching:
his algorithm detects SIFT key-points inside the attention regions. These regions
determine a search area whereas the matching is on SIFT key-points. This approach
4
White Box Testing is a software testing method in which the internal structure/design/
implementation of the item being tested is known to the tester.
Information: Theoretical Model for Saliency Prediction. . . 161
Visual
CBIR
attention
Improvement
was successful since they used very complex objects and those which not change
viewpoint. Others as Frintrop and Jenselft [22] applied directly SIFT descriptors to
the attention regions. They applied their approach on robot localization: the robot
has to determine its position in a known map by interpreting its sensor data which
was generated by a laser camera. Although this approach achieved an improvement
in the detection rate for indoor environment, it fails in the outdoor environment and
opens areas.
In this section, we hypothesize that attention can improve object recognition
systems in query-run time and information quality since these models generate
salient regions on large scales, considering the context information. This property of
attentional models permits to generate fewer salient points regardless interest point
detector. Or, these detectors extract regions of interest on small scales, resulting
several hundreds or thousands of points. This idea was presented previously by
Frintrop [21] who indicated that the task of object recognition become easier if
an attentional mechanism first cued the processing on regions of potential interest;
thus is because of two reasons: first, it reduces the search space and results in
computational complexity. Second, most recognition and classification methods
works best if the object occupies a dominant portion of the image.
Many challenges have been proposed to test the efficiency and robustness of the
recognition methods. One of the most popular challenges is the Visual Object
Classes Challenge. VOC was proposed for the first time in 2005 with one objective:
recognizing objects from number of visual object classes in realistic scenes [13].
Since then, it has been organized every year and integrates new constraints in order
to provide a standardized database to the researchers.
In 2005, twelve algorithms have been proposed to compete for winning the
challenge; it is interesting to mention that all algorithms were based on local features
detection. We propose taxonomy in Table 1. Finally, INRIA-Zhang appeared to be
the most efficient white-box method. We decide to take it as the reference algorithm
for object recognition. The algorithm shown in Fig. 5 consists of extracting an
invariant image representation and classifying this data with non-linear support
162 V. Courboulay and A. Revel
vector machines (SVM) with an
2 -kernel. This algorithm can be divided in three
parts:
1. Sparse image representation: this part extract a set of SIFT keypoints KZhang .I/
from an image I.x; y/ which was provided before as input. It consists of two
steps:
• Interest points detectors: Zhang uses two complementary local region detec-
tors to extract interesting image structures: Harris-Laplace detector [37],
dedicated to corner-like region and Laplacian detector [34] dedicated to blob-
like regions. These two detectors have been designed to be scale invariant.
Information: Theoretical Model for Saliency Prediction. . . 163
with k.xi ; x/ the kernel function value for the training sample xi and the test
sample x. ˛i is the learned weighted coefficient for the training sample xi , and
b is the learned threshold. Finally, to compute the efficiency of the algorithm,
SVM score has been considered as a confidence measure for a class.
The filtering process by itself consists of selecting the subset KFiltered .I/ of
keypoints in KZhang .I/ for which the mask M.H.I/; / is on:
This subset KFiltered .I/ serves as input for the next parts of Zhang algorithm for object
recognition. In the following, we will verify if the Attentive CBIR can produce an
meaningful enhancement.
To validate our hypothesis, we implemented our approach and evaluated it on
the VOC 2005 database. The VOC challenge proposed two images subsets, the
subset S1 with selected images and another subset S2 with Google image randomly
selected. Thus, our approach can be performed independently during learning and
for the test process. We evaluate the binary classification using Receiver Operating
Characteristic (ROC) curve [16]. With this curve, it is easy to observe the trade-off
between two measures: proportion of false positives plotted on the x-axis showing
how many times a method says the object class is present when it is not; proportion
of true-positives plotted on the y-axis showing how many times a method says the
object class is present when it is.
In Fig. 7, some ROC curves are shown. As it can be seen in Table 2, the value of
the threshold parameter has a great impact on the decimation of the keypoints: the
higher it is, the least number of keypoints are kept. These curves present the results
of our evaluation method, for two computational attention models: Itti and ours. The
idea, here, is to develop two Attentive CBIR models and to test its efficiency, not to
evaluate all the existing methods:
• P/P+Zhang: this system represents the combination of our models with Zhang’s
nominal algorithm.
• Itti+ Zhang: this system represents the combination of Itti models with Zhang’s
nominal implementation.
Information: Theoretical Model for Saliency Prediction. . . 165
Fig. 7 ROC curve with and without our filter approach for two different classes
Fig. 8 ROC curve with and without our filter approach for the different classes-S1
we did exhaustively: we selected only the “best” and “worst” curves. They present
results for each class with, respectively, Zhang’s original score as reported in the
challenge summary, our implementation of Zhang’s algorithm without filtering and
several filtering. In this section, we have shown that Attentive CBIR can improve
the query-run time and information quality in object recognition. Therefore, we
proposed our approach for selecting the most relevant SIFT keypoints according
to human perception, using our visual attention system. Testing this approach on
VOC 2005 demonstrated that we can maintain approximately the same performance
by selecting only 40% of SIFT keypoints. Based on this result, we propose this
approach as a first step to solve problems related to the management of memory and
query run-time for recognition systems based on white boxes detectors.
easy to include some new features such as texture, and top-down information (skin,
face, car, hand gesture [46] . . . ) actually, it is just a new kind of prey. We finally
have presented new and complementary evaluation of our approach: an up to date
benchmark and an attentive CBIR approach.
Nevertheless, as say the fox in the little Prince: It is only with the heart that one
can see rightly; what is essential is invisible to the eye. . .
References
16. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874
(2006). doi:10.1016/j.patrec.2005.10.010. http://linkinghub.elsevier.com/retrieve/pii/
S016786550500303X
17. Foo, J.J.: Pruning SIFT for scalable near-duplicate image matching. In: Australasian Database
Conference, Ballarat, p. 9 (2007)
18. Fox, M.D., Snyder, A.Z., Vincent, J.L., Raichle, M.E.: Intrinsic fluctuations within cortical
systems account for intertrial variability in human behavior. Neuron 56(1), 171–184 (2007).
doi:10.1016/j.neuron.2007.08.023
19. Frieden, B.R.: Physics from Fisher Information: A Unification. Cambridge University Press,
Cambridge (1998)
20. Frieden, B.R.: Science from Fisher Information: A Unification, Cambridge edn. Cambridge
University Press, Cambridge (2004). http://www.amazon.com/Science-Fisher-Information-
Roy-Frieden/dp/0521009111
21. Frintrop, S.: Towards attentive robots. Paladyn. J. Behav. Robot. 2, 64–70 (2011). doi:
10.2478/s13230-011-0018-4, http://dx.doi.org/10.2478/s13230-011-0018-4
22. Frintrop, S., Jensfelt, P.: Attentional landmarks and active gaze control for visual slam. IEEE
Trans. Robot. 24(5), 1054–1065 (2008). doi:10.1109/TRO.2008.2004977
23. Frintrop, S., Klodt, M., Rome, E.: A real-time visual attention system using integral images. In:
5th International Conference on Computer Vision Systems (ICVS). Applied Computer Science
Group, Bielefeld (2007)
24. Gilles, S.: Description and Experimentation of Image Matching Using Mutual Information.
Robotics Research Group, Oxford University (1996)
25. Histace, A., Ménard, M., Courboulay, V.: Selective image diffusion for oriented pattern
extraction. In: 4th International Conference on Informatics in Control, Automation and
Robotics (ICINCO), France (2008). http://hal.archives-ouvertes.fr/hal-00377679/en/
26. Histace, A., Ménard, M., Cavaro-ménard, C.: Selective diffusion for oriented pattern extrac-
tion: application to tagged cardiac MRI enhancement. Pattern Recogn. Lett. 30(15), 1356–1365
(2009). doi:10.1016/j.patrec.2009.07.012. http://dx.doi.org/10.1016/j.patrec.2009.07.012
27. Itti, L., Koch, C., Niebur, E., Others: A model of saliency-based visual attention for rapid scene
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
28. Kadir, T., Brady, M.: Saliency, scale and image description. Int. J. Comput. Vis. 45(2), 83–105
(2001)
29. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural
circuitry. Hum. Neurobiol. 4(4), 219–227 (1985)
30. Kondor, R., Jebara, T.: A kernel between sets of vectors. Mach. Learn. 361–368 (2003)
31. Laaksonen, J.: PicSOM? Content-based image retrieval with self-organizing maps. Pattern
Recogn. Lett. 21(13/14), 1199–1207 (2000). doi:10.1016/S0167-8655(00)00082-9. http://
linkinghub.elsevier.com/retrieve/pii/S0167865500000829
32. Le Meur, O., Le Callet, P., Dominique, B., Thoreau, D.: A coherent computational approach
to model bottom-up visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 802–817
(2006)
33. Lesser, M., Dinah, M.: Mind as a dynamical system: implications for autism. In: Psychobiology
of Autism: Current Research & Practice (1998). http://www.autismusundcomputer.de/mind.en.
html
34. Lindeberg, T.: Feature detection with automatic scale selection. Comput. Vis. 30(2), 96 (1998)
35. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
60(2), 91–110 (2004). doi:10.1023/B:VISI.0000029664.99615.94. http://www.springerlink.
com/openurl.asp?id=doi:10.1023/B:VISI.0000029664.99615.94
36. Mancas, M.: Computational attention: towards attentive computers. Ph.D., Faculté Polytech-
nique de Mons (2007)
37. Mikolajczyk, K.: Scale & affine invariant interest point detectors. Int. J. Comput. Vis.
60(1), 63–86 (2004). doi:10.1023/B:VISI.0000027790.02288.f2. http://www.springerlink.
com/openurl.asp?id=doi:10.1023/B:VISI.0000027790.02288.f2
38. Murray, J.D.: Mathematical Biology: An Introduction. Springer, Berlin/Heidelberg (2003)
170 V. Courboulay and A. Revel
39. Murray, J.D.: Mathematical Biology: Spatial Models and Biomedical Applications. Springer,
New York (2003)
40. Neisser, U.: Cognitive Psychology. Appleton-Century-Crofts, New York (1967)
41. Ouerhani, N., Hugli, H.: A model of dynamic visual attention for object tracking in natural
image sequences. Lecture Notes in Computer Science, pp. 702–709. Springer, Berlin (2003)
42. Park, S.J., An, K.H., Lee, M.: Saliency map model with adaptive masking based on independent
component analysis. Neurocomputing 49(1), 417–422 (2002)
43. Perreira Da Silva, M., Courboulay, V.: Implementation and evaluation of a computational
model of attention for computer vision. In: Developing and Applying Biologically-Inspired
Vision Systems: Interdisciplinary Concepts, pp. 273–306. IGI Global, Hershey (2012)
44. Perreira Da Silva, M., Courboulay, V., Estraillier, P.: Objective validation of a dynamical and
plausible computational model of visual attention. In: IEEE European Workshop on Visual
Information Processing, Paris, pp. 223–228 (2011)
45. Perreira Da Silva, M., Courboulay, V., Estraillier, P.: Une nouvelle mesure de complexité pour
les images basée sur l’attention visuelle. In: GRETSI, Bordeaux (2011)
46. Pisharady, P., Vadakkepat, P., Loh, A.: Attention based detection and recognition
of hand postures against complex backgrounds. Int. J. Comput. Vis. 1–17 (2012).
doi:10.1007/s11263-012-0560-5. http://dx.doi.org/10.1007/s11263-012-0560-5
47. Santini, S., Gupta, A., Jain, R.: Emergent semantics through interaction in image databases.
IEEE Trans. Knowl. Data Eng. 13(3), 337–351 (2001). doi:10.1109/69.929893. http://dx.doi.
org/10.1109/69.929893
48. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and
their location in images. In: Proceeding of the International Conference on Computer Vision,
vol. 1, pp. 370–377 (2005)
49. Tatler, B.W.: The central fixation bias in scene viewing: selecting an optimal viewing position
independently of motor biases and image feature distributions. J. Vis. 7, 1–17 (2007).
doi:10.1167/7.14.4.Introduction
50. Treisman, A.: Strategies and models of selective attention. Psychol. Rev. 76, 282–299 (1969)
51. Treisman, A., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 136(12),
97–136 (1980)
52. Treisman, A.M., Kanwisher, N.G.: Perceiving visually presented objets: recognition, aware-
ness, and modularity. Curr. Opin. Neurobiol. 8(2), 218–226 (1998). http://linkinghub.elsevier.
com/retrieve/pii/S0959438898801438
53. Tsotsos, J.K., Culhane, S.M., Kei Wai, W.Y., Lai, Y., Davis, N., Nuflo, F.: Modeling visual
attention via selective tuning. Artif. Intell. 78(1-2), 507–545 (1995)
54. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Found.
Trends Comput. Graph. Vis. 3(3), 177–280 (2007). doi:10.1561/0600000017. http://www.
nowpublishers.com/product.aspx?product=CGV&doi=0600000017
55. Volterra, V.: Variations and fluctuations of the number of individuals in animal species living
together. ICES J. Mar. Sci. 3(1), 3–51 (1928)
56. Walther, D.: Selective visual attention enables learning and recognition of multiple
objects in cluttered scenes. Comput. Vis. Image Underst. 100(1-2), 41–63 (2005).
doi:10.1016/j.cviu.2004.09.004
57. Wolfe, J.M., Cave, K.R., Franzel, S.L.: Guided search: an alternative to the feature integration
model for visual search. J. Exp. Psychol. Hum. Percept. Perform. 15(3), 419–433 (1989). http://
www.ncbi.nlm.nih.gov/pubmed/2527952
58. Zhao, X., Hou, Y., Song, D., Li, W.: Extending the extreme physical information to universal
cognitive models via a confident information first principle. Entropy 16(7), 3670–3688 (2014)
Image Retrieval Based on Query by Saliency
Content
1 Introduction
(histograms), which preserve the colour distributions but have the tendency to hide
information relating to smaller areas of interest that may carry a lot of importance.
In addition, in order to take full advantage of the retrieval by object, the user is
heavily involved in the database population. The later versions of QBIC included
the automatic segmentation of the foreground and background in order to improve
the retrieval.
CANDID (Comparison Algorithm for Navigating Digital Image Databases)
[25] image retrieval represented the global colour distribution in the image as a
probability density function (pdf) modelled as a Gaussian mixture model (GMM).
The idea originated in text document retrieval systems, where the similarity measure
was simply the dot product of two feature vectors [39]. For images, the local features
such as colour, texture and shape were computed for every pixel and then clustered
with the k-means algorithm which defined the GMM’s components and parameters.
The similarity measure was then based on the dot product, representing the cosine
of the angle between the two vectors. The background was considered as another
pdf which was subtracted from each signature during the similarity computation.
This method was applied in narrow image domains such as for retrieving aerial data
and medical greyscale images.
The Chabot [27] system combined the use of keywords and simple histograms
for the retrieval task. The system was highly interactive and utilized a relational
database that would eventually store around 500,000 images. For the retrieval
performance, the RGB colour histograms were quantised to 20 colours, which was
sufficient for qualitative colour definition during query with the keywords as the
primary search method. The MIT Photobook [32] system took an entirely different
approach to the retrieval of faces, shapes and textures. The system performed the
Karhunen-Loeve transform (KLT) on the covariance matrix of image differences
from the mean image of a given training set, while extracting the eigenvectors
corresponding to the largest eigenvalues. These vectors would represent the proto-
typical appearance of the object category and images can be efficiently represented
as a small set of coefficients. The similarity between objects is computed as an
Euclidean distance in the eigenspaces of the image representations. The VisualSEEk
[37] system combines image colour feature-based querying and spatial layouts. The
spatial organisation of objects and their relationships in an image are important
descriptive features that are ignored by simple colour histogram representation
methods. VisualSEEk identifies areas in a candidate image, whose colour histogram
is similar to that of the query.
Certain top-down, CBIR approaches employ machine learning techniques for
the relevance feedback such as the support vector machine (SVM) [4] or multiple
instance learning [33]. Image ranking for retrieval systems has been performed by
using integrated region matching (IRM) [44] and the Earth Mover’s Distance (EMD)
[23, 34]. Deep learning, emerged lately as a successful machine learning approach
to a variety of vision problems. This application of deep learning to CBIR was
discussed in [42, 45].
174 A.G. Bors and A. Papushoy
The focus of this work is to analyze and evaluate the effectiveness of a bottom-
up CBIR system that employs saliency in order to define the regions of interest
in order to perform localised retrieval in the broad image domain. Visual saliency
was considered for CBIR in [14, 30] as well. This chapter is organized as in the
following. In Sect. 2 we introduce the modelling framework for visual attention
while in Sect. 3 we present the initial processing stages for the Query by Saliency
Content Retrieval (QSCR) methodology. The way how saliency is taken into account
by QSCR is explained in Sect. 4. The ranking of images based on their content
is outlined in Sect. 5. The experimental results are provided in Sect. 6 and the
conclusions of this study in Sect. 7.
Looking for a specific object of interest amidst many others, such as a book on
a shelf or a key of a keyboard, may be defined by the previous knowledge of the
title or authors of that book for example. Top-down attention driven by memories
may even suppress bottom-up attention in order to reduce distraction by salient
regions. Recently, memorisation studies have been undertaken in order to identify
the reasoning behind the visual search [40, 41].
Visual attention is a diverse field of research and there are several models that
have been proposed. Visual attention can be defined as either space-based or object-
based. Spatial-based attention selects continuous spatial areas of interest, whereas
object-based attention considers whole objects as driving the human attention.
Object-based attention aims to address some of the disadvantages of spatial models
such as their imprecision in selecting certain non-salient areas. Spatial models
may select different parts of the same object as salient which means that the
attention focus is shifted from one point to another in the image, whereas object-
based attention considers a compact area of the image as the focus of attention.
Applications of spatial-based attention to Content Based Image Retrieval tasks have
been prevalent whilst those of object-based attention have not received a similar
attention from the Image Retrieval community.
One of the main computer vision tasks consists of image understanding which
leads to attempting to model or simulate the processing used by the human
brain. Computational attention models aim to produce saliency maps that identify
salient regions of the image. A saliency map relies on firstly finding the saliency
value for each pixel. Salient region identification approaches fall into two main
categories. The first category is based on purely computational principles such
as the detection of interest points. These are detected using corner detectors and
are robust under some image transformations, but are sensitive to image texture
and thus would generalize poorly. Textured regions contain more corners but there
are not necessarily more salient. Other computational approaches are based on
image complexity assuming that homogeneous regions have lower complexity than
regions of high variance. Some computational methods use either the coefficients
of the wavelet transform or the entropy of local intensities [24]. Once again, such
approaches assume that textured regions are more salient than others, which is
not necessarily true. A spectral approach was used in [19], while [17] proposed a
bottom-up model based on the maximisation of mutual information. A top-down
classification method was proposed in [17] by employing the classification into
either interesting or non-interesting areas.
The biologically influenced computational models of attention represent the
second category of saliency models. This category further splits into two sub-
categories: biologically plausible and biologically inspired. Biologically plausible
models are based on actual neurological processes occurring in the brain, whereas
biologically inspired models do not necessarily conform to the neurological model.
Generally, these models consist of three phases: the feature extraction, the compu-
tation of activation maps, and the normalization and recombination of the feature
maps into a single saliency map [20].
176 A.G. Bors and A. Papushoy
Content based image retrieval (CBIR) involves using an image as a model or query
in order to search for similar images from a given pool of images. CBIR relies on the
image content as a base of information for search, whilst defining image similarity
remains a challenge in the context of human intent. In bottom-up computational
analysis of images, the content is considered as being represented by statistics of
image features. In this chapter we explain the Query by Saliency Content Retrieval
(QSCR) method, which considers that the visual attention is a determinant factor
which should be considered when initiating the image search. Firstly, we have a
training stage in which characteristic image features are extracted from image
regions corresponding to various categories of images from a training set. In the
retrieval stage we rank the images, which are available from a database, according
to a similarity measure. The scheme of the proposed QSCR system is provided in
Fig. 2. The main parts of the QSCR system consists of image segmentation, feature
extraction, saliency modelling and evaluating the distance in the feature space
between a query image and a sample image from the given pool of images [29].
The mean shift segmentation algorithm is a well known clustering algorithm relying
on kernel density estimation. This algorithm is a density mode-finding algorithm
[9, 10] without the need to estimate explicitly the probability density. A typical
kernel density estimator is given by
ck;d X
x xi
n
f .x/ D d K (1)
nh iD1 h
where n is the number of data points, h is the bandwidth parameter, d is the number
of dimensions, ck;d is a normalizing constant and K.x/ is the kernel. The multivariate
Gaussian function is considered as a kernel in this study. The mode of this density
estimate is defined by rf .x/ D 0. A density gradient estimator can be obtained
by taking the gradient of the density estimator. In case of multivariate Gaussian
estimator, it will be
2ck;d X
n
x x
i
rf .x/ D d
.x xi /K dC2
(2)
nh iD1 h
and then derive a center updating vector called the mean shift vector:
Pn
xxi
iD1 xi K
mh;G .x/ D Pn xxhi x (3)
iD1 K h
178 A.G. Bors and A. Papushoy
Fig. 2 The query by saliency content retrieval (QSCR) system using visual attention
h2 c rf .x/
mh;G .x/ D (4)
2 f .x/
The mean shift algorithm stops when the mean shift becomes zero and consequently
there is no change in the cluster center defining the mode. In the case when the
algorithm starts with too many initial clusters, several of these would converge to
the same mode and consequently all, but the ones corresponding to the real modes,
can be removed.
Image Retrieval Based on Query by Saliency Content 179
Each image is resized and then segmented into regions as described in the previous
section. For each image region a characteristic feature vector is calculated, with
entries representing statistics of colour, contrast, texture information, the region
neighbourhood information and region’s centroid.
Firstly, six entries characterizing the colour are represented by the median values
as well as the standard deviations for the L*a*b* colour components calculated
from the segmented regions. The median estimator is well known as a robust
statistical estimator, whilst the variance represents the variation of that feature in
the image region. The L*a*b* is well known as a colour space defining the human
perception of colours. The Daubechies 4-tap filter (Db4) is used as a Discrete
Wavelet Transform (DWT) [26] function for characterizing texture in images.
4 from Db4 indicates the number of coefficients used for describing the filter
having two vanishing points. A larger numbers of coefficients would be useful
when analysing signals with fractal properties which are also characterized by self-
similarity. Db4 wavelets are chosen due to their good localisation properties, very
good texture classification performance [6], high compactness, low complexity, and
efficient separation between image regions of high and low frequency. Moreover
Daubechie wavelet functions are able to capture smooth transitions and gradients
much better than the original Haar wavelets, which are not continuous and are
sensitive to noise. The lower level decompositions are up-scaled to the size of the
image by using bicubic interpolation and then by averaging the pixel values across
the three scales and for each direction. Three entries represent the texture energy
measured as the average of the absolute values of the DWT coefficients of the region
in the horizontal, vertical and oblique directions across the three image scales, [6].
The human visual system is more sensitive to contrast than to absolute brightness.
Generally, the contrast is defined as the ratio between the difference in local
brightness and the average brightness in a region. In order to increase its robustness,
the contrast is computed as the ratio between the inter-quartile range and the
median of the L* (luminance) component for each segmented region. The locations
of the centers for each region are calculated as the averages of pixel locations
from inside each compactly segmented region. These values are normalised by the
image dimension in order to obtain values in the interval [0,1]. By giving more
importance to the centroid locations, candidate images that best satisfy the spatial
180 A.G. Bors and A. Papushoy
layout of the query image can be retrieved. However, placing too much importance
on the location may actually decrease the precision and this is the reason why
this feature is considered, together with the region neighbourhood, as a secondary
feature, characterised by a lower weight in the calculation of the similarity measure.
The region neighbourhood can provide very useful information about the image
context. The neighbourhood consistency is represented by the differences between
the L*, a* and b* values of the given region and those of its most important
neighbouring regions located above, below, left and right, where the neighbouring
significance is indicated by the size of the boundary between two regions, [33].
Based on the assumption that salient regions capture semantic concepts of an image,
the goal of computing visual saliency is to detect such regions of interest so that
they can be used as a search query. Saliency maps must concisely represent salient
objects or regions of the image. In Fig. 3 we present an example of retrieving
the Translucent Bowl (TB) image in SIVAL database without visual attention
models compared to when visual attention models is used, assuming identical image
features. As it can be observed, when using visual attention models, all first six
retrieved images and the eight out of the total of nine correspond to the TB category,
while when not using the visual attention models only the seventh image is from the
correct category but none of the other eight images.
Saliency is used to identify which image regions attract the human visual
attention and consequently should be considered in the image retrieval. Saliency
is defined in two ways: at local and at the global image level, [29]. The former is
defined by finding salient regions, while the latter is defined by the salient edges
in the entire images. The regions which are salient would have higher weights
when considering their importance for retrieval while the salient edges are used
as a constraint for evaluating the similarity of the query image to those from the
given pool of images as shown in Fig. 2.
In order to capture the global salient properties of a given image we consider the
salient edges as in [14]. Firstly, the image is split into 16 16 pixels blocks, called
sub-images. Salient edges are represented by means of the MPEG-7 Edge Histogram
Descriptor (EHD) which is translation invariant. This represents the distribution
along four main directions as well as the non-directional edges occurring in the
image. Edges corresponding to each of these directions are firstly identified and
then their density is evaluated for each sub-image region. The EHD histogram is
represented by five values representing the mean of the bin counts for each edge
type across the 16 sub-images. Each value represents the evaluation of the statistics
for each of the edge orientations: vertical, horizontal, the two diagonal directions
at 45 and 135 deg and the non-directional. The bin counts correspond to a specific
directional edge energy and consequently the mean is an estimate that would capture
it without any specific image location constraint.
Known computational models of visual saliency are the Itti-Koch (IK) [22], Graph-
Based Visual Saliency (GBVS) [18], which is the graph-based normalisation
of the Itti-Koch model, the Saliency Using Natural statistics (SUN) [47], and
the Frequency-Tuned Saliency (FTS) [1]. The first three methods produce low-
resolution saliency blur maps that do not provide clear salient region boundaries.
FTS, on the other hand, produces full resolution maps with clear boundaries,
however, unlike the first three methods it only uses colour information, so it may
fail to identify any salient regions when all objects in the image have the same
colour.
The Graph-Based Visual Saliency (GBVS) method [18] was chosen due to its
results for modelling saliency in images. The GBVS saliency extraction method
is a computational approach to visual saliency based on the Itti-Koch model, but it
takes a different approach to the creation of activation maps and their normalisation.
Unlike the Itti-Koch model, which computes activation maps by center-surround
differences of image features [22], GBVS applies a graph-based approach [18].
Generally, saliency maps are created in three steps: feature vectors are extracted for
every pixel to create feature maps, then activation maps are computed, and finally
the activation maps are normalized and combined. The image is converted into a
182 A.G. Bors and A. Papushoy
representation suitable for the computation of the feature contrasts. Feature dyadic
Gaussian pyramids are produced at three image scales of 2:1, 3:1, and 4:1. Gaussian
pyramids are created for each channel of physiologically based DKL colour space
[12], which has similar properties to L*a*b*. Orientation maps are then produced
after applying Gabor filters at the orientations of f0; =4; =2; 3 =4g degrees for
every scale of each colour channel. The outputs of these Gabor filters represent the
features which are then used as inputs in the GBVS algorithm.
In the first level of representation in GBVS, adjacency matrices are constructed
by connecting each pixel of the map to all the other pixels, excluding itself,
by using the following similarity function 1 .Mx ; My / between feature vectors
corresponding to the pixels located at x and y:
ˇ ˇ
ˇ Mx ˇˇ kx yk2
1 .Mx ; My / D ˇˇlog exp (5)
My ˇ 2
2
where
2 Œ0:1; 0; 2D, D representing the given map width. A Markov chain
is defined over this adjacency matrix, where the weights of outbound edges
are normalized to Œ0; 1, by assuming that graph nodes are states, and edges
are transition probabilities. By computing the equilibrium distribution yields an
activation map, where large values are concentrated in areas of high activation and
thus indicate the saliency in the image. The resulting activation map is smoothed
and normalized. A new graph is constructed onto this activation map, with each
node connected to all the others including itself and which has the edge weights
given by:
kx yk2
2 .Mx ; My / D A.x/ exp (6)
2
2
where A.x/ corresponds to the activation map value at location x. The normalization
of the activation maps leads to emphasizing the areas of true dissimilarity, while
suppressing non-salient regions. The resulting. saliency map for the entire image
is denoted as S.x/, for each location x and represents the sum of the normalized
activation maps for each colour and each local orientation channel as provided by
the Gabor filters.
In Fig. 4 we show a comparison of saliency maps produced by four saliency algo-
rithms: Itti-Koch (IK) [22], Graph-Based Visual Saliency (GBVS) [18], Saliency
using Natural Statistics (SUN) [47], and Frequency-Tuned Saliency (FTS) [1]. It
can be seen that IK produces small highly focused peaks in saliency that tend to
concentrate on small areas of the object. The peaks are also spread across the image
spatially. This is because the Itti-Koch model was designed to detect areas to which
the focus of attention would be diverted. Because the peaks do no capture the whole
object, but rather small areas of it, it is insufficient for representing the semantic
concept of the image and is not suitable for retrieval purposes.
Image Retrieval Based on Query by Saliency Content 183
Fig. 4 Evaluation of saliency performance. Original images are in the first row, Itti-Koch saliency
maps are in the second row, GBVS maps are in the third row, SUN maps are in the fourth row, and
FTS maps are in the fifth row. Saliency maps are overlaid on the original images
The image selections produced by GBVS provides a good coverage of the salient
object by correctly evaluating the saliency. It has a good balance between coverage
and accuracy the results sit in between IK and SUN. Unlike the other three methods,
GBVS provides a high-level understanding of the whole image and its environment,
in the sense that it does not get distracted by the local details, which may result in
false positives. It is able to achieve this because it models dissimilarity as a transition
probability between nodes of a graph, which means that most of the time, saliency
is provided by the nodes with the highest transition probability. It can be seen from
the mountain landscape image in the last column that saliency is correctly indicated
at the lake and sky, despite not having an evident object in that region of the image.
In the second image where the red bus fills most of the image, GBVS recognises
the area surrounding the door as most salient, compared to SUN algorithm, which
considers the whole image as salient. It appears that the SUN algorithm only works
well with the simplest of images such is the third image showing a snowboarder
on snow. In the first image, given the image of the horse, which is only slightly
more difficult, the SUN algorithm correctly identifies the head of horse, its legs, and
tail. However, it also selects the trees, which are not that relevant for the retrieval
of such images. The large amount of false positives, apparent bias, and lack of
precision makes SUN an unsuitable choice for retrieval in the broad image domain,
but perhaps it could prove itself useful in specialised applications. FTS algorithm,
184 A.G. Bors and A. Papushoy
which represents a simple colour difference, only works well when there is a salient
colour object in the image, and the image itself has little colour variation, such
that the average colour value is close to that of the background. As it uses no
other features than the colour, it lacks the robustness of other methods, but works
extremely well when its conditions are met. As seen with the bus in the second
image, its downside is that it does not cover salient objects when there is a lot of
colour variation within, hence failing to capture the semantic concept. One of the
problems with local contrast-based saliency algorithms is that they may misinterpret
the negative space around true salient objects as the salient object.
GBVS is chosen for saliency computation in this study because of its robustness,
accuracy, and coverage. One downside is that it does not produce full resolution
saliency maps due to its computational complexity. During the up-scaling, blurred
boundaries are produced, which means that saliency spills into adjacent regions and
so marks them as salient, albeit to a smaller extent.
In this study we consider that we segment the images using the mean-shift algorithm
described in Sect. 3.1 and we aim to identify which of the segmented regions are
salient. The purpose of the saliency maps is to select those regions that correspond
to the salient areas of the image, which are to be given a higher importance in
the querying procedure. For distinctive objects present in images, this represents
selecting the object’s regions, whereas for distinctive scenes it would come down
to selecting the object and its neighbouring regions. Several approaches have
been attempted to select an optimal threshold on the saliency energy of a region
as the sum of the saliencies of all its component pixels. An optimal threshold
would be the one that maximises the precision of retrieval rather than the one
that accurately selects regions corresponding to salient objects. This is somewhat
counter-intuitive, as one would think that specifying well-defined salient objects
would improve the precision, but due to the semantic gap, this is actually not always
the case. In the Blobworld image retrieval method [5], images are categorized as
distinctive scenes or distinctive objects, or both. However, it was remarked that when
considering CBIR in some image categories it would be useful to include additional
contextual information and not just the salient object.
Firstly, we consider selecting regions that contain a certain percentage of salient
pixels, where salient pixels are those defined by S.x/ > p . The average region
saliency is calculated from the saliency of its component pixels as:
X S.x/
S.ri / D (7)
x2ri
Nr
Image Retrieval Based on Query by Saliency Content 185
Fig. 5 Comparison of saliency cut-offs (1) Original image, (2) GBVS saliency map (SM), (3)
Otsu’s method, (4) Top 40%, (5) Top 20%, (6) Cut-off at 0.61, (7) Cut-off at twice the average
saliency as in [1]
Image Retrieval Based on Query by Saliency Content 187
Fig. 6 Empirical cumulative distribution functions (CDF) calculated for all images from COREL
1000 database. (a) Pixel saliency. (b) Salient regions
points, respectively. The 60th percentile produces similar results to Otsu’s method,
but on some occasions includes too many background pixels as seen in the Beach
and Dinosaur categories images, shown in the last two images from Fig. 5. The 80th
percentile, representing the selection of the top 20% of data, shows a good balance
between the two criteria where it selects a smaller subset of the pixels identified by
Otsu’s method as in the case of the Elephant and Bus categories, from the third and
ninth images and at the same time captures a good amount of background pixels
as in the Architecture and Beach category images from the sixth and tenth images.
Obviously, it is impossible to guarantee that it will capture background information
for distinctive scenes and just the object for distinctive object images, and vice versa,
but at least this method is sufficiently flexible to do so. The next method is a simple
fixed cut-off value set at the pixel value of 155, which corresponds to 60% precision
and 40% recall. By looking at the CDF of salient pixels from COREL 1000 database,
shown in Fig. 6a, this value corresponds to the 90th percentile of saliency values and
so selects only the top 10% of the data. Only small portions of the image are selected
and in many cases this fails to capture the background regions, resulting in lower
performance. An example of this is seen in the image of the walker in the snow-
covered landscape image, the Horse and the Architecture category images from
second, fourth and sixth images from Fig. 5. In all these images, the most salient
object has very little semantic information differentiating it from the others. For
example, the walker is mostly black and very little useful information is actually
contained within that region; the horse is mostly white and this is insufficient to
close the semantic gap. Achieving a balance is difficult because a method that
selects the regions of distinctive objects may fail when the image is both a distinctive
object and a distinctive scene. An example of such a situation is the Horse category
from the fourth image, where selecting the white horse by itself is too ambiguous
as there are many similarly coloured regions, but by adding several background
188 A.G. Bors and A. Papushoy
regions improves performance greatly. On the other hand, the performance would
be reduced by including background regions when the image is in the category of
distinctive objects. The last method, which was used in [1], sets the threshold at
twice the average saliency for the image. This approximately corresponds to the
top 15% of salient pixels from the empirical cumulative distribution for COREL
1000. This produces similar maps to the selection of regions with 20% salient pixels,
except that it captures fewer surrounding pixels.
In the following we evaluate the segmented image region saliency by considering
only those salient pixels which are among the top 20% salient pixels, which provides
best results, according to the study from [29]. By considering a hard threshold
for selecting salient pixels, the saliency for the rest of pixels is considered as
zero for further processing. We then apply the region mask to the saliency map
and consider the region saliency as given by the percentage of its salient pixels.
Next, we use the saliency characteristic from all regions in the image database to
construct an empirical CDF of salient regions, considering the mean-shift for the
image segmentation, as explained in Sect. 3.1. The empirical CDF of the salient
regions for the COREL 1000 database is shown in Fig. 6b. Now, we propose to
select the most salient regions by setting the second threshold at the first point
of inflexion in the CDF curve. This corresponds to the point where the gradient
of the CDF curve begins to decrease. We observe that this roughly corresponds
to the 35th percentile and thus our method considers the top 65% of most salient
regions in the given database. We have observed that this saliency region selection
threshold removes most of the regions with little saliency, while still considering
some background regions containing the background information necessary for the
retrieval of images of distinctive scenes. Such regions are suitable for describing the
contextual semantic information.
The methods discussed above focus on selecting representative salient query
regions. In the QSCR system we would segment the query image and would assume
that all candidate images had been previously segmented as well. The saliency of
each region in both the candidate images and the query one would then be evaluated.
Once the salient query regions are determined, they could be matched with all the
regions in the candidate images. Another approach could evaluate the saliency in
both the query and the candidate images and the matching would be performed
only with the salient regions from the candidate images. This constrains the search
by reducing the number of relevant images because the query regions are searched
by using only the information from the salient regions of the candidate images.
Theoretically, this should improve both retrieval precision and computational speed,
but in practice, the results will depend on the distinctiveness of the salient regions
because the semantic gap would be stronger due to a lack of context.
Image Retrieval Based on Query by Saliency Content 189
5 Similarity Ranking
Given a query image, we rank all the available candidate images according to their
similarity with the query image. The aim here is to combine the inter-region distance
matrix with the salient edge information to rank the images by their similarity
while taking into account their saliency as well. The processing stages of image
segmentation, feature extraction and saliency evaluation, described in the previous
section, and shown in the chart from Fig. 2, are applied initially to all images from a
given database. Each region Ij , j D 1; : : : ; NI , from every I image is characterized by
a feature vector, and by its saliency, evaluated as described in the previous Section.
Meanwhile, the energy of salient edges is evaluated for entire images. The same
processing stages are applied on the query image Q, which is segmented into several
regions Qi , i D 1; : : : ; M as described in Sect. 3.1. The similarity ranking becomes
a many-to-many region matching problem which takes into account the saliency
as well. Examples of many-to-many region matching algorithms are the Integrated
Region Matching (IRM) which was used in [44] for the SIMPLIcity image retrieval
algorithm and the Earth Mover’s Distances (EMD), [34]. The EMD algorithm was
chosen in this study due to its properties of optimising many-to-many matches, and
this section of the algorithm is outlined in the lower part of the diagram from Fig. 2.
In the EMD algorithm, each image becomes a signature of feature vectors
characterising each region. A saliency driven similarity measure is used between
the query image Q and a given image I, represented as the weighted sum of the
EMD matching cost function, considering the local saliency, and the global image
saliency measure driven by the salient edges, [29]:
X
jEHD.; Q/ EHD.; I/j
EMD.Q; I/
S .Q; I/ D WEMD C WEHD (8)
˛EMD 5 ˛E
q i ; Ij / D .SQi ; SIj /
D.Q
ˇP .cl dcl2 .i; j/ C te dte2 .i; j/ C co dco
2 .i; j// C ˇ . d 2 .i; j/ C d 2 .i; j//
S nn nn cd cd
(9)
where Qi , i D 1; : : : ; M from the query image Q and each region Ij , j D 1; : : : ; N
from the candidate retrieval image I. .SQi ; SIj / denotes the joint saliency weight for
Qi and Ij . dcl , dte and dco are the Euclidean distances between the primary features,
weighted by ˇP , corresponding to the colour, texture and contrast vectors, respec-
tively. Meanwhile, dnn and dcd are the Euclidean distances between the secondary
features, weighted by ˇS , characterizing the colours of the nearest neighbouring
regions and the centroid locations of the regions Qi and Ij , respectively. Each
feature distance component is normalized to the interval Œ0; 1 and is weighted
according to its significance for retrieval by the global weights ˇP , ˇS , modulating
the significance for each category of features, and the individual weights, weighting
the contribution of each feature as: cl , te , co , nn and cd . The selection of
primary and secondary features ˇP > ˇS , where ˇP C ˇS D 1 was performed based
on computational visual attention studies [16, 46] and following extensive empirical
experimentation.
The feature modelling for each segmented region is described in Sect. 3.2 and
distances are calculated between vectors of features characterizing image regions
from the database and those of the query image. The CIEDE2000 colour distance
was chosen for the colour components of the two vectors, because it provides
a better colour discrimination according to the CIE minimal perceptual colour
difference, [36]. The colour feature distance dcl is calculated as:
Image Retrieval Based on Query by Saliency Content 191
"
1 E00 .i; j/ 2
i;L
j;L 2
dcl2 D 3 C
6 CDis ˛L
#
i;a
j;a 2
i;b
j;b 2
C C (10)
˛a ˛b
where E00 .i; j/ represents the CIEDE2000 colour difference [36], calculated
between the median estimates of L*, a*, b* colour components, normalized by the
largest color distance CDis , while f
x;c jx 2 fi; jg; c 2 fL; a; bgg represent the
standard deviations for each colour component, and f˛c jc 2 fL; a; bgg are their
corresponding 95th percentiles, calculated across the entire image database, and
are used as robust normalization factors. These values are used for normalization
because the cumulative distributions of these features, extracted from segmented
image regions, can be modelled by log-normal functions.
The texture distance dte corresponds to the Euclidean distance between the
average of the absolute values of DWT coefficients corresponding to the horizontal,
vertical and oblique directions for the regions Qi and Ij , divided by their correspond-
ing standard deviations calculated across the entire image database. The contrast
difference dco is represented by the normalized Euclidean distance of the contrast
features for each region from I and Q, with respect to their neighbouring regions. For
the sake of robust normalization, the distances corresponding to the texture features
as well as those representing local contrast are divided by the 95th percentiles of
the empirical cumulative distribution function of their features, computed from
a representative image set. The neighbourhood characteristic difference dnn is
calculated as the average of the resulting 12 colour space distances to the four
nearest neighbouring regions from above, below, left and right, selected such that
they maximize the joint boundary in their respective direction. The centroid distance
dcd is the Euclidean distance between the coordinates of the regions centers.
The weight corresponding to the saliency, weighting the inter-region distances
between two image regions from Qi and Ij , from (9), is given by:
SQi C SIj
.SQi ; SIj / D max 1 ; 0:1 (11)
2
where SQi and SIj represent the saliency of the query image region Qi and that of the
candidate retrieval image region Ij , where the saliency of each region is calculated,
following the analysis from Sect. 4.3, and represents the ratios of salient pixels from
each region. It can be observed that the distance D.Qi ; Ij / is smaller when the two
regions Qi and Ij are both salient. Eventually, for all regions from Q and I, it results a
similarity matrix D.Qi ; Ij / which defines a set of inter-region distances between each
region Qi , i D 1; : : : ; M from the query image Q and each region Ij , j D 1; : : : ; N
from the candidate retrieval image I. The resulting inter-region similarity matrix
acts as the ground distance matrix for the EMD algorithm.
192 A.G. Bors and A. Papushoy
The distance matrix D.Qi ; Ij / represents the cost of moving the earth energy
associated with the image regions from Q to fitting the gaps of energy, represented
by the image regions from I. A set of weights fwQ;i ji D 1; : : : ; Mg is associated with
the amount of energy corresponding to a region in the query image, while fwI;j jj D
1; : : : ; Ng are the weights corresponding to the candidate image, representing the
size of an energy gap. All these weights represent the ratios of each segmented
region from the entire image. A unit of flow is defined as the transportation of a unit
of energy across a unit of ground distance. The EMD algorithm is an optimization
algorithm which minimizes the cost required for transporting the energy to a specific
energy gap, [34]:
M X
X N
min. fij D.Qi ; Ij // (12)
iD1 jD1
The goal of the optimization procedure is to find the flow fij between the regions Qi
and Ij such that the cost of matching the energy from a surplus area to a deficit of
energy area is minimized.
After solving this system by using linear programming, the EMD distance from
(8) is calculated by normalizing the cost required:
PM PN
iD1 jD1 fij D.Qi ; Ij /
EMD.Q; I/ D PM PN (17)
iD1 jD1 fij
This represents the normalized cost of matching the query image signature with that
of the most appropriate candidate retrieval image. The weights add up to unity only
when all image regions are used. We are removing non-salient image regions, and
consequently the weights would add up to a value less than one. Such signatures
enable partial matching which is essential for image retrieval where there is a high
likelihood of occlusion in the salient regions. The computational complexity of the
proposed QSCR is contained mostly in the feature extraction stage for the given
image database which is performed off-line. The computational complexity of the
optimization algorithm can be substantially reduced when thresholding the distances
calculated by EMD, as in [31].
Image Retrieval Based on Query by Saliency Content 193
6 Experimental Results
N
1 X nk
WPR D (18)
N kD1 k
where N is the number of all retrieved images and nk is the number of matches in the
first k retrieved images. This measure gives more weight to matched items occurring
closer to the top of the list and takes into account both precision and ranks. Ranks
can be equated to recall because a higher WPR value means that relevant images
are closer to the top of the list, therefore the precision would be high at lower recall
values because more relevant images are retrieved. However the WPR measure is
sensitive to the ratio of positive and negative examples in the database, i.e. the total
number of relevant images out of the total number of candidate images.
Quantitative tests are performed by evaluating the average performance of the
proposed methodology across the whole databases considering 300 queries for
COREL 1000, 600 queries for Flickr, and 750 for SIVAL. Across the graph legends
in this study, indicates the mean value for the measure represented, calculated
across all categories and followed by a ˙
which denotes the average of the spreads.
Following the analysis of various image saliency selection algorithms from Sect. 4.2
we use Graph-Based Visual Saliency (GBVS) algorithm for selecting the saliency
in the context of the optimization algorithm, as described in Sect. 5. Using saliency
as a weight for the Euclidean distances of the feature vectors is compared against
the case when salience is not used at all. The Area under the ROC curve (AUC)
results for COREL 1000 database are provided in Fig. 7. From this figure it can
be observed that saliency improves the performance in categories where salient
objects are prominent in the image such as Flowers, Horses, Dinosaurs, and
Image Retrieval Based on Query by Saliency Content 195
Fig. 8 Image retrieval when considering different saliency map thresholds. indicates the average
rank-weighted precision followed by the average of the corresponding standard deviations after the
˙ sign. Standard deviations are indicated for each image category in the plot as well
Fig. 9 Comparisons for various ways of selecting salient regions. (a) Rank-weighted average
precision (WPR) when selecting salient regions based on the percentile of salient pixels. (b) Area
under the ROC curve (AUC) when selecting salient regions based on the maximization of the
saliency entropy. (c) WPR when salient regions are extracted using a thresholded saliency map
for the average region saliency values is not suitable for retrieving the images from
Flower and Horse categories because it does not select enough background regions
to differentiate the red/yellow flowers from buses. In both of these plots it can
be observed that by selecting the top 65% salient regions outperforms the other
approaches. Another method for selecting salient regions consists of binarising the
saliency map using Otsu’s threshold proposed in [28], then choosing the salient
regions as those which have at least 80% of their pixels as salient. Figure 9c shows
Image Retrieval Based on Query by Saliency Content 197
that this method underperforms greatly when categories have a well-defined salient
object. This happens because this method selects just the salient object without
including any background regions, and since those categories are classified as
distinctive scenes, confusion occurs due to the semantic gap. On the other hand,
the proposed QSCR method considers only the top 65% salient regions, and this
was shown to be efficient in general-purpose image data sets, such as Corel and
Flickr databases. However, in the case of SIVAL database, which consists entirely
of distinctive objects with no semantic link to their backgrounds, salient regions are
considered when they are part of the top 40% most salient regions, due to the fact
that in this case the inclusion of background regions introduces false positives.
Salient edges are extracted as explained in Sect. 4.1 and are used in the final
image ranking evaluation measure from (8). The idea is that the region-to-region
matching EMD distance gives a localized representation of the image while the
salient edges provide a global view. Unlike in SEHD algorithm of [14], the QSCR
method decouples the edge histogram from its spatial domain by considering the
edge energy, corresponding to specific image feature orientations, calculated from
the entire image. SEHD uses a different image segmentation algorithm and different
selection of salient regions while performing the image ranking as in [7]. In Fig. 10
we compare the proposed salient edge retrieval approach, considering only the
global image saliency and not the local saliency, and SEHD image retrieval method
used in [14], using the average area under the ROC curve (AUC) as the comparison
criterion. The categories in which the proposed approach outperforms SEHD are
Fig. 10 Retrieval by salient edges: proposed salient edge representation compared with the Feng’s
SEHD approach
198 A.G. Bors and A. Papushoy
Fig. 11 Examples of extracted query information: (1) Original, (2) Image segments, (3) Saliency
map, (4) Salient edges, (5) Selected salient regions
Fig. 12 Extracting query information from images for seven image categories from COREL
database. The image columns indicate from left to right: original image, segmented regions, the
GBVS saliency map, salient edges and the salient regions
selection method includes the surrounding regions in the query. The precision-recall
(PR) curve corresponding to the query images is shown in Fig. 13b. Figure 14
shows a scenario where the number of positive examples in the category is much
smaller, and yet the AUC value is high, as it can be observed from Fig. 14b. This
means that if more positive examples were added to the database, then the precision
would improve. Because all images in the category are considered relevant and the
true number of positive examples is much lower, the curve underestimates the true
retrieval performance. The semantic gap is evident in the retrieval of this image as
the query regions contain ambiguous colours, resulting in a series of Horse and Food
category images as close matches. The results when retrieving the white horse in
natural habitat surroundings from Fig. 15 produces no false positives for the first 10
retrieved images, but after that creates some confusion with Africa (which basically
represents people), as well as with Flowers and Elephant categories.
200 A.G. Bors and A. Papushoy
Fig. 13 Retrieval performance for Architecture category from COREL 1000 database. (a) The
first line shows the query image, its saliency, selected salient regions and salient edge images while
the subsequent lines displays the retrieved images in their order. In the next six lines are shown 30
retrieved images. (b) Precision-recall curve
A variety of good retrieval results are provided in Fig. 16a, b for the Bus
and Flower categories from COREL 1000 database, while Fig. 16c, d shows the
results for images from the Pepsi Can and Checkered Scarf categories from SIVAL
database. The last two examples of more specific image categories from SIVAL
database indicate very limited salient object confusion in the retrieval results.
Figure 17 compares the results for the proposed query by saliency content retrieval
(QSCR) algorithm with SIMPLIcity from [44] when applied to COREL 1000
database. The comparison uses the same performance measures as in [44], respec-
tively the average precision, average rank and average standard deviation of rank. As
it can be observed from Fig. 17, QSCR provides better results in 4 image categories
and worse in the other 6, according to the measures used. This is due to the fact that
SIMPLIcity uses very selective features which are appropriate for these 6 image
categories.
Figure 18 compares the results of QSCR, with the two retrieval methods proposed
in [33], on Flickr database when using AUC. The results of QSCR and ACCIO are
Image Retrieval Based on Query by Saliency Content 201
Fig. 14 Retrieval performance for Africa category from COREL 1000 database. (a) The first line
shows the query image, its saliency, selected salient regions and salient edge images while the
subsequent lines displays the retrieved images in their order. In the next six lines are shown 30
retrieved images. (b) Precision-recall curve
broadly similar and vary from one image category to another. However, ACCIO
involves human intervention by acknowledging or not the retrieved images, while
the approach described in this chapter is completely automatic. The salient edges
improve the performance when the image of a certain category contain salient
objects which are neither distinctive or diverse enough. This is the case with the SB
category, where most of the photos depict people as salient objects, set in a snowy
environment, HB, FF and FI categories, where the images are mostly close-ups,
defined by mostly vertical edges.
Figure 19 provides the assessment of the retrieval results using AUC on SIVAL
database when considering the retrieval of five images for each category. In this
database, the objects have simple backgrounds and the saliency should highlight
the main object while excluding the background which is the same for other
image categories. Unlike in COREL 1000 and Flickr databases, the inclusion of
the background is detrimental to the retrieval performance in this database. In the
case of the images from SIVAL database we consider as salient those regions whose
saliency corresponds to the top 40% of salient regions in the image instead of 35%
which was used for the other two databases.
202 A.G. Bors and A. Papushoy
Fig. 15 Retrieval performance for Horse category from COREL 1000 database. (a) The first line
shows the query image, its saliency, selected salient regions and salient edge images while the
subsequent lines displays the retrieved images in their order. In the next six lines are shown 30
retrieved images. (b) Precision-recall curve
6.7 Discussion
Fig. 16 Retrieval performances for images from Corel database in (a) and (b) and from SIVAL
database in (c) and (d)
because the domain of the feature values is very small. The semantic gap is most
evident in this database because its images and ground truths were obtained by
performing keyword search on Flickr. In addition, most of the images contain
multiple salient areas, which combined with the deficiencies of computational
visual attention models, result in several strong responses, which ultimately end
up confusing the CBIR system.
The Corel database has also weaknesses. The categories are not entirely disjoint
and it is sometimes unclear how to judge the retrieval results. When attempting to
retrieve horses we may retrieve elephants as well because they are both animals
and have similar relationships with their surroundings. At the lowest semantic level,
this is incorrect as the retrieval is too general. Without keywords, if the user wished
204 A.G. Bors and A. Papushoy
to search for animals, it would not be possible to specify such a query because
“animal” is an abstract term. Such retrieval is only possible if the images are loosely
clustered.
By considering distances using (8) between each pair of images for half of
COREL 1000 database, an image classification matrix is produced, shown in Fig. 20.
It shows the categories where the semantic gap is most prominent. It can be seen that
Beach images are likely to get confused with Elephants, Landscapes, and Horses,
whereas Elephants get mostly confused with Horses and to a lesser extent with
Africa, Beaches, Architecture and Landscapes.
7 Conclusions
is little spatial variation of the salient object within images. A new salient region
selection method, that uses the cumulative distribution of the saliency values in the
database to select an appropriate threshold, was discussed in this chapter. An ideal
CBIR solution would incorporate a variety of search mechanisms and would select
the best choice dynamically, thus maximising its performance. The use of visual
attention models would be one of the mechanisms that a CBIR solution should
employ because it is indispensable in highly localized scenarios such as those found
in the SIVAL database, a global ranking method would fail in SIVAL, regardless of
the choices of features and distance metrics. This implies that systems must be able
to distinguish between images of distinctive objects or distinctive scenes, leading to
the thought of using the visual attention when searching for image content. Little
work has been done before on such semantics-sensitive approaches to the retrieval
task and it would be of great benefit to future CBIR systems. In their current state,
computational models of visual attention are still basic because they operate on the
notion of contrasting features, so they cannot accurately identify salient objects in
complex images that are commonplace. Therefore, saliency models are the limiting
factor for the concept of retrieval by visually salient objects and image features. In
the future more reliable models of the human intent, such as those involving human
memorisation processes, should be considered for CBIR systems in order to provide
better retrieval results.
Image Retrieval Based on Query by Saliency Content 207
References
1. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection.
In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recogni-
tion, pp. 1597–1604 (2009)
2. Ashley, J.J., Barber, R., Flickner, M., Hafner, J.L., Lee, D., Niblack, C.W., Petkovic, D.:
Automatic and semiautomatic methods for image annotation and retrieval in query by image
content (QBIC). In: Proceedings of SPIE 2420, Storage and Retrieval for Image and Video
Databases III, San Jose, CA, pp. 24–35 (1995)
3. Bors, A.G., Nasios, N.: Kernel bandwidth estimation for nonparametric modelling. IEEE
Trans. Syst. Man Cybern. B Cybern. 39(6), 1543–1555 (2009)
4. Bi J., Chen, Y., Wang, J.: A sparse support vector machine approach to region-based image
categorization. In: Proceedings of IEEE International Conference on Computer Vision and
Pattern Recognition, vol. 19, pp. 1121–1128 (2005)
5. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: image segmentation using
expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal.
Mach. Intell. 24(8), 1026–1038 (2002)
6. Chang, T., Kuo, C.-C.J.: Texture analysis and classification with tree-structured wavelet
transform. IEEE Trans. Image Process. 2(4), 429–441 (1993)
7. Chang, R., Liao, C.C.C.: Region-based image retrieval using edgeflow segmentation and
region adjacency graph. In: Proceedings of International Conference on Multimedia and Expo,
pp. 1883–1886 (2004)
8. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with images. J. Mach.
Learn. Res. 5, 913–939 (2004)
9. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell.
17(8), 790–799 (1995)
10. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE
Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
11. Datta, R., Joshi, D., Li, J., Wan, J.Z.: Image retrieval: Ideas, influences, and trends of the new
age. ACM Comput. Surv. 40(2), 5:1–5:60 (2008)
12. Derrington, A.M., Krauskopf, J., Lennie, P.: Chromatic mechanisms in lateral geniculate.
Nucleus Macaque J. Physiol. 357, 241–265 (1984)
13. Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W.:
Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 231–262 (1994)
14. Feng, S., Xu, D., Yang, X.: Attention-driven salient edge(s) and region(s) extraction with
application to CBIR. Signal Process. 90(1), 1–15 (2010)
15. Frintrop, S., Klodt, M., Rome, E.: International Conference on Computer Vision Systems
(2007)
16. Frintrop, S., Rome, E., Christensen, H.: Computational visual attention systems and their
cognitive foundations. A survey. ACM Trans. Appl. Percept. 7(1), 6.1–6.46 (2010)
17. Gao, D., Han, S., Vasconcelos, N.: Discriminant saliency, the detection of suspicious coinci-
dences, and applications to visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(6),
989–1005 (2009)
18. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Proceedings of Advances in
Neural Information Processing Systems (NIPS), vol. 19, pp. 545–552 (2007)
19. Hou, X., Zhang, L.: Saliency detection: a spectral residual approach. In: Proceedings of IEEE
International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
20. Islam, M.M., Zhang, D., Lu, G.: Comparison of retrieval effectiveness of different region
based image representations. In: International Conference on Information, Communications
and Signal Processing, pp. 1–6 (2007)
208 A.G. Bors and A. Papushoy
21. Itti, L., Ullman, S.: Shifts in selective visual attention, Towards the underlying neural circuitry.
Hum. Neurobiol. 4(4), 219–227 (1985)
22. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
23. Jing, F., Li, M., Zhang, H.-J., Zhang, B.: An efficient and effective region-based image retrieval
framework. IEEE Trans. Image Process. 13(5), 699–709 (2004)
24. Kadir, T., Brady, M.: Saliency, scale and image description. Int. J. Comput. Vis. 45(2), 83–105
(2001)
25. Kelly, P.M., Cannon, M., Hush, D.R.: Query by image example: the CANDID approach. In:
Proceedings of SPIE 2420, San Jose, CA, pp. 238–248 (1995)
26. Mallat, S.G.: A Wavelet Tour of Signal Processing: The Sparse Way. Academic, New York
(2009)
27. Ogle, V.E., Stonebraker, M.: Chabot: retrieval from a relational database of images. IEEE
Comput. 28(9), 40–48 (1995)
28. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man
Cybern. 9(1), 62–66 (1979)
29. Papushoy, A., Bors, A.G.: Image retrieval based on query by saliency content. Digital Signal
Process. 36(1), 156–173 (2015)
30. Papushoy, A., Bors, A.G.: Visual attention for content based image retrieval. In: Proceedings of
IEEE International Conference on Image Processing (ICIP), Quebec City, pp. 971–975 (2015)
31. Pele, O., Taskar, B.: The tangent earth mover’s distance. In: Proceedings of Geometric Science
of Information. Lecture Notes on Computer Science, vol. 8085, pp. 397–404 (2013)
32. Pentland, A., Picard, R.W., Sclaroff, S.: Photobook: content-based manipulation of image
databases. Int. J. Comput. Vis. 18(3), 233–254 (1995)
33. Rahmani, R., Goldman, S.A., Zhang, H., Cholleti, S.R., Fritts, J.E.: Localized content-based
image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1902–1912 (2008)
34. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover distance as a metric for image retrieval.
Int. J. Comput. Vis. 402, 99–121 (2000)
35. Rui, Y., Huang, T., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool in interactive
content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol. 8(5), 644–655 (1998)
36. Sharma, G., Wu, W., Dalal, E.N.: The CIEDE2000 color-difference formula: implementation
notes, supplementary test data, and mathematical observations. Color Res. Appl. 30(1), 1–24
(2005)
37. Smith, J.R., Chang, S.-F.: VisualSEEk: a fully automated content-based image query system.
In: Proceedings of the ACM International Conference on Multimedia, Boston, pp. 87–98
(1996)
38. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at
the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)
39. Stentiford, F.W.M.: A visual attention estimator applied to image subject enhancement and
colour and grey level compression. In: 17th International Conference on Pattern Recognition,
vol. 3, Cambridge, pp. 638–641 (2004)
40. Tas, A.C., Luck, S.J., Hollingworth, A.: The relationship between visual attention and visual
working memory encoding: a dissociation between covert and overt orienting. J. Exp. Psychol.
Hum. Percept. Perform. 42, 1121–1138 (2016)
41. Vondrick, C., Khosla, A., Pirsiavash, H., Malisiewicz, T., Torralba, A.: Visualizing object
detection features. Int. J. Comput. Vis. 119(2), 145–158 (2016)
42. Wan, J., Wang, D., Hoi, S., Wu, P., Zhu, J., Li, J.: Deep learning for content-based image
retrieval: a comprehensive study. In: Proceedings of ACM International Conference on
Multimedia, pp. 157–266 (2014)
43. Wang, J.Z., Wiederhold, G., Firschein, O., Sha, X.W.: Content-based image indexing and
searching using Daubechies’ wavelets. Int. J. Digit. Libr. 1(4), 311–328 (1998)
Image Retrieval Based on Query by Saliency Content 209
44. Wang, J., Li, J., Wiederhold, G.: Simplicity: semantics-sensitive integrated matching for picture
libraries. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 947–963 (2001)
45. Wang, H., Cai,Y., Zhang, Y., Pan, H., Lv, W., Han, H.: Deep learning for image retrieval: what
works and what doesn’t. In: Proceedings of IEEE International Conference on Data Mining
Workshop, pp. 1576–1583 (2015)
46. Wolfe, J.M.: Visual search. In: Pashler, H. (ed.) Attention. Psychology Press, Hove, East Sussex
(1998)
47. Zhang, L., Tong, M.H., Marks, T.K., Shan, H., Cottrell, G.W.: SUN, A Bayesian framework
for saliency using natural statistics. J. Vis. 8(7), 32.1–32.20 (2008)
Visual Saliency for the Visualization
of Digital Paintings
Abstract Over the last 15 years, several applications have been developed for
digital cultural heritage in the image processing and particularly in the area of digital
painting. In order to help preserve cultural heritage, this chapter proposes several
applications for digital paintings such as restoration, authentication, style analysis
and visualization. For the visualization of digital paintings we present specific
methods to visualize digital paintings based on visual saliency and in particular we
propose an automatic digital painting visualization method based on visual saliency.
The proposed system consists of extracting regions of interest (ROI) from a digital
painting to characterize them. These close-ups are then animated on the basis of
the paintings characteristics and the artist’s or designer’s aim. In order to obtain
interesting results from short video clips, we developed a visual saliency map-
based method. The experimental results show the efficiency of our approach and
an evaluation based on a Mean Opinion Score validates our proposed method.
1 Introduction
The two main objectives for cultural heritage services are to preserve paintings that
represent our past and to play an active role in spreading cultural knowledge [27].
For example, Giakoumis et al. presented an efficient method to detect and remove
cracks in digital paintings [8]. Another interesting activity is to analyze painting
styles and movements [28, 33]. This can be employed in artist identification in order
to detect forgery or simply to characterize an artist’s period or to study its evolution
in style or technique. Specific work has also been developed to protect the content
for the secure transmission of high resolution digital paintings [27]. A survey of
digital painting for cultural heritage is presented in a book by Stanco et al. [7].
P. Kennel
IMFT, INPT/Univ. Toulouse, Toulouse, France
e-mail: [email protected]
F. Comby • W. Puech ()
LIRMM, CNRS/Univ. Montpellier, Montpellier, France
e-mail: [email protected]; [email protected]
The authors present several techniques, algorithms and solutions for digital imaging
and computer graphics-driven cultural heritage preservation, in particular several
new visualization tools are proposed [7].
In this chapter, we first present several applications for digital paintings such
restoration, authentication and style analysis, secondly we present specific methods
to visualize digital paintings based on visual saliency and in particular we propose
an automatic digital painting visualization method based on visual saliency. The
main objective of this method is to generate an animation of a digital painting, which
would be close to what could be manually achieved by an artist. The three main steps
are the detection of regions of interest (ROI), the ROI characterization and the order
of ROI by building a path throughout the digital painting. Finally, we demonstrate
how videos can be generated by following the paths with custom trajectories. In the
first step, we propose to use the saliency map concept for ROI detection [13–15] a
slightly modified applied on a digital painting in order to provide a map of the most
valuable sites and animate them in a visualization process.
The saliency map concept was introduced by Ullman and Koch in 1985 [19].
These maps are supposed to represent salient regions in an image, i.e. regions that
would capture human attention. With this knowledge, the main task is to determine
human brain stimulation that would naturally detect a ROI from a scene in order to
model it. Two factors are distinguished in [23] which are bottom-up and top-down
factors. The first factor represents our natural instinctive sensorial attention based
on factors such as color, contrast, size of objects and luminosity. For example, this
mechanism allows us to detect a bright burning fire in a scene. The second factor
represents a more intelligent process based on the observer’s experience. We use
this mechanism when we are seeking certain kinds of objects. Many techniques
have been developed since the 1980s to generate such saliency maps, each of them
has tried to combine speed and robustness efficiently. Three method groups can be
singled out [36]: local methods [13, 15] which only deal with certain areas, global
methods [1, 12, 37] which use the whole image to characterize ROI, and frequential
methods [10, 26] which use the local characteristics of the spatial frequency phase.
Although saliency methods are already widely used in many domains (e.g.
robotics, marketing etc.), digital paintings have not been widely studied in such
a way. This could be explained by the fact that the understanding on how the brain
processes work and how humans see paintings is an open problem. Gaze tracker
technology has been used for this issue in [30], and shows that despite salient regions
of paintings playing an important role, there is still a large variability depending on
the subjects’ own interests, artistic appreciation and knowledge. Conversely, Subtle
Gaze Direction (SGD) process is employed in [22] to manipulate volunteers gaze on
paintings. The authors succeed in improving the legibility of paintings. Especially
the reading order and fixating time over panels was improved by using SGD. In this
study, saliency methods are envisaged to improve the SGD, but are not currently
implemented. Only a few recent studies have estimated saliency in digital paintings.
In [4], the authors provide a simple saliency analysis method which helped them
categorize paintings by art movement. In [18], the authors provide several valuable
advancements for the study of paintings. First, they provide a large database of
Visual Saliency for the Visualization of Digital Paintings 213
paintings with a wide diversity of artists and styles. State-of-the-art methods are
used to categorize paintings. Moreover, the authors collected gaze fixation data for
a consistent subset of paintings and applied state of the art saliency methods on the
same paintings. Using Itti’s framework [15], the performance is one of the top rated
methods for correlating salient maps and fixation maps.
The rest of this chapter is organized as follows: In Sect. 2, we present previous
methods applied on digital paintings for restoration, authentication and style
analysis. In Sect. 3 we present specific methods to visualize digital paintings based
on visual saliency. Finally, in Sect. 4 we conclude our experiments and discuss
possibilities for future work.
Even if image processing has been widely used in areas such as medical imaging,
robotics and security, fewer applications have been developed concerning digital
paintings. It is becoming increasingly important for experts or art historians to
analyse painting style or to authenticate paintings. However, some studies prove that
image processing algorithms can perform as well as human experts on applications
dedicated to digital paintings [7]. Among them, the most common are: virtual
restoration (Sect. 2.1), content authentication (Sect. 2.2) and the analysis of painting
evolution in time and style (Sect. 2.3).
One of the most common processes consists in virtual painting restoration. For
example in [8] the authors propose a technique to remove cracks on digitalized
paintings. The first step detects cracks. In this process, cracks are mainly considered
as dark areas with an elongated shape, so their detection are only performed on
the luminance component of the image. It consists in filtering the image with the
difference between the image’s gray level morphological closing and the image
itself. The authors suggest that a similar process can be used to detect bright cracks
(like scratches on a photo) while replacing the morphological closing by an opening
and computing the difference between the image and its opening. This process gives
a gray level image with higher values for pixels that are most likely to belong to a
crack. Then the image has a threshold filter applied to it in order to extract cracks
from the rest of the image. Many strategies are considered from the simplest one: a
global threshold whose value is computed thanks to the filtered image histogram to
a more complex one: a spatially varying threshold. The author observes that some
brush strokes may be misclassified as cracks, so they provided two techniques to
214 P. Kennel et al.
distinguish them. The first one is semi-automatic and relies on a region growing
algorithm where the seeds are chosen manually on pixels belonging to the class of
cracks. In this way, pixels corresponding to a brush stroke, are not 8-connected
to crack pixels, are removed from the resulting image. The second approach is
based on an analysis of the Hue and Saturation components of cracks and brush
strokes. A classification using a median radial basis function neural network is
trained to separate cracks from brush strokes. As explained by the authors, these two
approaches can be combined to give better results. Once cracks have been identified
they need to be filled. The authors propose two methods, one using order statistic
filters based on median, the other one based on controlled anisotropic diffusion.
An illustration is presented in Fig. 1. This method gives good results even if some
Fig. 1 (a) Zoom on the original painting containing cracks, (b) thresholded top hat transform
containing mainly cracks, (c) the virtually restored painting with cracks filled with a median
filter [8]
Visual Saliency for the Visualization of Digital Paintings 215
cracks remains (those not filtered by morphological tools, or when dark cracks occur
in dark areas). It also seems, in their examples, that some edges are degraded, maybe
because they are misinterpreted as cracks.
Other inpainting algorithms have been used to virtually restore paintings, for
example Chinese paintings in [21, 39] or by using a more general restoration tool
presented in [5]. An evaluation of the quality of such restorations has been proposed
in [24] where 14 metrics were compared in order to evaluate 8 inpainting algorithms.
The results showed first that exemplar-based inpainting algorithms worked better
than partial differential equation ones, as the later tends to add blur when filling
large areas; second that there is no ideal metric as they are really image-dependent.
Detecting cracks may also have another use, for example in painting identification
and control of the evolution of ancient paintings. For example in [6] a part of the
study was about painting authentication. Indeed, the painting Mona Lisa was stolen
in 1911 and when the painting was returned to the Louvre Museum, they wanted to
know if the painting was the original or a copy. The main theory was that the crack
pattern is impossible to copy. So, based on three high resolution pictures taken at
different periods (one before the theft in 1880, and two after in 1937 and 2005)
the crack patterns were extracted in order to compare them and authenticate or not
the painting. The authors proposed a method whereby they removed the content
leaving on the cracks. To do so, images were first filtered by an isotropic low pass
filter that removed almost all jpeg compression artifacts and the grainy aspect of the
images. Then, in order to remove the craquelures two treatments were performed:
one to remove dark cracks based on a gray level morphological closing and one to
remove bright cracks based on an opening. These two processes provide a blurry
image without cracks. An histogram specification was then used to match the gray
level distributions of the original image and the filtered one. Then, subtracting the
filtered image from the original one provided an image mainly composed of cracks.
After a simple edge extraction algorithm provides a binary image of cracks (see
Fig. 2). The three images of crack patterns were then geometrically rectified using
an homography to be aligned and compared. The minor differences between the
crack patterns allowed to confirm that the painting Mona Lisa was the original one.
It also provided information about the best storage conditions of the painting as
the cracks remained stable between 1937 and 2005.
Fig. 2 (a) Zoom on the eye of Mona Lisa, (b) the filtered version containing cracks, (c) its
edges [6]
analysis of Vincent van Gogh’s paintings. The authors applied these processes
to date the paintings, but also to extract specific features that are characteristic
of van Gogh’s style. Another approach, presented in [33], uses small patches of
texture (called textons) to characterize the brushstroke configuration. Their process
learns a codebook of textons from a large number of paintings (Fig. 4), then
a histogram is built representing the appearance frequency of each texton. The
analysis of texton’s distribution allows to establish van Gogh’s paintings from those
of other contemporary artists. The authors claimed that texton-based approach is
more suitable for texture analysis than using classical image processing filters,
since the latter introduces a smoothing that may remove some important features.
Brushstrokes are not the only feature that help to identify an artist or an artistic
movement. For example in [28], the authors focus their work on the analysis of
pearls in paintings. The way pearls are represented explains how the nature is
perceived by the painter and also gives informations on contemporary knowledge on
optical theory. To analyse pearls, they used a spatiogram (an extension of histograms
where spatial informations are kept). Each bin of the histogram is associated with
three values which are the bin count, its spatial mean and spatial covariance. Four
new metrics are also defined to characterize pearls thanks to their spatiograms,
for example the mean distance between the spatiogram bin’s centers. Experiments
showed that these four metrics allowed to segregate artists only by observing their
technique to paint pearls.
218 P. Kennel et al.
Painting analysis also explores the way paintings are perceived by humans. For
example in [31, 34], a saliency map is used to model or interpret the way the human
visual system (HVS) perceives visual art. In [31] the Itti and Koch’s [15] saliency
model was compared to the gaze behavior of human observers. In [34], the algorithm
consists in a segmentation using a fuzzy C-Means of the painting, then features such
as compacity, local contrast, edges and uniqueness are used to define each part of
the segmented image. These criterions are combined into a saliency map using a
weighted sum. A subjective testing strategy was tested with human observers and it
proved that this saliency map is relevant to characterize zones of interest of paintings
(see Fig. 5). Moreover, this is robust, regardless of the art movement they belong to.
In [38] a new metric called LuCo (for Luminance Contrast) is presented to
quantify the artists intention to manipulate lightning contrast to draw visual attention
to specific regions. It can be considered as a visual saliency measure dedicated to
luminance features. A histogram containing Bayesian surprise values is computed
for a set of patches across the painting. Then the LuCo score is computed as
a skewness measure on the histogram (high values of LuCo indicating lightning
effects).
Visual Saliency for the Visualization of Digital Paintings 219
Fig. 5 (a) Aleksey Konstantinovich Tolstoy painting by Alexander Briullov, (b) resulting saliency
map using algorithm presented in [34], (c) thresholded saliency map and (d) User defined saliency
map for comparison
Fig. 6 Overview of the proposed automatic visualization method for digital paintings. The artist
performs by: designing the final saliency map with custom weight, setting the number of regions
to be in the visualization, selecting features to be interpreted in order
In this section, the proposed method consists of three main steps which are presented
in Fig. 6. First, a saliency map is created from an image so that ROI can be identified
by a custom thresholding of map values. Next, the ROI are characterized by a set
of features. Finally, the third step orders ROI visualization according to the artist’s
needs [17].
The saliency map used in our method is a linear combination of several feature maps
derived from color, intensity and orientation information according to the approach
220 P. Kennel et al.
Painting
Linear filtering
C S I O
Linear combination
Saliency map
Fig. 7 Scheme of the proposed saliency system. n D 2.L 4/ where L is the number of levels on
the multi-resolution pyramid
and a total of T D 8n feature maps. For example, a 1024 1024 pixel painting will
produce a 9-level pyramid and 64 maps (n D 8).
Each of the T maps is normalized through a feature combination strategy which
relies on simulating local competition between neighboring salient locations. This
process proved its accuracy over other strategies evaluated in [14]. Saliency maps
C ; I ; S and O, illustrated in Fig. 7, are obtained by the addition of sub-maps
and are scaled to the size of the original image with bi-cubic interpolation. Finally,
the four maps are used to produce a final saliency map S . Weighting coefficients
!i used for the feature combination are set up by artists, so that artwork can be
interpreted according to the individual’s sensitivity. Therefore, we define S D
Œ!1 ; !2 ; !3 ; !4 ŒC ; I ; S , OT .
Once the saliency map S is produced, salient regions which are chosen to guide
visualization paths and close-ups on digital paintings can be isolated. These regions
could be directly segmented by thresholding the S map values, but the results
obtained are not adapted to our objective as illustrated in Fig. 8a. Even if an optimal
threshold value can be defined with constraints (e.g. by minimizing a cost function),
thresholding does not allow to strictly control the number of regions, areas or the
properties that we wanted in our framework. Therefore, we defined an adaptive
thresholding approach which is guided by N the number of ROI expected as well as
Amin and Amax the minimum and maximum areas that are authorized, respectively.
Figure 8b shows that 3 regions can be found on the 1-dimension profile plotted by
using multiple threshold, the second pic is omitted since it does not obey to area
constraints contrary to Fig. 8a.
Isolated ROI are then labeled and characterized by a vector, denoted by F, composed
of 15 features. This description prepares the final ordering step so that artists will
choose and decide which combinations of features allow a proper ROI visualization
sequence. This essential step provides the artist with a set of relevant characteristics
which can be understood and used as an accurate guide. We have kept the four
following classes of descriptors:
• Shape-based descriptors: area, perimeter, circularity, elongation and 16:9 com-
pacity (expressing the space occupied by the ROI in a 16:9 bounding window),
• Position-based descriptors: center coordinates, orientation, absolute and relative
distance from the center of gravity of the image and it’s cardinal position,
• Color-based descriptors: mean hue, mean saturation and mean value,
• Texture-based descriptors: selected Haralick features such as energy, contrast,
homogeneity and correlation [11].
Shape-based and position-based features usually involve simple morphological tools
on binary images (thresholded image from S ) such as the principal component
analysis.
222 P. Kennel et al.
2 regions found
t1
t2 4 regions found
(a)
S
3 regions requested
with controlled areas
t1
t2
t3
(b)
Fig. 8 The proposed adaptive thresholding method allows to control the number of regions to
segment as well as the regions area properties. Examples of segmentation on an 1-dimensional
profile of a saliency map with: (a) a classic thresholding method (2 threshold values), (b) the
proposed adaptive thresholding method
Fig. 9 Description of a sample region of interest (left) with various shape-based, position-based,
color-based and textured-based features (right)
where
A D 1;
x D
y D 1: (3)
Results Analysis
The visual saliency system used in our method provides an appropriate identification
of salient regions on digital paintings. The proposed method has been applied on
several digital paintings. Figure 10 presents the feature-based saliency maps C
Fig. 10b, I Fig. 10c, S Fig. 10d, and O Fig. 10e computed from the painting
Fig. 10 Examples of a saliency map obtained from (a) The original digital painting The Cheat of
Georges de La Tour, (b) Color map C , (c) Intensity map I , (d) Saturation map S , (e) Orientation
map O, (f) Features combined in a final saliency map S with artist-set weightings of 0.1, 0.9, 0.4
and 0.2 (respectively for C ; I ; S and O)
Visual Saliency for the Visualization of Digital Paintings 225
Fig. 11 Results of the ROI segmentation on the saliency map presented in Fig. 10f with the
classic thresholding method. Different choices of threshold values are shown, respectively (a) 0.72,
(b) 0.40 and (c) 0.15
from G. de La Tour The Cheat. The maps I Fig. 10b, and O Fig. 10e highlight
particularly well the players face, gaze and cards, whereas the C and S maps
provide supplementary salient details for the color and texture of the clothes. A
possible final saliency map S illustrated in Fig. 10f was built from an artist’s
weightings using the previous maps (C ; I ; S ; O). The features are combined in a
final saliency map S with artist-set weightings of 0.1, 0.9, 0.4 and 0.2 (respectively
for C ; I ; S and O). This final map shows how the artists intentions can be kept to
our system. Others results of saliency maps are provided in Fig. 14b on four digital
paintings with custom weights on saliency features.
Then, from Fig. 10f, salient regions were isolated to produce the ROI sets
presented in Figs. 11 and 12. Figure 12 illustrates the advantages raised by our
proposed adaptive thresholding method, over a classical thresholding approach
illustrated in Fig. 11. Controlling the number of regions segmented as well as their
dimensions, is a difficult task, but by simply changing (manually or automatically)
a single threshold value on the saliency map S as shown in Fig. 11a–c where
the number of regions increased dramatically by decreasing the threshold value.
The adaptive thresholding proposed provides regions with homogeneous sizes
(we requested between 0.1 and 2% of the total image area) and with a controlled
headcount (3, 5, 7, 9 regions where requested in Fig. 12a–d) ; these constraints are
essential for the proposed framework. Note that increasing the number of requested
regions will produce a set of regions which includes previous regions found by the
process with a smaller requested number. Only the order of the detected regions
will produce changes in the shape of regions due to the erasure step of the proposed
procedure (see Sect. 3.2.1).
Such ROI were ordered according to scores defined by the artist’s weighting on
the ROI descriptors with weights !i 2 f!i 2 Z; 5 !i 5g. Figure 13a presents
a possible corresponding transition graph from regions segmented in Fig. 12d. This
path is defined by focusing only on the area, luminosity and entropy descriptors in
the scoring step (! D 2; 1; 4 respectively). Another example of a path presented
in Fig. 13b is defined by opting for the compacity (! D 5), perimeter (! D 2)
226 P. Kennel et al.
Fig. 12 Results of the ROI segmentation on the saliency map presented in Fig. 10f with the
proposed adaptive thresholding method. Different choices of the region number requested are
shown, respectively (a) 3, (b) 5, (c) 7 and (d) 9. Area constraints are set between 0.1 and 3%
of the total image area
Fig. 13 Salient regions selected by thresholding the saliency map S obtained in Fig. 12d ordered
by expert guidance to obtain two transition graphs resulting from different weightings: (a) area,
! D 2, luminosity ! D 1, entropy, ! D 4 (b) compacity, ! D 5, perimeter, ! D 2, contrast,
! D 2 and correlation, ! D 1. The path in (c) used the same weightings as in the path in (a), as
well as the path in (d) which used the same weightings that were used in the path (b), but with the
consideration of the centrality/proximity bias
Visual Saliency for the Visualization of Digital Paintings 227
Fig. 14 (a) Digital paintings used to generate the videos of our Mean Opinion Score study (from
top to bottom: The Cheat (G. de La Tour), Massacre of the Innocents (P.P. Rubens), The Anatomy
Lesson of Dr. Nicolaes Tulp (R. van Rijn) and the Mona Lisa (L. Da Vinci)). (b) The corresponding
saliency maps designed to obtain the segmentation. (c) The ordered regions forming visualization
paths
based videos (MOS = 2.32 ˙ 1.03) and saliency-based videos (MOS = 3.59 ˙
0.99) by the Wilcoxon signed rank test [25] (V D 3820; p-value < 0:001). This
result suggests that our approach is well suited for selecting important regions
to visualize digital paintings using a video method. Our framework performed
particularly well on these paintings because it accurately localizes essential parts
for the interpretation and comprehension of the paintings.
Note that randomly-based and saliency-based videos generated with 4 regions
have the most separable MOS (V D 566:5; p-value < 0:001) compared to
the 3 region groups (V D 497; p-value < 0:001) and the 5 region groups
(V D 388:5; p-value < 0:05), see the density functions of scores obtained in
Fig. 15. These results may be justified by the fact that in the 5 region groups,
most of the paintings area is finally shown by the randomly-based as well as the
saliency-based method so that MOS is less strongly separable between both groups.
This tendency should increase with the number of regions used on the visualization
path. By contrast, the groups with 3 regions are less strongly separable because of
the potential proximity of the regions.
4 Conclusion
In this chapter we have shown that several methods have been developed in image
processing, specifically for digital paintings. In the beginning we presented several
methods that we applied on digital paintings for restoration, authentication and
style analysis. We have proved that saliency is well suited for painting analysis for
example.
In the second part of our experiment, we presented specific methods to visualize
digital paintings based on visual saliency. In particular we developed a method
which is able to automatically generate a video from a digital painting. The results of
the proposed method suggests that our approach is well suited for selecting impor-
tant regions to visualize digital paintings using video. Our framework performed
particularly well on these paintings, because it accurately localizes essential parts
for the interpretation and comprehension of the paintings.
We are convinced that it is very important to continue to develop such approaches
in the future, in particular with the creation of museums on line collections based
on virtual reality and augmented reality. To improve the quality of the automatic
generation of video from digital painting, it is clear that it will be necessary to take
into account gaze tracking.
Acknowledgements The authors would like to thank volunteers who accepted to participate in
our opinion score campaign.
230 P. Kennel et al.
0.2
0.0
0.2
0.0
1 2 3 4 5 1 2 3 4 5
MOS rating MOS rating
Fig. 15 Smoothed density functions of opinion scores collected over 218 videos by 18 vol-
unteers. Dashed-lines indicate mean values. (a) Whole region number confound, (b) 3-regions
(c), 4-regions and (d) 5-region groups
References
1. Aziz, M.Z., Mertsching, B.: Fast and robust generation of feature maps for region-based visual
attention. IEEE Trans. Image Process. 17(5), 633–644 (2008)
2. Berezhnoy, I.J., Brevdo, E., Hughes, S.M., Daubechies, I., Li, J., Postma, E., Johnson, C.R.,
Hendriks, E., Wang, J.Z.: Image processing for artist identification. IEEE Signal Process. Mag.
25(4), 37–48 (2008)
3. Brevdo, E., Hughes S., Brasoveanu, A., Jafarpour, S., Polatkan, G., Daubechies, I.: Stylistic
analysis of paintings using wavelets and machine learning. In: Proceedings of the 17th
European Signal Processing Conference (EUSIPCO), Glasgow, Scotland (2009)
4. Condorovici, R.G., Vranceanu, R., Vertan, C.: Saliency map retrieval for artistic paintings
inspired from human understanding. In: Proceedings of SPAMEC (2011)
5. Corsini, M., De Rosa, A., Cappellini, V., Barni, M., Piva, A.: Artshop: an art-oriented image
processing tool for cultural heritage applications. J. Vis. Comput. Animat. 14, 149–158 (2003)
6. Druon, S., Comby, F.: La Joconde – Essai Scientifique. Extraction des craquelures, Christian
Lahanier, pp. 179–184 (2007)
7. Gallo, G., Stanco, F., Battiato, S.: Digital Imaging for Cultural Heritage Preservation: Analysis,
Restoration and Reconstruction of Ancient Artworks. Taylor and Francis, Boca Raton, FL
(2011)
8. Giakoumis, I., Nikolaidis, N., Pitas, I.: Digital image processing techniques for the detection
and removal of cracks in digitized paintings. IEEE Trans. Image Process. 15(1), 178–188
(2006)
9. Greenspan, H., Belongie, S., Goodman, R., Perona, P., Rakshit, S., Anderson, C.H.: Overcom-
plete steerable pyramid filters and rotation invariance. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 222–228 (1994)
10. Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its
applications in image and video compression. IEEE Trans. Image Process. 19(1), 185–198
(2010)
11. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE
Trans. Syst. Man Cybern. 3(6), 610–621 (1973)
Visual Saliency for the Visualization of Digital Paintings 231
12. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Proceedings of NIPS,
pp. 545–552 (2006)
13. Itti, L.: Models of bottom-up and top-down visual attention. Ph.D. thesis, California Institute
of Technology (2000)
14. Itti, L., Koch, C.: A comparison of feature combination strategies for saliency-based visual
attention systems. J. Electron. Imaging 10, 161–169 (1999)
15. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
16. ITU-T Recommendation (P.910): Subjective video quality assessment methods for multimedia
applications (2000)
17. Kennel, P., Puech, W., Comby, F.: Visualization framework of digital paintings based on visual
saliency for cultural heritage. Multimedia Tools Appl. 76(1), 561–575 (2017)
18. Khan, F.S., Beigpour, S., Weijer, J., Felsberg, M.: Painting-91: a large scale database for
computational painting categorization. Mach. Vis. Appl. 25(6), 1385–1397 (2014)
19. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural
circuitry. Hum. Neurobiol. 4, 219–227 (1985)
20. Li, J., Wang, J.Z.: Studying digital imagery of ancient paintings by mixtures of stochastic
models. IEEE Trans. Image Process. 13(3), 340 (2004)
21. Lu, L.-C., Shih, T.K., Chang, R.-C., Huang, H.-C.: Multi-layer inpainting on Chinese artwork.
In: Proceedings of IEEE International Conference on Multimedia and Expo (ICME) (2004)
22. McNamara, A., Booth, T., Sridharan, S., Caffey, S., Grimm, C., Bailey, R.: Directing gaze
in narrative art. In: Proceedings of the ACM Symposium on Applied Perception, SAP’12,
pp. 63–70. Association for Computing Machinery, New York (2012)
23. Niebur, E.: Saliency map. Scholarpedia 2(8), 2675 (2007)
24. Oncu, A.I., Deger, F., Hardeberg, J.Y.: Evaluation of Digital Inpainting Quality in the Context
of Artwork Restoration, pp. 561–570. Springer, Heidelberg (2012)
25. Oyeka, A., Ebuh, G.: Modified Wilcoxon signed-rank test. Open J. Stat. 2, 172–176 (2012)
26. Pei, S.-C., Ding, J.-.J., Chang, J.-H.: Efficient implementation of quaternion fourier transform,
convolution, and correlation by 2-d complex FFT. IEEE Trans. Signal Process. 49(11),
2783–2797 (2001)
27. Pitzalis, D., Aitken, G., Autrusseau, F., Babel, M., Cayre, F., Puech, W.: TSAR: secure transfer
of high resolution art images. In: Proceedings of EVA’08 Florence, Electronic Imaging and the
Visual Art, France (2008)
28. Platia, L., Cornelis, B., Ruic, T., Piurica, A., Dooms, A., Martens, M., De Mey, M., Daubechies,
I.: Spatiogram features to characterize pearls in paintings. In: Proceedings of IEEE ICIP’2011
(2011)
29. Postma, E.O., Berezhnoy, I.E., van den Herik, H.J.: Authentic: computerized brushstroke
analysis. In: Proceedings of the 2005 IEEE International Conference on Multimedia and Expo,
ICME, Amsterdam, The Netherlands, 6–9 July 2005
30. Quiroga, R.Q., Pedreira, C.: How do we see art: an eye-tracker study. Front. Hum. Neurosci.
5, 98 (2011)
31. Redies, C., Fuchs, I., Ansorge, U., Leder, H.: Salience in paintings: bottom-up influences on
eye fixations. Cogn. Comput. 3(1), 25–36 (2011)
32. Shida, B., van de Weijer, J., Khan, F.S., Felsberg, M.: Painting-91: a large scale database for
computational painting categorization. Mach. Vis. Appl. 25, 1385–1397 (2014)
33. van der Maaten, L., Postma, E.: Identifying the real van gogh with brushstroke textons. White
paper, Tilburg University, Feb 2009
34. Vertan, C., Condorovici, R.G., Vrânceanu, R.: Saliency map retrieval for artistic paintings
inspired from human understanding. In: Proceedings of SPAMEC 2011, Cluj-Napoca, Roma-
nia, pp. 101–104 (2011)
35. Wolfe, J.M., Horowitz, T.S.: What attributes guide the deployment of visual attention and how
do they do it? Nat. Rev. Neurol. 5(6), 495–501 (2004)
36. Wu, B., Xu, L., Zeng, L., Wang, Z., Wang, Y.: A unified framework for spatiotemporal salient
region detection. EURASIP J. Image Video Process. 2013(1), 16 (2013)
232 P. Kennel et al.
37. Xu, L., Li, H., Zeng, L., Ngan, K.N.: Saliency detection using joint spatial-color constraint and
multi-scale segmentation. J. Vis. Commun. Image Represent. 24(4), 465–476 (2013)
38. Yang, S., Cheung, G., Le Callet, P., Liu, J., Guo, Z.: Computational modeling of artistic
intention: quantify lighting surprise for painting analysis. In: Proceedings of the Eighth
International Conference on Quality of Multimedia Experience (2016)
39. Zeng, Y.-C., Pei, S.-C., Chang, C.-H.: Virtual restoration of ancient Chinese paintings using
color contrast enhancement and lacuna texture synthesis. IEEE Trans. Image Process. (Special
Issue on Image Processing for Cultural Heritage) 13(3), 416–429 (2004)
40. Zöllner, F.: Leonardo’s Portrait of Mona Lisa del Giocondo. Gazette des Beaux-Arts 121(1),
115–138 (1993)
Predicting Interestingness of Visual Content
Abstract The ability of multimedia data to attract and keep people’s interest
for longer periods of time is gaining more and more importance in the fields of
information retrieval and recommendation, especially in the context of the ever
growing market value of social media and advertising. In this chapter we introduce
a benchmarking framework (dataset and evaluation tools) designed specifically
for assessing the performance of media interestingness prediction techniques. We
release a dataset which consists of excerpts from 78 movie trailers of Hollywood-
like movies. These data are annotated by human assessors according to their degree
of interestingness. A real-world use scenario is targeted, namely interestingness is
defined in the context of selecting visual content for illustrating a Video on Demand
(VOD) website. We provide an in-depth analysis of the human aspects of this task,
i.e., the correlation between perceptual characteristics of the content and the actual
data, as well as of the machine aspects by overviewing the participating systems of
the 2016 MediaEval Predicting Media Interestingness campaign. After discussing
the state-of-art achievements, valuable insights, existing current capabilities as well
as future challenges are presented.
1 Introduction
1
http://www.multimediaeval.org/.
2
http://www.multimediaeval.org/mediaeval2016/mediainterestingness/.
3
http://www.technicolor.com.
Predicting Interestingness of Visual Content 235
The prediction and detection of multimedia data interestingness has been analyzed
in the literature from the human perspective, involving psychological studies, and
also from the computational perspective, where machines are taught to replicate the
human process. Content interestingness has gained importance with the increasing
popularity of social media, on-demand video services and recommender systems.
These different research directions try to create a general model for human interest,
236 C.-H. Demarty et al.
go beyond the subjectivity of interestingness and detect some objective features that
appeal to the majority of subjects. In the following, we present an overview of these
directions.
Besides the vast literature of psychological studies, the concept of visual inter-
estingness has been studied from the perspective of automatic, machine-based,
approaches. The idea is to replicate human capabilities via computational means.
For instance, the authors in [23] studied a large set of attributes: RGB values,
GIST features [50], spatial pyramids of SIFT histograms [39], colorfulness [17],
complexity, contrast and edge distributions [35], arousal [46] and composition of
parts [6] to model different cues related to interestingness. They investigated the
role of these cues in varying context of viewing: different datasets were used,
from arbitrary selected and very different images (weak context) to images issued
from similar Webcam streams (strong context). They found that the concept of
“unusualness”, defined as the degree of novelty of a certain image when compared
to the whole dataset, was related to interestingness, in case of a strong context.
Unusualness was calculated by clustering performed on the images using Local
Outlier Factor [8] with RGB values, GIST and SIFT as features, composition of
parts and complexity interpreted as the JPEG image size. In case of a weak context,
personal preferences of the user, modeled by pixel values, GIST, SIFT and Color
Histogram as features, classified with a -SVR—Support Vector Regression (SVR)
with a RBF kernel, performed best. Continuing this work, the author in [61] noticed
that a regression with sparse approximation of data performed better with the
features defined by Gygli et al. [23] than the SVR approach.
Another approach [19] selected three types of attributes for determining image
interestingness: compositional, image content and sky-illumination. The composi-
tional attributes were: rule of thirds, low depth of field, opposing colors and salient
objects; the image content attributes were: the presence of people, animals and faces,
indoor/outdoor classifiers; and finally the sky-illumination attributes consisted of
scene classification as cloudy, clear or sunset/sunrise. Classification of interesting
content is performed with Support Vector Machines (SVM). As baseline, the authors
used the low-level attributes proposed in [35], namely average hue, color, contrast,
brightness, blur and simplicity interpreted as distribution of edges; and the Naïve
Bayes and SVM for classification. Results show that high-level attributes tend to
238 C.-H. Demarty et al.
perform better than the baseline. However, the combination of the two was able to
achieve even better results.
Other approaches focused on subcategories of interestingness. For instance, the
authors in [27] determined “social interestingness” based on social media ranking
and “visual interestingness” via crowdsourcing. The Pearson correlation coefficient
between these two subcategories had low values, e.g., 0.015 to 0.195, indicating
that there is a difference between what people share on social networks and what
has a high pure visual interest. The features used for predicting these concepts were
color descriptors determined on the HSV color space, texture information via Local
Binary Patterns, saliency [25] and edge information captured with Histogram of
Oriented Gradients.
Individual frame interestingness was calculated by the authors in [43]. They
used web photo collections of interesting landmarks from Flickr as estimators of
human interest. The proposed approach involved calculating a similarity measure
between each frame from YouTube travel videos and the Flickr image collection
of the landmarks presented in the videos, used as interesting examples. SIFT
features were computed and the number of features shared between the frame
and the image collection baseline, and their spatial arrangement similarity were
the components that determined the interestingness measure. Finally the authors
showed that their algorithm achieved the desired results, tending to classify full
images of the landmarks as interesting.
Another interesting approach is the one proposed in [31]. Authors used audio,
video and high-level features for predicting video shot interestingness, e.g., color
histograms, SIFT [45], HOG [15, 68], SSIM Self-Similarities [55], GIST [50],
MFCC [63], Spectrogram SIFT [34], Audio-Six, Classemes [64], ObjectBank [41]
and the 14 photographic styles described in [48]. The system was trained via
Joachims’ Ranking SVM [33]. The final results showed that audio and visual
features performed well, and that their fusion performed even better on the two
user-annotated datasets used, giving a final accuracy of 78.6% on the 1200 Flickr
videos and 71.7% on the 420 YouTube videos. Fusion with the high-level attributes
provided a better result only on the Flickr dataset, with an overall precision of 79.7
and 71.4%.
Low- and high-level features were used in [22] to detect the most interesting
frames in image sequences. The selected low-level features were: raw pixel values,
color histogram, HOG, GIST and image self-similarity. The high-level features
were grouped in several categories: emotion predicted from raw pixel values [66],
complexity defined as the size of the compressed PNG image, novelty computed
through a Local Outlier Factor [8] and a learning feature computed using a -
SVR classifier with RBF kernel on the GIST features. Each one of these features
performed above the baseline (i.e., random selection), and their combination also
showed improvements over each individual one. The tests were performed on a
database consisting of 20 image sequences, each containing 159 color images taken
from various webcams and surveillance scenarios, and the final results for the com-
bination of features gave an average precision score of 0.35 and a Top3 score of 0.59.
Predicting Interestingness of Visual Content 239
A critical point to build and evaluate any machine learning system is the availability
of labeled data. Although the literature for automatic interestingness prediction is
still at its early stages, there are some attempts to construct an evaluation data. In
the following, we introduce the most relevant initiatives.
Many of the authors have chosen to create their own datasets for evaluating
their methods. Various sources of information were used, mainly coming form
social media, e.g., Flickr [19, 27, 31, 43, 61], Pinterest [27], Youtube [31, 43].
The data consisted of the results returned by search queries. Annotations were
determined either automatically, by exploiting the available social media metadata
and statistics such as Flickr’s “interestingness measure” in [19, 31], or manually, via
crowdsourcing in [27] or local human assessors in [31].
The authors in [19] used a dataset composed of 40,000 images, and kept the top
10%, ordered according to the Flickr interestingness score, as positive interesting
examples and the last 10% as negative, non interesting examples. Half of this dataset
was used for training and half for testing. The same top and last 10% of Flickr results
was used in [31], generating 1200 videos retrieved with 15 keyword queries, e.g.,:
“basketball”, “beach”, “bird”, “birthday”, “cat”, “dancing”. In addition to these,
the authors in [31] also used 30 YouTube advertisement videos from 14 categories,
such as “accessories”, “clothing&shoes”, “computer&website”, “digital products”,
“drink”. The videos had an average duration of 36 s and were annotated by human
assessors, thus generating a baseline interestingness score.
Apart from the individual datasets, there were also initiatives of grouping several
datasets of different compositions. The authors in [23], associated an internal
context to the data: a strong context dataset proposed in [22], where the images
in 20 publicly available webcam streams are consistently related to one another,
thus generating a collection of 20 image sequences each containing 159 images; a
weak context dataset introduced in [50] which consists of 2688 fixed size images
grouped in 8 scene categories: “coast”, “mountain”, “forest”, “open country”,
“street”, “inside city”, “tall buildings” and “highways”; and a no context dataset
which consists of the 2222 image memorability dataset proposed in [29, 30], with
no context or story behind the pictures.
This section describes the Predicting Media Interestingness Task, which was
proposed in the context of the 2016 MediaEval international evaluation campaign.
This section addresses the task definition (Sect. 3.1), the description of the provided
data with its annotations (Sect. 3.2), and the evaluation protocol (Sect. 3.3).
240 C.-H. Demarty et al.
As mentioned in the previous section, the video and image subtasks are based on
a common dataset, which consists of Creative Commons trailers of Hollywood-like
movies, so as to allow redistribution. The dataset, its annotations, and accompanying
features, as described in the following subsections, are publicly available.4
The use of trailers, instead of full movies, has several motivations. Firstly, it is
the need for having content that can be freely and publicly distributed, as opposed
to e.g., full movies which have much stronger restrictions on distribution. Basically,
each copyrighted movie would require an individual permission for distribution.
Secondly, using full movies is not practically feasible for the highly demanding
segmentation and annotations steps with limited time and resources, as the number
of images/video excerpts to process is enormous, in the order of millions. Finally,
running on full movies, even if the aforementioned problems were solved, will not
allow for having a high diversification of the content, as only a few movies could
have been used. Trailers, will allow for selecting a larger number of movies and thus
diversifying the content.
Trailers are by definition representative of the main content and quality of the full
movies. However, it is important to note that trailers are already the result of some
manual filtering of the movie to find the most interesting scenes, but without spoiling
the movie key elements. In practice, most trailers also contain less interesting, or
slower paced shots to balance their content. We therefore believe that this is a good
compromise for the practicality of the data/task.
The proposed dataset is split into development data, intended for designing and
training the algorithms which is based on 52 trailers; and testing data which is used
for the actual evaluation of the systems, and is based on 26 trailers.
The data for the video subtask was created by segmenting the trailers into video
shots. The same video shots were also used for the image subtask, but here each shot
is represented by a single key-frame image. The task is thus to classify the shots, or
key-frames, of a particular trailer, into interesting and non interesting samples.
4
http://www.technicolor.com/en/innovation/scientific-community/scientific-data-sharing/
interestingness-dataset.
242 C.-H. Demarty et al.
Video shot segmentation was carried out manually using a custom-made software
tool. Here we define a video shot as a continuous video sequence recorded between
a turn-on and a turn-off of the camera. For an edited video sequence, a shot is
delimited between two video transitions. Typical video transitions include sharp
transitions or cuts (direct concatenation of two shots), and gradual transitions
like fades (gradual disappearance/appearance of a frame to/from a black frame)
and dissolves (gradual transformation of one frame into another). In the process,
we discarded movie credits and title shots. Gradual transitions were considered
presumably very uninteresting shots by themselves, whenever possible. In a few
cases, shots in between two gradual transitions were too short to be segmented.
In that case, they were merged with their surrounding transitions, resulting in one
single shot.
The segmentation process resulted in 5054 shots for the development dataset, and
2342 shots for the test dataset, with an average duration of 1 s in each case. These
shots were used for the video subtask. For the image subtask, we extracted a single
key-frame for each shot. The key-frame was chosen as the middle frame, as it is
likely to capture the most representative information of the shot.
All video shots and key-frames were manually annotated in terms of interestingness
by human assessors. The annotation process was performed separately for the video
and image subtasks, to allow us to study the correlation between the two. Indeed we
would like to answer the question: Does image interestingness automatically imply
video interestingness, and vice versa?
A dedicated web-based tool was developed to assist the annotation process. The
tool has been released as free and open source software, so that others can benefit
from it and contribute improvements.5
We use the following annotation protocol. Instead of asking annotators to assign
an interestingness value to each shot/key-frame, we used a pair-wise comparison
protocol where the annotators were asked to select the more interesting shot/key-
frame from a pair of examples taken from the same trailer. Annotators were provided
with the clips for the shots and the images for the key-frames, presented side by
side. Also, they were informed about the Video on Demand-use case, and asked
to consider also that “the selected video excerpts/key-frames should be suitable in
terms of helping a user to make his/her decision about whether he/she is interested
in watching a movie”. Figure 1 illustrates the pair-wise decision stage of the user
interface.
5
https://github.com/mvsjober/pair-annotate.
Predicting Interestingness of Visual Content 243
The choice of a pair-wise annotation protocol instead of direct rating was based
on our previous experience with annotating multimedia for affective content and
interestingness [3, 10, 60]. Assigning a rating is a cognitively very demanding task,
requiring the annotator to understand, and constantly keep in mind, the full range
of the interestingness scale [70]. Making a single comparison is a much easier task
as one only needs to compare the interestingness of two items, and not consider
the full range. Directly assigning a rating value is also problematic since different
annotators may use different ranges, and even for the same annotator the values may
not be easily interpreted [51]. For example, is an increase from 0.3 to 0.4 the same
as the one from 0.8 to 0.9? Finally, it has been shown that pairwise comparisons are
less influenced by the order in which the annotations are displayed than with direct
rating [71].
However, annotating all possible pairs is not feasible due to the sheer number of
comparisons required. For instance, n shots/key-frames would require n.n 1/=2
comparisons to be made for a full coverage. Instead, we adopted the adaptive square
design method [40], where the shots/key-frames are placed in a square design and
only pairs on the same
p row or column are compared. This reduces the numbers of
comparisons to n. n 1/. For example, for n D 100 we need to undergo only 900
comparisons instead of 4950 (full coverage). Finally, the Bradley-Terry-Luce (BTL)
model [7] was used to convert the paired comparison data to a scalar value.
We modified the adaptive square design setup so that comparisons were taken
by many users simultaneously until all the required pairs had been covered. For the
rest, we proceeded according to the scheme in [40]:
1. Initialization: shots/key-frames are randomly assigned positions in the square
matrix;
2. Perform a single annotation round according to the shot/key-frame pairs given
by the square (across rows, columns);
244 C.-H. Demarty et al.
Apart from the data and its annotations, to broaden the targeted communities, we
also provide some pre-computed content descriptors, namely:
Dense SIFT which are computed following the original work in [45], except that
the local frame patches are densely sampled instead of using interest point detectors.
A codebook of 300 codewords is used in the quantization process with a spatial
pyramid of three layers [39].
HoG descriptors i.e., Histograms of Oriented Gradients [15] are computed over
densely sampled patches. Following [68], HoG descriptors in a 2 2 neighborhood
are concatenated to form a descriptor of higher dimension.
LBP i.e., Local Binary Patterns as proposed in [49].
GIST is computed based on the output energy of several Gabor-like filters (eight
orientations and four scales) over a dense frame grid like in [50].
Color histogram computed in the HSV space (Hue-Saturation-Value).
MFCC computed over 32 ms time-windows with 50% overlap. The cepstral
vectors are concatenated with their first and second derivatives.
CNN features i.e., the fc7 layer (4096 dimensions) and prob layer (1000
dimensions) of AlexNet [32].
Mid level face detection and tracking features obtained by face tracking-by-
detection in each video shot via a HoG detector [15] and the correlation tracker
proposed in [16].
Predicting Interestingness of Visual Content 245
The 2016 Predicting Media Interestingness Task received more than 30 registrations
and 12 teams coming from 9 countries all over the world submitted runs in the end
(see Fig. 2). The task attracted a lot of interest from the community, which shows
the importance of this topic.
Tables 1 and 2 provide an overview of the official results for the two subtasks
(video and image interestingness prediction). A total of 54 runs were received,
6
http://trec.nist.gov/trec_eval/.
246 C.-H. Demarty et al.
equally distributed between the two subtasks. As a general conclusion, the achieved
MAP values were low, which proves again the challenging nature of this problem.
Slightly higher values were obtained for image interestingness prediction.
To serve as a baseline for comparison, we generated a random ranking run, i.e.,
samples were ranked randomly five times and we take the average MAP. Compared
to the baseline, the results of the image subtask clearly confirm their performance,
being almost all above the baseline. For the video subtask, on the other hand,
the value range is smaller and a few systems did worse than the baseline. In the
following we present the participating systems and analyze the achieved results in
detail.
248 C.-H. Demarty et al.
Table 3 Overview of the characteristics of the submitted systems for predicting image interest-
ingness
Team Features Classification technique
BigVid [69] denseSIFT+CNN+Style Attributes+SentiBank SVM (run4)
Regularized DNN (run5)
ETH-CVL [67] DNN-based Visual Semantic
Embedding Model
HKBU [44] ColorHist+denseSIFT+GIST+HOG+LBP (run1) Nearest neighbor and SVR
features from run1 + dimension reduction (run2)
HUCVL [21] CNN (run1, run3) MLP (run1, run2)
MemNet (run2) Deep triplet network (run3)
LAPI [14] ColorHist+GIST (run1) SVM
denseSIFT+GIST (run2)
MLPBOON [52] CNN, PCA for dimension reduction Logistic regression
RUC [12] GIST+LBP+CNN prob (run1) Random Forest (run1)
ColorHist+GIST+CNN prob (run2), Random Forest (run2)
ColorHist+GIST+LBP+CNN prob (run3) SVM (run3)
Technicolor [56] CNN (Alexnet fc7) SVM (run1)
MLP (run2)
TUD-MMC [42] Face-related ColorHist (run1) Normalized histogram-based
Face-related ColorHist+Face area (run2) confidence score
NHCS+Normalized face
area score (run2)
UIT-NII [38] CNN (AlexNet+VGG) (run1) SVM with late fusion
CNN (VGG)+GIST+HOG+DenseSIFT (run2)
UNIGECISA [1] Multilingual visual sentiment ontology Linear regression
(MVSO)+CNN
Predicting Interestingness of Visual Content 249
Table 4 Overview of the characteristics of the submitted systems for predicting video interesting-
ness
Teams Features Classification technique Multi-modality
BigVid [69] denseSIFT, CNN SVM (run1) No
Style Attributes, SentiBank Regularlized DNN (run2)
SVM/Ranking-SVM (run3)
ETH-CVL [67] DNN-based Video2GIF (run1) Text+Visual
Video2GIF+Visual Semantic
Embedding Model (run2)
HKBU [44] ColorHist+denseSIFT+GIST Nearest neighbor and SVR No
+HOG+LBP (run1)
features from run1
+ dimension reduction (run2)
LAPI [14] GIST+CNN prob (run3) SVM No
ColorHist+CNN (run4)
denseSIFT+CNN prob (run5)
RUC [12] Acoustic Statistics + GIST SVM Audio+Visual
(run4)
MFCC with Fisher Vector
Encoding + GIST (run5)
Technicolor [56] CNN+MFCC LSTM-Resnet + MLP (run3) Audio+Visual
Proposed RNN-based model
(run4, run5)
TUD-MMC [42] ColorHist (run1) Normalized histogram-based No
ColorHist+Face area (run2) confidence score (NHCS)
run3)
NHCS+Normalized face
area score (run4)
UIT-NII [38] CNN (AlexNet)+MFCC (run3) SVM with late fusion Audio+Visual
CNN (VGG)+GIST (run4)
UNIFESP [1] Histogram of motion Majority voting of pairwise No
patterns (HMP) [2] ranking methods:
Ranking SVM, RankNet
RankBoost, ListNet
UNIGECISA [53] MVSO+CNN (run2) SVR (run2) Audio+Visual
Baseline visual features [18] SPARROW (run3, run4)
(run3),
Emotionally-motivated audio
feature (run4)
RUC [12] (Renmin University, China): investigated the use of CNN features
and AlexNet probabilistic layer (referred as CNN prob), and hand-crafted visual
features including Color Histogram, GIST, LBP, HOG, dense SIFT. Classifiers
were SVM and Random Forest. They found that semantic-level features, i.e., CNN
prob, and low-level appearance features are complementary. However, concate-
nating CNN features with hand-crafted features did not bring any improvement.
This finding is coherent with the statement from MLPBOON team [52]. For
predicting video interestingness, audio modality offered superior performance than
visual modality and the early fusion of the two modalities can further boost the
performance.
Technicolor [56] (Technicolor R&D France, co-organizer of the task): used
CNN features as visual features (for both the image and video subtasks), and
MFCC as audio feature (for the video subtask) and investigated the use of both
SVM and different Deep Neural Networks (DNN) as classification techniques.
For the image subtask, a simple system with CNN features and SVM resulted
in the best MAP, 0.2336. For the video subtask, multi-modality as a mid-level
fusion of audio and visual features, was taken into account within the DNN
framework. Additionally, a novel DNN architecture based on multiple Recurrent
Neural Networks (RNN) was proposed for modeling the temporal aspect of the
video, and a resampling/upsampling technique was used to deal with the unbalanced
dataset.
TUD-MMC [42] (Delft University of Technology, Netherlands): investigated
MAP values obtained on the development set by swapping and submitting ground-
truth annotations of image and video to the video and image subtasks respectively,
i.e., using the video ground-truth as submission on the image subtask and the
image ground-truth as submission on the video subtask. They concluded on the low
correlation between the image interestingness and video interestingness concepts.
Their simple visual features took into account the human face information (color
and sizes) in the image and video with the assumption that clear human faces should
attract the viewer’s attention and thus make the image/video more interesting. One
of their submitted runs, only rule-based, obtained the best MAP value of 0.2336 for
the image subtask.
UIT-NII [38] (University of Science, Vietnam; University of Information Tech-
nology, Vietnam; National Institute of Informatics, Japan): used SVM to predict
three different scores given the three types of input features: (1) low-level visual
features provided by the organizers [18], (2) CNN features (AlexNet and VGG),
and (3) MFCC as audio feature. Late fusion of these scores was used for computing
the final interestingness levels. Interestingly, their system tends to output a higher
rank on images of beautiful women. Furthermore, they found that images from dark
scenes were often considered as more interesting.
UNIFESP [1] (Federal University of Sao Paulo, Brazil): participated only in the
video subtask. Their approach was based on combining learning-to-rank algorithms
for predicting the interestingness of videos by using their visual content only. For
this purpose, Histogram of Motion Patterns (HMP) [2] were used. A simple majority
voting scheme was used for combining four pairwise machine learned rankers
252 C.-H. Demarty et al.
This section provides an in-depth analysis of the results and discusses the global
trends found in the submitted systems.
Low-Level vs. High-Level Description The conventional low-level visual fea-
tures, such as dense SIFT, GIST, LBP, Color Histogram, were still being used by
many of the systems for both, image and video interestingness prediction [12, 14,
38, 44, 69]. However, deep features like CNN features (i.e., Alexnet fc7 or VGG)
have become dominant and are exploited by the majority of the systems. This
shows the effectiveness and popularity of deep learning. Some teams investigated
the combination of hand crafted features with deep features, i.e., conventional and
CNN features. A general finding is that such a combination did not really bring
any benefit to the prediction results [12, 44, 52]. Some systems combined low-level
features with some high-level attributes such as emotional expressions, human faces,
CNN visual concept predictions [12, 69]. In this case, the resulting conclusion was
that low-level appearance features and semantic-level features are complementary,
as the combination in general offered better prediction results.
Standard vs. Deep Learning-Based Classification As it can be seen in Tables 3
and 4, SVM was mostly used by a large number of systems, for both predic-
tion tasks. In addition, regression techniques such as linear regression, logistic
regression, and support vector regression were also widely reported. Contrary to
CNN features, which were widely used by most of the systems, deep learning
classification techniques were investigated less (see [21, 56, 67, 69] for image
interestingness and [56, 67, 69] for the video interestingness). This may be due to
the fact that the datasets are not large enough to justify a deep learning approach.
Conventional classifiers were preferred here.
Use of External Data Some systems investigated the use of external data to
improve the results. For instance, Flickr images with social-driven interestingness
Predicting Interestingness of Visual Content 253
labels were used for model selection in the image interestingness subtask by the
Technicolor team [56]. The HUCVL team [21] submitted a run with a fine-tuning of
the MemNet model, which was trained for image memorability prediction. Although
memorability and interestingness are not the same concept, the authors expected
that fine-tuning a model related to an intrinsic property of images could be helpful
in learning better high-level features for image interestingness prediction. The ETH-
CVL team [67] exploited movie titles, as textual side information related to movies,
for both subtasks. In addition, ETH-CVL also investigated the use of the deep
RankNet model, which was trained on the Video2GIF dataset [24], and the Visual
Semantic Embedding model, which was trained on the MSR Clickture dataset [28].
Dealing with Small and Unbalanced Data As the development data provided for
the two subtasks are not very large, some systems, e.g., [1, 56], used the whole
image and video development sets for training when building the final models. To
cope with the imbalance of the two classes in the dataset, the Technicolor team [56]
proposed to use classic resampling and upsampling strategies so that the positive
samples are used multiple times during training.
Multi-Modality Specific to video interestingness, multi-modal approaches were
exploited by half of the teams for at least one of their runs, as shown in Table 4.
Four teams combined audio and visual information [12, 38, 53, 56], and one team
combined text with visual information [67]. The fusion of modalities was done
either at the early stage [12, 53], middle stage [56], or late stage [38] in the
processing workflows. Note that the combination of text and visual information was
also reported in [67] for image interestingness prediction. The general finding here
was that multi-modality brings benefits to the prediction results.
Temporal Modeling for Video Though the temporal aspect is an important
property of a video, most systems did not actually exploit any temporal modeling
for video interestingness prediction. They mainly considered a video as a sequence
of frames and a global video descriptor was computed simply by averaging frame
image descriptors over each shot. As an example, HKBU team [44] treated each
frame as a separated image, and calculated the average and standard deviation of
their features over all frames in a shot to build their global feature vector for each
video. Only two teams incorporated temporal modeling in their submitted systems,
namely Technicolor [56] who used long-short term memory (LSTM) in their deep
learning-based framework, and ETH-CVL [67] who used 3D convolutional neural
networks (C3D) in their video highlight detector, trained on the Video2GIF dataset.
The purpose of this section is to give some insights on the characteristics of the
produced data, i.e., the dataset and its annotations.
254 C.-H. Demarty et al.
In general, the overall results obtained during the 2016 campaign show low values
for MAP (see Figs. 1 and 2), especially for the video interestingness prediction
subtask. To have a comparison, we provide examples of MAP values obtained
by other multi-modal tasks from the literature. Of course, these were obtained on
other datasets which are fundamentally different from the underlying data, both
from the data point of view and also use case scenario. A direct comparison is not
possible, however, they provide an idea about the current classification capabilities
for video:
• ILSVR Challenge 2015, Object Detection with provided training data, 200
fully labeled categories, best MAP is 0.62; Object Detection from videos with
provided training data, 30 fully labeled categories, best MAP is 0.67;
• TRECVID 2015, Semantic indexing of concepts such as: airplane, kitchen, flags,
etc., best MAP is 0.37;
• TRECVID 2015, Multi-modal event detection, e.g., somebody cooking on an
outdoor grill, best MAP is less than 0.35.
Although higher than the obtained MAP for the Predicting Media Interestingness
Task, it must be noted that for more difficult tasks such as multi-modal event
detection, the difference of performance is not that high, given the fact that the
proposed challenge is far more subjective than the tasks we are referring to.
Nevertheless, we may wonder, especially for the video interestingness sub-
task, whether the quality of the dataset/anotations partly affects the predicting
performance. Firstly, the dataset size, although it is sufficient for classic learning
techniques and required a huge annotation effort, it may not be sufficient for deep
learning, with only several thousands of samples for both subtasks.
Furthermore, it may be considered to be highly unbalanced with 8:3 and 9:6%
of interesting content for the development set and test set, respectively. Trying to
cope with the dataset’s unbalance has shown to increase the performance for some
systems [56, 57]. This leads to the conclusion that, although this unbalance reflects
reality, i.e., interesting content corresponds to only a small part of the data, it makes
the task even more difficult, as systems will have to take this characteristic into
account.
Finally, in Sect. 3.2, we explained that the final annotations were determined with
an iterative process which required the convergence of the results. Due to limited
time and human resources, this process was limited to five rounds. More rounds
would certainly have resulted in better convergence of the inter-annotator ratings.
To have an idea of the subjective quality of the ground-truth rankings, Figs. 3
and 4 illustrate some image examples for the image interestingness subtask together
with the rankings obtained by one of the best systems and the second worst
performing system, for both interesting and non interesting images.
The figures show that results obtained by the best system for the most interesting
images are coherent with the selection proposed by the ground-truth, whereas the
second worst performing system offers more images at the top ranks which do not
Predicting Interestingness of Visual Content 255
Fig. 3 Examples of interesting images from different videos of the test set. Images are ranked
from left to right decreasing interestingness ranking. (a) Interesting images according to the
ground-truth. (b) Interesting images selected by the best system. (c) Interesting images selected
by the second worst performing system (Color figure online)
256
Fig. 4 Examples of non interesting images from different videos of the test set. Images are ranked from left to right increasing interestingness ranking. (a) Non
C.-H. Demarty et al.
interesting images according to the ground-truth. (b) Non interesting images selected by the best system (Color figure online)
Predicting Interestingness of Visual Content 257
really contain any information, e.g., black or uniform frames, with blur or objects
and persons only partially visible.
These facts converge to the idea that both the provided ground-truth and the best
working systems have managed to capture the interestingness of images. It also
confirms that the obtained MAP values, although quite low, nevertheless correspond
to real differences in the interestingness prediction performance.
The observation of the images which were classified as non interesting (Fig. 4) is
also a source of interesting insights. According to the ground-truth and also to the
best performing systems, non interesting images tend to be those mostly uniform,
of low quality or without meaningful information. The amount of information con-
tained in the non interesting images then increases with the level of interestingness.
Note that we do not show here the images classified as non interesting by the second
worst performing system, as we did for the interesting images, because there were
too few (for the example 7 images out of 25 videos) to draw any conclusion.
We also calculated Krippendorff’s alpha metric (˛), which is a measure for
inter-observer agreement [26, 37], to be ˛ D 0:059 for image interestingness and
˛ D 0:063 for video interestingness. This result would indicate that there is no
inter-observer agreement. However, as our method (by design) produced very few
duplicate comparisons it is not clear if this result is reliable.
As a last insight, it is worth noting that the two experienced teams [53, 67],
i.e., the two teams that did work on predicting content interestingness before the
MediaEval benchmark, did not achieve particularly good results on both subtasks
and especially on the image subtask. This raises the question of the generalization
ability of their systems on different types of content, unless this difference of
performance comes from the choice of different use cases as working context.
For the latter, this seems to show that, to different use cases correspond different
interpretations of the interestingness concept.
Fig. 5 Representation of image rankings vs. video rankings from the ground-truth for several
videos of the development set. (a) Video 0, (b) Video 4, (c) Video 7, (d) Video 10, (e) Video 14,
(f) Video 51
This conclusion is in line with what was found in [42] where the authors
investigated the assessment of the ground-truth ranking of the image subtask against
the ground-truth ranking of the video subtask and vice-versa. MAP value achieved
by the video ground-truth on the image subtask was 0.1747, while for the image
ground-truth on the video subtask, it was 0.1457, i.e., in the range, or even lower,
than the random baseline for both cases. Videos obviously contain more information
Predicting Interestingness of Visual Content 259
than a single image, which can be conveyed by other channels such as audio and
motion, for example. Because of this additional information, a video might be
globally considered as interesting while one single key-frame extracted from the
same video will be considered as non interesting. This can explain, in some cases,
the observed discrepancy between image and video interestingnesses.
Trying to infer some potential links between the interestingness concept and
perceptual content characteristics, we did study how low-level characteristics such
as shot length, average luminance, blur and presence of high quality faces influence
the interestingness prediction of images and videos.
A first qualitative study of both sets of interesting and non interesting images
in the development and test sets shows that most uniformly black and very blurry
images were mostly classified as non interesting. So were the majority of images
with no real information, i.e., close-up of usual objects, partly cut faces or objects,
etc., as it can be seen in Fig. 4.
Figure 6 shows the distributions of interestingness values for both the develop-
ment and test sets, in the video interestingness subtask, compared to the distributions
of interesting values restricted to the shots with less than 10 frames. In all cases,
it seems that the distributions of small shots can just be superimposed under the
complete distributions, meaning that the shot length does not seem to influence the
interestingness of video segments even for very short durations. On the contrary,
Fig. 7 shows the two same types of distributions but for the image interestingness
subtask and when trying to assess the influence of the average luminance value on
interestingness. This time, the distributions of interestingness levels for the images
with low average luminance seem to be slightly shifted toward lower interestingness
Fig. 6 Video interestingness and shot length: distribution of interestingness levels (in blue—all
shots considered; in green—shots with length smaller than 10 frames). (a) Development set, (b)
test set (Color figure online)
260 C.-H. Demarty et al.
Fig. 7 Image interestingness and average luminance: distribution of interestingness levels (in
blue—all key-frames considered; in green—key-frames with luminance values lower than 25).
(a) Development set, (b) test set (Color figure online)
levels. This might lead us to the conclusion that low average luminance values tend
to decrease the interestingness level of a given image, contrary to the conclusion
in [38].
We also investigated some potential correlation between the presence of high-
quality faces in frames and the interestingness level. By high-quality faces, we mean
rather big faces with no motion blur, either frontal or profile, no closed eyes or
funny faces. This last mid-level characteristic was assessed manually by counting
the number of high-quality faces present in both the interesting and non interesting
images for the image interestingness subtask. The proportion of high-quality faces
on the development set was found to be 48:2% for the set of images annotated as
interesting and 33:9% for the set of images annotated as non interesting. For the test
set, 56:0% of the interesting images and 36:7% of the non interesting images contain
high quality faces. The difference in favor of the interesting sets tends to prove that
this characteristic has a positive influence on the interestingness assessment. This
was confirmed by the results obtained by TUD-MMC team [42] who based their
system only on the detection of these high quality faces and achieved the best MAP
value for the image subtask.
As a general conclusion, we may say that perceptual quality plays an important
role when assessing the interestingness of images, although it is not the only clue to
assess the interestingness of content. Among other semantic objects, the presence
of good quality human faces seems to be correlated with interestingness.
completely different tasks. This was more or less proved in the literature, however,
in those cases, images and videos were not chosen to be correlated. Therefore,
a future perspective might be the separation of the two, while focusing on more
representative data for each.
Acknowledgements We would like to thank Yu-Gang Jiang and Baohan Xu from the Fudan
University, China, and Hervé Bredin, from LIMSI, France for providing the features that
accompany the released data, and Frédéric Lefebvre, Alexey Ozerov and Vincent Demoulin for
their valuable inputs to the task definition. We also would like to thank our anonymous annotators
for their contribution to building the ground-truth for the datasets. Part of this work was funded
under project SPOTTER PN-III-P2-2.1-PED-2016-1065, contract 30PED/2017.
References
1. Almeida, J.: UNIFESP at MediaEval 2016 Predicting Media Interestingness Task. In:
Proceedings of the MediaEval Workshop, Hilversum (2016)
2. Almeida, J., Leite, N.J., Torres, R.S.: Comparison of video sequences with histograms of
motion patterns. In: IEEE ICIP International Conference on Image Processing, pp. 3673–3676
(2011)
3. Baveye, Y., Dellandréa, E., Chamaret, C., Chen, L.: Liris-accede: a video database for affective
content analysis. IEEE Trans. Affect. Comput. 6(1), 43–55 (2015)
4. Berg, A.C., Berg, T.L., Daume, H., Dodge, J., Goyal, A., Han, X., Mensch, A., Mitchell, M.,
Sood, A., Stratos, K., et al.: Understanding and predicting importance in images. In: IEEE
CVPR International Conference on Computer Vision and Pattern Recognition, pp. 3562–3569.
IEEE, Providence (2012)
5. Berlyne, D.E.: Conflict, Arousal and Curiosity. Mc-Graw-Hill, New York (1960)
6. Boiman, O., Irani, M.: Detecting irregularities in images and in video. Int. J. Comput. Vis.
74(1), 17–31 (2007)
7. Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: the method of paired
comparisons. Biometrika 39(3-4), 324–345 (1952)
8. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers.
In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM, New York (2000)
9. Bulling, A., Roggen, D.: Recognition of visual memory recall processes using eye movement
analysis. In: Proceedings of the 13th international conference on Ubiquitous Computing, pp.
455–464. ACM, New York (2011)
10. Chamaret, C., Demarty, C.H., Demoulin, V., Marquant, G.: Experiencing the interestingness
concept within and between pictures. In: Proceeding of SPIE, Human Vision and Electronic
Imaging (2016)
11. Chen, A., Darst, P.W., Pangrazi, R.P.: An examination of situational interest and its sources.
Br. J. Educ. Psychol. 71(3), 383–400 (2001)
12. Chen, S., Dian, Y., Jin, Q.: RUC at MediaEval 2016 Predicting Media Interestingness Task. In:
Proceedings of the MediaEval Workshop, Hilversum (2016)
13. Chu, S.L., Fedorovskaya, E., Quek, F., Snyder, J.: The effect of familiarity on perceived
interestingness of images. In: Proceedings of SPIE, vol. 8651, pp. 86,511C–86,511C–12
(2013). doi:10.1117/12.2008551, http://dx.doi.org/10.1117/12.2008551
14. Constantin, M.G., Boteanu, B., Ionescu, B.: LAPI at MediaEval 2016 Predicting Media
Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
Predicting Interestingness of Visual Content 263
15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR
International Conference on Computer Vision and Pattern Recognition (2005)
16. Danelljan, M., Hager, G., Khan, F.S., Felsberg, M.: Accurate scale estimation for robust visual
tracking. In: British Machine Vision Conference (2014)
17. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a
computational approach. In: IEEE ECCV European Conference on Computer Vision, pp. 288–
301. Springer, Berlin (2006)
18. Demarty, C.H., Sjöberg, M., Ionescu, B., Do, T.T., Wang, H., Duong, N.Q.K., Lefebvre, F.:
Mediaeval 2016 Predicting Media Interestingness Task. In: Proceedings of the MediaEval
Workshop, Hilversum (2016)
19. Dhar, S., Ordonez, V., Berg, T.L.: High level describable attributes for predicting aesthetics
and interestingness. In: IEEE International Conference on Computer Vision and Pattern
Recognition (2011)
20. Elazary, L., Itti, L.: Interesting objects are visually salient. J. Vis. 8(3), 3–3 (2008)
21. Erdogan, G., Erdem, A., Erdem, E.: HUCVL at MediaEval 2016: predicting interesting key
frames with deep models. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
22. Grabner, H., Nater, F., Druey, M., Gool, L.V.: Visual interestingness in image sequences.
In: ACM International Conference on Multimedia, pp. 1017–1026. ACM, New York (2013).
doi:10.1145/2502081.2502109, http://doi.acm.org/10.1145/2502081.2502109
23. Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., van Gool, L.: The interestingness of
images. In: ICCV International Conference on Computer Vision (2013)
24. Gygli, M., Song, Y., Cao, L.: Video2gif: automatic generation of animated gifs from video.
CoRR abs/1605.04850 (2016). http://arxiv.org/abs/1605.04850
25. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural
Information Processing Systems, pp. 545–552 (2006)
26. Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding
data. Commun. Methods Meas. 1(1), 77–89 (2007). doi:10.1080/19312450709336664, http://
dx.doi.org/10.1080/19312450709336664
27. Hsieh, L.C., Hsu, W.H., Wang, H.C.: Investigating and predicting social and visual image
interestingness on social media by crowdsourcing. In: 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4309–4313. IEEE, Providence (2014)
28. Hua, X.S., Yang, L., Wang, J., Wang, J., Ye, M., Wang, K., Rui, Y., Li, J.: Clickage:
towards bridging semantic and intent gaps via mining click logs of search engines. In: ACM
International Conference on Multimedia (2013)
29. Isola, P., Parikh, D., Torralba, A., Oliva, A.: Understanding the intrinsic memorability of
images. In: Advances in Neural Information Processing Systems, pp. 2429–2437 (2011)
30. Isola, P., Xiao, J., Torralba, A., Oliva, A.: What makes an image memorable? In: IEEE CVPR
International Conference on Computer Vision and Pattern Recognition, pp. 145–152. IEEE,
Providence (2011)
31. Jiang, Y.G., Wang, Y., Feng, R., Xue, X., Zheng, Y., Yan, H.: Understanding and predicting
interestingness of videos. In: AAAI Conference on Artificial Intelligence (2013)
32. Jiang, Y.G., Dai, Q., Mei, T., Rui, Y., Chang, S.F.: Super fast event recognition in internet
videos. IEEE Trans. Multimedia 177(8), 1–13 (2015)
33. Joachims, T.: Optimizing search engines using clickthrough data. In: ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 133–142. ACM, New
York (2002)
34. Ke, Y., Hoiem, D., Sukthankar, R.: Computer vision for music identification. In: IEEE CVPR
International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 597–604.
IEEE, Providence (2005)
35. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In:
IEEE CVPR International Conference on Computer Vision and Pattern Recognition, vol. 1, pp.
419–426. IEEE, Providence (2006)
264 C.-H. Demarty et al.
36. Khosla, A., Raju, A.S., Torralba, A., Oliva, A.: Understanding and predicting image memora-
bility at a large scale. In: International Conference on Computer Vision (ICCV) (2015)
37. Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 3rd edn. Sage,
Thousand Oaks (2013)
38. Lam, V., Do, T., Phan, S., Le, D.D., Satoh, S., Duong, D.: NII-UIT at MediaEval 2016 Pre-
dicting Media Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum
(2016)
39. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for
recognizing natural scene categories. In: IEEE CVPR International Conference on Computer
Vision and Pattern Recognition, pp. 2169–2178 (2006)
40. Li, J., Barkowsky, M., Le Callet, P.: Boosting paired comparison methodology in measuring
visual discomfort of 3dtv: performances of three different designs. In: Proceedings of SPIE
Electronic Imaging, Stereoscopic Displays and Applications, vol. 8648 (2013)
41. Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: a high-level image representation for scene
classification & semantic feature sparsification. In: Advances in Neural Information Processing
Systems, pp. 1378–1386 (2010)
42. Liem, C.: TUD-MMC at MediaEval 2016 Predicting Media Interestingness Task. In:
Proceedings of the MediaEval Workshop, Hilversum (2016)
43. Liu, F., Niu, Y., Gleicher, M.: Using web photos for measuring video frame interestingness.
In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2058–2063
(2009)
44. Liu, Y., Gu, Z., Cheung, Y.M.: Supervised manifold learning for media interestingness
prediction. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
45. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60,
91–110 (2004)
46. Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychol-
ogy and art theory. In: ACM International Conference on Multimedia, pp. 83–92. ACM, New
York (2010). doi:10.1145/1873951.1873965, http://doi.acm.org/10.1145/1873951.1873965
47. McCrae, R.R.: Aesthetic chills as a universal marker of openness to experience. Motiv. Emot.
31(1), 5–11 (2007)
48. Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual anal-
ysis. In: IEEE CVPR International Conference on Computer Vision and Pattern Recognition,
pp. 2408–2415. IEEE, Providence (2012)
49. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7),
971–987 (2002)
50. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial
envelope. Int. J. Comput. Vis. 42, 145–175 (2001)
51. Ovadia, S.: Ratings and rankings: reconsidering the structure of values and their measurement.
Int. J. Soc. Res. Methodol. 7(5), 403–414 (2004). doi:10.1080/1364557032000081654, http://
dx.doi.org/10.1080/1364557032000081654
52. Parekh, J., Parekh, S.: The MLPBOON Predicting Media Interestingness System for MediaE-
val 2016. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
53. Rayatdoost, S., Soleymani, M.: Ranking images and videos on visual interestingness by visual
sentiment features. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
54. Schaul, T., Pape, L., Glasmachers, T., Graziano, V., Schmidhuber, J.: Coherence progress:
a measure of interestingness based on fixed compressors. In: International Conference on
Artificial General Intelligence, pp. 21–30. Springer, Berlin (2011)
55. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: 2007
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Providence
(2007)
56. Shen, Y., Demarty, C.H., Duong, N.Q.K.: Technicolor@MediaEval 2016 Predicting Media
Interestingness Task. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
Predicting Interestingness of Visual Content 265
57. Shen, Y., Demarty, C.H., Duong, N.Q.K.: Deep learning for multimodal-based video interest-
ingness prediction. In: IEEE International Conference on Multimedia and Expo, ICME’17
(2017)
58. Silvia, P.J.: What is interesting? Exploring the appraisal structure of interest. Emotion 5(1), 89
(2005)
59. Silvia, P.J., Henson, R.A., Templin, J.L.: Are the sources of interest the same for everyone?
using multilevel mixture models to explore individual differences in appraisal structures.
Cognit. Emot. 23(7), 1389–1406 (2009)
60. Sjöberg, M., Baveye, Y., Wang, H., Quang, V.L., Ionescu, B., Dellandréa, E., Schedl, M.,
Demarty, C.H., Chen, L.: The mediaeval 2015 affective impact of movies task. In: Proceedings
of the MediaEval Workshop, CEUR Workshop Proceedings (2015)
61. Soleymani, M.: The quest for visual interest. In: ACM International Conference on Multime-
dia, pp. 919–922. New York, NY, USA (2015). doi:10.1145/2733373.2806364, http://doi.acm.
org/10.1145/2733373.2806364
62. Spain, M., Perona, P.: Measuring and predicting object importance. Int. J. Comput. Vis. 91(1),
59–76 (2011)
63. Stein, B.E., Stanford, T.R.: Multisensory integration: current issues from the perspective of the
single neuron. Nat. Rev. Neurosci. 9(4), 255–266 (2008)
64. Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using
classemes. In: IEEE ECCV European Conference on Computer Vision, pp. 776–789. Springer,
Berlin (2010)
65. Turner, S.A. Jr, Silvia, P.J.: Must interesting things be pleasant? A test of competing appraisal
structures. Emotion 6(4), 670 (2006)
66. Valdez, P., Mehrabian, A.: Effects of color on emotions. J. Exp. Psychol. Gen. 123(4), 394
(1994)
67. Vasudevan, A.B., Gygli, M., Volokitin, A., Gool, L.V.: Eth-cvl @ MediaEval 2016: Textual-
visual embeddings and video2gif for video interestingness. In: Proceedings of the MediaEval
Workshop, Hilversum (2016)
68. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: large-scale scene
recognition from abbey to zoo. In: IEEE CVPR International Conference on Computer Vision
and Pattern Recognition, pp. 3485–3492 (2010)
69. Xu, B., Fu, Y., Jiang, Y.G.: BigVid at MediaEval 2016: predicting interestingness in images
and videos. In: Proceedings of the MediaEval Workshop, Hilversum (2016)
70. Yang, Y.H., Chen, H.H.: Ranking-based emotion recognition for music organization and
retrieval. IEEE Trans. Audio Speech Lang. Process. 19(4), 762–774 (2011)
71. Yannakakis, G.N., Hallam, J.: Ranking vs. preference: a comparative study of self-reporting.
In: International Conference on Affective Computing and Intelligent Interaction, pp. 437–446.
Springer, Berlin (2011)
Glossary
BoVW Bag-of-Visual-Words
BoW Bag-of-Words
CBIR Content-Based Image Retrieval
CNN Convolutional Neural Networks
CVIR Content-Based Video Retrieval
SVH Human Visual System