C2017 Information Fusion in Content Based Image Retrieval - A Comprehensive Overview
C2017 Information Fusion in Content Based Image Retrieval - A Comprehensive Overview
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
a r t i c l e i n f o a b s t r a c t
Article history: An ever increasing part of communication between persons involve the use of pictures, due to the cheap
Received 8 September 2016 availability of powerful cameras on smartphones, and the cheap availability of storage space. The rising
Revised 30 December 2016
popularity of social networking applications such as Facebook, Twitter, Instagram, and of instant mes-
Accepted 4 January 2017
saging applications, such as WhatsApp, WeChat, is the clear evidence of this phenomenon, due to the
Available online 5 January 2017
opportunity of sharing in real-time a pictorial representation of the context each individual is living in.
Keywords: The media rapidly exploited this phenomenon, using the same channel, either to publish their reports, or
Information fusion to gather additional information on an event through the community of users. While the real-time use
Content based image retrieval of images is managed through metadata associated with the image (i.e., the timestamp, the geolocation,
tags, etc.), their retrieval from an archive might be far from trivial, as an image bears a rich semantic
content that goes beyond the description provided by its metadata. It turns out that after more than 20
years of research on Content-Based Image Retrieval (CBIR), the giant increase in the number and variety
of images available in digital format is challenging the research community. It is quite easy to see that
any approach aiming at facing such challenges must rely on different image representations that need to
be conveniently fused in order to adapt to the subjectivity of image semantics. This paper offers a jour-
ney through the main information fusion ingredients that a recipe for the design of a CBIR system should
include to meet the demanding needs of users.
© 2017 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.inffus.2017.01.003
1566-2535/© 2017 Elsevier B.V. All rights reserved.
L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60 51
Of course, the description of images through such low- and Moreover, the relevance feedback paradigm can be employed to
mid-level features is not always directly related to the common enable browse-to-search mechanisms, where the user has not a
perception that the user has of an image. For a human being, in- specific target in mind, and the feedback captures the most rele-
deed, an image can be seen as the representation of different con- vant features that drive the browsing experience towards the im-
cepts, either related to physical characteristics such as shapes, col- ages of interest [12,13].
ors, textures, etc., or related to emotions and memories. For a com- In addition to the use of different sets of visual features, it can
puter perspective, an image is simply a set of pixels with different be easily seen that the effectiveness of a visual retrieval system
“colors” and different intensities. can be improved by combining information from different modal-
The early papers on CBIR are, by now, almost twenty-five years ities, i.e., from different types of content. For example, if we con-
old [7], but while the results attained so far allowed achieving sider Web pages, they usually contain both images and text. Even
some relevant milestones [5], we are still facing issues for which if the relationship between the surrounding text and images varies
an acceptable solution is far to be devised. One of the most rele- greatly, with much of the text being redundant and/or unrelated to
vant issues is related to the type of features used to handle the the visual content, a large amount of information about an image
“content” of an image, that is usually represented through low- can be found in the textual context of the Web pages [14]. Several
level features that represent colors, shapes, edges that can be works indeed proved that such data can be effectively combined
found in an image with numerical values. This means that, when within the traditional CBIR systems to improve the quality of the
two images are compared to find similarities, actually the similar- retrieval results [15–17].
ities with respect to the intrinsic features of the images, such as This paper will introduce the reader to the major approaches
the presence of objects with a given shape, and/or the dominance proposed in the literature for fusing information in visual retrieval
of a given color, etc., are computed. It can be easily seen however tasks. Section 2 describes the basic concepts behind the techniques
that the effectiveness of the search in this case is limited to a small proposed in the content based image retrieval field. Section 3 sum-
subsets of semantic concepts. marizes the main categories in which fusion approaches can be
For example, when the similarity between an image of an or- classified, each fusion approach being extensively addressed in
ange and an image of a lemon has to be measured, you might not Sections 4–7. Pros and cons of the described strategies, conclusions
always be satisfied by the result [8,9]. A retrieval system based on and future research perspectives are drawn in Sections 8 and 9,
this level of description of an image content, may respond either respectively.
with a very high or very low value of similarity. It is not so difficult
to see that a shape-based retrieval system would evaluate the two 2. Architecture of a CBIR system
images as being similar, while a retrieval system based on color
does not. The design of a content-based image retrieval system requires
These early findings, paved the way for exploring CBIR sys- a clear planning of the goal of the system [5]. As much as the im-
tems based on the fusion of different features, either by em- ages in an archive are of different types, are obtained by different
ploying weighted similarity measures, where different features are acquisition techniques, and exhibit different content, the search for
weighted according to their relevance to the task at hand, or by specific concepts is definitely a hard task. It is easy to see that as
combining different similarity functions, where image similarity is much as the scope of the archive is limited, and the content to be
first computed separately for each feature, and then their values searched is clearly defined, than the task can be more easily man-
are combined. aged [3]. On the other hand, the design of a general purpose mul-
In the past years, there have been many attempts to bridge the timedia retrieval engine is a challenging task, as the system should
gap between the high level features, those perceived by human be- be capable of adapting to different semantic contents, and different
ings which identify the semantic information related to the im- intents of the users.
ages, and the low level ones that are used in the searches. This A number of content-based retrieval systems tailored to specific
difference in perception is widely known in the CBIR field as the applications, usually referred to as narrow domain systems, have
semantic gap. In order to capture such a subjectivity, image re- been proposed to date. Some of them are related to sport events,
trieval tools may employ Relevance Feedback [10,11] mechanisms. as the aspect of the scene is fixed, camera positions are known
Relevance Feedback (RF) techniques involve the user in the pro- in advance, and the movements of the players and other objects
cess of refining the search. In a CBIR task in which RF is employed, (e.g., a ball) can be modeled [5]. Other applications are related to
the user submits to the system a query image which is an example medical analysis, as the type of images, and the objects to look for
of the pictures of interest. The system then assigns to each image can be precisely defined [18].
in the database a score according to a similarity measure between The description of the content of a specific image can be pro-
each image and the query. The top k best scored images are re- vided in multiple ways. First of all, an image can be described in
turned to the user that labels them as being relevant or not, so term of its properties provided in textual form (e.g., creator, con-
that the system can consider all relevant images as additional ex- tent type, keywords, etc.). This is the model used by Digital Li-
amples to better specify the query, and the non-relevant ones as braries, where standard descriptors are defined, and guidelines for
examples of images that the user is not interested in. With the defining appropriate values are proposed. However, apart from de-
availability of this new additional information, the system can im- scriptor such as the file format, the size of the image, etc., other
prove the quality of the search results, by providing a larger num- keywords are typically provided by domain experts. In the case of
ber of relevant images in the next iteration. very narrow-domain systems, it is possible to agree on an ontology
It can be easily seen that this iterative and interactive proce- that helps describing standard scenarios. On the other hand, when
dure can benefit from the availability of multiple image represen- multimedia content is shared on the web, different users may as-
tations, as the retrieval system can exploit the different similar- sign the same keyword to different contents, as well as assign dif-
ity concepts embedded in the available representations, and adapt ferent keywords to the same content [19]. Thus, more complex on-
the search towards the user’s interests. Consequently, the fusion of tologies, and reasoning systems are required to correctly assess the
multiple image representations for content-based image retrieval similarity between images [20].
tasks has been mainly addressed within the relevance feedback Low-level and medium-level content-based features [5,9] have
paradigm, as it provides a way to estimate the relevance of each been proposed in analogy with the possible way in which the hu-
feature and similarity measure with respect to the task at hand. man brain assesses the similarity between visual contents. While
52 L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60
weighted combination of the input values, while the pooling layer tiple images are used for querying the archive. As a consequence,
reduces the output of the convolutional layer (See Fig. 2). At each it is not possible to choose “a priori” the subset of features that
level the raw features of the images are weighed and refined with is best suited to a user’s query. The basic idea behind weighting
respect to the previous level, to produce a better representation of mechanisms is that the exploitation of relevance feedback from the
the images. The hierarchical architecture of CNNs could be seen as user implicitly defines which images should be considered similar
a ‘natural’ way of combining more feature modalities taking ad- to each other. For example, in a metric space, relevance feedback
vantage of the training algorithms embedded in its hidden lay- information can be exploited by modify the similarity measure so
ers [59,60]. Unfortunately, despite the great attention paid by re- that similar images are represented as neighbors of each other (i.e.,
searchers in using deep learning approaches for image classifica- relevant images), and non-relevant images do not fall within the
tion and recognition, there is still a limited amount of works that neighborhood of relevant images.
specifically focus on CBIR applications [32,61], where the goal is More formally, feature weighting mechanisms can be formu-
not to retrieve the images of the most probable class(es), but to lated as follows. An image I is represented as I = I (F ), where F
retrieve the most similar images. This perspective affects both the is a set of low level feature spaces fi , such as color, texture, etc.
learning phase, and the output processing phase, as outlined by Each feature space fi can be modeled by several representations fij ,
the few seminal works to date. e.g. color histogram, color moments, etc.. Each representation fij is
In the next sections, the different fusion strategies will be ex- itself a vector with multiple components
tensively addressed, and, for each strategy, several approaches will
fi j = fi j1 , . . . , fi jh , . . . , fi jk , . . . , fi jL , (1)
be discussed in order to provide an overview of the different tech-
niques proposed in the past years for fusing multiple information where L is the vector length. For each level fi , fij and fijk it is
in image retrieval tasks. possible to associate a set of weights denoted with wi , wij and
wijk , aimed at representing the effectiveness of each feature to
the query at hand. For example, for a given feature representa-
4. Feature weighting for early fusion tion fij , the similarity between the images can be computed by the
“weighted” Minkowski metric [10]:
If the retrieval task is modeled as a classification task, where a 1/p
pool of images described by a set of low-level features are assigned
L
to a set of labels, so that images with the same labels are consid- S fi j = wi jk |IA ( fi jk ) − IB ( fi jk )| p
(2)
k=1
ered to be similar to each other, then a new feature space that cap-
tures the semantic similarity can be extracted by techniques such with p ≥ 1. The majority of papers that addressed the problem of
as PCA. However, to attain reliable results, the number of training weight estimation, followed a “probabilistic” approach. In [62] the
images should be large, and labels should be reliable. This is hardly authors proposed to estimate the weights using the inverse of the
the case in image retrieval from large datasets. Conversely, instead standard deviation of the values of a feature component computed
of formulating the problem in terms of the transformation of the over a “class” of relevant images. The rationale behind this pro-
feature space to discover the hidden relationship between relevant posal is that if a certain component of the feature vector takes
images, an alternative solution that has been widely investigated similar values for all the relevant images, it means that the compo-
consists in using feature selection strategies, or feature weight- nent is relevant to the query; on the contrary, if all relevant images
ing approaches, that can be considered to perform a soft selection have different values for that component, then the component is
strategy. In fact, by following this paradigm, non-discriminative not relevant. In [63], a “local” measure of relevance has been pro-
features will receive a weight close to zero. As above mentioned, posed to estimate features’ weights (Probabilistic Feature Relevance
the idea comes from the observation that the effectiveness of CBIR Learning, PFRL). The estimation followed a least-squares approach,
techniques strongly depends on the choice of the set of visual fea- that is, a certain feature is more relevant to the query if it con-
tures. tributes more to the reduction in terms of the prediction error. A
However, no matter how suitably for the task at hand the fea- different approach is used in [64], where the features with max-
tures have been designed, the set of retrieved images often fits the imum balanced information gain obtained from the entropy of the
users needs only partly. This is because, in general, the exact in- set of labeled images have been selected. In [8], the weights have
tent of the user’s query cannot be fully captured even when mul- been estimated with the goal of privileging those feature spaces
54 L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60
where the set of relevant images form a compact cluster or, in techniques, such as the mean rule, the maximum rule, the mini-
terms of probabilities, by assigning more importance to features mum rule, and weighted means.
for which the relevant examples have a high likelihood, and less In the field of content-based image retrieval, similar approaches
importance to features for which the non-relevant examples have can be employed by considering the value of similarity between
a low likelihood. Finally, in [65] the authors proposed a dynamic images as the output of a classifier. In particular, combination ap-
feature weighting approach by exploiting intra-cluster and inter- proaches have been proposed for fusing different feature repre-
cluster information for representing the descriptive and discrimi- sentations, where the appropriate similarity metric is computed in
native properties of the features according to the labels given by each feature space, and then all the similarities are fused through
the user. a weighted sum [10].
The same estimation procedure could be used to weigh not In the same spirit, in [71], a large set of highly selective visual
only the components of each individual feature space, but also to features has been used, where each feature was highly selective for
estimate the weights to be associated to subsets of components. a small percentage of images, and, at the same time, only a few
The idea stems from the fact that the feature vectors can be de- features were selective for the set of relevant images. In this way,
composed into “sub-vectors”, each sub-vector describing a differ- after the choice of the most selective features for a given query,
ent part of the image or a specific characteristic. Therefore, by as- each image in the archive can be evaluated very rapidly, by dis-
signing a greater or a lower weight to one of them, it is possible carding all other features.
to better adapt the search to the concept the user is looking for. Artificial neural networks have been used in [72], where self-
In [66] a different point of view with respect to the usual prob- organizing maps (SOMs) are employed to measure the similarity
abilistic approach has been proposed. The weights associated to a between images. This approach aims at mapping the sequence of
given feature have estimated so that they reflect the capability of the queries based on the user’s responses during the retrieval pro-
representing nearest-neighbor relationships according to the user’s cess. A separate SOM is trained for each feature vector type, then
choices. This method is tailor-made for retrieval techniques based the system adapts to the user’s preferences by returning more im-
on the nearest neighbor paradigm, and the same algorithm can be ages from those SOMs where the responses of the user have been
used either to weight each component of one feature space, or to most densely mapped.
weigh different subsets of feature values within the same feature More recently, Arevalillo-Herráez [73] proposed a different
space, or to weight different feature spaces. probabilistic strategy to combine similarity measures. The authors
Another approach that exploits the feedback from the user to considered a subjective similarity judgement given by users on a
assign a larger importance to features related to similar images, fixed set of image and related it to a measure of similarity, then
and less importance to other features, has been proposed in [67]. combined the different values of similarity evaluated in different
The rationale behind this approach can be explained by observing feature spaces. The different feature representations can be com-
that, if the variance of the images relevant to the query is large bined by fusing all the similarity metrics through a weighted sum
along a given axis of the feature space, any value on this axis could [10,66]. The main issue for this kind of approach is to increase
be acceptable by the user, i.e., the value of the corresponding fea- the performances in terms of Precision, Recall, and Average Precision
ture is irrelevant with respect to the user’s needs, and therefore [74], by limiting the increase of the processing time.
this axis should be given a low weight, and vice-versa. In that pa- The issue of combining different feature representations is also
per, the authors formulated the relevance feedback approach as a relevant when relevance feedback mechanisms are used. In this
minimization problem whose solutions are the optimal query and case, at each iteration, similarities have to be computed by ex-
a weight matrix used to define the distance metric between im- ploiting relevance feedback information, for example by resorting
ages. In [68], Rui and Huang improved the algorithm described in to Nearest-Neighbor or Support Vector Machine [75] techniques.
[67] by proposing a hierarchical model in which each image is rep- In this view, an approach that has been proposed in the pattern
resented by a set of different features. In that work, in addition recognition field to classify patterns represented by a set of simi-
to estimating weights related to each feature representation (inter- larity measures is the so called “dissimilarity space”. This approach
feature weighting), the different components of each feature rep- is based on the creation of a new feature space where patterns are
resentation are also weighted (intra-feature weighting). represented in terms of their (dis)similarities to some reference
Recently, the relevance feedback paradigm has been also ex- prototypes. The dimension of this space does not depend on the
ploited to improve the retrieval capabilities of CNNs by modifying dimensions of the low-level features employed, but it depends on
the weights of the convolutional layers according to the feedback the number of reference prototypes used to compute the dissim-
of the user, by following an early fusion approach, where the inter- ilarities, and on the number of dissimilarity measures employed.
nal layers of the networks are seen as implicit feature representa- If we denote with P = {p1 , . . . , pP } the set of prototypes, and the
tions of the input images [69]. dissimilarity measure between an image Ii and one of the proto-
types pj as d(Ii , pj ), then the image Ii can be represented in the
5. Representation by multi-feature spaces for late fusion dissimilarity space as follows:
IPi = [d (Ii , p1 ), . . . , d (Ii , pP )]. (3)
The previous section showed that the combination of multiple
image representations (colors, shapes, textures, etc.) by early fu- This representation can be easily extended to take into account
sion approaches can effectively cope with the reduced inter-class multiple dissimilarity measures by stacking the corresponding dis-
variation that is experienced by resorting to just one type of fea- similarity vectors.
tures. As a drawback, the use of multiple image representations This technique has been employed to exploit relevance feedback
with a high number of components increases the computational in content-based image retrieval field [76,77], where the set of rel-
cost of retrieval techniques. As a consequence, the response time evant images plays the role of reference prototypes. In addition,
of the system might become an issue for interactive applications dissimilarity spaces have been also proposed for image retrieval to
(e.g., web searching). Over the years, the pattern recognition com- exploit information from different multi-modal characteristic [78].
munity proposed a number of solutions for fusing the information In addition, Piras and Giacinto [56] propose another use of the
from different feature spaces through the combination of the out- dissimilarity representation for improving the performances of rel-
put of different pattern classifiers [70]. The most popular and ef- evance feedback approaches based on the Nearest-Neighbor ap-
fective techniques for output combination are based on late fusion proach [79]. Instead of computing (dis)similarities by using dif-
L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60 55
ferent prototypes (e.g., the relevant images) and a single feature query images, one relevance feedback technique might be better
space, the authors propose to compute similarities by using just other RF approaches.
one prototype, and multiple feature representations. Each image is According to this behavior, Yin et al. [89] proposed a combina-
thus represented by a very compact vector that summarizes differ- tion of multiple relevance feedback strategies. The proposed com-
ent low-level characteristics, and allows images that are relevant bination integrates three relevance feedback techniques, namely
to the user’s goals to be represented as near points. Query Vector Modification [85], Feature Relevance Estimation [62,63],
In the past years, the combination of multi-feature spaces for and Bayesian Inference [90], and dynamically selects the most ap-
image retrieval tasks, has been proposed through the use of CNN propriate technique for a particular query, or even for a particular
[80] to learn both the metrics for each feature space, and the com- iteration, by evaluating the retrieval precision of each approach.
bination function for the different feature representations. In [91], a different approach is followed. The authors proposed
to employ the Support Vector Machine ensembles technique to
6. Fusing different relevance feedback approaches construct a group-based relevance feedback algorithm, by assuming
the data as coming from multiple positive classes and one nega-
The relevance feedback paradigm has been introduced to refine tive class, i.e. the problem was modeled as a (x+1)-class classifica-
retrieval results, both to overcome inaccuracies in textual informa- tion problem. An SVM ensemble was also proposed in [92,93] to
tion, and to bridge the semantic gap between the low level image address the unbalanced learning issue, whereas the authors of
descriptors and the user semantics. The user is actively involved in [94] suggested to use a set of one-class classifiers based on the In-
the retrieval process as she is asked to label a set of retrieved im- formation Bottleneck framework [95].
ages as being relevant or not [81] with respect to her interests. In Apart from the above mentioned papers, there have not been
general, the approaches proposed in the literature to exploit the RF other significant investigations on the potentialities of combining
paradigm can be divided into two groups. One group of techniques different relevance feedback approaches. The vast majority of pa-
exploit relevance feedback by modifying some parameters of the pers that propose the use of classifier ensembles for content based
search, either by computing a new query vector in the feature image retrieval tasks, are based on a single approach for rele-
space [62], or by choosing a more suitable similarity measure, or vance feedback, where different instances are created either by
by using a weighted distance [66,82]. Another group of approaches training on different “classes” of images, or on different bags of
are based on the formulation of RF in terms of a pattern classifica- relevant/non-relevant images in order to improve the performance
tion task, by using popular learning algorithms such as SVMs [83], of that particular approach.
neural networks and self-organizing maps [72,75,84], and using the
relevant and non-relevant image sets for training purposes.
One of the first techniques to be employed for RF in CBIR tasks, 7. Multimodal retrieval
that is still used in a number of image retrieval applications, is
based on the so-called query shifting paradigm [62]. This technique While it has been more than 20 years since the first proposal
has been developed in the text retrieval field, and it is represented of a system that allowed the user to combine the textual informa-
by the Rocchio formula [85]: tion contained in a HTML document along with the attached im-
1 1 age, with the information in image metadata (i.e., its width, height,
Qopt = NR
Di − NT −NR
Di (4) the file size, type, etc.), and with the number of faces in the im-
i∈DR i∈DN
age [96,97], nevertheless the paradigm to combine multimodal fea-
where DR and DN are the sets of relevant and non-relevant images, ture has never ceased to arouse interest in researchers [42,98,99].
respectively, NR is the number of images in DR , NT the number of The roots of this approach lie in the fact that the performance of
the total documents, and Di is the representation of an image in a content-based image retrieval (CBIR) system is inherently con-
the feature space. This approach is motivated by the assumption strained by low-level features, and it cannot give satisfactory re-
that the query may lie in a region of the feature space that is in trieval results when users’ high-level concepts are not easily ex-
some way “far” from the images that are relevant to the user. On pressed by low-level features [100]. Keywords have been used to
the contrary, according to Eq. (4), the optimal query should lie near assist content-based image retrieval tasks according to two main
to the euclidean center of the relevant images and “far” from the approaches: their use as additional features, or their use to seed a
non-relevant images. text-based query [101].
Relevance feedback has been also formulated in terms of a pat- The first approach combines keywords with low level features
tern classification task using neural networks, self-organizing maps of the images in order to use a combined input space. Many works
(SOMs) [72], or approaches based on SVM. The latter have been followed this path as [100], where the authors proposed an algo-
widely used to model the concepts behind the set of relevant rithm for learning the keyword similarity matrix during user in-
images, and adjust the search accordingly [75,84]. However, it is teraction, namely word association via relevance feedback (WARF).
worth noting that in many practical CBIR settings it is usually diffi- They assume that the images in the database have textual annota-
cult to produce a high-level generalization of a “class” of objects, as tions in terms of short phrases or keywords that can come from
the number of available relevant and non-relevant samples cases keywords spotting from surrounding HTML text on Web pages,
may be too small, and the concept of “class” is variable, due to manual annotation, and so forth. To combine the use of low-level
the subjectivity of the definition of similarity between images. This visual features with keywords, they convert keyword annotations
kind of problems has been partially mitigated due to the use of the for each image into a vector, where each component is related to
active learning paradigm [86], where the system is trained not only the presence or to the probability of a certain keyword in a specific
with the most relevant images according to the user judgement, image.
but also with the most informative images that allows driving the Several other researchers have addressed this problem from dif-
search into more promising regions of the feature space [87,88]. ferent points of view. Sclaroff et al. [102] proposed to combine tex-
For a given image database, and for different users, the best tual and visual statistics in a single index vector, where textual
performances might be provided by different relevance feedback statistics are captured in vector form using a latent semantic ap-
approaches. This behavior can be easily seen if we model the set proach based on the text in the containing HTML document, while
formed by each query image and the associated positive feedback visual features are captured in vector form using color and orienta-
samples as a “class” of a classification problem. For each “class” of tion histograms. Barnard and Forsyth [103] proposed a method that
56 L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60
organizes image databases using both image features and associ- tem can perform query expansion using a thesaurus to obtain re-
ated text by integrating the two types of information during model lated words that have matching images in the dataset.
construction. The system learns the relationships between the im- Some approaches propose to perform separate visual/textual
age features and semantics by modeling the statistics of word and queries in parallel, and then take the union/intersection of the two
feature occurrence and co-occurrence. In [104], the authors pro- retrieved lists as in [112], where the results are combined using
posed an approach based on associating a fuzzy membership func- a weighted sum of the scores given by each retrieval system. A
tion with the distribution of the features’ distances, and assign- linear weighted combination is also used in [113], where a rele-
ing a degree of worthiness to each feature based on its relative vance feedback system that refines its results after each iteration,
average performance. The memberships and the feature weights using late fusion methods is proposed. It allows also the user to
are then aggregated to produce a confidence value that could be dynamically tune the amount of textual and visual information to
used to rank the retrieved images. The basic idea is to assign high be used for retrieving similar images. Other systems perform the
membership values to distances that are relatively low and low two queries sequentially and use one modality to filter the search
membership values to relatively large distances. The membership space for the other modality as in [114]. In that paper, the au-
functions is designed according to the distribution of the distances thors proposed an asymmetric multimedia fusion strategy, which
within each feature for a small set of training images. In particular, exploits the complementarity of the text and the visual features.
the features’ memberships values and their relevance weights have The schema consists in a prefiltering textual step, which reduces
been combined according to two distinct approaches: the first one the collection for the visual retrieval step.
is linear and is based on a simple weighted combination, the sec- In [115], the authors presented iLike, an image search engine
ond one is non-linear and is based on the discrete Choquet integral that integrates both textual and visual features to improve the re-
[105]. trieval performance. The system claims to bridge the semantic gap
Linear combination models have been widely used in multime- by capturing the meaning of each text term in the visual feature
dia information retrieval for combining textual and visual features, space, and re-weight visual features according to their significance
even if obtaining an effective system is not straightforward due to to the query terms. The system is able to infer the “visual mean-
the difficulty to estimate the weights of the different modalities. ings” behind the textual queries and provide a visual thesaurus,
In [106] the authors proposed an approach based on Fisher Linear which is generated from the statistical similarity between the vi-
Discriminant Analysis, aimed to learn the weights for multimedia sual space representation of textual terms.
documents composed of text and images. In particular, the authors The paper by Ngiam et al. [59] can be considered one of the
reformulate the task of learning the combination parameters as a first attempts to learn and combine features over multiple modal-
dimensionality reduction problem in a binary classification context, ities exploiting the deep learning paradigm. That paper presents a
i.e., to find the linear combination which best separate relevant series of tasks for multimodal learning, and shows how to train
and non-relevant documents. deep networks that learn features to address each of the proposed
More recently, it has been also proposed an alternative way tasks. For the sake of clarity, it is worth noting that the authors did
to combine different modalities. In [54] the authors use the Bal- not focus their proposal on the multimodal domain, properly speak-
anced Iterative Reducing and Clustering (BIRCH) algorithm [107] on ing, but rather on the so-called cross-modal domain. In the multi-
textual and visual descriptors to diversify the results obtained by modal fusion setting, data from all modalities is available at feature
the search. In order to combine textual and visual information to- learning, for training the system, and for each test pattern. In the
gether, they first build a clustering tree based on textual informa- cross-modality setting, data from multiple modalities is available
tion, and then the resulting tree is refined by replacing the text only during feature learning, while, during the training and test
features with the visual ones. In particular, for each node of the phases, only data from a single modality is provided [59]. Accord-
tree, its center and radius are recomputed based on the visual fea- ingly, cross-modal retrieval refers to the search paradigm where
ture vectors instead of the former textual feature vectors. information in one modality can be retrieved using the other avail-
In [108], a novel scheme for online multi-modal distance met- able modalities, such as searching images using text and vice-versa
ric learning (OMDML) is investigated, which learns distance met- [116]. In [60], the authors propose a Deep Boltzmann Machine
rics from multi-modal data or multiple types of features via an (DBM) [117] model for learning multimodal data representations
online learning scheme. The key idea of OMDML is to learn to op- where the key idea is to learn a joint density model over the space
timize a separate distance metric for each individual modality, and of multimodal inputs. More recently Wang et al. [118] proposed
to find an optimal combination of diverse distance metrics on mul- a 5-layer neural network to learn a joint model for semantically
tiple modalities. correlating multiple features from different modalities, where deep
Another kind of approach is based on the use of keywords to learning feature as image representation, and topic feature as text
seed a query, and then employs both keywords and visual fea- representation are fused.
tures to conduct query refinement [109,110]. In [111] the authors
proposed a framework that performed relevance feedback both on 8. Discussion
keywords representing the images’ semantic contents through a
semantic network, and the low-level feature vectors such as color, The above sections clearly showed that for image retrieval
texture, and shape. In [101] the authors proposed a multimodal tasks, the use of different representations is essential for captur-
learning approach that uses images’ semantic labels to guide a ing the multiple concepts that each image can be associated to by
concept-dependent, active-learning process. The system is based different persons. How to model the fusion of multiple represen-
on the definition of complexity of a concept, and then it proposed tations according to the context in which the retrieval system is
making adjustments to the sampling strategy from which images expected to be used is a hard task, as the increase in the number
are to be selected and labeled, to improve the capability of the of parameters controlling the fusion mechanisms goes along with
concept learning. The idea behind this Concept Dependent Active the availability of large training datasets, and with solid assump-
Learning approach (CDAL) is to address the scarcity problem by us- tions on their ranges and relationships, to ensure that robust esti-
ing keywords to seed a query. According to this approach the user mations are produced. While nowadays a large quantity of data is
can use a keyword to describe the target concept and the images available for training purposes, still the way in which they are pro-
that are annotated with that keyword are added to the initial pool. cessed to produce personalized results, and avoid biases, requires
If the number of images with matching keywords is small, the sys- a careful design of the learning architecture and algorithm [23].
L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60 57
So, while the deep learning paradigm is now regarded as an effec- an overview of the techniques that could be used to fuse differ-
tive solution for many classification and similarity retrieval tasks, ent descriptors, both in terms of the components in the processing
previous works on information fusion for image retrieval provide a pipeline in which the fusion takes place, and in terms of the tech-
set of guidelines on the design of the learning architecture and the niques that can be used to estimate the parameters of the query
learning function to fully exploit the potentialities of this popular mechanism to adapt to the needs and goals of the target applica-
paradigm. tion, and to the interests of the users involved. While the past 20+
years of research in the field allowed to reach a number of mile-
• Early fusion by feature weighting. Feature weighting allows stones, so that image classification and retrieval functions are now
building retrieval systems where the designer has a full control available in consumer products, the steep increase in the number
of the processing steps, and the reasons behind the output of of images stored, and the consequent requests for more advanced
the system can be traced back to the importance given to differ- functionalities by different categories of users, are making the old
ent image representations. Moreover, relevance feedback mech- challenges even harder:
anisms can be implemented to modify the weights according to
the users’ needs. The algorithm to estimate the weights should • Labeled data is needed in order to design the system, and es-
be as simple as possible, as the choice of the objective function, timate the parameters. However, the reliability of the labeling
and the related estimation procedure could drive the system to process clearly affects the quality of the performance of the sys-
produce biased and unexpected results. On the other hand, for tem. For this reason benchmarks that help researchers to de-
narrow domain applications, this mechanism could prove to be velop new approaches and evaluate the related performances
winning, as the weights can be tailored to the domain of the are, now more than ever, crucial. In this line of reasoning it is
images at hand. If no automatic estimation algorithm is used, worth to note that evaluation campaigns such as ImageCLEF2
then for the casual users it could be a difficult task to under- and ImageNET,3 since 2003 and 2010, respectively, created a
stand the effect of the weight values on the final results. number of publicly-accessible evaluation resources. Anyway, the
• Late fusion by multi-feature spaces. From a conceptual point of creation of public datasets suited to test the performances of
view, this is one of the more promising approaches, as it avoids CBIR system in real settings still remains one of the main is-
dealing directly with the fusion of different image representa- sues in this field. How can labeling be improved, and how can
tions, as fusion is performed at a later stage, where the outputs unlabeled data as well as partially labeled data being exploited
of different systems can be regarded as new features to be com- to incrementally improve the system performances?
bined. This approach is actually taken as a paradigm to imple- • How can the implicit feedback provided by the user be
ment multi-modal and cross-modal retrieval system employing exploited when browsing the private archive, or searching
deep learning architectures [116,118], as different independent through the media content shared within his/her social net-
retrieval systems are seen as feature transformation functions, work?
producing a new feature space for a second layer retrieval func- • How to design easy-to-use interfaces that allows users to in-
tion. Again, the main research issues in the implementation of teract with the system in an intuitive way so that labels and
such an approach is the training of different systems, and how feedbacks are provided in non-ambiguous way?
changes in one level propagates to the next levels and the out- • How to design fusion approaches tailored to vertical applica-
put. tions or scenarios, e.g., the media industry, fashion, design,
• Fusing different relevance feedback approaches. The exploitation forensics, etc.?
of relevance feedback information can be carried out accord- • More in general, principled approaches providing design guide-
ing to different approaches, in terms of the assumptions on lines are still needed, as the vast majority of papers support the
the underlying distribution of relevant images. As the differ- proposal of new techniques and algorithms through experimen-
ent relevance feedback approaches proposed in the literature tal evaluation, and trial and error procedures.
reflect the richness in the way the similarity between images
The availability of more computing power, especially through
can be modeled, their fusion might provide the system with
the ‘cloud computing’ paradigm, the large popularity of deep learn-
an additional level of flexibility in capturing the user’s needs,
ing approaches, as well as the interest from a variety of actors,
Again, the estimation of the parameters of the fusion mecha-
both from the research community, and from an increasing num-
nism should be kept as simple as possible, as the amount of in-
ber of companies, will allow addressing the above issues with
formation produced during the feedback iterations is quite lim-
novel approaches that will leverage on cooperation and knowledge
ited, and, consequently, the number of parameters to be esti-
sharing.
mated should be small.
Acknowledgement
9. Conclusion
This work has been supported by the Regional Administration
The cheap availability of cameras embedded in portable de- of Sardinia (RAS), Italy, within the project BS2R - Beyond Social
vices, and the availability of almost unlimited storage space, un- Semantic Recommendation (POR FESR 2007/2013 - PIA 2013).
leashed the natural tendency of people to communicate through
images, for the richness of semantic content, and the immediacy References
of the message conveyed. This vast amount of visual information
can be searched through textual queries related to the geolocation, [1] V. Cisco, The Zettabyte Era: Trends and Analysis, Whitepaper, 2015.
timestamp, tags and labels provided by the users, with the short- [2] C. Kofler, M. Larson, A. Hanjalic, User intent in multimedia search: a survey
of the state of the art and future challenges, ACM Comput. Surv. 49 (2) (2016)
coming that these textual descriptors capture the semantics of im- 36:1–36:37. http://doi.acm.org/10.1145/2954930.
ages only partly, due to the richness of the semantic content of [3] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based
an image compared to the subjectivity of image tagging and label- image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach.
Intell. 22 (12) (20 0 0) 1349–1380.
ing. To this end, the query-by-content paradigm allows searching
for images beyond the purpose of their first use, by leveraging on
the extraction of multiple descriptors to allow associating each im- 2
http://imageclef.org/2016.
age to different semantic concepts. This paper aimed at providing 3
http://image-net.org/challenges/LSVRC/2016/.
58 L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60
[4] M.S. Lew, N. Sebe, C. Djeraba, R. Jain, Content-based multimedia informa- [29] M.D. Zeiler, R. Fergus, Visualizing and Understanding Convolutional Networks,
tion retrieval: state of the art and challenges, ACM Trans. Multimed. Comput. Fleet et al. 2014, 818–833. http://dx.doi.org/10.1007/978- 3- 319- 10590- 1_53.
Commun. Appl. 2 (1) (2006) 1–19. [30] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level
[5] R. Datta, D. Joshi, J. Li, J.Z. Wang, Image retrieval: ideas, influences, and trends image representations using convolutional neural networks, in: Proceedings
of the new age, ACM Comput. Surv. 40 (2) (2008) 1–60. of the 2014 IEEE Conference on Computer Vision and Pattern Recognition,
[6] B. Thomee, A Picture is Worth a Thousand Words: Content-based Image Re- CVPR ’14, IEEE Computer Society, Washington, DC, USA, 2014, pp. 1717–1724.
trieval Techniques, Ph.d. thesis, Leiden University, The Netherlands, 2010. http://dx.doi.org/10.1109/CVPR.2014.222.
[7] T. Kato, Database architecture for content-based image retrieval, in: Image [31] A.S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf:
Storage and Retrieval Systems (SPIE), vol. 1662, SPIE, 1992, pp. 112–123. an astounding baseline for recognition, in: Proceedings of the 2014 IEEE
[8] M.L. Kherfi, D. Ziou, Relevance feedback for cbir: a new approach based on Conference on Computer Vision and Pattern Recognition Workshops, CVPRW
probabilistic feature weighting with positive and negative examples, IEEE ’14, IEEE Computer Society, Washington, DC, USA, 2014, pp. 512–519. http:
Trans. Image Process. 15 (4) (2006) 1017–1030. //dx.doi.org/10.1109/CVPRW.2014.131.
[9] T. Pavlidis, Limitations of Content-based Image Retrieval, Stony Brook Univer- [32] A. Babenko, A. Slesarev, A. Chigorin, V.S. Lempitsky, Neural Codes for
sity, 2008 Technical report. Image Retrieval, Fleet et al. 2014, 584–599. http://dx.doi.org/10.1007/
[10] Y. Rui, T.S. Huang, Relevance feedback techniques in image retrieval, in: 978- 3- 319- 10590- 1_38.
M. Lew (Ed.), Principles of Visual Information Retrieval, Springer-Verlag, Lon- [33] F. Perronnin, J. Sánchez, T. Mensink, Improving the fisher kernel for large-
don, 2001, pp. 219–258. scale image classification, in: Proceedings of the 11th European Conference on
[11] X.S. Zhou, T.S. Huang, Relevance feedback in image retrieval: a comprehensive Computer Vision: Part IV, ECCV’10, Springer-Verlag, Berlin, Heidelberg, 2010,
review, Multimed. Syst. 8 (6) (2003) 536–544. pp. 143–156. http://dl.acm.org/citation.cfm?id=1888089.1888101.
[12] S. Lu, T. Mei, J. Wang, J. Zhang, Z. Wang, S. Li, Browse-to-search: interactive [34] E. Valveny, Leveraging category-level labels for instance-level image retrieval,
exploratory search with visual entities, ACM Trans. Inf. Syst. 32 (4) (2014) in: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern
18:1–18:27. Recognition (CVPR), CVPR ’12, IEEE Computer Society, Washington, DC, USA,
[13] R. Tronci, L. Piras, G. Giacinto, Performance evaluation of relevance feedback 2012, pp. 3045–3052. http://dl.acm.org/citation.cfm?id=2354409.2354782.
for image retrieval by “real-world” multi-tagged image datasets, Int. J. Mul- [35] H. Jégou, A. Zisserman, Triangulation embedding and democratic aggregation
timed. Data Eng. Manag. 3 (1) (2012) 1–16. for image search, in: Proceedings of 2014 IEEE Conference on Computer Vi-
[14] A. Gilbert, L. Piras, J. Wang, F. Yan, E. Dellandréa, R.J. Gaizauskas, M. Vil- sion and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23–28,
legas, K. Mikolajczyk, Overview of the image CLEF 2015 scalable image 2014, IEEE Computer Society, 2014, pp. 3310–3317. http://dx.doi.org/10.1109/
annotation, localization and sentence generation task, in: L. Cappellato, CVPR.2014.417.
N. Ferro, G.J.F. Jones, E. SanJuan (Eds.), Proceedings of Conference and Labs [36] H. Chen, A socio-technical perspective of museum practitioners’ image-
of the Evaluation forum, CLEF, Toulouse, France, September 8–11, 2015., Vol. using behaviors, Electron. Libr. 25 (1) (2007) 18–35. http://dx.doi.org/10.1108/
1391 of CEUR Workshop Proceedings, CEUR-WS.org, 2015. http://ceur-ws.org/ 02640470710729092.
Vol- 1391/inv- pap6- CR.pdf. [37] B. Girod, V. Chandrasekhar, D.M. Chen, N.M. Cheung, R. Grzeszczuk, Y. Reznik,
[15] A. Torralba, R. Fergus, W.T. Freeman, 80 million tiny images: a large data G. Takacs, S.S. Tsai, R. Vedantham, Mobile visual search, IEEE Signal Process.
set for nonparametric object and scene recognition, IEEE Trans. Pattern Anal. Mag. 28 (4) (2011) 61–76.
Mach. Intell. 30 (11) (2008) 1958–1970. http://dx.doi.org/10.1109/TPAMI.2008. [38] O. Marques, Visual information retrieval: the state of the art, IT Prof. 18 (4)
128. (2016) 7–9.
[16] J. Weston, S. Bengio, N. Usunier, Large scale image annotation: learning to [39] J. Xiao, J. Hays, K.A. Ehinger, A. Oliva, A. Torralba, SUN database: large-scale
rank with joint word-image embeddings, Mach. Learn. 81 (1) (2010) 21–35. scene recognition from abbey to zoo, in: Proceedings of the Twenty-Third
http://dx.doi.org/10.1007/s10994-010-5198-3. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San
[17] X. Wang, L. Zhang, M. Liu, Y. Li, W. Ma, ARISTA – image search to annotation Francisco, CA, USA, 13–18, 2010, pp. 3485–3492. http://dx.doi.org/10.1109/
on billions of web photos, in: Proceedings of the Twenty-Third IEEE Confer- CVPR.2010.5539970.
ence on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, [40] N. Bhowmik, V.R. Gonzlez, V. Gouet-Brunet, H. Pedrini, G. Bloch, Efficient
CA, USA, 2010, pp. 2987–2994. http://dx.doi.org/10.1109/CVPR.2010.5540046. fusion of multidimensional descriptors for image retrieval, in: Proceedings
[18] H. Müller, P.D. Clough, T. Deselaers, B. Caputo (Eds.), ImageCLEF, Experimental of the 2014 IEEE International Conference on Image Processing, ICIP, 2014,
Evaluation in Visual Information Retrieval, Springer, 2010. http://dx.doi.org/ pp. 5766–5770.
10.1007/978- 3- 642- 15181- 1. [41] C. Snoek, M. Worring, A.W.M. Smeulders, Early versus late fusion in se-
[19] T. Li, T. Mei, S. Yan, I.-S. Kweon, C. Lee, Contextual decomposition of multi- mantic video analysis, in: H. Zhang, T. Chua, R. Steinmetz, M.S. Kankan-
-label images, in: Proceedings of IEEE Computer Society Conference on Com- halli, L. Wilcox (Eds.), Proceedings of the 13th ACM International Conference
puter Vision and Pattern Recognition, 2009, pp. 2270–2277. on Multimedia, Singapore, November 6–11, 2005, ACM, 2005, pp. 399–402.
[20] M. Bertini, A.D. Bimbo, G. Serra, C. Torniai, R. Cucchiara, C. Grana, R. Vezzani, http://doi.acm.org/10.1145/1101149.1101236.
Dynamic pictorially enriched ontologies for digital video libraries, IEEE Mul- [42] P.K. Atrey, M.A. Hossain, A. El-Saddik, M.S. Kankanhalli, Multimodal fusion for
timed. 16 (2) (2009) 42–51. http://dx.doi.org/10.1109/MMUL.2009.25. multimedia analysis: a survey, Multimed. Syst. 16 (6) (2010) 345–379. http:
[21] S.A. Chatzichristofis, Y.S. Boutalis, Cedd: color and edge directivity descrip- //dx.doi.org/10.10 07/s0 0530- 010- 0182- 0.
tor: a compact descriptor for image indexing and retrieval, in: A. Gaster- [43] J. Yu, Z. Qin, T. Wan, X. Zhang, Feature integration analysis of bag-of-features
atos, M. Vincze, J.K. Tsotsos (Eds.), Proceedings of International Conference model for image retrieval, Neurocomputing 120 (2013) 355–364. Image Fea-
on Computer Vision Systems, ICVS, 5008, Springer, 2008, pp. 312–322. ture Detection and Description. http://www.sciencedirect.com/science/article/
[22] J. Sivic, A. Zisserman, Efficient visual search for objects in videos, Proc. IEEE pii/S0925231213003020.
96 (4) (2008) 548–566. [44] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.
[23] N. Cristianini, A different way of thinking, New Sci. 232 (3101) (2016) 39–43. Comput. Vis. 60 (2) (2004) 91–110.
http://www.sciencedirect.com/science/article/pii/S026240791632190X. [45] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation
[24] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep invariant texture classification with local binary patterns, IEEE Trans. Pattern
convolutional neural networks, in: P.L. Bartlett, F.C.N. Pereira, C.J.C. Burges, Anal. Mach. Intell. 24 (7) (2002) 971–987. http://dx.doi.org/10.1109/TPAMI.
L. Bottou, K.Q. Weinberger (Eds.), Proceedings of Meeting on Advances in 2002.1017623.
Neural Information Processing Systems 25: 26th Annual Conference on Neu- [46] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,
ral Information Processing Systems 2012, December 3–6, 2012, Lake Tahoe, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer
Nevada, United States, 2012, pp. 1106–1114. http://papers.nips.cc/paper/ Vision and Pattern Recognition (CVPR 2005), 20–26 June 2005, IEEE Com-
4824- imagenet- classification- with- deep- convolutional- neural- networks. puter Society, San Diego, CA, USA, 2005, pp. 886–893. http://dx.doi.org/10.
[25] Y.L. Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, 1109/CVPR.2005.177.
L.D. Jackel, Handwritten digit recognition with a back-propagation network, [47] P.A.S. Kimura, J.M.B. Cavalcanti, P.C. Saraiva, R.d.S. Torres, M.A. Gonçalves,
in: Proceedings of the 2nd International Conference on Neural Information Evaluating retrieval effectiveness of descriptors for searching in large image
Processing Systems, NIPS’89, MIT Press, Cambridge, MA, USA, 1989, pp. 396– databases, J. Inf. Data Manag. 2 (3) (2011) 305–320. http://seer.lcc.ufmg.br/
404. http://dl.acm.org/citation.cfm?id=2969830.2969879. index.php/jidm/article/view/161.
[26] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new [48] J. Yue, Z. Li, L. Liu, Z. Fu, Content-based image retrieval using color and tex-
perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828. ture fused features, Math. Comput. Model. 54 (3–4) (2011) 1121–1127. Math-
http://dx.doi.org/10.1109/TPAMI.2013.50. ematical and Computer Modeling in agriculture (CCTA 2010). http://www.
[27] L. Deng, A tutorial survey of architectures, algorithms, and applica- sciencedirect.com/science/article/pii/S0895717710 0 05352.
tions for deep learning, APSIPA Trans. Signal Inf. Process. 3 (2014). [49] H.J. Escalante, C.A. Hérnadez, L.E. Sucar, M. Montes, Late fusion of hetero-
https://www.cambridge.org/core/services/aop-cambridge-core/content/view/ geneous methods for multimedia image retrieval, in: Proceedings of the
023B6ADF962FA37F8EC684B209E3DFAE/S20487703130 0 0 097a.pdf/div-class- 1st ACM International Conference on Multimedia Information Retrieval, MIR
title- a- tutorial- survey- of- architectures- algorithms- and- applications- for- ’08, ACM, New York, NY, USA, 2008, pp. 172–179. http://doi.acm.org/10.1145/
deep- learning- div.pdf. 1460096.1460125.
[28] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, De- [50] R.d.S. Torres, A.X. Falco, M.A. Gonalves, J.P. Papa, B. Zhang, W. Fan, E.A. Fox,
caf: a deep convolutional activation feature for generic visual recognition, in: A genetic programming framework for content-based image retrieval, Pattern
Proceedings of the 31th International Conference on Machine Learning, ICML Recognit. 42 (2) (2009) 283–292. Learning Semantics from Multimedia Con-
2014, Beijing, China, 21–26 June 2014, Vol. 32 of JMLR Workshop and Confer- tent. http://www.sciencedirect.com/science/article/pii/S0 0313203080 01623.
ence Proceedings, JMLR.org, 2014, pp. 647–655. http://jmlr.org/proceedings/
papers/v32/donahue14.html.
L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60 59
[51] W. Zhang, Z. Qin, T. Wan, Image scene categorization using multi-bag-of-fea- [78] E. Bruno, N. Moënne-Loccoz, S. Marchand-Maillet, Learning user queries in
tures, in: Proceedings of the 2011 International Conference on Machine Learn- multimodal dissimilarity spaces, in: M. Detyniecki, J.M. Jose, A. Nürnberger,
ing and Cybernetics (ICMLC), 4, 2011, pp. 1804–1808. C.J. van Rijsbergen (Eds.), Adaptive Multimedia Retrieval, 3877, Springer, 2005,
[52] L. Piras, R. Tronci, G. Giacinto, Diversity in ensembles of codebooks for visual pp. 168–179.
concept detection, in: A. Petrosino (Ed.), Proceedings of International Con- [79] G. Giacinto, A nearest-neighbor approach to relevance feedback in content
ference on Image Analysis and Processing, ICIAP (2), 8157, Springer, 2013, based image retrieval, in: Proceedings of the 6th ACM international confer-
pp. 399–408. ence on Image and video retrieval, CIVR ’07, ACM, New York, NY, USA, 2007,
[53] D. Picard, N. Thome, M. Cord, An efficient system for combining com- pp. 456–463.
plementary kernels in complex visual categorization tasks, in: Proceed- [80] P. Wu, S.C. Hoi, H. Xia, P. Zhao, D. Wang, C. Miao, Online multimodal deep
ings of the 2010 IEEE International Conference on Image Processing, 2010, similarity learning with application to image retrieval, in: Proceedings of the
pp. 3877–3880. 21st ACM International Conference on Multimedia, MM ’13, ACM, New York,
[54] D. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, F.G.B.D. Natale, A hybrid ap- NY, USA, 2013, pp. 153–162. http://doi.acm.org/10.1145/2502081.2502112.
proach for retrieving diverse social images of landmarks, in: Proceedings of [81] B. Thomee, M.S. Lew, Interactive search in image retrieval: a survey, Int. J.
the 2015 IEEE International Conference on Multimedia and Expo, ICME 2015, Multimed. Inf. Retr. 1 (1) (2012) 71–86.
Turin, Italy, June 29, - July 3, 2015, IEEE, 2015, pp. 1–6. http://dx.doi.org/10. [82] Y. Rui, T.S. Huang, S. Mehrotra, Relevance feedback: a power tool in interactive
1109/ICME.2015.7177486. content-based image retrieval, IEEE Trans. Circuits Syst. Video Technol. 8 (5)
[55] Y. Cao, H. Zhang, Y. Gao, X. Xu, J. Guo, Matching image with multiple local (1998) 644–655.
features, in: Proceedings of the 2010 20th International Conference on Pattern [83] S. Liang, Z. Sun, Sketch retrieval and relevance feedback with biased svm
Recognition, ICPR, 2010, pp. 519–522. classification, Pattern Recognit. Lett. 29 (12) (2008) 1733–1741. http://www.
[56] L. Piras, G. Giacinto, Dissimilarity representation in multi-feature spaces for sciencedirect.com/science/article/pii/S0167865508001621.
image retrieval, in: G. Maino, G.L. Foresti (Eds.), Proceedings of 16th Interna- [84] Y. Chen, X.S. Zhou, T. Huang, One-class svm for learning in image retrieval,
tional Conference on Image Analysis and Processing, ICIAP 2011 , Ravenna, in: Proceedings of International Conference on Image Processing, ICIP, 1, 2001,
Italy, September 14–16, 2011, Part I, 6978, Springer, 2011, pp. 139–148. http: pp. 34–37.
//dx.doi.org/10.1007/978- 3- 642- 24085- 0_15. [85] J.J. Rocchio, Relevance feedback in information retrieval, in: G. Salton (Ed.),
[57] D.A. Lisin, M.A. Mattar, M.B. Blaschko, E.G. Learned-Miller, M.C. Benfield, The SMART Retrieval System - Experiments in Automatic Document Process-
Combining local and global image features for object class recognition, in: ing, Prentice Hall, Englewood, Cliffs, New Jersey, 1971, pp. 313–323.
Proceedings of 2005 IEEE Computer Society Conference on Computer Vision [86] D.A. Cohn, L.E. Atlas, R.E. Ladner, Improving generalization with active learn-
and Pattern Recognition, CVPR’05, 2005 47-47. ing, Mach. Learn. 15 (2) (1994) 201–221.
[58] V. Risojevi, Z. Babi, Fusion of global and local descriptors for remote sensing [87] S.C.H. Hoi, R. Jin, J. Zhu, M.R. Lyu, Semisupervised SVM batch mode active
image classification, IEEE Geosci. Remote Sens. Lett. 10 (4) (2013) 836–840. learning with applications to image retrieval, ACM Trans. Inf. Syst. 27 (3)
[59] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, (2009) 16:1–16:29.
in: L. Getoor, T. Scheffer (Eds.), Proceedings of the 28th International Confer- [88] S. Tong, E.Y. Chang, Support vector machine active learning for image re-
ence on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28, - trieval, in: Proceedings of ACM Multimedia, 2001, pp. 107–118.
July 2, 2011, Omnipress, 2011, pp. 689–696. [89] P.-Y. Yin, B. Bhanu, K.-C. Chang, A. Dong, Integrating relevance feedback tech-
[60] N. Srivastava, R. Salakhutdinov, Multimodal learning with deep boltzmann niques for image retrieval using reinforcement learning, IEEE Trans. Pattern
machines, J. Mach. Learn. Res. 15 (1) (2014) 2949–2980. Anal. Mach. Intell. 27 (10) (2005) 1536–1551.
[61] J. Wan, D. Wang, S.C.H. Hoi, P. Wu, J. Zhu, Y. Zhang, J. Li, Deep learning for [90] I.J. Cox, M.L. Miller, T.P. Minka, T.V. Papathomas, P.N. Yianilos, The Bayesian
content-based image retrieval: a comprehensive study, in: Proceedings of the image retrieval system, PicHunter: theory, implementation, and psychophysi-
22nd ACM International Conference on Multimedia, MM ’14, ACM, New York, cal experiments, IEEE Trans. Image Process. 9 (1) (20 0 0) 20–37.
NY, USA, 2014, pp. 157–166. [91] C.-H. Hoi, M.R. Lyu, Group-based relevance feedback with support vector
[62] Y. Rui, T.S. Huang, S. Mehrotra, Content-based image retrieval with relevance machine ensembles, in: Proceedings of International Conference on Pattern
feedback in MARS, in: Proceedings of International Conference on Image Pro- Recognition, ICPR, 3, 2004, pp. 874–877.
cessing Proceedings, 1997, pp. 815–818. [92] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for
[63] J. Peng, B. Bhanu, S. Qing, Probabilistic feature relevance learning for content- support vector machines-based relevance feedback in image retrieval, IEEE
based image retrieval, Comput. Vis. Image Underst. 75 (1/2) (1999) 150–164. Trans. Pattern Anal. Mach. Intell. 28 (2006) 1088–1099.
[64] Y. Wu, A. Zhang, Interactive pattern analysis for relevance feedback in multi- [93] Y. Rao, P. Mundur, Y. Yesha, Fuzzy SVM ensembles for relevance feedback
media information retrieval, Multimed. Syst. 10 (1) (2004) 41–55. in image retrieval, in: H. Sundaram, M.R. Naphade, J.R. Smith, Y. Rui (Eds.),
[65] E. Guldogan, M. Gabbouj, Feature selection for content-based image retrieval, Proceedings of International Conference on Image and Video Retrieval, CIVR,
Signal Image Video Process. 2 (3) (2008) 241–250. 4071, Springer, 2006, pp. 350–359.
[66] L. Piras, G. Giacinto, Neighborhood-based feature weighting for relevance [94] Y. Tu, G. Li, H. Dai, Integrating local one-class classifiers for image retrieval,
feedback in content-based retrieval, in: Proceedings of Workshop on Image in: X. Li, O.R. Zaïane, Z. Li (Eds.), Proceedings of Advanced Data Mining and
Analysis for Multimedia Interactive Services, WIAMIS, IEEE Computer Society, Applications, ADMA, 4093, Springer, 2006, pp. 213–222.
2009, pp. 238–241. [95] K. Crammer, G. Chechik, A needle in a haystack: local one-class optimiza-
[67] Y. Ishikawa, R. Subramanya, C. Faloutsos, Mindreader: Querying databases tion, in: C.E. Brodley (Ed.), Proceedings of International Conference on Ma-
through multiple examples, in: Proceedings of the 24th Very Large Data Bases chine Learning, ICML, 69, ACM, 2004.
Conference, 1998, pp. 433–438. [96] R.K. Srihari, Automatic indexing and content-based retrieval of captioned im-
[68] Y. Rui, T. Huang, Optimizing learning in image retrieval, in: Proceedings of ages, Computer 28 (9) (1995) 49–56.
Computer Vision and Pattern Recognition, 20 0 0, vol. 1, 20 0 0, pp. 236–243. [97] C. Frankel, M.J. Swain, V. Athitsos, Webseer: An Image Search Engine for the
[69] M. Tzelepi, A. Tefas, Relevance feedback in deep convolutional neural net- World Wide Web, Technical report, Chicago, IL, USA, 1996.
works for content based image retrieval, in: Proceedings of the 9th Hellenic [98] A. Depeursinge, H. Müller, Fusion Techniques for Combining Textual and Vi-
Conference on Artificial Intelligence, SETN ’16, ACM, New York, NY, USA, 2016, sual Information Retrieval, Springer, Berlin, Heidelberg, 2010. 95–114. http:
pp. 27:1–27:7. //dx.doi.org/10.1007/978- 3- 642- 15181- 1_6.
[70] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, [99] D. Rafailidis, S. Manolopoulou, P. Daras, A unified framework for multi-
2004. modal retrieval, Pattern Recognit. 46 (12) (2013) 3358–3370. http://www.
[71] K. Tieu, P.A. Viola, Boosting Image Retrieval, in: Proceedings of Conference sciencedirect.com/science/article/pii/S0031320313002471.
on Computer Vision and Pattern Recognition, CVPR, IEEE Computer Society, [100] X.S. Zhou, T.S. Huang, Unifying keywords and visual contents in image
20 0 0, pp. 1228–1235. retrieval, IEEE Multimed. 9 (2) (2002) 23–33. http://dx.doi.org/10.1109/93.
[72] J. Laaksonen, M. Koskela, E. Oja, Picsom-self-organizing image retrieval with 998050.
mpeg-7 content descriptors, IEEE Trans. Neural Netw. 13 (4) (2002) 841–853. [101] K.-S. Goh, E.Y. Chang, W.-C. Lai, Multimodal concept-dependent active learn-
[73] M. Arevalillo-Herráez, J. Domingo, F.J. Ferri, Combining similarity measures ing for image retrieval, in: Proceedings of the 12th Annual ACM International
in content-based image retrieval, Pattern Recognit. Lett. 29 (16) (2008) Conference on Multimedia, Multimedia ’04, ACM, New York, NY, USA, 2004,
2174–2181. pp. 564–571. http://doi.acm.org/10.1145/1027527.1027664.
[74] H. Müller, W. Müller, D. Squire, S. Marchand-Maillet, T. Pun, Performance [102] S. Sclaroff, M.L. Cascia, S. Sethi, Unifying textual and visual cues for content-
evaluation in content-based image retrieval: overview and proposals, Pattern based image retrieval on the world wide web, Comput. Vis. Image Underst.
Recognit. Lett. 22 (5) (2001) 593–601. 75 (1–2) (1999) 86–98. http://dx.doi.org/10.1006/cviu.1999.0765.
[75] L. Zhang, F. Lin, B. Zhang, Support vector machine learning for image retrieval, [103] K. Barnard, D. Forsyth, Learning the semantics of words and pictures, in: Pro-
in: Proceedings of International Conference on Image Processing, ICIP, 2, 2001, ceedings. Eighth IEEE International Conference on Computer Vision, ICCV, 2,
pp. 721–724. 2001, pp. 408–415. http://dx.doi.org/10.1109/ICCV.2001.937654.
[76] G. Giacinto, F. Roli, Dissimilarity representation of images for relevance feed- [104] H. Frigui, J. Caudill, A.C.B. Abdallah, Fusion of multi-modal features for ef-
back in content-based image retrieval, in: P. Perner, A. Rosenfeld (Eds.), Pro- ficient content-based image retrieval, in: Proceedings of IEEE International
ceedings of Machine Learning and Data Mining, MLDM, 2734, Springer, 2003, Conference on Fuzzy Systems (IEEE World Congress on Computational Intel-
pp. 202–214. ligence), FUZZ-IEEE 2008, 2008, pp. 1992–1998.
[77] G.P. Nguyen, M. Worring, A.W.M. Smeulders, Similarity learning via dissim- [105] H. Tahani, J.M. Keller, Information fusion in computer vision using the fuzzy
ilarity space in cbir, in: J.Z. Wang, N. Boujemaa, Y. Chen (Eds.), Multimedia integral, IEEE Trans. Syst. Man Cybern. 20 (3) (1990) 733–741.
Information Retrieval, ACM, 2006, pp. 107–116.
60 L. Piras, G. Giacinto / Information Fusion 37 (2017) 50–60
[106] C. Moulin, C. Largeron, C. Ducottet, M. Gry, C. Barat, Fisher linear discriminant [113] F.M. Segarra, L.A. Leiva, R. Paredes, A relevant image search engine with late
analysis for text-image combination in multimedia information retrieval, Pat- fusion: Mixing the roles of textual and visual descriptors, in: Proceedings of
tern Recognit. 47 (1) (2014) 260–269. http://www.sciencedirect.com/science/ the 16th International Conference on Intelligent User Interfaces, IUI, ACM,
article/pii/S0031320313002550. 2011, pp. 455–456.
[107] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering [114] X. Benavent, A. Garcia-Serrano, R. Granados, J. Benavent, E. de Ves, Multime-
method for very large databases, in: Proceedings of ACM SIGMOD Interna- dia information retrieval based on late semantic fusion approaches: experi-
tional Conference on Management of Data, 1996, pp. 103–114. ments on a wikipedia image collection, IEEE Trans. Multimed. 15 (8) (2013)
[108] P. Wu, S.C.H. Hoi, P. Zhao, C. Miao, Z.Y. Liu, Online multi-modal distance met- 2009–2021.
ric learning with application to image retrieval, IEEE Trans. Knowl. Data Eng. [115] Y. Chen, H. Sampathkumar, B. Luo, X.w. Chen, ilike: bridging the semantic gap
28 (2) (2016) 454–467. in vertical image search by integrating text and visual features, IEEE Trans.
[109] L. Zhu, A.B. Rao, A. Zhang, Theory of keyblock-based image retrieval, ACM Knowl. Data Eng. 25 (10) (2013) 2257–2270.
Trans. Inf. Syst. 20 (2) (2002) 224–257. http://doi.acm.org/10.1145/506309. [116] Y. He, S. Xiang, C. Kang, J. Wang, C. Pan, Cross-modal retrieval via deep
506313. and bidirectional representation learning, IEEE Trans. Multimed. 18 (7) (2016)
[110] H. Zhang, Z. Chen, M. Li, Z. Su, Relevance feedback and learning in content- 1363–1377.
based image search, World Wide Web 6 (2) (2003) 131–155. http://dx.doi.org/ [117] R. Salakhutdinov, G.E. Hinton, Deep Boltzmann machines, in: D.A.V. Dyk,
10.1023/A:1023618504691. M. Welling (Eds.), Proceedings of the Twelfth International Conference on
[111] Y. Lu, C. Hu, X. Zhu, H. Zhang, Q. Yang, A unified framework for semantics and Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida,
feature based relevance feedback in image retrieval systems, in: Proceedings USA, April 16–18, 2009, Vol. 5 of JMLR Proceedings, JMLR.org, 2009, pp. 448–
of the Eighth ACM International Conference on Multimedia, Multimedia ’00, 455. http://www.jmlr.org/proceedings/papers/v5/salakhutdinov09a.html.
ACM, New York, NY, USA, 20 0 0, pp. 31–37. http://doi.acm.org/10.1145/354384. [118] C. Wang, H. Yang, C. Meinel, A deep semantic framework for multimodal
354403. representation learning, Multimed. Tools Appl. 75 (15) (2016) 9255–9276.
[112] R. Besançon, P. Hède, P.-A. Moellic, C. Fluhr, Cross-Media Feedback Strategies: http://dx.doi.org/10.1007/s11042- 016- 3380- 8.
Merging Text and Image Information to Improve Image Retrieval, Springer,
Berlin, Heidelberg, 2005, pp. 709–717.