0% found this document useful (0 votes)
61 views12 pages

Silber Er 13 Acl

The document discusses models that represent word meanings based on visual attributes extracted from images. It presents a new dataset of over 500 concepts and 688,000 images labeled with visual attributes. Models trained on these attributes are shown to provide a better fit to human word association data than models based solely on text or hand-coded attributes.

Uploaded by

german
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views12 pages

Silber Er 13 Acl

The document discusses models that represent word meanings based on visual attributes extracted from images. It presents a new dataset of over 500 concepts and 688,000 images labeled with visual attributes. Models trained on these attributes are shown to provide a better fit to human word association data than models based solely on text or hand-coded attributes.

Uploaded by

german
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Models of Semantic Representation with Visual Attributes

Carina Silberer1 , Vittorio Ferrari2 , Mirella Lapata1


1 Institute for Language, Cognition and Computation, 2 Institute of Perception, Action and Behaviour

School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB


[email protected], [email protected], [email protected]

Abstract distributional models. The motivation for models


that do not learn exclusively from text is twofold.
We consider the problem of grounding the
From a cognitive perspective, there is mounting
meaning of words in the physical world
experimental evidence suggesting that our inter-
and focus on the visual modality which we
action with the physical world plays an impor-
represent by visual attributes. We create
tant role in the way we process language (Barsa-
a new large-scale taxonomy of visual at-
lou, 2008; Bornstein et al., 2004; Landau et al.,
tributes covering more than 500 concepts
1998). From an engineering perspective, the abil-
and their corresponding 688K images. We
ity to learn representations for multimodal data has
use this dataset to train attribute classi-
many practical applications including image re-
fiers and integrate their predictions with
trieval (Datta et al., 2008) and annotation (Chai
text-based distributional models of word
and Hung, 2008), text illustration (Joshi et al.,
meaning. We show that these bimodal
2006), object and scene recognition (Lowe, 1999;
models give a better fit to human word as-
Oliva and Torralba, 2007; Fei-Fei and Perona,
sociation data compared to amodal models
2005), and robot navigation (Tellex et al., 2011).
and word representations based on hand-
crafted norming data. One strand of research uses feature norms as a
stand-in for sensorimotor experience (Johns and
1 Introduction
Jones, 2012; Andrews et al., 2009; Steyvers, 2010;
Recent years have seen increased interest in Silberer and Lapata, 2012). Feature norms are ob-
grounded language acquisition, where the goal is tained by asking native speakers to write down at-
to extract representations of the meaning of nat- tributes they consider important in describing the
ural language tied to the physical world. The meaning of a word. The attributes represent per-
language grounding problem has assumed sev- ceived physical and functional properties associ-
eral guises in the literature such as semantic pars- ated with the referents of words. For example,
ing (Zelle and Mooney, 1996; Zettlemoyer and apples are typically green or red, round, shiny,
Collins, 2005; Kate and Mooney, 2007; Lu et smooth, crunchy, tasty, and so on; dogs have four
al., 2008; Börschinger et al., 2011), mapping nat- legs and bark, whereas chairs are used for sit-
ural language instructions to executable actions ting. Feature norms are instrumental in reveal-
(Branavan et al., 2009; Tellex et al., 2011), associ- ing which dimensions of meaning are psychologi-
ating simplified language to perceptual data such cally salient, however, their use as a proxy for peo-
as images or video (Siskind, 2001; Roy and Pent- ple’s perceptual representations can itself be prob-
land, 2002; Gorniak and Roy, 2004; Yu and Bal- lematic (Sloman and Ripps, 1998; Zeigenfuse and
lard, 2007), and learning the meaning of words Lee, 2010). The number and types of attributes
based on linguistic and perceptual input (Bruni generated can vary substantially as a function of
et al., 2012b; Feng and Lapata, 2010; Johns and the amount of time devoted to each concept. It is
Jones, 2012; Andrews et al., 2009; Silberer and not entirely clear how people generate attributes
Lapata, 2012). and whether all of these are important for repre-
In this paper we are concerned with the latter senting concepts. Finally, multiple participants are
task, namely constructing perceptually grounded required to create a representation for each con-
cept, which limits elicitation studies to a small we show that this attribute-based image represen-
number of concepts and the scope of any compu- tation can be usefully integrated with textual data
tational model based on feature norms. to create distributional models that give a better fit
Another strand of research focuses exclusively to human word association data over models that
on the visual modality, even though the grounding rely on human generated feature norms.
problem could involve auditory, motor, and hap-
tic modalities as well. This is not entirely sur- 2 Related Work
prising. Visual input represents a major source of
data from which humans can learn semantic rep- Grounding semantic representations with visual
resentations of linguistic and non-linguistic com- information is an instance of multimodal learn-
municative actions (Regier, 1996). Furthermore, ing. In this setting the data consists of multiple
since images are ubiquitous, visual data can be input modalities with different representations and
gathered far easier than some of the other modali- the learner’s objective is to extract a unified repre-
ties. Distributional models that integrate the visual sentation that fuses the modalities together. The
modality have been learned from texts and im- literature describes several successful approaches
ages (Feng and Lapata, 2010; Bruni et al., 2012b) to multimodal learning using different variants of
or from ImageNet (Deng et al., 2009), e.g., by deep networks (Ngiam et al., 2011; Srivastava and
exploiting the fact that images in this database Salakhutdinov, 2012) and data sources including
are hierarchically organized according to WordNet text, images, audio, and video.
synsets (Leong and Mihalcea, 2011). Images are Special-purpose models that address the fusion
typically represented on the basis of low-level fea- of distributional meaning with visual information
tures such as SIFT (Lowe, 2004), whereas texts have been also proposed. Feng and Lapata (2010)
are treated as bags of words. represent documents and images by a common
Our work also focuses on images as a way multimodal vocabulary consisting of textual words
of physically grounding the meaning of words. and visual terms which they obtain by quantizing
We, however, represent them by high-level vi- SIFT descriptors (Lowe, 2004). Their model is es-
sual attributes instead of low-level image fea- sentially Latent Dirichlet Allocation (LDA, Blei et
tures. Attributes are not concept or category spe- al., 2003) trained on a corpus of multimodal docu-
cific (e.g., animals have stripes and so do cloth- ments (i.e., BBC news articles and their associated
ing items; balls are round, and so are oranges and images). Meaning in this model is represented as
coins), and thus allow us to express similarities a vector whose components correspond to word-
and differences across concepts more easily. Fur- topic distributions. A related model has been pro-
thermore, attributes allow us to generalize to un- posed by Bruni et al. (2012b) who obtain distinct
seen objects; it is possible to say something about representations for the textual and visual modali-
them even though we cannot identify them (e.g., it ties. Specifically, they extract a visual space from
has a beak and a long tail). We show that this images contained in the ESP-Game data set (von
attribute-centric approach to representing images Ahn and Dabbish, 2004) and a text-based seman-
is beneficial for distributional models of lexical tic space from a large corpus collection totaling
meaning. Our attributes are similar to those pro- approximately two billion words. They concate-
vided by participants in norming studies, however, nate the two modalities and subsequently project
importantly they are learned from training data (a them to a lower-dimensionality space using Sin-
database of images and their visual attributes) and gular Value Decomposition (Golub et al., 1981).
thus generalize to new images without additional Traditionally, computer vision algorithms de-
human involvement. scribe visual phenomena (e.g., objects, scenes,
In the following we describe our efforts to cre- faces, actions) by giving each instance a categor-
ate a new large-scale dataset that consists of 688K ical label (e.g., cat, beer garden, Brad Pitt, drink-
images that match the same concrete concepts ing). The ability to describe images by their at-
used in the feature norming study of McRae et al. tributes allows to generalize to new instances for
(2005). We derive a taxonomy of 412 visual at- which there are no training examples available.
tributes and explain how we learn attribute clas- Moreover, attributes can transcend category and
sifiers following recent work in computer vision task boundaries and thus provide a generic de-
(Lampert et al., 2009; Farhadi et al., 2009). Next, scription of visual data.
Initial work (Ferrari and Zisserman, 2007) Attribute Categories Example Attributes
focused on simple color and texture attributes color patterns (25) is red, has stripes
(e.g., blue, stripes) and showed that these can be diet (35) eats nuts, eats grass
learned in a weakly supervised setting from im- shape size (16) is small, is chubby
ages returned by a search engine when using the parts (125) has legs, has wheels
attribute as a query. Farhadi et al. (2009) were botany;anatomy (25;78) has seeds, has flowers
among the first to use visual attributes in an ob- behavior (in)animate (55) flies, waddles, pecks
ject recognition task. Using an inventory of 64 at- texture material (36) made of metal, is shiny
tribute labels, they developed a dataset of approx- structure (3) 2 pieces, has pleats
imately 12,000 instances representing 20 objects
from the PASCAL Visual Object Classes Chal- Table 1: Attribute categories and examples of at-
lenge 2008 (Everingham et al., 2008). Visual tribute instances. Parentheses denote the number
semantic attributes (e.g., hairy, four-legged) were of attributes per category.
used to identify familiar objects and to describe
unfamiliar objects when new images and bound-
ing box annotations were provided. Lampert et al. els previously proposed in the literature and in
(2009) showed that attribute-based representations all cases show that the attribute-based represen-
can be used to classify objects when there are no tation brings performance improvements over just
training examples of the target classes available. using the textual modality. Moreover, we show
Their dataset contained over 30,000 images repre- that automatically computed attributes are compa-
senting 50 animal concepts and used 85 attributes rable and in some cases superior to those provided
from the norming study of Osherson et al. (1991). by humans (e.g., in norming studies).
Attribute-based representations have also been ap-
plied to the tasks of face detection (Kumar et al., 3 The Attribute Dataset
2009), action identification (Liu et al., 2011), and
scene recognition (Patterson and Hays, 2012). Concepts and Images We created a dataset of
images and their visual attributes for the nouns
The use of visual attributes in models of distri-
contained in McRae et al.’s (2005) feature norms.
butional semantics is novel to our knowledge. We
The norms cover a wide range of concrete con-
argue that they are advantageous for two reasons.
cepts including animate and inanimate things
Firstly, they are cognitively plausible; humans em-
(e.g., animals, clothing, vehicles, utensils, fruits,
ploy visual attributes when describing the proper-
and vegetables) and were collected by presenting
ties of concept classes. Secondly, they occupy the
participants with words and asking them to list
middle ground between non-linguistic low-level
properties of the objects to which the words re-
image features and linguistic words. Attributes
ferred. To avoid confusion, in the remainder of
crucially represent image properties, however by
this paper we will use the term attribute to refer to
being words themselves, they can be easily inte-
properties of concepts and the term feature to refer
grated in any text-based distributional model thus
to image features, such as color or edges.
eschewing known difficulties with rendering im-
ages into word-like units. Images for the concepts in McRae et al.’s (2005)
production norms were harvested from ImageNet
A key prerequisite in describing images by
(Deng et al., 2009), an ontology of images based
their attributes is the availability of training data
on the nominal hierarchy of WordNet (Fellbaum,
for learning attribute classifiers. Although our
1998). ImageNet has more than 14 million im-
database shares many features with previous work
ages spanning 21K WordNet synsets. We chose
(Lampert et al., 2009; Farhadi et al., 2009) it dif-
this database due to its high coverage and the high
fers in focus and scope. Since our goal is to
quality of its images (i.e., cleanly labeled and high
develop distributional models that are applicable
resolution). McRae et al.’s norms contain 541 con-
to many words, it contains a considerably larger
cepts out of which 516 appear in ImageNet1 and
number of concepts (i.e., more than 500) and at-
are represented by 688K images overall. The av-
tributes (i.e., 412) based on a detailed taxonomy
erage number of images per concept is 1,310 with
which we argue is cognitively plausible and ben-
eficial for image and natural language processing 1 Some words had to be modified in order to match the cor-
tasks. Our experiments evaluate a number of mod- rect synset, e.g., tank (container) was found as storage tank.
behavior eats, walks, climbs, swims, runs
diet drinks water, eats anything
shape size is tall, is large
anatomy has mouth, has head, has nose, has tail, has claws,
has jaws, has neck, has snout, has feet, has tongue
color patterns is black, is brown, is white

botany has skin, has seeds, has stem, has leaves, has pulp
color patterns purple, white, green, has green top
shape size is oval, is long
texture material is shiny

behavior rolls
parts has step through frame, has fork, has 2 wheels, has chain, has pedals
has gears, has handlebar, has bell, has breaks has seat, has spokes
texture material made of metal
color patterns different colors, is black, is red, is grey, is silver

Table 2: Human-authored attributes for bear, eggplant, and bike.

the most popular being closet (2,149 images) and opment set and chose all McRae et al. (2005) vi-
the least popular prune (5 images). sual attributes that applied. If an attribute was gen-
The images depicting each concept were ran- erally true for the concept, but the images did not
domly partitioned into a training, development, provide enough evidence, the attribute was never-
and test set. For most concepts the development theless chosen and labeled with <no evidence>.
set contained a maximum of 100 images and the For example, a plum has a pit, but most images in
test set a maximum of 200 images. Concepts with ImageNet show plums where only the outer part
less than 800 images in total were split into 1/8 of the fruit is visible. Attributes supported by
test and development set each, and 3/4 training set. the image data but missing from the norms were
The development set was used for devising and re- added. For example, has lights and has bumper
fining our attribute annotation scheme. The train- are attributes of cars but are not included in the
ing and test sets were used for learning and eval- norms. Attributes were grouped in eight general
uating, respectively, attribute classifiers (see Sec- classes shown in Table 1. Annotation proceeded
tion 4). on a category-by-category basis, e.g., first all food-
related concepts were annotated, then animals, ve-
Attribute Annotation Our aim was to develop a hicles, and so on. Two annotators (both co-authors
set of visual attributes that are both discriminating of this paper) developed the set of attributes for
and cognitively plausible, i.e., humans would gen- each category. One annotator first labeled con-
erally use them to describe a concrete concept. As cepts with their attributes, and the other annota-
a starting point, we thus used the visual attributes tor reviewed the annotations, making changes if
from McRae et al.’s (2005) norming study. At- needed. Annotations were revised and compared
tributes capturing other primary sensory informa- per category in order to ensure consistency across
tion (e.g., smell, sound), functional/motor proper- all concepts of that category.
ties, or encyclopaedic information were not taken Our methodology is slightly different from
into account. For example, is purple is a valid vi- Lampert et al. (2009) in that we did not simply
sual attribute for an eggplant, whereas a vegetable transfer the attributes from the norms to the con-
is not, since it cannot be visualized. Collating all cepts in question but refined and extended them
the visual attributes in the norms resulted in a to- according to the visual data. There are several
tal of 673 which we further modified and extended reasons for this. Firstly, it makes sense to se-
during the annotation process explained below. lect attributes corroborated by the images. Sec-
The annotation was conducted on a per-concept ondly, by looking at the actual images, we could
rather than a per-image basis (as for example in eliminate errors in McRae et al.’s (2005) norms.
Farhadi et al. (2009)). For each concept (e.g., bear For example, eight study participants erroneously
or eggplant), we inspected the images in the devel- thought that a catfish has scales. Thirdly, dur-
has 2 pieces, has pointed end, has strap, has thumb, has buckles, has heels
has shoe laces, has soles, is black, is brown, is white, made of leather, made of rubber

climbs, climbs trees, crawls, hops, jumps, eats, eats nuts, is small, has bushy tail
has 4 legs, has head, has neck, has nose, has snout, has tail, has claws
has eyes, has feet, has toes,

diff colours, has 2 legs, has 2 wheels, has windshield, has floorboard, has stand, has tank
has mudguard, has seat, has exhaust pipe, has frame, has handlebar, has lights, has mirror
has step-through frame, is black, is blue, is red, is white, made of aluminum, made of steel

Table 3: Attribute predictions for sandals, squirrel, and motorcycle.

ing the annotation process, we normalized syn- images in the training data. Images of concepts
onymous attributes (e.g., has pit and has stone) annotated with the attribute were used as positive
and attributes that exhibited negligible variations examples, and the rest as negative examples. The
in meaning (e.g., has stem and has stalk). Finally, data was randomly split into a training and valida-
our aim was to collect an exhaustive list of vi- tion set of equal size in order to find the optimal
sual attributes for each concept which is consis- cost parameter C. The final SVM for the attribute
tent across all members of a category. This is un- was trained on the entire training data, i.e., on all
fortunately not the case in McRae et al.’s norms. positive and negative examples.
Participants were asked to list up to 14 different
properties that describe a concept. As a result, the The SVM learners used the four different fea-
attributes of a concept denote the set of properties ture types proposed in Farhadi et al. (2009),
humans consider most salient. For example, both, namely color, texture, visual words, and edges.
lemons and oranges have pulp. But the norms pro- Texture descriptors were computed for each pixel
vide this attribute only for the second concept. and quantized to the nearest 256 k-means centers.
On average, each concept was annotated with Visual words were constructed with a HOG spa-
19 attributes; approximately 14.5 of these were tial pyramid. HOG descriptors were quantized
not part of the semantic representation created by into 1000 k-means centers. Edges were detected
McRae et al.’s (2005) participants for that con- using a standard Canny detector and their orien-
cept even though they figured in the representa- tations were quantized into eight bins. Color de-
tions of other concepts. Furthermore, on average scriptors were sampled for each pixel and quan-
two McRae et al. attributes per concept were dis- tized to the nearest 128 k-means centers. Shapes
carded. Examples of concepts and their attributes and locations were represented by generating his-
from our database2 are shown in Table 2. tograms for each feature type for each cell in a grid
of three vertical and horizontal blocks. Our clas-
4 Attribute-based Classification sifiers used 9,688 features in total. Table 3 shows
Following previous work (Farhadi et al., 2009; their predictions for three test images.
Lampert et al., 2009) we learned one classifier per
Note that attributes are predicted on an image-
attribute (i.e., 350 classifiers in total).3 The train-
by-image basis; our task, however, is to describe a
ing set consisted of 91,980 images (with a maxi-
concept w by its visual attributes. Since concepts
mum of 350 images per concept). We used an L2-
are represented by many images we must some-
regularized L2-loss linear SVM (Fan et al., 2008)
how aggregate their attributes into a single repre-
to learn the attribute predictions. We adopted the
sentation. For each image iw ∈ Iw of concept w,
training procedure of Farhadi al. (2009).4 To learn
we output an F-dimensional vector containing pre-
a classifier for a particular attribute, we used all
diction scores scorea (iw ) for attributes a = 1, ..., F.
2 Available from http://homepages.inf.ed.ac.uk/
We transform these attribute vectors into a single
mlap/index.php?page=resources.
3 We only trained classifiers for attributes corroborated by vector pw ∈ [0, 1]1×F , by computing the centroid
the images and excluded those labeled with <no evidence>. of all vectors for concept w. The vector is nor-
4 http://vision.cs.uiuc.edu/attributes/ malized to obtain a probability distribution over
1
Concept Nearest Neighbors
0.9 boat ship, sailboat, yacht, submarine, canoe, whale,
airplane, jet, helicopter, tank (army)
0.8
rooster chicken, turkey, owl, pheasant, peacock, stork,
Precision

pigeon, woodpecker, dove, raven


0.7
shirt blouse, robe, cape, vest, dress, coat, jacket,
0.6 skirt, camisole, nightgown
spinach lettuce, parsley, peas, celery, broccoli, cabbage,
0.5 cucumber, rhubarb, zucchini, asparagus
squirrel chipmunk, raccoon, groundhog, gopher, porcu-
0.4 pine, hare, rabbit, fox, mole, emu
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Table 4: Ten most similar concepts computed on
the basis of averaged attribute vectors and ordered
Figure 1: Attribute classifier performance for dif-
according to cosine similarity.
ferent thresholds δ (test set).

In our experiments we learned word-attribute pairs


attributes given w:
from a lemmatized and pos-tagged (2009) dump
(∑iw ∈Iw scorea (iw ))a=1,...,F of the English Wikipedia.6 In the remainder of
pw = (1) this section we will briefly describe the models we
∑Fa=1 ∑iw ∈Iw scorea (iw )
used in our study and how the textual and visual
We additionally impose a threshold δ on pw by set- modalities were fused to create a joint representa-
ting each entry less than δ to zero. tion.
Figure 1 shows the results of the attribute pre- Concatenation Model Variants of this model
diction on the test set on the basis of the computed were originally proposed in Bruni et al. (2011)
centroids; specifically, we plot recall against pre- and Johns and Jones (2012). Let T ∈ RN×D de-
cision based on threshold δ.5 Table 4 shows the note a term-attribute co-occurrence matrix, where
10 nearest neighbors for five example concepts each cell records a weighted co-occurrence score
from our dataset. Again, we measure the cosine of a word and a textual attribute. Let P ∈ [0, 1]N×F
similarity between a concept and all other con- denote a visual matrix, representing a probability
cepts in the dataset when these are represented by distribution over visual attributes for each word.
their visual attribute vector pw . A word’s meaning can be then represented by the
concatenation of its normalized textual and visual
5 Attribute-based Semantic Models
vectors.
We evaluated the effectiveness of our attribute
Canonical Correlation Analysis The second
classifiers by integrating their predictions with tra-
model uses Canonical Correlation Analysis (CCA,
ditional text-only models of semantic representa-
Hardoon et al. (2004)) to learn a joint semantic
tion. These models have been previously proposed
representation from the textual and visual modali-
in the literature and were also described in a recent
ties. Given two random variables x and y (or two
comparative study (Silberer and Lapata, 2012).
sets of vectors), CCA can be seen as determining
We represent the visual modality by attribute two sets of basis vectors in such a way, that the cor-
vectors computed as shown in Equation (1). The relation between the projections of the variables
linguistic environment is approximated by textual onto these bases is mutually maximized (Borga,
attributes. We used Strudel (Baroni et al., 2010) 2001). In effect, the representation-specific de-
to obtain these attributes for the nouns in our tails pertaining to the two views of the same phe-
dataset. Given a list of target words, Strudel ex- nomenon are discarded and the underlying hidden
tracts weighted word-attribute pairs from a lem- factors responsible for the correlation are revealed.
matized and pos-tagged text corpus (e.g., egg- The linguistic and visual views are the same as
plant–cook-v, eggplant–vegetable-n). The weight in the simple concatenation model just explained.
of each word-attribute pair is a log-likelihood ratio We use a kernelized version of CCA (Hardoon et
score expressing the pair’s strength of association.
6 The corpus can be downloaded from http://wacky.
5 Threshold values ranged from 0 to 0.9 with 0.1 stepsize. sslmit.unibo.it/doku.php?id=corpora.
al., 2004) that first projects the data into a higher- word association norms collected by Nelson et al.
dimensional feature space and then performs CCA (1998).7 These were established by presenting
in this new feature space. The two kernel matrices a large number of participants with a cue word
are KT = T T 0 and KP = PP0 . After applying CCA (e.g., rice) and asking them to name an associate
we obtain two matrices projected onto l basis vec- word in response (e.g., Chinese, wedding, food,
tors, T̃ ∈ RN×l , resulting from the projection of the white). For each cue, the norms provide a set
textual matrix T onto the new basis and P̃ ∈ RN×l , of associates and the frequencies with which they
resulting from the projection of the corresponding were named. We can thus compute the prob-
visual attribute matrix. The meaning of a word is ability distribution over associates for each cue.
then represented by T̃ or P̃. Analogously, we can estimate the degree of sim-
ilarity between a cue and its associates using our
Attribute-topic Model Andrews et al. (2009)
models. The norms contain 63,619 unique cue-
present an extension of LDA (Blei et al., 2003)
associate pairs. Of these, 435 pairs were covered
where words in documents and their associated
by McRae et al. (2005) and our models. We also
attributes are treated as observed variables that
experimented with 1,716 pairs that were not part
are explained by a generative process. The
of McRae et al.’s study but belonged to concepts
idea is that each document in a document col-
covered by our attribute taxonomy (e.g., animals,
lection D is generated by a mixture of com-
vehicles), and were present in our corpus and Ima-
ponents {x1 , ..., xc , ..., xC } ∈ C , where a compo-
geNet. Using correlation analysis (Spearman’s ρ),
nent xc comprises a latent discourse topic coupled
we examined the degree of linear relationship be-
with an attribute cluster. Inducing these attribute-
tween the human cue-associate probabilities and
topic components from D with the extended LDA
the automatically derived similarity values.8
model gives two sets of parameters: word prob-
abilities given components PW (wi |X = xc ) for wi , Parameter Settings In order to integrate the vi-
i = 1, ..., n, and attribute probabilities given com- sual attributes with the models described in Sec-
ponents PA (ak |X = xc ) for ak , k = 1, ..., F. For ex- tion 5 we must select the appropriate threshold
ample, most of the probability mass of a compo- value δ (see Eq. (1)). We optimized this value
nent x would be reserved for the words shirt, coat, on the development set and obtained best results
dress and the attributes has 1 piece, has seams, with δ = 0. We also experimented with thresh-
made of material and so on. olding the attribute prediction scores and with ex-
Word meaning in this model is represented by cluding attributes with low precision. In both
the distribution PX|W over the learned compo- cases, we obtained best results when using all at-
nents. Assuming a uniform distribution over com- tributes. We could apply CCA to the vectors rep-
ponents xc in D , PX|W can be approximated as: resenting each image separately and then compute
P(wi |xc )P(xc ) P(wi |xc ) a weighted centroid on the projected vectors. We
PX=xc |W =wi = ≈ (2) refrained from doing this as it involves additional
P(wi ) C
∑ P(wi |xl ) parameters and assumes input different from the
l=1
other models. We measured the similarity between
where C is the total number of components.
two words using the cosine of the angle. For the
In our work, the training data is a corpus D of
attribute-topic model, the number of predefined
textual attributes (rather than documents). Each
components C was set to 10. In this model, sim-
attribute is represented as a bag-of-concepts,
ilarity was measured as defined by Griffiths et al.
i.e., words demonstrating the property expressed
(2007). The underlying idea is that word associa-
by the attribute (e.g., vegetable-n is a property of
tion can be expressed as a conditional distribution.
eggplant, spinach, carrot). For some of these con-
With regard to the textual attributes, we
cepts, our classifiers predict visual attributes. In
obtained a 9,394-dimensional semantic space
this case, the concepts are paired with one of their
7 From
visual attributes. We sample attributes for a con- http://w3.usf.edu/FreeAssociation/.
8 Previous work (Griffiths et al., 2007) which also predicts
cept w from their distribution given w (Eq. (1)).
word association reports how many times the word with the
highest score under the model was the first associate in the
6 Experimental Setup human norms. This evaluation metric assumes that there are
many associates for a given cue which unfortunately is not
Evaluation Task We evaluated the distribu- the case in our study which is restricted to the concepts rep-
tional models presented in Section 5 on the resented in our attribute taxonomy.
after discarding word-attribute pairs with a Nelson Concat CCA TopicAttr TextAttr
log-likelihood ratio score less than 19.9 We also Concat 0.24
discarded attributes co-occurring with less than CCA 0.30 0.72
two different words. TopicAttr 0.26 0.55 0.28
TextAttr 0.21 0.80 0.83 0.34
7 Results VisAttr 0.23 0.65 0.52 0.40 0.39
Our experiments were designed to answer four Table 5: Correlation matrix for seen Nelson et al.
questions: (1) Do visual attributes improve the (1998) cue-associate pairs and five distributional
performance of distributional models? (2) Are models. All correlation coefficients are statisti-
there performance differences among different cally significant (p < 0.01, N = 435).
models, i.e., are some models better suited to the
integration of visual information? (3) How do Nelson Concat CCA TopicAttr TextAttr
computational models fare against gold standard Concat 0.11
norming data? (4) Does the attribute-based repre- CCA 0.15 0.66
sentation bring advantages over more conventional TopicAttr 0.17 0.69 0.48
approaches based on raw image features? TextAttr 0.11 0.65 0.25 0.39
Our results are broken down into seen (Table 5) VisAttr 0.13 0.57 0.87 0.57 0.34
and unseen (Table 6) concepts. The former are
known to the attribute classifiers and form part Table 6: Correlation matrix for unseen Nelson
of our database, whereas the latter are unknown et al. (1998) cue-associate pairs and five distribu-
and are not included in McRae et al.’s (2005) tional models. All correlation coefficients are sta-
norms. We report the correlation coefficients we tistically significant (p < 0.01, N = 1, 716).
obtain when human-derived cue-associate proba-
bilities (Nelson et al., 1998) are compared against
On unseen pairs (see Table 6), Concat fares
the simple concatenation model (Concat), CCA,
worse than CCA and TopicAttr, achieving simi-
and Andrews et al.’s (2009) attribute-topic model
lar performance to TextAttr. CCA and TopicAttr
(TopicAttr). We also report the performance of
are significantly better than TextAttr and VisAttr
a distributional model that is based solely on the
(p < 0.01). This indicates that our attribute classi-
output of our attribute classifiers, i.e., without any
fiers generalize well beyond the concepts found in
textual input (VisAttr) and conversely the perfor-
our database and can produce useful visual infor-
mance of a model that uses textual information
mation even on unseen images. Compared to Con-
only (i.e., Strudel attributes) without any visual in-
cat and CCA, TopicAttr obtains a better fit with the
put (TextAttr). The results are displayed as a cor-
human association norms on the unseen data.
relation matrix so that inter-model correlations can
To answer our third question, we obtained dis-
also be observed.
tributional models from McRae et al.’s (2005)
As can be seen in Table 5 (second column), two
norms and assessed how well they predict Nelson
modalities are in most cases better than one when
et al.’s (1998) word-associate similarities. Each
evaluating model performance on seen data. Dif-
concept was represented as a vector with dimen-
ferences in correlation coefficients between mod-
sions corresponding to attributes generated by par-
els with two versus one modality are all statis-
ticipants of the norming study. Vector components
tically significant (p < 0.01 using a t-test), with
were set to the (normalized) frequency with which
the exception of Concat when compared against
participants generated the corresponding attribute
VisAttr. It is also interesting to note that Topi-
when presented with the concept. We measured
cAttr is the least correlated model when compared
the similarity between two words using the co-
against other bimodal models or single modali-
sine coefficient. Table 7 presents results for dif-
ties. This indicates that the latent space obtained
ferent model variants which we created by ma-
by this model is most distinct from its constituent
nipulating the number and type of attributes in-
parts (i.e., visual and textual attributes). Perhaps
volved. The first model uses the full set of at-
unsuprisingly Concat, CCA, VisAttr, and TextAttr
tributes present in the norms (All Attributes). The
are also highly intercorrelated.
second model (Text Attributes) uses all attributes
9 Baroni et al. (2010) use a similar threshold of 19.51. but those classified as visual (e.g., functional, en-
Models Seen Models Seen Unseen
All Attributes 0.28 Concat 0.22 0.10
Text Attributes 0.20 CCA 0.26 0.15
Visual Attributes 0.25 TopicAttr 0.23 0.19
TextAttr 0.20 0.08
Table 7: Model performance on seen Nelson et VisAttr 0.21 0.13
al. (1998) cue-associate pairs; models are based MixLDA 0.16 0.11
on gold human generated attributes (McRae et al.,
2005). All correlation coefficients are statistically Table 8: Model performance on a subset of Nelson
significant (p < 0.01, N = 435). et al. (1998) cue-associate pairs. Seen are concepts
known to the attribute classifiers and covered by
MixLDA (N = 85). Unseen are concepts covered
cyclopaedic). The third model (Visual Attributes) by LDA but unknown to the attribute classifiers
considers solely visual attributes. (N = 388). All correlation coefficients are statisti-
We observe a similar trend as with our compu- cally significant (p < 0.05).
tational models. Taking visual attributes into ac-
count increases the fit with Nelson’s (1998) associ-
ation norms, whereas visual and textual attributes trained on different corpora (MixLDA assumes
on their own perform worse. Interestingly, CCA’s that texts and images are collocated, whereas our
performance is comparable to the All Attributes images do not have collateral text), they seem to
model (see Table 5, second column), despite us- indicate that attribute-based information is indeed
ing automatic attributes (both textual and visual). beneficial.
Furthermore, visual attributes obtained through
our classifiers (see Table 5) achieve a marginally
8 Conclusions
lower correlation coefficient against human gener- In this paper we proposed the use of automatically
ated ones (see Table 7). computed visual attributes as a way of physically
Finally, to address our last question, we com- grounding word meaning. Our results demonstrate
pared our approach against Feng and Lapata that visual attributes improve the performance of
(2010) who represent visual information via quan- distributional models across the board. On a
tized SIFT features. We trained their MixLDA word association task, CCA and the attribute-topic
model on their corpus consisting of 3,361 BBC model give a better fit to human data when com-
news documents and corresponding images (Feng pared against simple concatenation and models
and Lapata, 2008). We optimized the model pa- based on a single modality. CCA consistently out-
rameters on a development set consisting of cue- performs the attribute-topic model on seen data (it
associate pairs from Nelson et al. (1998), exclud- is in fact slightly better over a model that uses gold
ing the concepts in McRae et al. (2005). We standard human generated attributes), whereas the
used a vocabulary of approximately 6,000 words. attribute-topic model generalizes better on unseen
The best performing model on the development set data (see Tables 5, 6, and 8). Since the attribute-
used 500 visual terms and 750 topics and the asso- based representation is general and text-based we
ciation measure proposed in Griffiths et al. (2007). argue that it can be conveniently integrated with
The test set consisted of 85 seen and 388 unseen any type of distributional model or indeed other
cue-associate pairs that were covered by our mod- grounded models that rely on low-level image fea-
els and MixLDA. tures (Bruni et al., 2012a; Feng and Lapata, 2010)
Table 8 reports correlation coefficients for our In the future, we would like to extend our
models and MixLDA against human probabili- database to actions and show that this attribute-
ties. All attribute-based models significantly out- centric representation is useful for more applied
perform MixLDA on seen pairs (p < 0.05 using a tasks such as image description generation and ob-
t-test). MixLDA performs on a par with the con- ject recognition. Finally, we have only scratched
catenation model on unseen pairs, however CCA, the surface in terms of possible models for inte-
TopicAttr, and VisAttr are all superior. Although grating the textual and visual modality. Interest-
these comparisons should be taken with a grain ing frameworks which we plan to explore are deep
of salt, given that MixLDA and our models are belief networks and Bayesian non-parametrics.
References R. Datta, D. Joshi, J. Li, and J. Z. Wang. 2008. Image
Retrieval: Ideas, Influences, and Trends of the New
M. Andrews, G. Vigliocco, and D. Vinson. 2009. Age. ACM Computing Surveys, 40(2):1–60.
Integrating Experiential and Distributional Data to
Learn Semantic Representations. Psychological Re- J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-
view, 116(3):463–498. Fei. 2009. ImageNet: A Large-Scale Hierarchi-
cal Image Database. In Proceedings of the IEEE
M. Baroni, B. Murphy, E. Barbu, and M. Poesio.
Computer Society Conference on Computer Vision
2010. Strudel: A Corpus-Based Semantic Model
and Pattern Recognition, pages 248–255, Miami,
Based on Properties and Types. Cognitive Science,
Florida.
34(2):222–254.

L. W. Barsalou. 2008. Grounded Cognition. Annual M. Everingham, L. Van Gool, C. K. I. Williams,


Review of Psychology, 59:617–845. J. Winn, and A. Zisserman. 2008. The
PASCAL Visual Object Classes Challenge
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent 2008 (VOC2008) Results. http://www.pascal-
Dirichlet Allocation. Journal of Machine Learning network.org/challenges/VOC/voc2008/workshop.
Research, 3:993–1022, March.
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin.
M. Borga. 2001. Canonical Correlation - a Tutorial, 2008. LIBLINEAR: A Library for Large Linear
January. Classification. Journal of Machine Learning Re-
search, 9:1871–1874.
M. H. Bornstein, L. R. Cote, S. Maital, K. Painter, S.-Y.
Park, L. Pascual, M. G. Pêcheux, J. Ruel, P. Venuti, A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009.
and A. Vyt. 2004. Cross-linguistic Analysis of Describing Objects by their Attributes. In Proceed-
Vocabulary in Young Children: Spanish, Dutch, ings of the IEEE Computer Society Conference on
French, Hebrew, Italian, Korean, and American En- Computer Vision and Pattern Recognition, pages
glish. Child Development, 75(4):1115–1139. 1778–1785, Miami Beach, Florida.
B. Börschinger, B. K. Jones, and M. Johnson. 2011. Li Fei-Fei and Pedro Perona. 2005. A Bayesian Hi-
Reducing Grounded Learning Tasks To Grammat- erarchical Model for Learning Natural Scene Cate-
ical Inference. In Proceedings of the 2011 Con- gories. In Proceedings of the IEEE Computer So-
ference on Empirical Methods in Natural Language ciety Conference on Computer Vision and Pattern
Processing, pages 1416–1425, Edinburgh, UK. Recognition, pages 524–531, San Diego, California.
S.R.K. Branavan, H. Chen, L. S. Zettlemoyer, and C. Fellbaum, editor. 1998. WordNet: an Electronic
R. Barzilay. 2009. Reinforcement Learning for Lexical Database. MIT Press.
Mapping Instructions to Actions. In Proceedings of
the Joint Conference of the 47th Annual Meeting of Yansong Feng and Mirella Lapata. 2008. Automatic
the ACL and the 4th International Joint Conference image annotation using auxiliary text information.
on Natural Language Processing of the AFNLP, In Proceedings of ACL-08: HLT, pages 272–280,
pages 82–90, Suntec, Singapore. Columbus, Ohio.
E. Bruni, G. Tran, and M. Baroni. 2011. Distributional
Semantics from Text and Images. In Proceedings of Y. Feng and M. Lapata. 2010. Visual Informa-
the GEMS 2011 Workshop on GEometrical Models tion in Semantic Representation. In Human Lan-
of Natural Language Semantics, pages 22–32, Edin- guage Technologies: The 2010 Annual Conference
burgh, UK. of the North American Chapter of the Association
for Computational Linguistics, pages 91–99, Los
E. Bruni, G. Boleda, M. Baroni, and N. Tran. 2012a. Angeles, California. ACL.
Distributional Semantics in Technicolor. In Pro-
ceedings of the 50th Annual Meeting of the Associa- V. Ferrari and A. Zisserman. 2007. Learning Visual
tion for Computational Linguistics (Volume 1: Long Attributes. In J.C. Platt, D. Koller, Y. Singer, and
Papers), pages 136–145, Jeju Island, Korea. S. Roweis, editors, Advances in Neural Information
Processing Systems 20, pages 433–440. MIT Press,
Elia Bruni, Jasper Uijlings, Marco Baroni, and Nicu Cambridge, Massachusetts.
Sebe. 2012b. Distributional semantics with eyes:
Using image analysis to improve computational rep- G. H. Golub, F. T. Luk, and M. L. Overton. 1981.
resentations of word meaning. In Proceedings of the A block lanczoz method for computing the singular
20th ACM International Conference on Multimedia, values and corresponding singular vectors of a ma-
pages 1219–1228., New York, NY. trix. ACM Transactions on Mathematical Software,
7:149–169.
C. Chai and C. Hung. 2008. Automatically Annotating
Images with Keywords: A Review of Image Annota- P. Gorniak and D. Roy. 2004. Grounded Semantic
tion Systems. Recent Patents on Computer Science, Composition for Visual Scenes. Journal of Artificial
1:55–68. Intelligence Research, 21:429–470.
T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum. K. McRae, G. S. Cree, M. S. Seidenberg, and C. Mc-
2007. Topics in Semantic Representation. Psycho- Norgan. 2005. Semantic Feature Production Norms
logical Review, 114(2):211–244. for a Large Set of Living and Nonliving Things. Be-
havior Research Methods, 37(4):547–559.
D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-
Taylor. 2004. Canonical Correlation Analysis: An D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. 1998.
Overview with Application to Learning Methods. The University of South Florida Word Association,
Neural Computation, 16(12):2639–2664. Rhyme, and Word Fragment Norms.

Brendan T. Johns and Michael N. Jones. 2012. Per- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan
ceptual Inference through Global Lexical Similarity. Nam, Honglak Lee, and Andrew Y. Ng. 2011.
Topics in Cognitive Science, 4(1):103–120. Multimodal deep learning. In Proceedings of the
28th International Conference on Machine Lean-
D. Joshi, J.Z. Wang, and J. Li. 2006. The Story Pictur- ring, pages 689–696, Bellevue, Washington.
ing Engine—A System for Automatic Text illustra-
tion. ACM Transactions on Multimedia Computing, A. Oliva and A. Torralba. 2007. The Role of Context in
Communications, and Applications, 2(1):68–89. Object Recognition. Trends in Cognitive Sciences,
11(12):520–527.
R. J. Kate and R. J. Mooney. 2007. Learning Lan- D. N. Osherson, J. Stern, O. Wilkie, M. Stob, and E. E.
guage Semantics from Ambiguous Supervision. In Smith. 1991. Default Probability. Cognitive Sci-
Proceedings of the 22nd Conference on Artificial In- ence, 2(15):251–269.
telligence, pages 895–900, Vancouver, Canada.
G. Patterson and J. Hays. 2012. SUN Attribute
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Na- Database: Discovering, Annotating and Recogniz-
yar. 2009. Attribute and Simile Classifiers for Face ing Scene Attributes. In Proceedings of the IEEE
Verification. In Proceedings of the IEEE 12th In- Conference on Computer Vision and Pattern Recog-
ternational Conference on Computer Vision, pages nition, pages 2751–2758, Providence, Rhode Island.
365–372, Kyoto, Japan.
Terry Regier. 1996. The Human Semantic Potential.
C. H. Lampert, H. Nickisch, and S. Harmeling. 2009. MIT Press, Cambridge, Massachusetts.
Learning To Detect Unseen Object Classes by
Between-Class Attribute Transfer. In Computer Vi- D. Roy and A. Pentland. 2002. Learning Words from
sion and Pattern Recognition, pages 951–958, Mi- Sights and Sounds: A Computational Model. Cog-
ami Beach, Florida. nitive Science, 26(1):113–146.

B. Landau, L. Smith, and S. Jones. 1998. Object Per- C. Silberer and M. Lapata. 2012. Grounded Mod-
ception and Object Naming in Early Development. els of Semantic Representation. In Proceedings of
Trends in Cognitive Science, 27:19–24. the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
C. Leong and R. Mihalcea. 2011. Going Beyond Natural Language Learning, pages 1423–1433, Jeju
Text: A Hybrid Image-Text Approach for Measuring Island, Korea.
Word Relatedness. In Proceedings of 5th Interna-
tional Joint Conference on Natural Language Pro- J. M. Siskind. 2001. Grounding the Lexical Semantics
cessing, pages 1403–1407, Chiang Mai, Thailand. of Verbs in Visual Perception using Force Dynamics
and Event Logic. Journal of Artificial Intelligence
J. Liu, B. Kuipers, and S. Savarese. 2011. Recognizing Research, 15:31–90.
Human Actions by Attributes. In Proceedings of the S. A. Sloman and L. J. Ripps. 1998. Similarity as an
IEEE Conference on Computer Vision and Pattern Explanatory Construct. Cognition, 65:87–101.
Recognition, pages 3337–3344, Colorado Springs,
Colorado. Nitish Srivastava and Ruslan Salakhutdinov. 2012.
Multimodal learning with deep boltzmann ma-
D. G. Lowe. 1999. Object Recognition from Local chines. In Proceedings of the 26th Annual Con-
Scale-invariant Features. In Proceedings of the In- ference on Neural Information Processing Systems,
ternational Conference on Computer Vision, pages pages 2231–2239, Lake Tahoe, Nevada.
1150–1157, Corfu, Greece.
M. Steyvers. 2010. Combining feature norms and
D. Lowe. 2004. Distinctive Image Features from text data with topic models. Acta Psychologica,
Scale-invariant Keypoints. International Journal of 133(3):234–342.
Computer Vision, 60(2):91–110.
S. Tellex, T. Kollar, S. Dickerson, M. R. Walter,
W. Lu, H. T. Ng, W.S. Lee, and L. S. Zettlemoyer. A. Gopal Banerjee, S. Teller, and N. Roy. 2011.
2008. A Generative Model for Parsing Natural Lan- Understanding Natural Language Commands for
guage to Meaning Representations. In Proceedings Robotic Navigation and Manipulation. In Proceed-
of the 2008 Conference on Empirical Methods in ings of the 25th National Conference on Artificial
Natural Language Processing, pages 783–792, Hon- Intelligence, pages 1507–1514, San Francisco, Cali-
olulu, Hawaii. fornia.
Luis von Ahn and Laura Dabbish. 2004. Labeling im-
ages with a computer game. In Proceeings of the
Human Factors in Computing Systems Conference,
pages 319–326, Vienna, Austria.
C. Yu and D. H. Ballard. 2007. A Unified Model of
Early Word Learning Integrating Statistical and So-
cial Cues. Neurocomputing, 70:2149–2165.
M. D. Zeigenfuse and M. D. Lee. 2010. Finding the
Features that Represent Stimuli. Acta Psychologi-
cal, 133(3):283–295.
J. M. Zelle and R. J. Mooney. 1996. Learning to Parse
Database Queries Using Inductive Logic Program-
ming. In Proceedings of the 13th National Con-
ference on Artificial Intelligence, pages 1050–1055,
Portland, Oregon.
L. S. Zettlemoyer and M. Collins. 2005. Learning to
Map Sentences to Logical Form: Structured Classi-
fication with Probabilistic Categorial Grammars. In
Proceedings of the Conference on Uncertainty in Ar-
tificial Intelligence, pages 658–666, Edinburgh, UK.

You might also like