Silber Er 13 Acl
Silber Er 13 Acl
botany has skin, has seeds, has stem, has leaves, has pulp
color patterns purple, white, green, has green top
shape size is oval, is long
texture material is shiny
behavior rolls
parts has step through frame, has fork, has 2 wheels, has chain, has pedals
has gears, has handlebar, has bell, has breaks has seat, has spokes
texture material made of metal
color patterns different colors, is black, is red, is grey, is silver
the most popular being closet (2,149 images) and opment set and chose all McRae et al. (2005) vi-
the least popular prune (5 images). sual attributes that applied. If an attribute was gen-
The images depicting each concept were ran- erally true for the concept, but the images did not
domly partitioned into a training, development, provide enough evidence, the attribute was never-
and test set. For most concepts the development theless chosen and labeled with <no evidence>.
set contained a maximum of 100 images and the For example, a plum has a pit, but most images in
test set a maximum of 200 images. Concepts with ImageNet show plums where only the outer part
less than 800 images in total were split into 1/8 of the fruit is visible. Attributes supported by
test and development set each, and 3/4 training set. the image data but missing from the norms were
The development set was used for devising and re- added. For example, has lights and has bumper
fining our attribute annotation scheme. The train- are attributes of cars but are not included in the
ing and test sets were used for learning and eval- norms. Attributes were grouped in eight general
uating, respectively, attribute classifiers (see Sec- classes shown in Table 1. Annotation proceeded
tion 4). on a category-by-category basis, e.g., first all food-
related concepts were annotated, then animals, ve-
Attribute Annotation Our aim was to develop a hicles, and so on. Two annotators (both co-authors
set of visual attributes that are both discriminating of this paper) developed the set of attributes for
and cognitively plausible, i.e., humans would gen- each category. One annotator first labeled con-
erally use them to describe a concrete concept. As cepts with their attributes, and the other annota-
a starting point, we thus used the visual attributes tor reviewed the annotations, making changes if
from McRae et al.’s (2005) norming study. At- needed. Annotations were revised and compared
tributes capturing other primary sensory informa- per category in order to ensure consistency across
tion (e.g., smell, sound), functional/motor proper- all concepts of that category.
ties, or encyclopaedic information were not taken Our methodology is slightly different from
into account. For example, is purple is a valid vi- Lampert et al. (2009) in that we did not simply
sual attribute for an eggplant, whereas a vegetable transfer the attributes from the norms to the con-
is not, since it cannot be visualized. Collating all cepts in question but refined and extended them
the visual attributes in the norms resulted in a to- according to the visual data. There are several
tal of 673 which we further modified and extended reasons for this. Firstly, it makes sense to se-
during the annotation process explained below. lect attributes corroborated by the images. Sec-
The annotation was conducted on a per-concept ondly, by looking at the actual images, we could
rather than a per-image basis (as for example in eliminate errors in McRae et al.’s (2005) norms.
Farhadi et al. (2009)). For each concept (e.g., bear For example, eight study participants erroneously
or eggplant), we inspected the images in the devel- thought that a catfish has scales. Thirdly, dur-
has 2 pieces, has pointed end, has strap, has thumb, has buckles, has heels
has shoe laces, has soles, is black, is brown, is white, made of leather, made of rubber
climbs, climbs trees, crawls, hops, jumps, eats, eats nuts, is small, has bushy tail
has 4 legs, has head, has neck, has nose, has snout, has tail, has claws
has eyes, has feet, has toes,
diff colours, has 2 legs, has 2 wheels, has windshield, has floorboard, has stand, has tank
has mudguard, has seat, has exhaust pipe, has frame, has handlebar, has lights, has mirror
has step-through frame, is black, is blue, is red, is white, made of aluminum, made of steel
ing the annotation process, we normalized syn- images in the training data. Images of concepts
onymous attributes (e.g., has pit and has stone) annotated with the attribute were used as positive
and attributes that exhibited negligible variations examples, and the rest as negative examples. The
in meaning (e.g., has stem and has stalk). Finally, data was randomly split into a training and valida-
our aim was to collect an exhaustive list of vi- tion set of equal size in order to find the optimal
sual attributes for each concept which is consis- cost parameter C. The final SVM for the attribute
tent across all members of a category. This is un- was trained on the entire training data, i.e., on all
fortunately not the case in McRae et al.’s norms. positive and negative examples.
Participants were asked to list up to 14 different
properties that describe a concept. As a result, the The SVM learners used the four different fea-
attributes of a concept denote the set of properties ture types proposed in Farhadi et al. (2009),
humans consider most salient. For example, both, namely color, texture, visual words, and edges.
lemons and oranges have pulp. But the norms pro- Texture descriptors were computed for each pixel
vide this attribute only for the second concept. and quantized to the nearest 256 k-means centers.
On average, each concept was annotated with Visual words were constructed with a HOG spa-
19 attributes; approximately 14.5 of these were tial pyramid. HOG descriptors were quantized
not part of the semantic representation created by into 1000 k-means centers. Edges were detected
McRae et al.’s (2005) participants for that con- using a standard Canny detector and their orien-
cept even though they figured in the representa- tations were quantized into eight bins. Color de-
tions of other concepts. Furthermore, on average scriptors were sampled for each pixel and quan-
two McRae et al. attributes per concept were dis- tized to the nearest 128 k-means centers. Shapes
carded. Examples of concepts and their attributes and locations were represented by generating his-
from our database2 are shown in Table 2. tograms for each feature type for each cell in a grid
of three vertical and horizontal blocks. Our clas-
4 Attribute-based Classification sifiers used 9,688 features in total. Table 3 shows
Following previous work (Farhadi et al., 2009; their predictions for three test images.
Lampert et al., 2009) we learned one classifier per
Note that attributes are predicted on an image-
attribute (i.e., 350 classifiers in total).3 The train-
by-image basis; our task, however, is to describe a
ing set consisted of 91,980 images (with a maxi-
concept w by its visual attributes. Since concepts
mum of 350 images per concept). We used an L2-
are represented by many images we must some-
regularized L2-loss linear SVM (Fan et al., 2008)
how aggregate their attributes into a single repre-
to learn the attribute predictions. We adopted the
sentation. For each image iw ∈ Iw of concept w,
training procedure of Farhadi al. (2009).4 To learn
we output an F-dimensional vector containing pre-
a classifier for a particular attribute, we used all
diction scores scorea (iw ) for attributes a = 1, ..., F.
2 Available from http://homepages.inf.ed.ac.uk/
We transform these attribute vectors into a single
mlap/index.php?page=resources.
3 We only trained classifiers for attributes corroborated by vector pw ∈ [0, 1]1×F , by computing the centroid
the images and excluded those labeled with <no evidence>. of all vectors for concept w. The vector is nor-
4 http://vision.cs.uiuc.edu/attributes/ malized to obtain a probability distribution over
1
Concept Nearest Neighbors
0.9 boat ship, sailboat, yacht, submarine, canoe, whale,
airplane, jet, helicopter, tank (army)
0.8
rooster chicken, turkey, owl, pheasant, peacock, stork,
Precision
Brendan T. Johns and Michael N. Jones. 2012. Per- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan
ceptual Inference through Global Lexical Similarity. Nam, Honglak Lee, and Andrew Y. Ng. 2011.
Topics in Cognitive Science, 4(1):103–120. Multimodal deep learning. In Proceedings of the
28th International Conference on Machine Lean-
D. Joshi, J.Z. Wang, and J. Li. 2006. The Story Pictur- ring, pages 689–696, Bellevue, Washington.
ing Engine—A System for Automatic Text illustra-
tion. ACM Transactions on Multimedia Computing, A. Oliva and A. Torralba. 2007. The Role of Context in
Communications, and Applications, 2(1):68–89. Object Recognition. Trends in Cognitive Sciences,
11(12):520–527.
R. J. Kate and R. J. Mooney. 2007. Learning Lan- D. N. Osherson, J. Stern, O. Wilkie, M. Stob, and E. E.
guage Semantics from Ambiguous Supervision. In Smith. 1991. Default Probability. Cognitive Sci-
Proceedings of the 22nd Conference on Artificial In- ence, 2(15):251–269.
telligence, pages 895–900, Vancouver, Canada.
G. Patterson and J. Hays. 2012. SUN Attribute
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Na- Database: Discovering, Annotating and Recogniz-
yar. 2009. Attribute and Simile Classifiers for Face ing Scene Attributes. In Proceedings of the IEEE
Verification. In Proceedings of the IEEE 12th In- Conference on Computer Vision and Pattern Recog-
ternational Conference on Computer Vision, pages nition, pages 2751–2758, Providence, Rhode Island.
365–372, Kyoto, Japan.
Terry Regier. 1996. The Human Semantic Potential.
C. H. Lampert, H. Nickisch, and S. Harmeling. 2009. MIT Press, Cambridge, Massachusetts.
Learning To Detect Unseen Object Classes by
Between-Class Attribute Transfer. In Computer Vi- D. Roy and A. Pentland. 2002. Learning Words from
sion and Pattern Recognition, pages 951–958, Mi- Sights and Sounds: A Computational Model. Cog-
ami Beach, Florida. nitive Science, 26(1):113–146.
B. Landau, L. Smith, and S. Jones. 1998. Object Per- C. Silberer and M. Lapata. 2012. Grounded Mod-
ception and Object Naming in Early Development. els of Semantic Representation. In Proceedings of
Trends in Cognitive Science, 27:19–24. the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
C. Leong and R. Mihalcea. 2011. Going Beyond Natural Language Learning, pages 1423–1433, Jeju
Text: A Hybrid Image-Text Approach for Measuring Island, Korea.
Word Relatedness. In Proceedings of 5th Interna-
tional Joint Conference on Natural Language Pro- J. M. Siskind. 2001. Grounding the Lexical Semantics
cessing, pages 1403–1407, Chiang Mai, Thailand. of Verbs in Visual Perception using Force Dynamics
and Event Logic. Journal of Artificial Intelligence
J. Liu, B. Kuipers, and S. Savarese. 2011. Recognizing Research, 15:31–90.
Human Actions by Attributes. In Proceedings of the S. A. Sloman and L. J. Ripps. 1998. Similarity as an
IEEE Conference on Computer Vision and Pattern Explanatory Construct. Cognition, 65:87–101.
Recognition, pages 3337–3344, Colorado Springs,
Colorado. Nitish Srivastava and Ruslan Salakhutdinov. 2012.
Multimodal learning with deep boltzmann ma-
D. G. Lowe. 1999. Object Recognition from Local chines. In Proceedings of the 26th Annual Con-
Scale-invariant Features. In Proceedings of the In- ference on Neural Information Processing Systems,
ternational Conference on Computer Vision, pages pages 2231–2239, Lake Tahoe, Nevada.
1150–1157, Corfu, Greece.
M. Steyvers. 2010. Combining feature norms and
D. Lowe. 2004. Distinctive Image Features from text data with topic models. Acta Psychologica,
Scale-invariant Keypoints. International Journal of 133(3):234–342.
Computer Vision, 60(2):91–110.
S. Tellex, T. Kollar, S. Dickerson, M. R. Walter,
W. Lu, H. T. Ng, W.S. Lee, and L. S. Zettlemoyer. A. Gopal Banerjee, S. Teller, and N. Roy. 2011.
2008. A Generative Model for Parsing Natural Lan- Understanding Natural Language Commands for
guage to Meaning Representations. In Proceedings Robotic Navigation and Manipulation. In Proceed-
of the 2008 Conference on Empirical Methods in ings of the 25th National Conference on Artificial
Natural Language Processing, pages 783–792, Hon- Intelligence, pages 1507–1514, San Francisco, Cali-
olulu, Hawaii. fornia.
Luis von Ahn and Laura Dabbish. 2004. Labeling im-
ages with a computer game. In Proceeings of the
Human Factors in Computing Systems Conference,
pages 319–326, Vienna, Austria.
C. Yu and D. H. Ballard. 2007. A Unified Model of
Early Word Learning Integrating Statistical and So-
cial Cues. Neurocomputing, 70:2149–2165.
M. D. Zeigenfuse and M. D. Lee. 2010. Finding the
Features that Represent Stimuli. Acta Psychologi-
cal, 133(3):283–295.
J. M. Zelle and R. J. Mooney. 1996. Learning to Parse
Database Queries Using Inductive Logic Program-
ming. In Proceedings of the 13th National Con-
ference on Artificial Intelligence, pages 1050–1055,
Portland, Oregon.
L. S. Zettlemoyer and M. Collins. 2005. Learning to
Map Sentences to Logical Form: Structured Classi-
fication with Probabilistic Categorial Grammars. In
Proceedings of the Conference on Uncertainty in Ar-
tificial Intelligence, pages 658–666, Edinburgh, UK.