Cats and Dogs: A Summary
Cats and Dogs: A Summary
A Summary
This paper addresses a fine-grained image classification problem of classifying 37 breeds of cats and dogs –
12 cat breeds and 25 dog breeds. Classification task between breeds of the same species is particularly
challenging since robust models are needed to learn subtle differences in fur textures and phenotypic details
present in the pets’ images – most famous computer vision datasets and classification models at the time of
this publication consisted of little visual similarity, e.g. ImageNet, and hence classification is easier on such
datasets. Besides the available literature of the time pointing out breed classification being a challenging task,
one other utilization for breed classification is the reporting of breed to pet owners – many of whom are
misinformed about their pets’ breed.
The paper has the following contributions: 1) Introduces a new cats vs dogs dataset, the Oxford-IIIT dataset.
2) Provides a model for species only, or species + breed classification. 3) Demonstrates the model to break the
ASIRRA challenges to a considerable amount of accuracy.
The Oxford IIIT dataset with 12 cat breeds, 25 dog breeds, is annotated with breed label, and bounding boxes
for animal head are provided. Further, pixel level segmentation of animal in the picture is also done, making the
dataset a very promising and easily useable dataset for further research purposes. Performance is measured
as average per class classification accuracy (average of row of confusion matrix)
The model given combines shape (pet face) and appearance (bag of words model) to classify. The
classification model is a 2-part model: 1) Shape Model: A ‘Deformable Part Model’ that captures the face
‘shape’ data from an image. This information is made part of the feature descriptor. 2) Appearance Model: A
bag of words model that captures the animals’ fur texture information.
Size is an important indicator of animal breed but unavailability of absolute reference in dataset requires use of
shape as a feature, and the appearance of fur.
Shape Model: As is in a deformable part model, a root part is connected with spring to 8 smaller parts. History
of Oriented Gradients (HOG) filters are used to capture latent information, local distribution of image edges.
Dynamic Programming is used to find best compromise of matching parts to the image while minimally
distorting the springs. The DPM is used to detect distinctive components of the animal body. Head annotations
are used to learn a DPM of cat faces, and one of dogs faces.
Appearance Model: SIFT descriptors are extracted, then quantized on a vocabulary of 4000 visual words
learned by using k means on randomly sampled features from data. Quantized SIFT features are pooled into
spatial histogram, which has dimension 4000 x # of bins. After normalization, SVM is used for classification by
exponential chi squared kernel. Different variants of spatial histogram can be obtained by placing spatial bins in
accordance with different features of the pet. 3 different spatial layouts for computing image descriptors are
shown. Spatial histograms computed on separate spatial components are concatenated to obtain the final
image descriptor. First one has only the image as a whole and then subdivided into 4 2x2 sub regions. Second
one has the head as a separate spatial tile, and the remaining image as a single spatial tile. The last has
descriptor containing those of first and second ones as such, the foreground object (the pet) as separate,
divided into 5 spatial bins, the background without any spatial subdivisions. The sizes of feature vectors grow
as 20, 28 then 48000 respectively.
Auto segmentation: Segmentation between foreground and background regions, containing the pet and
otherwise respectively, are obtained using grab-cut segmentation. An SVM classifier assigns a confidence
scores to super pixels used to initialize the grab-cut segmentation. Berkley’s Ultra metric colormap was used to
obtain super pixels of the image: color histogram of the image provides the feature map – and Sift -BoW
histogram is computed on it. 65% segmentation accuracy achieved.
2 Approaches are adopted to run the model:1) Flat Approach, and 2) Hierarchical Approach.
Flat approach regresses both parts of model simultaneously, while hierarchical approach first uses shape
features to detect if cat or a dog, then uses texture bag of words model to detect breed. VLFeat-BoW
classification on the dataset is used as the baseline.
Pet – Family Discrimination: i) Shape only: face detectors are used, and scores fed to linear SVM to
discriminate. This gives accuracy of 94.21 %.II) Appearance only: non-linear SVM regresses over spatial
histograms of visual words. Accuracy of models depend on layout of spatial bins used, which increases as
more bins are added in spatial layout, thus increasing feature vector size. Using ground truth segmentation
rather than our automatic segmentation increases 1% accuracy. III) Shape and appearance: linear SVM kernel
applied on shape is summed with exp. Chi squared kernel for appearance to combine shape and appearance
information. A 95.37% accuracy is obtained.
Similarly, breed classification results increase accuracy as bigger feature vectors are used, and segmenation
gets more accurate. Results are shown in the table below.
Improvements: The paper was published in 2012, where the image segmentation is only achieving 65%
accuracy. Since then, multiple better image segmentation models have been proposed. We are told that
classification accuracy increases with increasing segmentation accuracy, therefore using a better segmentation
model would increase classification accuracy.