0% found this document useful (0 votes)
48 views

Learning To Hash With Binary Deep Neural Network: October 2016

This document summarizes a research paper that proposes deep neural network models and learning algorithms for unsupervised and supervised binary hashing. The key contributions are: 1) A novel network design that constrains one hidden layer to directly output binary hash codes, avoiding non-smooth optimization from binarization functions used in prior work. 2) Direct incorporation of independence and balance properties of binary codes into the learning objective, rather than using relaxation. 3) Inclusion of similarity preservation in the objective function to learn codes that better preserve semantic similarity. Experimental results on benchmark datasets show the proposed methods compare favorably to state-of-the-art hashing techniques.

Uploaded by

riadelectro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Learning To Hash With Binary Deep Neural Network: October 2016

This document summarizes a research paper that proposes deep neural network models and learning algorithms for unsupervised and supervised binary hashing. The key contributions are: 1) A novel network design that constrains one hidden layer to directly output binary hash codes, avoiding non-smooth optimization from binarization functions used in prior work. 2) Direct incorporation of independence and balance properties of binary codes into the learning objective, rather than using relaxation. 3) Inclusion of similarity preservation in the objective function to learn codes that better preserve semantic similarity. Experimental results on benchmark datasets show the proposed methods compare favorably to state-of-the-art hashing techniques.

Uploaded by

riadelectro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308190581

Learning to Hash with Binary Deep Neural Network

Conference Paper · October 2016


DOI: 10.1007/978-3-319-46454-1_14

CITATIONS READS

75 564

3 authors, including:

Thanh-Toan Do Ngai-Man Cheung


University of Liverpool Singapore University of Technology and Design
85 PUBLICATIONS   782 CITATIONS    179 PUBLICATIONS   2,694 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Multi-aspect Time-ware Word Embedding View project

MediaEval 2017 View project

All content following this page was uploaded by Thanh-Toan Do on 29 November 2017.

The user has requested enhancement of the downloaded file.


Learning to Hash with Binary Deep
Neural Network

Thanh-Toan Do(B) , Anh-Dzung Doan, and Ngai-Man Cheung

Singapore University of Technology and Design, Singapore, Singapore


{thanhtoan do,dung doan,ngaiman cheung}@sutd.edu.sg

Abstract. This work proposes deep network models and learning algo-
rithms for unsupervised and supervised binary hashing. Our novel net-
work design constrains one hidden layer to directly output the binary
codes. This addresses a challenging issue in some previous works: opti-
mizing non-smooth objective functions due to binarization. Moreover,
we incorporate independence and balance properties in the direct and
strict forms in the learning. Furthermore, we include similarity preserv-
ing property in our objective function. Our resulting optimization with
these binary, independence, and balance constraints is difficult to solve.
We propose to attack it with alternating optimization and careful relax-
ation. Experimental results on three benchmark datasets show that our
proposed methods compare favorably with the state of the art.

Keywords: Learning to hash · Neural network · Discrete


optimizatization

1 Introduction
We are interested in learning binary hash codes for large scale visual search.
Two main difficulties with large scale visual search are efficient storage and
fast searching. An attractive approach for handling these difficulties is binary
hashing, where each original high dimensional vector x ∈ RD is mapped to a
very compact binary vector b ∈ {−1, 1}L , where L  D.
Hashing methods can be divided into two categories: data-independent and
data-dependent. Methods in data-independent category [1–4] rely on random
projections for constructing hash functions. Methods in data-dependent category
use the available training data to learn the hash functions in unsupervised [5–9]
or supervised manner [10–15]. The review of data-independent/data-dependent
hashing methods can be found in recent surveys [16–18].
One difficult problem in hashing is to deal with the binary constraint on
the codes. Specifically, the outputs of the hash functions have to be binary. In
general, this binary constraint leads to a NP-hard mixed-integer optimization
problem. To handle this difficulty, most aforementioned methods relax the con-
straint during the learning of hash functions. With this relaxation, the continuous
codes are learned first. Then, the codes are binarized (e.g., with thresholding).

c Springer International Publishing AG 2016
B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 219–234, 2016.
DOI: 10.1007/978-3-319-46454-1 14
220 T.-T. Do et al.

This relaxation greatly simplifies the original binary constrained problem. How-
ever, the solution can be suboptimal, i.e., the binary codes resulting from thresh-
olded continuous codes could be inferior to those that are obtained by including
the binary constraint in the learning.
Furthermore, a good hashing method should produce binary codes with the
properties [5]: (i) similarity preserving, i.e., (dis)similar inputs should likely have
(dis)similar binary codes; (ii) independence, i.e., different bits in the binary codes
are independent to each other; (iii) balance, i.e., each bit has a 50 % chance of
being 1 or −1. The direct incorporation of the independent and balance proper-
ties can complicate the learning. Previous work has used some relaxation to work
around the problem [6,19,20], but there may be some performance degradation.

1.1 Related Work


Our work is inspired by a few recent successful hashing methods which define
hash functions as a neural network [19,21,22]. We propose an improved design
to address their limitations. In Semantic Hashing [21], the model is formed by a
stack of Restricted Boltzmann Machine, and a pretraining step is required. This
model does not consider the independence and balance of the codes. In Binary
Autoencoder [22], a linear autoencoder is used as hash functions. As this model
only uses one hidden layer, it may not well capture the information of inputs.
Extending [22] with multiple, nonlinear layers is not straight-forward because of
the binary constraint. They also do not consider the independence and balance
of codes. In Deep Hashing [19], a deep neural network is used as hash functions.
However, this model does not fully take into account the similarity preserving.
They also apply some relaxation in arriving the independence and balance of
codes and this may degrade the performance.
In order to handle the binary constraint, Semantic Hashing [21] first solves
the relaxed problem by discarding the constraint and then thresholds the solved
continuous solution. In Deep Hashing (DH) [19], the output of the last layer,
Hn , is binarized by the sgn function. They include a term in the objective
function to reduce this binarization loss: (sgn(Hn ) − Hn ). Solving the objective
function of DH [19] is difficult because the sgn function is non-differentiable. The
authors in [19] work around this difficulty by assuming that the sgn function is
differentiable everywhere. In Binary Autoencoder (BA) [22], the outputs of the
hidden layer are passed into a step function to binarize the codes. Incorporating
the step function in the learning leads to a non-smooth objective function and the
optimization is NP-complete. To handle this difficulty, they use binary SVMs to
learn the model parameters in the case when there is only a single hidden layer.

1.2 Contribution
In this work, we first propose a novel deep network model and learning algorithm
for unsupervised hashing. In order to achieve binary codes, instead of involving
the sgn or step function as in [19,22], our proposed network design constrains
one layer to directly output the binary codes (hence the network is called as
Learning to Hash with Binary Deep Neural Network 221

Table 1. Notations and their corresponding meanings.

Notation Meaning
X X = {xi }m
i=1 ∈ R
D×m
: set of m training samples; each column of X
corresponds to one sample
B B = {bi }m
i=1 ∈ {−1, +1}
L×m
: binary code of X
L Number of bits in the output binary code to encode a sample
n Number of layers (including input and output layers)
sl Number of units in layer l
(l)
f Activation function of layer l
W(l) W(l) ∈ Rsl+1 ×sl : weight matrix connecting layer l + 1 and layer l
c(l) c(l) ∈ Rsl+1 :bias vector for units in layer l + 1
 
(l)
H H(l) = f (l) W(l−1) H(l−1) + c(l−1) 11×m : output values of layer l;
convention: H(1) = X
1a×b Matrix has a rows, b columns and all elements equal to 1

Binary Deep Neural Network ). Moreover, we propose to directly incorporate


the independence and balance properties without relaxing them. Furthermore,
we include the similarity preserving in our objective function. The resulting
optimization with these binary and direct constraints is NP-hard. We propose
to attack this challenging problem with alternating optimization and careful
relaxation. To enhance the discriminative power of the binary codes, we then
extend our method to supervised hashing by leveraging the label information
such that the binary codes preserve the semantic similarity between samples.
The solid experiments on three benchmark datasets show the improvement of
the proposed methods over state-of-the-art hashing methods.
The remaining of this paper is organized as follows. Section 2 and Sect. 3
present and evaluate the proposed unsupervised hashing method, respectively.
Section 4 and Sect. 5 present and evaluate the proposed supervised hashing
method, respectively. Section 6 concludes the paper.

2 Unsupervised Hashing with Binary Deep Neural


Network (UH-BDNN)
2.1 Formulation of UH-BDNN
We summarize the notations in Table 1. In our work, the hash functions are
defined by a deep neural network. In our proposed design, we use different acti-
vation functions in different layers. Specifically, we use the sigmoid function as
activation function for layers 2, · · · , n − 2, and the identity function as activation
function for layer n − 1 and layer n. Our idea is to learn the network such that
the output values of the penultimate layer (layer n−1) can be used as the binary
codes. We introduce constraints in the learning algorithm such that the output
222 T.-T. Do et al.

Fig. 1. The illustration of our network (D = 4, L = 2). In our proposed network design,
the outputs of layer n − 1 are constrained to {−1, 1} and are used as the binary codes.
During training, these codes are used to reconstruct the input samples at the final
layer.

values at the layer n − 1 have the following desirable properties: (i) belonging to
{−1, 1}; (ii) similarity preserving; (iii) independent and (iv) balancing. Figure 1
illustrates our network for the case D = 4, L = 2.
Let us start with first two properties of the codes, i.e., belonging to {−1, 1}
and similarity preserving. To achieve the binary codes having these two proper-
ties, we propose to optimize the following constrained objective function

1 

 
2 λ n−1
 1

 (l) 2
min J = X − W(n−1) H(n−1) + c(n−1) 11×m  + W  (1)
W,c 2m 2
l=1

s.t. H(n−1) ∈ {−1, 1}L×m (2)


The constraint (2) is to ensure the first property. As the acti-
vation
 (n−1)function for the last layer is the identity function, the term
W H(n−1) + c(n−1) 11×m is the output of the last layer. The first term
of (1) makes sure that the binary code gives a good reconstruction of X. It
is worth noting that the reconstruction criterion has been used as an indirect
way for preserving the similarity in state-of-the-art unsupervised hashing meth-
ods [6,21,22], i.e., it encourages (dis)similar inputs map to (dis)similar binary
codes. The second term is a regularization that tends to decrease the magnitude
of the weights, and this helps to prevent overfitting. Note that in our proposed
design, we constrain to directly output the binary codes at one layer, and this
avoids the difficulties with the sgn/step function such as non-differentiability.
On the other hand, our formulation with (1) under the binary constraint (2)
is very difficult to solve. It is a mixed-integer problem which is NP-hard. We
propose to attack the problem using alternating optimization by introducing an
auxiliary variable. Using the auxiliary variable B, we reformulate the objective
function (1) under constraint (2) as

1 

2 λ n−1
 1
 
 (l) 2
min J = X − W(n−1) B − c(n−1) 11×m  + W  (3)
W,c,B 2m 2
l=1
Learning to Hash with Binary Deep Neural Network 223

s.t. B = H(n−1) (4)


B ∈ {−1, 1}L×m (5)
The benefit of introducing the auxiliary variable B is that we can decom-
pose the difficult constrained optimization problem (1) into two sub-optimization
problems. Then, we can iteratively solve the optimization by using alternating
optimization with respect to (W, c) and B while holding the other fixed. We
will discuss the details of the alternating optimization in a moment. Using the
idea of the quadratic penalty method [23], we relax the equality constraint (4)
by solving the following constrained objective function
1 
2

min J = X − W(n−1) B − c(n−1) 11×m 
W,c,B 2m
λ1    λ2  2
n−1
 (l) 2  (n−1) 
+ W  + H − B (6)
2 2m
l=1

s.t. B ∈ {−1, 1}L×m (7)


The third term in (6) measures the (equality) constraint violation. By setting
the penalty parameter λ2 sufficiently large, we penalize the constraint violation
severely, thereby forcing the minimizer of the penalty function (6) closer to the
feasible region of the original constrained function (3).
Now let us consider the two remaining properties of the codes, i.e., indepen-
dence and balance. Unlike previous works which use some relaxation or approx-
imation on the independence and balance properties [6,19,20], we propose to
encode these properties strictly and directly based on the binary outputs of our
layer n − 11 . Specifically, we encode the independence and balance properties of
the codes by having the fourth and the fifth term respectively in the following
constrained objective function

1 

2 λ n−1
 1
 
 (l) 2
min J = X − W(n−1) B − c(n−1) 11×m  + W 
W,c,B 2m 2
l=1
   2  2
λ2  (n−1) 
λ3  1 (n−1) (n−1) T 
  + λ4  
2
+ H − B + H (H ) − I H (n−1)
1m×1 
2m 2 m  2m
(8)

s.t. B ∈ {−1, 1}L×m (9)


(8) under constraint (9) is our final formulation. Before discussing how to
solve it, let us present the differences between our work and the recent deep
learning based-hashing models Deep Hashing [19] and Binary Autoencoder [22].
The first important difference between our model and Deep Hashing [19]
/ Binary Autoencoder [22] is the way to achieve the binary codes. Instead of
1
Alternatively, we can constrain the independence and balance on B. This, however,
makes the optimization very difficult.
224 T.-T. Do et al.

involving the sgn or step function as in [19,22], we constrain the network to


directly output the binary codes at one layer. Other differences are presented as
follows.
Comparison to Deep Hashing (DH) [19]: the deep model of DH is learned by the
following formulation:
1 
2
 α1  (n) (n) T 
min J = sgn(H(n) ) − H(n)  − tr H (H )
W,c 2 2m
α2  
n−1
 (l)   
2 α n−1   
 (l) 2  (l) 2
(l) T 3
+ W (W ) − I  + W  + c 
2 2
l=1 l=1

The DH’s model does not have the reconstruction layer. They apply sgn
function to the outputs at the top layer of the network to obtain the binary
codes. The first term aims to minimize quantization loss when applying the
sgn function to the outputs at the top layer. The balancing and the independent
properties are contained in the second and the third terms [19]. It is worth noting
that minimizing DH’s objective function is difficult due to the non-differentiable
of sgn function. The authors work around this difficulty by assuming that sgn
function is differentiable everywhere.
Contrary to DH, we propose a different model design. In particular, our
model encourages the similarity preserving by having the reconstruction
 layer

in the network. For the balancing property, they maximize tr H(n) (H(n) )T .
According to [20], maximizing this term is only an approximation in arriv-
ing the balancing property. In our objective function, the balancing property
 2
is directly enforced on the codes by the term H(n−1) 1m×1  . For the indepen-
 2
dent property, DH uses a relaxed orthogonality constraint W(l) (W(l) )T − I ,
i.e., constraining on the network weights W. On the contrary, we (once again)
 1 (n−1) (n−1) T 2
directly constrain on the codes using  m H (H ) − I . Incorporating
the strict constraints can lead to better performance.
Comparison to Binary Autoencoder (BA) [22]: the differences between our model
and BA are quite clear. BA as described in [22] is a shallow linear autoencoder
network with one hidden layer. The BA’s hash function is a linear transformation
of the input followed by the step function to obtain the binary codes. In BA, by
treating the encoder layer as binary classifiers, they use binary SVMs to learn the
weights of the linear transformation. On the contrary, our hash function is defined
by multiple, hierarchical layers of nonlinear and linear transformations. It is not
clear if the binary SVMs approach in BA can be used to learn the weights in our
deep architecture with multiple layers. Instead, we use alternating optimization
to derive a backpropagation algorithm to learn the weights in all layers. Another
difference is that our model ensures the independence and balance of the binary
codes while BA does not. Note that independence and balance properties may
not be easily incorporated in their framework, as these would complicate their
objective function and the optimization problem may become very difficult to
solve.
Learning to Hash with Binary Deep Neural Network 225

2.2 Optimization
In order to solve (8) under constraint (9), we propose to use alternating opti-
mization over (W, c) and B.
(W, c) step. When fixing B, the problem becomes unconstrained optimization.
We use L-BFGS [24] optimizer with backpropagation for solving. The gradient of
the objective function J (8) w.r.t. different parameters are computed as follows.
At l = n − 1, we have
∂J −1
= (X − W(n−1) B − c(n−1) 11×m )BT + λ1 W(n−1) (10)
∂W(n−1) m
∂J −1  
= (X − W(n−1) B)1m×1 − mc(n−1) (11)
∂c (n−1) m
For other layers, let us define

  2λ  1
λ2 3 (n−1) T
Δ (n−1)
= H (n−1)
−B + H (n−1)
(H ) − I H(n−1)
m m m
λ4  (n−1)  
+ H 1m×m  f (n−1) (Z(n−1) ) (12)
m
  
Δ(l) = (W(l) )T Δ(l+1)  f (l) (Z(l) ), ∀l = n − 2, · · · , 2 (13)

where  denotes Hadamard product; Z(l) = W(l−1) H(l−1) + c(l−1) 11×m , l =


2, · · · , n.
Then, ∀l = n − 2, · · · , 1, we have
∂J
= Δ(l+1) (H(l) )T + λ1 W(l) (14)
∂W(l)
∂J
= Δ(l+1) 1m×1 (15)
∂c(l)
B step. When fixing (W, c), we can rewrite problem (8) as
 2  2
   
min J = X − W(n−1) B − c(n−1) 11×m  + λ2 H(n−1) − B (16)
B

s.t. B ∈ {−1, 1}L×m (17)


We adaptively use the recent method discrete cyclic coordinate descent [15]
to iteratively solve B, i.e., row by row. The advantage of this method is that if
we fix L − 1 rows of B and only solve for the remaining row, we can achieve a
closed-form solution for that row.
Let V = X − c(n−1) 11×m ; Q = (W(n−1) )T V + λ2 H(n−1) . For k = 1, · · · L,
let wk be k th column of W(n−1) ; W1 be matrix W(n−1) excluding wk ; qk be
k th column of QT ; bTk be k th row of B; B1 be matrix of B excluding bTk . We
have closed-form for bTk as
bTk = sgn(qT − wkT W1 B1 ) (18)
The proposed UH-BDNN method is summarized in Algorithm 1. In the Algo-
rithm 1, B(t) and (W, c)(t) are values of B and {W(l) , c(l) }n−1
l=1 at iteration t.
226 T.-T. Do et al.

Algorithm 1. Unsupervised Hashing with Binary Deep Neural Network (UH-


BDNN)
Input:
X = {xi }m i=1 ∈ R
D×m
: training data; L: code length; T : maximum iteration number; n: number
of layers; {sl }n
l=2 : number of units of layers 2 → n (note: sn−1 = L, sn = D); λ1 , λ2 , λ3 , λ4 .
Output:
n−1
Parameters {W(l) , c(l) }l=1

1: Initialize B(0) ∈ {−1, 1}L×m using ITQ [6]


n−1 n−2
2: Initialize {c(l) }l=1 = 0sl+1 ×1 . Initialize {W(l) }l=1 by getting the top sl+1 eigenvectors from
the covariance matrix of H(l) . Initialize W(n−1) = ID×L
n−1
3: Fix B(0) , compute (W, c)(0) with (W, c) step using initialized {W(l) , c(l) }l=1 (line 2) as start-
ing point for L-BFGS.
4: for t = 1 → T do
5: Fix (W, c)(t−1) , compute B(t) with B step
6: Fix B(t) , compute (W, c)(t) with (W, c) step using (W, c)(t−1) as starting point for L-BFGS.
7: end for
8: Return (W, c)(T )

3 Evaluation of Unsupervised Hashing with Binary Deep


Neural Network (UH-BDNN)
This section evaluates the proposed UH-BDNN and compares it to the following
state-of-the-art unsupervised hashing methods: Spectral Hashing (SH) [5], Iter-
ative Quantization (ITQ) [6], Binary Autoencoder (BA) [22], Spherical Hashing
(SPH) [8], K-means Hashing (KMH) [7]. For all compared methods, we use the
implementations and the suggested parameters provided by the authors.

3.1 Dataset, Evaluation Protocol, and Implementation Note


Dataset. CIFAR10 [25] dataset consists of 60,000 images of 10 classes. The
training set (also used as database for retrieval) contains 50,000 images. The
query set contains 10,000 images. Each image is represented by a 800-dimensional
feature vector extracted by PCA from 4096-dimensional CNN feature produced
by AlexNet [26].
MNIST [27] dataset consists of 70,000 handwritten digit images of 10 classes.
The training set (also used as database for retrieval) contains 60,000 images. The
query set contains 10,000 images. Each image is represented by a 784 dimensional
gray-scale feature vector by using its intensity.
SIFT1M [28] dataset contains 128 dimensional SIFT vectors [29]. There are
M vectors used as database for retrieval; 100K vectors for training (separated
from retrieval database) and 10 K vectors for query.
Evaluation protocol. We follow the standard setting in unsupervised hash-
ing [6–8,22] using Euclidean nearest neighbors as the ground truths for queries.
Number of ground truths are set as in [22], i.e., for CIFAR10 and MNIST
datasets, for each query, we use 50 its Euclidean nearest neighbors as ground
truths; for large scale dataset SIFT1M, for each query, we use 10, 000 its Euclid-
ean nearest neighbors as ground truths. We use the following evaluation metrics
Learning to Hash with Binary Deep Neural Network 227

12 25 25
UH−BDNN UH−BDNN UH−BDNN
10 BA BA BA
ITQ 20 ITQ 20 ITQ
SH SH SH
8 SPH SPH SPH
15 15
mAP

mAP

mAP
KMH KMH KMH
6
10 10
4

5 5
2

0 0 0
8 16 24 32 8 16 24 32 8 16 24 32
number bits (L) number bits (L) number bits (L)
(a) CIFAR10 (b) MNIST (c) SIFT1M

Fig. 2. mAP comparison between UH-BDNN and state-of-the-art unsupervised hash-


ing methods on CIFAR10, MNIST, and SIFT1M.

Table 2. Precision at Hamming distance r = 2 comparison between UH-BDNN and


state-of-the-art unsupervised hashing methods on CIFAR10, MNIST, and SIFT1M.

L CIFAR10 MNIST SIFT1M


8 16 24 32 8 16 24 32 8 16 24 32
UH-BDNN 0.55 5.79 22.14 18.35 0.53 6.80 29.38 38.50 4.80 25.20 62.20 80.55
BA [22] 0.55 5.65 20.23 17.00 0.51 6.44 27.65 35.29 3.85 23.19 61.35 77.15
ITQ [6] 0.54 5.05 18.82 17.76 0.51 5.87 23.92 36.35 3.19 14.07 35.80 58.69
SH [5] 0.39 4.23 14.60 15.22 0.43 6.50 27.08 36.69 4.67 24.82 60.25 72.40
SPH [8] 0.43 3.45 13.47 13.67 0.44 5.02 22.24 30.80 4.25 20.98 47.09 66.42
KMH [7] 0.53 5.49 19.55 15.90 0.50 6.36 25.68 36.24 3.74 20.74 48.86 76.04

which have been used in state of the art [6,19,22] to measure the performance of
methods. (1) mean Average Precision (mAP); (2) precision of Hamming radius
2 (precision@2) which measures precision on retrieved images having Hamming
distance to query ≤ 2 (if no images satisfy, we report zero precision). Note that
as computing mAP is slow on large dataset SIFT1M, we consider top 10, 000
returned neighbors when computing mAP.
Implementation note. In our deep model, we use n = 5 layers. The parameters
λ1 , λ2 , λ3 and λ4 are empirically set by cross validation as 10−5 , 5 × 10−2 , 10−2
and 10−6 , respectively. The max iteration number T is empirically set to 10. The
number of units in hidden layers 2, 3, 4 are empirically set as [90 → 20 → 8],
[90 → 30 → 16], [100 → 40 → 24] and [120 → 50 → 32] for the 8, 16, 24 and 32
bits, respectively.

3.2 Retrieval Results

Figure 2 and Table 2 show comparative mAP and precision of Hamming radius
2 (precision@2), respectively. We find the following observations are consistent
for all three datasets. In term of mAP, the proposed UH-BDNN comparable
or outperforms other methods at all code lengths. The improvement is more
clear at high code length, i.e., L = 24, 32. The mAP of UH-BDNN consistently
228 T.-T. Do et al.

outperforms that of binary autoencoder (BA) [22], which is the current state-
of-the-art unsupervised hashing method. In term of precision@2, UH-BDNN is
comparable to other methods at low L, i.e., L = 8, 16. At L = 24, 32, UH-BDNN
significantly outperforms other methods.
Comparison with Deep Hashing (DH): [19] As the implementation of DH is not
available, we set up the experiments on CIFAR10 and MNIST similar to [19] to
make a fair comparison. For each dataset, we randomly sample 1,000 images, 100
per class, as query set; the remaining images are used as training/database set.
Follow [19], for CIFAR10, each image is represented by 512-D GIST descrip-
tor [30]. The ground truths of queries are based on their class labels. Similar
to [19], we report comparative results in term of mAP and the precision of Ham-
ming radius r = 2. The comparative results are presented in the Table 3. It is
clearly showed in Table 3 that the proposed UH-BDNN outperforms DH [19] at
all code lengths, in both mAP and precision of Hamming radius.

Table 3. Comparison with Deep Hashing (DH) [19]. The results of DH are cited
from [19].

L CIFAR10 MNIST
mAP precision@2 mAP precision@2
16 32 16 32 16 32 16 32
DH [19] 16.17 16.62 23.33 15.77 43.14 44.97 66.10 73.29
UH-BDNN 17.83 18.52 24.97 18.85 45.38 47.21 69.13 75.26

4 Supervised Hashing with Binary Deep Neural Network


(SH-BDNN)
In order to enhance the discriminative power of the binary codes, we extend
UH-BDNN to supervised hashing by leveraging the label information. There are
several approaches proposed to leverage the label information, leading to differ-
ent criteria on binary codes. In [10,31], binary codes are learned such that they
minimize the Hamming distance among within-class samples, while maximizing
the Hamming distance among between-class samples. In [15], the binary codes
are learned such that they are optimal for linear classification.
In this work, in order to leverage the label information, we follow the app-
roach proposed in Kernel-based Supervised Hashing (KSH) [11]. The benefit
of this approach is that it directly encourages the Hamming distances between
binary codes of within-class samples equal to 0, and the Hamming distances
between binary codes of between-class samples equal to L. In the other words, it
tries to perfectly preserve the semantic similarity. To achieve this goal, it enforces
that the Hamming distance between learned binary codes has to highly correlate
with the pre-computed pairwise label matrix.
In general, the network structure of SH-BDNN is similar to UH-BDNN,
excepting that the last layer preserving reconstruction of UH-BDNN is removed.
Learning to Hash with Binary Deep Neural Network 229

The layer n − 1 in UH-BDNN becomes the last layer in SH-BDNN. All desir-
able properties, i.e. semantic similarity preserving, independence, and balance,
in SH-BDNN are constrained on the outputs of its last layer.

4.1 Formulation of SH-BDNN


We define the pairwise label matrix S as

1 if xi and xj are same class
Sij = (19)
−1 if xi and xj are not same class

To achieve the semantic similarity preserving property, we learn the binary codes
such that the Hamming distance between learned binary codes highly correlates
 2
with the matrix S, i.e., we want to minimize the quantity  L1 (H(n) )T H(n) − S .
In addition, to achieve the independence and balance properties of codes, we
 1 (n) (n) T 2  2
want to minimize the quantities  m H (H ) − I and H(n) 1m×1  .
Follow the same reformulation and relaxation as UH-BDNN (Sect. 2.1), we
solve the following constrained optimization which ensures the binary constraint,
the semantic similarity preserving, the independence, and the balance properties
of codes

 2   λ2  2
n−1
1  
 1 (H(n) )T H(n) − S + λ1  (l) 2  (n) 
min J = W  + H − B 
W,c,B 2m  L  2 2m
l=1
 2  2
λ3  1 (n) (n) T 
 + λ4  
+   H (H ) − I  H(n) 1m×1  (20)
2 m 2m

s.t. B ∈ {−1, 1}L×m (21)


(20) under constraint (21) is our formulation for supervised hashing. The main
difference in formulation between UH-BDNN (8) and SH-BDNN (20) is that
the reconstruction term preserving the neighbor similarity in UH-BDNN (8) is
replaced by the term preserving the label similarity in SH-BDNN (20).

4.2 Optimization
In order to solve (20) under constraint (21), we alternating optimize over (W, c)
and B.
(W, c) step. When fixing B, (20) becomes unconstrained optimization. We used
L-BFGS [24] optimizer with backpropagation for solving. The gradient of objec-
tive function J (20) w.r.t. different parameters are computed as follows.
Let us define
   λ   2λ  1 
1 2 3
Δ(n) = H(n) V + VT + H(n) − B + H(n) (H(n) )T − I H(n)
mL m m m
λ4  (n)  
+ H 1m×m  f (n) (Z(n) ) (22)
m
230 T.-T. Do et al.

Algorithm 2. Supervised Hashing with Binary Deep Neural Network (SH-


BDNN)
Input:
X = {xi }mi=1 ∈ R
D×m
: training data; Y ∈ Rm×1 : training label vector; L: code length; T :
maximum iteration number; n: number of layers; {sl }n l=2 : number of units of layers 2 → n (note:
sn = L); λ1 , λ2 , λ3 , λ4 .
Output:
n−1
Parameters {W(l) , c(l) }l=1

1: Compute pairwise label matrix S using (19).


2: Initialize B(0) ∈ {−1, 1}L×m using ITQ [6]
n−1 n−1
3: Initialize {c(l) }l=1 = 0sl+1 ×1 . Initialize {W(l) }l=1 by getting the top sl+1 eigenvectors from
the covariance matrix of H(l) .
n−1
4: Fix B(0) , compute (W, c)(0) with (W, c) step using initialized {W(l) , c(l) }l=1 (line 3) as start-
ing point for L-BFGS.
5: for t = 1 → T do
6: Fix (W, c)(t−1) , compute B(t) with B step
7: Fix B(t) , compute (W, c)(t) with (W, c) step using (W, c)(t−1) as starting point for L-BFGS.
8: end for
9: Return (W, c)(T )

(n) T
where V = 1
L (H ) H(n) − S.
  
Δ(l) = (W(l) )T Δ(l+1)  f (l) (Z(l) ), ∀l = n − 1, · · · , 2 (23)

where  denotes Hadamard product; Z(l) = W(l−1) H(l−1) + c(l−1) 11×m , l =


2, · · · , n.
Then ∀l = n − 1, · · · , 1, we have
∂J
= Δ(l+1) (H(l) )T + λ1 W(l) (24)
∂W(l)
∂J
= Δ(l+1) 1m×1 (25)
∂c(l)
B step When fixing (W, c), we can rewrite problem (20) as
 2
 
min J = H(n) − B (26)
B

s.t. B ∈ {−1, 1}L×m (27)


It is easy to see that the optimal solution for (26) under constraint (27) is
B = sgn(H(n) ).
The proposed SH-BDNN method is summarized in Algorithm 2. In the Algo-
rithm, B(t) and (W, c)(t) are values of B and {W(l) , c(l) }n−1
l=1 at iteration t.

5 Evaluation of Supervised Hashing with Binary Deep


Neural Network (SH-BDNN)
This section evaluates the proposed SH-BDNN and compares it to state-of-the-
art supervised hashing methods: Supervised Discrete Hashing (SDH) [15], ITQ-
CCA [6], Kernel-based Supervised Hashing (KSH) [11], Binary Reconstructive
Learning to Hash with Binary Deep Neural Network 231

70 100

90
60

80
50 SH−BDNN

mAP

mAP
SDH
ITQ−CCA 70
KSH
40 BRE
60
SH−BDNN
SDH
30
50 ITQ−CCA
KSH
BRE
20 40
8 16 24 32 8 16 24 32
number bits (L) number bits (L)
(a) CIFAR10 (b) MNIST

Fig. 3. mAP comparison between SH-BDNN and state-of-the-art supervised hashing


methods on CIFAR10 and MNIST.

Table 4. Precision at Hamming distance r = 2 comparison between SH-BDNN and


state-of-the-art supervised hashing methods on CIFAR10 and MNIST.

L CIFAR10 MNIST
8 16 24 32 8 16 24 32
SH-BDNN 54.12 67.32 69.36 69.62 84.26 94.67 94.69 95.51
SDH [15] 31.60 62.23 67.65 67.63 36.49 93.00 93.98 94.43
ITQ-CCA [6] 49.14 65.68 67.47 67.19 54.35 79.99 84.12 84.57
KSH [11] 44.81 64.08 67.01 65.76 68.07 90.79 92.86 92.41
BRE [14] 23.84 41.11 47.98 44.89 37.67 69.80 83.24 84.61

Embedding (BRE) [14]. For all compared methods, we use the implementation
and the suggested parameters provided by the authors.

5.1 Dataset, Evaluation Protocol, and Implementation Note

Dataset We evaluate and compare methods on CIFAR-10 and MNIST datasets.


The descriptions of these datasets are presented in Sect. 3.1.
Evaluation protocol. Follow the literature [6,11,15], we report the retrieval
results in two metrics: (1) mean Average Precision (mAP) and (2) precision of
Hamming radius 2 (precision@2).
Implementation note. The network configuration is same as UH-BDNN
excepting the final layer is removed. The values of parameters λ1 , λ2 , λ3 and λ4
are empirically set using cross validation as 10−3 , 5, 1 and 10−4 , respectively.
The max iteration number T is empirically set to 5.
Follow the settings in ITQ-CCA [6], SDH [15], all training samples are used
in the learning for these two methods. For SH-BDNN, KSH [11] and BRE [14]
where label information is leveraged by the pairwise label matrix, we randomly
select 3, 000 training samples from each class and use them for learning. The
ground truths of queries are defined by the class labels from the datasets.
232 T.-T. Do et al.

Table 5. Comparison between SH-BDNN and CNN-based hashing DSRH [32],


DRSCH [33] on CIFAR10. The results of DSRH and DRSCH are cited from [33].

L mAP precison@2
16 24 32 48 16 24 32 48
SH-BDNN 64.30 65.21 66.22 66.53 56.87 58.67 58.80 58.42
DRSCH [33] 61.46 62.19 62.87 63.05 52.34 53.07 52.31 52.03
DSRH [32] 60.84 61.08 61.74 61.77 50.36 52.45 50.37 49.38

5.2 Retrieval Results

On CIFAR10 dataset, Fig. 3(a) and Table 4 clearly show the proposed SH-BDNN
outperforms all compared methods by a fair margin at all code lengths in both
mAP and precision@2.
On MNIST dataset, Fig. 3(b) and Table 4 show the proposed SH-BDNN sig-
nificantly outperforms the current state-of-the-art SDH at low code length, i.e.,
L = 8. When L increases, SH-BDNN and SDH [15] achieve similar performance.
In comparison to remaining methods, i.e., KSH [11], ITQ-CCA [6], BRE [14],
SH-BDNN outperforms these methods by a large margin in both mAP and pre-
cision@2.
Comparison with CNN-based hashing methods [32,33]: We compare our proposed
SH-BDNN to the recent CNN-based supervised hashing methods: Deep Seman-
tic Ranking Hashing (DSRH) [32] and Deep Regularized Similarity Comparison
Hashing (DRSCH) [33]. Note that the focus of [32,33] are different from ours:
in [32,33], the authors focus on a framework in which the image features and
hash codes are jointly learned by combining CNN layers (image feature extrac-
tion) and binary mapping layer into a single model. On the other hand, our work
focuses on only the binary mapping layer given some image feature. In [32,33],
their binary mapping layer only applies a simple operation, i.e., an approxima-
tion of sgn function (i.e., logistic [32], tanh [33]), on CNN features for achieving
the approximated binary codes. Our SH-BDNN advances [32,33] in the way to
map the image features to the binary codes (which is our main focus). Given
the image features (i.e., pre-trained CNN features), we apply multiple transfor-
mations on these features; we constrain one layer to directly output the binary
code, without involving sgn function. Furthermore, our learned codes ensure
good properties, i.e. independence and balance, while DRSCH [33] does not con-
sider such properties, and DSRH [32] only considers the balance of codes.
We follow strictly the comparison setting in [32,33]. In [32,33], when com-
paring their CNN-based hashing to other non CNN-based hashing methods, the
authors use pre-trained CNN features (e.g. AlexNet [26], DeCAF [34]) as input
for other methods. Follow that setting, we use AlexNet features [26] as input
for SH-BDNN. We set up the experiments on CIFAR10 similar to [33], i.e., the
query set contains 10 K images (1 K images per class) randomly sampled from
the dataset; the rest 50 K image are used as the training set; in the testing
Learning to Hash with Binary Deep Neural Network 233

step, each query image is searched within the query set itself by applying the
leave-one-out procedure.
The comparative results between the proposed SH-BDNN and DSRH [32],
DRSCH [33], presented in Table 5, clearly show that at the same code length,
the proposed SH-BDNN outperforms [32,33] in both mAP and precision@2.

6 Conclusion
We propose UH-BDNN and SH-BDNN for unsupervised and supervised hashing.
Our network designs constrain to directly produce binary codes at one layer. Our
models ensure good properties for codes: similarity preserving, independence and
balance. Solid experimental results on three benchmark datasets show that the
proposed methods compare favorably with the state of the art.

References
1. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: VLDB (1999)
2. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: ICCV (2009)
3. Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant
kernels. In: NIPS (2009)
4. Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. PAMI
31(2), 2143–2157 (2009)
5. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2008)
6. Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning
binary codes. In: CVPR (2011)
7. He, K., Wen, F., Sun, J.: K-means hashing: an affinity-preserving quantization
method for learning binary compact codes. In: CVPR (2013)
8. Heo, J.P., Lee, Y., He, J., Chang, S.F. Yoon, S.E.: Spherical hashing. In: CVPR
(2012)
9. Kong, W., Li, W.J.: Isotropic hashing. In: NIPS (2012)
10. Strecha, C., Bronstein, A.M., Bronstein, M.M., Fua, P.: LDAHash: improved
matching with smaller descriptors. PAMI 34(1), 66–78 (2012)
11. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with ker-
nels. In: CVPR (2012)
12. Norouzi, M., Fleet, D.J., Salakhutdinov, R.: Hamming distance metric learning.
In: NIPS (2012)
13. Lin, G., Shen, C., Shi, Q., van den Hengel, A., Suter, D.: Fast supervised hashing
with decision trees for high-dimensional data. In: CVPR (2014)
14. Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. In:
NIPS (2009)
15. Shen, F., Shen, C., Liu, W., Tao Shen, H.: Supervised discrete hashing. In: CVPR
(2015)
16. Wang, J., Liu, W., Kumar, S., Chang, S.: Learning to hash for indexing big data
- a survey. CoRR (2015)
17. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey. CoRR
(2014)
234 T.-T. Do et al.

18. Grauman, K., Fergus, R.: Learning binary hash codes for large-scale image search.
In: Cipolla, R., Battiato, S., Farinella, G.M. (eds.) Machine Learning for Computer
Vision. SCI, vol. 411, pp. 55–93. Springer, Heidelberg (2013)
19. Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact
binary codes learning. In: CVPR (2015)
20. Wang, J., Kumar, S., Chang, S.: Semi-supervised hashing for large-scale search.
PAMI 34(12), 2393–2406 (2012)
21. Salakhutdinov, R., Hinton, G.E.: Semantic hashing. Int. J. Approximate Reasoning
50(7), 969–978 (2009)
22. Carreira-Perpinan, M.A., Raziperchikolaei, R.: Hashing with binary autoencoders.
In: CVPR (2015)
23. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. World Scientific, New
York (2006). Chap. 17
24. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale opti-
mization. Math. Program. 45, 503–528 (1989)
25. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical
report, University of Toronto (2009)
26. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding
(2014). arXiv preprint: arXiv:1408.5093
27. Lecun, Y., Cortes, C.: The MNIST database of handwritten digits. http://yann.
lecun.com/exdb/mnist/
28. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor
search. PAMI 33(1), 117–128 (2011)
29. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2),
91–110 (2004)
30. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation
of the spatial envelope. IJCV 42(3), 145–175 (2001)
31. Nguyen, V.A., Lu, J., Do, M.N.: Supervised discriminative hashing for compact
binary codes. In: ACM MM (2014)
32. Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for
multi-label image retrieval. In: CVPR (2015)
33. Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing
with regularized similarity learning for image retrieval and person re-identification.
IEEE Trans. Image Process. 24(12), 4766–4779 (2015)
34. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.:
DeCAF: a deep convolutional activation feature for generic visual recognition. In:
ICML (2014)

View publication stats

You might also like