Learning To Hash With Binary Deep Neural Network: October 2016
Learning To Hash With Binary Deep Neural Network: October 2016
net/publication/308190581
CITATIONS READS
75 564
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Thanh-Toan Do on 29 November 2017.
Abstract. This work proposes deep network models and learning algo-
rithms for unsupervised and supervised binary hashing. Our novel net-
work design constrains one hidden layer to directly output the binary
codes. This addresses a challenging issue in some previous works: opti-
mizing non-smooth objective functions due to binarization. Moreover,
we incorporate independence and balance properties in the direct and
strict forms in the learning. Furthermore, we include similarity preserv-
ing property in our objective function. Our resulting optimization with
these binary, independence, and balance constraints is difficult to solve.
We propose to attack it with alternating optimization and careful relax-
ation. Experimental results on three benchmark datasets show that our
proposed methods compare favorably with the state of the art.
1 Introduction
We are interested in learning binary hash codes for large scale visual search.
Two main difficulties with large scale visual search are efficient storage and
fast searching. An attractive approach for handling these difficulties is binary
hashing, where each original high dimensional vector x ∈ RD is mapped to a
very compact binary vector b ∈ {−1, 1}L , where L D.
Hashing methods can be divided into two categories: data-independent and
data-dependent. Methods in data-independent category [1–4] rely on random
projections for constructing hash functions. Methods in data-dependent category
use the available training data to learn the hash functions in unsupervised [5–9]
or supervised manner [10–15]. The review of data-independent/data-dependent
hashing methods can be found in recent surveys [16–18].
One difficult problem in hashing is to deal with the binary constraint on
the codes. Specifically, the outputs of the hash functions have to be binary. In
general, this binary constraint leads to a NP-hard mixed-integer optimization
problem. To handle this difficulty, most aforementioned methods relax the con-
straint during the learning of hash functions. With this relaxation, the continuous
codes are learned first. Then, the codes are binarized (e.g., with thresholding).
c Springer International Publishing AG 2016
B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 219–234, 2016.
DOI: 10.1007/978-3-319-46454-1 14
220 T.-T. Do et al.
This relaxation greatly simplifies the original binary constrained problem. How-
ever, the solution can be suboptimal, i.e., the binary codes resulting from thresh-
olded continuous codes could be inferior to those that are obtained by including
the binary constraint in the learning.
Furthermore, a good hashing method should produce binary codes with the
properties [5]: (i) similarity preserving, i.e., (dis)similar inputs should likely have
(dis)similar binary codes; (ii) independence, i.e., different bits in the binary codes
are independent to each other; (iii) balance, i.e., each bit has a 50 % chance of
being 1 or −1. The direct incorporation of the independent and balance proper-
ties can complicate the learning. Previous work has used some relaxation to work
around the problem [6,19,20], but there may be some performance degradation.
1.2 Contribution
In this work, we first propose a novel deep network model and learning algorithm
for unsupervised hashing. In order to achieve binary codes, instead of involving
the sgn or step function as in [19,22], our proposed network design constrains
one layer to directly output the binary codes (hence the network is called as
Learning to Hash with Binary Deep Neural Network 221
Notation Meaning
X X = {xi }m
i=1 ∈ R
D×m
: set of m training samples; each column of X
corresponds to one sample
B B = {bi }m
i=1 ∈ {−1, +1}
L×m
: binary code of X
L Number of bits in the output binary code to encode a sample
n Number of layers (including input and output layers)
sl Number of units in layer l
(l)
f Activation function of layer l
W(l) W(l) ∈ Rsl+1 ×sl : weight matrix connecting layer l + 1 and layer l
c(l) c(l) ∈ Rsl+1 :bias vector for units in layer l + 1
(l)
H H(l) = f (l) W(l−1) H(l−1) + c(l−1) 11×m : output values of layer l;
convention: H(1) = X
1a×b Matrix has a rows, b columns and all elements equal to 1
Fig. 1. The illustration of our network (D = 4, L = 2). In our proposed network design,
the outputs of layer n − 1 are constrained to {−1, 1} and are used as the binary codes.
During training, these codes are used to reconstruct the input samples at the final
layer.
values at the layer n − 1 have the following desirable properties: (i) belonging to
{−1, 1}; (ii) similarity preserving; (iii) independent and (iv) balancing. Figure 1
illustrates our network for the case D = 4, L = 2.
Let us start with first two properties of the codes, i.e., belonging to {−1, 1}
and similarity preserving. To achieve the binary codes having these two proper-
ties, we propose to optimize the following constrained objective function
1
2 λ n−1
1
(l) 2
min J = X − W(n−1) H(n−1) + c(n−1) 11×m + W (1)
W,c 2m 2
l=1
1
2 λ n−1
1
(l) 2
min J = X − W(n−1) B − c(n−1) 11×m + W (3)
W,c,B 2m 2
l=1
Learning to Hash with Binary Deep Neural Network 223
1
2 λ n−1
1
(l) 2
min J = X − W(n−1) B − c(n−1) 11×m + W
W,c,B 2m 2
l=1
2 2
λ2 (n−1)
λ3 1 (n−1) (n−1) T
+ λ4
2
+ H − B + H (H ) − I H (n−1)
1m×1
2m 2 m 2m
(8)
The DH’s model does not have the reconstruction layer. They apply sgn
function to the outputs at the top layer of the network to obtain the binary
codes. The first term aims to minimize quantization loss when applying the
sgn function to the outputs at the top layer. The balancing and the independent
properties are contained in the second and the third terms [19]. It is worth noting
that minimizing DH’s objective function is difficult due to the non-differentiable
of sgn function. The authors work around this difficulty by assuming that sgn
function is differentiable everywhere.
Contrary to DH, we propose a different model design. In particular, our
model encourages the similarity preserving by having the reconstruction
layer
in the network. For the balancing property, they maximize tr H(n) (H(n) )T .
According to [20], maximizing this term is only an approximation in arriv-
ing the balancing property. In our objective function, the balancing property
2
is directly enforced on the codes by the term H(n−1) 1m×1 . For the indepen-
2
dent property, DH uses a relaxed orthogonality constraint W(l) (W(l) )T − I ,
i.e., constraining on the network weights W. On the contrary, we (once again)
1 (n−1) (n−1) T 2
directly constrain on the codes using m H (H ) − I . Incorporating
the strict constraints can lead to better performance.
Comparison to Binary Autoencoder (BA) [22]: the differences between our model
and BA are quite clear. BA as described in [22] is a shallow linear autoencoder
network with one hidden layer. The BA’s hash function is a linear transformation
of the input followed by the step function to obtain the binary codes. In BA, by
treating the encoder layer as binary classifiers, they use binary SVMs to learn the
weights of the linear transformation. On the contrary, our hash function is defined
by multiple, hierarchical layers of nonlinear and linear transformations. It is not
clear if the binary SVMs approach in BA can be used to learn the weights in our
deep architecture with multiple layers. Instead, we use alternating optimization
to derive a backpropagation algorithm to learn the weights in all layers. Another
difference is that our model ensures the independence and balance of the binary
codes while BA does not. Note that independence and balance properties may
not be easily incorporated in their framework, as these would complicate their
objective function and the optimization problem may become very difficult to
solve.
Learning to Hash with Binary Deep Neural Network 225
2.2 Optimization
In order to solve (8) under constraint (9), we propose to use alternating opti-
mization over (W, c) and B.
(W, c) step. When fixing B, the problem becomes unconstrained optimization.
We use L-BFGS [24] optimizer with backpropagation for solving. The gradient of
the objective function J (8) w.r.t. different parameters are computed as follows.
At l = n − 1, we have
∂J −1
= (X − W(n−1) B − c(n−1) 11×m )BT + λ1 W(n−1) (10)
∂W(n−1) m
∂J −1
= (X − W(n−1) B)1m×1 − mc(n−1) (11)
∂c (n−1) m
For other layers, let us define
2λ 1
λ2 3 (n−1) T
Δ (n−1)
= H (n−1)
−B + H (n−1)
(H ) − I H(n−1)
m m m
λ4 (n−1)
+ H 1m×m f (n−1) (Z(n−1) ) (12)
m
Δ(l) = (W(l) )T Δ(l+1) f (l) (Z(l) ), ∀l = n − 2, · · · , 2 (13)
12 25 25
UH−BDNN UH−BDNN UH−BDNN
10 BA BA BA
ITQ 20 ITQ 20 ITQ
SH SH SH
8 SPH SPH SPH
15 15
mAP
mAP
mAP
KMH KMH KMH
6
10 10
4
5 5
2
0 0 0
8 16 24 32 8 16 24 32 8 16 24 32
number bits (L) number bits (L) number bits (L)
(a) CIFAR10 (b) MNIST (c) SIFT1M
which have been used in state of the art [6,19,22] to measure the performance of
methods. (1) mean Average Precision (mAP); (2) precision of Hamming radius
2 (precision@2) which measures precision on retrieved images having Hamming
distance to query ≤ 2 (if no images satisfy, we report zero precision). Note that
as computing mAP is slow on large dataset SIFT1M, we consider top 10, 000
returned neighbors when computing mAP.
Implementation note. In our deep model, we use n = 5 layers. The parameters
λ1 , λ2 , λ3 and λ4 are empirically set by cross validation as 10−5 , 5 × 10−2 , 10−2
and 10−6 , respectively. The max iteration number T is empirically set to 10. The
number of units in hidden layers 2, 3, 4 are empirically set as [90 → 20 → 8],
[90 → 30 → 16], [100 → 40 → 24] and [120 → 50 → 32] for the 8, 16, 24 and 32
bits, respectively.
Figure 2 and Table 2 show comparative mAP and precision of Hamming radius
2 (precision@2), respectively. We find the following observations are consistent
for all three datasets. In term of mAP, the proposed UH-BDNN comparable
or outperforms other methods at all code lengths. The improvement is more
clear at high code length, i.e., L = 24, 32. The mAP of UH-BDNN consistently
228 T.-T. Do et al.
outperforms that of binary autoencoder (BA) [22], which is the current state-
of-the-art unsupervised hashing method. In term of precision@2, UH-BDNN is
comparable to other methods at low L, i.e., L = 8, 16. At L = 24, 32, UH-BDNN
significantly outperforms other methods.
Comparison with Deep Hashing (DH): [19] As the implementation of DH is not
available, we set up the experiments on CIFAR10 and MNIST similar to [19] to
make a fair comparison. For each dataset, we randomly sample 1,000 images, 100
per class, as query set; the remaining images are used as training/database set.
Follow [19], for CIFAR10, each image is represented by 512-D GIST descrip-
tor [30]. The ground truths of queries are based on their class labels. Similar
to [19], we report comparative results in term of mAP and the precision of Ham-
ming radius r = 2. The comparative results are presented in the Table 3. It is
clearly showed in Table 3 that the proposed UH-BDNN outperforms DH [19] at
all code lengths, in both mAP and precision of Hamming radius.
Table 3. Comparison with Deep Hashing (DH) [19]. The results of DH are cited
from [19].
L CIFAR10 MNIST
mAP precision@2 mAP precision@2
16 32 16 32 16 32 16 32
DH [19] 16.17 16.62 23.33 15.77 43.14 44.97 66.10 73.29
UH-BDNN 17.83 18.52 24.97 18.85 45.38 47.21 69.13 75.26
The layer n − 1 in UH-BDNN becomes the last layer in SH-BDNN. All desir-
able properties, i.e. semantic similarity preserving, independence, and balance,
in SH-BDNN are constrained on the outputs of its last layer.
To achieve the semantic similarity preserving property, we learn the binary codes
such that the Hamming distance between learned binary codes highly correlates
2
with the matrix S, i.e., we want to minimize the quantity L1 (H(n) )T H(n) − S .
In addition, to achieve the independence and balance properties of codes, we
1 (n) (n) T 2 2
want to minimize the quantities m H (H ) − I and H(n) 1m×1 .
Follow the same reformulation and relaxation as UH-BDNN (Sect. 2.1), we
solve the following constrained optimization which ensures the binary constraint,
the semantic similarity preserving, the independence, and the balance properties
of codes
2 λ2 2
n−1
1
1 (H(n) )T H(n) − S + λ1 (l) 2 (n)
min J = W + H − B
W,c,B 2m L 2 2m
l=1
2 2
λ3 1 (n) (n) T
+ λ4
+ H (H ) − I H(n) 1m×1 (20)
2 m 2m
4.2 Optimization
In order to solve (20) under constraint (21), we alternating optimize over (W, c)
and B.
(W, c) step. When fixing B, (20) becomes unconstrained optimization. We used
L-BFGS [24] optimizer with backpropagation for solving. The gradient of objec-
tive function J (20) w.r.t. different parameters are computed as follows.
Let us define
λ 2λ 1
1 2 3
Δ(n) = H(n) V + VT + H(n) − B + H(n) (H(n) )T − I H(n)
mL m m m
λ4 (n)
+ H 1m×m f (n) (Z(n) ) (22)
m
230 T.-T. Do et al.
(n) T
where V = 1
L (H ) H(n) − S.
Δ(l) = (W(l) )T Δ(l+1) f (l) (Z(l) ), ∀l = n − 1, · · · , 2 (23)
70 100
90
60
80
50 SH−BDNN
mAP
mAP
SDH
ITQ−CCA 70
KSH
40 BRE
60
SH−BDNN
SDH
30
50 ITQ−CCA
KSH
BRE
20 40
8 16 24 32 8 16 24 32
number bits (L) number bits (L)
(a) CIFAR10 (b) MNIST
L CIFAR10 MNIST
8 16 24 32 8 16 24 32
SH-BDNN 54.12 67.32 69.36 69.62 84.26 94.67 94.69 95.51
SDH [15] 31.60 62.23 67.65 67.63 36.49 93.00 93.98 94.43
ITQ-CCA [6] 49.14 65.68 67.47 67.19 54.35 79.99 84.12 84.57
KSH [11] 44.81 64.08 67.01 65.76 68.07 90.79 92.86 92.41
BRE [14] 23.84 41.11 47.98 44.89 37.67 69.80 83.24 84.61
Embedding (BRE) [14]. For all compared methods, we use the implementation
and the suggested parameters provided by the authors.
L mAP precison@2
16 24 32 48 16 24 32 48
SH-BDNN 64.30 65.21 66.22 66.53 56.87 58.67 58.80 58.42
DRSCH [33] 61.46 62.19 62.87 63.05 52.34 53.07 52.31 52.03
DSRH [32] 60.84 61.08 61.74 61.77 50.36 52.45 50.37 49.38
On CIFAR10 dataset, Fig. 3(a) and Table 4 clearly show the proposed SH-BDNN
outperforms all compared methods by a fair margin at all code lengths in both
mAP and precision@2.
On MNIST dataset, Fig. 3(b) and Table 4 show the proposed SH-BDNN sig-
nificantly outperforms the current state-of-the-art SDH at low code length, i.e.,
L = 8. When L increases, SH-BDNN and SDH [15] achieve similar performance.
In comparison to remaining methods, i.e., KSH [11], ITQ-CCA [6], BRE [14],
SH-BDNN outperforms these methods by a large margin in both mAP and pre-
cision@2.
Comparison with CNN-based hashing methods [32,33]: We compare our proposed
SH-BDNN to the recent CNN-based supervised hashing methods: Deep Seman-
tic Ranking Hashing (DSRH) [32] and Deep Regularized Similarity Comparison
Hashing (DRSCH) [33]. Note that the focus of [32,33] are different from ours:
in [32,33], the authors focus on a framework in which the image features and
hash codes are jointly learned by combining CNN layers (image feature extrac-
tion) and binary mapping layer into a single model. On the other hand, our work
focuses on only the binary mapping layer given some image feature. In [32,33],
their binary mapping layer only applies a simple operation, i.e., an approxima-
tion of sgn function (i.e., logistic [32], tanh [33]), on CNN features for achieving
the approximated binary codes. Our SH-BDNN advances [32,33] in the way to
map the image features to the binary codes (which is our main focus). Given
the image features (i.e., pre-trained CNN features), we apply multiple transfor-
mations on these features; we constrain one layer to directly output the binary
code, without involving sgn function. Furthermore, our learned codes ensure
good properties, i.e. independence and balance, while DRSCH [33] does not con-
sider such properties, and DSRH [32] only considers the balance of codes.
We follow strictly the comparison setting in [32,33]. In [32,33], when com-
paring their CNN-based hashing to other non CNN-based hashing methods, the
authors use pre-trained CNN features (e.g. AlexNet [26], DeCAF [34]) as input
for other methods. Follow that setting, we use AlexNet features [26] as input
for SH-BDNN. We set up the experiments on CIFAR10 similar to [33], i.e., the
query set contains 10 K images (1 K images per class) randomly sampled from
the dataset; the rest 50 K image are used as the training set; in the testing
Learning to Hash with Binary Deep Neural Network 233
step, each query image is searched within the query set itself by applying the
leave-one-out procedure.
The comparative results between the proposed SH-BDNN and DSRH [32],
DRSCH [33], presented in Table 5, clearly show that at the same code length,
the proposed SH-BDNN outperforms [32,33] in both mAP and precision@2.
6 Conclusion
We propose UH-BDNN and SH-BDNN for unsupervised and supervised hashing.
Our network designs constrain to directly produce binary codes at one layer. Our
models ensure good properties for codes: similarity preserving, independence and
balance. Solid experimental results on three benchmark datasets show that the
proposed methods compare favorably with the state of the art.
References
1. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: VLDB (1999)
2. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: ICCV (2009)
3. Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant
kernels. In: NIPS (2009)
4. Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. PAMI
31(2), 2143–2157 (2009)
5. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2008)
6. Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning
binary codes. In: CVPR (2011)
7. He, K., Wen, F., Sun, J.: K-means hashing: an affinity-preserving quantization
method for learning binary compact codes. In: CVPR (2013)
8. Heo, J.P., Lee, Y., He, J., Chang, S.F. Yoon, S.E.: Spherical hashing. In: CVPR
(2012)
9. Kong, W., Li, W.J.: Isotropic hashing. In: NIPS (2012)
10. Strecha, C., Bronstein, A.M., Bronstein, M.M., Fua, P.: LDAHash: improved
matching with smaller descriptors. PAMI 34(1), 66–78 (2012)
11. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with ker-
nels. In: CVPR (2012)
12. Norouzi, M., Fleet, D.J., Salakhutdinov, R.: Hamming distance metric learning.
In: NIPS (2012)
13. Lin, G., Shen, C., Shi, Q., van den Hengel, A., Suter, D.: Fast supervised hashing
with decision trees for high-dimensional data. In: CVPR (2014)
14. Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. In:
NIPS (2009)
15. Shen, F., Shen, C., Liu, W., Tao Shen, H.: Supervised discrete hashing. In: CVPR
(2015)
16. Wang, J., Liu, W., Kumar, S., Chang, S.: Learning to hash for indexing big data
- a survey. CoRR (2015)
17. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey. CoRR
(2014)
234 T.-T. Do et al.
18. Grauman, K., Fergus, R.: Learning binary hash codes for large-scale image search.
In: Cipolla, R., Battiato, S., Farinella, G.M. (eds.) Machine Learning for Computer
Vision. SCI, vol. 411, pp. 55–93. Springer, Heidelberg (2013)
19. Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact
binary codes learning. In: CVPR (2015)
20. Wang, J., Kumar, S., Chang, S.: Semi-supervised hashing for large-scale search.
PAMI 34(12), 2393–2406 (2012)
21. Salakhutdinov, R., Hinton, G.E.: Semantic hashing. Int. J. Approximate Reasoning
50(7), 969–978 (2009)
22. Carreira-Perpinan, M.A., Raziperchikolaei, R.: Hashing with binary autoencoders.
In: CVPR (2015)
23. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. World Scientific, New
York (2006). Chap. 17
24. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale opti-
mization. Math. Program. 45, 503–528 (1989)
25. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical
report, University of Toronto (2009)
26. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding
(2014). arXiv preprint: arXiv:1408.5093
27. Lecun, Y., Cortes, C.: The MNIST database of handwritten digits. http://yann.
lecun.com/exdb/mnist/
28. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor
search. PAMI 33(1), 117–128 (2011)
29. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2),
91–110 (2004)
30. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation
of the spatial envelope. IJCV 42(3), 145–175 (2001)
31. Nguyen, V.A., Lu, J., Do, M.N.: Supervised discriminative hashing for compact
binary codes. In: ACM MM (2014)
32. Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for
multi-label image retrieval. In: CVPR (2015)
33. Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing
with regularized similarity learning for image retrieval and person re-identification.
IEEE Trans. Image Process. 24(12), 4766–4779 (2015)
34. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.:
DeCAF: a deep convolutional activation feature for generic visual recognition. In:
ICML (2014)