Dijazi - Deep Clustering Via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization - 17
Dijazi - Deep Clustering Via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization - 17
Entropy Minimization
Kamran Ghasedi Dizaji† , Amirhossein Herandi‡ , Cheng Deng] , Weidong Cai\ , Heng Huang†∗
†
Electrical and Computer Engineering, University of Pittsburgh, USA
‡
Computer Science and Engineering, University of Texas at Arlington, USA
]
School of Electronic Engineering, Xidian University, China
arXiv:1704.06327v3 [cs.LG] 9 Aug 2017
\
School of Information Technologies, University of Sydney, Australia
[email protected], [email protected], [email protected]
[email protected], [email protected]
Abstract
Image clustering is one of the most important computer
vision applications, which has been extensively studied in
literature. However, current clustering methods mostly suf-
fer from lack of efficiency and scalability when dealing with (a) Raw Data (b) NonJoint DEPICT (c) Joint DEPICT
large-scale and high-dimensional data. In this paper, we
propose a new clustering model, called DEeP Embedded Figure 1: Visualization to show the discriminative capa-
RegularIzed ClusTering (DEPICT), which efficiently maps bility of embedding subspaces using MNIST-test data. (a)
data into a discriminative embedding subspace and pre- The space of raw data. (b) The embedding subspace of
cisely predicts cluster assignments. DEPICT generally con- non-joint DEPICT using standard stacked denoising au-
sists of a multinomial logistic regression function stacked toencoder (SdA). (c) The embedding subspace of joint DE-
on top of a multi-layer convolutional autoencoder. We de- PICT using our joint learning approach (MdA).
fine a clustering objective function using relative entropy
(KL divergence) minimization, regularized by a prior for
the frequency of cluster assignments. An alternating strat-
egy is then derived to optimize the objective by updating gained significant attention for discriminative representa-
parameters and estimating cluster assignments. Further- tion of data points without any need for supervisory sig-
more, we employ the reconstruction loss functions in our nals. The clustering problem has been extensively studied in
autoencoder, as a data-dependent regularization term, to various applications; however, the performance of standard
prevent the deep embedding function from overfitting. In clustering algorithms is adversely affected when dealing
order to benefit from end-to-end optimization and elimi- with high-dimensional data, and their time complexity dra-
nate the necessity for layer-wise pretraining, we introduce a matically increases when working with large-scale datasets.
joint learning framework to minimize the unified clustering Tackling the curse of dimensionality, previous studies of-
and reconstruction loss functions together and train all net- ten initially project data into a low-dimensional manifold,
work layers simultaneously. Experimental results indicate and then cluster the embedded data in this new subspace
the superiority and faster running time of DEPICT in real- [37, 45, 52]. Handling large-scale datasets, there are also
world clustering tasks, where no labeled data is available several studies which select only a subset of data points to
for hyper-parameter tuning. accelerate the clustering process [42, 22, 20].
However, dealing with real-world image data, existing
1. Introduction clustering algorithms suffer from different issues: 1) Using
Clustering is one of the fundamental research topics in inflexible hand-crafted features, which do not depend on the
machine learning and computer vision research, and it has input data distribution; 2) Using shallow and linear embed-
∗ Corresponding Author. This work (accepted at ICCV 2017) was ding functions, which are not able to capture the non-linear
partially supported by U.S. NIH R01 AG049371, NSF IIS 1302675, IIS nature of data; 3) Non-joint embedding and clustering pro-
1344152, DBI 1356628, IIS 1619308, IIS 1633753. cesses, which do not result in an optimal embedding sub-
space for clustering; 4) Complicated clustering algorithms alizes the data points in the embedding subspace of joint
that require tuning the hyper-parameters using labeled data, DEPICT, in which the model is trained using our multi-
which is not feasible in real-world clustering tasks. layer denoising autoencoder learning approach (MdA). As
To address the mentioned challenging issues, we propose shown, joint DEPICT using MdA learning approach pro-
a new clustering algorithm, called deep embedded regular- vides a significantly more discriminative embedding sub-
ized clustering (DEPICT), which exploits the advantages of space compared to non-joint DEPICT using standard SdA
both discriminative clustering methods and deep embedding learning approach.
models. DEPICT generally consists of two main parts, a Moreover, experimental results show that DEPICT
multinomial logistic regression (soft-max) layer stacked on achieves superior or competitive results compared to the
top of a multi-layer convolutional autoencoder. The soft- state-of-the-art algorithms on the image benchmark datasets
max layer along with the encoder pathway can be consid- while having faster running times. In addition, we com-
ered as a discriminative clustering model, which is trained pared different learning strategies for DEPICT, and con-
using the relative entropy (KL divergence) minimization. firm that our joint learning approach has the best results.
We further add a regularization term based on a prior dis- It should also be noted that DEPICT does not require any
tribution for the frequency of cluster assignments. The reg- hyper-parameter tuning using supervisory signals, and con-
ularization term penalizes unbalanced cluster assignments sequently is a better candidate for the real-world clustering
and prevents allocating clusters to outlier samples. tasks. Thus, we summarize the advantages of DEPICT as:
Although this deep clustering model is flexible enough • Providing a discriminative non-linear embedding sub-
to discriminate the complex real-world input data, it can space via the deep convolutional autoencoder;
easily get stuck in non-optimal local minima during train- • Introducing an end-to-end joint learning approach,
ing and result in undesirable cluster assignments. In order which unifies the clustering and embedding tasks, and
to avoid overfitting the deep clustering model to spurious avoids layer-wise pretraining;
data correlations, we utilize the reconstruction loss function • Achieving superior or competitive clustering results
of autoencoder models as a data-dependent regularization on high-dimensional and large-scale datasets with no
term for training parameters. need for hyper-parameter tuning using labeled data.
In order to benefit from a joint learning framework for
embedding and clustering, we introduce a unified objective
2. Related Works
function including our clustering and auxiliary reconstruc- There is a large number of clustering algorithms in litera-
tion loss functions. We then employ an alternating approach ture, which can be grouped into different perspectives, such
to efficiently update the parameters and estimate the clus- as hierarchical [10, 54, 65], centroid-based [21, 4, 28, 2],
ter assignments. It is worth mentioning that in the stan- graph-based [41, 29, 51, 26], sequential (temporal) [12, 40,
dard learning approach for training a multi-layer autoen- 39, 69, 38], regression model based [8, 50], and subspace
coder, the encoder and decoder parameters are first pre- clustering models [1, 11, 7, 27]. In another sense, they
trained layer-wise using the reconstruction loss, and the are generally divided into two subcategories, generative and
encoder parameters are then fine-tuned using the objective discriminative clustering algorithms. The generative algo-
function of the main task [48]. However, it has been ar- rithms like K-means and Gaussian mixture model [5] ex-
gued that the non-joint fine-tuning step may overwrite the plicitly represent the clusters using geometric properties of
encoder parameters entirely and consequently cancel out the the feature space, and model the categories via the statistical
benefit of the layer-wise pretraining step [68]. To avoid this distributions of input data. Unlike the generative clustering
problem and achieve optimal joint learning results, we si- algorithms, the discriminative methods directly identify the
multaneously train all of the encoder and decoder layers to- categories using their separating hyperplanes regardless of
gether along with the soft-max layer. To do so, we sum up data distribution. Information theoretic [19, 3, 15], max-
the squared error reconstruction loss functions between the margin [67, 58], and spectral graph [25] algorithms are ex-
decoder and their corresponding (clean) encoder layers and amples of discriminative clustering models. Generally it has
add them to the clustering loss function. been argued that the discriminative models often have bet-
Figure 1 demonstrates the importance of our joint learn- ter results compared to their generative counterparts, since
ing strategy by comparing different data representations they have fewer assumptions about the data distribution and
of MNIST-test data points [17] using principle component directly separate the clusters, but their training can suffer
analysis (PCA) visualization. The first figure indicates the from overfitting or getting stuck in undesirable local min-
raw data representation; The second one shows the data ima [15, 25, 33]. Our DEPICT algorithm is also a discrim-
points in the embedding subspace of non-joint DEPICT, in inative clustering model, but it benefits from the auxiliary
which the model is trained using the standard layer-wise reconstruction task of autoencoder to alleviate this issues in
stacked denoising autoencoder (SdA); The third one visu- training of our discriminative clustering algorithm.
There are also several studies regarding the combina- signals for hyper-parameter tuning.
tion of clustering with feature embedding learning. Ye et
al. introduced a kernelized K-means algorithm, denoted 3. Deep Embedded Regularized Clustering
by DisKmeans, where embedding to a lower dimensional
subspace via linear discriminant analysis (LDA) is jointly In this section, we first introduce the clustering objec-
learned with K-means cluster assignments [62]. [49] pro- tive function and the corresponding optimization algorithm,
posed to a new method to simultaneously conduct both clus- which alternates between estimating the cluster assignments
tering and feature embedding/selection tasks to achieve bet- and updating model parameters. Afterwards, we show
ter performance. But these models suffer from having shal- the architecture of DEPICT and provide the joint learning
low and linear embedding functions, which cannot repre- framework to simultaneously train all network layers using
sent the non-linearity of real-world data. the unified clustering and reconstruction loss functions.
A joint learning framework for updating code books and
estimating image clusters was proposed in [57] while SIFT 3.1. DEPICT Algorithm
features are used as input data. A deep structure, named Let’s consider the clustering task of N samples, X =
TAGnet was introduced in [52], where two layers of sparse [x1 , ..., xn ], into K categories, where each sample xi ∈
coding followed by a clustering algorithm are trained with Rdx . Using the embedding function, ϕW : X → Z, we
an alternating learning approach. Similar work is presented are able to map raw samples into the embedding subspace
in [53] that formulates a joint optimization framework for Z = [z1 , ..., zn ], where each zi ∈ Rdz has a much lower di-
discriminative clustering and feature extraction using sparse mension compared to the input data (i.e. dz dx ). Given
coding. However, the inference complexity of sparse cod- the embedded features, we use a multinomial logistic re-
ing forces the model in [53] to reduce the dimension of gression (soft-max) function fθ : Z → Y to predict the
input data with PCA and the model in [52] to use an ap- probabilistic cluster assignments as follows.
proximate solution. Hand-crafted features and dimension
reduction techniques degrade the clustering performance by exp(θ Tk zi )
neglecting the distribution of input data. pik = P (yi = k|zi , Θ) = K
, (1)
exp(θ Tk0 zi )
P
Tian et al. learned a non-linear embedding of the affinity
k0 =1
graph using a stacked autoencoder, and then obtained the
clusters in the embedding subspace via K-means [45]. Tri-
where Θ = [θ 1 , ..., θ k ] ∈ Rdz ×K are the soft-max func-
georgis et al. extended semi non-negative matrix factoriza-
tion parameters, and pik indicates the probability of the i-th
tion (semi-NMF) to stacked multi-layer (deep) semi-NMF
sample belonging to the k-th cluster.
to capture the abstract information in the top layer. After-
In order to define our clustering objective function, we
wards, they run K-means over the embedding subspace for
employ an auxiliary target variable Q to refine the model
cluster assignments [46]. More recently, Xie et al. em-
predictions iteratively. To do so, we first use Kullback-
ployed denoising stacked autoencoder learning approach,
Leibler (KL) divergence to decrease the distance between
and first pretrained the model layer-wise and then fine-tuned
the model prediction P and the target variable Q.
the encoder pathway stacked by a clustering algorithm us-
ing Kullback-Leibler divergence minimization [56]. Unlike N K
these models that require layer-wise pretraining as well as 1 XX qik
L = KL(QkP) = qik log , (2)
non-joint embedding and clustering learning, DEPICT uti- N i=1 pik
k=1
lizes an end-to-end optimization for training all network
layers simultaneously using the unified clustering and re- In order to avoid degenerate solutions, which allocate most
construction loss functions. of the samples to a few clusters or assign a cluster to outlier
Yang et al. introduced a new clustering model, named samples, we aim to impose a regularization term to the tar-
JULE, based on a recurrent framework, where data is rep- get variable. To this end, we first define the empirical label
resented via a convolutional neural network and embedded distribution of target variables as:
data is iteratively clustered using an agglomerative cluster-
ing algorithm [60]. They derived a unified loss function 1 X
fk = P (y = k) = qik , (3)
consisting of the merging process for agglomerative cluster- N i
ing and updating the parameters of the deep representation.
While JULE achieved good results using the joint learning where fk can be considered as the soft frequency of cluster
approach, it requires tuning of a large number of hyper- assignments in the target distribution. Using this empiri-
parameters, which is not practical in real-world clustering cal distribution, we are able to enforce our preference for
tasks. In contrast, our model does not need any supervisory having balanced assignments by adding the following KL
divergence to the loss function. the following objective function.
N K
L = KL(QkP) + KL(f ku) (4) 1 XX
min − qik log pik , (8)
N XK K ψ N i=1
h1 X qik i h 1 X fk i k=1
= qik log + fk log
N i=1 pik N uk Interestingly, this problem can be considered as a standard
k=1 k=1
cross entropy loss function for classification tasks, and the
N K
1 XX qik fk parameters of soft-max layer Θ and embedding function W
= qik log + qik log ,
N i=1 pik uk can be efficiently updated by backpropagating the error.
k=1
3.2. DEPICT Architecture
where u is the uniform prior for the empirical label distri-
bution. While the first term in the objective minimizes the In this section, we extend our general clustering loss
distance between the target and model prediction distribu- function using a denoising autoencoder. The deep embed-
tions, the second term balances the frequency of clusters ding function is useful for capturing the non-linear nature
in the target variables. Utilizing the balanced target vari- of input data; However, it may overfit to spurious data cor-
ables, we can force the model to have more balanced pre- relations and get stuck in undesirable local minima dur-
dictions (cluster assignments) P indirectly. It is also simple ing training. To avoid this overfitting, we employ autoen-
to change the prior from the uniform distribution to any ar- coder structures and use the reconstruction loss function as
bitrary distribution in the objective function if there is any a data-dependent regularization for training the parameters.
extra knowledge about the frequency of clusters. Therefore, we design DEPICT to consist of a soft-max layer
An alternating learning approach is utilized to optimize stacked on top of a multi-layer convolutional autoencoder.
the objective function. Using this approach, we estimate the Due to the promising performance of strided convolutional
target variables Q via fixed parameters (expectation step), layers in [32, 63], we employ convolutional layers in our en-
and update the parameters while the target variables Q are coder and strided convolutional layers in the decoder path-
assumed to be known (maximization step). The problem to ways, and avoid deterministic spatial pooling layers (like
infer the target variable Q has the following objective: max-pooling). Strided convolutional layers allow the net-
work to learn its own spatial upsampling, providing a better
N K generation capability.
1 XX qik fk
min qik log + qik log , (5) Unlike the standard learning approach for denoising au-
Q N i=1 pik uk
k=1 toencoders, which contains layer-wise pretraining and then
P fine-tuning, we simultaneously learn all of the autoencoder
where the target variables are constrained to k qik = 1. and soft-max layers. As shown in Figure 2, DEPICT con-
This problem can be solved using first order methods, such sists of the following components:
as gradient descent, projected gradient descent, and Nes- 1) Corrupted feedforward (encoder) pathway maps the
terov optimal method [24], which only require the objective noisy input data into the embedding subspace using a few
function value and its (sub)gradient at each iteration. In the convolutional layers followed by a fully connected layer.
following equation, we show the partial derivative of the The following equation indicates the output of each layer in
objective function with respect to the target variables. the noisy encoder pathway.
∂L z̃l = Dropout g(Wel z̃l−1 ) ,
q f qik
ik k (9)
∝ log + N + 1, (6)
∂qik pik
where z̃l are the noisy features of the l-th layer, Dropout
P
qi0 k
i0 =1 is a stochastic mask function that randomly sets a subset of
its inputs to zero [44], g is the activation function of con-
Investigating this problem more carefully, we approximate
volutional or fully connected layers, and Wel indicates the
the gradient in Eq.(6) by removing the second term, since
weights of the l-th layer in the encoder. Note that the first
the number of samples N is often big enough to ignore the
layer features, z̃0 , are equal to the noisy input data, x̃.
second term. Setting the gradient equal to zero, we are now
2) Followed by the corrupted encoder, the decoder pathway
able to compute the closed form solution for Q accordingly.
reconstructs the input data through a fully connected and
P 1
pik /( i0 pi0 k ) 2 multiple strided convolutional layers as follows,
qik =P 1 , (7)
ẑl−1 = g(Wdl ẑl ) , (10)
P
pik0 /( i0 pi0 k0 ) 2
k0
where ẑ is the l-th reconstruction layer output, and Wdl
l
For the maximization step, we update the network parame- shows the weights for the l-th layer of the decoder. Note
ters ψ = {Θ, W} using the estimated target variables with that input reconstruction, x̂, is equal to ẑ0 .
Figure 2: Architecture of DEPICT for CMU-PIE dataset. DEPICT consists of a soft-max layer stacked on top of a multi-
layer convolutional autoencoder. In order to illustrate the joint learning framework, we consider the following four pathways
for DEPICT: Noisy (corrupted) encoder, Decoder, Clean encoder and Soft-max layer. The clustering loss function, LE , is
applied on the noisy pathway, and the reconstruction loss functions, L2 , are between the decoder and clean encoder layers.
The output size of convolutional layers, kernel sizes, strides (S), paddings (P) and crops (C) are also shown.
3) Clean feedforward (encoder) pathway shares its weights Algorithm 1: DEPICT Algorithm
with the corrupted encoder, and infers the clean embedded
1 Initialize Q using a clustering algorithm
features. The following equation shows the outputs of the
2 while not converged do
clean encoder, which are used in the reconstruction loss
min − N1 qik log p̃ik + N1
P 1
kzli − ẑli k22
P
3
functions and obtaining the final cluster assignments. ψ |zl | i
ik il
(t)
4 pik ∝ exp(θ Tk zL
i )
zl = g(Wel zl−1 ) , (11) (t) P 1
5 qik ∝ pik /( i0 pi0 k ) 2
where zl is the clean output of the l-th layer in the encoder. 6 end
Consider the first layer features z0 equal to input data x.
4) Given the top layer of the corrupted and clean encoder
pathways as the embedding subspace, the soft-max layer where |zli | is the output size of the l-th hidden layer (input
obtains the cluster assignments using Eq.(1). for l = 0), and L is the depth of the autoencoder model.
Note that we compute target variables Q using the clean The benefit of joint learning frameworks for train-
pathway, and model prediction P̃ via the corrupted path- ing multi-layer autoencoders is also reported in semi-
way. Hence, the clustering loss function KL(QkP̃) forces supervised classification tasks [34, 68]. However, DEPICT
the model to have invariant features with respect to noise. is different from previous studies, since it is designed for the
In other words, the model is assumed to have a dual role: unsupervised clustering task, it also does not require max-
a clean model, which is used to compute the more accu- pooling switches used in stacked what-where autoencoder
rate target variables; and a noisy model, which is trained to (SWWAE) [68], and lateral (skip) connections between en-
achieve noise-invariant predictions. coder and decoder layers used in ladder network [34]. Al-
As a crucial point, DEPICT algorithm provides a joint gorithm 1 shows a brief description of DEPICT algorithm.
learning framework that optimizes the soft-max and autoen-
coder parameters together. 4. Experiments
N K N L−1
In this section, we first evaluate DEPICT 1 in comparison
1 XX 1 XX 1 with state-of-the-art clustering methods on several bench-
min − qik log p̃ik + kzl − ẑli k22 ,
ψ N i=1
k=1
N i=1 |zli | i
l=0 1 Our code is available in https://github.com/herandy/
(12) DEPICT
mark image datasets. Then, the running speed of the best linkage-based agglomerative clustering (AC-GDL) [65], ag-
clustering models are compared. Moreover, we examine glomerative clustering via path integral (AC-PIC) [66],
different learning approaches for training DEPICT. Finally, spectral embedded clustering (SEC) [30], local discrimi-
we analyze the performance of DEPICT model on semi- nant models and global integration (LDMGI) [61], NMF
supervised classification tasks. with deep model (NMF-D) [46], task-specific clustering
Datasets: In order to show that DEPICT works well with with deep model (TSC-D) [52], deep embedded clustering
various kinds of datasets, we have chosen the following (DEC) [56], and joint unsupervised learning (JULE) [60].
handwritten digit and face image datasets. Considering that Implementation Details: We use a common architecture
clustering tasks are fully unsupervised, we concatenate the for DEPICT and avoid tuning any hyper-parameters using
training and testing samples when applicable. MNIST-full: the labeled data in order to provide a practical algorithm
A dataset containing a total of 70,000 handwritten digits for real-world clustering tasks. For all datasets, we con-
with 60,000 training and 10,000 testing samples, each be- sider two convolutional layers followed by a fully connected
ing a 32 by 32 monochrome image [17]. MNIST-test: A layer in encoder and decoder pathways. While for all con-
dataset which only consists of the testing part of MNIST-full volutional layers, the feature map size is 50 and the kernel
data. USPS: It is a handwritten digits dataset from the USPS size is about 5 × 5, the dimension of the embedding sub-
postal service, containing 11,000 samples of 16 by 16 im- space is set equal to the number of clusters in each dataset.
ages. CMU-PIE: A dataset including 32 by 32 face images We also pick the proper stride, padding and crop to have
of 68 people with 4 different expressions [43]. Youtube- an output size of about 10 × 10 in the second convolu-
Face (YTF): Following [60], we choose the first 41 subjects tional layer. Inspired by [32], we consider leaky rectified
of YTF dataset. Faces inside images are first cropped and (leaky RELU) non-linearity [23] as the activation function
then resized to 55 by 55 sizes [55]. FRGC: Using the 20 of convolutional and fully connected layers, except in the
random selected subjects in [60] from the original dataset, last layer of encoder and first layer of decoder, which have
we collect 2,462 face images. Similarly, we first crop the Tanh non-linearity functions. Consequently, we normalize
face regions and resize them into 32 by 32 images. Table 1 the image intensities to be in the range of [−1, 1]. Moreover,
provides a brief description of each dataset. we set the learning rate and dropout to 10−4 and 0.1 respec-
tively, adopt adam as our optimization method with the de-
Dataset # Samples # Classes # Dimensions fault hyper-parameters β1 = 0.9, β2 = 0.999, = 1e − 08
MNIST-full 70,000 10 1×28×28 [13]. The weights of convolutional and fully connected lay-
MNIST-test 10,000 10 1×28×28 ers are all initialized by Xavier approach [9]. Since the clus-
USPS 11,000 10 1×16×16
FRGC 2,462 20 3×32×32 tering assignments in the first iterations are random and not
YTF 10,000 41 3×55×55 reliable for clustering loss, we first train DEPICT without
CMU-PIE 2,856 68 1×32×32 clustering loss function for a while, then initialize the clus-
tering assignment qik by clustering the embedding subspace
Table 1: Dataset Descriptions features via simple algorithms like K-means or AC-PIC.
Quantitative Comparison: We run DEPICT and other
Clustering Metrics: We have used 2 of the most popular clustering methods on each dataset. We followed the im-
evaluation criteria widely used for clustering algorithms, ac- plementation details for DEPICT and report the average re-
curacy (ACC) and normalized mutual information (NMI). sults from 5 runs. For the rest, we present the best reported
The best mapping between cluster assignments and true la- results either from their original papers or from [60]. For
bels is computed using the Hungarian algorithm [16] to unreported results on specific datasets, we run the released
measure accuracy. NMI calculates the normalized mea- code with hyper-parameters mentioned in the original pa-
sure of similarity between two labels of the same data [59]. pers, these results are marked by (∗) on top. But, when the
Results of NMI do not change by permutations of clusters code is not publicly available, or running the released code
(classes), and they are normalized to have [0, 1] range, with is not practical, we put dash marks (-) instead of the cor-
0 meaning no correlation and 1 exhibiting perfect correla- responding results. Moreover, we mention the number of
tion. hyper-parameters that are tuned using supervisory signals
(labeled data) for each algorithm. Note that this number
4.1. Evaluation of Clustering Algorithm only shows the quantity of hyper-parameters, which are set
Alternative Models: We compare our clustering model, differently for various datasets for better performance.
DEPICT, with several baseline and state-of-the-art cluster- Table 2 reports the clustering metrics, normalized mu-
ing algorithms, including K-means, normalized cuts (N- tual information (NMI) and accuracy (ACC), of the algo-
Cuts) [41], self-tuning spectral clustering (SC-ST) [64], rithms on the aforementioned datasets. As shown, DEPICT
large-scale spectral clustering (SC-LS) [6], graph degree outperforms other algorithms on four datasets and achieves
Dataset MNIST-full MNIST-test USPS FRGC YTF CMU-PIE # tuned
NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC HPs
K-means 0.500∗ 0.534∗ 0.501∗ 0.547∗ 0.450∗ 0.460∗ 0.287∗ 0.243∗ 0.776∗ 0.601∗ 0.432∗ 0.223∗ 0
N-Cuts 0.411 0.327 0.753 0.304 0.675 0.314 0.285 0.235 0.742 0.536 0.411 0.155 0
SC-ST 0.416 0.311 0.756 0.454 0.726 0.308 0.431 0.358 0.620 0.290 0.581 0.293 0
SC-LS 0.706 0.714 0.756 0.740 0.681 0.659 0.550 0.407 0.759 0.544 0.788 0.549 0
AC-GDL 0.017 0.113 0.844 0.933 0.824 0.867 0.351 0.266 0.622 0.430 0.934 0.842 1
AC-PIC 0.017 0.115 0.853 0.920 0.840 0.855 0.415 0.320 0.697 0.472 0.902 0.797 0
SEC 0.779∗ 0.804∗ 0.790∗ 0.815∗ 0.511∗ 0.544∗ - - - - - - 1
LDMGI 0.802∗ 0.842∗ 0.811∗ 0.847∗ 0.563∗ 0.580∗ - - - - - - 1
NMF-D 0.152∗ 0.175∗ 0.241∗ 0.250∗ 0.287∗ 0.382∗ 0.259∗ 0.274∗ 0.562∗ 0.536∗ 0.920∗ 0.810∗ 0
TSC-D 0.651 0.692 - - - - - - - - - - 2
DEC 0.816∗ 0.844∗ 0.827∗ 0.859∗ 0.586∗ 0.619∗ 0.505∗ 0.378∗ 0.446∗ 0.371∗ 0.924∗ 0.801∗ 1
JULE-SF 0.906 0.959 0.876 0.940 0.858 0.922 0.566 0.461 0.848 0.684 0.984 0.980 3
JULE-RC 0.913 0.964 0.915 0.961 0.913 0.950 0.574 0.461 0.848 0.684 1.00 1.00 3
DEPICT 0.917 0.965 0.915 0.963 0.927 0.964 0.610 0.470 0.802 0.621 0.974 0.883 0
Table 2: Clustering performance of different algorithms on image datasets based on accuracy (ACC) and normalized mutual
information (NMI). The numbers of tuned hyper-parameters (# tuned HPs) using the supervisory signals are also shown for
each algorithm. The results of alternative models are reported from original papers, except the ones marked by (∗) on top,
which are obtained by us running the released code. We put dash marks (-) for the results that are not practical to obtain.
150000
competitive results on the remaining two. It should be noted JULE-RC
100000
Logarithmic
JULE-SF
that we think hyper-parameter tuning using supervisory sig- JULE-RC(fast)
nals is not feasible in real-world clustering tasks, and hence JULE-SF(fast)
DEPICT
DEPICT is a significantly better clustering algorithm com-
30000
pared to the alternative models in practice. For example,
Run time(s)
in which all of the autoencoder layers are retrained after the Model 100 1000 3000
pretraining step, only using the reconstruction of input layer T-SVM [47] 16.81 5.38 3.45
CAE [36] 13.47 4.77 3.22
while data is not corrupted by noise. The fine-tuning step is MTC [35] 12.03 3.64 2.57
also done after the retraining step. 3) Our learning approach PL-DAE [18] 10.49 3.46 2.69
(MdA), in which the whole model is trained simultaneously AtlasRBF [31] 8.10 3.68 -
using the joint reconstruction loss functions from all layers M1+M2 [14] 3.33±0.14 2.40±0.05 2.18±0.04
SWWAE [68] 8.71±0.34 2.83±0.10 2.10±0.22
along with the clustering objective function. Ladder [34] 1.06±0.37 0.84±0.08 -
Furthermore, we also examine the effect of clustering DEPICT 2.65±0.35 2.10±0.11 1.91±0.06
loss (through error back-prop) in constructing the embed-
ding subspace. To do so, we train a similar multi-layer Table 4: Comparison of DEPICT and several semi-
convolutional autoencoder (Deep-ConvAE) only using the supervised classification models in MNIST dataset with dif-
reconstruction loss function to generate the embedding sub- ferent numbers of labeled data.
space. Then, we run the best shallow clustering algorithm
(AC-PIC) on the embedded data. Hence, this model (Deep- nique from the original classification labels once, and then
ConvAE+AC-PIC) differs from DEPICT in the sense that they will be fixed during the training step.
its embedding subspace is only constructed using the re- Table 4 shows the error results for several semi-
construction loss and does not involve the clustering loss. supervised classification models using different numbers of
Table 3 indicates the results of DEPICT and Deep- labeled data. Surprisingly, DEPICT achieves comparable
ConvAE+AC-PIC when using the different learning ap- results with the state-of-the-art, despite the fact that the
proaches. As expected, DEPICT trained by our joint learn- semi-supervised classification models use 10,000 validation
ing approach (MdA) consistently outperforms the other al- data to tune their hyper-parameters, DEPICT only employs
ternatives on all datasets. Interestingly, MdA learning ap- the labeled training data (e.g. 100) and does not tune any
proach shows promising results for Deep-ConvAE+AC-PIC hyper-parameters. Although DEPICT is not mainly de-
model, where only reconstruction losses are used to train signed for classification tasks, it outperforms several mod-
the embedding subspace. Thus, our learning approach is an els including SWWAE [68], M1+M2 [14], and AtlasRBF
efficient strategy for training autoencoder models due to its [31], and has comparable results with the complicated Lad-
superior results and fast end-to-end training. der network [34]. These results further confirm the discrim-
inative quality of the embedding features of DEPICT.
4.4. Semi-Supervised Classification Performance
5. Conclusion
Representation learning in an unsupervised manner or
using a small number of labeled data has recently attracted In this paper, we proposed a new deep clustering model,
great attention. Due to the potential of our model in learn- DEPICT, consisting of a soft-max layer stacked on top of a
ing a discriminative embedding subspace, we evaluate DE- multi-layer convolutional autoencoder. We employed a reg-
PICT in a semi-supervised classification task. Following ularized relative entropy loss function for clustering, which
the semi-supervised experiment settings [34, 68], we train leads to balanced cluster assignments. Adopting our au-
our model using a small random subset of MNIST-training toencoder reconstruction loss function enhanced the embed-
dataset as labeled data and the remaining as unlabeled data. ding learning. Furthermore, a joint learning framework was
The classification error of DEPICT is then computed us- introduced to train all network layers simultaneously and
ing the MNIST-test dataset, which is not seen during train- avoid layer-wise pretraining. Experimental results showed
ing. Compared to our unsupervised learning approach, we that DEPICT is a good candidate for real-world clustering
only utilize the clusters corresponding to each labeled data tasks, since it achieved superior or competitive results com-
in training process. In particular, only for labeled data, the pared to alternative methods while having faster running
cluster labels (assignments) are set using the best map tech- speed and not needing hyper-parameter tuning. Efficiency
of our joint learning approach was also confirmed in clus- [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
tering and semi-supervised classification tasks. based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
[18] D.-H. Lee. Pseudo-label: The simple and efficient semi-
References
supervised learning method for deep neural networks. In
[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Au- Workshop on Challenges in Representation Learning, ICML,
tomatic subspace clustering of high dimensional data for volume 3, page 2, 2013.
data mining applications, volume 27. ACM, 1998. [19] H. Li, K. Zhang, and T. Jiang. Minimum entropy clustering
[2] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vas- and applications to gene expression analysis. In Computa-
silvitskii. Scalable k-means++. Proceedings of the VLDB tional Systems Bioinformatics Conference, 2004. CSB 2004.
Endowment, 5(7):622–633, 2012. Proceedings. 2004 IEEE, pages 142–151. IEEE, 2004.
[3] D. Barber and F. V. Agakov. Kernelized infomax cluster- [20] Y. Li, F. Nie, H. Huang, and J. Huang. Large-scale multi-
ing. In Advances in neural information processing systems view spectral clustering via bipartite graph. Twenty-Ninth
(NIPS), pages 17–24, 2005. AAAI Conference on Artificial Intelligence (AAAI 2015),
2015.
[4] J. C. Bezdek, R. Ehrlich, and W. Full. Fcm: The fuzzy
[21] S. Lloyd. Least squares quantization in pcm. IEEE transac-
c-means clustering algorithm. Computers & Geosciences,
tions on information theory, 28(2):129–137, 1982.
10(2-3):191–203, 1984.
[22] D. Luo, C. Ding, and H. Huang. Consensus spectral cluster-
[5] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mix- ing. ICDE, 2010.
ture model for clustering with the integrated completed like- [23] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
lihood. IEEE transactions on pattern analysis and machine earities improve neural network acoustic models. In Proc.
intelligence, 22(7):719–725, 2000. ICML, volume 30, 2013.
[6] X. Chen and D. Cai. Large scale spectral clustering with [24] Y. Nesterov. Introductory lectures on convex optimization:
landmark-based representation. In AAAI, 2011. A basic course, volume 87. Springer Science & Business
[7] H. Gao, F. Nie, X. Li, and H. Huang. Multi-view subspace Media, 2013.
clustering. International Conference on Computer Vision [25] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering:
(ICCV 2015), pages 4238–4246, 2015. Analysis and an algorithm. Advances in neural information
[8] H. Gao, X. Wang, and H. Huang. New robust clustering processing systems (NIPS), 2:849–856, 2002.
model for identifying cancer genome landscapes. IEEE In- [26] F. Nie, C. Deng, H. Wang, X. Gao, and H. Huang. New
ternational Conference on Data Mining (ICDM 2016), pages l1-norm relaxations and optimizations for graph clustering.
151–160, 2016. Thirtieth AAAI Conference on Artificial Intelligence (AAAI
[9] X. Glorot and Y. Bengio. Understanding the difficulty of 2016), 2016.
training deep feedforward neural networks. In Aistats, vol- [27] F. Nie and H. Huang. Subspace clustering via new discrete
ume 9, pages 249–256, 2010. group structure constrained low-rank model. 25th Interna-
[10] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clus- tional Joint Conference on Artificial Intelligence (IJCAI),
tering. In Proceedings of the 22nd international conference pages 1874–1880, 2016.
on Machine learning (ICML). ACM, 2005. [28] F. Nie, X. Wang, and H. Huang. Clustering and projected
[11] K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected clustering via adaptive neighbor assignment. The 20th ACM
subspace clustering for high-dimensional data. In Proceed- SIGKDD Conference on Knowledge Discovery and Data
ings of the 2004 SIAM International Conference on Data Mining (KDD 2014), pages 977–986, 2014.
Mining, pages 246–256. SIAM, 2004. [29] F. Nie, X. Wang, M. I. Jordan, and H. Huang. The con-
strained laplacian rank algorithm for graph-based clustering.
[12] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online al-
In AAAI, pages 1969–1976. Citeseer, 2016.
gorithm for segmenting time series. In Data Mining, 2001.
[30] F. Nie, Z. Zeng, I. W. Tsang, D. Xu, and C. Zhang. Spectral
ICDM 2001, Proceedings IEEE International Conference
embedded clustering: A framework for in-sample and out-
on, pages 289–296. IEEE, 2001.
of-sample spectral clustering. IEEE Transactions on Neural
[13] D. Kingma and J. Ba. Adam: A method for stochastic opti- Networks, 22(11):1796–1808, 2011.
mization. arXiv preprint arXiv:1412.6980, 2014. [31] N. Pitelis, C. Russell, and L. Agapito. Semi-supervised
[14] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. learning using an unsupervised atlas. In Joint European Con-
Semi-supervised learning with deep generative models. In ference on Machine Learning and Knowledge Discovery in
Advances in Neural Information Processing Systems (NIPS), Databases, pages 565–580. Springer, 2014.
pages 3581–3589, 2014. [32] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
[15] A. Krause, P. Perona, and R. G. Gomes. Discriminative sentation learning with deep convolutional generative adver-
clustering by regularized information maximization. In Ad- sarial networks. arXiv preprint arXiv:1511.06434, 2015.
vances in neural information processing systems (NIPS), [33] R. Raina, Y. Shen, A. Y. Ng, and A. McCallum. Classifi-
pages 775–783, 2010. cation with hybrid generative/discriminative models. In In
[16] H. W. Kuhn. The hungarian method for the assignment prob- Advances in Neural Information Processing Systems (NIPS),
lem. Naval research logistics quarterly, 2(1-2):83–97, 1955. volume 16, 2003.
[34] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and tering (track). European Conference on Machine Learn-
T. Raiko. Semi-supervised learning with ladder networks. In ing and Principles and Practice of Knowledge Discovery in
Advances in Neural Information Processing Systems (NIPS), Databases (ECML PKDD 2014), pages 306–321, 2014.
pages 3546–3554, 2015. [50] H. Wang, F. Nie, and H. Huang. Multi-view clustering and
[35] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. feature learning via structured sparsity. The 30th Interna-
The manifold tangent classifier. In Advances in Neural Infor- tional Conference on Machine Learning (ICML 2013), pages
mation Processing Systems (NIPS), pages 2294–2302, 2011. 352–360, 2013.
[36] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. [51] X. Wang, F. Nie, and H. Huang. Structured doubly stochas-
Contractive auto-encoders: Explicit invariance during fea- tic matrix for graph based clustering: Structured doubly
ture extraction. In Proceedings of the 28th international stochastic matrix. In ACM SIGKDD International Con-
conference on machine learning (ICML-11), pages 833–840, ference on Knowledge Discovery and Data Mining, pages
2011. 1245–1254, 2016.
[37] V. Roth and T. Lange. Feature selection in clustering prob- [52] Z. Wang, S. Chang, J. Zhou, M. Wang, and T. S. Huang.
lems. In In Advances in Neural Information Processing Sys- Learning a task-specific deep architecture for clustering. In
tems (NIPS), pages 473–480, 2003. Proceedings of the 2016 SIAM International Conference on
[38] N. Sadoughi and C. Busso. Retrieving target gestures toward Data Mining, pages 369–377. SIAM, 2016.
speech driven animation with meaningful behaviors. In Pro- [53] Z. Wang, Y. Yang, S. Chang, J. Li, S. Fong, and T. S. Huang.
ceedings of the 2015 ACM on International Conference on A joint optimization framework of sparse coding and dis-
Multimodal Interaction, pages 115–122. ACM, 2015. criminative clustering. In International Joint Conference on
[39] N. Sadoughi, Y. Liu, and C. Busso. Msp-avatar corpus: Mo- Artificial Intelligence (IJCAI), volume 1, page 4, 2015.
tion capture recordings to study the role of discourse func- [54] C. K. Williams. A mcmc approach to hierarchical mixture
tions in the design of intelligent virtual agents. In Automatic modelling. In Advances in neural information processing
Face and Gesture Recognition (FG), 2015 11th IEEE Inter- systems (NIPS), pages 680–686, 1999.
national Conference and Workshops on, 2015. [55] L. Wolf, T. Hassner, and I. Maoz. Face recognition in uncon-
[40] M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp. Anal- strained videos with matched background similarity. In Com-
ysis of head gesture and prosody patterns for prosody-driven puter Vision and Pattern Recognition (CVPR), 2011 IEEE
head-gesture animation. IEEE Transactions on Pattern Anal- Conference on, pages 529–534. IEEE, 2011.
ysis and Machine Intelligence, 30(8):1330–1345, 2008. [56] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep em-
[41] J. Shi and J. Malik. Normalized cuts and image segmenta- bedding for clustering analysis. In International Conference
tion. IEEE Transactions on pattern analysis and machine on Machine Learning (ICML), 2016.
intelligence, 22(8):888–905, 2000. [57] P. Xie and E. P. Xing. Integrating image clustering and code-
[42] H. Shinnou and M. Sasaki. Spectral clustering for a large book learning. In AAAI, pages 1903–1909, 2015.
data set by reducing the similarity matrix size. In LREC, [58] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maxi-
2008. mum margin clustering. In Advances in neural information
[43] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumina- processing systems (NIPS), pages 1537–1544, 2004.
tion, and expression (pie) database. In Automatic Face and [59] W. Xu, X. Liu, and Y. Gong. Document clustering based
Gesture Recognition, 2002. Proceedings. Fifth IEEE Inter- on non-negative matrix factorization. In Proceedings of the
national Conference on, pages 46–51. IEEE, 2002. 26th annual international ACM SIGIR conference on Re-
[44] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and search and development in informaion retrieval, pages 267–
R. Salakhutdinov. Dropout: a simple way to prevent neu- 273. ACM, 2003.
ral networks from overfitting. Journal of Machine Learning [60] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learn-
Research, 15(1):1929–1958, 2014. ing of deep representations and image clusters. In Proceed-
[45] F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning ings of the IEEE Conference on Computer Vision and Pattern
deep representations for graph clustering. In AAAI, pages Recognition, pages 5147–5156, 2016.
1293–1299, 2014. [61] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang. Image cluster-
[46] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. Schuller. ing using local discriminant models and global integration.
A deep semi-nmf model for learning hidden representations. IEEE Transactions on Image Processing, 19(10):2761–2773,
In ICML, pages 1692–1700, 2014. 2010.
[47] V. Vapnik. Statistical learning theory, volume 1. Wiley New [62] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clus-
York, 1998. tering. In Advances in neural information processing systems
[48] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.- (NIPS), pages 1649–1656, 2008.
A. Manzagol. Stacked denoising autoencoders: Learning [63] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and
useful representations in a deep network with a local de- M. N. Do. Semantic image inpainting with perceptual and
noising criterion. Journal of Machine Learning Research, contextual losses. arXiv preprint arXiv:1607.07539, 2016.
11(Dec):3371–3408, 2010. [64] L. Zelnik-Manor and P. Perona. Self-tuning spectral cluster-
[49] D. Wang, F. Nie, and H. Huang. Unsupervised feature se- ing. In In Advances in Neural Information Processing Sys-
lection via unified trace ratio formulation and k-means clus- tems (NIPS), volume 17, page 16, 2004.
[65] W. Zhang, X. Wang, D. Zhao, and X. Tang. Graph degree which is necessary for calculating the reconstruction loss
linkage: Agglomerative clustering on a directed graph. In functions, the kernel size, stride and padding (crop in de-
European Conference on Computer Vision, pages 428–441. coder) are varied in different datasets. Moreover, the num-
Springer, 2012. ber of fully connected features (outputs) is chosen equal to
[66] W. Zhang, D. Zhao, and X. Wang. Agglomerative clustering the number of clusters for each dataset. Table 5 represents
via maximum incremental path integral. Pattern Recogni- the detailed architecture of convolutional autoencoder net-
tion, 46(11):3056–3065, 2013.
works for each dataset.
[67] B. Zhao, F. Wang, and C. Zhang. Efficient multiclass maxi-
mum margin clustering. In ICML, pages 1248–1255, 2008.
[68] J. Zhao, M. Mathieu, R. Goroshin, and Y. Lecun.
B. Visualization of learned embedding sub-
Stacked what-where auto-encoders. arXiv preprint space
arXiv:1506.02351, 2015.
In this section, we visualize the learned embedding sub-
[69] F. Zhou, F. De la Torre, and J. K. Hodgins. Hierarchical
space (top encoder layer) in different stages using the first
aligned cluster analysis for temporal clustering of human
motion. IEEE Transactions on Pattern Analysis and Machine
two principle components. The embedding representations
Intelligence, 35(3):582–596, 2013. are shown in three stages: 1) initial stage, where the pa-
rameters are randomly initialized with GlorotUniform; 2)
A. Architecture of Convolutional Autoencoder intermediate stage before adding LE , where the parameters
are trained only using reconstruction loss functions; 3) fi-
Networks
nal stage, where the parameters are fully trained using both
In this paper, we have two convolutional layers plus one clustering and reconstruction loss functions. Figure 4 il-
fully connected layer in both encoder and decoder pathways lustrates the three stages of embedding features for MNIST-
for all datasets. In order to have same size outputs for corre- full, MNIST-test, and USPS datasets, and Figure 5 shows the
sponding convolutional layers in the decoder and encoder, three stages for FRGC, YTF, and CMU-PIE datasets.
Table 5: Architecture of deep convolutional autoencoder networks. Conv1, Conv2 and Fully represent the specifications of
the first and second convolutional layers in encoder and decoder pathways and the stacked fully connected layer.
(a) Initial stage on MNIST-full (b) Intermediate stage on MNIST-full (c) Final stage on MNIST-full
(d) Initial stage on MNIST-test (e) Intermediate stage on MNIST-test (f) Final stage on MNIST-test
(g) Initial stage on USPS (h) Intermediate stage on USPS (i) Final stage on USPS
Figure 4: Embedding features in different learning stages on MNIST-full, MNIST-test, and USPS datasets. Three stages
including Initial stage, Intermediate stage before adding clustering loss, and Final stage are shown for all datasets.
(a) Initial stage on FRGC (b) Intermediate stage on FRGC (c) Final stage on FRGC
(d) Initial stage on YTF (e) Intermediate stage on YTF (f) Final stage on YTF
(g) Initial stage on CMU-PIE (h) Intermediate stage on CMU-PIE (i) Final stage on CMU-PIE
Figure 5: Embedding features in different learning stages on FRGC, YTF and CMU-PIE datasets. Three stages including
Initial stage, Intermediate stage before adding clustering loss, and Final stage are shown for all datasets.