0% found this document useful (0 votes)
42 views10 pages

1 s2.0 S0950705122010772 Main

Uploaded by

ysyxcyx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views10 pages

1 s2.0 S0950705122010772 Main

Uploaded by

ysyxcyx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Knowledge-Based Systems 258 (2022) 109984

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Transformer-based Dynamic Fusion Clustering Network



Chunchun Zhang a , Yaliang Zhao a , , Jinke Wang a,b
a
Henan Key Laboratory of Big Data Analysis and Processing, School of Computer and Information Engineering, Henan
University, Kaifeng 475000, PR China
b
School of Software, Henan University, Kaifeng 475000, PR China

article info a b s t r a c t

Article history: Clustering is an advanced task in machine learning. Numerous studies have improved clustering
Received 6 April 2022 performance by integrating deep learning into clustering technology. However, some limitations have
Received in revised form 15 September 2022 still existed in the current deep clustering research: (1) lack of a dynamic fusion mechanism to help
Accepted 4 October 2022
multiple deep learning networks jointly train node information (2) data structure embedding methods
Available online 10 October 2022
are not mature enough, and as the number of layers in the deep learning network deepens, the ability
Keywords: to learn data representation decreases, resulting in low performance. In contrast to these clustering
Deep clustering methods, we propose a Transformer-based Dynamic Fusion Clustering Network (TDCN), a novel deep
Dynamic attention mechanism clustering network mainly with Transformer architecture that can successfully address the current
Transformer network issues and improve the clustering performance. Specifically, a new dynamic attention mechanism is
Self-supervised learning
used to fuse the feature of Transformer and autoencoder (AE) networks in TDCN. In order to obtain data
Feature fusion
structural information, a new transformation operation G is designed. The transformation operation G
varies according to the characteristics of the source data, helping to represent the data structure. In
addition, TDCN stacks multi-layer multi-scale heterogeneous networks to learn node representation
and further integrates information at different scales through specific modules, so as to facilitate
efficient extraction of information. The whole deep clustering network is trained by a dual self-
supervision mechanism. Experiments indicate that our model can achieve comparable or even better
performance than the state-of-the-art methods on five datasets.
© 2022 Elsevier B.V. All rights reserved.

1. Introduction a classical deep clustering algorithm. Xie et al. [8] proposed


Deep Embedding Clustering (DEC) to learn the mapping from the
Clustering is a fundamental problem in many data-driven data space to the low-dimensional feature space. Additionally, It
application domains, which helps classify data samples into dif- established a joint optimization space to optimize the clustering
ferent groups based on their characteristics and structural sim- objective. Mukherjee et al. [9] proposed deep clustering based
ilarities. In recent years, numerous clustering approaches have on generative adversarial networks (ClusterGAN) by sampling
been successfully applied in many fields, including object latent variables from a mixture of one-hot encoded variables and
detection [1], group activity recognition [2], text question an- continuous latent variables, combined with a reverse network
swering [3], face and landmark recognition [4], and so on. Its (projecting the data into the latent space) and a cluster-specific
performance largely depends on the quality of data representa- loss, which implements clustering in latent space. These meth-
tion, while traditional clustering methods [5,6] mainly focus on ods only pay attention to the characteristics of the data itself
feature transformation or clustering independently. Therefore the and thus ignore the data structural information which plays the
traditional clustering methods cannot fully learn the data repre- same importance as the feature information in the clustering
sentation which results in poor performance in clustering tasks. representation.
With the rapid growth of deep learning, great efforts have Since the great breakthrough of graph neural networks, a
been devoted to establishing a framework for the unified training large number of researchers have realized the importance of
of node representation information and clustering information by data structure for data representation learning. Multiple deep
designing new deep neural networks. Hinton et al. [7] proposed neural networks are used to construct a unified architecture that
an AE network to learn low-dimensional codes, which performs extracts data features and captures data underlying structure
better than Principal Component Analysis (PCA) and has become simultaneously in recent research. To reveal data structure, [10–
12] construct a K -Nearest Neighbor (KNN) graph, set different K
∗ Corresponding author. values according to the dataset. Further, in these studies, more
E-mail address: [email protected] (Y. Zhao). than one type of deep learning network is employed, one is used

https://doi.org/10.1016/j.knosys.2022.109984
0950-7051/© 2022 Elsevier B.V. All rights reserved.
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

to extract features from the data, and another is used to capture After that, it can learn clustering features suitable for local struc-
the data’s underlying structure. They fuse the data representa- ture preservation. Based on these two pioneering works, deep
tion learned by different networks with a certain mechanism. learning is widely used in various clustering algorithms. Affeldt
Although these efforts have achieved preferable performance, et al. [20] construct a model combining spectral clustering and
there are still several challenging issues: (1) The most commonly a deep autoencoder network, which can effectively solve the
used network in clustering networks to capture data structure robustness issues of the deep clustering brought by factor pa-
features is graph convolutional networks (GCN) [13]. However, rameters. Furthermore, Peng et al. [21] propose a new subspace
it has some limitations, such as over-smoothing. (2) The existing clustering algorithm based on deep learning. In this method,
fusion approach of different deep learning networks lacks an a new data affinity matrix can be inferred by simultaneously
efficient dynamic information fusion and processing mechanism. satisfying the sparsity principle of SSC and the nonlinear char-
To address the above issues and improve the performance of acteristics of neural networks. Guo et al. [22] adopt autonomous
deep clustering, this paper proposes a new deep clustering net- learning and data alignment on deep clustering. The method
work named Transformer-based Dynamic Fusion Clustering Net- fine-tunes the self-encoding structure iteratively and updates the
work (TDCN). Inspired by the high performance of Transformer cluster centers by aligning samples. In this way, the robustness
on graph representation learning [14–16], TDCN uses Transformer and applicability of the deep clustering algorithm can be further
network variants and AE to construct a unified architecture that improved. These works have achieved progressive results and
extracts data features and captures data underlying structure si- further promoted the research of deep clustering.
multaneously. Specifically, TDCN mainly consists of two modules: However, in these works, the clustering performance reaches
a multi-network feature fusion module (TDCN-M) and a multi- its limitation on account of the ignorance of data structures.
scale heterogeneous fusion module (TDCN-S). TDCN-M mainly Some new clustering methods with kernel learning, adversarial
performs feature fusion of AE and Transformer modules, each training, attention, convolution mechanisms are proposed to ex-
layer of feature information extracted in AE will be input into a ploit the structural features of the data. For example, a unified
Transformer module for further feature structure fusion. TDCN-S framework for graph construction and low-rank kernel learning
mainly performs feature fusion of each Transformer module to is designed in [23]. Consensus kernels and graphs can itera-
obtain multivariate information at different scales. Both TDCN-M tively enhance each other to improve clustering performance.
and TDCN-S adopt the dynamic graph attention mechanism [17] ARGA [24] introduces adversarial regularization for graph em-
for fusion. In TDCN, each Transformer module has multiple multi- bedding to learn the topology and feature representation of data
head attention layers, feed-forward networks and task-based MLP in vector spaces. DAEGC [25] based on GAT [26] proposes to
layers for better clustering-friendly representation. To enhance construct a proximity matrix and reconstruct the graph struc-
the local information of the Transformer network, we introduce ture by inner-product operations. It is one of the first works to
Gaussian prior probability [18] in each self-attention layer. More- introduce an attention mechanism in deep clustering. SDCN [10]
over, to capture the latent structure of the data, we design a reveals the structure of underlying data by constructing a KNN
transformation operation G to input the fused features of nodes graph, then utilizes GCN to learn graph embedding. A novel
into the network in a certain form. The contributions of this paper clustering network fusion module is first proposed in DFCN [11].
are listed as follows: It fuses the representations learned by AE and graph autoen-
• To the best of our knowledge, this is one of the first works to coder (GAE) in a specific way. Graph convolution-like operation
utilize AE and Transformer networks to build a joint training is used to fuse structural and feature information. Inspired by
architecture in a deep clustering network. We have shown SDCN [10], AGCN [12] utilizes an attention mechanism to capture
that the encoder of the Transformer network can enhance AE and GCN representation. It also proposes multi-scale modules
the performance of TDCN due to TDCN can stack multiple to capture the information of different neural network layers.
self-attention layers to deeply mine node information at the Although existing deep clustering methods have significantly
same scale. Moreover, TDCN will not cause over-smoothing improved clustering performance, they still have some limita-
as the number of model layers deepens. tions, i.e., (1) The attention mechanism used to fuse representa-
• A new dynamic attention mechanism is adopted to fuse AE tions learned by multiple network architectures is not efficient
and Transformer which is more expressive than the atten- enough, (2) The weakness of the GCN and GAE networks archi-
tion mechanism used in AGCN [12]. Meanwhile, Gaussian tectures like over-smoothing is ignored when using them to learn
prior probability is utilized in the Transformer network to graph embedding. In conclusion, the existing deep clustering
enhance local information. network cannot accurately extract much useful information, there
• Extensive experiments on five datasets indicate that our is still space for improvement.
TDCN architecture is competitive with or exceeding the
state-of-the-art methods. 3. Methodology

2. Related work In this section, the Transformer-based dynamic fusion clus-


tering network (TDCN) will be further discussed. The network
To make the paper self-contained, we will introduce some is mainly composed of TDCN-M and TDCN-S modules, as shown
related work on deep clustering. in Fig. 1. We first input data features into AE and Transformer,
With the increasing popularity of deep learning, deep cluster- respectively through transformation operation G. Then the out-
ing has been intensively studied in machine learning. It jointly put features generated by each layer of AE and the extracted
learns friendly clustering representations by building a joint op- features generated by the Transformer module are input into
timization space of deep learning and clustering. For instance, TDCN-M, respectively. After several TDCN-M modules, finally, the
DEC [8] iteratively optimizes the Kullback–Leibler divergence of features extracted by the TDCN-M module are input to the TDCN-
soft labels and auxiliary objective optimization to improve clus- S module for multi-scale information fusion. Moreover, a dual
tering assignment and feature representation. IDEC [19] applies self-supervision mechanism is proposed in the neural network
incomplete AE based on the DEC network architecture and pre- training process. We will describe the modules in our proposed
serves local structural information through reconstruction loss. model in detail.
2
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

Fig. 1. The overall architecture of TDCN. TDCN contains two modules TDCN-M and TDCN-S. TDCN-M is used to extract data features, and TDCN-S is used to fuse
multi-scale information. X is raw data, and X̂ denotes the predicted value output by Decoder. H(l) represents the output representation of the lth layer encoding of
AE, Z(l) represents the output of the lth Transformer module. F represents the dynamic fusion mechanism, G represents the transformation operation. The two blue
dotted lines represent the dual self-supervision mechanism.

Fig. 2. The overall architecture of the Transformer module.

(h) (h)
3.1. TDCN-M and the learnable bias in the encoder layer. Wd and bd are
the learnable parameters in the decoder layer. σ is a non-linear
To sufficiently explore node attribute information, the multi- activation function, here we use ReLU. To narrow the gap between
network feature fusion module (TDCN-M) is proposed. This mod- the node features reconstructed by the decoder layer and the
ule utilizes a dynamic attention mechanism to fuse AE and Trans- original node features, we construct the loss function as
former networks to help extract the node features.
LR = ∥X − X̂∥2F , (4)
3.1.1. AE module where X̂ is the final output of the decoder.
AE is a basic network architecture commonly used in deep
clustering for data feature extraction. AE consists of an encoder 3.1.2. Transformer module
module and a decoder module. In this paper, we employ the Based on the success of Transformer in natural language pro-
encoder module of AE to learn data representations. Then the cessing, a large number of researchers have applied Transformer
decoder module is used to perform feature reconstruction. The networks from sequence to multiple domains such as image [27]
input of the first encoder layer is different from the other layer and graph [28] to learn data representation. All of them have
in the encoder which is shown as follows: achieved great success. Thus, in order to make the clustering
performance more expressive, TDCN uses the Transformer model
H(0) = GX, (1)
for further fusion of node information. In this paper, we only use
where X denotes raw data, G can be a normalized adjacency the Transformer encoder. The Transformer architecture used in
matrix or identity matrix according to the different datasets. the paper is shown in Fig. 2. The Transformer module mainly
The two modules in AE can respectively be denoted as consists of three parts: an input layer, M encoder blocks, and N
task-based MLP layers.
H(l) = σ W(l) (l−1)
+ be(l) ,
( )
e H (2) Input. First of all, the input node feature Z(l−1) ∈ Rn×d(l−1) is
( ) embedded into a d-dimensional hidden feature space via linear
(h) (h)
Ĥ(h) = σ Wd Ĥ(h−1) + bd , (3) projection:
Z(l−1) = W0 Z(l−1) + b0 , (5)
where H(l) ∈ Rn×dl is the feature learned by the lth layer in the
encoder module of AE. Ĥ(h) is the output of hth layer in the where W0 and b0 are parameters of the linear projection layers.
(l) (l)
decoder module of AE. We and be denote the learnable weight Z(l−1) ∈ Rn×d is the input of the encoder block.
3
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

Encoder Block. The encoder block mainly includes a multi- layers. The MLP layers can change the model output dimension
head attention module, a feed-forward network module, and according to the task. In this way, the Transformer module can
residual connections. The multi-head attention module is mainly perform the corresponding dimension feature fusion with the
used to capture global information at different locations from dif- AE model. For the nonlinear activation function, we utilize the
ferent subspaces. Residual connection is to solve the problems of exponential linear unit (ELU) [31]. A layer in the MLP layers can
gradient disappearance and weight matrix degradation. However, be represented as follows:
ACT [29] found that as the encoder layer goes deeper, the features
Z′(n) = W(n−1) Z(n−1) + b(n−1) ,
will be similar on account that each feature in the Transformer
Z(n) = ELU Z′(n) ,
( )
will have information about each other. Thus, we introduce a (9)
Gaussian bias term [18] in self-attention to strengthen the local s.t . n = 1, . . . , N ,
information in the transformer model.
The standard attention function is shown in Eq. (6). The matrix where W(n−1) and b(n−1) are the parameters of the linear layer.
Q, K, V are obtained by a linear transformation of the Trans- Z(n) is the output of the nth layer in MLP layers.
former input feature. The elements in matrix Q and matrix K first
perform a dot product operation, then obtain the weight matrix 3.1.3. Dynamic fusion module
through the nonlinear function softmax. Finally, a matrix product The dynamic fusion module is mainly used to extract the
operation is performed on the weight matrix and V to get an feature information and structural information of the data. The
attention score: TDCN-M module utilizes the AE network to extract data feature
information and the Transformer network to extract underlying
QKT
( )
Attention(Q, K, V) = softmax √ V, (6) data structure information, then performs a dynamic attention
dk mechanism to get weights of the two networks respectively. After
where dk = d is the dimension of K, the dimensions of Q and V that, the transformation operation G inputs the representation to
are equal to K. the next Transformer network.
To address the issue that self-attention networks are not good Firstly, dynamic attention is used to calculate the weights of
at capturing local dependencies of data, we introduce a Gaussian the data representations learned by the two networks. Suppose
bias term. It is directly multiplied by the attention matrix. This Z(l) is the output of the lth Transformer network and H(l) is the
can help nodes focus more on important nodes around them. The output of the lth layer AE network. The output of each layer of
improved attention mechanism is shown in the following: AE and the output of the Transformer block are aggregated by
a concatenation operation ∥. After that, softmax and LeakyReLU
are applied to compute the weight between Z(l) and H(l) . And in
∑ ( )
Attention (Q, K, V) = φ di,j softmax(g(Q, K))V
j
LeakyReLU, the negative input slope is set as 0.2. Here, W(l) ∈
R2dl ×2 and a⊤ are learnable parameters. ℓ2 represents L2–norm.
exp(− ⏐w d2i,j + b⏐) exp(g(Q, K))
⏐ ⏐
The whole formula to obtain the weight matrix is shown as

= V
exp(− ⏐w d2i,j + b⏐) exp(g(Q, K))
∑ ⏐ ⏐
j
follows:
j
Ψl = ℓ2 softmax a⊤ LeakyReLU Z(l) ∥H(l) W(l) .
( ( ( ([ ] ))))
exp(− ⏐w d2i,j + b⏐ + g(Q, K)) (10)
⏐ ⏐

= V
j exp(− w di,j + b ) exp(g(Q, K))
∑ ⏐ 2 ⏐
⏐ ⏐ Secondly, Hadamard product operation is applied to learn a
j
∑ new representation Z′(l) . In Eq. (10), the attention weight matrix
softmax − ⏐w d2i,j + b⏐ + g(Q, K) V, Ψl ∈ Rn×2 is formed of ψl,1 and ψl,2 . ψl,1 and ψl,2 are the weight
( ⏐ ⏐ )
=
j vector for Z(l) and H(l) respectively. The new representation Z′(l) is
the input of the next Transformer network. Z′(l) can be expressed
(7)
in the following:
QKT
where g(Q, K) = √ , dk is the dimension of K, di,j is the distance
Z′(l) = ψl,1 ⊙ Z(l) + ψl,2 ⊙ H(l) .
( ) ( )
dk (11)
between tokens. w > 0 and b < 0 are scalar parameters. ′(l) n×dl
In this block, the multi-head attention module is first used to Finally, the obtained matrix Z ∈ R is input to the
capture node information in different spaces. Then we perform Transformer network after the transformation operation G. The
residual connections on the multi-head attention module. The input of the (l + 1)th Transformer Network can be represented as
normalized result is passed into the fully connected feed-forward
network layer. After that, a residual connection is performed Z(l+1) = Transformer(GZ′(l) ). (12)
in the encoding block. The multi-head module and the feed-
forward network are the same as the classical Transformer en- 3.2. TDCN-S
coder described in [30]. A block in encoder blocks is expressed as
To further explore node representations, TDCN-S is performed
Z′(m) = LN MHA Z(m−1) + Z(m−1) ,
( ( ) )
to fuse multi-scale information. The method of Transformer com-
bining multi-scale information has been widely used in the field
Z(m) = LN FFN Z′(m) + Z′(m) ,
( ( ) )
(8)
of vision [32]. TDCN-S first performs dynamic attention fusion
s.t . m = 1, . . . , M , on the Transformer network of different output dimensions. Af-
ter the transformation operation G, clustering representation is
where Z(m−1) ∈ Rn×d is the output of (m − 1)th encoding
obtained by the final Transformer network.
block. And we utilize MHA, FFN to represent the multi-head
Firstly, the concatenation operation is performed to fuse the
attention module and feed-forward network used in the classical
extracted features of different dimensions to obtain a new com-
Transformer model, respectively. LN means LayerNorm operation.
bined representation Z′ . Z′ ∈ Rn×(d1 +···+dl +dl ) can be represented
Finally, the output of the M encoder blocks is the input of MLP
as follows:
layers.
Z′ = Z(1) ∥Z(2) ∥ · · · ∥Z(l) ∥Z(l+1) ,
[ ]
Task-based MLP Layers. Inspired by GT [14], our model also
(13)
uses MLP layers. It consists of multiple linear layers and nonlinear s.t . l = 1, . . . , L,
4
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

where Z(l) is the output of lth Transformer network, Z(l+1) = H(l) ∈ with the output H of the AE network, we use the KL divergence
Rn×dl is the final layer of AE network. to approximate Z and the auxiliary target P. In this way, the
Then the dynamic attention mechanism is used to acquire the value of Z is constrained by P in the training process. The specific
weight matrix of multi-scale information. The weight matrix can calculation is shown as follows:
be expressed as ∑∑ pij
LcluQ = KL(P ∥Q ) = pij log ,
Ψ = ℓ2 softmax a⊤ LeakyReLU Z′ W
( ( ( ( ))))
, (14) qij
i j
pij (20)
where Ψ = [ψ1 ∥ψ2 ∥ · · · ∥ψl ∥ψl+1 ] is composed of weight vectors
∑∑
LcluZ = KL(P ∥Z) = pij log .
of different scales. zij
i j
After that, to explore multi-scale information, the weight ma-
The final loss function of TDCN can represent in following:
trix Ψ and new representation Z′ are operated by Hadamard
product to obtain fusion features Z′ . Z′ can be represented: L = LR + λ1 LcluQ + λ2 LcluZ , (21)
Z = ψ1 ⊙ Z ∥ψ2 ⊙ Z ∥ · · · ∥ψl ⊙ Z ∥ψl+1 ⊙ Z
′ (1) (2) (l) (l+1)
.
[ ]
(15) where λ1 > 0, λ2 > 0. λ1 and λ2 are the coefficients used to bal-
′ ance the AE network and the Transformer network, respectively.
Finally, input transformed Z into the final clustering layer. The
In practice, the network tends to be stable after several itera-
finally obtained Z is as follows:
tions. Thus, we get the predicted cluster label ci from the output
Z = Transformer(GZ′ ), (16) Z, which is calculated as follows:

where Z ∈ R n×k
, k is cluster number. ci = arg max zij , (22)
j

3.3. Dual self-supervised module where zij is the element of the final network output Z.
The training process of the Transformer-based dynamic fusion
Clustering is an unsupervised task, its performance mainly de- clustering network (TDCN) is presented in Algorithm 1.
pends on the data representation. Particularly, for jointly trained
network architectures, the composition of the loss function plays Algorithm 1 Training Process of TDCN
a crucial role. In recent research [10,11], the loss function of deep
Input: Input data X; Cluster number k; Parameters of AE net-
clustering is usually composed of non-clustering loss function and
clustering loss function. The non-clustering loss function LR and work; Parameters of Transformer network; Parameter λ1 ,λ2 ;
the clustering loss function LC are usually fused: Max iteration numbers T ; transformation operation G
Output: Clustering results C;
L = α LR + β LC , (17) 1: Initialize: H
(0)
= GX,Z(0) = GX;
2: Initialize the parameters learned by pre-train auto-encoder;
where α and β are constant coefficients of the loss function.
3: for iteration j = 1 to T do
In this paper, the non-clustering loss function is obtained by
the AE network. In other words, it is the reconstruction loss 4: Calculate each layer of AE: H(1) , H(2) , · · · , H(l) ;
acquired in Eq. (4). By adding AE reconstruction loss, AE can 5: for layer i to l − 1 do
successfully learn useful representations [33]. 6: Combine the i-th layer representation of AE and the i-th
Kullback–Leibler divergence (KL divergence) is applied for the output(of Transformer ( Network in the following manner:
Z′(i) = ψi,1 ⊙ Z(i) + ψi,2 ⊙ H(i) ;
) )
clustering loss function. KL divergence represents the divergence
between two probability distributions. Firstly, to get the cluster- 7: Generate new data representation Z(i+1) :
ing loss function, we introduce the soft cluster assignment distri- Z(i+1) = Transformer(GZ′(i) );
bution Q , which is obtained by using Student’s t-distribution [34] 8: end for
as the kernel to normalize the similarity between the final output 9: Merge the output of each Transformer with the output of
features of AE and cluster centers. The calculation is as follows: the final layer of AE via Eq. (13) to Eq. (15);
( )− ν+2 1 10: Generate the final representation Z:
2
1 + hi − µj  /ν Z = Transformer(GZ′ );

qij = , 11: Input final layer features H(l) of the encoder layer into the
2 )− ν+2 1 (18)
∑( decoder layer for reconstruction;
1 + hi − µj′  /ν

j′
12: Calculate the final loss function of TDCN;
13: Update the parameters in the proposed TDCN;
where hi is the ith row of H(l) , µj is jth cluster centers. A constant 14: end for
ν is the degree of freedom of the Student’s t distribution which 15: Acquire the clustering results C by the output feature Z in
is usually set to 1 in the experiment. Eq. (22);
Then, to improve the cohesion of clustering, this paper in- 16: return C;
troduces an auxiliary target P proposed by [8]. Compared to
cluster assignment distribution Q , P has a stricter probability
3.4. Computational complexity analysis
distribution, requiring nodes to be assigned with high confidence.
It is calculated as
In our work, we set n as the number of samples, dl as the
q2ij /Σi qij feature dimension of the lth layer, E as the number of edges in
pij = ( ), (19) the graph, d as the hidden dimension of the Transformer, and k as
Σj′ q2ij′ /Σi qij′ the number of clusters. TDCN-M mainly consists of AE and Trans-
former modules. )For the AE module, the time complexity is O1 =
where pij is the element of P and qij is the element of Q . ( ∑L
In neural network training, the KL divergence is always used O n l=2 dl−1 dl . For the Transformer module constructed in
to help the soft label probability distribution Q approach the aux- the paper, the time complexity O2 is mainly in the multi-headed
iliary target P, thereby improving the performance of clustering attention module. Thus, we only analyze a single Transformer
representation. Moreover, to align the final output Z of the TDCN module multi-headed attention component with time complexity
5
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

Table 1 • GCN-based methods: GAE and VGAE [39], ARGA [24], SDCN
The description of the datasets. [10], AGCN [12].
Dataset Type Samples Classes Dimension • Attention-based methods: DAEGC [25].
USPS Image 9298 10 256
HHAR Record 10 299 6 561 The AE-based methods utilize the AE network for feature
Reuters Text 10 000 4 2000 extraction and K-means for clustering. Among them, DEC guides
ACM Graph 3025 3 1870
CiteSeer Graph 3327 6 3703
model training by designing clustering objectives; IDEC proposes
feature reconstruction based on DEC to further improve the clus-
tering performance. In the GCN-based methods, graph embedding
( ) is introduced to learn the structural representation of the data.
O nd2 + n2 d . Before the AE module and Transformer module, SDCN and AGCN have established a joint training framework,
we utilize the transformation operation G, the time complexity which integrates DEC and GCN in a certain way. The attention-
is O3 = O (|E |dl ). Hence, the time complexity of TDCN-M is based methods utilize an attention network for node represen-
OM = O1 + ( O2 + O3 . )
∑L−1
For the((part of TDCN-S,
)
∑L+1
the
) time complexity tation. And the clustering objective is also used in the network
is Os = O l=1 (dl ) + O l=1 dl (L + 1) where dL+1 = dL . training process.
We also further analyzed the complexity of the self-supervised Evaluation matrix: We leverage four types of commonly clus-
module which is Oc = O(nk + n log n). In summary, the time tering evaluation metrics to evaluate model performance. They
complexity of the overall algorithm is O = OM + Os + Oc . are Accuracy (ACC), Normalization Mutual Information (NMI),
Average Rand Index (ARI), and macro F1-score (F1), respectively.
Higher values of these four evaluation matrices represent better
4. Experiments
clustering performance.
4.1. Benchmark datasets
4.3. Implementation details

Five popular benchmark datasets are employed to verify the Our proposed method TDCN is implemented on Google Co-
efficiency of our model in the experiment. It includes three non- lab with its GPU Tesla P100-PCIE-16 GB. The environment of
graph datasets (USPS [35], REUT [36], HHAR [37]) and two graph deep learning is PyTorch. For a fair comparison, the network
datasets (ACM [38], Citeseer). The basic information of these five settings used in the experiment are consistent with SDCN [10]
datasets is shown in Table 1. and AGCN [12]. Thus, the dimension of AE is set as 500-500-
2000-10. The MLP output dimensions of the Transformer modules
• USPS: Handwritten dataset. It includes 1100 grayscale pic-
are also set as 500, 500, 2000, and 10, respectively. The hidden
tures of handwritten digits in 10 categories. A subset of 9298
dimension of the Transformer network is set to 64 which is the
images is used in experiments.
same as that of the GT [14]. The attention module adopts an 8-
• HHAR: Heterogeneous human activity recognition dataset,
head attention mechanism. The graph datasets adopted in the
which is an extension of the HAR dataset. 10,299 human
paper are relatively small, so shallow layers can achieve compara-
activity records from smartphones and smart devices are in-
ble results. As a result, for graph datasets, MLP layers are set to 1,
cluded. These records are divided into 6 categories according
and encoder blocks are set to 2 for each Transformer network.
to record characteristics.
According to Table 1, the sizes of the non-graph datasets are
• Reuters: Reuters is a multi-category dataset of news topics,
large, hence deep structures are required to represent the data.
containing 46 different topics. In the experiment, 10,000
Compared with graph datasets, the number of MLP layers is set
samples are used for clustering. The labels of the experi-
to 2, and the number of encoder blocks for Transformer networks
ments mainly belong to the four root categories of corpo-
is 6 in non-graph datasets. Furthermore, The parameters of the
rate/industrial, markets, economics, and government/social.
Gaussian bias term in the encoding block are adjusted according
• ACM1 : ACM is a paper network. Nodes connected by edges
to the datasets.
represent papers by the same author. The paper of the
The training process of TDCN mainly includes two parts: AE
dataset is selected from keywords such as KDD, SIGMOD, pre-train and TDCN whole network training. First, we iteratively
SIGCOMM, MobiCOMM, etc. These papers are grouped into train the AE module 30 epochs with a learning rate of 0.001. Then,
three categories (databases, wireless communications, data iterative training of TDCN is performed. The number of iterations
mining) according to the field of research. is greater than 200 for non-graph datasets. For graph datasets, the
• CiteSeer2 : CiteSeer is a citation network that focuses pri- number of iterations is set as 100. The learning rate is 0.001 for
marily on providing a way to retrieve literature by cita- the ACM and USPS datasets, and 0.0001 for the other datasets.
tion links. Topics of the datasets mainly involve these do- The coefficient terms λ1 and λ2 of the loss function are set to
mains: agents, databases, artificial intelligence, information different values according to the dataset. For USPS, λ1 and λ2 are
retrieval, machine language, and HCI. set as {1000, 1000}; for HHAR, λ1 and λ2 are {5, 0.8} respectively;
for Reuters, λ1 and λ2 are set as {10, 10}. As for graph datasets,
the λ1 is 0.1, λ2 is 0.01.
4.2. Baseline method For all benchmark methods, we directly cite the results in
AGCN [12]. With the same settings as other benchmark methods,
The proposed method TDCN is compared with nine state-of- TDCN also conducts 10 experiments and obtains the mean value
the-art deep clustering methods, which are mainly divided into and standard deviations of the 10 experiments to reduce the
the following three categories: influence of random parameters on the model.
• AE-based methods: AE [7], DEC [8], IDEC [19].
4.4. Clustering results

1 http://dl.acm.org/. The clustering performance of TDCN and benchmark network


2 http://citeseerx.ist.psu.edu/index. in five datasets are shown in Table 2. According to the different
6
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

Table 2
Clustering results on five datasets (mean ± std). The font in red and in blue represent the best and the second results, respectively.
Dataset Metric AE DEC IDEC GAE VGAE DAEGC ARGA SDCN AGCN TDCN TDCNI
ACC 71.04 ± 0.03 73.31 ± 0.17 76.22 ± 0.12 63.10 ± 0.33 56.19 ± 0.72 73.55 ± 0.40 66.80 ± 0.70 78.08 ± 0.19 80.98 ± 0.28 80.35 ± 1.07 78.27 ± 1.11
USPS NMI 67.53 ± 0.03 70.58 ± 0.25 75.56 ± 0.06 60.69 ± 0.58 51.08 ± 0.37 71.12 ± 0.24 61.60 ± 0.30 79.51 ± 0.27 79.64 ± 0.32 80.39 ± 0.21 77.6 ± 0.19
ARI 58.83 ± 0.05 63.70 ± 0.27 67.86 ± 0.12 50.30 ± 0.55 40.96 ± 0.59 63.33 ± 0.34 51.10 ± 0.60 71.84 ± 0.24 73.61 ± 0.43 73.47 ± 1.09 70.46 ± 1.3
F1 69.74 ± 0.03 71.82 ± 0.21 74.63 ± 0.10 61.84 ± 0.43 53.63 ± 1.05 72.45 ± 0.49 66.10 ± 1.20 76.98 ± 0.18 77.61 ± 0.38 77.94 ± 0.36 75.66 ± 0.25
ACC 68.69 ± 0.31 69.39 ± 0.25 71.05 ± 0.36 62.33 ± 1.01 71.30 ± 0.36 76.51 ± 2.19 63.30 ± 0.80 84.26 ± 0.17 88.11 ± 0.43 88.32 ± 0.12 87.18 ± 0.13
HHAR NMI 71.42 ± 0.97 72.91 ± 0.39 74.19 ± 0.39 55.06 ± 1.39 62.95 ± 0.36 69.10 ± 2.28 57.10 ± 1.40 79.90 ± 0.09 82.44 ± 0.62 82.24 ± 0.4 80.34 ± 0.38
ARI 60.36 ± 0.88 61.25 ± 0.51 62.83 ± 0.45 42.63 ± 1.63 51.47 ± 0.73 60.38 ± 2.15 44.70 ± 1.00 72.84 ± 0.09 77.07 ± 0.66 77.22 ± 0.24 75.31 ± 0.2
F1 66.36 ± 0.34 67.29 ± 0.29 68.63 ± 0.33 62.64 ± 0.97 71.55 ± 0.29 76.89 ± 2.18 61.10 ± 0.90 82.58 ± 0.08 88.00 ± 0.53 88.12 ± 0.15 86.85 ± 0.19
ACC 74.90 ± 0.21 73.58 ± 0.13 75.43 ± 0.14 54.40 ± 0.27 60.85 ± 0.23 65.50 ± 0.13 56.20 ± 0.20 77.15 ± 0.21 79.30 ± 1.07 75.71 ± 1.39 81.5 ± 1.09
Reuters NMI 49.69 ± 0.29 47.50 ± 0.34 50.28 ± 0.17 25.92 ± 0.41 25.51 ± 0.22 30.55 ± 0.29 28.70 ± 0.30 50.82 ± 0.21 57.83 ± 1.01 46.98 ± 0.39 59.28 ± 0.97
ARI 49.55 ± 0.37 48.44 ± 0.14 51.26 ± 0.21 19.61 ± 0.22 26.18 ± 0.36 31.12 ± 0.18 24.50 ± 0.40 55.36 ± 0.37 60.55 ± 1.78 50.53 ± 1.6 62.46 ± 1.94
F1 60.96 ± 0.22 64.25 ± 0.22 63.21 ± 0.12 43.53 ± 0.42 57.14 ± 0.17 61.82 ± 0.13 51.10 ± 0.20 65.48 ± 0.08 66.16 ± 0.64 62.42 ± 0.26 66.68 ± 0.23
ACC 81.83 ± 0.08 84.33 ± 0.76 85.12 ± 0.52 84.52 ± 1.44 84.13 ± 0.22 86.94 ± 2.83 86.10 ± 1.20 90.45 ± 0.18 90.59 ± 0.15 90.82 ± 0.19 86.71 ± 0.17
ACM NMI 49.30 ± 0.16 54.54 ± 1.51 56.61 ± 1.16 55.38 ± 1.92 53.20 ± 0.52 56.18 ± 4.15 55.70 ± 1.40 68.31 ± 0.25 68.38 ± 0.45 69.2 ± 0.5 58.61 ± 0.35
ARI 54.64 ± 0.16 60.64 ± 1.87 62.16 ± 1.50 59.46 ± 3.10 57.72 ± 0.67 59.35 ± 3.89 62.90 ± 2.10 73.91 ± 0.40 74.20 ± 0.38 74.84 ± 0.46 64.75 ± 0.4
F1 82.01 ± 0.08 84.51 ± 0.74 85.11 ± 0.48 84.65 ± 1.33 84.17 ± 0.23 87.07 ± 2.79 86.10 ± 1.20 90.42 ± 0.19 90.58 ± 0.17 90.8 ± 0.19 86.62 ± 0.18
ACC 57.08 ± 0.13 55.89 ± 0.20 60.49 ± 1.42 61.35 ± 0.80 60.97 ± 0.36 64.54 ± 1.39 56.90 ± 0.70 65.96 ± 0.31 68.79 ± 0.23 69.1 ± 0.53 60.57 ± 0.35
CiteSeer NMI 27.64 ± 0.08 28.34 ± 0.30 27.17 ± 2.40 34.63 ± 0.65 32.69 ± 0.27 36.41 ± 0.86 34.50 ± 0.80 38.71 ± 0.32 41.54 ± 0.30 41.61 ± 0.65 33.04 ± 0.16
ARI 29.31 ± 0.14 28.12 ± 0.36 25.70 ± 2.65 33.55 ± 1.18 33.13 ± 0.53 37.78 ± 1.24 33.40 ± 1.50 40.17 ± 0.43 43.79 ± 0.31 43.91 ± 0.56 33.76 ± 0.29
F1 53.80 ± 0.11 52.62 ± 0.17 61.62 ± 1.39 57.36 ± 0.82 57.70 ± 0.49 62.20 ± 1.32 54.80 ± 0.80 63.62 ± 0.24 62.37 ± 0.21 62.3 ± 0.31 56.61 ± 0.13

Table 3
Ablation experiments on five datasets. Dynamic attention is the attention mechanism used in the paper. The static attention is the attention mechanism used by
AGCN. ✓indicates that the module is used by the network. Bold font indicates best performance.
Datasets Gaussian bias Dynamic attention Static attention ACC NMI ARI F1
✓ 80.21 ± 0.36 80.4 ± 0.16 73.34 ± 0.4 78.09 ± 0.24
USPS ✓ 79.05 ± 0.75 80.29 ± 0.30 72.69 ± 0.51 77.53 ± 0.37
✓ ✓ 80.35 ± 1.07 80.39 ± 0.21 73.47 ± 1.09 77.94 ± 0.36
✓ 88.37 ± 0.12 82.14 ± 0.31 77.28 ± 0.22 88.17 ± 0.14
HHAR ✓ 87.46 ± 0.66 81.93 ± 0.45 76.07 ± 0.85 87.2 ± 0.65
✓ ✓ 88.32 ± 0.12 82.24 ± 0.40 77.22 ± 0.24 88.12 ± 0.15
✓ 81.35 ± 1.01 59.27 ± 0.79 62.29 ± 1.78 66.65 ± 0.23
Reuters ✓ 80.29 ± 0.43 58.29 ± 0.48 61.21 ± 0.59 66.65 ± 0.22
✓ ✓ 81.5 ± 1.09 59.28 ± 0.96 62.46 ± 1.93 66.68 ± 0.22
✓ 90.77 ± 0.15 69.1 ± 0.35 74.72 ± 0.36 90.74 ± 0.15
ACM ✓ 90.71 ± 0.22 68.93 ± 0.48 74.57 ± 0.52 90.68 ± 0.22
✓ ✓ 90.82 ± 0.19 69.2 ± 0.5 74.84 ± 0.46 90.8 ± 0.19
✓ 69.05 ± 0.48 41.52 ± 0.6 43.87 ± 0.6 62.28 ± 0.31
CiteSeer ✓ 69.15 ± 0.45 41.82 ± 0.77 43.9 ± 0.53 62.13 ± 0.32
✓ ✓ 69.1 ± 0.53 41.61 ± 0.65 43.91 ± 0.56 62.3 ± 0.31

settings of the transformation operation G, TDCN is divided into by TDCN with some noise affects the semantic information.
two categories: TDCN and TDCNI . The transformation operation G Thus, the performance of TDCNI is better than TDCN. It
of TDCN is the normalized adjacency matrix, while for TDCNI it is is obvious that by choosing an appropriate transformation
the identity matrix. In addition, in TDCN, the non-graph datasets operation G, our model can also achieve impressive per-
obtain the data structure by constructing the KNN graph. From formance in text clustering. In conclusion, based on the
the experimental results, the observations are as follows: superior performance of TDCNI in Reuters, the space for
improvement of our model in text clustering is huge.
• For each evaluation matrix, our models TDCN and TDCNI
generally achieve comparable or better performance than
state-of-the-art models. Table 2 shows that the performance 4.5. Ablation experiments
of the AE-based deep clustering algorithm is lower than
that of the GCN-based deep clustering algorithm on different To demonstrate the efficiency of Transformer network of the
datasets. The reason is that the deep clustering algorithm model, we conduct ablation experiments. Besides, some experi-
based on AE does not utilize the potential structure infor- ments are applied to the Gaussian bias term in the Transformer
mation of the data. Hence, TDCN embeds data structure module and the dynamic attention mechanism between modules
information through the graph, efficiently integrates data to analyze the effect of strengthening local attention on TDCN.
structure information and data feature information. It can Ablation experiments are performed using the best performing
be seen that the clustering performance of TDCN is higher data in TDCN and TDCNI . In Table 3, the dynamic attention mech-
than the baseline method and TDCNI in most datasets. anism is the attention mechanism used in this paper, the static
• GCN-based clustering methods such as GAE and VGAE, their attention is the attention mechanism used in the AGCN [12].
clustering performance is limited by over-smoothing. TDCN In addition, we also confirm that there is no over-smoothing in
integrates AE network and Transformer network to establish TDCN. And the model is insensitive to the parameter value k of
a joint training framework and achieves better clustering the KNN graph.
performance. AGCN utilizes an attention mechanism to fuse Analysis of Transformer encoder network in TDCN. TDCN
AE and GCN networks. Thus, our model improves the at- is mainly formed of Transformer modules and the AE network.
tention mechanism between modules on this basis, which To verify that the Transformer encoder network has a promoting
further improves the clustering performance. effect on TDCN, we performed ablation experiments on it, as
• In Table 2, TDCNI achieves excellent performance on the text shown in Fig. 3. We can observe that the Transformer network
dataset Reuters. Text clustering needs to analyze multiple can improve the clustering performance of the dataset as a whole.
semantic information. However, the KNN graph constructed The Transformer network improves each evaluation matrix of the
7
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

Fig. 3. Clustering results of TDCN module variants. TDCN-MLP represents a model that replaces the Transformer encoder network in the Transformer module of
TDCN-M and TDCN-S with a multilayer perceptron.

Fig. 4. Clustering results with different layers of TDCN in USPS.

Fig. 5. Clustering results with different k.

HHAR dataset by about 3% − 4%, which greatly improves the can be seen that under the same conditions, the clustering perfor-
clustering performance. According to the comparison results of mance of the model with Gaussian bias is generally higher than
the USPS dataset and CiteSeer dataset in Fig. 3 and Table 2, the the clustering performance without Gaussian bias. The Gaussian
embedding way of the data has a greater impact on the clustering bias term can improve the evaluation matrix of most datasets
results of both two. To a certain extent, we verify that the Trans- by about 0.1% for the mean of 10 experiments. This verifies the
former encoder network boosts the performance of TDCN. As a validity of the Gaussian bias term.
consequence, TDCN can capture deep node information within Analysis of inter-module dynamic attention. Our model uti-
a scale by stacking Transformer encoder layers in each TDCN-M lizes a dynamic attention mechanism to fuse AE and Transformer
module. networks which are shown in Eqs. (10) and (11). We define the
Analysis of Gaussian bias in Transformer module. A Gaus- attention mechanism adopted by the AGCN [12] as the static
sian bias term is incorporated into each Transformer network in attention mechanism. The first and the second rows of each
TDCN and TDCNI to help the Transformer network capture local dataset in Table 3 make clear that the dynamic attention mecha-
information. From the data in the first and third rows of Table 3, it nism we adopted improves the overall performance of the model
8
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

more than the static attention. All five datasets have improved References
by nearly 1% on ACC. This means that the dynamic attention
mechanism can better capture the local features of the data. [1] P. Tang, X. Wang, S. Bai, W. Shen, X. Bai, W. Liu, A. Yuille, Pcl: Proposal
cluster learning for weakly supervised object detection, IEEE Trans. Pattern
Analysis of layers of TDCN. In order to analyze the effect of
Anal. Mach. Intell. 42 (2018) 176–191.
TDCN-M modules in the TDCN model, we conducted experiments [2] S. Li, Q. Cao, L. Liu, K. Yang, S. Liu, J. Hou, S. Yi, Groupformer: Group activity
on the USPS dataset. In the experiment, we set the number of recognition with clustered spatial-temporal transformer, in: Proceedings
TDCN-M modules to {0, 1, 2, 3}. Fig. 4 reveals that the number of the IEEE/CVF International Conference on Computer Vision, 2021, pp.
13668–13677.
of TDCN-M modules can steadily improve the USPS clustering
[3] S. Wang, L. Zhou, Z. Gan, Y.-C. Chen, Y. Fang, S. Sun, Y. Cheng, J. Liu, Cluster-
results. In another word, Deepening the number of layers in the former: Clustering-based sparse transformer for question answering, in:
TDCN network will not cause over-smoothing. Therefore, TDCN Findings of the Association for Computational Linguistics: ACL-IJCNLP,
is more suitable for improving the clustering performance of 2021, pp. 3958–3968.
[4] X.-B. Nguyen, D.T. Bui, C.N. Duong, T.D. Bui, K. Luu, Clusformer: A
large-scale data compared to GCN-based methods.
transformer based clustering approach to unsupervised large-scale face and
Analysis of different k. In the TDCN model, we construct visual landmark recognition, in: Proceedings of the IEEE/CVF Conference
KNN graphs for non-graph datasets. To observe the effect of k on Computer Vision and Pattern Recognition, 2021, pp. 10847–10856.
on the experiments, we performed ablation experiments with [5] M. Du, S. Ding, H. Jia, Study on density peaks clustering based on k-nearest
k = {1, 3, 5, 10} for HHAR and USPS, respectively. Fig. 5 illustrates neighbors and principal component analysis, Knowl.-Based Syst. 99 (2016)
135–145.
that for different k values, the four evaluation matrices are almost [6] T. Hofmann, B. Schölkopf, A.J. Smola, Kernel methods in machine learning,
equal. As a result, our model is insensitive to k. Ann. Statist. 36 (2008) 1171–1220.
[7] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with
neural networks, Science 313 (2006) 504–507.
5. Conclusion
[8] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering
analysis, in: International Conference on Machine Learning, 2016, pp.
In this paper, a novel deep clustering method named 478–487.
Transformer-based Dynamic Fusion Clustering Network (TDCN) is [9] S. Mukherjee, H. Asnani, E. Lin, S. Kannan, Clustergan: Latent space
clustering in generative adversarial networks, in: Proceedings of the AAAI
the first work to fuse the AE network and Transformer network
Conference on Artificial Intelligence, 2019, pp. 4610–4617.
for clustering tasks. TDCN obtains the latent structure of the [10] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, P. Cui, Structural deep clustering
data by the transformation operation G and utilizes the dynamic network, in: Proceedings of the Web Conference, 2020, pp. 1400–1410.
attention mechanism between modules to efficiently fuse the [11] W. Tu, S. Zhou, X. Liu, X. Guo, Z. Cai, E. Zhu, J. Cheng, Deep fusion clustering
network, in: Proceedings of the AAAI Conference on Artificial Intelligence,
structural information and feature information of the data. The
2021, pp. 9978–9987.
dynamic attention mechanism we incorporated can significantly [12] Z. Peng, H. Liu, Y. Jia, J. Hou, Attention-driven graph clustering network,
improve the clustering accuracy compared to the attention mech- in: Proceedings of the 29th ACM International Conference on Multimedia,
anism of existing work. Furthermore, the Transformer model 2021, pp. 935–943.
[13] T.N. Kipf, M. Welling, Semi-supervised classification with graph con-
adopted by TDCN can effectively improve the performance of the
volutional networks, in: 5th International Conference on Learning
model. TDCN deepens the model depth through two modules: Representations, ICLR, 2017.
TDCN-M and TDCN-S, which can effectively solve the problem [14] V.P. Dwivedi, X. Bresson, A generalization of transformer networks to
of performance degradation caused by the deepening of the graphs, 2020, arXiv preprint arXiv:2012.09699.
[15] D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, P. Tossou, Rethinking
current deep clustering network layers. Compared with the GCN
graph transformers with spectral attention, Adv. Neural Inf. Process. Syst.
commonly used in the deep clustering network, TDCN will not 34 (2021) 21618–21629.
cause over-smoothing and can learn the clustering representation [16] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, T.-Y. Liu, Do
to the greatest extent. Experiments show that TDCN can achieve transformers really perform badly for graph representation? Adv. Neural
Inf. Process. Syst. 34 (2021) 28877–28888.
comparable or even higher performance than the state-of-the-art
[17] S. Brody, U. Alon, E. Yahav, How attentive are graph attention
methods in all five datasets. In the future, on the one hand, we networks? 2021, arXiv preprint arXiv:2105.14491.
will explore more efficient transformation operations G for node [18] M. Guo, Y. Zhang, T. Liu, Gaussian transformer: a lightweight approach
embedding to achieve better performance in deep clustering, on for natural language inference, in: Proceedings of the AAAI Conference on
Artificial Intelligence, 2019, pp. 6489–6496.
the other hand, we will apply our model to multi-view clustering.
[19] X. Guo, L. Gao, X. Liu, J. Yin, Improved deep embedded clustering with
local structure preservation, in: IJCAI, 2017, pp. 1753–1759.
CRediT authorship contribution statement [20] S. Affeldt, L. Labiod, M. Nadif, Spectral clustering via ensemble deep
autoencoder learning (SC-EDAE), Pattern Recognit. 108 (2020) 107522.
[21] X. Peng, J. Feng, J.T. Zhou, Y. Lei, S. Yan, Deep subspace clustering, IEEE
Chunchun Zhang: Conceptualization, Methodology, Software, Trans. Neural Netw. Learn. Syst. 31 (12) (2020) 5509–5521.
Investigation, Formal analysis, Writing – original draft. Yaliang [22] X. Guo, X. Liu, E. Zhu, X. Zhu, M. Li, X. Xu, J. Yin, Adaptive self-paced deep
Zhao: Conceptualization, Funding acquisition, Resources, Super- clustering with data augmentation, IEEE Trans. Knowl. Data Eng. 32 (9)
vision, Writing – review & editing. Jinke Wang: Writing – review (2019) 1680–1693.
[23] Z. Kang, L. Wen, W. Chen, Z. Xu, Low-rank kernel learning for graph-based
& editing. clustering, Knowl.-Based Syst. 163 (2019) 510–517.
[24] S. Pan, R. Hu, S.-f. Fung, G. Long, J. Jiang, C. Zhang, Learning graph
Declaration of competing interest embedding with adversarial training methods, IEEE Trans. Cybern. 50
(2019) 2475–2487.
[25] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, C. Zhang, Attributed graph
The authors declare that they have no known competing finan- clustering: A deep attentional embedding approach, in: Proceedings of
cial interests or personal relationships that could have appeared the Twenty-Eighth International Joint Conference on Artificial Intelligence,
to influence the work reported in this paper. IJCAI, 2019, pp. 3670–3676.
[26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio,
Graph attention networks, in: 6th International Conference on Learning
Acknowledgments Representations, ICLR, 2018.
[27] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-
This work was supported in part by the National Natural Sci- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image
is worth 16x16 words: Transformers for image recognition at scale, 2020,
ence Foundation of China (No. 61802112), the Technology Devel-
arXiv preprint arXiv:2010.11929.
opment Plan Project in Henan Province of China (No. [28] J. Zhang, H. Zhang, C. Xia, L. Sun, Graph-bert: Only attention is needed for
212102310302). learning graph representations, 2020, arXiv preprint arXiv:2001.05140.

9
C. Zhang, Y. Zhao and J. Wang Knowledge-Based Systems 258 (2022) 109984

[29] M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, H. Dong, End-to-end [35] Y. Le Cun, O. Matan, B. Boser, J.S. Denker, D. Henderson, R.E. Howard,
object detection with adaptive clustering transformer, 2020, arXiv preprint W. Hubbard, L. Jacket, H.S. Baird, Handwritten zip code recognition with
arXiv:2011.09315. multilayer networks, in: [1990] Proceedings. 10th International Conference
[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. on Pattern Recognition, 1990, pp. 35–40.
Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. [36] D.D. Lewis, Y. Yang, T. Russell-Rose, F. Li, Rcv1: A new benchmark
Syst. 30 (2017). collection for text categorization research, J. Mach. Learn. Res. 5 (2004)
[31] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network 361–397.
learning by exponential linear units (elus), in: 4th International Conference [37] A. Stisen, H. Blunck, S. Bhattacharya, T.S. Prentow, M.B. Kjærgaard, A. Dey,
on Learning Representations, ICLR, 2016. T. Sonne, M.M. Jensen, Smart devices are different: Assessing and mitigat-
[32] J. Ke, Q. Wang, Y. Wang, P. Milanfar, F. Yang, Musiq: Multi-scale im- ingmobile sensing heterogeneities for activity recognition, in: Proceedings
age quality transformer, in: Proceedings of the IEEE/CVF International of the 13th ACM Conference on Embedded Networked Sensor Systems,
Conference on Computer Vision, 2021, pp. 5148–5157. 2015, pp. 127–140.
[33] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, L. Bottou,
[38] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, P.S. Yu, Heterogeneous
Stacked denoising autoencoders: Learning useful representations in a deep
graph attention network, in: The World Wide Web Conference, 2019, pp.
network with a local denoising criterion, J. Mach. Learn. Res. 11 (12)
2022–2032.
(2010).
[39] T.N. Kipf, M. Welling, Variational graph auto-encoders, 2016, arXiv preprint
[34] L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn.
arXiv:1611.07308.
Res. 9 (11) (2008).

10

You might also like