Contrastive Learning With Semantic Consistency Constraint
Contrastive Learning With Semantic Consistency Constraint
Review article
A R T I C L E I N F O A B S T R A C T
Keywords: Contrastive representation learning (CL) can be viewed as an anchor-based learning paradigm that learns rep
Representation learning resentations by maximizing the similarity between an anchor and positive samples while reducing the similarity
Contrastive learning with negative samples. A randomly adopted data augmentation strategy generates positive and negative samples,
Semantic consistency
resulting in semantic inconsistency in the learning process. The randomness may introduce additional distur
bances to the original sample, thereby reversing the sample identity. Also, the negative sample demarcation
strategy makes the negative samples containing semantically similar samples to the anchors, called false negative
samples. Therefore, CL’s maximization and reduction process cause distractors to be incorporated into the
learned feature representation. In this paper, we propose a novel Semantic Consistency Regularization (SCR)
method to alleviate this problem. Specifically, we introduce a new regularization item, pairwise subspace dis
tance, to constrain the consistency of distributions across different views. Furthermore, we propose a divide-and-
conquer strategy to ensure that the proposed SCR is well-suited for large mini-batch cases. Empirically, results
across multiple benchmark mini and large datasets demonstrate that SCR outperforms state-of-the-art methods.
Codes are available at https://github.com/PaulGHJ/SCR.git.
* Corresponding author.
E-mail addresses: [email protected] (H. Guo), [email protected] (L. Shi).
https://doi.org/10.1016/j.imavis.2023.104754
Received 27 October 2022; Received in revised form 15 June 2023; Accepted 20 June 2023
Available online 28 June 2023
0262-8856/© 2023 Elsevier B.V. All rights reserved.
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
2. Related work
2
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
of the learning training model [44]. The Wasserstein distance [45–47] of the two distributed data. Let D 1Z and D 2Z represent the distributions
provides a meaningful notion of closeness and and is capable of handling corresponding to the two augmented view datasets Z1 and Z2 in the
non-overlapping probability distributions in low-dimensional mani embedded space Z .
folds. Research has demonstrated that the generalization of KL diver Motivation. The traditional contrastive learning methods, such as
gence, JS divergence, and Wasserstein distance can be limited by the SimCLR, can be regarded as an n-classification problem, where each
influence of sample complexity. As shown in Fig. 2, Sliced Wasserstein sample is treated as an individual class within a batch. Positive pairs are
distance (SWD) [45] calculates the average Wasserstein distance of all formed by different augmented views of the same sample and assigned
marginal measures through uniform random slicing into sets of 1-dimen the same label. Wang pointed out that the contrastive learning method
sional distributions with equal weight. Max Sliced-Wasserstein Distance actually obeys two key principles, alignment and uniformity [41].
(Max-SWD) [48] finds a single linear projection that maximizes the Alignment is achieved by enforcing positive pairs to share similar rep
distance of the probability measures in the projected space. Max resentations in a low-dimensional space, that is, instance-level align
Generalized Sliced-Wasserstein Distance(Max-GSWD) [49] uses non- ment, ignoring semantic correlation between views. The representation
linear projection to find the critical projection direction, reducing the learned originate from the semantic information shared by different
computational cost induced by the projection operations. views, but the uncontrollable data augmentation strategy makes the
shared semantics incomplete, such as cropping, so that the information
3. Methodology in the common area is learned. We propose to use semantic-level
alignment to tackle the problem of instance-level alignment methods.
Given a training dataset, we consider a mini-batch sampled from the Although the augmented views are different in the input space, the
dataset, denoted by X = {xi }ni=1 . Taking SimCLR as the baseline method, distributions of the two views in the embedding space should be natu
the general framework of contrastive learning consists of the following rally aligned. By aligning the two augmented views of the batch, we
parts: (1) Two augmented views, defined as X1 = T1 (X), X2 = T 2 (X), expect the semantic distribution to contain more image semantics for
where T1 and T 2 represent different data augmentation strategies, T1 (X) semantic complementarity.
and T 2 (X) represent transforming each sample to obtain the corre Some studies [21,22,50,51] have pointed out that the principal
{ }n { }n
sponding augmented samples, X1 = x1i i=1 , X2 = x2i i=1 ; (2) An component direction or orthogonal basis of the sample is closely related
to the semantic information of the sample. The eigenvectors are ob
encoder fθ encodes the augmented views to obtain the corresponding
( ) ( ) tained by matrix factorization of the covariance matrix of the training
representations y1i = fθ x1i and y2i = fθ x2i ; (3) A projection head maps
data. The first k eigenvectors selected represent the k directions with
the learned representations to the embedding space Z1 and Z2 . We can large variance of data distribution. According to information theory, the
( ) ( )
obtain that z1i = gξ y1i ∈ Z1 , z2i = gξ y2i ∈ Z2 . Typical CL methods larger the variance of data distribution, the higher the information en
employ the noise contrast estimation (NCE) as the loss function, which tropy, indicating a larger amount of information present. Therefore, we
can be formulated as: can regard the projection of samples on each feature direction as se
⎡
sim(zi ,z+ )
⎤ mantics which are understood at the level of feature representations.
Thus, we achieve semantic alignment by forcing the distributions of
i
⎢ e τ ⎥
L NCE = Ezi ∼Z 1 ∪Z 2 ⎣ − log ⎦ (1)
sim(zi ,z+ )
i ∑ sim(zi ,zj ) different views of a batch to be aligned separately in all principal
e τ + zj e τ component directions.
[ ]T
We denote Zv = z11 , …, z1i , …, z1n , v = 1, 2, respectively. The two
where τ represents the temperature hyper-parameter and sim(⋅, ⋅) rep ( )T
resents the similarity of samples(i.e., the cosine similarity). In objective covariance matrices Z1cov and Z2cov can be calculated by Z1cov = Z1 Z1
{ } ( )T
(1), CL regards zi as the anchor, the sample pair zi , z+
i as positive pair, and Z2cov = Z2 Z2 . In the embedded spaces Z , the eigenvector of
{ } { 1 }\{ }
the pair zi , zj as negatives, where zj ∈ Z ∪ Z2 zi , z+
i . covariance matrix can be obtained by orthogonal decomposition of
matrix. Thus, we have:
( )T
3.1. Paired subspace distance 1
Zcov = U 1 S1 U 1
( )T (2)
2
Zcov = U 2 S2 U 2
This subsection introduces the proposed paired subspace distance
(PSD) for measuring the distance between two augmented view distri [ ] [ ]
where U1 = u11 , …, u1l , …, u1k and U2 = u21 , …, u2l , …, u2k represent the
butions. The PSD is defined in the embedded spaces Z and aims to align
orthonormal eigenvector matrices, u1l is the l-th orthogonal basis of view
the two distributions based on the eigenvalues of the covariance matrix
Fig. 2. Several variants based on wasserstein distance are used to measure distributional differences, taking 2-dimensional distribution as an example. Here we only
emphasize the selection method of projection direction from high-dimensional space to 1-dimensional space. (a). SWD randomly selects a large number of projection
directions; (b). Max-SWD looks for a special linear projection direction; (c). Pairwise subspace distance selects the respective principal component directions of
augmented views.
3
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
1, and so is u1l ; S1 and S2 describe the eigenvalue matrices corresponding there are significant differences. As shown in Fig. 2, the idea of SWD is to
to the eigenvector matrices, k indicates that we finally select k principal obtain multiple one-dimensional distributions of high-dimensional
component directions of two views respectively. U1 and U2 refer to the probability distributions through random transformation, and combine
projection directions of the principal components. Here, denote ̃ u2l− 1 = all Wasserstein-slices distances with equal weights. The effectiveness of
u1l , ̃
u2l = u2l , we can rewrite U = {̃u1 , ̃
u2 , …, ̃
u2k− 1 , ̃
u2k }. SWD depends on the number of slices and the quality of the slices.
We regard the projection directions as the coordinate primitive. For However, the importance of projections cannot be guaranteed by
random slicing [45]. Max-SWD [48] finds a linear projection that
the samples in Z1 and Z2 , we can obtain a new representation under this
maximizes the distance of the probability measure in the projected
coordinate primitive. So, we obtain the follows:
space, which does not reveal more semantic information about the two
p11 , …, p1l , …, p12k distributions or spatial structure information. By choosing the principal
(3)
p21 , …, p2l , …, p22k component projections of samples in a single view rather than all views,
the projection bias caused by the view difference can be reduced, and
where p1l = Z1 ̃ul and p2l = Z2 ̃ul are the projection representations of view the semantic information within the view can be preserved to a greater
1 and 2 in the l-th projection directions and ̃ ul ∈ U. Also, we can obtain extent.
[ ] [ ] { }n
pl = zl,1 , …, zl,i , …, zl,n and pl = zl,1 , …, z2l,i , …, z2l,n . Zl1 = z1l,j
1 1 1 1 2 2
is
j=1 3.2. Semantic consistency regularization
the projection set of Z1 in the l-th principal component direction, and
{ }n
Z2l = z2l,i is the projection set of Z2 in the l-th principal component Lastly, we integrate the proposed PSD into the contrastive loss as a
i=1
regularization term. Thus, the objective function of the contrastive
direction.
learning method with semantic consistency regularization(SCR) is
Next, we calculate the disparity in the distributions of the two views
defined as follows:
by analyzing their differences along the corresponding principal
component directions. Considering the existence of non-overlapping minf loss = loss cl + λ* loss sc (7)
low-dimensional manifolds, we design a new method to measure the
v
difference between p1l and p2l . Define ̃pl as the ordered vector repre where loss cl is the contrastive loss function such as NCE loss, loss sc is
j
sentation of pi in the form: called the semantic consistency loss that equals to eq. (6), and λ is the
[ ] regularization parameter that controls the relative importance of the
pvl = ̃zvl,1 , …,̃zvl,n
̃ (4) two loss terms. Note that loss cl can also be another loss function used in
contrastive learning. The learning framework of the proposed SCR is
v v v v shown in Fig. 4, and the complete training process is shown in Algorithm
where ̃ zl,1 ≥ … ≥ ̃ zl,n , ̃zl,i ∈ Zvl , and v ∈ {1, 2}. As shown in
zl,i ≥ … ≥ ̃
1.
Fig. 3, in l-th principal component direction, the difference between
view 1 and view 2 can be calculated by the following formula:
( ) ⃦ 1 ⃦2
dist p1 , p2 = ⃦̃
l l p2 ⃦
p − ̃ l l 2
(5)
Fig. 3. The distribution measure of view 1 and view 2 in the direction of the 3.3. Divide-and-conquer strategy
l-th principal component. Orange and blue represent the projections of view 1
and view 2 samples in the l-th direction of the principal components, respec When faced with large mini-batch in the training phase, the
tively. (For interpretation of the references to color in this figure legend, the computational complexity of the eigenvalue solution process of covari
reader is referred to the web version of this article.) ance matrix will become particularly large. To solve this problem, we
4
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
Fig. 4. The framework of Contrastive Learning with Semantic Consistency. Given mini-batch X = {x1 , …, xi , …, xn }, we construct data pairs using two different
augmentation T 1 , T 2 . Then we extract features Y 1 , Y 2 from augmented datasets using a shared encoder f , and g is a 3 layer multi-linear network that projects learned
representations into the low dimensional space Z1 , Z2 . Finally, we calculate contrastive loss and paired subspace distance.
propose a simple and effective strategy called divide-and-conquer additional 100,000 unlabeled images are also provided for unsu
strategy. Suppose the batch size is N, two augmentation view are pervised learning.
employed. In embedding space, we divide the large batch into M parts, • The Tiny ImageNet dataset [54] composed of 100,000 images in 200
Define the sample set of the m-th part of the v view as Zvm ,m ∈ 1,…,M,v ∈ classes. Each colored image is downsized to 64 × 64 resolution. Each
1,2. We can obtain the the l-th eigenvector of the m-th part sample in the class has 500 training images, 50 validation images and 50 test
v-th view for each part separately uvl,m , l ∈ {1, …, k}. Then, we add all images.
{ } • The ImageNet-100 dataset [55] is a subset of the ImageNet-1K
eigenvectors into the final set of eigenvectors U = uvl,m . Further, we
dataset and contains random 100 classes. 500 images per class
can obtain the projection of each view on these principal component sample for training and 50 images for validation.
directions p1l,m = Z1 uvl,m ,p2l,m = Z2 uvl,m . Finally, we use Eq. 3–6 to calculate • The ImageNet-1K [55] spans 1000 object classes and contains
the difference in the distribution of different views. 1,281,167 training images, 50,000 validation images and 100,000
The divide-and-conquer strategy can be seen as a trick for reducing test images.
the time complexity of singular value decomposition. We measure the
time spent for one batch by the method SimCLR +SCR on CIFAR-100 Data Augmentation. Each input image generates two correspond
dataset with batch size 512, k = 10. We randomly divide the samples ing augmented views in all experiments. The image augmentation
of each augmented view into two parts in the embedding space to obtain strategies comprises the following image transformations: random
its principal component directions respectively, that is, we can get 40 cropping, resizing, horizontal flipping, color jittering, converting to
projection directions. Then we calculate the distribution difference of grayscale, gaussian blurring, and solarization. The probabilities for
each view in these 40 directions. Table 1 shows the change in time and generating two positive samples are as follows: horizontal mirroring
accuracy with and without this strategy. We can somewhat accept the with probability 0.5, and color jittering with configuration
reduction in precision compared to the time it takes. (0.4; 0.4; 0.2; 0, 1) with probability 0.8, gaussian blurring with proba
bility 1 and 0.1, and solarizing with the probability 0.1 and 0.2.
4. Experiments Architecture and Optimization. For small and medium datasets,
we use a ResNet-18 as the encoder function fθ , followed by a projection
Benchmark Datasets. We evaluate our proposed method by classi head gξ for all experiments where the projection head is a three-layer
fication tasks in computer vision on the six datasets: linear network. The output of the encoder function serves as the
learned representation for downstream tasks in the testing phase. While
• The CIFAR-10 dataset [52] consists of 60,000 color images in 10 for ImageNet-100 and ImageNet-1K datasets, all methods are imple
classes, with 6,000 images in each class. There are 50,000 training mented by a ResNet-50.
images and 10,000 test images. For small and medium datasets, we use the LARS optimizer [56] to
• The CIFAR-100 dataset is similar to CIFAR-10, except that it has 100 train the encoder fθ for 200 epochs with a mini-batch size of 128. A basic
classes with 600 images in each class. There are 500 training images learning rate of 8 with a cosine decay schedule without restart is applied
and 100 testing images per class. These 100 classes are further to all datasets. After a learning rate warm-up period of 10 epochs, we use
grouped into 20 superclasses. a cosine decay schedule to reduce the learning rate by a factor of 1000.
• The STL-10 dataset [53] is a subset of labeled examples from For ImageNet-1K, we use Stochastic Gradient Descent (SGD) with a
ImageNet. There are 5,000 labeled training images (500 in each momentum of 0.9 to minimize our objective functions. The self-
class) and 8,000 labeled testing images (800 in each class). An supervised pre-training performed on the ImageNet-1K datasets uses a
single machine with eight GPUs (A100). The models are trained for 100/
200/400 epochs based on the cosine annealing schedule and Automatic
Table 1
Effect of divide-and-conquer on experimental accuracy and time spent. w and w/ Mixed-Precision.
o represent using or not using the strategy respectively. Linear evaluation. We employ the widely adopted linear evaluation
protocol to evaluate the trained model. First, we freeze the learned
top-1 Time top-1 Time
encoder using the Adam optimizer [57] to train a supervised linear
w 64.78 18(s) 65.02 25(s) w/o classifier on the dataset for 200 epochs, where the linear classifier is a
5
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
fully connected layer that replaces the projection head gξ in the training Table 3
phase. The learning rate of the optimizer decays from 3 × 10− 1 to 3 × Classification accuracy (top-1%) on ImageNet-100. All results are based on a
10− 5 , and the weight decay is set to 10− 6 . Finally, the linear classifier’s ResNet-50 encoder.
top-1 (or top-5) classification accuracy is used as the final evaluation Methods top-1 top-1 Methods*
criterion. SimCLR 70.15 72.56 SimCLR + SCR
DCL 72.12 75.14 DCL + SCR
BT 69.89 69.94 BT + SCR
4.1. Comparison with the state of the art HCL 73.45 75.89 HCL + SCR
6
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
This section investigates the effects of perturbations introduced 4.7. Comparison of distribution metrics
through data augmentation strategies on the experimental results.
Firstly, we set different perturbations to observe the change in classifi In this paper, we adopt the proposed PSD as a measure of distribution
cation accuracy. We employ the same data augmentation strategy and difference for constraining semantic consistency. We further compare
the same parameter settings in each set of experiments, except for the PSD with existing distribution metrics to investigate the perfor
gaussianBlur. The probability of gaussianBlur is set in {0.2, 0.5, 0.8}, mance on classification task. The methods used for comparison in the
and adopts different standard deviation σ . Table 9 shows the results with experiments include KL divergence, JS divergence, SWD, Max-SWD, and
SimCLR +SCR on the STL-10 dataset. From the experimental results, it SCR*. SCR* is a variant of SCR that does not order samples in the
principal component direction. In practical applications, since the dis
Table 7 tributions of the two augmented views are not completely non-
Parametric sensitivities of the number of projection directions k and hyper overlapping in the projection space, the KL divergence and the JS
parameter λ. divergence can be used to measure the difference between the views.
k\λ 100 10− 1
10− 2
10− 3 We choose SimCLR as our benchmark method. Table 10 shows all the
1 51.15 75.12 77.94 76.31 experimental results with different distribution constraints on CIFAR-10
5 57.13 75.49 78.52 76.84 dataset. The experimental results show that the distribution constraints
10 53.88 72.98 76.14 73.15 between views can improve the performance of the model in down
20 51.02 69.93 76.67 72.73 stream tasks. On the other hand, the results show that the +SCR method
30 50.13 67.37 75.89 70.76
outperforms other distribution methods. In addition, the results of
7
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
Table 8
Comparison of different data augmentations by using the resnet-18 backbone on CIFAR10. Method+ refer to method +SCR.
ID Data augmentations Methods
Horizontal flip rotate random crop random gray color jitter mixup SimCLR + DCL + HCL+
References
Table 9
The influence of the different degrees of disturbance on experimental results is [1] X. Su, L. Si, W. Qiang, J. Yu, F. Wu, C. Zheng, F. Sun, Intriguing property and
introduced by data augmentation. counterfactual explanation of gan for remote sensing image generation, arXiv
preprint arXiv:2303.05240, 2023.
σ\p 0.2 0.5 0.8
[2] D. Chen, J. Hu, W. Qiang, X. Wei, E. Wu, Rethinking skip connection model as a
2 76.85 77.07 77.21 learnable markov chain, arXiv preprint arXiv:2209.15278, 2022.
5 76.42 76.90 76.83 [3] W. Qiang, J. Zhang, L. Zhen, L. Jing, Robust weighted linear loss twin multi-class
8 75.93 75.61 76.14 support vector regression for large-scale classification, Signal Process. 170 (2020),
107449.
[4] W. Qiang, H. Zhang, J. Zhang, L. Jing, Tsvm-m3: twin support vector machine
based on multi-order moment matching for large-scale multi-class classification,
Appl. Soft Comput. 128 (2022), 109506.
Table 10
[5] W. Qiang, J. Li, B. Su, J. Fu, H. Xiong, J.-R. Wen, Meta attention-generation
Top-1 and Top-5 accuracy (%) of various distribution constrain on CIFAR-10 network for cross-granularity few-shot learning, Int. J. Comput. Vis. 131 (2023)
dataset trained and tested for 200 epochs. 1211–1233.
[6] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive
Methods top-1 top-5
learning of visual representations, in: International Conference on Machine
SimCLR 78.86 99.21 Learning, PMLR, 2020, pp. 1597–1607.
SimCLR + KL 80.52 99.34 [7] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised
SimCLR + JS 80.61 99.32 visual representation learning, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
SimCLR + SWD 80.43 99.30
[8] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,
SimCLR + Max-SWD 80.32 99.38
B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., Bootstrap your own latent-a new
SimCLR + SCR* 80.19 99.21
approach to self-supervised learning, Adv. Neural Inf. Proces. Syst. 33 (2020)
SimCLR + SCR 81.43 99.43 21271–21284.
[9] Z. Wen, Y. Li, Toward understanding the feature learning process of self-supervised
contrastive learning, in: International Conference on Machine Learning, PMLR,
+ SCR* are lower than +SCR, illustrating the necessity of sorting. 2021, pp. 11112–11122.
[10] M. Patacchiola, A.J. Storkey, Self-supervised relational reasoning for
representation learning, Adv. Neural Inf. Proces. Syst. 33 (2020) 4003–4014.
5. Conclusion [11] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, N. Saunshi, A theoretical
analysis of contrastive unsupervised representation learning, in: 36th International
In this paper, we propose a semantic consistency regularization(SCR) Conference on Machine Learning, ICML 2019, International Machine Learning
Society (IMLS), 2019, pp. 9904–9923.
to relieve the semantic inconsistency problem in contrastive learning. [12] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant
Unlike the existing metrics to measure distribution distance, we intro mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and
duce a novel metric called Paired Subspace Distance (PSD), which cal Pattern Recognition (CVPR’06) vol. 2, IEEE, 2006, pp. 1735–1742.
[13] A.V.D. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive
culates the distribution distance between two corresponding views. The coding, arXiv preprint arXiv:1807.03748, 2018.
proposed PSD can be integrated into the standard framework of [14] A. Dosovitskiy, J.T. Springenberg, M. Riedmiller, T. Brox, Discriminative
contrastive learning methods. Experimental results show that SCR out unsupervised feature learning with convolutional neural networks, Adv. Neural Inf.
Proces. Syst. 27 (2014).
performs previous methods for self-supervised and semi-supervised
[15] Y. Li, P. Hu, Z. Liu, D. Peng, J.T. Zhou, X. Peng, Contrastive clustering, in: 2021
classification tasks on multi-scale datasets. AAAI Conference on Artificial Intelligence (AAAI), 2021.
[16] J.D. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Contrastive learning with hard
negative samples, in: International Conference on Learning Representations, 2020.
Declaration of Competing Interest
[17] Z. Wu, Y. Xiong, S.X. Yu, D. Lin, Unsupervised feature learning via non-parametric
instance discrimination, in: Proceedings of the IEEE Conference on Computer
The authors declare that they have no known competing financial Vision and Pattern Recognition, 2018, pp. 3733–3742.
interests or personal relationships that could have appeared to influence [18] M. Wu, M. Mosse, C. Zhuang, D. Yamins, N. Goodman, Conditional negative
sampling for contrastive learning of visual representations, in: International
the work reported in this paper. Conference on Learning Representations, 2020.
[19] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, S. Jegelka, Debiased contrastive
Data availability learning, Adv. Neural Inf. Proces. Syst. 33 (2020) 8765–8775.
[20] W. Qiang, J. Li, C. Zheng, B. Su, H. Xiong, Interventional contrastive learning with
meta semantic regularizer, in: International Conference on Machine Learning,
Data will be made available on request. PMLR, 2022, pp. 18018–18030.
[21] S.T. Dumais, et al., Latent semantic analysis, Annu. Rev. Inf. Sci. Technol. 38
(2004) 188–230.
Acknowledgement [22] D.I. Martin, M.W. Berry, Mathematical foundations behind latent semantic
analysis, in: Handbook of Latent Semantic Analysis, 2007, pp. 35–55.
This work was supported by National Key R\&D Program of China [23] H. Abdi, L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev.
Comput. Stat. 2 (2010) 433–459.
(2021YFB3500700), NSFC Grant 62172026, National Social Science
[24] H. Bao, L. Dong, F. Wei, Beit: Bert pre-training of image transformers, arXiv
Fund of China 22\&ZD153, and SKLSDE. preprint arXiv:2106.08254, 2021.
8
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754
[25] K. Purohit, M. Suin, A. Rajagopalan, V.N. Boddeti, Spatially-adaptive image [41] T. Wang, P. Isola, Understanding contrastive representation learning through
restoration using distortion-guided networks, in: Proceedings of the IEEE/CVF alignment and uniformity on the hypersphere, in: International Conference on
International Conference on Computer Vision, 2021, pp. 2309–2319. Machine Learning, PMLR, 2020, pp. 9929–9939.
[26] X. Guo, H. Yang, D. Huang, Image inpainting via conditional texture and structure [42] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe, Whitening for self-supervised
dual generation, in: Proceedings of the IEEE/CVF International Conference on representation learning, in: International Conference on Machine Learning, PMLR,
Computer Vision, 2021, pp. 14134–14143. 2021, pp. 3015–3024.
[27] X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, C.C. Loy, Self-supervised scene de-occlusion, [43] C.-Y. Chuang, R.D. Hjelm, X. Wang, V. Vineet, N. Joshi, A. Torralba, S. Jegelka,
in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Y. Song, Robust contrastive learning against noisy views, in: Proceedings of the
Recognition, 2020, pp. 3784–3792. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022,
[28] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic pp. 16670–16681.
colorization, in: European Conference on Computer Vision, Springer, 2016, [44] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved
pp. 577–593. training of wasserstein gans, Adv. Neural Inf. Proces. Syst. 30 (2017).
[29] R.D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, [45] S. Kolouri, G.K. Rohde, H. Hoffmann, Sliced wasserstein distance for learning
Y. Bengio, Learning deep representations by mutual information estimation and gaussian mixture models, in: Proceedings of the IEEE Conference on Computer
maximization, in: International Conference on Learning Representations, 2018. Vision and Pattern Recognition, 2018, pp. 3427–3436.
[30] Y. Tian, D. Krishnan, P. Isola, Contrastive multiview coding, in: Computer [46] W. Qiang, J. Li, C. Zheng, B. Su, H. Xiong, Robust local preserving and global
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, aligning network for adversarial domain adaptation, IEEE Trans. Knowl. Data Eng
Proceedings, Part XI 16, Springer, 2020, pp. 776–794. (2021).
[31] Y.-H.H. Tsai, H. Zhao, M. Yamada, L.-P. Morency, R.R. Salakhutdinov, Neural [47] W. Qiang, J. Li, C. Zheng, B. Su, Auxiliary task guided mean and covariance
methods for point-wise dependency estimation, Adv. Neural Inf. Proces. Syst. 33 alignment network for adversarial domain adaptation, Knowl.-Based Syst. 223
(2020) 62–72. (2021) 107066.
[32] X. Chen, K. He, Exploring simple siamese representation learning, in: Proceedings [48] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao,
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, D. Forsyth, A.G. Schwing, Max-sliced wasserstein distance and its use for gans, in:
pp. 15750–15758. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[33] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised Recognition, 2019, pp. 10648–10656.
learning of visual features by contrasting cluster assignments, Adv. Neural Inf. [49] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, G. Rohde, Generalized sliced
Proces. Syst. 33 (2020) 9912–9924. wasserstein distances, Adv. Neural Inf. Proces. Syst. 32 (2019).
[34] J. Li, P. Zhou, C. Xiong, S. Hoi, Prototypical contrastive learning of unsupervised [50] X. Chen, S. Wang, J. Wang, M. Long, Representation subspace distance for domain
representations, in: International Conference on Learning Representations, 2020. adaptation regression, in: International Conference on Machine Learning, PMLR,
[35] J. Zhang, K. Ma, Rethinking the augmentation module in contrastive learning: 2021, pp. 1749–1759.
Learning hierarchical augmentation invariance with expanded views, in: [51] G.H. Golub, C.F. Van Loan, Matrix Computations, JHU press, 2013.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [52] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny
Recognition, 2022, pp. 16650–16659. images, 2009.
[36] T. Xiao, X. Wang, A.A. Efros, T. Darrell, What should not be contrastive in [53] A. Coates, A. Ng, H. Lee, An analysis of single-layer networks in unsupervised
contrastive learning, in: International Conference on Learning Representations, feature learning, in: Proceedings of the Fourteenth International Conference on
2020. Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings,
[37] J. Li, W. Qiang, C. Zheng, B. Su, H. Xiong, Metaug: Contrastive learning via meta 2011, pp. 215–223.
feature augmentation, arXiv preprint arXiv:2203.05119, 2022. [54] Y. Le, X. Yang, Tiny imagenet visual recognition challenge, CS 231N 7, 2015, p. 3.
[38] Y. Wang, J. Lin, J. Zou, Y. Pan, T. Yao, T. Mei, Improving self-supervised learning [55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
with automated unsupervised outlier arbitration, Adv. Neural Inf. Proces. Syst. 34 A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition
(2021) 27617–27630. challenge, Int. J. Comput. Vis. 115 (2015) 211–252.
[39] S. Ge, S. Mishra, C.-L. Li, H. Wang, D. Jacobs, Robust contrastive learning using [56] Y. You, I. Gitman, B. Ginsburg, Large batch training of convolutional networks,
negative samples with diminished semantics, Adv. Neural Inf. Proces. Syst. 34 arXiv preprint arXiv:1708.03888, 2017.
(2021) 27356–27368. [57] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
[40] S. Chen, G. Niu, C. Gong, J. Li, J. Yang, M. Sugiyama, Large-margin contrastive arXiv:1412.6980, 2014.
learning with distance polarization regularizer, in: International Conference on [58] J. Zbontar, L. Jing, I. Misra, Y. LeCun, S. Deny, Barlow twins: Self-supervised
Machine Learning, PMLR, 2021, pp. 1673–1683. learning via redundancy reduction, in: International Conference on Machine
Learning, PMLR, 2021, pp. 12310–12320.