0% found this document useful (0 votes)
2 views

SWIN Transformer: A Unifying Step Between Computer Vision and Natural Language Processing | by Renu

The document discusses the adaptation of transformers for computer vision, focusing on the SWIN Transformer, which addresses challenges such as fixed token scales and computational complexity. SWIN Transformer utilizes a hierarchical architecture and a shifted window approach for self-attention, allowing it to efficiently model various scales of visual entities. It demonstrates superior performance on image classification and other vision tasks compared to traditional CNNs and previous transformer models.

Uploaded by

Ignacio Scarinci
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

SWIN Transformer: A Unifying Step Between Computer Vision and Natural Language Processing | by Renu

The document discusses the adaptation of transformers for computer vision, focusing on the SWIN Transformer, which addresses challenges such as fixed token scales and computational complexity. SWIN Transformer utilizes a hierarchical architecture and a shifted window approach for self-attention, allowing it to efficiently model various scales of visual entities. It demonstrates superior performance on image classification and other vision tasks compared to traditional CNNs and previous transformer models.

Uploaded by

Ignacio Scarinci
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

c-$%'(.

"%U%A7""%5"59"7:-#*;%=>-76"=%*"A>%>'6=%5-#>'F%@6D#%$C%A-7%8")6$5%(#)%D">%(#%"P>7(%-#" _">%=>(7>") @6D#%H#

@"(7B'

!"#$%&'(#)"*+(*
,-.%/0%1213 4 /%56#%7"() 4 8"59"7:-#*; 4 <6=>"#

!"#$%&'()*+,'-.'/%0%1)2+32)4%!5.6%7.58..)
9,-6:5.'%;2*2,)%()<%$(5:'(=%>()4:(4. ($)*%+,-).$/0-/

?',@.**2)4 WFd&%X-**-+"7=

<-."=%*"(7#6#D0%='(76#D0%(#)%)6=B-."76#D%5;="*AF
?%#"+%@+6#%.6=6-#%>7(#=A-75"7%B(C(9*"%-A%="7.6#D%(=%(%D"#"7(*:C$7C-=" Y(==6-#(>"%(9-$>%8(B'6#"%<"(7#6#D%(#)%S""C
9(BE9-#"%A-7%B-5C$>"7%.6=6-#F <"(7#6#D

X-**-+
Here we will explore step by step why transformers are used for computer vision,
what the challenges were in adapting transformers from language to vision, what
SWIN Transformer is, how it works, and how SWIN solves some of the difficulties in !"#$%&#"'%!$.2*'

adapting transformers for vision. @6E:L-%I=(#D

D.E2.8/%!82)%&'()*+,'-.'
Why Transformer for Computer Vision?
@('6*%:

F),8%(C,:5%H,C2=.$.5%EO/
Computer vision is dominated by Convolutional Neural Networks(CNN),
#-6=.-.)5(52,)%M',-%!@'(5@A
composed of multiple building blocks such as convolutional layers, pooling :*2)4%?35,'@A

layers, and fully connected layers to serve as the backbone network for various !-9"7>%L(77-+
vision tasks such as image classification, object detection, and segmentation. #-6',E.-.)5*%2)%?).:-,)2(T
R.5.@52,)%R..6%>.(')2)4%H,<.=*

For Natural Language Processing(NLP), Transformers is the most common and Y(76)'6%@6#D'
prevalent architecture. Transformers uses a self-attention mechanism !.=+T*:6.'E2*.<%=.(')2)4%2)%<.65A
V6('5%OW
designed for sequence modeling and transductive tasks for translations and
text summarization. Transformers with self-attention has been an enormous
success due to their ability and ease of modeling long-range dependencies in
data. L"*C @>(>$= G76>"7= ^*-D `(7""7= Y76.(B; I"75= ?9-$>
&#-+(9*"

Transformer's colossal success in NLP has led the researchers to explore its usage
in computer vision, starting with Vision Transformers and now with a Shifted
Window Transformer(SWIN Transformer).

What were the challenges in adapting Transformers from language to vision?

Challenges in adapting Transformers from language to vision are due to the


difference between the two domains.

Images have large variations in the scale of visual entities; however,


transformer-based models with tokens of a fixed scale.

High resolution of pixels in images compared to the words in the text


passage.

The computational complexity of self-attention is quadratic to image


size

What is SWIN Transformer?

!"#$%&'()#*'+,-(./0*-$+/1&/(#-(0("#&/0/2"#203
%/0*-$+/1&/(,"+-&(/&4/&-&*%0%#+*(#-(2+145%&'(67
-"#$%#*8(,#*'+,-(67(3#1#%#*8(-&3$90%%&*%#+*(%+(*+*9
+:&/3044#*8(3+203(,#*'+,-(,"#3&(03-+(033+,#*8($+/
2/+--9,#*'+,(2+**&2%#+*

@GH,%I7(#=A-75"7%J@-$7B"K%@+6#%I7(#=A-75"7K%L6"7(7B'6B(*%M6=6-#%I7(#=A-75"7%$=6#D%@'6A>")%G6#)-+=N

The hierarchical architecture build features maps by merging image


patches in a deeper layer to have the flexibility to model various scales. It has
linear computational complexity with respect to image size due to the
computation of self-attention within each local non-overlapping window
while also allowing for cross-window connection.

The ability to model various scales of visual entities and a linear computation
complexity instead of a quadratic complexity helps SWIN Transformers serve as a
general-purpose backbone for any vision task.

How does SWIN Transformer work?

A key design element of Swin Transformer is the shift of the window


partition between consecutive self-attention layers.

Swin Transformer splits an input RGB image into non-overlapping patches


using a patch splitting module. Each Patch is treated as a token, and its feature
is set as a concatenation of raw RGB pixel values.

The shifted windows bridge the windows of the preceding layers, providing
connections among the different windows, which helps to enhance the modeling
power. The shifted window strategy is also efficient as all the query patches
within the windows share the same key set, which helps with faster memory
access.

!82)%&'()*+,'-.'%0'@A25.@5:'.

@GH,%I7(#=A-75"7%?7B'6>"B>$7":@+6#:I

Swin Transformer splits the RGB image into non-overlapping patches using
a patch partitioning module. Each Patch is treated as a token, and its feature is
set as a concatenation of the raw RGB pixels.

In the CIFAR-100 dataset with the image size of 32 * 32 *4, the image can be
split into patches of size 4*4. The feature dimension of each Patch will be 4
*4*3=48.

A linear embedding layer is applied on the raw-valued feature to project it to


an arbitrary dimension denoted by C.

Several Swin Transformer blocks with modified self-attentions computation


are applied on these patch tokens. The Swin Transformer block is built by
replacing the standard multi-head attention(MSA) module in a Transformer
block with a module based on the shifted window.

Swin Transformer consists of a shifted window-based MSA, followed by a


2-layer Multi-Layer Perceptron(MLP) with Gaussian Error Linear Unit
(GELU) non-linearity activation, a high-performing neural network
activation function which weights inputs by their value, rather than gates
inputs by their sign as in ReLUs. A Layer Norm(LN) is applied before each
MSA module and each MLP , and a residual connection is applied after
each module. Layer Norm technique normalizes the distributions of
intermediate layers.

Stage 1 Swin Transformer block contains the linear embedding with the
Swin Transformer.

The number of tokens is reduced using the patch merging layers as the
network gets deeper to produce a hierarchical representation.

In Stage 2, the Patch merging layer concatenates the features of each group
of 2 * 2 neighboring patches and applies a linear layer on the 4C dimensional
concatenated features. This reduces the number of tokens by a multiple of 4. This
process is repeated in Stage 3 and Stage 4.

@GH,%I7(#=A-75"7%J@-$7B"K%@+6#%I7(#=A-75"7K%L6"7(7B'6B(*%M6=6-#%I7(#=A-75"7%$=6#D%@'6A>")%G6#)-+=N

The Patch Merging in stages 2, 3, and 4 jointly produce a hierarchical


representation similar to the feature maps in CNN, enabling Swin
Transformer ar to be the backbone for the different vision tasks.

Self-Attention in non overlapped windows helps compute self-attention


within local windows with linear computation complexity to input image
size. The standard transformer for vision conducts global self-attention, which
has quadratic complexity with respect to the number of tokens leading to
intensive computational cost.

J3N:8@?0%O$()7(>6B%>-%65(D"%)65"#=6-#0%J1N:%G:8@?%*6#"(7%>-%65(D"%)65"#=6-#%+'"7"%8%6=%A6P")%>-%Q%9;%)"A($*>

The windows-based self-attention module lacks connection across windows


which limits the modeling power. This limitation of cross-window
connections is overcome in Swin Transformer using shifted window
partitioning approach, which alternates between two partitioning
configurations in consecutive Swin Transformer blocks.

The first Swin Transfor block module uses a regular window partitioning
strategy, whereas the next block adopts a shifted window partitioning
strategy.

The first regular windows partitioning starts from the top-left pixel. As shown
above, the 8 × 8 feature map is evenly partitioned into 2 × 2 windows of size 4
× 4 (M = 4). The self-attention is computed within each window.

The next module adopts a windowing configuration where window partitioning


is shifted from the preceding layer, resulting in new windows in layer l+1. The
self-attention computation in the new windows crosses the boundaries of the
previous windows in layer l, providing connections across windows between
neighboring non-overlapping windows and is found to be effective in image
classification, object detection, and semantic segmentation.

B++2@2.)5%C(5@A%@,-6:5(52,)%+,'%*A2+5.<%82)<,8%6('5252,)2)4%:*2)4%(%@3@=2@
(66',(@A
Shifted window partitioning generates more windows, and some of the windows
will be smaller. A naive solution is to pad smaller windows to the size M x M and
mask out the padded values when computing attention. Still, this naive solution
results in increased computation which is not efficient.

A better workaround is an efficient batch computation approach by cyclic


shifting towards the top-left direction.

RAA6B6"#>%9(>B'%B-5C$>(>6-#%(CC7-(B'%A-7%="*A:(>>"#>6-#%6#%='6A>")%+6#)-+%C(7>6>6-#6#D

After the cyclic shift, a batched window may be composed of several sub-
windows that are not adjacent in the feature maps; hence a masking
mechanism is employed to limit self-attention computation within each
sub-window. With the cyclic shift, the number of batched windows remains
the same as that of the regular window partitioning and is computationally
efficient too.

D.=(52E.%6,*252,)%C2(*
Swin transformers use relative position information(B) for calculating self-
attention

Q, K, V are the query, key, and value matrices; d is the query/key dimension, and
M² is the number of patches in a window.

Using relative position bias significantly improves performance over


transformers that use absolute position embedding.

!"#$%&'()*+,'-.'%?.'+,'-()@.
Task: SWIN Transformer performance for image classification task

Dataset: ImageNet-1K containing 1.28M training images and 50K validation


images with 1000 classes.

Training details: Swin Transformer uses

AdanW optimizer

300 epochs

Cosine decay learning rate scheduler and 20 epochs of linear warm-up. An


initial linear rate of 0.001 and a weight decay of 0.05

Batch size of 1024

Most augmentation and regularization strategies

(a)RegularImageNet-1Ktrainedmodels
method

RegNetY-4G(48)224321M 4.0G 11567 80.0


RegNetY-8G[48]|2243 39M 8.0G 591.6 81.7
RegNetY-I6G[4812243 84M16.0G 3347 829
EffNet-B3(58)300- W2M1.8G 7321 81.6
EffNet-B4(581 3803 19M 4.2G 349.4 82.9
EffNet-B5[58) 4563 30.M 9.9G 169.1 83.6
EffNet-B6[5815283 43M 19.0G 96.9 84.0

EffNet-B7[58)6003 66M37.0G 55.1 84.3


ViT-B/16(20)384386M55.4G 859 77.9
ViT-L/16(20] 3843307M190.7G 27.3 76.5
DeiT-S[631 2244 2M 4.6G 940.4 79.8
DeiT-B[63) 2242 86M17.5G 292.3 81.8
DeiT-B[63) 3842 86M55.4G 85.9 83.1
Swin-T 224429V45G 7552 81.3
Swin-S 2242 SOM 8.7G 436.9 83.0
Swin-B 2243 88M15.4G 278.1 83.5
Swin-B 3842 88M 47.0G 84.7 84.5

Swin surpasses DeiT architectures with similar complexities. Compared to


CNN models like RegNet and EffiecientNet, Swin Transformer achieves a
slightly better speed-accuracy trade-off.

87.3 top-1 accuracy on ImageNet-1K for image classification

58.7 box AP and 51.1 mask AP on COCO testdev for Dense prediction tasks
such as object detection

53.5 mIoU on ADE20K Val for semantic segmentation

Its performance surpasses the previous state-of-the-art by a large margin of +2.7


box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating
the potential of Transformer-based models as vision backbones.

9,)@=:*2,)
Computer vision is posed for a shift from CNNs to Transformers as generic
backbone architecture for various vision tasks. The trend started with Vision
Transformer(ViT), which globally models spatial relationships on non-
overlapping image patches with standard Transformer encoders.

Swin Transformer produces a hierarchical feature representation with linear


complexity with respect to the image size. A key element of Swin Transformer is
shifted window-based local self-attention computed on non-overlapping
windows. Swin Transformers achieve a better speed-accuracy trade-off
compared to other vision models. Swin introduces the inductive biases of
locality, hierarchical feature representation, and translation invariance, which
enable it to serve as a general-purpose backbone for various image recognition
tasks.

Swin Transformer's strong performance on various vision problems will unify the
modeling of vision and language.

D.+.'.)@.*/
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze
Liu

https://github.com/microsoft/Swin-
Transformer/blob/2622619f70760b60a42b996f5fcbe7c9d2e7ca57/models/swi
n_transformer.py#L458

Video Swin Transformer

Self-Attention with Relative Position Representations

Training data-efficient image transformers & distillation through attention

F.'(*%<,@:-.)5(52,)/%#-(4.%@=(**2+2@(52,)%825A%!82)%&'()*+,'-.'*
?$>'-7K%!6='6>%S(D*6%S(>"%B7"(>")K%1213T2UT2/%<(=>%5-)6A6")K
1213T2UT2/%S"=B76C>6-#K%H5(D"%B*(==6A6B(>6-#%$=6#D%@+6#V
E"7(=F6-

W/

W/

!"#$%&#"'%($)*%+,-).$/0-/ X-**-+

<-."=%*"(7#6#D0%='(76#D0%(#)%)6=B-."76#D%5;="*AF%Y(==6-#(>"%(9-$>%8(B'6#"%<"(7#6#D
(#)%S""C%<"(7#6#D

ZB>%1[0%1213 8"59"7:-#*;

!.@:'253%()<%?'2E(@3%#**:.*%2)%R..6%>.(')2)4
Understand the security and data privacy issues in Deep Learning to build
secure AI systems — This article is heavily adapted and inspired by Security
and Privacy Issues in Deep Learning. You will understand the dinerent…
types of AI attacks and defense techniques against those attacks. Why do
we need security
I"B'#-*-D; in deep learning? AI applications have penetrated our
/%56#%7"()
daily lives. We use Siri, Google Assistant…

@'(7"%;-$7%6)"(=%+6>'%56**6-#=%-A%7"()"7=F G76>"%-#%8")6$5

ZB>%3\0%1213 8"59"7:-#*;

G,8%5,%R2*5'2C:5.%R..6%>.(')2)4%H,<.=%&'(2)2)4J
A quick refresher into distributed training using tf.distribute.strategy —
This post details why you need to distribute the model training, dinerent
distribution strategies, and how they work. Finally, how to apply them…
usingtf.distribute.strategy Need to distribute the model training Training
on a single GPU
I"B'#-*-D; device takes a longer duration compared to training on
[%56#%7"()
multiple GPU devices. Current deep learning models are becoming
complex, with millions…

ZB>%Q0%1213 8"59"7:-#*;

H,)25,'2)4%3,:'%<.E2@.*%2)%?35A,)
Proqle and Monitor System Resources in Python using psutil and GPUtil —
Why is monitoring system resources important? If you cannot measure it,
you cannot improve it- Lord Kelvin Monitoring helps to regularly evaluate…
the performance of the critical system resources like CPU Memory -RAM,
Swap space, and
I"B'#-*-D; Hard disk space Network usage GPU usage Monitoring is
d%56#%7"()
critical in identifying the process…

ZB>%W0%1213 8"59"7:-#*;

H(@A2).%>.(')2)4%H,<.=%R.6=,3-.)5%!5'(5.42.*
Learn dinerent deployment strategies for ML solutions to decide which one
works best for your use case. — This article will explore dinerent
deployment strategies -Recreate, Canary, Blue-Green, and A/B testing.…
Advantages and disadvantages, and when to apply a particular deployment
strategy to your
I"B'#-*-D; Machine Learning solution. Why do we need deployment
[%56#%7"()
strategies for ML Models? A Machine learning model solves a business
problem like identifying if a…

@"C%1W0%1213 8"59"7:-#*;

H.=(),-(%9=(**2+2@(52,)%1*2)4%H2L.<%R(5(%2)%F.'(*
Build Keras model using mixed and imbalanced data of medical imaging
and patient data to determine the presence of Melanoma. — Skin cancer is
the most prevalent type of cancer. Melanoma, speciqcally, is responsible f…
75% of skin cancer deaths, despite being the least common skin cancer.
Melanoma
I"B'#-*-D; is d%56#%7"()
a deadly disease, but most cases of Melanoma can be cured
with minor surgery if caught early. Dermatologists could enhance their
diagnostic…

<-."%C-)B(=>=%-7%($)6-9--E=]%<"(7#%-#%>'"%D-%+6>'%-$7%#"+%(CCF I7;%&#-+(9*"

($1"''$).$.%&#"'%!$.2*'

M6E7(5%@6#D'%^6=V 6# M@H,_L^H@V !(; 6# 8<"(7#6#DF(6


# ,
"A(5%2*%G:-()%2)%5A.%>,,6 9,=,'K%!A(6.%()<%&.L5:'./
H(@A2).%>.(')2)4/%"A3%I%G,8 M.(5:'.%BL5'(@52,)%:*2)4
1*.<%2)%0#J N6.)9;

&(>6"%L-$=" 6# L(BEL"7W3\ S(#6"*%L(#:`'"# 6# !?YHS@%?H

OP%H.5A,<*%,+%#)@=:*2E. 0@@.=.'(52)4%&!$B%825A%Q?1*/
H(@A2).%>.(')2)4 M',-%A,:'*%5,%*.@,)<*

a6()%@?<<ZV 6# I-+(7)=%S(>(%@B6V !"#$%&'(#)"*+(* 6# I-+(7)=%?H


b8 #B"
&,6%R,8)%;2.8%(5 BE.'35A2)4%,)%G2.'('@A2@(=
D.2)+,'@.-.)5%>.(')2)4 9=:*5.'2)4

L6*(76"%@6> ?#D"*6#(%c(#D

"A(5%2*%!5(52*52@(=%>.(')2)4 "A(5S*%5A.%R2++.'.)@.%7.58..)
&A.,'3J%?('5%O 055.)52,)%()<%!.=+T(55.)52,)%2)
&'()*+,'-.'%H,<.=*J

You might also like