SWIN Transformer: A Unifying Step Between Computer Vision and Natural Language Processing | by Renu
SWIN Transformer: A Unifying Step Between Computer Vision and Natural Language Processing | by Renu
@"(7B'
!"#$%&'(#)"*+(*
,-.%/0%1213 4 /%56#%7"() 4 8"59"7:-#*; 4 <6=>"#
!"#$%&'()*+,'-.'/%0%1)2+32)4%!5.6%7.58..)
9,-6:5.'%;2*2,)%()<%$(5:'(=%>()4:(4. ($)*%+,-).$/0-/
?',@.**2)4 WFd&%X-**-+"7=
<-."=%*"(7#6#D0%='(76#D0%(#)%)6=B-."76#D%5;="*AF
?%#"+%@+6#%.6=6-#%>7(#=A-75"7%B(C(9*"%-A%="7.6#D%(=%(%D"#"7(*:C$7C-=" Y(==6-#(>"%(9-$>%8(B'6#"%<"(7#6#D%(#)%S""C
9(BE9-#"%A-7%B-5C$>"7%.6=6-#F <"(7#6#D
X-**-+
Here we will explore step by step why transformers are used for computer vision,
what the challenges were in adapting transformers from language to vision, what
SWIN Transformer is, how it works, and how SWIN solves some of the difficulties in !"#$%&#"'%!$.2*'
D.E2.8/%!82)%&'()*+,'-.'
Why Transformer for Computer Vision?
@('6*%:
F),8%(C,:5%H,C2=.$.5%EO/
Computer vision is dominated by Convolutional Neural Networks(CNN),
#-6=.-.)5(52,)%M',-%!@'(5@A
composed of multiple building blocks such as convolutional layers, pooling :*2)4%?35,'@A
layers, and fully connected layers to serve as the backbone network for various !-9"7>%L(77-+
vision tasks such as image classification, object detection, and segmentation. #-6',E.-.)5*%2)%?).:-,)2(T
R.5.@52,)%R..6%>.(')2)4%H,<.=*
For Natural Language Processing(NLP), Transformers is the most common and Y(76)'6%@6#D'
prevalent architecture. Transformers uses a self-attention mechanism !.=+T*:6.'E2*.<%=.(')2)4%2)%<.65A
V6('5%OW
designed for sequence modeling and transductive tasks for translations and
text summarization. Transformers with self-attention has been an enormous
success due to their ability and ease of modeling long-range dependencies in
data. L"*C @>(>$= G76>"7= ^*-D `(7""7= Y76.(B; I"75= ?9-$>
&#-+(9*"
Transformer's colossal success in NLP has led the researchers to explore its usage
in computer vision, starting with Vision Transformers and now with a Shifted
Window Transformer(SWIN Transformer).
!"#$%&'()#*'+,-(./0*-$+/1&/(#-(0("#&/0/2"#203
%/0*-$+/1&/(,"+-&(/&4/&-&*%0%#+*(#-(2+145%&'(67
-"#$%#*8(,#*'+,-(67(3#1#%#*8(-&3$90%%&*%#+*(%+(*+*9
+:&/3044#*8(3+203(,#*'+,-(,"#3&(03-+(033+,#*8($+/
2/+--9,#*'+,(2+**&2%#+*
@GH,%I7(#=A-75"7%J@-$7B"K%@+6#%I7(#=A-75"7K%L6"7(7B'6B(*%M6=6-#%I7(#=A-75"7%$=6#D%@'6A>")%G6#)-+=N
The ability to model various scales of visual entities and a linear computation
complexity instead of a quadratic complexity helps SWIN Transformers serve as a
general-purpose backbone for any vision task.
The shifted windows bridge the windows of the preceding layers, providing
connections among the different windows, which helps to enhance the modeling
power. The shifted window strategy is also efficient as all the query patches
within the windows share the same key set, which helps with faster memory
access.
!82)%&'()*+,'-.'%0'@A25.@5:'.
@GH,%I7(#=A-75"7%?7B'6>"B>$7":@+6#:I
Swin Transformer splits the RGB image into non-overlapping patches using
a patch partitioning module. Each Patch is treated as a token, and its feature is
set as a concatenation of the raw RGB pixels.
In the CIFAR-100 dataset with the image size of 32 * 32 *4, the image can be
split into patches of size 4*4. The feature dimension of each Patch will be 4
*4*3=48.
Stage 1 Swin Transformer block contains the linear embedding with the
Swin Transformer.
The number of tokens is reduced using the patch merging layers as the
network gets deeper to produce a hierarchical representation.
In Stage 2, the Patch merging layer concatenates the features of each group
of 2 * 2 neighboring patches and applies a linear layer on the 4C dimensional
concatenated features. This reduces the number of tokens by a multiple of 4. This
process is repeated in Stage 3 and Stage 4.
@GH,%I7(#=A-75"7%J@-$7B"K%@+6#%I7(#=A-75"7K%L6"7(7B'6B(*%M6=6-#%I7(#=A-75"7%$=6#D%@'6A>")%G6#)-+=N
J3N:8@?0%O$()7(>6B%>-%65(D"%)65"#=6-#0%J1N:%G:8@?%*6#"(7%>-%65(D"%)65"#=6-#%+'"7"%8%6=%A6P")%>-%Q%9;%)"A($*>
The first Swin Transfor block module uses a regular window partitioning
strategy, whereas the next block adopts a shifted window partitioning
strategy.
The first regular windows partitioning starts from the top-left pixel. As shown
above, the 8 × 8 feature map is evenly partitioned into 2 × 2 windows of size 4
× 4 (M = 4). The self-attention is computed within each window.
B++2@2.)5%C(5@A%@,-6:5(52,)%+,'%*A2+5.<%82)<,8%6('5252,)2)4%:*2)4%(%@3@=2@
(66',(@A
Shifted window partitioning generates more windows, and some of the windows
will be smaller. A naive solution is to pad smaller windows to the size M x M and
mask out the padded values when computing attention. Still, this naive solution
results in increased computation which is not efficient.
RAA6B6"#>%9(>B'%B-5C$>(>6-#%(CC7-(B'%A-7%="*A:(>>"#>6-#%6#%='6A>")%+6#)-+%C(7>6>6-#6#D
After the cyclic shift, a batched window may be composed of several sub-
windows that are not adjacent in the feature maps; hence a masking
mechanism is employed to limit self-attention computation within each
sub-window. With the cyclic shift, the number of batched windows remains
the same as that of the regular window partitioning and is computationally
efficient too.
D.=(52E.%6,*252,)%C2(*
Swin transformers use relative position information(B) for calculating self-
attention
Q, K, V are the query, key, and value matrices; d is the query/key dimension, and
M² is the number of patches in a window.
!"#$%&'()*+,'-.'%?.'+,'-()@.
Task: SWIN Transformer performance for image classification task
AdanW optimizer
300 epochs
(a)RegularImageNet-1Ktrainedmodels
method
58.7 box AP and 51.1 mask AP on COCO testdev for Dense prediction tasks
such as object detection
9,)@=:*2,)
Computer vision is posed for a shift from CNNs to Transformers as generic
backbone architecture for various vision tasks. The trend started with Vision
Transformer(ViT), which globally models spatial relationships on non-
overlapping image patches with standard Transformer encoders.
Swin Transformer's strong performance on various vision problems will unify the
modeling of vision and language.
D.+.'.)@.*/
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze
Liu
https://github.com/microsoft/Swin-
Transformer/blob/2622619f70760b60a42b996f5fcbe7c9d2e7ca57/models/swi
n_transformer.py#L458
F.'(*%<,@:-.)5(52,)/%#-(4.%@=(**2+2@(52,)%825A%!82)%&'()*+,'-.'*
?$>'-7K%!6='6>%S(D*6%S(>"%B7"(>")K%1213T2UT2/%<(=>%5-)6A6")K
1213T2UT2/%S"=B76C>6-#K%H5(D"%B*(==6A6B(>6-#%$=6#D%@+6#V
E"7(=F6-
W/
W/
!"#$%&#"'%($)*%+,-).$/0-/ X-**-+
<-."=%*"(7#6#D0%='(76#D0%(#)%)6=B-."76#D%5;="*AF%Y(==6-#(>"%(9-$>%8(B'6#"%<"(7#6#D
(#)%S""C%<"(7#6#D
ZB>%1[0%1213 8"59"7:-#*;
!.@:'253%()<%?'2E(@3%#**:.*%2)%R..6%>.(')2)4
Understand the security and data privacy issues in Deep Learning to build
secure AI systems — This article is heavily adapted and inspired by Security
and Privacy Issues in Deep Learning. You will understand the dinerent…
types of AI attacks and defense techniques against those attacks. Why do
we need security
I"B'#-*-D; in deep learning? AI applications have penetrated our
/%56#%7"()
daily lives. We use Siri, Google Assistant…
@'(7"%;-$7%6)"(=%+6>'%56**6-#=%-A%7"()"7=F G76>"%-#%8")6$5
ZB>%3\0%1213 8"59"7:-#*;
G,8%5,%R2*5'2C:5.%R..6%>.(')2)4%H,<.=%&'(2)2)4J
A quick refresher into distributed training using tf.distribute.strategy —
This post details why you need to distribute the model training, dinerent
distribution strategies, and how they work. Finally, how to apply them…
usingtf.distribute.strategy Need to distribute the model training Training
on a single GPU
I"B'#-*-D; device takes a longer duration compared to training on
[%56#%7"()
multiple GPU devices. Current deep learning models are becoming
complex, with millions…
ZB>%Q0%1213 8"59"7:-#*;
H,)25,'2)4%3,:'%<.E2@.*%2)%?35A,)
Proqle and Monitor System Resources in Python using psutil and GPUtil —
Why is monitoring system resources important? If you cannot measure it,
you cannot improve it- Lord Kelvin Monitoring helps to regularly evaluate…
the performance of the critical system resources like CPU Memory -RAM,
Swap space, and
I"B'#-*-D; Hard disk space Network usage GPU usage Monitoring is
d%56#%7"()
critical in identifying the process…
ZB>%W0%1213 8"59"7:-#*;
H(@A2).%>.(')2)4%H,<.=%R.6=,3-.)5%!5'(5.42.*
Learn dinerent deployment strategies for ML solutions to decide which one
works best for your use case. — This article will explore dinerent
deployment strategies -Recreate, Canary, Blue-Green, and A/B testing.…
Advantages and disadvantages, and when to apply a particular deployment
strategy to your
I"B'#-*-D; Machine Learning solution. Why do we need deployment
[%56#%7"()
strategies for ML Models? A Machine learning model solves a business
problem like identifying if a…
@"C%1W0%1213 8"59"7:-#*;
H.=(),-(%9=(**2+2@(52,)%1*2)4%H2L.<%R(5(%2)%F.'(*
Build Keras model using mixed and imbalanced data of medical imaging
and patient data to determine the presence of Melanoma. — Skin cancer is
the most prevalent type of cancer. Melanoma, speciqcally, is responsible f…
75% of skin cancer deaths, despite being the least common skin cancer.
Melanoma
I"B'#-*-D; is d%56#%7"()
a deadly disease, but most cases of Melanoma can be cured
with minor surgery if caught early. Dermatologists could enhance their
diagnostic…
<-."%C-)B(=>=%-7%($)6-9--E=]%<"(7#%-#%>'"%D-%+6>'%-$7%#"+%(CCF I7;%&#-+(9*"
($1"''$).$.%&#"'%!$.2*'
OP%H.5A,<*%,+%#)@=:*2E. 0@@.=.'(52)4%&!$B%825A%Q?1*/
H(@A2).%>.(')2)4 M',-%A,:'*%5,%*.@,)<*
L6*(76"%@6> ?#D"*6#(%c(#D
"A(5%2*%!5(52*52@(=%>.(')2)4 "A(5S*%5A.%R2++.'.)@.%7.58..)
&A.,'3J%?('5%O 055.)52,)%()<%!.=+T(55.)52,)%2)
&'()*+,'-.'%H,<.=*J