0% found this document useful (0 votes)

155 views

Mask 2 Former

The document presents Masked-attention Mask Transformer (Mask2Former), a new universal architecture for image segmentation tasks. Mask2Former uses masked attention to extract localized features within predicted mask regions, reducing required research efforts. It outperforms specialized architectures on four popular datasets, setting a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO), and semantic segmentation (57.7 mIoU on ADE20K). Existing universal architectures lag behind specialized architectures, particularly for instance segmentation. Mask2Former addresses this through its masked attention mechanism.

Uploaded by

Yi-Chen Chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views

Mask 2 Former

Uploaded by

Yi-Chen Chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng1,2 * Ishan Misra1 Alexander G. Schwing2 Alexander Kirillov1 Rohit Girdhar1
1
Facebook AI Research (FAIR) 2 University of Illinois at Urbana-Champaign (UIUC)
https://bowenc0221.github.io/mask2former

panoptic instance semantic

arXiv:2112.01527v3 [cs.CV] 15 Jun 2022

Abstract

Image segmentation groups pixels with different seman-

tics, e.g., category or instance membership. Each choice
of semantics defines a task. While only the semantics of 57.8 49.5 50.1 57.0 57.7
55.6
each task differ, current research focuses on designing spe- 52.7
51.1
cialized architectures for each task. We present Masked- 40.1
attention Mask Transformer (Mask2Former), a new archi-
tecture capable of addressing any image segmentation task Universal architectures:
(panoptic, instance or semantic). Its key components in- Mask2Former (ours) MaskFormer
clude masked attention, which extracts localized features by SOTA specialized architectures:
constraining cross-attention within predicted mask regions. Max-DeepLab Swin-HTC++ BEiT
In addition to reducing the research effort by at least three
times, it outperforms the best specialized architectures by Figure 1. State-of-the-art segmentation architectures are typically
a significant margin on four popular datasets. Most no- specialized for each image segmentation task. Although recent
tably, Mask2Former sets a new state-of-the-art for panoptic work has proposed universal architectures that attempt all tasks
segmentation (57.8 PQ on COCO), instance segmentation and are competitive on semantic and panoptic segmentation, they
(50.1 AP on COCO) and semantic segmentation (57.7 mIoU struggle with segmenting instances. We propose Mask2Former,
on ADE20K). which, for the first time, outperforms the best specialized architec-
tures on three studied segmentation tasks on multiple datasets.

1. Introduction
specialized architecture for every task.
Image segmentation studies the problem of grouping
To address this fragmentation, recent work [14, 62] has
pixels. Different semantics for grouping pixels, e.g., cat-
attempted to design universal architectures, that are capable
egory or instance membership, have led to different types
of addressing all segmentation tasks with the same archi-
of segmentation tasks, such as panoptic, instance or seman-
tecture (i.e., universal image segmentation). These archi-
tic segmentation. While these tasks differ only in semantics,
tectures are typically based on an end-to-end set prediction
current methods develop specialized architectures for each
objective (e.g., DETR [5]), and successfully tackle multiple
task. Per-pixel classification architectures based on Fully
tasks without modifying the architecture, loss, or the train-
Convolutional Networks (FCNs) [37] are used for semantic
ing procedure. Note, universal architectures are still trained
segmentation, while mask classification architectures [5,24]
separately for different tasks and datasets, albeit having the
that predict a set of binary masks each associated with a
same architecture. In addition to being flexible, universal
single category, dominate instance-level segmentation. Al-
architectures have recently shown state-of-the-art results on
though such specialized architectures [6, 10, 24, 37] have
semantic and panoptic segmentation [14]. However, re-
advanced each individual task, they lack the flexibility to
cent work still focuses on advancing specialized architec-
generalize to the other tasks. For example, FCN-based ar-
tures [20, 39, 45], which raises the question: why haven’t
chitectures struggle at instance segmentation, leading to the
universal architectures replaced specialized ones?
evolution of different architectures for instance segmenta-
Although existing universal architectures are flexible
tion compared to semantic segmentation. Thus, duplicate
enough to tackle any segmentation task, as shown in Fig-
research and (hardware) optimization effort is spent on each
ure 1, in practice their performance lags behind the best
* Work done during an internship at Facebook AI Research. specialized architectures. For instance, the best reported

1
performance of universal architectures [14, 62], is currently FCN-based architectures [37] independently predict a cat-
lower (> 9 AP) than the SOTA specialized architecture egory label for every pixel. Follow-up methods find con-
for instance segmentation [6]. Beyond the inferior per- text to play an important role for precise per-pixel classi-
formance, universal architectures are also harder to train. fication and focus on designing customized context mod-
They typically require more advanced hardware and a much ules [7,8,63] or self-attention variants [21,26,45,55,61,64].
longer training schedule. For example, training Mask- Specialized instance segmentation architectures are typ-
Former [14] takes 300 epochs to reach 40.1 AP and it can ically based upon “mask classification.” They predict a set
only fit a single image in a GPU with 32G memory. In con- of binary masks each associated with a single class label.
trast, the specialized Swin-HTC++ [6] obtains better perfor- The pioneering work, Mask R-CNN [24], generates masks
mance in only 72 epochs. Both the performance and train- from detected bounding boxes. Follow-up methods either
ing efficiency issues hamper the deployment of universal focus on detecting more precise bounding boxes [4, 6], or
architectures. finding new ways to generate a dynamic number of masks,
In this work, we propose a universal image segmen- e.g., using dynamic kernels [3, 49, 56] or clustering algo-
tation architecture named Masked-attention Mask Trans- rithms [11, 29]. Although the performance has been ad-
former (Mask2Former) that outperforms specialized ar- vanced in each task, these specialized innovations lack the
chitectures across different segmentation tasks, while still flexibility to generalize from one to the other, leading to
being easy to train on every task. We build upon a sim- duplicated research effort. For instance, although multiple
ple meta architecture [14] consisting of a backbone fea- approaches have been proposed for building feature pyra-
ture extractor [25, 36], a pixel decoder [33] and a Trans- mid representations [33], as we show in our experiments,
former decoder [51]. We propose key improvements that BiFPN [47] performs better for instance segmentation while
enable better results and efficient training. First, we use FaPN [39] performs better for semantic segmentation.
masked attention in the Transformer decoder which restricts Panoptic segmentation has been proposed to unify both se-
the attention to localized features centered around predicted mantic and instance segmentation tasks [28]. Architectures
segments, which can be either objects or regions depend- for panoptic segmentation either combine the best of spe-
ing on the specific semantic for grouping. Compared to cialized semantic and instance segmentation architectures
the cross-attention used in a standard Transformer decoder into a single framework [11, 27, 31, 60] or design novel ob-
which attends to all locations in an image, our masked atten- jectives that equally treat semantic regions and instance ob-
tion leads to faster convergence and improved performance. jects [5, 52]. Despite those new architectures, researchers
Second, we use multi-scale high-resolution features which continue to develop specialized architectures for different
help the model to segment small objects/regions. Third, image segmentation tasks [20, 45]. We find panoptic archi-
we propose optimization improvements such as switching tectures usually only report performance on a single panop-
the order of self and cross-attention, making query features tic segmentation task [52], which does not guarantee good
learnable, and removing dropout; all of which improve per- performance on other tasks (Figure 1). For example, panop-
formance without additional compute. Finally, we save 3× tic segmentation does not measure architectures’ abilities to
training memory without affecting the performance by cal- rank predictions as instance segmentations. Thus, we re-
culating mask loss on few randomly sampled points. These frain from referring to architectures that are only evaluated
improvements not only boost the model performance, but for panoptic segmentation as universal architectures. In-
also make training significantly easier, making universal ar- stead, here, we evaluate our Mask2Former on all studied
chitectures more accessible to users with limited compute. tasks to guarantee generalizability.
We evaluate Mask2Former on three image segmenta-
tion tasks (panoptic, instance and semantic segmentation) Universal architectures have emerged with DETR [5] and
using four popular datasets (COCO [35], Cityscapes [16], show that mask classification architectures with an end-to-
ADE20K [65] and Mapillary Vistas [42]). For the first end set prediction objective are general enough for any im-
time, on all these benchmarks, our single architecture age segmentation task. MaskFormer [14] shows that mask
performs on par or better than specialized architectures. classification based on DETR not only performs well on
Mask2Former sets the new state-of-the-art of 57.8 PQ on panoptic segmentation but also achieves state-of-the-art on
COCO panoptic segmentation [28], 50.1 AP on COCO in- semantic segmentation. K-Net [62] further extends set pre-
stance segmentation [35] and 57.7 mIoU on ADE20K se- diction to instance segmentation. Unfortunately, these ar-
mantic segmentation [65] using the exact same architecture. chitectures fail to replace specialized models as their perfor-
mance on particular tasks or datasets is still worse than the
2. Related Work best specialized architecture (e.g., MaskFormer [14] cannot
segment instances well). To our knowledge, Mask2Former
Specialized semantic segmentation architectures typi- is the first architecture that outperforms state-of-the-art spe-
cally treat the task as a per-pixel classification problem. cialized architectures on all considered tasks and datasets.

2
3. Masked-attention Mask Transformer
Transformer add & norm
We now present Mask2Former. We first review a meta Decoder
class FFN
architecture for mask classification that Mask2Former is
built upon. Then, we introduce our new Transformer de- mask add & norm
coder with masked attention which is the key to better con- 𝐿× self-attention
V K Q
vergence and results. Lastly, we propose training improve-
ments that make Mask2Former efficient and accessible. add & norm

masked attention
V K Q
Backbone
3.1. Mask classification preliminaries
Pixe mask
l De image query
Mask classification architectures group pixels into N code
r features features
segments by predicting N binary masks, along with N cor-
responding category labels. Mask classification is suffi- Figure 2. Mask2Former overview. Mask2Former adopts the
ciently general to address any segmentation task by assign- same meta architecture as MaskFormer [14] with a backbone, a
ing different semantics, e.g., categories or instances, to dif- pixel decoder and a Transformer decoder. We propose a new
ferent segments. However, the challenge is to find good Transformer decoder with masked attention instead of the standard
representations for each segment. For example, Mask R- cross-attention (Section 3.2.1). To deal with small objects, we pro-
CNN [24] uses bounding boxes as the representation which pose an efficient way of utilizing high-resolution features from a
limits its application to semantic segmentation. Inspired by pixel decoder by feeding one scale of the multi-scale feature to one
DETR [5], each segment in an image can be represented as Transformer decoder layer at a time (Section 3.2.2). In addition,
a C-dimensional feature vector (“object query”) and can be we switch the order of self and cross-attention (i.e., our masked
attention), make query features learnable, and remove dropout to
processed by a Transformer decoder, trained with a set pre-
make computation more effective (Section 3.2.3). Note that posi-
diction objective. A simple meta architecture would con- tional embeddings and predictions from intermediate Transformer
sist of three components. A backbone that extracts low- decoder layers are omitted in this figure for readability.
resolution features from an image. A pixel decoder that
gradually upsamples low-resolution features from the out-
put of the backbone to generate high-resolution per-pixel 3.2.1 Masked attention
embeddings. And finally a Transformer decoder that oper-
ates on image features to process object queries. The final Context features have been shown to be important for im-
binary mask predictions are decoded from per-pixel embed- age segmentation [7,8,63]. However, recent studies [22,46]
dings with object queries. One successful instantiation of suggest that the slow convergence of Transformer-based
such a meta architecture is MaskFormer [14], and we refer models is due to global context in the cross-attention layer,
readers to [14] for more details. as it takes many training epochs for cross-attention to learn
to attend to localized object regions [46]. We hypothesize
that local features are enough to update query features and
3.2. Transformer decoder with masked attention context information can be gathered through self-attention.
For this we propose masked attention, a variant of cross-
Mask2Former adopts the aforementioned meta archi-
attention that only attends within the foreground region of
tecture, with our proposed Transformer decoder (Figure 2
the predicted mask for each query.
right) replacing the standard one. The key components of
our Transformer decoder include a masked attention opera- Standard cross-attention (with residual path) computes
tor, which extracts localized features by constraining cross-
attention to within the foreground region of the predicted Xl = softmax(Ql KTl )Vl + Xl−1 . (1)
mask for each query, instead of attending to the full fea-
ture map. To handle small objects, we propose an efficient Here, l is the layer index, Xl ∈ RN ×C refers to N
multi-scale strategy to utilize high-resolution features. It C-dimensional query features at the lth layer and Ql =
feeds successive feature maps from the pixel decoder’s fea- fQ (Xl−1 ) ∈ RN ×C . X0 denotes input query features to
ture pyramid into successive Transformer decoder layers in the Transformer decoder. Kl , Vl ∈ RHl Wl ×C are the im-
a round robin fashion. Finally, we incorporate optimiza- age features under transformation fK (·) and fV (·) respec-
tion improvements that boost model performance without tively, and Hl and Wl are the spatial resolution of image
introducing additional computation. We now discuss these features that we will introduce next in Section 3.2.2. fQ ,
improvements in detail. fK and fV are linear transformations.

3
Our masked attention modulates the attention matrix via order of self- and cross-attention (our new “masked atten-
tion”) to make computation more effective: query features
Xl = softmax(Ml−1 + Ql KTl )Vl + Xl−1 . (2) to the first self-attention layer are image-independent and
Moreover, the attention mask Ml−1 at feature location do not have signals from the image, thus applying self-
(x, y) is attention is unlikely to enrich information. Second, we
make query features (X0 ) learnable as well (we still keep
0 if Ml−1 (x, y) = 1 the learnable query positional embeddings), and learnable
Ml−1 (x, y) = . (3)
−∞ otherwise query features are directly supervised before being used in
the Transformer decoder to predict masks (M0 ). We find
Here, Ml−1 ∈ {0, 1}N ×Hl Wl is the binarized output these learnable query features function like a region pro-
(thresholded at 0.5) of the resized mask prediction of the posal network [43] and have the ability to generate mask
previous (l − 1)-th Transformer decoder layer. It is resized proposals. Finally, we find dropout is not necessary and
to the same resolution of Kl . M0 is the binary mask predic- usually decreases performance. We thus completely remove
tion obtained from X0 , i.e., before feeding query features dropout in our decoder.
into the Transformer decoder.
3.3. Improving training efficiency
3.2.2 High-resolution features
One limitation of training universal architectures is the
High-resolution features improve model performance, espe- large memory consumption due to high-resolution mask
cially for small objects [5]. However, this is computation- prediction, making them less accessible than the more
ally demanding. Thus, we propose an efficient multi-scale memory-friendly specialized architectures [6, 24]. For ex-
strategy to introduce high-resolution features while control- ample, MaskFormer [14] can only fit a single image in a
ling the increase in computation. Instead of always using GPU with 32G memory. Motivated by PointRend [30] and
the high-resolution feature map, we utilize a feature pyra- Implicit PointRend [13], which show a segmentation model
mid which consists of both low- and high-resolution fea- can be trained with its mask loss calculated on K randomly
tures and feed one resolution of the multi-scale feature to sampled points instead of the whole mask, we calculate the
one Transformer decoder layer at a time. mask loss with sampled points in both the matching and
Specifically, we use the feature pyramid produced by the final loss calculation. More specifically, in the match-
the pixel decoder with resolution 1/32, 1/16 and 1/8 of ing loss that constructs the cost matrix for bipartite match-
the original image. For each resolution, we add both a si- ing, we uniformly sample the same set of K points for all
nusoidal positional embedding epos ∈ RHl Wl ×C , follow- prediction and ground truth masks. In the final loss be-
ing [5], and a learnable scale-level embedding elvl ∈ R1×C , tween predictions and their matched ground truths, we sam-
following [66]. We use those, from lowest-resolution to ple different sets of K points for different pairs of predic-
highest-resolution for the corresponding Transformer de- tion and ground truth using importance sampling [30]. We
coder layer as shown in Figure 2 left. We repeat this 3-layer set K = 12544, i.e., 112 × 112 points. This new training
Transformer decoder L times. Our final Transformer de- strategy effectively reduces training memory by 3×, from
coder hence has 3L layers. More specifically, the first three 18GB to 6GB per image, making Mask2Former more ac-
layers receive a feature map of resolution H1 = H/32, cessible to users with limited computational resources.
H2 = H/16, H3 = H/8 and W1 = W/32, W2 = W/16,
W3 = W/8, where H and W are the original image reso- 4. Experiments
lution. This pattern is repeated in a round robin fashion for
all following layers. We demonstrate Mask2Former is an effective architec-
ture for universal image segmentation through compar-
isons with specialized state-of-the-art architectures on stan-
3.2.3 Optimization improvements
dard benchmarks. We evaluate our proposed design de-
A standard Transformer decoder layer [51] consists of three cisions through ablations on all three tasks. Finally we
modules to process query features in the following order: a show Mask2Former generalizes beyond the standard bench-
self-attention module, a cross-attention and a feed-forward marks, obtaining state-of-the-art results on four datasets.
network (FFN). Moreover, query features (X0 ) are zero ini- Datasets. We study Mask2Former using four widely used
tialized before being fed into the Transformer decoder and image segmentation datasets that support semantic, instance
are associated with learnable positional embeddings. Fur- and panoptic segmentation: COCO [35] (80 “things” and
thermore, dropout is applied to both residual connections 53 “stuff” categories), ADE20K [65] (100 “things” and
and attention maps. 50 “stuff” categories), Cityscapes [16] (8 “things” and 11
To optimize the Transformer decoder design, we make “stuff” categories) and Mapillary Vistas [42] (37 “things”
the following three improvements. First, we switch the and 28 “stuff” categories). Panoptic and semantic seg-

4
method backbone query type epochs PQ PQTh PQSt APTh
pan mIoUpan #params. FLOPs fps
DETR [5] R50 100 queries 500+25 43.4 48.2 36.3 31.1 - - - -
MaskFormer [14] R50 100 queries 300 46.5 51.0 39.8 33.0 57.8 45M 181G 17.6
Mask2Former (ours) R50 100 queries 50 51.9 57.7 43.0 41.7 61.7 44M 226G 8.6
DETR [5] R101 100 queries 500+25 45.1 50.5 37.0 33.0 - - - -
MaskFormer [14] R101 100 queries 300 47.6 52.5 40.3 34.1 59.3 64M 248G 14.0
Mask2Former (ours) R101 100 queries 50 52.6 58.5 43.7 42.6 62.4 63M 293G 7.2
Max-DeepLab [52] Max-L 128 queries 216 51.1 57.0 42.2 - - 451M 3692G -
MaskFormer [14] Swin-L† 100 queries 300 52.7 58.5 44.0 40.1 64.8 212M 792G 5.2
K-Net [62] Swin-L† 100 queries 36 54.6 60.2 46.0 - - - - -
Mask2Former (ours) Swin-L† 200 queries 100 57.8 64.2 48.1 48.6 67.4 216M 868G 4.0

Table 1. Panoptic segmentation on COCO panoptic val2017 with 133 categories. Mask2Former consistently outperforms Mask-
Former [14] by a large margin with different backbones on all metrics. Our best model outperforms prior state-of-the-art MaskFormer by
5.1 PQ and K-Net [62] by 3.2 PQ. Backbones pre-trained on ImageNet-22K are marked with † .

mentation tasks are evaluated on the union of “things” and object,” i.e., predictions that have not been matched with
“stuff” categories while instance segmentation is only eval- any ground truth.
uated on the “things” categories. Post-processing. We use the exact same post-processing
Evaluation metrics. For panoptic segmentation, we use as [14] to acquire the expected output format for panoptic
the standard PQ (panoptic quality) metric [28]. We fur- and semantic segmentation from pairs of binary masks and
ther report APTh
pan , which is the AP evaluated on the “thing” class predictions. Instance segmentation requires additional
categories using instance segmentation annotations, and confidence scores for each prediction. We multiply class
mIoUpan , which is the mIoU for semantic segmentation confidence and mask confidence (i.e., averaged foreground
by merging instance masks from the same category, of the per-pixel binary mask probability) for a final confidence.
same model trained only with panoptic segmentation anno-
tations. For instance segmentation, we use the standard AP 4.2. Training settings
(average precision) metric [35]. For semantic segmentation, Panoptic and instance segmentation. We use Detec-
we use mIoU (mean Intersection-over-Union) [19]. tron2 [57] and follow the updated Mask R-CNN [24] base-
line settings1 for the COCO dataset. More specifically, we
4.1. Implementation details
use AdamW [38] optimizer and the step learning rate sched-
We adopt settings from [14] with the following differences: ule. We use an initial learning rate of 0.0001 and a weight
Pixel decoder. Mask2Former is compatible with any exist- decay of 0.05 for all backbones. A learning rate multiplier
ing pixel decoder module. In MaskFormer [14], FPN [33] of 0.1 is applied to the backbone and we decay the learning
is chosen as the default for its simplicity. Since our goal rate at 0.9 and 0.95 fractions of the total number of training
is to demonstrate strong performance across different seg- steps by a factor of 10. If not stated otherwise, we train our
mentation tasks, we use the more advanced multi-scale de- models for 50 epochs with a batch size of 16. For data aug-
formable attention Transformer (MSDeformAttn) [66] as mentation, we use the large-scale jittering (LSJ) augmenta-
our default pixel decoder. Specifically, we use 6 MSDefor- tion [18, 23] with a random scale sampled from range 0.1 to
mAttn layers applied to feature maps with resolution 1/8, 2.0 followed by a fixed size crop to 1024×1024. We use the
1/16 and 1/32, and use a simple upsampling layer with lat- standard Mask R-CNN inference setting where we resize an
eral connection on the final 1/8 feature map to generate the image with shorter side to 800 and longer side up-to 1333.
feature map of resolution 1/4 as the per-pixel embedding. We also report FLOPs and fps. FLOPs are averaged over
In our ablation study, we show that this pixel decoder pro- 100 validation images (COCO images have varying sizes).
vides best results across different segmentation tasks. Frames-per-second (fps) is measured on a V100 GPU with
Transformer decoder. We use our Transformer decoder a batch size of 1 by taking the average runtime on the entire
proposed in Section 3.2 with L = 3 (i.e., 9 layers total) and validation set including post-processing time.
100 queries by default. An auxiliary loss is added to every Semantic segmentation. We follow the same settings
intermediate Transformer decoder layer and to the learnable as [14] to train our models, except: 1) a learning rate multi-
query features before the Transformer decoder. plier of 0.1 is applied to both CNN and Transformer back-
Loss weights. We use the binary cross-entropy loss (instead bones instead of only applying it to CNN backbones in [14],
of focal loss [34] in [14]) and the dice loss [41] for our mask 2) both ResNet and Swin backbones use an initial learning
loss: Lmask = λce Lce + λdice Ldice . We set λce = 5.0 and rate of 0.0001 and a weight decay of 0.05, instead of using
λdice = 5.0. The final loss is a combination of mask loss and
1 https://github.com/facebookresearch/detectron2/blob/
classification loss: Lmask + λcls Lcls and we set λcls = 2.0 for main / MODEL _ ZOO . md # new - baselines - using - large - scale -
predictions matched with a ground truth and 0.1 for the “no jitter-and-longer-training-schedule

5
method backbone query type epochs AP APS APM APL APboundary #params. FLOPs fps
MaskFormer [14] R50 100 queries 300 34.0 16.4 37.8 54.2 23.0 45M 181G 19.2
Mask R-CNN [24] R50 dense anchors 36 37.2 18.6 39.5 53.3 23.1 44M 201G 15.2
Mask R-CNN [18, 23, 24] R50 dense anchors 400 42.5 23.8 45.0 60.0 28.0 46M 358G 10.3
Mask2Former (ours) R50 100 queries 50 43.7 23.4 47.2 64.8 30.6 44M 226G 9.7
Mask R-CNN [24] R101 dense anchors 36 38.6 19.5 41.3 55.3 24.5 63M 266G 10.8
Mask R-CNN [18, 23, 24] R101 dense anchors 400 43.7 24.6 46.4 61.8 29.1 65M 423G 8.6
Mask2Former (ours) R101 100 queries 50 44.2 23.8 47.7 66.7 31.1 63M 293G 7.8
QueryInst [20] Swin-L† 300 queries 50 48.9 30.8 52.6 68.3 33.5 - - 3.3
Swin-HTC++ [6, 36] Swin-L† dense anchors 72 49.5 31.0 52.4 67.2 34.1 284M 1470G -
Mask2Former (ours) Swin-L† 200 queries 100 50.1 29.9 53.9 72.1 36.2 216M 868G 4.0

Table 2. Instance segmentation on COCO val2017 with 80 categories. Mask2Former outperforms strong Mask R-CNN [24] baselines
for both AP and APboundary [12] metrics when training with 8× fewer epochs. Our best model is also competitive to the state-of-the-art
specialized instance segmentation model on COCO and has higher boundary quality. For a fair comparison, we only consider single-scale
inference and models trained using only COCO train2017 set data. Backbones pre-trained on ImageNet-22K are marked with † .

different learning rates in [14]. method backbone crop size mIoU (s.s.) mIoU (m.s.)
MaskFormer [14] R50 512 44.5 46.7
Mask2Former (ours) R50 512 47.2 49.2
4.3. Main results
Swin-UperNet [36, 58] Swin-T 512 - 46.1
Panoptic segmentation. We compare Mask2Former with MaskFormer [14] Swin-T 512 46.7 48.8
state-of-the-art models for panoptic segmentation on the Mask2Former (ours) Swin-T 512 47.7 49.6

COCO panoptic [28] dataset in Table 1. Mask2Former MaskFormer [14] Swin-L† 640 54.1 55.6
FaPN-MaskFormer [14, 39] Swin-L-FaPN† 640 55.2 56.7
consistently outperforms MaskFormer by more than 5 PQ BEiT-UperNet [2, 58] BEiT-L† 640 - 57.0
across different backbones while converging 6× faster. Swin-L† 640 56.1 57.3
Mask2Former (ours)
With Swin-L backbone, our Mask2Former sets a new state- Swin-L-FaPN† 640 56.4 57.7
of-the-art of 57.8 PQ, outperforming existing state-of-the-
art [14] by 5.1 PQ and concurrent work, K-Net [62], by Table 3. Semantic segmentation on ADE20K val with
150 categories. Mask2Former consistently outperforms Mask-
3.2 PQ. Mask2Former even outperforms the best ensemble
Former [14] by a large margin with different backbones (all
models with extra training data in the COCO challenge (see
Mask2Former models use MSDeformAttn [66] as pixel decoder,
Appendix A.1 for test set results). except Swin-L-FaPN uses FaPN [39]). Our best model outper-
Beyond the PQ metric, our Mask2Former also achieves forms the best specialized model, BEiT [2]. We report both single-
higher performance on two other metrics compared to scale (s.s.) and multi-scale (m.s.) inference results. Backbones
DETR [5] and MaskFormer: APTh pan , which is the AP eval- pre-trained on ImageNet-22K are marked with † .
uated on the 80 “thing” categories using instance segmen-
tation annotation, and mIoUpan , which is the mIoU evalu- all the highest gains come from large objects (+10.6 APL ).
ated on the 133 categories for semantic segmentation con- The performance on APS still lags behind other state-of-the-
verted from panoptic segmentation annotation. This shows art models. Hence there still remains room for improvement
Mask2Former’s universality: trained only with panoptic on small objects, e.g., by using dilated backbones like in
segmentation annotations, it can be used for instance and DETR [5], which we leave for future work.
semantic segmentation.
Semantic segmentation. We compare Mask2Former with
Instance segmentation. We compare Mask2Former with
state-of-the-art models for semantic segmentation on the
state-of-the-art models on the COCO [35] dataset in Ta-
ADE20K [65] dataset in Table 3. Mask2Former outper-
ble 2. With ResNet [25] backbone, Mask2Former outper-
forms MaskFormer [14] across different backbones, sug-
forms a strong Mask R-CNN [24] baseline using large-
gesting that the proposed improvements even boost seman-
scale jittering (LSJ) augmentation [18, 23] while requir-
tic segmentation results where [14] was already state-of-
ing 8× fewer training iterations. With Swin-L backbone,
the-art. With Swin-L as backbone and FaPN [39] as pixel
Mask2Former outperforms the state-of-the-art HTC++ [6].
decoder, Mask2Former sets a new state-of-the-art of 57.7
Although we only observe +0.6 AP improvement over
mIoU. We also report the test set results in Appendix A.3.
HTC++, the Boundary AP [12] improves by 2.1, suggesting
that our predictions have a better boundary quality thanks to
4.4. Ablation studies
the high-resolution mask predictions. Note that for a fair
comparison, we only consider single-scale inference and We now analyze Mask2Former through a series of abla-
models trained with only COCO train2017 set data. tion studies using a ResNet-50 backbone [25]. To test the
With a ResNet-50 backbone Mask2Former improves generality of the proposed components for universal image
over MaskFormer on small objects by 7.0 APS , while over- segmentation, all ablations are performed on three tasks.

6
AP PQ mIoU FLOPs AP PQ mIoU FLOPs
Mask2Former (ours) 43.7 51.9 47.2 226G Mask2Former (ours) 43.7 51.9 47.2 226G
− masked attention 37.8 (-5.9) 47.1 (-4.8) 45.5 (-1.7) 213G − learnable query features 42.9 (-0.8) 51.2 (-0.7) 45.4 (-1.8) 226G
− high-resolution features 41.5 (-2.2) 50.2 (-1.7) 46.1 (-1.1) 218G − cross-attention first 43.2 (-0.5) 51.6 (-0.3) 46.3 (-0.9) 226G
− remove dropout 43.0 (-0.7) 51.3 (-0.6) 47.2 (-0.0) 226G
− all 3 components above 42.3 (-1.4) 50.8 (-1.1) 46.3 (-0.9) 226G
(a) Masked attention and high-resolution features (from efficient multi-scale (b) Optimization improvements increase the performance without introduc-
strategy) lead to the most gains. More detailed ablations are in Table 4c and ing extra compute. Following DETR [5], query features are zero-initialized
Table 4d. We remove one component at a time. when not learnable. We remove one component at a time.

AP PQ mIoU FLOPs AP PQ mIoU FLOPs AP PQ mIoU FLOPs

cross-attention 37.8 47.1 45.5 213G single scale (1/32) 41.5 50.2 46.1 218G FPN [33] 41.5 50.7 45.6 195G
SMCA [22] 37.9 47.2 46.6 213G single scale (1/16) 43.0 51.5 46.5 222G Semantic FPN [27] 42.1 51.2 46.2 258G
mask pooling [62] 43.1 51.5 46.0 217G single scale (1/8) 44.0 51.8 47.4 239G FaPN [39] 42.4 51.8 46.8 -
masked attention 43.7 51.9 47.2 226G naı̈ve m.s. (3 scales) 44.0 51.9 46.3 247G BiFPN [47] 43.5 51.8 45.6 204G
efficient m.s. (3 scales) 43.7 51.9 47.2 226G MSDeformAttn [66] 43.7 51.9 47.2 226G
(c) Masked attention. Our masked attention (d) Feature resolution. High-resolution features (sin- (e) Pixel decoder. MSDeformAttn [66] consis-
performs better than other variants of cross- gle scale 1/8) are important. Our efficient multi-scale tently performs the best across all tasks.
attention across all tasks. (efficient m.s.) strategy effectively reduces the FLOPs.

Table 4. Mask2Former ablations. We perform ablations on three tasks: instance (AP on COCO val2017), panoptic (PQ on COCO
panoptic val2017) and semantic (mIoU on ADE20K val) segmentation. FLOPs are measured on COCO instance segmentation.

Transformer decoder. We validate the importance of each AP PQ mIoU memory

matching loss training loss (COCO) (COCO) (ADE20K) (COCO)
component by removing them one at a time. As shown in mask 41.0 50.3 45.9 18G
mask
Table 4a, masked attention leads to the biggest improve- point 41.0 50.8 45.9 6G
ment across all tasks. The improvement is larger for in- point (ours)
mask 43.1 51.4 47.3 18G
stance and panoptic segmentation than for semantic seg- point (ours) 43.7 51.9 47.2 6G

mentation. Moreover, using high-resolution features from

Table 5. Calculating loss with points vs. masks. Training with
the efficient multi-scale strategy is also important. Table 4b point loss reduces training memory without influencing the perfor-
shows additional optimization improvements further im- mance. Matching with point loss further improves performance.
prove the performance without extra computation.
Masked attention. Concurrent work has proposed other
variants of cross-attention [22, 40] that aim to improve the suggests that designing a module like a pixel decoder for a
convergence and performance of DETR [5] for object de- specific task does not guarantee generalization across seg-
tection. Most recently, K-Net [62] replaced cross-attention mentation tasks. Mask2Former, as a universal model, could
with a mask pooling operation that averages features within serve as a testbed for a generalizable module design.
mask regions. We validate the importance of our masked Calculating loss with points vs. masks. In Table 5 we
attention in Table 4c. While existing cross-attention vari- study the performance and memory implications when cal-
ants may improve on a specific task, our masked attention culating the loss based on either mask or sampled points.
performs the best on all three tasks. Calculating the final training loss with sampled points re-
Feature resolution. Table 4d shows that Mask2Former duces training memory by 3× without affecting the per-
benefits from using high-resolution features (e.g., a single formance. Additionally, calculating the matching loss with
scale of 1/8) in the Transformer decoder. However, this in- sampled points improves performance across all three tasks.
troduces additional computation. Our efficient multi-scale Learnable queries as region proposals. Region propos-
(efficient m.s.) strategy effectively reduces the FLOPs with- als [1, 50], either in the form of boxes or masks, are re-
out affecting the performance. Note that, naively concate- gions that are likely to be “objects.” With learnable queries
nating multi-scale features as input to every Transformer being supervised by the mask loss, predictions from learn-
decoder layer (naı̈ve m.s.) does not yield additional gains. able queries can serve as mask proposals. In Figure 3 top,
Pixel decoder. As shown in Table 4e, Mask2Former is com- we visualize mask predictions of selected learnable queries
patible with any existing pixel decoder. However, we ob- before feeding them into the Transformer decoder (the pro-
serve different pixel decoders specialize in different tasks: posal generation process is shown in Figure 3 bottom right).
while BiFPN [47] performs better on instance-level seg- In Figure 3 bottom left, we further perform a quantita-
mentation, FaPN [39] works better for semantic segmen- tive analysis on the quality of these proposals by calculat-
tation. Among all studied pixel decoders, the MSDefor- ing the class-agnostic average recall with 100 predictions
maAttn [66] consistently performs the best across all tasks (AR@100) on COCO val2017. We find these learnable
and thus is selected as our default. This set of ablations also queries already achieve good AR@100 compared to the fi-

7
PQ AP mIoU PQ AP mIoU PQ AP mIoU
panoptic 51.9 41.7 61.7 39.7 26.5 46.1 62.1 37.3 77.5
instance - 43.7 - - 26.4 - - 37.4 -
semantic - - 61.5 - - 47.2 - - 79.4
(a) COCO (b) ADE20K (c) Cityscapes
Table 7. Limitations of Mask2Former. Although a single
Mask2Former can address any segmentation task, we still need
to train it on different tasks. Across three datasets we find
Mask2Former trained with panoptic annotations performs slightly
worse than the exact same model trained specifically for instance
and semantic segmentation tasks with the corresponding data.

layer 9 Learnable queries

layer 3 layer 6
57.7 mask
We observe that our Mask2Former is competitive to
56.8 57.4
state-of-the-art methods on these datasets as well. It sug-
gests Mask2Former can serve as a universal image segmen-
learnable
queries Backbone
tation model and results generalize across datasets.
50.3
Pixe
l De
4.6. Limitations
code
AR@100 on COCO val2017 r
Our ultimate goal is to train a single model for all im-
Figure 3. Learnable queries as “region proposals”. Top: We age segmentation tasks. In Table 7, we find Mask2Former
visualize mask predictions of four selected learnable queries be- trained on panoptic segmentation only performs slightly
fore feeding them into the Transformer decoder (using R50 back- worse than the exact same model trained with the corre-
bone). Bottom left: We calculate the class-agnostic average recall sponding annotations for instance and semantic segmenta-
with 100 proposals (AR@100) and observe that these learnable tion tasks across three datasets. This suggests that even
queries provide good proposals compared to the final predictions
though Mask2Former can generalize to different tasks, it
of Mask2Former after the Transformer decoder layers (layer 9).
still needs to be trained for those specific tasks. In the fu-
Bottom right: Illustration of proposal generation process.
ture, we hope to develop a model that can be trained only
panoptic model semantic model
once for multiple tasks and even for multiple datasets.
method backbone PQ APTh pan mIoUpan mIoU (s.s.) (m.s.)
Furthermore, as seen in Tables 2 and 4d, even though it
Panoptic FCN [31] Swin-L† 65.9 - - - - improves over baselines, Mask2Former struggles with seg-
Panoptic-DeepLab [11] SWideRNet [9] 66.4 40.1 82.2 - - menting small objects and is unable to fully leverage multi-
Panoptic-DeepLab [11] SWideRNet [9] 67.5∗ 43.9∗ 82.9∗ - -
scale features. We believe better utilization of the feature
SETR [64] ViT-L† [17] - - - - 82.2
SegFormer [59] MiT-B5 [59] - - - - 84.0
pyramid and designing losses for small objects are critical.
R50 62.1 37.3 77.5 79.4 82.2
Mask2Former (ours) Swin-B† 66.1 42.8 82.7 83.3 84.5 5. Conclusion
Swin-L† 66.6 43.6 82.9 83.3 84.3
We present Mask2Former for universal image segmen-
Table 6. Cityscapes val. Mask2Former is competitive to spe- tation. Built upon a simple meta framework [14] with a
cialized models on Cityscapes. Panoptic segmentation models use new Transformer decoder using the proposed masked atten-
single-scale inference by default, multi-scale numbers are marked
tion, Mask2Former obtains top results in all three major im-
with ∗ . For semantic segmentation, we report both single-scale
age segmentation tasks (panoptic, instance and semantic) on
(s.s.) and multi-scale (m.s.) inference results. Backbones pre-
trained on ImageNet-22K are marked with † . four popular datasets, outperforming even the best special-
ized models designed for each benchmark while remaining
easy to train. Mask2Former saves 3× research effort com-
nal predictions of Mask2Former after the Transformer de-
pared to designing specialized models for each task, and it
coder layers, i.e., layer 9, and AR@100 consistently im-
is accessible to users with limited computational resources.
proves with more decoder layers.
We hope to attract interest in universal model design.
Ethical considerations: While our technical innovations do not appear to
4.5. Generalization to other datasets have any inherent biases, the models trained with our approach on real-
world datasets should undergo ethical review to ensure the predictions do
To show our Mask2Former can generalize beyond the not propagate problematic stereotypes, and the approach is not used for
COCO dataset, we further perform experiments on other applications including but not limited to illegal surveillance.
popular image segmentation datasets. In Table 6, we show Acknowledgments: Thanks to Nicolas Carion and Xingyi Zhou for help-
results on Cityscapes [16]. Please see Appendix B for de- ful feedback. BC and AS are supported in part by NSF #1718221,
tailed training settings on each dataset as well as more re- 2008387, 2045586, 2106825, MRI #1725729, NIFA 2020-67021-32799
sults on ADE20K [65] and Mapillary Vistas [42]. and Cisco Systems Inc. (CG 1377144 - thanks for access to Arcetri).

8
References [18] Xianzhi Du, Barret Zoph, Wei-Chih Hung, and Tsung-Yi
Lin. Simple training strategies and model scaling for object
[1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Fer- detection. arXiv preprint arXiv:2107.00057, 2021.
ran Marques, and Jitendra Malik. Multiscale combinatorial
[19] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo-
grouping. In CVPR, 2014.
pher KI Williams, John Winn, and Andrew Zisserman. The
[2] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre- PASCAL visual object classes challenge: A retrospective.
training of image transformers. arXiv, 2021. IJCV, 2015.
[3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. [20] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen
YOLACT++: Better real-time instance segmentation, 2019. Fang, Ying Shan, Bin Feng, and Wenyu Liu. Instances as
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv- queries. In ICCV, 2021.
ing into high quality object detection. In CVPR, 2018. [21] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Fang, and Hanqing Lu. Dual attention network for scene
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- segmentation. In CVPR, 2019.
end object detection with transformers. In ECCV, 2020. [22] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai,
[6] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- and Hongsheng Li. Fast convergence of detr with spatially
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, modulated co-attention. In ICCV, 2021.
Wanli Ouyang, et al. Hybrid task cascade for instance seg- [23] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
mentation. In CVPR, 2019. Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple
[7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, copy-paste is a strong data augmentation method for instance
Kevin Murphy, and Alan L Yuille. DeepLab: Semantic im- segmentation. In CVPR, 2021.
age segmentation with deep convolutional nets, atrous con- [24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
volution, and fully connected CRFs. PAMI, 2018. shick. Mask R-CNN. In ICCV, 2017.
[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Hartwig Adam. Rethinking atrous convolution for semantic Deep residual learning for image recognition. In CVPR,
image segmentation. arXiv:1706.05587, 2017. 2016.
[9] Liang-Chieh Chen, Huiyu Wang, and Siyuan Qiao. Scal- [26] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
ing wide residual networks for panoptic segmentation. Huang, Yunchao Wei, and Wenyu Liu. CCNet: Criss-cross
arXiv:2011.11675, 2020. attention for semantic segmentation. In ICCV, 2019.
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [27] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
Schroff, and Hartwig Adam. Encoder-decoder with atrous Dollár. Panoptic feature pyramid networks. In CVPR, 2019.
separable convolution for semantic image segmentation. In [28] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
ECCV, 2018. Rother, and Piotr Dollár. Panoptic segmentation. In CVPR,
[11] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, 2019.
Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. [29] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bog-
Panoptic-DeepLab: A simple, strong, and fast baseline for dan Savchynskyy, and Carsten Rother. InstanceCut: from
bottom-up panoptic segmentation. In CVPR, 2020. edges to instances with multicut. In CVPR, 2017.
[12] Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C [30] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
Berg, and Alexander Kirillov. Boundary iou: Improving shick. PointRend: Image segmentation as rendering. In
object-centric image segmentation evaluation. In CVPR, CVPR, 2020.
2021. [31] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Yukang Chen,
[13] Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Lu Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia.
Pointly-supervised instance segmentation. arXiv, 2021. Fully convolutional networks for panoptic segmentation with
[14] Bowen Cheng, Alexander G. Schwing, and Alexander Kir- point-based supervision. arXiv preprint arXiv:2108.07682,
illov. Per-pixel classification is not all you need for semantic 2021.
segmentation. In NeurIPS, 2021. [32] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima
[15] François Chollet. Xception: Deep learning with depthwise Anandkumar, Jose M Alvarez, Tong Lu, and Ping Luo.
separable convolutions. In CVPR, 2017. Panoptic segformer. arXiv preprint arXiv:2109.03814, 2021.
[16] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [33] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Bharath Hariharan, and Serge Belongie. Feature pyramid
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes networks for object detection. In CVPR, 2017.
dataset for semantic urban scene understanding. In CVPR, [34] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
2016. Piotr Dollár. Focal loss for dense object detection. In ICCV,
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, 2017.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
vain Gelly, et al. An image is worth 16x16 words: Trans- Zitnick. Microsoft COCO: Common objects in context. In
formers for image recognition at scale. In ICLR, 2021. ECCV, 2014.

9
[36] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [54] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
former: Hierarchical vision transformer using shifted win- Pvtv2: Improved baselines with pyramid vision transformer.
dows. arXiv:2103.14030, 2021. arXiv preprint arXiv:2106.13797, 2021.
[37] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully [55] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
convolutional networks for semantic segmentation. In ing He. Non-local neural networks. In CVPR, 2018.
CVPR, 2015. [56] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-
[38] Ilya Loshchilov and Frank Hutter. Decoupled weight decay hua Shen. SOLOv2: Dynamic and fast instance segmenta-
regularization. In ICLR, 2019. tion. NeurIPS, 2020.
[39] Shihua Huang Zhichao Lu, Ran Cheng, and Cheng He. Fapn: [57] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Feature-aligned pyramid network for dense image predic- Lo, and Ross Girshick. Detectron2. https://github.
tion. arXiv, 2021. com/facebookresearch/detectron2, 2019.
[40] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, [58] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Jian Sun. Unified perceptual parsing for scene understand-
Conditional detr for fast training convergence. In ICCV, ing. In ECCV, 2018.
2021. [59] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
[41] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
V-Net: Fully convolutional neural networks for volumetric ficient design for semantic segmentation with transformers.
medical image segmentation. In 3DV, 2016. In NeurIPS, 2021.
[42] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and [60] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min
Peter Kontschieder. The mapillary vistas dataset for semantic Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified
understanding of street scenes. In CVPR, 2017. panoptic segmentation network. In CVPR, 2019.
[43] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [61] Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin
Faster R-CNN: Towards real-time object detection with re- Chen, and Jingdong Wang. OCNet: Object context for se-
gion proposal networks. In NeurIPS, 2015. mantic segmentation. IJCV, 2021.
[44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [62] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Chen Change Loy. K-net: Towards unified image seg-
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and mentation. In NeurIPS, 2021.
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- [63] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
lenge. IJCV, 2015. Wang, and Jiaya Jia. Pyramid scene parsing network. In
[45] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia CVPR, 2017.
Schmid. Segmenter: Transformer for semantic segmenta- [64] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
tion. In ICCV, 2021. Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
[46] Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris M Ki- Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
tani. Rethinking transformer-based set prediction for object tation from a sequence-to-sequence perspective with trans-
detection. In ICCV, 2021. formers. In CVPR, 2021.
[47] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: [65] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Scalable and efficient object detection. In CVPR, 2020. Barriuso, and Antonio Torralba. Scene parsing through
[48] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hi- ADE20K dataset. In CVPR, 2017.
erarchical multi-scale attention for semantic segmentation. [66] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
arXiv:2005.10821, 2020. and Jifeng Dai. Deformable detr: Deformable transformers
[49] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convo- for end-to-end object detection. In ICLR, 2021.
lutions for instance segmentation. In ECCV, 2020.
[50] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
ers, and Arnold WM Smeulders. Selective search for object
recognition. IJCV, 2013.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[52] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. MaX-DeepLab: End-to-end panoptic
segmentation with mask transformers. In CVPR, 2021.
[53] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
high-resolution representation learning for visual recogni-
tion. PAMI, 2019.

10
Appendix outperform the challenge winner (which uses extra train-
ing data, model ensemble, etc.) on APL by a large margin
We first provide more results for Mask2Former with differ-
without any bells-and-whistles. On the other hand, the poor
ent backbones as well as test-set performance on standard
performance on small objects leaves room for further im-
benchmarks (Appendix A): We use COCO panoptic [28]
provement in the future.
for panoptic, COCO [35] for instance, and ADE20K [65]
for semantic segmentation. Then, we provide more detailed
results on additional datasets (Appendix B). Finally, we pro- A.3. Semantic segmentation.
vide additional ablation studies (Appendix C) and visualiza-
In Table V, we report Mask2Former results obtained
tion of Mask2Former predictions for all three segmentation
with various backbones on ADE20K val. Mask2Former
tasks (Appendix D).
outperforms all existing semantic segmentation models
with various backbones. Our best model sets a new state-
A. Additional results of-the-art of 57.7 mIoU.
Here, we provide more results of Mask2Former with In Table VI, we further report the best Mask2Former
different backbones on COCO panoptic [28] for panoptic model on the test set. Following [14], we train
segmentation, COCO [35] for instance segmentation and Mask2Former on the union of ADE20K train and val
ADE20K [65] for semantic segmentation. More specifi- set with ImageNet-22K pre-trained checkpoint and use
cally, for each benckmark, we evaluate Mask2Former with multi-scale inference. Mask2Former is able to outperform
ResNet [25] with 50 and 101 layers, as well as Swin [36] previous state-of-the-art methods on all metrics.
Tiny, Small, Base and Large variants as backbones. We use
ImageNet [44] pre-trained checkpoints to initialize back-
bones.
B. Additional datasets
We study Mask2Former on three image segmentation
A.1. Panoptic segmentation.
tasks (panoptic, instance and semantic segmentation) us-
In Table I, we report Mask2Former with various back- ing four datasets. Here we report additional results on
bones on COCO panoptic val2017. Mask2Former out- Cityscapes [16], ADE20K [65] and Mapillary Vistas [42]
performs all existing panoptic segmentation models with as well as more detailed training settings.
various backbones. Our best model sets a new state-of-the-
art of 57.8 PQ. B.1. Cityscapes
In Table II, we further report the best Mask2Former
model on the test-dev set. Note that Mask2Former Cityscapes is an urban egocentric street-view dataset
trained only with the standard train2017 data, with high-resolution images (1024 × 2048 pixels). It con-
achieves the absolute new state-of-the-art performance on tains 2975 images for training, 500 images for validation
both validation and test set. Mask2Former even outper- and 1525 images for testing with a total of 19 classes.
forms the best COCO competition entry which uses extra Training settings. For all three segmentation tasks: we use
training data and test-time augmentation. a crop size of 512 × 1024, a batch size of 16 and train
all models for 90k iterations. During inference, we oper-
A.2. Instance segmentation. ate on the whole image (1024 × 2048). Other implemen-
In Table III, we report Mask2Former results ob- tation details largely follow Section 4.1 (panoptic and in-
tained with various backbones on COCO val2017. stance segmentation follow semantic segmentation training
Mask2Former outperforms the best single-scale model, settings), except that we use 200 queries for panoptic and
HTC++ [6, 36]. Note that it is non-trivial to do multi-scale instance segmentation models with Swin-L backbone. All
inference for instance-level segmentation tasks without in- other backbones or semantic segmentation models use 100
troducing complex post-processing like non-maximum sup- queries.
pression. Thus, we only compare Mask2Former with other Results. In Table VII, we report Mask2Former results ob-
single-scale inference models. We believe multi-scale infer- tained with various backbones on Cityscapes for three seg-
ence can further improve Mask2Former performance and it mentation tasks and compare it with other state-of-the-art
remains an interesting future work. methods without using extra data. For panoptic segmen-
In Table IV, we further report the best Mask2Former tation, Mask2Former with Swin-L backbone outperforms
model on the test-dev set. Mask2Former achieves the the state-of-the-art Panoptic-DeepLab [11] with SWideR-
absolute new state-of-the-art performance on both valida- net [9] using single-scale inference. For semantic segmen-
tion and test set. On the one hand, Mask2Former is ex- tation, Mask2Former with Swin-B backbone outperforms
tremely good at segmenting large objects: we can even the state-of-the-art SegFormer [59].

11
method backbone search space epochs PQ PQTh PQSt APTh
pan mIoUpan #params. FLOPs
R50 100 queries 500+25 43.4 48.2 36.3 31.1 - - -
DETR [5]
CNN backbones R101 100 queries 500+25 45.1 50.5 37.0 33.0 - - -
K-Net [62] R50 100 queries 36 47.1 51.7 40.3 - - - -
Panoptic SegFormer [32] R50 400 queries 50 50.0 56.1 40.8 - - 47M 246G
R50 100 queries 300 46.5 51.0 39.8 33.0 57.8 45M 181G
MaskFormer [14]
R101 100 queries 300 47.6 52.5 40.3 34.1 59.3 64M 248G
R50 100 queries 50 51.9 57.7 43.0 41.7 61.7 44M 226G
Mask2Former (ours)
R101 100 queries 50 52.6 58.5 43.7 42.6 62.4 63M 293G
Max-S 128 queries 216 48.4 53.0 41.5 - - 62M 324G
Max-DeepLab [52]
Max-L 128 queries 216 51.1 57.0 42.2 - - 451M 3692G
Panoptic SegFormer [32] PVTv2-B5 [54] 400 queries 50 54.1 60.4 44.6 - - 101M 391G
K-Net [62] Swin-L† 100 queries 36 54.6 60.2 46.0 - - - -
Transformer backbones

Swin-T 100 queries 300 47.7 51.7 41.7 33.6 60.4 42M 179G
Swin-S 100 queries 300 49.7 54.4 42.6 36.1 61.3 63M 259G
MaskFormer [14] Swin-B 100 queries 300 51.1 56.3 43.2 37.8 62.6 102M 411G
Swin-B† 100 queries 300 51.8 56.9 44.1 38.5 63.6 102M 411G
Swin-L† 100 queries 300 52.7 58.5 44.0 40.1 64.8 212M 792G
Swin-T 100 queries 50 53.2 59.3 44.0 43.3 63.2 47M 232G
Swin-S 100 queries 50 54.6 60.6 45.7 44.7 64.2 69M 313G
Mask2Former (ours) Swin-B 100 queries 50 55.1 61.0 46.1 45.2 65.1 107M 466G
Swin-B† 100 queries 50 56.4 62.4 47.3 46.3 67.1 107M 466G
Swin-L† 200 queries 100 57.8 64.2 48.1 48.6 67.4 216M 868G

Table I. Panoptic segmentation on COCO panoptic val2017 with 133 categories. Mask2Former outperforms all existing panoptic
segmentation models by a large margin with different backbones on all metrics. Our best model sets a new state-of-the-art of 57.8 PQ.
Besides PQ for panoptic segmentation, we also report APThpan (the AP evaluated on the 80 “thing” categories using instance segmentation
annotation) and mIoUpan (the mIoU evaluated on the 133 categories for semantic segmentation converted from panoptic segmentation
annotation) of the same model trained for panoptic segmentation (note: we train all our models with panoptic segmentation annotation
only). Backbones pre-trained on ImageNet-22K are marked with † .

method backbone PQ PQTh PQSt SQ RQ

Max-DeepLab [52] Max-L 51.3 57.2 42.4 82.5 61.3
Panoptic FCN [31] Swin-L 52.7 59.4 42.5 - -
MaskFormer [14] Swin-L 53.3 59.1 44.5 82.0 64.1
Panoptic SegFormer [32] PVTv2-B5 [54] 54.4 61.1 44.3 83.3 64.6
K-Net [62] Swin-L 55.2 61.2 46.2 - -
Megvii (challenge winner) - 54.7 64.6 39.8 83.6 64.3
Mask2Former (ours) Swin-L 58.3 65.1 48.1 84.1 68.6

Table II. Panoptic segmentation on COCO panoptic test-dev with 133 categories. Mask2Former, without any bells-and-whistles,
outperforms the challenge winner (which uses extra training data, model ensemble, etc.) on the test-dev set. We only train our model
on the COCO train2017 set with ImageNet-22K pre-trained checkpoint.

B.2. ADE20K few papers reporting results on ADE20K, we hope this

experiment could set up a useful benchmark for future
Training settings. For panoptic and instance segmentation, research.
we use the exact same training parameters as we used for
semantic segmentation, except that we always use a crop B.3. Mapillary Vistas
size of 640 × 640 for all backbones. Other implementa-
tion details largely follow Section 4.1 , except that we use Mapillary Vistas is a large-scale urban street-view
200 queries for panoptic and instance segmentation models dataset with 18k, 2k and 5k images for training, validation
with Swin-L backbone. All other backbones or semantic and testing. It contains images with a variety of resolutions,
segmentation models use 100 queries. ranging from 1024 × 768 to 4000 × 6000. We only report
Results. In Table VIII, we report the results of panoptic and semantic segmentation results for this dataset.
Mask2Former obtained with various backbones on Training settings. For both panoptic and semantic segmen-
ADE20K for three segmentation tasks and compare it tation, we follow the same data augmentation of [14]: stan-
with other state-of-the-art methods. Mask2Former with dard random scale jittering between 0.5 and 2.0, random
Swin-L backbone sets a new state-of-the-art performance horizontal flipping, random cropping with a crop size of
on ADE20K for panoptic segmentation. As there are 1024 × 1024 as well as random color jittering. We train

12
method backbone search space epochs AP APS APM APL APboundary #params. FLOPs
CNN backbones R50 dense anchors 36 37.2 18.6 39.5 53.3 23.1 44M 201G
R50 dense anchors 400 42.5 23.8 45.0 60.0 28.0 46M 358G
Mask R-CNN [24]
R101 dense anchors 36 38.6 19.5 41.3 55.3 24.5 63M 266G
R101 dense anchors 400 43.7 24.6 46.4 61.8 29.1 65M 423G
R50 100 queries 50 43.7 23.4 47.2 64.8 30.6 44M 226G
Mask2Former (ours)
R101 100 queries 50 44.2 23.8 47.7 66.7 31.1 63M 293G
QueryInst [20] Swin-L† 300 queries 50 48.9 30.8 52.6 68.3 33.5 - -
Transformer backbones

Swin-B† dense anchors 36 49.1 - - - - 160M 1043G

Swin-HTC++ [6, 36]
Swin-L† dense anchors 72 49.5 31.0 52.4 67.2 34.1 284M 1470G
Swin-T 100 queries 50 45.0 24.5 48.3 67.4 31.8 47M 232G
Swin-S 100 queries 50 46.3 25.3 50.3 68.4 32.9 69M 313G
Mask2Former (ours) Swin-B 100 queries 50 46.7 26.1 50.5 68.8 33.2 107M 466G
Swin-B† 100 queries 50 48.1 27.8 52.0 71.1 34.4 107M 466G
Swin-L† 200 queries 100 50.1 29.9 53.9 72.1 36.2 216M 868G

Table III. Instance segmentation on COCO val2017 with 80 categories. Mask2Former outperforms strong Mask R-CNN [24] base-
lines with 8× fewer training epochs for both AP and APboundary [12] metrics. Our best model is also competitive to the state-of-the-art
specialized instance segmentation model on COCO and has higher boundary quality. For a fair comparison, we only consider single-scale
inference and models trained using only COCO train2017 set data. Backbones pre-trained on ImageNet-22K are marked with † .

method backbone AP AP50 AP75 APS APM APL

QueryInst [20] Swin-L 49.1 74.2 53.8 31.5 51.8 63.2
Swin-HTC++ [6, 36] Swin-L 50.2 - - - - -
Swin-HTC++ [6, 36] (multi-scale) Swin-L 51.1 - - - - -
Megvii (challenge winner) - 53.1 76.8 58.6 36.6 56.5 67.7
Mask2Former (ours) Swin-L 50.5 74.9 54.9 29.1 53.8 71.2

Table IV. Instance segmentation on COCO test-dev with 80 categories. Mask2Former is extremely good at segmenting large objects:
we can even outperform the challenge winner (which uses extra training data, model ensemble, etc.) on APL by a large margin without any
bells-and-whistles. We only train our model on the COCO train2017 set with ImageNet-22K pre-trained checkpoint.

our model for 300k iterations with a batch size of 16 using

the “poly” learning rate schedule [7]. During inference, we
resize the longer side to 2048 pixels. Our panoptic segmen-
tation model with a Swin-L backbone uses 200 queries. All
other backbones or semantic segmentation models use 100
queries.
Results. In Table IX, we report Mask2Former results
obtained with various backbones on Mapillary Vistas for
(a) Visualization of cross-attention (top) and masked attention (bottom) for
panoptic and semantic segmentation tasks and compare it different resolutions.
with other state-of-the-art methods. Our Mask2Former is 1/32 1/16 1/8 average
very competitive compared to state-of-the art specialized fg bg fg bg fg bg fg bg
models even if it is not designed for Mapillary Vistas. cross-attention 0.23 0.77 0.23 0.77 0.15 0.85 0.20 0.80
masked attention 0.53 0.47 0.61 0.39 0.64 0.36 0.59 0.41

C. Additional ablation studies (b) Cumulative attention weights on foreground (fg) and background (bg)
regions for different resolutions.
We perform additional ablation studies of Mask2Former Figure I. Masked attention analysis.
using the same settings that we used in the main paper: a
single ResNet-50 backbone [25]. scale jittering augmentation. This shows that Mask2Former
with our proposed Transformer decoder converges faster
C.1. Convergence analysis than models using the standard Transformer decoder: e.g.,
We train Mask2Former with 12, 25, 50 and 100 DETR [5] and MaskFormer [14] require 500 epochs and
epochs with either standard scale augmentation (Standard 300 epochs respectively.
Aug.) [57] or the more recent large-scale jittering aug-
C.2. Masked attention analysis
mentation (LSJ Aug.) [18, 23]. As shown in Figure IV,
Mask2Former converges in 25 epochs using standard aug- We quantitatively and qualitatively analyzed the COCO
mentation and almost converges in 50 epochs using large- panoptic model with the R50 backbone. First, we visual-

13
method backbone crop size mIoU (s.s.) mIoU (m.s.) #params. FLOPs
R50 512 × 512 44.5 46.7 41M 53G
MaskFormer [14]

CNN
R101 512 × 512 45.5 47.2 60M 73G
R50 512 × 512 47.2 49.2 44M 71G
Mask2Former (ours)
R101 512 × 512 47.8 50.1 63M 90G
Swin-UperNet [36, 58] Swin-L† 640 × 640 - 53.5 234M 647G
FaPN-MaskFormer [14, 39] Swin-L† 640 × 640 55.2 56.7 - -
BEiT-UperNet [2, 58] BEiT-L† 640 × 640 - 57.0 502M -
Swin-T 512 × 512 46.7 48.8 42M 55G
Transformer backbones

Swin-S 512 × 512 49.8 51.0 63M 79G

MaskFormer [14] Swin-B 640 × 640 51.1 52.3 102M 195G
Swin-B† 640 × 640 52.7 53.9 102M 195G
Swin-L† 640 × 640 54.1 55.6 212M 375G
Swin-T 512 × 512 47.7 49.6 47M 74G
Swin-S 512 × 512 51.3 52.4 69M 98G
Swin-B 640 × 640 52.4 53.7 107M 223G
Mask2Former (ours)
Swin-B† 640 × 640 53.9 55.1 107M 223G
Swin-L† 640 × 640 56.1 57.3 215M 403G
Swin-L-FaPN† 640 × 640 56.4 57.7 217M -

Table V. Semantic segmentation on ADE20K val with 150 categories. Mask2Former consistently outperforms MaskFormer [14] by
a large margin with different backbones (all Mask2Former models use MSDeformAttn [66] as pixel decoder, except Swin-L-FaPN uses
FaPN [39]). Our best model outperforms the best specialized model, BEiT [2], with less than half of the parameters. We report both
single-scale (s.s.) and multi-scale (m.s.) inference results. Backbones pre-trained on ImageNet-22K are marked with † .

method backbone P.A. mIoU score

SETR [64] ViT-L 78.35 45.03 61.69
Swin-UperNet [36, 58] Swin-L 78.42 47.07 62.75
MaskFormer [14] Swin-L 79.36 49.67 64.51
Mask2Former (ours) Swin-L-FaPN 79.80 49.72 64.76

Table VI. Semantic segmentation on ADE20K test with 150 categories. Mask2Former outperforms previous state-of-the-art methods
on all three metrics: pixel accuracy (P.A.), mIoU, as well as the final test score (average of P.A. and mIoU). We train our model on the
union of ADE20K train and val set with ImageNet-22K pre-trained checkpoint following [14] and use multi-scale inference.

ize the last three attention maps of our model using cross- 52.5
attention (Figure Ia top) and masked attention (Figure Ia
50.0
bottom) of a single query that predicts the “cat.” With
cross-attention, the attention map spreads over the entire 47.5
image and the region with highest response is outside the
PQ

object of interest. We believe this is because the softmax 45.0

used in cross-attention never attains zero, and small atten-
tion weights on large background regions start to dominate. 42.5
Instead, masked attention limits the attention weights to fo- cross-attention
cus on the object. We validate this hypothesis in Table Ib: 40.0 masked attention
we compute the cumulative attention weights on foreground
(defined by the matching ground truth to each prediction) 1 2 3 4 5 6 7 8 9
Transformer decoder layer
and background for all queries on the entire COCO val
set. On average, only 20% of the attention weights in cross- Figure II. Panoptic segmentation performance of each Transformer
attention focus on the foreground while masked attention decoder layer.
increases this ratio to almost 60%. Second, we plot the
panoptic segmentation performance using output from each
Transformer decoder layer (Figure II). We find masked at- C.3. Object query analysis
tention with a single Transformer decoder layer already out-
performs cross-attention with 9 layers. We hope the effec- Object queries play an important role in Mask2Former.
tiveness of masked attention, together with this analysis, We ablate different design choices of object queries includ-
leads to better attention design. ing the number of queries and making queries learnable.
Number of queries. We study the effect of different num-

14
panoptic model instance model semantic model
method backbone PQ (s.s.) PQ (m.s.) APTh
pan mIoUpan AP AP50 mIoU (s.s.) mIoU (m.s.)
R50 60.3 - 32.1 78.7 - - - -
Panoptic-DeepLab [11] X71 [15] 63.0 64.1 35.3 80.5 - - - -
SWideRNet [9] 66.4 67.5 40.1 82.2 - - - -
Panoptic FCN [31] Swin-L† 65.9 - - - - - - -
Segmenter [45] ViT-L† - - - - - - - 81.3
SETR [64] ViT-L† - - - - - - - 82.2
SegFormer [59] MiT-B5 - - - - - - - 84.0
R50 62.1 - 37.3 77.5 37.4 61.9 79.4 82.2
R101 62.4 - 37.7 78.6 38.5 63.9 80.1 81.9
Swin-T 63.9 - 39.1 80.5 39.7 66.9 82.1 83.0
Mask2Former (ours)
Swin-S 64.8 - 40.7 81.8 41.8 70.4 82.6 83.6
Swin-B† 66.1 - 42.8 82.7 42.0 68.8 83.3 84.5
Swin-L† 66.6 - 43.6 82.9 43.7 71.4 83.3 84.3

Table VII. Image segmentation results on Cityscapes val. We report both single-scale (s.s.) and multi-scale (m.s.) inference results
for PQ and mIoU. All other metrics are evaluated with single-scale inference. Since Mask2Former is an end-to-end model, we only use
single-scale inference for instance-level segmentation tasks to avoid the need for further post-processing (e.g., NMS).

panoptic model instance model semantic model

method backbone PQ APTh
pan mIoUpan AP APS APM APL mIoU (s.s.) mIoU (m.s.)
MaskFormer [14] R50 34.7 - - - - - - - -
Panoptic-DeepLab [11] SWideRNet [9] 37.9∗ - 50.0∗ - - - - - -
Swin-UperNet [36, 58] Swin-L† - - - - - - - - 53.5
MaskFormer [14] Swin-L† - - - - - - - 54.1 55.6
FaPN-MaskFormer [14, 39] Swin-L† - - - - - - - 55.2 56.7
BEiT-UperNet [2, 58] BEiT-L† - - - - - - - - 57.0
R50 39.7 26.5 46.1 26.4 10.4 28.9 43.1 47.2 49.2
Mask2Former (ours) Swin-L† 48.1 34.2 54.5 34.9 16.3 40.0 54.7 56.1 57.3
Swin-L-FaPN† 46.2 33.2 55.4 33.4 14.6 37.6 54.6 56.4 57.7

Table VIII. Image segmentation results on ADE20K val. Mask2Former is competitive to specialized models on ADE20K. Panoptic
segmentation models use single-scale inference by default, multi-scale numbers are marked with ∗ . For semantic segmentation, we report
both single-scale (s.s.) and multi-scale (m.s.) inference results.

ber of queries for three image segmentation tasks in Ta- learnable. In addition, we make query features learnable
ble Xa. For instance and semantic segmentation, using as well and directly apply losses on these learnable query
100 queries achieves the best performance, while using 200 features before feeding them into the Transformer decoder.
queries can further improve panoptic segmentation results. In Table Xb, we compare our learnable query features
As panoptic segmentation is a combination of instance and with zero-initialized query features in DETR. We find it
semantic segmentation, it has more segments per image is important to directly supervise object queries even be-
than the other two tasks. This ablation suggests that pick- fore feeding them into the Transformer decoder. Learnable
ing the number of queries for Mask2Former may depend on queries without supervision perform similarly well as zero-
the number of segments per image for a particular task or initialized queries in DETR.
dataset.
C.4. MaskFormer vs. Mask2Former
Learnable queries. An object query consists of two parts:
object query features and object query positional embed- Mask2Former builds upon the same meta architecture
dings. Object query features are only used as the initial as MaskFormer [14] with two major differences: 1) We
input to the Transformer decoder and are updated through use more advanced training parameters summarized in Ta-
decoder layers; whereas query positional embeddings are ble XIa; and 2) we propose a new Transformer decoder with
added to query features in every Transformer decoder layer masked attention, instead of using the standard Transformer
when computing the attention weights. In DETR [5], query decoder, as well as some optimization improvements sum-
features are zero-initialized and query positional embed- marized in Table XIb. To better understand Mask2Former’s
dings are learnable. Furthermore, there is no direct su- improvements over MaskFormer, we perform ablation stud-
pervision on these query features before feeding them into ies on training parameter improvements and Transformer
the Transformer (since they are zero vectors). In our decoder improvements in isolation.
Mask2Former, we still make query positional embeddings In Table XIc, we study our new training parameters. We

15
panoptic model semantic model
method backbone PQ mIoUpan mIoU (s.s.) mIoU (m.s.)
ensemble 42.2∗ 58.7∗ - -
Panoptic-DeepLab [11] SWideRNet [9] 43.7 59.4 - -
SWideRNet [9] 44.8∗ 60.0∗ - -
Panoptic FCN [31] Swin-L† 45.7 - - -
MaskFormer [14] R50 - - 53.1 55.4
HMSANet [48] HRNet [53] - - - 61.1
R50 36.3 50.7 57.4 59.0
Mask2Former (ours)
Swin-L† 45.5 60.8 63.2 64.7

Table IX. Image segmentation results on Mapillary Vistas val. Mask2Former is competitive to specialized models on Mapillary Vistas.
Panoptic segmentation models use single-scale inference by default, multi-scale numbers are marked with ∗ . For semantic segmentation,
we report both single-scale (s.s.) and multi-scale (m.s.) inference results.

58 D. Visualization
56 We visualize sample predictions of the Mask2Former
model with Swin-L [36] backbone on three tasks: COCO
54 panoptic val2017 set for panoptic segmentation (57.8 PQ)
in Figure V, COCO val2017 set for instance segmenta-
PQ

52 tion (50.1 AP) in Figure VI and ADE20K validation set for

semantic segmentation (57.7 mIoU, multi-scale inference)
50 in Figure VII.
MaskFormer
48 Mask2Former (ours)

200 400 600 800

GFLOPs

Figure III. MaskFormer [14] vs. Mask2Former (ours) with differ-

ent Swin Transformer backbones.

train the MaskFormer model with either its original train-

ing parameters in [14] or our new training parameters. We
observe significant improvements of using our new training
parameters for MaskFormer as well. This shows the new
training parameters are also generally applicable to other
models.
In Table XId, we study our new Transformer decoder.
We train a MaskFormer model and a Mask2Former model
with the exact same backbone, i.e., a ResNet-50; pixel de-
coder, i.e., a FPN; and training parameters. That is, the only
difference is in the Transformer decoder, summarized in Ta-
ble XIb. We observe improvements for all three tasks, sug-
gesting that the new Transformer decoder itself is indeed
better than the standard Transformer decoder.
While computational efficiency was not our primary
goal, we find that Mask2Former actually has a better
compute-performance trade-off compared to MaskFormer
(Figure III). Even the lightest instantiation of Mask2Former
outperforms the heaviest MaskFormer instantiation, using
1 th
4 the FLOPs.

16
44

Mask AP
40
Standard Aug.
38 LSJ Aug.

12 25 50 100
Epochs (log-scale)

Figure IV. Convergence analysis. We train Mask2Former with different epochs using either standard scale augmentation (Standard
Aug.) [57] or the more recent large-scale jittering augmentation (LSJ Aug.) [18, 23]. Mask2Former converges in 25 epochs using standard
augmentation and almost converges in 50 epochs using large-scale jittering augmentation. Using LSJ also improves performance with
longer training epochs (i.e., with more than 25 epochs).

AP PQ mIoU FLOPs AP PQ mIoU FLOPs

(COCO) (COCO) (ADE20K) (COCO) (COCO) (COCO) (ADE20K) (COCO)
50 42.4 50.5 46.2 217G zero-initialized (DETR [5]) 42.9 51.2 45.5 226G
100 43.7 51.9 47.2 226G learnable w/o supervision 42.9 51.2 47.0 226G
200 43.5 52.2 47.0 246G learnable w/ supervision 43.7 51.9 47.2 226G
300 43.5 52.1 46.5 265G
1000 40.3 50.7 44.8 405G
(a) Number of queries ablation. For instance and semantic segmentation, (b) Learnable queries ablation. It is important to supervise object queries
using 100 queries achieves the best performance while using 200 queries can before feeding them into the Transformer decoder. Learnable queries without
further improve panoptic segmentation results. supervision perform similarly well as zero-initialized queries in DETR.

Table X. Analysis of object queries. Table Xa: ablation on number of queries. Table Xb: ablation on using learnable queries.

training parameters MaskFormer Mask2Former (ours)

learning rate 0.0001 0.0001
weight decay 0.0001 0.05
batch size 16∗ 16
epochs 75∗ 50
data augmentation standard scale aug. w/ crop LSJ aug.
λcls 1.0 2.0
λfocal / λce 20.0 / - - / 5.0
λdice 1.0 5.0
mask loss mask 12544 sampled points
(a) Comparison of training parameters for MaskFormer [14] and our Mask2Former on the COCO dataset. ∗ : in the original MaskFormer implementation,
the model is trained with a batch size of 64 for 300 epochs. We find MaskFormer achieves similar performance when trained with a batch size of 16 for 75
epochs, i.e., the same number of iterations with a smaller batch size.

Transformer decoder MaskFormer Mask2Former (ours)

# of layers 6 9
single layer SA-CA-FFN MA-SA-FFN
dropout 0.1 0.0
feature resolution {1/32} × 6 {1/32, 1/16, 1/8} × 3
input query features zero init. learnable
query p.e. learnable learnable
(b) Comparison of Transformer decoder in MaskFormer [14] and our Mask2Former. SA: self-attention, CA: cross-attention, FFN: feed-forward network,
MA: masked attention, p.e.: positional embedding.

AP PQ mIoU AP PQ mIoU
model training params. (COCO) (COCO) (ADE20K) Transformer decoder pixel decoder (COCO) (COCO) (ADE20K)
MaskFormer MaskFormer 34.0 46.5 44.5 MaskFormer FPN 37.8 48.2 45.3
MaskFormer Mask2Former 37.8 (+3.8) 48.2 (+1.7) 45.3 (+0.8) Mask2Former FPN 41.5 (+3.7) 50.7 (+2.5) 45.6 (+0.3)
(c) Improvements from better training parameters. (d) Improvements from better Transformer decoder.

Table XI. MaskFormer vs. Mask2Former. Table XIa and Table XIb provide an in-depth comparison between MaskFormer and our
Mask2Former settings. Table XIc: MaskFormer benefits from our new training parameters as well. Table XId: Comparison between
MaskFormer and our Mask2Former with the exact same backbone, pixel decoder and training parameters. The improvements solely come
from a better Transformer decoder.

17
Figure V. Visualization of panoptic segmentation predictions on the COCO panoptic dataset: Mask2Former with Swin-L backbone which
achieves 57.8 PQ on the validation set. First and third columns: ground truth. Second and fourth columns: prediction. Last row shows
failure cases.

18
Figure VI. Visualization of instance segmentation predictions on the COCO dataset: Mask2Former with Swin-L backbone which achieves
50.1 AP on the validation set. First and third columns: ground truth. Second and fourth columns: prediction. Last row shows failure
cases. We show predictions with confidence scores greater than 0.5.

19
Figure VII. Visualization of semantic segmentation predictions on the ADE20K dataset: Mask2Former with Swin-L backbone which
achieves 57.7 mIoU (multi-scale) on the validation set. First and third columns: ground truth. Second and fourth columns: prediction.
Last row shows failure cases.

Cheng Masked-Attention Mask Transformer For Universal Image Segmentation CVPR 2022 Paper
No ratings yet
Cheng Masked-Attention Mask Transformer For Universal Image Segmentation CVPR 2022 Paper
10 pages
8DL
No ratings yet
8DL
6 pages
Mask Former
No ratings yet
Mask Former
17 pages
4th
No ratings yet
4th
24 pages
One Former
No ratings yet
One Former
19 pages
OneFormer: One Transformer To Rule Universal Image Segmentation
No ratings yet
OneFormer: One Transformer To Rule Universal Image Segmentation
18 pages
Jain OneFormer One Transformer To Rule Universal Image Segmentation CVPR 2023 Paper
No ratings yet
Jain OneFormer One Transformer To Rule Universal Image Segmentation CVPR 2023 Paper
10 pages
Panoptic Segmentation
No ratings yet
Panoptic Segmentation
29 pages
20PWMCT0732 Ass#3
No ratings yet
20PWMCT0732 Ass#3
8 pages
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
No ratings yet
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
28 pages
Lec+2(+Image+Segemnation)
No ratings yet
Lec+2(+Image+Segemnation)
52 pages
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
No ratings yet
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
11 pages
Strudel Transformer Segmentation
No ratings yet
Strudel Transformer Segmentation
17 pages
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
No ratings yet
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
68 pages
A Review On Deep Learning Techniques Applied To Semantic Segmentation
No ratings yet
A Review On Deep Learning Techniques Applied To Semantic Segmentation
23 pages
DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic
No ratings yet
DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic
12 pages
Instance and Panoptic Seg Using Conditional Convolutions
No ratings yet
Instance and Panoptic Seg Using Conditional Convolutions
18 pages
Overview of semantic segmentation
No ratings yet
Overview of semantic segmentation
20 pages
W-Net A Deep Model For Fully Unsupervised Image Segmentation
No ratings yet
W-Net A Deep Model For Fully Unsupervised Image Segmentation
13 pages
DL UNIT 5
No ratings yet
DL UNIT 5
63 pages
2023 - Zhu Et Al - A Survey of Weakly-Supervised Semantic Segmentation
No ratings yet
2023 - Zhu Et Al - A Survey of Weakly-Supervised Semantic Segmentation
6 pages
Adaptis: Adaptive Instance Selection Network
No ratings yet
Adaptis: Adaptive Instance Selection Network
11 pages
Image Segmentation ÔÇö A BeginnerÔÇÖs Guide _ Medium
No ratings yet
Image Segmentation ÔÇö A BeginnerÔÇÖs Guide _ Medium
16 pages
Research Paper
No ratings yet
Research Paper
7 pages
Lecture Sematic-Segmentation
No ratings yet
Lecture Sematic-Segmentation
23 pages
DL Segmentation 2
No ratings yet
DL Segmentation 2
18 pages
Segmentation-Aware Convolutional Networks Using Local Attention Masks
No ratings yet
Segmentation-Aware Convolutional Networks Using Local Attention Masks
11 pages
Panoptic Segmentation Research Paper
No ratings yet
Panoptic Segmentation Research Paper
22 pages
Assignment DIP
No ratings yet
Assignment DIP
17 pages
Facial Mask Detection Using Semantic Segmentation
No ratings yet
Facial Mask Detection Using Semantic Segmentation
5 pages
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
No ratings yet
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
6 pages
Rethinking Decoders For Transformer-Based Semantic Segmentation: Compression Is All You Need
No ratings yet
Rethinking Decoders For Transformer-Based Semantic Segmentation: Compression Is All You Need
21 pages
ParseNet: Looking Wider To See Better (2015)
No ratings yet
ParseNet: Looking Wider To See Better (2015)
11 pages
Fully_Convolutional_Networks_for_Semantic_Segmentation
No ratings yet
Fully_Convolutional_Networks_for_Semantic_Segmentation
12 pages
Fully Convolutional Networks For Semantic Segmentation
No ratings yet
Fully Convolutional Networks For Semantic Segmentation
12 pages
Lecture 8 Image Segmentationi n Computer Vision 2025
No ratings yet
Lecture 8 Image Segmentationi n Computer Vision 2025
18 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
Panoptic, Instance and Semantic Relations: A Relational Context Encoder To Enhance Panoptic Segmentation
No ratings yet
Panoptic, Instance and Semantic Relations: A Relational Context Encoder To Enhance Panoptic Segmentation
11 pages
1801.00868v3
No ratings yet
1801.00868v3
10 pages
2310.15160
No ratings yet
2310.15160
17 pages
SSRN Id4774117
No ratings yet
SSRN Id4774117
27 pages
NeurIPS 2023 Clusterfomer Clustering As A Universal Visual Learner Paper Conference
No ratings yet
NeurIPS 2023 Clusterfomer Clustering As A Universal Visual Learner Paper Conference
14 pages
Tmi 2019 2959609
No ratings yet
Tmi 2019 2959609
12 pages
Semantic Segmentation of Images
No ratings yet
Semantic Segmentation of Images
76 pages
Contrast Ives Eg
No ratings yet
Contrast Ives Eg
14 pages
DL UNIt-III
No ratings yet
DL UNIt-III
21 pages
Fully Convolutional Networks For Semantic Segmentation
No ratings yet
Fully Convolutional Networks For Semantic Segmentation
17 pages
2408.12957v3
No ratings yet
2408.12957v3
21 pages
2210 11810FJFTJTsu
No ratings yet
2210 11810FJFTJTsu
13 pages
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
No ratings yet
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
12 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
20 pages
Ke_Unsupervised_Hierarchical_Semantic_Segmentation_With_Multiview_Cosegmentation_and_Clustering_Transformers_CVPR_2022_paper
No ratings yet
Ke_Unsupervised_Hierarchical_Semantic_Segmentation_With_Multiview_Cosegmentation_and_Clustering_Transformers_CVPR_2022_paper
11 pages
Image Segmentation in Deep Learning
No ratings yet
Image Segmentation in Deep Learning
12 pages
Image Segmentation Using Deep Learning: A Survey
No ratings yet
Image Segmentation Using Deep Learning: A Survey
23 pages
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
18 pages
1907.06119
No ratings yet
1907.06119
58 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
Applsci 11 08802 - Compressed
No ratings yet
Applsci 11 08802 - Compressed
28 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
No ratings yet
Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
12 pages
Data Miningof Public Opinion An Overview
No ratings yet
Data Miningof Public Opinion An Overview
12 pages
Serving LLM 2312.15234
No ratings yet
Serving LLM 2312.15234
32 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
Cse (Convolutional Neural Network) PPT+Questions
No ratings yet
Cse (Convolutional Neural Network) PPT+Questions
18 pages
Whitening Sentence Representations For Better Semantics and Faster Retrieval
No ratings yet
Whitening Sentence Representations For Better Semantics and Faster Retrieval
9 pages
Learning To (Learn at Test Time) - RNNs With Expressive Hidden States
No ratings yet
Learning To (Learn at Test Time) - RNNs With Expressive Hidden States
34 pages
Simplified State Space Layers For Sequence
No ratings yet
Simplified State Space Layers For Sequence
35 pages
Algorithm & Flowchart For Deepfake
No ratings yet
Algorithm & Flowchart For Deepfake
3 pages
Integrating Machine Learning and AI in Automotive Safety: Enhancing ISO 26262 Compliance
No ratings yet
Integrating Machine Learning and AI in Automotive Safety: Enhancing ISO 26262 Compliance
20 pages
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
No ratings yet
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
9 pages
H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation
No ratings yet
H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation
13 pages
Non - Stationary Former
No ratings yet
Non - Stationary Former
21 pages
Music Transformer - Generating Music With Long-Term Structure
No ratings yet
Music Transformer - Generating Music With Long-Term Structure
14 pages
What Are Large Language Models
No ratings yet
What Are Large Language Models
6 pages
Nguyễn Kiến Bảo Thắng CV
No ratings yet
Nguyễn Kiến Bảo Thắng CV
2 pages
Campus Abnormal Behavior Recognition With Temporal Segment Transformers
No ratings yet
Campus Abnormal Behavior Recognition With Temporal Segment Transformers
14 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
ExploringDis uenciesforSpeechToTextMachineTranslation (2021)
No ratings yet
ExploringDis uenciesforSpeechToTextMachineTranslation (2021)
7 pages
Multilingual Denoising Pre-Training For Neural Machine Translation
No ratings yet
Multilingual Denoising Pre-Training For Neural Machine Translation
17 pages
A Survey of Deep Learning For Mathematical Reasoning
No ratings yet
A Survey of Deep Learning For Mathematical Reasoning
24 pages
ScreenAI A Vision-Language Model For UI and Infographics Understanding
No ratings yet
ScreenAI A Vision-Language Model For UI and Infographics Understanding
16 pages
firoz KHAN
No ratings yet
firoz KHAN
31 pages
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
No ratings yet
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
10 pages
2407.15886v1
No ratings yet
2407.15886v1
12 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
1 s2.0 S1319157824001691 Main
No ratings yet
1 s2.0 S1319157824001691 Main
14 pages
Machine Learning Based Battery Pack Health Prediction Using Real-World Data
No ratings yet
Machine Learning Based Battery Pack Health Prediction Using Real-World Data
11 pages
5958 Vision Transformers as Pr
No ratings yet
5958 Vision Transformers as Pr
14 pages
j2020 A Survey of The Usages of Deep Learning For Natural Language Processing
No ratings yet
j2020 A Survey of The Usages of Deep Learning For Natural Language Processing
21 pages