Mask 2 Former
Mask 2 Former
Bowen Cheng1,2 * Ishan Misra1 Alexander G. Schwing2 Alexander Kirillov1 Rohit Girdhar1
1
Facebook AI Research (FAIR) 2 University of Illinois at Urbana-Champaign (UIUC)
https://bowenc0221.github.io/mask2former
Abstract
1. Introduction
specialized architecture for every task.
Image segmentation studies the problem of grouping
To address this fragmentation, recent work [14, 62] has
pixels. Different semantics for grouping pixels, e.g., cat-
attempted to design universal architectures, that are capable
egory or instance membership, have led to different types
of addressing all segmentation tasks with the same archi-
of segmentation tasks, such as panoptic, instance or seman-
tecture (i.e., universal image segmentation). These archi-
tic segmentation. While these tasks differ only in semantics,
tectures are typically based on an end-to-end set prediction
current methods develop specialized architectures for each
objective (e.g., DETR [5]), and successfully tackle multiple
task. Per-pixel classification architectures based on Fully
tasks without modifying the architecture, loss, or the train-
Convolutional Networks (FCNs) [37] are used for semantic
ing procedure. Note, universal architectures are still trained
segmentation, while mask classification architectures [5,24]
separately for different tasks and datasets, albeit having the
that predict a set of binary masks each associated with a
same architecture. In addition to being flexible, universal
single category, dominate instance-level segmentation. Al-
architectures have recently shown state-of-the-art results on
though such specialized architectures [6, 10, 24, 37] have
semantic and panoptic segmentation [14]. However, re-
advanced each individual task, they lack the flexibility to
cent work still focuses on advancing specialized architec-
generalize to the other tasks. For example, FCN-based ar-
tures [20, 39, 45], which raises the question: why haven’t
chitectures struggle at instance segmentation, leading to the
universal architectures replaced specialized ones?
evolution of different architectures for instance segmenta-
Although existing universal architectures are flexible
tion compared to semantic segmentation. Thus, duplicate
enough to tackle any segmentation task, as shown in Fig-
research and (hardware) optimization effort is spent on each
ure 1, in practice their performance lags behind the best
* Work done during an internship at Facebook AI Research. specialized architectures. For instance, the best reported
1
performance of universal architectures [14, 62], is currently FCN-based architectures [37] independently predict a cat-
lower (> 9 AP) than the SOTA specialized architecture egory label for every pixel. Follow-up methods find con-
for instance segmentation [6]. Beyond the inferior per- text to play an important role for precise per-pixel classi-
formance, universal architectures are also harder to train. fication and focus on designing customized context mod-
They typically require more advanced hardware and a much ules [7,8,63] or self-attention variants [21,26,45,55,61,64].
longer training schedule. For example, training Mask- Specialized instance segmentation architectures are typ-
Former [14] takes 300 epochs to reach 40.1 AP and it can ically based upon “mask classification.” They predict a set
only fit a single image in a GPU with 32G memory. In con- of binary masks each associated with a single class label.
trast, the specialized Swin-HTC++ [6] obtains better perfor- The pioneering work, Mask R-CNN [24], generates masks
mance in only 72 epochs. Both the performance and train- from detected bounding boxes. Follow-up methods either
ing efficiency issues hamper the deployment of universal focus on detecting more precise bounding boxes [4, 6], or
architectures. finding new ways to generate a dynamic number of masks,
In this work, we propose a universal image segmen- e.g., using dynamic kernels [3, 49, 56] or clustering algo-
tation architecture named Masked-attention Mask Trans- rithms [11, 29]. Although the performance has been ad-
former (Mask2Former) that outperforms specialized ar- vanced in each task, these specialized innovations lack the
chitectures across different segmentation tasks, while still flexibility to generalize from one to the other, leading to
being easy to train on every task. We build upon a sim- duplicated research effort. For instance, although multiple
ple meta architecture [14] consisting of a backbone fea- approaches have been proposed for building feature pyra-
ture extractor [25, 36], a pixel decoder [33] and a Trans- mid representations [33], as we show in our experiments,
former decoder [51]. We propose key improvements that BiFPN [47] performs better for instance segmentation while
enable better results and efficient training. First, we use FaPN [39] performs better for semantic segmentation.
masked attention in the Transformer decoder which restricts Panoptic segmentation has been proposed to unify both se-
the attention to localized features centered around predicted mantic and instance segmentation tasks [28]. Architectures
segments, which can be either objects or regions depend- for panoptic segmentation either combine the best of spe-
ing on the specific semantic for grouping. Compared to cialized semantic and instance segmentation architectures
the cross-attention used in a standard Transformer decoder into a single framework [11, 27, 31, 60] or design novel ob-
which attends to all locations in an image, our masked atten- jectives that equally treat semantic regions and instance ob-
tion leads to faster convergence and improved performance. jects [5, 52]. Despite those new architectures, researchers
Second, we use multi-scale high-resolution features which continue to develop specialized architectures for different
help the model to segment small objects/regions. Third, image segmentation tasks [20, 45]. We find panoptic archi-
we propose optimization improvements such as switching tectures usually only report performance on a single panop-
the order of self and cross-attention, making query features tic segmentation task [52], which does not guarantee good
learnable, and removing dropout; all of which improve per- performance on other tasks (Figure 1). For example, panop-
formance without additional compute. Finally, we save 3× tic segmentation does not measure architectures’ abilities to
training memory without affecting the performance by cal- rank predictions as instance segmentations. Thus, we re-
culating mask loss on few randomly sampled points. These frain from referring to architectures that are only evaluated
improvements not only boost the model performance, but for panoptic segmentation as universal architectures. In-
also make training significantly easier, making universal ar- stead, here, we evaluate our Mask2Former on all studied
chitectures more accessible to users with limited compute. tasks to guarantee generalizability.
We evaluate Mask2Former on three image segmenta-
tion tasks (panoptic, instance and semantic segmentation) Universal architectures have emerged with DETR [5] and
using four popular datasets (COCO [35], Cityscapes [16], show that mask classification architectures with an end-to-
ADE20K [65] and Mapillary Vistas [42]). For the first end set prediction objective are general enough for any im-
time, on all these benchmarks, our single architecture age segmentation task. MaskFormer [14] shows that mask
performs on par or better than specialized architectures. classification based on DETR not only performs well on
Mask2Former sets the new state-of-the-art of 57.8 PQ on panoptic segmentation but also achieves state-of-the-art on
COCO panoptic segmentation [28], 50.1 AP on COCO in- semantic segmentation. K-Net [62] further extends set pre-
stance segmentation [35] and 57.7 mIoU on ADE20K se- diction to instance segmentation. Unfortunately, these ar-
mantic segmentation [65] using the exact same architecture. chitectures fail to replace specialized models as their perfor-
mance on particular tasks or datasets is still worse than the
2. Related Work best specialized architecture (e.g., MaskFormer [14] cannot
segment instances well). To our knowledge, Mask2Former
Specialized semantic segmentation architectures typi- is the first architecture that outperforms state-of-the-art spe-
cally treat the task as a per-pixel classification problem. cialized architectures on all considered tasks and datasets.
2
3. Masked-attention Mask Transformer
Transformer add & norm
We now present Mask2Former. We first review a meta Decoder
class FFN
architecture for mask classification that Mask2Former is
built upon. Then, we introduce our new Transformer de- mask add & norm
coder with masked attention which is the key to better con- 𝐿× self-attention
V K Q
vergence and results. Lastly, we propose training improve-
ments that make Mask2Former efficient and accessible. add & norm
masked attention
V K Q
Backbone
3.1. Mask classification preliminaries
Pixe mask
l De image query
Mask classification architectures group pixels into N code
r features features
segments by predicting N binary masks, along with N cor-
responding category labels. Mask classification is suffi- Figure 2. Mask2Former overview. Mask2Former adopts the
ciently general to address any segmentation task by assign- same meta architecture as MaskFormer [14] with a backbone, a
ing different semantics, e.g., categories or instances, to dif- pixel decoder and a Transformer decoder. We propose a new
ferent segments. However, the challenge is to find good Transformer decoder with masked attention instead of the standard
representations for each segment. For example, Mask R- cross-attention (Section 3.2.1). To deal with small objects, we pro-
CNN [24] uses bounding boxes as the representation which pose an efficient way of utilizing high-resolution features from a
limits its application to semantic segmentation. Inspired by pixel decoder by feeding one scale of the multi-scale feature to one
DETR [5], each segment in an image can be represented as Transformer decoder layer at a time (Section 3.2.2). In addition,
a C-dimensional feature vector (“object query”) and can be we switch the order of self and cross-attention (i.e., our masked
attention), make query features learnable, and remove dropout to
processed by a Transformer decoder, trained with a set pre-
make computation more effective (Section 3.2.3). Note that posi-
diction objective. A simple meta architecture would con- tional embeddings and predictions from intermediate Transformer
sist of three components. A backbone that extracts low- decoder layers are omitted in this figure for readability.
resolution features from an image. A pixel decoder that
gradually upsamples low-resolution features from the out-
put of the backbone to generate high-resolution per-pixel 3.2.1 Masked attention
embeddings. And finally a Transformer decoder that oper-
ates on image features to process object queries. The final Context features have been shown to be important for im-
binary mask predictions are decoded from per-pixel embed- age segmentation [7,8,63]. However, recent studies [22,46]
dings with object queries. One successful instantiation of suggest that the slow convergence of Transformer-based
such a meta architecture is MaskFormer [14], and we refer models is due to global context in the cross-attention layer,
readers to [14] for more details. as it takes many training epochs for cross-attention to learn
to attend to localized object regions [46]. We hypothesize
that local features are enough to update query features and
3.2. Transformer decoder with masked attention context information can be gathered through self-attention.
For this we propose masked attention, a variant of cross-
Mask2Former adopts the aforementioned meta archi-
attention that only attends within the foreground region of
tecture, with our proposed Transformer decoder (Figure 2
the predicted mask for each query.
right) replacing the standard one. The key components of
our Transformer decoder include a masked attention opera- Standard cross-attention (with residual path) computes
tor, which extracts localized features by constraining cross-
attention to within the foreground region of the predicted Xl = softmax(Ql KTl )Vl + Xl−1 . (1)
mask for each query, instead of attending to the full fea-
ture map. To handle small objects, we propose an efficient Here, l is the layer index, Xl ∈ RN ×C refers to N
multi-scale strategy to utilize high-resolution features. It C-dimensional query features at the lth layer and Ql =
feeds successive feature maps from the pixel decoder’s fea- fQ (Xl−1 ) ∈ RN ×C . X0 denotes input query features to
ture pyramid into successive Transformer decoder layers in the Transformer decoder. Kl , Vl ∈ RHl Wl ×C are the im-
a round robin fashion. Finally, we incorporate optimiza- age features under transformation fK (·) and fV (·) respec-
tion improvements that boost model performance without tively, and Hl and Wl are the spatial resolution of image
introducing additional computation. We now discuss these features that we will introduce next in Section 3.2.2. fQ ,
improvements in detail. fK and fV are linear transformations.
3
Our masked attention modulates the attention matrix via order of self- and cross-attention (our new “masked atten-
tion”) to make computation more effective: query features
Xl = softmax(Ml−1 + Ql KTl )Vl + Xl−1 . (2) to the first self-attention layer are image-independent and
Moreover, the attention mask Ml−1 at feature location do not have signals from the image, thus applying self-
(x, y) is attention is unlikely to enrich information. Second, we
make query features (X0 ) learnable as well (we still keep
0 if Ml−1 (x, y) = 1 the learnable query positional embeddings), and learnable
Ml−1 (x, y) = . (3)
−∞ otherwise query features are directly supervised before being used in
the Transformer decoder to predict masks (M0 ). We find
Here, Ml−1 ∈ {0, 1}N ×Hl Wl is the binarized output these learnable query features function like a region pro-
(thresholded at 0.5) of the resized mask prediction of the posal network [43] and have the ability to generate mask
previous (l − 1)-th Transformer decoder layer. It is resized proposals. Finally, we find dropout is not necessary and
to the same resolution of Kl . M0 is the binary mask predic- usually decreases performance. We thus completely remove
tion obtained from X0 , i.e., before feeding query features dropout in our decoder.
into the Transformer decoder.
3.3. Improving training efficiency
3.2.2 High-resolution features
One limitation of training universal architectures is the
High-resolution features improve model performance, espe- large memory consumption due to high-resolution mask
cially for small objects [5]. However, this is computation- prediction, making them less accessible than the more
ally demanding. Thus, we propose an efficient multi-scale memory-friendly specialized architectures [6, 24]. For ex-
strategy to introduce high-resolution features while control- ample, MaskFormer [14] can only fit a single image in a
ling the increase in computation. Instead of always using GPU with 32G memory. Motivated by PointRend [30] and
the high-resolution feature map, we utilize a feature pyra- Implicit PointRend [13], which show a segmentation model
mid which consists of both low- and high-resolution fea- can be trained with its mask loss calculated on K randomly
tures and feed one resolution of the multi-scale feature to sampled points instead of the whole mask, we calculate the
one Transformer decoder layer at a time. mask loss with sampled points in both the matching and
Specifically, we use the feature pyramid produced by the final loss calculation. More specifically, in the match-
the pixel decoder with resolution 1/32, 1/16 and 1/8 of ing loss that constructs the cost matrix for bipartite match-
the original image. For each resolution, we add both a si- ing, we uniformly sample the same set of K points for all
nusoidal positional embedding epos ∈ RHl Wl ×C , follow- prediction and ground truth masks. In the final loss be-
ing [5], and a learnable scale-level embedding elvl ∈ R1×C , tween predictions and their matched ground truths, we sam-
following [66]. We use those, from lowest-resolution to ple different sets of K points for different pairs of predic-
highest-resolution for the corresponding Transformer de- tion and ground truth using importance sampling [30]. We
coder layer as shown in Figure 2 left. We repeat this 3-layer set K = 12544, i.e., 112 × 112 points. This new training
Transformer decoder L times. Our final Transformer de- strategy effectively reduces training memory by 3×, from
coder hence has 3L layers. More specifically, the first three 18GB to 6GB per image, making Mask2Former more ac-
layers receive a feature map of resolution H1 = H/32, cessible to users with limited computational resources.
H2 = H/16, H3 = H/8 and W1 = W/32, W2 = W/16,
W3 = W/8, where H and W are the original image reso- 4. Experiments
lution. This pattern is repeated in a round robin fashion for
all following layers. We demonstrate Mask2Former is an effective architec-
ture for universal image segmentation through compar-
isons with specialized state-of-the-art architectures on stan-
3.2.3 Optimization improvements
dard benchmarks. We evaluate our proposed design de-
A standard Transformer decoder layer [51] consists of three cisions through ablations on all three tasks. Finally we
modules to process query features in the following order: a show Mask2Former generalizes beyond the standard bench-
self-attention module, a cross-attention and a feed-forward marks, obtaining state-of-the-art results on four datasets.
network (FFN). Moreover, query features (X0 ) are zero ini- Datasets. We study Mask2Former using four widely used
tialized before being fed into the Transformer decoder and image segmentation datasets that support semantic, instance
are associated with learnable positional embeddings. Fur- and panoptic segmentation: COCO [35] (80 “things” and
thermore, dropout is applied to both residual connections 53 “stuff” categories), ADE20K [65] (100 “things” and
and attention maps. 50 “stuff” categories), Cityscapes [16] (8 “things” and 11
To optimize the Transformer decoder design, we make “stuff” categories) and Mapillary Vistas [42] (37 “things”
the following three improvements. First, we switch the and 28 “stuff” categories). Panoptic and semantic seg-
4
method backbone query type epochs PQ PQTh PQSt APTh
pan mIoUpan #params. FLOPs fps
DETR [5] R50 100 queries 500+25 43.4 48.2 36.3 31.1 - - - -
MaskFormer [14] R50 100 queries 300 46.5 51.0 39.8 33.0 57.8 45M 181G 17.6
Mask2Former (ours) R50 100 queries 50 51.9 57.7 43.0 41.7 61.7 44M 226G 8.6
DETR [5] R101 100 queries 500+25 45.1 50.5 37.0 33.0 - - - -
MaskFormer [14] R101 100 queries 300 47.6 52.5 40.3 34.1 59.3 64M 248G 14.0
Mask2Former (ours) R101 100 queries 50 52.6 58.5 43.7 42.6 62.4 63M 293G 7.2
Max-DeepLab [52] Max-L 128 queries 216 51.1 57.0 42.2 - - 451M 3692G -
MaskFormer [14] Swin-L† 100 queries 300 52.7 58.5 44.0 40.1 64.8 212M 792G 5.2
K-Net [62] Swin-L† 100 queries 36 54.6 60.2 46.0 - - - - -
Mask2Former (ours) Swin-L† 200 queries 100 57.8 64.2 48.1 48.6 67.4 216M 868G 4.0
Table 1. Panoptic segmentation on COCO panoptic val2017 with 133 categories. Mask2Former consistently outperforms Mask-
Former [14] by a large margin with different backbones on all metrics. Our best model outperforms prior state-of-the-art MaskFormer by
5.1 PQ and K-Net [62] by 3.2 PQ. Backbones pre-trained on ImageNet-22K are marked with † .
mentation tasks are evaluated on the union of “things” and object,” i.e., predictions that have not been matched with
“stuff” categories while instance segmentation is only eval- any ground truth.
uated on the “things” categories. Post-processing. We use the exact same post-processing
Evaluation metrics. For panoptic segmentation, we use as [14] to acquire the expected output format for panoptic
the standard PQ (panoptic quality) metric [28]. We fur- and semantic segmentation from pairs of binary masks and
ther report APTh
pan , which is the AP evaluated on the “thing” class predictions. Instance segmentation requires additional
categories using instance segmentation annotations, and confidence scores for each prediction. We multiply class
mIoUpan , which is the mIoU for semantic segmentation confidence and mask confidence (i.e., averaged foreground
by merging instance masks from the same category, of the per-pixel binary mask probability) for a final confidence.
same model trained only with panoptic segmentation anno-
tations. For instance segmentation, we use the standard AP 4.2. Training settings
(average precision) metric [35]. For semantic segmentation, Panoptic and instance segmentation. We use Detec-
we use mIoU (mean Intersection-over-Union) [19]. tron2 [57] and follow the updated Mask R-CNN [24] base-
line settings1 for the COCO dataset. More specifically, we
4.1. Implementation details
use AdamW [38] optimizer and the step learning rate sched-
We adopt settings from [14] with the following differences: ule. We use an initial learning rate of 0.0001 and a weight
Pixel decoder. Mask2Former is compatible with any exist- decay of 0.05 for all backbones. A learning rate multiplier
ing pixel decoder module. In MaskFormer [14], FPN [33] of 0.1 is applied to the backbone and we decay the learning
is chosen as the default for its simplicity. Since our goal rate at 0.9 and 0.95 fractions of the total number of training
is to demonstrate strong performance across different seg- steps by a factor of 10. If not stated otherwise, we train our
mentation tasks, we use the more advanced multi-scale de- models for 50 epochs with a batch size of 16. For data aug-
formable attention Transformer (MSDeformAttn) [66] as mentation, we use the large-scale jittering (LSJ) augmenta-
our default pixel decoder. Specifically, we use 6 MSDefor- tion [18, 23] with a random scale sampled from range 0.1 to
mAttn layers applied to feature maps with resolution 1/8, 2.0 followed by a fixed size crop to 1024×1024. We use the
1/16 and 1/32, and use a simple upsampling layer with lat- standard Mask R-CNN inference setting where we resize an
eral connection on the final 1/8 feature map to generate the image with shorter side to 800 and longer side up-to 1333.
feature map of resolution 1/4 as the per-pixel embedding. We also report FLOPs and fps. FLOPs are averaged over
In our ablation study, we show that this pixel decoder pro- 100 validation images (COCO images have varying sizes).
vides best results across different segmentation tasks. Frames-per-second (fps) is measured on a V100 GPU with
Transformer decoder. We use our Transformer decoder a batch size of 1 by taking the average runtime on the entire
proposed in Section 3.2 with L = 3 (i.e., 9 layers total) and validation set including post-processing time.
100 queries by default. An auxiliary loss is added to every Semantic segmentation. We follow the same settings
intermediate Transformer decoder layer and to the learnable as [14] to train our models, except: 1) a learning rate multi-
query features before the Transformer decoder. plier of 0.1 is applied to both CNN and Transformer back-
Loss weights. We use the binary cross-entropy loss (instead bones instead of only applying it to CNN backbones in [14],
of focal loss [34] in [14]) and the dice loss [41] for our mask 2) both ResNet and Swin backbones use an initial learning
loss: Lmask = λce Lce + λdice Ldice . We set λce = 5.0 and rate of 0.0001 and a weight decay of 0.05, instead of using
λdice = 5.0. The final loss is a combination of mask loss and
1 https://github.com/facebookresearch/detectron2/blob/
classification loss: Lmask + λcls Lcls and we set λcls = 2.0 for main / MODEL _ ZOO . md # new - baselines - using - large - scale -
predictions matched with a ground truth and 0.1 for the “no jitter-and-longer-training-schedule
5
method backbone query type epochs AP APS APM APL APboundary #params. FLOPs fps
MaskFormer [14] R50 100 queries 300 34.0 16.4 37.8 54.2 23.0 45M 181G 19.2
Mask R-CNN [24] R50 dense anchors 36 37.2 18.6 39.5 53.3 23.1 44M 201G 15.2
Mask R-CNN [18, 23, 24] R50 dense anchors 400 42.5 23.8 45.0 60.0 28.0 46M 358G 10.3
Mask2Former (ours) R50 100 queries 50 43.7 23.4 47.2 64.8 30.6 44M 226G 9.7
Mask R-CNN [24] R101 dense anchors 36 38.6 19.5 41.3 55.3 24.5 63M 266G 10.8
Mask R-CNN [18, 23, 24] R101 dense anchors 400 43.7 24.6 46.4 61.8 29.1 65M 423G 8.6
Mask2Former (ours) R101 100 queries 50 44.2 23.8 47.7 66.7 31.1 63M 293G 7.8
QueryInst [20] Swin-L† 300 queries 50 48.9 30.8 52.6 68.3 33.5 - - 3.3
Swin-HTC++ [6, 36] Swin-L† dense anchors 72 49.5 31.0 52.4 67.2 34.1 284M 1470G -
Mask2Former (ours) Swin-L† 200 queries 100 50.1 29.9 53.9 72.1 36.2 216M 868G 4.0
Table 2. Instance segmentation on COCO val2017 with 80 categories. Mask2Former outperforms strong Mask R-CNN [24] baselines
for both AP and APboundary [12] metrics when training with 8× fewer epochs. Our best model is also competitive to the state-of-the-art
specialized instance segmentation model on COCO and has higher boundary quality. For a fair comparison, we only consider single-scale
inference and models trained using only COCO train2017 set data. Backbones pre-trained on ImageNet-22K are marked with † .
different learning rates in [14]. method backbone crop size mIoU (s.s.) mIoU (m.s.)
MaskFormer [14] R50 512 44.5 46.7
Mask2Former (ours) R50 512 47.2 49.2
4.3. Main results
Swin-UperNet [36, 58] Swin-T 512 - 46.1
Panoptic segmentation. We compare Mask2Former with MaskFormer [14] Swin-T 512 46.7 48.8
state-of-the-art models for panoptic segmentation on the Mask2Former (ours) Swin-T 512 47.7 49.6
COCO panoptic [28] dataset in Table 1. Mask2Former MaskFormer [14] Swin-L† 640 54.1 55.6
FaPN-MaskFormer [14, 39] Swin-L-FaPN† 640 55.2 56.7
consistently outperforms MaskFormer by more than 5 PQ BEiT-UperNet [2, 58] BEiT-L† 640 - 57.0
across different backbones while converging 6× faster. Swin-L† 640 56.1 57.3
Mask2Former (ours)
With Swin-L backbone, our Mask2Former sets a new state- Swin-L-FaPN† 640 56.4 57.7
of-the-art of 57.8 PQ, outperforming existing state-of-the-
art [14] by 5.1 PQ and concurrent work, K-Net [62], by Table 3. Semantic segmentation on ADE20K val with
150 categories. Mask2Former consistently outperforms Mask-
3.2 PQ. Mask2Former even outperforms the best ensemble
Former [14] by a large margin with different backbones (all
models with extra training data in the COCO challenge (see
Mask2Former models use MSDeformAttn [66] as pixel decoder,
Appendix A.1 for test set results). except Swin-L-FaPN uses FaPN [39]). Our best model outper-
Beyond the PQ metric, our Mask2Former also achieves forms the best specialized model, BEiT [2]. We report both single-
higher performance on two other metrics compared to scale (s.s.) and multi-scale (m.s.) inference results. Backbones
DETR [5] and MaskFormer: APTh pan , which is the AP eval- pre-trained on ImageNet-22K are marked with † .
uated on the 80 “thing” categories using instance segmen-
tation annotation, and mIoUpan , which is the mIoU evalu- all the highest gains come from large objects (+10.6 APL ).
ated on the 133 categories for semantic segmentation con- The performance on APS still lags behind other state-of-the-
verted from panoptic segmentation annotation. This shows art models. Hence there still remains room for improvement
Mask2Former’s universality: trained only with panoptic on small objects, e.g., by using dilated backbones like in
segmentation annotations, it can be used for instance and DETR [5], which we leave for future work.
semantic segmentation.
Semantic segmentation. We compare Mask2Former with
Instance segmentation. We compare Mask2Former with
state-of-the-art models for semantic segmentation on the
state-of-the-art models on the COCO [35] dataset in Ta-
ADE20K [65] dataset in Table 3. Mask2Former outper-
ble 2. With ResNet [25] backbone, Mask2Former outper-
forms MaskFormer [14] across different backbones, sug-
forms a strong Mask R-CNN [24] baseline using large-
gesting that the proposed improvements even boost seman-
scale jittering (LSJ) augmentation [18, 23] while requir-
tic segmentation results where [14] was already state-of-
ing 8× fewer training iterations. With Swin-L backbone,
the-art. With Swin-L as backbone and FaPN [39] as pixel
Mask2Former outperforms the state-of-the-art HTC++ [6].
decoder, Mask2Former sets a new state-of-the-art of 57.7
Although we only observe +0.6 AP improvement over
mIoU. We also report the test set results in Appendix A.3.
HTC++, the Boundary AP [12] improves by 2.1, suggesting
that our predictions have a better boundary quality thanks to
4.4. Ablation studies
the high-resolution mask predictions. Note that for a fair
comparison, we only consider single-scale inference and We now analyze Mask2Former through a series of abla-
models trained with only COCO train2017 set data. tion studies using a ResNet-50 backbone [25]. To test the
With a ResNet-50 backbone Mask2Former improves generality of the proposed components for universal image
over MaskFormer on small objects by 7.0 APS , while over- segmentation, all ablations are performed on three tasks.
6
AP PQ mIoU FLOPs AP PQ mIoU FLOPs
Mask2Former (ours) 43.7 51.9 47.2 226G Mask2Former (ours) 43.7 51.9 47.2 226G
− masked attention 37.8 (-5.9) 47.1 (-4.8) 45.5 (-1.7) 213G − learnable query features 42.9 (-0.8) 51.2 (-0.7) 45.4 (-1.8) 226G
− high-resolution features 41.5 (-2.2) 50.2 (-1.7) 46.1 (-1.1) 218G − cross-attention first 43.2 (-0.5) 51.6 (-0.3) 46.3 (-0.9) 226G
− remove dropout 43.0 (-0.7) 51.3 (-0.6) 47.2 (-0.0) 226G
− all 3 components above 42.3 (-1.4) 50.8 (-1.1) 46.3 (-0.9) 226G
(a) Masked attention and high-resolution features (from efficient multi-scale (b) Optimization improvements increase the performance without introduc-
strategy) lead to the most gains. More detailed ablations are in Table 4c and ing extra compute. Following DETR [5], query features are zero-initialized
Table 4d. We remove one component at a time. when not learnable. We remove one component at a time.
Table 4. Mask2Former ablations. We perform ablations on three tasks: instance (AP on COCO val2017), panoptic (PQ on COCO
panoptic val2017) and semantic (mIoU on ADE20K val) segmentation. FLOPs are measured on COCO instance segmentation.
7
PQ AP mIoU PQ AP mIoU PQ AP mIoU
panoptic 51.9 41.7 61.7 39.7 26.5 46.1 62.1 37.3 77.5
instance - 43.7 - - 26.4 - - 37.4 -
semantic - - 61.5 - - 47.2 - - 79.4
(a) COCO (b) ADE20K (c) Cityscapes
Table 7. Limitations of Mask2Former. Although a single
Mask2Former can address any segmentation task, we still need
to train it on different tasks. Across three datasets we find
Mask2Former trained with panoptic annotations performs slightly
worse than the exact same model trained specifically for instance
and semantic segmentation tasks with the corresponding data.
8
References [18] Xianzhi Du, Barret Zoph, Wei-Chih Hung, and Tsung-Yi
Lin. Simple training strategies and model scaling for object
[1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Fer- detection. arXiv preprint arXiv:2107.00057, 2021.
ran Marques, and Jitendra Malik. Multiscale combinatorial
[19] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo-
grouping. In CVPR, 2014.
pher KI Williams, John Winn, and Andrew Zisserman. The
[2] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre- PASCAL visual object classes challenge: A retrospective.
training of image transformers. arXiv, 2021. IJCV, 2015.
[3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. [20] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen
YOLACT++: Better real-time instance segmentation, 2019. Fang, Ying Shan, Bin Feng, and Wenyu Liu. Instances as
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv- queries. In ICCV, 2021.
ing into high quality object detection. In CVPR, 2018. [21] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Fang, and Hanqing Lu. Dual attention network for scene
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- segmentation. In CVPR, 2019.
end object detection with transformers. In ECCV, 2020. [22] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai,
[6] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- and Hongsheng Li. Fast convergence of detr with spatially
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, modulated co-attention. In ICCV, 2021.
Wanli Ouyang, et al. Hybrid task cascade for instance seg- [23] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
mentation. In CVPR, 2019. Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple
[7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, copy-paste is a strong data augmentation method for instance
Kevin Murphy, and Alan L Yuille. DeepLab: Semantic im- segmentation. In CVPR, 2021.
age segmentation with deep convolutional nets, atrous con- [24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
volution, and fully connected CRFs. PAMI, 2018. shick. Mask R-CNN. In ICCV, 2017.
[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Hartwig Adam. Rethinking atrous convolution for semantic Deep residual learning for image recognition. In CVPR,
image segmentation. arXiv:1706.05587, 2017. 2016.
[9] Liang-Chieh Chen, Huiyu Wang, and Siyuan Qiao. Scal- [26] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
ing wide residual networks for panoptic segmentation. Huang, Yunchao Wei, and Wenyu Liu. CCNet: Criss-cross
arXiv:2011.11675, 2020. attention for semantic segmentation. In ICCV, 2019.
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [27] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
Schroff, and Hartwig Adam. Encoder-decoder with atrous Dollár. Panoptic feature pyramid networks. In CVPR, 2019.
separable convolution for semantic image segmentation. In [28] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
ECCV, 2018. Rother, and Piotr Dollár. Panoptic segmentation. In CVPR,
[11] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, 2019.
Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. [29] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bog-
Panoptic-DeepLab: A simple, strong, and fast baseline for dan Savchynskyy, and Carsten Rother. InstanceCut: from
bottom-up panoptic segmentation. In CVPR, 2020. edges to instances with multicut. In CVPR, 2017.
[12] Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C [30] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
Berg, and Alexander Kirillov. Boundary iou: Improving shick. PointRend: Image segmentation as rendering. In
object-centric image segmentation evaluation. In CVPR, CVPR, 2020.
2021. [31] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Yukang Chen,
[13] Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Lu Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia.
Pointly-supervised instance segmentation. arXiv, 2021. Fully convolutional networks for panoptic segmentation with
[14] Bowen Cheng, Alexander G. Schwing, and Alexander Kir- point-based supervision. arXiv preprint arXiv:2108.07682,
illov. Per-pixel classification is not all you need for semantic 2021.
segmentation. In NeurIPS, 2021. [32] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima
[15] François Chollet. Xception: Deep learning with depthwise Anandkumar, Jose M Alvarez, Tong Lu, and Ping Luo.
separable convolutions. In CVPR, 2017. Panoptic segformer. arXiv preprint arXiv:2109.03814, 2021.
[16] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [33] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Bharath Hariharan, and Serge Belongie. Feature pyramid
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes networks for object detection. In CVPR, 2017.
dataset for semantic urban scene understanding. In CVPR, [34] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
2016. Piotr Dollár. Focal loss for dense object detection. In ICCV,
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, 2017.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
vain Gelly, et al. An image is worth 16x16 words: Trans- Zitnick. Microsoft COCO: Common objects in context. In
formers for image recognition at scale. In ICLR, 2021. ECCV, 2014.
9
[36] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [54] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
former: Hierarchical vision transformer using shifted win- Pvtv2: Improved baselines with pyramid vision transformer.
dows. arXiv:2103.14030, 2021. arXiv preprint arXiv:2106.13797, 2021.
[37] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully [55] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
convolutional networks for semantic segmentation. In ing He. Non-local neural networks. In CVPR, 2018.
CVPR, 2015. [56] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-
[38] Ilya Loshchilov and Frank Hutter. Decoupled weight decay hua Shen. SOLOv2: Dynamic and fast instance segmenta-
regularization. In ICLR, 2019. tion. NeurIPS, 2020.
[39] Shihua Huang Zhichao Lu, Ran Cheng, and Cheng He. Fapn: [57] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Feature-aligned pyramid network for dense image predic- Lo, and Ross Girshick. Detectron2. https://github.
tion. arXiv, 2021. com/facebookresearch/detectron2, 2019.
[40] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, [58] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Jian Sun. Unified perceptual parsing for scene understand-
Conditional detr for fast training convergence. In ICCV, ing. In ECCV, 2018.
2021. [59] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
[41] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
V-Net: Fully convolutional neural networks for volumetric ficient design for semantic segmentation with transformers.
medical image segmentation. In 3DV, 2016. In NeurIPS, 2021.
[42] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and [60] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min
Peter Kontschieder. The mapillary vistas dataset for semantic Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified
understanding of street scenes. In CVPR, 2017. panoptic segmentation network. In CVPR, 2019.
[43] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [61] Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin
Faster R-CNN: Towards real-time object detection with re- Chen, and Jingdong Wang. OCNet: Object context for se-
gion proposal networks. In NeurIPS, 2015. mantic segmentation. IJCV, 2021.
[44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [62] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Chen Change Loy. K-net: Towards unified image seg-
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and mentation. In NeurIPS, 2021.
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- [63] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
lenge. IJCV, 2015. Wang, and Jiaya Jia. Pyramid scene parsing network. In
[45] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia CVPR, 2017.
Schmid. Segmenter: Transformer for semantic segmenta- [64] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
tion. In ICCV, 2021. Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
[46] Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris M Ki- Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
tani. Rethinking transformer-based set prediction for object tation from a sequence-to-sequence perspective with trans-
detection. In ICCV, 2021. formers. In CVPR, 2021.
[47] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: [65] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Scalable and efficient object detection. In CVPR, 2020. Barriuso, and Antonio Torralba. Scene parsing through
[48] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hi- ADE20K dataset. In CVPR, 2017.
erarchical multi-scale attention for semantic segmentation. [66] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
arXiv:2005.10821, 2020. and Jifeng Dai. Deformable detr: Deformable transformers
[49] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convo- for end-to-end object detection. In ICLR, 2021.
lutions for instance segmentation. In ECCV, 2020.
[50] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
ers, and Arnold WM Smeulders. Selective search for object
recognition. IJCV, 2013.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[52] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. MaX-DeepLab: End-to-end panoptic
segmentation with mask transformers. In CVPR, 2021.
[53] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
high-resolution representation learning for visual recogni-
tion. PAMI, 2019.
10
Appendix outperform the challenge winner (which uses extra train-
ing data, model ensemble, etc.) on APL by a large margin
We first provide more results for Mask2Former with differ-
without any bells-and-whistles. On the other hand, the poor
ent backbones as well as test-set performance on standard
performance on small objects leaves room for further im-
benchmarks (Appendix A): We use COCO panoptic [28]
provement in the future.
for panoptic, COCO [35] for instance, and ADE20K [65]
for semantic segmentation. Then, we provide more detailed
results on additional datasets (Appendix B). Finally, we pro- A.3. Semantic segmentation.
vide additional ablation studies (Appendix C) and visualiza-
In Table V, we report Mask2Former results obtained
tion of Mask2Former predictions for all three segmentation
with various backbones on ADE20K val. Mask2Former
tasks (Appendix D).
outperforms all existing semantic segmentation models
with various backbones. Our best model sets a new state-
A. Additional results of-the-art of 57.7 mIoU.
Here, we provide more results of Mask2Former with In Table VI, we further report the best Mask2Former
different backbones on COCO panoptic [28] for panoptic model on the test set. Following [14], we train
segmentation, COCO [35] for instance segmentation and Mask2Former on the union of ADE20K train and val
ADE20K [65] for semantic segmentation. More specifi- set with ImageNet-22K pre-trained checkpoint and use
cally, for each benckmark, we evaluate Mask2Former with multi-scale inference. Mask2Former is able to outperform
ResNet [25] with 50 and 101 layers, as well as Swin [36] previous state-of-the-art methods on all metrics.
Tiny, Small, Base and Large variants as backbones. We use
ImageNet [44] pre-trained checkpoints to initialize back-
bones.
B. Additional datasets
We study Mask2Former on three image segmentation
A.1. Panoptic segmentation.
tasks (panoptic, instance and semantic segmentation) us-
In Table I, we report Mask2Former with various back- ing four datasets. Here we report additional results on
bones on COCO panoptic val2017. Mask2Former out- Cityscapes [16], ADE20K [65] and Mapillary Vistas [42]
performs all existing panoptic segmentation models with as well as more detailed training settings.
various backbones. Our best model sets a new state-of-the-
art of 57.8 PQ. B.1. Cityscapes
In Table II, we further report the best Mask2Former
model on the test-dev set. Note that Mask2Former Cityscapes is an urban egocentric street-view dataset
trained only with the standard train2017 data, with high-resolution images (1024 × 2048 pixels). It con-
achieves the absolute new state-of-the-art performance on tains 2975 images for training, 500 images for validation
both validation and test set. Mask2Former even outper- and 1525 images for testing with a total of 19 classes.
forms the best COCO competition entry which uses extra Training settings. For all three segmentation tasks: we use
training data and test-time augmentation. a crop size of 512 × 1024, a batch size of 16 and train
all models for 90k iterations. During inference, we oper-
A.2. Instance segmentation. ate on the whole image (1024 × 2048). Other implemen-
In Table III, we report Mask2Former results ob- tation details largely follow Section 4.1 (panoptic and in-
tained with various backbones on COCO val2017. stance segmentation follow semantic segmentation training
Mask2Former outperforms the best single-scale model, settings), except that we use 200 queries for panoptic and
HTC++ [6, 36]. Note that it is non-trivial to do multi-scale instance segmentation models with Swin-L backbone. All
inference for instance-level segmentation tasks without in- other backbones or semantic segmentation models use 100
troducing complex post-processing like non-maximum sup- queries.
pression. Thus, we only compare Mask2Former with other Results. In Table VII, we report Mask2Former results ob-
single-scale inference models. We believe multi-scale infer- tained with various backbones on Cityscapes for three seg-
ence can further improve Mask2Former performance and it mentation tasks and compare it with other state-of-the-art
remains an interesting future work. methods without using extra data. For panoptic segmen-
In Table IV, we further report the best Mask2Former tation, Mask2Former with Swin-L backbone outperforms
model on the test-dev set. Mask2Former achieves the the state-of-the-art Panoptic-DeepLab [11] with SWideR-
absolute new state-of-the-art performance on both valida- net [9] using single-scale inference. For semantic segmen-
tion and test set. On the one hand, Mask2Former is ex- tation, Mask2Former with Swin-B backbone outperforms
tremely good at segmenting large objects: we can even the state-of-the-art SegFormer [59].
11
method backbone search space epochs PQ PQTh PQSt APTh
pan mIoUpan #params. FLOPs
R50 100 queries 500+25 43.4 48.2 36.3 31.1 - - -
DETR [5]
CNN backbones R101 100 queries 500+25 45.1 50.5 37.0 33.0 - - -
K-Net [62] R50 100 queries 36 47.1 51.7 40.3 - - - -
Panoptic SegFormer [32] R50 400 queries 50 50.0 56.1 40.8 - - 47M 246G
R50 100 queries 300 46.5 51.0 39.8 33.0 57.8 45M 181G
MaskFormer [14]
R101 100 queries 300 47.6 52.5 40.3 34.1 59.3 64M 248G
R50 100 queries 50 51.9 57.7 43.0 41.7 61.7 44M 226G
Mask2Former (ours)
R101 100 queries 50 52.6 58.5 43.7 42.6 62.4 63M 293G
Max-S 128 queries 216 48.4 53.0 41.5 - - 62M 324G
Max-DeepLab [52]
Max-L 128 queries 216 51.1 57.0 42.2 - - 451M 3692G
Panoptic SegFormer [32] PVTv2-B5 [54] 400 queries 50 54.1 60.4 44.6 - - 101M 391G
K-Net [62] Swin-L† 100 queries 36 54.6 60.2 46.0 - - - -
Transformer backbones
Swin-T 100 queries 300 47.7 51.7 41.7 33.6 60.4 42M 179G
Swin-S 100 queries 300 49.7 54.4 42.6 36.1 61.3 63M 259G
MaskFormer [14] Swin-B 100 queries 300 51.1 56.3 43.2 37.8 62.6 102M 411G
Swin-B† 100 queries 300 51.8 56.9 44.1 38.5 63.6 102M 411G
Swin-L† 100 queries 300 52.7 58.5 44.0 40.1 64.8 212M 792G
Swin-T 100 queries 50 53.2 59.3 44.0 43.3 63.2 47M 232G
Swin-S 100 queries 50 54.6 60.6 45.7 44.7 64.2 69M 313G
Mask2Former (ours) Swin-B 100 queries 50 55.1 61.0 46.1 45.2 65.1 107M 466G
Swin-B† 100 queries 50 56.4 62.4 47.3 46.3 67.1 107M 466G
Swin-L† 200 queries 100 57.8 64.2 48.1 48.6 67.4 216M 868G
Table I. Panoptic segmentation on COCO panoptic val2017 with 133 categories. Mask2Former outperforms all existing panoptic
segmentation models by a large margin with different backbones on all metrics. Our best model sets a new state-of-the-art of 57.8 PQ.
Besides PQ for panoptic segmentation, we also report APThpan (the AP evaluated on the 80 “thing” categories using instance segmentation
annotation) and mIoUpan (the mIoU evaluated on the 133 categories for semantic segmentation converted from panoptic segmentation
annotation) of the same model trained for panoptic segmentation (note: we train all our models with panoptic segmentation annotation
only). Backbones pre-trained on ImageNet-22K are marked with † .
Table II. Panoptic segmentation on COCO panoptic test-dev with 133 categories. Mask2Former, without any bells-and-whistles,
outperforms the challenge winner (which uses extra training data, model ensemble, etc.) on the test-dev set. We only train our model
on the COCO train2017 set with ImageNet-22K pre-trained checkpoint.
12
method backbone search space epochs AP APS APM APL APboundary #params. FLOPs
CNN backbones R50 dense anchors 36 37.2 18.6 39.5 53.3 23.1 44M 201G
R50 dense anchors 400 42.5 23.8 45.0 60.0 28.0 46M 358G
Mask R-CNN [24]
R101 dense anchors 36 38.6 19.5 41.3 55.3 24.5 63M 266G
R101 dense anchors 400 43.7 24.6 46.4 61.8 29.1 65M 423G
R50 100 queries 50 43.7 23.4 47.2 64.8 30.6 44M 226G
Mask2Former (ours)
R101 100 queries 50 44.2 23.8 47.7 66.7 31.1 63M 293G
QueryInst [20] Swin-L† 300 queries 50 48.9 30.8 52.6 68.3 33.5 - -
Transformer backbones
Table III. Instance segmentation on COCO val2017 with 80 categories. Mask2Former outperforms strong Mask R-CNN [24] base-
lines with 8× fewer training epochs for both AP and APboundary [12] metrics. Our best model is also competitive to the state-of-the-art
specialized instance segmentation model on COCO and has higher boundary quality. For a fair comparison, we only consider single-scale
inference and models trained using only COCO train2017 set data. Backbones pre-trained on ImageNet-22K are marked with † .
Table IV. Instance segmentation on COCO test-dev with 80 categories. Mask2Former is extremely good at segmenting large objects:
we can even outperform the challenge winner (which uses extra training data, model ensemble, etc.) on APL by a large margin without any
bells-and-whistles. We only train our model on the COCO train2017 set with ImageNet-22K pre-trained checkpoint.
C. Additional ablation studies (b) Cumulative attention weights on foreground (fg) and background (bg)
regions for different resolutions.
We perform additional ablation studies of Mask2Former Figure I. Masked attention analysis.
using the same settings that we used in the main paper: a
single ResNet-50 backbone [25]. scale jittering augmentation. This shows that Mask2Former
with our proposed Transformer decoder converges faster
C.1. Convergence analysis than models using the standard Transformer decoder: e.g.,
We train Mask2Former with 12, 25, 50 and 100 DETR [5] and MaskFormer [14] require 500 epochs and
epochs with either standard scale augmentation (Standard 300 epochs respectively.
Aug.) [57] or the more recent large-scale jittering aug-
C.2. Masked attention analysis
mentation (LSJ Aug.) [18, 23]. As shown in Figure IV,
Mask2Former converges in 25 epochs using standard aug- We quantitatively and qualitatively analyzed the COCO
mentation and almost converges in 50 epochs using large- panoptic model with the R50 backbone. First, we visual-
13
method backbone crop size mIoU (s.s.) mIoU (m.s.) #params. FLOPs
R50 512 × 512 44.5 46.7 41M 53G
MaskFormer [14]
CNN
R101 512 × 512 45.5 47.2 60M 73G
R50 512 × 512 47.2 49.2 44M 71G
Mask2Former (ours)
R101 512 × 512 47.8 50.1 63M 90G
Swin-UperNet [36, 58] Swin-L† 640 × 640 - 53.5 234M 647G
FaPN-MaskFormer [14, 39] Swin-L† 640 × 640 55.2 56.7 - -
BEiT-UperNet [2, 58] BEiT-L† 640 × 640 - 57.0 502M -
Swin-T 512 × 512 46.7 48.8 42M 55G
Transformer backbones
Table V. Semantic segmentation on ADE20K val with 150 categories. Mask2Former consistently outperforms MaskFormer [14] by
a large margin with different backbones (all Mask2Former models use MSDeformAttn [66] as pixel decoder, except Swin-L-FaPN uses
FaPN [39]). Our best model outperforms the best specialized model, BEiT [2], with less than half of the parameters. We report both
single-scale (s.s.) and multi-scale (m.s.) inference results. Backbones pre-trained on ImageNet-22K are marked with † .
Table VI. Semantic segmentation on ADE20K test with 150 categories. Mask2Former outperforms previous state-of-the-art methods
on all three metrics: pixel accuracy (P.A.), mIoU, as well as the final test score (average of P.A. and mIoU). We train our model on the
union of ADE20K train and val set with ImageNet-22K pre-trained checkpoint following [14] and use multi-scale inference.
ize the last three attention maps of our model using cross- 52.5
attention (Figure Ia top) and masked attention (Figure Ia
50.0
bottom) of a single query that predicts the “cat.” With
cross-attention, the attention map spreads over the entire 47.5
image and the region with highest response is outside the
PQ
14
panoptic model instance model semantic model
method backbone PQ (s.s.) PQ (m.s.) APTh
pan mIoUpan AP AP50 mIoU (s.s.) mIoU (m.s.)
R50 60.3 - 32.1 78.7 - - - -
Panoptic-DeepLab [11] X71 [15] 63.0 64.1 35.3 80.5 - - - -
SWideRNet [9] 66.4 67.5 40.1 82.2 - - - -
Panoptic FCN [31] Swin-L† 65.9 - - - - - - -
Segmenter [45] ViT-L† - - - - - - - 81.3
SETR [64] ViT-L† - - - - - - - 82.2
SegFormer [59] MiT-B5 - - - - - - - 84.0
R50 62.1 - 37.3 77.5 37.4 61.9 79.4 82.2
R101 62.4 - 37.7 78.6 38.5 63.9 80.1 81.9
Swin-T 63.9 - 39.1 80.5 39.7 66.9 82.1 83.0
Mask2Former (ours)
Swin-S 64.8 - 40.7 81.8 41.8 70.4 82.6 83.6
Swin-B† 66.1 - 42.8 82.7 42.0 68.8 83.3 84.5
Swin-L† 66.6 - 43.6 82.9 43.7 71.4 83.3 84.3
Table VII. Image segmentation results on Cityscapes val. We report both single-scale (s.s.) and multi-scale (m.s.) inference results
for PQ and mIoU. All other metrics are evaluated with single-scale inference. Since Mask2Former is an end-to-end model, we only use
single-scale inference for instance-level segmentation tasks to avoid the need for further post-processing (e.g., NMS).
Table VIII. Image segmentation results on ADE20K val. Mask2Former is competitive to specialized models on ADE20K. Panoptic
segmentation models use single-scale inference by default, multi-scale numbers are marked with ∗ . For semantic segmentation, we report
both single-scale (s.s.) and multi-scale (m.s.) inference results.
ber of queries for three image segmentation tasks in Ta- learnable. In addition, we make query features learnable
ble Xa. For instance and semantic segmentation, using as well and directly apply losses on these learnable query
100 queries achieves the best performance, while using 200 features before feeding them into the Transformer decoder.
queries can further improve panoptic segmentation results. In Table Xb, we compare our learnable query features
As panoptic segmentation is a combination of instance and with zero-initialized query features in DETR. We find it
semantic segmentation, it has more segments per image is important to directly supervise object queries even be-
than the other two tasks. This ablation suggests that pick- fore feeding them into the Transformer decoder. Learnable
ing the number of queries for Mask2Former may depend on queries without supervision perform similarly well as zero-
the number of segments per image for a particular task or initialized queries in DETR.
dataset.
C.4. MaskFormer vs. Mask2Former
Learnable queries. An object query consists of two parts:
object query features and object query positional embed- Mask2Former builds upon the same meta architecture
dings. Object query features are only used as the initial as MaskFormer [14] with two major differences: 1) We
input to the Transformer decoder and are updated through use more advanced training parameters summarized in Ta-
decoder layers; whereas query positional embeddings are ble XIa; and 2) we propose a new Transformer decoder with
added to query features in every Transformer decoder layer masked attention, instead of using the standard Transformer
when computing the attention weights. In DETR [5], query decoder, as well as some optimization improvements sum-
features are zero-initialized and query positional embed- marized in Table XIb. To better understand Mask2Former’s
dings are learnable. Furthermore, there is no direct su- improvements over MaskFormer, we perform ablation stud-
pervision on these query features before feeding them into ies on training parameter improvements and Transformer
the Transformer (since they are zero vectors). In our decoder improvements in isolation.
Mask2Former, we still make query positional embeddings In Table XIc, we study our new training parameters. We
15
panoptic model semantic model
method backbone PQ mIoUpan mIoU (s.s.) mIoU (m.s.)
ensemble 42.2∗ 58.7∗ - -
Panoptic-DeepLab [11] SWideRNet [9] 43.7 59.4 - -
SWideRNet [9] 44.8∗ 60.0∗ - -
Panoptic FCN [31] Swin-L† 45.7 - - -
MaskFormer [14] R50 - - 53.1 55.4
HMSANet [48] HRNet [53] - - - 61.1
R50 36.3 50.7 57.4 59.0
Mask2Former (ours)
Swin-L† 45.5 60.8 63.2 64.7
Table IX. Image segmentation results on Mapillary Vistas val. Mask2Former is competitive to specialized models on Mapillary Vistas.
Panoptic segmentation models use single-scale inference by default, multi-scale numbers are marked with ∗ . For semantic segmentation,
we report both single-scale (s.s.) and multi-scale (m.s.) inference results.
58 D. Visualization
56 We visualize sample predictions of the Mask2Former
model with Swin-L [36] backbone on three tasks: COCO
54 panoptic val2017 set for panoptic segmentation (57.8 PQ)
in Figure V, COCO val2017 set for instance segmenta-
PQ
16
44
42
Mask AP
40
Standard Aug.
38 LSJ Aug.
12 25 50 100
Epochs (log-scale)
Figure IV. Convergence analysis. We train Mask2Former with different epochs using either standard scale augmentation (Standard
Aug.) [57] or the more recent large-scale jittering augmentation (LSJ Aug.) [18, 23]. Mask2Former converges in 25 epochs using standard
augmentation and almost converges in 50 epochs using large-scale jittering augmentation. Using LSJ also improves performance with
longer training epochs (i.e., with more than 25 epochs).
Table X. Analysis of object queries. Table Xa: ablation on number of queries. Table Xb: ablation on using learnable queries.
AP PQ mIoU AP PQ mIoU
model training params. (COCO) (COCO) (ADE20K) Transformer decoder pixel decoder (COCO) (COCO) (ADE20K)
MaskFormer MaskFormer 34.0 46.5 44.5 MaskFormer FPN 37.8 48.2 45.3
MaskFormer Mask2Former 37.8 (+3.8) 48.2 (+1.7) 45.3 (+0.8) Mask2Former FPN 41.5 (+3.7) 50.7 (+2.5) 45.6 (+0.3)
(c) Improvements from better training parameters. (d) Improvements from better Transformer decoder.
Table XI. MaskFormer vs. Mask2Former. Table XIa and Table XIb provide an in-depth comparison between MaskFormer and our
Mask2Former settings. Table XIc: MaskFormer benefits from our new training parameters as well. Table XId: Comparison between
MaskFormer and our Mask2Former with the exact same backbone, pixel decoder and training parameters. The improvements solely come
from a better Transformer decoder.
17
Figure V. Visualization of panoptic segmentation predictions on the COCO panoptic dataset: Mask2Former with Swin-L backbone which
achieves 57.8 PQ on the validation set. First and third columns: ground truth. Second and fourth columns: prediction. Last row shows
failure cases.
18
Figure VI. Visualization of instance segmentation predictions on the COCO dataset: Mask2Former with Swin-L backbone which achieves
50.1 AP on the validation set. First and third columns: ground truth. Second and fourth columns: prediction. Last row shows failure
cases. We show predictions with confidence scores greater than 0.5.
19
Figure VII. Visualization of semantic segmentation predictions on the ADE20K dataset: Mask2Former with Swin-L backbone which
achieves 57.7 mIoU (multi-scale) on the validation set. First and third columns: ground truth. Second and fourth columns: prediction.
Last row shows failure cases.
20