Semantic Segmentation for High-resolution Remote
Sensing Images Based on the Mask2Former Model
Yicheng Qiao
Beijing Sport University
2020011020@bsu.edu.cn
Wei liu
University of ****
******.edu
Pengyun Wang Bin Liang
***
teacher Zhang
teacher Yang
Abstract— With the development of remote sensing, semantic
segmentation of high-resolution remote sensing images (RSIs)
is increasingly essential. At the same time, the characteristics
of objects in RSIs, such as large size, variation in object scales,
and complex details, make it necessary to capture both long-
range context and local information. FCN-based methods such
as FCN and FastFCN lack the ability to capture long-range
dependencies, due to the limited receptive field of CNN. How-
ever, the self-attention mechanism in Transformer models has
remarkable capability in capturing long-range context. One of
the most outstanding Transformer models is the Mask2Former
(Masked-attention Mask Transformer), which adopts the mask
classification method. Concretely, the mask classification that
generates one or even more masks for specific categories to per-
form the elaborate segmentation is especially suitable for han-
dling the characteristic of large within-class and small between-
class variance of RSIs. Above all, extensive experimental re-
sults show that Mask2Former obtains better results in semantic
segmentation of high-resolution RSIs on the ISPRS Potsdam
dataset compared to CNN-based methods(FCN) and other state-
of-the-art transformer-based methods( SegFormer, SegMenter,
Swin Transformer). Extensive ablation studies conducted on the
Potsdam dataset verify the contribution of each component or
optimization strategy in Mask2Former.
TABLE OF CONTENTS
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3. EXPERIMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5. ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1. INTRODUCTION
Semantic segmentation, or image segmentation, is the task
of clustering together parts of an image that belong to the
same object class. It is a form of pixel-level prediction, as
each pixel in an image is classified according to a category.
Semantic segmentation is a fundamental and challenging
task in computer vision. Along with the development of
remote sensing technology, the accuracy methods of image
processing algorithms for remote sensing systems are of
particular importance. Semantic segmentation of very fine
resolution remote sensing images (RSI) plays a crucial role
in various urban applications, including assessment of the
environment, monitoring of vehicles, and mapping of land
cover, which also has a role in the refinement of image
processing algorithms for imaging remote sensing systems.
For the semantic segmentation task of RSI, there are three
key challenges. Firstly, RSI is characterized by large size,
978-1-6654-9032-0/23/$31.00 ©2023 IEEE
diverse object scales, and complex detail. These characteris-
tics make it necessary to capture both background and local
information at a distance. Existing CNN-based algorithms
lack the ability to model long-distance backgrounds in RSI.
Secondly, existing supervised learning methods require a
large amount of labeled data, which is often labour-intensive.
Following the introduction of visual transformers (ViT
[1]) in the field of computer vision, transformer-based mod-
els have shown excellent results and are gradually becom-
ing a standard paradigm for semantic segmentation tasks.
Transformer-based models are also very applicable in the
work on remote sensing image processing. One of the
models that have excelled in various tests is Maskformer [2],
which uses a mask classification approach that is particularly
well suited to deal with the large intra-class variance and
small inter-class variance features of RSI. Mask classification
involves predicting a set of binary masks and assigning a class
to each mask. However, given the long-distance dependency
and extremely high resolution that characterize RSI, direct
transfer of Maskformer to semantic segmentation of RSI
still suffers from the following problems: (1) The attention
area of individual Maskformer tokens is limited by a local
window, which would impair prediction of RSI contain-
ing long-distance contextual information. (2) Maskformer’s
original pixel decoding network is directly upsampled us-
ing nearest-neighbor interpolation, which is still relatively
coarse for accurate segmentation of fine-resolution RSIs that
require multi-scale semantic features. In this paper, we use
the SeMask-Mask2Former (Masked-attention Mask Trans-
former) model to overcome these problems. It makes the fol-
lowing improvements to Maskformer: Firstly, Mask2Former
adds the masked attention mechanism in Transformer de-
coders, which limits It is suitable for the processing of large
size and complex details in RSIs, Secondly, Mask2Former
uses the multi-scale feature input in the decoder for attention,
which is very helpful for improving the small object and
region In addition, Mask2Former reduces the number of
extra calculations that can improve performance, such as
switching the order of self-attention and cross-attention in
the decoder, making query features learnable, and removing
dropout. Finally, Mask2Former saves 3x training memory
without affecting performance by calculating mask losses
at K random sampling points. The four improvements not
only improve the segmentation accuracy but also improve the
calculation efficiency for RSIs semantic segmentation. Ex-
tensive ablation studies conducted on the Potsdam[3] dataset
verify the contribution of each component or optimization
strategy in Mask2Former.
The main contributions of our work can be summarized as
follows:
(1) We apply the SeMask-Mask2Former model to the
semantic segmentation
task of RSI, and we adopt the Mask2Former [4] mask classi-
fication method in our network architecture, considering that
mask classification is particularly suitable for handling both
1