基于Mask2Former的高分辨率遥感图像语义分割研究

需积分: 9 61 浏览量更新于2024-08-04 收藏 552KB PDF 举报

"这篇文档是关于使用Mask2Former模型进行高分辨率遥感图像语义分割的研究。作者探讨了高分辨率遥感图像语义分割的重要性，以及当前方法在处理这类图像时面临的挑战，如物体大小、尺度变化和复杂细节。文章特别提到了卷积神经网络（CNN）在捕获长距离上下文信息上的局限性，而Transformer模型中的自注意力机制在这方面具有显著优势，尤其是Mask2Former模型，它采用了掩模分类方法进行精细化分割。" 本文档主要关注的是高分辨率遥感图像的语义分割问题。语义分割是计算机视觉领域的一个关键任务，它涉及到将图像像素级地划分到不同的类别中，以此来理解图像内容。随着遥感技术的进步，对高分辨率遥感图像的精确分析变得越来越重要，这有助于环境监测、城市规划、灾害响应等多种应用。当前，基于全卷积网络(FCN)的方法如FCN和FastFCN在遥感图像处理中得到广泛应用，但由于CNN的接收野限制，它们在捕捉全局上下文信息方面存在不足。这在处理遥感图像时是个挑战，因为遥感图像中的物体通常较大，且形状和规模变化多端，同时包含丰富的局部细节。为了解决这个问题，研究者转向了Transformer模型，尤其是其中的Mask2Former。Transformer模型以其强大的自注意力机制著称，能有效捕获图像中的长距离依赖关系。Mask2Former模型进一步引入了掩模分类的概念，可以为特定类别生成一个或多个掩模，从而实现更精细的语义分割。这种方法不仅能够捕捉全局信息，还能够处理局部细节，适应遥感图像的特点。文章可能详细讨论了Mask2Former的架构和工作原理，包括其如何通过自注意力机制来学习上下文信息，以及如何通过掩模分类来优化分割结果。此外，可能还介绍了实验部分，包括数据集的选择、评估指标以及与现有方法的比较，以证明Mask2Former在高分辨率遥感图像语义分割任务中的优越性能。这篇文档深入研究了利用Transformer模型，特别是Mask2Former来提升高分辨率遥感图像的语义分割效果，对于理解和改进遥感图像分析技术具有重要意义。

Semantic Segmentation for High-resolution Remote

Sensing Images Based on the Mask2Former Model

Yicheng Qiao

Beijing Sport University

2020011020@bsu.edu.cn

Wei liu

University of ****

******.edu

Pengyun Wang Bin Liang

***

teacher Zhang

teacher Yang

Abstract— With the development of remote sensing, semantic

segmentation of high-resolution remote sensing images (RSIs)

is increasingly essential. At the same time, the characteristics

of objects in RSIs, such as large size, variation in object scales,

and complex details, make it necessary to capture both long-

range context and local information. FCN-based methods such

as FCN and FastFCN lack the ability to capture long-range

dependencies, due to the limited receptive ﬁeld of CNN. How-

ever, the self-attention mechanism in Transformer models has

remarkable capability in capturing long-range context. One of

the most outstanding Transformer models is the Mask2Former

(Masked-attention Mask Transformer), which adopts the mask

classiﬁcation method. Concretely, the mask classiﬁcation that

generates one or even more masks for speciﬁc categories to per-

form the elaborate segmentation is especially suitable for han-

dling the characteristic of large within-class and small between-

class variance of RSIs. Above all, extensive experimental re-

sults show that Mask2Former obtains better results in semantic

segmentation of high-resolution RSIs on the ISPRS Potsdam

dataset compared to CNN-based methods(FCN) and other state-

of-the-art transformer-based methods( SegFormer, SegMenter,

Swin Transformer). Extensive ablation studies conducted on the

Potsdam dataset verify the contribution of each component or

optimization strategy in Mask2Former.

TABLE OF CONTENTS

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3. EXPERIMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5. ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1. INTRODUCTION

Semantic segmentation, or image segmentation, is the task

of clustering together parts of an image that belong to the

same object class. It is a form of pixel-level prediction, as

each pixel in an image is classiﬁed according to a category.

Semantic segmentation is a fundamental and challenging

task in computer vision. Along with the development of

remote sensing technology, the accuracy methods of image

processing algorithms for remote sensing systems are of

particular importance. Semantic segmentation of very ﬁne

resolution remote sensing images (RSI) plays a crucial role

in various urban applications, including assessment of the

environment, monitoring of vehicles, and mapping of land

cover, which also has a role in the reﬁnement of image

processing algorithms for imaging remote sensing systems.

For the semantic segmentation task of RSI, there are three

key challenges. Firstly, RSI is characterized by large size,

diverse object scales, and complex detail. These characteris-

tics make it necessary to capture both background and local

information at a distance. Existing CNN-based algorithms

lack the ability to model long-distance backgrounds in RSI.

Secondly, existing supervised learning methods require a

large amount of labeled data, which is often labour-intensive.

Following the introduction of visual transformers (ViT

[1]) in the ﬁeld of computer vision, transformer-based mod-

els have shown excellent results and are gradually becom-

ing a standard paradigm for semantic segmentation tasks.

Transformer-based models are also very applicable in the

work on remote sensing image processing. One of the

models that have excelled in various tests is Maskformer [2],

which uses a mask classiﬁcation approach that is particularly

well suited to deal with the large intra-class variance and

small inter-class variance features of RSI. Mask classiﬁcation

involves predicting a set of binary masks and assigning a class

to each mask. However, given the long-distance dependency

and extremely high resolution that characterize RSI, direct

transfer of Maskformer to semantic segmentation of RSI

still suffers from the following problems: (1) The attention

area of individual Maskformer tokens is limited by a local

window, which would impair prediction of RSI contain-

ing long-distance contextual information. (2) Maskformer’s

original pixel decoding network is directly upsampled us-

ing nearest-neighbor interpolation, which is still relatively

coarse for accurate segmentation of ﬁne-resolution RSIs that

require multi-scale semantic features. In this paper, we use

the SeMask-Mask2Former (Masked-attention Mask Trans-

former) model to overcome these problems. It makes the fol-

lowing improvements to Maskformer: Firstly, Mask2Former

adds the masked attention mechanism in Transformer de-

coders, which limits It is suitable for the processing of large

size and complex details in RSIs, Secondly, Mask2Former

uses the multi-scale feature input in the decoder for attention,

which is very helpful for improving the small object and

region In addition, Mask2Former reduces the number of

extra calculations that can improve performance, such as

switching the order of self-attention and cross-attention in

the decoder, making query features learnable, and removing

dropout. Finally, Mask2Former saves 3x training memory

without affecting performance by calculating mask losses

at K random sampling points. The four improvements not

only improve the segmentation accuracy but also improve the

calculation efﬁciency for RSIs semantic segmentation. Ex-

tensive ablation studies conducted on the Potsdam[3] dataset

verify the contribution of each component or optimization

strategy in Mask2Former.

The main contributions of our work can be summarized as

follows:

(1) We apply the SeMask-Mask2Former model to the

semantic segmentation

task of RSI, and we adopt the Mask2Former [4] mask classi-

ﬁcation method in our network architecture, considering that

mask classiﬁcation is particularly suitable for handling both

下载后可阅读完整内容，剩余3页未读，立即下载

BECOME(

粉丝: 3
资源: 1

基于Mask2Former的高分辨率遥感图像语义分割研究

3D-3D-Semantic-Segmentation-for-Scene-Parsing.zip

Keras-Semantic-Segmentation__Keras-Semantic-Segmentation.zip

Multiscale Feature Weighted-Aggregating and Boundary Enhancement Network for Semantic Segmentation of High-Resolution Remote Sensing Images

人工智能-遥感-语义分割-PyTorch实现高分遥感语义分割（地物分类）

Using Fully Convolutional Networks for Semantic Image Segmentation

Essential Basics and Techniques for Beginners

给你一个jingqsdfgnvsdljk

MPSK调制解调MATLAB仿真源代码

一个基于Java SE的跳跃忍者游戏.zip

更新城市蔓延指数数据集（1990-2023年）.xlsx

最新资源