ISPRS Int. J. Geo-Inf. 2020, 9, 571 4 of 20
problem can be considered as more than just a multi-scale and multi-angle problem. In a remote
sensing image, there will be many different types of land cover. In general, different types of land
cover have their own spectral and structural characteristics, which are visible in different brightness
values, pixel values, or spatial changes in remote sensing images. On account of the complexity of
the composition, nature, distribution, and imaging conditions of the surface features, remote sensing
images can be thought of in terms of “same object, different spectrum” and “same spectrum, different
object”. There are also two or more kinds of “mixed pixels” that can occur in a single pixel or the
instantaneous field of view, making the work of recognition in remote sensing images even more
complex. All of these factors can affect the accuracy of the result. To deal with this, our proposed
method seeks to enhance the aggregated channel and spatial features separately, thus improving the
feature representation for remote sensing segmentation.
Our method can be used with any semantic segmentation model, such as U-Net, PSP-Net, etc.
Taking PSP-Net as an example, its basic structure is shown in Figure 1 [
22
]. The input image (a) is fed
into a Convolutional Neural Network (CNN) to obtain the feature map of the last convolutional layer
(b). Then, a pyramid parsing module (c) is used to get different sub-region representations, followed by
upsampling and concatenation layers to form the final feature representation. This contains both local
and global context information. Finally, a convolutional layer is used to get the per-pixel prediction (d)
according to the required representation.
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 4 of 20
3.1. Overview
For regular semantic segmentation, the scene for segmentation will include a variety of objects
of diverse scales with different lighting that are visible from different viewpoints. However, because
of the same shooting angle and distance of the samples in different remote sensing images, the
boundary problem can be considered as more than just a multi-scale and multi-angle problem. In a
remote sensing image, there will be many different types of land cover. In general, different types of
land cover have their own spectral and structural characteristics, which are visible in different
brightness values, pixel values, or spatial changes in remote sensing images. On account of the
complexity of the composition, nature, distribution, and imaging conditions of the surface features,
remote sensing images can be thought of in terms of “same object, different spectrum” and “same
spectrum, different object”. There are also two or more kinds of “mixed pixels” that can occur in a
single pixel or the instantaneous field of view, making the work of recognition in remote sensing
images even more complex. All of these factors can affect the accuracy of the result. To deal with
this, our proposed method seeks to enhance the aggregated channel and spatial features separately,
thus improving the feature representation for remote sensing segmentation.
Our method can be used with any semantic segmentation model, such as U-Net, PSP-Net, etc.
Taking PSP-Net as an example, its basic structure is shown in Figure 1 [22]. The input image (a) is
fed into a Convolutional Neural Network (CNN) to obtain the feature map of the last convolutional
layer (b). Then, a pyramid parsing module (c) is used to get different sub-region representations,
followed by upsampling and concatenation layers to form the final feature representation. This
contains both local and global context information. Finally, a convolutional layer is used to get the
per-pixel prediction (d) according to the required representation.
Figure 1. Overview of PSP-Net.
The general structure of DPA-PSP-Net is shown in Figure 2. We employed a pretrained
ResNet50 [48] and used a dilated strategy [38] for the backbone. Drawing upon the structure of
ResNet50, the proposed framework has four residual blocks, a Pyramid Pooling Module (PPM), a
channel attention module, and a spatial attention module. We removed the down-sampling
operation and employed dilated convolutions in the last two residual blocks instead, which is
identical to the process used in PSP-Net. Thus, the size of the final feature map was at 1/8 of the scale
of the input image. Given an input image with a size of 256 px × 256 px, we used ResNet50 to get the
feature map, F
1, while the weighting factor for the spatial attention, Ws, was obtained by the spatial
attention module. F
1 was fed into the PPM and the channel attention module, respectively, to obtain
the feature map, F
2, after up-sampling and applying the weighting factor for channel attention, Wc.
Finally, F
2 was multiplied by Wc and Ws to obtain the features to obtain the channel
attention-weighted feature map, F
C, and spatial attention-weighted feature map, FS. Then, FC and FS
were aggregated to get the final output.
Figure 1. Overview of PSP-Net.
The general structure of DPA-PSP-Net is shown in Figure 2. We employed a pretrained
ResNet50 [
48
] and used a dilated strategy [
38
] for the backbone. Drawing upon the structure
of ResNet50, the proposed framework has four residual blocks, a Pyramid Pooling Module (PPM),
a channel attention module, and a spatial attention module. We removed the down-sampling operation
and employed dilated convolutions in the last two residual blocks instead, which is identical to the
process used in PSP-Net. Thus, the size of the final feature map was at 1/8 of the scale of the input
image. Given an input image with a size of 256 px
×
256 px, we used ResNet50 to get the feature
map, F
1
, while the weighting factor for the spatial attention, Ws, was obtained by the spatial attention
module. F
1
was fed into the PPM and the channel attention module, respectively, to obtain the feature
map, F
2
, after up-sampling and applying the weighting factor for channel attention, W
c
. Finally, F
2
was
multiplied by W
c
and W
s
to obtain the features to obtain the channel attention-weighted feature
map, F
C
, and spatial attention-weighted feature map, F
S
. Then, F
C
and F
S
were aggregated to get the
final output.