XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Multi-modal Remote Sensing Image Description
Based on Word Embedding and Self-Attention
Mechanism
1
st
Yuan Wang
dept. College of Information Scienc and
Engineering
Xinjiang University
Urumqi, China
107551601496@stu.xju.edu.cn
4
th
Junli Li
dept. Xinjiang Institute of Ecology and
Geography
Chinese Academy of Sciences
Urumqi, China
lijl@ms.xjb.ac.cn
2
nd
Kuerban Alifu
College of Software
Xinjiang University
Urumqi,China
ghalipk@xju.edu.cn
5
th
Umut Halik
dept. College of Resource and
Environment Sciences
Xinjiang University
Urumqi, China
halik@xjb.edu.cn
3
rd
Hongbing Ma
dept. Electronic Engineering
Tsinghua University
Beijing, China
hbma@tsinghua.edu.cn
6
th
Yalong Lv
dept. College of Information Scienc and
Engineering
Xinjiang University
Urumqi, China
lvyalong_gdd@163.com
Abstract—Traditional multi-modal models are relatively
weak in describing complex image content when describing and
identifying objects to be identified in microwave images, the
generated sentences by which are relatively simple. In this paper,
a multimodal remote sensing semantic description and
recognition method based on self-attention mechanism is
proposed, which combined with the Ngram2vec word
embedding technique. Firstly, Ngram2ve is used to mine the
semantic information and context features between the pixels to
be identified in the domain window and adjacent pixels.
Secondly, a self-attention mechanism is introduced to further
learn the internal structure information of all pixels in the
neighborhood window to generate a multidimensional
representation. Finally, in order to avoid the loss of information
transmitted between layers, Dense nets are used to implement
information flow integration, and a multi-layered independent
recurrent neural network is added between each densely
connected module to solve the gradient disappearance.
Experimental results show that this method is superior to
traditional deep learning methods in image description and
recognition.
Keywords—Remote sensing imagery; Word embedding;
Densely connected network; Independent Recurrent Neural
Network; Gradient disappears
INTRODUCTION
With the continuous progress of remote sensing
technology and the excellent application of deep learning in
many fields such as natural language processing, image
generation, target detection and speech recognition, new ideas
related to the semantic description of remote sensing images
and object recognition have emerged. However, compared
with natural images, remote sensing images are characterised
by ambiguous semantics. Therefore, an important research
topic is how to use multi-modal model and natural language
processing technology to generate precise and concise natural
sentences to describe the complex content of remote sensing
images.
RELEVANT RESEARCH
In recent years, due to the continuous progress of artificial
satellite technology, intelligent processing of remote sensing
images has attracted considerable attention. Although the
contents of remote sensing images are complex and image
description is a challenging task, China and foreign
researchers have designed numerous methods for natural
image description generation. Mou L
[1]
decodes natural image
representation into natural language sentences by combining
traditional manual features with recurrent neural network
(RNN).Although good classification results have been
achieved, manual features have to be artificially involved in
threshold setting, which causes difficulty in meeting the large-
scale application needs. To avoid artificial participation in
setting thresholds, Jangtjik K A
[2]
uses deep convolution
features instead of manual features, decodes the deep features
with long-term and short-term memory neural network
(LSTM), and generates corresponding natural language
sentences to describe natural images. However, the sentences
generated are not simple enough to describe the complex
content of the image.Jia Y
[3]
and Sahadun N A
[4]
introduce
retrieval-based and object detection-based methods to further
decode the object in the natural image into precise natural
language sentences, thereby improving the description
accuracy. Although the methods proposed by researchers have
been successful in describing natural images, they cannot
effectively describe remote sensing images because of the
complexity of object semantics in these images. Therefore,
some researchers have studied the description of remote
sensing images and generated natural sentences from them. Jia
H L
[5]
and Cheng G
[6]
propose deep multimodal neural
network model to analyse the semantics of high-resolution
remote sensing images. The framework of remote sensing
image description based on convolutional neural network is
proposed by Yao Y
[7]
and Chen J
[8]
. However, these methods
are all based on convolutional neural network (CNN) to
represent images, and predefined templates use sequences in
recurrent neural network (RNN) to generate corresponding