728 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 3, MARCH 2018
methods which are based on deep neural networks and the lim-
itations of these methods. Second, we briefly introduce the
attention-based image caption approaches. Finally, we intro-
duce some works on object detection which are related with our
proposed methods.
Deep neural network-based image caption: With the suc-
cessful application of deep neural network in the task of image
recognition and machine translation, the task of automatically
generating image description also makes significant progress.
There exist several effective methods [13], [14], [30]–[33] based
on deep neural networks.
As mentioned above, these approaches consider generating
image description as a translation process. They directly trans-
late an image to a sentence via utilizing the encoder-decoder
framework [28] which is originally introduced in machine trans-
lation task. In general, this paradigm firstly uses a deep CNN
as the encoder which encodes an image to a static representa-
tion, and then uses a RNN as the decoder which decodes this
static representation to a meaningful sentence. The generated
sentence should be grammatically correct and well describe the
content of the image as much as possible.
To address the task of image description, in [32], Mao et al.
propose a multimodal RNN (m-RNN) model which can also be
used for image and sentence retrieval. The proposed m-RNN
additionally utilizes a multimodal layer to connect the language
model and the CNN together. Similarly, Karpathy et al. [33]
propose an alignment model via a multimodal embedding layer.
This alignment model can align segments of sentence with the
regions of the corresponding image that they describe. Replac-
ing the basic RNN by LSTM, a more powerful RNN model,
Vinyals et al. [31] propose an end-to-end model named NIC
by combining deep CNN with LSTM for t he problem. Further-
more, to address the problem of “drift away” or “lose track”
of the image content, Jia et al. [14] propose gLSTM model,
an alternative extension of LSTM. This model utilizes seman-
tic information extracted from image as input along with the
whole image to generate image descriptions. Donahue et al. [13]
propose Long-term Recurrent Convolutional Network (LRCN)
which combines convolutional layers and long-range temporal
recursion for visual recognition and description.
However, as shown in Fig. 1, we notice that the above men-
tioned approaches may suffer from the problems of object miss-
ing and misprediction in that those methods encode the whole
image to a static global feature vector. To overcome these prob-
lems, in this paper, we propose to integrate object-level features
with image-level features for generating image caption via the
widely used attention mechanism. In the next section, we brief
some related works based on attention mechanism.
Attention mechanism in image caption and machine transla-
tion. Recently, attention mechanism has been widely used and
proved to be important and effective in the field of natural lan-
guage processing [27] and computer vision [30], [37], [44], [45].
In fact, the essence of attention mechanism is to assign positive
weights to different parts to indicate the importance of these
parts.
Attention mechanism is originally introduced in machine
translation task [27]. In [27], Bahdanau et al. exploit BRNN with
attention mechanism for machine translation. This approach is
able to automatically search the part of the source sentence
which is most relevant to a target word. Then, attention mech-
anism is introduced into image/video understanding task. Xu
et al. [30] explore two kinds of attention mechanism for image
caption, i.e., soft-attention and hard-attention, and analyze how
the attention mechanism works in the process of generating im-
age caption via visualization manner. In [37], Yao et al. address
video caption task through capturing global temporal structure
among video frames with a temporal attention mechanism which
is based on soft-alignment method. This temporal attention
mechanism makes the model dynamically focus on key frames
which are more relevant with the predicted word. ATT [44] pro-
poses to utilize semantic concept to improve the performance.
This method firstly obtains semantic concept proposals by uti-
lizing different approaches, such as, k-NN, multi-label ranking
and so on, and then integrates these concept proposals into one
vector via the attention mechanism. The integrated vector is
finally used to guide language model to generate description.
Different from soft/hard attention method [30] and ATT
method [37], our proposed GLA method integrates local rep-
resentation at object-level with global representation at image-
level through attention mechanism, whose aim is to address
aforementioned problems of object missing and misprediction.
Due to these methods which use only global frame-level features
which cannot avoid problems of object missing and mispredic-
tion. Instead of considering semantic concepts or attributes used
in ATT [44], we directly apply image visual feature with atten-
tion mechanism to image caption. RA [45] proposes a compli-
cated pipeline to obtain important regions from selective search
region proposals [46] and combines them with scene-specific
contexts to generate image caption. Compared with ATT and
RA methods, our GLA method is simpler and the performance
is much better than RA method.
Object Detection: With the great success achieved by deep
learning technology, object detection has also made significant
progress. R-CNN [47] stands out as one of the notable landmarks
in the process of object detection tasks. It takes advantage of
high quality region proposals (selective search method [46]) and
CNN features. This pipeline mainly contains four procedures:
(1) Extracting region proposals which likely contain objects
via region proposal methods; (2) Extracting CNN features of
these region proposals via CNNs; (3) Classifying these propos-
als through classifier trained with CNN features; (4) Localizing
these objects via bounding box regression methods.
However, this kind of framework is time consuming due to
the four distinct steps. To reduce the computing time and im-
prove the accuracy of detection, SPP-Net [48], Fast R-CNN [49]
and DeepID-Net [8] et al.
are developed. These methods inte-
grate the last three steps into one end-to-end framework which
can simultaneously complete classification and bounding box
regression.
Although these improved methods have improved the perfor-
mance of object detection via the end-to-end part, the region
proposal generated process also isolates with it and this pro-
cess is the most time-cosuming. Thus, some real-time methods,
such as, YOLO [9], Faster R-CNN [6] and SSD [50] et al.,are