proposed a simple and scalable algorithm by combining region
proposals with convolutional neural networks (CNN) for accurate
object detection and semantic segmentation. Saurabh et al. [27]
proposed deep classification nets for semantic segmentation based
on depth CNN features and RGB CNN features. Noh et al. [28]
proposed a semantic segmentation algorithm by learning a de-
convolution network. Long et al. [29] built a fully convolutional
network (FCN) to be trained end-to-end and pixels-to-pixels. The
proposed FCN outperformed the state-of-the-art methods in se-
mantic segmentation. These DL-based approaches have three
disadvantages. First, they cost more than several hours or several
days to train or fine-tune a network and spend much more online
time than the traditional methods. Second, the massive number of
training samples is required to construct a robust model in these
methods. Third, the high-end graphics cards are needed, such as
GTX Titan X GPU, which put high demanding requirements on
hardware. To show the effectiveness of the proposed method, we
compared it with several state-of-the-art DL-based methods.
Unsupervised learning-based automatic annotation is also
widely studied. Lu et al. [30] proposed a context-based multi-label
annotation, which mainly uses the context to transfer keywords
and also simultaneously transmits several keywords to the test
image. The effect is good, but it is time-consuming. Jamieson et al.
[31] proposed a method for learning the appearance of target
models based on the visual mode and language tips, which com-
bines the typical appearance of target models with the corre-
sponding names into a name mark and annotates the similar ob-
jects of the test images.
For searching-based automatic annotation methods, the main
idea is to mine the related semantic description of similar images.
This type of method requires no training sample set and is not
restricted by the predefined vocabulary, so the process is simple
and generally contains only two steps, namely searching and
mining. Research studies of this category of methods are relatively
few in number and implementations are less used in applications.
Wang et al. [32] proposed a model-free-based image annotation
method, which mines the search results with visual and semantic
similarity to realize the ultimate image semantic annotation. It has
good robustness to exception handling. In this paper, the main
contribution is two-fold. First, the 2D and 3D features are ex-
tracted based on superpixels and aggregated to a representative
feature vector with a high discriminative power and the feature
aggregation can improve the robustness of the appearance model
and outperform each individual one. Second, we propose a novel
aggregated boosting decision forest to build the classifier, we call it
ABDF algorithm, in which an aggregated splitting strategy is used
and the breadth-first strategy is adopted instead of the depth-first
strategy. To obtain more accurate segmentation results, Graph-
Cuts are then adopted to tune and correct some minor errors. The
proposed methodology achieves better performance in segmen-
tation accuracy, robustness and comparable computation effi-
ciency than existing state-of-the-art semantic annotation
methods.
3. Architecture of the proposed method
The architecture of the proposed method is illustrated in Fig. 1.
The main steps are summarized as follows:
Step 1: Segment the superpixels. SLIC is used to segment the
superpixels of the training video sequence.
Step 2: The camera motion and 3D scene structure are re-
covered by using the automatic tracking system and then the
depth maps are recovered based on a bundle optimization fra-
mework [33]. The 2D and 3D features of the superpixels are then
extracted. Although the segmented superpixels have different
sizes, their features have the same dimension. The 2D and 3D
Features are normalized and fused based on a continuous feature
fusion strategy.
Step 3: Semantic annotation. The superpixels are classified ac-
cording to ABDF modeling, which is described in Algorithm 1.We
train 100 weak classi
fiers in parallel in our model.
Algorithm 1. ABDF algorithm.
Input: Training dataset
{
=( )FXY,
ii1
, the feature set X
i
, the class
label Y
i
,
∈{ … }
J1, ,
i
,
=…iN1, 2, ,
, the maximum tree depth
N
m
, the number of trees N
r
Input: Feature set X
k
of the testing dataset F
2
,
=…
N1, 2, ,
k
Output: Class label set Y
k
of F
2
1: Initialize the root nodes and the weight
=wN1/
it
1
;
2: for i¼ 1to
N
m
do
3: Check stopping criteria for all nodes in depth i
4: for all j¼ 1toN
r
do in parallel
5: Route the sample sets S
1
and S
2
to the left and right
child nodes according to Eq. (8).
6: Compute the local score
(
AS1
and
()AS
2
according
to Eq. (10).
7: Compute the probability distribution
(|
jF
1
accord-
ing to Eq. (14);
8: Compute
(
F
1
1
,
(
F
2
1
,
(
F
3
1
according to Eqs. (11),
(12), (13).
9: Determine the best splitting function according to
Eq. (9) and learn a weak classifier
φ (
XD;
i
i
.
10: end for
11: Update the model
Γ
according to Eq. (16).
12: Update the weight
+
w
i
i 1
according to Eq. (15).
13: end for
14:
=(
*
|)
*
∈{ … }
pY Xargmax
k
YJ
N
kk
1, ,
1
k
r
,
(
*
|
YX
kk
is the class prob-
ability distribution returned by
Γ
.
15: Set m¼ 2, update Y
k
according to Eq. (17).
4. Appearance feature model construction based on 2D–3D
multi-feature fusion
Features are used to describe the most important attributes of
the image. For image segmentation and recognition, using only 2D
or 3D features to annotate target objects would result in semantic
ambiguity. Considering that street view images usually contain
complex objects, objects may partially occluded by each other. To
address this problem, we construct the appearance model based
on the combination of 2D and 3D features and depth information.
The proposed appearance model is built based on superpixels [34].
For each superpixel, clues about object motion, color and texture
features are used to extract the 3D and 2D features. Suppose the
appearance model is presented as
()
=
()
ATL,,
1
where
T
denotes the 3D feature vector, and
denotes the 2D
feature vector,
denotes the feature vector after the concatena-
tion of these two sub-vectors.
To extract 3D features, we use our automatic tracking system to
recover camera motion as well as 3D scene structure from videos
or image sequences [33]. For a given video sequence, we first use
the SFM method to recover the camera parameters. Then, the
disparity map for each frame is initialized independently. After
initialization, bundle optimization is performed to refine the
X. Wang et al. / Pattern Recognition 62 (2017) 189–201 191