复杂视频街景语义标注：2D-3D多特征融合与聚合Boosting决策森林方法

158 浏览量更新于2024-08-26 收藏 4.64MB PDF 举报

"这篇研究论文探讨了一种针对复杂视频街景的语义标注方法，该方法结合了2D-3D多特征融合和聚合Boosting决策森林算法，以提高标注的准确性和效率。" 在大规模视频理解中，精确且高效的语义标注是一个关键但具有挑战性的步骤。该论文提出了一种创新框架，它利用2D和3D特征的融合以及聚合Boosting决策森林（ABDF模型）来改善复杂视频街景的语义解析。首先，2D-3D多特征融合涉及到将来自二维图像（2D）和三维空间（3D）的数据集成在一起。2D特征通常包括颜色、纹理、形状等，而3D特征则包含深度信息、物体几何结构等。通过融合这些不同维度的特征，系统可以更全面地理解场景，从而提高对物体和环境的识别能力。这种方法有助于克服单个特征的局限性，例如2D特征可能无法提供足够的空间信息，而3D特征在处理快速变化或遮挡的场景时可能受限。其次，聚合Boosting决策森林（ABDF模型）是一种机器学习技术，它结合了多个弱分类器（决策树）以创建一个强分类器。Boosting是一种集成学习策略，它逐步增加那些在训练过程中犯错误的弱分类器的权重，使得整个森林能够更准确地预测结果。在这个特定的框架中，决策森林被用来处理多特征融合后的数据，以进行像素级的语义标注。每个决策树都会对输入数据进行分割，并在每个分割点上选择最佳特征，以此达到优化分类的目的。通过聚合多个决策树的输出，系统能够获得更为稳健和准确的标注结果。此外，超级像素分割是该框架中的一个关键预处理步骤。超级像素是图像中的基本单位，它们是由相邻像素聚类形成的，这些像素具有相似的颜色、纹理和亮度属性。超级像素分割可以减少数据的维度，提高计算效率，同时保留了图像的边界信息，有利于后续的语义分析。关键词：语义标注、超级像素分割、2D-3D特征融合、ABDF模型该研究的提交日期为2015年12月，经过修订后于2016年7月再次提交，并于2016年8月被接受，最后在同月30日在线发布。这个工作对于视频理解、计算机视觉和自动驾驶等领域具有重要意义，因为它提供了一种有效处理复杂场景的方法，能够增强视频内容的理解和分析。

proposed a simple and scalable algorithm by combining region

proposals with convolutional neural networks (CNN) for accurate

object detection and semantic segmentation. Saurabh et al. [27]

proposed deep classiﬁcation nets for semantic segmentation based

on depth CNN features and RGB CNN features. Noh et al. [28]

proposed a semantic segmentation algorithm by learning a de-

convolution network. Long et al. [29] built a fully convolutional

network (FCN) to be trained end-to-end and pixels-to-pixels. The

proposed FCN outperformed the state-of-the-art methods in se-

mantic segmentation. These DL-based approaches have three

disadvantages. First, they cost more than several hours or several

days to train or ﬁne-tune a network and spend much more online

time than the traditional methods. Second, the massive number of

training samples is required to construct a robust model in these

methods. Third, the high-end graphics cards are needed, such as

GTX Titan X GPU, which put high demanding requirements on

hardware. To show the effectiveness of the proposed method, we

compared it with several state-of-the-art DL-based methods.

Unsupervised learning-based automatic annotation is also

widely studied. Lu et al. [30] proposed a context-based multi-label

annotation, which mainly uses the context to transfer keywords

and also simultaneously transmits several keywords to the test

image. The effect is good, but it is time-consuming. Jamieson et al.

[31] proposed a method for learning the appearance of target

models based on the visual mode and language tips, which com-

bines the typical appearance of target models with the corre-

sponding names into a name mark and annotates the similar ob-

jects of the test images.

For searching-based automatic annotation methods, the main

idea is to mine the related semantic description of similar images.

This type of method requires no training sample set and is not

restricted by the predeﬁned vocabulary, so the process is simple

and generally contains only two steps, namely searching and

mining. Research studies of this category of methods are relatively

few in number and implementations are less used in applications.

Wang et al. [32] proposed a model-free-based image annotation

method, which mines the search results with visual and semantic

similarity to realize the ultimate image semantic annotation. It has

good robustness to exception handling. In this paper, the main

contribution is two-fold. First, the 2D and 3D features are ex-

tracted based on superpixels and aggregated to a representative

feature vector with a high discriminative power and the feature

aggregation can improve the robustness of the appearance model

and outperform each individual one. Second, we propose a novel

aggregated boosting decision forest to build the classiﬁer, we call it

ABDF algorithm, in which an aggregated splitting strategy is used

and the breadth-ﬁrst strategy is adopted instead of the depth-ﬁrst

strategy. To obtain more accurate segmentation results, Graph-

Cuts are then adopted to tune and correct some minor errors. The

proposed methodology achieves better performance in segmen-

tation accuracy, robustness and comparable computation efﬁ-

ciency than existing state-of-the-art semantic annotation

methods.

3. Architecture of the proposed method

The architecture of the proposed method is illustrated in Fig. 1.

The main steps are summarized as follows:

Step 1: Segment the superpixels. SLIC is used to segment the

superpixels of the training video sequence.

Step 2: The camera motion and 3D scene structure are re-

covered by using the automatic tracking system and then the

depth maps are recovered based on a bundle optimization fra-

mework [33]. The 2D and 3D features of the superpixels are then

extracted. Although the segmented superpixels have different

sizes, their features have the same dimension. The 2D and 3D

Features are normalized and fused based on a continuous feature

fusion strategy.

Step 3: Semantic annotation. The superpixels are classiﬁed ac-

cording to ABDF modeling, which is described in Algorithm 1.We

train 100 weak classi

ﬁers in parallel in our model.

Algorithm 1. ABDF algorithm.

Input: Training dataset

{

}

=( )FXY,

ii1

, the feature set X

, the class

label Y

∈{ … }

J1, ,

=…iN1, 2, ,

, the maximum tree depth

, the number of trees N

Input: Feature set X

of the testing dataset F

=…

N1, 2, ,

Output: Class label set Y

of F

1: Initialize the root nodes and the weight

=wN1/

;

2: for i¼ 1to

3: Check stopping criteria for all nodes in depth i

4: for all j¼ 1toN

do in parallel

5: Route the sample sets S

and S

to the left and right

child nodes according to Eq. (8).

6: Compute the local score

(

)

AS1

and

()AS

according

to Eq. (10).

7: Compute the probability distribution

accord-

ing to Eq. (14);

8: Compute

(

)

(

)

(

)

according to Eqs. (11),

(12), (13).

9: Determine the best splitting function according to

Eq. (9) and learn a weak classiﬁer

φ (

)

XD;

10: end for

11: Update the model

according to Eq. (16).

12: Update the weight

i 1

according to Eq. (15).

13: end for

14:

∈{ … }

pY Xargmax

1, ,

(

is the class prob-

ability distribution returned by

15: Set m¼ 2, update Y

according to Eq. (17).

4. Appearance feature model construction based on 2D–3D

multi-feature fusion

Features are used to describe the most important attributes of

the image. For image segmentation and recognition, using only 2D

or 3D features to annotate target objects would result in semantic

ambiguity. Considering that street view images usually contain

complex objects, objects may partially occluded by each other. To

address this problem, we construct the appearance model based

on the combination of 2D and 3D features and depth information.

The proposed appearance model is built based on superpixels [34].

For each superpixel, clues about object motion, color and texture

features are used to extract the 3D and 2D features. Suppose the

appearance model is presented as

()

ATL,,

where

denotes the 3D feature vector, and

denotes the 2D

feature vector,

denotes the feature vector after the concatena-

tion of these two sub-vectors.

To extract 3D features, we use our automatic tracking system to

recover camera motion as well as 3D scene structure from videos

or image sequences [33]. For a given video sequence, we ﬁrst use

the SFM method to recover the camera parameters. Then, the

disparity map for each frame is initialized independently. After

initialization, bundle optimization is performed to reﬁne the

X. Wang et al. / Pattern Recognition 62 (2017) 189–201 191

剩余12页未读，继续阅读

weixin_38507923

粉丝: 3
资源: 952

复杂视频街景语义标注：2D-3D多特征融合与聚合Boosting决策森林方法

bagging，boosting和随机森林

随即森林和gradient boosting tree区别

r语言分别用决策树、bagging、boosting和随机森林对其进行数据挖掘,并在测试

随机森林和xgboost和lightGBM实现决策树区别

boosting随机森林五分类读取excel数据代码实现

基于决策树的集成学习算法

lgbmregressor和Gradient Boosting

雷视融合中除了特征融合还要什么

GBM模型和随机森林有何不同

基于树结构进行决策的算法

最新资源