基于可变形部分模型的对象检测系统

需积分: 17 29 浏览量更新于2024-07-22 1 收藏 5.48MB PDF 举报

基于可变形部件模型的物体检测在计算机视觉领域中，物体检测是指从图像或视频中检测和定位物体的过程。近年来，基于部件模型的物体检测方法逐渐受到关注。这种方法通过将物体分解为多个部件，然后对每个部件进行检测和识别，从而实现对物体的检测。本文要介绍的论文《Object Detection with Discriminatively Trained Part-Based Models》提出了一个基于可变形部件模型的物体检测系统。该系统能够代表高度可变的物体类别，并在PASCAL物体检测挑战赛中取得了 state-of-the-art 结果。所谓可变形部件模型，是指物体可以被分解为多个部件，每个部件可以具有不同的形状和大小。这些部件可以通过学习到的模型来描述，从而实现对物体的检测和识别。可变形部件模型的优点在于可以对物体的形状和大小进行建模，从而提高物体检测的准确性。在该论文中，作者提出了一个基于判别性训练的可变形部件模型。该方法通过结合margin-sensitive approach和latent SVM来实现对物体的检测。 Latent SVM 是一种半凸优化算法，可以将训练问题转化为凸优化问题，从而提高训练速度和准确性。在该系统中，作者还提出了一个迭代训练算法，该算法可以根据正样本的latent值来调整模型参数，从而提高模型的泛化能力。实验结果表明，该系统能够在PASCAL物体检测挑战赛中取得 state-of-the-art 结果。本论文提出的基于可变形部件模型的物体检测系统具有很高的检测准确性和泛化能力，对于物体检测任务具有重要的参考价值。知识点： 1. 可变形部件模型：一种将物体分解为多个部件的模型，每个部件可以具有不同的形状和大小。 2. 基于部件模型的物体检测：一种通过对每个部件进行检测和识别来实现对物体的检测的方法。 3. 判别性训练：一种通过学习判别模型来实现对物体的检测的方法。 4. Latent SVM：一种半凸优化算法，可以将训练问题转化为凸优化问题。 5. 迭代训练算法：一种根据正样本的latent值来调整模型参数的算法，以提高模型的泛化能力。本论文提出的基于可变形部件模型的物体检测系统具有很高的检测准确性和泛化能力，对于物体检测任务具有重要的参考价值。

“analytic” features. This removes the need to perform a costly

projection step when computing dense feature maps.

Significant variations in shape and appearance, such as

those caused by extreme viewpoint changes, are not well

captured by a 2D deformable model. Aspect graphs [31] are

a classical formalism for capturing significant changes that

are due to viewpoint variation. Mixture models provide a

simpler alternative approach. For example, it is common to

use multiple templates to encode frontal and side views of

faces and cars [36]. Mixture models have been used to

capture other aspects of appearance variation as well, such

as when there are multiple natural subclasses in an object

category [5].

Matching a deformable model to an image is a difficult

optimization problem. Local search methods require initi-

alization near the correct solution [2], [7], [43]. To guarantee

a globally optimal match, more aggressive search is needed.

One popular approach for part-based models is to restrict

part locations to a small set of possible locations returned by

an interest point detector [1], [18], [42]. Tree (and star)

structured pictorial structure models [9], [15], [19] allow for

the use of dynamic programming and generalized distance

transforms to efficiently search over all possible object

configurations in an image, without restricting the possible

locations for each part. We use these techniques for

matching our models to images.

Part-based deformable models are parameterized by the

appearance of each part and a geometric model capturing

spatial relationships among parts. For generative models,

one can learn model parameters using maximum likelihood

estimation. In a fully supervised setting, training images are

labeled with part locations and models can often be learned

using simple methods [9], [15]. In a weakly supervised

setting, training images may not specify locations of parts.

In this case, one can simultaneously estimate part locations

and learn model parameters with EM [2], [18], [42].

Discriminative training methods select model parameters

so as to minimize the mistakes of a detection algorithm on a

set of training images. Such approaches directly optimize

the decision boundary between positive and negative

examples. We believe that this is one reason for the success

of simple models trained with discriminative methods, such

as the Viola-Jones [41] and Dalal-Triggs [10] detectors. It has

been more difficult to train part-based models discrimina-

tively, though strategies exist [4], [23], [32], [34].

Latent SVMs are related to hidden CRFs [32]. However,

in a latent SVM, we maximize over latent part locations as

opposed to marginalizing over them, and we use a hinge

loss rather than log loss in training. This leads to an efficient

coordinate-descent style algorithm for training, as well as a

data-mining algorithm that allows for learning with very

large data sets. A latent SVM can be viewed as a type of

energy-based model [27].

A latent SVM is equivalent to the MI-SVM formulation of

multiple instance learning (MIL) in [3], but we find the

latent variable formulation more natural for the problems

we are interested in.

A different MIL framework was

previously used for training object detectors with weakly

labeled data in [40].

Our method for data-mining hard examples during

training is related to working set methods for SVMs (e.g.,

[25]). The approach described here requires relatively few

passes through the complete set of training examples and is

particularly well suited for training with very large data

sets, where only a fraction of the examples can fit in RAM.

The use of context for object detection and recognition

has received increasing attention in the recent years. Some

methods (e.g., [39]) use low-level holistic image features for

defining likely object hypothesis. The method in [22] uses a

coarse but semantically rich representation of a scene,

including its 3D geometry, estimated using a variety of

techniques. Here, we define the context of an image using

the results of running a variety of object detectors in the

image. The idea is related to [33] where a CRF was used to

capture co-occurrences of objects, although we use a very

different approach to capture this information.

A preliminary version of our system was described in

[17]. The system described here differs from the one in

[17] in several ways, including the introduction of mixture

models; here, we optimize the true latent SVM objective

function using stochastic gradient descent, while in [17]

we used an SVM package to optimize a heuristic

approximation of the objective; here, we use new features

that are both lower dimensional and more informative; we

now postprocess detections via bounding box prediction

and context rescoring.

3MODELS

All of our models involve linear filters that are applied to

dense feature maps. A feature map is an array whose entries

are d-dimensional feature vectors computed from a dense

grid of locations in an image. Intuitively, each feature vector

describes a local image patch. In practice, we use a variation

of the HOG features from [10], but the framework described

here is independent of the specific choice of features.

A filter is a rectangular template defined by an array of

d-dimensional weight vectors. The response, or score, of a

filter F at a position ðx; yÞ in a feature map G is the “dot

product” of the filter and a subwindow of the feature map

with top-left corner at ðx; yÞ:

F ½x

G½x þ x

;yþ y

:

We would like to define a score at different positions and

scales in an image. This is done using a feature pyramid

which specifies a feature map for a finite number of scales

in a fixed range. In practice, we compute feature pyramids

by computing a standard image pyramid via repeated

smoothing and subsampling, and then computing a feature

map from each level of the image pyramid. Fig. 3 illustrates

the construction.

The scale sampling in a feature pyramid is determined by a

parameter  defining the number of levels in an octave. That

is,  is the number of levels we need to go down in the

pyramid to get to a feature map computed at twice the

resolution of another one. In practice, we have used  ¼ 5 in

training and  ¼ 10 at test time. Fine sampling of scale space is

important for obtaining high performance with our models.

The system in [10] uses a single filter to define an object

model. That system detects objects by computing the score

of the filter at each position and scale of a HOG feature

pyramid and thresholding the scores.

1630 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

1. We defined a latent SVM in [17] before realizing the relationship to

MI-SVM.

剩余18页未读，继续阅读

zxpddfg

粉丝: 63
资源: 2

基于可变形部分模型的对象检测系统

Deep Learning论文_IEEE Trans on PAMI

deformable part models

DPM(Deformable Part Model) 源码

Object Detection Using Deformable Part Model in RGB-D Data

[2008 CVPR] A Discriminatively Trained, Multiscale, Deformable Part Model

Deformable Part Model Based Multiple Pedestrian Detection for Video Surveillance in Crowded Scenes

deformable_cell_model_integration

Deformable DETR

A hybrid model and kinematic-free control framework for a low-cost deformable manipulator using in home service

Deformable_DETR_Deformable_Transformers_

最新资源