从模板到语法：结构化模型在目标检测中的突破

需积分: 9 143 浏览量更新于2024-07-16 收藏 8.91MB PDF 举报

本篇论文《从刚性模板到语法：基于结构模型的对象检测》是Ross B. Girshick的博士论文，由他于2012年在芝加哥大学计算机科学系提交，旨在探讨对象检测领域的新方法与技术。论文的核心内容围绕着如何超越传统的刚性模板匹配，转向更为复杂的结构化模型来提高对象检测的准确性和鲁棒性。论文首先明确了研究的目标，即在计算机视觉领域中，解决对象检测中的关键挑战，如对象的多样性和复杂性、图像表示的灵活性、以及数据标注的不足。作者强调了在2011年左右，对象检测技术正经历一场从依赖固定模板到利用更深层次结构学习的转变。在图像表示方面，论文探讨了如何通过结构化模型来处理物体的形状变化和变形，引入了对象检测语法的概念，这种模型能够捕捉到对象的部分-整体关系，使得检测算法能够适应不同视角和姿态的变化。此外，作者还讨论了孤立变形语法（Isolated Deformation Grammars），这是一种特别设计的语法结构，用于处理独立于场景的局部对象特征变化。接着，论文深入分析了如何利用弱监督学习，即部分或无标签的数据，来训练这些结构化模型，降低对大量精确标注数据的依赖。这在当时是一个重要的研究方向，因为它扩展了模型的实用性，使之能在实际应用中更好地应对大规模数据集。另外，论文提到了混合模型（Mixture Models）在对象检测中的应用，混合模型通过结合多个假设来捕捉对象的多样性，提高了检测的准确性。同时，论文还关注了级联检测（Cascaded Detection）技术，这种分层的检测流程，通过逐步筛选和细化候选区域，提高了检测速度和精度。最后，作者还概述了软件实现的部分，包括他们开发的工具和技术，这些技术旨在支持结构化模型的研究和应用，并为后续的研究者提供了一个可供参考的框架。这篇论文不仅阐述了当时对象检测领域的最新进展，而且通过提出和实验验证了结构化模型（如对象检测语法和混合模型）在提高对象检测性能方面的有效性。它为后来的深度学习时代，尤其是卷积神经网络（CNN）在物体检测中的崛起奠定了基础。

may share almost no shape similarity with another view of the object.

Subcategories. The notion of an object category is inherently fuzzy. Most categories that

we would like to detect (for practical applications) are not deﬁned purely by visual similar-

ity. Take, for instance, the category airplane. This category is largely deﬁned by function

(airplanes are designed to ﬂy) and composition (airplanes have engines, wings, a fuselage,

etc.). However, the compositional elements may be visually very dissimilar (propeller vs. jet

engines), they may be arranged in diﬀerent spatial conﬁgurations, and occur with diﬀering

counts. Several reasonable subcategories, each of which is visually more consistent than the

supercategory, might be: ﬁghter jet, jumbo passenger jet, single-engine propeller plane, and

biplane.

Composition. Most object classes are compositional in nature. For example, people are

often composed with other objects, such as a marching band member wrapped in a tuba.

Other objects may be composed of a variable number of subobjects, such as the repeated

carriages in a train. The compositional nature of objects leads to a combinatorial problem:

an object detection system needs to cope with the exponentially large space of combinations.

1.3 Contributions

We present several models for category-level object detection. Each model in this sequence

builds on the structures and methods employed by the previous models, while staying within

the framework of discriminatively trained grammar models. Along the way, we increase

representational capacity, develop new machine learning techniques, and focus on eﬃcient

computation.

In the development of our models, we make extensive use of large image datasets for

quantitative evaluation. In particular, we use recent releases of the PASCAL VOC Chal-

lenge datasets [23, 26, 27]. These datasets are critical to the development of our methods

because they are challenging enough to reveal the performance beneﬁts of richer models.

This is in contrast to other standard datasets, such as the INRIA Person Dataset [17], for

which detection performance has neared a saturation point, and there is very little to gain

by increasing model sophistication.

1.3.1 The context circa 2011

In the 2007 PASCAL VOC Challenge, the top-performing system achieved a mean AP of

21% [34]. By the 2011 Challenge, the winning system had pushed this metric to 41% [27].

The methods described in this dissertation account for two-thirds of the performance gain

between 2007 and 2011. The remaining one-third of the gap is due to adding extensions

(e.g., more image features, more contextual features, and richer deformation models) to the

framework and software developed here [78].

Our point of departure — the system that won the 2007 PASCAL VOC detection task

— is the discriminatively trained deformable parts model ﬁrst presented by Felzenszwalb,

McAllester, and Ramanan in [34]. Their work builds on the successful combination of his-

togram of oriented gradients (HOG) features with linear SVM training, proposed by Dalal

and Triggs in [18]. The Dalal and Triggs model is a sliding-window detector implemented

as a single HOG ﬁlter. This ﬁlter can be thought of as a template that models the global

appearance of an object category. We refer to this type of global appearance ﬁlter as a “root

ﬁlter.”

The part-based model in [34] extends the Dalal and Triggs detector by adding higher

resolution local part ﬁlters that can translate relative to the lower resolution root ﬁlter. Ad-

ditionally, because their model is trained with weak supervision — it is assumed that the

training data does not include labels for what the parts are or where they are located — they

extend SVM to handle latent variables. In their part-based model, the latent variables de-

scribe the spatial conﬁguration of the parts. The resulting discriminative training formalism,

named latent SVM (LSVM), is mathematically equivalent to the MI-SVM formulation of

1.3.4 Mixture models

Both the Dalal and Triggs rigid HOG detector, and its part-based extension, use the size

and shape of a single root ﬁlter to hypothesize detection windows in images. Consider,

for a moment, the space of detection windows produced by these detectors. This space

only includes detections that have the same aspect ratio as the root ﬁlter. This limitation is

problematic since there may be a large proportion of object instances from the target category

that have very diﬀerent bounding box aspect ratios. When using an evaluation metric such as

the intersection-over-union overlap measure employed by the PASCAL evaluation protocol,

outputting detections with a single aspect ratio can severely limit a detector’s recall.

Aside from limited recall, the single-root-ﬁlter approach is also problematic from the

viewpoint of how best to model the appearance and geometry of an object category. Using

a single root ﬁlter (either with or without parts) requires that a single model must capture

all aspects, poses, and structural variation present within a category. This is likely an

unreasonable requirement. For example, a bicycle viewed from the front has a substantially

diﬀerent appearance as compared to a bicycle viewed from the side. These two aspects

are so visually dissimilar that they do not share any (nonprimitive) parts. The model

visualizations in [34], and reproduced in Figure 1.2, reveal that the learned ﬁlters appear to

be superpositions of multiple aspects. Intuitively, modeling their appearance jointly produces

an averaging eﬀect in the ﬁlter weights, leading to less speciﬁc ﬁlters.

Mixture models. The ﬁrst model that we present builds on the deformable parts models

of [34] in a way that helps overcome the aforementioned issues. As in previous work on

mixture models [62, 5], the key idea is to decompose a visually complex object category

into a set of simpler subcategories. Our goal is to automatically learn subcategory models

that span variation in aspect and pose, and where the range of appearance within each

subcategory is simple enough that it can be modeled eﬀectively by a single deformable parts

model.

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1

Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336

Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242

INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253

MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty

boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system

ranks ﬁrst in 10 out of 20 classes. A preliminary version of our system ranked ﬁrst in 6 classes in the ofﬁcial competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in

the root and part ﬁlters, with the part ﬁlters placed at the center of the allowable displacements. We also show the spatial model for each

part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid

template model of HOG features [5]. The best previous re-

sult of .19 adds a segmentation-based veriﬁcation step [20].

Figure 6 summarizes the performance of several models we

trained. Our root-only model is equivalent to the model

from [5] and it scores slightly higher at .18. Performance

jumps to .24 when the model is trained with a LSVM that

selects a latent position and scale for each positive example.

This suggests LSVMs are useful even for rigid templates

because they allow for self-adjustment of the detection win-

dow in the training examples. Adding deformable parts in-

creases performance to .34 AP — a factor of two above the

best previous score. Finally, we trained a model with parts

but no root ﬁlter and obtained .29 AP. This illustrates the

advantage of using a multiscale representation.

We also investigated the effect of the spatial model and

allowable deformations on the 2006 person dataset. Recall

that s

is the allowable displacement of a part, measured in

HOG cells. We trained a rigid model with high-resolution

parts by setting s

to 0. This model outperforms the root-

only system by .27 to .24. If we increase the amount of

allowable displacements without using a deformation cost,

we start to approach a bag-of-features. Performance peaks

at s

=1, suggesting it is useful to constrain the part dis-

placements. The optimal strategy allows for larger displace-

ments while using an explicit deformation cost. The follow-

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1

Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336

Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242

INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253

MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty

boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system

ranks ﬁrst in 10 out of 20 classes. A preliminary version of our system ranked ﬁrst in 6 classes in the ofﬁcial competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in

the root and part ﬁlters, with the part ﬁlters placed at the center of the allowable displacements. We also show the spatial model for each

part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid

template model of HOG features [5]. The best previous re-

sult of .19 adds a segmentation-based veriﬁcation step [20].

Figure 6 summarizes the performance of several models we

trained. Our root-only model is equivalent to the model

from [5] and it scores slightly higher at .18. Performance

jumps to .24 when the model is trained with a LSVM that

selects a latent position and scale for each positive example.

This suggests LSVMs are useful even for rigid templates

because they allow for self-adjustment of the detection win-

dow in the training examples. Adding deformable parts in-

creases performance to .34 AP — a factor of two above the

best previous score. Finally, we trained a model with parts

but no root ﬁlter and obtained .29 AP. This illustrates the

advantage of using a multiscale representation.

We also investigated the effect of the spatial model and

allowable deformations on the 2006 person dataset. Recall

that s

is the allowable displacement of a part, measured in

HOG cells. We trained a rigid model with high-resolution

parts by setting s

to 0. This model outperforms the root-

only system by .27 to .24. If we increase the amount of

allowable displacements without using a deformation cost,

we start to approach a bag-of-features. Performance peaks

at s

=1, suggesting it is useful to constrain the part dis-

placements. The optimal strategy allows for larger displace-

ments while using an explicit deformation cost. The follow-

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1

Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336

Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242

INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253

MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty

boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system

ranks ﬁrst in 10 out of 20 classes. A preliminary version of our system ranked ﬁrst in 6 classes in the ofﬁcial competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in

the root and part ﬁlters, with the part ﬁlters placed at the center of the allowable displacements. We also show the spatial model for each

part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid

template model of HOG features [5]. The best previous re-

sult of .19 adds a segmentation-based veriﬁcation step [20].

Figure 6 summarizes the performance of several models we

trained. Our root-only model is equivalent to the model

from [5] and it scores slightly higher at .18. Performance

jumps to .24 when the model is trained with a LSVM that

selects a latent position and scale for each positive example.

This suggests LSVMs are useful even for rigid templates

because they allow for self-adjustment of the detection win-

dow in the training examples. Adding deformable parts in-

creases performance to .34 AP — a factor of two above the

best previous score. Finally, we trained a model with parts

but no root ﬁlter and obtained .29 AP. This illustrates the

advantage of using a multiscale representation.

We also investigated the effect of the spatial model and

allowable deformations on the 2006 person dataset. Recall

that s

is the allowable displacement of a part, measured in

HOG cells. We trained a rigid model with high-resolution

parts by setting s

to 0. This model outperforms the root-

only system by .27 to .24. If we increase the amount of

allowable displacements without using a deformation cost,

we start to approach a bag-of-features. Performance peaks

at s

=1, suggesting it is useful to constrain the part dis-

placements. The optimal strategy allows for larger displace-

ments while using an explicit deformation cost. The follow-

(a) car

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1

Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336

Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242

INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253

MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty

boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system

ranks ﬁrst in 10 out of 20 classes. A preliminary version of our system ranked ﬁrst in 6 classes in the ofﬁcial competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in

the root and part ﬁlters, with the part ﬁlters placed at the center of the allowable displacements. We also show the spatial model for each

part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid

template model of HOG features [5]. The best previous re-

sult of .19 adds a segmentation-based veriﬁcation step [20].

Figure 6 summarizes the performance of several models we

trained. Our root-only model is equivalent to the model

from [5] and it scores slightly higher at .18. Performance

jumps to .24 when the model is trained with a LSVM that

selects a latent position and scale for each positive example.

This suggests LSVMs are useful even for rigid templates

because they allow for self-adjustment of the detection win-

dow in the training examples. Adding deformable parts in-

creases performance to .34 AP — a factor of two above the

best previous score. Finally, we trained a model with parts

but no root ﬁlter and obtained .29 AP. This illustrates the

advantage of using a multiscale representation.

We also investigated the effect of the spatial model and

allowable deformations on the 2006 person dataset. Recall

that s

is the allowable displacement of a part, measured in

HOG cells. We trained a rigid model with high-resolution

parts by setting s

to 0. This model outperforms the root-

only system by .27 to .24. If we increase the amount of

allowable displacements without using a deformation cost,

we start to approach a bag-of-features. Performance peaks

at s

= 1, suggesting it is useful to constrain the part dis-

placements. The optimal strategy allows for larger displace-

ments while using an explicit deformation cost. The follow-

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1

Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336

Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242

INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253

MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty

boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system

ranks ﬁrst in 10 out of 20 classes. A preliminary version of our system ranked ﬁrst in 6 classes in the ofﬁcial competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in

the root and part ﬁlters, with the part ﬁlters placed at the center of the allowable displacements. We also show the spatial model for each

part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid

template model of HOG features [5]. The best previous re-

sult of .19 adds a segmentation-based veriﬁcation step [20].

Figure 6 summarizes the performance of several models we

trained. Our root-only model is equivalent to the model

from [5] and it scores slightly higher at .18. Performance

jumps to .24 when the model is trained with a LSVM that

selects a latent position and scale for each positive example.

This suggests LSVMs are useful even for rigid templates

because they allow for self-adjustment of the detection win-

dow in the training examples. Adding deformable parts in-

creases performance to .34 AP — a factor of two above the

best previous score. Finally, we trained a model with parts

but no root ﬁlter and obtained .29 AP. This illustrates the

advantage of using a multiscale representation.

We also investigated the effect of the spatial model and

allowable deformations on the 2006 person dataset. Recall

that s

is the allowable displacement of a part, measured in

HOG cells. We trained a rigid model with high-resolution

parts by setting s

to 0. This model outperforms the root-

only system by .27 to .24. If we increase the amount of

allowable displacements without using a deformation cost,

we start to approach a bag-of-features. Performance peaks

at s

= 1, suggesting it is useful to constrain the part dis-

placements. The optimal strategy allows for larger displace-

ments while using an explicit deformation cost. The follow-

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1

Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336

Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242

INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253

MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty

boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system

ranks ﬁrst in 10 out of 20 classes. A preliminary version of our system ranked ﬁrst in 6 classes in the ofﬁcial competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in

the root and part ﬁlters, with the part ﬁlters placed at the center of the allowable displacements. We also show the spatial model for each

part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid

template model of HOG features [5]. The best previous re-

sult of .19 adds a segmentation-based veriﬁcation step [20].

Figure 6 summarizes the performance of several models we

trained. Our root-only model is equivalent to the model

from [5] and it scores slightly higher at .18. Performance

jumps to .24 when the model is trained with a LSVM that

selects a latent position and scale for each positive example.

This suggests LSVMs are useful even for rigid templates

because they allow for self-adjustment of the detection win-

dow in the training examples. Adding deformable parts in-

creases performance to .34 AP — a factor of two above the

best previous score. Finally, we trained a model with parts

but no root ﬁlter and obtained .29 AP. This illustrates the

advantage of using a multiscale representation.

We also investigated the effect of the spatial model and

allowable deformations on the 2006 person dataset. Recall

that s

is the allowable displacement of a part, measured in

HOG cells. We trained a rigid model with high-resolution

parts by setting s

to 0. This model outperforms the root-

only system by .27 to .24. If we increase the amount of

allowable displacements without using a deformation cost,

we start to approach a bag-of-features. Performance peaks

at s

= 1, suggesting it is useful to constrain the part dis-

placements. The optimal strategy allows for larger displace-

ments while using an explicit deformation cost. The follow-

(b) bicycle

Figure 1.2: Models with a single root ﬁlter. Images reproduced from [34].

In our formulation, we treat the subcategory labels as latent information, analogous to

the latent cluster labels used when ﬁtting mixture models with expectation maximization

(EM). In contrast to existing approaches, such as EM, we use nonprobabilistic, discriminative

frameworks — latent SVM and weak-label structural SVM — to learn mixtures. Returning

to the bicycle example, our system automatically learns subcategories for: side, 45

◦

angle,

and front/rear views.

Latent orientation. For many object categories, photographs taken by people typically

capture category instances with signiﬁcant variation in out-of-plane rotation. Empirically, we

ﬁnd that our method for learning subcategories tends to cluster instances that have similar

out-of-plane rotations — modulo 180

◦

— together. In the horse category, for example, one

of the learned clusters corresponds to side views. Unfortunately this cluster includes both

left and right-facing horses, and because their appearance is represented jointly, the result

is a model ideally suited to detect two-headed horses (see Figure 1.3 top). To address this

problem, we enrich our models by treating orientation within a subcategory as a latent,

剩余145页未读，继续阅读

tomeasure

粉丝: 254
资源: 10

从模板到语法：结构化模型在目标检测中的突破

From Rigid Templates to Grammars: Object Detection with Structured Models

Object Detection with Discriminatively Trained Part-Based Models

Quaternion 引用呢

修改上述代码给出根据data文件的位置进行数据处理，data文件的位置在C:/Users/86178/Desktop/CL-20/rigid/direction/x-direction/966.data

推荐几篇入门变形网格的文献

Continue optimizing the previous code

最新资源