Selective Search:解决物体识别中的关键区域生成

需积分: 9 103 浏览量更新于2024-07-18 收藏 12.16MB PDF 举报

Selective Search for Object Recognition是一项针对对象识别任务的关键技术，它旨在解决生成可能的对象位置的问题。该方法结合了全面搜索和分割的优势，利用图像结构来指导搜索过程，同时通过多样化的搜索策略处理对象定位的各种复杂条件，如形状、尺度、颜色和纹理等。 1. **使用分割**：在Selective Search中，分割技术起着关键作用。与传统的分割方法类似，它利用图像的局部结构（例如边缘、纹理或颜色相似性）来识别潜在的区域边界。这些分割结果为后续的对象检测和识别提供了一组候选区域，每个区域都可能包含部分或完整的物体。 2. **良好多样化策略**：Selective Search的一个核心创新在于它的多样化搜索策略。不同于单一的方法寻找可能的位置，它采用多种互补的图像分割策略，如不同的尺度空间分析、颜色空间处理和纹理特征提取等，确保能覆盖各种可能的对象形态和视觉特性。这样做的目的是提高搜索的鲁棒性和有效性，减少遗漏可能存在的对象。 3. **选择性搜索的效果**：Selective Search的优点在于其产生的是一组数据驱动且类独立的高质量对象位置。它能够在有限的计算资源下，通过对图像进行高效的筛选，找到一小部分最有可能包含目标对象的区域。这显著降低了搜索的复杂度，提高了识别算法的整体性能。与全量搜索相比，尽管牺牲了一些可能的候选区域，但通过精准地聚焦于高可能性区域，Selective Search在速度和准确性上取得了平衡。总结来说，Selective Search for Object Recognition是一种智能的搜索算法，它巧妙地融合了图像分割的结构信息和全面搜索的完整性，通过多样化策略优化搜索过程，从而在保证高效性和准确性的前提下，有效地支持了计算机视觉中的对象识别任务。这种方法对于处理现实世界中多变和复杂场景下的对象定位问题具有显著优势。

1. How do we use

segmentation?

the fast method of Felzenszwalb and Huttenlocher [13], which[3]

found well-suited for such purpose.

Our grouping procedure now works as follows. We ﬁrst use [13]

to create initial regions. Then we use a greedy algorithm to iter-

atively group regions together: First the similarities between all

neighbouring regions are calculated. The two most similar regions

are grouped together, and new similarities are calculated between

the resulting region and its neighbours. The process of grouping

the most similar regions is repeated until the whole image becomes

asingleregion.ThegeneralmethodisdetailedinAlgorithm1.

Algorithm 1: Hierarchical Grouping Algorithm

Input:(colour)image

Output: Set of object location hypotheses L

Obtain initial regions R = {r

,··· ,r

} using [13]

Initialise similarity set S = /0

foreach Neighbouring region pair (r

) do

Calculate similarity s(r

)

S = S ∪ s(r

)

while S = /0 do

Get highest similarity s(r

)=max(S)

Merge corresponding regions r

= r

∪ r

Remove similarities regarding r

: S = S \ s(r

∗

)

Remove similarities regarding r

: S = S \ s(r

∗

)

Calculate similarity set S

between r

and its neighbours

S = S ∪ S

R = R ∪ r

Extract object location boxes L from all regions in R

For the similarity s(r

) between region r

and r

we want a va-

riety of complementary measures under the constraint that they are

fast to compute. In ef fect, this means that the similarities should be

based on features that can be propagated through the hierarchy, i.e.

when merging region r

and r

into r

,thefeaturesofregionr

need

to be calculated from the features of r

and r

without accessing the

image pixels.

3.2 Diversiﬁcation Strategies

The second design criterion for selective sear ch is to diversify the

sampling and create a set of complementary strategies whose loca-

tions are combined afterwards. We diversify our selective search

(1) by using a variety of colour spaces with different invariance

properties, (2) by using different simil ar ity measures s

,and(3)

by varying our starting regions.

Complementary Colour Spaces. We want to account for dif-

ferent scene and lighting conditions. Therefore we perform our

hierarchical grouping algorithm in a variety of colour spaces with

arangeofinvarianceproperties. Speciﬁcally,wethefollowing

colour spaces with an increasing degree of invariance: (1) RGB,

(2) the intensity (grey-scale image) I,(3)Lab,(4)therg chan-

nels of normalized RGB plus intensity denoted as rgI,(5)HSV,(6)

normalized RGB denoted as rgb,(7)C [14] which is an opponent

colour space wher e intensity is divi ded out, and ﬁnally (8) the Hue

channel H from HSV.Thespeciﬁcinvariancepropertiesarelisted

in Table 1.

Of cours e, for images that are black and white a change of colour

space has little impact on the ﬁnal outcome of the algorithm. For

colour channels R G B I V L a b S r g C H

Light Intensity - - - - - - +/- +/- + + + + +

Shadows/shading - - - - - - +/- +/- + + + + +

Highlights - - - - - - - - - - - +/- +

colour spaces RGB I Lab rgI HSV rgb C H

Light Intensity - - +/-

+ + +

Shadows/shading - - +/-

+ + +

Highlights - - - -

- +/- +

Table 1: The invariance properties of both the individual colour

channels and the colour spaces used i n this paper, sorted by de-

gree of invariance. A “+/-” means partial invariance. A fraction

/3 means that one of the three colour channels is invariant to said

property.

these images we rely on the other diversiﬁcation methods for en-

suring good object locations.

In this paper we always use a single colour space throughout

the algorithm, meaning that both the initial grouping algorithm of

[13] and our subsequent grouping algorithm are performed in this

colour space.

Complementary Similarity Measures. We deﬁne four comple-

mentary, fast-to-compute similarity measures. Thes e measures are

all in range [0,1] which facilitates combinations of these measures.

colour

) measures colour similarity. Speciﬁcally, for each re-

gion we obtain one-dimensional colour histograms for each

colour channel using 25 bins, which we found to work well.

This leads to a colour histogr am C

= {c

,··· ,c

} for each

region r

with dimensionality n = 75 when three colour chan-

nels are used. The colour histograms are normalised us ing the

norm. Similari ty is measur ed us ing t he histogram intersec-

tion:

colour

∑

k=1

min(c

). (1)

The colour histograms can be efﬁci ently propagated through

the hierarchy by

size(r

) ×C

+ size(r

) ×C

size(r

)+size(r

)

. (2)

The s ize of a resulting region is simply the sum of i ts con-

stituents: size(r

)=size(r

)+size(r

texture

) measures texture similar ity. We represent texture us-

ing fast SIFT-like measurements as SIFT itself works well for

material recognition [20]. We take Gaussian derivatives in

eight orientations using

= 1foreachcolourchannel. For

each orientation for each colour channel we extract a hi s-

togram using a bin size of 10. This leads to a texture his-

togram T

= {t

,··· ,t

} for each region r

with dimension-

ality n = 240 when three colour channels are used. Texture

histograms are normalised using the L

norm. Similarity is

measured using histogram intersection:

texture

∑

k=1

min(t

). (3)

Texture histograms are efﬁciently propagated through the hi-

erarchy in the same way as the colour histograms.

•

Average Best Overlap (ABO)

•

Mean Average Best Overlap (MABO)

Evaluation Metric

Figure 3: The training procedure of our object recognition pipeline. As positive learning examples we use the ground truth. As negatives

we use examples that have a 20-50% overlap with the positive examples. We iteratively add hard negatives using a retraining phase.

by our selective search that have an overlap of 20% to 50% with

apositiveexample. Toavoidnear-duplicatenegativeexamples,

anegativeexampleisexcludedifithasmorethan70%overlap

with another negative. To keep the number of initial negatives per

class below 20,000, we randomly drop half of the negatives forthe

classes car, cat, dog and person.Intuitively,thissetofexamples

can be seen as difﬁcult negatives which are close to the positive ex-

amples. This means they are close to the decision boundary andare

therefore likely to become support vectors even when the complete

set of negatives would be considered. Indeed, we found that this

selection of training examples gives reasonably good initial classi-

ﬁcation models.

Then we ent er a retraining phase to iteratively add hard negative

examples (e.g.[12]):Weapplythelearnedmodelstothetraining

set using the locations generated by our selective search. For each

negative image we add the highest scoring location. As our initial

training set already yields good models, our models convergein

only two iterations.

For the test set, the ﬁnal model is applied to all locations gener-

ated by our selective search. The windows are sorted by classiﬁer

score while windows which have more than 30% overlap with a

higher scoring window are considered near-dupl icates and are re-

moved.

5 Evaluation

In this section we evaluate the quality of our selective search. We

divide our experiment s in four parts , each spanning a separate sub-

section:

Diversiﬁcation Strategies . We experiment with a variety of

colour spaces, similarity measures, and thresholds of the ini-

tial regions, all which were detailed in Section 3.2. We seek a

trade-off between the number of generated object hypotheses,

computation time, and the qualit y of object locations. We do

this in terms of bounding boxes. This results in a selection of

complementary techniques which together s erve as our ﬁnal

selective search method.

Quality of Locations . We test the quality of the object location

hypotheses resulting from the selecti ve search.

Object Recognition. We use the locations of our selective search

in the Object Recognition framework detailed in Section 4.

We evaluate performance on the Pascal VOC detection chal-

lenge.

An upper bound of location quality. We investigate how well

our object recognition framework performs when using an ob-

ject hypothesis set of “perfect” quality. How does this com-

pare to the locations that our sel ect ive search generates?

To evaluate the quality of our object hypotheses we deﬁne

the Average Best Overlap (ABO) and Mean Average Best Over-

lap (MABO) scores, which slightly generalises the measure used

in [9]. To calculate the Average Best Overlap for a speciﬁc class c,

we calculate the bes t overlap between each ground truth annotation

∈ G

and the object hypotheses L generated for the corres pond-

ing image, and average:

ABO =

∑

∈G

max

∈L

Overlap(g

). (7)

The Overlap score is taken from [11] and measures the area of the

intersection of two regions divided by its union:

Overlap(g

area(g

) ∩ area(l

)

area(g

) ∪ area(l

)

. (8)

Analogously to Average Precision and Mean Average Precision,

Mean Average Bes t Overlap is now deﬁned as the mean ABO over

all classes.

Other work often uses the recall derived from the Pascal Overlap

Criterion t o measure the qualit y of the boxes [1, 16, 34]. This crite-

rion considers an object to be found when the Overlap of Equation

8islargerthan0.5. However,inmanyofourexperimentsweob-

tain a recall between 95% and 100% for most classes, making this

measure too insensitive for this paper. However, we do reportthis

measure when comparing with other work.

To avoid overﬁtting, we perform the diversiﬁcation strategies ex-

periments on the Pascal VOC 2007 TRAIN+VA L set. Other exper-

iments are done on the Pascal VOC 2007 TEST set. Additionally,

our object recognit i on system is benchmarked on the Pascal VOC

2010 detection challenge, using t he independent evaluationserver.

5.1 Diversiﬁcation Strategies

In this section we evaluate a variety of strategies to obtain good

quality object location hypotheses using a reas onable number of

boxes computed within a reasonable amount of time.

5.1.1 Flat versus Hierarchy

In the description of our method we claim that using a full hierar-

chy is mor e natural than using multiple ﬂat partitionings by chang-

Figure 3: The training procedure of our object recognition pipeline. As positive learning examples we use the ground tru th. As negatives

we use examples that have a 20-50% overlap with the positive examples. We iteratively add hard negatives usi ng a retraining phase.

by our selective search that have an overlap of 20% to 50% with

apositiveexample. Toavoidnear-duplicatenegativeexamples,

anegativeexampleisexcludedifithasmorethan70%overlap

with another negative. To keep the number of initial negatives per

class below 20,000, we randomly drop half of the negatives forthe

classes car, cat, dog and person.Intuitively,thissetofexamples

can be seen as difﬁcult negatives which are close to the positive ex-

amples. This means they are close t o the deci sion boundary andare

therefore likely to become support vectors even when the complete

set of negatives would be considered. Indeed, we found that this

selection of training examples gives reasonably good initial classi-

ﬁcation models.

Then we enter a retrai ning phase to iteratively add hard negative

examples (e.g.[12]):Weapplythelearnedmodelstothetraining

set using the locations generated by our selective search. For each

negative image we add the hi ghes t scor ing location. As our initial

training set already yields good models, our models convergein

only two iterations.

For the test set, the ﬁnal model is applied to all locations gener-

ated by our selective search. The windows are sor ted by classiﬁer

score while windows which have more than 30% overlap with a

higher scoring window are considered near-duplicates and are re-

moved.

5 Evaluation

In this section we evaluate the quality of our selective search. We

divide our exper iments in four parts, each spanning a separate sub-

section:

Diversiﬁcation Strategies. We experiment with a variety of

colour spaces, similarity measures, and thresholds of the ini-

tial regions, all which were detailed in Section 3.2. We seek a

trade-off between the number of generated object hypotheses,

computation time, and the quality of object locations. We do

this in terms of bounding boxes. This results in a selection of

complementary techniques which together serve as our ﬁnal

selective search method.

Quality of Locat ions . We test the quality of the object location

hypotheses resulting from the selective search.

Object Recognition. We use the locations of our selective search

in the Object Recognition framework detailed in Section 4.

We evaluate performance on the Pascal VOC detection chal-

lenge.

An upper bound of location quality. We investigate how well

our object recognition framework performs when using an ob-

ject hypothesis set of “perfect” quality. How does this com-

pare to the locations that our selective search generates?

To evaluate the quality of our object hypotheses we deﬁne

the Average Best Overlap (ABO) and Mean Average Best Over-

lap (MABO) scores, which slightly generalises the measure used

in [9]. To calculate the Average Best Overlap for a speciﬁc class c,

we calculate the best overlap between each ground truth annotation

∈ G

and the object hypotheses L generated for the cor r espond-

ing image, and average:

ABO =

∑

∈G

max

∈L

Overlap(g

). (7)

The Overlap score is taken f rom [11] and measures the area of the

intersection of two regions divided by its union:

Overlap(g

area(g

) ∩ area(l

)

area(g

) ∪ area(l

)

. (8)

Analogously to Average Precision and Mean Average Precision,

Mean Average Best Overlap is now deﬁned as the mean ABO over

all classes.

Other work often uses the r ecall derived from the Pas cal Overlap

Criterion to measure the quality of the boxes [1, 16, 34]. This crite-

rion considers an object to be found when the Overlap of Equation

8islargerthan0.5. However,inmanyofourexperimentsweob-

tain a recall between 95% and 100% for most classes, making this

measure too insensitive for this paper. However, we do reportthis

measure when comparing with other work.

To avoid overﬁtting, we perform the diversiﬁcation strategies ex-

periments on the Pascal VOC 2007 TRAIN+VA L set. Other exper-

iments are done on the Pascal VOC 2007 TEST set. Additionally,

our object recognition system is benchmarked on the Pascal VOC

2010 detection challenge, using the independent evaluationserver.

5.1 Diversiﬁcation Strategies

In this section we evaluate a variety of strategies to obtain good

quality object location hypotheses using a reasonable number of

boxes computed withi n a reasonable amount of time.

5.1.1 Flat versus Hierarchy

In the description of our method we claim that using a full hierar-

chy is more natural than using multiple ﬂat partiti onings by chang-

0 500 1000 1500 2000 2500 3000

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of Object Boxes

Recall

Harzallah et al.

Vedaldi et al.

Alexe et al.

Carreira and Sminchisescu

Endres and Hoiem

Selective search Fast

Selective search Quality

(a) Trade-off between number of object locations and the Pascal Recall criterion.

0 500 1000 1500 2000 2500 3000

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of Object Boxes

Mean Average Best Overlap

Alexe et al.

Carreira and Sminchisescu

Endres and Hoiem

Selective search Fast

Selective search Quality

(b) Trade-off between number of object locations and the MABO score.

Figure 4: Trade-off between qual i ty and quantity of the object hypotheses in terms of bounding boxes on the Pascal 2007 TEST set. The

dashed lines are for those methods whose quantity is expressed is the number of boxes per class. In terms of recall “Fast” selective

search has the best trade-off. In terms of Mean Average Best Overlap the “Quality” selective search is comparable with [4,9]yetis

much faster to compute and goes on longer resulting in a higherﬁnalMABOof0.879.

(a) Bike: 0.863 (b) Cow: 0.874 (c) Chair: 0.884 (d) Person: 0.882 (e) Plant: 0.873

Figure 5: Examples of locations for objects whose Best Overlap score is around our Mean Average Best Overlap of 0.879. The green

boxes are the ground truth. The red boxes are created using the“Quality”selectivesearch.

剩余30页未读，继续阅读

Pumpkin_tong

粉丝: 40
资源: 54

Selective Search:解决物体识别中的关键区域生成

使用选择性搜索进行目标识别

选择性搜索：对象识别中的高效策略

区域卷积神经网络详解：从图表示到RCNN应用

selective search for object recognition

Selective Search for Object Recognition算法详解

Selective Search for Object Recognition-附件资源

Objects_recognition

Demystifying Object Detection: An In-depth Analysis of OpenCV Object Detection Algorithms, from Haar...

OpenCV Deep Learning Practical Guide: From Image Classification to Object Detection, Building AI ...

ECCV2014会议论文代码：深度学习检测与分割

最新资源