Int J Comput Vis (2013) 104:154–171 157
Fig. 2 Two examples of our selective search showing the necessity of different scales. On the left we find many objects at different scales. On the
right we necessarily find the objects at different scales as the girl is contained by the tv
3 Selective Search
In this section we detail our selective search algorithm for
object recognition and present a variety of diversification
strategies to deal with as many image conditions as possible.
A selective search algorithm is s ubject to the following design
considerations:
Capture All Scales. Objects can occur at any scale within
the image. Furthermore, some objects have less clear
boundaries then other objects. Therefore, in selective
search all object scales have to be taken into account,
as illustrated in Fig. 2. This is most naturally achieved by
using an hierarchical algorithm.
Diversification. There is no single optimal strategy to
group regions together. As observed earlier in Fig. 1,
regions may f orm an object because of only colour, only
texture, or because parts are enclosed. Furthermore, light-
ing conditions such as shading and the colour of the light
may influence how regions form an object. Therefore
instead of a single strategy which works well in most
cases, we want to have a diverse set of strategies to deal
with all cases.
Fast to Compute. The goal of selective search is to yield
a set of possible object locations for use in a practical
object recognition framework. The creation of this set
should not become a computational bottleneck, hence
our algorithm should be reasonably fast.
3.1 Selective Search by Hierarchical Grouping
We take a hierarchical grouping algorithm to form the basis
of our selective search. Bottom-up grouping is a popu-
lar approach to segmentation (Comaniciu and Meer 2002;
Felzenszwalb and Huttenlocher 2004), hence we adapt it for
selective search. Because the process of grouping itself is
hierarchical, we can naturally generate locations at all scales
by continuing the grouping process until the whole image
becomes a single region. This satisfies the condition of cap-
turing all scales.
As regions can yield richer information than pixels, we
want to use region-based features whenever possible. To get
a set of small starting regions which ideally do not span
multiple objects, we use the fast method of Felzenszwalb
and Huttenlocher (2004), which Arbeláez et al. (2011) found
well-suited for such purpose.
Our grouping procedure now works as follows. We first
use (Felzenszwalb and Huttenlocher 2004) to create initial
regions. Then we use a greedy algorithm to iteratively group
regions together: First the similarities between all neighbour-
ing regions are calculated. The two most similar regions are
grouped together, and new similarities are calculated between
the resulting region and its neighbours. The process of group-
ing the most similar r egions is repeated until the whole image
becomes a single region. The general method is detailed in
Algorithm 1.
For the similarity s(r
i
, r
j
) between region r
i
and r
j
we
want a variety of complementary measures under the con-
straint that they are fast to compute. In effect, this means that
the similarities should be based on features that can be prop-
agated through the hierarchy, i.e. when merging region r
i
and
r
j
into r
t
, the features of region r
t
need to be calculated from
the features of r
i
and r
j
without accessing the image pixels.
3.2 Diversification Strategies
The second design criterion for selective search is to diver-
sify the sampling and create a set of complementary strategies
whose locations are combined afterwards. We diversify our
selective search (1) by using a variety of colour spaces with
123