IEEE TRANSACTIONS ON ROBOTICS, VOL. , NO. , MONTH, YEAR. SHORT PAPER 1
Bags of Binary Words for Fast Place Recognition in
Image Sequences
Dorian G
´
alvez-L
´
opez and Juan D. Tard
´
os, Member, IEEE
Abstract—We propose a novel method for visual place recognition using
bag of words obtained from FAST+BRIEF features. For the first time, we
build a vocabulary tree that discretizes a binary descriptor space, and
use the tree to speed up correspondences for geometrical verification.
We present competitive results with no false positives in very different
datasets, using exactly the same vocabulary and settings. The whole
technique, including feature extraction, requires 22ms per frame in a
sequence with 26300 images, being one order of magnitude faster than
previous approaches.
Index Terms—Place Recognition, Bag of Words, SLAM, Computer
Vision.
I. INTRODUCTION
One of the most significant requirements for long-term visual
SLAM (Simultaneous Localization and Mapping) is robust place
recognition. After an exploratory period, when areas non-observed
for long are re-observed, standard matching algorithms fail. When
they are robustly detected, loop closures provide correct data asso-
ciation to obtain consistent maps. The same methods used for loop
detection can be used for robot relocation after track lost, due for
example to sudden motions, severe occlusions or motion blur. In [1]
we concluded that, for small environments, map-to-image methods
achieve nice performance, but for large environments, image-to-image
(or appearance-based) methods such as FAB-MAP [2] scale better.
The basic technique consists in building a database from the images
collected online by the robot, so that the most similar one can be
retrieved when a new image is acquired. If they are similar enough,
a loop closure is detected.
In recent years, many algorithms that exploit this idea have
appeared [2]–[6], basing the image matching on comparing them as
numerical vectors in the bag-of-words space [7]. Bags of words result
in very effective and quick image matchers [8], but they are not a
perfect solution for closing loops, due mainly to perceptual aliasing
[6]. For this reason, a verification step is performed later by checking
the matching images to be geometrically consistent, requiring feature
correspondences. The bottleneck of the loop closure algorithms is
usually the extraction of features, which is around ten times more
expensive in computation cycles than the rest of steps. This may
cause SLAM algorithms to run in two decoupled threads: one to
perform the main SLAM functionality, and the other just to detect
loop closures, as in [5].
In this paper, we present a novel algorithm to detect loops and
establishing point correspondences between images in real time, with
a conventional CPU and a single camera. Our approach is based on
bag of words and geometrical check, with several important novelties
that make it much faster than current approaches. The main speed
improvement comes from the use of a slightly modified version of
the BRIEF descriptor [9] with FAST keypoints [10], as explained in
Section III. The BRIEF descriptor is a binary vector where each bit
is the result of an intensity comparison between a given pair of pixels
around the keypoint. Although BRIEF descriptors are hardly invariant
This research has been partly funded by the European Union under
project RoboEarth FP7-ICT-248942, the Direcci
´
on General de Investigaci
´
on
of Spain under projects DPI2009-13710, DPI2009-07130 and the Ministerio
de Educaci
´
on (scholarship FPU-AP2008-02272).
The authors are with the Instituto de Investigaci
´
on en Ingenier
´
ıa de Arag
´
on
(I3A), Universidad de Zaragoza, Mar
´
ıa de Luna 1, 50018 Zaragoza, Spain.
{dorian, tardos}@unizar.es.
to scale and rotation, our experiments show that they are very robust
for loop closing with planar camera motions, the usual case in mobile
robotics, offering a good compromise between distinctiveness and
computation time.
We introduce a bag of words that discretizes a binary space, and
augment it with a direct index, in addition to the usual inverse index,
as explained in Section IV. To the best of our knowledge, this is the
first time a binary vocabulary is used for loop detection. The inverse
index is used for fast retrieval of images potentially similar to a given
one. We show a novel use of the direct index to efficiently obtain
point correspondences between images, speeding up the geometrical
check during the loop verification.
The complete loop detection algorithm is detailed in Section V.
Similarly to our previous work [5,6], to decide that a loop has been
closed, we verify the temporal consistency of the image matches
obtained. One of the novelties in this paper is a technique to prevent
images collected in the same place from competing among them when
the database is queried. We achieve this by grouping together those
images that depict the same place during the matching.
Section VI contains the experimental evaluation of our work, in-
cluding a detailed analysis of the relative merits of the different parts
in our algorithm. We present comparisons between the effectiveness
of BRIEF and two versions of SURF features [11], the descriptor
most used for loop closing. We also analyze the performance of the
temporal and geometrical consistency tests for loop verification. We
finally present the results achieved by our technique after evaluating
it in five public datasets with 0 .7 –4 Km long trajectories. We
demonstrate that we can run the whole loop detection procedure,
including the feature extraction, in 52ms in 26300 images (22ms on
average), outperforming previous techniques by more than one order
of magnitude.
A preliminary version of this work was presented in [12]. In the
current paper we enhance the direct index technique and extend the
experimental evaluation of our approach. We also report results in
new datasets and make a comparison with the state-of-the-art FAB-
MAP 2.0 algorithm [13].
II. RELATED WORK
Place recognition based on appearance has obtained great attention
in the robotics community because of the excellent results achieved
[4,5,13,14]. An example of this is the FAB-MAP system [13], which
detects loops with an omnidirectional camera, obtaining a recall of
48.4% and 3.1%, with no false positives, in trajectories 70 Km
and 1000 Km in length. FAB-MAP represents images with a bag
of words, and uses a Chow Liu tree to learn offline the words’
co-visibility probability. FAB-MAP has become the gold standard
regarding loop detection, but its robustness decreases when the
images depict very similar structures for a long time, which can
be the case when using frontal cameras [5]. In the work of Angeli
et al. [4], two visual vocabularies (for appearance and color) are
created online in an incremental fashion. The two bag-of-words
representations are used together as input of a Bayesian filter that
estimates the matching probability between two images, taking into
account the matching probability of previous cases. In contrast to
these probabilistic approaches, we rely on a temporal consistency
check to consider previous matches and enhance the reliability of
the detections. This technique has proven successful in our previous
works [5,6]. Our work also differs from the ones above in that we
use a bag of binary words for the first time, as well as propose a
technique to prevent images collected close in time and depicting the
same place from competing between them during the matching, so
that we can work at a higher frequency.