132 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 4, NO. 1, MARCH 2014
be norm alized into fixed orientation and size, hence scale and
orientation invariant local descriptors could be extracted after-
ward. Based on SIFT, some other similar floating point descrip-
tors like SURF [1], P C A-SIFT [12], gradient location and o ri -
entation histogram (GLOH) [21] are proposed. According to the
reported results in [21], SIFT and one of its extensions, i.e., t h e
GLOH generally outperform the other descriptors.
Other than the above mentioned fl oating point descriptors,
many efforts have been m ade to design efficient and compact
descriptors alternative to SIFT or SURF in recent years. Some
researchers propose low-bitrate descriptors such as BRIEF [4],
BRISK [14], CH oG [5], Edge-SIFT [43], and ORB [26] which
are fast both to build and match.
Each bit of BRIEF [4] is computed by considering signs of
simple intensity difference tests between pairs of points sampled
from the im age patches. Despite the clear advantage in speed,
BRIEF suffers in terms of reliability and robustness as it has
limited tolerance to image distortio ns and transformations. T he
BRISK descriptor [1 4] first efficiently detects interest points in
the scale-space pyramid based on a detector inspired by FA ST
[25] and AGAST [19]. Then given a set of detected keypoints,
BRISK descriptor is composed as a binary string by concate-
nating the results of simple brightness comparison tests. The
adopted detector of BRISK gets location, scale, and orienta-
tion clues for each interest point. Hence BRISK achieves orien-
tation-invariance and scale-invariance. O RB descriptor is built
based on BRIEF but is extracted with a novel corner point de-
tector, i.e., o-FA ST [26], and is more robust to the rotation and
image noises. The auth ors demonstrate that ORB is sign ificantly
faster than SIFT, while performs as well in many situations [2 6].
Compared with SIFT, desp ite of the advantage in speed, these
compact descriptors show lim itations in the aspects of descrip-
tive power, robustness, or generality.
B. Local Feature Based Image Matching and Retrieval
Local descriptors like SIFT [18], SURF [1], ORB [26], etc.,
are originally pro posed for image matching. As discussed by
Lowe in [18], SIFT descriptors can be m atched with two cri-
teria, i.e., 1) the cosine similarity of matched SIFT should be
larger than a threshold, e .g., 0.83; 2) simil arit y ratio between
the closest match and the second-closest match should be larger
than a threshold, e.g., 1.3. Howev e r, this two criteria ignores the
spatial clues among keypoints, which has been proven important
for visual matching. Some works improve the accuracy of local
feature matching by verifying the spatial consistency among
matched l ocal features, i .e., the correctly matched local descrip-
tors should satisfy consistent spatial relationsh ip. As discussed
in [11], the matching accuracy of SIFT, SURF, and PCA-SIFT
is su bstant iall y improve d with RANSAC [7], which estimates
an affine transforma tion between two images and rejects the
matched local descriptors violating this transformation.
Matching raw local descriptors is expensive to com -
pute because each image may contains over one thousand o f
high-dimensional descriptors. Meanwhile, exploiting geometric
relationships with full geometric verification like RANSAC [7]
improves retrieval precision, but is computationally expensive.
Therefore, raw SIFT and RANSAC are not ideal for large-scale
applications. An efficient solution is to quantize local descrip-
tors into visual words and generate BoWs model. It has b een
illustrated that single visual word can not preserve th e spatial
information in images, which has been proven important for
visual matching and recognition [30], [42], [40], [47], [46],
[45], [41 ], [6]. To combine BoWs with spatial information, lots
of works a re conducted to combine m ultiple visual words with
spatial information [3 0], [42], [40], [47], [46], [4 5], [41], [6].
This m ay be achieved, for example, b y using feature pursuit
algorithms such as AdaBoosting [32], a s demonstrated b y
Liu et al. [17]. Visual word correlogram and correlation [27],
which are leveraged f rom the color correlogram, are utilized to
model the spatial relationships among visual words for object
recognition. In [38], visual word s are bundled and the co rre-
sponding image indexing and visual word matching algorithms
are proposed for large-scale near-duplicate image retrieval. In
[47], the spatial configurations of visual words are recorded in
index, which are hence utilized for spatial verification during
online retrieval. In a recent work [6], Chu et al. utilize a spatial
graph model t o estimate the spa tial consistency among matched
visual wo rds du rin g online retrieval. Hence, the visual words
violating the spatial consistency could be effectively identified
and discarded. Visual phrases preserve extra spatial information
by involv ing multiple visual w ords, thus generally present more
advantages in large-scale image retrieval [8], [42], [45]. For
example, descriptive visual phrase (DVP) is generated in [42]
by grouping and selecting two nearby visual words. Similarly,
visual phrases proposed in [45] are extracted by discovering
multiple spatially stable visual words. Generally, considering
visual w ords in groups rather th an single visual word captures
stronger discriminative po wer.
Not withstanding the success of existing visual phrase fea-
tures, they are still limited in flexility and repeatability. Cur-
rently, each visual phrase is commonly treated as a whole cell
and tw o of them are matched only if they contain identical single
visual words [42], [45]. Firstly, this m eans only visual phrases
containing the same number of visual words could be consid-
ered for matching. Secondly, because of quantization error, visu-
ally similar descriptors might be quantized into different visual
words.Suchquantizationerrorwouldbeaggregatedinvisual
word com bin ations, and m akes visual phrase match ing more
difficult to occur.
Different from existing visual phrase f eatu res, the proposed
multi-order visual phrase captures rich spatial clues and allows
more flexible matching without scarifyi ng the repeatability of
BoWs model. Thus, it is superior in the aspects of flexibility,
quantization error, and efficiency. Our approach also diff ers
with existing spatial verification methods in that, it does not
compute graph models [6] or needs multiple iterations with
matrix operation as in [47], and is finished in a cascade manor.
Hence, it is potential to achieve better efficiency. Besides that,
our approach does not keep or discard a candidate m atch , but
reasonably estimates a confidence of correctness for it. This is
helpful to improve the recall rate and decrease the quantization
error. In the following section, we introduce how we extract
multi-order visual phrase from SIFT descriptors and binary
ORB descriptors.