GIFT: A Real-time and Scalable 3D Shape Search Engine
Song Bai
1
Xiang Bai
1
Zhichao Zhou
1
Zhaoxiang Zhang
2
Longin Jan Latecki
3
1
Huazhong University of Science and Technology,
3
Temple University
2
CAS Center for Excellence in Brain Science and Intelligence Technology, CASIA
{songbai,xbai,zzc}@hust.edu.cn zhaoxiang.zhang@ia.ac.cn latecki@temple.edu
Abstract
Projective analysis is an important solution for 3D shape
retrieval, since human visual perceptions of 3D shapes rely
on various 2D observations from different view points. Al-
though multiple informative and discriminative views are
utilized, most projection-based retrieval systems suffer from
heavy computational cost, thus cannot satisfy the basic re-
quirement of scalability for search engines.
In this paper, we present a real-time 3D shape search
engine based on the projective images of 3D shapes. The
real-time property of our search engine results from the fol-
lowing aspects: (1) efficient projection and view feature ex-
traction using GPU acceleration; (2) the first inverted file,
referred as F-IF, is utilized to speed up the procedure of
multi-view matching; (3) the second inverted file (S-IF),
which captures a local distribution of 3D shapes in the
feature manifold, is adopted for efficient context-based re-
ranking. As a result, for each query the retrieval task can
be finished within one second despite the necessary cost of
IO overhead. We name the proposed 3D shape search en-
gine, which combines GPU acceleration and Inverted File
(Twice), as GIFT. Besides its high efficiency, GIFT also
outperforms the state-of-the-art methods significantly in re-
trieval accuracy on various shape benchmarks and compe-
titions.
1. Introduction
3D shape retrieval is a fundamental issue in computer
vision and pattern recognition. With the rapid develop-
ment of large scale public 3D repositories, e.g., Google 3D
Warehouse or TurboSquid, and large scale shape bench-
marks, e.g., ModelNet [39], SHape REtrieval Contest
(SHREC) [14, 31], the scalability of 3D shape retrieval al-
gorithms becomes increasingly important for practical ap-
plications. However, efficiency issue has been more or less
ignored by previous works, though enormous efforts have
been devoted to retrieval effectiveness, that is to say, to de-
sign informative and discriminative features [12, 2, 17, 6,
40, 15, 18] to boost the retrieval accuracy. As suggested
in [14], plenty of these algorithms do not scale up to large
3D shape databases due to their high time complexity.
Meanwhile, owing to the fact that human visual percep-
tion of 3D shapes depends upon 2D observations, projective
analysis has became a basic and inherent tool in 3D shape
domain for a long time, with applications to segmenta-
tion [38], matching [24], reconstruction, etc.. Specifically in
3D shape retrieval, projection-based methods demonstrate
impressive performances. Especially in recent years, the
success of planar image representation [7, 35, 43], makes it
easier to describe 3D models using depth or silhouette pro-
jections.
Generally, a typical 3D shape search engine is comprised
of the following four components (see also Fig. 1):
1. Projection rendering. With a 3D model as input, the
output of this component is a collection of projec-
tions. Most methods set an array of virtual cameras
at pre-defined view points to capture views. These
view points can be the vertices of a dodecahedron [4],
located on the unit sphere [35], or around the lateral
surface of a cylinder [24]. In most cases, pose nor-
malization [22] is needed for the sake of invariance to
translation, rotation and scale changes.
2. View feature extraction. The role of this component is
to obtain multiple view representations, which affects
the retrieval quality largely. A widely-used paradigm
is Bag-of-Words (BoW) [7] model, since it has shown
its superiority as natural image descriptors. However,
in order to get better performances, many features [14]
are of extremely high dimension. As a consequence,
raw descriptor extraction (e.g., SIFT [20]), quantiza-
tion and distance calculation are all time-consuming.
3. Multi-view matching. This component establishes the
correspondence between two sets of view features, and
returns a matching cost between two 3D models. Since
at least a set-to-set matching strategy [25, 26, 27, 16, 9]
is required, this stage suffers from high time complex-
ity even when using the simplest Hausdorff matching.
Hence, the usage of algorithms incorporated with some