Y. Xu et al.: Deeply Exploiting Long-Term View Dependency for 3D Shape Recognition
II. RELATED WORK
The early approaches to 3D shape recognition focus on the
design of handcrafted shape features. In recent years, inspired
by the success of machine learning (particularly deep learn-
ing) in computer vision, many learning-based approaches
have been proposed to learn adaptive shape descriptors from a
set of 3D data. Our work follows this line of research. In this
section, a brief review is given on the handcrafted features,
followed by a detailed review on the learning-based methods.
A. HANDCRAFTED FEATURES
The classic shape descriptors, such as the statistical
moments [29], Fourier descriptor [29], [30] and eigenvalue
descriptor [31], are devoted to the global description of
shapes. These methods suffer from the sensitivity to non-
rigid transformations or topological changes. To overcome
this weakness, some local geometric descriptors were pro-
posed as the building blocks to form global shape fea-
tures, e.g. spin images [32], shape context [33] and mesh
HOG [34]. Nevertheless, such descriptors are not robust
to local geometric deformations or perturbations. Recently,
the diffusion-based approaches have emerged as a promis-
ing direction for shape description, which enjoy strong
robustness to isometric deformations and small perturbations
on surfaces. These methods model the geometric structure
of shapes with a certain diffusion process, and the shape
descriptors are built upon the associated diffusion operators,
e.g. Discrete Laplace-Beltrami operator [35] and heat kernel
signature [36], [37].
B. LEARNING-BASED METHODS
Embracing recent advances of deep learning and neural
networks (NNs) in image classification (i.e. an analogous
task to 3D shape recognition) [16], [17], most learning-
based methods for 3D shape recognition are built upon
NN architectures. There are mainly three formats of
3D data used in the NN-based methods, including points, vox-
els and views. According to the data format of NN’s input, the
NN-based methods can be classified into three categories:
point-cloud based methods, voxel-based methods, and view-
based methods. We focus more on the view-based meth-
ods in the literature review, as our method belongs to this
type. It is noted that there is a group of learning-based
approaches that take mesh surface as input by generalizing
CNNs to non-Euclidean geometries (e.g.spectral CNNs [38],
anisotropic CNNs [39]) or by using the handcrafted features
of objects as input (e.g. [40]). These approaches are devoted
to matching tasks, without published results on standard
shape recognition. Thus, we omit this group of approaches in
our literature review. It is also noted that a few approaches
use two or more sources of 3D data for further improve-
ment, e.g. both voxels and views are used in [41]. Last
but not least, there are also some approaches built upon
other machine learning techniques, e.g. multi-hypergraph
learning [42].
1) POINT-CLOUD BASED METHODS
In contrast to image data which is row-column indexed,
a point cloud (except for those computed from depth images)
is generally a set of point coordinates with irregular orga-
nization and unordered structure, which hinders the trivial
use of traditional image CNNs in point-cloud based methods.
To address such a fundamental challenge arising from raw
data, new NN architectures are needed. A pioneering work is
PointNet [43], a permutation-invariant deep architecture that
learns a spatial-encoding representation for each point and
combines them into a global descriptor. In [44], the Point-
Net architecture is extended to a hierarchical version called
PointNet++, which aims at better exploiting local structures
of shapes by applying PointNet recursively on a nested group-
ing of the input point cloud. Another work with the same
purpose of exploiting local shape structures is done in [45] via
kernel correlation and graph pooling. The grouping scheme
in PointNet++ is to implicitly exploit the spatial distribution
of points. For explicit exploitation, the kd-Net [46] builds a
kd-tree on the input point set and runs hierarchical feature
extractions from the leaves to root. Due to the non-overlap
partition by kd-tree, the kd-Net lacks of the overlapped recep-
tive fields which are useful for recognition. To address this
issue, Li et al. [47] proposed to replace the kd-tree with a
self-organizing map (SOM) and perform k-NN search from
points to SOM nodes, by which the receptive filed overlap
can be controlled. Instead of directly dealing with point sets
in the network, Simonovsky and Komodakis [20] proposed to
structure the point cloud with a graph and apply a graph CNN
to process the graph-structured data.
2) VOXEL-BASED METHODS
Voxels of 3D objects are a straightforward extension of pixels
of 2D images, by which an object shape is represented as a
volumetric binary occupancy grid. Unlike point cloud data,
voxels are well indexed. Thus, the image-based CNNs can
be easily extended to handle voxelized data. A seminal work
can be traced back to 3D-ShapeNet [18], a volumetric con-
volutional deep belief network which expresses 3D shapes as
a probability distribution of binary variables on a voxel grid.
Another early attempt is Voxnet [19], which uses a shallow
volumetric CNN with the volumetric probabilistic occupancy
grid representation. The Voxnet architecture is jointed with an
orientation estimation task in [48] for performance improve-
ment. Since voxels are volumetric representations that can
easily become computationally intractable, the above voxel-
based NNs have to be shallow. To make use of the power of
deep learning, Brock et al. [49] proposed a deep voxel-based
CNN architecture which can be effectively and efficiently
trained. With the same purpose, Riegler et al. [50] proposed
to exploit the sparsity of voxelized data to enable deeper
networks without reducing resolution. To anylase the shape
distribution of 3D objects, [51] uses a VAE (variational
auto-encoder) to reconstruct full 3D shapes from voxelized
single-views. With the latent variables learned by the VAE,
111680 VOLUME 7, 2019