660 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 2, FEBRUARY 2019
aggregate the information captured from different orientations.
To speed up the learning from voxels by deep learning models,
Wang et al. [24] proposed O-CNN to learn global features
based on a novel octree data structure. To learn local features
from voxles, Han et al. [12] proposed a novel voxelization
permutation strategy to eliminate the effect of rotation and
orientation ambiguity on the 3D surface. Although voxel-based
methods have the advantage of generating 3D shapes, they not
only need heavy computational cost but also require 3D shapes
to be aligned. In addition, this kind of methods always perform
discriminating shapes worse than the following view-based
methods.
C. View-Based Methods
Light Field Descriptor (LFD) [25] is the pioneer view-based
3D descriptor, which employs features of 2D silhouettes
in multiple views of 3D shapes. Instead of aggregating
multi-view information into global features, LFD evaluates the
dissimilarity between two shapes via comparing 2D features
of their corresponding two view sets in a greedy way. By the
same strategy, GIFT [5] measures the difference between two
shapes by the Hausdorff distance between their corresponding
view sets. To bridge 2D sketches and 3D shapes for shape
retrieval, barycentric representations of 3D shapes were pro-
posed to be learned from multiple views [26].
DeepPano [6] was proposed to learn features from
PANORAMA views using CNN, where a PANORAMA view
can be regarded as the seamless aggregation of multiple views
captured on a circle. To eliminate the effect of rotation about
the up-oriented direction, row-wise max pooling was intro-
duced in DeepPano. With pose normalization, Sfikas et al. [27]
used CNN to learn 3D features from multiple PANORAMA
views which were stacked together in a consistent order.
Similarly, using another hand-crafted feature, geometry image,
Sinha et al. [28] proposed to learn 3D features from geom-
etry images. In addition, RotationNet [29] is proposed to
learn global features by treating pose labels as latent vari-
ables which are optimized to self-align in an unsupervised
manner.
Recently, Su et al. [3] proposed Multi-View CNN to learn
3D global features from multiple views. To describe a 3D
shape by multiple views, the content information within
multiple views is aggregated into the global feature through
max pooling. Similarly, max pooling is also employed to
aggregate multiple views to learn local features for shape
segmentation or correspondence [4]. To employ more content
information in each view, Savva et al. [30] concatenated all
view features for hierarchical abstraction in the CNN-based
model. By decomposing a view sequence into a set of view
pairs, Johns et al. [31] classified each view pair independently,
and then, learned an object classifier by weighting the contri-
bution of each view pair, which allowed 3D shape recognition
over arbitrary camera trajectories. To perform pooling more
efficiently, Wang et al. [8] proposed dominant set clustering
to cluster views token form each shape, where pooling is
performed in each cluster.
Although pooling resolves the effect of rotation of 3D
shapes, it still suffers from two kinds of information loss,
i.e., the content information of almost all views and the
spatial information among the views. The spatial information
between pairwise views is also disregarded by the view pair
decomposition [31]. Savva et al. [30] compensated these two
kinds of loss by concatenation of all views, however, it is
sensitive to the first view position.
To resolve the aforementioned issues, SeqViews2SeqLabels
is proposed to learn 3D features via aggregating sequential
views by RNN. The RNN-based aggregation not only pre-
serves the content information of all views and the spatial
information among the views, but also becomes capable of
learning the semantics of view sequence, which is robust to
the first view position.
D. CNN-RNN Based and RNN-RNN Based Models
SeqViews2SeqLabels is similar to CNN-RNN based and
RNN-RNN based models. Different from multiple views,
Miyagi and Aono [32] employed multiple voxel slices to
learn 3D global features. They used CNN to extract the
feature of each voxel slice, and then, used RNN for view
aggregation, where a softmax was employed to conduct 3D
shape classification. Using a two-layer RNN, Le et al. [33]
proposed a CNN-RNN model to segment 3D shapes, where
multiple edge images were predicted to estimate the different
parts on a 3D shape. In addition, RNN-RNN based models,
especially seq2seq models, were originally proposed for text
understanding. Due to their powerful learning ability, they
have been successfully employed for image and speech under-
standing, such as scene text recognition [34], [35], image
caption generation [36] and speech recognition [37]. The
models in [34]–[36] were proposed to recognize what are in
a single image. For example, [34] and [35] focused on how
to recognize the characters in an image, [36] focused on how
to recognize the concepts in an image. Different from their
tasks, SeqViews2SeqLabels recognize what a view sequence of
multiple views is. This difference makes the involved attention
play different roles. In our method, we want to use attention
to highlight the views with distinctive characteristics to each
shape class and depress the views with ambiguous appearance.
Thus
,
our attention weights are computed at the image level.
In the methods of [35] and [36], attention is used to highlight
the parts with a specified meaning in an image, although mul-
tiple feature maps are involved. Thus, their attention weights
are computed at the part level. To represent the characteristics
of each shape class at each step of decoder, we propose a novel
attention mechanism which is different from the one employed
in [35] and [37].
III. S
EQVIEWS2SEQLABELS
In this section, SeqViews2SeqLabels is introduced in detail.
First, we provide an overview and then describe the key
elements, including capturing sequential views, view feature
extraction, the encoder-RNN, the decoder-RNN, and the atten-
tion mechanism in the subsequent five subsections.