well labeled images in over 22,000 object categories. By utilizing
ImageNet to train object detectors, Lai et al. [2012] demonstrated
that the resulting detector can be used to reliably label objects in 3D
scenes. While the amount of available 3D data continues to grow,
it is unlikely that it will ever come close to matching the volume
of image data. Moreover, compared to 2D images, 3D shapes are
inherently more difficult to acquire and process, requiring more ef-
fort to label and analyze. Our work demonstrates the advantages of
image data, in terms of its sheer volume and relative ease for pro-
cessing, which can be exploited to address challenges arising from
the segmentation of 3D shapes.
Projective shape analysis. Treating a 3D shape as a collection
of 2D projections rendered from multiple directions is not new to
computer graphics. Murase and Nayar [1995] recognize an object
by matching its appearance with a large set of 2D images obtained
automatically by rendering 3D models under varying poses and illu-
minations. Lindstrom and Turk [2000] compute an image-space er-
ror metric from these projections to guide mesh simplification. Cyr
and Kimia [2001] generate projections from selected view direc-
tions and use them to identify 3D objects and their poses. Sketch-
or image-based 3D shape retrieval [Eitz et al. 2012] compares ob-
ject projections with query images or user-drawn sketches in 2D.
Similarities among 2D shapes can be evaluated using techniques
such as LFD [Chen et al. 2003] and cross-correlation [Makadia and
Daniilidis 2010]. Liu and Zhang [2007] embed a 3D mesh into the
spectral domain, turning the 3D segmentation problem into a con-
tour analysis one. 3D reconstruction from multi-view images is one
of the most fundamental problems in computer vision. Our work
applies projective analysis to a new application: semantic segmen-
tation of 3D shapes. Specifically, we fuse labeled segmentations
learned from back-projected 2D labels to obtain a coherent seman-
tic labeling of a 3D object.
Image and shape hybrid processing. 3D shape reconstruction
often benefits from utilizing available 2D data, e.g., from registered
photographs, to improve the quality of 3D scans [Li et al. 2011]. On
the other hand, leveraging a priori 3D geometry of a given object
category can alleviate the ill-posed nature of image analysis from
single photographs. Chang et al. [2009] and Pepik et al. [2012]
combine the representational power of 3D objects with 2D object
category detectors to estimate viewpoints. Xu et al. [2011] take a
data-driven approach for photo-inspired 3D shape creation, where
the best matching 3D candidate is deformed to fit the silhouette of
the object captured in a single photograph. In our work, we also
take a hybrid approach where the semantics of 3D shapes is guided
by constraints learned via projective shape analysis.
Image retrieval. Measuring image similarity for retrieval is ex-
tensively studied in computer vision; see [Xiao et al. 2010] for
a systematic study of image features for scene retrieval. Well-
known distance measures between 2D shapes include Hamming
distance, Hausdorff distance [Baddeley 1992], shock graph edit
distance [Klein et al. 2001], distance between Fourier descrip-
tors [Chen et al. 2003], inner distance shape context [Ling and Ja-
cobs 2007], and context-sensitive shape similarity [Bai et al. 2010].
Different from previous attempts, we do not only retrieve a 2D
shape but also infer a semantic labeling of its interior. Unlike ex-
isting contour-based methods [Ling and Jacobs 2007], our region-
based analysis allows shape retrieval and label transfer to be con-
ducted in a coherent manner. Moreover, our image retrieval is not
cross-category, but within-category, with the goal of finding shapes
with similar topological features to guide part-aware label trans-
fer. To properly evaluate the differences between the correspond-
ing parts of two shapes, we implicitly warp one shape to match
Figure 3: Region-based matching via warp alignment. Both the
labeled images (left column) and query projection (middle column)
are cut into axis-aligned slabs. Each labeled image is then warped
to match the query projection. The dissimilarity is measured us-
ing warp-aligned shapes, allowing the matching to favor the shape
with similar topologies (top row) over the one with parts at similar
scales and positions (bottom row). Note that although the bottom
chair is visually more similar, the top chair is more useful for label-
ing the armrest area in the query projection.
the other, before computing dissimilarity using a topology-aware
Hausdorff distance measure.
Image label transfer. Semantic label transfer is another core
problem in computer vision. Existing approaches can be classified
into learning-based and non-parametric based. The former ones try
learn a model for each object category. A successful method is Tex-
tonboost [Shotton et al. 2006], which trains a conditional random
field (CRF) model. A problem of learning-based methods is that
they do not scale well with the number of object categories. With
the emergence of large image databases, non-parametric methods
have demonstrated their advantages. Given an input image, Liu et
al. [2011a] first retrieve its nearest neighbors from a large database
using GIST matching [Oliva and Torralba 2001]; then transfer and
integrate together annotations from each of these neighbors via
dense correspondence estimated from SIFT flow [Liu et al. 2011b].
Compared to learning-based approaches, this method has few pa-
rameters and allows simply adding more images and/or new cate-
gories without requiring additional training. When the set of anno-
tated images is small, Zhang et al. [2010] and Chen et al. [2012]
further learn an object model from the retrieved nearest neighbors
to improve the performance of label transfer. Our approach incor-
porates the same nearest neighbor idea, but instead of performing
label transfer within the whole image domain, we compute seman-
tic labeling for the interior of the 2D shape only. This provides
us additional constraints for obtaining a better labeling result. In
addition, almost all existing dense correspondence estimation ap-
proaches [Liu et al. 2011b; Berg et al. 2005; Leordeanu and Hebert
2005; Duchenne et al. 2011] rely on local intensity patterns and are
unsuitable for transferring labels to textureless 2D projections.
3 Overview
Our image-driven shape analysis is based on a dataset of pre-labeled
images which captures the semantic knowledge about the relevant
class of shapes. The input is a 3D mesh model, possibly non-
manifold, incomplete, or self-intersecting. The 3D shape and the
labeled images belong to the same semantic class. We assume that
both the input and the objects captured in the labeled images are
in their upright orientations. In practice, we found the assumption
to hold for the vast majority of the data, e.g., almost all chair im-
ages found on Google. We apply our multi-view shape matching