3D Pose Estimation and 3D Model Retrieval for Objects in the Wild
Alexander Grabner
1
Peter M. Roth
1
Vincent Lepetit
2,1
1
Institute of Computer Graphics and Vision, Graz University of Technology, Austria
2
Laboratoire Bordelais de Recherche en Informatique, University of Bordeaux, France
{alexander.grabner,pmroth,lepetit}@icg.tugraz.at
Abstract
We propose a scalable, efficient and accurate approach
to retrieve 3D models for objects in the wild. Our contri-
bution is twofold. We first present a 3D pose estimation
approach for object categories which significantly outper-
forms the state-of-the-art on Pascal3D+. Second, we use
the estimated pose as a prior to retrieve 3D models which
accurately represent the geometry of objects in RGB im-
ages. For this purpose, we render depth images from 3D
models under our predicted pose and match learned im-
age descriptors of RGB images against those of rendered
depth images using a CNN-based multi-view metric learn-
ing approach. In this way, we are the first to report quanti-
tative results for 3D model retrieval on Pascal3D+, where
our method chooses the same models as human annota-
tors for 50% of the validation images on average. In ad-
dition, we show that our method, which was trained purely
on Pascal3D+, retrieves rich and accurate 3D models from
ShapeNet given RGB images of objects in the wild.
1. Introduction
Retrieving 3D models for objects in 2D images, as
shown in Fig.
1, is extremely useful for 3D scene under-
standing, augmented reality applications and tasks like ob-
ject grasping or object tracking. Recently, the emergence of
large databases of 3D models such as ShapeNet [
3] initiated
substantial interest in this topic and motivated research for
matching 2D images of objects against 3D models. How-
ever, there is no straight forward approach to compare 2D
images and 3D models, since they have considerably differ-
ent representations and characteristics.
One approach to address this problem is to project 3D
models onto 2D images, which is known as rendering [
24].
This converts the task to comparing 2D images, which is,
however, still challenging, because the appearance of ob-
jects in real images and synthetic renderings can signifi-
cantly differ. In general, the geometry and texture of avail-
able 3D models do not exactly match those of objects in real
Figure 1: Given an RGB image (top), we predict a 3D pose
and a 3D model for objects of different categories (bottom).
images. Therefore, recent approaches [
2, 10, 23, 28] use
convolutional neural networks (CNNs) [
7, 8, 22] to extract
features from images which are partly invariant to these
variations. In particular, these methods compute image de-
scriptors from real RGB images and synthetic RGB images
which are generated by rendering 3D models under multiple
poses. While this allows them to train a single CNN purely
on synthetic data, there are two main disadvantages:
First, there is a significant domain gap between real and
synthetic RGB images: Real images are affected by com-
plex lighting, uncontrolled degradation and natural back-
grounds. This makes it is hard to render photo-realistic im-
ages from the available 3D models. Therefore, using a sin-
gle CNN for feature extraction from both domains is lim-
ited in performance, and even domain adaption [
13] does
not fully account for the different characteristics of real and
synthetic images.
3022