Object Recognition and Full Pose Registration from a Single Image for
Robotic Manipulation
Alvaro Collet Dmitry Berenson Siddhartha S. Srinivasa Dave Ferguson
Abstract— Robust perception is a vital capability for robotic
manipulation in unstructured scenes. In this context, full pose
estimation of relevant objects in a scene is a critical step towards
the introduction of robots into household environments. In this
paper, we present an approach for building metric 3D models
of objects using local descriptors from several images. Each
model is optimized to fit a set of calibrated training images, thus
obtaining the best possible alignment between the 3D model and
the real object. Given a new test image, we match the local de-
scriptors to our stored models online, using a novel combination
of the RANSAC and Mean Shift algorithms to register multiple
instances of each object. A robust initialization step allows for
arbitrary rotation, translation and scaling of objects in the test
images. The resulting system provides markerless 6-DOF pose
estimation for complex objects in cluttered scenes. We provide
experimental results demonstrating orientation and translation
accuracy, as well a physical implementation of the pose output
being used by an autonomous robot to perform grasping in
highly cluttered scenes.
I. INTRODUCTION
Autonomous robots operating in human environments
present some extremely challenging research topics in path
planning and dynamic perception, among others. Whether it
is in the workplace or in a household, a common character-
istic is the lack of static surroundings: people walk around,
tables and chairs are moved, objects are left in different
places. In order to successfully navigate in, and interact
with, such an environment, accurate and robust dynamic
perception is a must. In particular, an object recognition
system that provides accurate 6-DOF pose is very important
for performing complex manipulation tasks.
The object recognition and registration system we pro-
pose handles arbitrarily complex non-planar objects, is fully
automatic and based on natural (marker-free) features of
a single image. It is robust to outliers, partial occlusions,
changes in illumination, scale and rotation. It is able to detect
multiple objects and multiple instances of the same object
in a single image, and provide accurate pose estimation
for every instance. Using a calibrated camera, it is able to
localize each object in the robot’s coordinate frame to enable
on-line manipulation, as shown in Fig. 1.
Our system takes the core algorithm of Gordon and
Lowe [1] and extends it with a model alignment step that
enables accurate localization (section III-B), an automatic
A. Collet and D. Berenson are with The Robotics Institute, Carnegie
Mellon University, 5000 Forbes Ave., Pittsburgh, PA - 15213, USA.
{acollet, dberenson}@cs.cmu.edu
S. Srinivasa and D. Ferguson are with Intel Research
Pittsburgh, 4720 Forbes Ave., Suite 410, Pittsburgh,
PA - 15213, USA {siddhartha.srinivasa,
dave.ferguson}@intel.com
Fig. 1. Object grasping in a cluttered scene through pose estimation
performed with a single image. (top left) Scene observed by the robot’s
camera, used for object recognition/pose estimation. Coordinate frames
show the pose of each object. (top right) Virtual environment reconstructed
after running pose estimation algorithm. Each object is represented using a
simple geometry. (bottom) Our robot platform in the process of grasping an
object, using only the pose information from this algorithm.
initialization step for pose registration, and the combination
of RANSAC [2] with Mean-Shift [3] clustering to greatly
improve efficiency of recognizing multiple instances of the
same object. All these contributions make this algorithm
suitable for robotic manipulation of objects in cluttered
scenes, using only a single input image.
To accomplish these goals, the system we propose uses
SIFT features [4] to extract local descriptors from natural
features. As in [1], the system is separated into an off-
line object modelling stage and an on-line recognition and
registration stage. In the modelling stage, a sequence of
images of an object are taken from different viewpoints
using a camera with no pose information. The object is then
segmented in each training image, either manually or auto-
matically. Next, SIFT features are extracted for each image
and matched across the entire sequence. Using a structure-
from-motion bundle adjustment algorithm [5] described in