Fig. 3. Multi-functional gripper with a retractable mechanism that enables
quick and automatic switching between suction (pink) and grasping (blue).
system at training time, both as physical objects and as
representative product images (images of objects available
on the web); while the “novel” objects are provided only at
test time in the form of representative product images.
Overall approach. The system follows a grasp-first-then-
recognize work-flow. For each pick-and-place operation, it
first uses FCNs to infer the pixel-wise affordances of four
different grasping primitive actions: from suction to parallel-
jaw grasps (Section IV). It then selects the grasping primitive
action with the highest affordance, picks up one object,
isolates it from the clutter, holds it up in front of cameras,
recognizes its category, and places it in the appropriate bin.
Although the object recognition algorithm is trained only on
known objects, it is able to recognize novel objects through
a learned cross-domain image matching embedding between
observed images of held objects and product images (Section
V).
Advantages. This system design has several advantages.
First, the affordance-based grasping algorithm is model-free
and agnostic to object identities and generalizes to novel
objects without re-training. Second, the category recognition
algorithm works without task-specific data collection or re-
training for novel objects, which makes it scalable for appli-
cations in warehouse automation and service robots where
the range of observed object categories is large and dynamic.
Third, our grasping framework supports multiple grasping
modes with a multi-functional gripper and thus handles a
wide variety of objects. Finally, the entire processing pipeline
requires only a few forward passes through deep networks
and thus executes quickly (Table II).
System setup. Our system features a 6DOF ABB IRB
1600id robot arm next to four picking work-cells. The robot
arm’s end-effector is a multi-functional gripper with two
fingers for parallel-jaw grasps and a retractable suction cup
(Fig. 3). This gripper was designed to function in cluttered
environments: finger and suction cup length are specifically
chosen such that the bulk of the gripper body does not
need to enter the cluttered space. Each work-cell has a
storage bin and four statically-mounted RealSense SR300
RGB-D cameras (Fig. 2): two cameras overlooking the
storage bins are used to infer grasp affordances, while the
other two pointing towards the robot gripper are used to
recognize objects in the gripper. Although our experiments
were performed with this setup, the system was designed to
suction down suction side grasp down flush grasp
Fig. 4. Multiple motion primitives for suction and grasping to ensure
successful picking for a wide variety of objects in any orientation.
be flexible for picking and placing between any number of
reachable work-cells and camera locations. Furthermore, all
manipulation and recognition algorithms in this paper were
designed to be easily adapted to other system setups.
IV. MULTI-AFFORDANCE GRASPING
The goal of the first step in our system is to robustly
grasp objects from a cluttered scene without relying on their
object identities or poses. To this end, we define a set of
four grasping primitive actions that are complementary to
each other in terms of utility across different object types and
scenarios – empirically maximizing the variety of objects and
orientations that can be picked with at least one primitive.
Given RGB-D images of the cluttered scene at test time, we
infer the dense pixel-wise affordances for all four primitives.
A task planner then selects and executes the primitive with
the highest affordance (more details of this planner can be
found in the Appendix).
A. Grasping Primitives
We define four grasping primitives to achieve robust
picking for typical household objects. Fig. 4 shows example
motions for each primitive. Each of them are implemented
as a set of guarded moves, with collision avoidance and
quick success or failure feedback mechanisms: for suction,
this comes from flow sensors; for grasping, this comes from
contact detection via force feedback from sensors below
the work-cell. Robot arm motion planning is automatically
executed within each primitive with stable IK solves [26].
These primitives are as follows:
Suction down grasps objects with a vacuum gripper ver-
tically. This primitive is particularly robust for objects
with large and flat suctionable surfaces (e.g. boxes, books,
wrapped objects), and performs well in heavy clutter.
Suction side grasps objects from the side by approaching
with a vacuum gripper tilted an an angle. This primitive is
robust to thin and flat objects resting against walls, which
may not have suctionable surfaces from the top.
Grasp down grasps objects vertically using the two-finger
parallel-jaw gripper. This primitive is complementary to
the suction primitives in that it is able to pick up objects
with smaller, irregular surfaces (e.g. small tools, deformable
objects), or made of semi-porous materials that prevent a
good suction seal (e.g. cloth).