Fig. 2: Detecting and executing grasps: From left to right: Our system obtains an RGB-D image from a Kinect mounted on the robot,
and searches over a large space of possible grasps, for which some candidates are shown. For each of these, it extracts a set of raw features
corresponding to the color and depth images and surface normals, then uses these as inputs to a deep network which scores each rectangle.
Finally, the top-ranked rectangle is selected and the corresponding grasp is executed using the parameters of the detected rectangle and the
surface normal at its center. Red and green lines correspond to gripper plates, blue in RGB-D features indicates masked-out pixels.
may use different subsets of the modalities. In this work, we
will give a structured regularization method which guides the
learning algorithm to select such subsets, without imposing
hard constraints on network structure.
Structured Learning and Structured Regularization: Sev-
eral approaches have been proposed which attempt to use a
specially-designed regularization function to impose structure
on a set of learned parameters without directly enforcing it.
Jalali et al. [26] used a group regularization function in the
multitask learning setting, where one set of features is used for
multiple tasks. This function applies high-order regularization
separately to particular groups of parameters. Their function
regularized the number of features used for each task in a set of
multi-class classification tasks solved by softmax regression.
Intuitively, this encodes the belief that only some subset of
the input features will be useful for each task, but this set of
useful features might vary between tasks.
A few works have also explored the use of structured
regularization in deep learning. The Topographic ICA algo-
rithm [24] is a feature-learning approach that applies a similar
penalty term to feature activations, but not to the weights
themselves. Coates and Ng [8] investigate the problem of
selecting receptive fields, i.e., subsets of the input features
to be used together in a higher-level feature. The structure
of the network is learned first, then fixed before learning the
parameters of the network.
III. DEEP LEARNING FOR GRASP DETECTION:
SYSTEM AND MODEL
In this work, we will present an algorithm for robotic grasp
detection from a single RGB-D view. Our approach will be
based on machine learning, but distinguish itself from previous
approaches by learning not only the weights used to rank
prospective grasps, but also the features used to rank them,
which were previously hand-engineered.
We will do this using deep learning methods, learning a
set of RGB-D features which will be extracted from each
candidate grasp, then used to score that grasp. Our approach
will include a structured multimodal regularization method
which improves the quality of the features learned from
RGB-D data without constraining network structure.
In our system for robotic grasping, as shown in Fig. 2, the
robot first obtains an RGB-D image of the scene containing
objects to be grasped. A small deep network is used to score
potential grasps in this image, and a small candidate set of the
top-ranked grasps is provided to a larger deep network, which
yields a single best-ranked grasp.
In this work, we will represent potential grasps using
oriented rectangles in the image plane as seen on the left in
Fig. 2, with one pair of parallel edges corresponding to the
robotic gripper [28]. Each rectangle is thus parameterized by
the X and Y coordinates of its upper-left corner, its width,
height, and orientation in the image plane, giving a five-
dimensional search space for potential grasps. Grasps will be
ranked based on features extracted from the RGB-D image
region contained inside their corresponding rectangle, aligned
to the gripper plates, as seen in the center of Fig. 2.
To translate a rectangle such as that shown on the right in
Fig. 2 into a gripper pose for grasping we find the point with
the minimum depth inside the central third (horizontally) of
the rectangle. We then use the averaged surface normal around
this point to determine the approach vector for the gripper.
The orientation of the detected rectangle is translated to a
rotation around this vector to orient the gripper. We use the
X-Y coordinates of the rectangle center along with the depth
of the closest point to determine a grasping point in the robot’s
coordinate frame. We compute a pre-grasp position by shifting
10 cm back from the grasping point along this approach vector
and position the gripper at this point. We then approach the
object along the approach vector and grasp it.
Using a standard feature learning approach such as sparse
auto-encoder [21], a deep network can be trained for the
problem of grasping rectangle recognition (i.e., does a given
rectangle in image space correspond to a valid robotic grasp?).