PointNetGPD: Detecting Grasp Configurations from Point Sets
Hongzhuo Liang
1†
, Xiaojian Ma
2†
, Shuang Li
1
, Michael G
¨
orner
1
, Song Tang
1
,
Bin Fang
2
, Fuchun Sun
2∗
, Jianwei Zhang
1
Abstract— In this paper, we propose an end-to-end grasp
evaluation model to address the challenging problem of local-
izing robot grasp configurations directly from the point cloud.
Compared to recent grasp evaluation metrics that are based on
handcrafted depth features and a convolutional neural network
(CNN), our proposed PointNetGPD is lightweight and can di-
rectly process the 3D point cloud that locates within the gripper
for grasp evaluation. Taking the raw point cloud as input, our
proposed grasp evaluation network can capture the complex
geometric structure of the contact area between the gripper
and the object even if the point cloud is very sparse. To further
improve our proposed model, we generate a larger-scale grasp
dataset with 350k real point cloud and grasps with the YCB
object set for training. The performance of the proposed model
is quantitatively measured both in simulation and on robotic
hardware. Experiments on object grasping and clutter removal
show that our proposed model generalizes well to novel objects
and outperforms state-of-the-art methods. Code and video are
available at https://lianghongzhuo.github.io/PointNetGPD.
I. INTRODUCTION
Planning a grasp under uncertainty is a difficult task
in robotics. For a robot that operates in the real world,
uncertainty may come from varied aspects. In this paper,
we mainly concentrate on the uncertainty brought by the
imprecision and deficiency in sensing. This kind of uncer-
tainty is usually associated with the sensor we use for robotic
perception [1]. To address this problem, a grasping model
that can work with raw sensor input is needed. Some recent
advances suggest to use deep neural networks that have been
trained on large-scale grasp dataset labeling by humans [2],
[3] or grasping outcomes done by robotic hardware [4], [5]
to plan grasps directly with sensor input like images [6] or
point cloud [7]. Such research work yields promising results
across a wide variety of objects, sensors, and robots, and
their models generalize well to novel objects that are not
present in the training set. However, most of the current
methods still rely on 2D (image) or 2.5D (depth map) input;
some grasping models even require complex hand-crafted
features [8] before they can process the data, while very
few of them will take the 3D geometry information into
consideration [9]. Intuitively, whether a grasp is successful
or not is always related to how the robot (gripper) interacts
with the object surface in 3D space; thus the lack of geometry
†These two authors contributed equally. This work was done when
Hongzhuo Liang was visiting Tsinghua University.
1
TAMS (Technical Aspects of Multimodal Systems), Department of
Informatics, Universit
¨
at Hamburg
2
Tsinghua National Laboratory for Information Science and Technology
(TNList), State Key Lab on Intelligent Technology and Systems, Department
of Computer Science and Technology, Tsinghua University
*Corresponding author to provide e-mail: fcsun@tsinghua.edu.cn
Robot Initial State
Quality Evaluation
with PointNet
Executed Grasp
Grasp Candidates
Generation
Grasp Dataset
Best
Grasp
Fig. 1. An illustration of our proposed PointNetGPD for detecting
reliable grasp configuration from point sets. Taking raw sensor input from
a common RGB-D camera, we first convert the depth map into a point
cloud, then several grasp candidates will be sampled with essential geometry
information as heuristic or constraints. For each candidate, the point cloud
within the gripper will be cropped and transformed into local coordinates
and finally fed into our grasp quality evaluation network. The grasp with
the highest score will be executed. Our model is trained with a large-scale
grasp dataset based on the YCB [10] object set.
analysis could entail side effects to grasp planning, especially
when accurate and complete sensing is not available.
To tackle these unsolved issues, inspired by the recent
work of PointNet [11] that directly operates on point sets
for 3D object classification and segmentation, in this work,
we propose a point cloud based grasp detection method for
detecting reliable grasp configurations from the point cloud.
As illustrated in Figure 1, PointNetGPD provides an effective
pipeline to generate and evaluate grasp configurations. Com-
pared with previous grasp detection methods that depend on
multi-view CNN [8] or 3D-CNN [12], our approach does
not require point cloud projection on multiple 2D images
or rasterization into dense 3D volumes. As a result, it could
mostly sustain the geometric information of the original point
cloud and infer grasp quality more efficiently.
Recent success in deep neural network based grasp detec-
tion methods [3], [6] emphasizes the importance of training
on large-scale datasets. To further improve the performance
of the proposed grasp detection method, we built a grasp
dataset with a 350k real point cloud captured by depth
cameras, parallel-jaw grasps and analytic grasp metrics over
a subset of the YCB [10] object set. Different from other
grasp datasets like Dex-Net [3], we provide fine-grained
scores for each grasp instead of binary labels. Specifically,
given a 6D grasp pose and CAD model of an object, we
perform force-closure [13] and a friction-less grasp wrench
space (GWS) [14] analysis on the grasp respectively to obtain
such scores. Quantitative scores make the more flexible
label assignment possible during training, which could also
improve the performance of our grasp quality evaluation
arXiv:1809.06267v4 [cs.RO] 18 Feb 2019