Complex & Intelligent Systems
Object-oriented methods
Object-oriented methods, in their canonical form, try to
superpose a predefined rigid 3D model of the target object to
its matching geometries in the perceived scene. The matching
reveals the pose of the target object in the scene, which is then
leveraged to derive corresponding grasp poses. Numerous
studies utilizing distinct sensory data address the matching
with various techniques for defining the 3D models and per-
forming the registration [6].
Sun et al. [7] matched the segmented 3D point cloud with
a primitive geometric model of the target to derive the reg-
istration. After a rough pose matching with RANSAC [8],
the method refines the pose with the iterative closest point
(ICP) algorithm. Its accuracy substantially depends on the
matching quality of RANSAC yet is limited by t he repre-
sentativeness of primitive models. To tackle the inability of
RANSAC-based methods to scale to large databases, a shape
completion framework is proposed in [9] and simplified in
[10] to enable grasp estimation. Since if the shape and tex-
ture of the perceived object are complete, the object-oriented
method could be more accurate. A 3D convolutional neu-
ral network (CNN) was trained on a dataset of over 440,000
3D exemplars to learn to complete a segmented point cloud.
The completion could generalize to new objects, allowing
previously unseen items to be grasped. Yet, it still performs
the grasp planning with GraspIt! [11] in an out of context
manner, making it an object-oriented method.
Another research line for completing the perceived geom-
etry and estimating its pose is via multi-view fusion [12].
This kind of methods is able t o alleviate the damping factors
of perception such as poor lighting conditions, clutter, and
occlusions. Though, a precise estimation generally requires
having an accurate computer-aided design (CAD) model [13]
[14] for the target object.
Scene oriented
Scene-oriented approaches pursue an understanding of the
whole scene [16]. This kind of method can be generalized to
new objects and environments, and dynamically reacts to the
environment [17–20].
Grasping new objects in unknown (complex) scenes is a
challenging problem in the field of robotics [21]. In recent
years, end-to-end grasp estimation methods on this problem
have thrived. These methods deal with the objects in context
(the scene), which could be defined as scene-oriented grasp
estimation. They take images or point clouds as input and
produce viable grasp poses as output. This idea originated
in the work of Saxena et al. [22], which enables the robot to
grasp objects it has never seen before. The algorithm neither
requires nor tries to build or complete a 3D model of the
object. Instead, given two (or more) images of an object, it
attempts to identify a few grasp points to locate the gripper
with a supervised learned model. This set of sparse points is
then triangulated to obtain a 3D location at which to attempt
agrasp.
Subsequently, Zeng et al. [23] proposed to utilize multi-
view RGB-D data self-monitoring and data-driven learning
methods to obtain the grasping poses of objects. The sys-
tem can estimate the object’s 6-DOF grasping poses reliably
in a variety of scenes and adapts to the scene. Zapata-
Impata et al. [24] proposed an optimal grasp estimation
method for 3D point clouds based on the local perspective
of unknown objects. This approach is flexible and stable to
work with objects in ever-changing scenes but is limited to a
non-cluttered environment. Mousavian et al. [25] introduced
6-DOF GraspNet for generating diverse grasps for unknown
objects. The method leveraged a trained variational auto-
encoder (VAE) t o sample multiple grasps for an object. It
also presented a scheme to move the gripper closer to a suc-
cessful grasp pose. Wang et al. [26] proposed a method for
robot grasping both rigid and soft objects. This method gen-
erates the grasping pose directly along its central axis without
relying on a CAD model. An ambidextrous grasping frame-
work is proposed in literature [4] as a significant extension
of the previous versions of Dex-Net research. The approach
learns strategies by training on a set of grippers using a
domain randomized dataset and geometric analysis models.
Wu et al. [27] proposed an end-to-end Grasp Proposal Net-
work (GPNet) predicting a diverse set of 6-DOF grasp for an
unseen object observed from a single and unknown camera
view. GPNet builds on a crucial design for the grasp proposal
module that defines anchors of grasp centres at discrete but
regular 3D grid corners, being flexible to support either pre-
cise or diverse grasp predictions.
Chu et al. [28] presented a grasping detection system to
predict grasp candidates for novel objects in RGB-D images.
The system test on the Cornell grasping dataset as well as
a self-collected multi-object multi-grasp dataset showed the
effectiveness of the design. Ten Pas et al. [29] generated grasp
hypotheses that do not require a precise segmentation of the
object. They proposed incorporating prior knowledge about
object categories to increase grasp classification accuracy.
Since the algorithm does not segment the objects, it can detect
grasps that treat multiple objects as a single atomic object.
Liang et al. [30] proposed an end-to-end grasp evaluation
model (PointNetGPD) to address the challenging problem of
localizing robot grasp configurations directly from the point
cloud. It is lightweight and can directly process the 3D point
cloud locating within the gripper for grasp evaluation. In
[31,32], Generative Grasping Convolutional Neural Network
(GG-CNN) was presented as a grasp synthesis model which
directly generates grasp poses from a depth image on a pixel-
wise basis, instead of sampling and classifying individual
grasp candidates like other deep learning techniques.
123