Real-time object detection and robotic manipulation
for agriculture using a YOLO-based learning
approach
Hongyu Zhao, Zezhi Tang
∗
, Zhenhong Li, Yi Dong, Yuancheng Si, Mingyang Lu, George Panoutsos
Abstract—The optimisation of crop harvesting processes for
commonly cultivated crops is of great importance in the aim
of agricultural industrialisation. Nowadays, the utilisation of
machine vision has enabled the automated identification of
crops, leading to the enhancement of harvesting efficiency, but
challenges still exist. This study presents a new framework
that combines two separate architectures of convolutional neural
networks (CNNs) in order to simultaneously accomplish the tasks
of crop detection and harvesting (robotic manipulation) inside a
simulated environment. Crop images in the simulated environ-
ment are subjected to random rotations, cropping, brightness,
and contrast adjustments to create augmented images for dataset
generation. The you only look once algorithmic framework
is employed with traditional rectangular bounding boxes (R-
Bbox) for crop localization. The proposed method subsequently
utilises the acquired image data via a visual geometry group
model in order to reveal the grasping positions for the robotic
manipulation.
Index Terms—Deep learning, YOLOV3-dense, robot grasping.
I. INTRODUCTION
The progression of automation can be observed on a global
scale across several industries. The modernization and automa-
tion of agricultural production have also been noticed. The
implementation of mechanized techniques in agriculture has
facilitated the automation of diverse processes, resulting in
enhanced efficiency in agricultural production [1]. Neverthe-
less, the issue pertaining to crop harvesting within the realm of
automation remains inadequately resolved, with conventional
robots encountering challenges in accurately perceiving and
successfully executing the act of crop grasping.
H. Zhao is with the Department of Physics, Imperial College, London,
United Kingdom (email: hz2623@ic.ac.uk)
Z. Tang and G. Panoutsos are with the Department of Automatic Control
and Systems Engineering, University of Sheffield, Sheffield, S1 3JD, United
Kingdom (emails: zezhi.tang@sheffield.ac.uk, g.panoutsos@sheffield.ac.uk)
Z. Li is with the Department of Electrical and Electronic Engineer-
ing, University of Manchester, Manchester, United Kingdom (email: zhen-
hong.li@manchester.ac.uk)
Y. Dong is with the Department of Electronics and Computer Sci-
ence, University of Southampton, Southampton, United Kingdom (email:
yi.dong@soton.ac.uk)
Y. Si is with the Department of Economics, Fudan University, Shanghai,
200433, China (email: siyuancheng@fudan.edu.cn)
M. Lu is with the Center for Nondestructive Evaluation (CNDE), Iowa State
University, Ames, IA 50011, United States (email: mingylu@iastate.edu)
*Corresponding author
*© 2024 IEEE. Personal use of this material is permitted. Permission from
IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
Traditional machines have faced challenges when it comes
to harvesting crops. Manual labor is time-consuming and
leads to increased production costs. Therefore, robots can be
used to contribute to increased agricultural productivity [2].
In the context of industrial production lines, robots typically
perform specific roles within a production task, such as the
manipulation and placement of products at a fixed location or
the execution of specific steps within a specialized process
[3]. Nevertheless, using robots in agricultural productivity
requires enhanced object detection and grasping capabilities.
Hence, it is necessary to conduct research on the topic of
robot recognition and grasping techniques. Specifically, the
concept of grasping holds significant importance in the field
of automation, as the majority of automated systems rely
on the precise and effective gripping of a designated object.
Currently, a wide range of algorithms exist that are used for
the purpose of object recognition in conjunction with robotic
grasping.
The mask region-based convolutional neural network
(Mask-RCNN) algorithm is employed for the purpose of
segmenting and performing geometric stereo-matching in or-
der to accurately determine the location of the object of
interest in the camera’s field of view; the robot manipulator
subsequently performs efficient grabbing of the target object
[4]. The grasp region-based convolutional neural network(GR-
ConvNet) algorithm has the capability to produce grasping
poses based on RGB photos. This addresses the challenge of
creating and performing grasping actions for a robot that is
unfamiliar with the items in its environment, using n-channel
photographs of the scene [5].
The YOLO method is a computer vision technique utilised
for object recognition, renowned for its remarkable real-time
detection capabilities. Moreover, it is continuously subjected
to optimisation and enhancement efforts. As discussed in [6],
the utilisation of YOLO effectively addresses the significant
challenge of domain drift commonly seen in traditional target
detection approaches. This enables the generalised underwater
object detector (GUOD) to achieve commendable performance
in detecting targets in diverse underwater settings. The authors
of [7] propose a solution to address the challenge of picture
detection on datasets with limited samples. The proposed
approach uses the real-time capabilities of YOLO, along with
techniques such as transfer learning and data augmentation,
to enhance detection rates and speed. In [8], the YOLO is
acknowledged for addressing the difficulty of accurately de-
arXiv:2401.15785v1 [cs.CV] 28 Jan 2024