transposing (Sørensen et al., 2017), mirroring (Dyrmann et al., 2016a),
translations and perspective transform (Sladojevic et al., 2016), adap-
tations of objects’ intensity in an object detection problem (Steen et al.,
2016) and a PCA augmentation technique (Bargoti and Underwood,
2016).
Papers involving simulated data performed additional augmenta-
tion techniques such as varying the HSV channels and adding random
shadows (Dyrmann et al., 2016b) or adding simulated roots to soil
images (Douarre et al., 2016).
4.6. Technical details
From a technical side, almost half of the research works (17 papers,
42%) employed popular CNN architectures such as AlexNet, VGG16
and Inception-ResNet. From the rest, 14 papers developed their own
CNN models, 2 papers adopted a first-order Differential Recurrent
Neural Networks (DRNN) model, 5 papers preferred to use a Long
Short-Term Memory (LSTM) model (Gers et al., 2000), one paper used
deep belief networks (DBN) and one paper employed a hybrid of PCA
with auto-encoders (AE). Some of the CNN approaches combined their
model with a classifier at the output layer, such as logistic regression
(Chen et al., 2014), Scalable Vector Machines (SVM) (Douarre et al.,
2016), linear regression (Chen et al., 2017), Large Margin Classifiers
(LCM) (Xinshao and Cheng, 2015) and macroscopic cellular automata
(Song et al., 2016).
Regarding the frameworks used, all the works that employed some
well-known CNN architecture had also used a DL framework, with Caffe
being the most popular (13 papers, 32%), followed by Tensor Flow (2
papers) and deeplearning4j (1 paper). Ten research works developed
their own software, while some authors decided to build their own
models on top of Caffe (5 papers), Keras/Theano (5 papers), Keras/
TensorFlow (4 papers), Pylearn2 (1 paper), MatConvNet (1 paper) and
Deep Learning Matlab Toolbox (1 paper). A possible reason for the wide
use of Caffe is that it incorporates various CNN frameworks and data-
sets, which can be used then easily and automatically by its users.
Most of the studies divided their dataset between training and
testing/verification data using a ratio of 80–20 or 90–10 respectively.
In addition, various learning rates have been recorded, from 0.001
(Amara et al., 2017) and 0.005 (Mohanty et al., 2016) up to 0.01
(Grinblat et al., 2016). Learning rate is about how quickly a network
learns. Higher values help avoid the solver being stuck in local minima,
which can reduce performance significantly. A general approach used
by many of the evaluated papers is to start out with a high learning rate
and lower it as the training goes on. We note that learning rate is very
dependent on the network architecture.
Moreover, most of the research works that incorporated popular DL
architectures took advantage of transfer learning (
Pan and Yang, 2010),
which
concerns leveraging the already existing knowledge of some re-
lated task or domain in order to increase the learning efficiency of the
problem under study by fine-tuning pre-trained models. As sometimes it
is not possible to train a network from scratch due to having a small
training data set or having a complex multi-task network, it is required
that the network be at least partially initialized with weights from
another pre-trained model. A common transfer learning technique is the
use of pre-trained CNN, which are CNN models that have been already
trained on some relevant dataset with possibly different number of
classes. These models are then adapted to the particular challenge and
dataset. This method was followed (among others) in Lu et al. (2017),
Douarre et al. (2016), Reyes et al. (2015), Bargoti and Underwood
(2016), Steen et al. (2016), Lee et al. (2015), Sa et al. (2016), Mohanty
et al. (2016), Christiansen et al. (2016) and Sørensen et al. (2017), for
the VGG16, DenseNet, AlexNet and GoogleNet architectures.
4.7. Outputs
Finally, concerning the 31 papers that involved classification, the
classes as used by the authors ranged from 2 (Lu et al., 2017; Pound
et al., 2016; Douarre et al., 2016; Milioto et al., 2017) up to 1000
(Reyes et al., 2015). A large number of classes was observed in Luus
et al. (2015) (21 land-use classes) (Rebetez et al., 2016) (22 different
crops plus soil) (Lee et al., 2015) (44 plant species) and (Xinshao and
Cheng, 2015) (91 classes of common weeds found in agricultural
fields). In these papers, the number of outputs of the model was equal to
the number of classes respectively. Each output was a different prob-
ability for the input image, segment, blob or pixel to belong to each
class, and then the model picked the highest probability as its predicted
class.
From the rest 9 papers, 2 performed predictions of fruits counted
(scalar value as output) (Rahnemoonfar and Sheppard, 2017; Chen
et al., 2017), 2 identified regions of fruits in the image (multiple
bounding boxes) (Bargoti and Underwood, 2016; Sa et al., 2016), 2
predicted animal growth (scalar value) (Demmers et al., 2010, 2012),
one predicted weather conditions (scalar value) (Sehgal et al., 2017),
one crop yield index (scalar value) (Kuwata and Shibasaki, 2015) and
one paper predicted percentage of soil moisture content (scalar value)
(Song et al., 2016).
4.8. Performance metrics
Regarding methods used to evaluate performance, various metrics
have been employed by the authors, each being specific to the model
used at each study. Table 1 lists these metrics, together with their de-
finition/description, and the symbol we use to refer to them in this
survey. In some papers where the authors referred to accuracy without
specifying its definition, we assumed they referred to classification
accuracy (CA, first metric listed in Table 1). From this point onwards,
we refer to “DL
performance” as its score in some performance metric
from the ones listed in Table 1.
CA was the most popular metric used (24 papers, 60%), followed by
F1 (10 papers, 25%). Some papers included RMSE (4 papers), IoU (3
papers), RFC (Chen et al., 2017; Rahnemoonfar and Sheppard, 2017)or
others. Some works used a combination of metrics to evaluate their
efforts. We note that some papers employing CA, F1, P or R, used IoU in
order to consider a model’s prediction (Bargoti and Underwood, 2016;
Sa et al., 2016; Steen et al., 2016; Christiansen et al., 2016; Dyrmann
et al., 2017). In these cases, a minimum threshold was put on IoU, and
any value above this threshold would be considered as positive by the
model.
We note that in some cases, a trade-off can exist between metrics.
For example, in a weed detection problem (Milioto et al., 2017), it
might be desirable to have a high R to eliminate most weeds, but not
eliminating crops is of a critical importance, hence a lower P might be
acceptable.
4.9. Overall performance
We note that it is difficult if not impossible to compare between
papers, as different metrics are employed for different tasks, con-
sidering different models, datasets and parameters. Hence, the reader
should consider our comments in this section with some caution.
In 19 out of the 24 papers that involved CA as a metric, accuracy
was high (i.e. above 90%), indicating good performance. The highest
CA has been observed in the works of Hall et al. (2015), Pound et al.
(2016), Chen et al. (2014), Lee et al. (2015), Minh et al. (2017), Potena
et al. (2016) and Steen et al., 2016, with values of 98% or more, con-
stituting remarkable results. From the 10 papers using F1 as metric, 5
had values higher than 0.90 with the highest F1 observed in Mohanty
et al. (2016) and Minh et al. (2017) with values higher than 0.99.
The works of Dyrmann et al. (2016a), Rußwurm and Körner (2017),
Ienco et al. (2017), Mortensen et al. (2016), Rebetez et al. (2016),
Christiansen et al. (2016) and Yalcin (2017) were among the ones with
the lowest CA (i.e. 73–79%) and/or F1 scores (i.e. 0.558–0.746),
A. Kamilaris, F.X. Prenafeta-Boldú
Computers and Electronics in Agriculture 147 (2018) 70–90
74