available, (3) RS data, especially hyperspectral data, are a very large data cube and many suc-
cessful DL algorithms are tuned for small RGB image patches, (4) RS data gather ed via light
detection and ranging (LiDAR) have insufficient DL literature (data are point clouds and not
images), and (5) the best architecture is usually unknown a priori which means a gridded search
(which can be very time consuming) or random methods such as those discussed in Ref. 50 are
required for optimization. Chapter 8 of Ref. 23 discusses optimization techniques for training DL
models. A thorough discussion of these techniques is beyond the scope of this paper; however,
we list some common methods typically used to train DL systems.
Goodfellow et al.
23
in Sec. 8.5.4 point out that there is no current consensus on the best
training/optimization algorithm. For the interested reader, the survey paper of Schaul et al.
51
provides results for many optimization algorithms over a large variety of tasks. CNNs are typ-
ically trained using stochastic gradient descent (SGD), SGD with momentum,
52
AdaGrad,
53
RMSProp,
54
and ADAM.
55
For details on the pros and cons of these algorithms, refer to
Secs. 8.3 and 8.5 of Ref. 23. There are also second-order methods, and these are discussed
in Sec. 8.6 of Ref. 23. A good history of DL is provided in Ref. 56, and training is discussed
in Sec. 5.24. Further discussions in this paper can be found in open questions 7 (Sec. 4.7),
8 (Sec. 4.8), and 9 (Sec. 4.9).
AEs can be trained with optimization algorithms similar to a CNN. Some special AEs, such
as the marginalized DAE in Ref. 25, have closed-form solutions. DBNs can be trained using
greed-layer wise training, as shown in Hinton et al.
57
and Bengio et al.
58
and Salakhutdinov
and Hinton
59
developed an improved pretraining method for DBNs and DBMs, by doubling or
halving the weights (see the paper for more details).
Computation comparisons are complex and highly dependent on factors such as the train-
ing architecture, computer system (and GPUs), the way the architectures get data onto and
off of the GPUs, the particular settings of the optimization algorithm (e.g., the mini-batch
size), the learning rate, etc., and, of course, on the data itself. It is very d ifficult to know
aprioriwhat the complexities will be. This question is currently unanswered in current
DL knowledge.
The survey paper by Shi et al.
60
investigates tools performance to help users of common DL
tools (see Sec. 2.3.5) such a s Caffe and TensorFlow to understand these tools’ speed, capabilities,
and limitations. They discovered that GPUs are critical to speeding up DL algorithms, whereas
multicore systems do not scale linearly after about 8 cores. The GTX1080 (and now 1080Ti)
GPUs performed the best among the GPUs they tested.
RNNs can be difficult to train due to the exploding gradient problem.
61
To overcome this
issue, Pascanu et al.
62
developed a gradient-clipping strategy to more effectively train RNNs.
Martens and Sutskever
63
developed a Hessian-free with dampening scheme RNN optimization
and tested it on very challenging data sets.
Last, the survey paper of Deng
2
discusses DL architectures and gives many references for
training DL systems. The survey paper of Bengio et al.
64
on unsupervised FL also discusses
various learning and optimization strategies.
2.3.4 Big data
Every day, approximately 350 million images are uploaded to Facebook,
45
Wal-Mart collects
approximately 2.5 petabytes of data per day,
45
and National Aeronautics and Space
Administration (NASA) is actively streaming 1.73 gigabytes of spacecraft borne observation
data for active missions alone.
65
IBM reports that 2.5 quintillion bytes of data are now generated
every data, which means that “90% of the data in the world today has been created in the last two
years alone.”
66
The point is that an unprecedented amount of (varying quality) data exists due to
technologies such as RS, smartphones, and inexpensive data storage. In times past, researchers
used tens to hundreds, maybe thousands of data training samples, but nothing on the order of
magnitude as today. In areas such as CV, high data volume and variety are at the heart of
advancements in performance, meaning reported results are a reflection of advances in data
and machine learning.
To date, a number of approaches have been explored relative to large-scale deep networks
(e.g., hundreds of layers) and big data (e.g., high volume of data). For example, Raina et al.
67
Ball, Anderson, and Chan: Comprehensive survey of deep learning in remote sensing: theories. . .
Journal of Applied Remote Sensing 042609-9 Oct–Dec 2017
•
Vol. 11(4)