G. Litjens et al. / Medical Image Analysis 42 (2017) 60–88 65
a hidden layer h = (h
1
, h
2
, . . . , h
M
) that carries the latent fea-
ture representation. The connections between the nodes are bi-
directional, so given an input vector x one can obtain the latent
feature representation h and also vice versa. As such, the RBM is
a generative model, and we can sample from it and generate new
data points. In analogy to physical systems, an energy function is
defined for a particular state ( x, h ) of input and hidden units:
E(x , h ) = h
T
Wx − c
T
x − b
T
h , (9)
with c and b bias terms. The probability of the ‘state’ of the system
is defined by passing the energy to an exponential and normaliz-
ing:
p(x , h ) =
1
Z
exp {−E(x , h ) } . (10)
Computing the partition function Z is generally intractable. How-
ever, conditional inference in the form of computing h conditioned
on x or vice versa is tractable and results in a simple formula:
P (h
j
| x ) =
1
1 + exp {−b
j
− W
j
x }
. (11)
Since the network is symmetric, a similar expression holds for
P ( x
i
| h ).
DBNs ( Bengio et al., 2007; Hinton et al., 2006 ) are essentially
SAEs where the AE layers are replaced by RBMs. Training of the
individual layers is, again, done in an unsupervised manner. Final
fine-tuning is performed by adding a linear classifier to the top
layer of the DBN and performing a supervised optimization.
2.6.3. Va riational auto-Encoders and generative adverserial networks
Recently, two novel unsupervised architectures were intro-
duced: the variational auto-encoder (VAE) ( Kingma and Welling,
2013 ) and the generative adversarial network (GAN) ( Goodfellow
et al., 2014 ). There are no peer-reviewed papers applying these
methods to medical images yet, but applications in natural images
are promising. We will elaborate on their potential in the discus-
sion.
2.7. Hardware and software
One of the main contributors to the steep rise of deep learn-
ing papers has been the widespread availability of GPU and GPU-
computing libraries (CUDA, OpenCL). GPUs are highly parallel com-
puting engines, which have an order of magnitude more execution
threads than central processing units (CPUs). With current hard-
ware, deep learning on GPUs is typically 10–30 times faster than
on CPUs.
Next to hardware, the other driving force behind the popularity
of deep learning methods is the wide availability of open-source
software packages. These libraries provide efficient GPU implemen-
tations of important operations in neural networks, such as con-
volutions; allowing the user to implement ideas at a high level
rather than worrying about efficient implementations. At the time
of writing, the most popular packages were (in alphabetical order):
•
Caffe ( Jia et al., 2014 ). Provides C++ and Python interfaces, de-
veloped by graduate students at UC Berkeley.
•
Tensorflow ( Abadi et al., 2016 ). Provides C++ and Python and
interfaces, developed by Google and is used by Google research.
•
Theano ( Bastien et al., 2012 ). Provides a Python interface, de-
veloped by MILA lab in Montreal.
•
Torch ( Collobert et al., 2011 ). Provides a Lua interface and is
used by, among others, Facebook AI research.
There are third-party packages written on top of one or more
of these frameworks, such as Lasagne ( https://github.com/Lasagne/
Lasagne ) or Keras ( https://keras.io/ ). It goes beyond the scope of
this paper to discuss all these packages in detail.
3. Deep learning uses in medical imaging
3.1. Classification
3.1.1. Image/exam classification
Image or exam classification was one of the first areas in which
deep learning made a major contribution to medical image analy-
sis. In exam classification, one typically has one or multiple images
(an exam) as input with a single diagnostic variable as output (e.g.,
disease present or not). In such a setting, every diagnostic exam is
a sample and dataset sizes are typically small compared to those
in computer vision (e.g., hundreds/thousands vs. millions of sam-
ples). The popularity of transfer learning for such applications is
therefore not surprising.
Transfer learning is essentially the use of pre-trained networks
(typically on natural images) to try to work around the (perceived)
requirement of large data sets for deep network training. Two
transfer learning strategies were identified: (1) using a pre-trained
network as a feature extractor and (2) fine-tuning a pre-trained
network on medical data. The former strategy has the extra ben-
efit of not requiring one to train a deep network at all, allow-
ing the extracted features to be easily plugged in to existing im-
age analysis pipelines. Both strategies are popular and have been
widely applied. However, few authors perform a thorough investi-
gation in which strategy gives the best result. The two papers that
do, Antony et al. (2016) and Kim et al. (2016a) , offer conflicting re-
sults. In the case of Antony et al. (2016) , fine-tuning clearly outper-
formed feature extraction, achieving 57.6% accuracy in multi-class
grade assessment of knee osteoarthritis versus 53.4%. Kim et al.
(2016a) , however, showed that using CNN as a feature extractor
outperformed fine-tuning in cytopathology image classification ac-
curacy (70.5% versus 69.1%). If any guidance can be given to which
strategy might be most successful, we would refer the reader to
two recent papers, published in high-ranking journals, which fine-
tuned a pre-trained version of Google’s Inception v3 architecture
on medical data and achieved (near) human expert performance
( Esteva et al., 2017; Gulshan et al., 2016 ). As far as the authors are
aware, such results have not yet been achieved by simply using
pre-trained networks as feature extractors.
With respect to the type of deep networks that are commonly
used in exam classification, a timeline similar to computer vision
is apparent. The medical imaging community initially focused on
unsupervised pre-training and network architectures like SAEs and
RBMs. The first papers applying these techniques for exam clas-
sification appeared in 2013 and focused on neuroimaging. Brosch
and Tam (2013) , Plis et al. (2014) , Suk and Shen (2013) , and Suk
et al. (2014) applied DBNs and SAEs to classify patients as having
Alzheimer’s disease based on brain Magnetic Resonance Imaging
(MRI). Recently, a clear shift towards CNNs can be observed. Out
of the 47 papers published on exam classification in 2015, 2016,
and 2017, 36 are using CNNs, 5 are based on AEs and 6 on RBMs.
The application areas of these methods are very diverse, ranging
from brain MRI to retinal imaging and digital pathology to lung
computed tomography (CT).
In the more recent papers using CNNs authors also often train
their own network architectures from scratch instead of using
pre-trained networks. Menegola et al. (2016) performed some ex-
periments comparing training from scratch to fine-tuning of pre-
trained networks and showed that fine-tuning worked better given
a small data set of around a 10 0 0 images of skin lesions. However,
these experiments are too small scale to be able to draw any gen-
eral conclusions from.
Three papers used an architecture leveraging the unique at-
tributes of medical data: two use 3D convolutions ( Hosseini-Asl
et al., 2016; Payan and Montana, 2015 ) instead of 2D to classify pa-
tients as having Alzheimer; Kawahara et al. (2016b) applied a CNN-