18 Chapter 3. Training Networks Using Multiple GPUs
can throw at the problem, the faster we can train a given network. However, some of us may only
have one GPU when working through this book. That raises the questions:
• Is using just one GPU a fruitless exercise?
• Is reading through this chapter a waste a time?
• Was purchasing the ImageNet Bundle a poor investment?
The answer to all of these questions is a resounding no – you are in good hands, and the knowl-
edge you learn here will be applicable to your own deep learning projects. However, you do need
to manage your expectations and realize you are crossing a threshold, one that separates educational
deep learning problems from advanced, real-world applications.
You are now entering the world
of state-of-the-art deep learning
where experiments can take days, weeks, or even in some rare
cases, months to complete – this timeline is totally and completely normal.
Regardless if you have one GPU or eight GPUs, you’ll be able to replicate the performance of
the networks detailed in this chapter, but again, keep in mind the caveat of time. The more GPUs
you have, the faster the training will be. If you have a single GPU, don’t be frustrated – simply be
patient and understand this is part of the process. The primary goal of the ImageNet Bundle is to
provide you with actual case studies and detailed information on how to train state-of-the-art deep
neural networks on the challenging ImageNet dataset (along with a few additional applications
as well). No matter if you have one GPU or eight GPUs, you’ll be able to learn from these case
studies and use this knowledge in your own applications.
For readers using a single GPU, I highly recommend spending most of your time training
AlexNet and SqueezeNet on the ImageNet dataset. These networks are more shallow and can be
trained much faster on a single GPU system (in the order of 3-6 days for AlexNet and 7-10 days
for SqueezeNet, depending on your machine). Deeper Convolutional Neural Networks such as
GoogLeNet can also be trained on a single GPU but can take up to 7-14 days.
Smaller variations of ResNet can also be trained on a single GPU as well, but for the deeper
version covered in this book, I would recommend multiple GPUs.
The only network architecture I do not recommend attempting to train using one GPU is
VGGNet – not only can it be a pain to tune the network hyperparameters (as we’ll see later in this
book), but the network is extremely slow due to its depth and number of fully-connected nodes. If
you decide to train VGGNet from scratch, keep in mind that it can take up to 14 days to train the
network, even using four GPUs.
Again, as I mentioned earlier in this section, you are now crossing the threshold from deep
learning practitioner to deep learning expert. The datasets we are examining are large and challeng-
ing – and the networks we will train on these datasets are deep. As depth increases, so does the
computation required to perform the forward and backward pass. Take a second now to set your
expectations that these experiments are not ones you can leave running overnight and gather the
results the next morning – your experiments will take longer to run.
This is a fact that every deep
learning researcher must accept.
But even if you are training your own state-of-the-art deep learning models on a single GPU,
don’t fret. The same techniques we use for multiple GPUs can also be applied to single GPUs. The
sole purpose of the ImageNet Bundle is to give you the knowledge and experience you need to be
successful applying deep learning to your own projects.
3.2 Performance Gains Using Multiple GPUs
In an ideal world, if a single epoch for a given dataset and network architecture takes
N
seconds
to complete on a single GPU, then we would expect the same epoch with two GPUs to complete
in
N/2
seconds. However, this expectation isn’t the actual case. Training performance is heavily
dependent on the PCIe bus on your system, the specific architecture you are training, the number of
layers in the network, and whether your network is bound via computation or communication.