Training Deeper Models by GPU Memory
Optimization on TensorFlow
Chen Meng
1
, Minmin Sun
2
, Jun Yang
1
, Minghui Qiu
2
, Yang Gu
1
1
Alibaba Group, Beijing, China
2
Alibaba Group, Hangzhou, China
{mc119496, minmin.smm, muzhuo.yj, minghui.qmh, gy104353}@alibaba-inc.com
Abstract
With the advent of big data, easy-to-get GPGPU and progresses in neural network
modeling techniques, training deep learning model on GPU becomes a popular
choice. However, due to the inherent complexity of deep learning models and the
limited memory resources on modern GPUs, training deep models is still a non-
trivial task, especially when the model size is too big for a single GPU. In this paper,
we propose a general dataflow-graph based GPU memory optimization strategy,
i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome
the limitation of GPU memory. Meanwhile, to optimize the memory-consuming
sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are
also proposed. These strategies are integrated into TensorFlow seamlessly without
accuracy loss. In the extensive experiments, significant memory usage reductions
are observed. The max training batch size can be increased by 2 to 30 times given
a fixed model and system configuration.
1 Introduction
Recently deep learning plays an increasingly important role in various applications [
1
][
2
][
3
][
4
][
5
].
The essential logic of training deep learning models involves parallel linear algebra calculation which
is suitable for GPU. However, due to physical constraints, GPU usually has lesser device memory
than host memory. The latest high-end NVIDIA GPU P100 is equipped with 12–16 GB device
memory, while a CPU server has 128GB host memory. On the contrary, the trend for deep learning
models is to have a “deeper and wider” architecture. For example, ResNet [
6
] consists of up to 1001
neuron layers and a Neural Machine Translation(NMT) model consists of 8 layers using attention
mechanism [
7
][
8
], and most of layers in NMT model are sequential ones unrolling horizontally which
brings non-neglectable memory consumption.
In short, the gap between limited GPU device memory capacity and increasing model complexity
makes memory optimization a necessary requirement. In the following, the major constituents of
memory usage for deep learning training process are presented.
Feature maps.
For deep learning models, feature map is the intermediate output of one layer
generated in the forward pass and required for gradients calculation during the backward phase.
Figure 1 shows the curve of the ResNet-50’s memory footprint for one mini-batch training iteration
on ImageNet dataset. The max value of the curve gradually emerges with the accumulation of feature
maps. The size of feature map is typically determined by batch size and model architecture(for CNN
the stride size and output channel number, for RNN the gate number, time-step length and hidden
size). The feature map no longer needed will be de-allocated, which results in the declining of the
curve. For complex model training, users have to adjust batch size or even redesign their model
architectures to work around “Out of Memory” issue. Although with model parallelism [
9
], one
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.