深度学习GPU内存优化：TensorFlow中的模型训练策略

需积分: 10 154 浏览量更新于2024-09-08 收藏 637KB PDF 举报

"Training Deeper Models by GPU Memory Optimization on TensorFlow" 深度学习在当前的数据时代已经成为了处理复杂问题的主流技术，特别是在人脸识别等任务上。然而，随着模型深度和复杂性的增加，训练过程中对GPU内存的需求也随之增加，这往往超过了单个GPU的内存限制。为了解决这一挑战，本文提出了一种基于数据流图的GPU内存优化策略，称为"swap-out/in"，该策略利用主机内存作为更大的内存池来扩展GPU的存储能力。 "swap-out/in"策略的核心思想是将不常使用的计算图节点的数据暂时交换到主机内存中，当需要这些数据时再从主机内存中交换回GPU，以此来减少GPU上的内存占用。通过这种方式，可以有效地应对深度学习模型训练中内存不足的问题，尤其是对于那些内存消耗大的序列到序列（Seq2Seq）模型。 Seq2Seq模型在自然语言处理等领域有广泛应用，如机器翻译和对话系统。由于其需要存储大量的隐藏状态和注意力权重，内存需求较大。为了进一步优化这类模型，论文提出了专门的内存优化策略。这些策略针对Seq2Seq模型的特点进行设计，能够在不影响模型准确性的前提下，有效降低内存使用。这些优化策略被无缝地集成到了TensorFlow框架中，这意味着开发者可以在不牺牲模型精度的情况下，直接利用这些优化方法进行训练。实验结果显示，采用这些优化措施后，内存使用量显著减少，使得在相同的硬件配置下，最大训练批次大小可以增加2到30倍。通过GPU内存优化，不仅可以训练更大规模的深度学习模型，而且还能加速训练过程，提高训练效率。这对于推动深度学习技术的发展，尤其是在资源有限的环境下训练更深层次、更复杂的模型具有重要意义。这种优化策略的应用将有助于研究人员和工程师更好地应对深度学习中的内存挑战，从而在人脸识别和其他相关领域取得更好的性能。

Training Deeper Models by GPU Memory

Optimization on TensorFlow

Chen Meng

, Minmin Sun

, Jun Yang

, Minghui Qiu

, Yang Gu

Alibaba Group, Beijing, China

Alibaba Group, Hangzhou, China

{mc119496, minmin.smm, muzhuo.yj, minghui.qmh, gy104353}@alibaba-inc.com

Abstract

With the advent of big data, easy-to-get GPGPU and progresses in neural network

modeling techniques, training deep learning model on GPU becomes a popular

choice. However, due to the inherent complexity of deep learning models and the

limited memory resources on modern GPUs, training deep models is still a non-

trivial task, especially when the model size is too big for a single GPU. In this paper,

we propose a general dataﬂow-graph based GPU memory optimization strategy,

i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome

the limitation of GPU memory. Meanwhile, to optimize the memory-consuming

sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are

also proposed. These strategies are integrated into TensorFlow seamlessly without

accuracy loss. In the extensive experiments, signiﬁcant memory usage reductions

are observed. The max training batch size can be increased by 2 to 30 times given

a ﬁxed model and system conﬁguration.

1 Introduction

Recently deep learning plays an increasingly important role in various applications [

][

The essential logic of training deep learning models involves parallel linear algebra calculation which

is suitable for GPU. However, due to physical constraints, GPU usually has lesser device memory

than host memory. The latest high-end NVIDIA GPU P100 is equipped with 12–16 GB device

memory, while a CPU server has 128GB host memory. On the contrary, the trend for deep learning

models is to have a “deeper and wider” architecture. For example, ResNet [

] consists of up to 1001

neuron layers and a Neural Machine Translation(NMT) model consists of 8 layers using attention

mechanism [

][

], and most of layers in NMT model are sequential ones unrolling horizontally which

brings non-neglectable memory consumption.

In short, the gap between limited GPU device memory capacity and increasing model complexity

makes memory optimization a necessary requirement. In the following, the major constituents of

memory usage for deep learning training process are presented.

Feature maps.

For deep learning models, feature map is the intermediate output of one layer

generated in the forward pass and required for gradients calculation during the backward phase.

Figure 1 shows the curve of the ResNet-50’s memory footprint for one mini-batch training iteration

on ImageNet dataset. The max value of the curve gradually emerges with the accumulation of feature

maps. The size of feature map is typically determined by batch size and model architecture(for CNN

the stride size and output channel number, for RNN the gate number, time-step length and hidden

size). The feature map no longer needed will be de-allocated, which results in the declining of the

curve. For complex model training, users have to adjust batch size or even redesign their model

architectures to work around “Out of Memory” issue. Although with model parallelism [

], one

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

下载后可阅读完整内容，剩余7页未读，立即下载

hopkinsyang

粉丝: 10
资源: 5

深度学习GPU内存优化：TensorFlow中的模型训练策略

"数据库营销策略优化：客户保留与潜在客户发掘

提高视频字幕质量：基于TensorFlow的解码器源码研究

"Rhapsody软件入门操作手册：40页后的逐步指导很容易上手

delving-deeper-into-the-decoder-for-video-captioning:用于深入研究视频字幕解码器的源代码-tensorflow source code

Deep learning with tensorflow

Deep Learning with TensorFlow

TensorFlow Machine Learning Cookbook

react-deeper

Deeper Inside PageRank

The_Deeper

最新资源