大规模深度学习广告系统：分布式GPU分层参数服务器

需积分: 10 148 浏览量更新于2024-07-16 1 收藏 729KB PDF 举报

"大规模深度学习广告系统的分布式分层GPU参数服务器是一种优化的深度学习训练架构，旨在处理在线广告系统中的海量参数。该系统利用GPU的高带宽内存、CPU主存和SSD作为三层分层存储，实现了高效的数据管理和计算。通过将神经网络训练过程集中在GPU上，并结合分层工作流，确保了模型训练的有效性和可扩展性。" 深度学习在广告系统中的应用已经越来越广泛，尤其是在推荐和排名等关键任务中。这些系统通常需要处理来自多个来源的输入，如查询-广告的相关性、广告特性以及用户画像，这些输入被编码成稀疏的一热或多热二进制特征。然而，每个样本中非零特征值的比例很小，这给模型训练带来了挑战。传统的深度学习模型在面对TB级别的参数时，可能会超出单个计算节点的GPU或CPU内存限制。例如，一个赞助在线广告系统可能包含超过10^11个稀疏特征，导致神经网络成为一个具有约10TB参数的庞大模型。为了解决这个问题，论文提出的分布式GPU分层参数服务器架构提供了一个创新的解决方案。这个架构的核心是将存储层次结构分为三部分：GPU高带宽内存、CPU主存和SSD。GPU主要用于执行计算密集型的神经网络训练，而CPU主存和SSD则作为辅助存储，以处理大量的稀疏特征。通过这种分层设计，系统能够根据数据访问模式智能地缓存和调度参数，降低了数据传输延迟，提高了训练效率。此外，该系统还可能采用了异步更新策略，允许不同GPU节点并行地进行参数更新，进一步提升了训练速度。同时，通过动态调整工作流，系统能够适应不同的工作负载和资源可用性，确保了整体的可扩展性。大量的实验结果验证了该系统在处理大规模深度学习广告系统时的性能和有效性。 "分布式分层GPU参数服务器"是一个强大的工具，它为处理大规模深度学习模型提供了有效的途径，特别是在在线广告这样需要处理海量数据和复杂模型的领域。这一技术的实施，不仅可以加速训练过程，还能帮助公司更有效地利用硬件资源，提升广告系统的性能和用户体验。

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

justiﬁcation of adopting DNN models for CTR prediction.

•

Hashing reduces the accuracy. Even with

k = 2

, the

test AUC is dropped by 0.7%.

•

Hash+DNN is a good combination for replacing LR.

Compared to the original baseline LR model, we can re-

duce the number of nonzero weights from 31B to merely

14.6M without affecting the accuracy.

Table 2 summarizes the experiments on web search ads data.

The trend is essentially similar to Table 1. The main differ-

ence is that we cannot really propose to use Hash+DNN for

web search ads CTR models, because that would reduce the

accuracy of current DNN-based models and consequently

would affect the revenue for the company.

Table 2. OP+OSRP for Web Search Sponsored Ads Data

# Nonzero Weights Test AUC

Baseline LR 199,359,034,971 0.7458

Baseline DNN 0.7670

Hash+DNN (k = 2

) 3,005,012,154 0.7556

Hash+DNN (k = 2

) 1,599,247,184 0.7547

Hash+DNN (k = 2

) 838,120,432 0.7538

Hash+DNN (k = 2

) 433,267,303 0.7528

Hash+DNN (k = 2

) 222,780,993 0.7515

Hash+DNN (k = 2

) 114,222,607 0.7501

Hash+DNN (k = 2

) 58,517,936 0.7487

Hash+DNN (k = 2

) 15,410,799 0.7453

Hash+DNN (k = 2

) 4,125,016 0.7408

Summary.

This section summarizes our effort on develop-

ing effective hashing methods for ads CTR models. The

work was done in 2015 and we had never attempted to pub-

lish the paper. The proposed algorithm, OP+OSRP, actually

still has some novelty to date, although it obviously com-

bines several previously known ideas. The experiments are

exciting in a way because it shows that one can use a single

machine to store the DNN model and can still achieve a

noticeable increase in AUC compared to the original (large)

LR model. However, for the main ads CTR model used in

web search which brings in the majority of the revenue, we

observe that the test accuracy is always dropped as soon

as we try to hash the input data. This is not acceptable in

the current business model because even a

0.1%

decrease in

AUC would result in a noticeable decrease in revenue.

Therefore, this report helps explain why we introduce the

distributed hierarchical GPU parameter server in this paper

to train the massive scale CTR models, in a lossless fashion.

3 DISTRIBUTED HIERARCHICAL

PARAMETER SERVER OVERVIEW

In this section, we present the distributed hierarchical param-

eter server overview and describe its main modules from

a high-level view. Figure 2 illustrates the proposed hier-

archical parameter server architecture. It contains three

major components: HBM-PS, MEM-PS and SSD-PS.

Workers

pull/push

Parameter shards

GPU

Inter-GPU

communications

Data shards

GPU

HDFS

Memory

Local

parameters

Data shards

SSD

Batch load/dump

Materialized

parameters

Local pull/push & Data transfer

SSD-PS

MEM-PS

HBM-PS

Remote

pull/push

RDMA remote

synchronization

Figure 2. Hierarchical parameter server architecture.

Workﬂow.

Algorithm 1 depicts the distributed hierarchi-

cal parameter server training workﬂow. The training data

batches are streamed into the main memory through a net-

work ﬁle system, e.g., HDFS (line 2). Our distributed train-

ing framework falls in the data-parallel paradigm (Li et al.,

2014; Cui et al., 2014; 2016; Luo et al., 2018). Each node

is responsible to process its own training batches—different

nodes receive different training data from HDFS. Then, each

node identiﬁes the union of the referenced parameters in

the current received batch and pulls these parameters from

the local MEM-PS/SSD-PS (line 3) and the remote MEM-

PS (line 4). The local MEM-PS loads the local parameters

stored on local SSD-PS into the memory and requests other

nodes for the remote parameters through the network. Af-

ter all the referenced parameters are loaded in the memory,

these parameters are partitioned and transferred to the HBM-

PS in GPUs. In order to effectively utilize the limited GPU

memory, the parameters are partitioned in a non-overlapped

fashion—one parameter is stored only in one GPU. When a

worker thread in a GPU requires the parameter on another

GPU, it directly fetches the parameter from the remote GPU

and pushes the updates back to the remote GPU through

high-speed inter-GPU hardware connection NVLink (Foley

& Danskin, 2017). In addition, the data batch is sharded

into multiple mini-batches and sent to each GPU worker

thread (line 5-10). Many recent machine learning system

studies (Ho et al., 2013; Chilimbi et al., 2014; Cui et al.,

2016; Alistarh et al., 2018) suggest that the parameter stale-

ness shared among workers in data-parallel systems leads to

slower convergence. In our proposed system, a mini-batch

contains thousands of examples. One GPU worker thread

is responsible to process a few thousand mini-batches. An

剩余16页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

大规模深度学习广告系统：分布式GPU分层参数服务器

多深度学习框架分布式训练教程与实践

分布式深度学习任务管理系统的开发与应用

深度学习分布式训练框架：Horovod介绍及应用

用Horovod实现大规模分布式深度学习.pdf

分布式深度学习通信架构的性能分析.pdf

分布式深度学习任务管理系统.zip

大型分布式网站架构设计与实践.带目录书签.完整版.pdf

凤凰架构：构建可靠的大型分布式系统.pdf

分布式计算(第二版).pdf

深度学习word2vec学习笔记pdf版.pdf

最新资源