GPU统一内存管理：分层页面逐出策略优化

需积分: 5 61 浏览量更新于2024-08-13 收藏 1.39MB PDF 举报

"HPE：GPU中统一内存的分层页面逐出策略" 本文是一篇研究论文，探讨了在GPU（图形处理单元）中如何有效管理统一内存的分层页面逐出策略。统一内存使得GPU编程更为便捷，并允许内存超额订阅，但同时也带来了页面故障时的高开销问题。当GPU内存达到容量极限时，如何选择合适的页面进行淘汰是关键挑战。目前广泛使用的页面淘汰策略是最近最少使用（LRU），以及更先进的替换策略如随机旋转索引策略（RRIP）和CLOCK-Pro。然而，在处理频繁的页面替换（thrashing）访问模式时，这些策略效率低下。它们未能充分考虑GPU工作负载的特点，导致性能下降。文章作者Qi Yu、Bruce Childers（IEEE会员）、Libo Huang、Cheng Qian和Zhiying Wang（均为IEEE会员）提出了一种名为HPE（Hierarchical Page Eviction Policy，分层页面逐出策略）的新方法，旨在解决上述问题。HPE策略通过构建层次化的内存管理系统，优化页面淘汰过程，以适应GPU特有的并行计算和内存访问模式。 HPE策略的核心是将内存层次化，不同的层次可能具有不同的淘汰策略。例如，低层可能更倾向于快速淘汰不常访问的页面，而高层可能对最近频繁访问的页面给予更多保护，以减少页面故障的发生。这样的设计可以更好地应对GPU的突发性和局部性访问特性，减少不必要的数据迁移，从而提高整体系统性能。此外，文章可能还深入分析了HPE策略与其他策略在不同工作负载下的性能比较，通过实验验证了其在减少页面故障开销、提高内存利用率和提升应用运行速度方面的优势。最后，作者可能还讨论了HPE策略的潜在改进方向和未来研究的可能性，比如结合机器学习技术进一步优化页面预测和淘汰决策。该研究为GPU内存管理提供了一个创新的解决方案，对于GPU在高性能计算、深度学习和其他数据密集型应用中的广泛应用具有重要意义。通过实施分层页面逐出策略，HPE有望降低页面故障带来的性能损失，提升GPU的整体计算效率。

0278-0070 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2944790, IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems

(



 



 



   



)

I: Streaming Access Pattern (  )

󰇛



 



 



   



󰇜



II: Thrashing Access Pattern (     )

󰇛







 



 







 



   







 



󰇜

III: Part Repetitive Access Pattern

(  



 



   



   



 



   



 )

󰇛







 







 







   







󰇜

IV: Most Repetitive Access Pattern

(  



 



   



)

󰇛







 







 







   







󰇜



V: Repetitive-Thrashing Access Pattern

(        



 



   



)

󰇛







   







󰇜 󰇛







   







󰇜   󰇛







   







󰇜

VI: Region Moving Access Pattern

(           



 



   



   



)

Fig. 2: Representative access patterns found in selected GPU workloads (Type I and Type II refer to [15]).

3) runtime error: cfd, huffman, mummergpu, nn from Ro-

dinia, lbm (long), histo (large), tpacf (large) from Parboil

and FDTD-2D, GRAMSCHM from Polybench;

4) duplicated access patterns with other selected ap-

plications: 3DCONV (with 2DCONV), and ATAX,

GESUMMV (with MVT).

TABLE I: Conﬁguration of the Simulated System.

GPU Arch. NVIDIA GTX-480 Fermi-like

GPU Cores 15 cores, 1.4GHz

Private L1 cache 16KB, 4-way associative, LRU

Private L1 TLB 128-entry per SM, single port, 1-cycle la-

tency, LRU, support hit under miss

Shared L2 cache 1.5MB total, 128KB/DRAM channel, 8-way

associative, LRU

Shared L2 TLB 512-entry, 16-associative, LRU, 10-cycle la-

tency, 2 ports

DRAM GDDR5, 12-channel, FR-FCFS scheduler,

177GB/s aggregate

CPU-GPU interconnect 16GB/s, 20 µs page fault service time

TABLE II: Workload Characteristics.

access pattern benchmark suite application and Abbr.

Type I

Rodinia hotspot (HOT), leukocyte (LEU)

Parboil cutcp (CUT)

Polybench 2DCONV (2DC), GEMM (GEM)

Type II

Rodinia srad v2 (SRD), hotspot3D (HSD)

Parboil mri-q (MRQ), stencil (STN)

Type III

Rodinia

pathﬁnder (PAT), dwt2d (DWT),

backprop (BKP), kmeans (KMN)

Parboil sad (SAD)

Type IV

Rodinia nw (NW), bfs (BFS)

Polybench MVT (MVT)

Type V

Rodinia heartwall (HWL)

Parboil

sgemm (SGM), histo (HIS),

spmv (SPV)

Type VI Rodinia b+tree (B+T), hybridsort (HYB)

A. Access Pattern Types

Different GPU workloads have different access patterns. To

better understand when existing eviction policies (mentioned

in Section I) behave well or poorly, we studied application

access patterns from Rodinia, Parboil, and Polybench. We ﬁnd

six representative patterns, which are shown in Fig. 2.

In this ﬁgure, a

denotes a virtual page; a

means a

is referenced N

times (refer to frequency, the same below).

, a

, ..., a

) denotes a temporal sequence of references to

k unique virtual pages and (a

, a

, ..., a

) represents a

temporal sequence with different reference frequencies. A

temporal sequence repeating N times with the same frequency

and with different frequencies is denoted as (a

, a

, ..., a

)

and (a

, a

, ..., a

)

, respectively.

Type I and II are simple access patterns, where all pages

are referenced the same number of times (1 or N ). However,

there are complex access patterns in which virtual pages are

referenced a different number of times and with different

probability, such as type III, IV, V, and VI. Type III is a

temporal sequence where parts of virtual pages are referenced

multiple times with some probability. Type IV represents a

temporal sequence in which most virtual pages are referenced

multiple times. Type V is a combination of type II and type IV:

a temporal sequence repeats N times; in each iteration, most

virtual pages are referenced multiple times. In type VI, virtual

pages are divided into kn address regions, and in each region,

pages are referenced multiple times for a certain duration of

time. The application then continues to access pages in the next

region. For type III, IV, V, and VI, different page references

usually intersect with each other (in terms of reference order).

B. Limitations of LRU and RRIP

From Fig. 2, we can infer that LRU performs well for type

I and VI, but poorly for type II. For this type, LRU fails

to preserve some of the working set in the GPU memory.

RRIP is expected to perform well for type II; however, due to

“instant thrashing”, RRIP has limited speedup over LRU. For

type III, IV, and V, it is difﬁcult to determine whether these

eviction policies perform well or poorly, because performance

is inﬂuenced by several factors, such as which pages are

referenced multiple times and when the pages are referenced.

To show the limitations of LRU and RRIP, we conducted

simulations under an oversubscription rate of 75%, which

means only 75% of application footprint ﬁts in the GPU

memory. As a baseline, we use an ofﬂine eviction policy to

explore the upper bound of performance, which is similar to

Belady’s MIN algorithm [27]. We call this policy “Ideal”.

We normalize evictions of LRU and RRIP to Ideal. Fig. 3

shows the result. We make three observations. First, for type II,

RRIP incurs signiﬁcant thrashing for SRD and HSD. Despite

outperforming LRU for MRQ and STN, RRIP evicts 50% more

pages than Ideal. Second, LRU performs well for type I (except

for GEM) and type VI, while RRIP performs poorly for type

剩余13页未读，继续阅读

weixin_38545485

粉丝: 5
资源: 983

GPU统一内存管理：分层页面逐出策略优化

HPE:H3C CF22000.docx

OpenStack Days China 2016 HPE史天：HPE Helion OpenStack 3.0和专业服务

HPE 3par ss8400如何更换故障内存

HPE bios 模拟器下载

hpe bios 模拟器下载

hpe p408i驱动下载

hpe msa2040初始化

hpe data protector手册

Hewlett Packard Enterprise (HPE) MSL G3 使用手册

hpe smart storage administrator 启动盘

最新资源