0278-0070 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2944790, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
3
(
)
I: Streaming Access Pattern ( )
II: Thrashing Access Pattern ( )
III: Part Repetitive Access Pattern
(
)
IV: Most Repetitive Access Pattern
(
)
V: Repetitive-Thrashing Access Pattern
(
)
VI: Region Moving Access Pattern
(
)
Fig. 2: Representative access patterns found in selected GPU workloads (Type I and Type II refer to [15]).
3) runtime error: cfd, huffman, mummergpu, nn from Ro-
dinia, lbm (long), histo (large), tpacf (large) from Parboil
and FDTD-2D, GRAMSCHM from Polybench;
4) duplicated access patterns with other selected ap-
plications: 3DCONV (with 2DCONV), and ATAX,
GESUMMV (with MVT).
TABLE I: Configuration of the Simulated System.
GPU Arch. NVIDIA GTX-480 Fermi-like
GPU Cores 15 cores, 1.4GHz
Private L1 cache 16KB, 4-way associative, LRU
Private L1 TLB 128-entry per SM, single port, 1-cycle la-
tency, LRU, support hit under miss
Shared L2 cache 1.5MB total, 128KB/DRAM channel, 8-way
associative, LRU
Shared L2 TLB 512-entry, 16-associative, LRU, 10-cycle la-
tency, 2 ports
DRAM GDDR5, 12-channel, FR-FCFS scheduler,
177GB/s aggregate
CPU-GPU interconnect 16GB/s, 20 µs page fault service time
TABLE II: Workload Characteristics.
access pattern benchmark suite application and Abbr.
Type I
Rodinia hotspot (HOT), leukocyte (LEU)
Parboil cutcp (CUT)
Polybench 2DCONV (2DC), GEMM (GEM)
Type II
Rodinia srad v2 (SRD), hotspot3D (HSD)
Parboil mri-q (MRQ), stencil (STN)
Type III
Rodinia
pathfinder (PAT), dwt2d (DWT),
backprop (BKP), kmeans (KMN)
Parboil sad (SAD)
Type IV
Rodinia nw (NW), bfs (BFS)
Polybench MVT (MVT)
Type V
Rodinia heartwall (HWL)
Parboil
sgemm (SGM), histo (HIS),
spmv (SPV)
Type VI Rodinia b+tree (B+T), hybridsort (HYB)
A. Access Pattern Types
Different GPU workloads have different access patterns. To
better understand when existing eviction policies (mentioned
in Section I) behave well or poorly, we studied application
access patterns from Rodinia, Parboil, and Polybench. We find
six representative patterns, which are shown in Fig. 2.
In this figure, a
i
denotes a virtual page; a
N
i
i
means a
i
is referenced N
i
times (refer to frequency, the same below).
(a
1
, a
2
, ..., a
k
) denotes a temporal sequence of references to
k unique virtual pages and (a
N
1
1
, a
N
2
2
, ..., a
N
k
k
) represents a
temporal sequence with different reference frequencies. A
temporal sequence repeating N times with the same frequency
and with different frequencies is denoted as (a
1
, a
2
, ..., a
k
)
N
and (a
N
1
1
, a
N
2
2
, ..., a
N
k
k
)
N
, respectively.
Type I and II are simple access patterns, where all pages
are referenced the same number of times (1 or N ). However,
there are complex access patterns in which virtual pages are
referenced a different number of times and with different
probability, such as type III, IV, V, and VI. Type III is a
temporal sequence where parts of virtual pages are referenced
multiple times with some probability. Type IV represents a
temporal sequence in which most virtual pages are referenced
multiple times. Type V is a combination of type II and type IV:
a temporal sequence repeats N times; in each iteration, most
virtual pages are referenced multiple times. In type VI, virtual
pages are divided into kn address regions, and in each region,
pages are referenced multiple times for a certain duration of
time. The application then continues to access pages in the next
region. For type III, IV, V, and VI, different page references
usually intersect with each other (in terms of reference order).
B. Limitations of LRU and RRIP
From Fig. 2, we can infer that LRU performs well for type
I and VI, but poorly for type II. For this type, LRU fails
to preserve some of the working set in the GPU memory.
RRIP is expected to perform well for type II; however, due to
“instant thrashing”, RRIP has limited speedup over LRU. For
type III, IV, and V, it is difficult to determine whether these
eviction policies perform well or poorly, because performance
is influenced by several factors, such as which pages are
referenced multiple times and when the pages are referenced.
To show the limitations of LRU and RRIP, we conducted
simulations under an oversubscription rate of 75%, which
means only 75% of application footprint fits in the GPU
memory. As a baseline, we use an offline eviction policy to
explore the upper bound of performance, which is similar to
Belady’s MIN algorithm [27]. We call this policy “Ideal”.
We normalize evictions of LRU and RRIP to Ideal. Fig. 3
shows the result. We make three observations. First, for type II,
RRIP incurs significant thrashing for SRD and HSD. Despite
outperforming LRU for MRQ and STN, RRIP evicts 50% more
pages than Ideal. Second, LRU performs well for type I (except
for GEM) and type VI, while RRIP performs poorly for type