Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU
Victor W Lee
†
, Changkyu Kim
†
, Jatin Chhugani
†
, Michael Deisher
†
,
Daehyun Kim
†
, Anthony D. Nguyen
†
, Nadathur Satish
†
, Mikhail Smelyanskiy
†
,
Srinivas Chennupaty
⋆
, Per Hammarlund
⋆
, Ronak Singhal
⋆
and Pradeep Dubey
†
victor.w.lee@intel.com
†
Throughput Computing Lab,
Intel Corporation
⋆
Intel Architecture Group,
Intel Corporation
ABSTRACT
Recent advances in computing have led to an explosion in the amount
of data being generated. Processing the ever-growing data in a
timely manner has made throughput computing an important as-
pect for emerging applications. Our analysis of a set of important
throughput computing kernels shows that there is an ample amount
of parallelism in these kernels which makes them suitable for to-
day’s multi-core CPUs and GPUs. In the past few years there have
been many studies claiming GPUs deliver substantial speedups (be-
tween 10X and 1000X) over multi-core CPUs on these kernels. To
understand where such large performance difference comes from,
we perform a rigorous performance analysis and find that after ap-
plying optimizations appropriate for both CPUs and GPUs the per-
formance gap between an Nvidia GTX280 processor and the Intel
Core i7 960 processor narrows to only 2.5x on average. In this pa-
per, we discuss optimization techniques for both CPU and GPU,
analyze what architecture features contributed to performance dif-
ferences between the two architectures, and recommend a set of
architectural features which provide significant improvement in ar-
chitectural efficiency for throughput kernels.
Categories and Subject Descriptors
C.1.4 [Processor Architecture]: Parallel architectures
; C.4 [Performance of Systems]: Design studies
; D.3.4 [Software]: Processors—Optimization
General Terms
Design, Measurement, Performance
Keywords
CPU architecture, GPU architecture, Performance analysis, Perfor-
mance measurement, Software optimization, Throughput Comput-
ing
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISCA’10, June 19–23, 2010, Saint-Malo, France.
Copyright 2010 ACM 978-1-4503-0053-7/10/06 ...$10.00.
The past decade has seen a huge increase in digital content as
more documents are being created in digital form than ever be-
fore. Moreover, the web has become the medium of choice for
storing and delivering information such as stock market data, per-
sonal records, and news. Soon, the amount of digital data will ex-
ceed exabytes (10
18
) [31]. The massive amount of data makes stor-
ing, cataloging, processing, and retrieving information challenging.
A new class of applications has emerged across different domains
such as database, games, video, and finance that can process this
huge amount of data to distill and deliver appropriate content to
users. A distinguishing feature of these applications is that they
have plenty of data level parallelism and the data can be processed
independently and in any order on different processing elements
for a similar set of operations such as filtering, aggregating, rank-
ing, etc. This feature together with a processing deadline defines
throughput computing applications. Going forward, as digital data
continues to grow rapidly, throughput computing applications are
essential in delivering appropriate content to users in a reasonable
duration of time.
Two major computing platforms are deemed suitable for this new
class of applications. The first one is the general-purpose CPU
(central processing unit) that is capable of running many types of
applications and has recently provided multiple cores to process
data in parallel. The second one is the GPU (graphics process-
ing unit) that is designed for graphics processing with many small
processing elements. The massive processing capability of GPU
allures some programmers to start exploring general purpose com-
puting with GPU. This gives rise to the GPGPU field [3, 33].
Fundamentally, CPUs and GPUs are built based on very different
philosophies. CPUs are designed for a wide variety of applications
and to provide fast response times to a single task. Architectural
advances such as branch prediction, out-of-order execution, and
super-scalar (in addition to frequency scaling) have been responsi-
ble for performance improvement. However, these advances come
at the price of increasing complexity/area and power consumption.
As a result, main stream CPUs today can pack only a small number
of processing cores on the same die to stay within the power and
thermal envelopes. GPUs on the other hand are built specifically
for rendering and other graphics applications that have a large de-
gree of data parallelism (each pixel on the screen can be processed
independently). Graphics applications are also latency tolerant (the
processing of each pixel can be delayed as long as frames are pro-
cessed at interactive rates). As a result, GPUs can trade off single-
thread performance for increased parallel processing. For instance,
GPUs can switch from processing one pixel to another when long