GPU vs CPU性能神话：多核处理器与GPU并行计算实测对比

97 浏览量更新于2024-08-25 收藏 230KB PDF 举报

本文档《Debunking the 100X GPU vs CPU Myth - An Evaluation of Throughput Computing on CPU and GPU》主要探讨了在当前计算机科学领域中，关于GPU（图形处理器）与CPU（中央处理器）性能对比的常见误解，特别是针对大量数据处理（即“throughput computing”）时的性能差距。文章由Victor W. Lee等人，来自Intel Corporation的Throughput Computing Lab和Intel Architecture Group，共同完成。近年来，随着科技的发展，产生的数据量急剧增长，对数据处理能力提出了新的挑战。传统上，GPU因其并行计算能力被广泛认为在执行特定的并行密集型任务时能提供显著的性能提升，尤其是在某些基准测试中声称速度提升可达到10到1000倍。然而，这篇文章的目标是质疑这种100X性能优势的普遍观点，并进行深入的性能分析。作者们通过细致的研究一组关键的并行计算内核，发现这些内核实际上包含大量的并行性，这意味着它们不仅适用于现代多核CPU，也适用于GPU。他们发现，尽管GPU在某些特定场景下可能表现出色，但在经过严格的性能评估后，这些声称的巨大性能差距并非总是成立。文章指出，性能差异可能源于多种因素，包括硬件特性、编程模型、优化程度以及应用的特定要求。作者们的工作旨在揭示在实际应用中，CPU和GPU之间的性能差距并不总是像传闻中那样悬殊，特别是当考虑到了软件优化和硬件特性之间的平衡时。他们提倡更全面的评估，以便更好地理解不同平台在处理不同类型工作负载时的相对性能。这篇论文对于理解现代计算环境中CPU和GPU的适配策略，以及如何最大化两者在大规模数据处理中的协同效应具有重要意义，挑战了关于GPU性能优越性的普遍共识。

Debunking the 100X GPU vs. CPU Myth:

An Evaluation of Throughput Computing on CPU and GPU

Victor W Lee

†

, Changkyu Kim

†

, Jatin Chhugani

†

, Michael Deisher

†

Daehyun Kim

†

, Anthony D. Nguyen

†

, Nadathur Satish

†

, Mikhail Smelyanskiy

†

Srinivas Chennupaty

⋆

, Per Hammarlund

⋆

, Ronak Singhal

⋆

and Pradeep Dubey

†

victor.w.lee@intel.com

†

Throughput Computing Lab,

Intel Corporation

⋆

Intel Architecture Group,

Intel Corporation

ABSTRACT

Recent advances in computing have led to an explosion in the amount

of data being generated. Processing the ever-growing data in a

timely manner has made throughput computing an important as-

pect for emerging applications. Our analysis of a set of important

throughput computing kernels shows that there is an ample amount

of parallelism in these kernels which makes them suitable for to-

day’s multi-core CPUs and GPUs. In the past few years there have

been many studies claiming GPUs deliver substantial speedups (be-

tween 10X and 1000X) over multi-core CPUs on these kernels. To

understand where such large performance difference comes from,

we perform a rigorous performance analysis and ﬁnd that after ap-

plying optimizations appropriate for both CPUs and GPUs the per-

formance gap between an Nvidia GTX280 processor and the Intel

Core i7 960 processor narrows to only 2.5x on average. In this pa-

per, we discuss optimization techniques for both CPU and GPU,

analyze what architecture features contributed to performance dif-

ferences between the two architectures, and recommend a set of

architectural features which provide signiﬁcant improvement in ar-

chitectural efﬁciency for throughput kernels.

Categories and Subject Descriptors

C.1.4 [Processor Architecture]: Parallel architectures

; C.4 [Performance of Systems]: Design studies

; D.3.4 [Software]: Processors—Optimization

General Terms

Design, Measurement, Performance

Keywords

CPU architecture, GPU architecture, Performance analysis, Perfor-

mance measurement, Software optimization, Throughput Comput-

ing

1. INTRODUCTION

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ISCA’10, June 19–23, 2010, Saint-Malo, France.

The past decade has seen a huge increase in digital content as

more documents are being created in digital form than ever be-

fore. Moreover, the web has become the medium of choice for

storing and delivering information such as stock market data, per-

sonal records, and news. Soon, the amount of digital data will ex-

ceed exabytes (10

) [31]. The massive amount of data makes stor-

ing, cataloging, processing, and retrieving information challenging.

A new class of applications has emerged across different domains

such as database, games, video, and ﬁnance that can process this

huge amount of data to distill and deliver appropriate content to

users. A distinguishing feature of these applications is that they

have plenty of data level parallelism and the data can be processed

independently and in any order on different processing elements

for a similar set of operations such as ﬁltering, aggregating, rank-

ing, etc. This feature together with a processing deadline deﬁnes

throughput computing applications. Going forward, as digital data

continues to grow rapidly, throughput computing applications are

essential in delivering appropriate content to users in a reasonable

duration of time.

Two major computing platforms are deemed suitable for this new

class of applications. The ﬁrst one is the general-purpose CPU

(central processing unit) that is capable of running many types of

applications and has recently provided multiple cores to process

data in parallel. The second one is the GPU (graphics process-

ing unit) that is designed for graphics processing with many small

processing elements. The massive processing capability of GPU

allures some programmers to start exploring general purpose com-

puting with GPU. This gives rise to the GPGPU ﬁeld [3, 33].

Fundamentally, CPUs and GPUs are built based on very different

philosophies. CPUs are designed for a wide variety of applications

and to provide fast response times to a single task. Architectural

advances such as branch prediction, out-of-order execution, and

super-scalar (in addition to frequency scaling) have been responsi-

ble for performance improvement. However, these advances come

at the price of increasing complexity/area and power consumption.

As a result, main stream CPUs today can pack only a small number

of processing cores on the same die to stay within the power and

thermal envelopes. GPUs on the other hand are built speciﬁcally

for rendering and other graphics applications that have a large de-

gree of data parallelism (each pixel on the screen can be processed

independently). Graphics applications are also latency tolerant (the

processing of each pixel can be delayed as long as frames are pro-

cessed at interactive rates). As a result, GPUs can trade off single-

thread performance for increased parallel processing. For instance,

GPUs can switch from processing one pixel to another when long

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38724535

粉丝: 3
资源: 915

GPU vs CPU性能神话：多核处理器与GPU并行计算实测对比

信息安全_数据安全_grc-f02-debunking-myths-for-cybe.pdf

将云技术引进企业协作《Collaboration in the Cloud》

「网络安全」All That Glitters Debunking Fools Marketing of ML and A

信息安全_数据安全_Debunking_the_Hacker_Hype：The_Re.pdf

「安全人才」All_That_Glitters_Debunking_Fools_Marketing_of_ML_and_A

信息安全_数据安全_All That Glitters Debunking Fools Mark.pdf

ctpug-july-2018:2018年7月7日提供给开普敦Python用户组的演讲资源

社交媒体中的假新闻、阴谋和神话揭穿——跨学科的文献调查-研究论文

Common Mistakes About MATLAB Crashing: Debunking Misconceptions, Avoiding Recurring Errors, and ...

C#ASP.NET网络进销存管理系统源码数据库 SQL2008源码类型 WebForm

最新资源