GPU并行编程：CUDA与Tesla架构实战

4星 · 超过85%的资源需积分: 34 20 浏览量更新于2024-07-29 收藏 4.74MB PDF 举报

"大规模并行处理器程序设计：一种实践方法" 并行程序设计是现代高性能计算领域中的关键技术，它旨在通过同时执行多个任务来提升计算效率。随着科技的进步，计算机硬件的发展使得并行处理能力得到了显著提升。在并行处理器上进行程序设计，可以充分利用这些硬件资源，从而实现更快速、更高效的计算。本书《In Praise of Programming Massively Parallel Processors: A Hands-on Approach》由David Kirk和Wen-mei Hwu合著，专门探讨了在大规模并行处理器上的编程技术。他们特别关注了一种具有广泛适用性的并行硬件——图形处理单元（GPU）。GPU最初设计用于处理图形和视频数据，但其高度并行的架构使其成为并行计算的理想平台，现在许多桌面和笔记本电脑都配备了这种设备。书中介绍了CUDA，这是一种基于C语言的数据并行编程语言，专为NVIDIA的GPU设计。CUDA提供了一种直接与GPU硬件交互的方式，允许程序员利用其大量的核心进行计算。作者不仅解释了CUDA语言的基本概念和架构，还深入讨论了在异构CPU-GPU系统中运行良好的数据并行问题的本质。为了使读者对并行程序设计有更直观的理解，书中包含了两个详细的案例研究。这两个案例展示了使用CUDA编写的简单代码相对于传统CPU只用C语言程序的速度提升，可以达到10倍到15倍。而经过精心优化的CUDA版本，速度提升更是高达45倍至105倍，这充分证明了并行编程的巨大潜力。书的结尾部分，作者展望了未来，讨论了并行计算技术的可能发展方向。随着GPU和其他并行处理器技术的持续演进，我们有理由期待更多的性能突破和新的应用领域。并行程序设计不仅是追求高性能计算的关键，也是解决大数据、人工智能等领域复杂计算问题的有效手段。通过学习和掌握如CUDA这样的并行编程技术，开发者能够更好地利用现代硬件资源，开发出更加高效的应用程序，推动科技进步。

The Design Document

Once the students decide on a project and form a team, they are required to

submit a design document for the project. This helps them think through the

project steps before they jump into it. The ability to do such planning will

be important to their later career success. The design document should

discuss the background and motivation for the project, application-level

objectives and potential impact, main features of the end application, an

overview of their design, an implementation plan, their performance goals,

a verification plan and acceptance test, and a project schedule.

The teaching assistants hold a project clinic for final project teams

during the week before the class symposium. This clinic helps ensure that

students are on-track and that they have identified the potential roadblocks

early in the process. Student teams are asked to come to the clinic with an

initial draft of the following three versions of their application: (1) The best

CPU sequential code in terms of performance, with SSE2 and other optimi-

zations that establish a strong serial base of the code for their speedup

comparisons; (2) The best CUDA parallel code in terms of performance.

This version is the main output of the project; (3) A version of CPU sequen-

tial code that is based on the same algorithm as version 3, using single

precision. This version is used by the students to characterize the parallel

algorithm overhead in terms of extra computations involved.

Student teams are asked to be prepared to discuss the key ideas used in

each version of the code, any floating-point precision issues, any compari-

son against previous results on the application, and the potential impact

on the field if they achieve tremendous speedup. From our experience,

the optimal schedule for the clinic is 1 week before the class symposium.

An earlier time typically results in less mature projects and less meaningful

sessions. A later time will not give students sufficient time to revise their

projects according to the feedback.

The Project Report

Students are required to submit a project report on their team’s key find-

ings. Six lecture slots are combined into a whole-day class symposium.

During the symposium, students use presentation slots proportional to the

size of the teams. During the presentation, the students highlight the best

parts of their project report for the benefit of the whole class. The presenta-

tion accounts for a significant part of students’ grades. Each student must

answer questions directed to him/her as individuals, so that different grades

can be assigned to individuals in the same team. The symposium is a major

opportunity for students to learn to produce a concise presentation that

xvPreface

Acknowledgments

We especially acknowledge Ian Buck, the father of CUDA and John

Nickolls, the lead architect of Tesla GPU Computing Architecture. Their

teams created an excellent infrastructure for this course. Ashutosh Rege and

the NVIDIA DevTech team contributed to the original slides and contents

used in ECE498AL course. Bill Bean, Simon Green, Mark Harris, Manju

Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, and

Cyril Zeller provided review comments and corrections to the manuscripts.

Nadeem Mohammad organized the NVIDIA review efforts and also helped

to plan Chapter 11 and Appendix B. Calisa Cole helped with cover.

Nadeem’s heroic efforts have been critical to the completion of this book.

We also thank Jensen Huang for providing a great amount of financial

and human resources for developing the course. Tony Tamasi’s team con-

tributed heavily to the review and revision of the book chapters. Jensen also

took the time to read the early drafts of the chapters and gave us valuable

feedback. David Luebke has facilitated the GPU computing resources for

the course. Jonah Alben has provided valuable insight. Michael Shebanow

and Michael Garland have given guest lectures and contributed materials.

John Stone and Sam Stone in Illinois contributed much of the base

material for the case study and OpenCL chapters. John Stratton and Chris

Rodrigues contributed some of the base material for the computational

thinking chapter. I-Jui “Ray” Sung, John Stratton, Xiao-Long Wu, Nady

Obeid contributed to the lab material and helped to revise the course material

as they volunteered to serve as teaching assistants on top of their research.

Laurie Talkington and James Hutchinson helped to dictate early lectures that

served as the base for the first five chapters. Mike Showerman helped build

two generations of GPU computing clusters for the course. Jeremy Enos

worked tirelessly to ensure that students have a stable, user-friendly GPU

computing cluster to work on their lab assignments and projects.

We acknowledge Dick Blahut who challenged us to create the course in

Illinois. His constant reminder that we needed to write the book helped

keep us going. Beth Katsinas arranged a meeting between Dick Blahut

and NVIDIA Vice President Dan Vivoli. Through that gathering, Blahut was

introduced to David and challenged David to come to Illinois and create the

course with Wen-mei.

We also thank Thom Dunning of the University of Illinois and Sharon

Glotzer of the University of Michigan, Co-Directors of the multiuniversity

Virtual School of Computational Science and Engineering, for graciously

xvii

剩余278页未读，继续阅读

liwei198584

粉丝: 3
资源: 1

GPU并行编程：CUDA与Tesla架构实战

《大规模并行处理器程序设计》(文字版[PDF]

大规模并行处理器编程实战

大规模并行处理器程序设计 第3版 pdf

大规模并行处理器阵列中的光纤互连网络

大规模并行处理机的并行程序设计

CUDA_超大规模并行程序设计(赵开勇)

CudaDBClustering:通过图形处理器集群，使用 NVIDIA CUDA sdk 在大规模并行显卡处理器上进行数据库集群

多核处理器下并行程序设计探析.pdf

GPU并行编程实战：掌握大规模并行处理器

GPU编程：掌握大规模并行处理器

最新资源

大规模并行处理器程序设计第3版 pdf