"深入浅出的《多核和GPU编程》：全面性强、实用性高的并行计算学习资源"

51 浏览量更新于2024-03-11 收藏 22.43MB PDF 举报

"Multicore and GPU Programming" is a comprehensive and practical resource for learning parallel computing. It covers basic concepts, programming models, algorithm implementation, and performance optimization for multicore and GPU computing. The book utilizes modern C and CUDA features, such as Lambda expressions and CUDA C kernel functions, to make the code more concise, readable, and secure. It provides rich code examples for various parallel computing techniques, including OpenMP, CUDA, and Intel TBB, facilitating a better understanding of each technology's implementation process. The content is presented in a clear and accessible manner, catering to the needs and level of understanding of the readers, enabling them to quickly grasp the fundamental knowledge and practical skills of parallel computing. Overall, "Multicore and GPU Programming" is an excellent resource for anyone looking to deepen their understanding of parallel computing.

xvi List of tables

Table 6.7 An estimation of the shared memory needed per SM to provide conﬂict-free

access during a histogram calculation of an image, assuming each counter

is a 32-bit integer. 456

Table 6.8 Average and standard deviation of kernel execution times over 100 runs, for

different histogram-calculating kernels. Input data consisted of a

3264 × 2448, 8-bit, grayscale image. Reported times exclude the data

transfer to/from the GPU. 463

Table 6.9 MPI processes generated in response to the appﬁle of Listing 6.59. 575

Table 7.1 OpenCL primitive data types. Two’s complement representation is used for

the signed types. 588

Table 7.2 OpenCL memory regions access and lifetime. 609

Table 7.3 A sample of image channel order symbolic constants. A value of 0 or 1

means that the corresponding component is not present and it is assumed to

be 0 or 1, respectively. A value of 1 for the alpha channel corresponds to

100% opacity. In

CL_LUMINANCE the luminance value is replicated in the red,

green, and blue components. 620

Table 7.4 A sample of image channel types. 620

Table 7.5 Properties for the creation of a sampler object. 623

Table 7.6 OpenCL 2.0 (and above) atomic data types. Support for 64-bit types is

conditional, based on the availability of 64-bit device support. The

atomic_flag is a Boolean data type. 646

Table 8.1 List of the available operators for the

reduction clause, along with the

initial value of the reduction variable’s private copies [3]. Since

OpenMP 5.2 the minus (“-”) operator is deprecated. 697

Table 8.2

target

directives and the map types allowed in their map clauses. 762

Table 8.3 Placement of threads based on the existence of four places (p

to p

), for

different numbers of threads and initial placements of the master thread

(MT). 770

Table 9.1 A list of the functions provided by the

QtConcurrent namespace.

T represents the type of element that the map/ﬁlter/reduce functions apply to. 809

Table 10.1 Thrust device back-ends and their associated compiler switches. GCC is

assumed to be the compiler used by

nvcc. 869

Table 11.1 Symbol table. Examples of typical units are shown. 895

Table D.1 MPI functions and constants and their Boost.MPI counterparts. 955

Table G.1 Short reference of structure

Node. Fields are listed in alphabetical order. 968

Table G.2 Short reference of data members of class

Network. 968

Table G.3 A list of the ﬁles that make up the

DLTlib library. 976

Preface

Parallel computing has been given a fresh breath of life since the emergence of multi-

core architectures in the ﬁrst decade of the new century. The new platforms demand

a new approach to software development; one that blends the tools and established

practices of the network of workstations (NoW) era with emerging software platforms

such as CUDA.

This book tries to address this need by covering the dominant contemporary tools

and techniques, both in isolation and, most importantly, in combination with each

other. We strive to provide examples where multiple platforms and programming

paradigms (e.g., message passing and threads) are effectively combined. “Hybrid”

computation, as it is usually called, is a new trend in high-performance computing,

one that could possibly allow software to scale to the “millions of threads” required

for exascale performance.

All chapters are accompanied by extensive examples and practice problems.

Whenever possible multiple design alternatives are pursued and compared. All the

little details that can make the difference between productive software development

and a futile exercise are presented in an orderly fashion.

The book covers the latest advances in tools that have been inherited from the

1990s (e.g., the OpenMP and MPI standards), but also more cutting-edge platforms,

such as the C++11 thread support, the Qt library with its sophisticated thread manage-

ment, and the Thrust template library with its capability to deploy the same software

over diverse multicore architectures, including both CPUs and GPUs.

We could never accomplish the feat of covering all the tools available for multi-

core development today. Even some of the industry standard ones, like POSIX

threads, are omitted.

Our goal is to sample the dominant paradigms (ranging from OpenMP’s semiau-

tomatic parallelization of sequential code to the explicit communication “plumping”

that underpins MPI), while at the same time explaining the rationale and how-to be-

hind efﬁcient multicore program development.

What is in this book

The book is separated into the following logical units:

• Introduction, designing multicore software: Chapter 1 introduces multicore

hardware and examines inﬂuential instances of this architectural paradigm. Chap-

ter 1 also introduces speedup and efﬁciency, which are essential metrics used in

the evaluation of multicore and parallel software. Amdahl’s law and Gustafson–

Barsis’s rebuttal cap up the chapter, providing estimates of what can be expected

from the exciting new developments in multicore and many-core hardware.

xvii

xviii Preface

Chapter 2 is all about the methodology and the design patterns that can be em-

ployed in the development of parallel and multicore software. Both work decom-

position patterns and program structure patterns are examined.

• Programming with threads and processes: Dealing explicitly with the individ-

ual paths of execution in the form of threads or processes is the most elementary

form of parallel programming. In this part we examine how this paradigm is used

to program CPUs (with C++11 threads), GPUs (with CUDA and OpenCL), and

even clusters of networked machines (using MPI).

C++11 threads have been a long-awaited addition to the C++ standard, estab-

lishing a ﬁrm foundation for cross-platform, high-performance, parallel software

development for CPUs. Chapter 3 covers C++11 facilities, along with commonly

used synchronization mechanisms such as semaphores and monitors. Also, fre-

quently encountered design patterns, such as producers–consumers and readers–

writers, are explained thoroughly and applied in a range of examples.

Chapter 4 is dedicated to shared-memory parallel data structures and how we can

ensure correctness when multiple actions are attempted on a program’s data.

In Chapter 5 we cover MPI, which is the de facto standard for distributed memory

parallel programming. MPI provides the foundation for utilizing multiple disjoint

multicore machines as a single virtual platform, designed to scale from a single

shared-memory multicore machine to a million-node supercomputer. The features

that are covered include both point-to-point and collective communication, as well

as one-sided communication. A section is dedicated to the Boost.MPI library,

as it does simplify the proceedings of using MPI, although it is not yet feature-

complete.

GPU software development is covered in great detail, including kernel design,

memory management, grid-block/index space conﬁgurations, and optimization

techniques. Both CUDA (Chapter 6) and OpenCL (Chapter 7) are examined both

in isolation and in combination with other platforms such as C++11 threads and

MPI.

• High-level parallel programming: Parallel software suffers from high develop-

ment and maintenance costs. Some of this burden can be alleviated by utilizing

tools that handle the more esoteric details of “how” and “where” to execute costly

computations.

The OpenMP standard in its latest incarnation (v5.0) manages to address these

problems by requiring only “hints” from the programmer, while also allowing

both CPUs and GPUs to be targeted. There are still complications that need to be

addressed, such as loop-carried dependencies, which are also examined in Chap-

ter 8.

OpenMP’s design philosophy is to take advantage of multi- and many-core hard-

ware, while requiring minimum alterations to the source code of a sequential

program. The Qt library covered in Chapter 9 offers another solution to the de-

sign problem by supporting high-level abstractions in the form of map, ﬁlter, and

reduce operations that can be applied to collections of data, without the need to

instantiate or manage any threads.

Preface xix

Chapter 10 covers the Thrust template library which follows suit, requiring a

software redesign, so that computations can be expressed as transformations, re-

ductions, or in general the application of function objects on data. The STL-like

approach to program design affords Thrust the ability to target both CPUs and

GPU platforms.

• Advanced topics: Chapter 11 is dedicated to an often underestimated aspect of

multicore development: load balancing. In general, load balancing has to be se-

riously considered once heterogeneous computing resources come into play. For

example, a CPU and a GPU constitute such a set of resources, so we should not

think only of clusters of dissimilar machines as ﬁtting this requirement. Chap-

ter 11 brieﬂy discusses the Linda coordination language, which can be considered

a high-level abstraction of dynamic load balancing.

The main focus is on static load balancing and the mathematical models that can

be used to drive load partitioning and data communication sequences. A well-

established methodology known as divisible load theory (DLT) is explained and

applied in a number of scenarios. A simple C++ library that implements parts of

the DLT research results that have been published over the past two decades is

also presented.

Using this book as a textbook

The material covered in this book is appropriate for senior undergraduate or postgrad-

uate coursework. The required student background includes programming in C++,

basic operating system concepts, and at least elementary knowledge of computer ar-

chitecture.

Depending on the desired focus, an instructor may choose to follow one of the

suggested paths listed below. The ﬁrst two chapters lay the foundations for the other

chapters, so they are included in all sequences:

• Emphasis on parallel programming (undergraduate):

• Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance

metrics. Sections 1.1–1.5.

• Chapter 2: Design, PCAM methodology, decomposition patterns, program

structure patterns. Sections 2.1– 2.5.

• Chapter 3: Threads, semaphores, monitors. Sections 3.1–3.10.

• Chapter 5: MPI, point-to-point communications, collective operations, ob-

ject/structure communications, debugging and proﬁling. Sections 5.1–5.12,

5.16–5.18, and 5.20.

• Chapter 6: CUDA programming model, memory hierarchy, GPU-speciﬁc op-

timizations. Sections 6.1–6.6, 6.7.1, 6.7.3, and 6.7.4.2.

• Chapter 10: Thrust basics. Sections 10.1–10.5.

xx Preface

• Emphasis on multicore programming (undergraduate):

• Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance

metrics. Sections 1.1–1.5.

• Chapter 2: Design, PCAM methodology, decomposition patterns, program

structure patterns. Sections 2.1– 2.5.

• Chapter 3: Threads, semaphores, monitors. Sections 3.1–3.10.

• Chapter 8: OpenMP basics, work-sharing constructs, correctness and perfor-

mance issues. Sections 8.1–8.9 and 8.12.

• Chapter 6: CUDA programming model, memory hierarchy, GPU-speciﬁc op-

timizations. Sections 6.1–6.10.

• Chapter 10: Thrust basics. Sections 10.1–10.5

• Advanced multicore programming:

• Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance

metrics. Sections 1.1–1.5.

• Chapter 2: Design, PCAM methodology, decomposition patterns, program

structure patterns. Sections 2.1– 2.5.

• Chapter 3: Threads, semaphores, monitors, advanced thread management. Sec-

tions 3.1–3.13.

• Chapter 4: Parallel data structures, lock-based and lock-free design approaches.

Sections 4.1–4.3.

• Chapter 5: MPI, point-to-point communications, collective operations, ob-

ject/structure communications, debugging and proﬁling. Sections 5.1

–5.12,

5.16–5.18, 5.21–5.22, and 5.25.

• Chapter 6: CUDA programming model, memory hierarchy, GPU-speciﬁc op-

timizations. Sections 6.1–6.15.

• Chapter 8: OpenMP basics, work-sharing constructs, correctness and perfor-

mance issues. Sections 8.1–8.14.

• Chapter 11: Load balancing, “divisible load theory”-based partitioning. Sec-

tions 11.1–11.5.

Also, if FPGAs or other non-NVidia accelerators are targeted, Chapter 7 on

OpenCL can replace Chapter 6 on CUDA.

Software and hardware requirements

The book examples have been developed and tested on Ubuntu Linux. All the soft-

ware used throughout the book is available in free or open source form. These

include:

• GNU C/C++ Compiler Suite 9.x (for CUDA compatibility) and 10.x (for OpenMP

5.0 compatibility)

• OpenMPI 4.x

• NVidia’s CUDA SDK 11.x (includes Thrust)

• Qt 5.x library

剩余1008页未读，继续阅读

0x0007

粉丝: 3692
资源: 472

"深入浅出的《多核和GPU编程》：全面性强、实用性高的并行计算学习资源"

Multicore and GPU Programming An Integrated Approach.pdf

Multicore and GPU Programming An Integrated Approach

Multi-Core Programming

Multicore and GPU Programming

Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU

Paralle Programming for Multicore and Cluster Systems

Parallel Programming for Multicore and Cluster Systems

Programming multicore and many-core computing systems mobi

GPU Programming in MATLAB

Programming multicore and many-core computing systems azw3

最新资源