GPU计算宝石： Jade版 - 探索并提升GPU计算技术

5星 · 超过95%的资源需积分: 15 24 浏览量更新于2024-07-21 收藏 14.79MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"GPU Computing Gems, Jade Edition 是一本由Morgan Kaufmann出版的关于GPU计算应用的系列书籍，旨在为科研人员、工程师、学生和超级计算专业人员提供GPU并行计算的训练、实例和灵感。该书由业界领先的专家编写，他们具有独特的资格提供并行计算的见解和指导。每一章节都经过同行评审，确保了高质量的技术标准和广泛的可访问性。" GPU计算是科学研究中的第三大支柱，其快速发展主要归功于GPU在手持设备、笔记本电脑、台式机和超级计算机集群中的广泛使用。GPU因其在性能提升方面的显著成果而变得无处不在。Morgan Kaufmann的“GPU计算应用”系列正是为了满足那些希望在模拟或实验中利用GPU强大能力的读者的需求。 "GPU Computing Gems, Jade Edition"是这个系列的一部分，它提供了并行计算在特定行业领域的最新状态的快照。这本书通过精心挑选的各个行业的案例，向读者展示了前沿的科研进展，使读者有机会观察到可能适用于自己项目的算法工作。书中涵盖的内容不仅深入探讨了GPU计算技术，还强调了如何将这些技术实际应用于各种项目。每一颗“GPU Computing Gem”都是作者们创新工作的结晶，他们响应编辑团队的改进要求，努力确保内容既符合高技术标准，又易于理解。区域编辑与作者紧密合作，确保了这些“宝石”级别的文章能够达到既定的技术要求，同时保持对广大读者的吸引力。此外，该书系列鼓励读者通过访问 http://mkp.com/gpu-computing-gems 获取更多关于GPU计算的信息，扩展他们的知识领域。推荐的并行计算资源可能包括编程工具、最佳实践以及进一步阅读材料，帮助读者深入理解和应用GPU计算技术。 “GPU Computing Gems, Jade Edition”是一本全面且深入的资源，对于想要了解和利用GPU计算潜力的专业人士来说，它既是学习的宝贵资料，也是探索创新解决方案的灵感来源。

资源详情

资源推荐

HWU 2011 06-ch01-003-014-9780123859631 2011/9/8 19:06 Page 4 #2

4 CHAPTER 1 Large-Scale GPU Search

memory subsystem (see Section 1.3.2). With P-ary search we demonstrate how parallel memory access

combined with the superior synchronization capabilities of SIMD architectures like GPUs can be lever-

aged to compensate for memory latency (see the section titled “P-ary Search: Parallel Search from

Scratch”). P-ary search outperforms conventional search algorithms not only in terms of throughput

but also response time (see Section 1.4). Implemented on a GPU it can outperform a similarly priced

CPU by up to three times, and is compatible with existing index data structures like inverted lists and

B-trees. We expect the underlying concepts of P-ary search — parallel memory access and exploit-

ing efﬁcient SIMD thread synchronization — to be applicable to an entire class of memory-bound

applications (see Section 1.5).

1.2 MEMORY PERFORMANCE

With more than six times the memory bandwidth of contemporary CPUs, GPUs are leading the trend

toward throughput computing. On the other hand, traditional search algorithms besides linear scan are

latency bound since their iterations are data dependent. However, as large database systems usually

serve many queries concurrently both metrics — latency and bandwidth — are relevant.

Memory latency is mainly a function of where the requested piece of data is located in the memory

hierarchy. Comparing CPU and GPU memory latency in terms of elapsed clock cycles shows that

global memory accesses on the GPU take approximately 1.5 times as long as main memory accesses

on the CPU, and more than twice as long in terms of absolute time (Table 1.1). Although shared

memory does not operate the same way as the L1 cache on the CPU, its latency is comparable.

Memory bandwidth, on the other hand, depends on multiple factors, such as sequential or random

access pattern, read/write ratio, word size, and concurrency [3]. The effects of word size and read/write

behavior on memory bandwidth are similar to the ones on the CPU — larger word sizes achieve

better performance than small ones, and reads are faster than writes. On the other hand, the impact

of concurrency and data access pattern require additional consideration when porting memory-bound

applications to the GPU.

Little’s Law, a general principle for queuing systems, can be used to derive how many concurrent

memory operations are required to fully utilize memory bandwidth. It states that in a system that

processes units of work at a certain average rate W, the average amount of time L that a unit spends

inside the system is the product of W and λ, where λ is the average unit’s arrival rate: L = λW [4].

Applying Little’s Law to memory, the number of outstanding requests must match the product of

latency and bandwidth. For our GTX 285 GPU the latency is 500 clock cycles, and the peak bandwidth

Table 1.1 Cache and Memory Latency Across the Memory Hierarchy for the Processors in Our

Test System

L1/Shared Memory L2 L3 Memory

Processor [cc] [ns] [cc] [ns] [cc] [ns] [cc] [ns]

Intel Core i7 2.6 GHz 4 1.54 10 3.84 40 15.4 350 134.6

NVIDIA GTX285 1.5 GHz 4 2.66 n/a n/a n/a n/a 500 333.3

HWU 2011 06-ch01-003-014-9780123859631 2011/9/8 19:06 Page 5 #3

1.2 Memory Performance 5

is 128 bytes per clock cycle — the physical bus width is 512 bits, or a 64-byte memory block, and two

of these blocks are transferred per clock cycle — so:

outstanding reads = latency × bandwidth/request size =

= 500 cc × 128 B/cc/(4 B/request) = 16 K requests

assuming 4-byte reads as in the code in Section 1.4. This can be achieved using different combinations

of number of threads and outstanding requests per thread. In our example, we could make full use of

the global memory by having 1 K threads issue 16 independent reads each, or 2 K threads issue eight

reads each, and so on.

The plots in Figure 1.1 show the case in which each thread has only one outstanding memory

request. This serves as a baseline example, mimicking the behavior of conventional search algorithms

that at any given time have at most one outstanding memory request per search (thread), due to data

dependencies. Obviously, if there are no constraints issuing more than one read per thread it is much

more efﬁcient to issue multiple reads per thread to maximize memory utilization [5, 6]. In our case, to

saturate memory bandwidth we need at least 16,000 threads, for instance as 64 blocks of 256 threads

each, where we observe a local peak. The maximum bandwidth of 150 GB/s is not reached here because

the number of threads cannot compensate for some overhead required to manage threads and blocks.

Increasing the number of threads, the bandwidth takes a small hit before reaching its peak (Figure 1.1a).

Although there are many options to launch 16,000 or more threads, only certain conﬁgurations can

achieve memory bandwidth close to the maximum. Using fewer than 30 blocks is guaranteed to leave

some of the 30 streaming multiprocessors (SMs) idle, and using more blocks that can actively ﬁt the

SMs will leave some blocks waiting until others ﬁnish and might create some load imbalance. Having

mutliple threads per block is always desirable to improve efﬁciency, but a block cannot have more than

512 threads. Considering 4-byte reads as in our experiments, fewer than 16 threads per block cannot

fully use memory coalescing as described below.

256

16K

256

100

120

140

160

180

Memory bandwidth [GB/s]

#Threads per block

#Thread blocks

100

120

140

160

256

16K

256

Memory bandwidth [GB/s]

#Threads per

block

#Thread blocks

(a) Coalesced (sequential) read

(

)

Random read

FIGURE 1.1

Memory bandwidth as a function of both access pattern and number of threads measured on an NVIDIA

GTX285. All experiments have one outstanding read per thread, and access a total of 32 GB in units of 32-bit

words.

HWU 2011 06-ch01-003-014-9780123859631 2011/9/8 19:06 Page 6 #4

6 CHAPTER 1 Large-Scale GPU Search

Returning to Little’s Law, we notice that it assumes that the full bandwidth be utilized, meaning,

that all 64 bytes transferred with each memory block are useful bytes actually requested by an applica-

tion, and not bytes that are transferred just because they belong to the same memory block. When any

amount of data is accessed, with a minimum of one single byte, the entire 64-byte block that the data

belongs to is actually transferred. To make sure that all bytes transferred are useful, it is necessary that

accesses are coalesced, i.e. requests from different threads are presented to the memory management

unit (MMU) in such a way that they can be packed into accesses that will use an entire 64-byte block. If,

for example, the MMU can only ﬁnd 10 threads that read 10 4-byte words from the same block, 40 bytes

will actually be used and 24 will be discarded. It is clear that coalescing is extremely important to

achieve high memory utilization, and that it is much easier when the access pattern is regular and con-

tiguous. The experimental results in Figure 1.1b conﬁrm that random-access memory bandwidth is sig-

niﬁcantly lower than in the coalesced case. A more comprehensive explanation of memory architecture,

coalescing, and optimization techniques can be found in Nvidia’s CUDA Programming Guide [7].

1.3 SEARCHING LARGE DATA SETS

The problem of searching is not theoretically difﬁcult in itself, but quickly searching a large data set

offers an exquisite example of how ﬁnding the right combination of algorithms and data structures for

a speciﬁc architecture can dramatically improve performance.

1.3.1 Data Structures

Searching for speciﬁc values in unsorted data leaves no other choice than scanning the entire data set.

The time to search an unsorted data set in memory is determined by sequential memory bandwidth,

because the access pattern is nothing else but sequential reads. Performance is identical to that of coa-

lesced reads (Figure 1.1a), which peak at about 150 GB/s on the GPU when using many thread blocks

and many threads per block. On the other hand, searching sorted data or indexes can be implemented

much more efﬁciently.

Databases and many other applications use indexes, stored as sorted lists or B-trees, to acceler-

ate searches. For example, searching a sorted list using binary search

requires O(log

(n)) memory

accesses as opposed to O(n) using linear search, for a data set of size n. However, the data access

pattern of binary search is not amenable to caching and prefetching, as each iteration’s mem-

ory access is data dependent and distant (Figure 1.2). Although each access incurs full memory

latency this approach is orders of magnitude faster on sorted data. For example, assuming a mem-

ory latency of 500 clock cycles, searching a 512 MB data set of 32-bit integers in the worst case takes

log

(128M) ∗ 500cc = 13, 500cc, as opposed to millions when scanning the entire data set. B-trees

Binary search compares the search key with the (pivot) element in the middle of a sorted data set. Based on whether the

search key is larger than, smaller than, or equal to the pivot element, the algorithm then searches the upper or lower half of

the data set, or returns the current location if the search key was found.

Searching a B-tree can be implemented comparing the search key with the elements in a node in ascending order, starting

at the root. When an element larger than the search key is found, it takes the corresponding branch to the child node, which

only contains elements in the same range as the search key, smaller than the current element and larger than the previous

one. When an element equals the search key its position is returned.

HWU 2011 06-ch01-003-014-9780123859631 2011/9/8 19:06 Page 7 #5

1.3 Searching Large Data Sets 7

2nd-level node Root node 1st-level node

FIGURE 1.2

Memory access patterns of linear search (top), binary search (middle), and B-tree search (bottom).

group pivot elements as nodes and store them linearly, which makes them more amenable to caching

(Figure 1.2). However, using the same threading model as one would on the CPU — assigning one

thread to one search query — inevitably results in threads diverging quickly such that memory accesses

across threads cannot be coalesced.

1.3.2 Limitations of Conventional Search Algorithms

Implementing a search function in CUDA can be done by simply addding a global function qualiﬁer

to a textbook implementation of binary search. Obviously, the performance of such a naive imple-

mentation cannot even keep up with a basic CPU implementation [8]. Although optimizations like

manual caching, using vector data types, and inlining functions can signiﬁcantly improve response

time and throughput [9], the basic problem remains: algorithms like binary search were designed for

serial machines and not for massively parallel architectures like the GPU [10]. For instance, each itera-

tion of binary search depends on the outcome of a single load and compare, and response time becomes

simply a matter of memory latency (Table 1.1) multiplied by the number of iterations required to ﬁnd

the search key, log

(n). As the GPU has higher memory latency than the CPU, we cannot expect any

improvements in response time.

Adopting the CPU threading model, simply running multiple searches concurrently — mapping

one search query to one thread — makes matters worse, for two reasons. First, given the GPU’s SIMD

architecture, all threads within a thread block have to “wait” until all of them complete their searches.

With increasing block size the likelihood for one of the threads to require worst-case runtime increases.

Second, while all threads start with the same pivot element(s), they quickly diverge as each thread is

assigned a different search key. The resulting memory access patterns are not amenable to caching

or coalescing. Moreover, the large amount of small memory accesses is likely to lead to contention,

HWU 2011 06-ch01-003-014-9780123859631 2011/9/8 19:06 Page 8 #6

8 CHAPTER 1 Large-Scale GPU Search

thus introducing additional latency. On the other hand, achieving high memory bandwidth, and thus

high application throughput, requires a high level of concurrency, which translates to large workloads.

Using conventional algorithms, one has to make a choice of whether to optimize for throughput or for

response time, as they appear to be conﬂicting goals.

1.3.3 P-ary Search: Parallel Search from Scratch

Taking full advantage of the GPU’s memory bandwidth requires: (a) concurrent memory access, (b)

sufﬁciently large thread blocks, (c) a sufﬁciently large number of thread blocks, and (d) memory

access patterns that allow coalescing. Large workloads satisfy condition (a); conventional search algo-

rithms can satisfy conditions (b) and (c), but they do not produce memory access patterns amenable to

coalescing and therefore do not satisfy condition (d).

P-ary search uses a divide-and-conquer strategy where all SIMD threads in a thread block are

searching for the same key, each in a disjoint subset of the initial search range (Figure 1.3). Assuming

the data is sorted, each thread compares the search key with the ﬁrst and last entry of its subset to deter-

mine which part contains the search key. This part is again assigned in equal shares to all threads, and

the process is repeated until the search key is found. In case multiple threads report ﬁnding the search

key, it is irrelevant which one delivers the result because all are correct. The advantage of this strategy

is that memory accesses from multiple threads can leverage the GPU’s memory gather capabilities,

or — in case the data is stored in a B-tree — they can be coalesced. In both cases, if boundary keys are

shared by neighboring threads, they can be loaded into shared memory, eliminating additional global

memory accesses. The synchronization process at the end of each iteration to determine which thread

holds the relevant part containing the search key also uses shared memory. In theory and in practice

P-ary search scales with increasing numbers of threads and has a time complexity of log

(n), where p

denotes the number of parallel threads.

P-ary Search on Sorted Lists

As opposed to conventional algorithms, P-ary search uses all threads within a thread block collabora-

tively, taking advantage of parallel memory accesses to accelerate a single search query. The parallel

4230 31 32 33 34 35 36 37 38 39 40 41 44 45 46 47 48 51 52 53 54 55 56 57 58 60 6159504943

39 40

44 45

4643

Thread 0 Thread 2 Thread 3 Thread 1

Th0

(I)

(II)

Th1 Th2 Th3 All searching for '42'

FIGURE 1.3

P-ary search implemented with four threads on a sorted list. Note that if the data is stored as a B-tree, the

boundary keys are already adjacent, resulting in linear, faster memory access (Figure 1.4).

剩余533页未读，继续阅读

「已注销」

粉丝: 15
资源: 16

GPU计算宝石： Jade版 - 探索并提升GPU计算技术

GPU Computing Gems Emerald Edition

GPU Computing Gems Jade Edition

GPU computing

hcie-cloud_computing_v1.0_实验手册

hcia-cloud_computing_v4.实验指导

if batch_idx % check_interval == 0: gpu_usage = get_gpu_usage() print(f'GPU usage: {gpu_usage}')这段代码什么意思

nvidia gpu computing toolkit

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\extras\demo_suite 'C:\Program' 不是内部或外部命令，也不是可运行的程序 或批处理文件。

VMware Workstation 历史版本下载

终端如何进入C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\extras\demo_suite不是内部或外部命令，也不是可运行的程序

我要切换到这个目录C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\extras\demo_suite

vs2019如何编译cuda版本的pcl1.13.1 给出详细步骤和示例

gpu computing guide 2019

cd \Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.5\extras\demo_suite

&"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\extras\demo_suite\deviceQuery.exe"

No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4'

The CUDA Toolkit v11.6 directory 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\' does not exist. Please verify the CUDA Toolkit is installed properly or define the CudaToolkitDir propert

windows horovod安装

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3

最新资源

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\extras\demo_suite 'C:\Program' 不是内部或外部命令，也不是可运行的程序或批处理文件。