GPU加速：提升局部序列比对的统计显著性计算

42 浏览量更新于2024-08-26 收藏 1.25MB PDF 举报

本文是一篇研究论文，标题为“通过获取GPU的功能来加速局部统计的成对统计显着性估计”，发表于2011年的IEEE国际计算生物学与医学科学会议（ICCABS）。作者包括Yuhong Zhang、Sanchit Misra等人，他们探讨了在生物信息学领域广泛应用的序列比对中，如何利用GPU（图形处理器）的强大性能来提升pairwise statistical significance estimation（成对统计显著性估计）的计算效率。在生物信息学任务中，精确地识别相关序列是至关重要的，这通常涉及到对大量数据进行复杂的分析。然而，传统的pairwise statistical significance estimation不仅计算量大，而且对硬件资源的需求较高，这对系统的性能和可扩展性构成了挑战。为了克服这一问题，研究人员提出了一种创新的方法，即通过GPU的并行计算能力来加速这个过程。论文的核心内容是开发了一种基于GPU的算法实现，该算法利用标准替换矩阵来估计局部序列比对的统计显著性。他们着重研究了算法的数据访问模式，发现通过将数据组织成连续的块，可以在GPU的全局内存中实现高效的并发读取，从而提高GPU的利用率（occupancy），即同时执行的线程数量。这种方法允许同时处理大量的计算任务，显著提升了整体的计算速度和并行性能。此外，论文可能还讨论了GPU的优势，比如其并行计算架构、高速内存以及大规模并行处理的能力，这些特性使得它成为处理高数据密集型任务的理想选择。通过这种GPU加速策略，作者旨在解决生物信息学中的性能瓶颈，使得本地序列比对的统计显著性估计变得更加高效，对于大规模生物数据库的分析和挖掘具有实际意义。这篇论文不仅提供了技术上的解决方案，还展示了如何将GPU技术应用于生物信息学的实践，为未来在高性能计算环境中进行生物序列分析提供了新的思路和方法。

such as, GPU memory access optimization, o ccupancy

maximization, and subs titution ma trix custo mization.

These techniques have significantly speeded up PSSE.

Careful analysis of the data pipelines of PSSE shows

that the computation of PSSE can be decomposed into

three computation kerne ls: Permutation, Al ignment and

Fitting. Permutation and Alignment comprise the over-

whelming majority (a more than 99.8%) of the overall

execution time [35]. Therefore, efforts should be spent

to optimize these two kernels to achieve high perfor-

mance. Also, we observe th at permuta tion presents high

degrees of data independency that are naturally sui tab le

for single-instruction, multiple-thread (SIMT) architec-

tures [38] and therefore, can be mapped very well to

task parallelism models of GPU. Moreover, even though

the alig nment task suffers fr om data dependency, we

show that with clever optimizations, it can be heavily

accelerated using GPUs.

Design

GPU memory access optimization

It is especially important to optimize global memory

access as its bandwidth is low and its latency is hun-

dreds of clock cycles [38]. Moreover, global m emory

coalescing is the most critical optimization for GPU

programming [39]. Since the kernels of PSSE usually

work over large numbers of sequences that reside in the

global memory, the performance is highly dependent on

hiding memory latency. When a GPU kernel is accessing

global memory, all threads in groups of 32 (i. e. warp)

access a bank of memory at one time. A batch of mem-

ory accesses is considered coalesced when the data

requested by a warp of threads are located in contiguous

memory addresses. For example, if the data requested by

threads within a warp are located in 32 con secutive

memory addresses (such that the i

address is accessed

by the i

thread), the memory can be read in a single

access. Hence, this memory access operation runs 32

times fa ster. If the memory access is not coalesced, it is

divided into multiple reads and hence serialized [37].

After permutation, if the sequence s

and its N per-

muted copies were stored contiguously one after

another in the global memory, the intuitive memory lay-

out would be as shown in Figure 1 (a) Note that, we

need o ne byte (uchar) to store each amino acid residue.

Moreover, GPU can read four-byte (packed as a CUDA

built-in vector data type uchar4) of data from the global

memo ry to registers in one instruction. To achieve high

parallelism of global memory access, uchar4 is used to

store the permuted sequences. Dummy amino acid sym-

bols are p added in the end to make the length of

sequences a multiple of 4.

Considering inter-task parallelism, where each thread

works on the alignment of one of the permuted copies

of s

to s

, in this layout the gap between the memory

accesses by the neighboring threads is at least the length

of the sequence. For example, in the int uitive layout , if

thread T

accesses th e first residue (i.e., ‘R’), and thread

accesses the first residue (i.e., ‘ E’), the gap between

the access data is n. This results in non-coalesced mem-

ory reads (i.e., serialized reads), which significantly dete-

riorates the performance.

We therefore reorganize the layout of sequence data

in memory to obtain coalesced reads. Now, to achieve

coalesced access, we reorganize layout of sequences in

memory as aligned structure of arrays, a s shown in Fig-

ure 1 (b) In the optimized layout, the characters (in

granularity of 4 byte s) that lie at the same index in dif-

ferent permuted sequences stay at neighboring positions.

Then if the first uchar4 of the first permuted sequence

(i.e. ‘REGN’) is requested by thread T

,thefirstuchar4

of the second permuted sequence (i.e. ‘ARNE’)is

requested by T

, and so on. This results in reading a

consecutive m emory (each thread reads 4 bytes) by a

warp of threads in a single access. Thus the global

memory access is coalesced, and therefore high perfor-

mance is achieved.

As the sequences remain unchanged during the align-

ment, they can be thought of as read-only data, which

can be bound to texture memory. For read patterns, tex-

ture memory fetches is a be tter alternat ive to global

memory reads because of texture memory cache, which

can further improve the performance.

Occupancy maximization

Hiding glo bal memory latency is very important to

achieve high performance on the GPU. This can be

done by creating enough threads to keep the CUDA

cores always occupied while many other threads are

waiting for global memory accesses [39]. GPU occu-

pancy, as defined below, is a metric to d etermine how

effectively the hardware is kept busy:

Occupancy =(B × T

num

)/T

max

(2)

where T

max

is maximum number of resident threads

that can be launched on a streaming multiprocessor

(SM) (which is a c onstant for a specific GPU), T

num

the number of active threads per block and B is the

number of active blocks per Streaming Multiprocessor

(SM).

B also depends upon the GPU physical limitations (e.

g. the amount of registers, shared memory and t hreads

supported in each model). It can be given in the follow-

ing way:

B = min(B

user

, B

reg

, B

shr

, B

)

(3)

where B

is the hardware limit (only 8 blocks are

allowed per SM), and B

reg

, B

shr

, ar e the potential blocks

Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3

http://www.biomedcentral.com/1471-2105/13/S5/S3

Page 3 of 12

剩余11页未读，继续阅读

weixin_38563871

粉丝: 1
资源: 959

GPU加速：提升局部序列比对的统计显著性计算

GPU并行加速矩阵乘法

PyTorch 安装教程：支持GPU加速功能 .docx

MATLAB矩阵求逆的并行化：利用多核处理器和GPU加速计算

【Python随机数与Numpy协同】：加速数值计算的秘诀

【OpenCV光流法】：运动估计的秘密武器

水果识别系统性能优化：算法调优与加速技巧，提升效率

揭秘OpenCV图像处理入门：从USB摄像头获取图像，踏上图像处理之旅

声学模型的端到端解决方案：直接从声音到文字的革命性方法

【优化高手】：提升异常检测效率与准确性的6大技巧

提升诊断和治疗的准确性：MATLAB数值积分在医学成像中的应用

最新资源