such as, GPU memory access optimization, o ccupancy
maximization, and subs titution ma trix custo mization.
These techniques have significantly speeded up PSSE.
Careful analysis of the data pipelines of PSSE shows
that the computation of PSSE can be decomposed into
three computation kerne ls: Permutation, Al ignment and
Fitting. Permutation and Alignment comprise the over-
whelming majority (a more than 99.8%) of the overall
execution time [35]. Therefore, efforts should be spent
to optimize these two kernels to achieve high perfor-
mance. Also, we observe th at permuta tion presents high
degrees of data independency that are naturally sui tab le
for single-instruction, multiple-thread (SIMT) architec-
tures [38] and therefore, can be mapped very well to
task parallelism models of GPU. Moreover, even though
the alig nment task suffers fr om data dependency, we
show that with clever optimizations, it can be heavily
accelerated using GPUs.
Design
GPU memory access optimization
It is especially important to optimize global memory
access as its bandwidth is low and its latency is hun-
dreds of clock cycles [38]. Moreover, global m emory
coalescing is the most critical optimization for GPU
programming [39]. Since the kernels of PSSE usually
work over large numbers of sequences that reside in the
global memory, the performance is highly dependent on
hiding memory latency. When a GPU kernel is accessing
global memory, all threads in groups of 32 (i. e. warp)
access a bank of memory at one time. A batch of mem-
ory accesses is considered coalesced when the data
requested by a warp of threads are located in contiguous
memory addresses. For example, if the data requested by
threads within a warp are located in 32 con secutive
memory addresses (such that the i
th
address is accessed
by the i
th
thread), the memory can be read in a single
access. Hence, this memory access operation runs 32
times fa ster. If the memory access is not coalesced, it is
divided into multiple reads and hence serialized [37].
After permutation, if the sequence s
2
and its N per-
muted copies were stored contiguously one after
another in the global memory, the intuitive memory lay-
out would be as shown in Figure 1 (a) Note that, we
need o ne byte (uchar) to store each amino acid residue.
Moreover, GPU can read four-byte (packed as a CUDA
built-in vector data type uchar4) of data from the global
memo ry to registers in one instruction. To achieve high
parallelism of global memory access, uchar4 is used to
store the permuted sequences. Dummy amino acid sym-
bols are p added in the end to make the length of
sequences a multiple of 4.
Considering inter-task parallelism, where each thread
works on the alignment of one of the permuted copies
of s
2
to s
1
, in this layout the gap between the memory
accesses by the neighboring threads is at least the length
of the sequence. For example, in the int uitive layout , if
thread T
0
accesses th e first residue (i.e., ‘R’), and thread
T
1
accesses the first residue (i.e., ‘ E’), the gap between
the access data is n. This results in non-coalesced mem-
ory reads (i.e., serialized reads), which significantly dete-
riorates the performance.
We therefore reorganize the layout of sequence data
in memory to obtain coalesced reads. Now, to achieve
coalesced access, we reorganize layout of sequences in
memory as aligned structure of arrays, a s shown in Fig-
ure 1 (b) In the optimized layout, the characters (in
granularity of 4 byte s) that lie at the same index in dif-
ferent permuted sequences stay at neighboring positions.
Then if the first uchar4 of the first permuted sequence
(i.e. ‘REGN’) is requested by thread T
0
,thefirstuchar4
of the second permuted sequence (i.e. ‘ARNE’)is
requested by T
1
, and so on. This results in reading a
consecutive m emory (each thread reads 4 bytes) by a
warp of threads in a single access. Thus the global
memory access is coalesced, and therefore high perfor-
mance is achieved.
As the sequences remain unchanged during the align-
ment, they can be thought of as read-only data, which
can be bound to texture memory. For read patterns, tex-
ture memory fetches is a be tter alternat ive to global
memory reads because of texture memory cache, which
can further improve the performance.
Occupancy maximization
Hiding glo bal memory latency is very important to
achieve high performance on the GPU. This can be
done by creating enough threads to keep the CUDA
cores always occupied while many other threads are
waiting for global memory accesses [39]. GPU occu-
pancy, as defined below, is a metric to d etermine how
effectively the hardware is kept busy:
Occupancy =(B × T
num
)/T
max
(2)
where T
max
is maximum number of resident threads
that can be launched on a streaming multiprocessor
(SM) (which is a c onstant for a specific GPU), T
num
is
the number of active threads per block and B is the
number of active blocks per Streaming Multiprocessor
(SM).
B also depends upon the GPU physical limitations (e.
g. the amount of registers, shared memory and t hreads
supported in each model). It can be given in the follow-
ing way:
B = min(B
user
, B
reg
, B
shr
, B
hw
)
(3)
where B
hw
is the hardware limit (only 8 blocks are
allowed per SM), and B
reg
, B
shr
, ar e the potential blocks
Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3
http://www.biomedcentral.com/1471-2105/13/S5/S3
Page 3 of 12