. 3
-40
-30
-20
-10
0
10
20
30
40
0 10 20 30 40 50 60 70 80 90 100
% forced uniform branches
Performance increase (no software overhead)
Performance increase (software overhead)
Fig. 3. The performance of Mandelbrot can be increased by forcing
uniformity for more branches. However, if software overhead is added to
ensure branch uniformity, increasing the number of affected branches increases
overhead and can even result in degraded performance.
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80 90 100
% forced uniform branches
% mismatched output bytes .
Mandelbrot Output Quality Degradation
Julia Output Quality Degradation
Fig. 4. While eliminating control divergence can increase performance,
blindly forcing branch uniformity can result in degraded output quality.
significant. Figure 3 shows the potential performance increase
(runtime reduction) if control divergence can be eliminated for
a fraction of the static branches in Mandelbrot (from 0% to
100% of branches). The branches are chosen uniformly ran-
domly when the fraction is less than 100%. Control divergence
is preempted by changing the source code to vote within a
warp on the condition of a branch and forcing all threads in
the warp to take the same (majority) direction at the branch
(details in Section III). Experiments were run natively on a
NVIDIA GeForce GTX 480 GPU (details in Section VI).
While only 10% of dynamic instructions in Mandelbrot are
branches, and less than 1% of branches diverge, performance
can potentially be increased by 31% by eliminating control
divergence. As the no software overhead performance series in
Figure 3 demonstrates, performance increases for Mandelbrot
as control divergence is eliminated for more branches. Figure 4
shows that the quality of the Mandelbrot output set degrades
by less than 2%, even when divergence has been eliminated
for all static branches. This shows that for certain error-
tolerant applications, it may be possible to get significant
performance benefits from eliminating control divergence for
minimal output quality degradation. A quick look at the
Julia output set, however, also suggests that an indiscriminate
selection of branches for herding may result in significant
output quality degradation for several applications. Therefore,
any implementation of branch herding needs to carefully select
the branches to target. Figure 5 shows visual representations
of the Mandelbrot and Julia output sets as the percentage
of forced uniform branches increases from 20% to 100% in
increments of 40%.
The software overhead performance series of Figure 3
demonstrates another important consideration for any tech-
nique that eliminates control divergence. Since the fraction of
Fig. 5. Progression of Mandelbrot (top) and Julia (bottom) images from 20%
to 100% forced branch uniformity in 40% intervals.
divergent branches in a program may be small (in this case,
less than 1%), an indiscriminate application of a technique
to all branches may result in significant overhead that dimin-
ishes or even eliminates performance gains that result from
reduced divergence. This result reinforces the conclusion that
care should be exercised in selecting the branches to target
for elimination of control divergence. Also, a low-overhead
mechanism for eliminating control divergence may enable
significantly more benefits. The result also confirms that na¨ıve
implementations of techniques to eliminate control divergence
may actually decrease performance in some scenarios.
B. Memory Divergence
Like control divergence, memory divergence occurs when
threads in the same warp exhibit different behavior. In the
GPU, a load operation for a warp is implemented as a
collection of scalar loads, where each thread potentially loads
from a different address. When a load is issued, the SM
sets up destination registers and corresponding scoreboard
entries for each thread in the warp. The load then exits the
pipeline, potentially before any of the individual thread loads
have finished. When all the memory requests corresponding
to the warp load have finished, the destination vector register
is marked as ready. Instructions that depend on the load must
stall if any lanes of the destination vector register are not ready.
Memory divergence occurs when the memory requests for
some threads finish before those of other threads in the same
warp [9]. Individual threads that delay in finishing their loads
prevent the SM from issuing any dependent instructions from
that warp, even though other threads are ready to execute.
Memory divergence may occur for two reasons. (1) The time
to complete each memory request depends on several factors,
including which DRAM bank the target memory resides in,
contention in the interconnect network, and availability of
resources (such as MSHRs) in the memory controller. (2)
Since the target data for a collection of memory requests
made by a warp may reside in different levels of the memory
hierarchy, the individual memory operations may complete in
different lengths of time.
Most GPU architectures do not implement out-of-order
execution due to its relative complexity and hardware cost.
Rather, GPUs cover long latency stalls by multithreading
instructions from a pool of ready warps. Providing each SM
with plenty of ready warps ensures that long latency stalls will
not be exposed. Memory divergence delays the time when
a warp may execute the next dependent instruction, cutting
into the pool of ready warps and potentially exposing stalls