xvi
|
Note from the Lead Developer of Intel Threading Building Blocks
Starting in 2004, I chaired a study group inside Intel that drafted the initial Thread-
ing Building Blocks proposal. Despite an early commitment to a library-only
solution, we drew much of our inspiration from new languages and extensions for
parallel programming. Our goal was to borrow as many good ideas as we could put
into library form. Sticking to a library was a requirement so that Threading Building
Blocks could slip easily into existing C++ programming environments.
C++ makes the library approach practical because it is designed for writing libraries.
Stroustrup cites a 10X reduction in line count for the Booch components written in
C++ versus Ada. Perhaps C++ will be even more powerful in the future. For exam-
ple, the addition of lambda functions (see Chapter 12) would simplify the mechanics
of using the Threading Building Blocks
parallel_for.
A library-only solution is not perfect. We had to leave out some features that really
require compiler support. For example, data-parallel array operations such as
Fortran 90, ZPL, and NESL were deemed impractical because they rely heavily on
optimizing compilers. There have been some C++ libraries such as POOMA that do
some of the optimizations via template metaprogramming, but the complexity of
such libraries is high. Parallel functional programming is another powerful para-
digm, but alas, it requires significant compiler support.
Several systems were particularly influential in the development of Threading Build-
ing Blocks; these (and others) are listed in the bibliography of this book.
The Chare Kernel (now Charm++) showed the advantages of breaking a program
into many small tasks. In particular, distributing load is simplified. By analogy, it’s a
lot easier to evenly distribute many small objects among the cases instead of a few
large objects.
Cilk showed the power of combining a scheduling technique called task stealing with
recursive tasks. Recursion is often slower than iteration for serial programming, but
it turns out that recursive parallelism has some big advantages over iterative parallel-
ism with respect to load balancing and cache reuse. Cache reuse is critical because
restructuring for cache sometimes improves program speed by 2X or more, possibly
delivering better improvement than multithreading alone. Fortunately, the Cilk
approach tends to steer programmers to solutions that both are parallel and have
good cache behavior.
The C++ Standard Template Library (STL) showed how a library could be both
generic and efficient. As we gain experience, we’re learning how to be more generic.
STAPL showed how to bring generic binding of algorithms to containers into the
parallel world, by substituting the fundamentally sequential STL iterator with paral-
lel recursive ranges (pRanges in STAPL). This enabled parallel algorithms to operate
on parallel containers and opened up the ability to apply parallel recursive ranges to
multidimensional spaces (e.g.,
blocked_range2d), and even reuse (some would say
abuse) them to write a parallel quicksort.