
because it creates a load imbalance, the schedule clause can change the default distribution scheme. For
instance, schedule(dynamic, c) creates chunks of size c, and idle threads grab the next available chunk.
OpenMP also provides mechanisms to describe the visibility of data to the threads (“scoping“). It is important
to tell the OpenMP compiler what data must remain in the shared memory domain and what data needs to be
private to the individual threads. OpenMP denes clauses that can be added to the directives to control the scope
of variables. The shared clause keeps a variable in the shared space, while the private clause creates a
thread-private copy of a variable. Shared scope is the default for variables that are declared outside of a parallel
region. In Figure 1, this applies to the variables A, B, C, and the constant N. Private variables can store
different values for different threads, as needed by the loop counter i, in the example. Private copies are by
default created without initializing; rstprivate can be used to assign the value of the variable outside
of the parallel region.
The latest version 4.0 of the OpenMP API specication not only includes minor bug xes and improvements
to existing features, it now supports a good share of features introduced with Fortran 2003. OpenMP afnity
denes a common way to express thread afnity to execution units of the hardware. Version 4.0 also comes
with major feature enhancements, some of which will be discussed in more detail here. Task groups improve
tasking by providing a better way to express synchronization of a set of tasks and to handle cancellation,
which allows to stop parallel execution. SIMD pragmas extend the thread-parallel execution to data-parallel
SIMD machine instructions, while user-dened reductions let programmers specify arbitrary reduction operations.
Possibly the biggest addition to OpenMP is support for ofoading computation to coprocessor devices.
Talk the Talk, Task the Task
The growing number of cores (and threads) make it harder to fully utilize the cores with traditional worksharing
constructs for parallel loops. Irregular algorithms, such as recursions and traversals of graphs, require a completely
different approach to parallelism. Task-based models blend well with the requirements of these algorithms,
since tasks can be created in a much more exible way.
An OpenMP task may be treated as a small package that consists of a piece of code to be executed and all the
data needed for execution. An OpenMP task is created through the #pragma omp task directive to mark
a piece of code and data for concurrent execution. The OpenMP runtime system takes care of mapping the
created tasks to the threads of a parallel region. It may defer the task for later execution by adding it to a task
queue or it may execute the task immediately.
Figure 2 shows a task-parallel version of a very simple, brute-force Sudoku* solver. The idea of the algorithm is:
1. Find an empty eld without a number
2. Insert a number
3. Check the Sudoku board
4. If the solution is invalid, try the next possible number
5. If the solution is valid, go to the next eld and start over
Share with a friendSign up for future issues
For more information regarding performance and optimization choices in Intel
®
software products,
visit http://software.intel.com/en-us/articles/optimization-notice.
8The Parallel Universe
llllllllllllllllllllllllllllllllllllllllll