AMD ACCELERATED PARALLEL PROCESSING
1-2 Chapter 1: OpenCL Architecture and AMD Accelerated Parallel Processing
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
1.2 OpenCL Overview
The OpenCL programming model consists of producing complicated task graphs
from data-parallel execution nodes.
In a given data-parallel execution, commonly known as a kernel launch, a
computation is defined in terms of a sequence of instructions that executes at
each point in an N-dimensional index space. It is a common, though by not
required, formulation of an algorithm that each computation index maps to an
element in an input data set.
The OpenCL data-parallel programming model is hierarchical. The hierarchical
subdivision can be specified in two ways:
• Explicitly - the developer defines the total number of work-items to execute
in parallel, as well as the division of work-items into specific work-groups.
• Implicitly - the developer specifies the total number of work-items to execute
in parallel, and OpenCL manages the division into work-groups.
OpenCL's API also supports the concept of a task dispatch. This is equivalent to
executing a kernel on a compute device with a work-group and NDRange
containing a single work-item. Parallelism is expressed using vector data types
implemented by the device, enqueuing multiple tasks, and/or enqueuing native
kernels developed using a programming model orthogonal to OpenCL.
wavefronts and
work-groups
Wavefronts and work-groups are two concepts relating to compute
kernels that provide data-parallel granularity. A wavefront executes a
number of work-items in lock step relative to each other. Sixteen work-
items are execute in parallel across the vector unit, and the whole
wavefront is covered over four clock cycles. It is the lowest level that flow
control can affect. This means that if two work-items inside of a
wavefront go divergent paths of flow control, all work-items in the
wavefront go to both paths of flow control.
Grouping is a higher-level granularity of data parallelism that is enforced
in software, not hardware. Synchronization points in a kernel guarantee
that all work-items in a work-group reach that point (barrier) in the code
before the next statement is executed.
Work-groups are composed of wavefronts. Best performance is attained
when the group size is an integer multiple of the wavefront size.
local data store
(LDS)
The LDS is a high-speed, low-latency memory private to each compute
unit. It is a full gather/scatter model: a work-group can write anywhere
in its allocated space. This model is unchanged for the AMD Radeon™
HD 7XXX series. The constraints of the current LDS model are:
• The LDS size is allocated per work-group. Each work-group specifies
how much of the LDS it requires. The hardware scheduler uses this
information to determine which work groups can share a compute unit.
• Data can only be shared within work-items in a work-group.
• Memory accesses outside of the work-group result in undefined
behavior.
Term Description