CUDA平台上的cuFFT库：GPU加速傅里叶变换

需积分: 0 37 浏览量更新于2024-06-16 3 收藏 634KB PDF 举报

"NVIDIA CUDA平台上的CUFFT函数库是一个专门为GPU设计的高效傅里叶变换工具，用于加速大规模数据集的 FFT 计算。它涵盖了从一维到三维的实数和复数变换，并提供了多种数据布局和类型的支持。CUFFT库通过利用GPU的并行计算能力，显著提升了在信号处理、图像处理等领域的计算速度。" 在深入探讨CUFFT之前，首先要理解傅里叶变换的基础概念。傅里叶变换是一种将信号从时域表示转换到频域表示的方法，广泛应用于解析周期性或近似周期性信号，以及图像分析。CUFFT库是基于NVIDIA的CUDA编程环境，允许开发者充分利用GPU的并行计算能力，优化大规模傅里叶变换的计算性能。 CUFFT库提供了丰富的功能，包括： 1. **傅里叶变换类型**：库支持一维、二维和三维的复数和实数傅里叶变换，满足不同维度的数据处理需求。 2. **数据布局与类型**：CUFFT支持单精度浮点数、双精度浮点数、半精度浮点数以及bfloat16精度的变换，适应不同的计算精度要求。同时，库还允许用户自定义数据布局，以适应各种应用场景。 3. **多维变换**：对于复杂的多维数据，CUFFT库能够执行高效的多维傅里叶变换，这对于处理图像和其他高维数据非常有用。 4. **高级数据布局**：库提供了高级数据布局选项，如交错和非交错数组，以优化内存访问模式和计算效率。 5. **流式变换**：用户可以利用CUDA流来并发执行多个CUFFT变换，进一步提升计算效率。 6. **多GPU支持**：CUFFT支持跨多个GPU执行变换，允许更大规模的并行计算。这包括对permuted输入的2D和3D变换，以及对工作区的规划和管理。 7. **回调函数**：库引入了回调功能，允许用户自定义加载和存储操作，从而实现更灵活的数据预处理和后处理。 8. **线程安全**：CUFFT库保证了在多线程环境中的正确性和线程安全性。 9. **CUDA Graphs支持**：CUFFT支持CUDA Graphs，这是一种用于构建和执行依赖关系图的方法，可优化执行流程并减少上下文切换的开销。 10. **静态库和回调支持**：除了动态库之外，CUFFT也提供了静态库选项，并且全面支持回调功能，使得开发者能够在特定场景下优化性能。 CUFFT函数库是CUDA平台上进行高效傅里叶变换计算的强大工具，它通过其丰富的功能和高度的灵活性，帮助开发者实现高性能计算，特别是在需要大量并行处理的信号和图像处理应用中。通过熟练掌握CUFFT库的使用，开发者可以大幅提升GPU计算的效率和效果。

Using the cuFFT API

cuFFT Library User's Guide DU-06707-001_v11.4|9

For example, static declaration of a three-dimensional array for the output of an out-of-place

real-to-complex transform will look like this:

cufftComplex odata[N1][N2][N3/2+1];

2.6. Advanced Data Layout

The advanced data layout feature allows transforming only a subset of an input array, or

outputting to only a portion of a larger data structure. It can be set by calling function:

cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed,

int istride, int idist, int *onembed, int ostride,

int odist, cufftType type, int batch);

Passing inembed or onembed set to NULL is a special case and is equivalent to passing n for

each. This is same as the basic data layout and other advanced parameters such as istride

are ignored.

If the advanced parameters are to be used, then all of the advanced interface parameters

must be specified correctly. Advanced parameters are defined in units of the relevant data type

(cufftReal, cufftDoubleReal, cufftComplex, or cufftDoubleComplex).

Advanced layout can be perceived as an additional layer of abstraction above the access to

input/output data arrays. An element of coordinates [z][y][x] in signal number b in the

batch will be associated with the following addresses in the memory:

‣

input[ b * idist + x * istride ]

output[ b * odist + x * ostride ]

‣

input[ b * idist + (x * inembed[1] + y) * istride ]

output[ b * odist + (x * onembed[1] + y) * ostride ]

‣

input[ b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride ]

output[ b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride ]

The istride and ostride parameters denote the distance between two successive input

and output elements in the least significant (that is, the innermost) dimension respectively.

In a single 1D transform, if every input element is to be used in the transform, istride

should be set to ; if every other input element is to be used in the transform, then istride

should be set to . Similarly, in a single 1D transform, if it is desired to output final elements

one after another compactly, ostride should be set to ; if spacing is desired between the

least significant dimension output data, ostride should be set to the distance between the

elements.

The inembed and onembed parameters define the number of elements in each dimension in

the input array and the output array respectively. The inembed[rank-1] contains the number

of elements in the least significant (innermost) dimension of the input data excluding the

istride elements; the number of total elements in the least significant dimension of the

input array is then istride*inembed[rank-1]. The inembed[0] or onembed[0] corresponds

Using the cuFFT API

cuFFT Library User's Guide DU-06707-001_v11.4|10

to the most significant (that is, the outermost) dimension and is effectively ignored since

the idist or odist parameter provides this information instead. Note that the size of each

dimension of the transform should be less than or equal to the inembed and onembed values

for the corresponding dimension, that is n[i] ≤ inembed[i], n[i] ≤ onembed[i], where

The idist and odist parameters indicate the distance between the first element of two

consecutive batches in the input and output data.

2.7. Streamed cuFFT Transforms

Every cuFFT plan may be associated with a CUDA stream. Once so associated, all launches

of the internal stages of that plan take place through the specified stream. Streaming of

cuFFT execution allows for potential overlap between transforms and memory copies. (See the

NVIDIA CUDA Programming Guide for more information on streams.) If no stream is associated

with a plan, launches take place in stream(0), the default CUDA stream. Note that many plan

executions require multiple kernel launches.

cuFFT uses private streams internally to sort operations, including event syncrhonization.

cuFFT does not guarantee ordering of internal operations, and the order is only preserved with

respect to the streams set by the user.

As of CUDA 11.2 (cuFFT 10.4.0), cufftSetStream() is supported in multiple GPU cases.

However, calls to cufftXtMemcpy() are still synchronous across multiple GPUs when using

streams. In previous versions of cuFFT, cufftSetStream() returns an error in the multiple

GPU case. Likewise, calling certain multi-GPU functions such as cufftXtSetCallback()

after setting a stream with cufftSetStream() will result in an error (see API functions for

more details).

Please note that in order to overlap plans using single plan handle user needs to manage

work area buffers. Each concurrent plan execution needs it's exclusive work area. Work area

can be set by cufftSetWorkArea function.

2.8. Multiple GPU cuFFT Transforms

cuFFT supports using up to sixteen GPUs connected to a CPU to perform Fourier Transforms

whose calculations are distributed across the GPUs. An API has been defined to allow users to

write new code or modify existing code to use this functionality.

Some existing functions such as the creation of a plan using cufftCreate() also apply in the

multiple GPU case. Multiple GPU routines contain Xt in their name.

The memory on the GPUs is managed by helper functions cufftXtMalloc()/cufftXtFree()

and cufftXtMemcpy() using the cudaLibXtDesc descriptor.

Performance is a function of the bandwidth between the GPUs, the computational ability of the

individual GPUs, and the type and number of FFT to be performed. The highest performance

is obtained using NVLink interconnect (http://www.nvidia.com/object/nvlink.html). The second

best option is using PCI Express 3.0 between the GPUs and ensuring that both GPUs are on

Using the cuFFT API

cuFFT Library User's Guide DU-06707-001_v11.4|11

the same switch. Note that multiple GPU execution is not guaranteed to solve a given size

problem in a shorter time than single GPU execution.

The multiple GPU extensions to cuFFT are built on the extensible cuFFT API. The general

steps in defining and executing a transform with this API are:

‣

cufftCreate() - create an empty plan, as in the single GPU case

‣

cufftXtSetGPUs() - define which GPUs are to be used

‣

Optional: cufftEstimate{1d,2d,3d,Many}() - estimate the sizes of the work areas

required. These are the same functions used in the single GPU case although the definition

of the argument workSize reflects the number of GPUs used.

‣

cufftMakePlan{1d,2d,3d,Many}() - create the plan. These are the same functions

used in the single GPU case although the definition of the argument workSize reflects the

number of GPUs used.

‣

Optional: cufftGetSize{1d,2d,3d,Many}() - refined estimate of the sizes of the work

areas required. These are the same functions used in the single GPU case although the

definition of the argument workSize reflects the number of GPUs used.

‣

Optional: cufftGetSize() - check workspace size. This is the same function used in the

single GPU case although the definition of the argument workSize reflects the number of

GPUs used.

‣

Optional: cufftXtSetWorkArea() - do your own workspace allocation.

‣

cufftXtMalloc() - allocate descriptor and data on the GPUs

‣

cufftXtMemcpy() - copy data to the GPUs

‣

cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z() - execute the plan

‣

cufftXtMemcpy() - copy data from the GPUs

‣

cufftXtFree() - free any memory allocated with cufftXtMalloc()

‣

cufftDestroy() - free cuFFT plan resources

2.8.1. Plan Specification and Work Areas

In the single GPU case a plan is created by a call to cufftCreate() followed by a call to

cufftMakePlan*(). For multiple GPUs, the GPUs to use for execution are identified by a call

to cufftXtSetGPUs() and this must occur after the call to cufftCreate() and prior to the

call to cufftMakePlan*().

Note that when cufftMakePlan*() is called for a single GPU, the work area is on that GPU. In

a multiple GPU plan, the returned work area has multiple entries; one value per GPU. That is

workSize points to a size_t array, one entry per GPU. Also the strides and batches apply to

the entire plan across all GPUs associated with the plan.

Once a plan is locked by a call to cufftMakePlan*(), different descriptors may be specified in

calls to cufftXtExecDescriptor*() to execute the plan on different data sets, but the new

descriptors must use the same GPUs in the same order.

As in the single GPU case, cufftEstimateSize{Many,1d,2d,3d}() and

cufftGetSize{Many,1d,2d,3d}() give estimates of the work area sizes required for a

multiple GPU plan and in this case workSize points to a size_t array, one entry per GPU.

Using the cuFFT API

cuFFT Library User's Guide DU-06707-001_v11.4|12

Similarly the actual work size returned by cufftGetSize() is a size_t array, one entry per

GPU in the multiple GPU case.

2.8.2. Helper Functions

Multiple GPU cuFFT execution functions assume a certain data layout in terms of what input

data has been copied to which GPUs prior to execution, and what output data resides in

which GPUs post execution. cuFFT provides functions to assist users in manipulating data on

multiple GPUs. These must be called after the call to cufftMakePlan*().

On a single GPU users may call cudaMalloc() and cudaFree() to allocate and free

GPU memory. To provide similar functionality in the multiple GPU case, cuFFT includes

cufftXtMalloc() and cufftXtFree() functions. The function cufftXtMalloc() returns a

descriptor which specifies the location of these memories.

On a single GPU users may call cudaMemcpy() to transfer data between host and GPU

memory. To provide similar functionality in the multiple GPU case, cuFFT includes

cufftXtMemcpy() which allows users to copy between host and multiple GPU memories or

even between the GPU memories.

All single GPU cuFFT FFTs return output the data in natural order, that is the ordering of the

result is the same as if a DFT had been performed on the data. Some Fast Fourier Transforms

produce intermediate results where the data is left in a permutation of the natural output.

When batch is one, data is left in the GPU memory in a permutation of the natural output.

When cufftXtMemcpy() is used to copy data from GPU memory back to host memory, the

results are in natural order regardless of whether the data on the GPUs is in natural order

or permuted. Using CUFFT_COPY_DEVICE_TO_DEVICE allows users to copy data from the

permuted data format produced after a single transform to the natural order on GPUs.

2.8.3. Multiple GPU 2D and 3D Transforms on

Permuted Input

For single 2D or 3D transforms on multiple GPUs, when cufftXtMemcpy() distributes the

data to the GPUs, the array is divided on the X axis. E.G. for two GPUs half of the X dimenson

points, for all Y (and Z) values, are copied to each of the GPUs. When the transform is

computed, the data are permuted such that they are divided on the Y axis. I.E. half of the Y

dimension points, for all X (and Z) values are on each of the GPUs.

When cuFFT creates a 2D or 3D plan for a single transform on multiple GPUs, it actually

creates two plans. One plan expects input to be divided on the X axis. The other plan expects

data to be divided on the Y axis. This is done because many algorithms compute a forward

FFT, then perform some point-wise operation on the result, and then compute the inverse

FFT. A memory copy to restore the data to the original order would be expensive. To avoid this,

cufftXtMemcpy and cufftXtExecDescriptor() keep track of the data ordering so that the

correct operation is used.

The ability of cuFFT to process data in either order makes the following sequence possible.

‣

cufftCreate() - create an empty plan, as in the single GPU case

‣

cufftXtSetGPUs() - define which GPUs are to be used

剩余90页未读，继续阅读

杨咩咩ing

粉丝: 395
资源: 2

CUDA平台上的cuFFT库：GPU加速傅里叶变换

CUDA加速的频谱搬移 fftshift_qt.rar

cufftShift:基于CUDA的线性1D，2D和3D FFT-Shift功能实现

CUDA实现短时傅里叶变换

CUFFT_Library cuda fft变换 库文档

CUDA讲座8：CUFFT与PyCUDA - GPU FFT库详解

NVIDIA CUDA快速傅里叶变换测试

CUDA平台下的简单CUFFT库实现正反FFT算法

CUDA并行计算中的快速傅里叶变换（FFT）算法实现

GPU平台二维快速傅里叶变换算法实现及应用.pdf

基于GPU(cuFFT库)的FFT实现代码

最新资源

CUFFT_Library cuda fft变换库文档