NVIDIA Fermi架构：新一代CUDA计算白皮书概览

需积分: 10 175 浏览量更新于2024-08-01 收藏 1.07MB PDF 举报

"NVIDIA的Fermi计算架构白皮书详细介绍了该公司新一代CUDA计算和图形架构，该架构以物理学家Enrico Fermi的名字命名。白皮书涵盖了GPU计算的历史、Fermi架构的关键特性，以及对CUDA编程模型的更新，旨在提升高性能计算和图形处理能力。" 在GPU计算的历史部分，白皮书提到了G80架构，这是NVIDIA早期的一个里程碑，标志着GPU开始支持通用计算。G80引入了CUDA（Compute Unified Device Architecture）编程模型，使得程序员能够利用GPU的强大并行处理能力执行非图形计算任务。 Fermi架构作为NVIDIA的下一代CUDA架构，其核心亮点包括： 1. 第三代Streaming Multiprocessor（SM）：每个SM包含512个高性能CUDA核心，大幅提升了并行计算能力，专为科学计算和高性能计算应用设计。 2. 16个Load/Store单元和4个特殊功能单元：这些硬件单元优化了数据存取和特定计算操作的性能，如浮点运算。 3. 设计支持双精度浮点运算：Fermi架构对双精度计算进行了强化，这对于科学计算和工程应用非常重要，因为这些应用通常需要高精度的计算结果。 4. 双Warp Scheduler：提升了指令调度效率，使得更多的线程能在同一时间执行，提高了GPU的利用率。 5. 64KB可配置的Shared Memory和L1 Cache：共享内存和缓存的增强，有助于减少全局内存访问，提高计算效率。 6. 第二代Parallel Thread Execution ISA：更新的指令集架构支持更丰富的编程语言特性，如C++，并且优化了对OpenCL和DirectCompute的支持。 7. 统一地址空间：允许直接访问GPU内存，实现了全C++支持，简化了编程模型。 8. 改进的条件执行通过谓词实现：通过谓词控制，改善了条件分支的性能，减少了分支预测错误带来的开销。 9. 内存子系统的创新：NVIDIA Parallel Data Cache带有可配置的L1和统一L2缓存，提供了更快的数据访问速度。 10. 首款支持ECC内存的GPU：增强了数据完整性，降低了由于内存错误导致的程序崩溃风险。 11. 快速原子内存操作：对于多线程环境中的同步和数据更新操作，提供了高效的支持。 12. GigaThreadTM线程调度器：能够同时调度数以亿计的线程，确保GPU始终保持忙碌状态，最大化性能。 Fermi架构的这些特性使得它在科学计算、深度学习、物理模拟等领域展现出强大的性能，极大地推动了GPU计算的发展。通过CUDA编程接口，开发者可以充分利用这些特性来编写高效、并行的代码，实现计算密集型任务的加速。

• Improve Double Precision Performance—while single precision floating point performance

was on the order of ten times the performance of desktop CPUs, some GPU computing

applications desired more double precision performance as well.

• ECC support—ECC allows GPU computing users to safely deploy large numbers of GPUs in

datacenter installations, and also ensure data-sensitive applications like medical imaging and

financial options pricing are protected from memory errors.

• True Cache Hierarchy—some parallel algorithms were unable to use the GPU’s shared memory,

and users requested a true cache architecture to aid them.

• More Shared Memory—many CUDA programmers requested more than 16 KB of SM shared

memory to speed up their applications.

• Faster Context Switching—users requested faster context switches between application

programs and faster graphics and compute interoperation.

• Faster Atomic Operations—users requested faster read-modify-write atomic operations for

their parallel algorithms.

With these requests in mind, the Fermi team designed a processor that greatly increases raw

compute horsepower, and through architectural innovations, also offers dramatically increased

programmability and compute efficiency. The key architectural highlights of Fermi are:

• Third Generation Streaming Multiprocessor (SM)

o 32 CUDA cores per SM, 4x over GT200

o 8x the peak double precision floating point performance over GT200

o Dual Warp Scheduler simultaneously schedules and dispatches instructions

from two independent warps

o 64 KB of RAM with a configurable partitioning of shared memory and L1 cache

• Second Generation Parallel Thread Execution ISA

o Unified Address Space with Full C++ Support

o Optimized for OpenCL and DirectCompute

o Full IEEE 754-2008 32-bit and 64-bit precision

o Full 32-bit integer path with 64-bit extensions

o Memory access instructions to support transition to 64-bit addressing

o Improved Performance through Predication

• Improved Memory Subsystem

o NVIDIA Parallel DataCache

hierarchy with Configurable L1 and Unified L2

Caches

o First GPU with ECC memory support

o Greatly improved atomic memory operation performance

• NVIDIA GigaThread

Engine

o 10x faster application context switching

o Concurrent kernel execution

o Out of Order thread block execution

o Dual overlapped memory transfer engines

本页已使用福昕阅读器进行编辑。

仅供试用。

剩余21页未读，继续阅读

allenqallenq

粉丝: 0
资源: 3

NVIDIA Fermi架构：新一代CUDA计算白皮书概览

nvidia-ampere-architecture-whitepaper.pdf

fermi_whitepaper

dif_fermi_filter.m

hg_fermi-paradox-20161105.zip

ERROR: Error reading property E_FERMI

nvidia_geforce_gt_520m_win732bit

nVidia_Geforce_GTS450_For_Mac_OS_X（直接安装

NVIDIA_SDK_8.2.16.zip

nvidia_9.18.13.2762_w8164

nvidia fermi白皮书

最新资源