NVIDIA Fermi架构：CUDA计算新篇章

需积分: 9 144 浏览量更新于2024-07-24 收藏 846KB PDF 举报

"NVIDIA Fermi 计算架构白皮书详细介绍了NVIDIA的下一代CUDA计算和图形架构，代号为‘Fermi’。该架构是GPU计算领域的重要里程碑，旨在提供强大的并行处理能力和高级特性，以满足高性能计算和图形应用的需求。" NVIDIA的Fermi架构是继G80架构之后的一次重大升级，G80作为NVIDIA早期GPU计算架构的代表，已经奠定了GPU在科学计算和图形渲染领域的基础。Fermi架构的推出旨在进一步提升GPU的计算能力，尤其是对双精度浮点运算的支持，以及对CUDA编程模型的优化。在硬件执行层面，Fermi架构引入了第三代流式多处理器（Streaming Multiprocessor, SM），每个SM拥有512个高性能CUDA核心，这显著提升了并行计算能力。此外，每个SM还包括16个加载/存储单元和4个特殊功能单元，用于处理各种计算任务。Fermi架构特别强调了对双精度运算的支持，这是科学计算中的关键需求，因为它能提供更高的精度。 Fermi架构还引入了双线程调度器，允许两个线程 warp（一组32个并发线程）同时执行，从而提高了指令级并行性。64KB可配置的共享内存和L1缓存进一步优化了数据访问效率，减少了全局内存访问的延迟。在软件层面，Fermi支持第二代并行线程执行ISA（Instruction Set Architecture），使得CUDA编程模型更加成熟，能够更好地支持全C++编程。此外，Fermi优化了对OpenCL和DirectCompute的支持，使得开发者可以利用这些跨平台的API进行高效计算。其提供的32位IEEE浮点精度确保了计算的准确性，而通过预判（Predication）技术改进的条件性能，使代码执行更加灵活。在内存子系统方面，NVIDIA的平行数据缓存（Parallel Data Cache）带有可配置的L1和统一L2缓存，提高了数据读写速度。Fermi还是首款支持ECC（错误检查和校正）内存的GPU，增强了数据完整性。快速原子内存操作功能则确保了多线程环境中的数据一致性。GigaThread线程调度器的引入，使得Fermi能处理上亿级别的线程，极大地提升了应用程序的运行速度。 NVIDIA Fermi架构是GPU计算的一次重大飞跃，它不仅提升了硬件性能，还在软件层面上提供了更丰富的工具和接口，促进了GPU计算在科学、工程、图形渲染等领域的广泛应用。

• Improve Double Precision Performance—while single precision floating point performance

was on the order of ten times the performance of desktop CPUs, some GPU computing

applications desired more double precision performance as well.

• ECC support—ECC allows GPU computing users to safely deploy large numbers of GPUs in

datacenter installations, and also ensure data-sensitive applications like medical imaging and

financial options pricing are protected from memory errors.

• True Cache Hierarchy—some parallel algorithms were unable to use the GPU’s shared memory,

and users requested a true cache architecture to aid them.

• More Shared Memory—many CUDA programmers requested more than 16 KB of SM shared

memory to speed up their applications.

• Faster Context Switching—users requested faster context switches between application

programs and faster graphics and compute interoperation.

• Faster Atomic Operations—users requested faster read-modify-write atomic operations for

their parallel algorithms.

With these requests in mind, the Fermi team designed a processor that greatly increases raw

compute horsepower, and through architectural innovations, also offers dramatically increased

programmability and compute efficiency. The key architectural highlights of Fermi are:

• Third Generation Streaming Multiprocessor (SM)

o 32 CUDA cores per SM, 4x over GT200

o 8x the peak double precision floating point performance over GT200

o Dual Warp Scheduler simultaneously schedules and dispatches instructions

from two independent warps

o 64 KB of RAM with a configurable partitioning of shared memory and L1 cache

• Second Generation Parallel Thread Execution ISA

o Unified Address Space with Full C++ Support

o Optimized for OpenCL and DirectCompute

o Full IEEE 754-2008 32-bit and 64-bit precision

o Full 32-bit integer path with 64-bit extensions

o Memory access instructions to support transition to 64-bit addressing

o Improved Performance through Predication

• Improved Memory Subsystem

o NVIDIA Parallel DataCache

hierarchy with Configurable L1 and Unified L2

Caches

o First GPU with ECC memory support

o Greatly improved atomic memory operation performance

• NVIDIA GigaThread

Engine

o 10x faster application context switching

o Concurrent kernel execution

o Out of Order thread block execution

o Dual overlapped memory transfer engines

剩余21页未读，继续阅读

Torstan1

粉丝: 0
资源: 7

NVIDIA Fermi架构：CUDA计算新篇章

NVIDIA费米架构白皮书

NVIDIA Turing GPU Architecture Whitepaper 英文版

Fermi架构——白皮书-中文详细版.pdf

fermi_whitepaper

dif_fermi_filter.m

hg_fermi-paradox-20161105.zip

ERROR: Error reading property E_FERMI

nvidia_geforce_gt_520m_win732bit

nVidia_Geforce_GTS450_For_Mac_OS_X（直接安装

NVIDIA_SDK_8.2.16.zip

最新资源