NVIDIAGPU图灵架构详解_查看GPU架构

1星需积分: 44 57 浏览量更新于2023-03-03 评论 1 收藏 3.67MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

WP-08608-001_v1.1 | August 2017

NVIDIA TESLA V100 GPU

ARCHITECTURE

THE WORLD’S MOST ADVANCED DATA CENTER GPU

The World’s Most Advanced Data Center GPU  WP-08608-001_v1.1  |  ii  
WP-08608-001_v1.1 
TABLE OF CONTENTS 
Introduction to the NVIDIA Tesla V100 GPU Architecture ...................................... 1 
Tesla V100: The AI Computing and HPC Powerhouse ............................................ 2 
Key Features ................................................................................................... 2 
Extreme Performance for AI and HPC ..................................................................... 5 
NVIDIA GPUs  The Fastest and Most Flexible Deep Learning Platform .................... 6 
Deep Learning Background ................................................................................. 6 
GPU-Accelerated Deep Learning ........................................................................... 7 
GV100 GPU Hardware Architecture In-Depth ....................................................... 8 
Extreme Performance and High Efficiency ............................................................... 11 
Volta Streaming Multiprocessor ........................................................................... 12 
Tensor Cores .............................................................................................. 14 
Enhanced L1 Data Cache and Shared Memory ...................................................... 17 
Simultaneous Execution of FP32 and INT32 Operations ........................................... 18 
Compute Capability .......................................................................................... 18 
NVLink: Higher bandwidth, More Links, More Features ............................................... 19 
More Links, Faster Links ................................................................................. 19 
More Features ............................................................................................. 19 
HBM2 Memory Architecture ................................................................................ 21 
ECC Memory Resiliency .................................................................................. 22 
Copy Engine Enhancements ............................................................................... 23 
Tesla V100 Board Design ................................................................................... 23 
GV100 CUDA Hardware and Software Architectural Advances............................... 25 
Independent Thread Scheduling .......................................................................... 26 
Prior NVIDIA GPU SIMT Models ........................................................................ 26 
Volta SIMT Model ......................................................................................... 27 
Starvation-Free Algorithms .............................................................................. 29 
Volta Multi-Process Service ................................................................................. 30 
Unified Memory and Address Translation Services ..................................................... 32 
Cooperative Groups ......................................................................................... 33 
Conclusion .................................................................................................... 36 
Appendix A  NVIDIA DGX-1 with Tesla V100 ...................................................... 37 
NVIDIA DGX-1 System Specifications .................................................................... 38 
DGX-1 Software .............................................................................................. 39 
Appendix B  NVIDIA DGX Station - A Personal AI Supercomputer for Deep Learning 41 
Preloaded with the Latest Deep Learning Software .................................................... 43 
Kickstarting AI initiatives ................................................................................... 43 
Appendix C  Accelerating Deep Learning and Artificial Intelligence with GPUs ........ 44 
Deep Learning in a Nutshell ................................................................................ 44 
NVIDIA GPUs:  The Engine of Deep Learning ........................................................... 47 
Training Deep Neural Networks ........................................................................ 48 
Inferencing Using a Trained Neural Network ........................................................ 49 

The World’s Most Advanced Data Center GPU  WP-08608-001_v1.1  |  iv  
LIST OF FIGURES 
Figure 1.  NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU ....................... 1 
Figure 2.  New Technologies in Tesla V100 ................................................... 4 
Figure 3.  Tesla V100 Provides a Major Leap in Deep Learning Performance with New 
Tensor Cores .......................................................................... 5 
Figure 4.  Volta GV100 Full GPU with 84 SM Units ........................................... 9 
Figure 5.  Volta GV100 Streaming Multiprocessor (SM) ..................................... 13 
Figure 6.  cuBLAS Single Precision (FP32) .................................................... 14 
Figure 7.  cuBLAS Mixed Precision (FP16 Input, FP32 Compute) .......................... 15 
Figure 8.  Tensor Core 4x4 Matrix Multiply and Accumulate ............................... 15 
Figure 9.  Mixed Precision Multiply and Accumulate in Tensor Core ...................... 16 
Figure 10.  Pascal and Volta 4x4 Matrix Multiplication ........................................ 16 
Figure 11.  Comparison of Pascal and Volta Data Cache ..................................... 17 
Figure 12.  Hybrid Cube Mesh NVLink Topology as used in DGX-1 with V100 ............ 20 
Figure 13.  V100 with NVLink Connected GPU-to-GPU and GPU-to-CPU ................... 20 
Figure 14.  Second Generation NVLink Performance ......................................... 21 
Figure 15.  HBM2 Memory Speedup on V100 vs P100 ....................................... 22 
Figure 16.  Tesla V100 Accelerator (Front) .................................................... 23 
Figure 17.  Tesla V100 Accelerator (Back) ..................................................... 24 
Figure 18.  NVIDIA Tesla V100 SXM2 Module - Stylized Exploded View ................... 24 
Figure 19.  Deep Learning Methods Developed Using CUDA ................................ 25 
Figure 20.  SIMT Warp Execution Model of Pascal and Earlier GPUs ....................... 26 
Figure 21.  Volta Warp with Per-Thread Program Counter and Call Stack ................. 27 
Figure 22.  Volta Independent Thread Scheduling ............................................ 28 
Figure 23.  Programs use Explicit Synchronization to Reconverge Threads in a Warp ... 28 
Figure 24.  Doubly Linked List with Fine-Grained Locks ...................................... 29 
Figure 25.  Software-based MPS Service in Pascal vs Hardware-Accelerated MPS Service 
in Volta ................................................................................ 31 
Figure 26.  Volta MPS for Inference ............................................................. 32 
Figure 27.  Two Phases of a Particle Simulation............................................... 35 
Figure 28.  NVIDIA DGX-1 Server ............................................................... 37