Tutorial of Demystifying GPU Architectures For Deep Learning – Part 1

Tutorial: Demystifying GPU Architecture

CUDA: Programming Model
- C++ interface，一般大家说program in CUDA (也存在着python，matlab等其他语言的binding)
- abstracts away most of the inner workings of GPUs
CUDA Core:
- 层级结构：cuda core - Grid - Block - Thread
  - All threads within a block execute the same instructions and all of them run on the same SM
CUDA Kernels: special function用于描述thread如何handle data
- ‘launching’ CUDA kernels.
- example: Matrix Multiplication, on CPU, for loop;
  - pecify the computation that each CUDA thread should perform
Streaming Multiprocessor (SM，性质是一个micro architecture): a sophisticated processor within the GPU which contains hardware and software for orchestrating the execution of hundreds of CUDA threads
- divides blocks of threads into ‘warps’
- “Thread Manager”
存储层次： thread的local memory，一个block共享的shared mamory，以及global memory (VideoRAM)
- shared memory (CPU cache的对应)
  - shared memory is located physically close to the CUDA cores and fetching data from shared memory is at least 10 times faster than from global memory
- Read-only cache
  - read-only cache located physically close to CUDA cores and is shared within a warp（group of threads, smaller than block）
  - data stored in this memory does not change during the course of kernel execution
- Register: Registers allow threads to store local copies of variable which are visible to only that one thread.
  - each SM has a fixed, limited number of registers.
  - register allocation is handled by the compiler (nvcc)
  - the only thing a programmer can control is the number and size of local variables.
- Unified memory (UM)：
  - software functionality in CUDA
  - see all available memory in the system as one large unified whole.
  - 相当于帮你完成了默认的memory management优化
Code Example: a dense layer
- use unified memory: declare matrice A,B,C,D
- define cuda core matmul_kernel
- initialize matrice and write value
- define how to split up computation into blocks and threads
- launch the cuad kernel
AI specific in CUDA
- micro architecture的介绍
Tensor core: specialized for Multiply Accumulate (MAC) operations
- 闭源，不知道里面如何实现
详细的Hoppler架构解读
cuDNN:
- cuDNN, a C++ library built on top of CUDA which provides highly optimised routines for frequently used operations in deep learning.
- the full CUDA library is too complex and low-level for most programmers to use directly
- The only case for direct CUDA usage is if you are trying to implement a custom layer or if you want to merge a few layers for computational efficiency.

GPU和FPGA平台调研

典型数据指标：以GV100(V100内部的核心SoC，不包含显存)为例子：
- SM Count：80
- Tensor Cores：64
- L1 Cache：128KB
- 全卡(Tesla V100, tesla指的是科学计算卡的系列)的性质：
  - Clock Speed： 1245MB
  - Memory：32GB，带宽 1134GB/s
  - FP32算力：16.35 TFLOPs - 核心芯片的数字越小越好：
- 典型卡指标对比（Geforce）系列：
  - - PCIE接口是普通的机器接口，还有更好的接口叫做SXM（DGX/HGX云服务器专用，NVLink需要）
    - DGX v.s. HGX: DGX是整机，把CPU配好了；HGX可以自行配置
Nvidia架构：约每年更新一版：
- 每代有改进：
  - Volta相比Pascal架构，分离FP32和INT32计算的CUDA Core，并且出现了Tensor Core
  - Turing架构，砍FP64的Core，增加了Ray Traing Core,TensorCore增加了INT8和INT4支持
  - Ampere架构，升级了tensor core，新增TF32（google的tensor格式），BF16，FP64的支持（只不支持FP32了，FP32用cuda core做），还增加了2:4结构化稀疏。
    - Deep Learning数据格式：
    - 数据精度支持一览表：
  - Hopper，Tensor Core进一步提升2x吞吐率，支持FP8格式
NVLink：GPU之间的直接通信，带宽达到PCIe Gen5的7倍：
- NVSwitch：交换机，为每个链接的GPU提供等同于NVLink的带宽：
- NVSwitch System：交换机组网
Nvidia Compute Copability “计算兼容性”
SM_xx: “计算能力”，代表了Stream Multiprocessor的版本（和计算兼容性的数字一直）
- 小系统（类似整个DPU）
- Tensor Core和CUDA Core (FP32,INT32,FP64) 都在这个里面包含
- 还包含了Cache和SFU(special function)
- Multi-instance GPU（MIG）：单卡虚拟化，在多个GPU上实现一个任务
HBM v.s. DDR: 相当于多个DDR堆叠起来
Xilinx FPGA：
- 系列：Artix，KinTex，Virtex，ALVEO：资源更多
- ACAP：往FPGA上放向量单元的硬核（对标Tensor Core）：
  - 有AI Engine，1G独立时钟
- 上面的Memory单元：可以放DDR（大部分）也可以放HBM

GPU相关知识 & CUDA学习

Know your weapon.

Tutorial: Demystifying GPU Architecture

GPU和FPGA平台调研

CATALOG

FEATURED TAGS

FRIENDS