Tutorial of Demystifying GPU Architectures For Deep Learning – Part 1
Tutorial: Demystifying GPU Architecture
- CUDA: Programming Model
- C++ interface,一般大家说program in CUDA (也存在着python,matlab等其他语言的binding)
- abstracts away most of the inner workings of GPUs
- CUDA Core:
- 层级结构:cuda core - Grid - Block - Thread
- All threads within a block execute the same instructions and all of them run on the same SM
- 层级结构:cuda core - Grid - Block - Thread
- CUDA Kernels: special function用于描述thread如何handle data
- ‘launching’ CUDA kernels.
- example: Matrix Multiplication, on CPU, for loop;
- pecify the computation that each CUDA thread should perform
- Streaming Multiprocessor (SM,性质是一个micro architecture): a sophisticated processor within the GPU which contains hardware and software for orchestrating the execution of hundreds of CUDA threads
- divides blocks of threads into ‘warps’
- “Thread Manager”
- 存储层次: thread的local memory,一个block共享的shared mamory,以及global memory (VideoRAM)
- shared memory (CPU cache的对应)
- shared memory is located physically close to the CUDA cores and fetching data from shared memory is at least 10 times faster than from global memory
- Read-only cache
- read-only cache located physically close to CUDA cores and is shared within a warp(group of threads, smaller than block)
- data stored in this memory does not change during the course of kernel execution
- Register: Registers allow threads to store local copies of variable which are visible to only that one thread.
- each SM has a fixed, limited number of registers.
- register allocation is handled by the compiler (nvcc)
- the only thing a programmer can control is the number and size of local variables.
- Unified memory (UM):
- software functionality in CUDA
- see all available memory in the system as one large unified whole.
- 相当于帮你完成了默认的memory management优化
- shared memory (CPU cache的对应)
- Code Example: a dense layer
- use unified memory: declare matrice A,B,C,D
- define cuda core matmul_kernel
- initialize matrice and write value
- define how to split up computation into blocks and threads
- launch the cuad kernel
- AI specific in CUDA
- micro architecture的介绍
- micro architecture的介绍
- Tensor core: specialized for Multiply Accumulate (MAC) operations
- 闭源,不知道里面如何实现
- cuDNN:
- cuDNN, a C++ library built on top of CUDA which provides highly optimised routines for frequently used operations in deep learning.
- the full CUDA library is too complex and low-level for most programmers to use directly
- The only case for direct CUDA usage is if you are trying to implement a custom layer or if you want to merge a few layers for computational efficiency.
GPU和FPGA平台调研
- 典型数据指标:以GV100(V100内部的核心SoC,不包含显存)为例子:
- SM Count:80
- Tensor Cores:64
- L1 Cache:128KB
- 全卡(Tesla V100, tesla指的是科学计算卡的系列)的性质:
- Clock Speed: 1245MB
- Memory:32GB,带宽 1134GB/s
- FP32算力:16.35 TFLOPs - 核心芯片的数字越小越好:
- 典型卡指标对比(Geforce)系列:
-
- PCIE接口是普通的机器接口,还有更好的接口叫做SXM(DGX/HGX云服务器专用,NVLink需要)
- DGX v.s. HGX: DGX是整机,把CPU配好了;HGX可以自行配置
-
- Nvidia架构:约每年更新一版:
- 每代有改进:
- Volta相比Pascal架构,分离FP32和INT32计算的CUDA Core,并且出现了Tensor Core
- Turing架构,砍FP64的Core,增加了Ray Traing Core,TensorCore增加了INT8和INT4支持
- Ampere架构,升级了tensor core,新增TF32(google的tensor格式),BF16,FP64的支持(只不支持FP32了,FP32用cuda core做),还增加了2:4结构化稀疏。
- Deep Learning数据格式:
- 数据精度支持一览表:
- Hopper,Tensor Core进一步提升2x吞吐率,支持FP8格式
- 每代有改进:
- NVLink:GPU之间的直接通信,带宽达到PCIe Gen5的7倍:
- NVSwitch:交换机,为每个链接的GPU提供等同于NVLink的带宽:
- NVSwitch System:交换机组网
- Nvidia Compute Copability “计算兼容性”
- SM_xx: “计算能力”,代表了Stream Multiprocessor的版本(和计算兼容性的数字一直)
- 小系统(类似整个DPU)
- Tensor Core和CUDA Core (FP32,INT32,FP64) 都在这个里面包含
- 还包含了Cache和SFU(special function)
- Multi-instance GPU(MIG):单卡虚拟化,在多个GPU上实现一个任务
-
HBM v.s. DDR: 相当于多个DDR堆叠起来
- Xilinx FPGA:
- 系列:Artix,KinTex,Virtex,ALVEO:资源更多
- ACAP:往FPGA上放向量单元的硬核(对标Tensor Core):
- 有AI Engine,1G独立时钟
- 上面的Memory单元:可以放DDR(大部分)也可以放HBM