GPU相关知识 & CUDA学习

Know your weapon.

Posted by tianchen on May 18, 2023

Tutorial of Demystifying GPU Architectures For Deep Learning – Part 1

Tutorial: Demystifying GPU Architecture

  • CUDA: Programming Model
    • C++ interface,一般大家说program in CUDA (也存在着python,matlab等其他语言的binding)
    • abstracts away most of the inner workings of GPUs
  • CUDA Core:
    • 层级结构:cuda core - Grid - Block - Thread
      • All threads within a block execute the same instructions and all of them run on the same SM
  • CUDA Kernels: special function用于描述thread如何handle data
    • ‘launching’ CUDA kernels.
    • example: Matrix Multiplication, on CPU, for loop;
      • pecify the computation that each CUDA thread should perform
  • Streaming Multiprocessor (SM,性质是一个micro architecture): a sophisticated processor within the GPU which contains hardware and software for orchestrating the execution of hundreds of CUDA threads
    • divides blocks of threads into ‘warps’
    • “Thread Manager”
  • 存储层次: thread的local memory,一个block共享的shared mamory,以及global memory (VideoRAM)
    • shared memory (CPU cache的对应)
      • shared memory is located physically close to the CUDA cores and fetching data from shared memory is at least 10 times faster than from global memory
    • Read-only cache
      • read-only cache located physically close to CUDA cores and is shared within a warp(group of threads, smaller than block)
      • data stored in this memory does not change during the course of kernel execution
    • Register: Registers allow threads to store local copies of variable which are visible to only that one thread.
      • each SM has a fixed, limited number of registers.
      • register allocation is handled by the compiler (nvcc)
      • the only thing a programmer can control is the number and size of local variables.
    • Unified memory (UM):
      • software functionality in CUDA
      • see all available memory in the system as one large unified whole.
      • 相当于帮你完成了默认的memory management优化
  • Code Example: a dense layer
    • use unified memory: declare matrice A,B,C,D
    • define cuda core matmul_kernel
    • initialize matrice and write value
    • define how to split up computation into blocks and threads
    • launch the cuad kernel
  • AI specific in CUDA
    • micro architecture的介绍
  • Tensor core: specialized for Multiply Accumulate (MAC) operations
    • 闭源,不知道里面如何实现
  • 详细的Hoppler架构解读

  • cuDNN:
    • cuDNN, a C++ library built on top of CUDA which provides highly optimised routines for frequently used operations in deep learning.
    • the full CUDA library is too complex and low-level for most programmers to use directly
    • The only case for direct CUDA usage is if you are trying to implement a custom layer or if you want to merge a few layers for computational efficiency.

GPU和FPGA平台调研

  • 典型数据指标:以GV100(V100内部的核心SoC,不包含显存)为例子:
    • SM Count:80
    • Tensor Cores:64
    • L1 Cache:128KB
    • 全卡(Tesla V100, tesla指的是科学计算卡的系列)的性质:
      • Clock Speed: 1245MB
      • Memory:32GB,带宽 1134GB/s
      • FP32算力:16.35 TFLOPs - 核心芯片的数字越小越好:
    • 典型卡指标对比(Geforce)系列:
        • PCIE接口是普通的机器接口,还有更好的接口叫做SXM(DGX/HGX云服务器专用,NVLink需要)
        • DGX v.s. HGX: DGX是整机,把CPU配好了;HGX可以自行配置
  • Nvidia架构:约每年更新一版:
    • 每代有改进:
      • Volta相比Pascal架构,分离FP32和INT32计算的CUDA Core,并且出现了Tensor Core
      • Turing架构,砍FP64的Core,增加了Ray Traing Core,TensorCore增加了INT8和INT4支持
      • Ampere架构,升级了tensor core,新增TF32(google的tensor格式),BF16,FP64的支持(只不支持FP32了,FP32用cuda core做),还增加了2:4结构化稀疏。
        • Deep Learning数据格式:
        • 数据精度支持一览表:
      • Hopper,Tensor Core进一步提升2x吞吐率,支持FP8格式
  • NVLink:GPU之间的直接通信,带宽达到PCIe Gen5的7倍:
    • NVSwitch:交换机,为每个链接的GPU提供等同于NVLink的带宽:
    • NVSwitch System:交换机组网
  • Nvidia Compute Copability “计算兼容性”
  • SM_xx: “计算能力”,代表了Stream Multiprocessor的版本(和计算兼容性的数字一直)
    • 小系统(类似整个DPU)
    • Tensor Core和CUDA Core (FP32,INT32,FP64) 都在这个里面包含
    • 还包含了Cache和SFU(special function)
    • Multi-instance GPU(MIG):单卡虚拟化,在多个GPU上实现一个任务
  • HBM v.s. DDR: 相当于多个DDR堆叠起来

  • Xilinx FPGA:
    • 系列:Artix,KinTex,Virtex,ALVEO:资源更多
    • ACAP:往FPGA上放向量单元的硬核(对标Tensor Core):
      • 有AI Engine,1G独立时钟
    • 上面的Memory单元:可以放DDR(大部分)也可以放HBM