ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

1Tsinghua University, 2Infinigence AI 3Microsoft 4Shanghai Jiao Tong University
Corresponding authors Project Advisor

TL;DR

We introduce ViDiT-Q, a quantization method specialized for diffusion transformers. For popular large-scale models (e.g., open-sora, Latte, Pixart-α, Pixart-Σ) for the video and image generation task, ViDiT-Q could achieve W8A8 quantization without metric degradation, and W4A8 without notable visual quality degradation.

FP16
ViDiT-Q W8A8
Baseline W8A8


Abstract

Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity.

When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: "ViDiT-Q": Video and Image Diffusion Transformer Quantization) to address these issues.

Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP).

We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiT-Q-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.

Motivation

Existing Diffusion Quantizaiton Methods Face Challenges for DiTs

When applying existing diffusion quantization methods tailored for U-Net-based CNNs to DiTs, we witness notable quality degradation. Some "unnatural contents" appears for W8A8 quantization (unexpected fins for the turtle). For W4A8 quantization, the quantized model completly fails and produces blank images. To address these issues, we design improved quantization scheme: "ViDiT-Q". For further reducing bit-width to W6A6 and W4A8, we design mixed precision version "ViDiT-Q-MP".

FP16
Baseline W8A8: "Unnatural Contents"
Baseline W4A8: "Blank Frames"
Experimental Results Image.

Identify the Key Issue for DiT Quantization: Dynamic Data Variance

We conduct preliminary experiments to delve into the reasons for quantization failure by visualizing the data distribution. We conclude the key unique challenge of DiT quantization lies in the data variance in multiple levels. While existing quantization methods adopt fixed and coarse-grained quantization parameters, which struggle to handle highly variant data. We summarize the data variance in 4 following levels:

  • Token-wise Variance: In DiTs, the activation is composed of visual tokens (spatial and temporal tokens for video generation). Notable difference exists between tokens.
  • CFG-wise Variance: For conditional generation, the classifier-free-guidance (CFG) conducts 2 separate forward with and without the control signal. We observe notable difference bettween the conditional and unconditional part.
  • Timestep-wise Variance: Diffusion model iterates the model for multiple timesteps. We observe notable variance in activation across timesteps.
  • Channel-wise Variance: For both weight and activation, we witness significant difference across different channels.

Experimental Results Image.

Methodology

To address the data variance in the above 4 levels, we design an improved quantization scheme ViDiT-Q, which provides corresponding solutions: channel balancing, token-wise quantizaiton, and dynamic quantization. Furthermore, to preserve quality under lower bit-widths, we introduce a metric-decoupled mixed precision method to develop ViDiT-Q-MP.
Experimental Results Image.

(1) Improved Quantization Scheme: ViDiT-Q

Token-wise Quantization

The primary distinction between the DiT and previous U-Net-based models are the feature extraction operator (Convolution vs. Linear). U-Net employs convolution layers that conducts local aggregation of neighboring pixel featrues. Therefore, these pixel features should have the same quantization parameters (tensor-wise quantization parameter). For DiTs's linear layers, the computation for each visual token is indepdent. Threfore, token-wise quantization parameter is applicable.

Dynamic Quantization

The "CFG-wise variance" and "Timestep-wise variance" are two unique challenge for diffusion models. Inspired by prior language model quantization methods, we propose to use the "Dynamic Quantization" that calculates the quantization parameters online for activation. It naturally solves the CFG-wise and timestep-wise variance.

Timestep-wise Channel Balancing

The channel balancing problem could be alleviated with channel balancing quantization techniques (e.g., SmoothQuant). By introducing a channel-wise balancing mask "s" that is divided by weight and multiplied by activation. It migrates the difficulty of weight quantization to activation. For DiT quantization, we witness it still yields unsatisfactory performance. We further discover that the activation channel-wise distribution changes significantly across timesteps. We improve the channel balancing with "timestep-aware", and propose to use different masks for different timesteps.

Experimental Results Image.

(2) Metric Decoupled Mixed Precision: ViDiT-Q-MP

Motivation for Mixed Precision

For lower bit-width quantization, we discover that degradation is still witnessed for ViDiT-Q. We discover that the quantization is "bottlenecked" by some layer (unquantizing one layer improves from blank image to readable content).

Experimental Results Image.

Motivation for Metric Decouple

An intuitive solution for highly sensitive layers is assigning higher bit-width for them. We observe that simple Mean-Squared-Error(MSE)-based quantiation sensitivity measurement lacks precision. We introduce an improved metric-decoupled sensitivity analysis method that considers quantization's influence on multiple aspects.

Experimental Results Image.

Experiments and Analysis:

🎬 Text2Video - Comprehensive Benchmark: VBench

VBench is a comprehensive benchmark suite for video generation. We evaluate the quantized model's generation quality from various perspective as follows.

Experimental Results Image.

ViDiT-Q W8A8 quantization achieves comparable performance with the FP16. W4A8-MP quantiation only incurs notable performance decay, outtperforming the baseline W8A8 quantization.

Experimental Results Image.

We provide some qualitative video examples as follows:

FP16
ViDiT-Q W8A8
Baseline W8A8: "Ear Suddenly Appears"
FP16
ViDiT-Q W8A8
Baseline W8A8: "Jitter and Color Shift"
FP16
ViDiT-Q W8A8
Baseline W8A8: "Content Changes"

🎬 Text2Video & Class Condition generation on UCF-101

We apply ViDiT-Q to the open-sora and Latte models on UCF-101 datasets.

Experimental Results Image.

🎬 Text2Video - Comparison of Quantization Methods

We compare existing quantization techniques for open-sora model on its exmample prompts.

Experimental Results Image.
Experimental Results Image.
PTQ4DM W8A8
Q-Diffusion W8A8
ViDiT-Q W8A8
Naive PTQ W6A6
PTQ4DM W6A6
ViDiT-Q W6A6
PTQ4DM W4A8
Smooth Quant W4A8
ViDiT-Q W4A8

🖼️ Text2Image Generation on COCO Annotations

We apply ViDiT-Q to the PixArt-α and PixArt-Σ models on the COCO annotations.

Experimental Results Image.
Experimental Results Image.

🧭 Efficiency Improvement

We test the efficiency improvement of ViDiT-Q quantization with hardware profiling. Due to lack of existing open-source dynamic quantization GPU kenrel, we measure the latency with normal INT GPU kernel, and consider the overhead dynamic quantizaiton introduces. ViDiT-Q could achive 2-3x memory compression rate, and 1.4x latency speedup with W8A8/W4A8.

Attention map visualization.

Ablation Studies

We conduct ablation studies for open-sora text-to-video generation for W4A8 quantization.

Experimental Results Image.
Naive PTQ
Dynamic Quant
Dynamic Quant+Channel Balance
Dynamic Quant + Timestep-aware Channel Balance
Dynamic Quant + Timestep-aware CB + Mixed Precision
FP16

Ablation of Timestep-aware Channel Balancing

We provide more examples of comparison between naive and timestep-aware channel balancing.

Naive
Timestep-wise
Naive
Timestep-wise

BibTeX

@misc{zhao2024viditq,
      title={ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation}, 
      author={Tianchen Zhao and Tongcheng Fang and Enshu Liu and Wan Rui and Widyadewi Soedarmadji and Shiyao Li and Zinan Lin and Guohao Dai and Shengen Yan and Huazhong Yang and Xuefei Ning and Yu Wang},
      year={2024},
      eprint={2406.02540},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }