PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention
in Visual Generation Models

1Tsinghua University, 2Bytedance
Corresponding author
TL;DR
  • We propose PAROAttention: A simple yet effective approach to enhance the efficiency of attention in visual generative models. By taking a novel alternative approach of "reorganize" the attention pattern through "Pattern-Aware token ReOrder", it simultaneously improves both the attention **sparsification and quantization**, achieving superior performance preservation and hardware efficiency. PAROAttention achieves 20% density (5x sparse), and INT4 quantization of both QK and PV computing on mainstream video (CogVideo, Wan) and image generation (Flux) models, speedup attention computing by 3-10x and 2-4x end-to-end acceleration while maintaining generation quality.
Teaser.

The remaining content further elaborates the idea of PAROAttention as follows:

  • We analyze key challenges in attention sparsity and low-bit quantization for visual generation models, identifying their common issue as "diverse and scattered" attention patterns.
  • Based on the "locality" prior in visual feature extraction, we show that these patterns arise from the tokenization process, which disrupts 3D spatial adjacency. The diverse patterns actually describe local aggregations in 3D space, which can be unified into a "regular and block-based" attention pattern.
  • We design a simple "Token Reorder" method to convert diverse attention patterns into a unified, hardware-friendly block pattern. Tailored sparse and quantization strategies are introduced to optimize performance and hardware efficiency. Through CUDA Kernel implementation, PAROAttention achieves 20% densification with INT4 quantization on mainstream video (CogVideo, Wan) and image generation (Flux) models, improving attention and end-to-end acceleration while maintaining generation quality.
Video and Image Demos

Generation Quality

FP16 Baseline
1x Speedup.
PAROAttn (30% Sparse + INT8)
5.72x Speedup.
SpargeAttn (50% Sparse)
1.67x Speedup.
SparseVideoGen (30% Sparse)
1.42x Speedup.
(* without timestep skipping in original implementation)
Teaser Image 0
FP16 Baseline
1x Speedup.
Teaser Image 1
PAROAttn (50% Sparse + INT4)
7.61x Speedup.
Teaser Image 2
DiTFastAttn (50% Sparse)
1.38x Speedup.
Teaser Image 3
SpargeAttn (50% Sparse)
1.61x Speedup.

Hardware Speedup

We implement efficient CUDA Implementation of PAROAttention, including kernel fusion, prefetching to minimize the overhead. We conduct comparison with current attention sparsification and quantization methods, and demonstrate PAROAttn's superior performance in both hardware efficiency and generation quality.



Preliminary Analysis

Key Challenge for Attention Sparsification and Quantization

To find a unified solution for Attention sparsity and quantization, we analyze the key challenges in optimizing Attention efficiency through sparsity and low-bit quantization. We conclude the key challenges arise from the diverse and scattered visual attention patterns (as shown on the left side of the figure below).

  • Sparse Attention: requires consideration of both maintaining algorithmic performance and improving hardware efficiency.
    • For algorithmic performance: it is essential to avoid incorrectly removing large values during the sparsification. Due to the diverse structures in visual attention patterns (such as diagonal, vertical, block-wise, etc.), which dynamically change across different time steps and control signals, the diversity and dynamic nature of attention patterns make it difficult to design a mask that can fully capture important values.
    • For hardware efficiency, it is necessary to design "structured" sparse masks that allow for the skipping of entire blocks of computation to achieve actual hardware benefits. Arbitrary irregular sparsity requires additional indexing operations and may lead to fragmented computational loads. Specifically, because FlashAttention involves block-wise attention computation, sparsification should also consider how to adapt to this method. Since the attention patterns in visual attention maps often do not correspond to the blocks in FlashAttention (e.g., in diagonal patterns, only a small number of large values exist in each block), the scattered nature of the attention map patterns makes structured sparsity difficult to achieve, limiting the potential for effective hardware efficiency improvement.
  • Low-Bit Quantization Algorithms: The key issue lies in minimizing quantization loss.
    • Existing work (e.g., ViDiT-Q) has already analyzed and identified the primary source of error in low-bit quantization as the "data variation within quantization groups" For attention quantization, it is necessary to select block-wise quantization groups to adapt to FlashAttention. However, the "diagonal" visual attention patterns result in outliers on the diagonal elements within the block-wise quantization groups, causing significant intra-group data differences.
Experimental Results Image.

How to understand diverse patterns? from the perspective of "Locality"

To address the aforementioned challenges, we seek an approach to "reorganize" the attention maps to obtain more uniform and concentrated attention patterns. Inspired by the locality prior of visual feature extraction (e.g., CNN, SwinTransformer design principles, and Hubel & Wiesel's biological experiments), we further analyze the causes of the diversity of visual attention patterns and identify that "diverse visual attention patterns all describe local aggregation in space."

As shown in the figure below, during the tokenization, the original 3D space (F - frame count, H, W - width and height of each frame's image) is reshaped into a 1D token sequence, arranged in the default order of [F, H, W]. This transformation causes pixels in adjacent positions of the 3D space (except for the last dimension (W), which is contiguous in memory) to be spaced out in the token sequence. The tokens with the same interval actually describes neighboring pixels along certain dimension in the 3D space. Therefore, multi-diagonal attention patterns actually describe "local aggregation in other dimensions," and can be transformed into block-wise patterns that represent local aggregation through token sequence reordering (by transforming the local aggregation dimensions into contiguous memory dimensions, such as [F, H, W] -> [F, W, H]).

We further verified that each different attention head (Head) consistently presents local aggregation in a particular dimension in different scenarios. As a result, we can design an appropriate token reordering scheme for each head to transform the diverse and scattered attention patterns into uniform, hardware-friendly block-wise patterns, significantly ease the difficulty the sparsification and quantization of Attention. This approach leverages the locality of visual feature extraction on the algorithm side (better numerical locality) and aligns it with the locality of hardware computation (better memory and computation locality), thus achieving both optimal algorithm performance retention and hardware efficiency improvement.

Experimental Results Image.

Methodology

The flow of our approach is shown in the figure below. For the main bottlenecks in Attention computation, namely the two large-scale matrix multiplications (QK and PV), we have performed both sparsification and quantization optimization, significantly reducing their hardware overhead. We offline determine the token reordering scheme for each attention head (Head) and the corresponding sparse mask based on a small amount of corrective data, which introduces almost no additional overhead during inference. During inference, we only need to skip the attention blocks corresponding to the sparse mask and perform low-bit quantization block-by-block on the remaining parts.

Experimental Results Image.

Pattern-Aware Token ReOrder (PARO)

We have discovered that each different attention head (Head) consistently presents local aggregation in a particular dimension under different scenarios. Therefore, we can offline select an appropriate token reordering method for each attention head to transform the attention map into a block-wise pattern that exhibits local aggregation. We select a special case in the reordering process, (dimension permutation), which yields good results. For the feature [F, H, W] of video generation models, we explored six possible permutation methods for each attention head and offline selected the optimal permutation to achieve the desired data distribution. Since attention sparsification and quantization require different data distribution properties, we designed selection criteria for reordering methods specifically for sparsification and quantization, combining both as the final criterion.

  • Sparsification perspective: To minimize the loss caused by structured sparsity, the goal is to ensure as many blocks as possible are fully sparse (Block Sparse).
  • Quantization perspective: To reduce quantization loss caused by large intra-block data distribution differences, the goal is to make the data distribution within the block as uniform as possible (Block Uniform).

As shown in the figure below, sparsification and quantization have different distribution requirements for attention maps. We need to combine both requirements to find a reordering method that fits both. After appropriate reordering, the attention map exhibits a block-wise and more concentrated distribution, suitable for both sparsification and quantization processing.

Experimental Results Image.

Attention Sparsification

Existing sparse attention schemes can be divided into two approaches: (1) dynamic sparsification schemes (e.g., SpargeAttention), which generate sparse masks online based on attention values; and (2) static sparsification schemes (e.g., DiTFastAttn), which generate sparse masks offline. Both approaches have their advantages and disadvantages. Although the token reordering scheme (PARO) designed in this paper can help both dynamic and static schemes, we analyze the advantages and disadvantages of both and ultimately select the static sparsification scheme as the main sparsification approach for PAROAttention. The detailed analysis is as follows:

  • Dynamic Sparsification (Dynamic Approach):
    • For performance preservation: although dynamic schemes can naturally adapt to dynamically changing patterns, they generate sparse masks online based on the attention values before the Softmax operation (pre-softmax). These values are relatively uniform and do not exhibit obvious patterns, making it difficult to accurately identify the patterns.
    • For hardware efficiency, dynamic sparsification introduces the additional overhead of calculating the sparse mask online. This overhead is a trade-off with the accuracy of the mask prediction. To obtain an accurate mask, a relatively large amount of additional computation must be introduced. This additional prediction process generally requires sophisticated CUDA Kernels to achieve high efficiency gains.

    In summary, under low sparsity ratios, dynamic sparsification faces a bottleneck in terms of both accuracy and efficiency improvement. Therefore, we resort to the static sparsification scheme in this paper.

  • Static Sparsification (Static Approach):
    • For performance preservation, since static attention maps are difficult to adapt to diverse and dynamically changing attention patterns, static sparsification schemes generally result in more significant performance losses compared to dynamic schemes. However, the attention map reorganization in PAROAttention has transformed the diverse and dynamically changing attention patterns into more regular and uniform patterns, addressing the key challenge of static sparsification. Thus, by utilizing the more distinct post-softmax attention maps, PAROAttention achieves better algorithmic performance retention than the dynamic approach.
    • In terms of hardware efficiency, although the additional computation overhead of generating sparse masks online is avoided, offline sparse masks introduce additional memory overhead. To address this issue, optimizations have been made in this paper (see the "CUDA System Design" section below).

    After reordering, the attention map presents a uniform and concentrated block-wise pattern. Therefore, we can offline compute the sum of attention data within each block and design a threshold to determine whether the current block should be skipped (this threshold can be adjusted to control density). The sparse mask can be offline obtained without introducing any additional overhead during inference. As shown in the figure below, compared to other existing static attention sparsification schemes, PAROAttention avoids the complex and restricted mask design due to the pre-unification of attention patterns, allowing for sparse masks that fit well with the original attention map.

Experimental Results Image.

Quantization Design

For low-bit quantization, the key metric for evaluating quantization loss is the data variance within groups. Existing literature typically uses incoherence to measure this, which is defined as the ratio of the maximum value to the mean value within the current data group (x.max() / x.abs().mean()). After appropriate token reordering, the significant data variance within the Attention Map blocks is significantly alleviated, thereby enabling support for lower bit-width quantization.

Experimental Results Image.

CUDA System Design

Minimizing Additional Overhead: The additional overhead introduced by PAROAttention mainly includes the following two aspects. We have conducted targeted optimizations at the system level to minimize this overhead.

  • Online Token Reordering Overhead: Although the reordering method is determined offline, the token reordering process (dimension permutation) needs to be performed online. To avoid the overhead of moving memory from GPU Global Memory to Shared Memory in a single operation, we performed operator fusion (Layer Fusion) by modifying only the write address order of the operators before reordering. The additional overhead introduced by this operation is negligible.
  • Memory Overhead of Static Sparse Masks: Since the attention map is large, the sparse masks determined offline may occupy gigabytes of GPU memory. To reduce this overhead, we adopted a prefetch strategy by creating a new CUDA stream. During each computation, only the sparse mask of the current layer is read, thus reducing the additional memory overhead to several megabytes.

Compatibility: Both the sparsification and quantization schemes in PAROAttention are processed block by block, making them directly compatible with FlashAttention. Since both the reordering and sparse mask generation are completed offline, no fine-tuned CUDA kernel optimizations are required. Only support for skipping entire block computations based on FlashAttention is necessary, enabling broad adaptability to various scenarios.


Experiments and Analysis

Algorithm Performance Preservation

This paper tests multiple metrics on mainstream video (CogVideo) and image generation models (Flux), including:

  • Video Quality Metrics: CLIPSIM measures semantic consistency; VQA measures video quality; FlowScore measures temporal consistency.
  • Differences from Floating-Point Generation: FVD-FP16 measures feature space differences, PSNR/CosSim measures pixel space differences, and SSIM measures structural similarity.

The typical experimental conclusions are summarized as follows:

  1. Other baseline sparse schemes cause noticeable quality loss (including content changes, image blurring, etc.) even at relatively high sparsity ratios (50%). In contrast, PAROAttention's sparsification scheme can generate results that are very similar to floating-point results at a 20% higher sparsity ratio, achieving better overall metrics by 50% compared to the baseline schemes.
  2. The token reordering scheme, PARO, is not limited to static sparsification schemes. It can directly adapt to dynamic sparsification schemes such as SpargeAttention, improving generation performance. Combining PARO with SpargeAttention at 30% density results in generation quality equivalent to SpargeAttention at 50% density, improving the acceleration ratio from 1.67x to 2.22x.
  3. Compared to SageAttentionV2 (QK INT4, PV FP8), PAROAttention's quantization scheme can further quantize PV to INT4 without any loss in precision.
  4. The sparsification and quantization schemes of PAROAttention can be used concurrently. The most aggressive optimization scheme (50% + INT4) achieves nearly 10 times better Attention latency optimization compared to floating-point, while maintaining algorithmic performance similar to the baseline method, which only achieves around 2x latency optimization.

Experimental Results: CogVideoX

Quantitative Results

Qualitative Comparison: CogVideoX

Qualitative Comparison

Qualitative Comparison: Flux

Ablation Studies

Hardware Efficiency Speedup

We further analyzes the system-level optimization techniques, with the key experimental conclusions summarized as follows:

  1. PAROAttention's sparsification scheme achieves both superior algorithm performance retention and efficiency improvement. Taking 50% density as an example, PAROAttention achieves a 1.73x attention speedup, surpassing SpargeAttention (1.67x) and SparseVideoGen (1.42x) under the same conditions. This is because the static sparsification scheme introduces almost no additional overhead, whereas the baseline scheme's online sparse mask generation/selection incurs an extra overhead of about 6% to 10%, which becomes more noticeable at lower densities.
  2. The speedup ratio of PAROAttention is close to the theoretical limit (50% density, theoretical 2x, actual 1.73x), highlighting the hardware-friendly nature of the approach.
  3. The additional overhead of PAROAttention has been effectively reduced and is controlled within 1% of the total system cost.

Latency Speedup

Latency Speedup 1
Latency Speedup 2

Future Insights

In summary, this paper focuses on the "locality" characteristics of visual generation tasks. Through a simple yet effective token reordering operation, we can simultaneously achieve better algorithm performance retention by leveraging the locality of visual feature extraction (better numerical locality) and align it with the locality of hardware computation (better memory and computation locality), thereby obtaining improvements in both algorithm performance and hardware efficiency. The PAROAttention approach is primarily designed to optimize inference efficiency; however, the concept of utilizing token reordering to better exploit feature extraction locality is beyond inference optimization. Different attention heads autonomously learn local aggregation in different dimensions, which can inspire optimizations in training methods, image parameterization, the design of three-dimensional positional encodings, and further advance the construction of visual backbone models with reasonable inductive biases.

BibTeX

@misc{zhao2024viditq,
      title={ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation}, 
      author={Tianchen Zhao and Tongcheng Fang and Enshu Liu and Wan Rui and Widyadewi Soedarmadji and Shiyao Li and Zinan Lin and Guohao Dai and Shengen Yan and Huazhong Yang and Xuefei Ning and Yu Wang},
      year={2024},
      eprint={2406.02540},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }