CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance

1Tsinghua University, 2Peking University 3Shanghai Qizhi Institude




Transformers have gained much attention by outperforming convolutional neural networks in many 2D vision tasks. However, they are known to have generalization problems and rely on massive-scale pre-training and sophisticated training techniques. When applying to 3D tasks, the irregular data structure and limited data scale add to the difficulty of transformer’s application.

We propose CodedVTR (Codebook-based Voxel TRansformer), which improves data efficiency and generalization ability for 3D sparse voxel transformers. On the one hand, we propose the codebook-based attention that projects an attention space into its subspace represented by the combination of “prototypes” in a learnable codebook. It regularizes attention learning and improves generalization. On the other hand, we propose geometry-aware self-attention that utilizes geometric information (geometric pattern, density) to guide attention learning.

CodedVTR could be embedded into existing sparse convolution-based methods, and bring consistent performance improvements for indoor and outdoor 3D semantic segmentation tasks.


Transformer's Generalization Issue

Transformers are known to have generalization issues. It requires large-scale pre-training and cannot be directly trained on smaller datasets without overfitting. As the ViT paper states, "When directly trained on the ImageNet, ViT yields modest accuracies of a few points below ResNets of comparable size". When introducing transformers into the 3d domain, the generalization issue is aggravated by 3D data's relative restricted data scale and unique properties (sparsity, varying geometric shape).

Codebook-based Self-Attention

Instead of directly learning the mapping from the activation space to the attention weight space. We construct a learnable codebook , and use weighted sum of the codebook elements to approximate the attention weight. Noted that both the codebook elements and the weights are learnable. It projects the attention learning space to its subspace represented by the weighted sum of a few attention weight "prototypes" (codebook elements). It restricts the attention learning space and works as regularization, which could improve generalization.
(Interestingly, the codebook-based self-attention could be viewed as an intermediate state of transformer and convolution. When the codebook has only one element, it becomes convolution. And when the codebook has infinite number of elements, it works as transformer.)

Codebook-based self-attention image.

Geometric-aware Self-Attention

Considering the unique properties (distinct geometric shapes and varying densities) for 3D data, we carefully design distinctive geometric shapes and assign them to the attention span of the codebook elements. We also use the geometric information to directly guide the attention learning, encouraging the attention to choose the codebook element with the sparse pattern similar to the input activation.

Geometry attention image.

Experiments and Visualization


CodedVTR could improve performance on both indoor and outdoor 3D semantic segmentation tasks. Noted that our CodedVTR block is compatible with existing sparse-conv based methods (e.g., SPVCNN).

Experimental Results Image.

Visualization of Attention Map

One of the well-known issues of transformer optimization is the "attention collapse". The attention map tends to become uniform in the latter layers. When introducing codebook-based attention, the codebook elements also tend to be similar. After employing the geometric guidance, the attention map becomes distinctive.

Attention map visualization.

Visualization of Geometry Guidance

The voxels on the wall/corner tend to choose "vertical-cross" shaped codebook elements, the voxels on the floor/desk tend to choose the "plane" shaped codebook elements. The remote voxels with low density favor the codebook elements with larger receptive field.

Visualization of geometry.


      title={CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance},
      author={Tianchen Zhao and Niansong Zhang and Xuefei Ning and He Wang and Li Yi and Yu Wang},