3D related

View things from another dimension

Posted by tianchen on November 29, 2020



  • pcd = o3d.io.read_point_cloud("1.ply")
    • type Geometry.PointCloud
    • 一般point cloud直接np.floor(coords / voxel_size) - 就可以获得quantize之后的coords
      • 这样相当于直接把对应的点移动到了空间的mesh上(Sub-Nearest),还有多种quantize的方法
  • Mesh
    • with vertices`
    • ?: differenece between “mesh” “grid” & “voxel”


  • Couple of data formats:
    1. Lattice/Mesh: graphics当中常用的一种3d数据存储格式,实质上是一个graph;含了vertex和edge,两者上面都可以有feature
    2. Voxel: 空间中的小方格,但是其实可以内嵌点云或者是Feature
    3. Grid本身就是2D image的形式,每个pixel一个value
  • canonicalization:
    • pointnet中的T-Net(然而在后续被发现没有多少用)是做这个,目的是让feature/input空间变得规整,有点类似于BN
    • 类似于normalization,但是不会scaling,会有translation以及rotation


3-d model dataset

  • ModelNet40
    • Generated
    • (9843/2468)
    • Subset ModelNet10 - (3991/908)
    • 大概相当于CIFAR10
  • ShapeNet
    • CIFAR-100
    • 220k 3,135 classes
    • ShapeNetCore - 51,300 - 55 class
    • ShapeNetAug - extend from ShapeNetCore - 26k models - 573k labels - 24 classes
    • ShapeNetSeg - 12k - 270 class


3-d outdoor

ScanNet Dataset Preparation

  1. Download 通过官方发Email获得邮件下载script 下载了ply文件 python download-scannet.py -o ./ --type _vh_clean_2.ply(clean_2是downsample之后的ply),还有要下载(_vh_clean_2.labels.ply,_vh_clean_2.0.010000.segs.json) 文件夹结构如下:

  2. Preprocess


  • 修改SpatioTemporalSegmentation/lib/datasets/preprocessing/scannet.py 中的

  • ScanNet官方git-repo下载对应的filelist,放到对应目录(train/test目录也放一份)

  • 对该repository还需要修改读取文件后缀

  • 采用train命令测试,将scannet_process/train/作为datasetpath输入

  • tasks - Semantic Scene Understanding(SSU)
    • fundamental vision tasks: det, seg, pose estimation (High-level understanding, instance-level)
    • base tasks: object registeration(low-level, point-level)

  • forms of data & conversion
    1. RGB-D: color and depth are complementary modal, how to apply feature fuse is key *2.5-D solution: 2-D image of [R,G,B,D]
    • both RGB and depth to form colorized point cloud(6-channel point cloud)RGBXYZ

MinkowskiEngine Doc

Features & Terminology

  • Sparse Tensor(Point Cloud dara are sparse)

  • also support negative coordinates(?)
  • generalized convolution
    • since the in/out is sparse, we must know how the non-zero elment in input maps to the output, kernel_map
  • sparse tensor
    • use COOrdinate list format to represent a sparse tensor
    • 一个大的coord matrix C,以及feature vector F
  • tensor stride
  • coordinate manager
    • generate new set of output coordinates with different order(conv pooling)
    • SparseTensor中的coords_man
      • 当进行inplace运算的时候,参与计算的各个tensor需要共享coords_manager,可以直接丢coord_key就可以不用输入coords了
    • coordeinate key - the hash to index the unordered coordinate manager(share the same memory)
    • caches the kernel_map and cur coordinates
      • reuses the coordinate instead of recomputing the order for series of convs
  • kernel_map
    • [[I],[O]]
  • SparseTensor initial requires coord with additional batch indice
    • use ME.utils.sparse_collate / ME.utils.batched_cordinates
    • 上面的函数的作用就只是给一个batch内的所有的coords加上一个新的dim,BATCH,值为batch_idx
      • 从原本的[num_points, num_dim] -> [num_points, num_dim+1(batch_idx)]
    • 经过这样打包之后的coord和feature才能去initialize SparseTensor
coords0, feats0 = to_sparse_coo(data_batch_0)
coords1, feats1 = to_sparse_coo(data_batch_1)
coords, feats = ME.utils.sparse_collate(
    coords=[coords0, coords1], feats=[feats0, feats1])
  • discretize the continous coordinates
sinput = ME.SparseTensor(
    feats=torch.from_numpy(colors), # Convert to a tensor
	    coords=ME.utils.batched_coordinates([coordinates / voxel_size]),  # coordinates must be defined in a integer grid. If the scale
		    quantization_mode=ME.SparseTensorQuantizationMode.UNWEIGHTED_AVERAGE  # when used with continuous coordinates, average features in the same coordinate
logits = model(sinput).slice(sinput)
  • quantize the coords in dataset
    • use ME.utils.sparse_quantize, could use return_index=True
  • defining dataloader
    • define the colate_fn to convert the input to proper output
    • collate_fn = ME.utils.batch_sparse_collate / SparseCollation()
      • collation - 整理校对

in examples/training & examples/sparse_tensor_basics

  • data of getitem() of the dataset, before using the ME.utils.batch_sparse_collate
    • discrete-coords - [num_point, dim_coord]
    • out_feature - [num_point, dim_feature]
    • label - [num_point]
      • contains [N_CLASS] choices of elements
  • process of the ME.MinkowskiConvolution


Deep Learning for 3D Point Clouds: A Survey

  • 3 tasks: Shape Classification, Object Det & Tracking, Semantic Segmentation
  • Data source: LiDAR, RGBD-camera, 3D-Scanners
  • Data form: depth images, point cloud. meshes, volumeetric grids
    • point cloud: no discretization(Better representation)
  • Challenges: 1. small scale dataset; 2. high-dimension; 3. unstructured natureA
  • Available Datasets:
    • ModelNet
    • ScanObjectNN
    • ShapeNet
    • ParNet
    • S3DIS
    • ScanNet
    • Semantic3D
    • ApooloCar3D
    • KITTI
  • Evaluation Metrics:
    • classification: Overall Acc(Acc of all test instances), meanAcc*(mAcc - acc for all classes)
    • Det: AP(Average Precision): area-under precision-recall curve
    • Track: precision & success
    • Segmentation: meanIOU(meanIntersectionOverUnion), meanClassAccuracy
  • 3D Classification
    • Multi-view - convert unstructured point cloud to 2d images
      • project into multi 2d views, then fuse these features
      • Key: how to aggregate the multi-view feature
      • MVCNN: maxpool multiview feature, which causes information loss
      • MHBN: harmonized bilinear pooling
      • Relation network to discover relations between group of views
      • View-GCN: directed graph, graph nodes as multi-views
    • Volumetric - convert into 3d volumetric form
      • Key: 1st voxelize into 3d grids, then apply a 3DCNN
        • Preprocess + CNN: how to find good preprocess
      • Problems: unable to scale well to dense 3D-Data
        • OctTree is sometimes introduced
      • VoxNet: volumetric occupacy network
      • Deep Belief 3D ShapeNets:
      • OctNet: use hierarchical octree to gen a bit-string representation
      • PointGrid: integrate point and grid representatio
      • 3DmFV: 3D grids further processed with 3D modified Fisher Vector
    • Point-based - directly without voxelization / projection
      • no explicit loss, more popular
      • category: 1. point-wise MLP 2. CNN-based 3. graph-based 4. hier-data structure 5. others
      • MLP:
        • have permutation invariance with symmetric function
        • PointNet:
        • DeepSets: summing all representation than transformation
        • PointNet++: hierarchical network to capture geometric from neighbourhood - (sampling/grouping/PointNet-based)
        • MoNet: PointNet-like
        • PAT(Point Attention Transform): represent point with its abs position and relative position with neighbours, then group shuffle attention used for get relations, then gumbel-set-sample used for learn hier fetature.
        • PointWeb: improve feature by Adaptive feature adjustment
        • SRN(Structural Relation Net): learn structure feature
        • SRINet: project point cloud to find rotation invariant features, then use pointnet backbone, then graph-based aggregation
        • PointASNL: adaptive sampling - furthertest point sampling methods + local-nonlocal module
      • Conv-based:
        • Continuous Conv:
          • the weights for neighboring points are related to the spatial distribution with respect to the center point
            • on spherical harmonic
          • conv could be viewed as a wieghted sum over a given subset
          • RS-CNN
          • DensePoint
          • KPConv
          • ConvPoint
          • PointConv
          • MCCNN
        • Discrete Conv:
          • the weights for neighbouring points are related to offsets with respeqct to center poin
      • Graph-based - each point is a vertex, the directed edge represents the neighbourhood
        • On Space-division
          • conv is MLP over spatial neighbours, pooling is adopted produce coarse graph
        • On Spectral-division
          • conv as spetral filtering, applying mult on graph laplacian matrix eigenvectors
      • Hier Data Structure
        • apply on Hier-data structure(kd-tree & octTree)
        • feature aggregation from leaf to root node
    • Summary:
      • pointwise MLP often serve as a basic building-block
      • CNN & GNN are promising directions
        • how to handle irregular data structure is key
      • Efficiency is often a problem
  • 3D Detection: - output the 3D Bounding Boxes
    • Taxnomy:
        1. Region-Proposal based(2-Stage)
          • Mult-view based methods
          • Segmentation-based methods(Use segmentation to remove background points, then raise high-quality proposals)
          • Frustem-based: generate 2-d proposal, then transform to frustem(几何体) 3d ones
        1. Single-Shot
          • BEV-based (Bird-View)
          • Discreticized: for voxels
          • Point-based: diirectly on raw 3d pointcloud
    • Problems:
      • how to efficiently fuse multi-modality feature
      • extract robust representation
      • long-range detection poor
      • how to exploit texture information
    • Summary:
      • currentlly 2-stage outperform single by a large margin
  • 3D Tracking - give the location of the obkject at the 1st frame, estimate its state in subsequent frames
    • use the rich geometric information to overcome drawbacks of of image tracking like: occlusion, illumination, variation
    • 3D version of Siamese
  • 3D Scnen Flow Estimation
    • Given 2 point cloud, measure its movement, the 3d version of optical flow
    • point-based most rich information, however no explicit neighbourhood contains in the point representation

  • representations forms
    • Projection/Discretization based could leverage the 2-d network architecture, dealing with structured data form. hoewver, information loss
  • Current Problems & Future Directions
    1. imbalanced segmentation
    2. Dense point clouds
    3. spatial-temporal information for dynamic point clouds


  • tasks:
    • 3D Shape Classification
    • Segmantions (Part/Semantic)
  • Methods:
    • Volumetric-based - 体素化
      • 转变为voxel,会损失一定的信息
      • 细的时候计算复杂度很高
      • 体素化的example,找一个大1x1立方体,切成小块
    • Multiview - 3D to 2D
    • Point-based: per point [x,y,z+N(feature)]
  • PointNet(pointwise-MLP) - directly apply deep learning on points)
    • point properties: unordered & invariant to geometric transform * facing orderless: symmetric computing - 具有对称性的运算
    • T-Net: 学习出一个线性变换,为了摆正,目的是做对齐
      • T-Net + MLP
    • 没有点之间的联系
  • PointNet++
    • 在pointnet之前加上了一个sampling和grouping(非常关键) * sample - 最远点采样 * grouping - K-近邻 * 相当于降采样
    • seg任务中是一个插值
    • sample的可能改进方向
      • non-uniform sampling
      • 会有hier的分支
  • KPConv
    • 一个⚪的卷积核,里面固定的位置有一些点,实际卷的时候将对应的点放在圆心,按照距离吧其他的weight加权求和生成一个新的weight与x乘
  • PointCNN
    • Concat(MLP(position),color)x(MLP(x) as transform)
    • Point + Voxel
      • 2条支路,voxel本身也可以看成是一个sample,voxel先体素化然后卷积,然后再反体素化
    • Voxel-based有很大冗余
    • 本身没有降采样
  • Grid-CNN

Review: deep learning on 3D point clouds

  • mainly focus on directly taking in the point cloud
  • Challenges:
    1. Irregualr: dense and sparse regions
    2. Unordered: within the same set, no order
    3. Unstructured: No grid, each point independent
  • Structured Grid-based Learning
    • analogy from Conv on 2-d images
      1. Voxel-based: discretize into binary grids(has point or not)
      • high memory consumption due to sparsity 2. Multiview-based: squash the 3-d model into many 2-d grids(as images) 3. Lattices-based
  • Directly operate on pc
  • pointnet:
    • plain MLP, no local information
    • simple max-pooling
  • how2 get local information: sample/group & mapping - often approximated by a MLP
    • sample: random sample, K-farthest sampling(sample M times, each points is the farthest points of former), uniform sampling, gumbel sampling
    • group: K-nearest sampling group them into a local branch
    • mapping: mlp+max-pool to retain symmetric for orderless
  • donot extract local info
    • pointnet++: hier applu pointnet
    • VoxelNet: sample T points for each voxel, use their location mean(centroid) into FCN
    • Pointwise Convolution
  • explore local info
    • PointCNN: start from pointnet++, apply X-transform on K-nearest points, permute inputs
    • GeoCNN: weighted feature aggregation based on their distance from centroids
    • PAT: gumbel sampling for sample, pointnet for computing + multi-head attention
      • permutation invariant & robust ot outliers
  • Some dont follow MLP: thinks it neglects the prior of geometry of PC & large param size:
    • Step func + Taylor expansion
  • graph-based: input graph {V,E} V points, E as a {V,V} represent edges

YiLi: 3D Deep Geometrical Process

  • Sensing to perception:
    • segmentation -> assign attribute
  • 3-d datasets for objects
    • Rich attributes* - semantic category, pose, part, optical material
    • ShapeNet - CAD models - >3M models; 4k Object classes
    • ShapeNetCore - spatially aligned and with pose label
    • PartNet
  • 3-d dataste for scne undestanding
    • SceneNet
    • ScanNet
    • Waymo / Kitti / Apollo
  • 3-d dataset representations:
    • multiview-image
    • depth map
    • volumetric occupacy
    • point cloud (*)
    • polygon mesh (*)
  • GeoNet - CVPR2019
    • capture the surface topology representation - geodesic-aware
    • (task of point cloud upsampling?complete)
    • geodestic distance?
      • shortest path along the 2d surface
    • a very good example of how to acquire the low-level features of the point cloud(i didnt see in many pointCNN)
  • Learning on graph/mesh:
  • SyncSpecCNN:
    • Previous work: SpectralCNN - mult in the spectrum, apply IFFT to get the convolution in time division
      • fourier bases corressbond to local features
    • this paper: sync the basis
    • follow-up work: TextureNet
      • orientation - 方向?
      • 之前不带方向性(由于IFFT)在空域上一般是没有方向性
      • kernel-weights orientational symmetric
  • Tasks:
  • 3-d instance detection & segmentation:

  • GSPN:
    • RPN: region proposal, 3-d region proposal fails
      • 3-d box may not be suitable, free-form shape proposal
    • Objects of the same class has true physical scales, less afftected by lighting/acculusion
    • VAE - generate the object conditioned at the input scene
  • Differnet between indoor / outdoor
    • indoor: dense / outdoor: sparse Lidar
    • outdoor: has domain gap
      • view it as a geometrical completion task
  • More finer tasks into parts:
  • Primitive Fitting:
    • 用一些基础件去贴合一个3d model
    • SPFM
      • make differentiable Pattern for base elements
  • Motion Segmentation - articulated - 链接式的结构(比如✂)
    • how points change from the 1st state to 2nd state
    • category / domain invariance?
    • hard to annotate
    • discover correspondences between
    • (Pose Estimation?)
  • Future Dicrections:
    • 2-d surface manifold in 3d (Non-encludian)
    • various representations - explicit representation(?)
      • potentially combine them?
    • Multi-modal - in autonomous driving
    • 3-d generative model
    • Embedded AI


  • PointNet++

  • MultiscaleGrouping -
    • dropout inputs points for each instance
    • generate differnet extend of sparsity for input
    • during training, using all points
  • Multi-resolution grouping:(computationaly efficient)
    • 在处理之前和之后的embedding上分别做grouping,并且将两个输出concat起来

  • KPConv

  • Point cloud - Sparse & Unordered
    • similar but also different from grid: (same) - spatially localized
  • Methods towards the point cloud:
    1. Grid-based project sparse 3D data on regular shape(voxelize?), so conv could be defined more easily
      • voxelization is projection in eculidean space
    2. Directly Apply MLP on the points
      • original pointnet dont model the local representation
      • other hierachical methods use MLP as conv, author argue hard to converge
    3. Graph Conv
      • conv on graphh is equivalent to mul on its spectral representation
      • represent edge connections instead of edge relative positions
      • combines feature on local patches, however, invariant to deformations(变形-bad)
    4. Other Point Conv
      • PointwiseCNN: kernel weights at voxel bin(more like grid-based methdos)
      • SpiderCNN: family of polynominal function applied weighting for each neighbour, spatially inconsistent
      • FlexCNN: linear functions
      • PCNN: also use point to carry kernel weights
  • Also Propose the deformabel version of this conv
    • rigid / deformable
    • with learnabel local shift
    • extra regularization is applied to avoid empty set
  • Favors radius neighbour instead of KNN during grouping
    • consistent spherical domain
  • reception field is a ball
    • inside K points cntains kernel weights
  • Aside from weight, a correlation is also applied:
    • use linear correlation max(0, 1-   y_i - x_k   /\sigma)A
    • didnt use gaussian correlation for simplicity
  • “fitting reg” - penalize the kernel point and the cloest neighbour
  • “reulsive reg” - penalize points have overlap region
  • Deformable

  • Point cloud registration task?
    • 点云配准,对齐两个点云空间的变换(感觉有点pose estimation的意味)

  • PointCNN

  • X-Conv
    • F^* = [MLP(p_{center} - p_i), feature]
    • X = MLP(p_center - p_i) of dim [k,K]
    • Y = X*F^*
    • nearby points projected/aggregated into representative point P(sampled)
  • Architectural resemble the grdid-based CNN
    • KxK local patched -> K neighbouring points around representative points
    • X-conv instead of conv


  • Voxel-based methods - regular and good memory locality / need high reso to not lose information -> big memory consumption
  • Point-based methods - irregular memory access / huge dynamic overhead
  • Combine both - PVConv(point-voxel conv)
    • disentangle fine-grained feature transformation / coarse-grained neighbourhood aggergation(low reso voxel grids)
  • Method:
    1. normalize the coords 1st before voxelization
    2. Voxelization: average all the features falls into the grid box
    3. Feature aggregation: 3-d volumetric conv
    4. Devoxelization: (plain) - nearest-neighbour inerpolation, will result in all points in 1 voxel share the same value
      • so tri-linear interpolation
    5. Also leverage a MLP to extract point-based feature

  • PointContrast

  • 首先有一个case study,参照别人用fully-supervised(用全label是为了表示transfer的upper bound) ShapeNet(单个类别的生成的3d模型)做预训练,然后迁移到S3DIS segmentation

  • Dataset - downstream tasks

  • FCGF-based

FullyConvNetwork Arch(U-Net full-resolution output,同时输入为整个pc,不需要crop成多个块,保留了信息) 学习方式是Point-level的Metric Learning task是domain-specific的pointcloud registration task

rigid transform - rotation, translation, scaling

  • Loss


P是所有positive match pairs,只用到了positive的pair,而没有用到非positive的pair;用的dictionary learning的思想,key对应上的是positive,对应不上的是neg。外面套了一个softmax,会更加stable

  • PointContrast完整的说明了Unsupervised Pre-training的作用,

此外,对ScanNet本身也有acc的增益。即使不trasnfer Unsupervised Pretraining也有作用(ScanNetv2 to 本身4),以及transfer的Unsupervised对比supervised不差太多(ScanNetV2->S3DIS)

  • Architecture

Sparse Residue U-Net(SR U-Net) - Res34

?: pointcontrast的preprocess,如何获得point pair 对某个scene x,从两个view获取x1,x2,subsample every 25 frames, 将两个frame align到同一个世界坐标(collect 2 point clouds in a pair of at least 30% overlap?)

  • Exp


Sun-RGBD Detection task - modify archietcture to VoteNet

  • Ideas:

use holistic scene is not working, 不再从两个view分别做transform,而直接用reconstructed point cloud去apply两个transform,会明显掉点。

  • FCGF-Fully Convolutional Geometric Features

  • 🔑 Key:
    • uses metric learning loss + fullyConv Backbone
  • 🎓 Source:
  • 🌱 Motivation:
    • Towards the low-level geometric feature extraction problem - task as registration/reconstruction & tracking
    • earlier works use limited reception field - 3DCNN
  • 💊 Methodology:
    • Metric Learning:
      • the hardest contrastive - hardest triplets
      • 2 Constraints: similar features should be close and dissimilar ones should be a margin away
      • use hard negative mining
  • 📐 Exps:
    • Datesets: 3D match on KITTI, find GPS gt is noisy, use ICP to refine the alignment
    • Generate point pairs, then sample the positive/negative pair
  • 💡 Ideas:

  1. Prpose efficient 3d block: SPVConv(Sparse-Point-Voxel Convolution)
  2. Design automl flow(search space)
  • 🎓 Source: MIT HAN

  • 🌱 Motivation:

  • 💊 Methodology:

  1. SPVConv

Revisit Point-Voxel Conv(Proposed in the PVCNN): Point-Voxel Convolution:(Coarse Voxelization) - originally proposed to reduce memory consumption(reduce the irregular memory access / improve locality) However, some small instance occupy very few voxels, cant be learnt well(could use fewer piece sliding window but huge cost)

Revisit Sparse Conv(Minkowski Net): its core idea: skip the non-activated region(1st finding the active map between input/output points) However, have to agressively downsample to gain good reception field, then the reso will be too coarse

These methods sacrifeices the resolution to gain efficiency

use point-voxel conv:

keep the sparse voxelized tensor(only store non-zero voxels), and the full point cloud, transform between them with voxelization/devoxelization Voxelization: 3d coordinates - floor(x/v) (v is the voxel size); the feature are the means of all points fallen into em feature aggregation with sparse-voxel-conv with residue SparseConvolution, then transform(devoxelize) back to point representation: interpolate em with 8 neighbour grids

  1. NAS Design Space: fine-grained channel num & Elastic network depth

not coarsed grained channel size(expansion ratio in ResNet) O(n) channel space: crop the first c channels sampled since the 3d model is more memory-bounded, search for depth, sampled depth N, only n layers(the deeper layers may be poorly trained, applied progressive shrinking to avoid) small kernel matters, but not supported yet, keep all to be 3.

Evolutionary searching.

  • 📐 Exps:

task: Semantic KIITI Seg, then transfer to Det

  • 💡 Ideas:
  1. since point cloud has sparse nature, using dense volumetric methods are inefficient(some efforts already made: octree to reduce memory footprint, MinkowskiNet proposes Sparse Convolution)
  2. 3d medical data understaning resembles more with 2-d image, but not like 3d scene unserstanding on point cloud

mainly focus on using temporal information 3d + video however true contribution - Minkowski Engine & MinkowskiNet - Sparse Voxel Conv

  • 🎓 Source: Christopher Choy - StanfordVL

  • 🌱 Motivation:
  • 💊 Methodology:

  • Generalized SparseConv

  • 📐 Exps:

Seg on ScanNet & S3DIS

improvement much

  • 💡 Ideas:

  • 🔑 Key:
  • 🎓 Source:
  • 🌱 Motivation:
  • 💊 Methodology:
  • 📐 Exps:
  • 💡 Ideas: