All about that binary

Binarize em.

Posted by tianchen on May 8, 2020

Papers

  • BinaryConnect: Training Deep Neural Networks with binary weights during propagations
    • 2015
    • Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David
  • Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
    • 2016
    • Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio
  • XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
    • 2016
    • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi
  • Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing
    • 2016
    • Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, Dharmendra S. Modha
  • Ternary Weight Networks
    • 2016
    • Fengfu Li, Bo Zhang, Bin Liu
  • DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
    • 2016
    • Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, Yuheng Zou
  • Flexible Network Binarization with Layer-wise Priority
    • 2017
    • Lixue Zhuang, Yi Xu, Bingbing Ni, Hongteng Xu
  • ReBNet: Residual Binarized Neural Network
    • 2017
    • Mohammad Ghasemzadeh, Mohammad Samragh, Farinaz Koushanfar
  • Towards Accurate Binary Convolutional Neural Network
    • 2017
    • Xiaofan Lin, Cong Zhao, Wei Pan
  • Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving
    • 2018
    • Jiaolong Xu, Peng Wang, Heng Yang, Antonio M. López
  • Self-Binarizing Networks
    • 2019
    • Fayez Lahoud, Radhakrishna Achanta, Pablo Márquez-Neila, Sabine Süsstrunk
  • Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization
    • 2019
    • Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, Roeland Nusselder
  • Least squares binary quantization of neural networks
    • 2020
    • Hadi Pouransari, Oncel Tuzel
  • Widening and Squeezing: Towards Accurate and Efficient QNNs
    • 2020
    • Chuanjian Liu, Kai Han, Yunhe Wang, Hanting Chen, Chunjing Xu, Qi Tian
  • Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm
    • 2018
    • 添加了一条额外的shortcut
      • 卷积或者是BN的输出,在binarize化之前, connect this real activation到consecutive block
    • 用一种新的tight approx来取代STE对grad进行矫正
    • magnitude-aware gradient
    • 从全精度的pretrained模型开始,activation用clip改为ReLU
    • 提到了训练BNN的两个问题
      • Non-diff
      • Grad相对太小不能改变sign
        • 保留一份real-value weight就可以
  • BinaryDenseNet: Developing an Architecture for Binary Neural Networks
    • resnetE(resnet with extra shortcut) - used in birealnet - only 1 conv on shortcut, but 2 shortcut in 1 resnet block
      • also used full precision downsample
    • downsample conv
      • 2x smaller feature map & double the channel size
      • for CIFAR, 0.4-1% acc gain but double the size, for imgnet, 2.2% acc gain 3.5-4MB
      • so the bigger base model is, more needed for full-prec downsample
    • scaling in training from scratch scenario could lead to acc drop
    • no bottleneck / more shortcut
    • choice of full-prec / binary downsample op linked with reduction rate(full prec could have bigger reduction rate)
    • splitting block: half the growth rate and double the blocks(depth)
      • may cause more memory consump in trainig, but not a problem when inference
  • Training Competitive Binary Neural Networks from Scratch
    • 2018
    • 强调了目前(现在看起来好像不是目前了)的很多方法都需要借助2-stage training或者是full-precision model
      • 所以想提出一种相对simple的训练方法,能够直接from scratch
    • 同时首先提出了binary+Dense的模式
    • 提到了Bi-real net
      • binarize化的resnet,加入了additional shortcut
        • 以一次额外的real-value Addition为代价
        • 将Sign函数之前的部分,shortcutc到BN之后
      • a change of gradient computation
      • complex training strategy, finetuning from full-precision
    • 稍微修改STE
    • no weight decay
      • since grad cancelling has already preventing the weight has too big value
    • how 2 choose scaling factor
      • 有人prove了在前向的时候对weight做filter-wise的scaling factor是无效的,但是对grad中做到magnitude aware可用
      • scaling feature for activation
      • 认为learning a useful scaling factor是很难,因为BN的存在
      • 但是表示对于从pretrain而来的,可能有所区别
    • use as much shortcut as could
    • less shortcut layer - huge information loss
      • whether to use the full-precision reduction - based on reduction rate(if has more channel, could save)
      • full precision downsample often lead to relatively minor acc gain and big param size
    • downsampling layer - smaller feature map but more channels - extremely sensitive
    • the presearvation of information & information flow is the key to good BNN
  • XNORNet++
    • BMVC
    • 原先版本的对Activation feature map计算scaling factor需要针对输入每次去做,computational expensive
    • fuse weight & activation scaling factor into one
    • (感觉很trivial很empirical)
  • MeliusNet: Can Binary Neural Networks Achieve MobileNet-level Accuracy?
    • 2020
    • Joseph Bethge, Christian Bartz, Haojin Yang, Ying Chen, Christoph Meinel
    • 本质上是提出了一种新的Block
      • to incerease feature quality and capacity
    • related work: birealnet & bi-densenet
      • adapt bianry towards full-precision ones
      • mentioned that they didnt test whether their tricks are for binary only or could also applied to fp32 ones
    • 指出了原本的BNN的优化方法有
      • 加大Channel数目
      • multiple binary basis
      • 但是本文主要提出了一种architectural的view,提出了一种alternating denseblock, 去increase feature capacity
    • 认为BNN中的主要损失
      • Weight层面的误差(quantization error) - quality of the feature
      • Activation层面的表征能力小了 - 对于一些比较detail的区别难以区分 - capacity of the feature
    • tries to reduce the number of fp32 ops
      • dense-block: use input feature to get 64c and concat to it for increasing feature capacity
      • improvement block: parse the whole input feature for 64, shortcuted to the last 64 feature
      • 4 stages, 1x1 conv as the channel convert(remain full-prec) - also could use groups to reduce op
      • reduce the stem(fp32) -
  • IRNet-Forward and Backward Information Retention for Accurate Binary Neural Networks
    • IRNet(Information Retention Network)
    • 2种主要方法分别针对forward和backward
      • Libra-parameter-binarization - minimize the q-error and information loss
        • balance & standardize weiht
      • Error Decay Estimator(EDE)
    • minimizing the quantization error -   A-Q(A)   - not always work
      • Objective Function: Min(Q-error) + Max(Binary Entropy)
        • Bernoulli Distribution
      • 为了训练更加stable,对w减均值并且norm
      • 额外加入了一个optimal bit-shift scalar
        • «»s表示左/右shift
      • activation的binary则是最简单的Sign
      • 从Information View解决问题,先balance再binarize,去retain足够的information
        • Libra-PB有bernoulli分布下的最大information entropy
    • EDE
  • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions
  • motivation: binary network is sensitive to activation distributino variations - Distribution Matters
    • small variant in activation affect the semantic feature representation
  • basic block - MobileNetV1 with modification, double input feature map for reduction block to ensure the i/o channel of binary conv stay the same
  • replace the depthwise/pointwise conv with vanilla 3x3 & 1x1 conv*
  • unlike previous works using full-precision downsampling layer(cause i/o shape differs) - using avg_pool on shortcut for matching spatial dim, concat 2 blocks with same input for channel difference(double the weight, clone the activation)
  • only the 1st conv and last fc is not binarized
  • activation function with learnable thresholds
    • original sign function -> 1 if x>\alpha else -1
    • coefficients are channel-wise
    • gamma shifts the input, beta folds, theta shifts the output
    • only introducing 4xchannel_num params, simple chain-rule grad update
  • Distribution Loss
    • the KL divergence between the real-valued output and the binary ones
    • kinda like KD?
    • unlike previous work which enforces the intermediate feature activation
  • training trick
    • in step 1, the teacher is pre-trained
    • 2 step: 1st train bianry activation, real-valued net from scractch, 2nd inheriting the weight and finetuning it with both weight and activation binarized
    • all using the proposed distribution loss instead of cross-entropy
    • in code, cross-entropy with label-smoothing was defined but not used

Training Details

regular used BNN flow

BATS

Ideas

  • Binarize本质上是一种极限的Quantize,所面临的问题其实也和Quantize类似,除了几个区别
    • Overflow的Tradeoff不存在,由于只有一个量化区间
    • 同理Rounding的操作也不存在
  • 但是存在的Error都是前向的QuantizationError与等精度的模型进行对比 - Forward Error
    • 另外是由于梯度的问题导致收敛不到一个有效的结果 - Gradient Error
  • 解决以上两种问题的方法主要有
  • 前向误差 - 本质上是通过微调(整体的)在有限的表达能力下尽可能的去近似Counter的一个全精度模型
    • Scaling Factor(更细粒度的)
    • Adjustable/Learnable的NonLinear Function - PACT
    • Gradual/Iterative/Recursive(交替的优化binary函数以及参数)-有点类似于Quantize-aware training的思想
    • 通过引入一定的随机性(个人臆测比较玄学,不知道效果如何) - 类似Stochastic Roudning思想的(抖色深的) - Half-wave Gaussian
    • 增加模型的表达能力(增加模型大小,堆叠)
      • 线性组合(multiple binary basis) - ABCNet
      • KD
      • Wide
  • 反向误差 - 本质上是修正/补偿梯度所带来的误差
    • Loss修正 - 最朴素的是进行正则化,加一个奇形怪状的项来拉动
      • 比较高一点的是一系列Differentiable的方法,把Loss导到各个部分来协助优化
      • 更多learnable parameter - LQNet
    • 对STE进行修正

对于究竟哪种方法真正work,讲道理是需要深入对比各个paper的实现细节以及自己去尝试,目前还没有展开,所以只是臆断


  • 现在看起来对于binary比较好的structure相对还是比较unclear的
    • (听上去十分Straightforward但是怎么进一步呢-讲道理类Darts的方法就是在解决这个问题)
  • 一篇Survey还有提及到transfer binary net

  • 有没有可能各种位置的各种模块,所需要的Nonlinear function是不一样的,如何将其加入搜索(derive过去)

  • Binary Neural Networks: A Survey
    • 存在的问题: severe information loss & Discontinuality
    • Explainable解释
      • some filters are redundnant(play a similar role)
    • 有这样的操作,去衡量某一层是否重要-将它从fp换成0/1,看acc decay
      • traitioinal understanding for the layer sensitivity, is keeping the 1st and -1st layer with full precision
    • beneficial for understanding the structure of NN
    • ensemble to help training the BNN, but may face over-fitting
    • 分类 naive binarization / optimized binarization
      • 后者针对quantize error或者是loss function做了一些额外的操作
    • Preliminary
      • BP: STE with clamp(-1,1)
    • Category
      • 几个方向
        • Minimize Q Error
          • Introducing Scale Factor(XNOR)
          • Hash Map with Scaling Factor(BWHN)
          • Train with Quantized Gradient, improved XNOR(DoReFa)
          • More Channels, imporved XNOR(WRPN)
          • Group and gradually Quantize
          • Recursive approximation (HOPQ-Higher-order residue quantization)
            • not 1 step quantization, linear combination of each recursive step
          • Similar linear combination of binary masks - ABCNet
          • 2 Step Quantization
            1. only quantize activation with learnable function
            2. then fix function and quantize weight (learn scale factor)
          • PACT
            • learnable upper bound
          • LQ-Net (Learnt Quantization)
            • Jointly trainig the NN and the Quantizer
            • Surrport arbitary bit
          • XNORNet++
            • Fuse w/a scaling factor together to support arbitary bits
        • Improved Loss func
          • 主要focus在如何去globally适配binarization(对应上面的Q-error的就是更focus在局部上)
          • LAB - Loss-aware-binarization (quasi-Newton )
          • Problems
            • degeneration
            • saturation
            • gradient mismatch
          • INQ (Incremental Network Quantization)
            • 逐渐把Group扩大
          • Apprentice Network
          • DBNN - Distillation BinaryNN
        • Reduce Gradient Error
          • 修正STE
            • 最朴素的方式会导致在[-1,1]之间的参数不会被更新(?)
            • 核心方式是Soft,Adjustable,SignFunction(然后不用STE)
          • Bi-real: approx sign function(一个二次函数)
          • CBCB - Circulant BNN
          • BNN+
          • HWGQ - Half Gaussian Quantization
          • Differentiable Soft Quantization

BNN+NAS

  • Searching for Accurate Binary Neural Architectures
  • Huawei Noah - ICCV19 W
  • WRPN uniform expand
  • only search for width(channels), acquire higher acc with less flops
    • encode channel num into ss, EA as optimization
  • the arch remain the same with the original fp32 model
  • DoReFaNet Forward
    • 除了第一层和最后一层
  • 4 is empirical upper bound of expansion ratio
    • [0.25,0.5,1,2,3,4]
  • Learning Architectures for Binary Networks
  • GIST(South Korea)
  • seems like eccv …
  • 号称自己可以和SOTA的方法打平,而不用很多技巧,只是加大
  • cell-based, proposed a new cell template composed of binary operations
  • 首先实验直接对Darts等搜出来的结构直接用XNOR的binary scheme
    • 效果很差(很合理)
  • novel-searching objective - Diversity Regularization
  • SS design
    • should be robust to quantization error
    • dialted conv与一半conv对Q-error来说一致,这两者相对对Q-error比较鲁棒
    • separable有很大的Q-error
    • zeorise - 输出为0,原先是用来建模没有shortcut connection的过程
      • 本质是因为有时候binary之后的误差实在是太大了,导致比直接把结果置0还大
      • 保留这种层去减少Q-Error,而不是只是将其作为Placeholder(?)
      • 有一个possibility是否包含zeroise
  • Cell Template Deisgn
    • unstable gradient
    • 强制不同Cell之间带Skip - InterCell Skip connection(less quantization error)
  • Diversity Regularizer
    • 同别的文章,一样发现了带参数的op一开始不容易被选中
    • exponential annealed entropy regularizer
  • 看上去像是一个带Hotfix的方法,但是做的还是比较solid的

  • Binarized Neural Architecture Search
  • Beihang Univ
  • Darts foundation
  • channel sampling / operation space reduction0
    • abbadon less potential operation
  • 基本就是PCDarts+Binary复述了一下…

  • CP-NAS: Child-Parent Neural Architecture Search for Binary Neural Networks
    • sample without replacement
    • 对K个op每个sample过一次,,循环K次,之后做SS reduction
  • Pair Optimization for binary
    • minimize distribution error between fp32 and binary
    • minimize output-class inrta-class feature
  • BATS: Binary ArchitecTure Search
  • Cambridge
  • 表示直接把NAS套用到binary domain会带来很大问题,所以需要一些操作去alleviate
    • binarized ss
    • search strategy (control and stablilze the searching)
      • temperature-based
  • binary的方式
    • follow XNORNet - 但是scaling factor 反传得到而不是analytically
  • search space
    • 首先表示一个比较好的ss即使用random search也可以获得比较好的效果
    • 认为depthwise本身已经是compact了,所以更难做binary - bottleneck也是
      • high group size 去近似 depthwise
      • 有尝试过在每个group之后加入一个channel shuffle,但是其实没有太好的效果,犹豫group一般比较大
  • Search Strategy - 对DARTS的稳定收敛的各种操作
    • 早期发现会很快收敛到real-value op(比如polling and skip-connect),早期比较有效
    • 用temerpature来解决,让整个分布变得更加spiky
  • 2-Stage Search 由于Training Binary本身更困难一点
    • weights real, activation binarized
    • 个人感觉这个不大靠谱…也不一定…
  • APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
  • MIT Han
  • 将Arch,Prun以及Quantize unifify到一个方向
  • 超大SS,用一个Quantize Predictor
    • 训练其需要一个{FP,QUAN}的Acc Pair,需要设计Quantize-aware finetune
    • 借助Transefer Knowledge(从FP32 predictor到Quant predictor),显著提升Sample Eff
    • 主体是Evo+Predictor+Shared-Weights(OFA)
  • OFA Training - progressively distill smaller subnets sampled from the OFA
    • MobileNet V2 Base
    • to handle SS 过大的时候OFA的subnet不准
  • Quantization Predictor
    • Arch and Quantize Policy encoding
  • Binarizing MobileNet via Evolution-based Searching

  • 🔑 Key:
    1. find a balanced bianry mobilenet, mainly in the group-conv domain
    2. weight sharing
  • 🎓 Source:
  • 🌱 Motivation:
  • 💊 Methodology:
    • BinaryScheme
      • scaling factor and backprop like XNORNeto
      • enhanced shortcut like MoBiNet & Birealnet
      • Polynominal differentiable approximation like birealnet
      • only weighs are binarized at train/testo
    • Flow
      1. pre-training
      2. sample grouping strategy and EA
      3. determine the strategy and train from scratch
    • module modification
  • 📐 Exps:
  • 💡 Ideas:
    • depth-wise + point-wise = depth separable conv
      • for binary the channel(depth)wise, less binary numbers are added together and has low precision, so cannot converge
      • so group conv could be a surrogate