Papers

GDAS - Searching for A Robust Neural Architecture in Four GPU Hours

Problems

自己总结理解的一些darts主要存在的问题

complicated and coupled

搜索过程中向Shortcut坍缩
- (DARTS+) 通过早停来避免，类似的还有(Understanding & Robustifying)设计了基于hessian矩阵的早停机制
- 但是有人认为不让searching converge，会导致search unstable, sensitive to initialization，通过实验结果说明很多时候不能得到好的结果，要好几run才能得到好的
本身用op之间的softmax的不合理
- (FairDarts/GOLDNAS)一系列工作表示这个op之间竞争不合理，改用sigmoid
- (GDAS/SNAS/FBNet)一系列工作选用Gumbel采样，或者是每次forward只更新某个op的weight(体现为gumbel-hard采样)
- 区分带参数以及不带参数的op，分别竞争
SS本身会造成搜索不稳定 - 做SearchSpace Reduction - Progressive Shrinking/Pruning
- (CPNAS) - prune掉topK的op
- 当weight的alpha小于threshold的时候，prune掉
gradient error on approximation - 是理论存在的，被认为是造成collapse的主要原因
- 用pertubation as regularization
- mixed-level optimization

param-free的op更容易converge，陷入了local minima
- 最简单的解决方案就是在开始fix alpha
- 另外一种是sampling
- partitian param / param-free
- op-level dropout

derived-based

搜索过程完之后None-Op占主要，derive的时候需要删除none
- (自己经验)当采用Gumbel采样之后，alpha中none并不处于主导，可以带none进行derive
derive的时候discretize与原本的alpha差距大
- (FairDarts)通过加上01loss，鼓励alpha更加偏向0/1
- 在搜索过程中就直接用离散的架构(Gumbel hard)
- 将最后做discrete转化为gradual eliminating op
8-cell搜索，20-cell final之间的cell数目的gap
- (P-darts)搜索中逐渐增加cell的数目

others

这篇survey将weight-sharing方法与标准的individual evaluation的误差总结为了以下4点：
1. mathematical-error: 比如darts这类方法对于alpha的梯度衡量存在明显误差
2. supernet sub-optimza: supernet的训练容易陷入由alpha所决定的local optima
  - ? - 这个是否可以一定程度上归因于Imbalance training
3. discretization loss: derive的过程中存在的误差
4. deployment-gap: 由于size或者是训练超参数所存在的误差
  - 8cell-20cell的个人理解也应该归属于这一类

Training Settings

search space

主要有两大类，一类是原始darts所用的基于eNAS的SS，另一类是FBNet(ProxylessNAS)所用的block-wise的MobileNet搜索空间

8 ops
- 3x3 5x5 separable conv
- 3x3 5x5 dilated conv
- 3x3 max/avergae pooling
- zero / shortcut
N=7 nodes
- output node as depthwise concat
- cell k’s 1st and 2nd node are equal to cell k-1, k-2’s output (insert 1x1 conv for channel align)
Relu-conv-bn
separable-conv applied twice
1/3 & 2/3 depth of arch is reduction cell
- stride=2 op nearest to input

trainer

split the CIFAR-10 set - 50% train / 50% val
For weights - SGD - 0.025/0.9/3.e-4
zero init \alpha
For alpha - 3.e-4/(0.5,0.999)/1.e-3

Others

Xie’s Talk - Full space NAS

Darts的问题所在
- 主要在Bi-level Optimizatoin
  - 很难找到当前架构上的最优参数，假设不成立
- Discrete Loss会让母网络的精度大量下降
替代Bilevel optimization
- alpha与weight的数量差距很大，jointly training的话会导致优化不平衡(所以需要采用bilevel来分离)
- 在单阶段中加入正则化去拉 jj.jj.adual pruning，防止超网络去疯狂生长，让所有的参数都留下来,加一个flops约束来
删除不合理的
- 每条边上只能保持一个操作
- 每个Cell只能和之前的两个Cell链接
- Normal以及Reduction Cell必须一样

1-level optimization

原本darts使用1-level只有3.5%左右，甚至比随机samole的3.4左右还要低
认为原本one-level不行的原因是因为task不够困难，在search过程中对数据集做一个augmentation来增强困难

Darts related

diving into grads

Papers

Problems

derived-based

others

Training Settings

search space

trainer

Others

Xie’s Talk - Full space NAS

1-level optimization

CATALOG

FEATURED TAGS

FRIENDS

Papers

Problems

search-related

weight-adaption-related

derived-based

others

Training Settings

search space

trainer

Others

Xie’s Talk - Full space NAS

1-level optimization

CATALOG

FEATURED TAGS

FRIENDS