ResNet是啥,SOTA的CNN设计思路如何?

可能还附带了一些自己对卷积设计的理解?

Posted by tianchen on April 29, 2020

Different Convs

参考了知乎-Convs详解

  1. Normal Conv (不用阐述了,这张图不错)
  2. 1x1 Conv/Pointwise Conv
    • 1st in NiN, Used in GoogLenet
    • 可以用来控制Filter数目
    • 跨Channel的信息整合
  3. Spatial&Cross Channel Conv
    • Used in the Inception Block
    • 先用一个1x1Conv来限制Channel数目,然后3x3 Conv,最后用concat的方式来融合特征
  4. Grouped Conv
    • 1st in Alex, Applied in ResNeXT
    • 将Feature分组,在组内做conv,最后将结果Concat,以减少计算量
      • 有一个基数(Cardinalty来控制组的数量) - 随着基数的增加,计算量会减小
      • 分组卷积让filter之间的相关性更小了,可以起到一定的正则化作用
      • 分组的操作也可以被并行起来
    • ShuffleGroupConv
      • 1st in ShuffleNet
      • ChannelShuffle促进层间信息的flow
        • 左一是Bottleneck-中间是一般的ShuffleBlock-右一是strided shuffle block
  5. Sepearable Conv 5.1 Depth-wise Seperable Conv
    • 在Xception和MobileNet系列中Applied
    • (Depthwise+Pointwise) - (PerChannel+1x1Conv)
    • 5.2 Spatilly Separable Conv
    • 将NxN拆解为1xN与Nx1
    • 问题是会带来一定的信息损失(不是所有的Filter都可以分解)
    • 与之类似的有FlattenConv
      • 也是矩阵分解,3-d分解以减小计算量2
      • 记得MICRO还是INSCA有一篇EDA的文章是去找如何去Unroll
  6. Dilated Conv
    • 用于多尺度的邻近信息融合
  7. Deformable Conv
    • 目标是形状可以学习的Conv,学习每个点的offset,让Conv不再是标准的正方形
  8. Attention 9.1 SE Block
    • 思想是每个Channel的重要性不同,加了一个Auxiliary Model给每个通道打分,将结果与Feature Map相乘,对不同channel赋予不同权重、
    • Channel维度的Attention 9.2 CBAM
    • 在Channel-wise Attention 加上了Spatial-wise Attention
    • 两个模块级联
  • Summary
    • 单一尺寸卷积核用多个尺寸卷积核代替(参考Inception系列)
    • 使用可变形卷积替代固定尺寸卷积(参考DCN)
    • 大量加入1x1卷积或者pointwise grouped convolution来降低计算量(参考NIN、ShuffleNet)
    • 通道加权处理(参考SENet)
    • 用深度可分离卷积替换普通卷积(参考MobileNet)
    • 使用分组卷积(参考ResNeXt)
    • 分组卷积+channel shuffle(参考shuffleNet)
    • 使用Residual连接(参考ResNet)

参考了知乎-ResNeXT详解

ResNeXT

Split-Transform-Merge

  • fc(对于上一层的每一个神经元)都是一个Split-Transform(Activation)-Merge的结构
  • 而Inception Module,也是一个这样的Split-Transform-Merge的结构

Change Inception

  • ResNext的作者认为Inception V4中Inception Module内部需要设计,太人工和刻意(太多的超参数难以调节)
    • 合理也不合理:这样貌似没有考虑到多种感受野
  • 所以对所有的Inception Module采取相同的拓扑结构,只通过Cardinalty(基数-这个词在数学里面是“势”)来控制有多少个这样的模块
  • 与Inception V4很类似,只是Inception是先concat再1*1conv,而ResNext是先分路1*1Conv再Concat

Group Conv

  • 介于正常卷积和Depth-wise卷积之间的折中
    • 既不是对每一个C用一个卷积核,也不是对所有C用一个卷积核

ResNeXt为什么好

  • 引入了cardinalty,本质上还是不同的group,不同的group对应着不同的sub-sample,可以学到更加diverse的表示
  • 有人指出Multi-cardinalty其实和multi-head的Attention的意味一样
    • Take Part in different rpresentation subspace

WRN(Wide ResNet)

ResNet的又一个经典变体

  • 所谓的宽,就是feature map的channel数…
  • 命名方式 WRN-n(卷积层数)-k(宽度系数)-B(block的结构)
  • 作者进行疯狂实验之后得到结论
    • 带Dropout更牛逼
    • 有的时候深度太深反而会GG
    • B(3,3)是比较好的结构,指一个Block里面2个3x3卷积

ShakeShake

  • 本质上是一种Regularization的方法,不是很主流,被证明在小网络上取得了比较好的效果
  • 多Branch(3)和ResNet分路的思路一致
  • FeedForward Shake
    • 在前向的时候对两路分别乘上a和(1-a)做了一个Disturbing
    • 在原本的Dropout中已经提出了 - (本文方法可以看做一个软的Dropout?)
    • 其可以和BN一起获得更好的效果,这篇博客有提及
      • 这篇博客论述的是Dropout与BN不compatible的问题
      • 思路是Dropout,会给网络加上variance shift(方差偏移)而导致BN层一开始归一化的方差错了!(traing和inference不一致了)
      • WRN的作者在dropout后面加了一个fc再BN,缓解了这个问题
      • Solu:
        • 最简单暴力的是把dropout放BN的后边
        • 有人使用高斯Dropout,给一个均匀分布的dropout(其实和shakeshake一文中的方法很类似)
  • Backwardshake
    • backprop的时候对反向传播也做Disturbing

ResNet 源码阅读

由于用到的resnet次数实在是太多了,所以对Resnet源码做一个解读吧

  • 大段代码 Alert
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)


def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
  • 最简单的Conv,封装了一层,由于内部只有3x3和1x1
    • 调用的是torch.nn中的Conv2d函数(这个返回的是一个nn.functional和nn.functional.conv2d等价)
    • 输入args
      • inplane/outplane 输入输出Channel数目,很直观
      • Kernel_size
      • Stride (默认为1))
      • padding (默认为0)
      • bias(默认为True)这里设置为None
        • bias是直接加在每个Output Channel上的
      • groups(描述了input和output之间的Connection)这里设置为了1
        • inplane和outplane要都能被groups整除
        • groups = 1 - 正常的卷积
        • groups = 2 - 相当于有两个卷积层,每个处理一半的inputchannel,产生一般的outputchannel,最后再Concat起来 (相当于切断了一半的in_channel之间的联系)
        • groups = in_channel - 相当于每个in_channl都和(OUTC/INC)个卷积和去卷,最后输出的结果全部concat起来
      • dilation - 默认为1
      • dilation和stride之类的东西都可以是一个int或者是一个tuple
        • 可以用来描述横纵不一样时候的情况

class BasicBlock(nn.Module):
    expansion = 1
    __constants__ = ['downsample']

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

  • 两基本模块中的一个BasicBlock
    • 为了兼容很多的结构所以有了一些冗余设计
    • 只能支持group=1,base-width=64
    • norm-layer以及downsample看上去是可以配置的
      • Downsample具体是什么在makelayer里
      • 默认的normlayer是None,也就对应着BatchNorm2d
    • 层内还有一个expension参数默认为1(好像也只能为1)
class Bottleneck(nn.Module):
    expansion = 4
    __constants__ = ['downsample']

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out
  • 两大基本模块里的一个 BottleNexk
    • expansion为4
      • 用在内部的后面的conv1x1的channel数里面
        • 从width拓展到expansion*width
    • width = int(planes * (base_width / 64.)) * groups
      • 同样也是第一个conv1x1的outchannel
      • 之后的conv3x3都保持width为channel数
class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

    # Allow for accessing forward method in a inherited class
    forward = _forward
  • 输入arg有两个比较有趣的可以配置的
    • zero_init_residue
      • 有关代码
      
      if zero_init_residual:
      for m in self.modules():
          if isinstance(m, Bottleneck):
              nn.init.constant_(m.bn3.weight, 0)
          elif isinstance(m, BasicBlock):
              nn.init.constant_(m.bn2.weight, 0)
      
      
      • 对每个Block(也就是对应一个Residue Branch)的最后一个BN的weight置0
        • 让每个Block一开始只有Identity Mapping的效果
    • Replace Stide With Dilation
      • 首先要了解到Resnet当中没有pool(对不起我打脸了,还是有那么几个pool的,只不过不是主要的),基本是通过stide=2的卷积来达成效果
        • 等价于先卷积再Pool
      • 这个参数就表示是否用Diated Conv来替代stride2conv
      • 这个的输入参数是一个tuple,里面内容为True/Falses
  • 对于224的imagnet来说
    • inplane设置为64,也就是起手的一个7x7Conv的输出channel是64
  • init方式
    • 对所有的Conv采用kaiming_normal_
    • 对所有的BatchNorm(或者可能是groupNorm层),按照正常的init方式对weight和bias分别给0/1
    • 可以学习一个

for m in self.modules():
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)



def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
    norm_layer = self._norm_layer
    downsample = None
    previous_dilation = self.dilation
    if dilate:
        self.dilation *= stride
        stride = 1
    if stride != 1 or self.inplanes != planes * block.expansion:
        downsample = nn.Sequential(
            conv1x1(self.inplanes, planes * block.expansion, stride),
            norm_layer(planes * block.expansion),
        )

    layers = []
    layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                        self.base_width, previous_dilation, norm_layer))
    self.inplanes = planes * block.expansion
    for _ in range(1, blocks):
        layers.append(block(self.inplanes, planes, groups=self.groups,
                            base_width=self.base_width, dilation=self.dilation,
                            norm_layer=norm_layer))

    return nn.Sequential(*layers)

  • MakeLayer函数非常关键
    • 首先为每个Block创建了DownSample层
      • 里面是一个conv1x1+BN (转化到plane*expansion数目的channel)
    • 创建了一个layers的list
      • 第一个元素是一个block,从inplane到plane
      • 后几个plane从inplane到expansion*plane
      • 输出为plane
    • 返回一个nn.Sequential对象
  • resnet的主体就是四个layer和一个head
  • 实际调用的时候
    • 从Resnet建立的时候的输入参数决定
      • Blcok是block类型,对于res18和34是BasicBlock,之后50是BottleNeckBlock
      • layers描述了layer的大小,是一个list传入,4个元素对应4个layer,对于res18是[2,2,2,2]
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                               dilate=replace_stride_with_dilation[0])
self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                               dilate=replace_stride_with_dilation[1])
self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                               dilate=replace_stride_with_dilation[2])

  • 还要多一层封装
def _resnet(arch, block, layers, pretrained, progress, **kwargs):
    model = ResNet(block, layers, **kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls[arch],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model
  • 最后的就是各种resnet了
def resnet18(pretrained=False, progress=True, **kwargs):
    r"""ResNet-18 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress,
                   **kwargs)

其他ResNet的变种

手工设计优秀的模型结构领域也有所进展,对标NAS,用手工设计的结构获得好的性能

ResNeSt: Split-Attention Networks

  • 🔑 Key:
    • New backbone for det & semantic-seg
    • Split Attention across feature map groups (within a block)
      • A new ResNet Cell
  • 🎓 Source:
  • 🌱 Motivation:
    • A new resnet cell, plug and play with ResNet
      • help the downstream tasks(like Det or Seg)
    • ResNet are meant for Image Classification - focus on depthwise & group-wise conv, howerer for downstream task, cross channel information are ctitical
      • Small reception field, no cross-channel interaction
    • create a versatile backbone with universally improved feature representation
  • 💊 Methodology:
    • Feature maps into groups(num of group is cardinality)
      • Radix as the split within a group
    • Split Attention
      • element-wise sum across multiple splits
      • H,W average pooling for gathering global contextual information
      • Channel-wise weighted soft fusion
      • finally for groups, concat them
      • split block output with a shortcut
        • if stided, add another transformation with the output
    • Related with other attention methods
      • SE(Squueze and activation) use global context to predict channel-wise attention factor
        • when r=1(radix)ResNest is SENet, only difference is
        • SE employed on the whole block regardless of groups, ResNest applied on each group
      • SKNet feature fusion between network branches
        • (author says that is could be low efficient and hard to caling to larger groups)
  • 📐 Exps:
    • 3x3 Max pooling instead of strided conv
    • All the training tricks(a little-bit concerned, we could employ these though)
  • 💡 Ideas:
    • SENet
      • Channel-wise aggregation for the global context, then learn a set of weights
        • The excitation is actually an fc
    • SKNet
      • Split - Fuse - Select
        • two group conv branches with different kernel size
        • Fuse is a squeeze and excitation block
        • 2 softmax to get the channel weights, multiplying them, then added
      • Sacrifice a little flops for better perf.
        • Combine Attention with larger kernel

DenseNet

  • Thu Huang Gao
  • 实现
    • Normal Res: x_l = H(x_{l-1})+x_{l-1}
    • Dense: x_l = H([x_0,x_1,…x_{l-1}]) - 每一层都和其之前的所有层有链接
      • Concat
  • 参数
    • growth rate k: lth layer has: k_0+k*(l-1) input channel
      • small as 12
    • 1x1 Conv as bottleneck, 4k out channel
    • compression ratio: less channel at transition layer(where pooling is applied)

ConDenseNet

  • THU Huang Gao
  • Learned Group Conv + Dense
  • Target at Mobile, high efficiency - 找到连接的相对关系
    • dense的全连接 + learner group conv去除掉那些相对无效的链接
  • Normal Conv & Group Conv
    • group conv in 1x1 layer, acc drop.
  • 对于densenet来说,即使某一个feature map和本层中任何group都没有关系,也会与后面层的产生联系(因为identity mapping)
  • 实现
    • 2 stage
      • condensing - 稀疏化的正则,prune掉低绝对值的Filter连接,(会保持每个group的sparsity pattern一样,去兼容常见的group-conv)
      • optimizing - 然后带Mask去做正常的训练