Different Convs
参考了知乎-Convs详解
- Normal Conv (不用阐述了,这张图不错)
- 1x1 Conv/Pointwise Conv
- 1st in NiN, Used in GoogLenet
- 可以用来控制Filter数目
- 跨Channel的信息整合
- Spatial&Cross Channel Conv
- Used in the Inception Block
- 先用一个1x1Conv来限制Channel数目,然后3x3 Conv,最后用concat的方式来融合特征
- Grouped Conv
- 1st in Alex, Applied in ResNeXT
- 将Feature分组,在组内做conv,最后将结果Concat,以减少计算量
- 有一个基数(Cardinalty来控制组的数量) - 随着基数的增加,计算量会减小
- 分组卷积让filter之间的相关性更小了,可以起到一定的正则化作用
- 分组的操作也可以被并行起来
- ShuffleGroupConv
- 1st in ShuffleNet
- ChannelShuffle促进层间信息的flow
- 左一是Bottleneck-中间是一般的ShuffleBlock-右一是strided shuffle block
- Sepearable Conv
5.1 Depth-wise Seperable Conv
- 在Xception和MobileNet系列中Applied
- (Depthwise+Pointwise) - (PerChannel+1x1Conv)
5.2 Spatilly Separable Conv
- 将NxN拆解为1xN与Nx1
- 问题是会带来一定的信息损失(不是所有的Filter都可以分解)
- 与之类似的有FlattenConv
- 也是矩阵分解,3-d分解以减小计算量2
- 记得MICRO还是INSCA有一篇EDA的文章是去找如何去Unroll
- Dilated Conv
- 用于多尺度的邻近信息融合
- Deformable Conv
- 目标是形状可以学习的Conv,学习每个点的offset,让Conv不再是标准的正方形
- Attention
9.1 SE Block
- 思想是每个Channel的重要性不同,加了一个Auxiliary Model给每个通道打分,将结果与Feature Map相乘,对不同channel赋予不同权重、
- Channel维度的Attention 9.2 CBAM
- 在Channel-wise Attention 加上了Spatial-wise Attention
- 两个模块级联
- Summary
- 单一尺寸卷积核用多个尺寸卷积核代替(参考Inception系列)
- 使用可变形卷积替代固定尺寸卷积(参考DCN)
- 大量加入1x1卷积或者pointwise grouped convolution来降低计算量(参考NIN、ShuffleNet)
- 通道加权处理(参考SENet)
- 用深度可分离卷积替换普通卷积(参考MobileNet)
- 使用分组卷积(参考ResNeXt)
- 分组卷积+channel shuffle(参考shuffleNet)
- 使用Residual连接(参考ResNet)
参考了知乎-ResNeXT详解
ResNeXT
ResNet + Inception
- 本质是分组卷积(Group Convolution)
Split-Transform-Merge
- fc(对于上一层的每一个神经元)都是一个Split-Transform(Activation)-Merge的结构
- 而Inception Module,也是一个这样的Split-Transform-Merge的结构
Change Inception
- ResNext的作者认为Inception V4中Inception Module内部需要设计,太人工和刻意(太多的超参数难以调节)
- 合理也不合理:这样貌似没有考虑到多种感受野
- 所以对所有的Inception Module采取相同的拓扑结构,只通过Cardinalty(基数-这个词在数学里面是“势”)来控制有多少个这样的模块
- 与Inception V4很类似,只是Inception是先concat再
1*1conv
,而ResNext是先分路1*1Conv
再Concat
Group Conv
- 介于正常卷积和Depth-wise卷积之间的折中
- 既不是对每一个C用一个卷积核,也不是对所有C用一个卷积核
ResNeXt为什么好
- 引入了cardinalty,本质上还是不同的group,不同的group对应着不同的sub-sample,可以学到更加diverse的表示
- 有人指出Multi-cardinalty其实和multi-head的Attention的意味一样
- Take Part in different rpresentation subspace
WRN(Wide ResNet)
ResNet的又一个经典变体
- 所谓的宽,就是feature map的channel数…
- 命名方式 WRN-n(卷积层数)-k(宽度系数)-B(block的结构)
- 作者进行疯狂实验之后得到结论
- 带Dropout更牛逼
- 有的时候深度太深反而会GG
- B(3,3)是比较好的结构,指一个Block里面2个3x3卷积
ShakeShake
- 本质上是一种Regularization的方法,不是很主流,被证明在小网络上取得了比较好的效果
- 多Branch(3)和ResNet分路的思路一致
- FeedForward Shake
- 在前向的时候对两路分别乘上a和(1-a)做了一个Disturbing
- 在原本的Dropout中已经提出了 - (本文方法可以看做一个软的Dropout?)
- 其可以和BN一起获得更好的效果,这篇博客有提及
- 这篇博客论述的是Dropout与BN不compatible的问题
- 思路是Dropout,会给网络加上variance shift(方差偏移)而导致BN层一开始归一化的方差错了!(traing和inference不一致了)
- WRN的作者在dropout后面加了一个fc再BN,缓解了这个问题
- Solu:
- 最简单暴力的是把dropout放BN的后边
- 有人使用高斯Dropout,给一个均匀分布的dropout(其实和shakeshake一文中的方法很类似)
- Backwardshake
- backprop的时候对反向传播也做Disturbing
ResNet 源码阅读
由于用到的resnet次数实在是太多了,所以对Resnet源码做一个解读吧
-
- 大段代码 Alert
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
"""3x3 convolution with padding"""
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
padding=dilation, groups=groups, bias=False, dilation=dilation)
def conv1x1(in_planes, out_planes, stride=1):
"""1x1 convolution"""
return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
- 最简单的Conv,封装了一层,由于内部只有3x3和1x1
- 调用的是torch.nn中的Conv2d函数(这个返回的是一个nn.functional和nn.functional.conv2d等价)
- 输入args
- inplane/outplane 输入输出Channel数目,很直观
- Kernel_size
- Stride (默认为1))
- padding (默认为0)
- bias(默认为True)这里设置为None
- bias是直接加在每个Output Channel上的
- groups(描述了input和output之间的Connection)这里设置为了1
- inplane和outplane要都能被groups整除
- groups = 1 - 正常的卷积
- groups = 2 - 相当于有两个卷积层,每个处理一半的inputchannel,产生一般的outputchannel,最后再Concat起来 (相当于切断了一半的in_channel之间的联系)
- groups = in_channel - 相当于每个in_channl都和(OUTC/INC)个卷积和去卷,最后输出的结果全部concat起来
- dilation - 默认为1
- dilation和stride之类的东西都可以是一个int或者是一个tuple
- 可以用来描述横纵不一样时候的情况
class BasicBlock(nn.Module):
expansion = 1
__constants__ = ['downsample']
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(BasicBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
if groups != 1 or base_width != 64:
raise ValueError('BasicBlock only supports groups=1 and base_width=64')
if dilation > 1:
raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
# Both self.conv1 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv3x3(inplanes, planes, stride)
self.bn1 = norm_layer(planes)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes)
self.bn2 = norm_layer(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
- 两基本模块中的一个BasicBlock
- 为了兼容很多的结构所以有了一些冗余设计
- 只能支持group=1,base-width=64
norm-layer
以及downsample
看上去是可以配置的- Downsample具体是什么在makelayer里
- 默认的normlayer是None,也就对应着BatchNorm2d
- 层内还有一个expension参数默认为1(好像也只能为1)
class Bottleneck(nn.Module):
expansion = 4
__constants__ = ['downsample']
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(Bottleneck, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.)) * groups
# Both self.conv2 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv1x1(inplanes, width)
self.bn1 = norm_layer(width)
self.conv2 = conv3x3(width, width, stride, groups, dilation)
self.bn2 = norm_layer(width)
self.conv3 = conv1x1(width, planes * self.expansion)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
- 两大基本模块里的一个 BottleNexk
- expansion为4
- 用在内部的后面的conv1x1的channel数里面
- 从width拓展到expansion*width
- 用在内部的后面的conv1x1的channel数里面
width = int(planes * (base_width / 64.)) * groups
- 同样也是第一个conv1x1的outchannel
- 之后的conv3x3都保持width为channel数
- expansion为4
class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
groups=1, width_per_group=64, replace_stride_with_dilation=None,
norm_layer=None):
super(ResNet, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
self._norm_layer = norm_layer
self.inplanes = 64
self.dilation = 1
if replace_stride_with_dilation is None:
# each element in the tuple indicates if we should replace
# the 2x2 stride with a dilated convolution instead
replace_stride_with_dilation = [False, False, False]
if len(replace_stride_with_dilation) != 3:
raise ValueError("replace_stride_with_dilation should be None "
"or a 3-element tuple, got {}".format(replace_stride_with_dilation))
self.groups = groups
self.base_width = width_per_group
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
bias=False)
self.bn1 = norm_layer(self.inplanes)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
dilate=replace_stride_with_dilation[0])
self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
dilate=replace_stride_with_dilation[1])
self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
dilate=replace_stride_with_dilation[2])
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck):
nn.init.constant_(m.bn3.weight, 0)
elif isinstance(m, BasicBlock):
nn.init.constant_(m.bn2.weight, 0)
def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = 1
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
conv1x1(self.inplanes, planes * block.expansion, stride),
norm_layer(planes * block.expansion),
)
layers = []
layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
self.base_width, previous_dilation, norm_layer))
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes, groups=self.groups,
base_width=self.base_width, dilation=self.dilation,
norm_layer=norm_layer))
return nn.Sequential(*layers)
def _forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
# Allow for accessing forward method in a inherited class
forward = _forward
- 输入arg有两个比较有趣的可以配置的
zero_init_residue
- 有关代码
if zero_init_residual: for m in self.modules(): if isinstance(m, Bottleneck): nn.init.constant_(m.bn3.weight, 0) elif isinstance(m, BasicBlock): nn.init.constant_(m.bn2.weight, 0)
- 对每个Block(也就是对应一个Residue Branch)的最后一个BN的weight置0
- 让每个Block一开始只有Identity Mapping的效果
Replace Stide With Dilation
- 首先要了解到Resnet当中没有pool(对不起我打脸了,还是有那么几个pool的,只不过不是主要的),基本是通过stide=2的卷积来达成效果
- 等价于先卷积再Pool
- 这个参数就表示是否用Diated Conv来替代stride2conv
- 这个的输入参数是一个tuple,里面内容为True/Falses
- 首先要了解到Resnet当中没有pool(对不起我打脸了,还是有那么几个pool的,只不过不是主要的),基本是通过stide=2的卷积来达成效果
- 对于224的imagnet来说
- inplane设置为64,也就是起手的一个7x7Conv的输出channel是64
- init方式
- 对所有的Conv采用kaiming_normal_
- 对所有的BatchNorm(或者可能是groupNorm层),按照正常的init方式对weight和bias分别给0/1
- 可以学习一个
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = 1
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
conv1x1(self.inplanes, planes * block.expansion, stride),
norm_layer(planes * block.expansion),
)
layers = []
layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
self.base_width, previous_dilation, norm_layer))
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes, groups=self.groups,
base_width=self.base_width, dilation=self.dilation,
norm_layer=norm_layer))
return nn.Sequential(*layers)
- MakeLayer函数非常关键
- 首先为每个Block创建了DownSample层
- 里面是一个conv1x1+BN (转化到plane*expansion数目的channel)
- 创建了一个layers的list
- 第一个元素是一个block,从inplane到plane
- 后几个plane从inplane到expansion*plane
- 输出为plane
- 返回一个nn.Sequential对象
- 首先为每个Block创建了DownSample层
- resnet的主体就是四个layer和一个head
- 实际调用的时候
- 从Resnet建立的时候的输入参数决定
- Blcok是block类型,对于res18和34是BasicBlock,之后50是BottleNeckBlock
- layers描述了layer的大小,是一个list传入,4个元素对应4个layer,对于res18是[2,2,2,2]
- 从Resnet建立的时候的输入参数决定
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
dilate=replace_stride_with_dilation[0])
self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
dilate=replace_stride_with_dilation[1])
self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
dilate=replace_stride_with_dilation[2])
- 还要多一层封装
def _resnet(arch, block, layers, pretrained, progress, **kwargs):
model = ResNet(block, layers, **kwargs)
if pretrained:
state_dict = load_state_dict_from_url(model_urls[arch],
progress=progress)
model.load_state_dict(state_dict)
return model
- 最后的就是各种resnet了
def resnet18(pretrained=False, progress=True, **kwargs):
r"""ResNet-18 model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress,
**kwargs)
其他ResNet的变种
手工设计优秀的模型结构领域也有所进展,对标NAS,用手工设计的结构获得好的性能
ResNeSt: Split-Attention Networks
- 🔑 Key:
- New backbone for det & semantic-seg
- Split Attention across feature map groups (within a block)
- A new ResNet Cell
- 🎓 Source:
- 🌱 Motivation:
- A new resnet cell, plug and play with ResNet
- help the downstream tasks(like Det or Seg)
- ResNet are meant for Image Classification - focus on depthwise & group-wise conv, howerer for downstream task, cross channel information are ctitical
- Small reception field, no cross-channel interaction
- create a versatile backbone with universally improved feature representation
- A new resnet cell, plug and play with ResNet
- 💊 Methodology:
- Feature maps into groups(num of group is cardinality)
- Radix as the split within a group
- Split Attention
- element-wise sum across multiple splits
- H,W average pooling for gathering global contextual information
- Channel-wise weighted soft fusion
- finally for groups, concat them
- split block output with a shortcut
- if stided, add another transformation with the output
- Related with other attention methods
- SE(Squueze and activation) use global context to predict channel-wise attention factor
- when r=1(radix)ResNest is SENet, only difference is
- SE employed on the whole block regardless of groups, ResNest applied on each group
- SKNet feature fusion between network branches
- (author says that is could be low efficient and hard to caling to larger groups)
- SE(Squueze and activation) use global context to predict channel-wise attention factor
- 📐 Exps:
- 3x3 Max pooling instead of strided conv
- All the training tricks(a little-bit concerned, we could employ these though)
- 💡 Ideas:
- SENet
- Channel-wise aggregation for the global context, then learn a set of weights
- The excitation is actually an fc
- Channel-wise aggregation for the global context, then learn a set of weights
- SKNet
- Split - Fuse - Select
- two group conv branches with different kernel size
- Fuse is a squeeze and excitation block
- 2 softmax to get the channel weights, multiplying them, then added
- Sacrifice a little flops for better perf.
- Combine Attention with larger kernel
- Split - Fuse - Select
- SENet
DenseNet
- Thu Huang Gao
- 实现
- Normal Res: x_l = H(x_{l-1})+x_{l-1}
- Dense: x_l = H([x_0,x_1,…x_{l-1}]) - 每一层都和其之前的所有层有链接
- Concat
- 参数
- growth rate k: lth layer has: k_0+k*(l-1) input channel
- small as 12
- 1x1 Conv as bottleneck, 4k out channel
- compression ratio: less channel at transition layer(where pooling is applied)
- growth rate k: lth layer has: k_0+k*(l-1) input channel
ConDenseNet
- THU Huang Gao
- Learned Group Conv + Dense
- Target at Mobile, high efficiency - 找到连接的相对关系
- dense的全连接 + learner group conv去除掉那些相对无效的链接
- Normal Conv & Group Conv
- group conv in 1x1 layer, acc drop.
- 对于densenet来说,即使某一个feature map和本层中任何group都没有关系,也会与后面层的产生联系(因为identity mapping)
- 实现
- 2 stage
- condensing - 稀疏化的正则,prune掉低绝对值的Filter连接,(会保持每个group的sparsity pattern一样,去兼容常见的group-conv)
- optimizing - 然后带Mask去做正常的训练
- 2 stage