导读继卷积神经网络之后,Transformer又推进了图像识别的发展,成为视觉领域的又一主导。最近有人提出Transformer的这种优越性应归功于Self-Attention的架构本身,本文带着质疑的态度对Transformer进行了仔细研究,提出了3种高效的架构设计,将这些组件结合在一起构建一种纯粹的CNN架构,其鲁棒性甚至比Transformer更优越。
视觉Transformer最近的成功动摇了卷积神经网络(CNNs)在图像识别领域十年来的长期主导地位。具体而言,就out-of-distribution样本的鲁棒性而言,最近的研究发现,无论不同的训练设置如何,Transformer本质上都比神经网络更鲁棒。此外,人们认为Transformer的这种优越性在很大程度上应该归功于它们类似Self-Attention的架构本身。在本文中,作者通过仔细研究Transformer的设计来质疑这种信念。作者的发现导致了3种高效的架构设计,以提高鲁棒性,但足够简单,可以在几行代码中实现,即:将这些组件结合在一起,作者能够构建纯粹的CNN架构,而无需任何像Transformer一样鲁棒甚至比Transformer更鲁棒的类似注意力的操作。作者希望这项工作能够帮助社区更好地理解鲁棒神经架构的设计。代码:https://github.com/UCSC-VLAA/RobustCNN
对输入图像进行拼接
扩大kernel-size
减少激活层和规范化层
Transformer块本身已经是一个复合设计
Transformer还包含许多其他层(例如,Patch嵌入层),鲁棒性和Transformer的架构元素之间的关系仍然令人困惑。
首先,将图像拼接成不重叠的Patch可以显著地提高out-of-distribution鲁棒性;更有趣的是,关于Patch大小的选择,作者发现越大越好;
其次,尽管应用小卷积核是一种流行的设计方法,但作者观察到,采用更大的卷积核(例如,从3×3到7×7,甚至到11×11)对于确保out-of-distribution样本的模型鲁棒性是必要的;
最后,受最近工作的启发,作者注意到减少规范化层和激活函数的数量有利于out-of-distribution鲁棒性;同时,由于使用的规范化层较少,训练速度可能会加快23%。
第1个Block是Depth-wise ResNet Bottleneck block,其中3×3卷积层被3×3深度卷积层取代;
第2个Block是Inverted Depth-wise ResNet Bottleneck block,其中隐藏维度是输入维度的4倍;
第3个Block基于第2个Block,深度卷积层在ConvNeXT中的位置向上移动;
基于第2个Block,第4 Block向下移动深度卷积层的位置。
Stylized-ImageNet,它包含具有形状-纹理冲突线索的合成图像;
ImageNet-C,具有各种常见的图像损坏;
ImageNet-R,它包含具有不同纹理和局部图像统计的ImageNet对象类的自然呈现;
ImageNet Sketch,其中包括在线收集的相同ImageNet类的草图图像。
在本节中,作者旨在通过增加深度卷积层的内 kernel-size 来模拟Self-Attention块的行为。如图3所示,作者对不同大小的Kernel进行了实验,包括5、7、9、11和13,并在不同的鲁棒性基准上评估了它们的性能。作者的研究结果表明,较大的 kernel-size 通常会带来更好的clean精度和更强的鲁棒性。尽管如此,作者也观察到,当 kernel-size 变得太大时,性能增益会逐渐饱和。
值得注意的是,使用具有较大Kernel的(标准)卷积将导致计算量的显著增加。例如,如果作者直接将ResNet50中的 kernel-size 从3更改为5,则生成的模型的总FLOP将为7.4G,这比Transformer的对应模型大得多。
然而,在使用深度卷积层的情况下,将内 kernel-size 从3增加到13通常只会使FLOP增加0.3G,与DeiT-S(4.6G)的FLOP相比相对较小。
这里唯一的例外情况是ResNet-Inverted-DW:由于其Inverted Bottleneck设计中的大通道尺寸,将 kernel-size 从3增加到13带来了1.4G FLOP的增加,这在某种程度上是一个不公平的比较。顺便说一句,使用具有大Patchvsize的Patch Stem可以减轻大 kernel-size 所产生的额外计算成本。
因此,作者的最终模型仍将处于与DeiT-S相同的规模。对于具有多个拟议设计的模型。
知识蒸馏是一种通过转移更强的教师模型的知识来训练能力较弱的学生模型的技术。通常情况下,学生模型可以通过知识蒸馏获得与教师模型相似甚至更好的性能。然而,直接应用知识蒸馏让ResNet-50(学生模型)向DeiT-S(教师模型)学习在增强鲁棒性方面效果较差。
令人惊讶的是,当模型角色切换时,学生模型DeiT-S在一系列鲁棒性基准上显著优于教师模型ResNet-50,从而得出结论,实现DeiT良好鲁棒性的关键在于其架构,因此不能通过知识蒸馏将其转移到ResNet。
为了进一步研究这一点,作者用将所有3种提出的建筑设计相结合的模型作为学生模型,并用DeiT-S作为教师模型来重复这些实验。如表所示,作者观察到,在ViT提出的架构组件的帮助下,作者得到的Robust ResNet系列现在可以在out-of-distribution样本上始终比DeiT表现得更好。相比之下,尽管 Baseline 模型在clean ImageNet上取得了良好的性能,但不如教师模型DeiT那样鲁棒。
from collections import OrderedDict
from functools import partial
import torch
import torch.nn as nn
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
from .helpers import build_model_with_cfg
from .layers import SelectAdaptivePool2d, AvgPool2dSame
from .layers import RobustResNetDWBlock, RobustResNetDWInvertedBlock, RobustResNetDWUpInvertedBlock, RobustResNetDWDownInvertedBlock
from .registry import register_model
__all__ = ['RobustResNet'] # model_registry will add each entrypoint fn to this
def _cfg(url='', **kwargs):
return {
'url': url,
'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': (7, 7),
'crop_pct': 0.875, 'interpolation': 'bicubic',
'mean': IMAGENET_DEFAULT_MEAN, 'std': IMAGENET_DEFAULT_STD,
'first_conv': 'stem.0', 'classifier': 'head.fc',
**kwargs
}
default_cfgs = dict(
small=_cfg(),
base=_cfg(),
)
def get_padding(kernel_size, stride, dilation=1):
padding = ((stride - 1) + dilation * (kernel_size - 1)) // 2
return padding
def downsample_conv(
in_channels, out_channels, kernel_size, stride=1, dilation=1, first_dilation=None, norm_layer=None):
norm_layer = norm_layer or nn.BatchNorm2d
kernel_size = 1 if stride == 1 and dilation == 1 else kernel_size
first_dilation = (first_dilation or dilation) if kernel_size > 1 else 1
p = get_padding(kernel_size, stride, first_dilation)
return nn.Sequential(*[
nn.Conv2d(
in_channels, out_channels, kernel_size, stride=stride, padding=p, dilation=first_dilation, bias=True),
norm_layer(out_channels)
])
def downsample_avg(
in_channels, out_channels, kernel_size, stride=1, dilation=1, first_dilation=None, norm_layer=None):
norm_layer = norm_layer or nn.BatchNorm2d
avg_stride = stride if dilation == 1 else 1
if stride == 1 and dilation == 1:
pool = nn.Identity()
else:
avg_pool_fn = AvgPool2dSame if avg_stride == 1 and dilation > 1 else nn.AvgPool2d
pool = avg_pool_fn(2, avg_stride, ceil_mode=True, count_include_pad=False)
return nn.Sequential(*[
pool,
nn.Conv2d(in_channels, out_channels, 1, stride=1, padding=0, bias=True),
norm_layer(out_channels)
])
class Stage(nn.Module):
def __init__(
self, block_fn, in_chs, chs, stride=2, depth=2, dp_rates=None, layer_scale_init_value=1.0,
norm_layer=nn.BatchNorm2d, act_layer=partial(nn.ReLU, partial=True),
avg_down=False, down_kernel_size=1, mlp_ratio=4., inverted=False, **kwargs):
super().__init__()
blocks = []
dp_rates = dp_rates or [0.] * depth
for block_idx in range(depth):
stride_block_idx = depth - 1 if block_fn == RobustResNetDWDownInvertedBlock else 0
current_stride = stride if block_idx == stride_block_idx else 1
downsample = None
if inverted:
if in_chs != chs or current_stride > 1:
down_kwargs = dict(
in_channels=in_chs, out_channels=chs, kernel_size=down_kernel_size,
stride=current_stride, norm_layer=norm_layer)
downsample = downsample_avg(**down_kwargs) if avg_down else downsample_conv(**down_kwargs)
else:
if in_chs != int(mlp_ratio * chs) or current_stride > 1:
down_kwargs = dict(
in_channels=in_chs, out_channels=int(mlp_ratio * chs), kernel_size=down_kernel_size,
stride=current_stride, norm_layer=norm_layer)
downsample = downsample_avg(**down_kwargs) if avg_down else downsample_conv(**down_kwargs)
if downsample is not None:
assert block_idx in [0, depth - 1]
blocks.append(block_fn(
indim=in_chs, dim=chs, drop_path=dp_rates[block_idx], layer_scale_init_value=layer_scale_init_value,
mlp_ratio=mlp_ratio,
norm_layer=norm_layer, act_layer=act_layer,
stride=current_stride,
downsample=downsample,
**kwargs,
))
in_chs = int(chs * mlp_ratio) if not inverted else chs
self.blocks = nn.Sequential(*blocks)
def forward(self, x):
x = self.blocks(x)
return x
class RobustResNet(nn.Module):
# TODO: finish comment here
r""" RobustResNetDW
A PyTorch impl of :
Args:
in_chans (int): Number of input image channels. Default: 3
num_classes (int): Number of classes for classification head. Default: 1000
depths (tuple(int)): Number of blocks at each stage. Default: [3, 3, 9, 3]
dims (tuple(int)): Feature dimension at each stage. Default: [96, 192, 384, 768]
drop_rate (float): Head dropout rate
drop_path_rate (float): Stochastic depth rate. Default: 0.
layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
head_init_scale (float): Init scaling value for classifier weights and biases. Default: 1.
"""
def __init__(
self, block_fn, in_chans=3, num_classes=1000, global_pool='avg', output_stride=32,
patch_size=16, stride_stage=(3, ),
depths=(3, 3, 9, 3), dims=(96, 192, 384, 768), layer_scale_init_value=1e-6,
head_init_scale=1., head_norm_first=False,
norm_layer=nn.BatchNorm2d, act_layer=partial(nn.ReLU, inplace=True),
drop_rate=0., drop_path_rate=0., mlp_ratio=4., block_args=None,
):
super().__init__()
assert block_fn in [RobustResNetDWBlock, RobustResNetDWInvertedBlock, RobustResNetDWUpInvertedBlock, RobustResNetDWDownInvertedBlock]
self.inverted = True if block_fn != RobustResNetDWBlock else False
assert output_stride == 32
self.num_classes = num_classes
self.drop_rate = drop_rate
self.feature_info = []
block_args = block_args or dict()
print(f'using block args: {block_args}')
assert patch_size == 16
self.stem = nn.Conv2d(in_chans, dims[0], kernel_size=patch_size, stride=patch_size)
curr_stride = patch_size
self.stages = nn.Sequential()
dp_rates = [x.tolist() for x in torch.linspace(0, drop_path_rate, sum(depths)).split(depths)]
prev_chs = dims[0]
stages = []
# 4 feature resolution stages, each consisting of multiple residual blocks
for i in range(4):
stride = 2 if i in stride_stage else 1
curr_stride *= stride
chs = dims[i]
stages.append(Stage(
block_fn, prev_chs, chs, stride=stride,
depth=depths[i], dp_rates=dp_rates[i], layer_scale_init_value=layer_scale_init_value,
norm_layer=norm_layer, act_layer=act_layer, mlp_ratio=mlp_ratio,
inverted=self.inverted, **block_args)
)
prev_chs = int(mlp_ratio * chs) if not self.inverted else chs
self.feature_info += [dict(num_chs=prev_chs, reduction=curr_stride, module=f'stages.{i}')]
self.stages = nn.Sequential(*stages)
assert curr_stride == output_stride
self.num_features = prev_chs
self.norm_pre = nn.Identity()
self.head = nn.Sequential(OrderedDict([
('global_pool', SelectAdaptivePool2d(pool_type=global_pool)),
# ('norm', norm_layer(self.num_features)),
('flatten', nn.Flatten(1) if global_pool else nn.Identity()),
('drop', nn.Dropout(self.drop_rate)),
('fc', nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity())
]))
self.resnet_init_weights()
def resnet_init_weights(self):
for n, m in self.named_modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
nn.init.zeros_(m.bias)
elif isinstance(m, nn.BatchNorm2d):
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
def get_classifier(self):
return self.head.fc
def reset_classifier(self, num_classes=0, global_pool='avg'):
# pool -> norm -> fc
self.head = nn.Sequential(OrderedDict([
('global_pool', SelectAdaptivePool2d(pool_type=global_pool)),
('norm', self.head.norm),
('flatten', nn.Flatten(1) if global_pool else nn.Identity()),
('drop', nn.Dropout(self.drop_rate)),
('fc', nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity())
]))
def forward_features(self, x):
x = self.stem(x)
x = self.stages(x)
x = self.norm_pre(x)
return x
def forward(self, x):
x = self.forward_features(x)
x = self.head(x)
return x
def _create_robust_resnet(variant, pretrained=False, **kwargs):
model = build_model_with_cfg(
RobustResNet, variant, pretrained,
default_cfg=default_cfgs[variant],
feature_cfg=dict(out_indices=(0, 1, 2, 3), flatten_sequential=True),
**kwargs)
return model
@register_model
def robust_resnet_dw_small(pretrained=False, **kwargs):
'''
4.49GFLOPs and 38.6MParams
'''
assert not pretrained, 'no pretrained models!'
model_args = dict(block_fn=RobustResNetDWBlock, depths=(3, 4, 12, 3), dims=(96, 192, 384, 768),
block_args=dict(kernel_size=11, padding=5),
patch_size=16, stride_stage=(3,),
norm_layer=nn.BatchNorm2d, act_layer=partial(nn.ReLU, inplace=True),
**kwargs)
model = _create_robust_resnet('small', pretrained=pretrained, **model_args)
return model
@register_model
def robust_resnet_inverted_dw_small(pretrained=False, **kwargs):
'''
4.59GFLOPs and 33.6MParams
'''
assert not pretrained, 'no pretrained models!'
model_args = dict(block_fn=RobustResNetDWInvertedBlock, depths=(3, 4, 14, 3), dims=(96, 192, 384, 768),
block_args=dict(kernel_size=7, padding=3),
patch_size=16, stride_stage=(3,),
norm_layer=nn.BatchNorm2d, act_layer=partial(nn.ReLU, inplace=True),
**kwargs)
model = _create_robust_resnet('small', pretrained=pretrained, **model_args)
return model
@register_model
def robust_resnet_up_inverted_dw_small(pretrained=False, **kwargs):
'''
4.43GFLOPs and 34.4MParams
'''
assert not pretrained, 'no pretrained models!'
model_args = dict(block_fn=RobustResNetDWUpInvertedBlock, depths=(3, 4, 14, 3), dims=(96, 192, 384, 768),
block_args=dict(kernel_size=11, padding=5),
patch_size=16, stride_stage=(3,),
norm_layer=nn.BatchNorm2d, act_layer=partial(nn.ReLU, inplace=True),
**kwargs)
model = _create_robust_resnet('small', pretrained=pretrained, **model_args)
return model
@register_model
def robust_resnet_down_inverted_dw_small(pretrained=False, **kwargs):
'''
4.55GFLOPs and 24.3MParams
'''
assert not pretrained, 'no pretrained models!'
model_args = dict(block_fn=RobustResNetDWDownInvertedBlock, depths=(3, 4, 15, 3), dims=(96, 192, 384, 768),
block_args=dict(kernel_size=11, padding=5),
patch_size=16, stride_stage=(2,),
norm_layer=nn.BatchNorm2d, act_layer=partial(nn.ReLU, inplace=True),
**kwargs)
model = _create_robust_resnet('small', pretrained=pretrained, **model_args)
return model
全部0条评论
快来发表一下你的评论吧 !