在 as_strided 之前进行整形以进行优化

问题描述

def forward(x,f,s):
    B,H,W,C = x.shape # e.g. 64,16,3
    Fh,Fw,C,_ = f.shape # e.g. 4,4,3,3 
    # C is redeclared to emphasise that the dimension is the same
    
    Sh,Sw = s # e.g. 2,2

    strided_shape = B,1 + (H - Fh) // Sh,1 + (W - Fw) // Sw,Fh,C

    x = as_strided(x,strided_shape,strides=(
        x.strides[0],Sh * x.strides[1],Sw * x.strides[2],x.strides[1],x.strides[2],x.strides[3]),)

    # print(x.flags,f.flags)

    # The reshaping changes the einsum from 'wxyijk,ijkd' to 'wxyz,zd->wxyd'
    f = f.reshape(-1,f.shape[-1])
    x = x.reshape(*x.shape[:3],-1) # Bottleneck!
    
    return np.einsum('wxyz,zd->wxyd',x,optimize='optimal')

（相反，变体没有重塑使用return np.einsum('wxyijk,ijkd->wxyd',f)）

作为参考，以下是重塑前 x 和 f 的标志：

x.flags:

C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFcopY : False
UPDATEIFcopY : False


f.flags:

C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFcopY : False
UPDATEIFcopY : False

有趣的是，例程中的主要瓶颈不是einsum，而是x 的重塑（扁平化）。我知道 f 不会遇到这样的问题，因为它的内存是 C 连续的，所以重塑相当于在不更改数据的情况下进行快速内部修改 - 但由于 x 不是 C 连续的（并且不拥有它的数据，就此而言），reshape 的成本要高得多，因为它涉及经常更改数据/获取非缓存对齐的数据。这反过来又是 as_strided 上执行的 x 函数的结果 - 步幅的修改必须以扰乱自然顺序的方式进行。（仅供参考，as_strided 非常快，无论传递给它什么步幅都应该很快）

有没有办法在不产生瓶颈的情况下达到相同的结果？也许通过在使用 x 之前重塑 as_strided？

另请注意，对于几乎 100% 的应用程序： B: [1-64],W: [1-60],C: [1-8] Fh,Fw: [1-12]

我还在这里包含了一些图表，用于随着张量维度 B（批量大小）以及我设备上的 H,W（图像大小）的变化而变化的时间（如你可以看到，涉及到 reshape 的那一个已经可以与 Tensorflow 竞争了）：

编辑：一个有趣的发现 - 重塑算法在 cpu 上以 5 倍的系数击败非重塑算法，但是当我使用 GPU（即使用 CuPy 而不是 NumPy）时，两种算法同样快（大约是 TensorFlow 的前向传递速度的两倍）

解决方法

由于您提到的原因（在非连续数组上复制），跨步数组的重新整形有点昂贵，但没有您想象的那么昂贵。 np.einsum 实际上可能是您的应用程序中的瓶颈，具体取决于张量大小。如 Convolutional layer in Python using Numpy 中所述，np.tensordot 可以很好地替代 np.einsum。

举个简单的例子：

x = np.arange(64*221*221*3).reshape((64,221,3))
f = np.arange(4*4*3*5).reshape((4,4,3,5))
s = (2,2)

B,H,W,C = x.shape # e.g. 64,16,3
Fh,Fw,C,_ = f.shape # e.g. 4,3 
Sh,Sw = s # e.g. 2,2
strided_shape = B,1 + (H - Fh) // Sh,1 + (W - Fw) // Sw,Fh,C
print(strided_shape)
# (64,109,3)

初始化变量后，我们可以测试代码部分的时序

%timeit x_strided = as_strided(x,strided_shape,strides=(x.strides[0],Sh * x.strides[1],Sw * x.strides[2],x.strides[1],x.strides[2],x.strides[3]),)
>>> 7.11 µs ± 118 ns per loop (mean ± std. dev. of 7 runs,100000 loops each)

%timeit f_reshaped = f.reshape(-1,f.shape[-1])
>>> 450 ns ± 7.43 ns per loop (mean ± std. dev. of 7 runs,1000000 loops each)

%timeit x_reshaped = x_strided.reshape(*x_strided.shape[:3],-1) # Bottleneck!
>>> 94.6 ms ± 896 µs per loop (mean ± std. dev. of 7 runs,10 loops each)

# einsum without reshape
%timeit np.einsum('wxy...,...d->wxyd',x_strided,f,optimize='optimal')
>>> 809 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs,1 loop each)

# einsum with reshape
%%timeit
f_reshaped = f.reshape(-1,f.shape[-1])
x_reshaped = x_strided.reshape(*x_strided.shape[:3],-1) # Bottleneck!
k = np.einsum('wxyz,zd->wxyd',x_reshaped,f_reshaped,optimize='optimal')
>>> 549 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs,1 loop each)

# tensordot without reshape
%timeit k = np.tensordot(x_strided,axes=3)
>>> 271 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs,1 loop each)

# tensordot with reshape
%%timeit
f_reshaped = f.reshape(-1,-1) # Bottleneck!
k = np.tensordot(x_reshaped,axes=(3,0))
>>> 266 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs,1 loop each)

我在您的代码中使用张量大小得到了类似的结果（即 64、16、16、3 和 4、4、3、3）。

如您所见，调整大小操作存在开销，但由于连续数据，它使矩阵操作更快。请注意，结果会因 CPU 速度、CPU 架构/代等而异。

conv-neural-network numpy numpy-einsum python tensor tensor

在 as_strided 之前进行整形以进行优化

问题描述

解决方法

相关问答