如何在 Fortran 中通过 BLAS 加速高阶张量收缩的重塑？

问题描述

如果我有张量收缩 A[a,b] * B[b,c,d] = C[a,d] 如果我使用BLAS，我想我需要DGemm（假设真实值），那么我可以

首先将张量 B[b,d] 重塑为 D[b,e] where e = c*d,
DGemm，A[a,b] * D[b,e] = E[a,e]
将E[a,e]改造成C[a,d]

问题是，reshape 没有那么快 :( 我在 Fortran: Which method is faster to change the rank of arrays? (Reshape vs. Pointer) 中看到了讨论，在上面的链接中，作者遇到了一些错误信息，除了reshape本身。

因此，我在问是否有一个方便的解决方案。

解决方法

[我在维度的大小前面加上了字母 n 以避免在下面混淆张量和张量的大小]

正如评论中所讨论的，没有必要重塑。 Dgemm 没有张量的概念，它只知道数组。它所关心的只是这些数组在内存中以正确的顺序排列。由于 Fortran 是列主要的，如果您使用 3 维数组来表示问题中的 3 维张量 B，它将在内存中与用于表示 2 维的 2 维数组完全相同张量 D。就矩阵 mult 而言，您现在需要做的就是获得形成正确长度的结果的点积。这使您得出结论，如果您告诉 dgemm B 的前导暗度为 nb，第二个暗度为 nc*nd，您将得到正确的结果。这导致我们

ian@eris:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation,Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ cat reshape.f90
Program reshape_for_blas

  Use,Intrinsic :: iso_fortran_env,Only :  wp => real64,li => int64

  Implicit None

  Real( wp ),Dimension( :,:    ),Allocatable :: a
  Real( wp ),:,: ),Allocatable :: b
  Real( wp ),Allocatable :: c1,c2
  Real( wp ),Allocatable :: d
  Real( wp ),Allocatable :: e

  Integer :: na,nb,nc,nd,ne
  
  Integer( li ) :: start,finish,rate

  Write( *,* ) 'na,nd ?'
  Read( *,* ) na,nd
  ne = nc * nd
  Allocate( a ( 1:na,1:nb ) ) 
  Allocate( b ( 1:nb,1:nc,1:nd ) ) 
  Allocate( c1( 1:na,1:nd ) ) 
  Allocate( c2( 1:na,1:nd ) ) 
  Allocate( d ( 1:nb,1:ne ) ) 
  Allocate( e ( 1:na,1:ne ) ) 

  ! Set up some data
  Call Random_number( a )
  Call Random_number( b )

  ! With reshapes
  Call System_clock( start,rate )
  d = Reshape( b,Shape( d ) )
  Call dgemm( 'N','N',na,ne,1.0_wp,a,Size( a,Dim = 1 ),&
                                            d,Size( d,&
                                    0.0_wp,e,Size( e,Dim = 1 ) )
  c1 = Reshape( e,Shape( c1 ) )
  Call System_clock( finish,rate )
  Write( *,* ) 'Time for reshaping method ',Real( finish - start,wp ) / rate
  
  ! Direct
  Call System_clock( start,rate )
  Call dgemm( 'N',&
                                            b,Size( b,&
                                            0.0_wp,c2,Size( c2,Dim = 1 ) )
  Call System_clock( finish,* ) 'Time for straight  method ',wp ) / rate

  Write( *,* ) 'Difference between result matrices ',Maxval( Abs( c1 - c2 ) )

End Program reshape_for_blas
ian@eris:~/work/stack$ cat in
40 50 60 70
ian@eris:~/work/stack$ gfortran -std=f2008 -Wall -Wextra -fcheck=all reshape.f90  -lblas
ian@eris:~/work/stack$ ./a.out < in
 na,nd ?
 Time for reshaping method    1.0515256000000001E-002
 Time for straight  method    5.8608790000000003E-003
 Difference between result matrices    0.0000000000000000     
ian@eris:~/work/stack$ gfortran -std=f2008 -Wall -Wextra  reshape.f90  -lblas
ian@eris:~/work/stack$ ./a.out < in
 na,nd ?
 Time for reshaping method    1.3585931000000001E-002
 Time for straight  method    1.6730429999999999E-003
 Difference between result matrices    0.0000000000000000

也就是说，我认为值得注意的是，重塑的开销是 O(N^2) 而矩阵乘法的时间是 O(N^3)。因此，对于大型矩阵，由于重塑导致的百分比开销将趋于零。现在代码性能不是唯一的考虑因素，代码的可读性和可维护性也很重要。因此，如果您发现 reshape 方法更具可读性，并且您使用的矩阵足够大以至于开销不重要，那么您很可能会使用 reshapes，因为在这种情况下，代码可读性可能比性能更重要。您的来电。

arrays arrays arrays blas fortran reshape