我的Julia循环/去矢量化代码出了什么问题

问题描述

我正在使用Julia 1.0。请考虑以下代码：

using LinearAlgebra
using Distributions

## create random data
const data = rand(Uniform(-1,2),100000,2)

function test_function_1(data)
    theta = [1 2]
    coefs = theta * data[:,1:2]'
    res   = coefs' .* data[:,1:2]
    return sum(res,dims = 1)'
end

function test_function_2(data)
    theta   = [1 2]
    sum_all = zeros(2)
    for i = 1:size(data)[1]
        sum_all .= sum_all + (theta * data[i,1:2])[1] *  data[i,1:2]
    end
    return sum_all
end

第一次运行后，我给它计时

julia> @time test_function_1(data)
  0.006292 seconds (16 allocations: 5.341 MiB)
2×1 Adjoint{Float64,Array{Float64,2}}:
 150958.47189289227
 225224.0374366073

julia> @time test_function_2(data)
  0.038112 seconds (500.00 k allocations: 45.777 MiB,15.61% gc time)
2-element Array{Float64,1}:
 150958.4718928927
 225224.03743660534

test_function_1在分配和速度上都明显优越，但是test_function_1并未进行矢量化处理。我希望test_function_2的表现更好。请注意，两个功能的作用相同。

我有一种直觉，这是因为在test_function_2中，我使用sum_all .= sum_all + ...，但是我不确定为什么会出现问题。我可以得到提示吗？

解决方法

因此，首先让我评论一下，如果我想使用循环，我将如何编写您的函数：

Position + 2

接下来，这是三个选项的基准比较：

function test_function_3(data)
    theta   = (1,2)
    sum_all = zeros(2)
    for row in eachrow(data)
        sum_all .+= dot(theta,row) .*  row
    end
    return sum_all
end

接下来，如果您在循环中显式实现julia> @benchmark test_function_1($data) BenchmarkTools.Trial: memory estimate: 5.34 MiB allocs estimate: 16 -------------- minimum time: 1.953 ms (0.00% GC) median time: 1.986 ms (0.00% GC) mean time: 2.122 ms (2.29% GC) maximum time: 4.347 ms (8.00% GC) -------------- samples: 2356 evals/sample: 1 julia> @benchmark test_function_2($data) BenchmarkTools.Trial: memory estimate: 45.78 MiB allocs estimate: 500002 -------------- minimum time: 16.316 ms (7.44% GC) median time: 16.597 ms (7.63% GC) mean time: 16.845 ms (8.01% GC) maximum time: 34.050 ms (4.45% GC) -------------- samples: 297 evals/sample: 1 julia> @benchmark test_function_3($data) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 1 -------------- minimum time: 777.204 μs (0.00% GC) median time: 791.458 μs (0.00% GC) mean time: 799.505 μs (0.00% GC) maximum time: 1.262 ms (0.00% GC) -------------- samples: 6253 evals/sample: 1，则可以更快一点：

dot

要了解差异，让我们看一下您的代码的这一行：

julia> function test_function_4(data)
           theta   = (1,2)
           sum_all = zeros(2)
           for row in eachrow(data)
               @inbounds sum_all .+= (theta[1]*row[1]+theta[2]*row[2]) .*  row
           end
           return sum_all
       end
test_function_4 (generic function with 1 method)

julia> @benchmark test_function_4($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     502.367 μs (0.00% GC)
  median time:      502.547 μs (0.00% GC)
  mean time:        505.446 μs (0.00% GC)
  maximum time:     806.631 μs (0.00% GC)
  --------------
  samples:          9888
  evals/sample:     1

让我们计算一下您在此表达式中执行的内存分配：

sum_all .= sum_all + (theta * data[i,1:2])[1] *  data[i,1:2]

因此，您可以看到在循环的每次迭代中您分配了五次。分配很昂贵。您可以在基准测试中看到这一点，您在此过程中分配了5000002：

sum_all .= sum_all + # allocation of a new vector as a result of addition (theta * # allocation of a new vector as a result of multiplication data[i,1:2] # allocation of a new vector via getindex )[1] * # allocation of a new vector as a result of multiplication data[i,1:2] # allocation of a new vector via getindex的1个分配
sum_all的1个分配
500000个分配（5 * 100000）

此外，您执行的索引类似于theta，该索引执行边界检查，这也是一个很小的成本（但与分配相比是微不足道的）。

现在在功能data[i,1:2]中，我使用test_function_3。这次我也得到了eachrow(data)矩阵的行，但是它们作为视图（而不是新矩阵）返回，因此循环内没有发生分配。接下来，我再次使用data函数来避免先前由矩阵乘法引起的分配（我将dot的{{1}}从theta更改为Tuple，然后是{{1 }}快一点，但这是次要的。最后，我写了Matrix，在这种情况下，广播了所有操作，因此Julia可以进行广播融合（再次-没有分配发生）。

在dot中，我只是将um_all .+= dot(theta,row) .* row替换为展开循环，因为我们知道我们有两个要计算其点积的元素。实际上，如果您完全展开所有内容并使用test_function_4，它将变得更快：

dot

因此，您可以看到，这种方法比@simd快100倍。 julia> function test_function_5(data) theta = (1,2) s1 = 0.0 s2 = 0.0 @inbounds @simd for i in axes(data,1) r1 = data[i,1] r2 = data[i,2] mul = theta[1]*r1 + theta[2]*r2 s1 += mul * r1 s2 += mul * r2 end return [s1,s2] end test_function_5 (generic function with 1 method) julia> @benchmark test_function_5($data) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 1 -------------- minimum time: 22.721 μs (0.00% GC) median time: 23.146 μs (0.00% GC) mean time: 24.306 μs (0.00% GC) maximum time: 100.109 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1仍然相对较快，并且是完全通用的，所以通常我可能会写类似test_function_1的东西，除非我真的需要非常快，并且知道我的数据的大小是固定的且很小的。 / p>

julia vectorization

我的Julia循环/去矢量化代码出了什么问题

问题描述

解决方法

相关问答