需要加快python中numpy数组的操作一些代码时间备注

问题描述

我正在通过将 Cplex 作为库导入 Python 来求解整数编程模型。假设优化问题具有以下形式的约束 (Ax = b)： x0+x1+x1+x3 = 1

此约束中 x 变量的索引为 0、1、1 和 3。它们存储在一个列表中：indices=[0,1,3] 这些变量的系数也存储在另一个列表coeff = [1,1]

Cplex 不能接受重复的索引，所以约束应该是这样的：

x0+2x1+x3 = 1

所以这两个列表应该像这样更新：

indices=[0,3]
 coeff = [1,2,1]

我有这个函数，它将索引和系数作为两个参数，并为我提供操作列表：

def manipulate(indices,coeff):
    u = np.unique(indices)
    sums  = { ui:np.sum([  coeff[c[0]] for c in np.argwhere(indices == ui) ]) for     ui in u }
    return list(sums.keys()),list(sums.values())

因此，manipulate([0,3],[1,1]) 返回 ([0,1])。

我的问题是当我有这么多变量时，列表的长度可能是一百万，而我有数百万个这样的约束。当我使用 cplex 解决我的优化问题时，程序变得非常慢。我跟踪了每个函数所花费的时间，并意识到我的代码中最耗时的部分是这些计算，我猜这是因为使用了 numpy。我需要让这个函数更有效率，希望能减少执行时间。你能和我分享关于如何改变函数操作的任何想法和建议吗？

非常感谢，

解决方法

如果不使用基于本机代码的扩展，可能总会有妥协：

Numpy / Vectorization 方法会遗漏基于哈希的算法，恕我直言，将遭受算法复杂性的缺点（例如，需要排序；需要进行多次传递……）
基于 Python 的基于哈希的方法会受到缓慢循环的影响。

尽管如此，我确实认为，您的方法在某种程度上接近两全其美，并且您可能会有所收获。

一些代码

from time import perf_counter as pc
from collections import defaultdict
import numpy as np
np.random.seed(0)

def manipulate(indices,coeff):
    u = np.unique(indices)
    sums  = { ui:np.sum([  coeff[c[0]] for c in np.argwhere(indices == ui) ]) for     ui in u }
    return list(sums.keys()),list(sums.values())

# this assumes pre-sorted indices
def manipulate_alt(indices,coeff):
  unique,indices = np.unique(indices,return_index=True)
  cum_coeffs = np.cumsum(coeff)
  bla = cum_coeffs[indices-1]
  blub = np.roll(bla,-1)
  bluab = np.ediff1d(blub,to_begin=blub[0])

  return unique.tolist(),bluab.tolist()

def manipulate_pure_py(indices,coeff):
  final = defaultdict(int)
  n = len(indices)
  for i in range(n):
    final[indices[i]] += coeff[i]

  return list(final.keys()),list(final.values())

# BAD NON-SCIENTIFIC BENCHMARK
# ----------------------------

ITERATIONS = 10
SIZE = 1000000

accu_time_manipulate = 0.0
accu_time_manipulate_alt = 0.0
accu_time_manipulate_py = 0.0

for i in range(ITERATIONS):
  indices = np.sort(np.random.randint(0,10000,size=SIZE))
  coeffs = np.random.randint(1,100,size=SIZE)

  start = pc()
  sol = manipulate(indices,coeffs)
  end = pc()
  accu_time_manipulate += (end-start)

  start = pc()
  sol_alt = manipulate_alt(indices,coeffs)
  end = pc()
  accu_time_manipulate_alt += (end-start)

  start = pc()
  sol_py = manipulate_pure_py(indices,coeffs)
  end = pc()
  accu_time_manipulate_py += (end-start)

  assert sol[0] == sol_alt[0]
  assert sol[1] == sol_alt[1]

  assert sol[0] == sol_py[0]
  assert sol[1] == sol_py[1]


print(accu_time_manipulate)
print(accu_time_manipulate_alt)
print(accu_time_manipulate_py)

时间

164.34614480000005
0.24998690000001744
8.751806900000059

备注

基准很糟糕
我 100% 不相信基于 numpy 的 hack
只需使用简单的基于字典的方法（或走本机代码路线）
（取决于建模任务：您也可以在构建模型时结合此合并：直接进行，不要延迟到后期处理）

cplex integer-programming numpy performance performance performance python

需要加快python中numpy数组的操作 一些代码时间备注

问题描述

解决方法

一些代码

时间

备注

需要加快python中numpy数组的操作一些代码时间备注