numpy数组的唯一标识符？

问题描述

我处于一种二维数组A的情况下，我的代码中的类之一使用了这种方法，然后稍后我需要检查传递给其他方法的数组是否具有相同的数组值。

明显的解决方法是将A保存为类属性，但是由于A可能会变得很大，因此我想避免将其添加为属性以避免内存问题。

我想做的是为此数组保存某种唯一标识符，然后检查一下。我的第一个想法是使用id(A)，但这是对象的唯一标识符，而不是数组，因此，如果我有一些B = A.copy()，它将有一个不同的id。>

另一种想法是保存一些稀疏的A版本，例如，抽样一些随机索引并检查等价性，但这似乎比我需要的东西更加混乱和深入。

有人有什么建议吗？

解决方法

使用hash function，例如通过hashlib模块的SHA-256。下面是生成基于哈希的ID的示例。 array_id()函数对此固定64个符号长度的数组字符串返回唯一的。具有相同内容的数组将产生相同的id，即使更改很小的一部分，其ID也将完全不同。

请注意，不同类型的数组可能会产生不同的结果，例如如果您有两个具有相同整数值的整数数组，但其中一个类型为np.int32，而另一个类型为np.int64，则您将获得不同的ID，在这种情况下，您只需将数组更改为一种常见类型，例如做res_id = array_id(a.astype(np.int64))。但是不同的类型并不总是意味着哈希ID会有所不同，例如如果所有整数都是非负数且小于2 ^ 31，则np.int32和np.uint32类型都将给出相同的哈希值。

因此，如果您希望哈希值相同的数字的ID相同，则始终将数组类型更改为一些常见的类型，例如array_id(a.astype(common_type))，其中common_type例如np.int64适用于所有整数类型，np.float64适用于所有浮点类型。相反，如果您希望不同的类型始终产生不同的结果，则将类型名称包含在哈希中，例如hashlib.sha256(str(a.dtype).encode('ascii') + a.tobytes()).hexdigest().upper()。

在下一个代码中，如果您传递标志include_dtype = True，则数据类型将包含在ID计算中。如果include_shape = True也将包括形状。 algo参数（sha256或xxhash）选择要使用的哈希算法。

代码需要一次通过命令python -m pip install numpy xxhash安装一些模块。

Try it online!

# Needs: python -m pip install numpy xxhash
def array_id(a,*,include_dtype = False,include_shape = False,algo = 'xxhash'):
    data = bytes()
    if include_dtype:
        data += str(a.dtype).encode('ascii')
    data += b','
    if include_shape:
        data += str(a.shape).encode('ascii')
    data += b','
    data += a.tobytes()
    if algo == 'sha256':
        import hashlib
        return hashlib.sha256(data).hexdigest().upper()
    elif algo == 'xxhash':
        import xxhash
        return xxhash.xxh3_64(data).hexdigest().upper()
    else:
        assert False,algo

# Test
import numpy as np,timeit
a = np.array([[1,2,3],[4,5,6]])
print(array_id(a))
print(array_id(a,include_shape = True))
print(array_id(a,include_shape = True,include_dtype = True))

# Speed Measure
a = np.ones((10000,10000,),dtype = np.uint32)
for algo in ['sha256','xxhash']:
    print(algo,round(timeit.timeit(lambda: array_id(a,algo = algo),number = 1),3),'sec')

输出：

17A96F5E5826D66A
E201378DF28CB280
0FDFAE47334C986A
sha256 3.774 sec
xxhash 1.356 sec

arrays arrays arrays identifier identifier identifier numpy python