访问 Memory<T> 的底层数组

问题描述

在我的应用程序中，我需要遍历文件的内容以生成文件固定大小块的哈希值。最终目标是实现 Amazon Glacier 的 Tree Hash 算法，我几乎是从他们的文档中逐字复制代码。

问题发生在我通过 SonarQube 运行以下代码时：

    byte[] buff = new byte[Mio];
    int bytesRead;

    while ((bytesRead = await inputStream.ReadAsync(buff,Mio)) > 0) {
        // Process the bytes read
    }

我在使用 while 循环时遇到了 Roslyn 问题。问题是“更改 'ReadAsync' 方法调用以使用 'Stream.ReadAsync(Memory,CancellationToken)' 重载”。根据描述，这是因为使用 Memory 类的方法比使用基本数组的方法效率更高。

当类可以从头到尾使用时，这可能是正确的。问题是，我需要将数据提供给 ComputeHash 的 HashAlgorithm 方法，并且他们没有任何覆盖接受 Memory。这意味着我必须使用 ToArray 的 Memory 方法，该方法对数据进行复制。这对我来说听起来效率不高。

我知道可以通过将现有数组传递给其构造函数来创建 Memory 实例，如下所示：

    byte[] buff = new byte[Mio];
    Memory<byte> memory = new Memory<byte>(buff);
    int bytesRead;

    while ((bytesRead = await inputStream.ReadAsync(memory)) > 0) {
        // Use `buff` to access the bytes
    }

但是文档不清楚传递给构造函数的数组是否实际用作 Memory 实例的底层存储。

因此，这是我的问题：

如何将数据从 Memory 直接提供给 HashAlgorithm 实例？我指的是从 HashAlgorithm 派生的类的任何实例，而不是专门的 SHA256 算法。与 Glacier 不同，我的实现不限于 SHA256。
存储在 Memory 实例中的数据是否也可以在用于创建它的数组中访问？
是否有另一种方法可以访问存储在 Memory 实例中作为数组的数据，无需复制？
如果做不到这一点，我该如何消除 SonarQube 中的外部问题（在这种情况下是 Roslyn 警告）？我没有像普通声纳问题那样更改其状态的下拉菜单。

EDIT 添加有关代码工作方式的其他信息：它是 AWS's example of computing a Glacier Tree Hash 的第一部分，该部分计算文件中 1Mio 块的第一个哈希值。

这些是上面 while 循环的内容：


// Constructor of the class
// The class implements Idisposable to properly dispose of the Algorithm field
// Constructor is called like this
// `using TreeHash treeHash = new TreeHash(System.Security.Cryptography.SHA512.Create());`
public TreeHash(HashAlgorithm algo) {
    this.Algorithm = algo;
}


// Chunk hash generation loop
// first part of the tree hash algorithm
    byte[][] chunkHashes = new byte[numChunks][];

    byte[] buff = new byte[Mio];
    int bytesRead;
    int idx = 0;

    while ((bytesRead = await inputStream.ReadAsync(buff,Mio)) > 0) {
        chunkHashes[idx++] = this.ComputeHash(buff,bytesRead);
    }


// Quick wrapper around the hash algorithm
// Also used by the second part of the tree hash computation
private byte[] ComputeHash(byte[] data,int count) => this.Algorithm.ComputeHash(data,count);

我默认使用散列算法的无前缀版本，但我可能可以切换到托管版本。如果需要，该方法可以变为非async。

解决方法

以下应该有效。它利用 MemoryPool<byte> 来获得一个 IMemoryOwner<byte>，我们可以用它来检索我们的暂存缓冲区。我们需要一个 Memory<byte> 来传递给 ReadAsync 调用，因此我们传递了 IMemoryOwner<byte> 的 Memory 属性。

然后我们重构代码以使用 HashAlgorithm.TryComputeHash 方法，该方法接受 ReadOnlySpan<byte> 作为源和 Span<byte> 作为目标。我们确实分配一个新数组（而不是使用 ArrayPool），因为您要保留/存储数组。

byte[][] chunkHashes = new byte[numChunks][]; 

using var memory = MemoryPool<byte>.Shared.Rent(Mio);

int bytesRead;
int idx = 0; 

while ((bytesRead = await inputStream.ReadAsync(memory.Memory,CancellationToken.None)) > 0) 
{ 
   var tempBuff = new byte[(int)Math.Ceiling(this.Algorithm.HashSize/8.0)];
   if (this.Algorithm.TryComputeHash(memory.Memory.Span[..bytesRead] /*1*/,tempBuff,out var hashWritten)) 
   {
      chunkHashes[idx++] = hashWritten == tempBuff.Length ? tempBuff : tempBuff[..hashWritten] /*2*/;
   } 
   else
      throw new Exception("buffer not big enough");
}

对于源，我们传递 Memory<bytes> 缓冲区的 Span 属性，该属性再次从 IMemoryOwner<byte>.Memory 属性中检索。我们根据读取的字节数将其切片为适当的长度。我们作为目标传递的 Span<byte> 必须至少是算法的 HashSize 属性的大小，即比特数（not 字节) 所需的哈希值。由于实现可能（尽管我认为不太可能）使用不是 8 的倍数的大小，我们上限如有必要，我们可以对除法进行四舍五入。我们不需要调用 AsSpan，因为存在来自 T[] 的隐式转换。

我相信*写入的最终字节数将始终与 HashSize 的长度相同。如果/当它是时，我们只需使用原始数组。否则，我们需要根据写入的哈希字节数将其切片为正确的长度。

如果缓冲区不够大，TryComputeHash 返回 false 并抛出异常。我相当肯定这不会发生在我们身上，因为我们明确地根据 HashSize 计算大小，但无论如何我们都会将这种情况作为最佳实践来处理。

我已经通过了 CancellationToken.None，但您可以提供自己的令牌。我还使用 Range 语法而不是显式调用 Slice。如果这对您来说不可用，或者您只是不喜欢它的外观，您可以明确说明它：

/*1*/ memory.Memory.Span.Slice(0,bytesRead)
/*2*/ tempBuff.AsSpan(0,hashWritten).ToArray()

我们可以做出一些可能的假设：

假设 HashSize 总是 8 的倍数
假设 HashSize 始终等于写入的字节数，并且不对最终数组进行切片
假设我们始终提供足够大的缓冲区（按照上述说明，这将是所需的确切大小）并删除 if 和 Exception

while ((bytesRead = await inputStream.ReadAsync(memory.Memory,CancellationToken.None)) > 0)
{
   var tempBuff = new byte[this.Algorithm.HashSize/8];
   _ = this.Algorithm.TryComputeHash(memory.Memory.Span[..bytesRead],out _);
   chunkHashes[idx++] = tempBuff;
}

* 不幸的是，我不能 100% 肯定地说这些都是有效的假设。我查看过的大多数实现都有一个 Debug.Assert 验证缓冲区大小和写入的字节是相同的，所以我认为它们合理。也就是说，我个人认为我会坚持使用更详细的选项。

您还会注意到我已经删除了您的 ComputeHash 函数。这并不是说您仍然不能使用它，但我将其转换为这种基于 Try 的 Memory<> 模式作为练习留给读者。

arrays arrays arrays buffer buffer c#c#sonarqube