如何在python中使用gzip向压缩字符串添加标头?

问题描述

我正在尝试像特定的 C# 代码一样通过 python 压缩字符串,但我得到了不同的结果。似乎我必须在压缩结果中添加一个标头,但我不知道如何在 python 中为压缩字符串添加一个标头。这是我不知道 python 中的 C# 行:

memoryStream.Read(compressedBytes,CompressedMessageHeaderLength,(int)memoryStream.Length);

这是整个可运行的 C# 代码

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

namespace Rextester
{
    /// <summary>Handles compressing and decompressing API requests and responses.</summary>
    public class Compression
    {
        #region Member Variables
        /// <summary>The compressed message header length.</summary>
        private const int CompressedMessageHeaderLength = 4;
        #endregion

        #region Methods
        /// <summary>Compresses the XML string.</summary>
        /// <param name="documentToCompress">The XML string to compress.</param>
        public static string CompressData(string data)
        {
            using (MemoryStream memoryStream = new MemoryStream())
            {
                byte[] plainBytes = Encoding.UTF8.GetBytes(data);

                using (GZipStream zipStream = new GZipStream(memoryStream,CompressionMode.Compress,leaveOpen: true))
                {
                    zipStream.Write(plainBytes,plainBytes.Length);
                }

                memoryStream.Position = 0;

                byte[] compressedBytes = new byte[memoryStream.Length + CompressedMessageHeaderLength];

                Buffer.Blockcopy(
                    BitConverter.GetBytes(plainBytes.Length),compressedBytes,CompressedMessageHeaderLength
                );

                // Add the header,which is the length of the compressed message.
                memoryStream.Read(compressedBytes,(int)memoryStream.Length);

                string compressedXml = Convert.ToBase64String(compressedBytes);

                return compressedXml;
            }
        }
        
 
        #endregion
    }

    public class Program
    {
        public static void Main(string[] args)
        {
            //Your code goes here
            string data = "Hello World!";
            Console.WriteLine(  Compression.CompressData(data) );
            // result would be DAAAAB+LCAAAAAAABADzSM3JyVcIzy/KsveEAKMcKRwMAAAA

        }
    }
}

这是我写的 Python 代码

data = 'Hello World!'

import gzip
import base64
print(base64.b64encode(gzip.compress(data.encode('utf-8'))))

# I expect DAAAAB+LCAAAAAAABADzSM3JyVcIzy/KsveEAKMcKRwMAAAA 
# but I get H4sIACwuuWAC//NIzcnJVwjpl8pJUQQAoxwpHAwAAAA=

解决方法

您可以使用 to_bytes 来转换编码字符串的长度:

enc = data.encode('utf-8')
zipped = gzip.compress(enc)
print(base64.b64encode((len(enc)).to_bytes(4,sys.byteorder) + zipped)) # sys.byteorder can be set to concrete fixed value

此外,gzip.compress(enc) 产生的结果似乎与 C# 对应的结果略有不同(因此总体结果也会有所不同),但这应该不是问题,因此解压缩应该可以正确处理所有内容。

,

我要开始的一件事是 C# 代码不太适合跨平台使用。长度标头的字节顺序取决于底层架构,因为 BitConverter.GetBytes 以任何架构顺序返回字节。

但是,对于 C#,我们可能指的是 windows,也可能指的是 Intel,所以 Little Endian 很有可能。

因此,您需要做的是将原始数据的长度按小端顺序添加到压缩数据中。正好 4 个字节。

bdata = data.encode('utf-8')
compressed = gzip.compress(bdata)
header = len(bdata).to_bytes(4,'little')

然后,您需要连接并转换为base64:

print(base64.b64encode(header + compressed))
,

正如其他人所提到的,您将该标头放入 c# 版本这一事实有所不同。

同样,请注意 gzip 过程可以通过多种方式完成。例如,在 C# 中,您可以指定 CompressionLevelOptimalFastestNoCompression。请参阅:https://docs.microsoft.com/en-us/dotnet/api/system.io.compression.compressionlevel?view=net-5.0

我对 Python 不够熟悉,无法说明默认情况下它将如何处理 gzip 压缩(也许 C# 中的 Fastest 提供了比 Python 或多或少的激进算法)

这是您的 C# 代码,标头值设置为“0”,并以 3 CompressionLevels 输出。请注意,它输出的字符串值“非常接近”您在 Python 中得到的值。

您还应该询问值不同是否真的重要。只要能编解码就够了吗?

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

public class Program
{
    public static void Main()
    {
        string data = "Hello World!";
        Console.WriteLine(  Compression.CompressData(data,CompressionLevel.Fastest) );
        Console.WriteLine(  Compression.CompressData(data,CompressionLevel.NoCompression) );
        Console.WriteLine(  Compression.CompressData(data,CompressionLevel.Optimal) );
        // result would be DAAAAB+LCAAAAAAABADzSM3JyVcIzy/KSVEEAKMcKRwMAAAA
        // but I get       H4sIACwuuWAC//NIzcnJVwjPL8pJUQQAoxwpHAwAAAA=
    }
}

public class Compression
    {
        #region Member Variables
        /// <summary>The compressed message header length.</summary>
        private const int CompressedMessageHeaderLength = 0; // changed to zero
        #endregion

        #region Methods
        /// <summary>Compresses the XML string.</summary>
        /// <param name="documentToCompress">The XML string to compress.</param>
        public static string CompressData(string data,CompressionLevel compressionLevel)
        {
            using (MemoryStream memoryStream = new MemoryStream())
            {
                byte[] plainBytes = Encoding.UTF8.GetBytes(data);

                using (GZipStream zipStream = new GZipStream(memoryStream,compressionLevel,leaveOpen: true))
                {
                    zipStream.Write(plainBytes,plainBytes.Length);
                }

                memoryStream.Position = 0;

                byte[] compressedBytes = new byte[memoryStream.Length + CompressedMessageHeaderLength];

                Buffer.BlockCopy(
                    BitConverter.GetBytes(plainBytes.Length),compressedBytes,CompressedMessageHeaderLength
                );

                // Add the header,which is the length of the compressed message.
                memoryStream.Read(compressedBytes,CompressedMessageHeaderLength,(int)memoryStream.Length);

                string compressedXml = Convert.ToBase64String(compressedBytes);

                return compressedXml;
            }
        }
        
 
        #endregion
    }

输出:

H4sIAAAAAAAEA/NIzcnJVwjPL8pJUQQAoxwpHAwAAAA= H4sIAAAAAAAEAwEMAPP/SGVsbG8gV29ybGQhoxwpHAwAAAA= H4sIAAAAAAAAAA/NIzcnJVwjPL8pJUQQAoxwpHAwAAAA=

在:https://dotnetfiddle.net/TI8gwM