用Boost'gzip_compressor压缩前比较两个数据缓冲区，知道哪个缓冲区可以达到更好的效果

问题描述

我手头有一个特定的任务，我有两个大小完全相同（比如几兆字节）的二进制数据缓冲区，我正在用 Boost 的 boost::iostreams::gzip_compressor 压缩它们。缓冲区的内容不同（但相似），我想估计两者中的哪一个会更好地压缩。当然，我实际上可以进行压缩，然后确定知道 - 但理想情况下我想避免这种情况。我尝试对数据使用经典的“熵”计算公式（在书中用作数据压缩介绍的那个）来获得这样的估计，但我的结果与实际的 gzip 结果不一致。这是我用 C++ 估算的代码：

long getEntropySize (const unsigned char *ptr,size_t size) 
{
    long * table = new long [256];

    memset (table,256 * sizeof(long));

    for (size_t i=0; i < size; i++)
    {
        int v = ptr[i];

        table[v]++;
    }

    long total_bits = 0;

    for (size_t i=0; i < 256; i++)
    {
        if (table[i] == 0) continue;

        double p = (double)size / table[i];

        double bits = log2 (p); 

        total_bits += (bits * table[i]);
    }

    delete[] table;

    return ((total_bits + 7) / 8);
}

那么，我能做得比这个功能更好吗？

解决方法

上述函数在大约 75% 的情况下正确运行。但在 25% 的情况下，gzip 会比另一组压缩得更好，而函数告诉我否则

我想说的是，如果您的测试便宜得多，那么成功率就很高。当然，这取决于您的输入数据。如果您的输入数据总是 ~ 均匀随机，您将不会得到一个好的分类器，因为根据定义，压缩余量太小而无法可靠地猜测。

分析您的代码

我看到你的代码并立即看到动态分配。根据您的编译器、标准版本和可能被优化掉的优化标志（参见 Optimization of raw new[]/delete[] vs std::vector，但还要注意，它实际上似乎并没有在 C++20 模式下的最新 GCC/Clang/MSVC 上发生：{ {3}}）。

因此，您可能应该表达您的意图并分析代码以识别瓶颈。

为了让事情更通用、更有表现力并且可能更高效（没有对此进行分析），我建议进行一些更改：

#include <array>
#include <cmath>
#include <cstdint>
#include <cstring>
#include <span>

template <typename Ch   = std::uint8_t,typename T,size_t Extent = std::dynamic_extent>
static inline constexpr auto getEntropySize(std::span<Ch const,Extent> buf,std::span<T,256> table) {
    static_assert(sizeof(Ch) == 1);
    std::fill_n(table.data(),table.size(),0);

    for (std::uint8_t v : buf) { // implicit conversion to unsigned chars
        ++table[v];
    }

    T total_bits = 0;

    for (auto bucket: table) {
        if (bucket) {
            double p    = static_cast<double>(buf.size()) / bucket;
            double bits = std::log2(p);

            total_bits += (bits * bucket);
        }
    }

    return (total_bits + 7) / 8;
}

template <typename Buf>
static inline constexpr auto getEntropySize(Buf const& buf) {
    std::array<long,256> table;
    return getEntropySize(std::span(std::data(buf),std::size(buf)),std::span(table));
}

#include <cstdio>
#include <string>
#include <string_view>
#include <vector>
int main() {
    using namespace std::literals;
    std::printf("%ld\n",getEntropySize("xxxxxxxxxxxx")); // includes trailing NUL
    std::printf("%ld\n",getEntropySize("xxxxxxxxxxxx"s));
    std::printf("%ld\n",getEntropySize("xxxxxxxxxxxx"sv));

    std::printf("%ld\n",getEntropySize("Hello world!"));
    std::printf("%ld\n",getEntropySize("Hello world!"sv));
    std::printf("%ld\n",getEntropySize("Hello world!"s));

    std::printf("%ld\n",getEntropySize(std::vector<unsigned char>{
        0xec,0x48,0xf0,0x77,0xf7,0xd1,0xd0,0x08,0xa8,0x4b,0x1d,0x61,0x24,0xe8,0x16,0xe1,0x09,0x9a,0x65,0x94,0xe7,0xd3,0xa4,0xa7,0x1a,0x29,0x15,0x59,0x79,0x4e,0x19,0x17,0xfd,0x0a,0x34}));
}

打印Compiler Explorer For All Three

奖金

您可能会使用一些小技巧来获得更便宜的 log2 实现：Live On GCC and Clang¹

参见例如Fast computing of log2 for 64-bit integers、std::countl_zero 等

boost boost c++compression estimation gzip