Pearson 哈希 8 位实现产生非常不均匀的值

问题描述

我正在实现一个 pearson 哈希,以便为需要一个文件名与文件数据配对的 C 项目创建一个轻量级字典结构 - 我想要哈希表的良好常量搜索属性。我不是数学专家,所以我查找了很好的文本哈希,然后 pearson 提出了,声称它是有效的并且具有良好的分布。我测试了我的实现,发现无论我如何改变表大小或文件名最大长度,散列都是非常低效的,例如 18/50 桶被留空。我相信维基百科不会说谎,是的,我知道我可以只下载第三方哈希表实现,但我非常想知道为什么我的版本不起作用。

在下面的代码中,(一个向表中插入值的函数),“csstring”是文件名,要散列的字符串,“cLen”是字符串的长度,“pData”是指向某些插入表中的数据,“pTable”是表结构。初始条件 cHash = cLen - csstring[0] 是我通过实验发现可以略微提高均匀性的东西。我应该补充一点,我正在使用完全随机的字符串(使用 rand() 生成 ascii 值)测试表格,并且长度在某个范围内 - 这是为了轻松生成和测试大量值。

typedef struct StaticStrTable {
    unsigned int nRepeats;
    unsigned char nBuckets;
    unsigned char nMaxCollisions;

    void** pBuckets;
} StaticStrTable;

static const char cPerm256[256] = {
    227,117,238,33,25,165,107,226,132,88,84,68,217,237,228,58,52,147,46,197,191,119,211,218,139,196,153,170,77,175,22,193,83,66,182,151,99,11,144,104,233,166,34,177,14,194,51,30,121,102,49,222,210,199,122,235,72,13,156,38,145,137,78,65,176,94,163,95,59,92,114,243,204,224,43,185,168,244,203,28,124,248,105,10,87,115,161,138,223,108,192,6,186,101,16,39,134,123,200,190,195,178,164,9,251,245,73,162,71,7,239,62,69,209,159,3,45,247,19,174,149,61,57,146,234,189,15,202,89,111,207,31,127,215,198,231,4,181,154,64,125,24,93,152,37,116,160,113,169,255,44,36,70,225,79,250,12,229,230,76,167,118,232,142,212,98,82,252,130,23,29,236,86,240,32,90,67,126,8,133,85,20,63,47,150,135,100,103,173,184,48,143,42,54,129,242,18,187,106,254,53,120,205,155,216,219,172,21,253,5,221,40,27,2,179,74,17,55,183,56,50,110,201,109,249,128,112,75,220,214,140,246,213,136,148,97,35,241,60,188,180,206,80,91,96,157,81,171,141,131,158,1,208,26,41
};

void InsertStaticStrTable(char* csstring,unsigned char cLen,void* pData,StaticStrTable* pTable) {
    unsigned char cHash = cLen - csstring[0];

    for (int i = 0; i < cLen; ++i) cHash ^= cPerm256[cHash ^ csstring[i]];
    
    unsigned short cTableIndex = cHash % pTable->nBuckets;
    long long* pBucket = pTable->pBuckets[cTableIndex];
    
    // Inserts data and records how many collisions there are - it may look weird as the way in which I decided to pack the data into the table buffer is very compact and arbitrary 
    // It won't affect the hash though,which is the key issue!

    for (int i = 0; i < pTable->nMaxCollisions; ++i) {
        if (i == 1) {
            pTable->nRepeats++;
        }

        long long* pSlotID = pBucket + (i << 1);

        if (pSlotID[0] == 0) {
            pSlotID[0] = csstring;
            pSlotID[1] = pData;

            break;
        }
    }
}

解决方法

仅供参考(这不是答案,我只需要格式) 这些只是模拟的单次运行,YMMV。

在 50 个 bin 上随机分布 50 个元素:


kalender_size=50 nperson = 50
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
  0:       18 (0.360000)        0 (0.000000)     0        0        0
  1:       18 (0.360000)       18 (0.360000)     1       18       18
  2:       10 (0.200000)       20 (0.400000)     3       30       48
  3:        4 (0.080000)       12 (0.240000)     6       24       72
----+---------+--------+----------+--------+------+--------+--------
  4:       50                  50                1.440000         72

同样:在一个生日日历上分配 365 个人(忽略闰日......):


kalender_size=356 nperson = 356
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
  0:      129 (0.362360)        0 (0.000000)     0        0        0
  1:      132 (0.370787)      132 (0.370787)     1      132      132
  2:       69 (0.193820)      138 (0.387640)     3      207      339
  3:       19 (0.053371)       57 (0.160112)     6      114      453
  4:        6 (0.016854)       24 (0.067416)    10       60      513
  5:        1 (0.002809)        5 (0.014045)    15       15      528
----+---------+--------+----------+--------+------+--------+--------
  6:      356                 356                1.483146        528

对于 N 个插槽上的 N 个项目,number of empty slotsnumber of slots with a single item in them 的期望是相等的。两者的预期密度均为 1/e。

最终数字 (1.483146) 是每个找到的元素的 ->next 指针遍历次数(当使用链式哈希表时)任何最佳哈希函数几乎都会达到 1.5。