使用二进制搜索和散列进行k个最不匹配的模式匹配

问题描述

我正在尝试解决一个程序，该程序使用哈希（通过滚动哈希结构）和二进制搜索来查找可以在某些文本中找到模式的所有位置。 Sergei Tsaturian先生在coursera discussion forum.

上可以更好地解释二进制搜索在其中的作用。

例如，您要在字符串aabaabba和aaaaaaaa中找到2个不匹配项。首先，您寻找第一个不匹配的地方。这是第一个位置i，字符串在[0，i-1]上相同，但在第i个位置不同（在我们的情况下，应该为2）。因此，我们可以使用二进制搜索找到它，并在每个步骤中检查字符串[0，i-1]是否相同。如果是，请移至右半部分，否则请移至左半部分。找到第一个不匹配项后，请考虑第一个不匹配项后的字符串部分（aabba和aaaaa）。现在，您要在其中找到第一个不匹配项（这应该在原始字符串中输出位置5），并且您可以使用另一个二进制搜索进行查找。因此，您总共需要进行多达k个二进制搜索才能找到k个不匹配项。

我已经对其他程序进行了测试，因此我的滚动哈希和其他函数都可以正常工作-这是二进制搜索，我对此感到不知所措。

我的二进制搜索程序使用4个指针-开始（常规低指针），结束（高指针），中间和id。与标准二进制搜索唯一的区别是id，而id则是我正在考虑的子字符串当前部分的开始，因此我可以轻松获得哈希值。当start大于end时，发现不匹配的索引是start-1，因此id指针开始抵消不匹配和它前面的字符串，并有条件地检查不匹配是否在子字符串的末尾，但是这种方法在考虑“ cab”和“ ccc”的匹配项（最多不匹配1个）时发生故障。即使很明显为False，它也会在此处返回True。

我的功能：

def find_num_matches(p1,p2,t1,t2,m1,m2,x,k,len_p,i):
    """
    Uses binary search to find the number of mismatches. It finds the left-most
    mismatch and then looks at the substring hash after that index position and
    continues until the number of mismatches are more than k or the start pointer
    is more than or equal to end.
    """
    start = 0
    for mismatch in range(k):
        # print(f'start: {start},id: {id}')
        id = start
        end = len_p-1
        while start <= end:
            mid = start + (end-start)//2
            p_h1,p_h2 = get_hash_value(
                p1,id,mid-start),get_hash_value(p2,mid-start)
            s_h1,s_h2 = get_hash_value(
                t1,i+id,get_hash_value(t2,mid-start)
            if p_h1 == s_h1 and p_h2 == s_h2:  # move to the right half,no mismatch yet
                start = mid + 1
            else:
                end = mid - 1
        # when loop exits,start - 1 is the index of the mismatch.
        if start == len_p:  # found the last mismatch
            return True
    return False

简单地获取哈希值的函数将返回文本或要匹配的模式的子字符串的哈希值，并且为子字符串加上i仅表示当前子字符串在整个文本中的起始位置。

变量i从调用二进制搜索的函数传递。在与模式“ abd”匹配的文本“ abcdfe”中，尝试所有3个字母模式，因此在范围（0到len（text）-len（pattern）+ 1）中运行一个循环，而我是该循环中的计数器。它指示您当前的子字符串在整个文本中从何处开始。

如果您对我如何纠正方法和纠正错误有任何建议，请告诉我-谢谢！

编辑：

一个中断的极端情况是将“ caa”与“ ccc”进行比较，k = 1。在第一次迭代中； start = 0，end = 2，mid = 1，id = 0，因此比较了索引id中长度为mid-start的哈希，即'c'和'c'。由于这些匹配，开始变为中+1。第二次迭代； start = 2，end = 2，mid = 2，id = 0，因此比较了索引id的长度为mid-start的长度为0的哈希值，因此哈希值（或缺少哈希值）匹配，并且开始增量为mid + 1这是3。现在，由于start> end，它退出循环并假定索引start处的char-1是最左边的不匹配。

解决方法

似乎您从不检查自己的中音。我认为您应该更改开始和结束更新：

if p_h1 == s_h1 and p_h2 == s_h2:  # move to the right half,no mismatch yet
   start = mid
else:
   end = mid

binary-search hash hashmap pattern-matching python

使用二进制搜索和散列进行k个最不匹配的模式匹配

问题描述

解决方法

相关问答