给定两个字符串,我想识别从最长到最短的所有常见子字符串.
我想删除任何“子”子字符串.例如,’1234’的任何子串都不会包含在’12345’和’51234’之间的匹配中.
string1 = '51234' string2 = '12345' result = ['1234','5']
我想找到longest common substring,然后递归地找到左/右最长的子串.但是,我不希望在找到之后删除一个公共子字符串.例如,下面的结果在中间共享一个6:
string1 = '12345623456' string2 = '623456' result = ['623456','23456']
最后,我需要针对数千个字符串的固定列表检查一个字符串.我不确定是否有一个聪明的步骤可以用来散列这些字符串中的所有子串.
以前的答案:
在这个thread中,发现一个动态编程解决方案需要O(nm)时间,其中n和m是字符串的长度.我对更有效的方法感兴趣,它将使用后缀树.
背景:
我正在用旋律片段创作歌曲旋律.有时,组合设法生成在现有的一行中匹配太多音符的旋律.
我可以使用字符串相似性度量,例如编辑距离,但相信与旋律有很小差异的曲调是独特且有趣的.不幸的是,这些曲调与连续复制旋律的许多音符的歌曲具有相似的相似度.
解决方法
让我们从树开始吧
from collections import defaultdict def identity(x): return x class TreeReprMixin(object): def __repr__(self): base = dict(self) return repr(base) class PrefixTree(TreeReprMixin,defaultdict): ''' A hash-based Prefix or Suffix Tree for testing for sequence inclusion. This implementation works for any slice-able sequence of hashable objects,not just strings. ''' def __init__(self): defaultdict.__init__(self,PrefixTree) self.labels = set() def add(self,sequence,label=None): layer = self if label is None: label = sequence if label: layer.labels.add(label) for i in range(len(sequence)): layer = layer[sequence[i]] if label: layer.labels.add(label) return self def add_ngram(self,label=None): if label is None: label = sequence for i in range(1,len(sequence) + 1): self.add(sequence[:i],label) def __contains__(self,sequence): layer = self j = 0 for i in sequence: j += 1 if not dict.__contains__(layer,i): break layer = layer[i] return len(sequence) == j def depth_in(self,sequence): layer = self count = 0 for i in sequence: if not dict.__contains__(layer,i): print "Breaking" break else: layer = layer[i] count += 1 return count def subsequences_of(self,sequence): layer = self for i in sequence: layer = layer[i] return layer.labels def __iter__(self): return iter(self.labels) class SuffixTree(PrefixTree): ''' A hash-based Prefix or Suffix Tree for testing for sequence inclusion. This implementation works for any slice-able sequence of hashable objects,SuffixTree) self.labels = set() def add_ngram(self,label=None): if label is None: label = sequence for i in range(len(sequence)): self.add(sequence[i:],label=label)
要填充树,您将使用.add_ngram方法.
下一部分有点棘手,因为您正在寻找并发遍历字符串同时跟踪树坐标.为了解决所有这些问题,我们需要一些在树上运行的函数和一个查询字符串
def overlapping_substrings(string,tree,solved=None): if solved is None: solved = PrefixTree() i = 1 last = 0 matching = True solutions = [] while i < len(string) + 1: if string[last:i] in tree: if not matching: matching = True else: i += 1 continue else: if matching: matching = False solutions.append(string[last:i - 1]) last = i - 1 i -= 1 i += 1 if matching: solutions.append(string[last:i]) for solution in solutions: if solution in solved: continue else: solved.add_ngram(solution) yield solution def slide_start(string): for i in range(len(string)): yield string[i:] def seek_subtree(tree,sequence): # Find the node of the search tree which # is found by this sequence of items node = tree for i in sequence: if i in node: node = node[i] else: raise KeyError(i) return node def find_all_common_spans(string,tree): # We can keep track of solutions to avoid duplicates # and incomplete prefixes using a Prefix Tree seen = PrefixTree() for substring in slide_start(string): # Drive generator forward list(overlapping_substrings(substring,seen)) # Some substrings are suffixes of other substrings which you do not # want compress = SuffixTree() for solution in sorted(seen.labels,key=len,reverse=True): # A substrings may be a suffix of another substrings,but that substrings # is actually a repeating pattern. If a solution is # a repeating pattern,`not solution in seek_subtree(tree,solution)` will tell us. # Otherwise,discard the solution if solution in compress and not solution in seek_subtree(tree,solution): continue else: compress.add_ngram(solution) return compress.labels def search(query,corpus): tree = SuffixTree() if isinstance(corpus,SuffixTree): tree = corpus else: for elem in corpus: tree.add_ngram(elem) return list(find_all_common_spans(query,tree))
所以现在做你想做的事,做到这一点:
search("12345",["51234"]) search("623456",["12345623456"])
如果有什么不清楚的地方,请告诉我,我会尽力澄清.