有没有办法在两个字符串列之间逐个字符地执行编辑距离Levenshtein？

问题描述

我有两个数据集：dataset1 和 dataset2，它们有一个名为 SAX 的公共列，它是一个字符串对象。

dataset1=
         SAX
0    gangsyu
1    zicobgm
2    eerptow
3    cqbsynt
4    zvmqben
..       ...
475  rfikekw
476  bnbzvqx
477  rsuhgax
478  ckhloio
479  lbzujtw

480 rows × 2 columns

和

dataset2=
         SAX
0    gdmgsyu
1    zifgbgm
2    esdftow
3    cqtjgnt
4    znweben
..       ...
475  rfikekw
476  bnbzvqx
477  rsuhgax
478  ckhloio
479  lbzujtw

480 rows × 2 columns

我需要输出是一列，它是 SAX(dataset1) 变为 SAX(dataset2) 所需的编辑/更改次数的总和。变化基本上是我所考虑的“编辑/改变” 。（示例如下）

Taking the first row of SAX from dataset1 and dataset 2 and comparing.
"gangsyu" and "gdmgsyu"

First character "g" is a match,so move on.
Second character is not a match,it takes 3 edits to change "a" to "d". 
Third character is not a match,it takes 1 edit to change "n" to "m"
Rest of the characters match.
I want the column to be a sum of the edits/changes which is 3+1 = 4.(shown below)
dataset3= 
     sum_edits
0    4 (for the example shown right above)
1    0
2    1
3    2
4    0
..       ...
475  3
476  0
477  8
478  1
479  4

480 rows × 2 columns

是否有实现此目的的功能/方法？会很感激。谢谢。

解决方法

有一个库可以轻松计算 Levenshtein 距离 (python-Levenshtein)。

如果您愿意使用该库，您可以简单地遍历数据集并计算 distance(item_dataset_1,item_dataset_2)。

但是在您的示例中，我不清楚为什么您将从 a 到 d 的更改计为 3 次编辑。计算 Levenshtein 距离应计为 1 次编辑。

这是你的意思吗？我不完全确定

foo = ["gangsyu","zicobgm","eerptow","cqbsynt","zvmqben"]
bar = ["gdmgsyu","zifgbgm","esdftow","cqtjgnt","znweben"]

from collections import Counter

edits = 0

for word1,word2 in zip(foo,bar):
    x = Counter([letter for letter in word1])
    y = Counter([letter for letter in word2])
    edits += len(x - y)

# Outputs 13
print(edits)

Counter 很棒，它计算列表/元组中的项目并从中创建一个字典

编辑（澄清后）

“a”到“d”的编辑更改是因为“d”在字母表中“a”之后的 3 个字母，以及“n”到“m”之后的 1 个字母，对吗？我对这意味着什么感到困惑

您可能可以像这样设置字母索引：

alpahbetindex = {
    "a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9,"j":10,"k":11,"l":12,"m":13,"n":14,"o":15,"p":16,"q":17,"r":18,"s":19,"t":20,"u":21,"v":22,"w":23,"x":24,"y":25,"z":26,}

然后让这个嵌套的 for 循环比较每个单词中的每个字母，如果它们不同，则将两个字母传递给函数，函数根据字母索引返回两个字母之间的正值差。

foo = ["gangsyu","znweben"]

def func(char1,char2):
    x = alpahbetindex[char1.lower()]
    y = alpahbetindex[char2.lower()]
    return max(x,y,key=float) - min(x,key=float)

edits = 0
for word1,bar):
    for i in range(7):
        if word1[i] != word2[i]:
            changes = func(word1[i],word2[i])
            edits += changes
# Outputs 128
print(edits)

可能有更好的方法来做到这一点。

"gangsyu" and "gdmgsyu"

First character "g" is a match,so move on.
Second character is not a match,it takes 3 edits to change "a" to "d". 
Third character is not a match,it takes 1 edit to change "n" to "m"
Rest of the characters match.

这不是 Levenshtein distance，而是 Hamming distance 的某种变异，虽然 last 只考虑字符是否匹配或不匹配。

dataframe edit-distance machine-learning python string string