问题描述
我有两个数据集:dataset1 和 dataset2,它们有一个名为 SAX
的公共列,它是一个字符串对象。
dataset1=
SAX
0 gangsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
和
dataset2=
SAX
0 gdmgsyu
1 zifgbgm
2 esdftow
3 cqtjgnt
4 znweben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
我需要输出是一列,它是 SAX(dataset1) 变为 SAX(dataset2) 所需的编辑/更改次数的总和。变化基本上是我所考虑的“编辑/改变” 。 (示例如下)
Taking the first row of SAX from dataset1 and dataset 2 and comparing.
"gangsyu" and "gdmgsyu"
First character "g" is a match,so move on.
Second character is not a match,it takes 3 edits to change "a" to "d".
Third character is not a match,it takes 1 edit to change "n" to "m"
Rest of the characters match.
I want the column to be a sum of the edits/changes which is 3+1 = 4.(shown below)
dataset3=
sum_edits
0 4 (for the example shown right above)
1 0
2 1
3 2
4 0
.. ...
475 3
476 0
477 8
478 1
479 4
480 rows × 2 columns
解决方法
有一个库可以轻松计算 Levenshtein 距离 (python-Levenshtein)。
如果您愿意使用该库,您可以简单地遍历数据集并计算 distance(item_dataset_1,item_dataset_2)
。
但是在您的示例中,我不清楚为什么您将从 a 到 d 的更改计为 3 次编辑。计算 Levenshtein 距离应计为 1 次编辑。
,这是你的意思吗?我不完全确定
foo = ["gangsyu","zicobgm","eerptow","cqbsynt","zvmqben"]
bar = ["gdmgsyu","zifgbgm","esdftow","cqtjgnt","znweben"]
from collections import Counter
edits = 0
for word1,word2 in zip(foo,bar):
x = Counter([letter for letter in word1])
y = Counter([letter for letter in word2])
edits += len(x - y)
# Outputs 13
print(edits)
Counter 很棒,它计算列表/元组中的项目并从中创建一个字典
编辑(澄清后)
“a”到“d”的编辑更改是因为“d”在字母表中“a”之后的 3 个字母,以及“n”到“m”之后的 1 个字母,对吗?我对这意味着什么感到困惑
您可能可以像这样设置字母索引:
alpahbetindex = {
"a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9,"j":10,"k":11,"l":12,"m":13,"n":14,"o":15,"p":16,"q":17,"r":18,"s":19,"t":20,"u":21,"v":22,"w":23,"x":24,"y":25,"z":26,}
然后让这个嵌套的 for 循环比较每个单词中的每个字母,如果它们不同,则将两个字母传递给函数,函数根据字母索引返回两个字母之间的正值差。
foo = ["gangsyu","znweben"]
def func(char1,char2):
x = alpahbetindex[char1.lower()]
y = alpahbetindex[char2.lower()]
return max(x,y,key=float) - min(x,key=float)
edits = 0
for word1,bar):
for i in range(7):
if word1[i] != word2[i]:
changes = func(word1[i],word2[i])
edits += changes
# Outputs 128
print(edits)
可能有更好的方法来做到这一点。
,"gangsyu" and "gdmgsyu"
First character "g" is a match,so move on.
Second character is not a match,it takes 3 edits to change "a" to "d".
Third character is not a match,it takes 1 edit to change "n" to "m"
Rest of the characters match.
这不是 Levenshtein distance,而是 Hamming distance 的某种变异,虽然 last 只考虑字符是否匹配或不匹配。