比较熊猫两列中的字符串

问题描述

我正在尝试确定熊猫数据框中两列的相似性:

Text1                                                                             All
Performance results achieved by the approaches submitted to this Challenge.       The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist.                             Where am I?

我想将'Performance results ... ''The six...'和'Accuracy is one...''Where am I?'进行比较。 第一行在两列之间应具有较高的相似度,因为它包含一些单词。第二列应等于0,因为两列之间没有共同的词。

要比较我使用的SequenceMatcher的两列,如下:

from difflib import SequenceMatcher

ratio = SequenceMatcher(None,df.Text1,df.All).ratio()

但是使用df.Text1,df.All似乎是错误的。

你能告诉我为什么吗?

解决方法

  • SequenceMatcher不是为熊猫系列设计的。
  • 您可以.apply的功能。
  • SequenceMatcher Examples
    • 对于isjunk=None,即使空格也不被视为垃圾邮件。
    • 使用isjunk=lambda y: y == " "会将空格视为垃圾。
from difflib import SequenceMatcher
import pandas as pd

data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.','Accuracy is one of the basic principles of perfectionist.'],'All': ['The six top approaches and three others outperform the strong baseline.','Where am I?']}

df = pd.DataFrame(data)

# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1','All']].apply(lambda x: SequenceMatcher(lambda y: y == " ",x[0],x[1]).ratio(),axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.356164
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.088235

# isjunk=None
df['ratio'] = df[['Text1','All']].apply(lambda x: SequenceMatcher(None,axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.410959
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.117647

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...