熊猫单列运行模糊模糊比

问题描述

我有很多全名示例：

datafile.csv:
full_name,dob,Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012

我正在尝试使用fuzz.ration来查看column ['fullname']中的名称是否有任何相似之处，但是代码要花很多时间，主要是因为嵌套了for循环。

示例代码：

dataframe = pd.read_csv('datafile.csv')
_list = []
for row1 in dataframe['fullname']:
    for row2 in dataframe['fullname']:
        x = fuzz.ratio(row1,row2)
        if x > 90:
            _list.append([row1,row2,x])

print(_list)

是否有更好的方法来迭代单个pandas列以获得潜在重复数据的比率？

谢谢吉姆

解决方法

您可以创建第一个模糊数据：

import pandas as pd
from io import StringIO
from fuzzywuzzy import fuzz

data = StringIO("""
Jerry Smith
Morty Smith
Rick Sanchez
Jery Smith
Morti Smith
""")

df = pd.read_csv(data,names=['full_name'])

for index,row in df.iterrows():
    df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'],x))

print(df.to_string())

输出：

      full_name  Jerry Smith  Morty Smith  Rick Sanchez  Jery Smith  Morti Smith
0   Jerry Smith          100           73            26          95           64
1   Morty Smith           73          100            26          76           91
2  Rick Sanchez           26           26           100          27           35
3    Jery Smith           95           76            27         100           67
4   Morti Smith           64           91            35          67          100

然后找到所选名称的最佳匹配项：

data_rows = df[df['Jerry Smith'] > 90]
print(data_rows)

输出：

     full_name  Jerry Smith  Morty Smith  Rick Sanchez  Jery Smith  Morti Smith
0  Jerry Smith          100           73            26          95           64
3   Jery Smith           95           76            27         100           67

import pandas as pd
from io import StringIO
from fuzzywuzzy import process

s = """full_name,dob
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012"""

df = pd.read_csv(StringIO(s))

# 1 - use fuzzywuzzy.process.extract with list comprehension
# 2 - You still have to iterate once but this method avoids the use of apply,which can be very slow
# 3 - convert the list comprehension results to a dataframe 
# Note that I am limiting the results to one match. You can adjust the code as you see fit
df2 = pd.DataFrame([process.extract(df['full_name'][i],df[~df.index.isin([i])]['full_name'],limit=1)[0] for i in range(len(df))],index=df.index,columns=['match_name','match_percent','match_index'])
# join the new dataframe to the original
final = df.join(df2)


      full_name         dob   match_name  match_percent  match_index
0   Jerry Smith  21/01/2010   Jery Smith             95            3
1   Morty Smith  18/06/2008  Morti Smith             91            4
2  Rick Sanchez  27/04/1993  Morti Smith             43            4
3    Jery Smith  27/12/2012  Jerry Smith             95            0
4   Morti Smith  13/03/2012  Morty Smith             91            1

此比较方法起着双重作用，因为在“杰里·史密斯”和“莫蒂·史密斯”之间运行模糊测试比与“莫里·史密斯”和“杰里·史密斯”之间的比率相同。

如果您遍历子数组，则可以更快地完成此操作。

dataframe = pd.read_csv('datafile.csv')
_list = []
for i_dataframe in range(len(dataframe)-1):
    comparison_fullname = dataframe['fullname'][i_dataframe]
    for entry_fullname,entry_score in process.extract(comparison_fullname,dataframe['fullname'][i_dataframe+1::],scorer=fuzz.ratio):
        if entry_score >=90:
            _list.append((comparison_fullname,entry_fullname,entry_score)
print(_list)

这将防止任何重复的工作。

通常有两个部分可以帮助您提高性能：

减少比较量
使用更快的方式匹配字符串

在您的实现中，您执行了很多不需要的比较，因为您总是比较A B，然后再比较B A。您也比较A A，通常总是100。因此，您可以将比较量减少50％以上。由于您只想添加得分超过90的比赛，因此该信息可用于加快比较速度。尽管这无法在FuzzyWuzzy中完成，但可以在Rapidfuzz中完成（我是作者）。 Rapidfuzz在界面相对相似的情况下实现了与FuzzyWuzzy相同的算法，但是在性能上有很多改进。

可以通过以下方式实现您的代码，以实现这两个更改，这应该快得多。在我的计算机上测试此代码时，该代码的运行时间约为12秒，而此改进版本仅需要1.7秒。

import pandas as pd
from io import StringIO
from rapidfuzz import fuzz

# generate a bigger list of examples to show the performance benefits
s = "fullname,dob"
s+='''
Jerry Smith,13/03/2012'''*500

dataframe = pd.read_csv(StringIO(s))

# only create the data series once
full_names = dataframe['fullname']
for index,row1 in full_names.items():
    # skip elements that are already compared
    for row2 in full_names.iloc[index+1::]:
        # use a score_cutoff to improve the runtime for bad matches
        score = fuzz.ratio(row1,row2,score_cutoff=90)
        if score:
            _list.append([row1,score])

csv fuzzywuzzy pandas python similarity

熊猫单列运行模糊模糊比

问题描述

解决方法

相关问答