问题描述
我有两个数据帧。 具有347k个不同地址的DF_Address和具有24k个记录的DF_Project具有
Project_Id,Project_Start_Date和Project_Address
我想检查Df_Address中我的Project_Address是否存在模糊匹配。如果有匹配项,我想提取相同的Project_ID和Project_Start_Date。下面是我正在尝试的代码
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x,choices,cutoff):
print(x)
return process.extractOne(
x,choices=choices,score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,args=(
Df_Project ["Project_Address"],80
)
)
('matched_string',得分)
但是它也给出了相似的字符串。我还需要提取
Project_Id和Project_Start_Date
。由于数据量巨大,有人可以帮助我使用并行处理来实现这一点。
解决方法
您可以将元组转换为数据框,然后加入基本数据框。
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
输出:
address random_stuff distance
0 abc 100 10
1 cdf 200 20