问题描述
我的dataframe(df)在新列“ s_score”中有一些NaN条目,可以使用func(x)排除这些条目。 即document_path_similarity()的执行会导致某些NaN,从而阻止了most_similar_docs()的执行(如果我不首先使用func(x)的话)。 D1,D2是带有字符串数据的df。列。
df
Quality D1 D2
0 1 Ms Stewart,the chief executive... Ms Stewart,61,its chief executive
1 1 After more than two years' det... After more than two years in
def most_similar_docs():
def func(x):
try:
return document_path_similarity(x['D1'],x['D2'])
except:
return np.nan
df['s_score'] = df.apply(func,axis=1)
我的如下尝试导致“ ValueError :('max()arg为空序列”或SyntaxError。
df['s_scores'] = df.apply(lambda x: document_path_similarity(x.D1,x.D2),axis=1)
paraphrases['s_scores'] = paraphrases.apply(lambda x: document_path_similarity(x.D1,axis=1 if np.isnan(x))
解决方法
我认为您的pandas
代码没有任何问题。我确实发现similarity_score()
失败了,因为它试图获取最大的空列表。我通过将分数强制为零来强制列表为非空。这是我第一次查看此库,所以请不要以为我的补丁程序是高质量的补丁程序。
import io
df = pd.read_csv(io.StringIO(""" Quality D1 D2
0 1 Ms Stewart,the chief executive... Ms Stewart,61,its chief executive
1 1 After more than two years' det... After more than two years in """),sep="\s\s+",engine="python")
def similarity_score(s1,s2):
list1 = []
for a in s1:
# patch +[0] at end so never finding max of empty list
list1.append(max([i.path_similarity(a) for i in s2 if i.path_similarity(a) is not None]+[0]))
output = sum(list1)/len(list1)
return output
df = df.assign(
s_scores=lambda x: x.apply(lambda r: document_path_similarity(r.D1,r.D2),axis=1)
)
print(df.to_string(index=False))
输出
Quality D1 D2 s_scores
1 Ms Stewart,its chief executive 0.838889
1 After more than two years' det... After more than two years in 0.912500