如何将此代码重写为apply-lambda表达式?

问题描述

我的dataframe(df)在新列“ s_score”中有一些NaN条目,可以使用func(x)排除这些条目。 即document_path_similarity()的执行会导致某些NaN,从而阻止了most_similar_docs()的执行(如果我不首先使用func(x)的话)。 D1,D2是带有字符串数据的df。列。

df
Quality D1                                  D2
0   1   Ms Stewart,the chief executive...  Ms Stewart,61,its chief executive 
1   1   After more than two years' det...   After more than two years in 
def most_similar_docs():

    def func(x):
        try:
            return document_path_similarity(x['D1'],x['D2'])
        except:
            return np.nan
    df['s_score'] = df.apply(func,axis=1)

有没有办法将此代码重写为一个衬里?

我的如下尝试导致“ ValueError :('max()arg为空序列”或SyntaxError。

df['s_scores'] = df.apply(lambda x: document_path_similarity(x.D1,x.D2),axis=1)
paraphrases['s_scores'] = paraphrases.apply(lambda x: document_path_similarity(x.D1,axis=1 if np.isnan(x))

解决方法

我认为您的pandas代码没有任何问题。我确实发现similarity_score()失败了,因为它试图获取最大的空列表。我通过将分数强制为零来强制列表为非空。这是我第一次查看此库,所以请不要以为我的补丁程序是高质量的补丁程序。

import io
df = pd.read_csv(io.StringIO("""  Quality  D1                                  D2
0   1   Ms Stewart,the chief executive...  Ms Stewart,61,its chief executive 
1   1   After more than two years' det...   After more than two years in """),sep="\s\s+",engine="python")

def similarity_score(s1,s2):
    list1 = []
    for a in s1:
        # patch +[0] at end so never finding max of empty list
        list1.append(max([i.path_similarity(a) for i in s2 if i.path_similarity(a) is not None]+[0]))
    output = sum(list1)/len(list1)
    return output

df = df.assign(
    s_scores=lambda x: x.apply(lambda r: document_path_similarity(r.D1,r.D2),axis=1)
)

print(df.to_string(index=False))

输出

 Quality                                  D1                                   D2  s_scores
       1  Ms Stewart,its chief executive  0.838889
       1   After more than two years' det...         After more than two years in  0.912500