熊猫:仅当另一列中的值匹配时才计算行之间的重叠词多个实例的问题

问题描述

我有一个如下所示的数据框,但有很多行:

import pandas as pd

data = {'intent':  ['order_food','order_food','order_taxi','order_call','order_taxi'],'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],'key_words': [['need','hamburger'],['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}

df = pd.DataFrame (data,columns = ['intent','Sent','key_words'])

我使用下面的代码(不是我的解决方案)计算了 jaccard 相似度:

def lexical_overlap(doc1,doc2): 
    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)    
    return intersection

修改 @Amit Amola 给出的代码以比较每两行之间重叠的单词并从中创建一个数据框:

overlapping_word_list=[]

for val in list(combinations(range(len(data_new)),2)):
     overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])

@gold_cy 的回答对我有帮助,我对其进行了一些更改以获得我喜欢的输出

for intent in df.intent.unique():
    # loc returns a DataFrame but we need just the column
    rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
    combos = combinations(rows,2)
    for combo in combos:
        x,y = rows
        overlap = lexical_overlap(x[1],y[1])
        print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")

问题是,当有更多相同意图的实例时,我遇到了错误: 值错误:解包的值太多(预期为 2)

对于我的数据集中的更多示例,我不知道如何处理

解决方法

你想要这个吗?

from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values,map(set,df.key_words.values)),2)):
    keywords = (list(map(itemgetter(1),item)))
    intersect = keywords[0].intersection(keywords[1])
    if len(intersect) > 0:
        str_list = list(map(itemgetter(0),item))
        str_list.append(intersect)
        items_to_consider.append(str_list)


for i in items_to_consider:
    for item in i[2]:
        if item in i[0] and item in i[1]:
            print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")