与数据框中文本列匹配的单词列表

问题描述

我有 2 个数据框，第一个是文本数据列（超过 10k 行），第二个是关键字（近 100 个列表）

数据帧 1：

           Text
a white house cat plays in garden
cat is a domestic species of small carnivorous mammal
cat is walking in garden behind white house
yellow banana is healthy

数据帧 2：

ID     Keywords
1    ['cat','white']
2    ['garden','white','cat']
3    ['domestic','mammal']

我想在数据帧 1 中添加带有 ID 的列，其中最大单词数与数据帧 2 匹配。此外，如果超过 1 或 2 个 ID 之间存在联系，则将两个 ID 连接在一起。在某些情况下，没有任何单词匹配，因此，在这种情况下添加“不匹配”。

输出：

           Text                                                ID
a white house cat plays in garden                              2
cat is a domestic species of small carnivorous mammal          3
cat is walking in behind white house                           1,2
yellow banana is healthy                                       'No Match'

解决方法

这会起作用。它会创建一个包含每个关键字列表的匹配项数的列表，然后在该列表中查找最大值的 ID。

import pandas as pd
import ast

df1 = pd.DataFrame(['a white house cat plays in garden','cat is a domestic species of small carnivorous mammal','cat is walking in behind white house','yellow banana is healthy'],columns=['Text'])
df2 = pd.DataFrame([ { "ID": 1,"Keywords": "['cat','white']" },{ "ID": 2,"Keywords": "['garden','white','cat']" },{ "ID": 3,"Keywords": "['domestic','mammal']" } ])
df2['Keywords'] = df2['Keywords'].apply(ast.literal_eval)

def get_ids(text):
    matches = [len(set(text.split(" ")) & set(i)) for i in df2['Keywords']]
    matches_ids = [df2['ID'][index] for index,val in enumerate(matches) if val == max(matches) if max(matches)>0 ]
    return ",".join(str(x) for x in matches_ids) if matches_ids else "No Match"
    
df1['ID'] = df1['Text'].apply(get_ids)

结果：

	文本	ID
0	一只白色的家猫在花园里玩耍	2
1	猫是家养的小型食肉哺乳动物	3
2	猫在白宫后面走	1,2
3	黄香蕉很健康	无匹配

dataframe keyword-search nlp pandas python-3.x