如何通过搜索列表中给定的关键字值在Python pandas数据框中添加新列?

问题描述

我想基于Identified关键字在数据框中添加新列:

这是当前数据(数据框名称= df):

    Topic                   Count
0   This is Python          39
1   This is sql             6
2   This is Paython Pandas  98
3   import tkinter          81
4   Learning Python         94
5   sql Working             85
6   Pandas and Work         67
7   This is Pandas          30
8   Computer                20
9   Mobile Work             55
10  Smart Mobile            69

我想要的输出如下

    Topic                   Count       Groups
0   This is Python          39          Python_Group
1   This is sql             6           sql_Group
2   This is Paython Pandas  98          Python_Group
3   import tkinter          81          Python_Group
4   Learning Python         94          Python_Group
5   sql Working             85          sql_Group
6   Pandas and Work         67          Python_Group
7   This is Pandas          30          Python_Group
8   Computer                20          Devices_Group
9   Mobile Work             55          Devices_Group
10  Smart Mobile            69          Devices_Group

如何识别Groups列值

根据以下Topics列中的身份创建的组。 如果在Topics中找到了特定的单词,则需要为其分配特定的组名

Topic列中的关键字列表

Python_Group = ['Python','Pandas','tkinter']
sql_Group = ['sql','Select']
Devices_Group = ['Computer','Mobile']

我已经尝试过以下代码

df['Groups'] = [
    'Python Group' if "Python" in x 
    else 'Python Group' if "Pandas" in x
    else 'Python Group' if "tkinter" in x
    else 'sql Group' if "sql" in x
    else 'Devices Group' if "Computer" in x
    else 'Devices Group' if "Mobile" in x
    else '000' 
    for x in df['Topic']]
print(df)

上面的代码也给了我想要的输出,但是我想使其更简短,更快捷,因为在上述数据框中几乎有2MM +条记录,这对我编写1k +行代码来定义分组非常困难。

有什么方法可以利用属于Topic列下的关键字列表 任何可以在此迭代过程中为我提供帮助的自定义函数

代码:2 在咨询堆栈溢出专家之后,尝试了以下另一个代码

d = pd.read_excel('Map.xlsx').to_dict('list')
keyword_groups = {x:k for k,v in d.items() for x in v}
pat = '({})'.format('|'.join(keyword_groups))   #This line is giving an error
df['Groups'] = (df['Topic'].str.extract(pat,expand=False)
               .map(keyword_groups)
               .fillna('000'))

错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-131-543675c0b403> in <module>
      3 
      4 keyword_groups = {x:k for k,v in d.items() for x in v}
----> 5 pat = '({})'.format('|'.join(keyword_groups))
      6 pat

TypeError: sequence item 5: expected str instance,float found

感谢您的帮助。

解决方法

您可以使用np.select进行此操作。 np.select接收3个参数,其中一个为条件,一个为结果,最后一个在没有条件的情况下为默认值。

Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL','Select']
Devices_Group = ['Computer','Mobile']

conditions = [
    df['Topic'].str.contains('|'.join(Python_Group)),df['Topic'].str.contains('|'.join(SQL_Group)),df['Topic'].str.contains('|'.join(Devices_Group))
]

results = [
    "Python_Group","SQL_Group","Devices_Group"
]

df['Groups'] = np.select(conditions,results,'000')
#output:
    Topic                   Count   Groups
0   This is Python          39      Python_Group
1   This is SQL             6       SQL_Group
2   This is Paython Pandas  98      Python_Group
3   import tkinter          81      Python_Group
4   Learning Python         94      Python_Group
5   SQL Working             85      SQL_Group
6   Pandas and Work         67      Python_Group
7   This is Pandas          30      Python_Group
8   Computer                20      Devices_Group
9   Mobile Work             55      Devices_Group
10  Smart Mobile            69      Devices_Group
,

一种方法可能是考虑在d = {'Python_Group': ['Python','tkinter'],'SQL_Group': ['SQL','Select'],'Devices_Group': ['Computer','Mobile']} 中维护您的组和关键字:

dict

从这里开始,您可以轻松地将其转换为“关键字:组” keyword_groups = {x:k for k,v in d.items() for x in v} # {'Python': 'Python_Group',# 'Pandas': 'Python_Group',# 'tkinter': 'Python_Group',# 'SQL': 'SQL_Group',# 'Select': 'SQL_Group',# 'Computer': 'Devices_Group',# 'Mobile': 'Devices_Group'}

pat = '({})'.format('|'.join(keyword_groups))

df['Groups'] = (df['Topic'].str.extract(pat,expand=False)
               .map(keyword_groups)
               .fillna('000'))

然后,您可以使用Series.str.extract使用正则表达式查找这些关键字,并将map归入正确的组。使用fillna捕获所有不匹配的组。

                     Topic  Count          Groups
0           This is Python     39    Python_Group
1              This is SQL      6       SQL_Group
2   This is Paython Pandas     98    Python_Group
3           import tkinter     81    Python_Group
4          Learning Python     94    Python_Group
5              SQL Working     85       SQL_Group
6          Pandas and Work     67    Python_Group
7           This is Pandas     30    Python_Group
8                 Computer     20   Devices_Group
9              Mobile Work     55   Devices_Group
10            Smart Mobile     69   Devices_Group

[出]

{{1}}