问题描述
我想基于Identified关键字在数据框中添加新列:
这是当前数据(数据框名称= df):
Topic Count
0 This is Python 39
1 This is sql 6
2 This is Paython Pandas 98
3 import tkinter 81
4 Learning Python 94
5 sql Working 85
6 Pandas and Work 67
7 This is Pandas 30
8 Computer 20
9 Mobile Work 55
10 Smart Mobile 69
我想要的输出如下
Topic Count Groups
0 This is Python 39 Python_Group
1 This is sql 6 sql_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 sql Working 85 sql_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
如何识别Groups
列值
根据以下Topics
列中的身份创建的组。
如果在Topics
中找到了特定的单词,则需要为其分配特定的组名
Topic
列中的关键字列表
Python_Group = ['Python','Pandas','tkinter']
sql_Group = ['sql','Select']
Devices_Group = ['Computer','Mobile']
我已经尝试过以下代码:
df['Groups'] = [
'Python Group' if "Python" in x
else 'Python Group' if "Pandas" in x
else 'Python Group' if "tkinter" in x
else 'sql Group' if "sql" in x
else 'Devices Group' if "Computer" in x
else 'Devices Group' if "Mobile" in x
else '000'
for x in df['Topic']]
print(df)
上面的代码也给了我想要的输出,但是我想使其更简短,更快捷,因为在上述数据框中几乎有2MM +条记录,这对我编写1k +行代码来定义分组非常困难。
有什么方法可以利用属于Topic
列下的关键字列表?
或
任何可以在此迭代过程中为我提供帮助的自定义函数?
d = pd.read_excel('Map.xlsx').to_dict('list')
keyword_groups = {x:k for k,v in d.items() for x in v}
pat = '({})'.format('|'.join(keyword_groups)) #This line is giving an error
df['Groups'] = (df['Topic'].str.extract(pat,expand=False)
.map(keyword_groups)
.fillna('000'))
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-543675c0b403> in <module>
3
4 keyword_groups = {x:k for k,v in d.items() for x in v}
----> 5 pat = '({})'.format('|'.join(keyword_groups))
6 pat
TypeError: sequence item 5: expected str instance,float found
感谢您的帮助。
解决方法
您可以使用np.select
进行此操作。 np.select接收3个参数,其中一个为条件,一个为结果,最后一个在没有条件的情况下为默认值。
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL','Select']
Devices_Group = ['Computer','Mobile']
conditions = [
df['Topic'].str.contains('|'.join(Python_Group)),df['Topic'].str.contains('|'.join(SQL_Group)),df['Topic'].str.contains('|'.join(Devices_Group))
]
results = [
"Python_Group","SQL_Group","Devices_Group"
]
df['Groups'] = np.select(conditions,results,'000')
#output:
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
,
一种方法可能是考虑在d = {'Python_Group': ['Python','tkinter'],'SQL_Group': ['SQL','Select'],'Devices_Group': ['Computer','Mobile']}
中维护您的组和关键字:
dict
从这里开始,您可以轻松地将其转换为“关键字:组” keyword_groups = {x:k for k,v in d.items() for x in v}
# {'Python': 'Python_Group',# 'Pandas': 'Python_Group',# 'tkinter': 'Python_Group',# 'SQL': 'SQL_Group',# 'Select': 'SQL_Group',# 'Computer': 'Devices_Group',# 'Mobile': 'Devices_Group'}
。
pat = '({})'.format('|'.join(keyword_groups))
df['Groups'] = (df['Topic'].str.extract(pat,expand=False)
.map(keyword_groups)
.fillna('000'))
然后,您可以使用Series.str.extract
使用正则表达式查找这些关键字,并将map
归入正确的组。使用fillna
捕获所有不匹配的组。
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
[出]
{{1}}