高效的笛卡尔积算法Pandas DF /列之间的部分匹配 详细信息:

问题描述

我有两个数据框

df1

name
xyz limited
abc private
lmn limited
pqrlimited
abc def xyz limited
abc private limited

df2

flag   tag
E    private
A    limited

所需的输出是

输出:

name         flag   tag
xyz limited   A    limited
abc private   E    private
lmn limited   A    limited 
pqrlimited    A    limited 
abc def xyz limited    A    limited
abc private limited    A    limited
abc private limited    E    private

我的代码:

df1['tmp'] = 1
df2['tmp'] = 1

df3 = pd.merge(df1,df2,on=['tmp'])                     
df3 = df3.drop('tmp',axis=1)

df3 = df3[df3.apply(lambda x: x['tag'] in (x['name']),axis=1)]

但是实际上两个数据框都包含数百万条记录。有人可以建议最有效的方法解决这个问题吗?

解决方法

<div> <table id="apps"></table> </div>split一起使用:

merge

更新的解决方案:

df1['tag'] = df1['name'].str.split(' ',expand=True)[1]
df1.merge(df2)
#or
df1['flag'] = df1['tag'].map(df2.set_index('tag')['flag'])
#or if the strings not seperated then
df1['tag'] = df1['name'].str.findall('|'.join(set(df2['tag'].tolist()))).str[0]
,

您可以这样做:

regx = '|'.join(df2['tag'])
df1['tag'] = df1['name'].str.extract(f'({regx})')
df1['flag'] = df1['tag'].map(df2.set_index('tag')['flag'])
print(df1)

输出:

                  name      tag flag
0          xyz limited  limited    A
1          abc private  private    E
2          lmn limited  limited    A
3           pqrlimited  limited    A
4  abc def xyz limited  limited    A
5  abc private limited  private    E

详细信息:

  • 使用在df2中找到的标签列表创建正则表达式
  • 从df1名称栏中提取这些标签
  • 将这些标签映射到df2中的标志值

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...