遍历2个数据框并采用特定列的模式

问题描述

给出的是df1(其中包含每个商店销售最多和销售最少的产品)

id   most_sold_A  most_sold_B  most_sold_C  least_sold_A  least_sold_B  least_sold_C
1     1             0           0             0            1             1
2     0             1           0             1            0             0
3     0             1           1             1            0             0

和df2(包含两个商店之间的距离)也被给出:

id1   id2   distance 
1     2      0.5
1     3      3.0
2     3      0.2

结果数据框应

  1. 检查每个shopid在1k距离内的shop_ids
  2. 在1k以内的所有竞争对手中采用most_sold_product的模式
  3. 在1k以内的所有竞争对手中采用最低销售产品的模式

产生df:

id   most_sold_A  most_sold_B  most_sold_C  least_sold_A  least_sold_B  least_sold_C    /
1     1             0           0             0            1             1
2     0             1           0             1            0             0
3     0             1           1             1            0             0

most_sold_competition_within_1k   least_sold_competition_within_1k
B                                    A
[A,B,C]                              [A,C]
B                                    A

编辑

df1 = pd.DataFrame([[1,1,1],[2,0],[3,0]],columns = ["id","most_sold_A","most_sold_B","most_sold_C","least_sold_A","least_sold_B","least_sold_C"])
df2 = pd.DataFrame([[1,2,0.5],[1,3,3.0],0.2]],columns = ["id1","id2","distance"])

解决方法

我提出了一些建议,但我认为可以进一步优化。这个想法是先过滤范围内的竞争对手,然后加入.apply()并计算结果:

import numpy as np
import pandas as pd

df1 = pd.DataFrame([[1,1,1],[2,0],[3,0]],columns = ["id","most_sold_A","most_sold_B","most_sold_C","least_sold_A","least_sold_B","least_sold_C"])
df2 = pd.DataFrame([[1,2,0.5],[1,3,3.0],0.2]],columns = ["id1","id2","distance"])

df2 = pd.concat([df2,df2[["id2","id1","distance"]].rename(columns = {"id2":"id1","id1":"id2"})]).reset_index()[["id1","distance"]]
df2["id2"] = df2["id2"].astype(str)
df2 = df2[df2["distance"]<1][["id1","id2"]].groupby("id1").agg({'id2': ','.join}).reset_index()

df3 = pd.merge(df1,df2,how = 'left',left_on="id",right_on="id1")

most_cols = [col for col in df3.columns if 'most' in col]
least_cols = [col for col in df3.columns if 'least' in col]

df3["most_sold_competition_within_1k"] = df3.apply(lambda x: [df3[df3["id"]==int(elem)][most_cols].columns[[df3[df3["id"]==int(elem)][most_cols].values == 1][0][0]] for elem in x["id2"].split(",")],axis = 1)
df3["least_sold_competition_within_1k"] = df3.apply(lambda x: [df3[df3["id"]==int(elem)][least_cols].columns[[df3[df3["id"]==int(elem)][least_cols].values == 1][0][0]] for elem in x["id2"].split(",axis = 1)

df3 = df3[["id"]+most_cols+least_cols+["most_sold_competition_within_1k","least_sold_competition_within_1k"]]

df3

输出:

    id  most_sold_A most_sold_B most_sold_C least_sold_A    least_sold_B    least_sold_C    most_sold_competition_within_1k   least_sold_competition_within_1k
0   1   1           0           0           0               1               1              [[most_sold_B]]              [[least_sold_A]]
1   2   0           1           0           1               0               0    [[most_sold_B,most_sold_C],[most_sold_A]  [[least_sold_A],[least_sold_B,least_sold_C]]
2   3   0           1           1           1               0               0      [[most_sold_B]]                            [[least_sold_A]]
,

似乎“棘手”部分正在为每个商店寻找相关的竞争对手。我敢肯定还有更多的优雅解决方案,但是简单明了的是:

def find_competitors(x,df2):
    shops = np.unique(df2[(df2.id1==x.id) | (df2.id2 == x.id)][['id1','id2']])
    competitors = np.delete(shops,np.argwhere(shops == x.id))
    return competitors

df2 = df2[df2.distance<=1]
df1['competitors'] = df1.apply(lambda x: find_competitors(x,df2),axis=1)

现在,对于每个商店,您现在都是相关的竞争对手,您只需简单地遍历每个商店的竞争对手,就可以找到其他两个问题(竞争对手最畅销和最不畅销的产品)的答案。我希望这足够清楚。

更新

要找到竞争对手最少/最多的产品,可以使用:

most_cols = [col for col in df1.columns if 'most' in col]

def find_competitors_by_metric(x,metric_cols):
    competitors_metric = df1[df1.id.isin(x.competitors)][metric_cols]
    return competitors_metric.T[competitors_metric.any()].T.columns

most_for_competitors = df1.apply(lambda x: find_competitors_by_metric(x,most_cols),axis=1)

现在,您可以向该函数发送要为商店的竞争对手计算的指标(假设这些指标存在于数据框中)。