问题描述
给出的是df1(其中包含每个商店销售最多和销售最少的产品)
id most_sold_A most_sold_B most_sold_C least_sold_A least_sold_B least_sold_C
1 1 0 0 0 1 1
2 0 1 0 1 0 0
3 0 1 1 1 0 0
和df2(包含两个商店之间的距离)也被给出:
id1 id2 distance
1 2 0.5
1 3 3.0
2 3 0.2
结果数据框应
- 检查每个shopid在1k距离内的shop_ids
- 在1k以内的所有竞争对手中采用most_sold_product的模式
- 在1k以内的所有竞争对手中采用最低销售产品的模式
产生df:
id most_sold_A most_sold_B most_sold_C least_sold_A least_sold_B least_sold_C /
1 1 0 0 0 1 1
2 0 1 0 1 0 0
3 0 1 1 1 0 0
most_sold_competition_within_1k least_sold_competition_within_1k
B A
[A,B,C] [A,C]
B A
编辑
df1 = pd.DataFrame([[1,1,1],[2,0],[3,0]],columns = ["id","most_sold_A","most_sold_B","most_sold_C","least_sold_A","least_sold_B","least_sold_C"])
df2 = pd.DataFrame([[1,2,0.5],[1,3,3.0],0.2]],columns = ["id1","id2","distance"])
解决方法
我提出了一些建议,但我认为可以进一步优化。这个想法是先过滤范围内的竞争对手,然后加入.apply()
并计算结果:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([[1,1,1],[2,0],[3,0]],columns = ["id","most_sold_A","most_sold_B","most_sold_C","least_sold_A","least_sold_B","least_sold_C"])
df2 = pd.DataFrame([[1,2,0.5],[1,3,3.0],0.2]],columns = ["id1","id2","distance"])
df2 = pd.concat([df2,df2[["id2","id1","distance"]].rename(columns = {"id2":"id1","id1":"id2"})]).reset_index()[["id1","distance"]]
df2["id2"] = df2["id2"].astype(str)
df2 = df2[df2["distance"]<1][["id1","id2"]].groupby("id1").agg({'id2': ','.join}).reset_index()
df3 = pd.merge(df1,df2,how = 'left',left_on="id",right_on="id1")
most_cols = [col for col in df3.columns if 'most' in col]
least_cols = [col for col in df3.columns if 'least' in col]
df3["most_sold_competition_within_1k"] = df3.apply(lambda x: [df3[df3["id"]==int(elem)][most_cols].columns[[df3[df3["id"]==int(elem)][most_cols].values == 1][0][0]] for elem in x["id2"].split(",")],axis = 1)
df3["least_sold_competition_within_1k"] = df3.apply(lambda x: [df3[df3["id"]==int(elem)][least_cols].columns[[df3[df3["id"]==int(elem)][least_cols].values == 1][0][0]] for elem in x["id2"].split(",axis = 1)
df3 = df3[["id"]+most_cols+least_cols+["most_sold_competition_within_1k","least_sold_competition_within_1k"]]
df3
输出:
id most_sold_A most_sold_B most_sold_C least_sold_A least_sold_B least_sold_C most_sold_competition_within_1k least_sold_competition_within_1k
0 1 1 0 0 0 1 1 [[most_sold_B]] [[least_sold_A]]
1 2 0 1 0 1 0 0 [[most_sold_B,most_sold_C],[most_sold_A] [[least_sold_A],[least_sold_B,least_sold_C]]
2 3 0 1 1 1 0 0 [[most_sold_B]] [[least_sold_A]]
,
似乎“棘手”部分正在为每个商店寻找相关的竞争对手。我敢肯定还有更多的优雅解决方案,但是简单明了的是:
def find_competitors(x,df2):
shops = np.unique(df2[(df2.id1==x.id) | (df2.id2 == x.id)][['id1','id2']])
competitors = np.delete(shops,np.argwhere(shops == x.id))
return competitors
df2 = df2[df2.distance<=1]
df1['competitors'] = df1.apply(lambda x: find_competitors(x,df2),axis=1)
现在,对于每个商店,您现在都是相关的竞争对手,您只需简单地遍历每个商店的竞争对手,就可以找到其他两个问题(竞争对手最畅销和最不畅销的产品)的答案。我希望这足够清楚。
更新
要找到竞争对手最少/最多的产品,可以使用:
most_cols = [col for col in df1.columns if 'most' in col]
def find_competitors_by_metric(x,metric_cols):
competitors_metric = df1[df1.id.isin(x.competitors)][metric_cols]
return competitors_metric.T[competitors_metric.any()].T.columns
most_for_competitors = df1.apply(lambda x: find_competitors_by_metric(x,most_cols),axis=1)
现在,您可以向该函数发送要为商店的竞争对手计算的指标(假设这些指标存在于数据框中)。