使用groupby构造数据帧

问题描述

我的数据框如下：

<html>
<head>
    <Meta charset="utf-8" />
</head>
<body>
    <div style="-webkit-column-count: 4;-moz-column-count: 4; column-count: 4;">
        <ul>
            <li>Gods</li>
            <li>Óðinn</li>
            <li>Baldur</li>
            <li>Njörður</li>
            <li>Freyr</li>
            <li>Týr</li>
            <li>Bragi</li>
            <li>Heimdal</li>
            <li>Thor</li>
            <li>Höðr</li>
            <li>Víðar</li>
            <li>Áli or Váli</li>
            <li>Ullr</li>
            <li>Forseti</li>
            <li>Óðinn</li>
            <li>Þór</li>
        </ul>
    </div>
</body>

</html>

我想用“目标”，“源”，“权重”列来写它，其中： “目标”和“来源”都是“ id”，“权重”取决于“目标”和“来源”同时更改价格的天数。看起来像这样：

                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729

我的目标是使用此数据框制作一个networkx图。

我尝试使用groupby

target  source  weights
60059   60060   2
60059   60076   1   
60060   60076   1

和for循环（非常糟糕）。

我觉得我在小组赛中缺少一小步，但无法说出缺少的东西。

谢谢您的帮助。

解决方法

这个想法是，如果ID在每个日期都有一个pct_change，则首先使用pivto_table来获取True。

#first pivot to get True if any value of id for a date
df_ = df.pivot_table(index='id',columns='date',values='pct_change',aggfunc=any,fill_value=False)
print(df_)
date  2010-07-28 2010-07-29
id                         
60059       True       True
60060       True       True
60076       True      False

然后，您可以使用combination中的itertools创建所有可能的对，使用它们选择行，并使用&运算符来查看两者在同一日期具有True的地方，沿列求和（获取权重列）。将此列分配给从两个组合列表创建的数据框。

# get all combinations of ids
from itertools import combinations
a,b = map(list,zip(*combinations(df_.index,2)))

res = (pd.DataFrame({'target':a,'source':b})
         .assign(weigths=(df_.loc[a].to_numpy()
                          &df_.loc[b].to_numpy()
                         ).sum(axis=1))
      )
print(res)
   target  source  weigths
0   60059   60060        2
1   60059   60076        1
2   60060   60076        1

注意：不要忘记用您的分类列的名称更改index='id'中的pivot_table，否则您的计算机很可能无法处理以下操作和崩溃

尝试

import pandas as pd,numpy as np

ids = df.id.unique()
WeightDf = pd.DataFrame(index=ids,columns=ids)
WeightDf.loc[:,:] = 0

def weigh(ID):
    IdDates =  set(df.loc[df.id==ID].date.to_list())
    for i in ids:
        WeightDf.at[ID,i] = len(set.intersection(set(df.loc[df.id==i].date.to_list()),IdDates))
        
pd.Series(ids).apply(weigh)
print(WeightDf)

import itertools as itt
result = pd.DataFrame(columns=['Id1','Id2','Weight'])
for i1,i2 in itt.combinations(ids,2):
    result = pd.concat([result,pd.DataFrame(data=[{'Id1':i1,'Id2':i2,'Weight':WeightDf.loc[i1,i2]}])])

print(result)

看到了这个用例的很多变化-生成组合

import itertools

df = pd.read_csv(io.StringIO("""                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729"""),sep="\s+")

# generate combinations of two... edge case when a group has only one member
# tuple of itself to itself
dfx = (df.groupby('date').agg({"id": lambda s: list(itertools.combinations(list(s),2))
                               if len(list(s))>1 else [tuple(list(s)*2)]})
    .explode("id")
     .groupby("id").agg({"id":"count"})
     .rename(columns={"id":"weights"})
     .reset_index()
     .assign(target=lambda dfa: dfa["id"].apply(lambda s: s[0]),source=lambda dfa: dfa["id"].apply(lambda s: s[1]))
     .drop(columns="id")
)

print(dfx.to_string(index=False))

输出

 weights  target  source
       2   60059   60060
       1   60059   60076
       1   60060   60076

This SO link最终为我的问题提供了更快的答案，该问题适用于大量id。它更接近我之前尝试使用的groupby + value_counts。

以下是代码，以方便将来的人们使用：

from itertools import combinations

def combine(batch):
    """Combine all products within one batch into pairs"""
    return pd.Series(list(combinations(set(batch),2)))

edges = df.groupby('date')['id'].apply(combine).value_counts()

c = ['source','target']
L = edges.index.values.tolist()
edges = pd.DataFrame(L,columns=c).join(edges.reset_index(drop=True))

dataframe graph graph graph networkx pandas pandas python