python – 根据列使用Pandas保存其他列的值,在DataFrame中对日期进行排序

我有一个像这样的数据集(额外的这里意味着多个额外的列)：

>>> df = pd.DataFrame({'id_police':['p123','p123','p123','b123','b123'],
                   'dateeffe':['24/01/2018','24/11/2017','25/02/2018','24/02/2018','24/02/2018'],
                   'date_fin':['23/03/2018','23/12/2017','26/03/2018','25/02/2018','25/02/2018'],
                   'prime':[0,20,10,20,30],
                   'prime2':[0,30,10,20,0],
                   'extra':[12,12,13,15,20],
                   ...
})
###
  id_police    dateeffe    date_fin  prime  prime2  extra  ...
0      p123  24/01/2018  23/03/2018      0       0     12  ...
1      p123  24/11/2017  23/12/2017     20      30     12  ...
2      p123  25/02/2018  26/03/2018     10      10     13  ...
3      b123  24/02/2018  25/02/2018     20      20     15  ...
4      b123  24/02/2018  25/02/2018     30       0     20  ...

我想在每列id_police中对日期(例如2017年然后2018年……)进行排序,同样,我必须在每个重复的dateeffe和date_fin中保持最大素数,如3& 4具有相同的id_police.

这是预期的输出：

  id_police    dateeffe    date_fin  prime  prime2  extra  ...
0      p123  24/11/2017  23/12/2017     20      30     12  ...
1      p123  24/01/2018  23/03/2018      0       0     12  ...
2      p123  25/02/2018  26/03/2018     10      10     13  ...
3      b123  24/02/2018  25/02/2018     30      20     15  ...

找到最大的素数和prime2我用过这个：

df = df.groupby(['id_police','dateeffe','date_fin'],as_index=False).agg({'prime':'max','prime2':'max'})

这就是我尝试过但它将所有内容组合在一起而且我失去了额外的列……

df1 = df.sort_values(['dateeffe','date_fin']).groupby('id_police', as_index=False).apply(lambda x: x)

我到处寻找,感谢你的帮助,提前谢谢！

解决方法:

我想出了一个基于两步groupby的解决方案.

为了便于按groupby中的日期排序,让我们开始吧
将两个日期的类型更改为datetime：

df.dateeffe = pd.to_datetime(df.dateeffe)
df.date_fin = pd.to_datetime(df.date_fin)

第二部分是文本的解决方案的副本,用于创建字典
聚合功能(智能解决方案,无需任何其他方式)：

d = {'prime': 'max', 'prime2': 'max'}
d1 = dict.fromkeys(df.columns.difference(
    ['id_police', 'dateeffe', 'date_fin', 'prime', 'prime2']), 'first')
d.update(d1)

然后让我们定义一个包含第二步groupby的函数,应用
以上聚合函数：

def fn(xx):
    return xx.groupby(['dateeffe', 'date_fin'], as_index=False).agg(d)

唯一要做的就是实际计算,即第一步groupby,
应用上面定义的第二步组：

df.groupby('id_police', sort=False).apply(fn)\
    .reset_index(level=1, drop=True).reset_index()

注意两个groupby案例之间的区别：

>第一步groupby包含sort = False,因此原始
保持id_police的顺序.
>但是第二步groupby没有任何排序参数,所以这个
在两个日期,灌浆都要注意分类.

关于reset_index的两次调用都有一些解释：

df.groupby(‘id_police’,sort = False).apply(fn)生成一个DataFrame
以下多索引：

id_police  
p123      0
          1
          2
b123      0

所以第一个reset_index完全删除了1级(0,1,2,0)
(降=真).

但第二个reset_index实际上改变了唯一的剩余
索引级别(p123,p123,p123,b123)进入常规列和
创建默认索引(从0开始的连续数字).

python – 根据列使用Pandas保存其他列的值,在DataFrame中对日期进行排序

相关文章