使用熊猫的Groupby df列

我有资料

1        member_id  application_name  active_seconds 
2           192180             Opera   6
3           192180             Opera   7
4           192180             Chrome  243
5           5433112            Chrome   52
6           5433112            Opera   34
7           5433112            Chrome 465

我需要根据使用application_name的数量和active_seconds的数量对其进行分组

我用
打印df.groupby([‘member_id’,’application_name’]).count()但我得到结果为active_second,并且

print df.groupby(['member_id', 'application_name'])['active_seconds'].count() 

工作不正确.
我做错了什么?

解决方法:

我认为您需要aggregate

df1 = df.groupby(['member_id', 'application_name'])
        .agg({'application_name':len, 'active_seconds':sum}) 

print (df1)
                            active_seconds  application_name
member_id application_name                                  
192180    Chrome                       243                 1
          Opera                         13                 2
5433112   Chrome                       517                 2
          Opera                         34                 1

如果需要reset_index,请先输入rename列(因为ValueError:无法插入application_name,已经存在):

df1 = df.groupby(['member_id', 'application_name'])
        .agg({'application_name':len, 'active_seconds':sum})
        .rename(columns={'active_seconds':'count_sec','application_name':'sum_app'})
        .reset_index() 

print (df1)
   member_id application_name  count_sec  sum_app
0     192180           Chrome        243        1
1     192180            Opera         13        2
2    5433112           Chrome        517        2
3    5433112            Opera         34        1

时间:

In [208]: %timeit df.groupby(['member_id', 'application_name']).agg({'application_name':len, 'active_seconds':sum}).rename(columns={'active_seconds':'count_sec','application_name':'sum_app'}).reset_index()
10 loops, best of 3: 93.6 ms per loop

In [209]: %timeit (f1(df))
10 loops, best of 3: 127 ms per loop

测试代码

import pandas as pd

df = pd.DataFrame({'member_id': {0: 192180, 1: 192180, 2: 192180, 3: 5433112, 4: 5433112, 5: 5433112}, 
                   'active_seconds': {0: 6, 1: 7, 2: 243, 3: 52, 4: 34, 5: 465}, 
                   'application_name': {0: 'Opera', 1: 'Opera', 2: 'Chrome', 3: 'Chrome', 4: 'Opera', 5: 'Chrome'}})
print (df)
#   active_seconds application_name  member_id
#0               6            Opera     192180
#1               7            Opera     192180
#2             243           Chrome     192180
#3              52           Chrome    5433112
#4              34            Opera    5433112
#5             465           Chrome    5433112

df = pd.concat([df]*1000).reset_index(drop=True)
print (len(df))
#6000

df1 = df.groupby(['member_id', 'application_name']).agg({'application_name':len, 'active_seconds':sum}).rename(columns={'active_seconds':'count_sec','application_name':'sum_app'}).reset_index() 
print (df1)

def f1(df):
    a = (df.groupby(['member_id', 'application_name'])['active_seconds'].sum() )
    b = (df.groupby(['member_id', 'application_name']).size())
    return (pd.concat([a,b], axis=1, keys=['count_sec','sum_app']).reset_index())

print (f1(df))
#   member_id application_name  count_sec  sum_app
#0     192180           Chrome     243000     1000
#1     192180            Opera      13000     2000
#2    5433112           Chrome     517000     2000
#3    5433112            Opera      34000     1000
#   member_id application_name  count_sec  sum_app
#0     192180           Chrome     243000     1000
#1     192180            Opera      13000     2000
#2    5433112           Chrome     517000     2000
#3    5433112            Opera      34000     1000

相关文章

转载:一文讲述Pandas库的数据读取、数据获取、数据拼接、数...
Pandas是一个开源的第三方Python库,从Numpy和Matplotlib的基...
整体流程登录天池在线编程环境导入pandas和xrld操作EXCEL文件...
 一、numpy小结             二、pandas2.1为...
1、时间偏移DateOffset对象DateOffset类似于时间差Timedelta...
1、pandas内置样式空值高亮highlight_null最大最小值高亮背景...