问题描述
我确实有一个json数组,在这里我将拥有id,开始时间,结束时间。我想计算用户活跃的平均时间。有些可能只有星际时间,而没有结束时间。 示例数据-
data = [{"id":1,"stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":2,"stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":3,"stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":4,"stime":"2020-09-23T06:25:36Z","etime": "2020-09-29T09:25:36Z"}]
我实现此目标的方法,区别开始时间和结束时间。然后将所有差异时间相加,然后除以Id的总数。
示例代码:
import datetime
from datetime import timedelta
import dateutil.parser
datetimeFormat = '%Y-%m-%d %H:%M:%s.%f'
date_s_time = '2020-09-21T06:25:36Z'
date_e_time = '2020-09-22T09:25:36Z'
d1 = dateutil.parser.parse(date_s_time)
d2 = dateutil.parser.parse(date_e_time)
diff1 = datetime.datetime.strptime(d2.strftime('%Y-%m-%d %H:%M:%s.%f'),datetimeFormat)\
- datetime.datetime.strptime(d1.strftime('%Y-%m-%d %H:%M:%s.%f'),datetimeFormat)
print("Difference 1:",diff1)
date_s_time2 = '2020-09-20T06:25:36Z'
date_e_time2 = '2020-09-28T02:25:36Z'
d3 = dateutil.parser.parse(date_s_time2)
d4 = dateutil.parser.parse(date_e_time2)
diff2 = datetime.datetime.strptime(d4.strftime('%Y-%m-%d %H:%M:%s.%f'),datetimeFormat)\
- datetime.datetime.strptime(d3.strftime('%Y-%m-%d %H:%M:%s.%f'),datetimeFormat)
print("Difference 2:",diff2)
print("total",diff1+diff2)
print(diff1+diff2/2)
解决方法
您可以使用pandas
库。
import pandas as pd
data = [{"id":1,"stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":1,"stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},"stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},"stime":"2020-09-23T06:25:36Z"}]
(假设您的最后一行没有结束时间)
现在,您可以使用数据创建Pandas DataFrame
df = pd.DataFrame(data)
df
看起来像这样:
id stime etime
0 1 2020-09-21T06:25:36Z 2020-09-22T09:25:36Z
1 1 2020-09-22T02:24:36Z 2020-09-23T07:25:36Z
2 1 2020-09-20T06:25:36Z 2020-09-24T09:25:36Z
3 1 2020-09-23T06:25:36Z NaN
现在,我们要映射列stime
和etime
,以便将字符串转换为日期时间对象,并用有意义的内容填充NaN
:如果没有结束时间存在,我们可以使用当前时间吗?
df = df.fillna(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))
df['etime'] = df['etime'].map(dateutil.parser.parse)
df['stime'] = df['stime'].map(dateutil.parser.parse)
或者,如果您想删除没有etime
的行,只需
df = df.dropna()
现在df
变为:
id stime etime
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00
最后,减去两个:
df['tdiff'] = df['etime'] - df['stime']
我们得到:
id stime etime tdiff
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00 1 days 03:00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00 1 days 05:01:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00 4 days 03:00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00 1 days 13:40:06
此列的平均值是:
df['tdiff'].mean()
Output: Timedelta('2 days 00:10:16.500000')