问题描述
TL;DR:我想右对齐这个 df,覆盖 NaN/将它们向左移动:
EU Odds
将其作为最右边的列填充的连续数据:
In [6]: series.str.split(':',expand=True)
Out[6]:
0 1 2
0 1 25.842 <NA>
1 <NA> <NA> <NA>
2 0 15.413 <NA>
3 54.154 <NA> <NA>
4 3 2 06.284
我真正想做的事情:
我有一个 Pandas 系列的持续时间/时间增量,大致采用 H:M:S 格式 - 但有时“H”或“H:M”部分可能会丢失- 所以我不能把它传递给 0 1 2
0 0 1 25.842 # 0 or NA
1 <NA> <NA> <NA> # this NA should remain
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
或 timedelta
。我想要做的是将它们转换为秒,我已经完成了,但似乎有点复杂:
datetime
如果我不执行此 In [1]: import pandas as pd
...:
...: series = pd.Series(['1:25.842',pd.NA,'0:15.413','54.154','3:2:06.284'],dtype='string')
...: t = series.str.split(':') # not using `expand` helps for the next step
...: t
Out[1]:
0 [1,25.842]
1 <NA>
2 [0,15.413]
3 [54.154]
4 [3,2,06.284]
dtype: object
In [2]: # reverse it so seconds are first; and NA's are just empty
...: rows = [i[::-1] if i is not pd.NA else [] for i in t]
In [3]: smh = pd.DataFrame.from_records(rows).astype('float')
...: # left-aligned is okay since it's continuous Secs->Mins->Hrs
...: smh
Out[3]:
0 1 2
0 25.842 1.0 NaN
1 NaN NaN NaN
2 15.413 0.0 NaN
3 54.154 NaN NaN
4 6.284 2.0 3.0
步骤,那么它会为稍后的秒数转换生成 NaN。
fillna(0)
^ 预期的最终结果。
(或者,我可以编写一个仅用于 Python 的小型函数来拆分 In [4]: smh.iloc[:,1:] = smh.iloc[:,1:].fillna(0) # NaN's in first col = NaN from data; so leave
...: # convert to seconds
...: smh.iloc[:,0] + smh.iloc[:,1] * 60 + smh.iloc[:,2] * 3600
Out[4]:
0 85.842
1 NaN
2 15.413
3 54.154
4 10926.284
dtype: float64
,然后根据每个列表具有的值数量进行转换。)
解决方法
让我们尝试使用 numpy
右对齐数据框,基本思想是沿 sort
axis=1
数据框,以便 NaN
值出现在 {{1 }} 值,同时保持 non-NaN
值的顺序不变:
non-NaN
为了得到 i = np.argsort(np.where(df.isna(),-1,0),1)
df[:] = np.take_along_axis(df.values,i,axis=1)
0 1 2
0 NaN 1.0 25.842
1 NaN NaN NaN
2 NaN 0.0 15.413
3 NaN NaN 54.154
4 3.0 2.0 6.284
,您可以将右对齐的数据帧乘以 total seconds
并沿 [3600,60,1]
取 sum
:
axis=1
,
您可以通过用 series
填充 '0:'
来更早地解决问题,如下所示:
# setup
series = pd.Series(['1:25.842',pd.NA,'0:15.413','54.154','3:2:06.284'],dtype='string')
# create a padding of 0 series
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) and c > 0 else '' for c in counts],dtype='string')
# apply padding
res = pad.str.cat(series)
t = res.str.split(':',expand=True)
print(t)
输出
0 1 2
0 0 1 25.842
1 <NA> <NA> <NA>
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
,
1. 使用排序不适用方法in Shubham's answer,我想出了这个 - 利用 Pandas {{1 }} 和 Python apply
:
sorted
(然后根据需要乘以。)但是它很慢,见下文。
2.通过预填充 '0:'s in Dani's answer,然后我可以直接创建 series = pd.Series(['1:25.842',dtype='string')
df = series.str.split(':',expand=True)
# key for sorted is `pd.notna`,so False(0) sorts before True(1)
df.apply(sorted,axis=1,key=pd.notna,result_type='broadcast')
并获取它们pd.Timedelta
:
total_seconds
(但是在大约 10k 行中进行扩展拆分然后乘法+求和会更快。)
性能警告,有 1 万行数据:
我的问题中的初始代码/尝试,行反转 - 所以也许我会坚持下去:
res = ... # from answer
pd.to_timedelta(res,errors='coerce').map(lambda x: x.total_seconds())
Numpy %%timeit
t = series.str.split(':')
rows = [i[::-1] if i is not pd.NA else [] for i in t]
smh = pd.DataFrame.from_records(rows).astype('float')
smh.mul([1,3600]).sum(axis=1,min_count=1)
# 14.3 ms ± 310 µs per loop (mean ± std. dev. of 7 runs,100 loops each)
+ argsort
:
take_along_axis
预先填充:
%%timeit
df = series.str.split(':',expand=True)
i = np.argsort(np.where(df.isna(),axis=1)
df.apply(pd.to_numeric,errors='coerce').mul([3600,1]).sum(axis=1,min_count=1)
# 30.1 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs,10 loops each)
预先填充,timedeltas + total_seconds:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts],dtype='string')
res = pad.str.cat(series)
t = res.str.split(':',expand=True)
t.apply(pd.to_numeric,min_count=1)
# 48.3 ms ± 607 µs per loop (mean ± std. dev. of 7 runs,10 loops each)
Pandas %%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts],dtype='string')
res = pad.str.cat(series)
pd.to_timedelta(res,errors='coerce').map(lambda x: x.total_seconds())
# 183 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs,10 loops each)
+ Python apply
(非常慢):
sorted