将缺少部分的部分 H:M:S 持续时间转换为秒;或右对齐非 NA 数据 我真正想做的事情:

问题描述

TL;DR:我想右对齐这个 df,覆盖 NaN/将它们向左移动:

EU Odds

将其作为最右边的列填充的连续数据:

In [6]: series.str.split(':',expand=True)
Out[6]:
        0       1       2
0       1  25.842    <NA>
1    <NA>    <NA>    <NA>
2       0  15.413    <NA>
3  54.154    <NA>    <NA>
4       3       2  06.284

我真正想做的事情:

我有一个 Pandas 系列的持续时间/时间增量,大致采用 H:M:S 格式 - 但有时“H”或“H:M”部分可能会丢失- 所以我不能把它传递给 0 1 2 0 0 1 25.842 # 0 or NA 1 <NA> <NA> <NA> # this NA should remain 2 0 0 15.413 3 0 0 54.154 4 3 2 06.284 timedelta。我想要做的是将它们转换为秒,我已经完成了,但似乎有点复杂:

datetime

如果我不执行此 In [1]: import pandas as pd ...: ...: series = pd.Series(['1:25.842',pd.NA,'0:15.413','54.154','3:2:06.284'],dtype='string') ...: t = series.str.split(':') # not using `expand` helps for the next step ...: t Out[1]: 0 [1,25.842] 1 <NA> 2 [0,15.413] 3 [54.154] 4 [3,2,06.284] dtype: object In [2]: # reverse it so seconds are first; and NA's are just empty ...: rows = [i[::-1] if i is not pd.NA else [] for i in t] In [3]: smh = pd.DataFrame.from_records(rows).astype('float') ...: # left-aligned is okay since it's continuous Secs->Mins->Hrs ...: smh Out[3]: 0 1 2 0 25.842 1.0 NaN 1 NaN NaN NaN 2 15.413 0.0 NaN 3 54.154 NaN NaN 4 6.284 2.0 3.0 步骤,那么它会为稍后的秒数转换生成 NaN。

fillna(0)

^ 预期的最终结果。

(或者,我可以编写一个仅用于 Python 的小型函数来拆分 In [4]: smh.iloc[:,1:] = smh.iloc[:,1:].fillna(0) # NaN's in first col = NaN from data; so leave ...: # convert to seconds ...: smh.iloc[:,0] + smh.iloc[:,1] * 60 + smh.iloc[:,2] * 3600 Out[4]: 0 85.842 1 NaN 2 15.413 3 54.154 4 10926.284 dtype: float64 ,然后根据每个列表具有的值数量进行转换。)

解决方法

让我们尝试使用 numpy 右对齐数据框,基本思想是沿 sort axis=1 数据框,以便 NaN 值出现在 {{1 }} 值,同时保持 non-NaN 值的顺序不变:

non-NaN

为了得到 i = np.argsort(np.where(df.isna(),-1,0),1) df[:] = np.take_along_axis(df.values,i,axis=1) 0 1 2 0 NaN 1.0 25.842 1 NaN NaN NaN 2 NaN 0.0 15.413 3 NaN NaN 54.154 4 3.0 2.0 6.284 ,您可以将右对齐的数据帧乘以 total seconds 并沿 [3600,60,1]sum

axis=1
,

您可以通过用 series 填充 '0:' 来更早地解决问题,如下所示:

# setup
series = pd.Series(['1:25.842',pd.NA,'0:15.413','54.154','3:2:06.284'],dtype='string')

# create a padding of 0 series
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) and c > 0 else '' for c in counts],dtype='string')

# apply padding
res = pad.str.cat(series)

t = res.str.split(':',expand=True)
print(t)

输出

      0     1       2
0     0     1  25.842
1  <NA>  <NA>    <NA>
2     0     0  15.413
3     0     0  54.154
4     3     2  06.284
,

1. 使用排序不适用方法in Shubham's answer,我想出了这个 - 利用 Pandas {{1 }} 和 Python apply :

sorted

(然后根据需要乘以。)但是它很慢,见下文。

2.通过预填充 '0:'s in Dani's answer,然后我可以直接创建 series = pd.Series(['1:25.842',dtype='string') df = series.str.split(':',expand=True) # key for sorted is `pd.notna`,so False(0) sorts before True(1) df.apply(sorted,axis=1,key=pd.notna,result_type='broadcast') 并获取它们pd.Timedelta

total_seconds

(但是在大约 10k 行中进行扩展拆分然后乘法+求和会更快。)


性能警告,有 1 万行数据:

我的问题中的初始代码/尝试,行反转 - 所以也许我会坚持下去:

res = ...  # from answer

pd.to_timedelta(res,errors='coerce').map(lambda x: x.total_seconds())

Numpy %%timeit t = series.str.split(':') rows = [i[::-1] if i is not pd.NA else [] for i in t] smh = pd.DataFrame.from_records(rows).astype('float') smh.mul([1,3600]).sum(axis=1,min_count=1) # 14.3 ms ± 310 µs per loop (mean ± std. dev. of 7 runs,100 loops each) + argsort

take_along_axis

预先填充:

%%timeit
df = series.str.split(':',expand=True)
i = np.argsort(np.where(df.isna(),axis=1)
df.apply(pd.to_numeric,errors='coerce').mul([3600,1]).sum(axis=1,min_count=1)

# 30.1 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs,10 loops each)

预先填充,timedeltas + total_seconds:

%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts],dtype='string')
res = pad.str.cat(series)
t = res.str.split(':',expand=True)
t.apply(pd.to_numeric,min_count=1)

# 48.3 ms ± 607 µs per loop (mean ± std. dev. of 7 runs,10 loops each)

Pandas %%timeit counts = 2 - series.str.count(':') pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts],dtype='string') res = pad.str.cat(series) pd.to_timedelta(res,errors='coerce').map(lambda x: x.total_seconds()) # 183 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs,10 loops each) + Python apply(非常慢):

sorted