问题描述
我有一个时间序列数据帧,我想从中生成序列。
Time A B C D
2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11
2019-06-17 08:53:00 12503.61 12503.61 12503.61 12503.61
2019-06-17 08:54:00 12503.61 12503.61 12503.11 12503.11
预期结果是: (如您所见,样本越来越长,总是从头开始,而在下一行结束。 注意:为了说明,我只保留了索引。他们不必一定要在结果中。)
[
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11
2019-06-17 08:53:00 12503.61 12503.61 12503.61 12503.61 ],[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11
2019-06-17 08:53:00 12503.61 12503.61 12503.61 12503.61
2019-06-17 08:54:00 12503.61 12503.61 12503.11 12503.11 ]
]
该怎么做?
解决方法
我会使用np.asarray
或DataFrame.values
,速度更快
arr = np.asarray(df)
#arr = df.values
result = [arr[:i] for i in range(1,df.shape[0]+1)]
或
arr = np.asarray(df)
#arr = df.values
result = list(map(lambda i: arr[:i],range(1,df.shape[0]+1)))
%%timeit
arr = np.asarray(df)
result = list(map(lambda i: arr[:i],df.shape[0]+1)))
24 µs ± 1 µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
没有np.asarray
%%timeit
result = list(map(lambda i: df[:i],df.shape[0]+1)))
935 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
您也可以尝试DataFrame.shift
,但这很慢
%%timeit
n = df.shape[0]
[df.shift(i).dropna().values for i in reversed(range(n))]
15.8 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs,100 loops each)
编辑
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(500000,4))
%%timeit
arr = np.asarray(df)
[arr[:i] for i in range(1,df.shape[0]+1)]
213 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs,1 loop each)