如何有效地创建从数据帧到每个索引的序列?

问题描述

我有一个时间序列数据帧,我想从中生成序列。

Time                           A            B           C             D                                                               
2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71      
2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91      
2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41      
2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11      
2019-06-17 08:53:00     12503.61     12503.61    12503.61      12503.61      
2019-06-17 08:54:00     12503.61     12503.61    12503.11      12503.11      

预期结果是: (如您所见,样本越来越长,总是从头开始,而在下一行结束。 注意:为了说明,我只保留了索引。他们不必一定要在结果中。)

[                                                                                  
   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71  ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91   ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21   ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41   ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21    ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11    ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
   2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11    ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
   2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11    
   2019-06-17 08:53:00     12503.61     12503.61    12503.61      12503.61     ],[2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
   2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11    
   2019-06-17 08:53:00     12503.61     12503.61    12503.61      12503.61     
   2019-06-17 08:54:00     12503.61     12503.61    12503.11      12503.11  ]
                                                                               ]

该怎么做?

解决方法

我会使用np.asarrayDataFrame.values,速度更快

arr = np.asarray(df)
#arr = df.values
result = [arr[:i] for i in range(1,df.shape[0]+1)]

arr = np.asarray(df)
#arr = df.values
result = list(map(lambda i: arr[:i],range(1,df.shape[0]+1)))

%%timeit
arr = np.asarray(df)
result = list(map(lambda i: arr[:i],df.shape[0]+1)))
24 µs ± 1 µs per loop (mean ± std. dev. of 7 runs,10000 loops each)

没有np.asarray

%%timeit
result = list(map(lambda i: df[:i],df.shape[0]+1)))
935 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

您也可以尝试DataFrame.shift,但这很慢

%%timeit
n = df.shape[0]
[df.shift(i).dropna().values for i in reversed(range(n))]
15.8 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs,100 loops each)

编辑

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(500000,4))


%%timeit
arr = np.asarray(df)
[arr[:i] for i in range(1,df.shape[0]+1)]

213 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs,1 loop each)

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...