问题描述
我有以下代码,对于已排序的 Pandas 数据框,按一列分组,并创建两列新列:一列根据组中的前 4 行和当前行,另一列基于组中的未来行组。
data_test = {'nr':[1,1,6,6],'val':[11,12,13,14,15,61,62,63,64,65,66,67]}
df_test = pd.DataFrame (data_test,columns = ['nr','val'])
print (df_test)
因此出现以下框架:
nr val
0 1 11
1 1 12
2 1 13
3 1 14
4 1 15
5 6 61
6 6 62
7 6 63
8 6 64
9 6 65
10 6 66
11 6 67
现在我必须遵循按“nr”分组的代码,并为每一行构建一列,其中包含组中“val”的前 4 个值和当前值。同样,构建一个额外的列,每行包含组中 'val' 的未来值。
df_test['past4'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(4).fillna(0))
df_test['past3'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(3).fillna(0))
df_test['past2'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(2).fillna(0))
df_test['past1'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(1).fillna(0))
df_test['future'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(-1).fillna(0))
df_test['amounts'] = df_test[['past4','past3','past2','past1','val']].values.tolist()
df_test.drop(columns = ['past4','past1'],inplace = True)
df_test
nr val future amounts
0 1 11 12 [0,11]
1 1 12 13 [0,11,12]
2 1 13 14 [0,13]
3 1 14 15 [0,14]
4 1 15 0 [11,15]
5 6 61 62 [0,61]
6 6 62 63 [0,62]
7 6 63 64 [0,63]
8 6 64 65 [0,64]
9 6 65 66 [61,65]
10 6 66 67 [62,66]
11 6 67 0 [63,67]
我确信我应该能够更轻松地构建一个名为“amounts”的列表列,可能是单行。我该怎么做?
解决方法
使用自定义函数创建嵌套列表,例如:
def f(x):
#list comprehension with shift by 4,3,2,1,0
L = [x['val'].shift(i).fillna(0) for i in range(4,-1,-1)]
#shifting to another column
x['future'] = x['val'].shift(-1).fillna(0).astype(int)
#column filled by lists
x['amounts'] = pd.Series(np.array(L).astype(int).T.tolist(),index=x.index)
return (x)
df_test = df_test.groupby(['nr']).apply(f)
print (df_test)
nr val future amounts
0 1 11 12 [0,11]
1 1 12 13 [0,11,12]
2 1 13 14 [0,12,13]
3 1 14 15 [0,13,14]
4 1 15 0 [11,14,15]
5 6 61 62 [0,61]
6 6 62 63 [0,61,62]
7 6 63 64 [0,62,63]
8 6 64 65 [0,63,64]
9 6 65 66 [61,64,65]
10 6 66 67 [62,65,66]
11 6 67 0 [63,66,67]
,
将您的 bloc 迁移到函数中使代码更加模块化和轻量
在这个特定的例子中,我们发送 reversed(range(5))
作为 shift_values
,这代表列表 [4,0]
import pandas as pd
data_test = {'nr':[1,6,6],'val':[11,15,67]}
df_test = pd.DataFrame(data_test,columns = ['nr','val'])
def generate_past(df,shift_values):
serie = pd.DataFrame([df.groupby('nr')['val'].transform(lambda x: x.shift(shift_value).fillna(0)) for shift_value in shift_values])
return serie.T.values.tolist()
df_test['future'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(-1).fillna(0))
df_test['amounts'] = generate_past(df_test,reversed(range(5)))
,
你可以这样尝试(与 jezrael 相同)但不使用 apply。不是一个好方法,因为我正在制作新的数据框。
df_new = pd.DataFrame()
for i,grp in df_test.groupby('nr'):
grp = grp.reset_index(drop=True)
grp['future'] = pd.Series(grp['val'].shift(-1).fillna(0).astype(int))
grp['amount'] = pd.Series([grp['val'].shift(i).fillna(0).values[-5:] for i in range(len(grp)-1,-1)])
df_new = df_new.append(grp)
df_new.reset_index(drop=True,inplace=True)
df_new:
nr val future amounts
0 1 11 12 [0.0,0.0,11.0]
1 1 12 13 [0.0,11.0,12.0]
2 1 13 14 [0.0,12.0,13.0]
3 1 14 15 [0.0,13.0,14.0]
4 1 15 0 [11,15]
5 6 61 62 [0.0,61.0]
6 6 62 63 [0.0,61.0,62.0]
7 6 63 64 [0.0,62.0,63.0]
8 6 64 65 [0.0,63.0,64.0]
9 6 65 66 [61.0,64.0,65.0]
10 6 66 67 [62.0,65.0,66.0]
11 6 67 0 [63,67]