问题描述
我有一个仅包含两列(年份和石油价格)的简单数据集。现在,我需要对它们进行整形,以便 keras 的 LSTM 层接受它们的 input_shape。
我的代码看起来像这样,我基本上需要黄色标记区域的帮助。我认为我需要在(数组、归一化等)之前更改/转换 X_train 和 X_test 但我只在尝试时出错...
解决方法
如果您将 X_train 和 X_test 保留为二维数据框,我认为您的代码会起作用。所以你的问题会解决如果你定义
X_train = train[["Year"]]
X_test = test[["Year"]]
之后,您可以像在问题中所做的那样定义 LSTM 架构
,一个解决方案是像 X_train
一样重塑 X_train.reshape((X_train.shape[0],1,1))
并省略第一个 LSTM 层中的 input_shape
参数。 LSTM 层的输入形状始终为(批量大小、时间步长、特征)。查看有关此 here 的更多信息。
另一件要考虑的事情是在一个“小”范围内缩放数据,例如 [0,1],以便训练过程收敛得很好而且顺利,因为我们不希望权重在更新时变得疯狂,但它也取决于具体的实现/应用。
您可能需要使用旋钮(激活函数、dropuouts、单位、批量大小等超参数,以获得更好的预测性能)。
这是一个包含注释的完整示例:
# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense,LSTM,Dropout
from keras.activations import relu
# data setup
years = [i for i in range(1861,2021)]
oil = [0.49,1.05,3.15,8.06,6.59,3.74,2.41,3.63,3.64,3.86,4.34,1.83,1.17,1.35,2.56,2.42,1.19,0.86,0.95,0.78,0.84,0.88,0.71,0.67,0.94,0.87,0.56,0.64,1.36,1.18,0.79,0.91,1.29,0.96,0.8,0.62,0.73,0.72,0.7,0.61,0.74,0.81,1.1,1.56,1.98,2.01,3.07,1.73,1.61,1.34,1.43,1.68,1.88,1.3,1.27,0.65,0.97,1.09,1.13,1.02,1.14,1.2,1.21,1.12,1.9,1.99,1.78,1.71,1.93,2.08,1.8,2.24,2.48,3.29,11.58,11.53,12.8,13.92,14.02,31.61,36.83,35.93,32.97,29.55,28.78,27.56,14.43,18.43503937,14.9238417,18.22611328,23.72582031,20.0009144,19.32083658,16.97163424,15.81762646,17.01667969,20.66848837,19.09258755,12.71566148,17.97007782,28.49544922,24.44389105,25.02325581,28.83070313,38.265,54.52108949,65.1440625,72.38907843,97.25597276,61.67126482,79.4955336,111.2555976,111.6697024,108.6585178,98.94600791,52.38675889,43.73416996,54.19244048,71.31005976,64.21057312,41.83834646]
data = pd.DataFrame(np.vstack((years,oil)).T,columns = ["Year","Oil Crude Price ($)"]).astype({'Year': int})
# train percentage,thus test percentage = 1 - train_split
train_split = 0.8
# scaler for inputs and outputs
scaler = MinMaxScaler()
# scaling data between 0 and 1
data_scaled = scaler.fit_transform(data.values)
# splitting data into train set 0.6 * 160 = first 96 rows
X_train = data_scaled[:int(train_split * len(data_scaled)),0]
y_train = data_scaled[:int(train_split * len(data_scaled)),1]
# splitting data into test set 0.4 * 160 = last 64 rows
X_test = data_scaled[int(train_split * len(data_scaled)):,0]
y_test = data_scaled[int(train_split * len(data_scaled)):,1]
# sanity check,adding rows in X_train and X_test MUST add to total rows in data
assert len(X_train) + len(X_test) == len(data)
# reshaping inputs for LSTM
X_train_lstm = X_train.reshape((X_train.shape[0],1))
X_test_lstm = X_test.reshape((X_test.shape[0],1))
# building model with several LSTM,dropouts,and dense layers
model = Sequential()
model.add(LSTM(units = 512,return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 128,return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 64,return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 32,return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 16))
model.add(Dropout(0.2))
model.add(Dense(units = 1))
# compiling model with rmsprop (my preferred optimizer,and loss)
model.compile(optimizer="adam",loss="mse")
# training model for 500 epocs and 40 samples per batch
history = model.fit(X_train_lstm,y_train,epochs=100,batch_size = 20,verbose=1)
# making predictions using test set
y_pred_scaled = model.predict(X_test_lstm)
def original_scale(scaler,x,y):
return scaler.inverse_transform(np.concatenate((x.reshape((x.shape[0],1)),y),axis=1))
# transforming values back to original scale
y_pred = original_scale(scaler,X_test,y_pred_scaled)[:,1] # predicted price
y_test = data.values[int(train_split * data_scaled.shape[0]):,1] #
y_test_years = data.values[int(train_split * data_scaled.shape[0]):,0]
# wrapping up putting results together in a dataframe
output = pd.DataFrame(data = np.vstack((y_test_years,y_test,y_pred)).T,"Oil Crude Price ($)","Predicted Oil Crude Price ($)"]).astype({'Year': int})
print(output)
输出:
Year Oil Crude Price ($) Predicted Oil Crude Price ($)
0 1989 18.226113 22.428251
1 1990 23.725820 23.613170
2 1991 20.000914 24.847811
3 1992 19.320837 26.132980
4 1993 16.971634 27.469316
5 1994 15.817626 28.857338
6 1995 17.016680 30.297413
7 1996 20.668488 31.789748
8 1997 19.092588 33.334372
9 1998 12.715661 34.931136
10 1999 17.970078 36.579701
11 2000 28.495449 38.279516
12 2001 24.443891 40.029859
13 2002 25.023256 41.829785
14 2003 28.830703 43.678103
15 2004 38.265000 45.573485
16 2005 54.521089 47.514343
17 2006 65.144063 49.498909
18 2007 72.389078 51.525253
19 2008 97.255973 53.591203
20 2009 61.671265 55.694481
21 2010 79.495534 57.832594
22 2011 111.255598 60.002952
23 2012 111.669702 62.202827
24 2013 108.658518 64.429348
25 2014 98.946008 66.679673
26 2015 52.386759 68.950699
27 2016 43.734170 71.239459
28 2017 54.192440 73.542838
29 2018 71.310060 75.857814
30 2019 64.210573 78.181272
31 2020 41.838346 80.510184
软件包版本:
keras 2.4.3
numpy 1.19.2
pandas 1.1.5
scikit-learn 0.23.2
tensorflow 2.4.1