我是否在机器学习模型中正确使用onehot编码功能？

问题描述

通过使用RenadomForestRegressor中的Sklearn软件包，我的Python模型具有14个功能和1个标签。每列下有10000个数据，所以数组Feature: (10000,14)和Label: (10000,1)

的大小

14个功能中的13个是字符串格式，因此我将OneHotEncoder中的sklearn.preprocessing用于以下13个字符串功能（1个功能是浮点格式）。下面我仅显示一个功能示例：

values = array(df['receiver_bic']) # This is one of the features,BIC-code for banks like "HANDSESS",in string format with limited values

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False,categories='auto')
integer_encoded = integer_encoded.reshape(len(integer_encoded),1)
receiver_bic_onehot = onehot_encoder.fit_transform(integer_encoded)

The shape of the final array RECEIVER_BIC_ONEHOT: (10.000,622)

在对每个字符串特征（13个特征）执行相同的处理之后，我得到了一个热编码特征尺寸，如下：

# Shapes of 13 OneHot_encoded features
(10000,622),(10000,397),325),331),319),235),24),4),196),78),118),128),55)

最后，我在X下将这些功能收集为：

X=np.c_[OneHot_Feature_1,OneHot_Feature_2,...,OneHot_Feature_13,Numeric_Feature_14]

y = df[target_col] # Target column

X = np.array(X) # Converting Feature and Target to numpy arrays
y = np.array(y)

# Split dataset into training set and test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1)

最后我得到的数组形状为

Training Features Shape: (7000,2833)
Training Labels Shape: (7000,1)
Testing Features Shape: (3000,2833)
Testing Labels Shape: (3000,1)

在模型中使用之前，我将Features: X转换为StandardScaler()

scaler = StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

最后，我将这些数组插入RandomForestRegressor模型中

est_RFR = RandomForestRegressor(n_estimators=10) 
est_RFR = est_RFR.fit(X_train,y_train.ravel()) # ravel() is needed to convert the (n,1) shape into (n,)

我的问题：

我上面针对多个功能使用OneHotEncoder的过程是否正确？
即使正确，在OneHotEncoder之前的X.shape也为(10000,14)，而在OneHotEncoder之后的X.shape为(10000,2833)。我的直觉表明我在模型中使用了2833列而不是14来容纳大量Feature-columns，是否有使用此方法的更合适方法？
我正在尝试使用inverted = label_encoder.inverse_transform([argmax(receiver_bic_onehot[:,:])])将OneHot编码的值转换回其原始值。但是print(inverted)的输出仅给出一个原始值，而不是整个列。我应该如何编写此代码？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

machine-learning one-hot-encoding python random-forest scikit-learn