了解onehotencoder的工作原理-为什么我在列中得到多个？

问题描述

我正在使用sklearn管道执行一键编码：

preprocess = make_column_transformer(
    (MinMaxScaler(),numeric_cols),(OneHotEncoder(),['country'])
    )

param_grid =    { 
                  'xgbclassifier__learning_rate': [0.01,0.005,0.001],}

model = make_pipeline(preprocess,XGBClassifier())

# Initialize Grid Search Modelg
model = gridsearchcv(model,param_grid = param_grid,scoring = 'roc_auc',verbose= 1,iid= True,refit = True,cv  = 3)
model.fit(X_train,y_train)

然后看看这些国家是如何进行热编码的，我得到以下信息（我知道有两个）

pd.DataFrame(preprocess.fit_transform(X_test))

其结果是：

几个问题：

现在纠正我是否有错，但是用一种很热的编码，我认为它是一系列全0和一个1的数字。为什么我会在一列中得到几个数字
当我执行model.predict（x_test）时，它会应用在流水线训练中定义的变换？
当我调用fit_transform时如何检索特征名称？

解决方法

为帮助您更好地理解（1），即OHE的工作原理。

假设您有1列包含分类数据：

df = pd.DataFrame({"categorical": ["a","b","a"]})
print(df)
  categorical
0           a
1           b
2           a

然后，您将获得每行一个1（对于一列分类数据而言始终如此），但不一定基于每一列：

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(df)
ohe_out = ohe.transform(df).todense()
# ohe_df = pd.DataFrame(ohe_out,columns=ohe.get_feature_names(df.columns))
ohe_df = pd.DataFrame(ohe_out,columns=ohe.get_feature_names(["categorical"]))
print(ohe_df)
   categorical_a  categorical_b
0            1.0            0.0
1            0.0            1.0
2            1.0            0.0

是否应添加更多数据列，例如数字列，这将适用于每列，但不再适用于整个行：

df = pd.DataFrame({"categorical":["a","a"],"nums":[0,1,0]})
print(df)
  categorical  nums
0           a     0
1           b     1
2           a     0

ohe.fit(df)
ohe_out = ohe.transform(df).todense()
# ohe_df = pd.DataFrame(ohe_out,columns=ohe.get_feature_names(["categorical","nums"]))
print(ohe_df)
   categorical_a  categorical_b  nums_0  nums_1
0            1.0            0.0     1.0     0.0
1            0.0            1.0     0.0     1.0
2            1.0            0.0     1.0     0.0

categorical-data one-hot-encoding pipeline pipeline scikit-learn sklearn-pandas