问题描述
e = pd.DataFrame({
'col1': ['A','A','B','W','F','C'],'col2': [2,1,9,8,7,4],'col3': [0,4,2,3],'col4': ['a','c','D','e','F']
})
在这里,我使用sklearn.preprocessing.LabelEncoder
对数据进行了编码。通过以下代码行:
x = list(e.columns)
# Import label encoder
from sklearn import preprocessing
# label_encoder object kNows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
for i in x:
# Encode labels in column 'species'.
e[i] = label_encoder.fit_transform(e[i])
print(e)
但这甚至可以对int
类型的数字数据点进行编码。
编码数据集:
col1 col2 col3 col4
0 0 1 0 3
1 0 0 1 0
2 1 5 5 4
3 4 4 4 1
4 3 3 2 5
5 2 2 3 2
我该如何纠正?
解决方法
一种非常简单的可能性是仅对具有字符串值的列进行编码。例如,将代码调整为:
import pandas as pd
from sklearn import preprocessing
e = pd.DataFrame({
'col1': ['A','A','B','W','F','C'],'col2': [2,1,9,8,7,4],'col3': [0,4,2,3],'col4': ['a','c','D','e','F']
})
label_encoder = preprocessing.LabelEncoder()
for col in e.columns:
if e[col].dtype == 'O':
e[col] = label_encoder.fit_transform(e[col])
print(e)
或更妙的是:
import pandas as pd
from sklearn import preprocessing
def encode_labels(ser):
if ser.dtype == 'O':
return label_encoder.fit_transform(ser)
else:
return ser
label_encoder = preprocessing.LabelEncoder()
e = pd.DataFrame({
'col1': ['A','F']
})
e_encoded = e.apply(encode_labels)
print(e_encoded)
,
过滤并根据列类型调整预处理是正确的想法,而最有效的方法是使用pandas管道。
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.preprocecssing import LabelEncoder,StandardScaler
示例1:根据列名应用变压器
my_transformer1 = ColumnTransformer(
[
('transform_name_for_col1',LabelEncoder(),'col1'),('transformer_name_for_col2_and_col3',StandardScaler(),['col2','col3'])
]
)
示例2:根据列类型应用变压器
my_transformer2 = ColumnTransformer(
[
('transform_name_categories',make_column_selector(dtype_include=object)),('transformer_name_for_numerical',make_column_selector(dtype_include=np.number))
]
)
很明显,用您选择的变压器替换LabelEncoder和StandardScaler,包括自定义变压器:
class MyCustomTransformer(BaseEstimator,TransformerMixin):
def __init__(self):
# do something
def fit(self,X,y = None):
# do something
return self
def transform(self,y = None):
# do something
# return something transformed
为进一步讲解,我建议使用scikit-learn Pipeline根据列和/或列类型(这将更加灵活)组合不同的预处理。
在此处查看课程详细信息: