在sklearn的Pipeline中使用LabelEncoder可以得出:fit_transform接受2个位置参数,但是给出了3个

问题描述

我一直在尝试运行一些ML代码,但是在运行管道之后,我在调试阶段仍然步履蹒跚。我在各种论坛上四处逛逛,没有多大用处。我发现有些人说您不能在管道中使用LabelEncoder。我不确定那是真的。如果有人对此事有任何见解,我将很高兴听到他们的讲话。

我不断收到此错误

TypeError: fit_transform() takes 2 positional arguments but 3 were given

所以我不确定问题是来自我还是来自python。这是我的代码

data = pd.read_csv("ks-projects-201801.csv",index_col="ID",parse_dates=["deadline","launched"],infer_datetime_format=True)

var = list(data)

data = data.drop(labels=[1014746686,1245461087,1384087152,1480763647,330942060,462917959,69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]

p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)

x = data[var]

obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),dyear = dat_feat.deadline.dt.year.astype("int64"),lmonth=dat_feat.launched.dt.month.astype("int64"),lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline",axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])

u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]

# Pipeline time
#Impute and encode

strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),OneHotEncoder(handle_unkNown=oh_unk),TargetEncoder(),catboostEncoder()]

#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])

trans = ColumnTransformer(transformers=[
                                        ("num",num_trans,list(num_feat)+list(dat_feat)),#("obj",obj_imp,list(obj_feat)),("onehot",oh_enc,oh_obj),("target",te_enc,te_obj),("catboost",cb_enc,cb_obj)
                                        ])

models = [RandomForestClassifier(random_state=0),KNeighborsClassifier(),DecisionTreeClassifier(random_state=0)]

model = models[2]

print("Check 4")

# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])

x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)

它运行良好,直到run.fit引发错误。我很想听听任何人可能提供的任何建议,并且也非常感谢您解决该问题的任何可能方式!谢谢。

解决方法

问题与this answer中发现的问题相同,但您的情况为LabelEncoderLabelEncoder的{​​{3}}方法采用:

def fit_transform(self,y):
    """Fit label encoder and return encoded labels
    ...

Pipeline期望其所有转换器都采用三个位置参数fit_transform(self,X,y)

您可以按照上述答案中的方法制作自定义转换器,但是LabelEncoder 不应用作功能转换器。在fit_transform中可以找到有关为什么的广泛解释。因此,如果功能量太高,例如LabelEcoder,也建议您不要使用TargetEncoder,而要使用其他贝叶斯编码器。