问题描述
所以我有我使用 sklearn train_test_split 获得的训练集,我现在想使用 GridSearcCV 创建 10 个分割并为从 2 到 10 的每个 d 值找到 auc 分数。然后我想找到给出最佳auc得分
min_samples_list = list(range(2,10))
tree_para = [{'min_samples_leaf': min_samples_list}]
cv = KFold(n_splits=10)
decisionTreeClassifier = DecisionTreeClassifier(min_samples_leaf=k,random_state=0)
clf = gridsearchcv(decisionTreeClassifier,tree_para,cv=10)
fold_accuracy = []
for train_index,valid_index in cv.split(X_train):
train_x,test_x = X_train[train_index],X_train[valid_index]
train_y,test_y= y_train[train_index],y_train[valid_index]
model = clf.fit(train_x,train_y)
predicted_probs = model.predict([train_y])
fold_accuracy.append(sklearn.metrics.accuracy_score(predicted_probs,test_y))
best_parameters = clf.best_params_
print(best_parameters)
print("Accuracy per fold: ",fold_accuracy,"\n")
print("Average accuracy: ",sum(fold_accuracy)/len(fold_accuracy))
解决方法
在 scikit-learn 中,所有模型的 predict()
方法使用输入值 X 而不是目标值 y (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.predict)。这就是您在预测中遇到特征数量问题的原因。