问题描述
我正在尝试在 MLEN 超级学习器管道的交叉验证折叠中扩展我的数据。当我在管道中使用 StandardScaler 时(如下所示),我收到以下警告:
/miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] 无法评分 pipeline-1.mlpclassifier。细节: ValueError("分类指标无法处理二进制和连续多输出目标的混合") (name,inst_name,exc),MetricWarning)
请注意,当我省略 StandardScaler() 时,警告消失,但数据未缩放。
breast_cancer_data = load_breast_cancer()
X = breast_cancer_data['data']
y = breast_cancer_data['target']
from sklearn.model_selection import train_test_split
X,X_val,y,y_val = train_test_split(X,test_size=.3,random_state=0)
from sklearn.base import BaseEstimator
class RFBasedFeatureSelector(BaseEstimator):
def __init__(self,n_estimators):
self.n_estimators = n_estimators
self.selector = None
def fit(self,X,y):
clf = RandomForestClassifier(n_estimators=self.n_estimators,random_state = RANDOM_STATE,class_weight = 'balanced')
clf = clf.fit(X,y)
self.selector = SelectFromModel(clf,prefit=True,threshold = 0.001)
def transform(self,X):
if self.selector is None:
raise AttributeError('The selector attribute has not been assigned. You cannot call transform before first calling fit or fit_transform.')
return self.selector.transform(X)
def fit_transform(self,y):
self.fit(X,y)
return self.transform(X)
N_FOLDS = 5
RF_ESTIMATORS = 1000
N_ESTIMATORS = 1000
RANDOM_STATE = 42
from mlens.metrics import make_scorer
from sklearn.metrics import roc_auc_score,balanced_accuracy_score
accuracy_scorer = make_scorer(balanced_accuracy_score,average='micro',greater_is_better=True)
from mlens.ensemble.super_learner import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import ExtraTreesClassifier,RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
ensemble = SuperLearner(folds=N_FOLDS,shuffle=True,random_state=RANDOM_STATE,n_jobs=10,scorer=balanced_accuracy_score,backend="multiprocessing")
preprocessing1 = {'pipeline-1': [StandardScaler()]
}
preprocessing2 = {'pipeline-1': [RFBasedFeatureSelector(N_ESTIMATORS)]
}
estimators = {'pipeline-1': [RandomForestClassifier(RF_ESTIMATORS,class_weight='balanced'),MLPClassifier(hidden_layer_sizes=(10,10,10),activation='relu',solver='sgd',max_iter=5000)
]
}
ensemble.add(estimators,preprocessing2,preprocessing1)
ensemble.add_Meta(LogisticRegression(solver='liblinear',class_weight = 'balanced'))
ensemble.fit(X,y)
yhat = ensemble.predict(X_val)
balanced_accuracy_score(y_val,yhat)```
>Error text: /miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details:
ValueError("Classification metrics can't handle a mix of binary and continuous-multIoUtput targets")
(name,MetricWarning)
解决方法
在调用 add 方法时,您当前将预处理步骤作为两个单独的参数传递。 您可以改为将它们组合如下:
preprocessing = {'pipeline-1': [RFBasedFeatureSelector(N_ESTIMATORS),StandardScaler()]}
请参阅此处找到的 add 方法的文档: https://mlens.readthedocs.io/en/0.1.x/source/mlens.ensemble.super_learner/