使用交叉验证的模型评估错误 - average_precision_score

问题描述

所以我使用 balance_accuracy 作为我的评分运行了以下随机森林网格搜索

# define the parameter grid
param_grid = [
        {'criterion': ['gini','entropy'],# try different purity metrics in building the trees
         'max_depth': [2,5,8,10,15,20],# vary the max_depth of the trees in the ensemble
        'n_estimators': [10,50,100,200],# vary the number of trees in the ensemble
        'max_samples': [0.4,0.7,0.9]}     # vary how many samples each tree is built with
]

# setup the Random Forest model with all arguments as default
model = RandomForestClassifier()

# pass the model and the param_grid to the grid search,and use 5 folds with 'accuracy' as the scoring measure
grid_search = gridsearchcv(model,param_grid,cv = 5,scoring = 'balanced_accuracy')

# fit the grid search to the training set
grid_search.fit(X_smote,y_smote)

# return best model
rf_best = grid_search.best_estimator_

# return the hyperparameter values of the best model
print(grid_search.best_params_)

# use the best model to make predictions on the test set
y_pred = rf_best.predict(X_test)

# compute the test set accuracy of the best model
print("accuracy: ",accuracy_score(y_test,y_pred))
print("f1: ",f1_score(y_test,y_pred,pos_label='Listed'))
print("precision: ",precision_score(y_test,pos_label='Listed'))
print("recall: ",recall_score(y_test,pos_label='Listed'))

产生以下分数:


{'criterion': 'gini','max_depth': 20,'max_samples': 0.7,'n_estimators': 100}
accuracy:  0.6547231270358306
f1:  0.7612612612612613
precision:  0.9260273972602739
recall:  0.6462715105162524

我想使用 average_precision 评分参数,因为这更适合我的用例,因此我将语法更新为以下内容

from sklearn.metrics import average_precision_score
# define the parameter grid
param_grid = [
        {'criterion': ['gini',scoring = 'average_precision')

# fit the grid search to the training set
grid_search.fit(X_smote,pos_label='Listed'))

但是我收到以下错误

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\_ranking.py in average_precision_score(y_true,y_score,average,pos_label,sample_weight)
    211         if len(present_labels) == 2 and pos_label not in present_labels:
    212             raise ValueError("pos_label=%r is invalid. Set it to a label in "
--> 213                              "y_true." % pos_label)
    214     average_precision = partial(_binary_uninterpolated_average_precision,215                                 pos_label=pos_label)

ValueError: pos_label=1 is invalid. Set it to a label in y_true.

为什么我不能像使用balanced_accuracy那样在我的代码中使用average_precision。有什么我应该做的事情吗?

解决方法

不知道您的数据集是什么样的,也不知道代码中的错误究竟在哪里。多余的部分太多。

如果目的是使用所述的平均精度分数,那么您可以使用 make_scorer,假设您的标签是二进制的,0/1 如下例所示:

from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
        {'criterion': ['gini','entropy'],'max_depth': [2,5],'n_estimators': [200],'max_samples': [0.8]}]


X,y = make_blobs(n_samples=[80,20],centers=None,n_features=5,cluster_std = 3.5,random_state=0)     

model = RandomForestClassifier(random_state=42)
grid_search_acc = GridSearchCV(model,param_grid,cv = 5,scoring = 'balanced_accuracy')

grid_search_acc.fit(X,y)

grid_search_acc.best_score_
0.75625

平衡精度有效,使其适用于平均精度:

from sklearn.metrics import average_precision_score,make_scorer
ap_score = make_scorer(precision_score,greater_is_better=True,pos_label=1)

grid_search_prec = GridSearchCV(model,scoring = ap_score)
grid_search_prec.fit(X,y)

grid_search_prec.best_score_
0.9333333333333332