XGBoost图重要性F分数值> 100

问题描述

我已为模型中的所有功能绘制了XGBoost功能重要性，如下图所示。但是您可以看到图中的F得分值未标准化（不在0到100的范围内）。如果您有任何想法，请告诉我。我是否需要在plot_importance函数中传递任何参数以进行标准化？

解决方法

plot_importance 绘制的特征重要性由其参数决定 importance_type，默认为 weight。有 3 个选项：weight、gain 和 cover。不过，它们都不是百分比。

来自此方法的 documentation：

importance_type (str,default "weight") – 如何计算重要性：“weight”、“gain”或“cover”

“权重”是特征在树中出现的次数
“gain”是使用该特征的分割的平均增益
“cover”是使用特征的分割的平均覆盖率，其中覆盖率定义为受分割影响的样本数

所以，长话短说：对于您想要的，没有简单的解决方案。

解决方法

模型的属性feature_importances_已经按照你的意愿进行了归一化，你可以自己绘制它，但它会是一个手工制作的图表。

首先，确保将 Classifier 的 importance_type 参数设置为上面列举的选项之一（构造函数的默认值是 gain，因此您会看到与plot_importances 如果你不改变它）。

best_model = xgb.XGBClassifier(importance_type='weight')

之后你可以在这一行中尝试一些东西：

import pandas as pd

best_model.feature_importances_
# In my toy example: array([0.21473685,0.19157895,0.28842106,0.30526316],dtype=float32)

best_model.feature_importances_.sum()
#  1.0

# Build a simple dataframe with the feature importances
# You can change the naming fN to something more human readable
fs = len(best_model.feature_importances_)
df = pd.DataFrame(zip([f"f{n}" for n in range(fs)],best_model.feature_importances_),columns=['Features','Feature Importance'])
df = df.set_index('Features').sort_values('Feature Importance')

# Build horizontal bar char
ax = df.plot.barh(color='red',alpha=0.5,grid=True,legend=False,title='Feature importance',figsize=(15,5))

# Annotate bar chart,adapted from this SO answer:
# https://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots
for p,value in zip(ax.patches,df['Feature Importance']):
    ax.annotate(round(value,2),(p.get_width() * 1.005,p.get_y() * 1.005))

通过这种方法，我得到了如下图表，它与原始图表非常接近：

scikit-learn xgbclassifier xgboost