问题描述
我正在研究 kaggle 数据集,可在下方访问。
https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction
鉴于数据的不平衡,我正在运行一个平衡的随机森林分类器。然而,下面的代码给了我 100% 的准确率、召回率和精确度,所以肯定是不正确的。
from pandas import pd
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
# import data
path = 'raw Data/'
df_train = pd.read_csv(path + 'train.csv')
df_train.head(3)
# Seperate features from label
X = df_train.drop(columns=['Response'])
y = df_train['Response']
# Get dummy varialbes
X = pd.get_dummies(df_train,columns = ['Gender','Region_Code','Vehicle_Age','Vehicle_damage','Policy_Sales_Channel'],drop_first=True)
# #Split data into 3: 60% train,20% validation,20% test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y)
# Run model
brfc = BalancedRandomForestClassifier(n_estimators=500,random_state=0).fit(X_train,y_train)
print("F1 score for Balanced Random Forest Classifier is ",f1_score(y_test,brfc.predict(X_test)))
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)