映射 - 特征重要性与标签分类

问题描述

我有一组与香草磅蛋糕烘焙相关的数据(200 行),具有 27 个特征,如下所示。标签 caketaste 是衡量烘焙蛋糕的好坏程度,由 bad(0)neutral(1)good(2) 定义。

Features = cake_id,flour_g,butter_g,sugar_g,salt_g,eggs_count,bakingpowder_g,milk_ml,water_ml,vanillaextract_ml,lemonzest_g,mixingtime_min,bakingtime_min,preheattime_min,coolingtime_min,bakingtemp_c,preheattemp_c,color_red,color_green,color_blue,traysize_small,traysize_medium,traysize_large,milktype_lowfat,milktype_skim,milktype_whole,trayshape.

Label = caketaste ["bad","neutral","good"]

我的任务是找到:
a) 影响标签结果的 5 个最重要的特征;
b) 找出有助于标签中“良好”分类的 5 个最重要特征的值。

我可以使用 sklearn (Python) 解决这个问题,使用 RandomForestClassifier() 拟合数据,然后使用 Feature_Importance() 确定 5 个最重要的特征,即 mixingtime_minbakingtime_min、{{ 1}}、sugar_gflour_g

最小、完整且可验证的示例:

preheattemp_c

FeatureImportance

可以使用什么方法解决任务 b)?我正在尝试回答以下研究问题,

################################################################# # a) Libraries ################################################################# import pandas as pd pd.plotting.register_matplotlib_converters() import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.inspection import permutation_importance from sklearn.compose import ColumnTransformer from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MaxAbsScaler from sklearn.model_selection import gridsearchcv from sklearn.metrics import accuracy_score import time ################################################################# # b) Data Loading Symlinks ################################################################# df = pd.read_excel("poundcake.xlsx",sheet_name="Sheet0",engine='openpyxl') ################################################################# # c) Analyzing Dataframe ################################################################# #Getting dataframe details e.g columns,total entries,data types etc print("\n<Syntax>: df.info()") df.info() #Getting the 1st 5 lines in the dataframe print("\n<Syntax>: df.head()") df.head() ################################################################# # d) Data Visualization ################################################################# #Scatterplot SiteID vs LTE - Spectral Efficiency fig=plt.figure() ax=fig.add_axes([0,1,1]) ax.scatter(df["cake_id"],df["caketaste"],color='r') ax.set_xlabel('cake_id') ax.set_ylabel('caketaste') ax.set_title('scatter plot') plt.show() ################################################################# # e) Feature selection ################################################################# #Note: #Machine learning models cannot work well with categorical (string) data,specifically scikit-learn. #Need to convert the categorical variables into numeric types before building a machine learning model. categorical_columns = ["trayshape"] numerical_columns = ["flour_g","butter_g","sugar_g","salt_g","eggs_count","bakingpowder_g","milk_ml","water_ml","vanillaextract_ml","lemonzest_g","mixingtime_min","bakingtime_min","preheattime_min","coolingtime_min","bakingtemp_c","preheattemp_c","color_red","color_green","color_blue","traysize_small","traysize_medium","traysize_large","milktype_lowfat","milktype_skim","milktype_whole"] ################################################################# # f) Dataset (Train Test Split) # # (Dataset) # ┌──────────────────────────────────────────┐ # ┌──────────────────────────┬────────────┐ # | Training │ Test │ # └──────────────────────────┴────────────┘ ################################################################# # Prediction target - Training data X = df[categorical_columns + numerical_columns] # Prediction target - Training data y = df["caketaste"] # Break off validation set from training data. Default: train_size=0.75,test_size=0.25 X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=42) ################################################################# # Pipeline ################################################################# ####################### # g) Column Transformer ####################### categorical_encoder = OneHotEncoder(handle_unkNown='ignore') #Mean might not be suitable,Remove rows? numerical_pipe = Pipeline([ ('imp',SimpleImputer(strategy='mean')) ]) preprocessing = ColumnTransformer( [('cat',categorical_encoder,categorical_columns),('num',numerical_pipe,numerical_columns)]) ##################### # b) Pipeline Printer ##################### #RF: builds multiple decision trees and merges (bagging) them together #to get a more accurate and stable prediction (averaging). pipe_xxx_xxx_rfo = Pipeline([ ('pre',preprocessing),('scl',None),('pca',('clf',RandomForestClassifier(random_state=42)) ]) pipe_abs_xxx_rfo = Pipeline([ ('pre',MaxAbsScaler()),RandomForestClassifier(random_state=42)) ]) ################################################################# # h) Hyper-Parameter Tuning ################################################################# parameters_rfo = { 'clf__n_estimators':[100],'clf__criterion':['gini'],'clf__min_samples_split':[2,5],'clf__min_samples_leaf':[1,2] } parameters_rfo_bk = { 'clf__n_estimators':[10,20,30,40,50,60,70,80,90,100,1000],'clf__criterion':['gini','entropy'],'clf__min_samples_split':[5,10,15,25,30],2,3,4,5] } ######################### # i) GridSearch Printer ######################### # scoring can be used as 'accuracy' or for MAE use 'neg_mean_absolute_error' scr='accuracy' grid_xxx_xxx_rfo = gridsearchcv(pipe_xxx_xxx_rfo,param_grid=parameters_rfo,scoring=scr,cv=5,refit=True) grid_abs_xxx_rfo = gridsearchcv(pipe_abs_xxx_rfo,refit=True) print("Pipeline setup.... Complete") ################################################### # Machine Learning Models Evaluation Algorithm ################################################### grids = [grid_xxx_xxx_rfo,grid_abs_xxx_rfo] grid_dict = { 0: 'RandomForestClassifier',1: 'RandomForestClassifier with AbsMaxScaler',} # Fit the grid search objects print('Performing model optimizations...\n') best_test_scr = -999999999999999 #python3 does not allow to use None anymore best_clf = 0 best_gs = '' for idx,grid in enumerate(grids): start_time = time.time() print('*' * 100) print('\nestimator: %s' % grid_dict[idx]) # Fit grid search grid.fit(X_train,y_train) #Calculate the score once and use when needed test_scr = grid.score(X_test,y_test) train_scr = grid.score(X_train,y_train) # Track best (lowest grid.score) model if test_scr > best_test_scr: best_test_scr = test_scr best_train_scr = train_scr best_rf = grid best_clf = idx print("..........................this model is better. SELECTED") print("Best params : %s" % grid.best_params_) print("Training accuracy : %s" % best_train_scr) print("Test accuracy : %s" % best_test_scr) print("Modeling time : %s" % time.strftime("%H:%M:%s",time.gmtime(time.time() - start_time))) print('\nClassifier with best test set score: %s' % grid_dict[best_clf]) ######################################################################################### # j) Feature Importance using Gini Importance or Mean Decrease in Impurity (MDI) # Note: # 1.Calculates each feature importance as the sum over the number of splits (accross # all trees) that include the feature,proportionaly to the number of samples it splits. # 2. Biased towards cardinality i.e numerical variables ######################################################################################## ohe = (best_rf.best_estimator_.named_steps['pre'].named_transformers_['cat']) feature_names = ohe.get_feature_names(input_features=categorical_columns) feature_names = np.r_[feature_names,numerical_columns] tree_feature_importances = (best_rf.best_estimator_.named_steps['clf'].feature_importances_) sorted_idx = tree_feature_importances.argsort() # figure: Top Features count=-28 y_ticks = np.arange(0,abs(count)) fig,ax = plt.subplots() ax.barh(y_ticks[count:],tree_feature_importances[sorted_idx][count:]) ax.set_yticklabels(feature_names[sorted_idx][count:],fontsize=7) ax.set_yticks(y_ticks[count:]) ax.set_title("Random Forest Tree's Feature Importance from Mean Decrease in Impurity (MDI)") fig.tight_layout() plt.show() mixingtime_minbakingtime_minflour_gsugar_g统计上对良好的 preheattemp_c(良好:2) ?

可能的预期结果:

caketaste

上面的结果基本上可以得出结论,如果一个人喜欢吃好吃的蛋糕,他需要用 150-180g 面粉和 200-250g 糖烘烤他的蛋糕,并在 5-15 分钟之间混合面团,然后再烤 50在 150-170ºC 的预热烤箱中 -55 分钟。

希望大家多多指教。

问题

您能指导我如何着手解决这个研究问题吗?
sklearn 中是否有任何图书馆或其他图书馆可以获取此信息?
任何额外的信息,如置信区间、异常值等都是额外的。

数据(poundcake.xlsx):

mixingtime_min = [5,15] AND
bakingtime_min = [50,51,52,53,54,55] AND
flour_g = [150,160,170,180] AND
sugar_g = [200,250] AND
preheattemp_c = [150,170]

解决方法

非常简单的解决方案可以使用您的数据运行决策树分类器并使用 Grapviz 库将树可视化,这是文档 https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html,你也可以在得到代码生成的dot文件后,在webgraphiz中进行可视化。此练习的结果可能是您期望的范围值。