问题描述
我有一组与香草磅蛋糕烘焙相关的数据(200 行),具有 27 个特征,如下所示。标签 caketaste
是衡量烘焙蛋糕的好坏程度,由 bad(0)
、neutral(1)
、good(2)
定义。
Features = cake_id,flour_g,butter_g,sugar_g,salt_g,eggs_count,bakingpowder_g,milk_ml,water_ml,vanillaextract_ml,lemonzest_g,mixingtime_min,bakingtime_min,preheattime_min,coolingtime_min,bakingtemp_c,preheattemp_c,color_red,color_green,color_blue,traysize_small,traysize_medium,traysize_large,milktype_lowfat,milktype_skim,milktype_whole,trayshape.
Label = caketaste ["bad","neutral","good"]
我的任务是找到:
a) 影响标签结果的 5 个最重要的特征;
b) 找出有助于标签中“良好”分类的 5 个最重要特征的值。
我可以使用 sklearn (Python) 解决这个问题,使用 RandomForestClassifier() 拟合数据,然后使用 Feature_Importance() 确定 5 个最重要的特征,即 mixingtime_min
、bakingtime_min
、{{ 1}}、sugar_g
和 flour_g
。
最小、完整且可验证的示例:
preheattemp_c
可以使用什么方法来解决任务 b)?我正在尝试回答以下研究问题,
#################################################################
# a) Libraries
#################################################################
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import gridsearchcv
from sklearn.metrics import accuracy_score
import time
#################################################################
# b) Data Loading Symlinks
#################################################################
df = pd.read_excel("poundcake.xlsx",sheet_name="Sheet0",engine='openpyxl')
#################################################################
# c) Analyzing Dataframe
#################################################################
#Getting dataframe details e.g columns,total entries,data types etc
print("\n<Syntax>: df.info()")
df.info()
#Getting the 1st 5 lines in the dataframe
print("\n<Syntax>: df.head()")
df.head()
#################################################################
# d) Data Visualization
#################################################################
#Scatterplot SiteID vs LTE - Spectral Efficiency
fig=plt.figure()
ax=fig.add_axes([0,1,1])
ax.scatter(df["cake_id"],df["caketaste"],color='r')
ax.set_xlabel('cake_id')
ax.set_ylabel('caketaste')
ax.set_title('scatter plot')
plt.show()
#################################################################
# e) Feature selection
#################################################################
#Note:
#Machine learning models cannot work well with categorical (string) data,specifically scikit-learn.
#Need to convert the categorical variables into numeric types before building a machine learning model.
categorical_columns = ["trayshape"]
numerical_columns = ["flour_g","butter_g","sugar_g","salt_g","eggs_count","bakingpowder_g","milk_ml","water_ml","vanillaextract_ml","lemonzest_g","mixingtime_min","bakingtime_min","preheattime_min","coolingtime_min","bakingtemp_c","preheattemp_c","color_red","color_green","color_blue","traysize_small","traysize_medium","traysize_large","milktype_lowfat","milktype_skim","milktype_whole"]
#################################################################
# f) Dataset (Train Test Split)
#
# (Dataset)
# ┌──────────────────────────────────────────┐
# ┌──────────────────────────┬────────────┐
# | Training │ Test │
# └──────────────────────────┴────────────┘
#################################################################
# Prediction target - Training data
X = df[categorical_columns + numerical_columns]
# Prediction target - Training data
y = df["caketaste"]
# Break off validation set from training data. Default: train_size=0.75,test_size=0.25
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=42)
#################################################################
# Pipeline
#################################################################
#######################
# g) Column Transformer
#######################
categorical_encoder = OneHotEncoder(handle_unkNown='ignore')
#Mean might not be suitable,Remove rows?
numerical_pipe = Pipeline([
('imp',SimpleImputer(strategy='mean'))
])
preprocessing = ColumnTransformer(
[('cat',categorical_encoder,categorical_columns),('num',numerical_pipe,numerical_columns)])
#####################
# b) Pipeline Printer
#####################
#RF: builds multiple decision trees and merges (bagging) them together
#to get a more accurate and stable prediction (averaging).
pipe_xxx_xxx_rfo = Pipeline([
('pre',preprocessing),('scl',None),('pca',('clf',RandomForestClassifier(random_state=42))
])
pipe_abs_xxx_rfo = Pipeline([
('pre',MaxAbsScaler()),RandomForestClassifier(random_state=42))
])
#################################################################
# h) Hyper-Parameter Tuning
#################################################################
parameters_rfo = {
'clf__n_estimators':[100],'clf__criterion':['gini'],'clf__min_samples_split':[2,5],'clf__min_samples_leaf':[1,2]
}
parameters_rfo_bk = {
'clf__n_estimators':[10,20,30,40,50,60,70,80,90,100,1000],'clf__criterion':['gini','entropy'],'clf__min_samples_split':[5,10,15,25,30],2,3,4,5]
}
#########################
# i) GridSearch Printer
#########################
# scoring can be used as 'accuracy' or for MAE use 'neg_mean_absolute_error'
scr='accuracy'
grid_xxx_xxx_rfo = gridsearchcv(pipe_xxx_xxx_rfo,param_grid=parameters_rfo,scoring=scr,cv=5,refit=True)
grid_abs_xxx_rfo = gridsearchcv(pipe_abs_xxx_rfo,refit=True)
print("Pipeline setup.... Complete")
###################################################
# Machine Learning Models Evaluation Algorithm
###################################################
grids = [grid_xxx_xxx_rfo,grid_abs_xxx_rfo]
grid_dict = { 0: 'RandomForestClassifier',1: 'RandomForestClassifier with AbsMaxScaler',}
# Fit the grid search objects
print('Performing model optimizations...\n')
best_test_scr = -999999999999999 #python3 does not allow to use None anymore
best_clf = 0
best_gs = ''
for idx,grid in enumerate(grids):
start_time = time.time()
print('*' * 100)
print('\nestimator: %s' % grid_dict[idx])
# Fit grid search
grid.fit(X_train,y_train)
#Calculate the score once and use when needed
test_scr = grid.score(X_test,y_test)
train_scr = grid.score(X_train,y_train)
# Track best (lowest grid.score) model
if test_scr > best_test_scr:
best_test_scr = test_scr
best_train_scr = train_scr
best_rf = grid
best_clf = idx
print("..........................this model is better. SELECTED")
print("Best params : %s" % grid.best_params_)
print("Training accuracy : %s" % best_train_scr)
print("Test accuracy : %s" % best_test_scr)
print("Modeling time : %s" % time.strftime("%H:%M:%s",time.gmtime(time.time() - start_time)))
print('\nClassifier with best test set score: %s' % grid_dict[best_clf])
#########################################################################################
# j) Feature Importance using Gini Importance or Mean Decrease in Impurity (MDI)
# Note:
# 1.Calculates each feature importance as the sum over the number of splits (accross
# all trees) that include the feature,proportionaly to the number of samples it splits.
# 2. Biased towards cardinality i.e numerical variables
########################################################################################
ohe = (best_rf.best_estimator_.named_steps['pre'].named_transformers_['cat'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = np.r_[feature_names,numerical_columns]
tree_feature_importances = (best_rf.best_estimator_.named_steps['clf'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()
# figure: Top Features
count=-28
y_ticks = np.arange(0,abs(count))
fig,ax = plt.subplots()
ax.barh(y_ticks[count:],tree_feature_importances[sorted_idx][count:])
ax.set_yticklabels(feature_names[sorted_idx][count:],fontsize=7)
ax.set_yticks(y_ticks[count:])
ax.set_title("Random Forest Tree's Feature Importance from Mean Decrease in Impurity (MDI)")
fig.tight_layout()
plt.show()
、mixingtime_min
、bakingtime_min
、flour_g
和 sugar_g
在统计上对良好的 preheattemp_c
(良好:2) ?
可能的预期结果:
caketaste
上面的结果基本上可以得出结论,如果一个人喜欢吃好吃的蛋糕,他需要用 150-180g 面粉和 200-250g 糖烘烤他的蛋糕,并在 5-15 分钟之间混合面团,然后再烤 50在 150-170ºC 的预热烤箱中 -55 分钟。
希望大家多多指教。
问题
您能指导我如何着手解决这个研究问题吗?
sklearn 中是否有任何图书馆或其他图书馆可以获取此信息?
任何额外的信息,如置信区间、异常值等都是额外的。
数据(poundcake.xlsx):
mixingtime_min = [5,15] AND
bakingtime_min = [50,51,52,53,54,55] AND
flour_g = [150,160,170,180] AND
sugar_g = [200,250] AND
preheattemp_c = [150,170]
解决方法
非常简单的解决方案可以使用您的数据运行决策树分类器并使用 Grapviz 库将树可视化,这是文档 https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html,你也可以在得到代码生成的dot文件后,在webgraphiz中进行可视化。此练习的结果可能是您期望的范围值。