Tidymodels - 在 XGBoost 分类模型中使用 PCA 输出

问题描述

我正在尝试在 Tidymodels 框架内使用 XGBoost 运行分类模型。我已经能够运行模型并且结果还可以。作为改进过程的一部分，我尝试使用 PCA 来返回更好的结果。我设计了一些功能，结果如下 df

A tibble: 6 x 32
  ID    LIMIT_BAL   SEX EDUCATION MARRIAGE   AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
  <fct>     <dbl> <dbl>     <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3         90000     2         3        3     3     1     1     1     1     1     1
2 4         50000     2         3        2     4     1     1     1     1     1     1
3 6         50000     1         2        3     4     1     1     1     1     1     1
4 8        100000     2         3        3     1     1     1     1     1     1     1
5 10        20000     1         4        3     4     1     1     1     1     1     1
6 11       200000     2         4        3     3     1     1     2     1     1     1
# … with 20 more variables: BILL_AMT1 <dbl>,BILL_AMT2 <dbl>,BILL_AMT3 <dbl>,#   BILL_AMT4 <dbl>,BILL_AMT5 <dbl>,BILL_AMT6 <dbl>,PAY_AMT1 <chr>,PAY_AMT2 <chr>,#   PAY_AMT3 <chr>,PAY_AMT4 <chr>,PAY_AMT5 <chr>,PAY_AMT6 <chr>,default <fct>,#   PAY_AMT <dbl>,lim_bal1 <dbl>,lim_bal2 <dbl>,lim_bal3 <dbl>,lim_bal4 <dbl>,#   lim_bal5 <dbl>,lim_bal6 <dbl>

此训练集随后用于以下配方：

mas_rec <- recipe(default ~ .,data = cr_tr) %>% 
  step_select(-c(PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,ID)) %>% 
  step_impute_bag(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_predictors())

这个食谱可以按照下面的代码准备：

tic()
mod1_prep <- mas_rec %>% 
  check_missing(all_predictors()) %>% 
  prep()
mod1_prep 
toc()
juice(mod1_prep)
summary(mod1_prep) %>% arrange(role)

输出如下：

> juice(mod1_prep)
# A tibble: 16,168 x 6
   default    PC1    PC2    PC3    PC4    PC5
   <fct>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 N       -0.971 -0.270 -0.999  0.803 -0.901
 2 N       -0.222 -0.141 -0.908 -0.726 -0.771
 3 N       -0.213 -0.170 -0.944  0.881 -0.102
 4 N       -1.77  -0.523 -0.547  1.39  -0.569
 5 N       -1.69  -0.468 -0.345 -0.138 -1.46 
 6 N       -1.25  -0.293  0.133  0.224 -1.19 
 7 N       -1.05  -0.164  0.510 -0.189  1.91 
 8 N       -0.963 -0.171 -1.22  -0.337  2.18 
 9 N        0.780  0.130 -1.12   1.48   0.633
10 N       -1.63  -0.512 -0.531  1.77   0.372
# … with 16,158 more rows

> summary(mod1_prep) %>% arrange(role)
# A tibble: 6 x 4
  variable type    role      source  
  <chr>    <chr>   <chr>     <chr>   
1 default  nominal outcome   original
2 PC1      numeric predictor derived 
3 PC2      numeric predictor derived 
4 PC3      numeric predictor derived 
5 PC4      numeric predictor derived 
6 PC5      numeric predictor derived

模型如下：


cr_boost <- boost_tree(
  mtry = tune(),trees = 1000,min_n = tune(),tree_depth = tune(),learn_rate = tune(),loss_reduction = tune(),sample_size = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

其余流程如下：

# Parallel Processing ----
cores <- detectCores() -1
registerDoParallel(cores = cores)

# Re-sampling with cross validation ----
tree_folds <- vfold_cv(cr_tr)
control <- control_grid(save_pred = TRUE)

# 3. Create a Work-flow ----
cr_wf <- workflow() %>% 
  add_recipe(mod1_prep) %>% 
  add_model(cr_boost) 
cr_wf

# Set the Grid Space for the model ----
xg_boost_grid <- grid_latin_hypercube(
  tree_depth(),min_n(),loss_reduction(),sample_size = sample_prop(),finalize(mtry(),cr_tr),learn_rate(),size = 30)

调整过程如下：

xg_boost_tnd <- tune_grid(
  cr_wf,resamples = tree_folds,grid = xg_boost_grid,control = control)
print("It's done,check the results below")
xg_boost_tnd
toc()

当我运行它时出现以下消息：

Warning message:
This tuning result has notes. Example notes on model fitting include:
preprocessor 1/1: Error in svd(x,nu = 0,nv = k): a dimension is zero

任何建议将不胜感激。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

modeling pca r r tidymodels xgboost