VIP产生的可变重要性标志与glmnet / tidymodels的预期相反

问题描述

我正在使用套索回归将某些文本分类为与AI相关或无关。当我使用vip和tidymodels计算变量重要性时，该符号与预期相反–“机器”，“学习”和“算法”等词带有负号。

很抱歉缺少reprex，但这是我的代码：

fy21_raw %>%
    sample_n(5)

# A tibble: 5 x 3
#  prog_title     text     artificial_intel
#  <chr>          <chr>    <fct>           
#1 Advanced Batt~ "ABMS l~ not             
#2 Energy Effici~ "This e~ not             
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not 

# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"

set.seed(123)
budget_split <- initial_split(fy21_raw,strata = artificial_intel) 
budget_train <- training(budget_split)
budget_test  <- testing(budget_split)

set.seed(234)
budget_folds <- vfold_cv(budget_train,strata = artificial_intel,v = 5) 

budget_rec <- recipe(artificial_intel ~ .,data = budget_train) %>% # update dv with actual name
    update_role(prog_title,new_role = "id") %>%
    step_tokenize(text) %>%
    step_tokenfilter(text,max_tokens = 1000) %>%
    step_upsample(artificial_intel) %>% # update dv with actual name
    step_tfidf(text) %>%
    step_normalize(recipes::all_predictors())

budget_wf <- workflow() %>%
    add_recipe(budget_rec)

lasso_spec <- logistic_reg(penalty = 0.1,mixture = 1) %>%
    set_mode("classification") %>%
    set_engine("glmnet")

all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(1234)
lasso_res <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit_resamples(resamples = budget_folds,metrics = metric_set(roc_auc,accuracy,sens,spec),control = control_grid(save_pred = TRUE,pkgs = c('textrecipes')))

set.seed(123)
budget_imp <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit(budget_train) %>%
    pull_workflow_fit() %>%
    vi()

# A tibble: 1,000 x 3
#   Variable              Importance Sign 
#   <chr>                      <dbl> <chr>
# 1 tfidf_text_machine        -6.82  NEG  
# 2 tfidf_text_artificial     -5.84  NEG  
# 3 tfidf_text_learning       -3.69  NEG

它是在计算相对于“非”结果而不是“ artificial_intel”的重要性吗？

解决方法

从glmnet插图开始：

请注意，对于“二项式”模型，结果仅返回类对应于因子响应的第二级。

因此，如果您想要正确的系数符号，则glmnet的正电平必须为第二。如果将glmnet与标尺一起使用，请记住，标尺使用第一个因子级别作为默认值。因此，您需要设置yardstick.event_first = FALSE

glmnet r r tidymodels