问题描述
我正在使用套索回归将某些文本分类为与AI相关或无关。当我使用vip
和tidymodels
计算变量重要性时,该符号与预期相反–“机器”,“学习”和“算法”等词带有负号。
很抱歉缺少reprex,但这是我的代码:
fy21_raw %>%
sample_n(5)
# A tibble: 5 x 3
# prog_title text artificial_intel
# <chr> <chr> <fct>
#1 Advanced Batt~ "ABMS l~ not
#2 Energy Effici~ "This e~ not
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not
# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"
set.seed(123)
budget_split <- initial_split(fy21_raw,strata = artificial_intel)
budget_train <- training(budget_split)
budget_test <- testing(budget_split)
set.seed(234)
budget_folds <- vfold_cv(budget_train,strata = artificial_intel,v = 5)
budget_rec <- recipe(artificial_intel ~ .,data = budget_train) %>% # update dv with actual name
update_role(prog_title,new_role = "id") %>%
step_tokenize(text) %>%
step_tokenfilter(text,max_tokens = 1000) %>%
step_upsample(artificial_intel) %>% # update dv with actual name
step_tfidf(text) %>%
step_normalize(recipes::all_predictors())
budget_wf <- workflow() %>%
add_recipe(budget_rec)
lasso_spec <- logistic_reg(penalty = 0.1,mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
set.seed(1234)
lasso_res <- budget_wf %>%
add_model(lasso_spec) %>%
fit_resamples(resamples = budget_folds,metrics = metric_set(roc_auc,accuracy,sens,spec),control = control_grid(save_pred = TRUE,pkgs = c('textrecipes')))
set.seed(123)
budget_imp <- budget_wf %>%
add_model(lasso_spec) %>%
fit(budget_train) %>%
pull_workflow_fit() %>%
vi()
# A tibble: 1,000 x 3
# Variable Importance Sign
# <chr> <dbl> <chr>
# 1 tfidf_text_machine -6.82 NEG
# 2 tfidf_text_artificial -5.84 NEG
# 3 tfidf_text_learning -3.69 NEG
它是在计算相对于“非”结果而不是“ artificial_intel”的重要性吗?
解决方法
从glmnet插图开始:
请注意,对于“二项式”模型,结果仅返回 类对应于因子响应的第二级。
因此,如果您想要正确的系数符号,则glmnet的正电平必须为第二。 如果将glmnet与标尺一起使用,请记住,标尺使用第一个因子级别作为默认值。因此,您需要设置yardstick.event_first = FALSE