使用插入符号、glmnet 和嵌套交叉验证构建嵌套逻辑回归模型

问题描述

我的问题

我想建立一个具有高 AUC 的逻辑回归模型来预测二元变量。

我想使用以下方法（如果可行）：

使用弹性网络模型 (glmnet) 减少预测变量并找到最佳超参数（alpha 和 lambda）
将这个模型的输出（一个简单的线性组合）与一个额外的预测器（超级医生的意见superdoc）结合在一个逻辑回归模型（=finalmodel）中，类似如第 26 页中所述：

Afshar P、Mohammadi A、Plataniotis KN、Oikonomou A、Benali H。来自手工制作到基于深度学习的癌症放射组学：挑战和机会。 IEEE 信号处理杂志 2019； 36：132-60。可用 here

示例数据

作为示例数据，我有一个包含许多数字预测变量和二进制 (pos/neg) 结果 (diabetes) 的数据集。

# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# get example data
data(PimaIndiansDiabetes,package="mlbench")
data <- PimaIndiansDiabetes

# add the super doctors opinion to the data
set.seed(2323)
data %>% 
  rowwise() %>% 
  mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)),TRUE~ 0)) -> data

# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]

^{由 reprex package (v1.0.0) 于 2021 年 3 月 14 日创建}

我已经尝试过的

# train the model (without the superdoc's opinion)
set.seed(2323)
model <- train(
  diabetes ~.,data = train.data %>% select(-superdoc),method = "glmnet",trControl = trainControl("cv",number = 10,classprobs = TRUE,savePredictions = TRUE,summaryFunction = twoClassSummary),tuneLength = 10,metric="ROC" #ROC metric is in twoClassSummary
)


# extract the coefficients for the best alpha and lambda  
coef(model$finalModel,model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs

coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)


# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial,data=train.data)


# make predictions on the test data
predict(finalmodel,test.data,type="response") -> predictions


# check the AUC of the model in the test data
roc(test.data$diabetes,predictions,ci=TRUE) 
#> Setting levels: control = neg,case = pos
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = test.data$diabetes,predictor = predictions,ci = TRUE)
#> 
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)

^{由 reprex 包 (v1.0.0) 于 2021 年 3 月 14 日创建}

我不确定的地方...

我认为要找到最准确的模型并避免过度拟合，我必须使用嵌套交叉验证（正如我所了解的 here 和 here）。但是，我不知道该怎么做。目前，每次我使用另一个 set.seed 时，都会选择不同的预测变量，并且我得到不同的 AUCs。可以通过正确使用嵌套交叉验证来缓解这种情况吗？

更新 1

我刚刚了解到嵌套 CV 并不能帮助您获得最准确的模型。问题是，我在上面的第二个代码示例中得到了具有不同 set.seet 的变系数。我实际上遇到了与此处描述的相同的问题：Extract the coefficients for the best tuning parameters of a glmnet model in caret

一个已发布的解决方案是使用重复的 CV 来减少这种变化。不幸的是，我无法运行。

更新 2

使用 "repeatedcv" 解决了我的问题。使用重复的 cv not 嵌套 cv 成功了！

model <- train(
  diabetes ~.,trControl = trainControl("repeatedcv",repeats=10,metric="ROC" #ROC metric is in twoClassSummary
)

解决方法

感谢@missuse，我可以解决我的问题：

交叉验证无助于获得最准确的模型。这个（以及我的）误解在博客文章中得到了很好的讨论：The "Cross-Validation - Train/Predict" misunderstanding

小数据集中 glmnet 的预测器系数的种子依赖变化的问题可以通过重复交叉验证来缓解（即 "repeatedcv" 中的 caret::trainControl，如注释 {{ 3}})

堆叠的学习器（在我的例子中是堆叠的 glmnet 和 glm）通常是使用来自较低级别的学习器的折叠预测构建的。这可以使用 mlr3 包来完成，如这篇博文中所述：here。由于这不是初始问题，因此我提出了一个新问题 Tuning a stacked learner。