问题描述
我正在使用 ranger
来拟合随机森林。作为评估指标,我使用了 cvAUC
的 roc-auc-score。进行预测后,当我尝试评估 auc 分数时,出现错误:Format of predictions is invalid. It Couldn't be coerced to a list
。我认为这是由于预测包含 Level
部分,它显示了预测的独特水平。但是,我无法摆脱那部分。最小可重现示例如下,抛出错误:
library(caret)
install.packages("cvAUC")
library(cvAUC)
# Columns for training set
cat.column <- c("cat","dog","monkey","shark","seal")
num.column <- c(1,2,5,7,9)
class <- c(0,1,1)
train.set <- data.frame(num.column,cat.column,class)
# Columns for test set
cat.column <- c("cat","elephant-shrew",11,6,8)
class <- c(1,1)
test.set <- data.frame(num.column,class)
# Drop the target variable from the test set
target.test <- test.set["class"]
test.set <- test.set[,!names(test.set) %in% "class"]
# Fit random forest
rf = ranger(formula = as.factor(class) ~ .,data = train.set,verbose = FALSE)
# Get predictions
pred <- predict(rf,test.set)
predictions <- pred$predictions
# Get AUC score
auc <- AUC(as.factor(predictions),as.factor(unlist(target.test)),label.ordering = NULL)
cat(auc)
解决方法
您收到错误是因为 AUC
期望的是数字向量而不是因子。但是,在本例中,在测试集中,列cat.column
(elephant-shrew
) 中出现了一个新级别。最好输入一个变量在训练和测试集中可以假设的所有可能值。
library(caret)
library(cvAUC)
library(ranger)
# Columns for training set
cat.column <- c("cat","dog","monkey","shark","seal")
num.column <- c(1,2,5,7,9)
class <- factor(c(0,1,1),levels = c(0,1))
train.set <- data.frame(num.column,cat.column,class,stringsAsFactors = F)
# Columns for test set
cat.column <- c("cat","elephant-shrew",11,6,8)
class <- factor(c(1,1))
test.set <- data.frame(num.column,stringsAsFactors = F)
# Drop the target variable from the test set
target.test <- test.set["class"]
test.set <- test.set[,!names(test.set) %in% "class"]
# Fit random forest
rf = ranger(formula = class ~ .,data = train.set,verbose = FALSE)
# Get predictions
pred <- predict(rf,test.set)
predictions <- pred$predictions
# Get AUC score
auc <- AUC(as.numeric(predictions),target.test$class,label.ordering = NULL)
cat(auc)
如您所见,我稍微更改了数据准备步骤。首先,如果您的 class
列是分类任务的结果,最好将其强制为尽快因子。其次,如果测试集不包含字符变量的所有值(例如在您的示例中,列 cat.column
包含未包含在训练集中的 elephant-shrew
),则更好将该变量作为字符处理(在这种情况下,您可以使用 stringAsFactor=F
将字符变量保留为字符