问题描述
我使用插入符号包,并尝试使用rpart方法。有趣的是,我可以使用通用的rpart包来拟合模型,但是一旦使用插入符号包,它就不再起作用。令我感到困惑的是,我在各种各样的网站上看到使用了脱字符,例如波士顿数据。
对于是否正确实施模型还是在这里遗漏了点,我感到困惑。 对于rpart_tree2(如下),我收到以下错误消息:“在nominalTrainWorkflow(x = x,y = y,wts =权重,info = trainInfo ,:重新采样的性能度量中缺少值。”
我知道我也可以指定repeatcv,但是对于错误消息没有影响。
下面您将找到一个MWE(我试图使其尽可能简单):
library(caret)
library(rpart)
data("Boston")
index <- sample(nrow(Boston),nrow(Boston)*0.75)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]
rpart_tree1 <- rpart(medv ~ .,data = Boston.train)
rpart_tree2 <- train(medv ~.,data = Boston.train,method = "rpart")
解决方法
警告不是问题。
在某些重采样中使用较大的cp
值,生成的树没有拆分。当一棵树没有裂痕时,预测值是火车结果值的平均值。由于预测值没有方差,因此cor
函数会发出警告,并且结果为NA
。此函数用于计算RSquared-因此,对于这些重新采样,RSquared为NA
-换句话说,它缺失-警告所暗示的含义。
示例:
library(caret)
library(rpart)
library(MASS)
data(Boston)
set.seed(1)
index <- sample(nrow(Boston),nrow(Boston)*0.75)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]
下部cp
不会产生警告:
rpart_tree2 <- train(medv ~.,data = Boston.train,method = "rpart",tuneGrid = data.frame(cp = c(0.01,0.05,0.1)))
当我指定更高的cp和特定的种子时:
set.seed(111)
rpart_tree3 <- train(medv ~.,tuneGrid = data.frame(cp = c(0.4)),trControl = trainControl(savePredictions = TRUE))
Warning message:
In nominalTrainWorkflow(x = x,y = y,wts = weights,info = trainInfo,:
There were missing values in resampled performance measures.
要检查问题:
rpart_tree3$resample
RMSE Rsquared MAE Resample
1 7.530482 0.4361392 5.708437 Resample01
2 7.334995 0.2350619 5.392867 Resample02
3 7.178178 0.3971089 5.511530 Resample03
4 6.369189 0.2798907 4.851146 Resample04
5 7.550175 0.3344412 5.566677 Resample05
6 7.019099 0.4270561 5.160572 Resample06
7 7.197384 0.4530680 5.665177 Resample07
8 7.206760 0.3447690 5.290300 Resample08
9 7.408748 0.4553087 5.513998 Resample09
10 7.241468 0.4119979 5.452725 Resample10
11 7.562511 0.3967082 5.768643 Resample11
12 7.347378 0.3861702 5.225532 Resample12
13 7.124039 0.4039857 5.599800 Resample13
14 7.151013 0.3301835 5.490676 Resample14
15 6.518536 0.3835073 4.938662 Resample15
16 10.008008 NA 7.174290 Resample16
17 7.018742 0.4431380 5.379823 Resample17
18 7.454669 0.3888220 6.000062 Resample18
19 6.745457 0.3772237 5.175481 Resample19
20 6.864304 0.4179276 5.089924 Resample20
21 7.238874 0.2378432 5.234752 Resample21
22 7.581736 0.3707839 5.543641 Resample22
23 7.236317 0.3431725 5.278693 Resample23
24 7.232241 0.4196955 5.518907 Resample24
25 6.641846 0.3664023 4.683834 Resample25
我们可以看到Resample16中出现了问题
library(tidyverse)
rpart_tree3$pred %>%
filter(Resample == "Resample16") -> for_cor
head(for_cor)
pred obs rowIndex cp Resample
1 21.87018 15.6 1 0.4 Resample16
2 21.87018 22.3 3 0.4 Resample16
3 21.87018 13.4 6 0.4 Resample16
4 21.87018 12.7 10 0.4 Resample16
5 21.87018 18.6 11 0.4 Resample16
6 21.87018 19.0 13 0.4 Resample16
我们可以看到Resample16
的每一行的pred都是相同的
cor(for_cor$pred,for_cor$obs,use = "pairwise.complete.obs")
[1] NA
Warning message:
In cor(for_cor$pred,use = "pairwise.complete.obs") :
the standard deviation is zero
要查看在插入符号中如何计算RSquared,请查看postResample
的来源。基本上cor(pred,obs)^2