尝试对数据进行欠采样时，为什么会不断出现错误？

问题描述

我正在尝试对数据进行采样，但是我不断收到此错误：

Error in sample.int(length(x),size,replace,prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

从本质上来说，我的零件数据集不平衡，它们要么通过测试，要么没有通过测试。只有大约1％的零件发生了故障，我想运行一些模型来预测这些故障。我想先平衡数据，这就是我尝试过的

# Split the data into training and test
library(caTools)
passes<-nn[grepl(0,nn$nfail),] # get all passes
fails<-nn[!grepl(0,] # get all fails
splitp <- sample.split(passes$nfail,SplitRatio = 0.75)
splitf<- sample.split(fails$nfail,SplitRatio = 0.75)
trainp <- subset(passes,splitp == TRUE) # get 75% of passes for training
testp <- subset(passes,splitp == FALSE) # get 25% of passes for testing
trainf <- subset(fails,splitf == TRUE) # get 75% of fails for training
testf <- subset(fails,splitf == FALSE) # get 25% of fails for testing
train <- rbind(trainp,trainf) # combine training passes and fails
test<- rbind(testp,testf)

library(ROSE)
# Undersampling
data_balanced_under <- ovun.sample(nfail ~ .,data = train,method = "under",N = 2*nrow(trainf),seed = 1)$data
table(data_balanced_under$nfail)

# Oversampling
data_balanced_over <- ovun.sample(nfail ~ .,method = "over",N = 2*nrow(trainp),seed=1)$data
table(data_balanced_over$nfail)

我基本上将数据集nn分为训练和测试集。

我想对训练集进行抽样，因此要使训练集的行数是训练失败的一半（半通过和半失败）。这肯定比人口样本小得多，所以我不明白这个错误。

我还尝试过分平衡数据并运行代码，但是它没有提供我想要的东西。我希望过采样集的失败次数和通过次数相等（应该与训练集中的通过次数相同）。不会发生这种情况，只有大约5％的新套准通过。

我将不胜感激。请注意，所有训练和测试数据集的行数都是我期望的，而列数则相同。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

oversampling r r