R训练和调整具有硬件挑战的随机森林分类器

问题描述

很抱歉，如果这不适合在这里提问，请原谅。我正在使用台式机，32GB RAM和4核cpu在实验数据上使用R Caret训练随机森林多类（8）分类器。但是，RStudio一直抱怨它无法分配9GB的向量，因此我一直在抱怨。因此，为了进行CV折叠和某些网格搜索，我必须将训练集一路减少到数据的1％。结果，我的模式精度为〜50％，选择的结果功能也不是很好。每8个班级中只有2个是如实区分。当然可能是我没有任何好的功能。但是我至少要首先测试训练并在相当大的训练数据上调整模型。有什么解决方案可以帮助您？还是可以在任何地方上传数据并在某个地方训练？我是如此新奇，以至于我不知道基于云的技术是否可以帮助我？指针将不胜感激。

编辑：我已经上传了数据表和代码，所以也许是我的错误代码搞砸了。

以下是数据链接： https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing

这是我的代码：

#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)

#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv",header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw,select = c(18:131))
df <- subset(df.1,select = -c(2:4))

#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum

#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1],saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged,"Near Zero Features flagged.CSV",row.names = TRUE)


 #identify correlated features
 df.Cor <- cor(df[-1])
 highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
 highlyCor <- findCorrelation(df.Cor,cutoff = .99,verbose = TRUE)

 #Get rid off strongly correlated features
 filtered.df<- df[,-highlyCor]

 str(filtered.df)
 #identify linear dependecies

 linear.combo <- findLinearCombos(filtered.df[-1])
 linear.combo #no linear ones detected

 #splt datainto training and test
 #Here is my problem,I want to use 80% of the data for training
 #but in my computer,I can only use 0.002

 set.seed(123)
 split <- sample.split(filtered.df$treatmentsum,SplitRatio = 0.8)
 training_set <- subset(filtered.df,split==TRUE)
 test_set <- subset(filtered.df,split==FALSE)

 #scaling numeric data
 #leave the first column labels out

 training_set[-1] = scale(training_set[-1])
 test_set[-1] = scale(test_set[-1])
 training_set[1]


 #build RF
 #use Cross validation for model training
 #I can't use repeated CV as it fails on my machine
 #I set a grid search for tuning

 control <- trainControl(method="cv",number=10,verboseIter = TRUE,search = 'grid')

 #default mtry below,is around 10
 #mtry <- sqrt(ncol(training_set))
 #I used,mtry 1:12 to run,but I wanted to test more,limited again by machine  

 tunegrid <- expand.grid(.mtry = (1:20))
 model <- train(training_set[,-1],as.factor(training_set[,1]),data=training_set,method = "rf",trControl = control,metric= "Accuracy",maximize = TRUE,importance = TRUE,type="classification",ntree =800,tuneGrid = tunegrid)

  print(model)
  plot(model)

  prediction2 <- predict(model,test_set[,-1])

  cm<-confusionMatrix(prediction2,as.factor(test_set[,positive = "1")

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

hardware multilabel-classification r r r-caret random-forest