问题描述
很抱歉,如果这不适合在这里提问,请原谅。 我正在使用台式机,32GB RAM和4核cpu在实验数据上使用R Caret训练随机森林多类(8)分类器。但是,RStudio一直抱怨它无法分配9GB的向量,因此我一直在抱怨。因此,为了进行CV折叠和某些网格搜索,我必须将训练集一路减少到数据的1%。结果,我的模式精度为〜50%,选择的结果功能也不是很好。每8个班级中只有2个是如实区分。当然可能是我没有任何好的功能。但是我至少要首先测试训练并在相当大的训练数据上调整模型。有什么解决方案可以帮助您?还是可以在任何地方上传数据并在某个地方训练?我是如此新奇,以至于我不知道基于云的技术是否可以帮助我?指针将不胜感激。
编辑:我已经上传了数据表和代码,所以也许是我的错误代码搞砸了。
以下是数据链接: https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
这是我的代码:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv",header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw,select = c(18:131))
df <- subset(df.1,select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1],saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged,"Near Zero Features flagged.CSV",row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor,cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem,I want to use 80% of the data for training
#but in my computer,I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum,SplitRatio = 0.8)
training_set <- subset(filtered.df,split==TRUE)
test_set <- subset(filtered.df,split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv",number=10,verboseIter = TRUE,search = 'grid')
#default mtry below,is around 10
#mtry <- sqrt(ncol(training_set))
#I used,mtry 1:12 to run,but I wanted to test more,limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]),data=training_set,method = "rf",trControl = control,metric= "Accuracy",maximize = TRUE,importance = TRUE,type="classification",ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model,test_set[,-1])
cm<-confusionMatrix(prediction2,as.factor(test_set[,positive = "1")
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)