问题描述
我有一张包含门票信息的表格。 一列是票号,另外三列是自由格式文本字段,其中包含多个英文单词,最后一列(分类)用于分配给的组。
为简单起见,我只是将 Text### 作为单元格值,但实际上每个 Field1、Field2 和 Field3 列都有多个句子,其中包含多个英语单词。
数据如下。在同一个表中,我们提供了标识正确组的行,以及一些等待分配给相应组的工单。
票号 | Field1 | Field2 | Field3 | 组 | DataOneY |
---|---|---|---|---|---|
00000001 | Text101 | Text102 | Text103 | B | B |
00000002 | Text101 | Text102 | Text103 | A | A |
00000003 | Text101 | Text102 | Text103 | B | B |
00000004 | Text101 | Text102 | Text103 | B | B |
00000005 | Text101 | Text102 | Text103 | C | C |
........ | ....... | ....... | ....... | ...... | ........ |
00000789 | Text101 | Text102 | Text103 | ||
00001232 | Text101 | Text102 | Text103 | ||
00012988 | Text101 | Text102 | Text103 | ||
........ | ....... | ....... | ....... | ...... | ........ |
手头的任务是,根据之前的数据,通过使用所有自由格式文本字段中的词,使用 SVM 来预测组分配。
所以我构建了 VCorpus 和 DTM,然后开始构建我的训练、测试和预测数据框。
tSparse
数据框如下所示(Ticket ID 用作行名称)
Word1 | Word2 | Word3 | ..... | WordN | 组 | DataOneY | |
---|---|---|---|---|---|---|---|
00000001 | 0 | 1 | 0 | ..... | 2 | B | B |
00000001 | 1 | 1 | 3 | ..... | 0 | B | B |
00000002 | 0 | 1 | 0 | ..... | 1 | B | B |
00000103 | 2 | 3 | 3 | ..... | 0 | B | B |
00000084 | 0 | 1 | 0 | ..... | 0 | B | B |
.... | ... | ... | ... | ..... | ... | ... | ... |
00001249 | 0 | 1 | 0 | ..... | 2 | ||
00023232 | 0 | 2 | 2 | ..... | 1 | ||
00000098 | 4 | 1 | 0 | ..... | 1 | ||
.... | ... | ... | ... | ..... | ... | ... | ... |
buildDocCorpus <- reactive({
#build the VCorpus and DTM
#Build general dataframe with predictions to split train and test
tSparse1_r<-tSparse%>%filter(tSparse$dataOneY!="")
#make sure output column is a factor
tSparse1_r$dataOneY<-factor(tSparse1_r$dataOneY)
#Split into training and test dataframes (sets)
trainSparse <- stratified(tSparse1_r,"dataOneY",.9,keep.rownames=TRUE)
#make sure trainSparse is a dataframe and use ticket id as index (row names)
trainSparse <- as.data.frame(trainSparse)
rownames(trainSparse) <- trainSparse$rn
trainSparse$rn <- NULL
#create test dataframe by selecting tickets whose ID doesn't appear in training
testSparse = subset(tSparse1_r,!(rownames(tSparse1_r) %in% rownames(trainSparse)))
#build predict set with rows that don't have a group assigned
PredictSparse1<-tSparse%>%filter(dataOneY==""|(is.na(dataOneY)))
PredictSparse1<-subset(PredictSparse1,select = -c(dataOneY))
return(
list(
trainSparse = trainSparse,testSparse = testSparse,PredictSparse = PredictSparse1
)
)
})
cfMtxSVM <- function(mymode){
#browser()
mymode = toString(mymode)
bdc <- buildDocCorpus()
trainSparse <- bdc$trainSparse
if(mymode == "test"){
mySparse <- bdc$testSparse
}
else if (mymode == "predict"){
mySparse <- bdc$PredictSparse
}
#subset.test <- test[filt,]
#rf =randomForest(dataOneY~ .,data=trainSparse)
#PredictRF = predict(rf,newdata = mySparse)
#
trctrl <- trainControl(method = "repeatedcv",number = 10,repeats = 3)
svm_Linear <- train(dataOneY ~.,data = trainSparse,method = "svmLinear",trControl=trctrl,preProcess = c("center","scale"),tuneLength = 10)
test_svm1 <- predict(svm_Linear,newdata = mySparse)
#test_svm
return(
list(
testOneY = mySparse$dataOneY,test_svm = test_svm1,trainSparse = trainSparse
)
)
}
当我这样运行程序时:
tb1 <- cfMtxSVM(mymode = toString("predict"))
我收到以下错误:
Warning: Error in model.frame.default: factor Group has new level
[没有可用的堆栈跟踪]
当然,Group
和 DataOneY
列在预测数据集中都是不适用的。
根据我的调查,我似乎需要为预测数据集中的 Group 列分配级别。这些是我尝试过的所有尝试,但都返回错误:
#Attempt 1: Remove both output columns
#PredictSparse1<-subset(PredictSparse1,select = -c(Group,dataOneY))
#Attempt 2: Make PredictSpare Group column a factor
#PredictSparse1$Group<-factor(PredictSparse1$Group)
#Attempt 3: copy Levels from trainSparse to PredictSparse
#levels(PredictSparse1$Group) <- levels(trainSparse$Group)
#Attempt 4: Like 3 but making it factor
#PredictSparse1$Failure_Mode <- factor(
# PredictSparse1$Failure_Mode,levels = levels(trainSparse$Failure_Mode)
#)
#Attempt 5: Manually specify levels and add NA that is in output column
lvls <- c('A','B','C')
PredictSparse1$Group <- sapply(PredictSparse1$Group,factor,levels=lvls)
PredictSparse1$Group <- addNA(PredictSparse1$Group)
#Attempt 6: Same as 5 but for the three datasets (train,test and predict)
我已经无能为力了,请您解释一下如何解决 has new level
错误。
如果有帮助,我还使用完全相同的训练、测试和预测数据集运行 RandomForest,并且每次都运行正常,除非我之前尝试修复级别错误,但它也崩溃了。
解决方法
菜鸟错误!
dataOneY
和 Group
是副本,所以我实际上在模型中有数据泄漏。
从训练和测试数据集中删除 Group
并重新运行模型训练后,我能够在 SVM predict
中正确获得结果。