r - 为什么当我使用预测数据集时我的 SVM 函数失败,但测试和训练数据集可以?

问题描述

我有一张包含门票信息的表格。 一列是票号,另外三列是自由格式文本字段,其中包含多个英文单词,最后一列(分类)用于分配给的组。

为简单起见,我只是将 Text### 作为单元格值,但实际上每个 Field1、Field2 和 Field3 列都有多个句子,其中包含多个英语单词。

数据如下。在同一个表中,我们提供了标识正确组的行,以及一些等待分配给相应组的工单。

票号 Field1 Field2 Field3 DataOneY
00000001 Text101 Text102 Text103 B B
00000002 Text101 Text102 Text103 A A
00000003 Text101 Text102 Text103 B B
00000004 Text101 Text102 Text103 B B
00000005 Text101 Text102 Text103 C C
........ ....... ....... ....... ...... ........
00000789 Text101 Text102 Text103
00001232 Text101 Text102 Text103
00012988 Text101 Text102 Text103
........ ....... ....... ....... ...... ........

手头的任务是,根据之前的数据,通过使用所有自由格式文本字段中的词,使用 SVM 来预测组分配。

所以我构建了 VCorpus 和 DTM,然后开始构建我的训练、测试和预测数据框。

tSparse 数据框如下所示(Ticket ID 用作行名称

Word1 Word2 Word3 ..... WordN DataOneY
00000001 0 1 0 ..... 2 B B
00000001 1 1 3 ..... 0 B B
00000002 0 1 0 ..... 1 B B
00000103 2 3 3 ..... 0 B B
00000084 0 1 0 ..... 0 B B
.... ... ... ... ..... ... ... ...
00001249 0 1 0 ..... 2
00023232 0 2 2 ..... 1
00000098 4 1 0 ..... 1
.... ... ... ... ..... ... ... ...
buildDocCorpus <- reactive({
    #build the VCorpus and DTM
    #Build general dataframe with predictions to split train and test
    tSparse1_r<-tSparse%>%filter(tSparse$dataOneY!="")

    #make sure output column is a factor
    tSparse1_r$dataOneY<-factor(tSparse1_r$dataOneY)
    #Split into training and test dataframes (sets)
    trainSparse <- stratified(tSparse1_r,"dataOneY",.9,keep.rownames=TRUE)
    #make sure trainSparse is a dataframe and use ticket id as index (row names)
    trainSparse <- as.data.frame(trainSparse)
    rownames(trainSparse) <- trainSparse$rn
    trainSparse$rn <- NULL
    #create test dataframe by selecting tickets whose ID doesn't appear in training
    testSparse = subset(tSparse1_r,!(rownames(tSparse1_r) %in% rownames(trainSparse)))
    #build predict set with rows that don't have a group assigned
    PredictSparse1<-tSparse%>%filter(dataOneY==""|(is.na(dataOneY)))
    PredictSparse1<-subset(PredictSparse1,select = -c(dataOneY))
    return(
          list(
            trainSparse = trainSparse,testSparse = testSparse,PredictSparse = PredictSparse1
          )
        )
      })

cfMtxSVM <- function(mymode){
    #browser()
    mymode = toString(mymode)
    bdc <- buildDocCorpus()
    trainSparse <- bdc$trainSparse
    if(mymode == "test"){
      mySparse <- bdc$testSparse
    }
    else if (mymode == "predict"){
      mySparse <- bdc$PredictSparse
    }


    #subset.test <- test[filt,]
    #rf =randomForest(dataOneY~ .,data=trainSparse)
    #PredictRF = predict(rf,newdata = mySparse)
    #
    trctrl <- trainControl(method = "repeatedcv",number = 10,repeats = 3)
    svm_Linear <- train(dataOneY ~.,data = trainSparse,method = "svmLinear",trControl=trctrl,preProcess = c("center","scale"),tuneLength = 10)
    test_svm1 <- predict(svm_Linear,newdata = mySparse)

    #test_svm
    return(
      list(
        testOneY = mySparse$dataOneY,test_svm = test_svm1,trainSparse = trainSparse
      )
    )
  }

当我这样运行程序时:

tb1 <- cfMtxSVM(mymode =  toString("predict"))

我收到以下错误

Warning: Error in model.frame.default: factor Group has new level 

[没有可用的堆栈跟踪]

当然,GroupDataOneY 列在预测数据集中都是不适用的。

根据我的调查,我似乎需要为预测数据集中的 Group 列分配级别。这些是我尝试过的所有尝试,但都返回错误

#Attempt 1: Remove both output columns
#PredictSparse1<-subset(PredictSparse1,select = -c(Group,dataOneY))

#Attempt 2: Make PredictSpare Group column a factor
#PredictSparse1$Group<-factor(PredictSparse1$Group)

#Attempt 3: copy Levels from trainSparse to PredictSparse
#levels(PredictSparse1$Group) <- levels(trainSparse$Group)

#Attempt 4: Like 3 but making it factor
#PredictSparse1$Failure_Mode <- factor(
#  PredictSparse1$Failure_Mode,levels = levels(trainSparse$Failure_Mode)
#)

#Attempt 5: Manually specify levels and add NA that is in output column
lvls <- c('A','B','C')
PredictSparse1$Group <-  sapply(PredictSparse1$Group,factor,levels=lvls)
PredictSparse1$Group <- addNA(PredictSparse1$Group)

#Attempt 6: Same as 5 but for the three datasets (train,test and predict)

我已经无能为力了,请您解释一下如何解决 has new level 错误

如果有帮助,我还使用完全相同的训练、测试和预测数据集运行 RandomForest,并且每次都运行正常,除非我之前尝试修复级别错误,但它也崩溃了。

解决方法

菜鸟错误!

dataOneYGroup 是副本,所以我实际上在模型中有数据泄漏。

从训练和测试数据集中删除 Group 并重新运行模型训练后,我能够在 SVM predict 中正确获得结果。