问题:在将列作为因子附加到数据框中时,在r的附加列中创建了NA

问题描述

我是r语言的新手,并尝试使用插入符号学习ml。

问题-创建dummies删除NZV variables后,当我将Ypredicted variable添加回df as factors时,在同一列(发布步骤5-6)中创建NA。那么如何将Y变量作为最终df中的因子。

1。数据(来自uci / kaggle的银行营销响应数据)

str(data)
'data.frame':   4119 obs. of  21 variables:
 $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
 $ default       : Factor w/ 3 levels "no","unkNown",..: 1 1 1 1 1 1 1 2 1 2 ...
 $ housing       : Factor w/ 3 levels "no",..: 3 1 3 2 3 1 3 3 1 1 ...
 $ loan          : Factor w/ 3 levels "no",..: 1 1 1 2 1 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
 $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
 $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ prevIoUs      : int  0 0 0 0 0 2 0 0 1 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
 $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
 $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ...
 $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
 $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ...
 $ nr.employed   : num  5099 5191 5228 5228 5196 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

2。保存X和Y变量

Y = subset(data,select = y)
X = subset(data,select = -y)

dim(X)
dim(Y)
[1] 4119   20
[1] 4119    1

3。。创建了假人

pp_dummy <- dummyVars(y ~ .,data = data)

data <- predict(pp_dummy,newdata = data)

data <- data.frame(data)

4。使用接近零方差

删除了变量
nzv_list <- nearZeroVar(data) %>% 
            as.vector()

data <- data[,-nzv_list ]

str(data)
'data.frame':   4119 obs. of  44 variables:
 $ age                          : num  30 39 25 38 47 32 32 41 31 35 ...
 $ job.admin.                   : num  0 0 0 0 1 0 1 0 0 0 ...
 $ job.blue.collar              : num  1 0 0 0 0 0 0 0 0 1 ...
 $ job.management               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.services                 : num  0 1 1 1 0 1 0 0 1 0 ...
 $ job.technician               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ marital.divorced             : num  0 0 0 0 0 0 0 0 1 0 ...
 $ marital.married              : num  1 0 1 1 1 0 0 1 0 1 ...
 $ marital.single               : num  0 1 0 0 0 1 1 0 0 0 ...
 $ education.basic.4y           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.6y           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.9y           : num  1 0 0 1 0 0 0 0 0 1 ...
 $ education.high.school        : num  0 1 1 0 0 0 0 0 0 0 ...
 $ education.professional.course: num  0 0 0 0 0 0 0 0 1 0 ...
 $ education.university.degree  : num  0 0 0 0 1 1 1 1 0 0 ...
 $ default.no                   : num  1 1 1 1 1 1 1 0 1 0 ...
 $ default.unkNown              : num  0 0 0 0 0 0 0 1 0 1 ...
 $ housing.no                   : num  0 1 0 0 0 1 0 0 1 1 ...
 $ housing.yes                  : num  1 0 1 0 1 0 1 1 0 0 ...
 $ loan.no                      : num  1 1 1 0 1 1 1 1 1 1 ...
 $ loan.yes                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ contact.cellular             : num  1 0 0 0 1 1 1 1 1 0 ...
 $ contact.telephone            : num  0 1 1 1 0 0 0 0 0 1 ...
 $ month.apr                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.aug                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.jul                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.jun                    : num  0 0 1 1 0 0 0 0 0 0 ...
 $ month.may                    : num  1 1 0 0 0 0 0 0 0 1 ...
 $ month.nov                    : num  0 0 0 0 1 0 0 1 1 0 ...
 $ day_of_week.fri              : num  1 1 0 1 0 0 0 0 0 0 ...
 $ day_of_week.mon              : num  0 0 0 0 1 0 1 1 0 0 ...
 $ day_of_week.thu              : num  0 0 0 0 0 1 0 0 0 1 ...
 $ day_of_week.tue              : num  0 0 0 0 0 0 0 0 1 0 ...
 $ day_of_week.wed              : num  0 0 1 0 0 0 0 0 0 0 ...
 $ duration                     : num  487 346 227 17 58 128 290 44 68 170 ...
 $ campaign                     : num  2 4 1 3 1 3 4 2 1 1 ...
 $ prevIoUs                     : num  0 0 0 0 0 2 0 0 1 0 ...
 $ poutcome.failure             : num  0 0 0 0 0 1 0 0 1 0 ...
 $ poutcome.nonexistent         : num  1 1 1 1 1 0 1 1 0 1 ...
 $ emp.var.rate                 : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
 $ cons.price.idx               : num  92.9 94 94.5 94.5 93.2 ...
 $ cons.conf.idx                : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
 $ euribor3m                    : num  1.31 4.86 4.96 4.96 4.19 ...
 $ nr.employed                  : num  5099 5191 5228 5228 5196 ...

5。问题:在appending y上对数据as factor产生NA列。

data$y <- as.factor(Y)

str(data)
'data.frame':   4119 obs. of  45 variables:
 $ age                          : num  30 39 25 38 47 32 32 41 31 35 ...
 $ job.admin.                   : num  0 0 0 0 1 0 1 0 0 0 ...
 $ job.blue.collar              : num  1 0 0 0 0 0 0 0 0 1 ...
 $ job.management               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.services                 : num  0 1 1 1 0 1 0 0 1 0 ...
 $ job.technician               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ marital.divorced             : num  0 0 0 0 0 0 0 0 1 0 ...
 $ marital.married              : num  1 0 1 1 1 0 0 1 0 1 ...
 $ marital.single               : num  0 1 0 0 0 1 1 0 0 0 ...
 $ education.basic.4y           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.6y           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.9y           : num  1 0 0 1 0 0 0 0 0 1 ...
 $ education.high.school        : num  0 1 1 0 0 0 0 0 0 0 ...
 $ education.professional.course: num  0 0 0 0 0 0 0 0 1 0 ...
 $ education.university.degree  : num  0 0 0 0 1 1 1 1 0 0 ...
 $ default.no                   : num  1 1 1 1 1 1 1 0 1 0 ...
 $ default.unkNown              : num  0 0 0 0 0 0 0 1 0 1 ...
 $ housing.no                   : num  0 1 0 0 0 1 0 0 1 1 ...
 $ housing.yes                  : num  1 0 1 0 1 0 1 1 0 0 ...
 $ loan.no                      : num  1 1 1 0 1 1 1 1 1 1 ...
 $ loan.yes                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ contact.cellular             : num  1 0 0 0 1 1 1 1 1 0 ...
 $ contact.telephone            : num  0 1 1 1 0 0 0 0 0 1 ...
 $ month.apr                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.aug                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.jul                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.jun                    : num  0 0 1 1 0 0 0 0 0 0 ...
 $ month.may                    : num  1 1 0 0 0 0 0 0 0 1 ...
 $ month.nov                    : num  0 0 0 0 1 0 0 1 1 0 ...
 $ day_of_week.fri              : num  1 1 0 1 0 0 0 0 0 0 ...
 $ day_of_week.mon              : num  0 0 0 0 1 0 1 1 0 0 ...
 $ day_of_week.thu              : num  0 0 0 0 0 1 0 0 0 1 ...
 $ day_of_week.tue              : num  0 0 0 0 0 0 0 0 1 0 ...
 $ day_of_week.wed              : num  0 0 1 0 0 0 0 0 0 0 ...
 $ duration                     : num  487 346 227 17 58 128 290 44 68 170 ...
 $ campaign                     : num  2 4 1 3 1 3 4 2 1 1 ...
 $ prevIoUs                     : num  0 0 0 0 0 2 0 0 1 0 ...
 $ poutcome.failure             : num  0 0 0 0 0 1 0 0 1 0 ...
 $ poutcome.nonexistent         : num  1 1 1 1 1 0 1 1 0 1 ...
 $ emp.var.rate                 : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
 $ cons.price.idx               : num  92.9 94 94.5 94.5 93.2 ...
 $ cons.conf.idx                : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
 $ euribor3m                    : num  1.31 4.86 4.96 4.96 4.19 ...
 $ nr.employed                  : num  5099 5191 5228 5228 5196 ...
 $ y                            : Factor w/ 1 level "1:2": NA NA NA NA NA NA NA NA NA NA ...

6。。如果我照原样追加Y,那么它不会立即创建NA,但是当我将其转换为factor时,它将给出{ {1}}

NA

(更新)

7。。如果我没有将其转换为data$y <- Y # as.factor(Y) data <- data %>% mutate(y = as.factor(y)) str(data) ,那么我总是必须使用factor而不是仅仅使用pull(data$y)。下面的示例:

data$y

如何避免使用 subsets <- c(7,10,12,15,20) control <- rfeControl(functions = rfFuncs,method = "cv",verbose = FALSE) system.time( RFE_res <- rfe(x = data[,1:44],# subset(train,select = -y) y = pull(data$y),sizes = subsets,rfeControl = control ) ) 而仅使用pull(data$y)

解决方法

pull()无关。

即使只有一列,也无法将data.frame转换为向量:

X = subset(iris,select=-Species)
Y = subset(iris,select=Species)

as.factor(Y)
Species 
   <NA> 
Levels: 1:3

.valid.factor(Y)
[1] "factor levels must be \"character\""

levels(Y)
NULL

您需要将data.frame的列调出:

X$y = as.factor(Y$Species)
# or X %>% mutate(y = as.factor(Y$Species))

> str(X)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ y           : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...