不是缩放混合数据框的训练和测试数据集中的所有数字列

问题描述

以下代码缩放训练和测试集。由于 Col6 和 Col7 不得缩放，因此将它们从原始数据中移除以缩放训练和测试集：

library(tidyverse)

Data_Frame <- data.frame(Col1 = c("A1","A1","A2","A3","A3"),Col2 = c("2011-03-11","2014-08-21","2016-01-17","2017-06-30","2018-07-11","2018-11-28","2019-09-04","2020-02-29","2020-07-12"),Col3 = c("2018-10-22","2019-05-24","2020-12-25","2018-10-12","2019-09-24","2020-12-19","2018-10-22","2019-06-14","2020-12-20"),Col4 = c(4,12,2,1,4,75,44),Col5 = c(7.81,6.45,3,5,2),Col6 = c(1,1),Col7 = c(2,Col8 = c(7.77,6,8.4,-11.23,3.5,7.2,15,100,22.22))

# randomly split data in r
sample_size = floor(0.8*nrow(Data_Frame))
set.seed(777)
picked = sample(seq_len(nrow(Data_Frame)),size = sample_size)
Train_Set = Data_Frame[picked,]
Test_Set = Data_Frame[-picked,]

# Remove columns Col6 and Col7,which will not be scaled
Train <- Train_Set %>% dplyr::select(- c(Col6,Col7))
Test <- Test_Set %>% dplyr::select(- c(Col6,Col7))

# Scale Train,collect mean and sd to scale in Test
Train_Scale <- Train %>% dplyr::mutate_if(is.numeric,~scale(.) %>% as.vector)
num_cols <- names(which(sapply(Train,is.numeric)))
scale_params <- attributes(scale(Train[,num_cols]))[c("scaled:center","scaled:scale")]

# Scale Test with the scales of Train
Test_Scale <- Test
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]])

尝试

varnames <- c('Col6','Col7')
index <- names(Train_Set) %in% varnames
Train_Scale_Check <- Train_Set[,!index] %>% dplyr::mutate_if(is.numeric,~scale(.) %>% as.vector)

有效，但从数据框中删除 Col6 和 Col7。

还有，

Train_Scale_Check <- Train_Set %>% dplyr::mutate_if(is.numeric,!index,~scale(.) %>% as.vector)

抛出以下错误：

Error: expecting a one sided formula,a function,or a function name.
Run `rlang::last_error()` to see where the error occurred.

rlang::last_error()
<error/rlang_error>
expecting a one sided formula,or a function name.
Backtrace:
 1. dplyr::mutate_if(...)
 2. dplyr:::manip_if(...)
 3. dplyr:::as_fun_list(.funs,.env,...,.caller = .caller)
 4. dplyr:::map(...)
 5. base::lapply(.x,.f,...)
 6. dplyr:::FUN(X[[i]],...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
expecting a one sided formula,or a function name.
Backtrace:
    x
 1. \-dplyr::mutate_if(...)
 2.   \-dplyr:::manip_if(...)
 3.     \-dplyr:::as_fun_list(.funs,.caller = .caller)
 4.       \-dplyr:::map(...)
 5.         \-base::lapply(.x,...)
 6.           \-dplyr:::FUN(X[[i]],...)

有没有一种简单的方法可以在 Train_Set 和 Test_Set 数据集中保留 Col6 和 Col7，但不缩放它们？冗长的方法是将列 Col6 和 Col7 提取为单独的数据帧，使用顶部的代码并最终将 Col6 和 Col7 数据帧绑定。

解决方法

以下解决了问题（感谢@27 ϕ 9 的建议）

仅在需要的列上缩放训练集（忽略 Col6 和 Col7）

varnames <- c('Col6','Col7')
index <- names(Train_Set) %in% varnames
Train_Scale <- Train_Set %>%  mutate(across(where(is.numeric) & -all_of(varnames),~scale(.) %>% as.vector))

拿起秤：

num_cols <- names(which(sapply(subset(Train_Set,select=-c(Col6,Col7)),is.numeric)))
scale_params <- attributes(scale(Train_Set[,num_cols]))[c("scaled:center","scaled:scale")]

使用测试数据中的尺度：

Test_Scale <- Test_Set
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]])

data-preprocessing dataframe dplyr dplyr r r standardized