R中一组列的热编码 数据

问题描述

我正在尝试对R中的df列的子集进行热编码,

一种热编码是将分类变量转换成可以提供给ML算法的形式的过程,通过将字符串列转换为二进制列中的每个字符串,可以更好地进行预测。

假设我们有一个如下所示的df:

mes          work_location  birth_place
01/01/2000      China           Chile
01/02/2000      Mexico           Japan
01/03/2000      China            Chile
01/04/2000      China           Argentina
01/05/2000      USA              Poland
01/06/2000      Mexico           Poland
01/07/2000      USA              Finland
01/08/2000      USA              Finland
01/09/2000      Japan             norway
01/10/2000      Japan             Kenia
01/11/2000      Japan              Mali
01/12/2000      India              Mali

这是热编码的代码

## function to hot-encode ##
columna_dummy <- function(df,columna) {
  df %>% 
    mutate_at(columna,~paste(columna,eval(as.symbol(columna)),sep = "_")) %>% 
    mutate(valor = 1) %>% 
    spread(key = columna,value = valor,fill = 0)
}

## selecting columns ##
columnas <- c("work_location","birth_place")

## applying loop to repeat columna_dummy function for each df column ##


for(i in 1:length(columnas)){
    new_dataset <- columna_dummy(df,i)
   } 

控制台输出

Error: Problem with `mutate()` input `mes`.
x objeto '1' no enconTrado
i Input `mes` is `(structure(function (...,.x = ..1,.y = ..2,. = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd) 

mes是日期类列,但是不包含在原子向量列中 仍然会引发上述错误

对于所选字符串df列中的每个字符串,预期输出应类似于以下内容

(我无法添加每列,但work_location_China是一个示例 列的外观)

mes          work_location  birth_place    work_location_China   
01/01/2000      China           Chile              1
01/02/2000      Mexico           Japan             0
01/03/2000      China            Chile             1
01/04/2000      China           Argentina          1
01/05/2000      USA              Poland            0
01/06/2000      Mexico           Poland            0
01/07/2000      USA              Finland           0
01/08/2000      USA              Finland           0
01/09/2000      Japan             norway           0
01/10/2000      Japan             Kenia            0
01/11/2000      Japan              Mali            0
01/12/2000      India              Mali            0

还有其他方法可以应用此循环吗?

解决方法

我们在传递字符串时,可以选择select列(select可以带引号/不带引号),创建1s列(“ valor”)和行号列( 'rn'),然后从'long'改成'wide'(pivot_wider

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
columna_dummy <- function(df,columna) {
      df %>% 
            select(columna) %>%
            mutate(valor = 1,rn = row_number()) %>%
            pivot_wider(names_from = all_of(columna),values_from = valor,values_fill = 0) %>%
            select(-rn)
   }

-测试

对于多个列,一个选项是用map遍历感兴趣的列名,应用该函数并用_dfc绑定它们,然后绑定到原始数​​据集({{1} })

bind_cols

-输出

out <- imap_dfc(setNames(c("work_location","birth_place"),c("work_location","birth_place")),~ {
     nm1 <- as.character(.y)
    columna_dummy(df = df,columna = .x) %>% 
      rename_all(~ str_c(nm1,.,sep="_"))
       }) %>%            
      bind_cols(df,.)

数据

head(out,2)
#         mes work_location birth_place work_location_China work_location_Mexico work_location_USA work_location_Japan
#1 01/01/2000         China       Chile                   1                    0                 0                   0
#2 01/02/2000        Mexico       Japan                   0                    1                 0                   0
#  work_location_India birth_place_Chile birth_place_Japan birth_place_Argentina birth_place_Poland birth_place_Finland
#1                   0                 1                 0                     0                  0                   0
#2                   0                 0                 1                     0                  0                   0
#  birth_place_Norway birth_place_Kenia birth_place_Mali
#1                  0                 0                0
#2                  0                 0                0
,

通过使用purrr库,我解决了这个问题:

## data ##

df <- structure(list(mes = c("01/01/2000","01/02/2000","01/03/2000","01/04/2000","01/05/2000","01/06/2000","01/07/2000","01/08/2000","01/09/2000","01/10/2000","01/11/2000","01/12/2000"),work_location = c("China","Mexico","China","USA","Japan","India"),birth_place = c("Chile","Chile","Argentina","Poland","Finland","Norway","Kenia","Mali","Mali")),class = "data.frame",row.names = c(NA,-12L))

## function to hot-encode ##

columna_dummy <- function(df,columna) {
  df %>% 
    mutate_at(columna,~paste(columna,eval(as.symbol(columna)),sep = "_")) %>% 
    mutate(valor = 1) %>% 
    spread(key = columna,value = valor,fill = 0)
}

## vector of columns ##

columnas <- c("work_location","birth_place")


## hot_encoded_dataset ##
library(purrr)

hot_encoded_dataset <- purrr :: map(columnas,columna_dummy,df = df) %>% 
  reduce(inner_join)