在R中使用非标准空白列导入数据

问题描述

我正在尝试导入要在R中使用的数据集(使用tidyverse)。不幸的是,它是政府数据集,几乎总是意味着一些奇怪的标准。在这种情况下,如果某个观察值没有给定变量的值,则将ND作为文本字符串输入。

我不愿意在Excel中打开它并手动跟踪每个ND以将其替换为空白单元格(甚至使用find和replace)-因为它显然使我的代码难以复制。但是不这样做意味着当我使用read_csv导入数据时,我的某些变量类型无法正常工作(例如,我不能很乐意将一列变成双精度)。在数据导入过程中,有没有一种方法可以将所有这些ND条目替换为“标准” NA?

我在下面包含了我的代码

如果这是一个简单的答案,很抱歉,我很抱歉。

谢谢


> #Load libraries
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.2.1     v purrr   0.3.3
v tibble  2.1.3     v dplyr   0.8.3
v tidyr   1.0.0     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

> 
> #Import data
> 
> #From march 2020,REGION_GEOG and REGION_GEOG_CODE fields removed 
> GP.prac.Mar20 <- read_csv(
+   "Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv",+   col_types = cols(
+     .default = col_double(),+     PRAC_CODE = col_character(),+     PRAC_NAME = col_character(),+     CCG_CODE = col_character(),+     CCG_NAME = col_character(),+     PCN_CODE = col_character(),+     PCN_NAME = col_character(),+     STP_CODE = col_character(),+     STP_NAME = col_character(),+     REGION_CODE = col_character(),+     REGION_NAME = col_character(),+     HEE_REGION_CODE = col_character(),+     HEE_REGION_NAME = col_character(),+     CONTRACT = col_character(),+     GP_SOURCE = col_character(),+     NURSE_SOURCE = col_character(),+     DPC_SOURCE = col_character(),+     ADMIN_SOURCE = col_character()
+   )
+ )


|=================================================================| 100%   11 MB
Warning: 91380 parsing failures.
row                    col expected actual                                                                               file
  9 TOTAL_DPC_HC           a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
  9 TOTAL_DPC_disPENSER_HC a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
  9 TOTAL_DPC_HCA_HC       a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
  9 TOTAL_DPC_PHLEB_HC     a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
  9 TOTAL_DPC_PHARMA_HC    a double     ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
... ...................... ........ ...... ..................................................................................
See problems(...) for more details.

> 

解决方法

尝试使用data.table's fread()通过适当的na.strings = 设置读取文件

library( data.table )

#without na.strings set
data.table::fread( 
"col1,col2
ND,test" )

#    col1 col2
# 1:   ND test

#with na.strings set
data.table::fread( 
  "col1,col2
  ND,test",na.strings = "ND" )

#    col1 col2
# 1:   NA test
,

最后发现只要能算出NA值是多少,导入时就可以用na =指定,所以read_csv("path",na = "ND")