问题描述
我正在尝试导入要在R中使用的数据集(使用tidyverse)。不幸的是,它是政府数据集,几乎总是意味着一些奇怪的标准。在这种情况下,如果某个观察值没有给定变量的值,则将ND作为文本字符串输入。
我不愿意在Excel中打开它并手动跟踪每个ND以将其替换为空白单元格(甚至使用find和replace)-因为它显然使我的代码难以复制。但是不这样做意味着当我使用read_csv导入数据时,我的某些变量类型无法正常工作(例如,我不能很乐意将一列变成双精度)。在数据导入过程中,有没有一种方法可以将所有这些ND条目替换为“标准” NA?
我在下面包含了我的代码。
如果这是一个简单的答案,很抱歉,我很抱歉。
谢谢
> #Load libraries
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.2.1 v purrr 0.3.3
v tibble 2.1.3 v dplyr 0.8.3
v tidyr 1.0.0 v stringr 1.4.0
v readr 1.3.1 v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
>
> #Import data
>
> #From march 2020,REGION_GEOG and REGION_GEOG_CODE fields removed
> GP.prac.Mar20 <- read_csv(
+ "Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv",+ col_types = cols(
+ .default = col_double(),+ PRAC_CODE = col_character(),+ PRAC_NAME = col_character(),+ CCG_CODE = col_character(),+ CCG_NAME = col_character(),+ PCN_CODE = col_character(),+ PCN_NAME = col_character(),+ STP_CODE = col_character(),+ STP_NAME = col_character(),+ REGION_CODE = col_character(),+ REGION_NAME = col_character(),+ HEE_REGION_CODE = col_character(),+ HEE_REGION_NAME = col_character(),+ CONTRACT = col_character(),+ GP_SOURCE = col_character(),+ NURSE_SOURCE = col_character(),+ DPC_SOURCE = col_character(),+ ADMIN_SOURCE = col_character()
+ )
+ )
|=================================================================| 100% 11 MB
Warning: 91380 parsing failures.
row col expected actual file
9 TOTAL_DPC_HC a double ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
9 TOTAL_DPC_disPENSER_HC a double ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
9 TOTAL_DPC_HCA_HC a double ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
9 TOTAL_DPC_PHLEB_HC a double ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
9 TOTAL_DPC_PHARMA_HC a double ND 'Data/GPWorkforcePracticeLevel/17. General Practice march 2020 Practice level.csv'
... ...................... ........ ...... ..................................................................................
See problems(...) for more details.
>
解决方法
尝试使用data.table's
fread()
通过适当的na.strings =
设置读取文件
library( data.table )
#without na.strings set
data.table::fread(
"col1,col2
ND,test" )
# col1 col2
# 1: ND test
#with na.strings set
data.table::fread(
"col1,col2
ND,test",na.strings = "ND" )
# col1 col2
# 1: NA test
,
最后发现只要能算出NA值是多少,导入时就可以用na =
指定,所以read_csv("path",na = "ND")