问题描述
我在清理数据方面取得了进步:
df1 <- data.frame(ID=(c("18.1010-2.570322","171114-238509","140808-3481906
","18055656193","180625-378224","190903-2793831 / -9311442 / -6810125","190808-625-6692","190 807 - 7941125","1807298087721Roland","19060881t1676")),True_ID = c("181010-2570322","171114-2385039","190808-4381906","180556-5619343","180625-3782242","190903-2793831 190903-9311442
190903-6810125","190808-6256692","190807-7941125","180729-8087721","190608-8112676"))
真实值是这样的:190312-4184811。所以有一个模式,前六个整数是一个日期,例如19 = 2019 03 =三月和12 =天。其他七个数字是随机的。我清理了许多无用的模式,但在这里我不知道该如何处理许多不同的模式。
a = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{6}\\-[:digit:]{7}([ ]|$)")
b = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{5}\\-[:digit:]{7}([ ]|$)")
c = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{4}\\-[:digit:]{7}([ ]|$)")
d = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{6}\\-[:digit:]{6}([ ]|$)")
e = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{6}\\-[:digit:]{5}([ ]|$)")
f = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{6}\\-[:digit:]{4}([ ]|$)")
g = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{6}\\-[:digit:]{8}([ ]|$)")
h = str_extract(data_file$IP_P,"(^|[ ])[:digit:]{6}\\-[:digit:]{9}([ ]|$)")
data_file["Extracted_i"] = NA
data1 <- data.frame(a,b,c,d,e,f,g,h)
data1 <- data1 %>% unite("z",a:h,remove = FALSE)
data_file["Extracted_i"] =gsub("[^0-9\\.\\-]","",data1$z)
解决方法
难道不就是要去除所有非数字字符以给出所有数字的字符串,然后将前6个字符和后6个字符粘贴在一起并加上一个“-”吗?
paste(substr(gsub("\\D","",df1$ID),1,6),substr(gsub("\\D",7,12),sep = "-")
#> [1] "181010-257032" "171114-238509" "140808-348190" "180556-56193"
#> [5] "180625-378224" "190903-279383" "190808-625669" "190807-794112"
#> [9] "180729-808772" "190608-811676"
,
我们还可以使用gsub
将字符捕获为一个组,并在替换中指定捕获组的后向引用(\\1
,\\2
)
gsub("^(.{1,6})(.{1,6}).*","\\1-\\2",gsub("\\D+",df1$ID))
#[1] "181010-257032" "171114-238509" "140808-348190" "180556-56193"
#[5] "180625-378224" "190903-279383" "190808-625669" "190807-794112"
#[9] "180729-808772" "190608-811676"