如何使用tidyverse/regex识别R中包含非字母字符的行

问题描述

我有一个数据框,其中包含表示“全名”的字符串。有些是完整的、正常的全名,有些不是基于非字母字符的“完整”或“准确”。

数据框示例:

Full name
----------

Mikki Clancy
Hermsdorfer,Mark (retired)
CSP,PSECU Lan Unit (typo)
Clifton Gurlen
G�mez,Oscar Prieto
Sj�¶strand,Anders
Lisa Terry
Meloy,Wilson {old}
Gregory Stevens
Charles Gruenberg

df <- structure(list(Full_name = c("Jane Clancy","Hermsdorfer,Mark (retired)","CSP,PSECU Lan Unit (typo)","Clif Gurlen","G�mez,Oscar Prieto","Sj�¶strand,Anders","Liza Terry","Meloy,Will {old}","Garret Stevens","Charly Ruenberg"),Group = c("a","b","c","d","e","f","g","h","i","j")),class = "data.frame",row.names = c(NA,-10L))

要求基于包含非 ascii 字符的字符串(例如来自上述值 - '{},(),&,�')对完整数据帧进行子集化。

所需的输出将是包含这些字符的名称列,然后是总行数,以便我可以从“不完整”或“准确”的完整数据框中计算百分比。

Not Complete Full name
----------------------

Hermsdorfer,PSECU Lan Unit (typo)
G�mez,Anders
Meloy,Wilson {old}

解决方法

为了更全面地了解字母,我从 this question about matching letters 借用了正则表达式。

library(dplyr)
df %>% mutate(
  has_non_letters = grepl("[^\\p{L} ]",df$names,perl = TRUE)
)
#                          names has_non_letters
# 1                 Mikki Clancy           FALSE
# 2  Hermsdorfer,Mark (retired)            TRUE
# 3   CSP,PSECU Lan Unit (typo)            TRUE
# 4               Clifton Gurlen           FALSE
# 5   G<U+FFFD>mez,Oscar Prieto            TRUE
# 6         Sj�¶strand,Anders            TRUE
# 7                   Lisa Terry           FALSE
# 8          Meloy,Wilson {old}            TRUE
# 9              Gregory Stevens           FALSE
# 10           Charles Gruenberg           FALSE

我会为您提供额外的总结 - 您可以根据自己的喜好summean TRUE/FALSE 值。


使用这些数据:

df = data.frame(names = c(
"Mikki Clancy","Hermsdorfer,Mark (retired)","CSP,PSECU Lan Unit (typo)","Clifton Gurlen","G�mez,Oscar Prieto","Sj�¶strand,Anders","Lisa Terry","Meloy,Wilson {old}","Gregory Stevens","Charles Gruenberg"
))
,

我们可以使用str_detect

library(dplyr)
library(stringr)
df %>% 
   filter(str_detect(Full_name,"[^A-Za-z,]+"))
                    Full_name Group
1 Hermsdorfer,Mark (retired)     b
2  CSP,PSECU Lan Unit (typo)     c
3         G�mez,Oscar Prieto     e
4        Sj�¶strand,Anders     f
5           Meloy,Will {old}     h