问题描述
我需要您的帮助,因为使用不同的方法尝试会遇到相同的错误。我想删除特殊字符,例如“áéíóúÁÉÍÓÓÚýÝ”,“àèìòùÀÈÌÒÙ”,“âêîôûÂÊÎÔÛ”,“ãõÃÕñÑ”,“äëïöüÄËÏÖÜÿ”,“çÇ”到“ aeIoUAEIoUXX”,“ aeIoUAEIoU”,“ AEIoUAEIU”来自数据帧的“ XX”。 谢谢!!!
首先我尝试这样做:
trata<-function(Campo){
Campo<-Campo %>% chartr('ÇÆ£ØÞß&@Ð','XXXXXXXXX',.) %>%
str_to_upper(locale = "es") %>% str_trim(side = "both") %>%
str_replace_all("['´`^]","") %>% chartr('ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÅÃÕÑ','AEIoUAEIoUAEIoUAEIoUAAOX',.)
return(Campo)
}
trataRS<-function(Campo){
Campo<-Campo %>% chartr('ÇÆ£ØÞßÐ',"") %>% chartr('ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÅÃÕ','AEIoUAEIoUAEIoUAEIoUAAO',.)
return(Campo)
}
然后我将这些功能应用于:
Base$paterno_originador<-trata(Base$paterno_originador)
Base$razon_originador <- trataRS(Base$razon_originador)
但我收到此错误:
Error in chartr("ÇÆ£ØÞßÐ","XXXXXXXXX",.) : invalid input 'HÉCTOR" in 'utftowcs'
因此,我尝试了从@Alexandre_Lima在这里找到的另一种方式:
rm_accent <- function(str,pattern="all") {
if(!is.character(str))
str <- as.character(str)
pattern <- unique(pattern)
if(any(pattern=="Ç"))
pattern[pattern=="Ç"] <- "ç"
symbols <- c(
acute = "áéíóúÁÉÍÓÚýÝ",grave = "àèìòùÀÈÌÒÙ",circunflex = "âêîôûÂÊÎÔÛ",tilde = "ãõÃÕñÑ",umlaut = "äëïöüÄËÏÖÜÿ",cedil = "çÇ"
)
nudeSymbols <- c(
acute = "aeIoUAEIoUyY",grave = "aeIoUAEIoU",circunflex = "AEIoUAEIoU",tilde = "AOAOXX",umlaut = "AEIoUAEIoUX",cedil = "XX"
)
accentTypes <- c("´","`","^","~","¨","ç")
if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
return(chartr(paste(symbols,collapse=""),paste(nudeSymbols,str))
for(i in which(accentTypes%in%pattern))
str <- chartr(symbols[i],nudeSymbols[i],str)
return(str)
}
但是我遇到了类似的错误:
Error in chartr(paste(symbols,collapse = ""),:
invalid input 'RUÍZ' in 'utf8towcs'
我写这个给你看编码。出现在该列中有特殊字符的UTF-8:
编码(Base $ nombre_originador) [1]“未知”“ UTF-8”“未知”“ UTF-8”
解决方法
'utf8towcs' 中无效输入的解决方案是在将 .csv 文件导入 R 时设置您的编码。
-
当您使用 read.csv() 或 read.delim() 导入文件时,请指定 encoding = "UTF-8" 或 encoding = "Latin-1"。我尝试使用“Latin-1”并解决它。
-
您可能还想检查您的系统编码是什么,并匹配它。您可以使用 Sys.getlocale() 执行此操作(并使用 Sys.setlocale() 对其进行设置。)例如在我的系统上:
Sys.getlocale() [1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
一个例子
data <- read.delim("input/data/data.txt",sep=";",encoding = "Latin-1",stringsAsFactors = F )
data <- read.csv("input/data/data.csv",stringsAsFactors = F )
最诚挚的问候