删除除带有POSIX字符类

问题描述

我想使用R删除单词之间所有期望的下划线。最后，代码删除单词结尾或开头的下划线。结果应该是 “ hello_world和hello_world” 。我想使用那些预构建的类。知道，我已经学会了使用下面的代码来期待特定的字符，但是我不知道如何使用单词边界序列。

test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]","",test,perl=T)

解决方法

您可以使用

gsub("[^_[:^punct:]]|_+\\b|\\b_+","",test,perl=TRUE)

请参见regex demo

详细信息：

[^_[:^punct:]]-除_以外的任何标点符号
|-或
_+\b-单词末尾有一个或多个_
|-或
\b_+-单词开头的一个或多个_

一种非正则表达式的方法是通过将trimws参数设置为whitespace来拆分和使用_，即

paste(sapply(strsplit(test,' '),function(i)trimws(i,whitespace = '_')),collapse = ' ')
#[1] "hello_world and hello_world"

您可以使用：

test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])",perl=TRUE)
output

[1] "hello_world and hello_world"

正则表达式的解释：

(?<![^\\W])  assert that what precedes is a non word character OR the start of the input
_            match an underscore to remove
|            OR
_            match an underscore to remove,followed by
(?![^\\W])   assert that what follows is a non word character OR the end of the input

我们可以删除所有两端都具有单词边界的基础。我们使用正向查找和正则表达式查找来查找此类基础。要在开始和结束时删除基础，我们使用trimws。

test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)",trimws(test,whitespace = '_'),perl = TRUE)
#[1] "hello_world and hello_world"

gsub posix r r