R - 删除字符向量中以大写字母开头的字符串

问题描述

我有一个 df：

df <- c("hello goodbye Delete Me","Another Sentence good program","hello world The End")

我想要这个：

c("hello goodbye","good program","hello world")

我试过了：

df <- grep("^[A-Z]",df,invert = TRUE,value = TRUE)

但这会删除以大写字母开头的整个字符：

c("hello goodbye Delete Me","hello world The End")

我该怎么做？

解决方法

您可以使用 -

trimws(gsub('[A-Z]\\w+','',df))
#[1] "hello goodbye" "good program"  "hello world"

您可以使用以下正则表达式模式，然后只替换一个空格：

\s*[A-Z]\w+\s*

这将捕获所有以大写字母开头的单词，以及任何可能出现在任一侧的空格。对 trimws() 的外部调用用于删除任何可能保留在开头或结尾的空格，作为替换逻辑的剩余部分。

x <- c("nice to meet You however","cat Ran away","Cat","Dog")
trimws(gsub('\\s*[A-Z]\\w+\\s*',' ',x))

[1] "nice to meet however" "cat away"             ""                    
[4] ""

怎么样：

library(stringr)
str_extract(df,"[^ ]+ [^ ]+")

输出：

[1] "hello goodbye"    "Another Sentence" "hello world"

您可以使用以下三种解决方案：

df <- c("hello goodbye Delete Me","Another Sentence good program","hello world The End","an iPhone","Ещё Одно слово")

## Base R gsub with default TRE regex engine:
trimws(gsub("\\s*\\b[[:upper:]][[:alpha:]]*\\b","",df))

## Base R gsub with PCRE regex engine:
trimws(gsub("(*UCP)\\s*\\b\\p{Lu}\\p{L}*\\b",df,perl=TRUE))

## stringr::str_replace_all with ICU regex engine:
library(stringr)
str_trim(str_replace_all(df,"\\s*\\b\\p{Lu}\\p{L}*\\b",""))

所有三个的输出都是 [1] "hello goodbye" "good program" "hello world" "an iPhone" "слово"。请注意，单词边界对于正确处理 iPhone 等单词至关重要。

参见online R demo。此外，请参阅 the PCRE regex demo 显示正则表达式的工作原理（您可以转到 here to watch the internals of the regex engine）。

正则表达式详情：

\s* - 零个或多个空白字符
\b - 一个词边界
[[:upper:]] / \p{Lu} - 任何 Unicode 大写字母
[[:alpha:]]* - 任意零个或多个字母
\b - 一个词边界

PCRE 正则表达式中的 (*UCP) 启用正则表达式中的 Unicode 属性类。

trimws 需要删除前导/尾随空格，以防它们出现在替换之后。

extract extract r r regex regex regex uppercase