正则表达式-使用1连字符或2句子结尾的过滤器

问题描述

我需要RegEx过滤支持！

我有一个关键字列表和许多应检查的行。在此示例中，关键字“ -book-”可以是（1）位于句子中间，也可以是（2）位于末尾，这表示最后一个连字符不存在。

我需要一个RegEx表达式，该表达式标识“ -book-”和“ -book”。我不希望识别类似“ -booking-”之类的关键字。

library(dplyr)
keywords = c( "-album-","-book-","-castle-")                 
search_terms = paste(keywords,collapse ="|")                
number = c(1:5)
sentences = c("the-best-album-in-shop","this-book-is-fantastic","that-is-the-best-book","spacespacespace","unwanted-sentence-with-booking")   
data = data.frame(number,sentences)

output = data %>% filter(.,grepl( search_terms,sentences) )

# Current output:
 number              sentences
1      1 the-best-album-in-shop
2      2 this-book-is-fantastic

# DESIRED output:
  number              sentences
1      1 the-best-album-in-shop
2      2 this-book-is-fantastic
3      3  that-is-the-best-book

解决方法

-book-模式将匹配整个单词book，左和右均带有连字符。

要将整个单词book与左边的或右边的连字符匹配，您需要一个交替的\bbook-|-book\b。

因此，您可以使用

keywords = c( "-album-","\\bbook-","-book\\b","-castle-" )

您也可以这样做：

subset(data,grepl(paste0(sprintf("%s?\\b",keywords),collapse = "|"),sentences))

  number              sentences
1      1 the-best-album-in-shop
2      2 this-book-is-fantastic
3      3  that-is-the-best-book

请注意，这只会检查句子中间（1）或结尾（2）的-book- 而不是开头

您可以考虑的另一种解决方案

library(stringr)
data %>% 
  filter(str_detect(sentences,regex("-castle-|-album-|-book$|-book-\\w{1,}")))
#   number              sentences
# 1      1 the-best-album-in-shop
# 2      2 this-book-is-fantastic
# 3      3  that-is-the-best-book

filtering r r regex