正则表达式-使用1连字符或2句子结尾的过滤器

问题描述

我需要RegEx过滤支持

我有一个关键字列表和许多应检查的行。 在此示例中,关键字“ -book-”可以是(1)位于句子中间,也可以是(2)位于末尾,这表示最后一个连字符不存在。

我需要一个RegEx表达式,该表达式标识“ -book-”和“ -book”。 我不希望识别类似“ -booking-”之类的关键字。

library(dplyr)
keywords = c( "-album-","-book-","-castle-")                 
search_terms = paste(keywords,collapse ="|")                
number = c(1:5)
sentences = c("the-best-album-in-shop","this-book-is-fantastic","that-is-the-best-book","spacespacespace","unwanted-sentence-with-booking")   
data = data.frame(number,sentences)  
output = data %>% filter(.,grepl( search_terms,sentences) )               
# Current output:
 number              sentences
1      1 the-best-album-in-shop
2      2 this-book-is-fantastic
# DESIRED output:
  number              sentences
1      1 the-best-album-in-shop
2      2 this-book-is-fantastic
3      3  that-is-the-best-book

解决方法

-book-模式将匹配整个单词book,左右均带有连字符。

要将整个单词book与左边的右边的连字符匹配,您需要一个交替的\bbook-|-book\b

因此,您可以使用

keywords = c( "-album-","\\bbook-","-book\\b","-castle-" ) 
,

您也可以这样做:

subset(data,grepl(paste0(sprintf("%s?\\b",keywords),collapse = "|"),sentences))

  number              sentences
1      1 the-best-album-in-shop
2      2 this-book-is-fantastic
3      3  that-is-the-best-book

请注意,这只会检查句子中间(1)或结尾(2)的-book- 而不是开头

,

您可以考虑的另一种解决方案

library(stringr)
data %>% 
  filter(str_detect(sentences,regex("-castle-|-album-|-book$|-book-\\w{1,}")))
#   number              sentences
# 1      1 the-best-album-in-shop
# 2      2 this-book-is-fantastic
# 3      3  that-is-the-best-book