在R中的字符串后提取指定数量的单词

问题描述

在下面的示例中，我尝试提取字符串“ source：”之后的4个单词。

@H_502_3@library(stringr)

x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general,liver and dairy products;","source: Eggs,liver,certain fish species such as sardines,certain mushroom species such as shiitake","source: Leafy green vegetables such as spinach; egg yolks; liver"))


x$source = str_extract(x$end,'[^source: ](.)*')

当我尝试上面的代码时，我可以将“ source：”之后的所有文本提取到新列中。我想知道是否有一种方法可以使用stringr或任何其他包来提取“ source”之后的前四个单词。

解决方法

您可以使用：

trimws(stringr::str_extract(x$end,'(?<=source:\\s)(\\w+,?\\s){4}'))
#[1] "from animal origin as"       "Eggs,liver,certain fish"   
#    "Leafy green vegetables such"

?<=是积极的寻找目标，其后依次搜索空白的'source:'。

我们捕获了4个“单词”，包括可选的逗号和空格。

r r stringr text-extraction