问题描述
在下面的示例中,我尝试提取字符串“ source:”之后的4个单词。
@H_502_3@library(stringr) x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general,liver and dairy products;","source: Eggs,liver,certain fish species such as sardines,certain mushroom species such as shiitake","source: Leafy green vegetables such as spinach; egg yolks; liver")) x$source = str_extract(x$end,'[^source: ](.)*')
当我尝试上面的代码时,我可以将“ source:”之后的所有文本提取到新列中。我想知道是否有一种方法可以使用stringr或任何其他包来提取“ source”之后的前四个单词。
解决方法
您可以使用:
trimws(stringr::str_extract(x$end,'(?<=source:\\s)(\\w+,?\\s){4}'))
#[1] "from animal origin as" "Eggs,liver,certain fish"
# "Leafy green vegetables such"
?<=
是积极的寻找目标,其后依次搜索空白的'source:'
。
我们捕获了4个“单词”,包括可选的逗号和空格。