stringr 提取正则表达式未按预期工作

问题描述

假设 arg 如下：

\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck,Lando\r\n\r\n\t\t\t\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter,we'll contact \r\n\t\tyou.\r\n\r\n

我正在尝试使用下面的代码来提取 arg 中以 "\t|\n|\r" + 几个大写字母开头并以 "\r\n\r\n" 结尾的所有字符串，但是没有匹配项：

str_extract_all(arg,"(\t|\n|\r)[A-Z]{1}.*?[A-Z]{2}(\r\n\t\t\t).*?(?=(\r\n\r\n))")

我希望这段代码的结果是 "\tLUKE\r\n\t\t\t(通过 comlink)\r\n\t\t祝你好运，兰多\r\n\r\n " 和 "\tLANDO\r\n\t\t\t(进入通讯链接)\r\n\t\t当我们找到小屋贾巴和\r\n\t\t那个赏金猎人时，我们会联系\r \n\t\你。\r\n\r\n”。

当我在最后放弃积极的前瞻时，匹配会以其他方式工作，我会返回 "\tLUKE\r\n\t\t\t" 和 "\tLANDO\r\n\t\ t\t" 符合预期。

str_extract_all(arg,"(\t|\n|\r)[A-Z]{1}.*?[A-Z]{2}(\r\n\t\t\t).*?")

我在这里遗漏了什么？

解决方法

如果您之后不需要该值，您可以省略捕获组。另外 {1} 是多余的，可以删除。

使用模式行 .*? 仅在末尾不会产生任何匹配，因为量词是非贪婪的，并且在它之后没有规则让它放弃任何匹配。

为了保持模式不那么严格，您可以使用量词而不是指定制表符和换行符的确切数量。

为防止不必要的回溯，您可以匹配仅包含大写字符的行，然后匹配所有不包含大写字符的行。

^[^\S\r\n]+[A-Z]+(?:\r?\n(?![^\S\r\n]*[A-Z]+$).*)*

^ 字符串开头
[^\S\r\n]+ 匹配 1+ 次没有换行符的空白字符
[A-Z]+ 匹配 1+ 个大写字符
(?: 非捕获组
- \r?\n(?![^\S\r\n]*[A-Z]+$)` 匹配一个换行符并断言该行没有一个大写的单词
- .* 如果前面的断言为真，则匹配整行
)* 关闭组并重复 0 次以上以匹配所有行

Regex demo

使用 (?m) 表示多行的示例

library(stringr)

arg <- "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck,Lando\r\n\r\n\t\t\t\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter,we'll contact \r\n\t\tyou.\r\n\r\n"
str_extract_all(arg,"(?m)^[^\\S\\r\\n]+[A-Z]+(?:\\r?\\n(?![^\\S\\r\\n]*[A-Z]+$).*)*")

输出

[[1]]
[1] "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck,Lando\r\n"                                                                                
[2] "\t\t\t\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter,we'll contact \r\n\t\tyou.\r\n\r\n"

默认情况下，点 (.) 不匹配换行符（例如，请参见 dotall 中的 help(stri_opts_regex) 选项），这就是 .*? 部分不匹配的原因捕捉你想要的。您可以通过 (?s) 标志启用此功能：

str_extract_all(arg,"(?s)(\t|\n|\r)[A-Z]{1,}(\r\n\t\t\t).*?(?=\r\n\r\n)")

[[1]]
[1] "\tLUKE\r\n\t\t\t(over comlink)\r\n\t\tGood luck,Lando"                                                                      
[2] "\tLANDO\r\n\t\t\t(into comlink)\r\n\t\tWhen we find Jabba the Hut and \r\n\t\tthat bounty hunter,we'll contact \r\n\t\tyou."

pattern-matching r r regex regex regex