令人惊讶但正确的贪婪子表达式在积极的后视断言中的行为

问题描述

注意：

观察到的行为@H_404_7@正确，但起初可能@H_404_7@令人惊讶；对我来说是这样，我认为对其他人也可能是这样 - 尽管对于那些非常熟悉正则表达式引擎的人来说可能不是这样。
重复建议的重复项 Regex lookahead,lookbehind and atomic groups 包含关于环视断言的@H_404_7@一般信息，但没有@H_404_7@解决手头的具体误解，正如下面的评论中更详细地讨论的那样。

使用 @H_404_7@greedy，根据定义 @H_404_7@variable-width 在 positive look-behind assertion 中的子表达式可以表现出令人惊讶的行为。

为方便起见，示例使用 PowerShell，但该行为通常适用于 .NET 正则表达式引擎：

这个命令按我的直觉运行：

# OK:  
#     The subexpression matches greedily from the start up to and
#     including the last "_",and,by including the matched string ($&) 
#     in the replacement string,effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_','$&|'
a_b_|c

以下命令使用肯定的后视断言 (?<=...)，@H_404_7@表面上等效 - 但@H_404_7@不是：

# CORRECT,but SURPRISING:
#   Use a positive lookbehind assertion to *seemingly* match
#   only up to and including the last "_",and insert a "|" there.
PS> 'a_b_c' -replace '(?<=^.+_)','|'
a_|b_|c  # !! *multiple* insertions were performed

为什么不等价？为什么要执行@H_404_7@多次插入？

解决方法

tl;dr：

在后视断言中，贪婪子表达式实际上表现非贪婪 （在全局匹配除了贪婪的行为），由于考虑输入字符串的每个前缀字符串。

我的问题是我没有考虑到，在后视断言中，必须检查输入字符串中每个字符位置的前面的文本到那个点匹配lookbehind断言中的子表达式。

这与 PowerShell 的 -replace 运算符执行的始终全局替换（即执行所有可能的匹配）相结合，导致了多次插入:

也就是说，当考虑当前字符位置左侧的文本时，贪婪的锚定子表达式^.+_合法匹配两次考虑：

首先，当 a_ 是左边的文本时。
当 a_b_ 是左边的文字时。

因此，两次插入了 |。

相比之下，没有后视断言，贪婪表达式^.+_根据定义只匹配一次，直到最后 > _，因为它只应用于整个输入字符串。

regex regex-greedy regex-lookarounds