正则表达式 – 为什么strsplit使用积极的前瞻和lookbehind断言不同的匹配？

使用gregexpr()的常识和理性检查表明，下面的后视和预先断言应该在testString的正好一个位置匹配：

testString <- "text XX text"
BB  <- "(?<= XX )"
FF  <- "(?= XX )"

as.vector(gregexpr(BB,testString,perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF,perl=TRUE)[[1]][1])
# [1] 5

然而，strsplit()使用不同的匹配位置，在使用lookbehind断言时将testString分割在一个位置，但在使用前瞻断言时，在两个位置 – 其中第二个位置似乎不正确。

strsplit(testString,BB,perl=TRUE)
# [[1]]
# [1] "text XX " "text"    

strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    " "       "XX text"

我有两个问题：(Q1)这里发生了什么？和(Q2)怎样才能使strsplit()更好地表现？

更新：Theodore Lytras的优秀答案解释了发生了什么，因此地址(Q1)。我的答案建立在他的身上，以找出补救办法，解决(Q2)。

我不知道这是否属于错误，因为我认为这是基于R文档的预期行为。来自？strsplit：

The algorithm applied to each input string is

06000

Note that this means that if there is a match at the beginning of
a (non-empty) string,the first element of the output is ‘””’,but
if there is a match at the end of the string,the output is the
same as with the match removed.

问题是前瞻(和lookbehind)断言是零长度的。例如在这种情况下：

FF <- "(?=funky)"
testString <- "take me to funky town"

gregexpr(FF,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE

strsplit(testString,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"

发生什么是孤独的前卫(？=时髦)在位置12匹配。所以第一个分割包括直到位置11(匹配的左边)的字符串，并且与匹配一起从字符串中删除，但是 – 具有零长度。

现在剩下的字符串是时髦的城镇，而前卫在第1位匹配。但是没有什么可以删除，因为比赛左边没有任何东西，比赛本身的长度也是零。所以算法被卡在无限循环中。显然，R通过拆分单个字符来解决这个问题，当strspliting一个空的正则表达式(当参数split =“”)时，这是一个记录的行为。之后，剩下的字符串是unky town，由于没有匹配，它作为最后一个分割返回。

Lookbehinds是没有问题的，因为每个匹配被分割并从剩余的字符串中删除，所以算法永远不会被卡住。

诚然，这种行为乍一看似乎很奇怪。否则行为会违反前瞻性的零长度假设。鉴于strsplit算法是有记载的，我相信这不符合bug的定义。

正则表达式 – 为什么strsplit使用积极的前瞻和lookbehind断言不同的匹配？

相关文章