如何检查没有紧随其后的关键字的单词，未被关键字包围的单词如何处理？

问题描述

我正在尝试寻找the之前的单词。

进行正向后移，以获取关键字'the'(?<=the\W)之后的单词。但是，我无法捕获“人”和“那”，因为上述逻辑不适用于这些情况。

我无法处理前后都没有关键字“ the”的单词（例如，句子中的“ that”和“ people”）。

p = re.compile(r'(?<=the\W)\w+') 
m = p.findall('the part of the fair that attracts the most people is the fireworks')

print(m)

当前输出是

'part','fair','most','fireworks'.

编辑：

感谢您提供以下所有帮助。使用注释中的以下建议，设法更新了我的代码。

p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')

这使我更接近需要获得的输出。

更新后的输出：

[('part',' of the'),('fair',''),('that',' attracts the'),('most',('people',' is the'),('fireworks','')]

我只需要字符串（“ part”，“ fair”，“ that”，“ most”，“ people”，“ fireworks”）。有什么建议吗？

解决方法

我正在尝试查找不在'the'之前的单词。

请注意，下面的代码不使用re。

words = 'the part of the fair that attracts the most people is the fireworks'
words_list = words.split()
words_not_before_the = []
for idx,w in enumerate(words_list):
    if idx < len(words_list)-1 and words_list[idx + 1] != 'the':
        words_not_before_the.append(w)
words_not_before_the.append(words_list[-1])
print(words_not_before_the)

输出

['the','part','the','fair','that','most','people','fireworks']

尝试绕开它，而不是找到紧跟the之后的单词，而不要查找紧跟the之后的所有单词

import re
test = "the part of the fair that attracts the most people is the fireworks"
pattern = r"\s\w*\sthe|the\s"
print(re.sub(pattern,"",test))

输出：part fair that most people fireworks

使用正则表达式：

import re
m = re.sub(r'\b(\w+)\b the','the part of the fair that attracts the most people is the fireworks')
print([word for word in m.split(' ') if not word.isspace() and word])

输出：

['the','fireworks']

我正在尝试寻找不是紧接在此之前的单词。

尝试一下：

import re

# The capture group (\w+) matches a word,that is followed by a word,followed by the word: "the"
p = re.compile(r'(\w+)\W\w+\Wthe')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)

输出：

['part','people']

我终于解决了这个问题。谢谢大家！

p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)

在第三组中添加了一个非捕获组'？：'。

输出：

['part','fireworks']

lookbehind python