正则表达式删除多行字符串中的重复短语

问题描述

有什么问题：

我有一个多行文本，例如：

1: This is test string for my app. d
2: This is test string for my app.
3: This is test string for my app. abcd
4: This is test string for my app.
5: This is test string for my app.
6: This is test string for my app.
7: This is test string for my app. d
8: This is test string for my app.
9: This is test string for my app.
10: This is another string.

这里的行号只是为了更好的可视化，它们不是文本本身的一部分。

我尝试过的：

我尝试了两种不同的正则表达式（标志总是：i g 和 m）：

^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1)+$

请看这里：regexr.com/5nklg

和

^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)

请看这里：regexr.com/5nkla

它们都产生不同的输出，都很好，但并不完美。

我想要达到的目标：

删除文本中所有重复的短语，但保留一个。例如，这里保留第一个“这是我的应用程序的测试字符串”。从第 1 行开始，匹配第 2-9 行的相同短语并保留数字 10。

如果我可以保留最后一个而不是第一个匹配短语，它也对我有用。所以这里这将是匹配行 1 - 8，保留 9 和 10。

有没有办法用正则表达式做到这一点？

仅供参考：稍后我将在 python 中使用正则表达式来删除重复项：

re.sub(r"^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)","",my_text,flags=re.MULTILINE)

编辑：“短语”意味着让我们说 3 个或更多单词。所以匹配任何超过 2 个单词的重复。所以第一个 sub 之后的预期输出是：

This is test string for my app. d  //from line 1
This is test string for my app.    //from line 2
abcd                               //from line 3
This is another string.            //from line 10

提前致谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

eda python regex