在文本中查找重复的句子

问题描述

我想知道如何在同一句子中找到相似之处。 我有一个这样的句子列表:

my_list=["do you want pizza for dinner? Do you want pizza for dinner?","I like pizza","I have no money I have no money"]

我想创建一个熊猫数据框,如果在其中重复一个句子,我将赋值为1,否则赋值为0。

类似这样的东西:

Text                                                              Repeated?
do you want pizza for dinner? Do you want pizza for dinner?            1
I like pizza                                                           0
I have no money I have no money                                        1

我在想这样的事情:

from collections import Counter


my_list = dict(Counter(my_list.split()))
for i in sorted(my_list.keys()):
    print ('"'+i+'" is repeated '+str(my_list[i])+' time.')

然后计算该句子中总共有多少个单词以及总共有多少个唯一单词。但是我认为这不如编码。 您知道是否还有另一种方式来获得预期的结果?

解决方法

您可以对任务(regex101)使用正则表达式:

import re
import pandas as pd

my_list=["do you want pizza for dinner? Do you want pizza for dinner?","I like pizza","I have no money I have no money"]
df = pd.DataFrame({'Text': my_list})

r = re.compile(r'(.+)\s*\1$',flags=re.I)
df['Repeated'] = df['Text'].apply(lambda x: bool(r.match(x))).astype(int) 
print(df)

打印:

                                                Text  Repeated
0  do you want pizza for dinner? Do you want pizz...         1
1                                       I like pizza         0
2                    I have no money I have no money         1

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...