如何在文本文件中使用levenshtein距离找到与另一个句子相似的句子开头？

问题描述

我需要在文本文件中找到所有句子的开头，但是问题是我在文件中寻找的句子与数组中的句子可能会有一些差异。

我当时正在考虑使用levenshtein距离比较句子，问题是我要与什么比较？文件很大，句子最多只有一行。

现在这是我的代码，具有简单的比较，没有相似的距离。

import re
import pandas as pd

data = pd.read_excel("./excel_file_with_the_sentences.xlsx")
df = pd.DataFrame(data,columns=['Année','Journal','A_Sommaire','Numero'])
# print(df)

jo = df.query("Année == 2018")
jo.sort_values(by=['Numero'],inplace=True)
# "A_Sommaire" contains the sentences the other fields are there to filter and sort only
print(jo["A_Sommaire"])
print(len(jo))
#################################################################################

file_path = "./the_file_with_the_text.txt"

file = open(file_path)
txt = file.read()
##################################################################################

titles = [t for t in jo["A_Sommaire"]]
print(titles)
beginnings = []
for title in titles:
    # here I get the iterator that point to the first title encontred
    # and I want to change it so that it can search for the first "similar"
    # title or sentence
    beginning = re.finditer(title,txt,flags=re.MULTILINE)
    beginnings.append([b.start() for b in beginning])

print(beginnings)

这是结果：

[[],[],[13898],[17136],[17645],[18743],[19886],[21010],[22165],[26885],[31049],[33333],[35260],[37339],[39760],[41822],[45880],[54839],[]]

这还不完整，通常不存在null，因为Excel文件中的每个句子在文本文件中至少应出现一次。

所以我的问题是，如何使用levenshtein距离或其他任何方法来确定文本中所有句子的开头？

注意，这些文件太大了，甚至都无法尝试作为示例用途，因此，对此感到抱歉。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

dataframe levenshtein-distance python regex similarity