问题描述
我已从 pdf 中提取文本并尝试将其拆分为句子。 代表形式的文本样本 -
"This is multiline text without any delimiter need to \n be considered as one sentence \n Whereas this sentence is one liner \n Slash n or first char capital is not option as sentences of \n Dhiraj's sample can contain first letter capital even its not a new sentence"
结果应该是 -
["This is multiline text without any delimiter need to be considered as one sentence","Whereas this sentence is one liner","Slash n or first char capital is not option as sentences of Dhiraj's sample can contain first letter capital even its not a new sentence"]
我的临时解决方案是获取句子的最大长度并将其视为多行句子并在那里删除 /n 。但它并不可靠。
解决方法
有一个解决方案,但需要一些手工,
- 创建一个包含专有名词的列表,找到文本中的所有专有名词 使用该列表,并使用搜索方法将它们转换为小写字母。
- 然后编写您的主要代码块,使用第一个大写字母将文本分成句子。
- 最后使用姓名列表将文本中的姓名重新大写。