如何将数据从文本文件提取为句子,句子定义为空白行之间的数据行?

问题描述

数据位于文本文件中,我想将其中的数据分组为句子。句子的定义是所有行接一个的行,每行至少有一个字符。在包含数据的行之间有空白行,因此我希望空白行标记句子的开头和结尾。有没有办法通过列表理解做到这一点?

来自文本文件的示例。数据如下所示:

This is the
first sentence.

This is a really long sentence
and it just keeps going across many
rows there will not necessarily be 
punctuation
or consistency in word length
the only difference in ending sentence
is the next row will be blank

here would be the third sentence
as 
you see
the blanks between rows of data 
help define what a sentence is

this would be sentence 4
i want to pull data
from text file
as such (in sentences) 
where sentences are defined with
blank records in between

this would be sentence 5 since blank row above it
and continues but ends because blank row(s) below it

解决方法

您可以使用file_as_string = file_object.read()将整个文件作为单个字符串获得。当您想在空行上分割此字符串时,等效于在两个后续换行符上进行分割,因此我们可以执行sentences = file_as_string.split("\n\n")。最后,您可能要删除句子中间仍然存在的换行符。您可以通过列表理解来做到这一点,用任何内容替换换行符:sentences = [s.replace('\n','') for s in sentences]

总共可以得到:

file_as_string = file_object.read()
sentences = file_as_string.split("\n\n")
sentences = [s.replace('\n','') for s in sentences]
,

为此,您可以非常有效地使用正则表达式拆分。

如果您只想按双倍空格分割,请使用:

^[ \t]*$

Demo

在Python中,您可以执行以下操作:

import re   

with open(fn) as f_in:
    sentencences=re.split(r'\r?\n^[ \t]*$',f_in.read(),flags=re.M)

如果您要删除单个文本\n

with open(fn) as f_in:
    sentencences=[re.sub(r'[ \t]*(?:\r?\n){1,}',' ',s) 
         for s in re.split(r'\r?\n^[ \t]*$',flags=re.M)]