问题描述
我需要帮助从Google Colab上的此文本文件(https://www.gutenberg.org/files/768/768.txt)中删除段落。我需要文本文件在“ ccx074@pglaf.org”之后开始,并在“项目GUTENBERG EBOOK拼写高度结束”之前结束,以使字数总数准确。我还需要帮助将列表另存为文件,这样我才能获得正确的字数统计。下面列出的是我到目前为止的编码。
# download and installing pyspark in colab
!pip install -q pyspark
# download Wuthering Heights,by Emily Bronte
!wget -q https://www.gutenberg.org/files/768/768.txt
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('/content/768.txt')
fileName = os.path.join(baseDir,inputPath)
with open('/content/768.txt','r') as f:
for line in f:
for word in line.split():
print(word)
解决方法
最简单的方法可能是在同一for line in f:
块中使用两个with open(filename) as f:
循环-一个用于读取行至起始文本,另一个用于读取行至结束文本。如果您从第一个break
循环中的for
中找到与起始文本匹配的行,则下一个for
循环将以相同的迭代器继续进行,这意味着它将继续从位置已经到达,而不是从文件顶部重新开始。
filename = '768.txt'
start_text = 'ccx074@pglaf.org'
end_text = 'END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS'
with open(filename) as f:
# skip the header bit
for line in f:
if start_text in line:
break
else:
print("end of file reached without seeing the start text")
# count the words - finish when we get to the end line
count = 0
for line in f:
if end_text in line:
break
count += len(line.strip().split())
else:
print("end of file reached without seeing the end text")
print(count,'words')