删除单词和单词计数

问题描述

我需要帮助从Google Colab上的此文本文件https://www.gutenberg.org/files/768/768.txt)中删除段落。我需要文本文件在“ [email protected]”之后开始,并在“项目GUTENBERG EBOOK拼写高度结束”之前结束,以使字数总数准确。我还需要帮助将列表另存为文件,这样我才能获得正确的字数统计。下面列出的是我到目前为止的编码。

# download and installing pyspark in colab
!pip install -q pyspark

# download Wuthering Heights,by Emily bronte
!wget -q https://www.gutenberg.org/files/768/768.txt

 import os.path
 baseDir = os.path.join('data')
 inputPath = os.path.join('/content/768.txt')
 fileName = os.path.join(baseDir,inputPath)
 with open('/content/768.txt','r') as f:
       for line in f:
       for word in line.split():
       print(word)

解决方法

最简单的方法可能是在同一for line in f:块中使用两个with open(filename) as f:循环-一个用于读取行至起始文本,另一个用于读取行至结束文本。如果您从第一个break循环中的for中找到与起始文本匹配的行,则下一个for循环将以相同的迭代器继续进行,这意味着它将继续从位置已经到达,而不是从文件顶部重新开始。

filename = '768.txt'

start_text = '[email protected]'
end_text = 'END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS'

with open(filename) as f:

    # skip the header bit
    for line in f:
        if start_text in line:
            break
    else:
        print("end of file reached without seeing the start text")
        
    # count the words - finish when we get to the end line
    count = 0
    for line in f:
        if end_text in line:
            break
        count += len(line.strip().split())
    else:
        print("end of file reached without seeing the end text")

print(count,'words')