删除单词和单词计数

问题描述

我需要帮助从Google Colab上的此文本文件(https://www.gutenberg.org/files/768/768.txt)中删除段落。我需要文本文件在“ ccx074@pglaf.org”之后开始,并在“项目GUTENBERG EBOOK拼写高度结束”之前结束,以使字数总数准确。我还需要帮助将列表另存为文件,这样我才能获得正确的字数统计。下面列出的是我到目前为止的编码。

# download and installing pyspark in colab
!pip install -q pyspark

# download Wuthering Heights,by Emily Bronte
!wget -q https://www.gutenberg.org/files/768/768.txt

 import os.path
 baseDir = os.path.join('data')
 inputPath = os.path.join('/content/768.txt')
 fileName = os.path.join(baseDir,inputPath)
 with open('/content/768.txt','r') as f:
       for line in f:
       for word in line.split():
       print(word)

解决方法

最简单的方法可能是在同一for line in f:块中使用两个with open(filename) as f:循环-一个用于读取行至起始文本,另一个用于读取行至结束文本。如果您从第一个break循环中的for中找到与起始文本匹配的行,则下一个for循环将以相同的迭代器继续进行,这意味着它将继续从位置已经到达,而不是从文件顶部重新开始。

filename = '768.txt'

start_text = 'ccx074@pglaf.org'
end_text = 'END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS'

with open(filename) as f:

    # skip the header bit
    for line in f:
        if start_text in line:
            break
    else:
        print("end of file reached without seeing the start text")
        
    # count the words - finish when we get to the end line
    count = 0
    for line in f:
        if end_text in line:
            break
        count += len(line.strip().split())
    else:
        print("end of file reached without seeing the end text")

print(count,'words')

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...