显示每个字的字数

问题描述

我很难在Google Colab上对文档Wuthering Heights（https://www.gutenberg.org/files/768/768.txt）进行前15个字数统计（每个字的字数统计）。它只能包含在“ [email protected]”之后开始并在“项目Gutenberg EBOOK到达高度”结束之前结束的词。这是我尝试的编码。

file = open(768.txt,'r+')
wordcount = {}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] +=1
for k,v in wordcount.items():
    print(k,v)

解决方法

您可以使用正则表达式找到所需的子字符串：

file = open('768.txt','r')
start = '[email protected]'
end = 'END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS'
import re

m = re.findall(start+'(.*?)'+end,file.read(),flags=re.S)[0]
wordcount={}
for word in m.split():
  if word not in wordcount:
    wordcount[word] = 1
  else:
      wordcount[word] +=1
for k,v in wordcount.items():
  print(k,v)

样本输出：

WUTHERING 1
HEIGHTS 1
CHAPTER 34
I 3215
1801.--I 1
have 594
just 72
returned 39
from 476
...

但是，您可以使用内置函数来计算单词数。例如，这：

from collections import Counter
print(Counter(m.split()))

#Counter({'the': 4273,'and': 4189,'to': 3436,...})

编辑：要打印排序：

sorted(Counter(m.split()).items(),key=lambda x:x[1])

或从高变低：

sorted(Counter(m.split()).items(),key=lambda x:x[1],reverse=True)

在string punctuation和operator itemgetter的帮助下，这可能是一种方法。这将接近。请注意，删除标点符号将消除结尾（。！？），以获得清晰的单词。（还可以删除撇号（您可能不想删除它）

from collections import Counter
from string import punctuation
from operator import itemgetter

d = Counter()

with open('wuthering_heights.txt','r') as f:
    opening = False

    for line in f:
        if line.startswith('[email protected]'):
            opening = True
        if opening == False:
            continue
        if line.startswith('CHAPTER'): # don't count chapter headings
            continue
        if line.startswith('***END OF THE PROJECT GUTENBERG EBOOK'):
            break
        
        line = line.strip()
        if len(line) == 0:
            continue
        
        # clean out punctuation
        line = line.translate(str.maketrans('','',punctuation))
        
        d.update(line.lower().split())

        

print('different words count',len(d)        )
#print(d.most_common(15))

for word,count in reversed(sorted(d.items(),key=itemgetter(1))):
    print(word,count)
    if count < 290:
        break

此打印：

different words count 10098
and 4693
the 4552
i 3530
to 3476
a 2301
of 2221
he 1922
you 1712
her 1544
in 1459
his 1419
it 1284
she 1269
that 1188
was 1124
my 1098
me 1047
not 932
as 931
him 917
for 836
on 809
with 804
at 783
be 724
had 687
but 673
is 649
have 629
from 485
by 451
would 442
if 440
heathcliff 413
your 404
no 384
said 368
so 357
were 354
linton 340
catherine 333
an 317
we 311
mr 309
or 307
when 307
out 305
what 301
are 295
this 290
they 283

computer-science google-colaboratory python