问题描述
我必须做的索引必须采用以下形式:
{'the': defaultdict(int,{'a_and_c': 875,'all_well': 736,'as_you': 698,'com_err': 445,'coriolan': 1130,'cymbelin': 973,'dream': 565,'hamlet': 1146,...
%matplotlib inline
from collections import Counter,defaultdict,OrderedDict
from bs4 import BeautifulSoup
import os
from tqdm import tqdm_notebook
import glob
import nltk
import zipfile
import math
import pandas as pd
import sys
import itertools
def loadShakespeare():
if 'shaks200.zip' in os.listdir():
return 'shaks200.zip'
elif os.path.exists('../../data/Week1/'):
return '../../data/Week1/shaks200.zip'
elif os.path.exists('../../../data/Week1/'):
return '../../../data/Week1/shaks200.zip'
def index_collection(shaks200):
# With zipfile we can read the file without opening the zip file
archive = zipfile.ZipFile('shaks200.zip','r')
namelist = [x for x in archive.namelist() if '.xml' in x]
MyIndex = defaultdict(lambda: defaultdict(int)) # initialize MyIndex
for infile in notebook.tqdm(namelist): # loop over each file
f = archive.open(infile)
return MyIndex
%time Shakespeare = index_collection(loadShakespeare())
Shakespeare['the'],Shakespeare['witch']
必须对文件中的单词进行标记,然后才能对某个单词进行计数。如您所见,必须在默认词典中对每个文件中的单词“ the”和“ witch”进行计数和排序。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)