如何将Zipfolder转换为反向索引？

问题描述

我必须做的索引必须采用以下形式：

{'the': defaultdict(int,{'a_and_c': 875,'all_well': 736,'as_you': 698,'com_err': 445,'coriolan': 1130,'cymbelin': 973,'dream': 565,'hamlet': 1146,...

我有以下代码：



%matplotlib inline
from collections import Counter,defaultdict,OrderedDict
from bs4 import BeautifulSoup
import os
from tqdm import tqdm_notebook
import glob
import nltk
import zipfile
import math
import pandas as pd
import sys
import itertools


def loadShakespeare():
    if 'shaks200.zip' in os.listdir():
        return 'shaks200.zip'
    elif os.path.exists('../../data/Week1/'):
        return '../../data/Week1/shaks200.zip'
    elif os.path.exists('../../../data/Week1/'):
        return '../../../data/Week1/shaks200.zip'


def index_collection(shaks200):
    # With zipfile we can read the file without opening the zip file
    archive = zipfile.ZipFile('shaks200.zip','r')
    namelist = [x for x in archive.namelist() if '.xml' in x]
    MyIndex = defaultdict(lambda: defaultdict(int)) # initialize MyIndex
    for infile in notebook.tqdm(namelist): # loop over each file
        f = archive.open(infile)
        
        
    return MyIndex

%time Shakespeare = index_collection(loadShakespeare())


Shakespeare['the'],Shakespeare['witch']

必须对文件中的单词进行标记，然后才能对某个单词进行计数。如您所见，必须在默认词典中对每个文件中的单词“ the”和“ witch”进行计数和排序。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

indexing indexing python tokenize zipfile