如何使用新文档语料库更新.mm市场矩阵文件?

问题描述

我正在寻找一种使用gensim使用新文档更新现有语料库的方法在这里,我根据现有的语料库创建了一个词典,并为此制作了一袋单词。后来,我将其序列化为.mm文件并将其保存到本地磁盘中。现在,我想用新文档更新现有的.mm文件,以便可以保留更新的语料库的表示形式,以便在看不见的数据上可以将其用于文档相似性。请协助我该怎么办?更新主体的正确方法是什么?此外,我知道可以将文档添加到字典中,而不是.mm文件

from gensim import corpora,models,similarities
from gensim.parsing.preprocessing import STOPWORDS

tweets = [
    ['human','interface','computer'],['survey','user','computer','system','response','time','survey'],['eps','system'],['system','human','eps'],['user','time'],['trees'],['graph','trees'],'minors','survey']
]

dictionary = corpora.Dictionary(tweets)
dictionary.save('tweets.dict')  # store the dictionary,for future reference

dictionary = corpora.Dictionary.load('tweets.dict')
print(f'Length of prevIoUs dict = {len(dictionary)},tokens = {dictionary.token2id}')
raw_corpus = [dictionary.doc2bow(t) for t in tweets]
corpora.MmCorpus.serialize('tweets.mm',raw_corpus)  # store to disk
print("Save the vectorized corpus as a .mm file")

corpus = corpora.MmCorpus('tweets.mm') # loading saved .mm file
print(corpus)

new_docs = [
["user","response","system"],["trees","minor","surveys"]
]

# how to add this new_docs corpus to tweets.mm

tweets.mm是否可以更新?还是推荐?

解决方法

没有直接方法可以更新磁盘上的.mm语料库。相反,我建议您从文件中读取语料库,并通过使用tweets的内容扩展new_docs列表从头开始重新处理它。这样,您可以确保语料库中的字典(将单词映射到id)不会与语料库不同步。

我将创建以下函数来处理更新:

def update_corpus(tweets,new_docs,dict_path):
    dictionary = corpora.Dictionary.load(dict_path)
    print(f'Length of previous dict = {len(dictionary)},tokens = {dictionary.token2id}')
    dictionary.add_documents(new_docs)
    dictionary.save(dict_path)
    print(f'Length of updated dict = {len(dictionary)},tokens = {dictionary.token2id}')
    import itertools  # you can move it outside of the function
    full_corpus = itertools.chain(tweets,new_docs)
    raw_corpus = [dictionary.doc2bow(t) for t in full_corpus]
    corpora.MmCorpus.serialize('tweets.mm',raw_corpus)  # store to disk
    print("Save the vectorized corpus as a .mm file")

请注意,无需在创建和保存字典后立即加载字典,因此您可以删除以下行:

dictionary = corpora.Dictionary.load('tweets.dict')