问题描述
我正在寻找一种使用gensim使用新文档更新现有语料库的方法。在这里,我根据现有的语料库创建了一个词典,并为此制作了一袋单词。后来,我将其序列化为.mm文件并将其保存到本地磁盘中。现在,我想用新文档更新现有的.mm文件,以便可以保留更新的语料库的表示形式,以便在看不见的数据上可以将其用于文档相似性。请协助我该怎么办?更新主体的正确方法是什么?此外,我知道可以将文档添加到字典中,而不是.mm文件。
from gensim import corpora,models,similarities
from gensim.parsing.preprocessing import STOPWORDS
tweets = [
['human','interface','computer'],['survey','user','computer','system','response','time','survey'],['eps','system'],['system','human','eps'],['user','time'],['trees'],['graph','trees'],'minors','survey']
]
dictionary = corpora.Dictionary(tweets)
dictionary.save('tweets.dict') # store the dictionary,for future reference
dictionary = corpora.Dictionary.load('tweets.dict')
print(f'Length of prevIoUs dict = {len(dictionary)},tokens = {dictionary.token2id}')
raw_corpus = [dictionary.doc2bow(t) for t in tweets]
corpora.MmCorpus.serialize('tweets.mm',raw_corpus) # store to disk
print("Save the vectorized corpus as a .mm file")
corpus = corpora.MmCorpus('tweets.mm') # loading saved .mm file
print(corpus)
new_docs = [
["user","response","system"],["trees","minor","surveys"]
]
# how to add this new_docs corpus to tweets.mm
tweets.mm
是否可以更新?还是推荐?
解决方法
没有直接方法可以更新磁盘上的.mm语料库。相反,我建议您从文件中读取语料库,并通过使用tweets
的内容扩展new_docs
列表从头开始重新处理它。这样,您可以确保语料库中的字典(将单词映射到id)不会与语料库不同步。
我将创建以下函数来处理更新:
def update_corpus(tweets,new_docs,dict_path):
dictionary = corpora.Dictionary.load(dict_path)
print(f'Length of previous dict = {len(dictionary)},tokens = {dictionary.token2id}')
dictionary.add_documents(new_docs)
dictionary.save(dict_path)
print(f'Length of updated dict = {len(dictionary)},tokens = {dictionary.token2id}')
import itertools # you can move it outside of the function
full_corpus = itertools.chain(tweets,new_docs)
raw_corpus = [dictionary.doc2bow(t) for t in full_corpus]
corpora.MmCorpus.serialize('tweets.mm',raw_corpus) # store to disk
print("Save the vectorized corpus as a .mm file")
请注意,无需在创建和保存字典后立即加载字典,因此您可以删除以下行:
dictionary = corpora.Dictionary.load('tweets.dict')