训练 word2vec 模型时必须提供 corpus_file 或 corpus

问题描述

我刚刚开始使用 word2vec 模型，我想从我的问题数据中创建不同的集群。

所以要制作集群我得到的是，我必须

创建词嵌入模型从模型中获取词向量从词向量创建句子向量使用 Kmeans 聚类问题数据

所以要得到 word2vec 词向量，one of the article says

def get_word2vec(tokenized_sentences):
    print("Getting word2vec model...")
    model = Word2Vec(tokenized_sentences,min_count=1)
    return model.wv

然后只需创建句子向量和 Kmeans。

and other article says，得到 word2vec 模型后，我必须构建词汇，然后需要训练模型。然后创建句子向量，然后是 Kmeans/

def get_word2vec_model(tokenized_sentences):
    start_time = time.time()
    print("Getting word2vec model...")
    model = Word2Vec(tokenized_sentences,sg=1,window=window_size,vector_size=size,min_count=min_count,workers=workers,epochs=epochs,sample=0.01)
    log_total_time(start_time)
    return model 


def get_word2vec_model_vector(model):
    start_time = time.time()
    print("Training...")
#     model = Word2Vec(tokenized_sentences,min_count=1)
    model.build_vocab(sentences=shuffle_corpus(tokenized_sentences),update=True)
    # Training the model
    for i in tqdm(range(5)):
        model.train(sentences=shuffle_corpus(tokenized_sentences),epochs=50,total_examples=model.corpus_count)
    log_total_time(start_time)
    return model.wv

def shuffle_corpus(sentences):
    shuffled = list(sentences)
    random.shuffle(shuffled)
    return shuffled

这就是我的 tokenized_sentences 的样子

8857                                     [,year,old]
11487     [,birthday,canada,cant,share,job,friend]
20471                       [,chat,people,also,talk]
5877                                           [,found]

Q1) 第二种方法给出以下错误

---> 54     model.build_vocab(sentences=shuffle_corpus(tokenized_sentences),update=True)
     55     # Training the model
     56     for i in tqdm(range(5)):

~\AppData\Local\Programs\Python\python38\lib\site-packages\gensim\models\word2vec.py in build_vocab(self,corpus_iterable,corpus_file,update,progress_per,keep_raw_vocab,trim_rule,**kwargs)
    477 
    478         """
--> 479         self._check_corpus_sanity(corpus_iterable=corpus_iterable,corpus_file=corpus_file,passes=1)
    480         total_words,corpus_count = self.scan_vocab(
    481             corpus_iterable=corpus_iterable,progress_per=progress_per,trim_rule=trim_rule)

~\AppData\Local\Programs\Python\python38\lib\site-packages\gensim\models\word2vec.py in _check_corpus_sanity(self,passes)
   1484         """Checks whether the corpus parameters make sense."""
   1485         if corpus_file is None and corpus_iterable is None:
-> 1486             raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")
   1487         if corpus_file is not None and corpus_iterable is not None:
   1488             raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")

TypeError: Either one of corpus_file or corpus_iterable value must be provided

和

Q2) 是否有必要构建词汇然后训练数据？或者获取模型是我唯一需要做的事情？

解决方法

而不是做model.build_vocab(sentences=shuffle_corpus(tokenized_sentences),update=True)

用 corpus_iterable 替换 sentence 参数名称，所以如果您的 iterable 工作正常，您应该能够轻松生成：

model.build_vocab(shuffle_corpus(tokenized_sentences),update=True)

或

model.build_vocab(corpus_iterable=shuffle_corpus(tokenized_sentences),update=True)

它需要 List of List 用于训练，因此请尝试以该格式提供数据。另外，尝试清理您的数据。我认为空的空间不是一个好的选择，但我也没有尝试过。其他一切都一样。只需按照 official Documentation on FastText training 即可，这样您就可以继续前进。它也适用于 Word2Vec，但这个有更多的解释。

注意：给出的示例来自旧版本，这就是 sentence= 参数给出错误的原因

Q.2：模式构建词汇。显然有必要构建词汇表，否则模型将如何知道 a,the,book,reader 是什么等等。每个单词都需要一个相应的数字，这就是它的用途。如果您正在处理一些包含许多 OOV 词的数据，请尝试 FastText。

有一件事是，通过查看 Astronomer 和 geology，它可以为您嵌入 astrology，即使它甚至一次都没见过。

在 Gensim 的最新版本中，名称 sentences 已被替换。（它经常误导人们认为每个文本都必须是一个正确的句子，而不仅仅是一个标记列表。）

您应该将语料库指定为 corpus_iterable（如果它类似于 Python list 或可重复迭代的序列），或指定为 corpus_file（如果它在一个已经被换行符分解成文本，并被空格分解成标记的单个磁盘文件）。

单独：

您可能不需要反复重新整理语料库的复杂性。（如果你的数据源在某些范围内有大量的词类型——比如词 A 的所有例子都出现在一行文本的早期，而词 B 的所有例子出现在一行文本后期——那么 one 在开始之前洗牌可能会有所帮助，因此所有单词在语料库中出现的早期和晚期的可能性相同。）
多次调用 .train() 几乎总是一个错误，它会导致对训练过程的混淆和学习率 alpha 衰减的管理不善。有关详细信息，请参阅有关（相关算法）Doc2Vec 的答案：https://stackoverflow.com/a/62801053/130288

gensim machine-learning python word2vec

训练 word2vec 模型时必须提供 corpus_file 或 corpus_iterable 值之一 python

问题描述

解决方法