关键错误:单词“ ”不在词汇表 WORD2VEC

问题描述

enter image description hereenter image description here我正在从事 Python 项目,我正在使用 Word2Vec 来推荐产品。 该代码对于包含 19401 的数据集工作得非常好,但是每当我传递产品的 ID 时,我都会收到这个错误“keyerror : word '1077' not invocabulary” 我不知道如何解决这个问题,因为我对此知之甚少,我还在学习中。请帮我解决这个问题!

purchases_train = []

for i in tqdm(product_train):
    temp = train_df[train_df["Clothing ID"] == i]["Review Text"].tolist()
    purchases_train.append(temp)


purchases_val = []

for i in tqdm(validation_df['Clothing ID'].unique()):
    temp = validation_df[validation_df["Clothing ID"] == i]["Review Text"].tolist()
    purchases_val.append(temp)



model = Word2Vec(window = 10,sg = 1,hs = 0,negative = 10,# for negative sampling
                 alpha=0.03,min_count= 1,min_alpha=0.0007,seed = 14)


model.build_vocab(purchases_train,progress_per=200)
model.train(purchases_train,total_examples = model.corpus_count,epochs=10,report_delay=1)

# save word2vec model
model.save("word2vec_2.model")


model.init_sims(replace=True)

# extract all vectors
X = model[model.wv.vocab]

products = train_df[["Clothing ID","Review Text"]]

# remove duplicates
products.drop_duplicates(inplace=True,subset='Clothing ID',keep="last")

# create product-ID and product-description dictionary
products_dict = products.groupby('Clothing ID')['Review Text'].apply(list).to_dict()


def similar_products(v,n = 6):
    
    # extract most similar products for the input vector
    ms = model.similar_by_vector(v,topn= n+1)[1:]
    
    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0],j[1])
        new_ms.append(pair)
        
    return new_ms


similar_products(model['1077'])

解决方法

如果您收到错误 word '847' not in vocabulary,那么您可以确定:您的训练数据中未提供令牌 '847'

如果您认为它在那里,您应该查看数据以确认它不在。

如果您的代码需要能够对不在训练数据中的单词做一些有用的事情,您应该将其扩展为:

(1) 在尝试获取词向量之前先检查词是否存在

    if '847' in model:
        similar_products(model['847'])
    else:
        # do something else
        ...

...或...

(2) 抓住 KeyError 并在它被抓住时做其他事情。