熊猫数据框中的Word2vec

问题描述

我正在尝试使用word2vec来检查数据集每行两列的相似性。

例如:

Sent1                                     Sent2
It is a sunny day                         Today the weather is good. It is warm outside
What people think about democracy         In ancient times,Greeks were the first to propose democracy  
I have never played tennis                I do not kNow who Roger Feder is 

要应用word2vec,我考虑以下事项:

import numpy as np

words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:

    sentence1_meaning = np.add(sentence1_meaning,word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words1[0])
count = 1

for w in words1[1:]:
    sentence1_meaning = np.add(sentence1_meaning,word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words2[0])
count = 1
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
    sentence2_meaning = np.add(sentence2_meaning,word2vec(w))
    count += 1
sentence2_meaning /= count

#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning,sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))

但是,这应该适用于不在熊猫数据框中的两个句子。

您能告诉我在熊猫数据框的情况下应用word2vec来检查send1和send2之间的相似性时需要做什么吗?我想要一个新的结果列。

解决方法

我没有受过word2vec的培训并且没有空缺,因此,我将展示如何使用伪造的word2vec,并通过tfidf权重将单词转换为句子,以达到您想要的目的。 / p>

步骤1 。准备数据

from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"sentences": ["this is a sentence","this is another sentence"]})

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df.sentences).todense()
vocab = tfidf.vocabulary_
vocab
{'this': 3,'is': 1,'sentence': 2,'another': 0}

第2步。伪造word2vec(与我们的唱头一样大)

word2vec = np.random.randn(len(vocab),300)

第3步。计算包含word2vec的句子列:

sent2vec_matrix = np.dot(tfidf_matrix,word2vec) # word2vec here contains vectors in the same order as in vocab
df["sent2vec"] = sent2vec_matrix.tolist()
df

sentences   sent2vec
0   this is a sentence  [-2.098592110459085,1.4292324332403232,-1.10...
1   this is another sentence    [-1.7879436822159966,1.680865619703155,-2.00...

第4步。计算相似度矩阵

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(df["sent2vec"].tolist())
similarity
array([[1.,0.76557098],[0.76557098,1.        ]])

要使您的word2vec正常工作,您需要稍微调整步骤2,以便word2vec包含vocab中所有单词的顺序相同(按值或字母顺序)

对于您的情况,应为:

sorted_vocab = sorted([word for word,key in vocab.items()])
sorted_word2vec = []
for word in sorted_vocab:
    sorted_word2vec.append(word2vec[word])