在同一时间段内使用WMD的文本相似度

问题描述

我有一个数据集

       Title                                                Year
0   Sport,there will be a match between United and Tottenham ...   2020
1   Forecasting says that it will be cold next week                 2019
2   Sport,Mourinho is approaching the anniversary at Tottenham     2020
3   Sport,Tottenham are sixth favourites for the title behind Arsenal. 2020
4   Pochettino says clear-out of fringe players at Tottenham is inevitable.     2018
... ... ...

我想研究同年而不是整个数据集中的文本相似性。为了找到最相似的文本,我使用了WM距离相似度。 对于两个文本将是:

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
distance = word2vec_model.wmdistance("string 1".split(),"string 2".split())

但是,我需要迭代同一年中句子之间的距离,以使每个文本与其他文本相似,从而在数据框中每行创建一个相似文本列表。 您能否告诉我如何遍历同一年出版的文本中的wmdistance函数,以便使每个文本在同一时期内最相似?

解决方法

为每个组生成一个距离矩阵,然后选择最小值。这将为您提供给定年份中最接近的单个文档索引。如果您想要n个文档或类似的东西,则应该能够轻松地修改此代码。

from scipy.spatial.distance import pdist,squareform

def nearest_doc(group):
    sq = squareform(pdist(group.to_numpy()[:,None],metric=lambda x,y:word2vec_model.wmdistance(x[0],y[0])))

    return group.index.to_numpy()[np.argmin(np.where(sq==0,np.inf,sq),axis=1)]

df['nearest_doc'] = df.groupby('Year')['Title'].transform(nearest_doc)

结果:

Title   Year    nearest_doc
0   Sport,there will be a match between United an...   2020    3
1   Forecasting says that it will be cold next week     2019    1
2   Sport,Mourinho is approaching the anniversary...   2020    3
3   Sport,Tottenham are sixth favourites for the ...   2020    2
4   Pochettino says clear-out of fringe players at...   2018    4