如何解释python中的余弦相似度输出

问题描述

初学者@ Python在这里。我有一个熊猫DataFrame df ，其列为：用户ID ，重量， SEI ，名称。

#libraries 
   import numpy as np; import pandas as pd
   from sklearn.metrics.pairwise import cosine_similarity
    
#dataframe
   userID    weight     SEI        name
   3         125.0.     0.562140   263
   4         254.0.     0.377294   869 
   5         451.0.     0.872896   196
   1429      451.0.     0.872896   196 
   5         129.0.     0.569432   582
   ...       ...        ...        ...

#output
   cosine_similarity(df)

   array([[1.,0.98731894,0.75370844,...,0.33814175,0.33700687,0.24443919],[0.98731894,1.,0.63987877,0.35037059,0.34963404,0.23870279],[0.75370844,0.16648431,0.16403693,0.17438159],

具有用户ID 3的人的权重为125.0， SEI 为0.562140。名称为263的人的体重为125.0， SEI 为0.562140。（我必须对 name 列使用标签编码器，因为在不更改列数据类型的情况下我无法运行余弦相似度函数。希望这不会影响最终目标吗？em>）

目标是使用所有行的余弦相似度，将 userID 列中的值与 name 列中的值进行匹配。为此，我只需要一些解释输出的指导即可。我所知道的是，余弦值越高，相似度就越大。

感谢您的帮助！

解决方法

使自己更轻松，并按两列分组

result1=df.sort_values('weight')
result2=(result1.groupby(['userID_x','SEI']).apply(lambda g: 
         cosine_similarity(g['weight'].values.reshape(1,-1),g['artist'].values.reshape(1,-1))[0][0])).rename('CosSim').reset_index()