在熊猫中使用一个非常大的向量数据框对一个向量进行相似度排序

问题描述

目标：我正在尝试创建一个项目的有序列表，这些项目的排名基于它们与测试项目的接近程度。

我有1个具有10个属性的测试项目和250,000个具有10个属性的项目。我想要一个列出250,000个项目的列表。例如，如果结果列表返回[10,50,21,11,10000 ....]，则索引为10的项目最接近我的测试项目，索引50为第二个最接近我的测试项目，依此类推。

我尝试过的方法适用于较小的数据框，但不适用于较大的数据框：

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = pd.np.random.rand(4,4) 

#4 items with the first being the test
#0.727048   0.113704    0.886672    0.0345438
#0.496636   0.678949    0.0627973   0.547752
#0.641021   0.498811    0.628728    0.575058
#0.760778   0.955595    0.646792    0.126714 

#creates the cosine similarity matrix 
winner = cosine_similarity(similarity_matrix) 

#I just need the first row,how similar each item is to the test,I'm excluding how similar the test is to the test 
winner = np.argsort(winner[0:1,1:])

#I want to reverse the order and add one so the list matches the original index    
winner = np.flip(winner) +1

不幸的是，我收到250,000，出现以下错误“ MemoryError：无法为形状为（250000，250000）和数据类型为float64的数组分配339. GiB”

我实际上只需要第一行，而不是创建250000X250000矩阵。还有另一种方法吗？

解决方法

逐行计算距离例如。

test = np.array([[1,2,3]])
big_matrix = np.array([[1,3],[2,3,4]])

#calculate and concat all of them into one
result = np.array([cosine_similarity(test,row.reshape(1,-1)) for row in big_matrix]).reshape(-1,1)
winner = np.argsort(result)

如果使用第二个参数调用cosine_similarity，则只会计算与第二个数组的距离。
一个带有随机向量的例子

x = np.random.rand(5,2)

有一个论点

cosine_similarity(x)
array([[1.,0.95278802,0.93496787,0.45860786,0.62841819],[0.95278802,1.,0.99853581,0.70677904,0.8349406 ],[0.93496787,0.74401257,0.86348853],[0.45860786,0.979448  ],[0.62841819,0.8349406,0.86348853,0.979448,1.        ]])

将第一个向量作为第二个参数

cosine_similarity(x,[x[0]])
array([[1.        ],[0.95278802],[0.93496787],[0.45860786],[0.62841819]])

如果内存仍然不足，则可以以块为单位计算距离

chunks = 4
np.concatenate(
    [cosine_similarity(i,[x[0]]) for i in np.array_split(x,chunks)]
)
array([[1.        ],[0.62841819]])