从熊猫数据框中找到与用户输入最相似的行

问题描述

我想从数据集中找到与用户输入最相似的行。

我的数据集如下：

这是用户输入：

我使用scipy和sklearn进行了很多距离度量（欧几里得，海明，街区，相关性，余弦...），但是我没有找到好的结果。

我的daset形状是（400,70），对于70个特征，我有25个二元特征和45个连续特征。

这是我的Python代码：

raw_data['distance']= distance.cdist(raw_data,raw_user.values.reshape(1,-1),metric='euclidean')

#Sort the rows of dataframe by column 'Distance'
raw_data = raw_data.sort_values(by ='distance')
print(raw_data.distance)

结果如下：

155    3.047796e+09
177    3.047797e+09
162    3.047797e+09
23     3.047797e+09
192    3.047797e+09
       ...     
72     3.047931e+09
104    3.047931e+09
Name: distance,Length: 203,dtype: float64

如果您有其他方法或技术来解决此问题，请随时向我提供建议。谢谢

解决方法

这里您不应该使用直接欧几里德距离，因为您在原始数据中具有具有可变量变化的特征，即二进制特征最多相差1个单位，而连续特征有所不同。因此，我提出了一个标准化的欧几里德距离来衡量记录之间的相似性。你应该试试这个

# storing the standard deviation
columnWiseStandardDeviation = raw_data.std()

# calculating normalised(delta is Devided With Standard Deviation of the Column) euclidean distance
# .values have been used to access values in form of numpy array which can handle differently shaped operands
# while doing binary opration : '-' here
# deltas between corresponding column of raw_data and raw_user values are divided by their 
# column-wise standard deviations 
# to normalize them.
# then normalized DELTAS are squared and summed up and then square root of the sum is normalized euclidean distance(not a standard term),I coined it in this context

distance = ((((raw_data.values - raw_user.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5

# getting the record closest to the user input record 
# df.iloc has to be user here as distance does not have indexes of original 
# dataframe as we have use value(np array) of dfs
closestRecord = raw_data.iloc[list(distance==distance.min()).index(True)]
print(closestRecord)

因为我没有实际数据，所以我生成了一个带有随机数的数据框来测试脚本

import random
rows,cols = 50,10
_m = [5*random.randint(1,cols) for c in range(cols)]
print(_m)

df=pd.DataFrame(data={i:[random.randint(0,_m[i]) for j in range(rows)] for i in range(cols)})
print(df)

columnWiseStandardDeviation = df.std()
print(columnWiseStandardDeviation.values)

df1 = pd.DataFrame(data=[[random.randint(0,_m[i]) for i in range(cols)]])
distance = ((((df.values - df1.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5
print(df1)

print(sorted(list(enumerate(distance)),key=lambda d:d[1]))
print('Closest Record: ',df.iloc[list(distance==distance.min()).index(True)].values)

euclidean-distance python scikit-learn scipy similarity

从熊猫数据框中找到与用户输入最相似的行

问题描述

解决方法

相关问答