问题描述
我想从数据集中找到与用户输入最相似的行。
我的数据集如下:
这是用户输入:
我使用scipy和sklearn进行了很多距离度量(欧几里得,海明,街区,相关性,余弦...),但是我没有找到好的结果。
我的daset形状是(400,70),对于70个特征,我有25个二元特征和45个连续特征。
这是我的Python代码:
raw_data['distance']= distance.cdist(raw_data,raw_user.values.reshape(1,-1),metric='euclidean')
#Sort the rows of dataframe by column 'Distance'
raw_data = raw_data.sort_values(by ='distance')
print(raw_data.distance)
结果如下:
155 3.047796e+09
177 3.047797e+09
162 3.047797e+09
23 3.047797e+09
192 3.047797e+09
...
72 3.047931e+09
104 3.047931e+09
Name: distance,Length: 203,dtype: float64
如果您有其他方法或技术来解决此问题,请随时向我提供建议。谢谢
解决方法
这里您不应该使用直接欧几里德距离,因为您在原始数据中具有具有可变量变化的特征,即二进制特征最多相差1个单位,而连续特征有所不同。因此,我提出了一个标准化的欧几里德距离来衡量记录之间的相似性。 你应该试试这个
# storing the standard deviation
columnWiseStandardDeviation = raw_data.std()
# calculating normalised(delta is Devided With Standard Deviation of the Column) euclidean distance
# .values have been used to access values in form of numpy array which can handle differently shaped operands
# while doing binary opration : '-' here
# deltas between corresponding column of raw_data and raw_user values are divided by their
# column-wise standard deviations
# to normalize them.
# then normalized DELTAS are squared and summed up and then square root of the sum is normalized euclidean distance(not a standard term),I coined it in this context
distance = ((((raw_data.values - raw_user.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5
# getting the record closest to the user input record
# df.iloc has to be user here as distance does not have indexes of original
# dataframe as we have use value(np array) of dfs
closestRecord = raw_data.iloc[list(distance==distance.min()).index(True)]
print(closestRecord)
因为我没有实际数据,所以我生成了一个带有随机数的数据框来测试脚本
import random
rows,cols = 50,10
_m = [5*random.randint(1,cols) for c in range(cols)]
print(_m)
df=pd.DataFrame(data={i:[random.randint(0,_m[i]) for j in range(rows)] for i in range(cols)})
print(df)
columnWiseStandardDeviation = df.std()
print(columnWiseStandardDeviation.values)
df1 = pd.DataFrame(data=[[random.randint(0,_m[i]) for i in range(cols)]])
distance = ((((df.values - df1.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5
print(df1)
print(sorted(list(enumerate(distance)),key=lambda d:d[1]))
print('Closest Record: ',df.iloc[list(distance==distance.min()).index(True)].values)