问题描述
这是我的代码。请注意,这只是一个玩具数据集,我的真实集合每个表中包含大约1000个条目。
import pandas as pd
import numpy as np
import sklearn.neighbors
locations_stores = pd.DataFrame({
'city_A' : ['City1','City2','City3','City4',],'latitude_A': [ 56.361176,56.34061,56.374749,56.356624],'longitude_A': [ 4.899779,4.871195,4.893847,4.912281]
})
locations_neigh = pd.DataFrame({
'neigh_B': ['Neigh1','Neigh2','Neigh3','Neigh4','Neigh5'],'latitude_B' : [ 53.314,53.318,53.381,53.338,53.7364],'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})
/some calc code here/
##df_dist_long.loc[df_dist_long.sort_values('dist(km)').groupby('neigh_B')['city_A'].min()]##
df_dist_long.to_csv('dist.csv',float_format='%.2f')
当我添加df_dist_long.loc[df_dist_long.sort_values('dist(km)').groupby('neigh_B')['city_A'].min()]
时。我收到此错误
File "C:\Python\python38\lib\site-packages\pandas\core\groupby\groupby.py",line 656,in wrapper
raise ValueError
ValueError
没有它,输出就像这样……
city_A neigh_B dist(km)
0 City1 Neigh1 6.45
1 City2 Neigh1 6.42
2 City3 Neigh1 7.93
3 City4 Neigh1 5.56
4 City1 Neigh2 8.25
5 City2 Neigh2 6.67
6 City3 Neigh2 8.55
7 City4 Neigh2 8.92
8 City1 Neigh3 7.01 ..... and so on
我想要的是另一个表格,该表格过滤了最接近邻居的城市。例如,对于“ Neigh1”,City4是最近的(距离最小)。所以我想要下面的表格
city_A neigh_B dist(km)
0 City4 Neigh1 5.56
1 City3 Neigh2 4.32
2 City1 Neigh3 7.93
3 City2 Neigh4 3.21
4 City4 Neigh5 4.56
5 City5 Neigh6 6.67
6 City3 Neigh7 6.16
..... and so on
城市名称是否重复并不重要,我只想将最近的一对保存到另一个csv中。专家,请问该如何实施!
解决方法
如果只想为每个街区提供最近的城市,则不想计算完整距离矩阵。
这是一个工作代码示例,尽管我得到的输出与您的不同。也许是经纬度错误。
我使用了您的数据
import pandas as pd
import numpy as np
import sklearn.neighbors
locations_stores = pd.DataFrame({
'city_A' : ['City1','City2','City3','City4',],'latitude_A': [ 56.361176,56.34061,56.374749,56.356624],'longitude_A': [ 4.899779,4.871195,4.893847,4.912281]
})
locations_neigh = pd.DataFrame({
'neigh_B': ['Neigh1','Neigh2','Neigh3','Neigh4','Neigh5'],'latitude_B' : [ 53.314,53.318,53.381,53.338,53.7364],'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})
创建了一个可以查询的BallTree
from sklearn.neighbors import BallTree
import numpy as np
stores_gps = locations_stores[['latitude_A','longitude_A']].values
neigh_gps = locations_neigh[['latitude_B','longitude_B']].values
tree = BallTree(stores_gps,leaf_size=15,metric='haversine')
对于每个我们要最接近(k=1
)城市/商店的邻居:
distance,index = tree.query(neigh_gps,k=1)
earth_radius = 6371
distance_in_km = distance * earth_radius
我们可以使用以下方法创建结果的数据框
pd.DataFrame({
'Neighborhood' : locations_neigh.neigh_B,'Closest_city' : locations_stores.city_A[ np.array(index)[:,0] ].values,'Distance_to_city' : distance_in_km[:,0]
})
这给了我
Neighborhood Closest_city Distance_to_city
0 Neigh1 City2 19112.334106
1 Neigh2 City2 19014.154744
2 Neigh3 City2 18851.168702
3 Neigh4 City2 19129.555188
4 Neigh5 City4 15498.181486
由于我们的输出不同,因此有一些错误需要更正。也许交换纬度/经度,我只是在这里猜测。但这是您想要的方法,尤其是对于您的数据量。
编辑:对于完整矩阵,请使用
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
earth_radius = 6371
haversine_distances = dist.pairwise(np.radians(stores_gps),np.radians(neigh_gps) )
haversine_distances *= earth_radius
这将提供完整的矩阵,但请注意,对于更大的数字,这将需要很长时间,并且会期望命中内存限制。
您可以使用numpy的np.argmin(haversine_distances,axis=1)
从BallTree获得类似的结果。它将产生距离最近的索引,可以像在BallTree示例中那样使用它。