提高将hasrsine函数应用于两个数据集Pandas中所有行的效率

问题描述

我正在使用熊猫在两个数据集中使用纬度和经度来计算学校之间的距离。我正在使用以下hasrsine函数：

def haversine(lon1,lat1,lon2,lat2):
"""
Calculate the great circle distance between two points 
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians 
lon1,lat2 = map(radians,[lon1,lat2])

# haversine formula 
dlon = lon2 - lon1 
dlat = lat2 - lat1 
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a)) 
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r

数据集如下：

gr_offer,sch_code,lat,lon,distance_to_sec_school_km,nearest_sec_school_code
0   G.1-4   S0306020712 11.09990    37.987301   NaN NaN
1   G.1-4   S0401140392 7.15509 38.586300   NaN NaN
2   G.1-4   S0406150452 9.40269 41.964401   NaN NaN

第二个数据集看起来相似，但提供不同的等级。现在，我正在计算数据集1中的学校到数据集2的距离，并填充第一个数据集的列中的行：-1）distance_to_sec_school_km和2）最近的_sec_school_code。然后，我读取第一个数据集的输出作为结果。我的代码似乎可以正常运行，但是运行只需要一个多小时。我想提高效率。

任何建议，我们将不胜感激！

请参见下面的功能：

def calculate_distance(df1,df2):
import time
startTime = time.time()

i=0
count = 0
# two variables to store the results from the distance function. 
nearest_gps = 0
nearest_code = 0

# loop to calculate distance between primary and secondary
while i < len(df1):
    if count < len(df2):
        distance = haversine(df1['lon'].iloc[i],df1['lat'].iloc[i],df2['lon'].iloc[count],df2['lat'].iloc[count])
        if nearest_gps == 0 and count < len(df2): # i.e. first iteration. 
            nearest_gps = distance
            nearest_code = df2['sch_code'].iloc[count]
            count+=1
        elif distance < nearest_gps and count < len(df2): #shortest distance replaced
            nearest_gps = distance
            nearest_code = df2['sch_code'].iloc[count]
            count+=1
        else:        
            count+=1
    else:
        df1['distance_to_sec_school_km'].iloc[i] = nearest_gps
        df1['nearest_sec_school_code'].iloc[i] = nearest_code
        i += 1
        count = 0
        nearest_gps = 0
        nearest_code = 0

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))
df1.to_csv('edited_csv.csv',index=False,encoding = 'utf-8')

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

haversine pandas performance