问题描述
我有两个大文件,我有纬度和经度。我正在尝试过滤掉 500 米内的所有记录。
我已按照以下步骤操作。
1.首先,我在 abs(lat1-lat2)
val joindf = df1.join(broadcast(df2),abs(df1.lat1-df2.lat2) < 0.01 && abs(df1.long1-df2.long2) < 0.01)
//将进行交叉连接并返回 5 公里内的所有行。
2. 之后我计算了距离(基本上调用 custum 函数,它采用 lat1,lat2,long1,long2 及其以米为单位的返回距离)。
3.之后我添加了过滤条件,如距离
但上述步骤在小数据集上运行良好,但不适用于大数据。第 1 步大约需要 4 天。
请帮帮我,我们可以使用 geospatail 解决这个问题吗?我已阅读该文档,但我是 Spark 新手。请帮帮我。
像这样的数据框样本。
数据框 1
Tehsil district Type Code POP V_Lat V_Long
Tulsipur Balrampur NON Census Village 0 0 27.594705 82.334491
Tulsipur Balrampur NON Census Village 0 0 27.605287 82.34746
Tulsipur Balrampur NON Census Village 0 0 27.573511 82.336592
Tulsipur Balrampur NON Census Village 0 0 27.582564 82.355718
Tulsipur Balrampur NON Census Village 0 0 27.57687 82.322748
Tulsipur Balrampur NON Census Village 0 0 27.583982 82.344223
Tulsipur Balrampur NON Census Village 0 0 27.577273 82.330141
Tulsipur Balrampur NON Census Village 0 0 27.569862 82.326575
Tulsipur Balrampur Village 173435 2702 27.584897 82.353102
Tulsipur Balrampur Village 173434 2552 27.592387 82.330867
Tulsipur Balrampur Village 173436 3506 27.5734 82.340243
Tulsipur Balrampur Village 173431 1693 27.599005 82.345086
Haidergarh Bara Banki NON Census Village 0 0 26.579465 81.461515
数据框 2
UE Purwa Dhanauti Haidergarh Bara Banki NON Census Village 0 0 26.568228 81.471936
UE Purwa Lachhmansingh Haidergarh Bara Banki NON Census Village 0 0 26.569711 81.478505
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)