Spark如何仅在分区内加入

问题描述

我有2个大数据帧。每行都有经/纬度数据。我的目标是在2个数据框之间进行联接，并找到距离内的所有点，例如100m。

df1: (id,lat,lon,geohash7)
df2: (id,geohash7)

我想在geohash7上对df1和df2进行分区，然后仅在分区内加入。我想避免分区之间的连接以减少计算量。

df1 = df1.repartition(200,"geohash7")
df2 = df2.repartition(200,"geohash7")

df_merged = df1.join(df2,(df1("geohash7")===df2("geohash7")) & (dist(df1("lat"),df1("lon"),df2("lat"),df2("lon"))<100) )

因此，基本上加入geohash7，然后确保点之间的距离小于100。问题在于，Spark实际上将交叉连接所有数据。我如何才能使其仅执行分区间连接而不是分区内连接？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark apache-spark-sql partitioning