如何从 Spark 中的笛卡尔积中删除重复项

问题描述

我创建了一组单词的交叉连接,以比较它们在 Spark 中的相似性。但是,我试图摆脱那些自 (word1,word2) = (word2,word1) 的分数以来重复的条目。我有下表,看起来像这样;

+-------+-------+-------+
|  col1 | col2  | score |
+-------+-------+-------+
| word1 | word1 |   1   |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word1 | 0.345 |
| word2 | word2 |   1   |
| word2 | word3 | 0.543 |
| word3 | word1 | 0.432 |
| word3 | word2 | 0.543 |
| word3 | word3 |   1   |
+-------+-------+-------+

理想情况下,我希望获得这样的结果,其中不重复比较:

+-------+-------+-------+
|  col1 | col2  | score |
+-------+-------+-------+
| word1 | word1 |   1   |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word2 |   1   |
| word2 | word3 | 0.543 |
| word3 | word3 |   1   |
+-------+-------+-------+

解决方法

col1col2 组合为一个列表,并使用 sort_array 按字母顺序对它们进行排序。排序后,对它们执行 .distinct() 将删除重复项。现在您可以将列表解压缩为 col1col2

from pyspark.sql import functions as F
from pyspark.sql.functions import sort_array

df.withColumn("sorted_list",sort_array(F.array([F.col("col1"),F.col("col2")])))\
    .select("sorted_list","score").distinct()\
    .select(F.col("sorted_list")[0].alias("col1"),\
    F.col("sorted_list")[1].alias("col2"),"score").show()

输出:

+-----+-----+-----+
| col1| col2|score|
+-----+-----+-----+
|word1|word1|  1.0|
|word1|word2|0.345|
|word1|word3|0.432|
|word2|word2|  1.0|
|word2|word3|0.543|
|word3|word3|  1.0|
+-----+-----+-----+

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...