如何在pyspark数据框中找到相似的行?

问题描述

我有一个如下所示的spark数据框:

+-------+-----------------------------+
|user_id|       profile_features      |
+-------+-----------------------------+
|   100 |  [0.0,0.33..,0.66..,...|
|   101 |  [0.42..,0.15..,0.57..,...|
|   102 |  [0.33..,0.0,0.25..,...|
|   103 |  [0.15..,...|
|   104 |  [0.0,...|
+-------+-----------------------------+

如何通过user_id查找与给定用户最相似的用户? 我正在考虑将给定的user_id与其他行(user_ids)相乘以找到它们的相似性,然后以某种方式整理出结果表并返回最上面的N个user_ids。如果是正确的方法,如何在pyspark中实现它?

解决方法

传递要在属性中查找重复项的列,如果计数大于1,则对其进行分组,然后进行计数,否则重复重复,否则为唯一记录/

Attribute1 = ["user_id","profile_features"]

Selected_Col_Groupby = Data.select(Attribute1).groupBy(Attribute1).count()

Rule_Flag = Selected_Col_Groupby.withColumn('RuleCol',f.when(Selected_Col_Groupby["count"] > 1,1).otherwise(0)).drop("count")

Duplicates = Rule_Flag.filter('RuleCol'+" == 1")
Not_Duplicates = Rule_Flag.filter('RuleCol'+" == 0")

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...