问题描述
我有一个如下所示的spark数据框:
+-------+-----------------------------+
|user_id| profile_features |
+-------+-----------------------------+
| 100 | [0.0,0.33..,0.66..,...|
| 101 | [0.42..,0.15..,0.57..,...|
| 102 | [0.33..,0.0,0.25..,...|
| 103 | [0.15..,...|
| 104 | [0.0,...|
+-------+-----------------------------+
如何通过user_id查找与给定用户最相似的用户? 我正在考虑将给定的user_id与其他行(user_ids)相乘以找到它们的相似性,然后以某种方式整理出结果表并返回最上面的N个user_ids。如果是正确的方法,如何在pyspark中实现它?
解决方法
传递要在属性中查找重复项的列,如果计数大于1,则对其进行分组,然后进行计数,否则重复重复,否则为唯一记录/
Attribute1 = ["user_id","profile_features"]
Selected_Col_Groupby = Data.select(Attribute1).groupBy(Attribute1).count()
Rule_Flag = Selected_Col_Groupby.withColumn('RuleCol',f.when(Selected_Col_Groupby["count"] > 1,1).otherwise(0)).drop("count")
Duplicates = Rule_Flag.filter('RuleCol'+" == 1")
Not_Duplicates = Rule_Flag.filter('RuleCol'+" == 0")