pyspark 数据帧上的 Reduce 和 Lambda

问题描述

以下是来自 https://graphframes.github.io/graphframes/docs/_site/user-guide.html

的示例

我唯一困惑的是条件函数中“lit(0)”的目的 如果这个“lit(0)”意味着输入“cnt”?如果是,为什么在 ["ab","bc","cd"] 之后?

from pyspark.sql.functions import col,lit,when
from pyspark.sql.types import IntegerType
from graphframes.examples import Graphs
from functools import reduce

chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")

chain4.show()

sumFriends = lambda cnt,relationship: when(relationship == "friend",cnt+1).otherwise(cnt)

condition = reduce(lambda cnt,e: sumFriends(cnt,col(e).relationship),["ab","cd"],lit(0))

chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()

解决方法

lit(0)reduce 语句的 initializer。您需要使用 sumFriends 初始化 cnt = 0 计数器才能开始计数。

condition = reduce(lambda cnt,e: sumFriends(cnt,col(e).relationship),["ab","bc","cd"],lit(0))

# should be equivalent to

condition = sumFriends(lit(0),col("ab").relationship)
condition = sumFriends(condition,col("bc").relationship)
condition = sumFriends(condition,col("cd").relationship)