使用地图功能的Pyspark应用于列列表

问题描述

我有以下列表，其中包含来自数据帧@ContextConfiguration(classes = {AppConfig.class}) @CucumberContextConfiguration @WebAppConfiguration @TestExecutionListeners(ClassLevelServletTestExecutionListener.class) // extend the Spring class to get the default TestExecutionListeners public class TestBase extends AbstractJUnit4SpringContextTests { @Autowired public ExampleService underTest; }

的一些列名

df

我想计算这些列中的不同值。我看到了下面的代码，但它似乎不起作用。

stringList = ['A','B','C']

但是，以下两种方法似乎效果很好：

from pyspark.sql.functions import *

distinctList = []
def countDistinctCats(colName):
  count = df.agg(countDistinct(colName)).collect()
  distinctList.append(count)

# Apply function on every column
map(countDistinctCats,stringList)
print(distinctList)

与以下方法相比，这种方法非常慢：

result = map(lambda x: df.agg(countDistinct(col(x))).collect(),stringList) 
print(list(result))

为什么第一个代码块不起作用？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark apache-spark-sql pyspark python