问题描述
有没有更有效的方法来组合 Spark 数据帧而不使用 for 循环?在 this 帖子中,答案使用 for 循环,如果您有多个数据框,这似乎需要很长时间。代码如下:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder\
.appName("DynamicFrame")\
.getorCreate()
df01 = spark.createDataFrame([(1,2,3),(9,5,6)],("C1","C2","C3"))
df02 = spark.createDataFrame([(11,12,13),(10,15,16)],("C2","C3","C4"))
df03 = spark.createDataFrame([(111,112),(110,115)],"C4"))
dataframes = [df01,df02,df03]
# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
for x in df.columns:
cols.add(x)
cols = sorted(cols)
# Create a dictionary with all the dataframes
dfs = {}
for i,d in enumerate(dataframes):
new_name = 'df' + str(i) # New name for the key,the dataframe is the value
dfs[new_name] = d
# Loop through all column names. Add the missing columns to the dataframe (with value 0)
for x in cols:
if x not in d.columns:
dfs[new_name] = dfs[new_name].withColumn(x,lit(0))
dfs[new_name] = dfs[new_name].select(cols) # Use 'select' to get the columns sorted
# Now put it al together with a loop (union)
result = dfs['df0'] # Take the first dataframe,add the others to it
dfs_to_add = dfs.keys() # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one,because it is already in the result
for x in dfs_to_add:
result = result.union(dfs[x])
result.show()
可以使用某种递归技术吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)