问题描述
例如,我有如下数据:
data=[(1,1,10),(1,2,20),(2,3,15),47),(3,28),17)]
df=spark.createDataFrame(data).toDF("ID","Target","features","value1")
df.show()
+---+------+--------+------+
| ID|Target|features|value1|
+---+------+--------+------+
| 1| 1| 1| 10|
| 1| 1| 2| 20|
| 2| 1| 3| 15|
| 2| 0| 1| 47|
| 3| 0| 2| 28|
| 3| 0| 3| 17|
+---+------+--------+------+
我想将数据转换为:按 ID 分组:
1 1:10 2:20
1 2:15 1:47
0 2:28 3:17
所以每一行代表一个ID,第一个值代表Target,features:value1
您能提供任何示例代码或建议吗?
非常感谢!!!!!!!!!!!
解决方法
您可以按ID
(也可以按Target
?)对数据进行分组,将每个组collect 组成一个列表,然后使用transform 和{ 的组合{3}} 将每个列表格式化为所需的格式:
from pyspark.sql import functions as F
df = spark.createDataFrame(data).toDF("ID","Target","features","value1") \
.groupBy("ID","Target").agg(F.collect_list(F.struct("features","value1")).alias("feature_value")) \
.withColumn("feature_value",F.expr("transform(feature_value,x -> concat_ws(':',x.features,x.value1))")) \
.withColumn("feature_value",F.concat_ws(" ",F.col("feature_value"))) \
.withColumn("result",F.col("Target"),F.col("feature_value"))) \
.select("result")
结果:
+-----------+
| result|
+-----------+
|0 2:28 3:17|
|1 1:10 2:20|
|1 3:15 1:47|
+-----------+