如何在 Pyspark Dataframe 中的特定索引中添加一行或替换？

问题描述

我想将此列表 L1 添加为第一个索引中的一行，如何在 Pyspark Dataframe 中的特定索引中附加一行？

L1=['na',5.6,2.4]

data=[('fr',8.8,6.6),('nr',4.4,2.5),('cc',2.3,3.9)]
data_schema=[StructField('loc',StringType(),True),StructField('col',FloatType(),StructField('io',True)]
final=StructType(fields=data_schema)


df=spark.createDataFrame(data,schema=final)

df=df.withColumn("idx",F.row_number().over(Window.orderBy('col'))) 

>>show
+---+----+---+---+
|loc| col| io|idx|
+---+----+---+---+
| fr| 8.8|6.6|  1|
| nr| 4.4|2.5|  2|
| cc| 2.3|3.9|  3|

解决方法

您可以使用 idx != 1 过滤行，并使用 union 添加一行：

from pyspark.sql import functions as F,Window

L1 = ['na',5.6,2.4]
data = [('fr',8.8,6.6),('nr',4.4,2.5),('cc',2.3,3.9)]

df = spark.createDataFrame(data,['loc','col','io'])

df2 = df.withColumn(
    "idx",F.row_number().over(Window.orderBy('loc'))
).filter('idx != 1').union(spark.createDataFrame([L1 + [1]]))

df2.show()
+---+---+---+---+
|loc|col| io|idx|
+---+---+---+---+
| fr|8.8|6.6|  2|
| nr|4.4|2.5|  3|
| na|5.6|2.4|  1|
+---+---+---+---+

apache-spark apache-spark-sql pyspark pyspark python