Pyspark到Spark-scala转换

问题描述

资深开发者

我正在创建动态固定长度文件读取功能-模式将来自JSON文件: 我的代码语言是:scala,因为大多数现有代码已经用scala编写。

浏览时,我找到了我需要的确切代码,用pyspark编写。您能帮忙将其转换为相应的Spark-scala代码吗?特别是字典部分和循环部分

主要参考文献:Read fixed width file using schema from json file in pyspark

SchemaFile.json
===========================
{"Column":"id","From":"1","To":"3"}
{"Column":"date","From":"4","To":"8"}
{"Column":"name","From":"12","To":"3"}
{"Column":"salary","From":"15","To":"5"}

File = spark.read\
    .format("csv")\
    .option("header","false")\
    .load("C:\Temp\samplefile.txt")

SchemaFile = spark.read\
    .format("json")\
    .option("header","true")\
    .json('C:\Temp\schemaFile\schema.json')
    
sfDict = map(lambda x: x.asDict(),SchemaFile.collect())
print(sfDict)
#[{'Column': u'id','From': u'1','To': u'3'},# {'Column': u'date','From': u'4','To': u'8'},# {'Column': u'name','From': u'12',# {'Column': u'salary','From': u'15','To': u'5'}

from pyspark.sql.functions import substring
File.select(
    *[
        substring(
            str='_c0',pos=int(row['From']),len=int(row['To'])
        ).alias(row['Column']) 
        for row in sfDict
    ]
).show()

解决方法

检查以下代码。

scala> df.show(false)
+--------------------+
|value               |
+--------------------+
|00120181120xyz12341 |
|00220180203abc56792 |
|00320181203pqr25483 |
+--------------------+
scala> schema.show(false)
+------+----+---+
|Column|From|To |
+------+----+---+
|id    |1   |3  |
|date  |4   |8  |
|name  |12  |3  |
|salary|15  |5  |
+------+----+---+
scala> :paste
// Entering paste mode (ctrl-D to finish)

val columns = schema
.withColumn("id",lit(1))
.groupBy($"id")
.agg(collect_list(concat(lit("substring(value,"),$"from",lit(",$"to",lit(") as "),$"column")).as("data"))
.withColumn("data",explode($"data"))
.select($"data")
.map(_.getAs[String](0))
.collect

// Exiting paste mode,now interpreting.

columns: Array[String] = Array(substring(value,1,3) as id,substring(value,4,8) as date,12,3) as name,15,5) as salary)
scala> df.selectExpr(columns:_*).show(false)
+---+--------+----+------+
|id |date    |name|salary|
+---+--------+----+------+
|001|20181120|xyz |12341 |
|002|20180203|abc |56792 |
|003|20181203|pqr |25483 |
+---+--------+----+------+

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...