问题描述
我想使用 scala 找到输入范围,如下面的输入数据框
Input
scala> val r_df = Seq((1,"1 to 6"),(2,"44/1 to 3")).toDF("id","range")
r_df: org.apache.spark.sql.DataFrame = [id: int,range: string]
scala> r_df.show
+---+---------+
| id| range|
+---+---------+
| 1| 1 to 6|
| 2|44/1 to 3|
+---+---------+
for 循环 udf
val survey_to1 = udf((data1: Int,data2: Int) => {
val arr = new ArrayBuffer[Int]()
for(i <- data1 to data2)
{
arr+= i
}
arr
})
r_df4.withColumn("new",survey_to1(col("new1"),col("new3"))).show(false)
将上面的 for 循环 udf 应用于数据帧,它只需要“1 到 6”
+---+---------+----+----+----+------------------+
|id |range |new1|new2|new3|new |
+---+---------+----+----+----+------------------+
|1 |1 to 6 |1 |to |6 |[1,2,3,4,5,6]|
|2 |44/1 to 3|44/1|to |3 |null |
+---+---------+----+----+----+------------------+
预期输出
+---+---------+----+----+----+------------------+
|id |range |new1|new2|new3|new |
+---+---------+----+----+----+------------------+
|1 |1 to 6 |1 |to |6 |[1,6]|
|2 |44/1 to 3|44/1|to |3 |[44/1,44/2,44,3] |
+---+---------+----+----+----+------------------+
解决方法
使用那些特定的字符串模式:
import org.apache.spark.sql.functions.udf
val patern = "([0-9.]{2}/[0-9.]{1}|[0-9.]{1}) to ([0-9.]{1})".r
def createArray = udf { str : String =>
val patern(from,_to) = str
((from.split("/").last.toInt to _to.toInt).toArray)
.map(el => {
val strPattern = from.split("/")
s"""${ if(strPattern.length > 1) strPattern(0) + "/" + el else el
}"""
})
}
val r_df = Seq((1,"1 to 6"),(2,"44/1 to 3")).toDF("id","range")
r_df.withColumn("array",createArray($"range")).show(false)
给出:
+---+---------+------------------+
|id |range |array |
+---+---------+------------------+
|1 |1 to 6 |[1,2,3,4,5,6]|
|2 |44/1 to 3|[44/1,44/2,44/3]|
+---+---------+------------------+
要添加模式以支持格式为“3a 到 5a”的字符串,只需使用以下内容更新正则表达式:
val patern = "([0-9.]{2}/[0-9.]{1}|[0-9.]{1})[a-zA-Z0-9_]* to ([0-9.]{1})[a-zA-Z0-9_]*".r
例如:
+---+---------+------------------+
|id |range |array |
+---+---------+------------------+
|1 |1 to 6 |[1,44/3]|
|3 |3a to 5a |[3,5] |
+---+---------+------------------+