使用 for 循环的范围模式

问题描述

我想使用 scala 找到输入范围,如下面的输入数据框

Input
    scala> val r_df = Seq((1,"1 to 6"),(2,"44/1 to 3")).toDF("id","range")
    r_df: org.apache.spark.sql.DataFrame = [id: int,range: string]


scala> r_df.show
+---+---------+
| id|    range|
+---+---------+
|  1|   1 to 6|
|  2|44/1 to 3|
+---+---------+

for 循环 udf

val survey_to1 = udf((data1: Int,data2: Int) => {
      val arr = new ArrayBuffer[Int]()
      for(i <- data1 to data2)
      {
        arr+= i
      }
      arr
    })




r_df4.withColumn("new",survey_to1(col("new1"),col("new3"))).show(false)

将上面的 for 循环 udf 应用于数据帧,它只需要“1 到 6”

+---+---------+----+----+----+------------------+
|id |range    |new1|new2|new3|new               |
+---+---------+----+----+----+------------------+
|1  |1 to 6   |1   |to  |6   |[1,2,3,4,5,6]|
|2  |44/1 to 3|44/1|to  |3   |null              |
+---+---------+----+----+----+------------------+

预期输出

+---+---------+----+----+----+------------------+
|id |range    |new1|new2|new3|new               |
+---+---------+----+----+----+------------------+
|1  |1 to 6   |1   |to  |6   |[1,6]|
|2  |44/1 to 3|44/1|to  |3   |[44/1,44/2,44,3]  |
+---+---------+----+----+----+------------------+

解决方法

使用那些特定的字符串模式:

import org.apache.spark.sql.functions.udf

val patern = "([0-9.]{2}/[0-9.]{1}|[0-9.]{1}) to ([0-9.]{1})".r

def createArray = udf { str : String =>
    val patern(from,_to) = str
    ((from.split("/").last.toInt to _to.toInt).toArray)
      .map(el => {
        val strPattern = from.split("/")
        s"""${ if(strPattern.length > 1) strPattern(0) + "/" + el else el
        }"""
      })
  }

val r_df = Seq((1,"1 to 6"),(2,"44/1 to 3")).toDF("id","range")
r_df.withColumn("array",createArray($"range")).show(false)

给出:

+---+---------+------------------+
|id |range    |array             |
+---+---------+------------------+
|1  |1 to 6   |[1,2,3,4,5,6]|
|2  |44/1 to 3|[44/1,44/2,44/3]|
+---+---------+------------------+

要添加模式以支持格式为“3a 到 5a”的字符串,只需使用以下内容更新正则表达式:

val patern = "([0-9.]{2}/[0-9.]{1}|[0-9.]{1})[a-zA-Z0-9_]* to ([0-9.]{1})[a-zA-Z0-9_]*".r

例如:

+---+---------+------------------+
|id |range    |array             |
+---+---------+------------------+
|1  |1 to 6   |[1,44/3]|
|3  |3a to 5a |[3,5]         |
+---+---------+------------------+