Spark广播变量Map给出空值

问题描述

我正在将Java8与Spark v2.4.1。一起使用

我正尝试使用广播变量Map进行查找,如下所示:

输入数据:

+-----+-----+-----+
|code1|code2|code3|
+-----+-----+-----+
|1    |7    |  5  |
|2    |7    |  4  |
|3    |7    |  3  |
|4    |7    |  2  |
|5    |7    |  1  |
+-----+-----+-----+

预期输出

+-----+-----+-----+
|code1|code2|code3|
+-----+-----+-----+
|1    |7    |51   |
|2    |7    |41   |
|3    |7    |31   |
|4    |7    |21   |
|5    |7    |11   |
+-----+-----+-----+

我当前使用的代码以及尝试过的不同解决方案:

Map<Integer,Integer> lookup_map= new HashMap<>();
lookup_map.put(1,11);
lookup_map.put(2,21);
lookup_map.put(3,31);
lookup_map.put(4,41);
lookup_map.put(5,51);

JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
broadcast<Map<Integer,Integer>> lookup_mapBcVar = javaSparkContext.broadcast(lookup_map);

Dataset<Row> resultDs= dataDs
  .withColumn("floor_code3",floor(col("code3")))
  .withColumn("floor_code3_int",floor(col("code3")).cast(DataTypes.IntegerType))
  .withColumn("map_code3",lit(((Map<Integer,Integer>)lookup_mapBcVar.getValue()).get(col("floor_code3_int"))))
  .withColumn("five",Integer>)lookup_mapBcVar.getValue()).get(5)))
  .withColumn("five_lit",Integer>)lookup_mapBcVar.getValue()).get(lit(5).cast(DataTypes.IntegerType))));

使用以下代码输出当前代码

resultDs.printSchema();                       
resultDs.show();
            
root
 |-- code1: integer (nullable = true)
 |-- code2: integer (nullable = true)
 |-- code3: double (nullable = true)
 |-- floor_code3: long (nullable = true)
 |-- floor_code3_int: integer (nullable = true)
 |-- map_code3: null (nullable = true)
 |-- five: integer (nullable = false)
 |-- five_lit: null (nullable = true)

+-----+-----+-----+-----------+---------------+---------+----+--------+
|code1|code2|code3|floor_code3|floor_code3_int|map_code3|five|five_lit|
+-----+-----+-----+-----------+---------------+---------+----+--------+
|    1|    7|  5.0|          5|              5|     null|  51|    null|
|    2|    7|  4.0|          4|              4|     null|  51|    null|
|    3|    7|  3.0|          3|              3|     null|  51|    null|
|    4|    7|  2.0|          2|              2|     null|  51|    null|
|    5|    7|  1.0|          1|              1|     null|  51|    null|
+-----+-----+-----+-----------+---------------+---------+----+--------+

要重新创建输入数据:

List<String[]> stringAsList = new ArrayList<>();
stringAsList.add(new String[] { "1","7","5" });
stringAsList.add(new String[] { "2","4" });
stringAsList.add(new String[] { "3","3" });
stringAsList.add(new String[] { "4","2" });
stringAsList.add(new String[] { "5","1" });
    
JavaSparkContext sparkContext = new JavaSparkContext(sparkSession.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));

   
StructType schema = DataTypes
  .createStructType(new StructField[] {
    DataTypes.createStructField("code1",DataTypes.StringType,false),DataTypes.createStructField("code2",DataTypes.createStructField("code3",false)
  });

Dataset<Row> dataDf= sparkSession.sqlContext().createDataFrame(rowRDD,schema).toDF();

    
Dataset<Row> dataDs = dataDf
  .withColumn("code1",col("code1").cast(DataTypes.IntegerType))
  .withColumn("code2",col("code2").cast(DataTypes.IntegerType))
  .withColumn("code3",col("code3").cast(DataTypes.IntegerType));

我在做什么错了?

Scala笔记本,此处使用相同

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3062033079132966/7035720262824085/latest.html

解决方法

lit()返回Column类型,但是map.get需要int类型 您可以通过这种方式完成

    val df: DataFrame = spark.sparkContext.parallelize(Range(0,10000),4).toDF("sentiment")
    val map = new util.HashMap[Int,Int]()
    map.put(1,1)
    map.put(2,2)
    map.put(3,3)
    val bf: Broadcast[util.HashMap[Int,Int]] = spark.sparkContext.broadcast(map)
    df.rdd.map(x => {
      val num = x.getInt(0)
      (num,bf.value.get(num))
    }).toDF("key","add_key").show(false)