我试图从包含
scala函数定义的字符串中定义spark(2.0)中的udf.这是片段:
val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolBox = currentMirror.mkToolBox() val f = udf(toolBox.eval(toolBox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show
Caused by: java.lang.classCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setobjFieldValues(ObjectStreamClass.java:2133) at java.io.ObjectStreamClass.setobjFieldValues(ObjectStreamClass.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) at java.io.ObjectInputStream.readobject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) at java.io.ObjectInputStream.readobject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.readobject(ObjectInputStream.java:373) at org.apache.spark.serializer.JavaDeserializationStream.readobject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
但是当我将udf定义为:
val f = udf((s:String) => 5)
解决方法
正如Giovanny所观察到的,问题在于类加载器是不同的(你可以通过在任何对象上调用.getClass.getClassLoader来更多地研究它).然后,当工作人员尝试反序列化你反射的函数时,所有的地狱都会崩溃.
这是一个不涉及任何类加载器hackery的解决方案.我们的想法是将反思步骤转移给工人.我们最终不得不重做反射步骤,但每个工人只需要重做一次.我认为这是非常优化的 – 即使你只在主节点上进行一次反射,你也必须为每个工作人员做一些工作才能让他们识别这个功能.
val f = udf { new Function1[String,Int] with Serializable { import scala.reflect.runtime.universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox lazy val toolBox = currentMirror.mkToolBox() lazy val func = { println("reflected function") // triggered at every worker toolBox.eval(toolBox.parse("(s:String) => 5")).asInstanceOf[String => Int] } def apply(s: String): Int = func(s) } }
然后,调用sc.parallelize(Seq(“1”,“5”)).toDF.select(f(col(“value”))).show工作得很好.
随意注释println – 这只是计算反射发生次数的简单方法.在spark-shell –master’local’中只有一次,但是在spark-shell –master’local [2]中它只有两次.
这个怎么运作
UDF会立即得到评估,但在它到达工作节点之前永远不会被使用,所以只能在工作者上评估惰性值工具箱和func.此外,由于它们很懒惰,因此每个工人只能评估一次.