我有一个DataFrame与模式
root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true)
虽然,我可以使用过滤数据框
val data = rawData .filter( !(rawData("features.feat1") <=> "100") )
我无法删除列
val data = rawData .drop("features.feat1")
这是我在这里做错了吗?我也尝试(不成功)做drop(rawData(“features.feat1”)),尽管这样做没有什么意义.
提前致谢,
尼基尔
解决方法
这只是一个编程练习,但你可以尝试这样的:
import org.apache.spark.sql.{DataFrame,Column} import org.apache.spark.sql.types.{StructType,StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try case class DFWithDropFrom(df: DataFrame) { def getSourceField(source: String): Try[StructField] = { Try(df.schema.fields.filter(_.name == source).head) } def getType(sourceField: StructField): Try[StructType] = { Try(sourceField.dataType.asInstanceOf[StructType]) } def genOutputCol(names: Array[String],source: String): Column = { f.struct(names.map(x => f.col(source).getItem(x).alias(x)): _*) } def dropFrom(source: String,toDrop: Array[String]): DataFrame = { getSourceField(source) .flatMap(getType) .map(_.fieldNames.diff(toDrop)) .map(genOutputCol(_,source)) .map(df.withColumn(source,_)) .getorElse(df) } }
使用示例
scala> case class features(feat1: String,feat2: String,feat3: String) defined class features scala> case class record(label: String,features: features) defined class record scala> val df = sc.parallelize(Seq(record("a_label",features("f1","f2","f3")))).toDF df: org.apache.spark.sql.DataFrame = [label: string,features: struct<feat1:string,feat2:string,feat3:string>] scala> DFWithDropFrom(df).dropFrom("features",Array("feat1")).show +-------+--------+ | label|features| +-------+--------+ |a_label| [f2,f3]| +-------+--------+ scala> DFWithDropFrom(df).dropFrom("foobar",Array("feat1")).show +-------+----------+ | label| features| +-------+----------+ |a_label|[f1,f2,f3]| +-------+----------+ scala> DFWithDropFrom(df).dropFrom("features",Array("foobar")).show +-------+----------+ | label| features| +-------+----------+ |a_label|[f1,f3]| +-------+----------+
添加一个implicit conversion,你很好去.