通过 Dataframe 读取 XML 文件

问题描述

我有如下格式的 XML 文件。

<nt:vars>
<nt:var id="1.3.0" type="TimeStamp"> 89:19:00.01</nt:var>
<nt:var id="1.3.1" type="OBJECT ">1.9.5.67.2</nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:vars>

我使用以下代码在其上构建了一个数据框。尽管代码显示 3 行并检索 id 和类型字段，但它并未显示实际值，即 89:19:00.01,1.9.5.67.2,AB-CD-EF

spark.read.format("xml").option("roottag","nt:vars").option("rowTag","nt:var").load("/FileStore/tables/POC_DB.xml").show()

如果我必须在上面的行中添加任何其他选项以带来这些值，请您帮我一下。

解决方法

您可以将 rowTag 指定为 nt:vars：

df = spark.read.format("xml").option("rowTag","nt:vars").load("file.xml")

df.printSchema()
root
 |-- nt:var: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _type: string (nullable = true)

df.show(truncate=False)
+-------------------------------------------------------------------------------------------+
|nt:var                                                                                     |
+-------------------------------------------------------------------------------------------+
|[[ 89:19:00.01,1.3.0,TimeStamp],[1.9.5.67.2,1.3.1,OBJECT ],[AB-CD-EF,1.3.9,STRING]]|
+-------------------------------------------------------------------------------------------+

并且要将值作为单独的行获取，您可以分解结构数组：

df.select(F.explode('nt:var')).show(truncate=False)
+--------------------------------+
|col                             |
+--------------------------------+
|[ 89:19:00.01,TimeStamp]|
|[1.9.5.67.2,OBJECT ]    |
|[AB-CD-EF,STRING]       |
+--------------------------------+

或者，如果您只想要这些值：

df.select(F.explode('nt:var._VALUE')).show()
+------------+
|         col|
+------------+
| 89:19:00.01|
|  1.9.5.67.2|
|    AB-CD-EF|
+------------+

apache-spark apache-spark-sql apache-spark-xml pyspark