问题描述
我正在使用下面的代码从rest api中读取并将响应写入pyspark中的json文档,并将文件保存到Azure Data Lake Gen2。当响应没有空白数据,但是当我尝试取回所有数据然后遇到以下错误时,该代码可以正常工作。
错误消息:ValueError:推断后无法确定某些类型 。
代码:
import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',auth=('user','password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/demo/data")
响应:
[
{
"ProductID": "156528","ProductType": "Home Improvement","Description": "","SaleDate": "0001-01-01T00:00:00","UpdateDate": "2015-02-01T16:43:18.247"
},{
"ProductID": "126789","ProductType": "Pharmacy","UpdateDate": "2015-02-01T16:43:18.247"
}
]
尝试修复如下所示的模式。
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField("ProductID",StringType(),True),StructField("ProductType","Description",StructField("SaleDate",StructField("UpdateDate",True)])
df = spark.createDataFrame([[None,None,None]],schema=schema)
df.show()
不确定如何创建数据框并将数据写入json文档。
解决方法
您可以将data
,schema
变量传递给 spark.createDataFrame
(),然后spark将创建一个数据框。
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
data=[
{
"ProductID": "156528","ProductType": "Home Improvement","Description": "","SaleDate": "0001-01-01T00:00:00","UpdateDate": "2015-02-01T16:43:18.247"
},{
"ProductID": "126789","ProductType": "Pharmacy","UpdateDate": "2015-02-01T16:43:18.247"
}
]
schema = StructType([StructField("ProductID",StringType(),True),StructField("ProductType",StructField("Description",StructField("SaleDate",StructField("UpdateDate",True)])
df = spark.createDataFrame(data,schema=schema)
df.show()
#+---------+----------------+-----------+-------------------+--------------------+
#|ProductID| ProductType|Description| SaleDate| UpdateDate|
#+---------+----------------+-----------+-------------------+--------------------+
#| 156528|Home Improvement| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#| 126789| Pharmacy| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#+---------+----------------+-----------+-------------------+--------------------+