从rest api到pyspark数据帧的嵌套json

问题描述

我正在尝试创建一个数据管道,用于从 REST API 请求数据。输出一个很好的嵌套 json 文件。我想将 json 文件读入 pyspark 数据帧。当我在本地保存文件并使用以下代码时,这很好用:

from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession\
    .builder\
    .appName("jsontest")\
    .getorCreate()

raw_df = spark.read.json(r"my_json_path",multiLine='true')

但是当我在发出 API 请求后想直接创建一个 pyspark 数据框时,我收到以下错误

Error when trying to create a pyspark dataframeenter image description here

我使用以下代码进行rest api调用并转换为pyspark数据帧:

apiCallHeaders = {'Authorization': 'Bearer ' + bearer_token}
apiCallResponse = requests.get(data_url,headers=apiCallHeaders,verify=True)
json_rdd = spark.sparkContext.parallelize(apiCallResponse.text)
raw_df = spark.read.json(json_rdd)

以下是部分响应输出

{"networks":[{"href":"/v2/networks/velobike-moscow","id":"velobike-moscow","name":"Velobike"},{"href":"/v2/networks/bycyklen","id":"bycyklen","name":"Bycyklen"},{"href":"/v2/networks/nu-connect","id":"nu-connect","name":"Nu-Connect"},{"href":"/v2/networks/baerum-bysykkel","id":"baerum-bysykkel","name":"Bysykkel"},{"href":"/v2/networks/bysykkelen","id":"bysykkelen","name":"Bysykkelen"},{"href":"/v2/networks/onroll-a-rua","id":"onroll-a-rua","name":"Onroll"},{"href":"/v2/networks/onroll-albacete","id":"onroll-albacete",{"href":"/v2/networks/onroll-alhama-de-murcia","id":"onroll-alhama-de-murcia",{"href":"/v2/networks/onroll-almunecar","id":"onroll-almunecar",{"href":"/v2/networks/onroll-antequera","id":"onroll-antequera",{"href":"/v2/networks/onroll-aranda-de-duero","id":"onroll-aranda-de-duero","name":"Onroll"}

我希望我的问题有意义并且有人可以提供帮助。

提前致谢!

解决方法

在此 answer 之后,您可以添加以下几行:

{
  "compilerOptions": {
    "target": "ES5","module": "ESNext",// Rules
    "strict": true,"importHelpers": true,"moduleResolution": "node","experimentalDecorators": true,"emitDecoratorMetadata": true,"skipLibCheck": true,"esModuleInterop": true,"resolveJsonModule": true,// Output
    "sourceMap": true,"baseUrl": ".","jsx": "preserve",// Aliases
    "paths": ...
  },"include": [
    "src/**/*.ts","src/**/*.tsx"
  ]
}

要运行您的代码,必须在此处添加 ({ entry: { app: './src/main.ts' },module: { rules: [ { test: /\.tsx?$/,use: [ 'ts-loader' ],exclude: /node_modules/ },{ test: /\.s[ac]ss$/,use: [ 'style-loader','css-loader','sass-loader',] },{ test: /\.html?$/,use: 'raw-loader' },{ test: /\.(ttf|woff2?|png|jpe?g|svg|webp)$/,use: 'file-loader' } ] },resolve: { extensions: ['.ts','.tsx','.js','.jsx'],alias: ... },output: ...,devServer: ...,plugins: (HtmlWebpackPlugin,FaviconsWebpackPlugin),context: __dirname,mode: ...,devtool: ...,externals: /node_modules\/.+?\.ts$/ })

import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

看一个例子:

[ ]