问题描述
还有其他文章描述了如何为 spark 类设置配置(spark 和 hadoop)以便能够写入 GCS 存储桶。
如果我从 IntelliJ 运行以下代码
package com.test.migration;
import java.io.File
import java.util
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
object DFToGcslite {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[*]")
.appName("DFToGcslite")
.config("spark.hadoop.google.cloud.auth.service.account.enable",true)
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile","src/main/resources/test-storage-318320-d3aa6f895415.json")
.getorCreate()
import spark.implicits._
val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.defaultFS","gs://test-csv-write/")
(0 to 100)
.toDF
.write
.mode(SaveMode.Append)
.parquet("outputs01")
}
}
它完美地写入了我的 GCS 存储桶。
但是当我编译 jar 并在集群上运行它时:
/usr/local/bin/spark-submit --class com.test.migration.CSVToGCS --master local /Users/adam.mac/Desktop/csv_to_gcs/target/scala-2.11/CSVToGCS-assembly-0.0.1.jar
将 master.("local[*]")
改为 master.("yarn")
它失败了
Exception in thread "main" org.apache.hadoop.fs.Unsupportedfilesystemexception: No FileSystem for scheme "gs"
built.sbt:
name := "CSVToGCS"
version := "0.0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.4.0"
libraryDependencies ++= Seq(
"com.typesafe" % "config" % "1.3.1","org.apache.spark" %% "spark-core" % sparkVersion,"org.apache.spark" %% "spark-sql" % sparkVersion,"org.apache.spark" %% "spark-yarn" % "2.4.0" % "provided","org.apache.hadoop" % "hadoop-common" % "2.7.3","com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.0.0"
)
我也尝试过设置这些配置:
sc.hadoopConfiguration.set("fs.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
sc.hadoopConfiguration.set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
但我得到了相同的结果。我觉得我的配置在某处不正确,但是当我只从 IntelliJ 运行类文件时,代码如何工作?
解决方法
这是我在讨论这个 github 问题 https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/323#issuecomment-597353458 的帮助下解决错误的方法。由于我们使用的是 hadoop 2.6 版,因此我们需要使用此 gcs-connector-hadoop2-2.0.1.jar
可用 here
一旦我将 jar 放在 $SPARK_HOME/jars/ 中,代码就运行得很好!