如何在本地环境中正确配置 gcs-connector

问题描述

我正在尝试在我的 scala 项目中配置 gcs-connector,但我总是得到 java.lang.classNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

这是我的项目配置:

val sparkConf = new SparkConf()
      .set("spark.executor.memory","4g")
      .set("spark.executor.cores","2")
      .set("spark.driver.memory","4g")
      .set("temporaryGcsBucket","some-bucket")

    val spark = SparkSession.builder()
      .config(sparkConf)
      .master("spark://spark-master:7077")
      .getorCreate()

    val hadoopConfig = spark.sparkContext.hadoopConfiguration
    hadoopConfig.set("fs.gs.auth.service.account.enable","true")
    hadoopConfig.set("fs.gs.auth.service.account.json.keyfile","./path-to-key-file.json")
    hadoopConfig.set("fs.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
    hadoopConfig.set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

我尝试使用两者设置 gcs-connector:

.set("spark.jars.packages","com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.6")
.set("spark.driver.extraClasspath",":/home/celsomarques/Desktop/gcs-connector-hadoop2-2.1.6.jar")

但它们都没有将指定的类加载到类路径中。

你能指出我做错了什么吗?

解决方法

以下配置有效:

val sparkConf = new SparkConf()
      .set("spark.executor.memory","4g")
      .set("spark.executor.cores","2")
      .set("spark.driver.memory","4g")

    val spark = SparkSession.builder()
      .config(sparkConf)
      .master("local")
      .getOrCreate()