问题描述
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HDFStoAWSExample")
.getorCreate()
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.access.key","ACCESS_KEY")
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.secret.key","SECRET_KEY")
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.endpoint","s3.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access","true")
val hdfsCSV = spark.read.option("header",true).csv("hdfs://localhost:19000/testCSV.csv")
hdfsCSV.show()
hdfsCSV.write.parquet("s3a://test/parquet/abcCSV")
使用这个简单的 sbt 文件:
name := "spark-amazon-s3-parquet"
scalaVersion := "2.12.12"
val sparkVersion = "3.0.1"
libraryDependencies += "log4j" % "log4j" % "1.2.17"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.0"
updateOptions := updateOptions.value.withCachedResolution(true)
现在,当我尝试编写 parquet 时,它抱怨缺少类或方法,例如 org/apache/hadoop/tracing/SpanReceiverHost(最后的完整堆栈跟踪)
我尝试使用 2.7.3 版本的 hadoop-common 和 aws,但随后 S3 抱怨了大约 400 个错误的请求(与以前相同的代码,只是更改了 sbt 中的 common 和 aws 版本)
有人知道 wtf 正在使用 hadoop-common 和 hadoop-aws 吗?
全栈:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/tracing/SpanReceiverHost
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:634)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.distributedFilesystem.initialize(distributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3354)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getorElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:723)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:553)
at HDFStoAWSExample$.delayedEndpoint$HDFStoAWSExample$1(HDFStoAWSExample.scala:16)
at HDFStoAWSExample$delayedInit$body.apply(HDFStoAWSExample.scala:3)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at HDFStoAWSExample$.main(HDFStoAWSExample.scala:3)
at HDFStoAWSExample.main(HDFStoAWSExample.scala)
Caused by: java.lang.classNotFoundException: org.apache.hadoop.tracing.SpanReceiverHost
at java.net.urlclassloader.findClass(urlclassloader.java:382)
at java.lang.classLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.classLoader.loadClass(ClassLoader.java:351)
... 28 more
PS:我的hadoop配置没有问题,我可以读写
解决方法
如 here 所述,您可能需要提供 hadoop-client
作为依赖项。