问题描述
嗨,我有一个配置单元外部表,该表使用AWS胶水作为数据目录.EMR可以访问胶水目录。 我已经通过蜂巢控制台检查了它。 但是,当我尝试使用.enableHiveSupport()通过Spark通过Scala程序访问hive表时,出现错误
val spark = SparkSession.builder.appName("Spark hive app")
.config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSglueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getorCreate()
spark.catalog.setCurrentDatabase("testDb")
spark.sql("set hive.msck.path.validation=ignore")
spark.sql("MSCK REPAIR TABLE test_table")
spark.sql("select * from test_table limit 10")
spark.stop()
我想连接到胶水metastore,但是库以某种方式试图在本地主机上找到metastore,这会导致问题? hive.metastore.uris对于AWS胶有什么价值吗?
emr版本= emr-5.30.1
应用程序= Hive 2.3.6,Presto 0.232,Spark 2.4.5
我已启用对表元数据使用AWS glue数据目录。
以下是我的代码
name := "test"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.1"
assemblyMergeStrategy in assembly := {
case PathList("meta-inf",xs @ _*) => MergeStrategy.discard
case "git.properties" => MergeStrategy.last
case x => MergeStrategy.first
}
mainClass in assembly := Some("com.std.test")
assemblyJarName in assembly := "test.jar"
assemblyShadeRules in assembly ++= Seq(
ShadeRule.rename("org.apache.hadoop.**" -> "my_conf.@1")
.inLibrary("org.apache.hadoop" % "hadoop-aws" % "2.7.3")
.inProject
)
build.sbt:-
20/08/16 16:40:50 INFO metastore: Trying to connect to metastore with URI thrift://ip-172-31-39-192.ap-south-1.compute.internal:9083
20/08/16 16:40:50 WARN metastore: Failed to connect to the metastore Server...
20/08/16 16:40:50 INFO metastore: Waiting 1 seconds before next connection attempt.
20/08/16 16:40:51 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.Metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.Metadata.SessionHivemetastoreClient
at org.apache.hadoop.hive.ql.Metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.Metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.Metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:185)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:118)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:404)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:306)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:258)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.sqlExecution$.org$apache$spark$sql$execution$sqlExecution$$executeQuery$1(sqlExecution.scala:83)
at org.apache.spark.sql.execution.sqlExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(sqlExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.sqlExecution$.org$apache$spark$sql$execution$sqlExecution$$withMetrics(sqlExecution.scala:178)
at org.apache.spark.sql.execution.sqlExecution$$anonfun$withNewExecutionId$1.apply(sqlExecution.scala:93)
at org.apache.spark.sql.execution.sqlExecution$.withsqlConfPropagated(sqlExecution.scala:200)
at org.apache.spark.sql.execution.sqlExecution$.withNewExecutionId(sqlExecution.scala:92)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
at com.quickheal.PartitionHandler$.main(PartitionHandler.scala:44)
at com.quickheal.PartitionHandler.main(PartitionHandler.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.Metadata.SessionHivemetastoreClient
at org.apache.hadoop.hive.metastore.metastoreUtils.newInstance(metastoreUtils.java:1523)
at org.apache.hadoop.hive.metastore.retryingmetastoreClient.<init>(retryingmetastoreClient.java:86)
at org.apache.hadoop.hive.metastore.retryingmetastoreClient.getProxy(retryingmetastoreClient.java:132)
at org.apache.hadoop.hive.metastore.retryingmetastoreClient.getProxy(retryingmetastoreClient.java:104)
at org.apache.hadoop.hive.ql.Metadata.Hive.createmetastoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.Metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.Metadata.Hive.getAllDatabases(Hive.java:1234)
... 60 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.metastoreUtils.newInstance(metastoreUtils.java:1521)
... 66 more
Caused by: MetaException(message:Could not connect to Meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.thrift.transport.TSocket.open(TSocket.java:226)
at org.apache.hadoop.hive.metastore.HivemetastoreClient.open(HivemetastoreClient.java:420)
at org.apache.hadoop.hive.metastore.HivemetastoreClient.<init>(HivemetastoreClient.java:236)
at org.apache.hadoop.hive.ql.Metadata.SessionHivemetastoreClient.<init>(SessionHivemetastoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.metastoreUtils.newInstance(metastoreUtils.java:1521)
at org.apache.hadoop.hive.metastore.retryingmetastoreClient.<init>(retryingmetastoreClient.java:86)
at org.apache.hadoop.hive.metastore.retryingmetastoreClient.getProxy(retryingmetastoreClient.java:132)
at org.apache.hadoop.hive.metastore.retryingmetastoreClient.getProxy(retryingmetastoreClient.java:104)
at org.apache.hadoop.hive.ql.Metadata.Hive.createmetastoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.Metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.Metadata.Hive.getAllDatabases(Hive.java:1234)
at org.apache.hadoop.hive.ql.Metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.Metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:185)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:118)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:404)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:306)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:258)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.sqlExecution$.org$apache$spark$sql$execution$sqlExecution$$executeQuery$1(sqlExecution.scala:83)
at org.apache.spark.sql.execution.sqlExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(sqlExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.sqlExecution$.org$apache$spark$sql$execution$sqlExecution$$withMetrics(sqlExecution.scala:178)
at org.apache.spark.sql.execution.sqlExecution$$anonfun$withNewExecutionId$1.apply(sqlExecution.scala:93)
at org.apache.spark.sql.execution.sqlExecution$.withsqlConfPropagated(sqlExecution.scala:200)
at org.apache.spark.sql.execution.sqlExecution$.withNewExecutionId(sqlExecution.scala:92)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
at com.quickheal.PartitionHandler$.main(PartitionHandler.scala:44)
at com.quickheal.PartitionHandler.main(PartitionHandler.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.socksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.socket.connect(Socket.java:607)
at org.apache.thrift.transport.TSocket.open(TSocket.java:221)
... 74 more
)
详细的错误日志:-
{{1}}
解决方法
通过对sbt文件进行更改,build.sbt中存在一些问题解决了我的问题:-
更新的代码 build.sbt:-
next()
火花代码:-
{
value: [elementAt(index),elementAt(index)]
done: false;
}
所做的更改:-
按规定制作了spark sbt依赖项
不需要config(“ hive.metastore.client.factory.class”,“ com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”)
删除合并策略
希望这对某人有帮助