问题描述
我尝试从 Greenplum 数据库中检索数据并使用 Pyspark 显示它。这是我实现的代码。
import pyspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import sqlContext
spark = SparkSession \
.builder \
.appName("spkapp") \
.master("local[*]") \
.config("spark.debug.maxToStringFields","100")\
.config("spark.sql.broadcastTimeout","36000")\
.config("spark.network.timeout","600s")\
.config('spark.executor.cores','1')\
.getorCreate()
gscpythonoptions = {
"url": "jdbc:postgresql://localhost:5432/db_name","user": "my_user","password": "","dbschema": "public"
}
gpdf_swt = spark.read.format("greenplum").options(**gscpythonoptions,dbtable="products",partitionColumn= "id").load()
gpdf_swt.printSchema()
gpdf_swt.show()
但是当我使用 spark submit 运行我的 python 文件时,它给了我如下错误。
20/12/30 21:23:33 ERROR TaskSetManager: Task 2 in stage 0.0 Failed 1 times; aborting job
Traceback (most recent call last):
File "/home/credit_card/summary_table_creation2Test.py",line 38,in <module>
gpdf_swt.count()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",line 524,in show
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",line 1257,in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",line 63,in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",line 328,in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o84.show.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 Failed 1 times,most recent failure: Lost task 2.0 in stage 0.0 (TID 2,localhost,executor driver): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at io.pivotal.greenplum.spark.jdbc.Jdbc$.getdistributedTransactionId(Jdbc.scala:500)
at io.pivotal.greenplum.spark.externaltable.GreenplumRowIterator.<init>(GreenplumRowIterator.scala:100)
at io.pivotal.greenplum.spark.GreenplumRDD.compute(GreenplumRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
这是我的 spark-submit 命令。
/usr/local/spark/bin/spark-submit --driver-class-path /root/greenplum/greenplum-spark_2.11-1.6.2.jar summary_table_creation
感谢任何帮助克服此错误。
编辑-: 我的 Greenplum 版本是 6.4.0。有一个类似的问题 here。但其解决方案仅适用于6.7.1以上的greenplum版本。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)