Spark 和 CouchDB 的大数据

问题描述

我将 spark 2.4.0 与“org.apache.bahir - spark-sql-cloudant - 2.4.0”一起使用我必须将所有 json 文件从 couchDB 下载到 hdfs。

 val df = spark
  .read
  .format("org.apache.bahir.cloudant")
  .load("demo")
df.persist(StorageLevel.MEMORY_AND_disK)

 df
  .write
  .partitionBy("year","month","day")
  .mode("append")
  .parquet("...")

总文件大小为 160GB（> 1300 万个文件）运行 5 分钟后 Spark 作业出错

引起：com.cloudant.client.org.lightcouch.CouchDbException：检索服务器响应时出错

增加超时时间没有帮助，但稍后会下降有什么办法摆脱困境？

解决方法

使用另一个端点进行查询，对_all_docs使用_changes帮助了我

apache-bahir apache-spark apache-spark-sql cloudant couchdb couchdb