问题描述
我有一个使用spark 3.x和delta 0.7.x创建的增量表:
data = spark.range(0,5)
data.write.format("delta").mode("overwrite").save("tmp/delta-table")
# add some more files
data = spark.range(20,100)
data.write.format("delta").mode("append").save("tmp/delta-table")
df = spark.read.format("delta").load("tmp/delta-table")
df.show()
现在,在日志中生成了很多文件(很多情况下,木地板文件太小)。
%ls tmp/delta-table
我要压缩它们:
df.createGlobalTempView("my_delta_table")
spark.sql("OPTIMIZE my_delta_table ZORDER BY (id)")
失败:
ParseException:
mismatched input 'OPTIMIZE' expecting {'(','ADD','ALTER','ANALYZE','CACHE','CLEAR','COMMENT','COMMIT','CREATE','DELETE','DESC','DESCRIBE','DFS','DROP','EXPLAIN','EXPORT','FROM','GRANT','IMPORT','INSERT','LIST','LOAD','LOCK','MAP','MERGE','MSCK','REDUCE','REFRESH','REPLACE','RESET','REVOKE','ROLLBACK','SELECT','SET','SHOW','START','TABLE','TRUNCATE','UNCACHE','UNLOCK','UPDATE','USE','VALUES','WITH'}(line 1,pos 0)
== SQL ==
OPTIMIZE my_delta_table ZORDER BY (id)
^^^
问题:
- 如何在不使查询失败的情况下使它工作(优化)
- 是否有比调用基于文本的SQL更本地的API?
注意:
spark is started like this:
import pyspark
from pyspark.sql import SparkSession
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages","io.delta:delta-core_2.12:0.7.0") \
.config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
from delta.tables import *
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)