为什么即使合并不会更新任何内容，Databricks Delta也会复制未修改的行？

问题描述

当我运行以下查询时：

merge into test_records t
using (
select id,"senior developer" title,country from test_records where country = 'Brazil'
) u
on t.id = u.id
when matched and (t.id <> u.id) then -- this is just to be sure that nothing will get updated
  update set t.title = u.title,t.updated_at = Now()
when not matched then 
  insert (id,title,country,created_at,updated_at) values (id,Now(),Now());

运行描述目标表的历史记录时，我仍然看到以下数据：

{"numTargetRowscopied": "2","numTargetRowsDeleted": "0","numTargetFilesAdded": "1","numTargetRowsInserted": "0","numTargetRowsUpdated": "0","numOutputRows": "2","numSourceRows": "2","numTargetFilesRemoved": "1"}

在spark ui中，我看到了：

因此，没有任何（？）原因就可以重写未修改的行。为什么呢？

解决方法

免责声明：我一直在研究它，我只能给你代码的外观，但不能告诉你为什么这样做。

此无匹配行的MERGE INTO解析为MergeIntoCommand逻辑命令（在驱动程序上执行）。您可以找到所有性能指标here（使用numTargetRowsCopied）。

这将我们的指标引向writeAllChanges

此代码的有趣之处在于，它将连接类型选择为rightOuter或fullOuter。为DEBUG记录器打开org.apache.spark.sql.delta.commands.MergeIntoCommand日志记录级别，以查看日志中的内部信息。

另一个非常有趣的事情是，度量标准是以UDF（！）来计算的。我们的是here。

最后，these lines是执行UDF和度量标准递增的位置。评论特别有趣：

// Target row did not match any source row,so just copy it to the output

我认为这说明了一切。没有匹配会导致增加numTargetRowsCopied指标。我猜您在目标表中有两行，不是吗？

apache-spark-sql databricks delta-lake