Clickhouse加入条件

问题描述

我发现了奇怪的东西,查询

SELECT *
FROM progress as pp
ALL LEFT JOIN links as ll USING (viewId)
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8' 

结果:0 rows in set. Elapsed: 5.267 sec. Processed 8.62 million rows,484.94 MB (1.64 million rows/s.,92.08 MB/s.)

修改后的查询

SELECT *
FROM
  (SELECT *
   FROM progress
   WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') AS p ALL
LEFT JOIN
  (SELECT *
   FROM links
   WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) AS l ON p.viewId = l.viewId;

结果:0 rows in set. Elapsed: 0.076 sec. Processed 4.48 million rows,161.35 MB (58.69 million rows/s.,2.12 GB/s.)

但是看起来很脏。

是否应该考虑条件来优化查询

在这里查询的正确方法是什么,如果在哪里,该怎么办?

然后我尝试添加一个联接:

SELECT *
FROM
  (SELECT videoUuid AS contentUuid,viewId
   FROM
     (SELECT *
      FROM progress
      WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') p ALL
   LEFT JOIN
     (SELECT *
      FROM links
      WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) USING `viewId`) ALL
LEFT JOIN `MetaInfo` USING `viewId`,`contentUuid`;

考虑到我只想将3个具有条件选择的表联接成一行,结果还是很慢:

0 rows in set. Elapsed: 1.747 sec. Processed 9.13 million rows,726.55 MB (5.22 million rows/s.,415.85 MB/s.)

解决方法

此时,CH不能很好地应对多联接查询(DB star-schema),并且查询优化器还不足以完全依赖它。

因此,它需要明确说明如何通过使用子查询而不是联接来“执行”查询。

考虑测试查询:

SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
  INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
  INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
  INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
WHERE r = 54320
/*
┌─────r─┐
│ 54320 │
└───────┘

1 rows in set. Elapsed: 6.261 sec. Processed 96.06 million rows,768.52 MB (15.34 million rows/s.,122.74 MB/s.)
*/

让我们使用子查询来重写它,以大大加快它的运行速度。

SELECT number AS r
FROM numbers(87654321)
WHERE r = 54320 AND number IN (
  SELECT number AS r
  FROM numbers(7654321)
  WHERE r = 54320 AND number IN (
    SELECT number AS r
    FROM numbers(654321)
    WHERE r = 54320 AND number IN (
      SELECT number AS r
      FROM numbers(54321)
      WHERE r = 54320
    )
  )
)
/*
┌─────r─┐
│ 54320 │
└───────┘

1 rows in set. Elapsed: 0.481 sec. Processed 96.06 million rows,768.52 MB (199.69 million rows/s.,1.60 GB/s.)
*/

还有其他方法可以优化JOIN


一些有用的参考文献:

Altinity webinar: Tips and tricks every ClickHouse user should know

Altinity webinar: Secrets of ClickHouse Query Performance

,

不是应该根据条件在哪里优化查询吗?

此类优化尚未实现

,

这是预期的行为。 根据CH doc https://clickhouse.tech/docs/en/sql-reference/statements/select/join/#performance的说法,“在运行JOIN时,与查询的其他阶段相关的执行顺序没有优化。在运行WHERE和.NET过滤之前,将运行联接(在右表中进行搜索)。聚合之前。”