如何在 Clickhouse 中高效地进行小型查询

问题描述

在我们的部署中,有一千个分片。插入是通过带有分片 jumpConsistentHash(colX,1000) 的分布式表完成的。当我使用 colX=... 查询行并打开 send_logs_level='trace' 时,我看到查询被发送到所有分片并在每个分片上执行。这限制了我们的 QPS(每秒查询数)。检查 Clickhouse document,它指出:

SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly). 
When you add a new shard,you don’t have to transfer the old data to it. 
You can write new data with a heavier weight – the data will be distributed slightly unevenly,but queries will work correctly and efficiently.

You should be concerned about the sharding scheme in the following cases:

* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key,you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN,which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites,advertisers,or partners). 
In order for the small queries to not affect the entire cluster,it makes sense to locate data for a single client on a single shard. 
Alternatively,as we’ve done in Yandex.Metrica,you can set up bi-level sharding: divide the entire cluster into “layers”,where a layer may consist of multiple shards. 
Data for a single client is located on a single layer,but shards can be added to a layer as necessary,and data is randomly distributed within them. 
distributed tables are created for each layer,and a single shared distributed table is created for global queries.

对于像我们这样的小查询(上面的第二个项目),似乎有一个解决方案,但我不清楚这一点。这是否意味着在使用谓词 colX=... 查询特定查询时,我需要找到包含其行的相应“层”,然后在该层的相应分布式表上进行查询

有没有办法在全局分布式表上查询这些小查询

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)