问题描述
在我们的部署中,有一千个分片。插入是通过带有分片 jumpConsistentHash(colX,1000)
的分布式表完成的。当我使用 colX=...
查询行并打开 send_logs_level='trace'
时,我看到查询被发送到所有分片并在每个分片上执行。这限制了我们的 QPS(每秒查询数)。检查 Clickhouse document,它指出:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard,you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly,but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key,you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN,which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites,advertisers,or partners).
In order for the small queries to not affect the entire cluster,it makes sense to locate data for a single client on a single shard.
Alternatively,as we’ve done in Yandex.Metrica,you can set up bi-level sharding: divide the entire cluster into “layers”,where a layer may consist of multiple shards.
Data for a single client is located on a single layer,but shards can be added to a layer as necessary,and data is randomly distributed within them.
distributed tables are created for each layer,and a single shared distributed table is created for global queries.
对于像我们这样的小查询(上面的第二个项目),似乎有一个解决方案,但我不清楚这一点。这是否意味着在使用谓词 colX=...
查询特定查询时,我需要找到包含其行的相应“层”,然后在该层的相应分布式表上进行查询?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)