问题描述
当索引可用时,Postgres在表tracking
上使用更重的Seq扫描。第一个查询是原始尝试,它使用Seq扫描,因此查询速度很慢。我试图使用内部选择强制执行索引扫描,但postgres将其转换回几乎相同的运行时有效地返回相同的查询。最后,我从第二个查询的内部选择中复制了列表,以进行第三个查询。最后,postgres使用了索引扫描,这大大减少了运行时间。第三个查询在生产环境中不可行。什么会导致postgres使用最后一个查询计划?
(两个表都使用了真空)
表格
- 跟踪(worker_id,localdatetime)总记录:118664105
- project_worker(id,project_id)总记录:12935
INDEX
- 使用btree(worker_id,localdatetime)在public.tracking上创建索引track_worker_id_localdatetime_idx
查询
SELECT worker_id,localdatetime FROM tracking t JOIN project_worker pw ON t.worker_id = pw.id WHERE project_id = 68475018
Hash Join (cost=29185.80..2638162.26 rows=19294218 width=16) (actual time=16.912..18376.032 rows=177681 loops=1)
Hash Cond: (t.worker_id = pw.id)
-> Seq Scan on tracking t (cost=0.00..2297293.86 rows=118716186 width=16) (actual time=0.004..8242.891 rows=118674660 loops=1)
-> Hash (cost=29134.80..29134.80 rows=4080 width=8) (actual time=16.855..16.855 rows=2102 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 115kB
-> Seq Scan on project_worker pw (cost=0.00..29134.80 rows=4080 width=8) (actual time=0.004..16.596 rows=2102 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 10833
Planning Time: 0.192 ms
Execution Time: 18382.698 ms
SELECT worker_id,localdatetime FROM tracking t WHERE worker_id IN (SELECT id FROM project_worker WHERE project_id = 68475018 LIMIT 500)
Hash Semi Join (cost=6905.32..2923969.14 rows=27733254 width=24) (actual time=19.715..20191.517 rows=20530 loops=1)
Hash Cond: (t.worker_id = project_worker.id)
-> Seq Scan on tracking t (cost=0.00..2296948.27 rows=118698327 width=24) (actual time=0.005..9184.676 rows=118657026 loops=1)
-> Hash (cost=6899.07..6899.07 rows=500 width=8) (actual time=1.103..1.103 rows=500 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Limit (cost=0.00..6894.07 rows=500 width=8) (actual time=0.006..1.011 rows=500 loops=1)
-> Seq Scan on project_worker (cost=0.00..28982.65 rows=2102 width=8) (actual time=0.005..0.968 rows=500 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 4493
Planning Time: 0.224 ms
Execution Time: 20192.421 ms
SELECT worker_id,localdatetime FROM tracking t WHERE worker_id IN (322016383,316007840,...,285702579)
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4766798.31 rows=21877360 width=24) (actual time=0.079..29.756 rows=22112 loops=1)
" Index Cond: (worker_id = ANY ('{322016383,285702579}'::bigint[]))"
Planning Time: 1.162 ms
Execution Time: 30.884 ms
...代替了查询中使用的500个id条目
同一查询在另一组500个id上运行
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4776714.91 rows=21900980 width=24) (actual time=0.105..5528.109 rows=117838 loops=1)
" Index Cond: (worker_id = ANY ('{286237712,286237844,216724213}'::bigint[]))"
Planning Time: 2.105 ms
Execution Time: 5534.948 ms
解决方法
“ tracking”(跟踪)中“ worker_id”的分布似乎非常不对称。一方面,查询3的一个实例中的行数返回的行数是其另一个实例的5倍。另外,估计的行数比实际的行数高100到1000倍。当然,这可能会导致计划不正确(尽管不太可能是完整的情况)。
在跟踪select count(distinct worker_id) from tracking
中,worker_id的不同值的实际数目是多少?计划者认为此值是什么:select n_distinct from pg_stats where tablename='tracking' and attname='worker_id'
?如果这些值相距遥远,并且您迫使计划者通过alter table tracking alter column worker_id set (n_distinct = <real value>); analyze tracking;
使用更合理的值,是否会改变计划?
如果要将PostgreSQL推向嵌套循环联接,请尝试以下操作:
-
在
tracking
上创建可用于仅索引扫描的索引:CREATE INDEX ON tracking (worker_id) INCLUDE (localdatetime);
确保经常
tracking
VACUUM
被使用,以便仅索引扫描有效。 -
减少
random_page_cost
并增加effective_cache_size
,以便优化程序价格指数降低(但不要使用疯狂的值)。 -
请确保您对
project_worker
的估算正确:ALTER TABLE project_worker ALTER project_id SET STATISTICS 1000; ANALYZE project_worker;