问题描述
我有两个表,一个用于个人资料,另一个用于个人资料的就业状态。这两个表具有一对一的关系。一个个人资料可能没有工作身份。表模式如下(为清楚起见,删除了不相关的列):
create type employment_status as enum ('claimed','approved','denied');
create table if not exists profiles
(
id bigserial not null
constraint profiles_pkey
primary key
);
create table if not exists employments
(
id bigserial not null
constraint employments_pkey
primary key,status employment_status not null,profile_id bigint not null
constraint fk_rails_d95865cd58
references profiles
on delete cascade
);
create unique index if not exists index_employments_on_profile_id
on employments (profile_id);
使用这些表,我被要求列出所有待业档案。失业档案定义为没有工作记录或具有除“已批准”以外的身份的档案。
SELECT * FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE employments.status != 'approved'
这里的假设是所有配置文件都将与各自的工作一起列出,然后我可以用where条件过滤它们。没有工作记录的任何资料都将具有null
的工作状态,因此将按条件进行过滤。但是,此查询在没有工作的情况下不会返回个人资料。
经过一番研究,我发现this post,解释了为什么它不起作用并转换了我的查询:
SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';
实际上是工作的。但是,我的ORM产生了一个稍有不同的查询,该查询不起作用。
SELECT profiles.* FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'
唯一的区别是select子句。我试图理解为什么这种细微的差异会产生如此的差异,并解释了所有三个查询:
EXPLAIN ANALYZE SELECT * FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE employments.status != 'approved'
Hash Join (cost=14.28..37.13 rows=846 width=452) (actual time=0.025..0.027 rows=2 loops=1)
Hash Cond: (e.profile_id = profiles.id)
-> Seq Scan on employments e (cost=0.00..20.62 rows=846 width=68) (actual time=0.008..0.009 rows=2 loops=1)
Filter: (status <> ''approved''::employment_status)
Rows Removed by Filter: 1
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.007..0.007 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.003..0.004 rows=8 loops=1)
Planning Time: 0.111 ms
Execution Time: 0.053 ms
EXPLAIN ANALYZE SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';
Hash Right Join (cost=14.28..37.13 rows=846 width=452) (actual time=0.036..0.042 rows=8 loops=1)
Hash Cond: (employments.profile_id = profiles.id)
-> Seq Scan on employments (cost=0.00..20.62 rows=846 width=68) (actual time=0.005..0.005 rows=2 loops=1)
Filter: (status <> ''approved''::employment_status)
Rows Removed by Filter: 1
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.015..0.015 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.010..0.011 rows=8 loops=1)
Planning Time: 0.106 ms
Execution Time: 0.108 ms
EXPLAIN ANALYZE SELECT profiles.* FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'
Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.006..0.007 rows=8 loops=1)
Planning Time: 0.063 ms
Execution Time: 0.016 ms
第一个查询计划和第二个查询计划对一个散列连接的期望值几乎相同,而另一个查询计划对右散列连接的期望值几乎相同,而最后一个查询甚至不执行联接或where条件。
EXPLAIN ANALYZE SELECT profiles.* FROM profiles
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE (employments.id IS NULL OR employments.status != 'approved')
Hash Right Join (cost=14.28..35.02 rows=846 width=384) (actual time=0.021..0.026 rows=7 loops=1)
Hash Cond: (employments.profile_id = profiles.id)
Filter: ((employments.id IS NULL) OR (employments.status <> ''approved''::employment_status))
Rows Removed by Filter: 1
-> Seq Scan on employments (cost=0.00..18.50 rows=850 width=20) (actual time=0.002..0.003 rows=3 loops=1)
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.011..0.011 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.007..0.008 rows=8 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.049 ms
关于这个问题,我的问题是:
编辑:
使用以下示例数据,预期查询应返回2和3。
insert into profiles values (1);
insert into profiles values (2);
insert into profiles values (3);
insert into employments (profile_id,status) values (1,'approved');
insert into employments (profile_id,status) values (2,'denied');
解决方法
employments.profile_id
上必须有唯一键或主键约束(或者它是具有适当的DISTINCT
子句的视图),以便优化程序知道{{ 1}}与employments
中的给定行相关。
如果是这种情况,并且您没有在profiles
列表中使用employments
的列,那么优化程序会推断出连接是多余的,不需要计算,这使操作更简单和更快的执行计划。
请参阅SELECT
中对join_is_removable
的评论:
src/backend/optimizer/plan/analyzejoins.c