Postgesql非重复计数可提高性能

问题描述

我在Google Cloud sql上有一个数据库。它包含一个简单的表，如下所示：

url_id user_id

url_id是包含整数的字符串，user_id是14个字符的字符串。我在url_id上有一个索引：

CREATE INDEX index_test ON table1 (url_id);

我要运行的请求是获取具有url_id不在给定ID列表中的不同user_id的数量。我这样做：

 SET work_mem='4GB';
 select count(*) from (select distinct afficheW from table1 where url_id != '1880' and url_id != '2022' and url_id != '1963' and url_id != '11' and url_id != '32893' and url_id != '19' ) t ;

结果：

 count  
---------
 1242298
(1 row)

Time: 2118,917 ms

该表包含180万行。有什么办法可以使这种类型的请求更快？

解决方法

尝试将其写为：

if (yes_btn) { 
    footer += `<button class="button_yes modal_button" ${yes_btn}>Yes</button>`
}

（这假设select count(distinct afficheW) from table1 where url_id not in (1800,2022,1963,11,32892,19);确实是数字，而不是字符串。）

然后在url_id上添加索引。

话虽如此，在不到两秒钟的时间内从一张桌子中计数超过一百万个物品还算不错。

除非您的WHERE条件消除了大多数行并且您可以使用部分索引，否则最有希望的索引将位于(affichew,url_id)上。这样，它可以使用仅索引扫描，无需访问表即可基于url_id进行过滤，并以正确的顺序获取行以对其应用唯一性，而不必进行排序或散列。

此外，用not in编写它比使用ANDed！=条件列表要快一些。

您可以尝试在此处执行单级非重复计数查询

select count(distinct afficheW)
from table1
where url_id != '1880' and url_id != '2022' and url_id != '1963' and
      url_id != '11' and url_id != '32893' and url_id != '19';

这至少避免了外部显式计数查询，该查询不需要在那里。

一种替代方法是使用group by代替distinct：

select
    afficheW,count(*)
from
    table1
where
    url_id not in (1800,32893,19)
group by afficheW;

在这种情况下，很可能您需要在afficheW和url_id上使用单独多列索引（如@jjanes和@GordonLinoff）。我认为url_id应该是此多列索引的第一列，因为您在where子句中有明确的条件。

如果此查询性能至关重要，则可以在afficheW满足您的url_id子句的情况下在where上使用partial index。

作为@GordonLinoff，我还假设url_id是数字的（或者也许应该是数字的，以节省磁盘空间并提高性能），并且我也在使用{{1 }}作为编写多个not in (...)的一种更易读的方式。

另请参见：

有关多列索引（带有基准）中列排序的信息：Multicolumn index and performance

count sql sql