问题描述
添加 GIST 索引实际上似乎使 Postgresql 中 ORDER BY
列的 K-最近邻 (KNN) cube
查询更糟。为什么会这样,我们可以做些什么?
这就是我的意思。在 Postgresql 数据库中,我有一个表,其 DDL 为 create sample (id serial primary key,title text,embedding cube)
,其中 embedding
列是使用 Google 语言模型获得的 title
的嵌入向量。 cube
数据类型由我安装的多维数据集扩展提供。顺便说一下,这些是维基百科文章的标题。无论如何,有100万条记录。然后我使用以下查询执行 KNN 查询。此查询使用欧几里得距离运算符 distance
定义 <->
,但其他两个指标的结果相似。它执行 ORDER BY
并应用 LIMIT
以查找具有“相似”标题(最相似的是目标标题本身)的 10 篇维基百科文章。一切正常。
select sample.title,sample.embedding <-> cube('(0.18936706,-0.12455666,-0.31581765,0.0192692,-0.07364611,0.07851536,0.0290586,-0.02582532,-0.03378124,-0.10564457,-0.03903799,0.08668878,-0.15357816,-0.17793414,-0.01826405,0.01969068,0.11386908,0.1555583,0.09368557,0.13697313,-0.05610929,-0.06536788,-0.12212707,0.26356605,-0.06004387,-0.01966437,-0.1250324,-0.16645767,-0.13525756,0.22482251,-0.1709727,0.28966117,-0.07927769,-0.02498624,-0.10018375,-0.10923951,0.04770213,0.11573371,0.04619929,0.05216618,0.19176421,0.12948817,0.08719034,-0.16109011,-0.02411379,-0.05638905,-0.37334979,0.31225419,0.0744801,0.27044332)') distance from sample order by distance limit 10;
然而,令我困惑的是,如果我在 embedding
列上放置 GIST 索引,查询性能实际上更糟。添加索引后,查询计划会按预期以预期的方式更改,就它使用索引而言。但是……它变慢了!
这似乎与 cube
的 documentation 相悖,它指出:
此外,多维数据集 GiST 索引可用于在 ORDER BY 子句中使用度量运算符 、 和 来查找最近邻
SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..6.30 rows=10 width=29)
-> Index Scan using sample_embedding_idx on sample (cost=0.41..589360.33 rows=999996 width=29)
Order By: (embedding <-> '(0.18936706,0.27044332)'::cube)
(3 rows)
title | distance
----------------------+--------------------
david petrarca | 0.5866321762629475
david adamski | 0.5866321762629475
richard ansdell | 0.6239883862603475
linda darke | 0.6392124797481789
ilias tsiliggiris | 0.6996660649119893
watson,jim | 0.7059481479504834
sk radni%c4%8dki | 0.71718948226995
burnham,pa | 0.7384858030758069
arthur (europa-park) | 0.7468462897336924
ivan kecojevic | 0.7488206082281348
(10 rows)
Time: 1226.457 ms (00:01.226)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=74036.32..74037.48 rows=10 width=29)
-> Gather Merge (cost=74036.32..171264.94 rows=833330 width=29)
Workers Planned: 2
-> Sort (cost=73036.29..74077.96 rows=416665 width=29)
Sort Key: ((embedding <-> '(0.18936706,0.27044332)'::cube))
-> Parallel Seq Scan on sample (cost=0.00..64032.31 rows=416665 width=29)
(6 rows)
title | distance
----------------------+--------------------
david petrarca | 0.5866321762629475
david adamski | 0.5866321762629475
richard ansdell | 0.6239883862603475
linda darke | 0.6392124797481789
ilias tsiliggiris | 0.6996660649119893
watson,pa | 0.7384858030758069
arthur (europa-park) | 0.7468462897336924
ivan kecojevic | 0.7488206082281348
(10 rows)
Time: 381.419 ms
注意:
- 带索引:1226.457 毫秒
- 无索引:381.419 毫秒
这种非常令人费解的行为!所有这些都记录在 GitHub repo 中,以便其他人可以尝试。我将添加有关如何生成嵌入向量的文档,但应该不需要,就像在快速入门中一样,我展示了可以从我的 Google Drive 文件夹下载预先计算的嵌入向量.
附录
在下面的评论中要求提供 explain (analyze,buffers)
的输出。就是这里
pgbench=# create index on sample using gist (embedding) include (title);
CREATE INDEX
Time: 51966.315 ms (00:51.966)
pgbench=#
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..4.15 rows=10 width=29) (actual time=3215.956..3216.667 rows=10 loops=1)
Buffers: shared hit=1439 read=87004 written=7789
-> Index Only Scan using sample_embedding_title_idx on sample (cost=0.41..373768.39 rows=999999 width=29) (actual time=3215.932..3216.441 rows=10 loops=1)
Order By: (embedding <-> '(0.18936706,0.27044332)'::cube)
Heap Fetches: 0
Buffers: shared hit=1439 read=87004 written=7789
Planning:
Buffers: shared hit=14 read=6 dirtied=2
Planning Time: 0.432 ms
Execution Time: 3316.266 ms
(10 rows)
Time: 3318.333 ms (00:03.318)
pgbench=# drop index sample_embedding_title_idx;
DROP INDEX
Time: 182.324 ms
pgbench=#
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=74036.35..74037.52 rows=10 width=29) (actual time=6052.845..6057.210 rows=10 loops=1)
Buffers: shared hit=70 read=58830
-> Gather Merge (cost=74036.35..171265.21 rows=833332 width=29) (actual time=6052.825..6057.021 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=70 read=58830
-> Sort (cost=73036.33..74077.99 rows=416666 width=29) (actual time=6002.928..6003.019 rows=8 loops=3)
Sort Key: ((embedding <-> '(0.18936706,0.27044332)'::cube))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=70 read=58830
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on sample (cost=0.00..64032.33 rows=416666 width=29) (actual time=0.024..3090.103 rows=333333 loops=3)
Buffers: shared read=58824
Planning:
Buffers: shared hit=3 read=3 dirtied=1
Planning Time: 0.129 ms
Execution Time: 6057.388 ms
(18 rows)
Time: 6053.284 ms (00:06.053)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)