postgresql 三元组查询不使用索引

snvhrwxg  于 2023-08-04  发布在  PostgreSQL
关注(0)|答案(1)|浏览(135)

我正在尝试使用pg_trgm扩展在Postgres中进行批量相似性查询。我在列上设置了一个GIN索引,但似乎从未使用过。此外,文档似乎并不意味着GiST索引也会有很大帮助。
在我的查询中,我有一堆名字,我想做一个模糊匹配,看看数据库中的名字是相似的:
表:

CREATE TABLE people(
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL
);

字符串
表尺寸:

SELECT COUNT(*) FROM public.people;
-- 51436173


索引:

CREATE INDEX trgm_idx ON public.people USING GIST (name public.gist_trgm_ops);


我的查询需要一组名称来搜索:

EXPLAIN (ANALYZE, VERBOSE)
SELECT id, people.name, keys.name, similarity(people.name, keys.name)
FROM people JOIN (VALUES ('alice'), ('bob'), ('charlie')) keys(name)
ON similarity(people.name, keys.name) >= 0.75;


解释:

+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop  (cost=0.00..3742858.40 rows=51427344 width=67) (actual time=3415.321..383741.953 rows=73 loops=1)                    |
|  Output: people.id, people.name, "*VALUES*".column1, similarity(people.name, "*VALUES*".column1)                                 |
|  Join Filter: (similarity(people.name, "*VALUES*".column1) >= '0.75'::double precision)                                          |
|  Rows Removed by Join Filter: 154308446                                                                                          |
|  ->  Seq Scan on public.people  (cost=0.00..914354.44 rows=51427344 width=31) (actual time=1.921..9403.147 rows=51436173 loops=1)|
|        Output: people.id, people.name                                                                                            |
|  ->  Materialize  (cost=0.00..0.05 rows=3 width=32) (actual time=0.000..0.000 rows=3 loops=51436173)                             |
|        Output: "*VALUES*".column1                                                                                                |
|        ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=32) (actual time=0.075..0.081 rows=3 loops=1)                |
|              Output: "*VALUES*".column1                                                                                          |
|Planning Time: 3.580 ms                                                                                                           |
|JIT:                                                                                                                              |
|  Functions: 6                                                                                                                    |
|  Options: Inlining true, Optimization true, Expressions true, Deforming true                                                     |
|  Timing: Generation 12.076 ms, Inlining 9.587 ms, Optimization 24.373 ms, Emission 15.302 ms, Total 61.338 ms                    |
|Execution Time: 383756.056 ms                                                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------+


内联VALUES表将包含所有我想模糊搜索的名字。我遇到的问题是查询非常慢,并且对整个people表进行并行顺序扫描。我的问题是如何让Postgres在大多数查询中使用GIN(或GiST索引)?我已经对表进行了VACUUM ANALYZE,该表包含了几百万行。
编辑:来自Laurenz Albe的回答的质疑和解释:

BEGIN;
SET LOCAL enable_seqscan = off;
SET LOCAL pg_trgm.similarity_threshold = 0.75;
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, people.name, keys.name, similarity(keys.name, people.name)
FROM people
    JOIN (VALUES ('alice'), ('bob'), ('charlie'), ('dani')) keys(name)
      ON people.name % keys.name;
COMMIT;

+--------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop  (cost=4758.47..3668792.68 rows=2058049 width=67) (actual time=2419.249..32730.893 rows=75 loops=1)                       |
|  Buffers: shared hit=13120 read=144894                                                                                               |
|  ->  Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=4 width=32) (actual time=0.461..0.745 rows=4 loops=1)                          |
|  ->  Bitmap Heap Scan on people  (cost=4758.47..910766.76 rows=514512 width=31) (actual time=5615.752..8182.340 rows=19 loops=4)     |
|        Recheck Cond: (name % "*VALUES*".column1)                                                                                     |
|        Rows Removed by Index Recheck: 1122368                                                                                        |
|        Heap Blocks: exact=102969 lossy=33951                                                                                         |
|        Buffers: shared hit=13120 read=144894                                                                                         |
|        ->  Bitmap Index Scan on trgm_idx2  (cost=0.00..4629.84 rows=514512 width=0) (actual time=710.005..710.007 rows=37618 loops=4)|
|              Index Cond: (name % "*VALUES*".column1)                                                                                 |
|              Buffers: shared hit=12793 read=8286                                                                                     |
|Planning:                                                                                                                             |
|  Buffers: shared read=1                                                                                                              |
|Planning Time: 17.039 ms                                                                                                              |
|JIT:                                                                                                                                  |
|  Functions: 5                                                                                                                        |
|  Options: Inlining true, Optimization true, Expressions true, Deforming true                                                         |
|  Timing: Generation 38.873 ms, Inlining 75.855 ms, Optimization 163.283 ms, Emission 156.729 ms, Total 434.740 ms                    |
|Execution Time: 32787.630 ms                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------+

laximzn5

laximzn51#

如果你想使用索引,你必须重写查询:

BEGIN;

SET LOCAL pg_trgm.similarity_threshold = 0.75;

SELECT id, people.name, keys.name, similarity(people.name, keys.name)
FROM people
   JOIN (VALUES ('alice'), ('bob'), ('charlie')) keys(name)
      ON people.name % keys.name; 

COMMIT;

字符串

相关问题