我正在尝试使用pg_trgm扩展在Postgres中进行批量相似性查询。我在列上设置了一个GIN索引,但似乎从未使用过。此外,文档似乎并不意味着GiST索引也会有很大帮助。
在我的查询中,我有一堆名字,我想做一个模糊匹配,看看数据库中的名字是相似的:
表:
CREATE TABLE people(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL
);
字符串
表尺寸:
SELECT COUNT(*) FROM public.people;
-- 51436173
型
索引:
CREATE INDEX trgm_idx ON public.people USING GIST (name public.gist_trgm_ops);
型
我的查询需要一组名称来搜索:
EXPLAIN (ANALYZE, VERBOSE)
SELECT id, people.name, keys.name, similarity(people.name, keys.name)
FROM people JOIN (VALUES ('alice'), ('bob'), ('charlie')) keys(name)
ON similarity(people.name, keys.name) >= 0.75;
型
解释:
+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+----------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop (cost=0.00..3742858.40 rows=51427344 width=67) (actual time=3415.321..383741.953 rows=73 loops=1) |
| Output: people.id, people.name, "*VALUES*".column1, similarity(people.name, "*VALUES*".column1) |
| Join Filter: (similarity(people.name, "*VALUES*".column1) >= '0.75'::double precision) |
| Rows Removed by Join Filter: 154308446 |
| -> Seq Scan on public.people (cost=0.00..914354.44 rows=51427344 width=31) (actual time=1.921..9403.147 rows=51436173 loops=1)|
| Output: people.id, people.name |
| -> Materialize (cost=0.00..0.05 rows=3 width=32) (actual time=0.000..0.000 rows=3 loops=51436173) |
| Output: "*VALUES*".column1 |
| -> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=32) (actual time=0.075..0.081 rows=3 loops=1) |
| Output: "*VALUES*".column1 |
|Planning Time: 3.580 ms |
|JIT: |
| Functions: 6 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
| Timing: Generation 12.076 ms, Inlining 9.587 ms, Optimization 24.373 ms, Emission 15.302 ms, Total 61.338 ms |
|Execution Time: 383756.056 ms |
+----------------------------------------------------------------------------------------------------------------------------------+
型
内联VALUES表将包含所有我想模糊搜索的名字。我遇到的问题是查询非常慢,并且对整个people
表进行并行顺序扫描。我的问题是如何让Postgres在大多数查询中使用GIN(或GiST索引)?我已经对表进行了VACUUM ANALYZE,该表包含了几百万行。
编辑:来自Laurenz Albe的回答的质疑和解释:
BEGIN;
SET LOCAL enable_seqscan = off;
SET LOCAL pg_trgm.similarity_threshold = 0.75;
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, people.name, keys.name, similarity(keys.name, people.name)
FROM people
JOIN (VALUES ('alice'), ('bob'), ('charlie'), ('dani')) keys(name)
ON people.name % keys.name;
COMMIT;
+--------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop (cost=4758.47..3668792.68 rows=2058049 width=67) (actual time=2419.249..32730.893 rows=75 loops=1) |
| Buffers: shared hit=13120 read=144894 |
| -> Values Scan on "*VALUES*" (cost=0.00..0.05 rows=4 width=32) (actual time=0.461..0.745 rows=4 loops=1) |
| -> Bitmap Heap Scan on people (cost=4758.47..910766.76 rows=514512 width=31) (actual time=5615.752..8182.340 rows=19 loops=4) |
| Recheck Cond: (name % "*VALUES*".column1) |
| Rows Removed by Index Recheck: 1122368 |
| Heap Blocks: exact=102969 lossy=33951 |
| Buffers: shared hit=13120 read=144894 |
| -> Bitmap Index Scan on trgm_idx2 (cost=0.00..4629.84 rows=514512 width=0) (actual time=710.005..710.007 rows=37618 loops=4)|
| Index Cond: (name % "*VALUES*".column1) |
| Buffers: shared hit=12793 read=8286 |
|Planning: |
| Buffers: shared read=1 |
|Planning Time: 17.039 ms |
|JIT: |
| Functions: 5 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
| Timing: Generation 38.873 ms, Inlining 75.855 ms, Optimization 163.283 ms, Emission 156.729 ms, Total 434.740 ms |
|Execution Time: 32787.630 ms |
+--------------------------------------------------------------------------------------------------------------------------------------+
型
1条答案
按热度按时间laximzn51#
如果你想使用索引,你必须重写查询:
字符串