我有一个1000多万的PostgreSQL(v16)表,存储了一个人的地址详细信息。我必须找到与输入地址相似并超过阈值的地址。
我们已经尝试了模块pg_trgm
,但是大量的匹配记录需要时间。
select *
from address
where ARRAY['val'] && pentagram
and similarity(address,<input_string>) > THRESHOLD
字符串
address
:地址列pentagram
:具有GIN索引的列,其标记长度为3。(使用show_tgrm()
生成。)- 在
ARRAY['val']
中,我们放置了一个长度为3的标记,它是为输入字符串生成的。 - 我们正在为输入字符串生成的每个标记执行上述查询。
- 我们正在整理来自每个令牌查询的地址。
explain(costs,buffers,verbose,analyze)
select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address)
from address
where ARRAY['vel'] <@ pentagram;
Bitmap Heap Scan on public.address (cost=2296.03..416822.75 rows=398205 width=4) (actual time=136.191..13181.323 rows=398415 loops=1)
Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
Heap Blocks: exact=357144
Buffers: shared hit=357254
-> Bitmap Index Scan on pentgram_idx (cost=0.00..2196.48 rows=398205 width=0) (actual time=57.975..57.976 rows=398415 loops=1)
Index Cond: (address.pentagram @> '{vel}'::text[])
Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
Buffers: shared hit=1
Planning Time: 0.158 ms
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.370 ms, Inlining 0.000 ms, Optimization 0.573 ms, Emission 6.775 ms, Total 7.718 ms
Execution Time: 13216.714 ms
(17 rows)
explain(costs,buffers,verbose,analyze)
select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042', complete_address)
from address where ARRAY['vel'] <@ pentagram;
Bitmap Heap Scan on public.address (cost=2296.03..416822.75 rows=398205 width=4) (actual time=140.682..6224.179 rows=398415 loops=1)
Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
Heap Blocks: exact=357144
Buffers: shared hit=357254
-> Bitmap Index Scan on pentgram_idx (cost=0.00..2196.48 rows=398205 width=0) (actual time=64.489..64.490 rows=398415 loops=1)
Index Cond: (address.pentagram @> '{vel}'::text[])
Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
Buffers: shared hit=1
Planning Time: 0.225 ms
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.629 ms, Inlining 0.000 ms, Optimization 0.525 ms, Emission 5.670 ms, Total 6.824 ms
Execution Time: 6237.051 ms
(17 rows)
如何在最短时间内找到所有相似的地址?
1条答案
按热度按时间3wabscal1#
你所做的非常奇怪。通常你只是直接在text列上构建索引,而不是显式地存储单个三元组。show_tgrm()基本上只是一个调试工具,我从来没有在生产代码中使用过它。
您显示的查询可能与以下内容相同:
字符串
它应该使用索引
on address using gin (address gin_trgm_ops)
尽管我认为更常见的做法是选择一个你想要的阈值,然后使用%运算符,而不是从查询中选择一个子字符串。