postgresql Postgres查询查找相似字符串

vs91vp4v  于 2024-01-07  发布在  PostgreSQL
关注(0)|答案(1)|浏览(207)

我有一个1000多万的PostgreSQL(v16)表,存储了一个人的地址详细信息。我必须找到与输入地址相似并超过阈值的地址。
我们已经尝试了模块pg_trgm,但是大量的匹配记录需要时间。

select *
from address
where ARRAY['val'] && pentagram
and similarity(address,<input_string>) > THRESHOLD

字符串

  • address:地址列
  • pentagram:具有GIN索引的列,其标记长度为3。(使用show_tgrm()生成。)
  • ARRAY['val']中,我们放置了一个长度为3的标记,它是为输入字符串生成的。
  • 我们正在为输入字符串生成的每个标记执行上述查询。
  • 我们正在整理来自每个令牌查询的地址。
explain(costs,buffers,verbose,analyze)
    select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address)
    from address
    where ARRAY['vel'] <@ pentagram;
Bitmap Heap Scan on public.address  (cost=2296.03..416822.75 rows=398205 width=4) (actual time=136.191..13181.323 rows=398415 loops=1)
   Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
   Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
   Heap Blocks: exact=357144
   Buffers: shared hit=357254
   ->  Bitmap Index Scan on pentgram_idx  (cost=0.00..2196.48 rows=398205 width=0) (actual time=57.975..57.976 rows=398415 loops=1)
         Index Cond: (address.pentagram @> '{vel}'::text[])
         Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
   Buffers: shared hit=1
Planning Time: 0.158 ms
JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.370 ms, Inlining 0.000 ms, Optimization 0.573 ms, Emission 6.775 ms, Total 7.718 ms
Execution Time: 13216.714 ms
(17 rows)
explain(costs,buffers,verbose,analyze)
    select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042', complete_address)
    from address where ARRAY['vel'] <@ pentagram;
Bitmap Heap Scan on public.address  (cost=2296.03..416822.75 rows=398205 width=4) (actual time=140.682..6224.179 rows=398415 loops=1)
   Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
   Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
   Heap Blocks: exact=357144
   Buffers: shared hit=357254
   ->  Bitmap Index Scan on pentgram_idx  (cost=0.00..2196.48 rows=398205 width=0) (actual time=64.489..64.490 rows=398415 loops=1)
         Index Cond: (address.pentagram @> '{vel}'::text[])
         Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
   Buffers: shared hit=1
Planning Time: 0.225 ms
JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.629 ms, Inlining 0.000 ms, Optimization 0.525 ms, Emission 5.670 ms, Total 6.824 ms
Execution Time: 6237.051 ms
(17 rows)

如何在最短时间内找到所有相似的地址?

3wabscal

3wabscal1#

你所做的非常奇怪。通常你只是直接在text列上构建索引,而不是显式地存储单个三元组。show_tgrm()基本上只是一个调试工具,我从来没有在生产代码中使用过它。
您显示的查询可能与以下内容相同:

select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address) 
  from address where address like '%vel%';

字符串
它应该使用索引on address using gin (address gin_trgm_ops)
尽管我认为更常见的做法是选择一个你想要的阈值,然后使用%运算符,而不是从查询中选择一个子字符串。

相关问题