postgresql 在IMDB数据中快速搜索DVD标题

vsnjm48y  于 2023-06-22  发布在  PostgreSQL
关注(0)|答案(1)|浏览(108)

我想将我的DVD收藏链接到IMDb标题。为此,我将title_basics tsv作为一个表导入Postgres数据库。数据在这里:
https://datasets.imdbws.com/title.basics.tsv.gz
这个表有近1000万行。
然后,我在primaryTitle列上创建一个gin索引:
CREATE INDEX t_gin ON title_basics USING gin(primarytitle gin_trgm_ops);
下一步是通过以下方式查找最佳匹配标题:

SELECT  tconst, primarytitle
FROM title_basics
ORDER BY similarity(primarytitle, 'The Shape of Water 2017') DESC
limit 10
;

此查询需要大约30秒。所以,我想知道有没有什么方法可以提高性能?
下面是查询的执行计划:

═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Limit  (cost=284857.53..284858.70 rows=10 width=34) (actual time=27975.751..27976.745 rows=10 loops=1)
   Buffers: shared hit=693 read=142357
   I/O Timings: read=377.006
   ->  Gather Merge  (cost=284857.53..1248938.24 rows=8262972 width=34) (actual time=27975.750..27976.742 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=693 read=142357
         I/O Timings: read=377.006
         ->  Sort  (cost=283857.50..294186.22 rows=4131486 width=34) (actual time=27970.078..27970.079 rows=8 loops=3)
               Sort Key: (similarity(primarytitle, 'The Shape of Water 2017'::text)) DESC
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=693 read=142357
               I/O Timings: read=377.006
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on title_basics  (cost=0.00..194577.58 rows=4131486 width=34) (actual time=0.047..26837.286 rows=3305189 loops=3)
                     Buffers: shared hit=577 read=142357
                     I/O Timings: read=377.006
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.123 ms
 Execution Time: 27976.770 ms
(22 rows)

我用的是PostgreSQL 14.3 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6), 64-bit

q5iwbnjs

q5iwbnjs1#

使用函数形式不会得到三元组索引的使用情况。你需要使用操作形式。
你可以构建gist三元组索引,然后使用KNN实现:

ORDER BY primarytitle <-> 'The Shape of Water 2017'
LIMIT 10

或者你可以对你愿意容忍的不同程度设定一个截止值,并对结果进行排序,如果符合你截止值的事情较少,则接受少于10个结果:

WHERE primarytitle % 'The Shape of Water 2017'
ORDER BY primarytitle <-> 'The Shape of Water 2017'
LIMIT 10

这可以使用gin或gist索引结构。
我倾向于使用第二种方法,因为我通常发现gin比gist更适合三元组索引。

相关问题