postgresql 在IMDB数据中快速搜索DVD标题

vsnjm48y 于 2023-06-22 发布在 PostgreSQL

关注(0)|答案(1)|浏览(107)

我想将我的DVD收藏链接到IMDb标题。为此，我将title_basics tsv作为一个表导入Postgres数据库。数据在这里：
https://datasets.imdbws.com/title.basics.tsv.gz
这个表有近1000万行。
然后，我在primaryTitle列上创建一个gin索引：
CREATE INDEX t_gin ON title_basics USING gin(primarytitle gin_trgm_ops);
下一步是通过以下方式查找最佳匹配标题：

SELECT  tconst, primarytitle
FROM title_basics
ORDER BY similarity(primarytitle, 'The Shape of Water 2017') DESC
limit 10
;

此查询需要大约30秒。所以，我想知道有没有什么方法可以提高性能？
下面是查询的执行计划：

═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Limit  (cost=284857.53..284858.70 rows=10 width=34) (actual time=27975.751..27976.745 rows=10 loops=1)
   Buffers: shared hit=693 read=142357
   I/O Timings: read=377.006
   ->  Gather Merge  (cost=284857.53..1248938.24 rows=8262972 width=34) (actual time=27975.750..27976.742 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=693 read=142357
         I/O Timings: read=377.006
         ->  Sort  (cost=283857.50..294186.22 rows=4131486 width=34) (actual time=27970.078..27970.079 rows=8 loops=3)
               Sort Key: (similarity(primarytitle, 'The Shape of Water 2017'::text)) DESC
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=693 read=142357
               I/O Timings: read=377.006
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on title_basics  (cost=0.00..194577.58 rows=4131486 width=34) (actual time=0.047..26837.286 rows=3305189 loops=3)
                     Buffers: shared hit=577 read=142357
                     I/O Timings: read=377.006
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.123 ms
 Execution Time: 27976.770 ms
(22 rows)

我用的是PostgreSQL 14.3 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6), 64-bit

postgresql

来源：https://stackoverflow.com/questions/76496765/fast-searching-for-dvd-title-in-imdb-data

1条答案

按热度按时间

q5iwbnjs1#

使用函数形式不会得到三元组索引的使用情况。你需要使用操作形式。
你可以构建gist三元组索引，然后使用KNN实现：

ORDER BY primarytitle <-> 'The Shape of Water 2017'
LIMIT 10

或者你可以对你愿意容忍的不同程度设定一个截止值，并对结果进行排序，如果符合你截止值的事情较少，则接受少于10个结果：

WHERE primarytitle % 'The Shape of Water 2017'
ORDER BY primarytitle <-> 'The Shape of Water 2017'
LIMIT 10

这可以使用gin或gist索引结构。
我倾向于使用第二种方法，因为我通常发现gin比gist更适合三元组索引。

赞(0）回复(0）举报 2023-06-22

我来回答

postgresql 在IMDB数据中快速搜索DVD标题

1条答案

相关问题

热门标签

最新问答