我想将我的DVD收藏链接到IMDb标题。为此,我将title_basics tsv作为一个表导入Postgres数据库。数据在这里:
https://datasets.imdbws.com/title.basics.tsv.gz
这个表有近1000万行。
然后,我在primaryTitle列上创建一个gin索引:CREATE INDEX t_gin ON title_basics USING gin(primarytitle gin_trgm_ops);
下一步是通过以下方式查找最佳匹配标题:
SELECT tconst, primarytitle
FROM title_basics
ORDER BY similarity(primarytitle, 'The Shape of Water 2017') DESC
limit 10
;
此查询需要大约30秒。所以,我想知道有没有什么方法可以提高性能?
下面是查询的执行计划:
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Limit (cost=284857.53..284858.70 rows=10 width=34) (actual time=27975.751..27976.745 rows=10 loops=1)
Buffers: shared hit=693 read=142357
I/O Timings: read=377.006
-> Gather Merge (cost=284857.53..1248938.24 rows=8262972 width=34) (actual time=27975.750..27976.742 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=693 read=142357
I/O Timings: read=377.006
-> Sort (cost=283857.50..294186.22 rows=4131486 width=34) (actual time=27970.078..27970.079 rows=8 loops=3)
Sort Key: (similarity(primarytitle, 'The Shape of Water 2017'::text)) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=693 read=142357
I/O Timings: read=377.006
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on title_basics (cost=0.00..194577.58 rows=4131486 width=34) (actual time=0.047..26837.286 rows=3305189 loops=3)
Buffers: shared hit=577 read=142357
I/O Timings: read=377.006
Planning:
Buffers: shared hit=3
Planning Time: 0.123 ms
Execution Time: 27976.770 ms
(22 rows)
我用的是PostgreSQL 14.3 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6), 64-bit
1条答案
按热度按时间q5iwbnjs1#
使用函数形式不会得到三元组索引的使用情况。你需要使用操作形式。
你可以构建gist三元组索引,然后使用KNN实现:
或者你可以对你愿意容忍的不同程度设定一个截止值,并对结果进行排序,如果符合你截止值的事情较少,则接受少于10个结果:
这可以使用gin或gist索引结构。
我倾向于使用第二种方法,因为我通常发现gin比gist更适合三元组索引。