连接大规模表时的PostgreSQL查询优化

rryofs0p  于 2022-11-04  发布在  PostgreSQL
关注(0)|答案(1)|浏览(227)

bounty将在7天后过期。回答此问题可获得+500的声望奖励。Johnny Metz希望吸引更多人关注此问题。

假设我有以下PostgreSQL数据库模式:

Group
  id: int

Task:
  id: int
  created_at: datetime
  group: FK Group

我使用以下实体化视图来计算每个组的任务数和最新的Task.created_at值:

CREATE MATERIALIZED VIEW group_statistics AS (
SELECT
    group.id as group_id,
    MAX(task.created_at) AS latest_task_created_at,
    COUNT(task.id) AS task_count
FROM group
    LEFT OUTER JOIN task ON (group.id = task.group_id)
GROUP BY group.id
);

Task表当前有2000万条记录,因此刷新此实体化视图需要很长时间(20-30秒)。自从我们开始每10分钟刷新一次实体化视图以来,我们还遇到了一些短暂但严重的数据库性能问题,即使使用了CONCURRENTLY:

REFRESH MATERIALIZED VIEW CONCURRENTLY group_statistics;

有没有更好的方法来计算这些值?注意,它们不需要是精确的。近似值完全可以,例如latest_task_created_at可以延迟10-20分钟。
我想在每次写入Task表时缓存这些值,可以在Redis中缓存,也可以在PostgreSQL中缓存。

更新

人们正在请求执行计划。EXPLAINREFRESH上不起作用,但我在实际查询上运行了EXPLAIN。注意,它与我上面的理论数据模型不同。在本例中,DatabaseGroupRecordTask。还要注意,我使用的是PostgreSQL 12.10。

EXPLAIN (analyze, buffers, verbose)
SELECT
    store_database.id as database_id,
    MAX(store_record.updated_at) AS latest_record_updated_at,
    COUNT(store_record.id) AS record_count
FROM store_database
    LEFT JOIN store_record ON (store_database.id = store_record.database_id)
GROUP BY store_database.id;

输出量:

HashAggregate  (cost=1903868.71..1903869.22 rows=169 width=32) (actual time=18227.016..18227.042 rows=169 loops=1)
"  Output: store_database.id, max(store_record.updated_at), count(store_record.id)"
  Group Key: store_database.id
  Buffers: shared hit=609211 read=1190704
  I/O Timings: read=3385.027
  ->  Hash Right Join  (cost=41.28..1872948.10 rows=20613744 width=40) (actual time=169.766..14572.558 rows=20928339 loops=1)
"        Output: store_database.id, store_record.updated_at, store_record.id"
        Inner Unique: true
        Hash Cond: (store_record.database_id = store_database.id)
        Buffers: shared hit=609211 read=1190704
        I/O Timings: read=3385.027
        ->  Seq Scan on public.store_record  (cost=0.00..1861691.23 rows=20613744 width=40) (actual time=0.007..8607.425 rows=20928316 loops=1)
"              Output: store_record.id, store_record.key, store_record.data, store_record.created_at, store_record.updated_at, store_record.database_id, store_record.organization_id, store_record.user_id"
              Buffers: shared hit=609146 read=1190704
              I/O Timings: read=3385.027
        ->  Hash  (cost=40.69..40.69 rows=169 width=16) (actual time=169.748..169.748 rows=169 loops=1)
              Output: store_database.id
              Buckets: 1024  Batches: 1  Memory Usage: 16kB
              Buffers: shared hit=65
              ->  Index Only Scan using store_database_pkey on public.store_database  (cost=0.05..40.69 rows=169 width=16) (actual time=0.012..0.124 rows=169 loops=1)
                    Output: store_database.id
                    Heap Fetches: 78
                    Buffers: shared hit=65
Planning Time: 0.418 ms
JIT:
  Functions: 14
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 2.465 ms, Inlining 15.728 ms, Optimization 92.852 ms, Emission 60.694 ms, Total 171.738 ms"
Execution Time: 18229.600 ms

注意,执行时间很长。有时需要5-10分钟才能运行。我很想把这个时间降到最多几秒钟。

vnjpjtjt

vnjpjtjt1#

获得每个数据库(组)的精确计数是昂贵的。在Postgres中没有灵丹妙药。您可能可以使用整个表的估计值。但每个组的估计值并不容易获得。请参见:

  • 在PostgreSQL中发现表的行数的快速方法

也就是说,您的查询仍然可以大大提高速度。

SELECT d.id AS database_id
     , r.latest_record_updated_at
     , r.record_count
FROM   store_database d
LEFT   JOIN (
   SELECT r.database_id
        , max(r.updated_at) AS latest_record_updated_at
        , count(*) AS record_count
   FROM   store_record r
   ) r ON d.id = r.database_id;

并使用稍微快一点的(在本例中是等效的)count(*)

如果没有计数,只有latest_record_updated_at,则此查询将在很短的时间内提供该值:

SELECT d.id
    , (SELECT r.updated_at
       FROM   store_record r
       WHERE  r.database_id = d.id
       ORDER  BY r.updated_at DESC NULLS LAST
       LIMIT  1) AS latest_record_updated_at
FROM   store_database d;

与匹配的索引相结合!理想情况下:

CREATE INDEX store_record_database_id_idx ON store_record (database_id, updated_at DESC NULL LAST);

请参阅:

  • 优化GROUP BY查询以检索每个用户的最新行

相关问题