PostgreSQL索引/分区优化

cgfeq70w  于 2023-10-18  发布在  PostgreSQL
关注(0)|答案(1)|浏览(149)

我们在postgresql 13数据库中有一个重要的表“client”,它有大约500万行。此表链接到另一个具有2000行的表公司。我们有一家公司有200万行,三家公司有20万行,其余的公司都比其他公司低。
客户端的检索非常慢(谈论10- 20秒),我正试图优化它。
company_id列显然是有索引的,但是postgresql并不使用它,而是扫描整个表。
基本查询(13秒):

select distinct c1_0.id,c1_0.date_creation,c1_0.date_modification,c1_0.company_id,c1_0.nom,c1_0.prenom,c1_0.uuid 
from client c1_0 
join company e1_0 on e1_0.id=c1_0.company_id 
where e1_0.comp_name in ('MYCOMPANY');

以下是解释计划:

Unique  (cost=49805.26..50015.42 rows=2627 width=775) (actual time=12272.378..13245.548 rows=3246176 loops=1)
  ->  Sort  (cost=49805.26..49811.82 rows=2627 width=775) (actual time=12272.376..12559.093 rows=3246176 loops=1)
        Sort Key: c1_0.id,c1_0.date_creation,c1_0.date_modification,c1_0.company_id,c1_0.nom,c1_0.prenom,c1_0.uuid
        Sort Method: external merge  Disk: 615776kB
        ->  Nested Loop  (cost=265.28..49656.05 rows=2627 width=775) (actual time=166.906..9478.238 rows=3246176 loops=1)
              ->  Index Scan using company_nom_key on company e1_0  (cost=0.28..8.30 rows=1 width=8) (actual time=1.102..1.105 rows=1 loops=1)
                    Index Cond: ((numero_licence)::text = 'MYCOMPANY'::text)
              ->  Bitmap Heap Scan on client c1_0  (cost=265.00..49476.70 rows=17106 width=775) (actual time=165.798..8588.485 rows=3246176 loops=1)
                    Recheck Cond: (company_id = e1_0.id)
                    Rows Removed by Index Recheck: 1180926
                    Heap Blocks: exact=35248 lossy=99166
                    ->  Bitmap Index Scan on uc_client_company_id_carte_fidelite  (cost=0.00..260.73 rows=17106 width=0) (actual time=161.176..161.176 rows=3246177 loops=1)
                          Index Cond: (company_id = e1_0.id)
Planning Time: 3.078 ms
Execution Time: 13354.253 ms

现在,如果我直接使用company_id执行相同的请求,它会更快(150- 300毫秒)

select distinct c1_0.id,c1_0.date_creation,c1_0.date_modification,c1_0.company_id,c1_0.nom,c1_0.prenom,c1_0.uuid 
    from client c1_0 
    join company e1_0 on e1_0.id=c1_0.company_id 
    where e1_0.id in (12345);

以下是解释计划:

Unique  (cost=1001.17..1289653.17 rows=3248642 width=775) (actual time=188.930..6268.170 rows=3246176 loops=1)
  ->  Nested Loop  (cost=1001.17..1046005.02 rows=3248642 width=775) (actual time=188.928..4811.113 rows=3246176 loops=1)
        ->  Gather Merge  (cost=1000.89..1005388.70 rows=3248642 width=775) (actual time=188.910..4020.417 rows=3246176 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Incremental Sort  (cost=0.86..629414.78 rows=1353601 width=775) (actual time=126.966..3203.508 rows=1082059 loops=3)
                    Sort Key: c1_0.id,c1_0.date_creation,c1_0.date_modification,c1_0.company_id,c1_0.nom,c1_0.prenom,c1_0.uuid
                    Presorted Key: c1_0.id
                    Full-sort Groups: 31567  Sort Method: quicksort  Average Memory: 33kB  Peak Memory: 33kB
                    Worker 0:  Full-sort Groups: 34935  Sort Method: quicksort  Average Memory: 33kB  Peak Memory: 33kB
                    Worker 1:  Full-sort Groups: 34943  Sort Method: quicksort  Average Memory: 33kB  Peak Memory: 33kB
                    ->  Parallel Index Scan using pk_client on client c1_0  (cost=0.43..568502.73 rows=1353601 width=775) (actual time=117.699..2689.785 rows=1082059 loops=3)
                          Filter: (company_id = 12345)
                          Rows Removed by Filter: 856619
        ->  Materialize  (cost=0.28..8.30 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=3246176)
              ->  Index Only Scan using pk_company on company e1_0  (cost=0.28..8.30 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1)
                    Index Cond: (id = 206852)
                    Heap Fetches: 1
Planning Time: 0.173 ms
JIT:
  Functions: 15
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.137 ms, Inlining 59.856 ms, Optimization 167.256 ms, Emission 110.837 ms, Total 343.087 ms
Execution Time: 6349.227 ms

并不是所有的时候,我都能够执行第一个查询来检索公司id(或id),然后在客户端上使用公司id进行第二次选择(整个过程都在一个带有JPA的Java应用程序中)。删除distinct当然有帮助,但对于其他查询中的某些连接,这是强制性的。
我也试过对客户端表进行分区,基本上出现了同样的问题,当使用comp_name时,postgresql扫描所有分区,当直接使用company id时,postgresql选择正确的分区。
我创建了一个表client,它保留了id(用于全局unicity + fk)和另一个表client_partitioned,两者共享相同的id。

Limit  (cost=197182.76..197197.96 rows=20 width=777) (actual time=7771.630..7831.954 rows=20 loops=1)
  ->  Unique  (cost=197182.76..199179.52 rows=2627 width=777) (actual time=7745.582..7805.904 rows=20 loops=1)
        ->  Nested Loop  (cost=197182.76..198975.93 rows=2627 width=777) (actual time=7745.580..7805.887 rows=20 loops=1)
              ->  Gather Merge  (cost=197182.33..197488.29 rows=2627 width=777) (actual time=7745.531..7805.786 rows=20 loops=1)
                    Workers Planned: 2
                    Workers Launched: 2
                    ->  Sort  (cost=196182.31..196185.04 rows=1095 width=777) (actual time=7253.342..7253.493 rows=217 loops=3)
                          Sort Key: c1_0.id,c1_0.date_creation,c1_0.date_modification,c1_0.company_id,c1_0.nom,c1_0.prenom,c1_0.uuid
                          Sort Method: external merge  Disk: 178904kB
                          Worker 0:  Sort Method: external merge  Disk: 263768kB
                          Worker 1:  Sort Method: external merge  Disk: 173232kB
                          ->  Hash Join  (cost=8.31..196127.03 rows=1095 width=777) (actual time=718.867..3399.692 rows=1082059 loops=3)
                                Hash Cond: (c1_0.company_id = e1_0.id)
                                ->  Parallel Append  (cost=0.00..189745.25 rows=2423347 width=777) (actual time=13.854..3123.482 rows=1938678 loops=3)
                                      ->  Parallel Seq Scan on client_partitioned_mycompany c1_0_3  (cost=0.00..99961.73 rows=1352573 width=776) (actual time=6.855..2401.764 rows=1082059 loops=3)
                                      ->  Parallel Seq Scan on client_partitioned_default c1_0_6  (cost=0.00..33698.71 rows=465571 width=774) (actual time=20.721..783.996 rows=1117370 loops=1)
                                      ->  Parallel Seq Scan on client_partitioned_mycompany2 c1_0_4  (cost=0.00..20913.48 rows=296648 width=776) (actual time=0.147..309.226 rows=355978 loops=2)
                                      ->  Parallel Seq Scan on client_partitioned_mycompany3 c1_0_2  (cost=0.00..18261.16 rows=248316 width=774) (actual time=0.766..398.383 rows=595958 loops=1)
                                      ->  Parallel Seq Scan on client_partitioned_mycompany4 c1_0_5  (cost=0.00..3233.04 rows=59804 width=821) (actual time=0.346..71.407 rows=101666 loops=1)
                                      ->  Parallel Seq Scan on client_partitioned_mycompany5 c1_0_1  (cost=0.00..1560.40 rows=25240 width=840) (actual time=0.320..28.769 rows=42908 loops=1)
                                ->  Hash  (cost=8.30..8.30 rows=1 width=8) (actual time=0.587..0.588 rows=1 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      ->  Index Scan using company_comp_name_key on company e1_0  (cost=0.28..8.30 rows=1 width=8) (actual time=0.516..0.517 rows=1 loops=3)
                                            Index Cond: ((numero_licence)::text = 'MYCOMPANY'::text)
              ->  Index Only Scan using pk_client on client c1_1  (cost=0.43..0.57 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=20)
                    Index Cond: (id = c1_0.id)
                    Heap Fetches: 20
Planning Time: 0.892 ms
JIT:
  Functions: 69
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 10.378 ms, Inlining 0.000 ms, Optimization 5.020 ms, Emission 59.649 ms, Total 75.048 ms
Execution Time: 7851.928 ms

vs.

Limit  (cost=7122.35..7246.57 rows=20 width=776) (actual time=29.735..37.386 rows=20 loops=1)
  ->  Unique  (cost=7122.35..4039695.63 rows=649236 width=776) (actual time=29.734..37.384 rows=20 loops=1)
        ->  Nested Loop  (cost=7122.35..3796232.43 rows=3246176 width=776) (actual time=29.732..37.368 rows=20 loops=1)
              ->  Nested Loop  (cost=7122.07..3755646.93 rows=3246176 width=776) (actual time=29.713..37.341 rows=20 loops=1)
                    ->  Gather Merge  (cost=7121.64..1618527.96 rows=3246176 width=776) (actual time=29.677..37.248 rows=20 loops=1)
                          Workers Planned: 2
                          Workers Launched: 2
                          ->  Incremental Sort  (cost=6121.61..1242838.68 rows=1352573 width=776) (actual time=0.232..2.783 rows=72 loops=3)
                                Sort Key: c1_0.id,c1_0.date_creation,c1_0.date_modification,c1_0.company_id,c1_0.nom,c1_0.prenom,c1_0.uuid
                                Presorted Key: c1_0.nom
                                Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 38kB  Peak Memory: 38kB
                                Worker 0:  Full-sort Groups: 4  Sort Method: quicksort  Average Memory: 34kB  Peak Memory: 34kB
                                Worker 1:  Full-sort Groups: 4  Sort Method: quicksort  Average Memory: 33kB  Peak Memory: 33kB
                                ->  Parallel Index Scan using client_partitioned_mycompany_nom_idx1 on client_partitioned_mycompany c1_0  (cost=0.43..395532.62 rows=1352573 width=776) (actual time=0.019..2.580 rows=104 loops=3)
                                      Filter: (company_id = 206852)
                    ->  Index Only Scan using pk_client on client c1_1  (cost=0.43..0.66 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=20)
                          Index Cond: (id = c1_0.id)
                          Heap Fetches: 20
              ->  Materialize  (cost=0.28..8.30 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=20)
                    ->  Index Only Scan using pk_company on company e1_0  (cost=0.28..8.30 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=1)
                          Index Cond: (id = 206852)
                          Heap Fetches: 1
Planning Time: 0.503 ms
Execution Time: 37.507 ms

有时候我们也会添加过滤器(在name,lastname...,我删除了一些列),但这并不会对结果产生太大的影响。
感谢你的帮助.

kkbh8khc

kkbh8khc1#

你可以做两件简单的事情:

  • 删除不必要的DISTINCT,这将为您节省保存三秒
  • 增加work_mem,直到位图堆扫描中不再有lossy

如果你想保持查询不变,你可以考虑的另一件事是集群大表:

CLUSTER client USING uc_client_company_id_carte_fidelite;

但也许最好和最简单的解决方案是将查询拆分为两个不同的查询:首先从company中获取公司的id,然后查询client以获取该company_id。这样,第二个查询就可以准确地知道预期的结果,并可以选择最佳计划。当前查询的问题是PostgreSQL在查询规划时不知道 * 这是否是大company_id

相关问题