postgresql 优化Postgres删除孤立记录

9bfwbjaz 于 2023-11-18 发布在 PostgreSQL

关注(0)|答案(5)|浏览(159)

请看以下两张表：

Table "public.contacts"
       Column       |            Type             |                       Modifiers                       | Storage  | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
 id                 | integer                     | not null default nextval('contacts_id_seq'::regclass) | plain    |              |
 created_at         | timestamp without time zone | not null                                              | plain    |              |
 updated_at         | timestamp without time zone | not null                                              | plain    |              |
 external_id        | integer                     |                                                       | plain    |              |
 email_address      | character varying           |                                                       | extended |              |
 first_name         | character varying           |                                                       | extended |              |
 last_name          | character varying           |                                                       | extended |              |
 company            | character varying           |                                                       | extended |              |
 industry           | character varying           |                                                       | extended |              |
 country            | character varying           |                                                       | extended |              |
 region             | character varying           |                                                       | extended |              |
 ext_instance_id    | integer                     |                                                       | plain    |              |
 title              | character varying           |                                                       | extended |              |
Indexes:
    "contacts_pkey" PRIMARY KEY, btree (id)
    "index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)

字符串
和

Table "public.members"
        Column         |            Type             |                             Modifiers                              | Storage  | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
 id                    | integer                     | not null default nextval('members_id_seq'::regclass)               | plain    |              |
 step_id               | integer                     |                                                                    | plain    |              |
 contact_id            | integer                     |                                                                    | plain    |              |
 rule_id               | integer                     |                                                                    | plain    |              |
 request_id            | integer                     |                                                                    | plain    |              |
 sync_id               | integer                     |                                                                    | plain    |              |
 status                | integer                     | not null default 0                                                 | plain    |              |
 matched_targeted_rule | boolean                     | default false                                                      | plain    |              |
 external_fields       | jsonb                       |                                                                    | extended |              |
 imported_at           | timestamp without time zone |                                                                    | plain    |              |
 campaign_id           | integer                     |                                                                    | plain    |              |
 ext_instance_id       | integer                     |                                                                    | plain    |              |
 created_at            | timestamp without time zone |                                                                    | plain    |              |
Indexes:
    "members_pkey" PRIMARY KEY, btree (id)
    "index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
    "index_members_on_campaign_id" btree (campaign_id)
    "index_members_on_step_id" btree (step_id)
    "index_members_on_sync_id" btree (sync_id)
    "index_members_on_request_id" btree (request_id)
    "index_members_on_status" btree (status)

型
主键和members.contact_id都有索引。
我需要删除任何没有相关members的contact。大约有3 MM contact和25 MM member记录。
我正在尝试以下两个查询：

查询一：

DELETE FROM "contacts"
WHERE  "contacts"."id" IN (SELECT "contacts"."id" 
                           FROM   "contacts" 
                                  LEFT OUTER JOIN members 
                                               ON 
                                  members.contact_id = contacts.id 
                           WHERE  members.id IS NULL);

DELETE 0
Time: 173033.801 ms

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on contacts  (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
   ->  Nested Loop  (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
         ->  HashAggregate  (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
               Group Key: contacts_1.id
               ->  Hash Right Join  (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
                     Hash Cond: (members.contact_id = contacts_1.id)
                     Filter: (members.id IS NULL)
                     Rows Removed by Filter: 26725870
                     ->  Seq Scan on members  (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
                     ->  Hash  (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
                           Buckets: 262144  Batches: 4  Memory Usage: 34361kB
                           ->  Seq Scan on contacts contacts_1  (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
         ->  Index Scan using contacts_pkey on contacts  (cost=0.43..1.48 rows=1 width=10) (never executed)
               Index Cond: (id = contacts_1.id)
 Planning time: 0.488 ms
 Execution time: 188718.862 ms

型

查询二：

DELETE FROM contacts 
WHERE  NOT EXISTS (SELECT 1 
                   FROM   members c 
                   WHERE  c.contact_id = contacts.id); 

DELETE 0
Time: 170871.219 ms

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on contacts  (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
   ->  Hash Anti Join  (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
         Hash Cond: (contacts.id = c.contact_id)
         ->  Seq Scan on contacts  (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
         ->  Hash  (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
               Buckets: 262144  Batches: 32  Memory Usage: 36228kB
               ->  Seq Scan on members c  (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
 Planning time: 0.901 ms
 Execution time: 177524.526 ms

型
正如您所看到的，即使不删除任何记录，两个查询也显示出类似的性能，耗时约3分钟。
服务器磁盘I/O峰值达到100%，因此我假设数据溢出到磁盘，因为在contacts和members上都执行了顺序扫描。
服务器是EC2 r3.large（15 GB RAM）。
有什么想法我可以做什么来优化这个查询？

更新#1：

在对两个表运行vacuum analyze并确保enable_mergejoin设置为on之后，查询时间没有差异：

DELETE FROM contacts 
WHERE  NOT EXISTS (SELECT 1 
                   FROM   members c 
                   WHERE  c.contact_id = contacts.id); 

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on contacts  (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
   ->  Hash Anti Join  (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
         Hash Cond: (contacts.id = c.contact_id)
         ->  Seq Scan on contacts  (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
         ->  Hash  (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
               Buckets: 262144  Batches: 32  Memory Usage: 37006kB
               ->  Seq Scan on members c  (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
 Planning time: 0.328 ms
 Execution time: 209408.040 ms

型

更新二：

PG版本：

PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit

型
关系大小：

Table         |  Size   | External Size
-----------------------+---------+---------------
 members               | 23 GB   | 11 GB
 contacts              | 944 MB  | 371 MB

型
设定值：

work_mem
----------
 64MB

 random_page_cost
------------------
 4

型

更新三：

批量试验似乎对I/O使用率没有帮助（仍然达到100%），而且尽管使用了基于索引的计划，似乎也没有按时改进。

DO $do$ 
BEGIN 
  FOR i IN 57..668 
  LOOP 
    DELETE 
    FROM   contacts 
    WHERE  contacts.id IN 
           ( 
                           SELECT          contacts.id 
                           FROM            contacts 
                           left outer join members 
                           ON              members.contact_id = contacts.id 
                           WHERE           members.id IS NULL 
                           AND             contacts.id >= (i    * 10000) 
                           AND             contacts.id < ((i+1) * 10000));
END LOOP;END $do$;

型
我不得不在Time: 1203492.326 ms之后终止查询，并且在整个查询运行期间磁盘I/O保持在100%。我还尝试了1，000和5，000块，但没有看到性能的任何提高。
注意：使用57..668范围是因为我知道这些是现有的联系人ID。（例如min(id)和max(id)）

postgresql

来源：https://stackoverflow.com/questions/43307730/optimize-postgres-deletion-of-orphaned-records

5条答案

按热度按时间

yquaqz181#

有什么想法我可以做什么来优化这个查询？
你的查询是完美的。我会使用NOT EXISTS变体。
你的指数index_members_on_contact_id_and_step_id也很好：

Is a composite index also good for queries on the first field?

下面我们来看看BRIN指数。
您可以优化服务器、表和索引配置。
由于您几乎不更新或删除任何行，根据您的评论，请关注优化 * 读取 * 性能。

1.升级Postgres版本

您提供了：
服务器是EC2 r3.large（15 GB RAM）。
并且：
PostgreSQL 9.4.4
您的版本严重过时。* 至少 * 升级到最新的次要版本。更好的是，升级到当前的主要版本。Postgres 9.5和9.6为大数据带来了重大改进-这正是您所需要的。
Consider the versioning policy of the project.的
Amazon allows you to upgrade!的

2.完善表统计

在基本顺序扫描中，预期行计数和实际行计数之间出现意外的10%不匹配：
对成员c进行序列扫描（成本=0.00..1814029.74行=24855474宽度=10）（实际时间=1.132..188654.555行=27307060循环=1）
一点也不夸张，但仍然不应该在这个查询中发生。表示您可能必须调整autovacuum设置-对于非常大的表，可能需要按表调整。
更有问题的是：
哈希反连接（成本=2246088.17..2966677.08行=1875003宽度=12）（实际时间=209406.338..209406.338行=0循环=1）
Postgres期望找到1875003行要删除，而实际上找到了0行。这是意外的。可能大幅增加members.contact_id和contacts.id上的统计目标可以帮助减少差距，这可能会允许更好的查询计划。请参阅：

避免PostgreSQL有时选择一个糟糕的查询计划

3.避免表和索引膨胀

在members中，大约25 MM行占用了23 GB --几乎是每行1 kb，这对于您提供的表定义来说似乎太多了（即使您提供的总大小应该包括索引）：

4 bytes  item identifier

24        tuple header
 8        null bitmap
36        9x integer
16        2x ts
 1        1x bool
??        1x jsonb

字符串
参见：

理解Postgres行大小

这是每行89个字节-或者更少的一些NULL值-几乎没有任何对齐填充，所以最大96个字节，加上您的**jsonb**列。

要么jsonb列非常大，这将使我建议将数据规范化为单独的列或单独的表。考虑：

如何对JSONB类型的列执行更新操作
或者你的表是臃肿的，这可以用VACUUM FULL ANALYZE解决，或者，当它在：

CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;

型
但是这两种方法都需要一个独占锁，你说你负担不起。pg_repack可以在没有独占锁的情况下做到这一点。请参阅：

VACUUM returning disk space to operating system的

即使我们考虑到索引大小，你的表似乎太大了：你有7个小索引，每行36 - 44字节，没有膨胀，更少的是NULL值，所以总共< 300字节。
无论哪种方式，请考虑为表members设置更激进的autovacuum。

和/或停止膨胀表开始。你更新了很多行吗？你更新了很多特定的列吗？也许是jsonb列？你可以把它移到一个单独的（1：1）表，只是为了停止膨胀主表的死元组-并阻止autovacuum完成它的工作。

4.尝试BRIN索引

Block range indexes需要Postgres9.5或更高版本，并 * 显著 * 减少索引大小。我在初稿中过于乐观。如果每个contact.id在members中有 * 许多 * 行，那么BRIN索引对于您的用例是 * 完美 * 的-在物理集群您的表至少一次之后（有关CLUSTER命令，请参阅③）。在这种情况下，Postgres可以快速排除整个数据页。但您的数字表明，每个contact.id只有大约8行，因此数据页通常包含多个值，这就抵消了很多效果。这取决于你的数据分布的实际细节...
另一方面，元组的大小大约是1 kb，所以每个数据页只有8行（通常是8 kb）。
但您需要先升级您的服务器版本，参见①。

CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);

型

赞(0）回复(0）举报 2023-11-18

k75qkfdt2#

解决这类问题的一种方法是将其分成更小的块。

DELETE FROM "contacts"
WHERE  "contacts"."id" IN (
    SELECT id
    FROM contacts
    LEFT OUTER JOIN members ON members.contact_id = contacts.id 
    WHERE members.id IS NULL
        AND id >= 1 AND id < 1000
);
DELETE FROM "contacts"
WHERE  "contacts"."id" IN (
    SELECT id
    FROM contacts
    LEFT OUTER JOIN members ON members.contact_id = contacts.id 
    WHERE members.id IS NULL
        AND id >= 1001 AND id < 2000
);

字符串
重复上述步骤。尝试不同的块大小，为你的数据集找到一个最佳的块大小，它使用最少的查询，同时将它们全部保存在内存中。
当然，您可能希望使用plpgsql或您喜欢的任何脚本语言编写脚本。

赞(0）回复(0）举报 2023-11-18

qyyhg6bp3#

更新计划器使用的统计信息，并将enable_mergejoin设置为on：

vacuum analyse members;
vacuum analyse contacts;
set enable_mergejoin to on;

字符串
你应该得到一个类似于下面的查询计划：

explain analyse
delete from contacts 
where not exists (
    select 1 
    from members c 
    where c.contact_id = contacts.id); 

                               QUERY PLAN
----------------------------------------------------------------------
 Delete on contacts
   ->  Merge Anti Join
         Merge Cond: (contacts.id = c.contact_id)
         ->  Index Scan using contacts_pkey on contacts
         ->  Index Scan using members_contact_id_idx on members c

型

赞(0）回复(0）举报 2023-11-18

b4wnujal4#

下面是另一个可以尝试的变体：

DELETE FROM contacts
USING contacts c
LEFT JOIN members m
ON c.id = m.contact_id
WHERE m.contact_id IS NULL;

字符串
它使用此处所述的从连接查询中删除的技术。
我不能保证这样做是否会更快，但可能是因为避免了子查询。

赞(0）回复(0）举报 2023-11-18

lg40wkob5#

在where子句中使用子查询会花费很多时间，你应该使用with和using，这样会快很多......

with 
    c_not_member as (
     -- here extarct the id of contacts that not in members
     SELECT 
        c.id
     FROM  contacts c LEFT JOIN  members m on c.id = m.contact_id
     WHERE  
        -- to get the contact that don't exist in member just
        -- use condition in a field on member that cannot be null
        -- in this case you have id
        m.id is null
        -- the only case when m.id is null is when c.id does not have m.contact_id maching c.id
        -- in another way c.id doesn't exists in m.contact_id 
                   )
DELETE FROM contacts all_c using c_not_member WHERE all_c.id = not_member.id ;

字符串

赞(0）回复(0）举报 2023-11-18

我来回答

postgresql 优化Postgres删除孤立记录

查询一：

查询二：

更新#1：

更新二：

更新三：

5条答案

1.升级Postgres版本

2.完善表统计

3.避免表和索引膨胀

4.尝试BRIN索引

相关问题

热门标签

最新问答