postgresql 从数百万个文档中创建唯一文本表

apeeds0o 于 2023-08-04 发布在 PostgreSQL

关注(0)|答案(2)|浏览(107)

我有一个包含text列的表，其中包含数百万个不同长度的文本文档（从几KB到几MB）。有许多重复项。
我想创建一个具有唯一md5散列和文本的表。该表如下所示：

create table txt_document
(
    id serial primary key,
    doc_hash uuid not null unique,
    doc_text text
)
;

create index idx_txt_document__hash on txt_document(doc_hash);

字符串
要将所有文档复制到表中，我只需执行以下操作：

with txt_data as (
    select html_text
    from many_texts
    where
        html_text is not null
)
insert into txt_document(doc_hash, doc_text)
select md5(html_text)::uuid
    , html_text
from txt_data
on conflict do nothing
;

型
执行计划如下所示：

QUERY PLAN
----------------------------------------------------------------------------------
 Insert on txt_document  (cost=0.00..2829451.85 rows=0 width=0)
   Conflict Resolution: NOTHING
   ->  Seq Scan on many_texts  (cost=0.00..2829451.85 rows=10438262 width=52)
         Filter: (html_text IS NOT NULL)
(4 rows)

型
但是，这个查询似乎永远运行，而从来没有负担的CPU在所有。
有没有什么策略可以加快这样的行动？
我的Posterre版本是14.x

postgresql

来源：https://stackoverflow.com/questions/76769149/create-a-table-of-unique-texts-from-millions-of-documents

2条答案

按热度按时间

mgdq6dx11#

为了实现这一点，无论如何都必须为源表中的每条记录计算md5。以下是我在纯SQL中考虑的方法：
1.在源表中添加一个GENERATED md5哈希：

alter table many_texts
   add column html_text_md5 varchar(32)
     generated always as (md5(html_text)) stored;

字符串
1.（可选）在md5哈希上创建索引

create index on many_texts (html_text_md5) where html_text_md5 is not null;

型
1.使用该列仅将唯一行插入目标表：

insert into txt_document(doc_hash, doc_text)
   select distinct on (html_text_md5)
     html_text_md5::uuid, html_text
   from many_texts
   where html_text_md5 is not null
   order by html_text_md5;

型
1.可选地，代替（3），你可以尝试使用16个并行worker来加速你的插入：

insert into txt_document(doc_hash, doc_text)
   select distinct on (html_text_md5)
     html_text_md5::uuid, html_text
   from many_texts
   where html_text_md5 LIKE '0%'  -- 0,...,9,a,...,f
   order by html_text_md5;

型

赞(0）回复(0）举报 2023-08-04

vsikbqxv2#

我的想法是分开行动。也就是说，你首先写一个python脚本来处理重复的数据，然后在第二个脚本中，我会用“干净”的数据创建一个表。

赞(0）回复(0）举报 2023-08-04