postgresql 有没有办法用SQL找到并删除几乎相似的记录？

a14dhokn 于 2023-02-04 发布在 PostgreSQL

关注(0)|答案(1)|浏览(113)

我在Postgres DB中有一个表，它有很多几乎相同的行。

1. 00Zicky_-_San_Pedro_Danilo_Vigorito_Remix
2. 00Zicky_-_San_Pedro__Danilo_Vigorito_Remix__
3. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx__
4. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx_
5. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
6. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__

我想写一个小golang脚本来删除重复，但也许SQL可以做到这一点？
表格定义：

\d+ songs
                                                                       Table "public.songs"
    Column     |            Type             | Collation | Nullable |                Default                 | Storage  | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
 song_id       | integer                     |           | not null | nextval('songs_song_id_seq'::regclass) | plain    |             |              |
 song_name     | character varying(250)      |           | not null |                                        | extended |             |              |
 fingerprinted | smallint                    |           |          | 0                                      | plain    |             |              |
 file_sha1     | bytea                       |           |          |                                        | extended |             |              |
 total_hashes  | integer                     |           | not null | 0                                      | plain    |             |              |
 date_created  | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
 date_modified | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
Indexes:
    "pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
    TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap

尝试了几种方法来查找重复项，但这些方法仅搜索精确的相似性。

postgresql

来源：https://stackoverflow.com/questions/75316444/any-way-to-find-and-delete-almost-similar-records-with-sql

1条答案

按热度按时间

qyuhtwio1#

不存在本质上 A几乎= B 的算子。（这里有full text search，但这似乎有点过分了。）如果唯一的差异是-和_的数量，那么就去掉它们，然后比较得到的差异。如果它们相等，那么其中一个是重复的。你可以使用replace()函数来删除它们。例如：（参见demo）

delete  
  from songs s2 
 where exists ( select null 
                  from songs s1 
                 where s1.song_id < s2.song_id
                   and replace(replace(s1.name, '_',''),'-','') = 
                       replace(replace(s2.name, '_',''),'-','')
              );

如果你的表很大，这将不会很快，但函数索引可能会有所帮助：

create index song_name_idx on songs
        (replace(replace(name, '_',''),'-',''));

赞(0）回复(0）举报 2023-02-04

我来回答

postgresql 有没有办法用SQL找到并删除几乎相似的记录？

1条答案

相关问题

热门标签

最新问答