How to Find Duplicates Over Multiple Columns Using SQL Server

r6hnlfcb  于 2023-03-28  发布在  SQL Server
关注(0)|答案(1)|浏览(223)

Is it possible to find duplicates in a SQL table across multiple columns in a way that only requires a match by one of the columns?

For example, lets say I have a table with the following Schema: ID, C1, C2

My goal is to return a new table with a column called "Group ID" which is the ID of the group the record lives. A given record should exist in the group if there is a match by at least one of C1 or C2 .

For example, consider this table

Create Table #Target

(ID Varchar(MAX),C1 Varchar(MAX),C2 Varchar(MAX))

Insert INTO #Target

Values

('1','p1','e1'),
('2','p2','e2'),
('3','p1','e2'),
('4','p3','e3'),
('5','p3','e4'),
('6','p4','e4')
IDC1C2
1p1e1
2p2e2
3p1e2
4p3e3
5p3e4
6p4e4

The desired output would look something like
| ID | C1 | C2 | GID |
| ------------ | ------------ | ------------ | ------------ |
| 1 | p1 | e1 | G1 |
| 2 | p2 | e2 | G1 |
| 3 | p1 | e2 | G1 |
| 4 | p3 | e3 | G2 |
| 5 | p3 | e4 | G2 |
| 6 | p4 | e4 | G2 |

When trying to come up with a solution I tried grouping by each column individually like this:

#1 Group By C1 and Assign Unique ID (C1GID) to each Group

#2 Group By C2 and Assign Unique ID (C2GID) to each Group

Now the problem I am having is how to group by C1GID OR C2GID . I don't know how to group rows that share at least one of these column values in common.

Update: I am getting closer. I am able to generate a set of IDs that have a match by at least one of the columns using this

SELECT *, CAST(NULL AS INT) AS ID_To INTO #t FROM ( VALUES ('1','p1','e1'), ('2','p2','e2'), ('3','p1','e2'), ('4','p3','e3'), ('5','p3','e4'), ('6','p4','e4') ) t (ID,C1,C2)

Select ID1, STRING_AGG(ID3, ', ') + ',' + STRING_AGG(ID2, ', ') as Groups
FROM (
    select 
    t1.ID as ID1,
    t1.C1 as t1C1,
    t1.C2 as t1C2,

    t2.ID as ID2,
    t2.C1 as t2C1,
    t2.C2 as t2C2,

    t3.ID as ID3,
    t3.C1 as t3C1,
    t3.C2 as t3C2
    from #t t1
    LEFT JOIN (
        Select * From #t
    ) t2 ON t1.C1 = t2.C1 
    LEFT JOIN (
        Select * From #t
    ) t3 ON t1.C2 = t3.C2
    WHERE t1.C1 = t2.C1 OR t1.C2 = t3.C2
) Groups 
GROUP BY ID1

Image here: Output

I am trying to find a way to re-group these by finding an intersection between the values in the Groups column. For example its clear from the result that 1,2,3 belong together, and 4,5,6 belong together because their groups column has a common ID.

kgsdhlau

kgsdhlau1#

Try this one:

with firstc1s(gid, id,c1) as (
    select row_number() over(order by id) as gid, id, c1 from (
        select min(id) as id, c1
        from #Target 
        group by c1
    ) x
)
,cte(lvl, gid, id, c1, c2, p) as (
    select 1, f.gid, f.id, d.c1, d.c2, concat('/', f.id, ' ')
    from firstc1s f
        join #Target d on d.id = f.id
        
    union all
    
    select lvl+1, c.gid, d.id, d.c1, d.c2, concat(c.p , '/' , d.id, ' ')
    from cte c
    join #Target d on  
        (
            c.c1 = d.c1 or c.c1 = d.c2
            or 
            c.c2 = d.c1 or c.c2 = d.c2
        )
    where charindex(concat('/' , d.id, ' '), c.p) = 0
)
select 
    dense_rank() over(order by gid) as gid, id,c1,c2 
from (
    select min(gid) as gid, id, c1, c2 from cte
    group by id, c1, c2
) x
order by gid, id, c1, c2
;

https://dbfiddle.uk/-nIlBCer

相关问题