Creating a single ID for any combination of matches between three different identifiers

ttcibm8c  于 2023-02-28  发布在  其他
关注(0)|答案(1)|浏览(91)

I have customer data coming from several different systems, with three potential unique identifiers. We need to create a single new ID that can be used to link ANY match between the three identifiers.

The data is in the following structure:
| Primary Key | CustomerID1 | CustomerID2 | CustomerID3 |
| ------------ | ------------ | ------------ | ------------ |
| 1 | Alpha | Dog | Jeans |
| 2 | Alpha | Cat | Shirt |
| 3 | Beta | Dog | Dress |
| 4 | Gamma | Bear | Jeans |
| 5 | Alpha | Dog | Jeans |
| 6 | Epsilon | Bird | Boots |

In the end, keys 1-5 should all be linked based on the bolded matches while 6 should have its own ID. So the desired output would be a new ID generated for just 2 customers from this list and then a second table with all the IDs & IDTypes linked to that new ID (but open to changing the end result structure so long as we generate 1 ID for each customer):

New ID
Key1
Key2
New IDIDIDType
Key1AlphaCustomerID1
Key1BetaCustomerID1
Key1GammaCustomerID1
Key1DogCustomerID2
Key1CatCustomerID2
Key1BearCustomerID2
Key1JeansCustomerID3
Key1ShirtCustomerID3
Key1DressCustomerID3
Key2EpsilonCustomerID1
Key2BirdCustomerID2
Key2BootsCustomerID3

I've tried creating a list of all possible matches like the below but I'm not sure how to consolidate that down into a single ID for each customer.

DROP TABLE IF EXISTS #Test
CREATE TABLE #Test (PrimaryKey int, CustomerID1 varchar(15), CustomerID2 varchar(15), CustomerID3 varchar(15))
INSERT INTO #Test VALUES
     (1,'Alpha','Dog','Jeans')
    ,(2,'Alpha','Cat','Shirt')
    ,(3,'Beta','Dog','Dress')
    ,(4,'Gamma','Bear','Jeans')
    ,(5,'Alpha','Dog','Jeans')
    ,(6,'Epsilon','Bird','Boots')

SELECT
     t1.PrimaryKey
    ,t2.Primarykey
FROM #Test t1
JOIN #Test t2 on t2.PrimaryKey != t1.PrimaryKey and t1.CustomerID1 = t2.CustomerID1
UNION
SELECT
     t1.PrimaryKey
    ,t2.Primarykey
FROM #Test t1
JOIN #Test t2 on t2.PrimaryKey != t1.PrimaryKey and t1.CustomerID2 = t2.CustomerID2
UNION
SELECT
     t1.PrimaryKey
    ,t2.Primarykey
FROM #Test t1
JOIN #Test t2 on t2.PrimaryKey != t1.PrimaryKey and t1.CustomerID3 = t2.CustomerID3

I feel like the solution is obvious but I'm stuck so any help is appreciated! Thank you!

vfh0ocws

vfh0ocws1#

It's a bit tricky to do in one select (at least for me). I usually do something like this:

SELECT  *, CAST(NULL AS INT) AS ID_To
INTO #t
FROM
(
    VALUES  (1, N'Alpha', N'Dog', N'Jeans')
    ,   (2, N'Alpha', N'Cat', N'Shirt')
    ,   (3, N'Beta', N'Dog', N'Dress')
    ,   (4, N'Gamma', N'Bear', N'Jeans')
    ,   (5, N'Alpha', N'Dog', N'Jeans')
    ,   (5, N'Alpha', N'Bonanza', N'Boots')
    ,   (6, N'Epsilon', N'Bird', N'Boots')
    ,   (7, N'zz', N'dog', N'Bird')
    ,   (8, N'zzz', N'bye', N'hi')
    ,   (9, N'zzzz', N'bear', N'hi ho silver')
) t (ID,CustomerID1,CustomerID2,CustomerID3)

WHILE @@rowcount > 0
BEGIN
    UPDATE  t2
    SET ID_to = ISNULL(t.ID_To, t.ID)
    FROM    #t t
    LEFT JOIN #t tTo
        ON  tTo.ID = t.ID_To
    CROSS APPLY (
        VALUES (t.Customerid1), (t.Customerid2), (t.Customerid3)
        ) v(externalId) 
    CROSS JOIN #t t2
    CROSS APPLY (
        VALUES (t2.Customerid1), (t2.Customerid2), (t2.Customerid3)
        ) v2(externalId) 
    WHERE   ISNULL(t.id_To, t.id) < ISNULL(t2.ID_to, t2.id)
    AND v.externalId = v2.externalId
    AND t.ID <> t2.ID
END

SELECT  *
FROM    #t

SELECT  ISNULL(ID_TO, ID) AS groups
FROM    #t
GROUP BY ISNULL(ID_TO, ID)

To explain the code:

  1. I create two columns in your table containing row ID (ID) and eventual match ID (ID_to). This is so we can consolidate how every customer is matched against other customers
  2. Then i do a loop. In the loop i try to match the keys between two different customers. What i want to do is to match every customer against another customerID1,2,3 (AND v.externalId = v2.externalId )
  3. To make sure the loop finish i only want to match if the new id is lower than current one ISNULL(t.id_To, t.id) < ISNULL(t2.ID_to, t2.id). If the matching customer is already matched to someone else i take his ID_to. This ensure that if there's a long chain of matching IDs, we will follow the chain all the way.
  4. The WHILE @@ROWCOUNT > 0 trick is quite useful to loop stuff until there's something to do. But it's important to actually have a breaking condition otherwise the loop runs forever.

You can put the select inside the while loop to track what's going on with the #t-table

When the loop is done, every customer should either has ID_to = NULL, meaning he is the "master", or ID_to > 0 which means this customer is matched against another master customer

相关问题