SQL Server Split a String by Seperator AND Length into multiple columns

hpcdzsge  于 2023-03-22  发布在  其他
关注(0)|答案(3)|浏览(157)

I'm doing a migration for a customer and setting up the data (cleaning, filtering, setting up views for mig jobs, etc.) on SQL Server 2017 (SSMS 18).

Now I'm stuck on following problem:

From the source system, I'll get a string containing customer addresses, with a length from 10-120 characters.

The destination system has max. 5 address columns (Address_1, Address_2, etc.) with a length of 35 characters (not changeable).

My first attempt was to simply cut it like Address_1 = 1-35 characters, Address_2 = 35-70, etc.

But the customer is not satisfied with cutting the string in the middle of a word.

Challenge is to cut the string after the last "Space" before exceeding 35 characters and continue like this.

Some example - given a source address

'Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401'

I would like to get:

ADDRESS_1 = 'Cecilia Chapman 711-2880 Nulla'
ADDRESS_2 = 'St. Mankato Mississippi 96522'
ADDRESS_3 = '(257) 563-7401'`

Note: there is no pattern in the source system. It's a free text field, filled from different employees around the world.

Above is sample data from google since I'm not allowed to share address data of my customer.

What I already tried

CASE
    WHEN
        LEN(REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '),CHAR(13),' '),';',',')) > 35
        THEN LEFT(REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '),CHAR(13),' '),';',','),35 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35))))
        ELSE REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '),CHAR(13),' '),';',',')
END AS ADDRESS1,

CASE
    WHEN
        LEN(REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '),CHAR(13),' '),';',',')) > 35
        THEN SUBSTRING(REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '), CHAR(13),' '), ';', ','),37 - CHARINDEX(REVERSE('  '), REVERSE(LEFT(OE.ADDR, 35))), 35-CHARINDEX(REVERSE(' '), REVERSE(SUBSTRING(OE.ADDR, 37 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR, 35))), 35))))
END AS ADDRESS2,

CASE
WHEN
        LEN(REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '),CHAR(13),' '),';',',')) > (35 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35)))) + (35-CHARINDEX(REVERSE(' '), REVERSE(SUBSTRING(OE.ADDR,35 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35))),35))))
THEN
        SUBSTRING(REPLACE(REPLACE(REPLACE(OE.ADDR,CHAR(10),' '),CHAR(13),' '),';',','), (36 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35)))) + (36-CHARINDEX(REVERSE(' '), REVERSE(SUBSTRING(OE.ADDR,37 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35))),35)))) ,35 - CHARINDEX(REVERSE(' '), REVERSE(SUBSTRING(OE.ADDR, (36 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35)))) + (36-CHARINDEX(REVERSE(' '), REVERSE(SUBSTRING(OE.ADDR,37 - CHARINDEX(REVERSE(' '), REVERSE(LEFT(OE.ADDR,35))),35)))),35))))
ELSE
''
END AS ADDRESS3,`

I also tried to split it and thought about some "WHILE < 35" to put it back together, but the whole thing is giving me a headache.

SELECT ROW_ID, OE.ADDR, VALUE 
FROM IDTB_ORG_EXT OE
CROSS APPLY STRING_SPLIT(OE.ADDR, ' ')
ORDER BY LEN(OE.ADDR) DESC
tnkciper

tnkciper1#

You're likely going to need to achieve this with recursion, where you'll need to "loop" through each individual string, aggregating the length and "resetting" with you hit a value greater than 35. For a single value this won't be awfully slow, but if you need to do this on a large data set, it will likely be quite (awfully) slow.

Using this method gives you the following:

DECLARE @YourString varchar(8000) = 'Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401';
WITH Split AS(
    SELECT *
    FROM STRING_SPLIT(@YourString,' ',1)),
Grps as (
    SELECT value,
           LEN(value) +1 AS Length,
           LEN(value) +1 AS RunningLength,
           ordinal,
           1 AS Grp
    FROM Split
    WHERE ordinal = 1
    UNION ALL
    SELECT S.value,
           LEN(S.value) +1 AS Length,
           CASE WHEN LEN(S.value) +1 + g.RunningLength > 35 THEN LEN(S.value) +1 ELSE LEN(S.value) +1 + g.RunningLength END AS RunningLength,
           S.ordinal,
           CASE WHEN LEN(S.value) +1 + g.RunningLength > 35 THEN G.Grp + 1 ELSE G.Grp END AS Grp
    FROM Split S
         JOIN Grps G ON S.ordinal = G.ordinal + 1)
SELECT STRING_AGG(value,' ') WITHIN GROUP (ORDER BY ordinal)
FROM Grps
GROUP BY Grp;

I use STRING_SPLIT and the ordinal parameter here, however, it was then noted that you are using 2017 after I wrote the solution in my environment, which does not support the ordinal parameter. You will need to use a different splitter function that provides oridinal positions, such as a JSON splitter, or delimitedsplit8K_LEAD .

Note that this solution does not give you the expected results, as your description and the result you expect don't match. The result this gives is (with the value lengths added):
| Value | Value Length |
| ------------ | ------------ |
| Cecilia Chapman 711-2880 Nulla St. | 34 |
| Mankato Mississippi 96522 (257) | 31 |
| 563-7401 | 8 |

If you need these in columns, rather than rows, you'll need to then pivot the data. I use conditional aggregation here:

CREATE TABLE dbo.YourTable (YourID int IDENTITY,
                            YourString varchar(8000))
INSERT INTO dbo.YourTable (YourString)
VALUES('Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401'),
      ('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.');

WITH Split AS(
    SELECT YT.YourID,
           YT.YourString,
           DS8KL.Item AS value,
           DS8KL.ItemNumber AS ordinal
    FROM dbo.YourTable YT
         --Demonstrates solution using a different function
         CROSS APPLY dbo.DelimitedSplit8K_LEAD(YT.YourString,' ') DS8KL),
Grps as (
    SELECT YourID,
           value,
           LEN(value) +1 AS Length,
           LEN(value) +1 AS RunningLength,
           ordinal,
           1 AS Grp
    FROM Split
    WHERE ordinal = 1
    UNION ALL
    SELECT S.YourID,
           S.value,
           LEN(S.value) +1 AS Length,
           CASE WHEN LEN(S.value) +1 + g.RunningLength > 35 THEN LEN(S.value) +1 ELSE LEN(S.value) +1 + g.RunningLength END AS RunningLength,
           S.ordinal,
           CASE WHEN LEN(S.value) +1 + g.RunningLength > 35 THEN G.Grp + 1 ELSE G.Grp END AS Grp
    FROM Split S
         JOIN Grps G ON S.YourID = G.YourID
                    AND  S.ordinal = G.ordinal + 1),
StringAggs AS(
    SELECT YourID,
           STRING_AGG(value,' ') WITHIN GROUP (ORDER BY ordinal) AS Address,
           Grp
    FROM Grps
    GROUP BY YourID,
             Grp)
SELECT YourID,
       MAX(CASE Grp WHEN 1 THEN Address END) AS Address1,
       MAX(CASE Grp WHEN 2 THEN Address END) AS Address2,
       MAX(CASE Grp WHEN 3 THEN Address END) AS Address3
FROM StringAggs SA
GROUP BY YourID;
YourIDAddress1Address2Address3
1Cecilia Chapman 711-2880 Nulla St.Mankato Mississippi 96522 (257)563-7401
2Lorem ipsum dolor sit amet,consectetur adipiscing elit, seddo eiusmod tempor incididunt ut
iyfjxgzm

iyfjxgzm2#

When asking questions like this providing example DDL/DML is important. Failing to do so will directly affect the quality of answer you get, as well as the votes your question receives. I prefer to use table variables for this, as there's less to clean up. Consider:

DECLARE @Address TABLE (ID INT IDENTITY, Address NVARCHAR(MAX));
INSERT INTO @Address (Address) VALUES 
('John Smith 123 Anystreet Road Newburgh New York 12354 (201) 555-1234'),
('Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401');

Text/String manipulation is not something a RDBMS is really good at, but in a pinch it can be done. Consider:

SELECT ID, LEFT(Address,35-CHARINDEX(' ',REVERSE(LEFT(Address,35)))) AS Address1, 
LEFT(RIGHT(Address,LEN(Address)-(35-CHARINDEX(' ',REVERSE(LEFT(Address,35))))-1),CHARINDEX('(',RIGHT(Address,LEN(Address)-(35-CHARINDEX(' ',REVERSE(LEFT(Address,35))))-1))-1) AS RemaningAddress,
RIGHT(RIGHT(Address,LEN(Address)-(35-CHARINDEX(' ',REVERSE(LEFT(Address,35))))-1),1+LEN(RIGHT(Address,LEN(Address)-(35-CHARINDEX(' ',REVERSE(LEFT(Address,35))))-1))-CHARINDEX('(',RIGHT(Address,LEN(Address)-(35-CHARINDEX(' ',REVERSE(LEFT(Address,35))))-1))) AS Phone
  FROM @Address

There's a few things going on here. First we're grabbing the left 35 characters (per your description) and then reversing them to look for the last (but first in the reversed string) instance of a white space. Once we know where that is we know where to terminate the first split. Then we use the same technique, but this time to take all the characters right of that position. Finally we look for a open paren in the remaining (right side) string so we can also split out the phone number.

IDAddress1RemaningAddressPhone
1John Smith 123 Anystreet RoadNewburgh New York 12354(201) 555-1234
2Cecilia Chapman 711-2880 Nulla St.Mankato Mississippi 96522(257) 563-7401
6jygbczu

6jygbczu3#

Without using SPLIT_STRING, I came up with this. You can pass in the input string, the delimiter to split the string, and the cut off length that each substring cannot exceed. You can also pass in a label if you don't want the output rows to get labelled as "address n".

CREATE FUNCTION dbo.udf_VariableStringSplit
(@InputString VARCHAR(4000)
,@SubstringLabel VARCHAR(50)
,@Delimiter CHAR(1)
,@CutOff INT
)

RETURNS @t TABLE (SubstringName VARCHAR(50), TheSubstring VARCHAR(255))

AS

BEGIN

SET @InputString = RTRIM(LTRIM(@InputString))

;WITH -- GET RECORDSET CONTAINING EACH SUBSTRING WITH START AND END POSITIONS
a AS (
    SELECT n=0, i=-1, j=0 
    UNION ALL 
    SELECT n+1, j, CHARINDEX(@delimiter, @InputString, j+1
    ) 
    FROM a 
    WHERE j > i
    )
,b AS ( -- GET THE SUBSTRING VALUES
    SELECT n, i, IIF(n>1,@delimiter,'')+SUBSTRING(@InputString, i+1, IIF(j>0, j, LEN(@InputString)+1)-i-1) SubString 
    FROM a 
    WHERE i >= 0
    )
,c AS ( -- GET THE RUNNING LENGTH OF THE STRING AND CREATE A CUT OFF GROUP BY DIVIDING THE RUNNING LENGTH BY THE CUT OFF VALUE PARAMETER
    SELECT n,SubString,i
    , SUM(LEN(SubString)) OVER(ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningLength  
    ,(SUM(LEN(SubString)) OVER(ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ) / @cutoff AS StringGroup
    FROM b
    )
,d AS ( -- GET A ROW NUMBER FOR EACH SUBSTRING IN EACH GROUP
    SELECT n,SubString,i,RunningLength,StringGroup, ROW_NUMBER() OVER (PARTITION BY StringGroup ORDER BY n) AS GroupRowNum
    FROM c
)
,e
AS -- RECURSIVELY PUT THE SUBSTRINGS WITHIN EACH GROUP BACK TOGETHER
(
    SELECT 
        0 as n
        ,CAST('' AS VARCHAR(255)) as [SubString]
        ,d.StringGroup
        ,d.RunningLength
        ,d.GroupRowNum
    FROM d
    WHERE 1=1
    AND GroupRowNum = 1
    UNION ALL
    SELECT 
        e.n+1
        ,CAST(e.[SubString] +d.[SubString] AS VARCHAR(255))
        ,e.StringGroup
        ,e.RunningLength
        ,e.GroupRowNum + 1
    FROM 
    d JOIN e e ON d.StringGroup = e.StringGroup 
    WHERE e.GroupRowNum = d.GroupRowNum

)

,f AS -- FOR EACH GROUP, GET A ROW NUMBER FOR EACH CONCATENATED ROW AS WE ONLY WANT THE LAST ONE
(
    SELECT e.*
    ,ROW_NUMBER() OVER (PARTITION BY e.StringGroup ORDER BY e.GroupRowNum DESC) as ConcatRowNum
    FROM e
)
,g AS -- ADD A LABEL TO EACH ROW AND ONLY RETURN THE LAST CONCATENATION PER GROUP
(
  SELECT 
  @substringlabel+CAST(ROW_NUMBER() OVER (ORDER BY RunningLength) AS VARCHAR(10)) AS SubStringName
  ,f.[SubString]
  FROM f
  WHERE f.concatrownum = 1
)
INSERT INTO @t(SubstringName,TheSubstring)
SELECT * FROM g

RETURN

END

to run

DECLARE @TheString VARCHAR(255) = 'Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401'
DECLARE @SubstringLabel VARCHAR(50) = 'Address'
DECLARE @Delimiter CHAR(1) = ' '
DECLARE @Cutoff INT = 30

SELECT * from dbo.udf_VariableStringSplit(@thestring,@substringlabel,@delimiter,@cutoff)

or CROSS APPLY it to a table of existing addresses/string

SELECT 
TheString
,StringSplit.*
FROM MyTable
CROSS APPLY dbo.udf_VariableStringSplit(MyTable.MyString,@substringlabel,@delimiter,@cutoff)StringSplit

相关问题