SQL Server Using T-SQL Xquery to extract data from XML with low-ascii represented like "&#x1E"? Or just use T-SQL?

yzxexxkh  于 2023-03-07  发布在  其他
关注(0)|答案(3)|浏览(117)

I have strings inside XML that contain low-order ASCII. Specifically, the EDI contains special characters like char(28), char(29), char(30) (aka File Separator/Group Separator/Record Separator). The way the devs encode it in the XML string, these characters get changed to "", where the x1C is the hex representation (char(28), file separator). We then use a CLR to convert it back to the original text to store in a table, with the low-ascii characters in it (we don't store the XML, just the data within, as fields)

However, I need to do this in Azure SQL DB, which doesn't support CLR.

I'm unsure if the best/fastest way to do this is to write the XQuery to extract it in the correct form, or just use XQuery "value" and post-process it with T-SQL: nested REPLACE, CROSS/OUTER APPLY with REPLACE, or even possibly STUFF (TRANSLATE won't work because it's multiple characters).

Looking online, I can see where other versions of XQuery support fn:replace or bin:decode-string, but it doesn't appear that SQL Server's XQuery supports these methods, so I'm trying to figure out how I can do this ("replace value of", maybe?).

DECLARE @xml_edi XML  = '<Bundle><RawData>ABCDE&amp;#x1E;&amp;#x1C;FGHIJK&amp;#x1D;LMNOP</RawData></Bundle>' 
SELECT x.y.value('./RawData[1]','varchar(max)')
from @xml_edi.nodes('/Bundle')x(y)

Expected results:

ABCDEFGHIJKLMNOP  --note that this shows all 3 as the same character, that's a limitation of the browser/editor

(ABCDE, then char(30) & char(28), then FGHIJK, then the low-ascii character 29, then LMNOP)

Some links I found while writing this:

mnowg1ta

mnowg1ta1#

The characters 28, 29, and 30 are not valid in XML 1.0, and this is the case whether they are represented directly or as character references ( &#28; , etc).

In XML 1.1 you can use these characters provided you write them as character references. But not many people support XML 1.1.

So your problem is that the data you think of as XML actually isn't, which makes it rather hard to process.

fsi0uk1n

fsi0uk1n2#

This is probably far from efficient, and it also assumes that no other escape sequences can appear in the varchar value obtained from the XML, such as having &amp;amp; due to having a double escaped ampersand ( & ). If, however, that is true, then we can likely do some varbinary (ab)use.

First, I get the value of the text from the XML. Then I split that string into a single character on each row, with use of GENERATE_SERIES and SUBSTRING and work out if the character is in an escape sequence by counting the number of amerpsand ( & ) and semicolon ( ; ) characters with different windows.

Then I check if there is a semicolon in that group; if there is, it's an escape group. Finally, I reaggregate the string, omitting certain characters in the escape sequences, however, I aggregate the string in a binary form; converting non-escape sequences to a binary value (such as '41' for 'a' ) while leaving escaped sequences as they were. I then convert that value to an actual varbinary and then CONVERT it back to a varchar . Yuck, but it works:

DECLARE @xml_edi XML  = '<Bundle><RawData>ABCDE&amp;#x1E;&amp;#x1C;FGHIJK&amp;#x1D;LMNOP</RawData></Bundle>';

WITH EscapedStrings AS(
    SELECT x.b.value('(./RawData/text())[1]','varchar(max)') AS EscapedString
    FROM @xml_edi.nodes('/Bundle')x(b)),
Groups AS (
    SELECT ES.EscapedString,
           SS.C,
           GS.Value,
           COUNT(CASE SS.C WHEN '&' THEN 1 END) OVER (PARTITION BY ES.EscapedString ORDER BY GS.Value
                                                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) + 
           COUNT(CASE SS.C WHEN ';' THEN 1 END) OVER (PARTITION BY ES.EscapedString ORDER BY GS.Value
                                                      ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS Grp
    FROM EscapedStrings ES
         CROSS APPLY GENERATE_SERIES(CONVERT(bigint,1),LEN(ES.EscapedString),CONVERT(bigint,1)) GS
         CROSS APPLY (VALUES(CONVERT(char(1),SUBSTRING(ES.EscapedString,GS.value,1))))SS(C)),
EscapeGroups AS (
    SELECT G.EscapedString,
           G.C,
           G.Grp,
           G.Value,
           COUNT(CASE G.C WHEN ';' THEN 1 END) OVER (PARTITION BY G.EscapedString, G.Grp) AS EscapeGroup
    FROM Groups G)
SELECT CONVERT(varchar(MAX),CONVERT(varbinary(MAX),STRING_AGG(CASE EG.EscapeGroup WHEN 1 THEN EG.C
                                                                                  ELSE CONVERT(varchar(2),CONVERT(varbinary(1),EG.C),2)
                                                              END,'') WITHIN GROUP (ORDER BY EG.Value),2))
FROM EscapeGroups EG
WHERE NOT (EG.EscapeGroup = 1 AND EG.C IN ('&','#','x',';'))
GROUP BY EG.EscapedString;

I intentionally have partition and groups on (what is in effect) the value of the RawData element is so that if this needs to be used against a column in a table, it scales. Though, if 2 (or more) elements have the same value, then a different (unique) identifier would need to be used.

mwg9r5ms

mwg9r5ms3#

As Larnu suggested, here's the solutions in my edit posted as a separate answer.

DECLARE @xml_edi XML  = '<Bundle><RawData>ABCDE&amp;#x1E;&amp;#x1C;FGHIJK&amp;#x1D;LMNOP</RawData></Bundle>' 
--nested replace
SELECT REPLACE(REPLACE(REPLACE(REPLACE(x.y.value('./RawData[1]','nvarchar(max)'),'&#x1C;',CHAR(28)),'&#x1D;',CHAR(29)),'&#x1E;', char(30)),'&#x1F;',CHAR(31))
from @xml_edi.nodes('/Bundle')x(y)

--nested replace; cleaner? (uses https://bertwagner.com/posts/how-to-eliminate-ugly-nested-replace-functions/)
SELECT 
    s.Colors
FROM
    (SELECT x.y.value('./RawData[1]','nvarchar(max)') AS Colors FROM @xml_edi.nodes('/Bundle')x(y) ) c
    CROSS APPLY (SELECT REPLACE(c.Colors,'&#x1C;',CHAR(28)) AS Colors) r
    CROSS APPLY (SELECT REPLACE(r.Colors,'&#x1D;',CHAR(29)) AS Colors) g
    CROSS APPLY (SELECT REPLACE(g.Colors,'&#x1E;',CHAR(30)) AS Colors) b
    CROSS APPLY (SELECT REPLACE(b.Colors,'&#x1F;',CHAR(31)) AS Colors) s

相关问题