I have strings inside XML that contain low-order ASCII. Specifically, the EDI contains special characters like char(28), char(29), char(30) (aka File Separator/Group Separator/Record Separator). The way the devs encode it in the XML string, these characters get changed to "", where the x1C is the hex representation (char(28), file separator). We then use a CLR to convert it back to the original text to store in a table, with the low-ascii characters in it (we don't store the XML, just the data within, as fields)
However, I need to do this in Azure SQL DB, which doesn't support CLR.
I'm unsure if the best/fastest way to do this is to write the XQuery to extract it in the correct form, or just use XQuery "value" and post-process it with T-SQL: nested REPLACE, CROSS/OUTER APPLY with REPLACE, or even possibly STUFF (TRANSLATE won't work because it's multiple characters).
Looking online, I can see where other versions of XQuery support fn:replace or bin:decode-string, but it doesn't appear that SQL Server's XQuery supports these methods, so I'm trying to figure out how I can do this ("replace value of", maybe?).
DECLARE @xml_edi XML = '<Bundle><RawData>ABCDE&#x1E;&#x1C;FGHIJK&#x1D;LMNOP</RawData></Bundle>'
SELECT x.y.value('./RawData[1]','varchar(max)')
from @xml_edi.nodes('/Bundle')x(y)
Expected results:
ABCDEFGHIJKLMNOP --note that this shows all 3 as the same character, that's a limitation of the browser/editor
(ABCDE, then char(30) & char(28), then FGHIJK, then the low-ascii character 29, then LMNOP)
Some links I found while writing this:
- Function to convert hex to ascii in xquery
- SQL Nested Replace
- https://www.sqlshack.com/overview-of-the-sql-replace-function/ (TRANSLATE)
- https://dba.stackexchange.com/questions/201851/replace-character-without-using-looped-replace-function (interesting use of STUFF)
- https://bertwagner.com/posts/how-to-eliminate-ugly-nested-replace-functions/ (CROSS APPLY)
3条答案
按热度按时间mnowg1ta1#
The characters 28, 29, and 30 are not valid in XML 1.0, and this is the case whether they are represented directly or as character references (

, etc).In XML 1.1 you can use these characters provided you write them as character references. But not many people support XML 1.1.
So your problem is that the data you think of as XML actually isn't, which makes it rather hard to process.
fsi0uk1n2#
This is probably far from efficient, and it also assumes that no other escape sequences can appear in the
varchar
value obtained from the XML, such as having&amp;
due to having a double escaped ampersand (&
). If, however, that is true, then we can likely do some varbinary (ab)use.First, I get the value of the text from the XML. Then I split that string into a single character on each row, with use of
GENERATE_SERIES
andSUBSTRING
and work out if the character is in an escape sequence by counting the number of amerpsand (&
) and semicolon (;
) characters with different windows.Then I check if there is a semicolon in that group; if there is, it's an escape group. Finally, I reaggregate the string, omitting certain characters in the escape sequences, however, I aggregate the string in a binary form; converting non-escape sequences to a binary value (such as
'41'
for'a'
) while leaving escaped sequences as they were. I then convert that value to an actualvarbinary
and thenCONVERT
it back to avarchar
. Yuck, but it works:I intentionally have partition and groups on (what is in effect) the value of the
RawData
element is so that if this needs to be used against a column in a table, it scales. Though, if 2 (or more) elements have the same value, then a different (unique) identifier would need to be used.mwg9r5ms3#
As Larnu suggested, here's the solutions in my edit posted as a separate answer.