SQL Server Compare two strings with dash issue

cbjzeqam 于 2023-11-16 发布在其他

关注(0)|答案(3)|浏览(107)

From the T-SQL code below you can see that dash - on the 3rd place of a nvarchar string does not equal dash on 3rd place of a varchar string.

if substring(N'BV‐D',3,1) <> substring('BV‐D',3,1)
    select substring(N'BV‐D',3,1) as nchar
            , substring('BV‐D',3,1) as char
            , ASCII(substring(N'BV‐D',3,1)) as ncode
            , ASCII(substring('BV‐D',3,1)) as code
            , 'Non-Equal' as status

output:

nchar	char	ncode	code	status
‐	-	45	45	Non-Equal

If I will retype '-' or add N, or remove N in both cases strings will become equal. ASCII code is 45 in both cases.

Why are they not equal?

Initially it was not typed manually, but copied from different sources.

sql-server

来源：https://stackoverflow.com/questions/77372646/compare-two-strings-with-dash-issue

3条答案

按热度按时间

bqjvbblv1#

Because they aren't equal. You have a string which contains non-ASCII charcters that you're putting into a varchar and then (implicitly) converting that value back to an nvarchar . By that point it's too late and information has been lost.

Let's look at these values from a varbinary perspective:

SELECT CONVERT(varbinary(10),N'BV‐D') AS NvarcharBinary,
       CONVERT(varbinary(10),'BV‐D') AS VarcharBinary,
       CONVERT(varbinary(10),CONVERT(nvarchar(5),'BV‐D')) AS VarcharNvarcharBinary;

This gives the results:

NvarcharBinary	VarcharBinary	VarcharNvarcharBinary
0x4200560010204400	0x42562D44	0x420056002D004400

Notice the third character for the first string has the representation 0x1020 . Such a character cannot be represented in a varchar , and when you attempt to store it in one it is implicitly converted to a similar character, 0x2D . Then when you convert that back to an nvarchar you get 0x2D00 . (Note, not all characters are converted to similar characters. Those that aren't are replaced with ? .)

Unsurprisingly, 0x2D00 and 0x1020 are not the same, and so the strings are not equal.

赞(0）回复(0）举报 2023-11-16

sqyvllje2#

This is the ASCII-45 HYPHEN-MINUS. May I comment that this is the more common hyphen. It's what appears if I press the hyphen key in my US keyboard.

This is the Unicode-8208 HYPHEN, well beyong the ASCII Range. It is what your 'BV‐D' string contains. If you don't use a unicode string with N , sql server converts it to the similar ASCII-45 character. Still, they differ, and that's the reason you get an inequality.

赞(0）回复(0）举报 2023-11-16

km0tfn4u3#

That's not the dash you assume it is, it's a different Unicode character. The query is using the wrong function though, ASCII , which does not return the actual Unicode code.

If you try

select UNICODE(N'‐') as Hyphen,
       UNICODE(N'-') as Minus,
       ASCII('‐') As Ascii_Dash,
       ASCII(N'-') as Ascii_Minus

You'll get

Hyphen  Minus   Ascii_Dash  Ascii_Minus
8208    45      45          45

8208 is the Hyphen character . That's the character generated by eg HTML's &dash; , used to separate words and syllable's. It's not the same as the minus character on our keyboards.

There are a lot of dash characters used in typography and hence Unicode. Full width, half width, breaking or non-breaking etc.

Fixing this

You can use TRANSLATE to map individual characters :

select translate(N'BV‐D', N'‐', N'-')

This trick is used in SQL Server to normalize equivalent Unicode characters because T-SQL doesn't have a NORMALIZE function. This wouldn't help here anyway, because HYPHEN-MINUS and HYPHEN aren't considered equivalent. You'd have to add all possible options though, as the example in the linked question shows :

select translate(last_name, 'éêëèàäçïîìôöòûù', 'eeeeaaciiiooouu') from employees

赞(0）回复(0）举报 2023-11-16

我来回答

SQL Server Compare two strings with dash issue

3条答案

相关问题

热门标签

最新问答