Find non-ASCII characters in varchar columns using SQL Server

iq0todco  于 2024-01-05  发布在  SQL Server
关注(0)|答案(9)|浏览(176)

How can rows with non-ASCII characters be returned using SQL Server?
If you can show how to do it for one column would be great.

I am doing something like this now, but it is not working

  1. select *
  2. from Staging.APARMRE1 as ar
  3. where ar.Line like '%[^!-~ ]%'

For extra credit, if it can span allvarchar columns in a table, that would be outstanding! In this solution, it would be nice to return three columns:

  • The identity field for that record. (This will allow the whole record to be reviewed with another query.)
  • The column name
  • The text with the invalid character
  1. Id | FieldName | InvalidText |
  2. ----+-----------+-------------------+
  3. 25 | LastName | Solís |
  4. 56 | FirstName | François |
  5. 100 | Address1 | 123 Ümlaut street |

Invalid characters would be any outside the range of SPACE (3210) through ~ (12710)

tvmytwxo

tvmytwxo1#

Here is a solution for the single column search using PATINDEX.
It also displays the StartPosition, InvalidCharacter and ASCII code.

  1. select line,
  2. patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position],
  3. substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter],
  4. ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode]
  5. from staging.APARMRE1
  6. where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0
tuwxkamq

tuwxkamq2#

I've been running this bit of code with success

  1. declare @UnicodeData table (
  2. data nvarchar(500)
  3. )
  4. insert into
  5. @UnicodeData
  6. values
  7. (N'Horse�')
  8. ,(N'Dog')
  9. ,(N'Cat')
  10. select
  11. data
  12. from
  13. @UnicodeData
  14. where
  15. data collate LATIN1_GENERAL_BIN != cast(data as varchar(max))

Which works well for known columns.

For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.

  1. declare
  2. @sql varchar(max) = ''
  3. ,@table sysname = 'mytable' -- enter your table here
  4. ;with ColumnData as (
  5. select
  6. RowId = row_number() over (order by c.COLUMN_NAME)
  7. ,c.COLUMN_NAME
  8. ,ColumnName = '[' + c.COLUMN_NAME + ']'
  9. ,TableName = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']'
  10. from
  11. INFORMATION_SCHEMA.COLUMNS c
  12. where
  13. c.DATA_TYPE = 'nvarchar'
  14. and c.TABLE_NAME = @table
  15. )
  16. select
  17. @sql = @sql + 'select FieldName = ''' + c.ColumnName + ''', InvalidCharacter = [' + c.COLUMN_NAME + '] from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) ' + case when c.RowId <> (select max(RowId) from ColumnData) then ' union all ' else '' end + char(13)
  18. from
  19. ColumnData c
  20. -- check
  21. -- print @sql
  22. exec (@sql)

I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.

展开查看全部
ztyzrc3y

ztyzrc3y3#

try something like this:

  1. DECLARE @YourTable table (PK int, col1 varchar(20), col2 varchar(20), col3 varchar(20));
  2. INSERT @YourTable VALUES (1, 'ok','ok','ok');
  3. INSERT @YourTable VALUES (2, 'BA'+char(182)+'D','ok','ok');
  4. INSERT @YourTable VALUES (3, 'ok',char(182)+'BAD','ok');
  5. INSERT @YourTable VALUES (4, 'ok','ok','B'+char(182)+'AD');
  6. INSERT @YourTable VALUES (5, char(182)+'BAD','ok',char(182)+'BAD');
  7. INSERT @YourTable VALUES (6, 'BAD'+char(182),'B'+char(182)+'AD','BAD'+char(182)+char(182)+char(182));
  8. --if you have a Numbers table use that, other wise make one using a CTE
  9. WITH AllNumbers AS
  10. ( SELECT 1 AS Number
  11. UNION ALL
  12. SELECT Number+1
  13. FROM AllNumbers
  14. WHERE Number<1000
  15. )
  16. SELECT
  17. pk, 'Col1' BadValueColumn, CONVERT(varchar(20),col1) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
  18. FROM @YourTable y
  19. INNER JOIN AllNumbers n ON n.Number <= LEN(y.col1)
  20. WHERE ASCII(SUBSTRING(y.col1, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col1, n.Number, 1))>127
  21. UNION
  22. SELECT
  23. pk, 'Col2' BadValueColumn, CONVERT(varchar(20),col2) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
  24. FROM @YourTable y
  25. INNER JOIN AllNumbers n ON n.Number <= LEN(y.col2)
  26. WHERE ASCII(SUBSTRING(y.col2, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col2, n.Number, 1))>127
  27. UNION
  28. SELECT
  29. pk, 'Col3' BadValueColumn, CONVERT(varchar(20),col3) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3
  30. FROM @YourTable y
  31. INNER JOIN AllNumbers n ON n.Number <= LEN(y.col3)
  32. WHERE ASCII(SUBSTRING(y.col3, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col3, n.Number, 1))>127
  33. order by 1
  34. OPTION (MAXRECURSION 1000);

OUTPUT:

  1. pk BadValueColumn BadValue
  2. ----------- -------------- --------------------
  3. 2 Col1 BAD
  4. 3 Col2 BAD
  5. 4 Col3 BAD
  6. 5 Col1 BAD
  7. 5 Col3 BAD
  8. 6 Col1 BAD
  9. 6 Col2 BAD
  10. 6 Col3 BAD¶¶¶
  11. (8 row(s) affected)
展开查看全部
ufj5ltwl

ufj5ltwl4#

This script searches for non-ascii characters in one column. It generates a string of all valid characters, here code point 32 to 127. Then it searches for rows that don't match the list:

  1. declare @str varchar(128);
  2. declare @i int;
  3. set @str = '';
  4. set @i = 32;
  5. while @i <= 127
  6. begin
  7. set @str = @str + '|' + char(@i);
  8. set @i = @i + 1;
  9. end;
  10. select col1
  11. from YourTable
  12. where col1 like '%[^' + @str + ']%' escape '|';
cidc1ykv

cidc1ykv5#

running the various solutions on some real world data - 12M rows varchar length ~30, around 9k dodgy rows, no full text index in play, the patIndex solution is the fastest, and it also selects the most rows.

(pre-ran km. to set the cache to a known state, ran the 3 processes, and finally ran km again - the last 2 runs of km gave times within 2 seconds)

patindex solution by Gerhard Weiss -- Runtime 0:38, returns 9144 rows

  1. select dodgyColumn from myTable fcc
  2. WHERE patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,dodgyColumn ) >0

the substring-numbers solution by MT. -- Runtime 1:16, returned 8996 rows

  1. select dodgyColumn from myTable fcc
  2. INNER JOIN dbo.Numbers32k dn ON dn.number<(len(fcc.dodgyColumn ))
  3. WHERE ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))<32
  4. OR ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))>127

udf solution by Deon Robertson -- Runtime 3:47, returns 7316 rows

  1. select dodgyColumn
  2. from myTable
  3. where dbo.udf_test_ContainsNonASCIIChars(dodgyColumn , 1) = 1
展开查看全部
ruoxqz4g

ruoxqz4g6#

There is a user defined function available on the web 'Parse Alphanumeric'. Google UDF parse alphanumeric and you should find the code for it. This user defined function removes all characters that doesn't fit between 0-9, a-z, and A-Z.

  1. Select * from Staging.APARMRE1 ar
  2. where udf_parsealpha(ar.last_name) <> ar.last_name

That should bring back any records that have a last_name with invalid chars for you...though your bonus points question is a bit more of a challenge, but I think a case statement could handle it. This is a bit psuedo code, I'm not entirely sure if it'd work.

  1. Select id, case when udf_parsealpha(ar.last_name) <> ar.last_name then 'last name'
  2. when udf_parsealpha(ar.first_name) <> ar.first_name then 'first name'
  3. when udf_parsealpha(ar.Address1) <> ar.last_name then 'Address1'
  4. end,
  5. case when udf_parsealpha(ar.last_name) <> ar.last_name then ar.last_name
  6. when udf_parsealpha(ar.first_name) <> ar.first_name then ar.first_name
  7. when udf_parsealpha(ar.Address1) <> ar.last_name then ar.Address1
  8. end
  9. from Staging.APARMRE1 ar
  10. where udf_parsealpha(ar.last_name) <> ar.last_name or
  11. udf_parsealpha(ar.first_name) <> ar.first_name or
  12. udf_parsealpha(ar.Address1) <> ar.last_name

I wrote this in the forum post box...so I'm not quite sure if that'll function as is, but it should be close. I'm not quite sure how it will behave if a single record has two fields with invalid chars either.

As an alternative, you should be able to change the from clause away from a single table and into a subquery that looks something like:

  1. select id,fieldname,value from (
  2. Select id,'last_name' as 'fieldname', last_name as 'value'
  3. from Staging.APARMRE1 ar
  4. Union
  5. Select id,'first_name' as 'fieldname', first_name as 'value'
  6. from Staging.APARMRE1 ar
  7. ---(and repeat unions for each field)
  8. )
  9. where udf_parsealpha(value) <> value

Benefit here is for every column you'll only need to extend the union statement here, while you need to put that comparisson three times for every column in the case statement version of this script

展开查看全部
o3imoua4

o3imoua47#

To find which field has invalid characters:

  1. SELECT * FROM Staging.APARMRE1 FOR XML AUTO, TYPE

You can test it with this query:

  1. SELECT top 1 'char 31: '+char(31)+' (hex 0x1F)' field
  2. from sysobjects
  3. FOR XML AUTO, TYPE

The result will be:

Msg 6841, Level 16, State 1, Line 3 FOR XML could not serialize the data for node 'field' because it contains a character (0x001F) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.

It is very useful when you write xml files and get error of invalid characters when validate it.

展开查看全部
qnyhuwrf

qnyhuwrf8#

Here is a UDF I built to detectc columns with extended ascii charaters. It is quick and you can extended the character set you want to check. The second parameter allows you to switch between checking anything outside the standard character set or allowing an extended set:

  1. create function [dbo].[udf_ContainsNonASCIIChars]
  2. (
  3. @string nvarchar(4000),
  4. @checkExtendedCharset bit
  5. )
  6. returns bit
  7. as
  8. begin
  9. declare @pos int = 0;
  10. declare @char varchar(1);
  11. declare @return bit = 0;
  12. while @pos < len(@string)
  13. begin
  14. select @char = substring(@string, @pos, 1)
  15. if ascii(@char) < 32 or ascii(@char) > 126
  16. begin
  17. if @checkExtendedCharset = 1
  18. begin
  19. if ascii(@char) not in (9,124,130,138,142,146,150,154,158,160,170,176,180,181,183,184,185,186,192,193,194,195,196,197,199,200,201,202,203,204,205,206,207,209,210,211,212,213,214,216,217,218,219,220,221,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,248,249,250,251,252,253,254,255)
  20. begin
  21. select @return = 1;
  22. select @pos = (len(@string) + 1)
  23. end
  24. else
  25. begin
  26. select @pos = @pos + 1
  27. end
  28. end
  29. else
  30. begin
  31. select @return = 1;
  32. select @pos = (len(@string) + 1)
  33. end
  34. end
  35. else
  36. begin
  37. select @pos = @pos + 1
  38. end
  39. end
  40. return @return;
  41. end

USAGE:

  1. select Address1
  2. from PropertyFile_English
  3. where udf_ContainsNonASCIIChars(Address1, 1) = 1
展开查看全部
mm5n2pyu

mm5n2pyu9#

I took Gerhard Weiss's script and made it my own as my business scenario was pretty simple and only needed to update 10K rows.

  1. --CONTAINS() function requires a full text catalog
  2. --interesting to me is that CONTAINS() can find the ASCII characters, but REPLACE() cannot
  3. WITH cte_AsciiCharacterList AS
  4. (select LineID
  5. , LineSyntax
  6. , Position = patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax)
  7. , InvalidCharacter = substring(LineSyntax,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax),1)
  8. , ASCIICode = ascii(substring(LineSyntax,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax),1))
  9. , LineSyntaxPart1 = SUBSTRING(LineSyntax, 1, patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax) - 1)
  10. , LineSyntaxPart2 = SUBSTRING(LineSyntax, patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax) + 3, 10000)
  11. , NewLineSyntax = SUBSTRING(LineSyntax, 1, patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax) - 1) + '-' + SUBSTRING(LineSyntax, patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,LineSyntax) + 3, 10000)
  12. FROM stg.TableName
  13. WHERE Contains(linesyntax, '"ΓÇö"')
  14. )
  15. UPDATE stg.TableName
  16. SET LineSyntax = NewLineSyntax
  17. FROM stg.TableName x
  18. INNER JOIN cte_AsciiCharacterList y on x.LineID = y.LineID
展开查看全部

相关问题