SQL Server Full-Text Search Issue: Inconsistent Results Between CONTAINS and LIKE Queries with word 'za'

mznpcxlj  于 2023-08-02  发布在  SQL Server
关注(0)|答案(1)|浏览(146)

For the first time, I'm using full-text search. I have SQL Server 2022 Standard installed, and I've set up full-text search. I created a full-text catalog, [fullTextCatalog_CjeniciArtikli] , with ACCENT_SENSITIVITY = OFF .

I have a table named CjeniciArtikli with a column, Naziv , which I've created a full-text index for. The language is set to Croatian, and accent_sensitive is set to false.

When I execute the query below, it doesn't return any results, although it should:

SELECT *
FROM CjeniciArtikli
WHERE CONTAINS(Naziv, 'orink AND tinta AND za AND canon');

However, when I use the LIKE operator with a similar query, it returns 64 records, which is correct:

SELECT *
FROM CjeniciArtikli
WHERE Naziv LIKE '%orink%' AND Naziv LIKE '%tinta%' AND Naziv LIKE '%za%' AND Naziv LIKE '%canon%';

If I modify the first query by removing the word 'za', it returns the records as expected (just like when using LIKE), even though some of those records contain the word 'za'. For example, the first result returned is

Orink tinta za Canon, CLI-521GY, siva, OR-CCLI521G with chi

which contains word 'za'.

SELECT *
FROM CjeniciArtikli
WHERE CONTAINS(Naziv, 'orink AND tinta AND canon'); --works

When I tried searching for just the word 'za', it didn't return any results:

SELECT *
FROM CjeniciArtikli
WHERE CONTAINS(Naziv, 'za');

However, when I used LIKE, it returned 5,877 records:

SELECT *
FROM CjeniciArtikli
WHERE Naziv LIKE '% za %';

I initially thought the issue might be due to two-letter words, but when I tested a query with the word 'sa', it worked without any problems.

I haven't been able to find a solution on the internet or through AI language models, including ChatGPT.

If anyone knows what might be causing this issue, I would greatly appreciate any assistance. I've tried several resources, but without success.

vlju58qv

vlju58qv1#

You can get some insight into this from running

DECLARE @system_stoplist_id INT = 0;

DECLARE @lcid INT = (SELECT lcid
   FROM   sys.syslanguages
   WHERE  alias = 'Croatian');

SELECT *
FROM   sys.dm_fts_parser ('orink AND tinta AND za AND canon', @lcid, @system_stoplist_id, 0)

This returns

keywordgroup_idphrase_idoccurrencespecial_termdisplay_termexpansion_typesource_term
0x006F00720069006E006B101Exact Matchorink0orink
0x00740069006E00740061201Exact Matchtinta0tinta
0x007A0061301Noise Wordza0za
0x00630061006E006F006E401Exact Matchcanon0canon

So by default "za" is treated as a stop word because it is so common (in Croatian).

If you need this not to be treated as a stop word then you will need to customise the stop words used rather than just accepting the system default list.

TBH I find it somewhat surprising that a CONTAINS query with AND stop_word would cause it to return no results rather than just have that component ignored but I can repro this behaviour too.

I guess it does make sense because for every row it is "Unknown" whether or not "za" exists in the value and TRUE AND UNKNOWN is UNKNOWN - not TRUE

The CONTAINS query does at least print out
Informational: The full-text search condition contained noise word(s).

to the messages tab to somewhat alert you.

相关问题