Given a SQL Server table with one column [SearchableDescription] included in the full text search catalog/index with example data:
- apple banana cherry
- apple banana cherry grape
- apple banana cherry grape yam
- apple banana cherry grape yam zucchini
- ...
They we search using the Containstable() like:
declare
@aboutPredicate nvarchar(4000) =
N'IsAbout(
cherry weight (0.5),
grape weight (0.5)
)';
select *
from TheTable t
join ContainsTable(TheTable, SearchableDescription, @aboutPredicate) ct
on ct.Key = t.RowId
The problem is with the [Rank] column output. When the input weights are >= .3 as shown above, then often rows containing both cherry and grape are given a lower rank than rows with only one of cherry or grape.
However if the weights are adjusted to be much lower (~.1) like:
declare
@aboutPredicate nvarchar(4000) =
N'IsAbout(
cherry weight (0.1),
grape weight (0.1)
)';
Then the rows containing both search terms are ranked highest.
I recall there being uniqueness of words component built into the rank calculation but am surprised it could affect the answer so much as to ignore the fact of a word being matched. In the real test case most of our words and search terms are quite unique (part numbers, technical family names, etc) so even when both terms are as such, this still seems to happen.
Would like to understand the reason behind this behavior. And if using small input weights is acceptable, is there some other disadvantages to it?
Update: Noting that having the weights too small (<=.05) also caused the same issue. The most important thing appears to be the balance of weights. For our case (which may not be typical) each scale must be within ~10% of the other in order for both words being present to be ranked above case where just 1 word was present. The 10% applied to +/- to either word (the words are likely near the same document/corpus frequencies). Even with the weights being equal, too high or too low a value still caused the issue.
1条答案
按热度按时间cs7cruho1#
Frequency-Inverse Document Frequency (TF-IDF) ranking model:
The ranking algorithm used in full text serach is based on a combination of factors with one of these factors being the frequency of the search terms in the document compared to their frequency in the entire set of documents being searched. The impact of this can be counter-intuitive e.g:
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)