SQL Server Too many NOT LIKE statements and WAAAAYYYY too many Wildcards

yb3bgrhw  于 2023-04-28  发布在  其他
关注(0)|答案(2)|浏览(125)

At my new job where I am NOT a SQL expert, I am trying to make a sql query in SSMS. I am trying to find all transactions that have 'insurance' or 'ins' in the Description field. So my code looks something like this (I have simplified/obfuscated it)

SELECT
a.Customer,
b.Amount,
c.Description

FROM
    table1 as a
    INNER JOIN table2 as b ON a.CustKey = b.AmtKey
    INNER JOIN table 3 as c ON a.CustKey = c.DescKey
    
WHERE
(
(c.Description LIKE '%insurance%' OR c.Description LIKE '%ins%')
AND
(
b.Description NOT LIKE '%domains%' AND b.Description NOT LIKE '%vitamins%' AND 
b.Description NOT LIKE '%Gainsville%' AND b.Description NOT LIKE '%circle k%'
b.Description NOT LIKE '%insane%'
))

So the Description field is an nvarchar(250) data type. It generally looks something like this "ALFA MUTUAL INS EFT PYMTS PPD SMITH JOHN" or "VISA/DEBIT-CARD POS STATE FARM INSURANCE 8009566310 IL 8675309 0102" but there are many other different ways that it looks. And some even have locations in the Description.

My problem is that '%ins%' is pulling in many results that are not relevant. For example, it will pull in all transactions that were made at a business called "Insane Cookies". So I add NOT LIKE '%insane%'. But than it will include ALL transactions that were made by John Smith Atkins who purchased insurance but also made purchases at Circle K gas stations. I do NOT want to exclude the 'insurance' purchases, but I do want to exclude his gas station purchases. So I add NOT LIKE '%circle k%'. This has made me create a very long list of NOT LIKE statements (about 48 NOT LIKE '%value%' statements)

Because the Description field is essentially a string, I cannot just use 'ins' or 'insurance'. I have to use the wildcard characters. My query actually works but I have sooooo many NOT LIKE statements and each NOT LIKE statements has two wildcard characters that it is bogging down the system. They (DBAs) are instructing me to modify my query. I knew this would happen. I tried to ask for help because so many NOT LIKE statements + wildcards = poor performance, but "apparently" everyone who could help is too busy. So I'm here.

This is my first time posting/asking for help on here. So I apologize if I overstated or understated my predicament. Feel free to ask and I will provide clarification. I'd appreciate ANY help.

TLDR: what are alternatives methods to a long list of NOT LIKE + wildcards?

Edit: change "or" in NOT LIKE area to "and". I have it as "and" in my query.

rryofs0p

rryofs0p1#

Parsing out text to get what you're after will never be 100% correct. If someone enters INS meaning Immigration & Naturalization Services you're going to get a false positive you can't ever manage. The best you can do is "Default" an option; but rely on user intervention to set a flag/value to know it's insurance related.

Knowing this there may be a way to make your query more performant using SET based processing; avoiding a Not like.

Essentially get all the possible INS records in a set; then exclude all the ones that match your not like criteria but using a like instead.

In a perfect world you'd have some sort of record indicator to indicate this was insurance related and key off it instead. Are you sure there's not another way to identify insurance related records besides scraping text?

-- Get Set of data which we think is "Insurance" related

SELECT
a.Customer,
b.Amount,
c.Description
FROM table1 as a
INNER JOIN table2 as b ON a.CustKey = b.AmtKey
INNER JOIN table 3 as c ON a.CustKey = c.DescKey
WHERE (c.Description LIKE '%insurance%' 
   OR c.Description LIKE '%ins%')   --this is inclusive of the former so just use it.

EXCEPT  --aka MINUS

--Now exclude these similar records containing descriptions we want to ignore. 
-- (false positives) Note: like is generally faster than "not like"   so by treating 
-- these as data sets and excluding one from the other, you may see a performance gain.  
-- But you'll always have false positives in your results since you can't have a list 
-- that will always exclude EVERY variant of ins.  Your best bet is to find or maintain 
-- a field/table where someone controls this association. For example, you're 100% stuck if someone puts in INS (meaning Immigration & Naturalization Services'

SELECT
a.Customer,
b.Amount,
c.Description
FROM table1 as a
INNER JOIN table2 as b ON a.CustKey = b.AmtKey
INNER JOIN table 3 as c ON a.CustKey = c.DescKey
WHERE (b.Description LIKE '%domains%' 
  OR b.Description LIKE '%vitamins%' 
  OR b.Description LIKE '%Gainsville%' 
  OR b.Description LIKE '%circle k%' 
  OR b.Description LIKE '%insane%')

Alternatively you could add a computed (persisted?) column to the database which evaluates the text for the desired string excluding the undesired strings in a case expression pushing the analysis/performance to the save of a record vs the read. This field would simply contain a 1 if insurance related 0 if not and have an evaluation using a case expression something like:

where description like '%INS%' 
  and (Description not like '%...%' 
    or Description not like '%....%'...) 
 then 1 else 0 end

but again no matter what, you'll never hit 100% success.

sz81bmfz

sz81bmfz2#

To get rid of the NOT LIKE list, you need to think about word boundaries, as @siggemannen also pointed out:

  1. If you would want to find only the words "insurance" and "ins", and not any abbreviation of "insurance":
    You could just put spaces around these words to eliminate false matches, and also around the original values not to exclude when they are the first or last words:
    ('[space]' + c.Description + '[space]' LIKE '%[space]insurance[space]%' OR '[space]' + c.Description + '[space]' LIKE '%[space]ins[space]%')
    A more complicated but also potentially more performant way of expressing the equivalent is:
    (c.Description LIKE 'insurance[space]%' OR c.Description LIKE 'ins[space]%' OR '[space]' + c.Description + '[space]' LIKE '%[space]insurance[space]%' OR '[space]' + c.Description + '[space]' LIKE '%[space]ins[space]%')
  2. If you want any abbreviation of "insurance":
    First, regarding c.Description LIKE '%insurance%' OR c.Description LIKE '%ins%' - as it is OR , the first is redundant, as insurance contains ins . Therefore ins remains, but no abbreviation can contain any character before this word, so there should be a space:
    '[space]' + c.Description LIKE '%[space]ins%'
    a potentially more performant equivalent is
    c.Description LIKE 'ins%' OR '[space]' + c.Description LIKE '%[space]ins%' .
    This doesn't eliminate all NOT LIKE s, but many like "domains" or "Gainsville" will go away and only "insane" and the likes remain - however this approach should be avoided, as there are a ton of other words that you will have to exclude in the future. It is more reliable to include ALL possible abbreviations in point 1, as there are less variants.

相关问题