SQL Server More performant SQL Delete than using NOT EXISTS?

gfttwv5a  于 2023-06-21  发布在  其他
关注(0)|答案(2)|浏览(125)

I have a query that is deleting a lot of data from a table as follows. It uses a while loop based on trying to not destroy the transaction log but the Customers table has about 200 million records in it and it is deleting approx. 2 million. I was wondering if replacing the NOT EXISTS would help at all.

WHILE (1=1)
BEGIN
  DELETE TOP(10000) FROM Customers
  WHERE NOT EXISTS (SELECT * FROM CustomerInvoices WHERE CustomerInvoices.CustomerId = 
  Customers.CustomerId)
  IF (@@ROWCOUNT = 0)
  BREAK
END
wrrgggsh

wrrgggsh1#

Your problem is that the number of rows in Customers it needs to check before finding 10000 matching the NOT EXISTS grows every batch.

The ratio of matching rows will steadily drop until by the final batch you are scanning the whole 198 million remaining rows to find the last 10,000.

You are doing 200 batches. On average each batch reads 100 million in Customers rows (the earliest batches much less and the later ones more more) - this totals to 20 billion rows read over all just from that table and a similar amount in CustomerInvoices .

If the execution plan is a serial scan then likely every batch this will go over all the ones already processed in every previous batch and found to be not eligible before finally getting to the ones of interest.

You can create a temp table with a sequential integer column...

DECLARE @LastRow INT

CREATE TABLE #DeleteCandidates(Id int PRIMARY KEY, CustomerId INT);

INSERT #DeleteCandidates
SELECT ROW_NUMBER()
         OVER (
           ORDER BY (SELECT 0)) AS Id,
       Customers.CustomerId
FROM   Customers
WHERE  NOT EXISTS (SELECT *
                   FROM   CustomerInvoices
                   WHERE  CustomerInvoices.CustomerId = Customers.CustomerId)

SET @LastRow = @@ROWCOUNT

Then write some code to process that temp table in <batch_size> chunks of Id ranges.

e.g. as below

DECLARE @BatchSize INT = 10000
DECLARE @MinId INT = 1

WHILE @MinId <= @LastRow
  BEGIN

      DELETE FROM Customers
      WHERE  Customers.CustomerId IN (SELECT dc.CustomerId
                                      FROM   #DeleteCandidates dc
                                      WHERE  dc.Id >= @MinId
                                             AND dc.Id < @MinId + @BatchSize)
             AND NOT EXISTS (SELECT *
                             FROM   CustomerInvoices/*WITH (HOLDLOCK )*/
                             WHERE  CustomerInvoices.CustomerId = Customers.CustomerId)

      SET @MinId = @MinId + @BatchSize
  END

You still need a NOT EXISTS on the actual DELETE in case there were inserts since the identification that means a delete candidate is no longer eligible.

You might also consider the HOLDLOCK hint to deal with the possibility of truly concurrent inserts whilst the DELETE query itself is running.

nom7f22z

nom7f22z2#

The query you provided is using a while loop to delete data from the "Customers" table in batches of 10,000 rows at a time. The deletion is based on the condition that there is no matching record in the "CustomerInvoices" table for each customer.

Considering the large number of records in the "Customers" table (200 million) and the significant number of rows being deleted (2 million), optimizing the query is crucial. One possible improvement is to replace the NOT EXISTS subquery with a LEFT JOIN approach, which can sometimes offer better performance. Here's the modified query:

WHILE (1=1)
BEGIN
  DELETE TOP(10000) C
  FROM Customers C
  LEFT JOIN CustomerInvoices CI ON CI.CustomerId = C.CustomerId
  WHERE CI.CustomerId IS NULL
  IF (@@ROWCOUNT = 0)
    BREAK
END

In my version, the LEFT JOIN retrieves all matching records between the "Customers" and "CustomerInvoices" tables. Then, the WHERE clause filters out the rows where the join didn't find a match, effectively identifying the customers with no invoices. Deleting these customers in batches should improve the efficiency of the deletion process.

It's important to note that optimizing query performance depends on various factors, such as the database structure, indexes, and available resources. Therefore, it's recommended to test and analyze the execution plans to ensure the desired performance gains.

相关问题