I have a query that is deleting a lot of data from a table as follows. It uses a while loop based on trying to not destroy the transaction log but the Customers table has about 200 million records in it and it is deleting approx. 2 million. I was wondering if replacing the NOT EXISTS would help at all.
WHILE (1=1)
BEGIN
DELETE TOP(10000) FROM Customers
WHERE NOT EXISTS (SELECT * FROM CustomerInvoices WHERE CustomerInvoices.CustomerId =
Customers.CustomerId)
IF (@@ROWCOUNT = 0)
BREAK
END
2条答案
按热度按时间wrrgggsh1#
Your problem is that the number of rows in Customers it needs to check before finding
10000
matching theNOT EXISTS
grows every batch.The ratio of matching rows will steadily drop until by the final batch you are scanning the whole 198 million remaining rows to find the last 10,000.
You are doing 200 batches. On average each batch reads 100 million in
Customers
rows (the earliest batches much less and the later ones more more) - this totals to 20 billion rows read over all just from that table and a similar amount inCustomerInvoices
.If the execution plan is a serial scan then likely every batch this will go over all the ones already processed in every previous batch and found to be not eligible before finally getting to the ones of interest.
You can create a temp table with a sequential integer column...
Then write some code to process that temp table in
<batch_size>
chunks ofId
ranges.e.g. as below
You still need a
NOT EXISTS
on the actualDELETE
in case there were inserts since the identification that means a delete candidate is no longer eligible.You might also consider the
HOLDLOCK
hint to deal with the possibility of truly concurrent inserts whilst theDELETE
query itself is running.nom7f22z2#
The query you provided is using a while loop to delete data from the "Customers" table in batches of 10,000 rows at a time. The deletion is based on the condition that there is no matching record in the "CustomerInvoices" table for each customer.
Considering the large number of records in the "Customers" table (200 million) and the significant number of rows being deleted (2 million), optimizing the query is crucial. One possible improvement is to replace the NOT EXISTS subquery with a LEFT JOIN approach, which can sometimes offer better performance. Here's the modified query:
In my version, the LEFT JOIN retrieves all matching records between the "Customers" and "CustomerInvoices" tables. Then, the WHERE clause filters out the rows where the join didn't find a match, effectively identifying the customers with no invoices. Deleting these customers in batches should improve the efficiency of the deletion process.
It's important to note that optimizing query performance depends on various factors, such as the database structure, indexes, and available resources. Therefore, it's recommended to test and analyze the execution plans to ensure the desired performance gains.