I am hoping to delete all the duplicates in my large DB. I wrote a query which first filters about 200k records out of a million records and then SELECTs the records.
Now, this whole process takes about 10 MINS 30 SECS under a certain system load. Now, if I replaced the SELECT
command with DELETE
, will it take the same amount of time?
I cannot test it myself because I do not have DELETE privileges right at this moment.
My Queries are as follows:
- For Select
WITH CTE AS(
SELECT
ID,
FIRSTNAME,
LASTNAME,
AGE,
ROW_NUMBER() OVER(
PARTITION BY
ID,
FIRSTNAME,
LASTNAME,
AGE
ORDER BY
ID,
FIRSTNAME,
LASTNAME,
AGE
) AS row_num
FROM
NEWTABLE
)
SELECT * FROM CTE WHERE row_num > 1;
- For Delete
WITH CTE AS(
SELECT
ID,
FIRSTNAME,
LASTNAME,
AGE,
ROW_NUMBER() OVER(
PARTITION BY
ID,
FIRSTNAME,
LASTNAME,
AGE
ORDER BY
ID,
FIRSTNAME,
LASTNAME,
AGE
) AS row_num
FROM
NEWTABLE
)
DELETE FROM CTE WHERE row_num > 1;
1条答案
按热度按时间mrfwxfqh1#
Now, if I replaced the SELECT command with DELETE, will it take the same amount of time?
No, almost certainly not.
delete
is a data modification and fully logged operation, neither of whichselect
is.The
delete
will need to acquireIX
thenX
locks in order to delete rows, which may be blocked by other concurrent processes, depending on database usage.The
delete
will need to update the base table and also any relevant indexes, plus write the modifications to the transaction log.Depending on the performance of your hardware and IO it may be better to delete rows in batches of no more than approximately 5000, which is the threshold for the number of locks; this should prevent SQL Server escalating row locks to a table lock which would aid concurrency. You can add a
top
clause in your CTE to accomplish this.Also, your CTE doesn't need to select all columns, the
Id
should suffice.It may be worth experimenting with using a
view
in line with this advice .