sql server improve find duplicate (getting all columns)

cld4siwp  于 2023-04-28  发布在  SQL Server
关注(0)|答案(2)|浏览(179)

I have a doubt about SQL server, I have a table with more than 60 columns and more than 1 Millon rows. The things is I try to make a find duplicate based on 5 or 6 columns, if I hade the row then I need all the columns (so it is a select * ), I have the index but for the select * SqlServer is not using it like seek and take too much time, so I make and select "id" first and then another select * with the id I found, but now I have two queries.

PD: The queries are made by hibernate

Which will be the best practice to have all the rows?

2ledvvac

2ledvvac1#

SQL Server constructs an index automatically for a table that has a primary key, so make sure that you've appropriately constructed a primary key on the table.

Secondly, you want to make sure you've constructed an index that either spans the 5 or 6 columns that you're using to filter duplicates or an index for each column individually. Without knowing more information about the density of the values in question, its hard to tell you which type of index scheme would be more efficient.

It should be super easy then for the database to use those indexes to determine the data page offsets for the rows needed for the result-set and simply fetch those accordingly.

Additionally, yes you may find it more efficient to execute a query like this:

SELECT * FROM Table t
 WHERE t.id IN (
    /* your filter criteria that simply returns ID after joining the table with
       itself using the 5 or 6 columns to filter the data
     */
 )

The reason is because you effectively avoid hitting the data pages during the subquery as it will simply manipulate the indexes and the temporary table to determine the final result-set would be quite narrow in width. Doing a SELECT * would obviously hit your data pages in the subquery most likely and lead to some performance issues.

sg24os4d

sg24os4d2#

This requirements is best expressed using the COUNT window aggregate function. That allows the solution to be expressed as a single access of the underlying table especially if you have the index over the columns where you are looking for duplicates. See example below and try it to see if this leads to a better query plan and performance. c1, c2 are the columns where we are looking for duplicates.

CREATE TABLE Test
(id int PRIMARY KEY, c1 varchar(10), c2 varchar(10)
    , c3 varchar(10))
;

INSERT INTO Test
(id, c1, c2, c3)
VALUES
(1, 'a', 'b', 'p'),
(2, 'a', 'b', 'q'),
(3, 'c', 'd', 'r'),
(4, 'c', 'd', 's'),
(5, 'c', 'e', 's')
;

SELECT * FROM (
    SELECT
    *,
    COUNT(1) OVER(PARTITION BY c1, c2) AS DuplicateCount
    FROM
    Test
) AS t
WHERE t.DuplicateCount > 1

http://sqlfiddle.com/#!18/da3cf/3/0

相关问题