SQL Server Best query to get multiple columns from a GroupBy of two JOIN-ed tables?

vuktfyat  于 2023-06-21  发布在  其他
关注(0)|答案(4)|浏览(159)

I'm trying to write a SQL query to get multiple columns from a joined table, and I've seen there are two approaches, either to have multiple columns in the GROUPBY, or use an aggregator on the results, but I'm not sure which of these is the best approach, or if there is a performance benefit to either?

As an example, I have user and purchases tables:

dbo.[User]
| Id (PK) | Name |
| ------------ | ------------ |
| 1 | Name1 |
| 2 | Name2 |

dbo.Purchases

Id (PK)UserId (FK)Quantity
112
213
325

I would like to get the UserId, Name and total quantity for all the users, which I think can be done in two ways:

Approach A:

SELECT u.Id, u.[Name], SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id, u.[Name]

Approach B:

SELECT u.Id, MAX(u.[Name]), SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id

Both of these seem strange to me though, because it should be clear from User.Id being a primary key that if I am grouping by User.Id then any other fields from [User] must have a unique value, so I'm confused why something like the following doesn't work:

--- INCORRECT ---
SELECT u.Id, u.[Name], SUM(p.Quantity) as Quantity
FROM dbo.[User] u
LEFT JOIN dbo.Purchases p ON p.UserId = u.Id
GROUP BY u.Id

Which approach (A) or (B) - or a different approach - am I best to use?

dwbf0jvd

dwbf0jvd1#

Another way which might be more natural is the following:

SELECT u.Id, u.[Name], totalQuantity
FROM dbo.[User] u
LEFT JOIN (
   select  SUM(p.Quantity) as TotalQuantity
   , UserId
   FROM   dbo.Purchases p 
   GROUP BY UserId
  ) p
ON p.UserId = u.Id

This groups purchases separately and then joins them back. There's no performance hit usually since sql server can figure out and LEFT JOIN "through" the User table. Ie, it doesn't materialize whole purchase table unnecessarily.

Similar to this is the OUTER APPLY (cross apply should give same results in this case too i think) approach:

SELECT u.Id, u.[Name], totalQuantity
FROM dbo.[User] u
OUTER APPLY (
   select  SUM(p.Quantity) as TotalQuantity
   FROM   dbo.Purchases p 
   WHERE p.UserId = u.Id
  ) p

These should generate pretty identical plans in most cases, but of course the best is to test it

oiopk7p5

oiopk7p52#

Try with the below code to get the desired result set.

Create table [user]
(
ID int,
[Name] varchar(20)
)
insert into [user]
Select 1, 'Name1'
union all 
Select 2, 'Name2'

Create table [Purchases]
(
ID int,
userID int,
Quantity int
)
insert into [Purchases]
Select 1,1, 2
union all 
Select 2,1,3
Union all 
Select 3,2,5

--Select Query
Select p.userID, u.[Name], SUM(p.Quantity) AS TotalQuantity
from purchases p
JOIN [user] u
on p.userid = u.id
GROUP BY p.userid,u.Name

OUTPUT:
userID  Name    TotalQuantity
-------------------------
1       Name1       5
2       Name2       5
ddarikpa

ddarikpa3#

The only difference between approach A and approach B is that there is no need to include the name column under the GROUP BY section as you are commanding the MAX value for name under approach B.

I have just attempted to replicate your results (using slightly different notation) and either approach can work.

Approach A

select t1.id, t1.name, sum(t2.quantity) from table1 as t1 inner join table2 as t2 on t1.id=t2.id group by t1.id, t1.name order by t1.id;

Approach B

select t1.id, max(t1.name) as name, sum(t2.quantity) from table1 as t1 inner join table2 as t2 on t1.id=t2.id group by t1.id order by t1.id;

For Approach B, the only change I would make is to specify max(t1.name) as name , otherwise the column that displays the name will be called max instead of name.

4si2a6ki

4si2a6ki4#

Combined answer based on the suggestions to run the queries to see how they perform.

Note: I ran this for my specific example, so the timings and query plan weightings apply to my use-case, but were helpful to me to understand the differences. (Timings are just approximate - they vary by 100-200ms between executions.)

Approach A

SELECT uAct.Id, uAct.PrimaryEmail, uAct.LatestActivity FROM
    (SELECT u.Id, u.PrimaryEmail, Max(ta.ActivityTime) as LatestActivity
        FROM dbo.[User] u
        LEFT JOIN dbo.TimedActivity ta ON ta.UserId = u.Id
        GROUP BY u.Id, u.PrimaryEmail) uAct
    WHERE [someFilter])

Time: 00:00:00.290

Query Plan:

Approach B

SELECT uAct.Id, PrimaryEmail, uAct.LatestActivity FROM
    (SELECT u.Id, MAX(u.PrimaryEmail) as PrimaryEmail, Max(ta.ActivityTime) as LatestActivity
        FROM dbo.[User] u
        LEFT JOIN dbo.TimedActivity ta ON ta.UserId = u.Id
        GROUP BY u.Id) uAct
    WHERE [someFilter])

Time: 00:00:00.247

Query Plan:

Approach C (suggested by @siggemannen)

SELECT u.Id, u.PrimaryEmail, uAct.LatestActivity FROM
    dbo.[User] u
    LEFT JOIN(
        SELECT MAX(ta.ActivityTime) as LatestActivity
        , UserId
        FROM dbo.TimedActivity ta
        GROUP BY UserId
    ) uAct
    ON uAct.UserId = u.Id
    WHERE [someFilter])

Time: 00:00:00.135

Query Plan:

Approach D (suggested by @siggemannen)

SELECT u.Id, u.PrimaryEmail, uAct.LatestActivity FROM
    dbo.[User] u
    OUTER APPLY (
        SELECT MAX(ta.ActivityTime) as LatestActivity
        FROM dbo.TimedActivity ta
        WHERE ta.UserId = u.Id
    ) uAct
    WHERE [someFilter])

Time: 00:00:01.707

Query Plan:

Conclusion Approaches A, B and C seem to have very similar performance and execution. Approach D is much slower for my specific case. I'm supposing is due to creating a whole table inside the query? The key difference is the "Index Spool" taking up 82% of the execution time there.

I don't know much about query plans to identify any key differences in performance between A, B and C. I suppose this is likely to come down to relative sizes of tables - in my case not much difference at all.

相关问题