SQL Server Join several tables together without creating duplicates messing with count and sum functions?

6rvt4ljy  于 2023-11-16  发布在  其他
关注(0)|答案(3)|浏览(105)

The database:

Sales.Customers(CustomerID ,CustomerName) 
Sales.Orders(OrderID, CustomerID) 
Sales.OrderLines(OrderID, Quantity, UnitPrice) 
Sales.Invoices(InvoiceID, OrderID, CustomerID) 
Sales.InvoiceLines(InvoiceID, Quantity, UnitPrice)

The goal is to generate the following output:

column1: list CustomerID (only orders converted into invoices) 
Column2: list of CustomerName 
Column3: number of orders by customer ID (only orders converted into invoices) 
Column4: number of invoices by customer ID 
Column5: sum (Quantity*UnitPrice) of orders by CustomerID (only orders converted into invoices) 
Column6: sum  (Quantity*UnitPrice) of invoices by Customer ID 
Column7: the difference between column 4 and column 5

I managed to join the first 4 tables and get the expected results

Select
c.CustomerID
, c.CustomerName
, COUNT(DISTINCT(o.OrderID)) AS [number orders]
, COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
, SUM(il.Quantity * il.UnitPrice) AS [TOTAL invoices]

FROM Sales.Invoices AS i
INNER JOIN Sales.InvoiceLines AS il
ON il.InvoiceID = i.InvoiceID
INNER JOIN Sales.Orders AS o
ON o.orderID = i.OrderID    
INNER JOIN Sales.Customers AS c
ON c.CustomerID = i.CustomerID

GROUP BY c.CustomerID, c.CustomerName
ORDER BY c.CustomerID

But once I joined the last one, COUNT columns were still correct but both SUM columns were not returning expected numbers (1.6x from the expected SUM).

The query returning incorrect SUM:

Select 
        c.CustomerID
        , c.CustomerName
        , COUNT(DISTINCT(o.OrderID)) AS [number orders]
        , COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
        , SUM(DISTINCT(il.Quantity * il.UnitPrice)) AS [TOTAL invoices]
        , SUM(DISTINCT(ol.Quantity * ol.UnitPrice)) AS [TOTAL orders] 
            
FROM Sales.Invoices AS i
        LEFT JOIN Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID 
        LEFT JOIN Sales.Orders AS o
        ON o.orderID = i.OrderID    
        LEFT JOIN Sales.OrderLines AS ol
        ON ol.OrderID = o.OrderID
        LEFT JOIN Sales. Customers AS c
        ON c.CustomerID = i.CustomerID
    
GROUP BY c.CustomerID, c.CustomerName
ORDER BY c.CustomerID

I tried different types of joins, different orders, subqueries, and division queries... But always ended up with some errors or incorrect results. Any help would be highly appreciated!

nukf8bse

nukf8bse1#

To troubleshoot, remove all the group by and sums, pick one order (using a where clause) and actually look at the detail data.

This is all part of the basic problem solving process. You can't solve a problem looking at a summary you need to look at the detail.

Firstly, you have "double counting"

An order might have one invoice and three invoice lines. When you join those you get three records. That order might also have two order lines. When you join those three records to two order lines you get six (3x2) records. So now when you sum up lines or invoices you have too many records and the figure is overstated.

Secondly, this is definitely incorrect

SUM(DISTINCT(ol.Quantity * ol.UnitPrice))

If seven completely unrelated order lines have the same value. it'll throw out six of them. Which is definitely incorrect.

You can't answer the questions you need by just joining up all the tables.

You need to do a seperate subtable (or CTE) query that summarises order value by customerid then you can join on that.

-- this dataset is unique on customer id 
-- so won't double count when joined to any other dataset 
-- that is unique on customerid
with ordersummary as (
select o.customerid, SUM(ol.Quantity * ol.UnitPrice) as ordervalue
from Sales.Orders AS o
INNER JOIN Sales.OrderLines AS ol
ON ol.OrderID = o.OrderID
group by customerid
),
-- this dataset is also unique on customer id 
-- so won't double count when joined to any other dataset 
-- that is unique on customerid
invoicesummary as 

(
Select
c.CustomerID
, c.CustomerName
, COUNT(DISTINCT(o.OrderID)) AS [number orders]
, COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
, SUM(il.Quantity * il.UnitPrice) AS [TOTAL invoices]

FROM Sales.Invoices AS i
INNER JOIN Sales.InvoiceLines AS il
ON il.InvoiceID = i.InvoiceID
INNER JOIN Sales.Orders AS o
ON o.orderID = i.OrderID    
INNER JOIN Sales.Customers AS c
ON c.CustomerID = i.CustomerID
GROUP BY c.CustomerID, c.CustomerName
)
-- now we use our CTE's
-- and join on thoe unique customerid value. No double counting.
SELECT
    invoicesummary.CustomerID
    , invoicesummary.CustomerName
    , invoicesummary.[number orders]
    , invoicesummary.[number invoices]
    , invoicesummary.[TOTAL invoices]
    , ordersummary.ordervalue 
FROM invoicesummary
    INNER JOIN ordersummary 
    ON ordersummary.customerid=invoicesummary.Customerid
ORDER BY invoicesummary.CustomerID
ar5n3qh5

ar5n3qh52#

You need to pre-aggregate orders and invoices separately. You can't aggregate them at the same time or you will get a giant cross-join.

Because you want only invoices which also have orders, you need to aggregate the orders within each invoice, then aggregate all invoices.

Throwing DISTINCT at something is not a solution. You need to think about why duplicates are appearing in the first place.

SELECT
        c.CustomerID
        , c.CustomerName
        , i.NumberOrders
        , i.NumberInvoices
        , i.TotalInvoices
        , i.TotalOrders
FROM Sales.Customers AS c
LEFT JOIN (
    SELECT
      i.CustomerID,
      COUNT(*) AS NumberInvoices,
      SUM(il.Quantity * il.UnitPrice) AS TotalInvoices,
      SUM(o.NumberOfOrders) AS NumberOfOrders,
      SUM(o.TotalOrders) AS TotalOrders
    FROM Sales.Invoices AS i
    JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID
    JOIN (
        SELECT
          o.InvoiceID,
          COUNT(*) AS NumberOrders,
          SUM(ol.Quantity * ol.UnitPrice) AS TotalOrders
        FROM Sales.Orders AS o
        JOIN Sales.OrderLines AS ol ON ol.OrderID = o.OrderID
        GROUP BY o.InvoiceID
    ) AS o ON o.InvoiceID = i.InvoiceID
    GROUP BY i.CustomerID
) AS i ON c.CustomerID = i.CustomerID
ORDER BY
  c.CustomerID;

You can also do this kind of pre-aggregation using CROSS APPLY or OUTER APPLY . The ON moves inside the subquery, and the grouping ID can be removed from the SELECT

SELECT
        c.CustomerID
        , c.CustomerName
        , i.NumberOrders
        , i.NumberInvoices
        , i.TotalInvoices
        , i.TotalOrders
FROM Sales.Customers AS c
OUTER APPLY (
    SELECT
      COUNT(*) AS NumberInvoices,
      SUM(il.Quantity * il.UnitPrice) AS TotalInvoices,
      SUM(o.NumberOfOrders) AS NumberOfOrders,
      SUM(o.TotalOrders) AS TotalOrders
    FROM Sales.Invoices AS i
    JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID
    CROSS APPLY (
        SELECT
          COUNT(*) AS NumberOrders,
          SUM(ol.Quantity * ol.UnitPrice) AS TotalOrders
        FROM Sales.Orders AS o
        JOIN Sales.OrderLines AS ol ON ol.OrderID = o.OrderID
        WHERE o.InvoiceID = i.InvoiceID
        GROUP BY o.InvoiceID
    ) AS o
    WHERE c.CustomerID = i.CustomerID
    GROUP BY c.CustomerID
) AS i 
ORDER BY
  c.CustomerID;
kmbjn2e3

kmbjn2e33#

In your last query, you did not relate InvoiceLines to OrderLines . So effectively you are doing a cross join between these two tables. You can change your query to SELECT * , remove all aggregates and group by clause, inspect the result and you will see it.

For your requirement, it is easier to do it using derived table or cte. You perform the necessary calculation for the Orders or Invoices in the derived table and then on the outer query, join it to the Customers table

Select 
        c.CustomerID
        , c.CustomerName
        , i.[number invoices]
        , i.[TOTAL invoices]
        , o.[number orders]           
        , o.[TOTAL orders]                 
FROM    Sales.Customers AS c
        LEFT JOIN
        (
            SELECT i.CustomerID
                   , COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
                   , SUM(il.Quantity * il.UnitPrice) AS [TOTAL invoices]
            FROM   Sales.Invoices AS i
                   INNER JOIN Sales.InvoiceLines AS il
                           ON il.InvoiceID = i.InvoiceID 
            GROUP BY i.CustomerID
        ) i                ON i.CustomerID = c.CustomerID 
        LEFT JOIN
        (
            SELECT  o.CustomerID
                  , COUNT(DISTINCT(o.OrderID)) AS [number orders]           
                  , SUM(ol.Quantity * ol.UnitPrice) AS [TOTAL orders] 
            FROM    Sales.Orders AS o
                    INNER JOIN Sales.OrderLines AS ol
                           ON ol.OrderID = o.OrderID
            GROUP BY o.CustomerID
        ) o                ON o.CustomerID = c.CustomerID

相关问题