postgresql 返回计数和平均值的有效方法,无需将多个CTE连接在一起

wz1wpwve  于 2023-10-18  发布在  PostgreSQL
关注(0)|答案(4)|浏览(135)

在我的PostgreSQL 14.8数据库中,我有一个名为orders的表,如下所示:

  1. CREATE TABLE orders (
  2. user_id int
  3. , order_id int
  4. , order_date date
  5. , quantity int
  6. , revenue float
  7. , product text
  8. );
  9. INSERT INTO orders VALUES
  10. (1, 1, '2021-03-05', 1, 15, 'books'),
  11. (1, 2, '2022-03-07', 1, 3, 'music'),
  12. (1, 3, '2022-06-15', 1, 900, 'travel'),
  13. (1, 4, '2021-11-17', 2, 25, 'books'),
  14. (2, 5, '2022-08-03', 2, 32, 'books'),
  15. (2, 6, '2021-04-12', 2, 4, 'music'),
  16. (2, 7, '2021-06-29', 3, 9, 'books'),
  17. (2, 8, '2022-11-03', 1, 8, 'music'),
  18. (3, 9, '2022-11-07', 1, 575, 'food'),
  19. (3, 10, '2022-11-20', 2, 95, 'food'),
  20. (3, 11, '2022-11-20', 1, 95, 'food'),
  21. (4, 12, '2022-11-20', 2, 95, 'books'),
  22. (4, 13, '2022-11-21', 1, 95, 'food'),
  23. (4, 14, '2022-11-23', 4, 17, 'books'),
  24. (5, 15, '2022-11-20', 1, 95, 'food'),
  25. (5, 16, '2022-11-25', 2, 95, 'books'),
  26. (5, 17, '2022-11-29', 1, 95, 'food');

参见SQL fiddle:http://sqlfiddle.com/#!17/3dc 69/1
我想得到以下内容:
第一次购买图书的客户中(首先根据order_date),作为product(在本例中,客户ID为1和4),我希望获得:
A)这一群体购买书籍的平均值quantity(在本例中为2.25,即1、2、2和4的平均值);
B)每次购买的总金额revenue(在本例中为152)。
我尝试使用几个CTE来实现这一点。以下是我的尝试:

  1. WITH all_orders AS (
  2. SELECT
  3. *
  4. , ROW_NUMBER() OVER (PARTITION BY o.user_id ORDER BY o.order_date) AS row_num
  5. FROM
  6. orders o
  7. ),
  8. first_product_is_books AS (
  9. SELECT
  10. DISTINCT(user_id) AS usr_id
  11. FROM
  12. all_orders
  13. WHERE product = 'books' AND row_num = 1
  14. ),
  15. temp_results AS (
  16. SELECT
  17. *
  18. FROM
  19. all_orders ao
  20. JOIN
  21. first_product_is_books AS fp
  22. ON ao.user_id = fp.usr_id
  23. )
  24. SELECT
  25. avg(quantity)
  26. , sum(revenue)
  27. FROM
  28. temp_results tr
  29. WHERE
  30. tr.product = 'books'

参见fiddle:http://sqlfiddle.com/#!17/3dc 69/1
这适用于玩具数据集,但不适用于生产数据集,其中表约有400,000条记录。我认为这个查询是hacky,并没有真正优化。有没有更有效的方法?

qv7cva1a

qv7cva1a1#

使用DISTINCT ON更简单:

  1. SELECT avg(o.quantity) AS avg_quantity
  2. , sum(o.revenue) AS total_revenue
  3. FROM (
  4. SELECT DISTINCT ON (user_id)
  5. user_id, product
  6. FROM orders
  7. ORDER BY user_id, order_date
  8. ) init
  9. JOIN orders o USING (user_id, product)
  10. WHERE init.product = 'books';

fiddle
哪种查询样式最有效主要取决于基数。总共有多少行,有多少用户,有多少书籍订单,有多少书籍的初始订单,平均行大小,平均大小为product
如果还有一个表users,每个相关的user_id保存一行,则可以更简单,更快。
你需要一个索引来支持查询。最好是(user_id, order_date, product)。同样,最好的指数取决于上述未公开的细节。
而且(user_id, order_date)必须是唯一的,否则就有歧义。
请参阅:

  • 是否选择每个GROUP BY组中的第一行?
  • 优化GROUP BY查询以检索每个用户的最新行
展开查看全部
s5a0g9ez

s5a0g9ez2#

这个怎么样?

  1. with order_cte as
  2. (
  3. SELECT
  4. *,
  5. ROW_NUMBER() OVER(partition by user_id order by order_date) as row_num
  6. from orders
  7. )
  8. SELECT
  9. AVG(QUANTITY)AVG_QUANTITY
  10. , SUM(REVENUE)TOTAL_REV
  11. FROM ORDERS
  12. WHERE PRODUCT = 'books'
  13. and USER_ID IN (
  14. SELECT
  15. USER_ID
  16. FROM ORDER_CTE
  17. WHERE PRODUCT = 'books'
  18. AND ROW_NUM = 1
  19. )
展开查看全部
uttx8gqw

uttx8gqw3#

这个版本的查询更容易阅读。

  1. WITH first_orders AS (
  2. SELECT
  3. DISTINCT
  4. user_id,
  5. FIRST_VALUE(product) OVER (PARTITION BY user_id ORDER BY order_date) as first_product
  6. FROM orders
  7. GROUP BY user_id, order_date, product
  8. )
  9. SELECT
  10. AVG(o.quantity) AS avg_quantity,
  11. SUM(o.revenue) AS total_revenue
  12. FROM orders o
  13. JOIN first_orders fo
  14. ON o.user_id = fo.user_id
  15. WHERE fo.first_product = 'books' AND o.product = 'books';
展开查看全部
piztneat

piztneat4#

你的问题的逻辑是有缺陷的(下面的所有代码都可以在fiddle here上找到)!

  1. > A) the average quantity of books that this cohort purchased (in this
  2. > case, 2.25, which is the average of 1, 2, 2, and 4), and;
  3. >
  4. > B) the total revenue of each of these purchases (in this case, 152).

您想要计算两个用户(1和4)的两本书的平均购买量,但这四次购买的收入不是152,而是323!
这是由计算(见表人口在小提琴)

  • 用户_1 =(1 x 15)+(2 x 25)= 15 + 50 = 65
  • 用户_2 =(2 x 95)+(4 x 17)= 190 + 68 = 258
  • 65 + 258 = 323
    323是两个用户的总收入的正确总和。

152 = 15 + 25 + 95 + 17,即计算中没有考虑数量!
该表是根据您的数据构建和填充的。
然后,我运行了以下SQL:

  1. WITH cte AS
  2. (
  3. SELECT
  4. user_id, -- some of these fields are not necessary
  5. product, -- they just help in getting the full picture.
  6. order_date,
  7. SUM(revenue * quantity) OVER (PARTITION BY user_id, product) AS rev,
  8. AVG(quantity) OVER (PARTITION BY user_id, product) AS avg_qty,
  9. ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY user_id, order_date) AS rn
  10. FROM
  11. orders
  12. ORDER BY user_id, order_date
  13. )
  14. SELECT
  15. SUM(rev) AS tot_rev,
  16. ROUND(AVG(avg_qty), 2) AS avg_q
  17. FROM
  18. cte
  19. WHERE
  20. product = 'books' AND rn = 1;

测试结果:

  1. tot_rev avg_q
  2. 323 2.25

这就是正确答案!

展开查看全部

相关问题