databricks和spark中的公共表表达式(cte)

cuxqih21  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(762)

我在databricks中有一个sparkDataframe。我正在尝试用公共表表达式(cte)运行一些sql查询。以下是前10行数据

  1. +----------+----------+------+---+---+---------+-----------------+
  2. | data_date| user_id|region|sex|age|age_group|sum(duration_min)|
  3. +----------+----------+------+---+---+---------+-----------------+
  4. |2020-01-01|22600560aa| 1| 1| 28| 2| 0.0|
  5. |2020-01-01|17148900ab| 6| 2| 60| 5| 1138.0|
  6. |2020-01-01|21900230aa| 5| 1| 43| 4| 0.0|
  7. |2020-01-01|35900050ac| 8| 1| 16| 1| 224.0|
  8. |2020-01-01|22300280ad| 6| 2| 44| 4| 8.0|
  9. |2020-01-02|19702160ac| 2| 2| 55| 5| 0.0|
  10. |2020-02-02|17900020aa| 5| 2| 64| 5| 264.0|
  11. |2020-02-02|16900120aa| 3| 1| 69| 6| 0.0|
  12. |2020-02-02|11160900aa| 6| 2| 52| 5| 0.0|
  13. |2020-03-02|16900290aa| 5| 1| 37| 3| 0.0|
  14. +----------+----------+------+---+---+---------+-----------------+

在这里,我将每个用户的注册日期存储在regs cte中,然后计算每个月的注册数。这个带有cte的块在databricks中没有任何问题

  1. %sql
  2. WITH regs AS (
  3. SELECT
  4. user_id,
  5. MIN(data_date) AS reg_date
  6. FROM df2
  7. GROUP BY user_id)
  8. SELECT
  9. month(reg_date) AS reg_month,
  10. COUNT(DISTINCT user_id) AS users
  11. FROM regs
  12. GROUP BY reg_month
  13. ORDER BY reg_month ASC;

但是,当我将另一个cte添加到以前的sql查询中时,它返回一个错误(我在sqlserver中测试了这个块,它工作正常)。我不明白为什么他不在spark databricks工作。

  1. %sql
  2. WITH regs AS (
  3. SELECT
  4. user_id,
  5. MIN(data_date) AS reg_date
  6. FROM df2
  7. GROUP BY user_id
  8. ),
  9. regs_per_month AS (
  10. SELECT
  11. month(reg_date) AS reg_month,
  12. COUNT(DISTINCT user_id) AS users
  13. FROM regs
  14. GROUP BY reg_month
  15. )
  16. SELECT
  17. reg_month,
  18. users,
  19. LAG(users, 1) OVER (ORDER BY regs_per_month ASC) AS previous_users
  20. FROM regs_per_month
  21. ORDER BY reg_month ASC;

这是错误信息

  1. Error in SQL statement: AnalysisException: cannot resolve '`regs_per_month`' given input columns: [regs_per_month.reg_month, regs_per_month.users]; line 20 pos 31;
  2. 'Sort ['reg_month ASC NULLS FIRST], true
tez616oj

tez616oj1#

只需使用逗号,就可以在sparksql中嵌套公共表表达式(cte)

  1. %sql
  2. ;WITH regs AS (
  3. SELECT
  4. user_id,
  5. MIN(data_date) AS reg_date
  6. FROM df2
  7. GROUP BY user_id
  8. ),
  9. regs_per_month AS (
  10. SELECT
  11. month(reg_date) AS reg_month,
  12. COUNT(DISTINCT user_id) AS users
  13. FROM regs
  14. GROUP BY reg_month
  15. )
  16. SELECT
  17. reg_month,
  18. users,
  19. LAG(users, 1) OVER (ORDER BY reg_month ASC) AS previous_users
  20. FROM regs_per_month
  21. ORDER BY reg_month ASC;

我的结果:

如前所述 LAG 语句应引用 reg_month 列而不是 regs_per_month cte公司。
作为嵌套CTE的另一种方法,您可以使用多个, WITH 陈述,例如

  1. %sql
  2. ;WITH regs_per_month AS (
  3. WITH regs AS (
  4. SELECT
  5. user_id,
  6. MIN(data_date) AS reg_date
  7. FROM df2
  8. GROUP BY user_id
  9. )
  10. SELECT
  11. month(reg_date) AS reg_month,
  12. COUNT(DISTINCT user_id) AS users
  13. FROM regs
  14. GROUP BY reg_month
  15. )
  16. SELECT
  17. reg_month,
  18. users,
  19. LAG( users, 1 ) OVER ( ORDER BY reg_month ASC ) AS previous_users
  20. FROM regs_per_month
  21. ORDER BY reg_month ASC;
展开查看全部

相关问题