spark-sql

bqf10yzr  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(275)

给定下面的两个表,对于每个数据点,我想计算我们有值的不同年份的数量。但是,sparksql不允许将count distinct和filter组合在一起。

CREATE TABLE datapoints (name STRING);

INSERT INTO
  datapoints
VALUES
  ('Name'),
  ('Height'),
  ('Color');

CREATE TABLE entities (datapoint STRING, year INT, value STRING);

INSERT INTO
  entities
VALUES
  ('Name', 2015, 'John'),
  ('Name', 2015, 'Suzan'),
  ('Name', 2017, 'Jim'),
  ('Color', 2015, 'Blue')

SELECT
  dp.name,
  COUNT(DISTINCT year) FILTER (
    WHERE
      value IS NOT NULL
  ) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

结果:
sql语句错误:analysisexception:不能在聚合函数中同时使用distinct和filter;3号线位置2
在功能上等效的有效sparksql语句是什么?预期的输出是(注意'name'的重复年份):
名称计数颜色1高度0名称2

esyap4oy

esyap4oy1#

在下列情况下,请尝试对案例执行count distinct:

SELECT
  dp.name,
  COUNT(DISTINCT case when value is not null then year end) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

相关问题