hive:如何将总行数作为变量输出

dgiusagp  于 2021-05-27  发布在  Hadoop
关注(0)|答案(1)|浏览(478)

我有一个数据集,我正在用以下代码进行重复数据消除:

select session_id, sol_id, id, session_context_code, date
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id, date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
        order by session_id, sol_id, date

我想添加一个变量来存储重复数据消除后的行总数,我尝试了count(*):

select session_id, sol_id, id, session_context_code, date,count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
        order by session_id, sol_id, date

我收到的错误:
错误:执行错误:org.apache.hive.service.cli.hivesqlexception:编译语句时出错:失败:semanticexception[error 10025]:行1:44表达式不在group by key“session\u id”中
我只想输出一个count作为一个变量,在按行数进行重复数据消除后,按session\u id和sol\u id对所有不同的记录进行计数。如何将其合并到代码中?
根据gomz的建议,但收到错误:
错误:执行错误:org.apache.hive.service.cli.hivesqlexception:编译语句时出错:失败:parseexception行1:614“nifi\u date”附近的“group”缺少eof
代码:

select session_id, solicit_id, nifi_date,id, session_context_code,count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id) as rn,
        substr(case_id,2,9) as id
        from df.t1_data
         )undup
        where undup.rn =1 and 
        session_context_code in ("4","3") and
        order by session_id, sol_id, nifi_date
        group by session_id, sol_id, nifi_date,id, session_context_code
2izufjch

2izufjch1#

配置单元查询 COUNT(*) 以及中的列 SELECT 子句的末尾应使用group by对这些列进行分组。
一些样品:
SELECT COUNT(*) FROM employees; SELECT id, name, COUNT(*) FROM employees GROUP BY id, name; 在您的问题场景中,查询应该如下所示,

select session_id, sol_id, id, session_context_code, count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
GROUP BY session_id, sol_id, id, session_context_code
        order by session_id, sol_id, date

你可以在这里读更多
update:如果您只想按session\u id和sol\u id统计所有不同的记录,那么查询可以如下所示:,

select session_id, sol_id, count(*) as total
    from (
        select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
        substr(case_id,2,9) as id

        from df.t1_data
         )undup
        where undup.rn =1 
GROUP BY session_id, sol_id
        order by session_id, sol_id, date;

如前所述,您只能使用需要在select和group by中计数的列。
如果需要多个列的结果多于需要计数的列,则可以创建一个临时表,只包含要计数的列,并与原始表联接。i、 例如,如果需要表中的c、d、e、f列,即使需要a、b列的计数,也可以执行如下操作,

CREATE TABLE tmp AS 
SELECT a, b, count(*)
FROM table1
GROUP BY a,b;

在tmp和a、b列上的表1之间进行连接

SELECT y.a, y.b, x.c, x.d, x.e, x.f
FROM tmp y, table1 x
WHERE y.a=x.a
AND y.b=x.b;

希望这有帮助!

相关问题