累计和配置单元：查找不包括重复项的运行总数

to94eoyn 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(456)

嗨，我手头有一个非常奇怪的问题，我找不到解决办法。我有一个包含以下列的表userviews：

Progdate(String)
UserName(String)

表中的虚拟数据：

Progdate    UserName
20161119    A
20161119    B
20161119    C
20161119    B
20161120    D
20161120    E
20161120    A
20161121    B
20161121    A
20161121    B
20161121    F
20161121    G

每次用户查看程序时，表中都有一个条目。例如，在11月19日，用户a观看了一次程序，因此只有一个条目。用户b看了两次这个节目，所以11月19日这个用户有两个条目，以此类推。

Select Progdate, count(distinct UserName) UniqueUsersByDate 
from UserViews 
group by Progdate;

上面的查询将为我提供观看该节目的所有独特用户的日期统计

Progdate    UniqueUsersByDate
20161119    3
20161120    3
20161121    4

以下查询：

Select Progdate, UniqueUsersByDate, Sum(UniqueUsersByDate) over(Order By Progdate) RunningTotalNewUsers
from
(
Select Progdate, count(distinct UserName) UniqueUsersByDate
from 
UserViews 
group by Progdate SORT BY Progdate
) UV;

结果如下：

Progdate    UniqueUsersByDate   RunningTotalNewUsers
20161119    3                   3
20161120    3                   6
20161121    4                   10

但是我想要的是所有第一次看这个节目的用户的总数。意味着如果用户a在20161119上观看了该节目，然后又在20161120上观看了该节目，则该用户的计数不应在20161120的运行总数中重复。因此，我想从上表得到的结果是：

Progdate    UniqueUsersByDate   RunningTotalNewUsers
20161119        3               3
20161120        3               5
20161121        4               7

我只在hive hql中寻找解决方案。我们非常感谢对这个问题的任何意见。
谢谢。

Hive cumulative-sum

来源：https://stackoverflow.com/questions/46494809/hive-finding-running-total-excluding-duplicates

1条答案

按热度按时间

bfrts1fy1#

select      Progdate
           ,UniqueUsersByDate
           ,sum(Users1stOcc) over
            (
                order by    Progdate
            )                           as RunningTotalNewUsers

from       (select      Progdate
                       ,count (distinct UserName)           as UniqueUsersByDate
                       ,count (case when rn = 1 then 1 end) as Users1stOcc

            from       (select  Progdate
                               ,UserName
                               ,row_number() over
                                (
                                    partition by    UserName
                                    order by        Progdate
                                )   as rn

                        from    UserViews
                        ) uv

            group by    Progdate
            ) uv
;

+-------------+--------------------+-----------------------+
|  progdate   | uniqueusersbydate  | runningtotalnewusers  |
+-------------+--------------------+-----------------------+
| 2016-11-19  | 3                  | 3                     |
| 2016-11-20  | 3                  | 5                     |
| 2016-11-21  | 4                  | 7                     |
+-------------+--------------------+-----------------------+

附笔
理论上，聚合和sum分析函数的使用不需要额外的子查询，但是解析器似乎有问题（bug/feature）。
请注意，附加的子查询不一定表示附加的执行阶段，例如。 select * from (select * from (select * from (select * from (select * from t)t)t)t)t; 以及 select * from t 会有相同的执行计划。

赞(0）回复(0）举报 2021-06-26

我来回答

累计和配置单元：查找不包括重复项的运行总数

1条答案

相关问题

热门标签

最新问答