hive-compute基于日期窗口的分区统计信息

alen0pnh  于 2021-06-24  发布在  Hive
关注(0)|答案(2)|浏览(578)

我见过解决类似问题的方法,但没有一种对我很有效。我也有信心,应该有办法让它发挥作用。
给我一张table
iddatetarget12020-01-01112020-01-02112020-01-03012020-01-04112020-01-04012020-06-01112020-06-02112020-06-03012020-06-04112020-06-0402220-01-011
id是bigint,target是int,date是date
我想计算每个id/日期在该日期之前的3个月和12个月内(含)相同id的行数和总和。输出示例:
iddatesum计数12计数1212020-01-011111020-01-0222212020-01-03232312020-01-04353512020-06-01114612020-06-02225712020-06-03236812020-06-043571022020-01-011111
我怎样才能在Hive里得到这一次的结果呢?我不确定是否应该使用分析函数(以及如何使用)、分组方式等。。。?

ibps3vxo

ibps3vxo1#

如果你能估计间隔(1个月=30天):(gmb答案的改进)

with t as (
    select ID, Date,
        sum(target) target,
        count(target) c_target
    from table
    group by ID, Date
)
select ID, Date,
    sum(target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 90 preceding
    ) sum_3,
    sum(c_target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 90 preceding
    ) count_3,
    sum(target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 360 preceding
    ) sum_12,
    sum(c_target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 360 preceding
    ) count_12
from t

或者,如果需要精确的间隔,可以进行自连接(但代价很高):

with t as (
    select ID, Date,
        sum(target) target,
        count(target) c_target
    from table
    group by ID, Date
)
select
    t_3month.ID, 
    t_3month.Date, 
    t_3month.sum_3, 
    t_3month.count_3, 
    sum(t3.target) sum_12, 
    sum(t3.c_target) count_12
from (
    select 
        t1.ID, 
        t1.Date,
        sum(t2.target) sum_3,
        sum(t2.c_target) count_3
    from t t1
    left join t t2
    on t2.Date > t1.Date - interval 3 month and
       t2.Date <= t1.Date and
       t1.ID = t2.ID
    group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
   t3.Date <= t_3month.Date and
   t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;
zujrkrfu

zujrkrfu2#

如果您可以将月数近似为天数,则可以在配置单元中使用窗口函数:

select id, date, 
    count(*) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding -- 90 days
    ) as count_3,
    sum(target) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding
    ) as sum_3,
    count(*) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding -- 360 days
    ) as count_12,
    sum(target) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding
    ) as sum_12
from mytable

可以在同一查询中聚合:

select id, date, 
    sum(count(*)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding -- 90 days
    ) as count_3,
    sum(sum(target)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 90 preceding
    ) as sum_3,
    sum(count(*)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding -- 360 days
    ) as count_12,
    sum(sum(target)) over(
        partition by id 
        order by unix_timestamp(date)
        range 60 * 60 * 24 * 360 preceding
    ) as sum_12
from mytable
group by id, date, unix_timestamp(date)

相关问题