其中一个记录配置单元上的group by where子句

yxyvkwin  于 2021-06-26  发布在  Hive
关注(0)|答案(2)|浏览(312)

尝试将组筛选为只有一个参与者的会话时间超过5分钟的组。
我当前的查询:

select 
U.session_id,
U.session_date,
U.participant_duration
U.email
from data.usage U
left outer join
  (select 
  distinct M.session_id
  from data.usage M
  where email like '%gmail.com%'
  and data_date >= '20180101'
  and name in
    ( 
    select 
    lower(name)
    from data.users
    where role like 'Person%' 
    and isactive = TRUE
    and data_date = '20180412'
    ))M
on U.session_id = M.session_id

一旦数据出来。。

session_id   session_date   participant_duration   email
143          20180401       0.4                    huy@gmail.com
143          20180401       1.5                    t@gmail.com
143          20180401       1.6                    att@gmail.com
143          20180401       2.3                    m@gmail.com
124          20180401       5.6                    p@gmail.com
124          20180401       3.2                    alex@gmail.com
165          20180401       4.1                    jeff@gmail.com
165          20180401       3.1                    nader@gmail.com

我想用一个where子句来过滤这个问题,这个子句只带回至少有一条记录的组 participant_duration >= 5 .
您想要的是: group by session_id having participant_duration >= 5 离这儿远吗?

bq3bfh9z

bq3bfh9z1#

如果在session\u id字段上使用group by,则需要在select query中的其他字段上使用聚合函数(如sum、min、max等)。
我认为session_id和session_date对于记录是相同的,所以我在group by中使用这两个字段(或者)如果您不想在group by中使用session_date,那么您需要在这个字段上使用任何聚合函数,比如max(session_date)…等等。
对参与者\u duration使用sum aggregate函数,然后在having子句中使用partition \u duration只过滤出值大于5的记录。
select语句中只剩下email字段,它不在groupby子句中,所以我使用max aggregate函数只获取email字段的一个值。
分组依据中的会话\u日期clause:-

select 
U.session_id,
U.session_date,
sum(U.participant_duration) participant_duration,
max(U.email) email
from data.usage U
left outer join
  (select 
  distinct M.session_id
  from data.usage M
  where email like '%gmail.com%'
  and data_date >= '20180101'
  and name in
    ( 
    select 
    lower(name)
    from data.users
    where role like 'Person%' 
    and isactive = TRUE
    and data_date = '20180412'
    ))M
on U.session_id = M.session_id
group by U.session_id,U.session_date
having sum(cast(participant_duration >= 5 as int)) >= 1;

(或)
会话\u日期不在分组依据中clause:-

select 
U.session_id,
max(U.session_date) session_date,
sum(U.participant_duration) participant_duration,
max(U.email) email
from data.usage U
left outer join
  (select 
  distinct M.session_id
  from data.usage M
  where email like '%gmail.com%'
  and data_date >= '20180101'
  and name in
    ( 
    select 
    lower(name)
    from data.users
    where role like 'Person%' 
    and isactive = TRUE
    and data_date = '20180412'
    ))M
on U.session_id = M.session_id
group by U.session_id
having sum(cast(participant_duration >= 5 as int)) >= 1;
fykwrbwg

fykwrbwg2#

是的,你的想法是对的 group by 以及 having .

group by session_id
having sum(cast(participant_duration >= 5 as int)) >= 1

您的查询也可以简化为

select *
from (select U.session_id,U.session_date,U.participant_duration,U.email,
      SUM(cast(U.participant_duration >= 5 as int)) OVER(PARTITION BY U.session_id) as dur_gt_5
      from data.usage U
      join data.users M on U.session_id = M.session_id and U.name=lower(M.name)
      where M.role like 'Person%' and M.isactive = TRUE and M.data_date = '20180412'
      and U.email like '%gmail.com%' and U.data_date >= '20180101'
     ) t
where dur_gt_5>=1

相关问题