我在找一个聪明的方法来统计事件。
举个例子:
UserID CityID CountryID TagID
100000 1 30 5
100001 1 30 6
100000 2 20 7
100000 2 40 8
100001 1 40 6
100002 1 40 5
100002 1 20 6
我想做的是:
我要按列和为每个用户计算值的出现次数。最后,我想要一个表格,其中显示了多少用户有超过不同的特点。
结果应该是这样的-或多或少
Different_CityID Different_CountryIDs Different_TagIDs
1 3 2
说明:
不同的cityid:just userid 100000有不同的cityid
不同的国家ID:所有用户的国家都有不同的ID
不同的标签号:userid100000和100002都有不同的标签号。用户100001只有“6”作为tagid。
我挣扎着为列和分组计数,但最终没有成功。有聪明的解决办法吗?
谢谢
3条答案
按热度按时间omjgkv6w1#
这是另一个避免
count(distinct ...)
```select count (case when pos=0 and not is_distinct_ID then 1 end) as different_cityid
,count (case when pos=1 and not is_distinct_ID then 1 end) as different_countryid
,count (case when pos=2 and not is_distinct_ID then 1 end) as different_tagid
from (select pe.pos
,min(pe.ID)<=>max(pe.ID) as is_distinct_ID
from mytable t
lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID
;
select count (case when not is_distinct_CityID then 1 end) as different_cityid
,count (case when not is_distinct_CountryID then 1 end) as different_countryid
,count (case when not is_distinct_TagID then 1 end) as different_tagid
from (select min (CityID) <=> max (CityID) as is_distinct_CityID
,min (CountryID) <=> max (CountryID) as is_distinct_CountryID
,min (TagID) <=> max (TagID) as is_distinct_TagID
;
h4cxqtbf2#
select uid,cid,count(c),count(g) from(select cid,uid,count(coid) over(partition by cid,uid) as c,count(tagid) over(partition by cid,tagid) as g from citydata)e group by cid,uid;
这里uid=userid,cid=cityid,coid=countryid,tagidTotal MapReduce CPU Time Spent: 0 msec OK uid cid coid tagid 100000 1 1 1 100001 1 2 2 100002 1 2 2 100000 2 2 2 Time taken: 3.865 seconds, Fetched: 4 row(s)
基于userid
我希望这会有帮助zvokhttg3#
使用下面的代码我想它可以帮助你,
结果是这样的,
你好,维努