给定一个在特定日期具有不同hex_id的表,我希望聚合数据,使hex_id A的不同用户总数等于hex_id [A,B,C]中不同用户的总和
+----------+-------+------+---------+
| date_id|user_id|hex_id| hex_map|
+----------+-------+------+---------+
|2016-11-01| 100| A|[A, B, C]|
|2016-11-01| 300| B| [B]|
|2016-11-01| 400| B| [B]|
|2016-11-01| 100| C| [B, C]|
|2016-11-01| 200| C| [B, C]|
|2016-11-01| 300| C| [B, C]|
+----------+-------+------+---------+
I would like to aggregate the table on hex_id such that the value
+------+---------+---+
|hex_id| hex_map|cnt|
+------+---------+---+
| A|[A, B, C]| 1|
| B| [B]| 2|
| C| [B, C]| 3|
+------+---------+---+
becomes being replaced by the alphabets
+------+---------+---+
|hex_id| hex_map|cnt|
+------+---------+---+
| A| 6 | 1|
| B| 2 | 2|
| C| 5 | 3|
+------+---------+---+
这是在spark sql 2. 4. 0上运行的,我被如何实现这一点难住了。6的值来自[A+B+C]
我最大的努力就是
query="""
with cte as (select hex_id, hex_map, count(distinct user_id) cnt from tab group by hex_id, hex_map),
subq as (select hex_id as hex, cnt as cnts, explode(hex_map) xxt from cte),
sss (select * from subq a left join cte b on a.xxt = b.hex_id)
select hex, sum(cnt) from sss group by hex
"""
spark.sql(query).show()
1条答案
按热度按时间ubof19bj1#
因为您没有指定聚合的行为,所以我决定使用
first
,但是您可以根据自己的需要进行调整。我们的想法是将字符转换为
ascii
表示形式,您可以通过下面的代码来完成此操作:最终结果:
或者使用
Dataset API
:祝你好运!