hive 数组值的集合

bxpogfeg 于 2023-02-12 发布在 Hive

关注(0)|答案(1)|浏览(177)

给定一个在特定日期具有不同hex_id的表，我希望聚合数据，使hex_id A的不同用户总数等于hex_id [A，B，C]中不同用户的总和

+----------+-------+------+---------+
|   date_id|user_id|hex_id|  hex_map|
+----------+-------+------+---------+
|2016-11-01|    100|     A|[A, B, C]|
|2016-11-01|    300|     B|      [B]|
|2016-11-01|    400|     B|      [B]|
|2016-11-01|    100|     C|   [B, C]|
|2016-11-01|    200|     C|   [B, C]|
|2016-11-01|    300|     C|   [B, C]|
+----------+-------+------+---------+

I would like to aggregate the table on hex_id such that the value

+------+---------+---+
|hex_id|  hex_map|cnt|
+------+---------+---+
|     A|[A, B, C]|  1|
|     B|      [B]|  2|
|     C|   [B, C]|  3|
+------+---------+---+

becomes being replaced by the alphabets
+------+---------+---+
|hex_id|  hex_map|cnt|
+------+---------+---+
|     A|       6 |  1|
|     B|       2 |  2|
|     C|       5 |  3|
+------+---------+---+

这是在spark sql 2. 4. 0上运行的，我被如何实现这一点难住了。6的值来自[A+B+C]
我最大的努力就是

query="""
with cte as (select hex_id, hex_map, count(distinct user_id) cnt from tab group by hex_id, hex_map),
     subq as (select hex_id as hex, cnt as cnts, explode(hex_map) xxt from cte),
     sss (select * from subq a left join cte b  on a.xxt = b.hex_id)
     select hex, sum(cnt) from sss group by hex
"""
spark.sql(query).show()

Hive

来源：https://stackoverflow.com/questions/75421761/aggregate-of-array-values

1条答案

按热度按时间

ubof19bj1#

因为您没有指定聚合的行为，所以我决定使用first，但是您可以根据自己的需要进行调整。
我们的想法是将字符转换为ascii表示形式，您可以通过下面的代码来完成此操作：

val df1 = spark.sql("select hex_id, first(hex_map) as first_hex_map from test group by hex_id")
df1.createOrReplaceTempView("df1")

val df2 = spark.sql("select hex_id, transform(first_hex_map, a -> ascii(a) - 64) as aggr from df1")
df2.createOrReplaceTempView("df2")

val df3 = spark.sql("select hex_id, aggr, aggregate(aggr, 0, (acc, x) -> acc + x) as final from df2")

最终结果：

+------+---------+-----+
|hex_id|aggr     |final|
+------+---------+-----+
|A     |[1, 2, 3]|6    |
|B     |[2]      |2    |
|C     |[2, 3]   |5    |
+------+---------+-----+

或者使用Dataset API：

df.groupBy("hex_id").agg(first("hex_map").as("first_hex_map"))
  .withColumn("transformed", transform(col("first_hex_map"), a => ascii(a).minus(64)))
  .withColumn("hex_map", aggregate(col("transformed"), lit(0), (acc, x) => acc.plus(x)))

祝你好运!

赞(0）回复(0）举报 2023-02-12

我来回答

hive 数组值的集合

1条答案

相关问题

热门标签

最新问答