hive/spark sql中用于比较聚合的相关子查询问题

x6h2sr28  于 2021-06-26  发布在  Hive
关注(0)|答案(0)|浏览(256)

我有一个大约10亿个游戏交互的表格,每个记录描述了一个玩家在某一天的属性(玩家id,玩游戏的日期,属性1,[属性2,…])

+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+
|gmr_id | played_dt | attr1 | attr2 | attr3 |
+-------+-----------+-------+-------+-------+
|1      | 2017-01-01| 1     | 2     | txt   |
|1      | 2017-01-02| 3     | 2     | txt   |
|2      | 2017-01-02| 1     | 2     | txt   |
+-------+-----------+-------+-------+-------+

我还有一张有数百万条记录的表格,玩家的移动记录在每一个游戏中:

+---------+
| TABLE_2 |
+-------+-----------+---------+---------+---------+
|gmr_id | played_dt | finish  | attacks | deaths  |
+-------+-----------+---------+---------+---------+
|1      | 2017-01-01| 10      | 1       | 9       |
|1      | 2017-01-03| 12      | 10      | 2       |
|2      | 2017-01-02| 1       | 0       | 0       |
|4      | 2017-01-03| 1       | 0       | 1       |
|1      | 2017-01-04| 3       | 1       | 2       |
+-------+-----------+---------+---------+---------+

对于表1中的每条记录——特别是对于每个gmr\u id和played\u dt,我尝试比较played\u dt的后两天和前五天的移动总和(如果为真,则为1,否则为0),并基于gmr\u id和played\u dt加入表1:即。
从比赛开始到比赛结束后两天gmr id的完成次数、攻击次数和死亡次数之和,即完成次数、攻击次数等。 BETWEEN played_dt AND DATE_ADD(played_dt, 2) 从比赛开始前五天到比赛开始前一天,gmr id的完成次数、攻击次数和死亡次数的总和,即总和(完成次数)、总和(攻击次数)等。 BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1) 比较并设置标志,即获取一行,如下所示:gmr\u id,played\u dt,future\u finish\u gt\u pass\u finish(如果played\u dt之后的完成时间大于else 0之前的天数,则为1),future\u attacks\u gt\u pass\u attacks(如果played\u dt之后的攻击时间大于else 0之前的天数,则为1),等等。
加入行: gmr_id, played_dt, finish_f, attack_f 表1行: gmr_id, played_dt, attr1, attr2, attr3gmr_id, played_dt 我尝试过编写相关子查询,但没有效果:

SELECT
  t1.gmr_id,
  t1.played_dt,    
  (SELECT
    t2.gmr_id,
    SUM(t2.finish) `future_finish`,
    SUM(t2.attacks) `future_attacks`
  FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN played_dt AND DATE_ADD(played_dt, 2)
  GROUP BY t2.gmr_id), 
  (SELECT
    t2.gmr_id,
    SUM(t2.finish) `past_finish`,
    SUM(t2.attacks) `past_attacks`
  FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
  GROUP BY t2.gmr_id),

CASE WHEN future_finish > past_finish THEN 1 ELSE 0 END `finish_f`,

CASE WHEN future_attacks > past_attacks THEN 1 ELSE 0 END `attack_f`

FROM
  TABLE_1 t1;

预期输出如下所示:

+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+-----------+-----------+
|gmr_id | played_dt | attr1 | attr2 | attr3 | finish_f  | attack_f  |
+-------+-----------+-------+-------+-------+-----------+-----------+
|1      | 2017-01-01| 1     | 2     | txt   |     1     |     0     |
|1      | 2017-01-02| 3     | 2     | txt   |     1     |     1     |
|2      | 2017-01-02| 1     | 2     | txt   |     0     |     1     |
+-------+-----------+-------+-------+-------+-----------+-----------+

我正在使用hive1.2(或者可以使用spark1.5)来实现这一点,但到目前为止我还无法做到这一点。实现这一目标的最佳方法是什么?我将非常感谢你的帮助。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题