我有一个大约10亿个游戏交互的表格,每个记录描述了一个玩家在某一天的属性(玩家id,玩游戏的日期,属性1,[属性2,…])
+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+
|gmr_id | played_dt | attr1 | attr2 | attr3 |
+-------+-----------+-------+-------+-------+
|1 | 2017-01-01| 1 | 2 | txt |
|1 | 2017-01-02| 3 | 2 | txt |
|2 | 2017-01-02| 1 | 2 | txt |
+-------+-----------+-------+-------+-------+
我还有一张有数百万条记录的表格,玩家的移动记录在每一个游戏中:
+---------+
| TABLE_2 |
+-------+-----------+---------+---------+---------+
|gmr_id | played_dt | finish | attacks | deaths |
+-------+-----------+---------+---------+---------+
|1 | 2017-01-01| 10 | 1 | 9 |
|1 | 2017-01-03| 12 | 10 | 2 |
|2 | 2017-01-02| 1 | 0 | 0 |
|4 | 2017-01-03| 1 | 0 | 1 |
|1 | 2017-01-04| 3 | 1 | 2 |
+-------+-----------+---------+---------+---------+
对于表1中的每条记录——特别是对于每个gmr\u id和played\u dt,我尝试比较played\u dt的后两天和前五天的移动总和(如果为真,则为1,否则为0),并基于gmr\u id和played\u dt加入表1:即。
从比赛开始到比赛结束后两天gmr id的完成次数、攻击次数和死亡次数之和,即完成次数、攻击次数等。 BETWEEN played_dt AND DATE_ADD(played_dt, 2)
从比赛开始前五天到比赛开始前一天,gmr id的完成次数、攻击次数和死亡次数的总和,即总和(完成次数)、总和(攻击次数)等。 BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
比较并设置标志,即获取一行,如下所示:gmr\u id,played\u dt,future\u finish\u gt\u pass\u finish(如果played\u dt之后的完成时间大于else 0之前的天数,则为1),future\u attacks\u gt\u pass\u attacks(如果played\u dt之后的攻击时间大于else 0之前的天数,则为1),等等。
加入行: gmr_id, played_dt, finish_f, attack_f
表1行: gmr_id, played_dt, attr1, attr2, attr3
在 gmr_id, played_dt
我尝试过编写相关子查询,但没有效果:
SELECT
t1.gmr_id,
t1.played_dt,
(SELECT
t2.gmr_id,
SUM(t2.finish) `future_finish`,
SUM(t2.attacks) `future_attacks`
FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN played_dt AND DATE_ADD(played_dt, 2)
GROUP BY t2.gmr_id),
(SELECT
t2.gmr_id,
SUM(t2.finish) `past_finish`,
SUM(t2.attacks) `past_attacks`
FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
GROUP BY t2.gmr_id),
CASE WHEN future_finish > past_finish THEN 1 ELSE 0 END `finish_f`,
CASE WHEN future_attacks > past_attacks THEN 1 ELSE 0 END `attack_f`
FROM
TABLE_1 t1;
预期输出如下所示:
+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+-----------+-----------+
|gmr_id | played_dt | attr1 | attr2 | attr3 | finish_f | attack_f |
+-------+-----------+-------+-------+-------+-----------+-----------+
|1 | 2017-01-01| 1 | 2 | txt | 1 | 0 |
|1 | 2017-01-02| 3 | 2 | txt | 1 | 1 |
|2 | 2017-01-02| 1 | 2 | txt | 0 | 1 |
+-------+-----------+-------+-------+-------+-----------+-----------+
我正在使用hive1.2(或者可以使用spark1.5)来实现这一点,但到目前为止我还无法做到这一点。实现这一目标的最佳方法是什么?我将非常感谢你的帮助。
暂无答案!
目前还没有任何答案,快来回答吧!