配置单元完全外部联接为同一联接键返回多行

oxosxuxt  于 2021-06-27  发布在  Hive
关注(0)|答案(2)|浏览(379)

我在同一列的4个表上执行完全外部联接。我只想为join列中的每个不同值生成一行。
输入为:

employee1
+---------------------+-----------------+--+
| employee1.personid  | employee1.name  |
+---------------------+-----------------+--+
| 111                 | aaa             |
| 222                 | bbb             |   
| 333                 | ccc             | 
+---------------------+-----------------+--+
employee2
+---------------------+----------------+--+
| employee2.personid  | employee2.sal  |
+---------------------+----------------+--+
| 111                 | 2              |
| 200                 | 3              |
+---------------------+----------------+--+
employee3
+---------------------+------------------+--+
| employee3.personid  | employee3.place  |
+---------------------+------------------+--+
| 111                 | bbsr             |
| 300                 | atl              |
| 200                 | ny               |
+---------------------+------------------+--+
employee4
+---------------------+---------------+--+
| employee4.personid  | employee4.dt  |
+---------------------+---------------+--+
| 111                 | 2019-02-21    |
| 300                 | 2019-03-18    |
| 400                 | 2019-03-18    |
+---------------------+---------------+--+

预期结果每个人名有一条记录,所以总共应该有6条记录(111222333200300400),比如:

+-----------+---------+--------+----------+-------------+--+
| personid  | f.name  | u.sal  | v.place  |   v_in.dt   |
+-----------+---------+--------+----------+-------------+--+
| 111       | aaa     | 2      | bbsr     | 2019-02-21  |
| 200       | NULL    | 3      | ny       | NULL        |
| 222       | bbb     | NULL   | NULL     | NULL        |
| 300       | NULL    | NULL   | atl      | 2019-03-18  |
| 333       | ccc     | NULL   | NULL     | NULL        |
| 400       | NULL    | NULL   | NULL     | 2019-03-18  |
+-----------+---------+--------+----------+-------------+--+

我得到的结果是:

+-----------+---------+--------+----------+-------------+--+
| personid  | f.name  | u.sal  | v.place  |   v_in.dt   |
+-----------+---------+--------+----------+-------------+--+
| 111       | aaa     | 2      | bbsr     | 2019-02-21  |
| 200       | NULL    | 3      | NULL     | NULL        |
| 200       | NULL    | NULL   | ny       | NULL        |
| 222       | bbb     | NULL   | NULL     | NULL        |
| 300       | NULL    | NULL   | atl      | NULL        |
| 300       | NULL    | NULL   | NULL     | 2019-03-18  |
| 333       | ccc     | NULL   | NULL     | NULL        |
| 400       | NULL    | NULL   | NULL     | 2019-03-18  |
+-----------+---------+--------+----------+-------------+--+

使用的查询:

select coalesce(f.personid, u.personid, v.personid, v_in.personid) as personid,f.name,u.sal,v.place,v_in.dt
from employee1 f FULL OUTER JOIN employee2 u on f.personid=u.personid
FULL OUTER JOIN employee3 v on f.personid=v.personid
FULL OUTER JOIN employee4 v_in on f.personid=v_in.personid;

请建议如何产生预期的结果。

shstlldc

shstlldc1#

full outer join 是很棘手的,因为你必须采取前 NULL 这是考虑到的。但你可以做到:

select coalesce(f.personid, u.personid, v.personid, v_in.personid) as personid,f.name,u.sal,v.place,v_in.dt
from employee1 f FULL OUTER JOIN
     employee2 u
     on f.personid = u.personid FULL OUTER JOIN
     employee3 v
     on v.personid in (f.person_id, u.person_id) FULL OUTER JOIN
     employee4 v_in
     on v_in.personid in (f.person_id, u.person_id, v.person_id);

在支持 using 为了 join s(代替 on )这更简单。我不认为Hive支持 using 不过。

7y4bm7vi

7y4bm7vi2#

完全联接返回所有联接的行+左表中所有未联接的行+右表中所有未联接的行。既然你要加入 employee2 , employee3 , employee4 相同的 employee1 不包含 personid=200 ,从所有四个表返回所有未联接的行。
我建议合并所有四个表,为缺少的字段提供空值+聚合group by personid :

select personid, max(name) name, max(sal) sal, max(place) place, max(dt) dt 
from 
(
select  personid, name, NULL sal, NULL place, NULL dt from employee1  e1
UNION ALL
select  personid, NULL name, sal, NULL place, NULL dt from employee2  e2
UNION ALL
select  personid, NULL name, NULL sal, place, NULL dt from employee3  e3
UNION ALL
select  personid, NULL name, NULL sal, NULL place, dt from employee4  e4
)s
group by personid;

这将比连接性能更好。

相关问题