如何在hive中转置/透视数据?

eh57zj3b  于 2021-06-03  发布在  Hadoop
关注(0)|答案(9)|浏览(513)

我知道在Hive里没有直接的方法来转换数据。我问了一个问题:有没有办法在Hive中转换数据,但由于那里没有最终的答案,无法一路得到。
这是我的table:

| ID   |   Code   |  Proc1   |   Proc2 | 
 | 1    |    A     |   p      |   e     | 
 | 2    |    B     |   q      |   f     |
 | 3    |    B     |   p      |   f     |
 | 3    |    B     |   q      |   h     |
 | 3    |    B     |   r      |   j     |
 | 3    |    C     |   t      |   k     |

这里proc1可以有任意数量的值。id、code和proc1一起构成此表的唯一键。我想透视/转置这个表,这样proc1中的每个惟一值都成为一个新列,proc2中的对应值就是该列中对应行的值。在essense,我试图得到这样的东西:

| ID   |   Code   |  p   |   q |  r  |   t |
 | 1    |    A     |   e  |     |     |     |
 | 2    |    B     |      |   f |     |     |
 | 3    |    B     |   f  |   h |  j  |     |
 | 3    |    C     |      |     |     |  k  |

在新转换的表中,id和code是唯一的主键。从我上面提到的票,我可以得到这么远使用Mapudaf(免责声明-这可能不是朝着正确的方向迈出的一步,但如果是的话,请在此提及)

| ID   |   Code   |  Map_Aggregation   | 
 | 1    |    A     |   {p:e}            |
 | 2    |    B     |   {q:f}            |
 | 3    |    B     |   {p:f, q:h, r:j } |  
 | 3    |    C     |   {t:k}            |

但不知道如何从这一步到我想要的透视/转置表。任何帮助如何进行都太好了!谢谢。

t1rydlwq

t1rydlwq1#

对于unpivot,我们可以简单地使用下面的逻辑。

SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);
2exbekwf

2exbekwf2#

下面是我使用hive的内部udf函数“map”来解决这个问题的方法:

select
    b.id,
    b.code,
    concat_ws('',b.p) as p,
    concat_ws('',b.q) as q,
    concat_ws('',b.r) as r,
    concat_ws('',b.t) as t
from 
    (
        select id, code,
        collect_list(a.group_map['p']) as p,
        collect_list(a.group_map['q']) as q,
        collect_list(a.group_map['r']) as r,
        collect_list(a.group_map['t']) as t
        from (
            select
              id,
              code,
              map(proc1,proc2) as group_map 
            from 
              test_sample
        ) a
        group by
            a.id,
            a.code
    ) b;

“concat\u ws”和“map”是配置单元自定义项,“collect\u list”是配置单元自定义项。

toiithl6

toiithl63#

您可以使用case语句和collect\u set的一些帮助来实现这一点。你可以看看这个。你可以在-http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/
这里是查询参考,

SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;
ldxq2e6h

ldxq2e6h4#

另一个解决方案。
使用hivemall透视 to_map 功能。

SELECT
  uid,
  kv['c1'] AS c1,
  kv['c2'] AS c2,
  kv['c3'] AS c3
FROM (
  SELECT uid, to_map(key, value) kv
  FROM vtable
  GROUP BY uid
) t
``` `uid c1 c2 c3 101 11 12 13 102 21 22 23` 取消打印

SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
``` uid key value 101 c1 11 101 c2 12 101 c3 13 102 c1 21 102 c2 22 102 c3 23

nkoocmlb

nkoocmlb5#

下面也是一个支点

SELECT TM1_Code, Product, Size, State_code, Description
  , Promo_date
  , Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
   , MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
        ,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
        ,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;
qacovj5a

qacovj5a6#

我还没有写这段代码,但我认为您可以使用klouts brickhouse提供的一些自定义项:https://github.com/klout/brickhouse
具体来说,您可以使用此处提到的collect:http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/
然后使用本文详述的方法分解数组(它们的长度不同)http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra

pgky5nke

pgky5nke7#

如果是数值,可以使用以下配置单元查询:
样本数据

ID  cust_freq   Var1    Var2    frequency
220444  1   16443   87128   72.10140547
312554  6   984 7339    0.342452643
220444  3   6201    87128   9.258396518
220444  6   47779   87128   2.831972441
312554  1   6055    7339    82.15209213
312554  3   12868   7339    4.478333954
220444  2   6705    87128   15.80822558
312554  2   37432   7339    13.02712127

select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id, 
 map(cust_freq,frequency) as group_map 
 from table
 ) a group by a.id having id in 
( '220444',
'312554');

ID  one two three   six
220444  72.10140547 15.80822558 9.258396518 2.831972441
312554  82.15209213 13.02712127 4.478333954 0.342452643

In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.
332nm8kg

332nm8kg8#

我使用下面的查询创建了一个名为hive的虚拟表- create table hive (id Int,Code String, Proc1 String, Proc2 String); 已加载表中的所有数据-

insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f'); 
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');

现在使用下面的查询来实现输出。

select id,code,
     case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
     case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
     case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
     case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
     from(
            select id, code,
            case when proc1 ='p' then proc2 end as p,
            case when proc1 ='q' then proc2 end as q,
            case when proc1 ='r' then proc2 end as r,
            case when proc1 ='t' then proc2 end as t
            from hive
        ) dummy group by id,code;
ni65a41a

ni65a41a9#

以下是我最终使用的解决方案:

add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';

select 
    id, 
    code,
    group_map['p'] as p,
    group_map['q'] as q,
    group_map['r'] as r,
    group_map['t'] as t
    from ( select
        id, code,
        collect(proc1,proc2) as group_map 
        from test_sample 
        group by id, code
    ) gm;

toïmap自定义项是从brickhouse回购中使用的:https://github.com/klout/brickhouse

相关问题