sql—将多个数组合并到一个Map中

eanckbw9  于 2021-06-24  发布在  Hive
关注(0)|答案(1)|浏览(559)

我有一些数据(来自完整表格的示例)如下所示:

| prov_id | hotel_id | m_id | apis_xml | company_id | yyyy_mm_dd |
|---------|----------|------|----------|------------|------------|
| 945     | 78888    | 3910 | [5]      | 998        | 2020-05-20 |
| 1475    | 78888    | 6676 | [1,2,4]  | 37         | 2020-05-20 |
| 1475    | 78888    | 6670 | [1,2,4]  | 37         | 2020-05-20 |
| 945     | 78888    | 2617 | [5]      | 998        | 2020-05-20 |

我想找到每个酒店的最低api\u xml值,并将相关的prov\u id设置为“primary\u prov”。在获取第一行之前,我可以通过合并和排序数组来实现这一点。我有个疑问:

SELECT
    yyyy_mm_dd,
    hotel_id,
    prov_id as primary_prov
FROM(
    SELECT
        yyyy_mm_dd,
        hotel_id,
        prov_id,
        apis_xml,
        ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, hotel_id ORDER BY apis_xml) rn
    FROM(
        SELECT
            t.yyyy_mm_dd,
            t.hotel_id,
            t.prov_id,
            t.apis_xml,
            CAST(e.apis_xml AS INT) AS api
        FROM
            my_table t
        LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml
    )s
)s
WHERE rn=1

上面的查询用于获取Map到最低apis\uxml值的主prov。但是,我希望有一个额外的列,其中包含额外的xml值以供参考。也许Map不是这里最好的用例,但它是我们想到的。密钥将是prov\u id,而值将是api\u xml值。输出如下所示,因为我需要将其保留为每个酒店id的一行:

| hotel_id | primary_prov | detailed_prov            | yyyy_mm_dd |
|----------|--------------|--------------------------|------------|
| 78888    | 1475         | {1475: [1,2,4], 945: [5] | 2020-05-20 |
iyfjxgzm

iyfjxgzm1#

你可以建立 map 使用brickhouse collectudaf。阅读如何构建jar、添加jar、创建函数和使用它的说明:

add jar /path/to/jar/brickhouse-0.7.1.jar;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

SELECT yyyy_mm_dd,
       hotel_id,
       primary_prov,
       collect(prov_id, prov_id_api ) as detailed_prov
  FROM
(
SELECT
    yyyy_mm_dd,
    hotel_id,
    max(case when rn=1 then prov_id else null end) over(partition by yyyy_mm_dd, hotel_id) as primary_prov, 
    collect_set(api) over (partition by yyyy_mm_dd, hotel_id, prov_id) prov_id_api, --re-assemple array to include all elements from multiple initial arrays if there are different arrays per prov_id
    prov_id
FROM(
    SELECT
        yyyy_mm_dd,
        hotel_id,
        prov_id,
        apis_xml,
        api,
        ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, hotel_id ORDER BY api) rn
    FROM(
        SELECT
            t.yyyy_mm_dd,
            t.hotel_id,
            t.prov_id,
            t.apis_xml,
            CAST(e.apis_xml_element AS INT) AS api
        FROM
            my_table t
        LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml_element
    )s
)s
)s
group by yyyy_mm_dd,
         hotel_id,
         primary_prov

相关问题