hive自联接

lnlaulya  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(583)

我在hive中有一个表,其中的数据来自sap系统。此表包含以下列和数据:

  1. +======================================================================+
  2. |document_number | year | cost_centre | vendor_account_number | amount |
  3. +----------------------------------------------------------------------+
  4. | 1 | 2016 | XZ10 | | 123.5 |
  5. +----------------------------------------------------------------------+
  6. | 1 | 2016 | XZ10 | 1234567890 | 25.96 |
  7. +----------------------------------------------------------------------+
  8. | 1 | 2016 | XZ10 | | 586 |
  9. +----------------------------------------------------------------------+

如上所示,值 vendor_account_number 列只存在于一行中,我想将它带到所有其他行中。
预期产出如下:

  1. +======================================================================+
  2. |document_number | year | cost_centre | vendor_account_number | amount |
  3. +----------------------------------------------------------------------+
  4. | 1 | 2016 | XZ10 | 1234567890 | 123.5 |
  5. +----------------------------------------------------------------------+
  6. | 1 | 2016 | XZ10 | 1234567890 | 25.96 |
  7. +----------------------------------------------------------------------+
  8. | 1 | 2016 | XZ10 | 1234567890 | 586 |
  9. +----------------------------------------------------------------------+

为了实现这一点,我在hive中编写了以下cte

  1. with non_blank_account_no as(
  2. select document_number, vendor_account_number
  3. from my_table
  4. where vendor_account_number != ''
  5. )

然后进行自左外连接,如下所示:

  1. select
  2. a.document_number, a.year,
  3. a.cost_centre, a.amount,
  4. b.vendor_account_number
  5. from my_table a
  6. left outer join non_blank_account_no b on a.document_number = b.document_number
  7. where a.document_number = ' '

但是我得到了如下所示的重复输出

  1. +======================================================================+
  2. |document_number | year | cost_centre | vendor_account_number | amount |
  3. +----------------------------------------------------------------------+
  4. | 1 | 2016 | XZ10 | 1234567890 | 123.5 |
  5. +----------------------------------------------------------------------+
  6. | 1 | 2016 | XZ10 | 1234567890 | 25.96 |
  7. +----------------------------------------------------------------------+
  8. | 1 | 2016 | XZ10 | 1234567890 | 586 |
  9. +----------------------------------------------------------------------+
  10. | 1 | 2016 | XZ10 | 1234567890 | 123.5 |
  11. +----------------------------------------------------------------------+
  12. | 1 | 2016 | XZ10 | 1234567890 | 25.96 |
  13. +----------------------------------------------------------------------+
  14. | 1 | 2016 | XZ10 | 1234567890 | 586 |
  15. +----------------------------------------------------------------------+

有人能帮我理解我的Hive查询有什么问题吗?

eoxn13cs

eoxn13cs1#

在许多用例中,自连接可以被windows函数代替

  1. select document_number
  2. ,year
  3. ,cost_center
  4. ,max (case when vendor_account_number <> '' then vendor_account_number end) over
  5. (
  6. partition by document_number
  7. ) as vendor_account_number
  8. ,amount
  9. from my_table

相关问题