比较两个配置单元表以查找没有任何唯一列/时间戳的更新/插入/删除的记录，并将其附加到hadoop中的基表

5lwkijsr 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(320)

基本表（第1天从源加载）


**Id    Name    City      Country**

 7682   Stuart  Frankfurt   Germany
 8723   Micke   Paris       France
 2355   Niki    New york    USA
 2097   Deny    Italy       Rome

新表（第2天从源加载）


**Id    Name    City      Country**

 7682   Stuart  *Darmstadt*  Germany
 8723   Micke   Paris       France
 2355   Niki    New york    USA
 *9057  Bony    Prague      Prague*

比较上述两个表时，可以看到以下3个变化。
记录id 7682的城市名称在第02天更改为达姆施塔特
记录id 2097在第02天加载时被删除，并在第01天加载时出现
在第02天加载时插入id为9057的新记录
以上3个更改都需要捕获并附加到基表中
以下3条记录应通过比较获得

7682   Stuart  Darmstadt   Germany
2097   Deny    Italy       Rome
9057   Bony    Prague      Prague

追加02天更改后的基表输出


**Id    Name    City      Country**

 7682   Stuart  Frankfurt   Germany
 8723   Micke   Paris       France
 2355   Niki    New york    USA
 2097   Deny    Italy       Rome

* 7682   Stuart  Darmstadt   Germany*
* 2097   Deny    Italy       Rome*
* 9057   Bony    Prague      Prague*

我可以使用sql联接获取插入和删除的记录，但无法获取更新的记录。为了获取更新的记录，我使用了将文件本地复制到linux并进行比较的方法，但这种方法不适用于大量的数据。有人能分享一下你处理这种情况的经验吗？

hadoop Hive pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/43802507/comparing-2-hive-tables-for-finding-updated-inserted-deleted-records-without-any

2条答案

按热度按时间

gorkyyrv1#

select      inline
            (
                array
                (
                    case 
                        when n.id is null then struct(b.*)
                        else struct (n.*)
                    end
                )
            )

from                    base_table  as b
            full join   new_table   as n
            on          n.id = b.id

where       b.id is null 
        or  n.id is null
        or  struct(b.*) not in (struct(n.*))

+------+--------+-----------+---------+
| col1 |  col2  |   col3    |  col4   |
+------+--------+-----------+---------+
| 2097 | Deny   | Italy     | Rome    |
| 7682 | Stuart | Darmstadt | Germany |
| 9057 | Bony   | Prague    | Prague  |
+------+--------+-----------+---------+

赞(0）回复(0）举报 2021-05-29

wwtsj6pe2#

select      inline(array(rec))

from       (select      max(struct(tab,rec)).col2 as rec

            from        (           select 1 as tab,id,struct(*) as rec from base_table
                        union all   select 2 as tab,id,struct(*) as rec from new_table
                        ) t

            group by    id

            having      count(*) = 1
                    or  min(rec) not in (max(rec))
            ) t

+------+--------+-----------+---------+
| col1 |  col2  |   col3    |  col4   |
+------+--------+-----------+---------+
| 2097 | Deny   | Italy     | Rome    |
| 7682 | Stuart | Darmstadt | Germany |
| 9057 | Bony   | Prague    | Prague  |
+------+--------+-----------+---------+

赞(0）回复(0）举报 2021-05-29

我来回答

比较两个配置单元表以查找没有任何唯一列/时间戳的更新/插入/删除的记录，并将其附加到hadoop中的基表

2条答案

相关问题

热门标签

最新问答