hive 不带重复项的插入未按预期工作

knsnq2tg  于 2023-10-18  发布在  Hive
关注(0)|答案(1)|浏览(181)

我试图在一个特定字段中没有冲突的地方向表中添加新行,但是每次我尝试添加具有重复数据的行时,这些行都会被添加,表就会增长。不应该是这样。我使用一个名为的字段:event_id,作为我想确保不会再次出现的键,但每次,我都会添加行。
insert语句如下所示:

  1. insert into existing_table
  2. select distinct event_id
  3. , event
  4. , cs.pid as cs_pid
  5. , context_page_url
  6. , try_cast(concat(cs.year, "-", cs.month, "-", cs.day) as string) as event_date
  7. , split_part(cs.context_page_url, '|', 0) as stem
  8. from target_table cs where year=2023

所以如果event_id出现在existing_table中,那么忽略它。
table大小没有增长。

kjthegm6

kjthegm61#

您必须在existing_table中添加not in ()

  1. insert into existing_table
  2. select distinct event_id
  3. , event
  4. , cs.pid as cs_pid
  5. , context_page_url
  6. , try_cast(concat(cs.year, "-", cs.month, "-", cs.day) as string) as event_date
  7. , split_part(cs.context_page_url, '|', 0) as stem
  8. from target_table cs where year=2023
  9. and event_id not in (select event_id from existing_table) -- ensures event ID doesnt exist.

相关问题