hive multiple insert在distinct select语句中出错

66bbxpm5  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(415)

我从“hadoop权威指南”中阅读了以下代码:

  1. SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
  2. FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
  3. WHERE b.dateid = '2008-12-01') x
  4. INSERT OVERWRITE DIRECTORY 'results_gby_adid'
  5. SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
  6. INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
  7. SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
  8. INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
  9. SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;

但是作为我的测试,使用几个不同的方法不能得到正确的结果。
我的hiveql如下:

  1. CREATE TABLE IF NOT EXISTS a (logindate int, id int);

然后将本地文件加载到此表。。。

  1. CREATE TABLE IF NOT EXISTS user (id INT) PARTITIONED BY (logindate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

如果单独插入表格:

  1. INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) FROM a WHERE logindate=20130120;
  2. INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) FROM a WHERE logindate=20130121;

结果正确;
但如果选择下一个多重插入hql:

  1. FROM a
  2. INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) WHERE logindate=20130120
  3. INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) WHERE logindate=20130121;
  4. the results are not correct, both partitions have the same number of records, seems like select from DISTINCT(id) WHERE logindate=20130120 OR logindate=20130121

那么,这是一个错误还是我写了一些错误的语法?

o0lyfsai

o0lyfsai1#

distinct在代码中有一点奇怪的历史记录,作为分组依据的别名。如果有一个bug,那么您所使用的hive的版本将是非常重要的,因为bug在每个版本中都会得到解决。
这可能有用:

  1. FROM a
  2. INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120 GROUP BY id
  3. INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT id WHERE logindate=20130121 GROUP BY id;

如果这不起作用,这肯定会起作用…即使这不是你试图使用的方法。。。

  1. FROM (select distinct id, logindate from a where logindate in ('20130120','20130121')) subq_a
  2. INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120
  3. INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130121;

相关问题