hive在create表中返回太多的行作为select

1l5u6lss  于 2021-06-03  发布在  Hadoop
关注(0)|答案(0)|浏览(214)

在简单查询中筛选不需要的重复项
我们的数据:
表xx包含唯一的uuid1
结果包含(在某些情况下)重复的uuid1
我们有256个uuid1碎片
将表xx创建为(uuid1shard char(2)、uuid1 char(36)timestamp timestamp、uuid2shard char(2)、uuid2(char(36)、col001。。col190 int/bigint/char,有效负载varchar(100))
我们的发现:
工作正常:

select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from (select * from xx where uuidshard2 < 192)

以某种方式工作(count(*)>count(distinct uuid2):

select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from (select * from xx where uuidshard2 < 192)

通过create table保存时,结果(~800m)包含8000个重复行(完全相同)
一切正常,只要检查一下-

select count(*), count(distinct uuid2) from 
(
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from xx
)

再次有效:忽略剩下的190列

select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
, partition by uuid1shard, uuid1, timestamp, uuid2shard, uuid2, smallpayload
from xx

根本原因是它(所有的DUP来自哪里?):

create table yy as --the CREATE, yes, the problem is probably in the create statement
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from xx

看来这是'大小'的问题,或必须有一些设置,使它的工作
我们的环境:
小的;)16个节点,384个vCore等。。。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题