在简单查询中筛选不需要的重复项
我们的数据:
表xx包含唯一的uuid1
结果包含(在某些情况下)重复的uuid1
我们有256个uuid1碎片
将表xx创建为(uuid1shard char(2)、uuid1 char(36)timestamp timestamp、uuid2shard char(2)、uuid2(char(36)、col001。。col190 int/bigint/char,有效负载varchar(100))
我们的发现:
工作正常:
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from (select * from xx where uuidshard2 < 192)
以某种方式工作(count(*)>count(distinct uuid2):
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from (select * from xx where uuidshard2 < 192)
通过create table保存时,结果(~800m)包含8000个重复行(完全相同)
一切正常,只要检查一下-
select count(*), count(distinct uuid2) from
(
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from xx
)
再次有效:忽略剩下的190列
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
, partition by uuid1shard, uuid1, timestamp, uuid2shard, uuid2, smallpayload
from xx
根本原因是它(所有的DUP来自哪里?):
create table yy as --the CREATE, yes, the problem is probably in the create statement
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from xx
看来这是'大小'的问题,或必须有一些设置,使它的工作
我们的环境:
小的;)16个节点,384个vCore等。。。
暂无答案!
目前还没有任何答案,快来回答吧!