是否可以在hive中执行“规范化”密集的\u rank()？

mnemlml8 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(461)

我有一张这样的消费桌。

consumer | product | quantity
-------- | ------- | --------
a        | x       | 3
a        | y       | 4
a        | z       | 1
b        | x       | 3
b        | y       | 5
c        | x       | 4

我想要的是一个“标准化”的排名分配给每个消费者，这样我就可以很容易地分割表进行测试和培训。我在hive中使用了dense_rank（），所以得到了下表。

rank | consumer | product | quantity
---- | -------- | ------- | --------
1    | a        | x       | 3
1    | a        | y       | 4
1    | a        | z       | 1
2    | b        | x       | 3
2    | b        | y       | 5
3    | c        | x       | 4

这是很好的，但我想扩大这与任何数量的消费者使用，所以我希望理想的范围内排名之间的0和1，像这样。

rank | consumer | product | quantity
---- | -------- | ------- | --------
0.33 | a        | x       | 3
0.33 | a        | y       | 4
0.33 | a        | z       | 1
0.67 | b        | x       | 3
0.67 | b        | y       | 5
1    | c        | x       | 4

这样，我总是知道等级的范围是什么，并且可以用标准的方法分割数据（等级<=0.7训练，等级>0.7测试）
有没有一种方法可以在Hive中实现这一点？
或者，有没有一种不同的更好的方法来解决我最初的数据分割问题？
我试着做一个 select * where rank < 0.7*max(rank) ，但是hive说maxudaf在where子句中还不可用。

hadoop Hive machine-learning training-data

来源：https://stackoverflow.com/questions/43129622/is-it-possible-to-do-a-normalized-dense-rank-in-hive

1条答案

按热度按时间

brqmpdu11#

排名百分比

select  percent_rank() over (order by consumer) as pr
       ,* 
from    mytable
;

+-----+----------+---------+----------+
| pr  | consumer | product | quantity |
+-----+----------+---------+----------+
| 0.0 | a        | z       |        1 |
| 0.0 | a        | y       |        4 |
| 0.0 | a        | x       |        3 |
| 0.6 | b        | y       |        5 |
| 0.6 | b        | x       |        3 |
| 1.0 | c        | x       |        4 |
+-----+----------+---------+----------+

对于筛选，您需要一个子查询/cte

select  *
from   (select  percent_rank() over (order by consumer) as pr
               ,* 
        from    mytable
        ) t
where   pr <= ...
;

展开查看全部

赞(0）回复(0）举报 2021-05-29

我来回答

是否可以在hive中执行“规范化”密集的\u rank()？

1条答案

相关问题

热门标签

最新问答