按配置单元中的列进行采样

but5z9lq  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(263)

给定一个配置单元表,如下所示:

  1. > desc T;
  2. dim1 string
  3. dim2 string
  4. dim3 string
  5. value1 int
  6. value2 int

我试着按组随机抽取1000行 (dim1, dim2, dim3) .
一种方法是:

  1. # bash
  2. for dim1 in dim1_1, dim1_2; do
  3. for dim2 in dim2_1, dim2_2; do
  4. for dim3 in dim3_1, dim3_2; do
  5. hive -e "select * from T where dim1=$dim1 and dim2=$dim2 and dim3=$dim3 limit 1000;"
  6. done done done

然后将连续执行2^3=8个查询。有没有更有效的方法?

hmtdttj4

hmtdttj41#

  1. with dim as
  2. (
  3. select struct(d1.v,d2.v,d3.v) as vals
  4. from (select 1) x
  5. lateral view explode (array(1,2)) d1 as v -- dim1_1 = 1 dim1_2 = 2
  6. lateral view explode (array(3,4)) d2 as v -- dim2_1 = 3 dim2_2 = 4
  7. lateral view explode (array(5,6)) d3 as v -- dim3_1 = 5 dim3_2 = 6
  8. )
  9. select *
  10. from (select row_number () over
  11. (
  12. partition by dim1,dim2,dim3
  13. order by rand()
  14. ) as rn
  15. ,*
  16. from t
  17. where struct(t.dim1,t.dim2,t.dim3) in (select vals from dim)
  18. ) t
  19. where rn <= 1000
  20. ;
展开查看全部

相关问题