过滤器值大于pig/hive中的平均值

2ekbmq32  于 2021-06-25  发布在  Pig
关注(0)|答案(5)|浏览(322)

这是我的样本数据:

+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
|            Car            | MPG  | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model |      Origin      |
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
| Chevrolet Chevelle Malibu | 18.0 |         8 | 307.0        | 130.0      |   3504 | 12.0         |    70 | US Buick         |
| Skylark 320               | 15.0 |         8 | 350.0        | 165.0      |   3693 | 11.5         |    70 | US Plymouth      |
| Satellite                 | 18.0 |         8 | 318.0        | 150.0      |   3436 | 11.0         |    70 | US AMC Rebel     |
| SST                       | 16.0 |         8 | 304.0        | 150.0      |   3433 | 12.0         |    70 | US Ford          |
| Torino                    | 17.0 |         8 | 302.0        | 140.0      |   3449 | 10.5         |    70 | US Ford Galaxie  |
| 500                       | 15.0 |         8 | 429.0        | 198.0      |   4341 | 10.0         |    70 | US Chevrolet     |
| Impala                    | 14.0 |         8 | 454.0        | 220.0      |   4354 | 9.0          |    70 | US Plymouth Fury |
| iii                       | 14.0 |         8 | 440.0        | 215.0      |   4312 | 8.5          |    70 | US               |
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+

我想找出那些mpg和马力的基础上,每辆车的价值是大于他们的平均值。例如mpg>avg(mpg)和horsepower>avg(horsepower)。
我所做的:

r = load '/user/CarData/cars.csv' using PigStorage(',') as (car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,model:INT,org:chararray);
r1 = group r by car; 
r2 = foreach r1 generate group,AVG(r.mpg) as avg_mpg,AVG(r.hp) as avg_hp,r.mpg,r.hp;

它将给我肉身,平均和袋{mpg},现在我面临的问题,从r2过滤。我在尝试这样的事情: FILTER r2 by r.mpg > AVG(mpg) and r.hp > AVG(hp) 请帮帮我。谢谢

hgqdbh6s

hgqdbh6s1#

如上所述,您不需要联接表。我觉得这将是一个优化的版本。
输入数据:

Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504,12.0,70,US Buick
Skylark 320,15.0,8,350.0,165.0,3693,11.5,70,US Plymouth
Satellite,18.0,8,318.0,150.0,3436,11.0,70,US AMC Rebel
SST,16.0,8,304.0,150.0,3433,12.0,70,US Ford
Torino,17.0,8,302.0,140.0,3449,10.5,70,US Ford Galaxie
500,15.0,8,429.0,198.0,4341,10.0,70,US Chevrolet
Impala1,14.0,8,454.0,220.0,4354,9.0,70,US Plymouth Fury
Impala2,25.0,8,454.0,270.0,4354,9.0,70,US Plymouth Fury
Impala3,30.0,8,454.0,290.0,4354,9.0,70,US Plymouth Fury
iii,14.0,8,440.0,215.0,4312,8.5,70,US

Pig脚本:

input_data = LOAD '/pigsamples/carinfo' USING PigStorage (',') 
             AS (car:CHARARRAY, mpg:FLOAT, cyl:INT, disp:DOUBLE, hp:DOUBLE, weight:INT, acc:DOUBLE, model:INT, org:CHARARRAY);

group_data = GROUP input_data ALL;
average_values = FOREACH group_data GENERATE AVG(input_data.mpg) AS avg_mpg, AVG(input_data.hp) AS avg_hp;
filter_data = FILTER input_data BY mpg > average_values.avg_mpg AND hp > average_values.hp;

输出:

(Impala2,25.0,8,454.0,270.0,4354,9.0,70,US Plymouth Fury)
(Impala3,30.0,8,454.0,290.0,4354,9.0,70,US Plymouth Fury)
acruukt9

acruukt92#

输入:

Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504,12.0,70,US Buick
Skylark 320,15.0,8,350.0,165.0,3693,11.5,70,US Plymouth
Satellite,18.0,8,318.0,150.0,3436,11.0,70,US AMC Rebel
SST,16.0,8,304.0,150.0,3433,12.0,70,US Ford
Torino,17.0,8,302.0,140.0,3449,10.5,70,US Ford Galaxie
500,15.0,8,429.0,2.0,4341,10.0,70,US Chevrolet
500,45.0,8,429.0,198.0,4341,10.0,70,US Chevrolet
500,10.0,8,429.0,40.0,4341,10.0,70,US Chevrolet
Impala,14.0,8,454.0,220.0,4354,9.0,70,US Plymouth Fury
iii,14.0,8,440.0,215.0,4312,8.5,70,US

代码:

r = load 'test.data' using PigStorage(',') as car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,mod    el:INT,org:chararray);
r1 = group r by car;
r2  = foreach r1 generate FLATTEN(group) as (car_grp:chararray),(float)AVG(r.mpg) as (avg_mpg:float),
                             (DOUBLE)AVG(r.hp) as (avg_hp:DOUBLE);
j = JOIN r2 BY car_grp, r BY car;
r3 = foreach j generate r2::car_grp  as (car:chararray),r::mpg as (mpg:float),r::cyl as (cyl:INT),r::disp as (disp:DOUBLE),r::hp as (hp:DOUBLE),r::weight as (weight:INT),r::acc as (acc:DOUBLE),r::model as (model:INT),r::org as (org:chararray),r2::avg_mpg as (avg_mpg:float),r2::avg_hp as (avg_hp:DOUBLE);
r4 = FILTER r3 BY mpg > avg_mpg AND hp > avg_hp;

输出:

(500,45.0,8,429.0,198.0,4341,10.0,70,US Chevrolet,23.333334,80.0)
f5emj3cl

f5emj3cl3#

我变了 mpg 为了 Impala 以及 iii19.0 所以这个查询会返回一些东西。你要避免在这里自我连接;his可以有效地完成Hive窗口功能。
Hive:

select car, mpg, avg_mpg, horsepower, avg_hrspwr
from (
  select car, mpg, horsepower
    , avg( mpg ) over () as avg_mpg
    , avg( horsepower ) over () as avg_hrspwr
  from db.table ) x
where horsepower > avg_hrspwr and mpg > avg_mpg

输出:

Impala    19.0    17.125    220.0    171.0
iii       19.0    17.125    215.0    171.0

至于 Pig 我认为@sai kiran neelakantam的解决方案非常可靠。

2ul0zpep

2ul0zpep4#

r = load 'A' using PigStorage(',') as (car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,model:INT,org:chararray);
r1 = group r by car; 
r2 = foreach r1 generate group,FLATTEN(AVG(r.mpg)) as avg_mpg,AVG(r.hp) as avg_hp,FLATTEN(r.mpg) as mpg ,FLATTEN(r.hp) as hp;
r3 = FILTER r2 by mpg > avg_mpg  and hp > avg_hp;
r3 = distinct r3;
dump r3;
tp5buhyn

tp5buhyn5#

在Hive里,它会像

Select Car, MPG, Cylinders, Displacement, Horsepower, Weight, Acceleration, Model, Origin
FROM cars_table 
JOIN (Select AVG(mpg) as a_m, AVG(hp) as a_h) averages ON (1 = 1)
WHERE Horsepower > a_h AND MPG > a_m;

相关问题