这是我的样本数据:
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
| Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin |
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
| Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | US Buick |
| Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | US Plymouth |
| Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | US AMC Rebel |
| SST | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | US Ford |
| Torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | US Ford Galaxie |
| 500 | 15.0 | 8 | 429.0 | 198.0 | 4341 | 10.0 | 70 | US Chevrolet |
| Impala | 14.0 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | US Plymouth Fury |
| iii | 14.0 | 8 | 440.0 | 215.0 | 4312 | 8.5 | 70 | US |
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
我想找出那些mpg和马力的基础上,每辆车的价值是大于他们的平均值。例如mpg>avg(mpg)和horsepower>avg(horsepower)。
我所做的:
r = load '/user/CarData/cars.csv' using PigStorage(',') as (car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,model:INT,org:chararray);
r1 = group r by car;
r2 = foreach r1 generate group,AVG(r.mpg) as avg_mpg,AVG(r.hp) as avg_hp,r.mpg,r.hp;
它将给我肉身,平均和袋{mpg},现在我面临的问题,从r2过滤。我在尝试这样的事情: FILTER r2 by r.mpg > AVG(mpg) and r.hp > AVG(hp)
请帮帮我。谢谢
5条答案
按热度按时间hgqdbh6s1#
如上所述,您不需要联接表。我觉得这将是一个优化的版本。
输入数据:
Pig脚本:
输出:
acruukt92#
输入:
代码:
输出:
f5emj3cl3#
我变了
mpg
为了Impala
以及iii
至19.0
所以这个查询会返回一些东西。你要避免在这里自我连接;his可以有效地完成Hive窗口功能。Hive:
输出:
至于
Pig
我认为@sai kiran neelakantam的解决方案非常可靠。2ul0zpep4#
tp5buhyn5#
在Hive里,它会像