无法在pig中执行求和操作

pgky5nke  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(622)

我试图对pig中的数据执行sum操作,但它不接受显式类型转换,我在执行sum时尝试用double替换(int)。
代码

drivers = LOAD '/sachin/drivers.csv' USING PigStorage(',');
time = LOAD '/sachin/timesheet.csv' USING PigStorage(',');
drivdata = FILTER drivers BY $0>1;
timedata = filter time by $0>0;
drivgrp = group timedata by $0;
drivinfo = foreach drivgrp generate group as id , SUM(timedata.$2) as totalhr , SUM(timedata.$3) as totmillogged;
drivfinal = foreach drivdata generate $0 as id , $1 as name;
result = join drivfinal by id , drivinfo by id;
finalres = foreach result generate $0 as id, $1 as name, $3 as hrslogged, $4 as mileslogged;
summile = foreach finalres generate (int)SUM(mileslogged);
DUMP summile;

错误消息

grunt> exec /home/sachin/sec.pig
2017-12-13 21:57:58,812 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 1 time(s).
2017-12-13 21:57:58,854 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:58,996 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,036 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,080 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,121 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,192 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,246 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: <line 10, column 41> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
Details at logfile: /home/sachin/pig_1513175202309.log
grunt>

实际上,我正在尝试对前5名列表中的每个驾驶员执行操作,查找记录的英里数以及驾驶员记录的英里数占记录的总英里数的百分比,并将结果存储在hdfs中。
数据集链接: https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/how-to-process-data-with-apache-pig/assets/driver_data.zip 有谁能帮我解决这个问题或帮助我理解这里出了什么问题?

a8jjtwal

a8jjtwal1#

从http://pig.apache.org/docs/r0.17.0/func.html#sum
总和定义为
计算单个列包中数值的总和。sum要求前面的group all语句用于全局求和,group by语句用于组求和。
您的代码正在传递一个double,而sum需要一个包含double的包。不需要类型转换,但在调用sum函数之前需要分组。

allres = group finalres ALL;
summile = foreach allres generate SUM(finalres.mileslogged);
DUMP summile;
bxgwgixi

bxgwgixi2#

您必须强制转换mileslogged,然后调用sum函数

finalres = foreach result generate $0 as id, $1 as name, $3 as hrslogged, (int)$4 as mileslogged; 
summile = foreach finalres generate SUM(mileslogged);

我还注意到,您没有在load语句中指定数据类型。默认数据类型是bytearray,如果在后续步骤中不显式强制转换字段,我怀疑您会得到正确的结果。

相关问题