ElasticMapReduce—如何在ApachePig上强制执行正确的数据类型?

v64noz0r  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(361)

由于数据类型错误,我无法对一包值求和。
当我加载一个csv文件时,其行如下所示:

  1. 6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong

使用以下方法:

  1. logs_base = FOREACH raw_logs GENERATE
  2. FLATTEN(
  3. EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
  4. )
  5. as (
  6. account_id: int,
  7. bytes: long,
  8. cached: chararray,
  9. ip: chararray,
  10. time: chararray,
  11. domain: chararray,
  12. host: chararray,
  13. status: chararray,
  14. mime_type: chararray,
  15. page_view: chararray,
  16. path: chararray,
  17. protocol: chararray,
  18. username: chararray
  19. );

所有字段似乎都加载良好,类型正确,如“descripe”命令所示:

  1. grunt> describe logs_base
  2. logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}

每当我用以下公式求和时:

  1. bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);

并存储或转储内容时,mapreduce进程将失败,并出现以下错误:

  1. org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial
  2. at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87)
  3. at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65)
  4. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
  5. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
  6. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
  7. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
  8. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
  9. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
  10. at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
  11. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
  12. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
  13. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
  14. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
  15. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
  16. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
  17. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
  18. Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
  19. at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79)
  20. ... 15 more

引起我注意的是:

  1. Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

这使我相信extract函数没有将bytes字段转换为所需的数据类型(long)。
有没有办法强制extract函数转换为正确的数据类型?我如何才能在不必对所有记录进行foreach的情况下进行价值计算(同样的问题发生在将时间转换为unix时间戳,并试图找到min时。我肯定希望找到一个不需要不必要的投影的解决方案)。
任何提示都将不胜感激。非常感谢你的帮助。
你好,乔治c。
p、 我在亚马逊弹性Map缩减服务上以交互模式运行这个。

llycmphe

llycmphe1#

你有没有试过把从自定义项中检索到的数据转换成数据?在这里应用模式不会执行任何强制转换。
例如

  1. logs_base =
  2. FOREACH raw_logs
  3. GENERATE
  4. FLATTEN(
  5. (tuple(LONG,LONG,CHARARRAY,....)) EXTRACT(line, '^...')
  6. )
  7. AS (account_id: INT, ...);

相关问题