无法打开别名的迭代器< alias\u name>

j9per5c4  于 2021-06-24  发布在  Pig
关注(0)|答案(1)|浏览(334)

我知道这是重复最多的问题之一。我几乎找遍了所有地方,没有任何资源可以解决我面临的问题。下面是我的问题陈述的简化版本。但实际数据有点复杂,所以我不得不使用自定义项
我的输入文件:(input.txt)

  1. NotNeeded1,NotNeeded11;Needed1
  2. NotNeeded2,NotNeeded22;Needed2

我希望输出是

  1. Needed1
  2. Needed2

因此,我正在编写以下udf(java代码):

  1. package com.company.pig;
  2. import java.io.IOException;
  3. import org.apache.pig.EvalFunc;
  4. import org.apache.pig.data.Tuple;
  5. public class myudf extends EvalFunc<String>{
  6. public String exec(Tuple input) throws IOException {
  7. if (input == null || input.size() == 0)
  8. return null;
  9. String s = (String)input.get(0);
  10. String str = s.split("\\,")[1];
  11. String str1 = str.split("\\;")[1];
  12. return str1;
  13. }
  14. }

把它 Package 成

  1. rollupreg_extract-jar-with-dependencies.jar

下面是我的Pig壳代码

  1. grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
  2. grunt> DEFINE myudf com.company.pig.myudf;
  3. grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(',');
  4. grunt> extract = FOREACH data GENERATE myudf($1);
  5. grunt> DUMP extract;

我得到以下错误:

  1. 2017-05-15 15:58:15,493 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
  2. 2017-05-15 15:58:15,577 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
  3. 2017-05-15 15:58:15,659 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
  4. 2017-05-15 15:58:15,774 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
  5. 2017-05-15 15:58:15,865 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
  6. 2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
  7. 2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
  8. 2017-05-15 15:58:16,184 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
  9. 2017-05-15 15:58:16,196 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
  10. 2017-05-15 15:58:16,396 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
  11. 2017-05-15 15:58:16,576 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
  12. 2017-05-15 15:58:16,580 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
  13. 2017-05-15 15:58:16,584 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
  14. 2017-05-15 15:58:16,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
  15. 2017-05-15 15:58:17,258 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar
  16. 2017-05-15 15:58:17,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
  17. 2017-05-15 15:58:17,294 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
  18. 2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
  19. 2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
  20. 2017-05-15 15:58:17,354 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
  21. 2017-05-15 15:58:17,510 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
  22. 2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
  23. 2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
  24. 2017-05-15 15:58:17,753 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
  25. 2017-05-15 15:58:17,820 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
  26. 2017-05-15 15:58:17,830 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
  27. 2017-05-15 15:58:17,830 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
  28. 2017-05-15 15:58:17,884 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
  29. 2017-05-15 15:58:17,889 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
  30. 2017-05-15 15:58:17,922 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
  31. 2017-05-15 15:58:18,525 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
  32. 2017-05-15 15:58:18,692 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023
  33. 2017-05-15 15:58:18,879 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
  34. 2017-05-15 15:58:18,973 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023
  35. 2017-05-15 15:58:19,029 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/
  36. 2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023
  37. 2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract
  38. 2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C: R:
  39. 2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
  40. 2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023]
  41. 2017-05-15 15:58:29,156 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
  42. 2017-05-15 15:58:29,156 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs
  43. 2017-05-15 15:58:29,157 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
  44. 2017-05-15 15:58:29,790 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
  45. 2017-05-15 15:58:29,791 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
  46. 2017-05-15 15:58:29,793 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
  47. 2017-05-15 15:58:30,311 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
  48. 2017-05-15 15:58:30,312 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
  49. 2017-05-15 15:58:30,313 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
  50. 2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
  51. 2017-05-15 15:58:30,467 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
  52. 2017-05-15 15:58:30,472 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
  53. HadoopVersion PigVersion UserId StartedAt FinishedAt Features
  54. 2.7.3.2.5.0.0-1245 root 2017-05-15 15:58:16 2017-05-15 15:58:30 UNKNOWN
  55. Failed!
  56. Failed Jobs:
  57. JobId Alias Feature Message Outputs
  58. job_1494853652295_0023 data,extract MAP_ONLY Message: Job failed! hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225,
  59. Input(s):
  60. Failed to read data from "/pig_hdfs/input.txt"
  61. Output(s):
  62. Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225"
  63. Counters:
  64. Total records written : 0
  65. Total bytes written : 0
  66. Spillable Memory Manager spill count : 0
  67. Total bags proactively spilled: 0
  68. Total records proactively spilled: 0
  69. Job DAG:
  70. job_1494853652295_0023
  71. 2017-05-15 15:58:30,472 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
  72. 2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract
  73. Details at logfile: /pig/pig_1494863836458.log

我知道有人抱怨

  1. Failed to read data from "/pig_hdfs/input.txt"

但我相信这不是真正的问题。如果我不使用udf直接转储数据,我就得到了输出。所以,这不是问题所在。

ioekq8ef

ioekq8ef1#

首先,您不需要自定义项来获得所需的输出。您可以在load语句中使用分号作为分隔符并获得所需的列。

  1. data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';');
  2. extract = FOREACH data GENERATE $1;
  3. DUMP extract;

如果坚持使用自定义项,则必须将记录加载到单个字段中,然后使用自定义项。此外,自定义项不正确。应使用“;”拆分字符串s作为分隔符,它是从pig脚本传递的。

  1. String s = (String)input.get(0);
  2. String str1 = s.split("\\;")[1];

在pig脚本中,需要将整个记录加载到1个字段中,并在字段$0上使用自定义项。

  1. REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
  2. DEFINE myudf com.company.pig.myudf;
  3. data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
  4. extract = FOREACH data GENERATE myudf($0);
  5. DUMP extract;
展开查看全部

相关问题