我正试图用pig和twitter的 elephant-bird 库解析一个相当简单的json文件,但它变成了一个非常痛苦的调试过程。
json具有以下结构:
oid_id: (oid:chararray),
bookmarks: {(
oid_id:(oid:chararray),
id:chararray,
creator: chararray,
position:chararray,
creationdate:($ate:chararray)
)},
lastaction:(date:chararray),
settings:(preferredlanguage:chararray),
userid:chararray
行的示例:
{“oid\u id”:{“oid”:“573239f905474a686e2333f0”},“bookmarks”:[{“id”:“legoninx106w0079264”,“creator”:“player”,“position”:96,“creationdate”:{“date”:“2016-12-26t09:37:36.916z”},“oid id”:{“oid”:“5860e4e0ca6baf9032edc0d0”},“id”:“onepercentmw0128677”,“creator”:“player”,“position”:0.08,“creationdate”:{“date”:“2018-12-18t15:42:33.956z”},“oid\u id”:{“oid”:“5c191569faf8474953758930”}],“lastaction”:{“date”:“2018-12-18t15:42:28.107z”},“settings”:{“preferredlanguage”:“vf”,“preferredvideoquality”:“hd”},“userid”:“ocs\u 32a6ad6dd242d5e3842f9211fd236723\u 1461773211”}
以下是我的代码(受本教程启发:https://acadgild.com/blog/determining-popular-hashtags-in-twitter-using-pig)
register /path/to/json-simple-1.1.1.jar
register /path/to/elephant-bird-core-4.17.jar
register /path/to/elephant-bird-pig-4.17.jar
register /path/to/elephant-bird-hadoop-compat-4.17.jar
define JsonLoaderEB com.twitter.elephantbird.pig.load.JsonLoader;
A = LOAD 'file.json' USING JsonLoaderEB('-nestedLoad=true') as myMap;
describe A;
输入表:{mymap:bytearray}
B = foreach A generate flatten(myMap#'bookmarks') as (bookmark:map[]);
describe B;
b:{书签:Map[]}
当我们转储上述关系时,我们可以看到所有数据都已成功加载。
([{“oid\u id”:{“oid”:“5860e4e0ca6baf9032edc0d0”},“creator”:“player”,“creationdate”:{“date”:“2016-12-26t09:37:36.916z”},“id”:“legoninx106w0079264”,“position”:96},“oid id”:{“5c191569faf8475538930”},“creator”:“player”,“creationdate”:{“date”:“2018-12-18t15:42:33.956z”},“id”:“onepercentmw0128677”,“position”:0.08}])
现在我们从书签中提取creationdate、creator、id和position。
C = foreach B generate bookmark#'creationdate' as date_fact, bookmark#'creator' as creator, bookmark#'id' as id, bookmark#'position' as position;
c:{日期{fact:bytearray,创建者:bytearray,id:bytearray,位置:bytearray}
转储表会产生以下错误:
清管器堆放痕迹
错误1066:无法打开别名c的迭代器。后端错误:vertex失败,vertexname=scope-41,vertexid=vertex\u 1542613138136\u 6721 88\u 2\u 00,diagnostics=[任务失败,taskid=task\u 1542613138136\u 672188\u 2\u 00\u000000,diagnostics=[任务尝试0失败,info=[错误:运行任务时出错(失败):尝试\u 1542613138136 \u 672188 \u 2 \u 00 \u000000 \u 0:org.apache.pig.backend.executionengine.executeexception:错误0:执行时异常(名称:c:store)(hdfs://sandbox/tmp/temp-1543074195/tmp277240455:org.apache.pig.impl.io.interstorage)-sc ope-40运算符键:scope-40):org.apache.pig.backend.executionengine.executexception:错误0:执行[pomaplookup(名称:pomaplookup[bytearray]-scope-28运算符键:scope-28]时出现异常子项:null位于[null[4,31]]]:java.lang.classcastexception:java.lan g.string不能强制转换为org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.processinput(physicaloperator)上的java.util.map。java:315)在org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.postoretez.getnexttuple(postoretez。java:123)在org.apache.pig.backend.hadoop.executionengine.tez.runtime.pigprocessor.runpipeline(pigprocessor。java:376)在org.apache.pig.backend.hadoop.executionengine.tez.runtime.pigprocessor.run(pigprocessor。java:241)位于org.apache.tez.runtime.logicalioprocessorruntimetask.run(logicalioprocessorruntimetask)。java:370)在org.apache.tez.runtime.task.taskrunner2callable$1.run(taskrunner2callable。java:73)在org.apache.tez.runtime.task.taskrunner2callable$1.run(taskrunner2callable。java:61)位于javax.security.auth.subject.doas(subject)的java.security.accesscontroller.doprivileged(本机方法)。java:422)在org.apache.hadoop.security.usergroupinformation.doas(用户组信息。java:1698)在org.apache.tez.runtime.task.taskrunner2callable.callinternal(taskrunner2callable。java:61)在org.apache.tez.runtime.task.taskrunner2callable.callinternal(taskrunner2callable。java:37)在org.apache.tez.common.callablewithndc.call(callablewithndc。java:36)在java.util.concurrent.futuretask.run(futuretask。java:266)位于java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor。java:1149)在java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor。java:624)在java.lang.thread.run(线程。java:748)原因:org.apache.pig.backend.executionengine.executexception:错误0:异常执行[pomaplookup(name:pomaplookup[byt earray]-scope-28 operator key:scope-28)时,children:在[null[4]处为null,31]]]:java.lang.classcastexception:java.lang.string不能在org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.getnext(physicaloperator)上转换为java.util.map。java:364)在org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.poforeach.processplan(poforeach。java:406)在org.apache.pig.backend.hadoop.executionengine.physicallayer.relationaloperators.poforeach.getnexttuple(poforeach)。java:323)在org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.processinput(physicaloperator。java:305)1,9顶部
1条答案
按热度按时间bprjcwpo1#
即使这对我来说是个好结果
table_extraction
关系,它可以从原始数据。请删除或更正以下对象,它看起来无效: