我正在尝试在我的个人计算机上运行一个hadoop单集群节点(linux mint 17,linux kernel 3.13)。我想为我正在参加的一个在线课程运行一些pig脚本,但主要是因为我不熟悉hadoop本身和pig(尽管我每天都在编写hive查询),我被卡住了。
我已经按照以下两个指南安装了hadoop 2.5.0和pig 0.13.0:
在ubuntu linux 13.04(单节点集群)上安装hadoop 2.2.0
如何在linux-mint-vm上安装pig&hive
据我所知,pig有两种执行模式:本地模式和mapreduce模式。
本地模式
本地模式通常用于验证和调试单个计算机可以处理的较小数据集上的pig查询和/或脚本。它在单个jvm上运行并访问本地文件系统。要在本地模式下运行,请在启动pig时将local选项传递给-x或-exectype参数。这将启动名为grunt的交互式shell。
mapreduce模式
在这种模式下,pig将查询转换为mapreduce作业,并在hadoop集群上运行该作业。这个集群可以是伪分布式集群,也可以是完全分布式集群。
我的课程作业提出了以下问题:
example.pig生成了多少mapreduce作业?
第一个mapreduce作业中有多少个reduce任务?在以后的mapreduce作业中有多少个reduce任务?
每项工作需要多长时间?整个剧本要花多长时间?
example.pig中下列每个命令后面的元组的模式是什么?
考虑到这类问题,我假设我必须使用我刚刚创建的单个hadoop集群使用mapreduce模式。
在这本指南中,我读到了我也必须运行这个命令,我认为在某种程度上可以将pig连接到hadoop:
$ export PIG_CLASSPATH=$HADOOP_HOME/conf/
运行 pig
终端的命令给了我很多警告:
$ pig
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:21:18,409 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:21:18,409 [main] INFO org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408458078408.log
2014-08-19 15:21:18,429 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:21:18,837 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:21:18,837 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:21:18,837 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:21:19,670 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
现在,我的疑问是在mapreduce模式下启动pig之前是否必须启动hadoop集群?
如果我不启动集群,只是在mapreduce模式下运行脚本,我会收到以下错误消息:
$ pig -x mapreduce example.pig
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:56:46,818 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:56:46,818 [main] INFO org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log
2014-08-19 15:56:47,418 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:56:47,630 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:56:47,630 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:56:47,630 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:56:48,524 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Unable to check name hdfs://localhost:9000/user/gianluca
Failed to parse: Pig script failed to parse:
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1712)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1420)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:364)
at org.apache.pig.PigServer.executeBatch(PigServer.java:389)
at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:170)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:232)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:608)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by:
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:881)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 16 more
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
at org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:90)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:873)
... 22 more
Caused by: java.net.ConnectException: Call From gianluca-Aspire-S3-391/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:707)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1785)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1068)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:200)
... 26 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 44 more
2014-08-19 15:56:48,530 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
Details at logfile: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log
1条答案
按热度按时间mgdq6dx11#
您的问题是由于您的pig脚本尝试连接到集群而失败:
电话:gianluca-aspire-s3-391/127.0.1.1至localhost:9000 failed 连接异常:java.net.connectexception:连接被拒绝
在日志文件里。
现在,问题的原因是集群运行在yarn上,而pig脚本需要(旧的)mr1 jobtracker运行在端口9000上。
你有两个选择:
a) 设置mr1作业跟踪器。看到了吗http://itellity.wordpress.com/2014/08/20/installing-hadoop-chd4-mr1-on-mac-os-x/ 例如(在ubuntu上也适用)b)将pig配置为在运行时使用yarn
export hadoopdir=/yourhadoopsite/conf
这应该允许pig读取hadoop配置并“发现”正确的配置