我正在尝试在我的个人计算机上运行一个hadoop单集群节点（linux mint 17，linux kernel 3.13）。我想为我正在参加的一个在线课程运行一些pig脚本，但主要是因为我不熟悉hadoop本身和pig（尽管我每天都在编写hive查询），我被卡住了。
我已经按照以下两个指南安装了hadoop 2.5.0和pig 0.13.0：
在ubuntu linux 13.04（单节点集群）上安装hadoop 2.2.0
如何在linux-mint-vm上安装pig&hive
据我所知，pig有两种执行模式：本地模式和mapreduce模式。
本地模式
本地模式通常用于验证和调试单个计算机可以处理的较小数据集上的pig查询和/或脚本。它在单个jvm上运行并访问本地文件系统。要在本地模式下运行，请在启动pig时将local选项传递给-x或-exectype参数。这将启动名为grunt的交互式shell。
mapreduce模式
在这种模式下，pig将查询转换为mapreduce作业，并在hadoop集群上运行该作业。这个集群可以是伪分布式集群，也可以是完全分布式集群。
我的课程作业提出了以下问题：
example.pig生成了多少mapreduce作业？
第一个mapreduce作业中有多少个reduce任务？在以后的mapreduce作业中有多少个reduce任务？
每项工作需要多长时间？整个剧本要花多长时间？
example.pig中下列每个命令后面的元组的模式是什么？
考虑到这类问题，我假设我必须使用我刚刚创建的单个hadoop集群使用mapreduce模式。
在这本指南中，我读到了我也必须运行这个命令，我认为在某种程度上可以将pig连接到hadoop：

$ export PIG_CLASSPATH=$HADOOP_HOME/conf/

运行 pig 终端的命令给了我很多警告：

$ pig
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:21:18,409 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:21:18,409 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408458078408.log
2014-08-19 15:21:18,429 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:21:18,837 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:21:18,837 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:21:18,837 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:21:19,670 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

现在，我的疑问是在mapreduce模式下启动pig之前是否必须启动hadoop集群？
如果我不启动集群，只是在mapreduce模式下运行脚本，我会收到以下错误消息：

$ pig -x mapreduce example.pig 
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:56:46,818 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:56:46,818 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log
2014-08-19 15:56:47,418 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:56:47,630 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:56:47,630 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:56:47,630 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:56:48,524 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Unable to check name hdfs://localhost:9000/user/gianluca
Failed to parse: Pig script failed to parse: 
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1712)
    at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1420)
    at org.apache.pig.PigServer.parseAndBuild(PigServer.java:364)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:389)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
    at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:170)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:232)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
    at org.apache.pig.Main.run(Main.java:608)
    at org.apache.pig.Main.main(Main.java:156)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: 
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
    at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:881)
    at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
    at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
    at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
    at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
    at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
    ... 16 more
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
    at org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:90)
    at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:873)
    ... 22 more
Caused by: java.net.ConnectException: Call From gianluca-Aspire-S3-391/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
    at org.apache.hadoop.ipc.Client.call(Client.java:1415)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:707)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1785)
    at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1068)
    at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:200)
    ... 26 more
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
    at org.apache.hadoop.ipc.Client.call(Client.java:1382)
    ... 44 more
2014-08-19 15:56:48,530 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
Details at logfile: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log

1条答案

按热度按时间

mgdq6dx11#

您的问题是由于您的pig脚本尝试连接到集群而失败：
电话：gianluca-aspire-s3-391/127.0.1.1至localhost:9000 failed 连接异常：java.net.connectexception:连接被拒绝
在日志文件里。
现在，问题的原因是集群运行在yarn上，而pig脚本需要（旧的）mr1 jobtracker运行在端口9000上。
你有两个选择：
a）设置mr1作业跟踪器。看到了吗http://itellity.wordpress.com/2014/08/20/installing-hadoop-chd4-mr1-on-mac-os-x/ 例如（在ubuntu上也适用）b）将pig配置为在运行时使用yarn
export hadoopdir=/yourhadoopsite/conf
这应该允许pig读取hadoop配置并“发现”正确的配置

赞(0）回复(0）举报 2021-05-30

在mapreduce模式下启动pig之前，我应该启动hadoop集群吗？

1条答案

相关问题

热门标签

最新问答