使用以下脚本在yarn(hadoop 2.6.0.2.2.0.0-2041)上运行spark 1.3.0 pi example时:
# Run on a YARN cluster
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar \
1000
它失败,并显示“由于am容器,应用程序失败2次”消息(请参见下文)。据我所知,在yarn模式下运行spark应用程序的所有必要信息都在这个启动脚本中提供。在Yarn上还应该配置什么。少了什么?Yarn上市失败的其他原因?
[test@etl-hdp-mgmt pi]$ ./run-pi.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/04/01 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/01 12:59:58 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/01 12:59:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
15/04/01 12:59:58 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
15/04/01 12:59:58 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/04/01 12:59:58 INFO yarn.Client: Setting up container launch context for our AM
15/04/01 12:59:58 INFO yarn.Client: Preparing resources for our AM container
15/04/01 12:59:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/04/01 12:59:59 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-assembly-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:01 INFO yarn.Client: Uploading resource file:/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:02 INFO yarn.Client: Setting up the launch environment for our AM container
15/04/01 13:00:03 INFO spark.SecurityManager: Changing view acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: Changing modify acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test)
15/04/01 13:00:03 INFO yarn.Client: Submitting application 10 to ResourceManager
15/04/01 13:00:03 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0010
15/04/01 13:00:04 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:04 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1427893202566
final status: UNDEFINED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/
user: test
15/04/01 13:00:05 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:06 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:07 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:08 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:09 INFO yarn.Client: Application report for application_1427875242006_0010 (state: FAILED)
15/04/01 13:00:09 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1427875242006_0010 failed 2 times due to AM Container for appattempt_1427875242006_0010_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0010_02_000001
Exit code: 1
Exception message: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1427893202566
final status: FAILED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/cluster/app/application_1427875242006_0010
user: test
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
4条答案
按热度按时间7d7tgy0s1#
这就是spark与应用程序主机通信的问题。
rm和nm通过rpc相互通信,因此问题可能是launch\u container.cmd没有正确运行。在提交作业时检查nm是否与rm通信
尝试将此添加到yarn-site.xml:
这将确保看到的nm错误中的launch_container.cmd不会被删除(将保留20分钟-如果需要,增加1200到更高的数字)。现在,您可以尝试从container dir手动运行launch\u container.cmd脚本,看看它从何处退出。
希望这对你有帮助。
oxcyiej72#
我也面临着类似的问题。实际上,当您在集群中运行自包含的应用程序时,您不需要提到--master warn cluster。
这个问题已经在cloudera论坛上解决了,请参阅https://community.cloudera.com/t5/advanced-analytics-apache-spark/issue-running-spark-application-in-yarn-cluster-mode/td-p/44570
syqv5f0l3#
我完全同意@seanowen。遵循spark构建文档。
您需要使用hadoop集群的正确配置(版本、配置单元支持等)编译spark for yarn。
那问题就不会持续下去了!
fdx2calv4#
跑
那里的日志应该指出它失败的原因。
发生“失败2次”的原因是,当您以群集模式运行时,驱动程序在am中运行,默认情况下重试次数为2。
所以你的驱动程序被重试了两次。