执行器与驱动程序之间没有连接?

wnrlj8wa  于 2021-06-26  发布在  Mesos
关注(0)|答案(3)|浏览(592)

更新:问题已解决。docker图片如下:docker spark submit
我在docker容器中用一个胖jar运行spark submit。我的独立spark群集运行在3个虚拟机上—一个主服务器和两个工作服务器。从工作机上的executor日志中,我看到executor具有以下驱动程序url:
--驱动程序urlspark://coarsegrainedscheduler@172.17.0.2:5001"
172.17.0.2实际上是包含驱动程序的容器的地址,而不是运行容器的主机。无法从工作机访问此ip,因此工作机无法与驱动程序通信。从StandalonesSchedulerBackend的源代码中可以看到,它使用spark.driver.host设置构建driverurl:

val driverUrl = RpcEndpointAddress(
  sc.conf.get("spark.driver.host"),
  sc.conf.get("spark.driver.port").toInt,
  CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

它没有考虑spark\u public\u dns环境变量-这是正确的吗?在容器中,我不能将spark.driver.host设置为除容器“internal”ip地址(本例中为172.17.0.2)之外的任何其他地址。尝试将spark.driver.host设置为主机的ip地址时,会出现如下错误:
warn utils:服务“sparkdriver”无法绑定到端口5001。正在尝试端口5002。
我试图将spark.driver.bindaddress设置为主机的ip地址,但出现了相同的错误。那么,如何配置spark使用主机ip地址而不是docker容器地址与驱动程序通信呢?
upd:来自执行器的堆栈跟踪:

ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
    at scala.util.Try$.apply(Try.scala:192)
    at scala.util.Failure.recover(Try.scala:216)
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
    at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
    at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
    at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
    ... 8 more
wgx48brx

wgx48brx1#

我的设置,使用docker和macos:
在同一docker容器中运行spark 1.6.3 master+worker
从macos运行java应用程序(通过ide)
docker compose打开端口:

ports:
- 7077:7077
- 20002:20002
- 6060:6060

java配置(用于开发目的):

esSparkConf.setMaster("spark://127.0.0.1:7077");
        esSparkConf.setAppName("datahub_dev");

        esSparkConf.setIfMissing("spark.driver.port", "20002");
        esSparkConf.setIfMissing("spark.driver.host", "MAC_OS_LAN_IP");
        esSparkConf.setIfMissing("spark.driver.bindAddress", "0.0.0.0");
        esSparkConf.setIfMissing("spark.blockManager.port", "6060");
o75abkj4

o75abkj42#

我注意到其他答案是使用spark standalone(在vms上,如op或 127.0.0.1 作为另一个答案)。
我想展示一下运行一个 jupyter/pyspark-notebook 针对远程aws mesos集群,并在mac上的docker中本地运行容器。
但是,在这种情况下,这些说明适用, --net=host 只在linux主机上工作。
这里有一个重要的步骤-如链接中所述,在mesos slaves的操作系统上创建笔记本用户。
这个图对调试网络很有帮助,但是它没有提到spark.driver.blockmanager.port,它实际上是使这个工作正常的最后一个参数,我在spark文档中遗漏了这个参数。否则,mesos从机上的执行器也会尝试绑定该块管理器端口,而mesos拒绝分配该端口。

公开这些端口,以便您可以在本地访问jupyter和spark ui
jupyter用户界面( 8888 )
spark用户界面( 4040 )
这些端口使mesos可以回到驱动程序:重要:双向通信必须允许mesos的主人,奴隶和zookepeeper以及。。。
“libprocess”地址+端口似乎通过 LIBPROCESS_PORT 变量(random:37899). 参考:mesos文件
Spark驱动器端口(random:33139)+16用于 spark.port.maxRetries Spark块管理器端口(random:45029)+16用于 spark.port.maxRetries 不太相关,但我使用的是jupyter实验室接口

export EXT_IP=<your external IP>

docker run \
  -p 8888:8888 -p 4040:4040 \
  -p 37899:37899 \
  -p 33139-33155:33139-33155 \
  -p 45029-45045:45029-45045 \
  -e JUPYTER_ENABLE_LAB=y \
  -e EXT_IP \
  -e LIBPROCESS_ADVERTISE_IP=${EXT_IP} \
  -e LIBPROCESS_PORT=37899 \
  jupyter/pyspark-notebook

一旦开始,我就去 localhost:8888 地址为jupyter,只需打开一个终端就可以了 spark-shell 行动。我还可以为实际打包的代码添加卷装载,但这是下一步。
我没有编辑 spark-env.sh 或者 spark-default.conf ,所以我把所有相关的会议都传给 spark-shell 现在。提醒:这个在容器里面

spark-shell --master mesos://zk://quorum.in.aws:2181/mesos \
  --conf spark.executor.uri=https://path.to.http.server/spark-2.4.2-bin-hadoop2.7.tgz \
  --conf spark.cores.max=1 \
  --conf spark.executor.memory=1024m \
  --conf spark.driver.host=$LIBPROCESS_ADVERTISE_IP \
  --conf spark.driver.bindAddress=0.0.0.0 \
  --conf spark.driver.port=33139 \
  --conf spark.driver.blockManager.port=45029

这将加载spark repl,在一些关于查找mesos主机和注册框架的输出之后,我使用namenode ip从hdfs读取一些文件(尽管我怀疑任何其他可访问的文件系统或数据库都可以工作)
我得到了预期的结果

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.2
      /_/

Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.text("hdfs://some.hdfs.namenode:9000/tmp/README.md").show(10)
+--------------------+
|               value|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Spark Streami...|
|                    |
|<http://spark.apa...|
+--------------------+
only showing top 10 rows
6yt4nkrj

6yt4nkrj3#

因此,工作配置为:
将spark.driver.host设置为主机的ip地址
将spark.driver.bindaddress设置为容器的ip地址
正在工作的docker图像如下:docker spark submit。

相关问题