pyspark Spark执行器频繁退出,初始作业未接受任何资源

zbq4xfa0  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(155)

我有一个远程独立的Spark集群运行在2个Docker容器中,spark-master和spark-worker。我试图测试一个简单的Python程序来测试与Spark的连接,但我总是得到以下错误:

  1. WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

字符串
代码如下:

  1. from pyspark.sql import SparkSession
  2. if __name__ == '__main__':
  3. spark = SparkSession.builder.appName('test') \
  4. .master('spark://192.168.1.169:7077') \
  5. .config("spark.executor.memory", "512m") \
  6. .config('spark.cores.max', '1') \
  7. .config("spark.executor.cores", "1") \
  8. .config("spark.executor.instances", "1") \
  9. .getOrCreate()
  10. data = [("A", 1), ("B", 2), ("C", 3)]
  11. columns = ["Letter", "Number"]
  12. df = spark.createDataFrame(data, columns)
  13. df.show()
  14. spark.stop()


当我运行这个程序时,我可以在Spark GUI页面上看到一个正在运行的应用程序。
显然,对于我的简单程序来说,有很多资源。我也试过调整配置参数,但没有帮助。
我看过Docker容器中的executor摘要和主日志,似乎executor被频繁地创建和退出,大约每3秒一次。

  1. 23/08/07 11:39:16 INFO Master: Registering app test
  2. 23/08/07 11:39:16 INFO Master: Registered app test with ID app-20230807113916-0087
  3. 23/08/07 11:39:16 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  4. 23/08/07 11:39:16 INFO Master: Application app-20230807113916-0087 requested executors: Map(Profile: id = 0, executor resources: cores -> name: cores, amount: 1, script: , vendor: ,memory -> name: memory, amount: 512, script: , vendor: ,offHeap -> name: offHeap, amount: 0, script: , vendor: , task resources: cpus -> name: cpus, amount: 1.0 -> 1).
  5. 23/08/07 11:39:16 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  6. 23/08/07 11:39:16 INFO Master: Launching executor app-20230807113916-0087/0 on worker worker-20230705044801-172.26.0.3-35329
  7. 23/08/07 11:39:16 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  8. 23/08/07 11:39:19 INFO Master: Removing executor app-20230807113916-0087/0 because it is EXITED
  9. 23/08/07 11:39:19 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  10. 23/08/07 11:39:19 INFO Master: Launching executor app-20230807113916-0087/1 on worker worker-20230705044801-172.26.0.3-35329
  11. 23/08/07 11:39:19 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  12. 23/08/07 11:39:21 INFO Master: Removing executor app-20230807113916-0087/1 because it is EXITED
  13. 23/08/07 11:39:21 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  14. 23/08/07 11:39:21 INFO Master: Launching executor app-20230807113916-0087/2 on worker worker-20230705044801-172.26.0.3-35329
  15. 23/08/07 11:39:21 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  16. 23/08/07 11:39:24 INFO Master: Removing executor app-20230807113916-0087/2 because it is EXITED
  17. 23/08/07 11:39:24 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  18. 23/08/07 11:39:24 INFO Master: Launching executor app-20230807113916-0087/3 on worker worker-20230705044801-172.26.0.3-35329
  19. 23/08/07 11:39:24 INFO Master: Start scheduling for app app-20230807113916-0087 with rpId: 0
  20. 23/08/07 11:39:26 INFO Master: Removing executor app-20230807113916-0087/3 because it is EXITED


工人日志:

  1. 23/08/07 11:39:16 INFO Worker: Asked to launch executor app-20230807113916-0087/0 for test
  2. 23/08/07 11:39:16 INFO SecurityManager: Changing view acls to: spark
  3. 23/08/07 11:39:16 INFO SecurityManager: Changing modify acls to: spark
  4. 23/08/07 11:39:16 INFO SecurityManager: Changing view acls groups to:
  5. 23/08/07 11:39:16 INFO SecurityManager: Changing modify acls groups to:
  6. 23/08/07 11:39:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: spark; groups with view permissions: EMPTY; users with modify permissions: spark; groups with modify permissions: EMPTY
  7. 23/08/07 11:39:16 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx512M" "-Dspark.driver.port=57417" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:57417" "--executor-id" "0" "--hostname" "172.26.0.3" "--cores" "1" "--app-id" "app-20230807113916-0087" "--worker-url" "spark://[email protected]:35329" "--resourceProfileId" "0"
  8. 23/08/07 11:39:19 INFO Worker: Executor app-20230807113916-0087/0 finished with state EXITED message Command exited with code 1 exitStatus 1
  9. 23/08/07 11:39:19 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
  10. 23/08/07 11:39:19 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20230807113916-0087, execId=0)
  11. 23/08/07 11:39:19 INFO Worker: Asked to launch executor app-20230807113916-0087/1 for test
  12. 23/08/07 11:39:19 INFO SecurityManager: Changing view acls to: spark
  13. 23/08/07 11:39:19 INFO SecurityManager: Changing modify acls to: spark
  14. 23/08/07 11:39:19 INFO SecurityManager: Changing view acls groups to:
  15. 23/08/07 11:39:19 INFO SecurityManager: Changing modify acls groups to:
  16. 23/08/07 11:39:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: spark; groups with view permissions: EMPTY; users with modify permissions: spark; groups with modify permissions: EMPTY
  17. 23/08/07 11:39:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx512M" "-Dspark.driver.port=57417" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:57417" "--executor-id" "1" "--hostname" "172.26.0.3" "--cores" "1" "--app-id" "app-20230807113916-0087" "--worker-url" "spark://[email protected]:35329" "--resourceProfileId" "0"
  18. 23/08/07 11:39:21 INFO Worker: Executor app-20230807113916-0087/1 finished with state EXITED message Command exited with code 1 exitStatus 1
  19. 23/08/07 11:39:21 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 1
  20. 23/08/07 11:39:21 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20230807113916-0087, execId=1)
  21. 23/08/07 11:39:21 INFO Worker: Asked to launch executor app-20230807113916-0087/2 for test
  22. 23/08/07 11:39:21 INFO SecurityManager: Changing view acls to: spark
  23. 23/08/07 11:39:21 INFO SecurityManager: Changing modify acls to: spark
  24. 23/08/07 11:39:21 INFO SecurityManager: Changing view acls groups to:
  25. 23/08/07 11:39:21 INFO SecurityManager: Changing modify acls groups to:
  26. 23/08/07 11:39:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: spark; groups with view permissions: EMPTY; users with modify permissions: spark; groups with modify permissions: EMPTY
  27. 23/08/07 11:39:21 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx512M" "-Dspark.driver.port=57417" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:57417" "--executor-id" "2" "--hostname" "172.26.0.3" "--cores" "1" "--app-id" "app-20230807113916-0087" "--worker-url" "spark://[email protected]:35329" "--resourceProfileId" "0"


然而,我成功地在Docker容器中运行了一个示例jar。
请告诉我如何修复错误。我在所有节点和Python程序中使用Spark 3.4.1。
UPD:以下是来自worker/app的日志:

  1. Spark Executor Command: "/opt/bitnami/java/bin/java" "-cp"
  2. "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.port.maxRetries=65000" "-Dspark.driver.port=51772" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:51772" "--executor-id" "0" "--hostname" "172.26.0.3" "--cores" "8" "--app-id" "app-20230828045524-0147" "--worker-url" "spark://[email protected]:40457" "--resourceProfileId" "0"
  3. ========================================
  4. Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
  5. 23/08/28 04:55:25 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 27005@6181476d5774
  6. 23/08/28 04:55:25 INFO SignalUtils: Registering signal handler for TERM
  7. 23/08/28 04:55:25 INFO SignalUtils: Registering signal handler for HUP
  8. 23/08/28 04:55:25 INFO SignalUtils: Registering signal handler for INT
  9. 23/08/28 04:55:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  10. 23/08/28 04:55:26 INFO SecurityManager: Changing view acls to: spark,root
  11. 23/08/28 04:55:26 INFO SecurityManager: Changing modify acls to: spark,root
  12. 23/08/28 04:55:26 INFO SecurityManager: Changing view acls groups to:
  13. 23/08/28 04:55:26 INFO SecurityManager: Changing modify acls groups to:
  14. 23/08/28 04:55:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: spark, root; groups with view permissions: EMPTY; users with modify permissions: spark, root; groups with modify permissions: EMPTY
  15. Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  16. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
  17. at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:62)
  18. at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:428)
  19. at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:417)
  20. at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
  21. Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
  22. at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
  23. at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  24. at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
  25. at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:448)
  26. at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  27. at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
  28. at scala.collection.immutable.Range.foreach(Range.scala:158)
  29. at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
  30. at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:446)
  31. at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:63)
  32. at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  33. at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
  34. at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
  35. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
  36. ... 4 more
  37. Caused by: java.io.IOException: Failed to connect to /172.17.0.2:51772
  38. at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:284)
  39. at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
  40. at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226)
  41. at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
  42. at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
  43. at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
  44. at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  45. at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  46. at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  47. at java.base/java.lang.Thread.run(Thread.java:833)
  48. Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /172.17.0.2:51772
  49. Caused by: java.net.ConnectException: Connection timed out
  50. at java.base/sun.nio.ch.Net.pollConnect(Native Method)
  51. at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
  52. at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946)
  53. at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
  54. at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
  55. at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
  56. at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
  57. at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
  58. at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
  59. at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
  60. at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  61. at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  62. at java.base/java.lang.Thread.run(Thread.java:833)

ctrmrzij

ctrmrzij1#

我只是解决同样的问题。
你的spark执行器无法访问spark驱动程序。

解决方案一

考虑在您的配置中添加此行。

  1. .conf("spark.driver.host", "spark driver ip or a hostname") \

字符串
如果你在一个容器中运行它,在pod env中添加POD_IP,然后

  1. .conf("spark.driver.host", os.environ.get("POD_IP")) \

解决方案二

使用spark集群模式,驱动程序将在集群中运行,驱动程序主机将自动设置,正确和100%可访问的执行器。

展开查看全部

相关问题