error h2o集群的大小应该是3,但是是2

jjhzyzn0  于 2021-05-17  发布在  Spark
关注(0)|答案(1)|浏览(539)

我正在使用文档中的步骤在kubernetes上运行h2o-sw。
我启动了一个测试软件应用程序

  1. $ bin/spark-submit \
  2. --master k8s://$KUBERNETES_ENDPOINT \
  3. --deploy-mode cluster \
  4. --class ai.h2o.sparkling.InitTest \
  5. --conf spark.scheduler.minRegisteredResourcesRatio=1 \
  6. --conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.32.0.2-1-2.4 \
  7. --conf spark.executor.instances=3 \
  8. local:///opt/sparkling-water/tests/initTest.jar

似乎ui流运行正常,因为我可以在完成后访问它

  1. $ kubectl port-forward ai-h2o-sparkling-inittest-1606331533023-driver 54322:54322

在查看创建的sparklingwater豆荚的日志时,我看到了以下内容

  1. $ kubectl logs ai-h2o-sparkling-inittest-1606331533023-driver
  2. 20/11/25 19:14:14 INFO SignalUtils: Registered signal handler for INT
  3. 20/11/25 19:14:22 INFO Server: jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 1.8.0_275-b01
  4. 20/11/25 19:14:23 INFO ContextHandler: Started a.h.o.e.j.s.ServletContextHandler@5af7a7{/,null,AVAILABLE}
  5. 20/11/25 19:14:23 INFO AbstractConnector: Started ServerConnector@63f4e498{HTTP/1.1,[http/1.1]}{0.0.0.0:54321}
  6. 20/11/25 19:14:23 INFO Server: Started @90939ms
  7. 20/11/25 19:14:23 INFO RestApiUtils: H2O node http://10.244.1.4:54321/3/Cloud successfully responded for the GET.
  8. 20/11/25 19:14:23 INFO H2OContext: Sparkling Water 3.32.0.2-1-2.4 started, status of context:
  9. Sparkling Water Context:
  10. * Sparkling Water Version: 3.32.0.2-1-2.4
  11. * H2O name: root
  12. * cluster size: 2
  13. * list of used nodes:
  14. (executorId, host, port)
  15. ------------------------
  16. (0,10.244.1.4,54321)
  17. (1,10.244.0.10,54321)
  18. ------------------------
  19. Open H2O Flow in browser: http://ai-h2o-sparkling-inittest-1606331533023-driver-svc.default.svc:54321 (CMD + click in Mac OSX)
  20. Exception in thread "main" java.lang.RuntimeException: H2O cluster should be of size 3 but is 2
  21. at ai.h2o.sparkling.InitTest$.main(InitTest.scala:34)
  22. at ai.h2o.sparkling.InitTest.main(InitTest.scala)
  23. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  24. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  25. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  26. at java.lang.reflect.Method.invoke(Method.java:498)
  27. at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
  28. at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
  29. at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
  30. at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
  31. at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
  32. at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
  33. at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
  34. at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

当查看sw创建的pod时,我看到一个处于挂起状态(从不进入运行状态)

  1. $ kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. ai-h2o-sparkling-inittest-1606331533023-driver 1/1 Running 0 13m
  4. app-name-1606331575519-exec-1 1/1 Running 0 12m
  5. app-name-1606331575797-exec-2 1/1 Running 0 12m
  6. app-name-1606331575816-exec-3 0/1 Pending 0 12m

有什么办法解决这个问题吗?

4jb9z9bj

4jb9z9bj1#

这似乎是由于k8s集群没有足够的cpu(这是一个小集群)造成的。
在启动sw时减少执行器的数量(从3个减少到2个),解决了这个问题

  1. bin/spark-submit \
  2. --master k8s://$KUBERNETES_ENDPOINT \
  3. --deploy-mode cluster \
  4. --class ai.h2o.sparkling.InitTest \
  5. --conf spark.scheduler.minRegisteredResourcesRatio=1 \
  6. --conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.32.0.2-1-2.4 \
  7. --conf spark.executor.instances=2 \
  8. local:///opt/sparkling-water/tests/initTest.jar

相关问题