通过oozie使用shell脚本并发提交多个pyspark应用程序

mxg2im7a  于 2021-07-13  发布在  Spark
关注(0)|答案(0)|浏览(452)

用例-构建一个数据同步应用程序,将数据从apachekudu读取到memsql。该应用程序完全基于python构建,使用多处理队列逻辑。每个数据集作为一个任务插入到队列中,稍后工作线程并发地拉出这些任务并触发pyspark应用程序来同步数据。pyspark应用程序需要源和目标的详细信息,它连接到源,读取数据并将其推送到目标
问题-直到工作线程的数量保持为2,每件事都是一个魅力,但一旦我增加工作线程的数量超过2,然后我继续得到随机错误,如下面在繁殖pyspark应用程序
错误1-

21/02/24 14:00:37 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
21/02/24 14:00:49 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s).
21/02/24 14:00:50 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
21/02/24 14:00:50 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

错误2-

executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
21/02/24 14:00:39 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(2, XXXXXXXXXXXXXXXX, 41020, None)
21/02/24 14:00:50 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
21/02/24 14:00:50 INFO memory.MemoryStore: MemoryStore cleared
21/02/24 14:00:50 INFO storage.BlockManager: BlockManager stopped
21/02/24 14:00:50 INFO util.ShutdownHookManager: Shutdown hook called

错误3-

org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:493)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:277)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:805)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:804)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:836)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.io.IOException: Failed to connect to peanut-worker-33.cdp.cargill.com/10.12.36.153:35039
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:192)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

下面是我用来提交spark应用程序的pyspark命令

export PYSPARK_PYTHON=/usr/share/cdp/python-3.6/bin/python3.6;
export PYSPARK_DRIVER_PYTHON=/usr/share/cdp/python-3.6/bin/python3.6;
spark-submit --master yarn --deploy-mode client --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.am.memory=10g --conf spark.yarn.am.memoryOverhead=6g --executor-memory 4g --num-executors 2 --jars /efs/home/XXXXX/common/jdbc-2.5.41.jar,/efs/home/XXXXX/common/TCLIServiceClient.jar --packages 'com.memsql:memsql-spark-connector_2.11:3.0.5-spark-2.4.4' --files /efs/home/XXXXX/XXXXX.keytab,/efs/home/XXXXX/jaas.conf --py-files /efs/home/XXXXX/prod/DataSync/pipeline/pipeline.py /efs/home/XXXXX/prod/DataSync/pipeline/pipeline.py 'XXXX' prd {source:abc,destination:def}

我有这个运行非常好,在较低的环境中有4名工人,但在生产环境中,这根本不工作的工人超过2。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题