我尝试在使用以下方法创建的新dataproc群集上运行Pypark作业:
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
--region=${REGION} \
--image-version=1.4 \
--master-machine-type=n1-standard-4 \
--worker-machine-type=n1-standard-4 \
--bucket=${BUCKET_NAME} \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway
然后我尝试运行不同的作业,所有创建spark上下文的操作都会导致无限循环:
WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
脚本示例:
import pyspark
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
words = sorted(rdd.collect())
print(words)
使用添加到群集
gcloud dataproc jobs submit pyspark pyspark_sort.py \
--cluster=${BUCKET_NAME} \
--region=us-central1
当我这样做的时候:
import getpass
import sys
import imp
print('This job is running as "{}".'.format(getpass.getuser()))
print(sys.executable, sys.version_info)
for package in sys.argv[1:]:
print(imp.find_module(package))
我得到了一个成功的输出:
Job [f10a70ab95c54c8a9fa471429b91153b] submitted.
Waiting for job output...
This job is running as "root".
/opt/conda/default/bin/python sys.version_info(major=3, minor=6, micro=10, releaselevel='final', serial=0)
(None, '/opt/conda/default/lib/python3.6/site-packages/pandas', ('', '', 5))
(None, '/opt/conda/default/lib/python3.6/site-packages/scipy', ('', '', 5))
Job [f10a70ab95c54c8a9fa471429b91153b] finished successfully.
提前谢谢你!
暂无答案!
目前还没有任何答案,快来回答吧!