无法在dataproc上运行pyspark作业

wz1wpwve  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(338)

我尝试在使用以下方法创建的新dataproc群集上运行Pypark作业:

gcloud beta dataproc clusters create ${CLUSTER_NAME} \
 --region=${REGION} \
 --image-version=1.4 \
 --master-machine-type=n1-standard-4 \
 --worker-machine-type=n1-standard-4 \
 --bucket=${BUCKET_NAME} \
 --optional-components=ANACONDA,JUPYTER \
 --enable-component-gateway

然后我尝试运行不同的作业,所有创建spark上下文的操作都会导致无限循环:

WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

脚本示例:

import pyspark

sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
words = sorted(rdd.collect())
print(words)

使用添加到群集

gcloud dataproc jobs submit pyspark pyspark_sort.py \
    --cluster=${BUCKET_NAME} \
    --region=us-central1

当我这样做的时候:

import getpass
import sys
import imp

print('This job is running as "{}".'.format(getpass.getuser()))
print(sys.executable, sys.version_info)
for package in sys.argv[1:]:
    print(imp.find_module(package))

我得到了一个成功的输出:

Job [f10a70ab95c54c8a9fa471429b91153b] submitted.
Waiting for job output...
This job is running as "root".
/opt/conda/default/bin/python sys.version_info(major=3, minor=6, micro=10, releaselevel='final', serial=0)
(None, '/opt/conda/default/lib/python3.6/site-packages/pandas', ('', '', 5))
(None, '/opt/conda/default/lib/python3.6/site-packages/scipy', ('', '', 5))
Job [f10a70ab95c54c8a9fa471429b91153b] finished successfully.

提前谢谢你!

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题