无法在dataproc上运行pyspark作业

wz1wpwve  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(357)

我尝试在使用以下方法创建的新dataproc群集上运行Pypark作业:

  1. gcloud beta dataproc clusters create ${CLUSTER_NAME} \
  2. --region=${REGION} \
  3. --image-version=1.4 \
  4. --master-machine-type=n1-standard-4 \
  5. --worker-machine-type=n1-standard-4 \
  6. --bucket=${BUCKET_NAME} \
  7. --optional-components=ANACONDA,JUPYTER \
  8. --enable-component-gateway

然后我尝试运行不同的作业,所有创建spark上下文的操作都会导致无限循环:

  1. WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

脚本示例:

  1. import pyspark
  2. sc = pyspark.SparkContext()
  3. rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
  4. words = sorted(rdd.collect())
  5. print(words)

使用添加到群集

  1. gcloud dataproc jobs submit pyspark pyspark_sort.py \
  2. --cluster=${BUCKET_NAME} \
  3. --region=us-central1

当我这样做的时候:

  1. import getpass
  2. import sys
  3. import imp
  4. print('This job is running as "{}".'.format(getpass.getuser()))
  5. print(sys.executable, sys.version_info)
  6. for package in sys.argv[1:]:
  7. print(imp.find_module(package))

我得到了一个成功的输出:

  1. Job [f10a70ab95c54c8a9fa471429b91153b] submitted.
  2. Waiting for job output...
  3. This job is running as "root".
  4. /opt/conda/default/bin/python sys.version_info(major=3, minor=6, micro=10, releaselevel='final', serial=0)
  5. (None, '/opt/conda/default/lib/python3.6/site-packages/pandas', ('', '', 5))
  6. (None, '/opt/conda/default/lib/python3.6/site-packages/scipy', ('', '', 5))
  7. Job [f10a70ab95c54c8a9fa471429b91153b] finished successfully.

提前谢谢你!

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题