将requirements s.txt传递给Google Cloud PySpark批处理作业

5ssjco0h 于 2022-10-07 发布在 Go

关注(0)|答案(1)|浏览(109)

我正在尝试通过Google DataProc批处理作业运行一个pyspark脚本。

我的脚本应该连接到Firestore以从那里收集一些数据，因此我需要访问库firebase-admin。当我通过以下命令在Google Cloud上运行该脚本时：

gcloud dataproc batches submit 
        --project {PROJECT} 
        --region europe-west1 
        --subnet {SUBNET} 
        pyspark spark_image_matching/main.py 
        --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar 
        --deps-bucket={DEPS_BUCKET}

我收到以下错误：

Traceback (most recent call last):
  File "/tmp/srvls-batch-0127aaf6-a438-4439-af56-beb1a66f45ed/main.py", line 4, in <module>
    import firebase_admin
ModuleNotFoundError: No module named 'firebase_admin'

我已经尝试创建一个setup.py文件来生成一个指定依赖项和--py-files标志的.egg文件。这个想法受到了回购的高度启发：

http://www.restez-en-bonne-sante-leh.com/?_=%2FGoogleCloudPlatform%2Fdataproc-templates%2Fblob%2Fmain%2Fpython%2Fsetup.py%23BQyskaWdLgo6VQOkV2YyLaeS

pyspark

来源：https://stackoverflow.com/questions/73664892/passing-requirements-txt-to-google-cloud-pyspark-batch-job