EMR - Pyspark,没有名为“boto 3”的模块

vqlkdk9b  于 2023-05-16  发布在  Spark
关注(0)|答案(1)|浏览(173)

我正在使用以下创建语句运行EMR:

$ aws emr create-cluster \
 --name "my_cluster" \
 --log-uri "s3n://somebucket/" \
 --release-label "emr-6.8.0" \
 --service-role "arn:aws:iam::XXXXXXXXXX:role/EMR_DefaultRole" \
 --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","KeyName":"some_key","AdditionalMasterSecurityGroups":[],"AdditionalSlaveSecurityGroups":[],"ServiceAccessSecurityGroup":"sg-xxxxxxxx","SubnetId":"subnet-xxxxxxxx"}' \
 --applications Name=Spark Name=Zeppelin \
 --configurations '[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]' \
 --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","Name":"Core","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]},{"InstanceCount":1,"InstanceGroupType":"MASTER","Name":"Primary","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}]' \
 --bootstrap-actions '[{"Args":[],"Name":"install python package","Path":"s3://something/bootstrap/bootstrap-script.sh"}]' \
 --scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
 --auto-termination-policy '{"IdleTimeout":3600}' \
 --step-concurrency-level "3" \
 --os-release-label "2.0.20230418.0" \
 --region "us-east-1"

我的引导脚本(bootstrap-script.sh):

#!/bin/bash

echo -e 'Installing Boto3... \n'
which pip3
which python3
pip3 install -U boto3 botocore --user

一旦EMR启动,我添加此步骤:

$ spark-submit --deploy-mode cluster s3://something/py-spark/simple.py

Simple.py 是这样的:

import boto3
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('Simple test') \
    .getOrCreate()

spark.stop()

我的步骤失败,有:

ModuleNotFoundError: No module named 'boto3'

我以hadoop身份登录主节点并运行:

$ pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
click==8.1.3
docutils==0.14
jmespath==1.0.1
joblib==1.1.0
lockfile==0.11.0
lxml==4.9.1
mysqlclient==1.4.2
nltk==3.7
nose==1.3.4
numpy==1.20.0
py-dateutil==2.2
pystache==0.5.4
python-daemon==2.2.3
python37-sagemaker-pyspark==1.4.2
pytz==2022.2.1
PyYAML==5.4.1
regex==2021.11.10
simplejson==3.2.0
six==1.13.0
tqdm==4.64.0
windmill==1.6

然而,在我的bootstrap日志中:

Installing Boto3... 

/usr/bin/pip3
/usr/bin/python3
Collecting boto3
  Downloading boto3-1.26.133-py3-none-any.whl (135 kB)
Collecting botocore
  Downloading botocore-1.29.133-py3-none-any.whl (10.7 MB)
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.1-py3-none-any.whl (79 kB)
Requirement already satisfied, skipping upgrade: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3) (1.0.1)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Collecting python-dateutil<3.0.0,>=2.1
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore) (1.13.0)
Installing collected packages: urllib3, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.26.133 botocore-1.29.133 python-dateutil-2.8.2 s3transfer-0.6.1 urllib3-1.26.15

所有节点的日志看起来都一样。
所以在master上,作为hadoop,我运行:

$ which python3
/bin/python3

然后,为了验证我的bootstrap实际上做了一些事情:

$ /usr/bin/pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
boto3==1.26.133
botocore==1.29.133
click==8.1.3
docutils==0.14

因此,我在 Bootstrap (/usr/bin/python3)中更新的python3与Hadoop默认使用的python3不同。
然而,我试图确保pyspark在我的EMR配置中使用正确的“python”:

{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}

但是当我登录时,似乎没有在任何节点上设置PYSPARK_PYTHON。我不明白为什么。
我正在寻找正确的步骤来让我的“import boto 3”从我的pyspark脚本工作(我不想对www.example.com进行更改simple.py)。
更新:它似乎在客户端模式下工作:

$ spark-submit --deploy-mode client s3://something/py-spark/simple.py

当然,我想在生产环境中运行它,在集群模式下...

bzzcjhmw

bzzcjhmw1#

虽然这可能不能直接回答你的问题,但我发现使用EMR CLI是一种更简单的方法来打包依赖项(想象一下你需要的不仅仅是boto 3)并将步骤提交到EMR(无服务器或EC2)。
参考示例- Python构建系统,您应该在emr init之后具有以下文件夹结构:

project_name
├── Dockerfile
├── simple.py
└── pyproject.toml

接下来,编辑pyproject.toml以包含依赖项:

dependencies = [
    'boto3==1.26.133'
]

然后,打包zip文件:

emr package --entry-point simple.py

然后,部署到S3:

emr deploy \
    --entry-point simple.py \
    --s3-code-uri s3://xxxxx

最后,提交步骤:

emr run \
    --entry-point simple.py \
    --cluster-id xxx \
    --s3-code-uri s3://xxxxx

相关问题