我正在使用以下创建语句运行EMR:
$ aws emr create-cluster \
--name "my_cluster" \
--log-uri "s3n://somebucket/" \
--release-label "emr-6.8.0" \
--service-role "arn:aws:iam::XXXXXXXXXX:role/EMR_DefaultRole" \
--ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","KeyName":"some_key","AdditionalMasterSecurityGroups":[],"AdditionalSlaveSecurityGroups":[],"ServiceAccessSecurityGroup":"sg-xxxxxxxx","SubnetId":"subnet-xxxxxxxx"}' \
--applications Name=Spark Name=Zeppelin \
--configurations '[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]' \
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","Name":"Core","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]},{"InstanceCount":1,"InstanceGroupType":"MASTER","Name":"Primary","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}]' \
--bootstrap-actions '[{"Args":[],"Name":"install python package","Path":"s3://something/bootstrap/bootstrap-script.sh"}]' \
--scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
--auto-termination-policy '{"IdleTimeout":3600}' \
--step-concurrency-level "3" \
--os-release-label "2.0.20230418.0" \
--region "us-east-1"
我的引导脚本(bootstrap-script.sh):
#!/bin/bash
echo -e 'Installing Boto3... \n'
which pip3
which python3
pip3 install -U boto3 botocore --user
一旦EMR启动,我添加此步骤:
$ spark-submit --deploy-mode cluster s3://something/py-spark/simple.py
Simple.py 是这样的:
import boto3
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Simple test') \
.getOrCreate()
spark.stop()
我的步骤失败,有:
ModuleNotFoundError: No module named 'boto3'
我以hadoop身份登录主节点并运行:
$ pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
click==8.1.3
docutils==0.14
jmespath==1.0.1
joblib==1.1.0
lockfile==0.11.0
lxml==4.9.1
mysqlclient==1.4.2
nltk==3.7
nose==1.3.4
numpy==1.20.0
py-dateutil==2.2
pystache==0.5.4
python-daemon==2.2.3
python37-sagemaker-pyspark==1.4.2
pytz==2022.2.1
PyYAML==5.4.1
regex==2021.11.10
simplejson==3.2.0
six==1.13.0
tqdm==4.64.0
windmill==1.6
然而,在我的bootstrap日志中:
Installing Boto3...
/usr/bin/pip3
/usr/bin/python3
Collecting boto3
Downloading boto3-1.26.133-py3-none-any.whl (135 kB)
Collecting botocore
Downloading botocore-1.29.133-py3-none-any.whl (10.7 MB)
Collecting s3transfer<0.7.0,>=0.6.0
Downloading s3transfer-0.6.1-py3-none-any.whl (79 kB)
Requirement already satisfied, skipping upgrade: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3) (1.0.1)
Collecting urllib3<1.27,>=1.25.4
Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Collecting python-dateutil<3.0.0,>=2.1
Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore) (1.13.0)
Installing collected packages: urllib3, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.26.133 botocore-1.29.133 python-dateutil-2.8.2 s3transfer-0.6.1 urllib3-1.26.15
所有节点的日志看起来都一样。
所以在master上,作为hadoop,我运行:
$ which python3
/bin/python3
然后,为了验证我的bootstrap实际上做了一些事情:
$ /usr/bin/pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
boto3==1.26.133
botocore==1.29.133
click==8.1.3
docutils==0.14
因此,我在 Bootstrap (/usr/bin/python3)中更新的python3与Hadoop默认使用的python3不同。
然而,我试图确保pyspark在我的EMR配置中使用正确的“python”:
{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}
但是当我登录时,似乎没有在任何节点上设置PYSPARK_PYTHON。我不明白为什么。
我正在寻找正确的步骤来让我的“import boto 3”从我的pyspark脚本工作(我不想对www.example.com进行更改simple.py)。
更新:它似乎在客户端模式下工作:
$ spark-submit --deploy-mode client s3://something/py-spark/simple.py
当然,我想在生产环境中运行它,在集群模式下...
1条答案
按热度按时间bzzcjhmw1#
虽然这可能不能直接回答你的问题,但我发现使用EMR CLI是一种更简单的方法来打包依赖项(想象一下你需要的不仅仅是boto 3)并将步骤提交到EMR(无服务器或EC2)。
参考示例- Python构建系统,您应该在
emr init
之后具有以下文件夹结构:接下来,编辑
pyproject.toml
以包含依赖项:然后,打包zip文件:
然后,部署到S3:
最后,提交步骤: