我正在尝试从spark 3.0.1升级到3.1.1。
我正在jupyter笔记本上以客户机模式运行pyspark 3.1.1。
以下在3.0.1上运行,但在升级spark后失败:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
def get_spark_session(app_name: str, conf: SparkConf):
conf.setMaster("k8s://https://kubernetes.default.svc.cluster.local")
conf \
.set("spark.kubernetes.namespace", "spark-ml") \
.set("spark.kubernetes.container.image", "itayb/spark:3.1.1-hadoop-3.2.0-python-3.8.6-aws") \
.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-executor") \
.set("spark.executor.instances", "2") \
.set("spark.executor.memory", "2g") \
.set("spark.executor.cores", "2") \
.set("spark.driver.memory", "1G") \
.set("spark.driver.port", "2222") \
.set("spark.driver.blockManager.port", "7777") \
.set("spark.driver.host", "jupyter.recs.svc.cluster.local") \
.set("spark.driver.bindAddress", "0.0.0.0") \
.set("spark.ui.port", "4040") \
.set("spark.network.timeout", "240") \
.set("spark.hadoop.fs.s3a.endpoint", "localstack.kube-system.svc.cluster.local:4566") \
.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.set("spark.hadoop.fs.s3a.path.style.access", "true") \
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.set("spark.hadoop.com.amazonaws.services.s3.enableV4", "true") \
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
spark = get_spark_session("aws_localstack", SparkConf())
try:
df = spark.read.csv('s3a://my-bucket/stocks.csv',header=True)
df.printSchema()
print(df.count())
except Exception as exp:
print(exp)
spark.stop()
执行者日志:
+ SPARK_CLASSPATH=':/opt/spark/jars/*' │
+ env │
+ grep SPARK_JAVA_OPT_ │
+ sort -t_ -k4 -n │
+ sed 's/[^=]*=\(.*\)/\1/g' │
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS │
+ '[' -n '' ']' │
+ '[' -z ']' │
+ '[' -z ']' │
+ '[' -n '' ']' │
+ '[' -z ']' │
+ '[' -z x ']' │
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' │
+ case "$1" in │
+ shift 1 │
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apac │
+ exec /usr/bin/tini -s -- /usr/local/openjdk-11/bin/java -Dspark.network.timeout=240 -Dspark.driver.port=2222 -Dspark.ui.port=4040 -Dspark.driver.blockManager.port=7777 │
Unrecognized options: --resourceProfileId │
Usage: org.apache.spark.executor.CoarseGrainedExecutorBackend [options] │
Options are: │
--driver-url <driverUrl> │
--executor-id <executorId> │
--bind-address <bindAddress> │
--hostname <hostname> │
--cores <cores> │
--resourcesFile <fileWithJSONResourceInformation> │
--app-id <appid> │
--worker-url <workerUrl> │
--user-class-path <url> │
--resourceProfileId <id> │
stream closed
好像 --resourceProfileId
没有价值,但我不知道为什么。
2条答案
按热度按时间ejk8hzay1#
是我的错:(
我在驱动程序(3.1.1)和执行程序(3.0.1)之间使用了不同的spark版本。
这个错误的发生是因为我试图从官方回购中建立遗嘱执行人的形象,只需向银行结账即可
v3.1.1
标记并使用我想我以前必须编译什么的,因为版本比较旧。我通过
exec
到工作容器并运行:快速解决方案是下载正确的编译版本(而不是使用源代码),提取并构建:
vfwfrxfs2#
请参见spark公关页面上的讨论-https://issues.apache.org/jira/browse/spark-33288?focusedcommentid=17296060&page=com.atlassian.jira.plugin.system.issuetabpanels%3acomment-tabpanel#comment-17296060