我有docker-compose.yaml
version: '3'
services:
spark-master:
build:
context: ./spark-master
ports:
- "7077:7077"
- "8080:8080"
command: /bin/bash ./spark-master-entrypoint.sh
spark-worker:
build:
context: ./spark-worker
command: /bin/bash ./spark-worker-entrypoint.sh
ports:
- "8081:8081"
- "7078:7078"
environment:
- SPARK_MASTER=spark-master:7077
- SPARK_WORKER_PORT=7078
notebook:
image: jupyter/all-spark-notebook
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark-master
volumes:
- ./jupyter-notebook/notebooks:/home/jovyan
- ./jupyter-notebook/spark/conf:/opt/spark/conf/
ports:
- "8888:8888"
command: start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
有3个Docker
1.Spark控制器
1.Spark工作者
1.笔记本
我想用notebook(jupyter notebook)和spark互动一下。
我的笔记本
import os
from pyspark import SparkConf, SparkContext
spark_master = "spark://spark-master:7077"
conf = SparkConf().setAppName("from notebook").setMaster(spark_master)
sc = SparkContext(conf=conf).getOrCreate()
rdd = sc.parallelize([1, 2, 3, 4, 5])
transformed_rdd = rdd.map(lambda x: 3 * x)
result = transformed_rdd.values().collect()
print("Result from Spark:", result)
sc.stop()
但笔记本电脑运行和停留在
result = transformed_rdd.values().collect()
或
,它仍然卡住
FYI
MacBook Pro (15-inch, 2017)
Processor 3.1 GHz Quad-Core Intel Core i7
Memory 16 GB 2133 MHz LPDDR3
Docker版本
❯ docker compose version
Docker Compose version v2.20.2-desktop.1
❯ docker version
Client:
Cloud integration: v1.0.35-desktop+001
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:32:30 2023
OS/Arch: darwin/amd64
Context: desktop-linux
Server: Docker Desktop 4.22.1 (118664)
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:35:45 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0
for key, value in sc.getConf().getAll():
print(f'{key}: {value}')
spark.app.id: app-20230922092957-0002
spark.app.startTime: 1695374996363
spark.driver.host: 70f40e7eb82e
spark.executor.id: driver
spark.driver.port: 42001
spark.driver.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
spark.app.name: from notebook
spark.rdd.compress: True
spark.master: spark://spark-master:7077
spark.serializer.objectStreamReset: 100
spark.submit.pyFiles:
spark.submit.deployMode: client
spark.app.submitTime: 1695374996162
spark.ui.showConsoleProgress: true
spark.executor.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
你试了什么?
1.为工作进程定义端口
SPARK_WORKER_PORT=7078
从random
1.更改服务名称
1.出口工人港口
1.限制SPARK_WORKER_MEMORY
1.限制SPARK_WORKER_CORES
你在期待什么
我希望我的pyspark和jupyter notebook正常运行。
回答
感谢伯恩哈德·斯塔德勒
当我换到32GB的机器时
运行不卡住
我得到的错误python版本不等价
我添加
RUN apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y python3.11
RUN mkdir -p /opt/conda/bin
RUN ln -s /usr/bin/python3.11 /opt/conda/bin/python
我的DockerFile
并添加
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
我的笔记本文件
现在我得到了我想要的结果
1条答案
按热度按时间7uzetpgm1#
屏幕截图看起来像你的notebook正在执行,但你的代码片段中有一个问题:
由于
transformed_rdd
不包含元组,因此不能调用values()
。删除values()
调用: