Docker环境中PySpark和PySpark Notebook的交互问题:执行卡住或持续运行

5uzkadbs  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(111)

我有docker-compose.yaml

version: '3'
services:
  spark-master:
    build:
      context: ./spark-master
    ports:
      - "7077:7077"
      - "8080:8080"
    command: /bin/bash ./spark-master-entrypoint.sh

  spark-worker:
    build:
      context: ./spark-worker
    command: /bin/bash ./spark-worker-entrypoint.sh
    ports:
        - "8081:8081"
        - "7078:7078"
    environment:
      - SPARK_MASTER=spark-master:7077
      - SPARK_WORKER_PORT=7078

  notebook:
    image: jupyter/all-spark-notebook
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark-master
    volumes:
      - ./jupyter-notebook/notebooks:/home/jovyan
      - ./jupyter-notebook/spark/conf:/opt/spark/conf/
    ports:
        - "8888:8888"
    command: start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''

有3个Docker
1.Spark控制器
1.Spark工作者
1.笔记本
我想用notebook(jupyter notebook)和spark互动一下。
我的笔记本

import os
from pyspark import SparkConf, SparkContext

spark_master = "spark://spark-master:7077"
conf = SparkConf().setAppName("from notebook").setMaster(spark_master)
sc = SparkContext(conf=conf).getOrCreate()

rdd = sc.parallelize([1, 2, 3, 4, 5])
transformed_rdd = rdd.map(lambda x: 3 * x)
result = transformed_rdd.values().collect()

print("Result from Spark:", result)
sc.stop()

但笔记本电脑运行和停留在

result = transformed_rdd.values().collect()



,它仍然卡住

FYI

MacBook Pro (15-inch, 2017)
Processor 3.1 GHz Quad-Core Intel Core i7
Memory 16 GB 2133 MHz LPDDR3

Docker版本

❯ docker compose version
Docker Compose version v2.20.2-desktop.1

❯ docker version
Client:
 Cloud integration: v1.0.35-desktop+001
 Version:           24.0.5
 API version:       1.43
 Go version:        go1.20.6
 Git commit:        ced0996
 Built:             Fri Jul 21 20:32:30 2023
 OS/Arch:           darwin/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.22.1 (118664)
 Engine:
  Version:          24.0.5
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.6
  Git commit:       a61e2b4
  Built:            Fri Jul 21 20:35:45 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
for key, value in sc.getConf().getAll():
    print(f'{key}: {value}')
spark.app.id: app-20230922092957-0002
spark.app.startTime: 1695374996363
spark.driver.host: 70f40e7eb82e
spark.executor.id: driver
spark.driver.port: 42001
spark.driver.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
spark.app.name: from notebook
spark.rdd.compress: True
spark.master: spark://spark-master:7077
spark.serializer.objectStreamReset: 100
spark.submit.pyFiles: 
spark.submit.deployMode: client
spark.app.submitTime: 1695374996162
spark.ui.showConsoleProgress: true
spark.executor.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false

你试了什么?
1.为工作进程定义端口

SPARK_WORKER_PORT=7078

random
1.更改服务名称
1.出口工人港口
1.限制SPARK_WORKER_MEMORY
1.限制SPARK_WORKER_CORES
你在期待什么
我希望我的pyspark和jupyter notebook正常运行。

回答

感谢伯恩哈德·斯塔德勒
当我换到32GB的机器时
运行不卡住
我得到的错误python版本不等价
我添加

RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -y python3.11

RUN mkdir -p /opt/conda/bin
RUN ln -s /usr/bin/python3.11 /opt/conda/bin/python

我的DockerFile
并添加

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

我的笔记本文件
现在我得到了我想要的结果

7uzetpgm

7uzetpgm1#

屏幕截图看起来像你的notebook正在执行,但你的代码片段中有一个问题:

result = transformed_rdd.values().collect()

由于transformed_rdd不包含元组,因此不能调用values()。删除values()调用:

result = transformed_rdd.collect()

相关问题