spark笔记本docker映像的笔记本电脑没有采用自定义python版本

sauutmhj  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(439)

摘要

我试图在一个all spark笔记本中执行这个简单的python代码片段,这个笔记本应该在本地spark集群中执行,我在docker compose文件中设置了这个集群。然而,我得到了错误 ModuleNotFoundError: No module named 'pyspark' 这对我来说毫无意义,因为在这个dockerfile(我从dockerrepos文档中获取)中,我显式地用pip安装了pyspark。

重现错误的步骤

  1. # Clone the repository and checkout a specific commit
  2. kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ git clone https://github.com/kevinsuedmersen/hadoop-sandbox.git
  3. kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ git checkout e0a061dd3a60842aa0e93893892c7e0844c2278a
  4. # Install and start all services
  5. kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker-compose up -d
  6. # Entering the container running the notebooks
  7. kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker exec -it jupyter-spark bash
  8. # Activating the custom python environment installed in the above referenced Dockerfile
  9. (base) jovyan@XXX:~$ conda activate python37
  10. # Start a jupyter notebook server
  11. (python37) jovyan@XXX:~$ jupyter notebook
  12. # After some logging, the following output shows
  13. To access the notebook, open this file in a browser:
  14. file:///home/jovyan/.local/share/jupyter/runtime/nbserver-27913-open.html
  15. Or copy and paste one of these URLs:
  16. http://b8ef36545270:8889/?token=some_token
  17. or http://127.0.0.1:8889/?token=some_token

然后,我点击网址 http://127.0.0.1:8889/?token=some_token 要在我的浏览器中打开jupytergui,请执行简单的python代码片段并获得上面解释的错误。

我试过的

为了检查pyspark是否真的安装了,我基本上只是尝试在jupyterspark容器的shell中执行简单的python代码片段,令人惊讶的是,它成功了。具体来说,我在一个新shell中执行了以下命令:

  1. # Entering into the jupyter-spark container and activating the custom python environment
  2. kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker exec -it jupyter-spark bash
  3. (base) jovyan@XXX:~$ conda activate python37
  4. # Opening a python shell
  5. (python37) jovyan@XXX:~$ python
  6. # Copy pasting the same commands from the notebook into the shell
  7. >>> import pyspark
  8. >>> from pyspark.sql import SparkSession
  9. >>> spark = SparkSession.builder.master('spark://spark-master:7077').getOrCreate()
  10. >>> sc = spark.sparkContext
  11. >>> rdd = sc.parallelize(range(100 + 1))
  12. >>> rdd.sum()
  13. 5050

此外,我注意到在笔记本中执行以下操作

  1. ! python --version

印刷品 Python 3.8.8 所以,我的问题是:如何让笔记本使用定制的python环境?

kzmpq1sx

kzmpq1sx1#

因此,很明显,以下解决方法是有效的:
将jupyter spark服务的dockerfile更改为如下简单内容:

  1. FROM jupyter/all-spark-notebook:584f43f06586
  2. ARG SPARK_VERSION
  3. ARG HADOOP_VERSION
  4. ARG SPARK_CHECKSUM
  5. ARG OPENJDK_VERSION
  6. ARG PYTHON_VERSION
  7. # Install a different version of python inside the base environment
  8. RUN conda install -y python=$PYTHON_VERSION
  9. # Install required pip packages, e.g. pyspark
  10. COPY requirements.txt /docker_build/requirements.txt
  11. RUN pip install -r /docker_build/requirements.txt

中的服务定义 docker-compose.yml 文件变为:

  1. # Spark notebooks
  2. jupyter-spark:
  3. # To see all running servers in this container, execute
  4. # `docker exec jupyter-spark jupyter notebook list`
  5. container_name: jupyter-spark
  6. build:
  7. context: jupyter-spark
  8. args:
  9. - SPARK_VERSION=3.1.1
  10. - HADOOP_VERSION=3.2
  11. - SPARK_CHECKSUM=E90B31E58F6D95A42900BA4D288261D71F6C19FA39C1CB71862B792D1B5564941A320227F6AB0E09D946F16B8C1969ED2DEA2A369EC8F9D2D7099189234DE1BE
  12. - OPENJDK_VERSION=11
  13. # Make sure the python version in the driver (the notebooks) is the same as in spark-master,
  14. # spark-worker-1, and spark-worker-2
  15. - PYTHON_VERSION=3.7.10
  16. ports:
  17. - 8888:8888
  18. - 8889:8889
  19. - 4040:4040
  20. - 4041:4041
  21. volumes:
  22. - ./jupyter-spark/work:/home/jovyan/work
  23. pid: host
  24. environment:
  25. - TINI_SUBREAPER=true
  26. env_file:
  27. - ./hadoop.env
  28. networks:
  29. - hadoop

这里可以显示具有上述更改的存储库的当前工作状态

展开查看全部

相关问题