pyspark 在docker上运行气流和Spark,发现错误

4xrmg8kj  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(146)

我从bitnami/spark和apache/airflow分别构建spark镜像和airflow。我在使用sparksubmit操作符运行dag spark submit时发现了这个错误。
“/usr/lib/jvm/java-11-openjdk-amd 64/bin/java:没有这样的文件或目录”
结尾是,
“airflow.exceptions.AirflowException:无法执行:spark-submit --master spark://spark:7077 --executor-memory 1g --driver-memory 1g --name arrow-spark /home/***/scripts/local_to_postgres_pyspark. py。错误代码为:1。”
My Dockerfile.spark:

  1. FROM bitnami/spark:latest
  2. # Install dependencies
  3. USER root
  4. RUN apt-get update && \
  5. apt-get install -y gcc curl && \
  6. apt-get clean && \
  7. rm -rf /var/lib/apt/lists/*
  8. # Install JBDC Driver
  9. RUN curl -o /opt/bitnami/spark/jars/postgresql-42.6.0.jar https://jdbc.postgresql.org/download/postgresql-42.6.0.jar
  10. COPY ./requirements_for_docker.txt /
  11. RUN pip install -r /requirements_for_docker.txt

字符串
My Dockerfile.airflow:

  1. FROM apache/airflow:2.7.0
  2. USER root
  3. RUN apt-get update && \
  4. apt-get install -y procps openjdk-11-jre-headless && \
  5. apt-get clean && \
  6. rm -rf /var/lib/apt/lists/*
  7. ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
  8. # Install dependencies
  9. USER airflow
  10. COPY requirements_for_docker.txt /tmp/requirements_for_docker.txt
  11. RUN pip install --user --upgrade pip
  12. RUN pip install --no-cache-dir --user -r /tmp/requirements_for_docker.txt
  13. RUN pip install apache-airflow-providers-apache-spark==2.1.3


我的docker-compose.airflow.yaml文件:

  1. ---
  2. version: '3.8'
  3. x-airflow-common:
  4. &airflow-common
  5. image: airflow-spark:latest
  6. # build: .
  7. environment:
  8. &airflow-common-env
  9. AIRFLOW__CORE__EXECUTOR: LocalExecutor
  10. AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
  11. AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
  12. AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
  13. AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
  14. AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
  15. LD_LIBRARY_PATH: /usr/lib
  16. AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
  17. _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
  18. JAVA_HOME: /usr/lib/jvm/java-11-openjdk-amd64
  19. volumes:
  20. - ./dags:/opt/airflow/dags
  21. - ./logs:/opt/airflow/logs
  22. - ./config:/opt/airflow/config
  23. - ./plugins:/opt/airflow/plugins
  24. - ./data_sample:/home/airflow/data_sample
  25. - ./scripts:/home/airflow/scripts
  26. user: "${AIRFLOW_UID:-50000}:0"
  27. depends_on:
  28. &airflow-common-depends-on
  29. postgres:
  30. condition: service_healthy
  31. services:
  32. postgres:
  33. image: postgres:13
  34. environment:
  35. POSTGRES_USER: airflow
  36. POSTGRES_PASSWORD: airflow
  37. POSTGRES_MULTIPLE_DATABASES: "airflow,ownerOfairflow:postgres,ownerOfpostgres"
  38. MAX_CONNECTIONS: 200
  39. volumes:
  40. - postgres-db-volume:/var/lib/postgresql/data
  41. healthcheck:
  42. test: ["CMD", "pg_isready", "-U", "airflow"]
  43. interval: 10s
  44. retries: 5
  45. start_period: 5s
  46. ports:
  47. - 5432:5432
  48. restart: always
  49. pgadmin:
  50. image: dpage/pgadmin4
  51. links:
  52. - postgres
  53. depends_on:
  54. - postgres
  55. restart: always
  56. ports:
  57. - "8081:80"
  58. environment:
  59. - [email protected]
  60. - PGADMIN_DEFAULT_PASSWORD=admin
  61. volumes:
  62. - ./pgadmin-data:/var/lib/pgadmin
  63. airflow-webserver:
  64. <<: *airflow-common
  65. command: webserver
  66. ports:
  67. - "8080:8080"
  68. healthcheck:
  69. test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
  70. interval: 30s
  71. timeout: 10s
  72. retries: 5
  73. start_period: 30s
  74. restart: always
  75. depends_on:
  76. <<: *airflow-common-depends-on
  77. airflow-init:
  78. condition: service_completed_successfully
  79. airflow-scheduler:
  80. <<: *airflow-common
  81. command: scheduler
  82. healthcheck:
  83. test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
  84. interval: 30s
  85. timeout: 10s
  86. retries: 5
  87. start_period: 30s
  88. restart: always
  89. depends_on:
  90. <<: *airflow-common-depends-on
  91. airflow-init:
  92. condition: service_completed_successfully
  93. airflow-init:
  94. <<: *airflow-common
  95. entrypoint: /bin/bash
  96. command:
  97. - -c
  98. - |
  99. function ver() {
  100. printf "%04d%04d%04d%04d" $${1//./ }
  101. }
  102. airflow_version=$$(AIRFLOW__LOGGING__LOGGING_LEVEL=INFO && gosu airflow airflow version)
  103. airflow_version_comparable=$$(ver $${airflow_version})
  104. min_airflow_version=2.2.0
  105. min_airflow_version_comparable=$$(ver $${min_airflow_version})
  106. if (( airflow_version_comparable < min_airflow_version_comparable )); then
  107. echo
  108. echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
  109. echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
  110. echo
  111. exit 1
  112. fi
  113. if [[ -z "${AIRFLOW_UID}" ]]; then
  114. echo
  115. echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
  116. echo "If you are on Linux, you SHOULD follow the instructions below to set "
  117. echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
  118. echo "For other operating systems you can get rid of the warning with manually created .env file:"
  119. echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
  120. echo
  121. fi
  122. one_meg=1048576
  123. mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
  124. cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
  125. disk_available=$$(df / | tail -1 | awk '{print $$4}')
  126. warning_resources="false"
  127. if (( mem_available < 4000 )) ; then
  128. echo
  129. echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
  130. echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
  131. echo
  132. warning_resources="true"
  133. fi
  134. if (( cpus_available < 2 )); then
  135. echo
  136. echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
  137. echo "At least 2 CPUs recommended. You have $${cpus_available}"
  138. echo
  139. warning_resources="true"
  140. fi
  141. if (( disk_available < one_meg * 10 )); then
  142. echo
  143. echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
  144. echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
  145. echo
  146. warning_resources="true"
  147. fi
  148. if [[ $${warning_resources} == "true" ]]; then
  149. echo
  150. echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
  151. echo "Please follow the instructions to increase amount of resources available:"
  152. echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
  153. echo
  154. fi
  155. mkdir -p /sources/logs /sources/dags /sources/plugins
  156. chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
  157. exec /entrypoint airflow version
  158. environment:
  159. <<: *airflow-common-env
  160. _AIRFLOW_DB_UPGRADE: 'true'
  161. _AIRFLOW_WWW_USER_CREATE: 'true'
  162. _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
  163. _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
  164. _PIP_ADDITIONAL_REQUIREMENTS: ''
  165. user: "0:0"
  166. volumes:
  167. - ${AIRFLOW_PROJ_DIR:-.}:/sources
  168. volumes:
  169. postgres-db-volume:


我的docker-compose.spark.yaml文件:

  1. version: '2'
  2. services:
  3. spark:
  4. image: spark-cluster:latest
  5. environment:
  6. - SPARK_MODE=master
  7. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  8. - SPARK_RPC_ENCRYPTION_ENABLED=no
  9. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  10. - SPARK_SSL_ENABLED=no
  11. volumes:
  12. - ./scripts:/opt/bitnami/spark/scripts
  13. - ./dags:/opt/bitnami/spark/dags
  14. - ./data_sample:/opt/bitnami/spark/data_sample
  15. ports:
  16. - "8090:8080"
  17. - "7077:7077"
  18. spark-worker-1:
  19. image: spark-cluster:latest
  20. environment:
  21. - SPARK_MODE=worker
  22. - SPARK_MASTER_URL=spark://spark:7077
  23. - SPARK_WORKER_MEMORY=1G
  24. - SPARK_WORKER_CORES=1
  25. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  26. - SPARK_RPC_ENCRYPTION_ENABLED=no
  27. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  28. - SPARK_SSL_ENABLED=no
  29. volumes:
  30. - ./scripts:/opt/bitnami/spark/scripts
  31. - ./dags:/opt/bitnami/spark/dags
  32. - ./data_sample:/opt/bitnami/spark/data_sample
  33. spark-worker-2:
  34. image: spark-cluster:latest
  35. environment:
  36. - SPARK_MODE=worker
  37. - SPARK_MASTER_URL=spark://spark:7077
  38. - SPARK_WORKER_MEMORY=1G
  39. - SPARK_WORKER_CORES=1
  40. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  41. - SPARK_RPC_ENCRYPTION_ENABLED=no
  42. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  43. - SPARK_SSL_ENABLED=no
  44. volumes:
  45. - ./scripts:/opt/bitnami/spark/scripts
  46. - ./dags:/opt/bitnami/spark/dags
  47. - ./data_sample:/opt/bitnami/spark/data_sample
  48. spark-worker-3:
  49. image: spark-cluster:latest
  50. environment:
  51. - SPARK_MODE=worker
  52. - SPARK_MASTER_URL=spark://spark:7077
  53. - SPARK_WORKER_MEMORY=1G
  54. - SPARK_WORKER_CORES=1
  55. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  56. - SPARK_RPC_ENCRYPTION_ENABLED=no
  57. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  58. - SPARK_SSL_ENABLED=no
  59. volumes:
  60. - ./scripts:/opt/bitnami/spark/scripts
  61. - ./dags:/opt/bitnami/spark/dags
  62. - ./data_sample:/opt/bitnami/spark/data_sample
  63. spark-worker-4:
  64. image: spark-cluster:latest
  65. environment:
  66. - SPARK_MODE=worker
  67. - SPARK_MASTER_URL=spark://spark:7077
  68. - SPARK_WORKER_MEMORY=1G
  69. - SPARK_WORKER_CORES=1
  70. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  71. - SPARK_RPC_ENCRYPTION_ENABLED=no
  72. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  73. - SPARK_SSL_ENABLED=no
  74. volumes:
  75. - ./scripts:/opt/bitnami/spark/scripts
  76. - ./dags:/opt/bitnami/spark/dags
  77. - ./data_sample:/opt/bitnami/spark/data_sample
  78. spark-worker-5:
  79. image: spark-cluster:latest
  80. environment:
  81. - SPARK_MODE=worker
  82. - SPARK_MASTER_URL=spark://spark:7077
  83. - SPARK_WORKER_MEMORY=1G
  84. - SPARK_WORKER_CORES=1
  85. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  86. - SPARK_RPC_ENCRYPTION_ENABLED=no
  87. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  88. - SPARK_SSL_ENABLED=no
  89. volumes:
  90. - ./scripts:/opt/bitnami/spark/scripts
  91. - ./dags:/opt/bitnami/spark/dags
  92. - ./data_sample:/opt/bitnami/spark/data_sample
  93. spark-worker-6:
  94. image: spark-cluster:latest
  95. environment:
  96. - SPARK_MODE=worker
  97. - SPARK_MASTER_URL=spark://spark:7077
  98. - SPARK_WORKER_MEMORY=1G
  99. - SPARK_WORKER_CORES=1
  100. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  101. - SPARK_RPC_ENCRYPTION_ENABLED=no
  102. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  103. - SPARK_SSL_ENABLED=no
  104. volumes:
  105. - ./scripts:/opt/bitnami/spark/scripts
  106. - ./dags:/opt/bitnami/spark/dags
  107. - ./data_sample:/opt/bitnami/spark/data_sample


我试着加上

  1. ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64


从你在我的Dockerfile上看到的。气流

xwbd5t1u

xwbd5t1u1#

试着在你的“导出JAVA_HOME”之后添加“导出JAVA_HOME”。
(this是我使用的图像,它可以工作)
同样,如果仍然存在,看看你在spark中使用的python版本和你在气流图像中使用的python版本,在下面的例子中,我不得不改变到python3.11,因为这一点

  1. FROM apache/airflow:2.7.3-python3.11
  2. USER root
  3. RUN apt-get update \
  4. && apt-get install -y --no-install-recommends \
  5. openjdk-11-jre-headless \
  6. && apt-get autoremove -yqq --purge \
  7. && apt-get clean \
  8. && rm -rf /var/lib/apt/lists/*
  9. RUN apt update && apt install -y procps
  10. USER airflow
  11. ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-arm64
  12. RUN export JAVA_HOME

字符串

展开查看全部

相关问题