pyspark Spark连接类型未显示在Airflow中

dzhpxtsq  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(171)

我有一个docker-compose文件,其中定义了Airflow,Spark,postgreSQL和Redis服务。当我执行docker-compose并打开Airflow UI时,我尝试添加一个Spark连接类型,这样我就可以在Docker上的Airflow中运行一个spark作业。但是Spark并没有显示为连接类型,正如你所看到的:

这是我的docker-compose文件:

  1. # Feel free to modify this file to suit your needs.
  2. ---
  3. version: '3'
  4. x-airflow-common:
  5. &airflow-common
  6. # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  7. # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  8. # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  9. image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.2.3}
  10. # build: .
  11. environment:
  12. &airflow-common-env
  13. AIRFLOW__CORE__EXECUTOR: CeleryExecutor
  14. AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
  15. AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
  16. AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
  17. AIRFLOW__CORE__FERNET_KEY: ''
  18. AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
  19. AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
  20. AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
  21. ADDITIONAL_AIRFLOW_EXTRAS: apache.spark
  22. _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
  23. volumes:
  24. - ./dags:/opt/airflow/dags
  25. - ./logs:/opt/airflow/logs
  26. - ./plugins:/opt/airflow/plugins
  27. user: "${AIRFLOW_UID:-50000}:0"
  28. depends_on:
  29. &airflow-common-depends-on
  30. redis:
  31. condition: service_healthy
  32. postgres:
  33. condition: service_healthy
  34. services:
  35. postgres:
  36. image: postgres:13
  37. environment:
  38. POSTGRES_USER: airflow
  39. POSTGRES_PASSWORD: airflow
  40. POSTGRES_DB: airflow
  41. volumes:
  42. - postgres-db-volume:/var/lib/postgresql/data
  43. healthcheck:
  44. test: ["CMD", "pg_isready", "-U", "airflow"]
  45. interval: 5s
  46. retries: 5
  47. restart: always
  48. redis:
  49. image: redis:latest
  50. expose:
  51. - 6379
  52. healthcheck:
  53. test: ["CMD", "redis-cli", "ping"]
  54. interval: 5s
  55. timeout: 30s
  56. retries: 50
  57. restart: always
  58. airflow-webserver:
  59. <<: *airflow-common
  60. command: webserver
  61. ports:
  62. - '8282:8080'
  63. healthcheck:
  64. test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
  65. interval: 10s
  66. timeout: 10s
  67. retries: 5
  68. restart: always
  69. depends_on:
  70. <<: *airflow-common-depends-on
  71. airflow-init:
  72. condition: service_completed_successfully
  73. airflow-scheduler:
  74. <<: *airflow-common
  75. command: scheduler
  76. healthcheck:
  77. test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
  78. interval: 10s
  79. timeout: 10s
  80. retries: 5
  81. restart: always
  82. depends_on:
  83. <<: *airflow-common-depends-on
  84. airflow-init:
  85. condition: service_completed_successfully
  86. airflow-worker:
  87. <<: *airflow-common
  88. command: celery worker
  89. healthcheck:
  90. test:
  91. - "CMD-SHELL"
  92. - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
  93. interval: 10s
  94. timeout: 10s
  95. retries: 5
  96. environment:
  97. <<: *airflow-common-env
  98. # Required to handle warm shutdown of the celery workers properly
  99. # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
  100. DUMB_INIT_SETSID: "0"
  101. restart: always
  102. depends_on:
  103. <<: *airflow-common-depends-on
  104. airflow-init:
  105. condition: service_completed_successfully
  106. airflow-triggerer:
  107. <<: *airflow-common
  108. command: triggerer
  109. healthcheck:
  110. test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
  111. interval: 10s
  112. timeout: 10s
  113. retries: 5
  114. restart: always
  115. depends_on:
  116. <<: *airflow-common-depends-on
  117. airflow-init:
  118. condition: service_completed_successfully
  119. airflow-init:
  120. <<: *airflow-common
  121. entrypoint: /bin/bash
  122. # yamllint disable rule:line-length
  123. command:
  124. - -c
  125. - |
  126. function ver() {
  127. printf "%04d%04d%04d%04d" $${1//./ }
  128. }
  129. airflow_version=$$(gosu airflow airflow version)
  130. airflow_version_comparable=$$(ver $${airflow_version})
  131. min_airflow_version=2.2.0
  132. min_airflow_version_comparable=$$(ver $${min_airflow_version})
  133. if (( airflow_version_comparable < min_airflow_version_comparable )); then
  134. echo
  135. echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
  136. echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
  137. echo
  138. exit 1
  139. fi
  140. if [[ -z "${AIRFLOW_UID}" ]]; then
  141. echo
  142. echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
  143. echo "If you are on Linux, you SHOULD follow the instructions below to set "
  144. echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
  145. echo "For other operating systems you can get rid of the warning with manually created .env file:"
  146. echo " See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user"
  147. echo
  148. fi
  149. one_meg=1048576
  150. mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
  151. cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
  152. disk_available=$$(df / | tail -1 | awk '{print $$4}')
  153. warning_resources="false"
  154. if (( mem_available < 4000 )) ; then
  155. echo
  156. echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
  157. echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
  158. echo
  159. warning_resources="true"
  160. fi
  161. if (( cpus_available < 2 )); then
  162. echo
  163. echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
  164. echo "At least 2 CPUs recommended. You have $${cpus_available}"
  165. echo
  166. warning_resources="true"
  167. fi
  168. if (( disk_available < one_meg * 10 )); then
  169. echo
  170. echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
  171. echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
  172. echo
  173. warning_resources="true"
  174. fi
  175. if [[ $${warning_resources} == "true" ]]; then
  176. echo
  177. echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
  178. echo "Please follow the instructions to increase amount of resources available:"
  179. echo " https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin"
  180. echo
  181. fi
  182. mkdir -p /sources/logs /sources/dags /sources/plugins
  183. chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
  184. exec /entrypoint airflow version
  185. # yamllint enable rule:line-length
  186. environment:
  187. <<: *airflow-common-env
  188. _AIRFLOW_DB_UPGRADE: 'true'
  189. _AIRFLOW_WWW_USER_CREATE: 'true'
  190. _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
  191. _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
  192. user: "0:0"
  193. volumes:
  194. - .:/sources
  195. airflow-cli:
  196. <<: *airflow-common
  197. profiles:
  198. - debug
  199. environment:
  200. <<: *airflow-common-env
  201. CONNECTION_CHECK_MAX_COUNT: "0"
  202. # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
  203. command:
  204. - bash
  205. - -c
  206. - airflow
  207. flower:
  208. <<: *airflow-common
  209. command: celery flower
  210. ports:
  211. - 5555:5555
  212. healthcheck:
  213. test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
  214. interval: 10s
  215. timeout: 10s
  216. retries: 5
  217. restart: always
  218. depends_on:
  219. <<: *airflow-common-depends-on
  220. airflow-init:
  221. condition: service_completed_successfully
  222. spark:
  223. image: docker.io/bitnami/spark:3
  224. environment:
  225. - SPARK_MODE=master
  226. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  227. - SPARK_RPC_ENCRYPTION_ENABLED=no
  228. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  229. - SPARK_SSL_ENABLED=no
  230. ports:
  231. - '8080:8080'
  232. volumes:
  233. - ./spark-apps:/opt/spark-apps
  234. spark-worker:
  235. image: docker.io/bitnami/spark:3
  236. environment:
  237. - SPARK_MODE=worker
  238. - SPARK_MASTER_URL=spark://spark:7077
  239. - SPARK_WORKER_MEMORY=1G
  240. - SPARK_WORKER_CORES=1
  241. - SPARK_RPC_AUTHENTICATION_ENABLED=no
  242. - SPARK_RPC_ENCRYPTION_ENABLED=no
  243. - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
  244. - SPARK_SSL_ENABLED=no
  245. ports:
  246. - '8081:8081'
  247. volumes:
  248. postgres-db-volume:
rsl1atfo

rsl1atfo1#

我也有过类似的经历,试图设置一个Databricks连接,并通过添加一个Dockerfile来解决这个问题--添加Spark提供程序的方法也是一样的。
我花了相当多的时间在网上挖掘,但是在&airflow_common下的docker-compose文件中的注解中跟随了面包屑,这使我找到了这个文档
我使用Docker桌面版本4.12.0和VS代码版本1.62.3作为参考。
操作步骤:
1.在您保存docker-compose.yaml模板的同一目录中,您需要添加另一个名为Dockerfile(即无文件扩展名) 的文件
1.打开Dockerfile并在保存前添加以下代码行 (根据需要更改气流和Spark版本)

  1. FROM apache/airflow:2.3.0
  2. RUN pip install --no-cache-dir apache-airflow-providers-apache-spark==2.1.3

1.打开您的docker-compose.yaml文件,注解掉image行,并取消注解下面的build行:

  1. # image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.2.3}
  2. build: .

1.一旦保存了docker-compose文件,您需要首先使用以下bash命令构建映像:

  1. docker-compose build

1.如果执行上述命令时没有出现错误,则可以使用以下命令加速气流示例(包括新添加的供应器):

  1. docker-compose up -d

Docker完成初始化Airflow示例并启动Web服务器后,您应该会看到Spark列在Providers页面下,并可在UI的Connection Types下拉列表中找到:
Airflow connection types after adding Dockerfile
瞧!希望这为您和其他任何在Docker上运行时遇到气流提供程序问题的人解决了这个问题。

展开查看全部

相关问题