Hadoop 3 gcs连接器无法与最新版本的spark 3独立模式正常工作

goqiplq2  于 2022-12-11  发布在  Hadoop
关注(0)|答案(2)|浏览(279)

我编写了一个简单的Scala应用程序,它从GCS桶中读取一个parquet文件。

  • JDK 17语言
  • Scala 2017年12月22日
  • Spark SQL 3.3.1版
  • hadoop 3 -2.2.7的gcs连接器

连接器取自Maven,通过sbt(Scala构建工具)导入。由于这个issue,我没有使用最新的2.2.9版本。
应用程序在本地模式下工作得很完美,所以我试着切换到独立模式。
我所做的是这些步骤:
1.已从here下载Spark 3.3.1
1.手动启动群集,如此处所示
我尝试再次运行该应用程序,但遇到以下错误:

  1. [error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
  2. [error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
  3. [error] at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
  4. [error] at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
  5. [error] at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
  6. [error] at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
  7. [error] at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
  8. [error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
  9. [error] at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
  10. [error] at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
  11. [error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
  12. [error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
  13. [error] ... 14 more
  14. [error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
  15. [error] at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
  16. [error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
  17. [error] ... 24 more

不知何故,它无法检测连接器的文件系统:java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
我的spark配置非常基本:

  1. spark.app.name = "Example app"
  2. spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
  3. spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
  4. spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
  5. spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
  6. spark.hadoop.google.cloud.auth.service.account.enable = true
  7. spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"
v9tzhpje

v9tzhpje1#

我发现GCS Hadoop连接器的Maven版本在内部缺少依赖项。
我用以下方法之一修复了它:

为了解决第二个选项,我解压缩了gcs hadoop连接器jar文件,查找了pom.xml,将依赖项复制到一个新的独立xml文件中,并使用mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/命令下载它们
下面是我创建的pom.xml示例,请注意我使用的是连接器的2.2.9版本

  1. <project xmlns="http://maven.apache.org/POM/4.0.0"
  2. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  3. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  4. <modelVersion>4.0.0</modelVersion>
  5. <name>TMP_PACKAGE_NAME</name>
  6. <description>
  7. jar dependencies of gcs hadoop connector
  8. </description>
  9. <!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
  10. -->
  11. <groupId>TMP_PACKAGE_GROUP</groupId>
  12. <artifactId>TMP_PACKAGE_NAME</artifactId>
  13. <version>0.0.1</version>
  14. <dependencies>
  15. <dependency>
  16. <groupId>com.google.cloud.bigdataoss</groupId>
  17. <artifactId>gcs-connector</artifactId>
  18. <version>hadoop3-2.2.9</version>
  19. </dependency>
  20. <dependency>
  21. <groupId>com.google.api-client</groupId>
  22. <artifactId>google-api-client-jackson2</artifactId>
  23. <version>2.1.0</version>
  24. </dependency>
  25. <dependency>
  26. <groupId>com.google.guava</groupId>
  27. <artifactId>guava</artifactId>
  28. <version>31.1-jre</version>
  29. </dependency>
  30. <dependency>
  31. <groupId>com.google.oauth-client</groupId>
  32. <artifactId>google-oauth-client</artifactId>
  33. <version>1.34.1</version>
  34. </dependency>
  35. <dependency>
  36. <groupId>com.google.cloud.bigdataoss</groupId>
  37. <artifactId>util</artifactId>
  38. <version>2.2.9</version>
  39. </dependency>
  40. <dependency>
  41. <groupId>com.google.cloud.bigdataoss</groupId>
  42. <artifactId>util-hadoop</artifactId>
  43. <version>hadoop3-2.2.9</version>
  44. </dependency>
  45. <dependency>
  46. <groupId>com.google.cloud.bigdataoss</groupId>
  47. <artifactId>gcsio</artifactId>
  48. <version>2.2.9</version>
  49. </dependency>
  50. <dependency>
  51. <groupId>com.google.auto.value</groupId>
  52. <artifactId>auto-value-annotations</artifactId>
  53. <version>1.10.1</version>
  54. <scope>runtime</scope>
  55. </dependency>
  56. <dependency>
  57. <groupId>com.google.flogger</groupId>
  58. <artifactId>flogger</artifactId>
  59. <version>0.7.4</version>
  60. </dependency>
  61. <dependency>
  62. <groupId>com.google.flogger</groupId>
  63. <artifactId>google-extensions</artifactId>
  64. <version>0.7.4</version>
  65. </dependency>
  66. <dependency>
  67. <groupId>com.google.flogger</groupId>
  68. <artifactId>flogger-system-backend</artifactId>
  69. <version>0.7.4</version>
  70. </dependency>
  71. <dependency>
  72. <groupId>com.google.code.gson</groupId>
  73. <artifactId>gson</artifactId>
  74. <version>2.10</version>
  75. </dependency>
  76. </dependencies>
  77. </project>

我希望这能有所帮助

展开查看全部
xpszyzbs

xpszyzbs2#

这是因为Spark使用的是旧的Guava库版本,而您使用的是无阴影的GCS连接器jar。要使其正常工作,您只需要使用Maven的阴影GCS连接器jar,例如:https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar

相关问题