pyspark读取bigquery:java.lang.classnotfoundexception:org.apache.spark.internal.logging$class时出错

eyh26e7m  于 2021-05-17  发布在  Spark
关注(0)|答案(1)|浏览(1163)

我创建了一个dataproc集群,并试图提交本地作业进行测试。

  1. gcloud beta dataproc clusters create test-cluster \
  2. --region us-central1 \
  3. --zone us-central1-c \
  4. --master-machine-type n1-standard-4 \
  5. --master-boot-disk-size 500 \
  6. --num-workers 2 \
  7. --worker-machine-type n1-standard-4 \
  8. --worker-boot-disk-size 500 \
  9. --image-version preview-ubuntu18 \
  10. --project my-project-id \
  11. --service-account my-service-account@project-id.iam.gserviceaccount.com \
  12. --scopes https://www.googleapis.com/auth/cloud-platform \
  13. --tags dataproc,iap-remote-admin \
  14. --subnet my-vpc \
  15. --properties spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

试图提交一个非常简单的脚本

  1. import argparse
  2. from datetime import datetime, timedelta
  3. from pyspark.sql import SparkSession, DataFrame
  4. def load_data(spark: SparkSession):
  5. customers = spark.read.format('bigquery')\
  6. .option('table', 'MY_DATASET.MY_TABLE')\
  7. .load()
  8. customers.printSchema()
  9. customers.show()
  10. if __name__ == '__main__':
  11. spark = SparkSession \
  12. .builder \
  13. .master('yarn') \
  14. .appName('my-test-app') \
  15. .getOrCreate()
  16. spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
  17. load_data(spark)

尝试使用提交作业时,都出现了几乎相同的错误:

  1. # 1
  2. gcloud dataproc jobs submit pyspark myscript.py --cluster=test-cluster --region=us-central1 --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar
  3. # 2
  4. gcloud dataproc jobs submit pyspark myscript.py --cluster=test-cluster --region=us-central1
  5. # 3
  6. gcloud dataproc jobs submit pyspark myscript.py --cluster=test-cluster --region=us-central1 --properties spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

这是错误消息:

  1. Job [d8adf906970f43d2b348eb89728b2b7f] submitted.
  2. Waiting for job output...
  3. 20/11/12 00:33:44 INFO org.spark_project.jetty.util.log: Logging initialized @3339ms
  4. 20/11/12 00:33:44 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
  5. 20/11/12 00:33:44 INFO org.spark_project.jetty.server.Server: Started @3431ms
  6. 20/11/12 00:33:44 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3f6c2ca9{HTTP/1.1,[http/1.1]}{0.0.0.0:35517}
  7. 20/11/12 00:33:45 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at test-cluster-m/10.154.64.65:8032
  8. 20/11/12 00:33:45 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at test-cluster-m/10.154.64.65:10200
  9. 20/11/12 00:33:45 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
  10. 20/11/12 00:33:45 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
  11. 20/11/12 00:33:45 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
  12. 20/11/12 00:33:45 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
  13. 20/11/12 00:33:47 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1605139742119_0002
  14. Traceback (most recent call last):
  15. File "/tmp/d8adf906970f43d2b348eb89728b2b7f/vessel_master.py", line 44, in <module>
  16. load_data(spark)
  17. File "/tmp/d8adf906970f43d2b348eb89728b2b7f/vessel_master.py", line 14, in load_data
  18. .option('table', 'MY_DATASET.MY_TABLE')\
  19. File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
  20. File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  21. File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  22. File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
  23. py4j.protocol.Py4JJavaError: An error occurred while calling o61.load.
  24. : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
  25. at java.util.ServiceLoader.fail(ServiceLoader.java:232)
  26. at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
  27. at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
  28. at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  29. at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  30. at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
  31. at scala.collection.Iterator.foreach(Iterator.scala:941)
  32. at scala.collection.Iterator.foreach$(Iterator.scala:941)
  33. at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  34. at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  35. at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  36. at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  37. at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
  38. at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
  39. at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
  40. at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
  41. at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
  42. at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
  43. at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:648)
  44. at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:213)
  45. at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
  46. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  47. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  48. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  49. at java.lang.reflect.Method.invoke(Method.java:498)
  50. at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  51. at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  52. at py4j.Gateway.invoke(Gateway.java:282)
  53. at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  54. at py4j.commands.CallCommand.execute(CallCommand.java:79)
  55. at py4j.GatewayConnection.run(GatewayConnection.java:238)
  56. at java.lang.Thread.run(Thread.java:748)
  57. Caused by: java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class
  58. at com.google.cloud.spark.bigquery.BigQueryUtilScala$.<init>(BigQueryUtil.scala:34)
  59. at com.google.cloud.spark.bigquery.BigQueryUtilScala$.<clinit>(BigQueryUtil.scala)
  60. at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:41)
  61. at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:48)
  62. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  63. at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  64. at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  65. at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  66. at java.lang.Class.newInstance(Class.java:442)
  67. at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
  68. ... 29 more
  69. Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
  70. at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  71. at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
  72. at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
  73. ... 39 more

听说这可能是一些兼容性问题,我试图降级集群和使用图像版本1.5-debian10,但得到了相同的错误。
任何帮助都将不胜感激

v2g6jxz6

v2g6jxz61#

dataproc预览图像包含带有scala 2.12的spark 3。您提到的连接器jar基于scala2.11。请将url更改为 gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar .

相关问题