Py 4JJavaError-在S3上读取pySpark

pcrecxhr  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(172)

当我想从我的S3桶读取文件时,我总是得到同样的错误。我正在jupyter-lab上工作,但是我在没有jupyter-lab的情况下,在我的个人笔记本电脑上得到了同样的结果。这是我的代码。


# My spark configuration

conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.0')

# conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')

conf.set('spark.hadoop.fs.s3a.access.key', key)
conf.set('spark.hadoop.fs.s3a.secret.key', secret)
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# path to my test file (which a can read in local with same code

path = "s3a://bucket-name/folder/test.csv"

csv = spark.read.format("csv").load(path)

错误如下:

Py4JJavaError: An error occurred while calling o37.load.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
    at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
    at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)
    at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1580)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:341)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)

有时,“调用o37.load”具有不同的编号。即(o.43)
我在SparkSession期间收到一些警告。生成器:

:: loading settings :: url = jar:file:/home/ubuntu/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-840765ed-4adc-4453-b354-a3a8093d3776;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.3.0 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
    found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.563/aws-java-sdk-bundle-1.11.563.jar ...
    [SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.11.563!aws-java-sdk-bundle.jar (4888ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
    [SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly-openssl.jar (22ms)
:: resolution report :: resolve 697ms :: artifacts dl 4998ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
    org.apache.hadoop#hadoop-aws;3.3.0 from central in [default]
    org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   3   |   0   |   0   |   0   ||   3   |   2   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-840765ed-4adc-4453-b354-a3a8093d3776
    confs: [default]
    3 artifacts copied, 0 already retrieved (128050kB/319ms)
22/10/08 13:55:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

我尝试了this medium tutorial以确保haddop / java-sdk版本是兼容的。
我试着读取“binaryFiles”格式和“images”。同样的结果。
我可以用boto 3和python读取文件,但我需要使用pySpark。

zzwlnbp8

zzwlnbp81#

问题出在我的SparkSession配置中的hadoop版本上。我输入以下命令来查找我的haddop版本:

> print(f"Hadoop version = {spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}")
Hadoop version = 3.3.2

其中Spark是spark = SparkSession.builder.config(conf=conf).getOrCreate()的结果。
因此,我不得不将配置的第一行更改为:

conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.2')

但是!当我read.format("images")的时候,这个错误仍然存在,所以,我仍然在搜索一些关于这个的东西,即使它可以处理二进制文件。

相关问题