如何将S3连接到本地(org.apache.hadoop.fs.不支持的文件系统)上的pyspark方案“s3”没有文件系统)

fwzugrvs  于 2022-11-01  发布在  Hadoop
关注(0)|答案(1)|浏览(328)

Spark版本=3.2.1
Hadoop版本=3.3.1
我已经关注了StackOverflow上的所有帖子,但是无法让它运行。我是Spark的新手,正在尝试读取json文件。
在我的本地mac上,我已经安装了homebrew和安装了pyspark。我刚刚下载了jar,我需要保存在某个地方吗?

  1. Jars downloaded are : hadoop-aws-3.3.1
  2. aws-java-sdk-bundle-1.12.172

我把它保存在/usr/local/Cellar/apache-spark/3.2.1/libexec/jars下

  1. # /opt/python/latest/bin/pip3 list
  2. import os
  3. import pyspark
  4. from pyspark.sql import SparkSession
  5. from pyspark import SparkContext, SparkConf
  6. from pyspark.sql import SparkSession
  7. access_id = "A*"
  8. access_key = "C*"
  9. from pyspark import SparkConf, SparkContext, SQLContext
  10. from pyspark.sql import *
  11. ## set Spark properties
  12. conf = SparkConf()
  13. conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')
  14. sc=SparkContext(conf=conf)
  15. spark=SparkSession(sc)
  16. spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", access_id)
  17. spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", access_key)
  18. spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
  19. df=spark.read.json("s3://pt/raw/Deal_20220114.json")
  20. df.show()

错误:

  1. org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

我在本地跑得怎么样?

  1. spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 test.py

错误:

  1. org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
  2. at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
  3. at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
  4. at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
  5. at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
  6. at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
  7. at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
  8. at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
  9. at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:747)
  10. at scala.collection.immutable.List.map(List.scala:293)
  11. at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:745)
  12. at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:577)
  13. at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  14. at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
  15. at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
  16. at scala.Option.getOrElse(Option.scala:189)
  17. at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
  18. at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:405)
  19. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  20. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  21. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  22. at java.lang.reflect.Method.invoke(Method.java:498)
  23. at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  24. at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  25. at py4j.Gateway.invoke(Gateway.java:282)
  26. at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  27. at py4j.commands.CallCommand.execute(CallCommand.java:79)
  28. at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
  29. at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
  30. at java.lang.Thread.run(Thread.java:748)
  31. 22/03/06 19:31:36 INFO SparkContext: Invoking stop() from shutdown hook
  32. 22/03/06 19:31:36 INFO SparkUI: Stopped Spark web UI at http://adityakxc5zmd6m.attlocal.net:4041
  33. 22/03/06 19:31:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
  34. 22/03/06 19:31:36 INFO MemoryStore: MemoryStore cleared
  35. 22/03/06 19:31:36 INFO BlockManager: BlockManager stopped
  36. 22/03/06 19:31:36 INFO BlockManagerMaster: BlockManagerMaster stopped
  37. 22/03/06 19:31:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
  38. 22/03/06 19:31:37 INFO SparkContext: Successfully stopped SparkContext
  39. 22/03/06 19:31:37 INFO ShutdownHookManager: Shutdown hook called
  40. 22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-1e346a99-5d6f-498d-9bc4-ce8a3f951718
  41. 22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-f2727124-4f4b-4ee3-a8c0-607e207a3a98/pyspark-b96bf92a-84e8-409e-b9df-48f303c57b70
  42. 22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-f2727124-4f4b-4ee3-a8c0-607e207a3a98
tuwxkamq

tuwxkamq1#

方案“s3”没有文件系统
您还没有配置fs.s3.impl,因此Spark不知道如何处理以s3://开头的文件路径
使用,fs.s3a.implis recommended instead,而您使用**s3a://**访问文件

相关问题