我最近在我的windows pc上安装了spark 3.0.1(hadoop 2.7)。我正在尝试用以下代码读取avro格式的aws s3文件:
spark_df = spark.read.format("avro").load("s3/path")
当我运行命令时,我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Spark\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 178, in load
return self._df(self._jreader.load(path))
File "C:\Spark\spark-3.0.1-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
File "C:\Spark\spark-3.0.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide"
我已经按照apache的说明安装了spark avro:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.1 ...
作为补充信息,我在我的电脑上(而不是在集群中)本地使用spark。我想我已经阅读了stackoverflow上所有其他相关的问题,但是没有一个能帮助我解决这个问题。想知道我是否错过了一些东西,或必须做任何其他事情来阅读avro格式与Spark。
暂无答案!
目前还没有任何答案,快来回答吧!