使用PySpark从其他磁盘阅读数据(Windows 10)[重复]

x33g5p2x  于 2023-02-09  发布在  Windows
关注(0)|答案(1)|浏览(104)
    • 此问题在此处已有答案**:

How to access local files in Spark on Windows?(5个答案)
2小时前关门了。
我是Spark和PySpark的新手,我从here(多个. csv文件的1,75GB压缩)下载数据,并将它们存储在D盘,与C盘上的Spark安装和PySpark脚本分开。
当我尝试阅读它们时,我出现了以下错误:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[12], line 3
      1 df = spark.read.option("header", True) \
      2                 .option("inferSchema", True) \
----> 3                 .csv("\airport_delay")

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\sql\readwriter.py:535, in DataFrameReader.csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling)
    533 if type(path) == list:
    534     assert self._spark._sc._jvm is not None
--> 535     return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    536 elif isinstance(path, RDD):
    538     def func(iterator):

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\py4j\java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\sql\utils.py:196, in capture_sql_exception.<locals>.deco(*a, **kw)
    192 converted = convert_exception(e.java_exception)
    193 if not isinstance(converted, UnknownException):
    194     # Hide where the exception came from that shows a non-Pythonic
    195     # JVM exception message.
--> 196     raise converted from None
    197 else:
    198     raise

AnalysisException: Path does not exist: file:/C:/Users/Travail/Documents/PySpark/irport_delay
from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .master("local[1]") \
                    .appName("Test1") \
                    .getOrCreate()

df = spark.read.option("header", True) \
                .option("inferSchema", True) \
                .csv("file:\\\D:\Dataset\airport_delay")

我怎么能用PySpark从另一个磁盘读取数据呢?还是这样做是无稽之谈?
我尝试了:-添加/删除"file:"-读取Spark配置文档并查找类似于"spark. sql. warehouse. dir"的内容

u2nhd7ah

u2nhd7ah1#

我试着把所有的“\”都改成“/”,结果成功了。

df = spark.read.option("header", True) \
            .option("inferSchema", True) \
            .csv("file:///D:/Dataset/airport_delay")

相关问题