检测到有冲突的分区列名pyspark databricks

a0zr77ik  于 2021-05-16  发布在  Spark
关注(0)|答案(1)|浏览(547)

我想在databricks中读取一个带有pyspark的csv文件。marketingstartdate是这种格式 yyyyMMdd 以及 lastweek = marketingStartDate -7days ```
readFactToDataFrame('Facts',
'Fact.csv',startDate=str(dateBefore),
endDate=str(marketingStartDate),
inferSchema=False)

我收到这个错误信息。你知道问题出在哪里吗?
![](https://i.stack.imgur.com/PX8xZ.png)

df = spark.read.format(fileFormat).options(header=True, inferSchema=
inferSchema, delimiter = columnDelimiter).load(URL).filter("Year *
10000 + Month * 100 + Day = "+str(startDate) + " AND Year * 10000 +
Month * 100 + Day <=" + str(endDate)) 37 elif fileFormat == "json": 38
df =
spark.read.format(fileFormat).options(multiline=True).load(URL).filter("Year

  • 10000 + Month * 100 + Day = "+str(startDate) + " AND Year * 10000 + Month * 100 + Day <=" + str(endDate))

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path,
format, schema,**options) 164 self.options(**options) 165 if
isinstance(path, basestring): -- 166 return
self._df(self._jreader.load(path)) 167 elif path is not None: 168 if
type(path) != list:
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
in call(self, *args) 1255 answer =
self.gateway_client.send_command(command) 1256 return_value =
get_return_value( - 1257 answer, self.gateway_client, self.target_id,
self.name) 1258 1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a,**kw) 61 def
deco(*a,**kw): 62 try: --- 63 return f(*a,**kw) 64 except
py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name) 326 raise
Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --
328 format(target_id, ".", name), value) 329 else: 330 raise
Py4JError( Py4JJavaError: An error occurred while calling o2301.load.
: java.lang.AssertionError: assertion failed: Conflicting partition
column names detected: Partition column name list #0: Year, Month, Day
Partition column name list #1: Year, Month For partitioned table
directories, data files should only live in leaf directories. And
directories at the same level should have the same partition column
name. Please check the following directories for unexpected files or
inconsistent partition column names:
dbfs:/mnt/DL/Facts/Year=2020/Month=08
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=16
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=27
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=02
dbfs:/mnt/DL/Facts/Year=2020/Month=08/Day=01
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=09
dbfs:/mnt/DL/Facts/Year=2020/Month=02/Day=26
dbfs:/mnt/DL/Facts/Year=2020/Month=08/Day=10
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=12
dbfs:/mnt/DL/Facts/Year=2019/Month=10/Day=12
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=24
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=05
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=13
dbfs:/mnt/DL/Facts/Year=2019/Month=10/Day=27
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=16
dbfs:/mnt/DL/Facts/Year=2020/Month=02/Day=20
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=31 at
scala.Predef$.assert(Predef.scala:170) at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:396)
at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:197)
at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)

9jyewag0

9jyewag01#

由于源位置dbfs:/mnt/dl/facts有两个不同的分区结构,因此引发了以下错误。

java.lang.AssertionError: assertion failed: Conflicting partition column names detected: 
Partition column name list #0: Year, Month, Day 
Partition column name list #1: Year, Month

错误详细信息指向有问题的目录:

dbfs:/mnt/DL/Facts/Year=2020/Month=08

检查上面的databricks目录,看看这个位置上是否有任何文件。您可以删除它们或移动到其他目录。如果上面的目录中没有文件,您可以删除目录本身。
我希望这有助于解决这个问题。

相关问题