为什么在spark-3上写1900年之前的时间戳不会引发sparkupgradeexception?

vatpfxk5  于 2021-07-12  发布在  Spark
关注(0)|答案(0)|浏览(514)

第页:https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read
我们可以读到:
从Parquet文件中读取1582-10-15之前的日期或1900-01-01t00:00:00z之前的时间戳可能是不明确的,因为这些文件可能是由spark 2.x或hive的旧版编写的,后者使用的是与spark 3.0+的前公历不同的旧版混合日历
请考虑以下不引发异常的场景:

scala> spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")
res27: String = EXCEPTION
scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
scala> // why did not it throw exception?

至于之前的日期 1582 引发异常:

scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate")
21/03/10 19:07:19 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

有人能解释这个区别吗?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题