基于datetime64数据类型的pandas DataFrame在pyspark中读取parquet数据集

yhived7q  于 2023-11-15  发布在  Spark
关注(0)|答案(1)|浏览(185)

当阅读从pandas框架创建的具有datetime64[ns]列的Parquet数据集时,如何避免org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))。下面是一个最小的示例-

import pandas as pd
from pyspark.sql import SparkSession

# pandas DataFrame with datetime64[ns] column
pdf = pd.DataFrame(data={'time': pd.date_range('10/1/23', '10/7/23', freq='D')})
pdf.to_parquet('<path>/data.parquet')

# read parquet dataset - creates Illegal Parquet type
spark = SparkSession.builder.getOrCreate()
sdf = spark.read.parquet('<path>/data.parquet')

# recover original dataframe
df = sdf.toPandas()

字符串
目标是读取parquet数据集并将time列作为pyspark TimestampType接收。
datetime64[ns]列转换为object数据类型的变通方法并不理想。其中一种变通方法-pdf['time'] = pd.Series(pdf['time'].dt.to_pydatetime(), dtype=object)-在将spark框架转换回pandas框架时引发FutureWarning: Passing unit-less datetime64 dtype to .astype is deprecated and will raise in a future version. Pass 'datetime64[ns]' instead
在Linux示例上运行pandas-1.5.3pyspark-3.4.1

wko9yo5t

wko9yo5t1#

默认情况下,Pandas将DatetimeIndex存储在datetime64[ns](纳秒)下,您必须将datetime存储在datetime64[ms](毫秒)下,以便PySpark可以正确加载parquet文件:

pdf.astype({'time': 'datetime64[ms]'}).to_parquet('<path>/data.parquet')

# Or use pd.date_range('10/1/23', '10/7/23', freq='D', unit='ms')  # <- unit

字符串
输出量:

>>> sdf.show()
+-------------------+
|               time|
+-------------------+
|2023-10-01 00:00:00|
|2023-10-02 00:00:00|
|2023-10-03 00:00:00|
|2023-10-04 00:00:00|
|2023-10-05 00:00:00|
|2023-10-06 00:00:00|
|2023-10-07 00:00:00|
+-------------------+

>>> sdf.dtypes
[('time', 'timestamp_ntz')]

相关问题