将PySpark Dataframe转换为Pandas Dataframe在timestamp列上失败

pjngdqdw  于 2023-04-28  发布在  Spark
关注(0)|答案(1)|浏览(279)

我创建我的pyspark Dataframe :

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, BinaryType, ArrayType, StringType, TimestampType
    input_schema = StructType([
            StructField("key", StringType()),
            StructField("headers", ArrayType(
                StructType([
                    StructField("key", StringType()),
                    StructField("value", StringType())
                ])
            )),
            StructField("timestamp", TimestampType())
        ])
    
        input_data = [
            ("key1", [{"key": "header1", "value": "value1"}], datetime(2023, 1, 1, 0, 0, 0)),
            ("key2", [{"key": "header2", "value": "value2"}], datetime(2023, 1, 1, 0, 0, 0)),
            ("key3", [{"key": "header3", "value": "value3"}], datetime(2023, 1, 1, 0, 0, 0))
        ]
    
        df = spark.createDataFrame(input_data, input_schema)

我想使用Pandas的assert_frame_equal(),所以我想将我的数据框转换为Pandas数据框。
df.toPandas()将抛出TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
如何成功转换“timestamp”列以不丢失datetime值的详细信息?我需要它们保持为2023-01-01 00:00:00而不是2023-01-01

4ktjp1zp

4ktjp1zp1#

我找到了解决方案:

df = df.withColumn("timestamp", date_format("timestamp", "yyyy-MM-dd HH:mm:ss")).toPandas()

现在我可以用

assert_frame_equal(df, test_df)

成功。它没有失去精度。

相关问题