pyspark:将数据写入parquet文件时floattype出现问题

lf5gs5x2  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(396)

我有以下模式,

root
 |-- A: string (nullable = true)
 |-- B: float (nullable = true)

当我对数据应用schema时,float列的dataframe值被填充为错误的。

Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)

请帮助我了解spark到底在做什么,并生成以下输出。

After Applying Schema Dataframe:- 

+---------+----------+
|        A|         B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2|   0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8|       0.0|
+---------+----------+

After writing to parquet :- 

           A          B
0  floadVal1   0.404413
1  floadVal2   0.285630
2  floadVal3   0.591290
3  floadVal4   0.404413
4  floadVal5  15.376102
5  floadVal6  15.261798
6  floadVal7  19.887815
7  floadVal8   0.000000

根据spark doc 2.4.5 floattype:表示4字节单精度浮点数。

Sample Code 

spark = SparkSession.builder.master('local').config(
                "spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

schema = StructType([
         StructField("A", StringType(), True),
         StructField("B", FloatType(), True)])
df = spark.createDataFrame([
                                ("floadVal1", 0.404413386),
                                ("floadVal2", 0.28563),
                                ("floadVal3", 0.591290286),
                                ("floadVal4", 0.404413386),
                                ("floadVal5", 15.37610198),
                                ("floadVal6", 15.261798303),
                                ("floadVal7", 19.887814583),
                                ("floadVal8", 0.0)
                        ], schema)

df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题