我有以下模式,
root
|-- A: string (nullable = true)
|-- B: float (nullable = true)
当我对数据应用schema时,float列的dataframe值被填充为错误的。
Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
请帮助我了解spark到底在做什么,并生成以下输出。
After Applying Schema Dataframe:-
+---------+----------+
| A| B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2| 0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8| 0.0|
+---------+----------+
After writing to parquet :-
A B
0 floadVal1 0.404413
1 floadVal2 0.285630
2 floadVal3 0.591290
3 floadVal4 0.404413
4 floadVal5 15.376102
5 floadVal6 15.261798
6 floadVal7 19.887815
7 floadVal8 0.000000
根据spark doc 2.4.5 floattype:表示4字节单精度浮点数。
Sample Code
spark = SparkSession.builder.master('local').config(
"spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
schema = StructType([
StructField("A", StringType(), True),
StructField("B", FloatType(), True)])
df = spark.createDataFrame([
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
], schema)
df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')
暂无答案!
目前还没有任何答案,快来回答吧!