在pyspark中转换为json时数据类型不匹配

mxg2im7a  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(298)

我有一个 dataframe 就像下面一样。空格表示不存在值

+---+---+----+---+---+
| a1| b1|  c1| d1| e1|
+---+---+----+---+---+
|  1|  a|foo1|  4|  5|
|   |  b| bar|  4|  6|
|   |  c| mnc|   |  7|
+---+---+----+---+---+
//Schema

root
 |-- a1: long (nullable = true)
 |-- b1: string (nullable = true)
 |-- c1: string (nullable = true)
 |-- d1: long (nullable = true)
 |-- e1: long (nullable = true)

我想把它转换成一行一行的json格式

result = df.withColumn( "JSON",to_json(struct([when (df[x].isNotNull(),df[x]).otherwise(F.lit(None)).alias(x)for x in df.columns])))

结果就像

{"a1":1,"b1":"a","c1":"foo1","d1":4,"e1":5}
{"b1":"b","c1":"bar","d1":4,"e1":6} 
{"b1":"c","c1":"mnc","e1":6}

因此空值列不会添加到json结构中。
一种克服的方法是,如果我像下面那样加上f.lit(“”),用like代替f.lit(none)

result = data.withColumn( "JSON",to_json(struct([when (data[x].isNotNull(),data[x]).otherwise(F.lit("")).alias(x)for x in data.columns])))

但是添加f.lit(“”)会将所有内容转换为字符串。
所以我得到的结果是

{"a1":"1","b1":"a","c1":"foo1","d1":"4","e1":"5"}
{"a1":"",b1":"b","c1":"bar","d1":"4","e1":"6"} 
{"a1":"","b1":"c","c1":"mnc","d1":"","e1":"6"}

你能建议一种方法来克服这个问题吗?比如有没有什么方法可以把每一列都转换成原来的类型?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题