从dataframe创建json

9njqaruj  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(427)

作为一个新的Spark,我正在工作的东西和面临的困难。任何线索都会有帮助。我正试图从Dataframe创建一个json,但tojson函数并没有帮到我。所以我的输出数据框如下:

+---------+------------------+-------------------------+
|booking_id|    status           |count(status)|
+---------+------------------+-------------------------+
|  132         |     rent count.       |                        6|
|  132         |     rent booked     |                      24|
|  132         |     rent delayed    |                        6|
|  134         |     rent booked     |                      34|
|  134         |     rent delayed.   |                       21|

我要寻找的输出是一个Dataframe,它将包含预订id和状态,并将其计数为json

+---------+-------------------------------------------+
|booking_id|    status_json         
+---------+-------------------------------------------+
|  132         |   { "rent count": 6, "rent booked": 24, "rent delayed": 6}  
|  134        |     { "rent booked": 34, "rent delayed": 21}

提前谢谢。

pwuypxnk

pwuypxnk1#

val sourceDF = Seq(
    (132, "rent count", 6),
    (132, "rent booked", 24),
    (132, "rent delayed", 6),
    (134, "rent booked", 34),
    (134, "rent delayed", 21)
  ).toDF("booking_id", "status", "count(status)")

  val resDF = sourceDF
    .groupBy("booking_id")
    .agg(to_json(collect_list(map(col("status"), col("count(status)")))).alias("status_json"))

  //  +----------+--------------------------------------------------------+
  //  |booking_id|status_json                                             |
  //  +----------+--------------------------------------------------------+
  //  |132       |[{"rent count":6},{"rent booked":24},{"rent delayed":6}]|
  //  |134       |[{"rent booked":34},{"rent delayed":21}]                |
  //  +----------+--------------------------------------------------------+
oo7oh9g9

oo7oh9g92#

为了 Spark2.4 ,使用 map_from_arrays .

from pyspark.sql import functions as F

df.groupBy("booking_id").agg(F.to_json(F.map_from_arrays(F.collect_list("status"),F.collect_list("count(status)")))\
                              .alias("status_json"))\
                              .show(truncate=False)

# +----------+--------------------------------------------------+

# |booking_id|status_json                                       |

# +----------+--------------------------------------------------+

# |132       |{"rent count":6,"rent booked":24,"rent delayed":6}|

# |134       |{"rent booked":34,"rent delayed":21}              |

# +----------+--------------------------------------------------+

相关问题