如何将pyspark Dataframe 结构转换为多次出现的嵌套json数组

ryoqjall  于 9个月前  发布在  Spark
关注(0)|答案(1)|浏览(114)

如何将下面这样的pyspark Dataframe 转换为json数组结构

OrderID   field              fieldValue   itemSeqNo 

123       Date               01-01-23      1
123       Amount             10.00         1 
123       description        Pencil        1
123       Date               01-02-23      2
123       Amount             11.00         2
123       description        Pen           2

字符串
下面的JSON数组结构

{
           "orderDetails": {
           "orderID": "123"
                          },
           "itemizationDetails": [
               {
                "Date": "01-01-23",
                "Amount": "10.00",
                "description": "Pencil"
               },
               {
                 "Date": "01-02-23 ",
                "Amount": "11.00",
               "description": "Pen"
               }
                                 ]
         }


这是我目前的代码,输出并不像预期的那样。

import pandas as pd 

      test_dataframe = pd.DataFrame(
     {
      "OrderID" : ['123','123','123','123','123','123'],
      "field" : 
     ["Date","Amount",'description','Date','Amount','description'],
       "fieldValue": ['01-01-23','10.00','Pencil','01-02-23 
     ','11.00','Pen '],
        "itemSeqNo" : ['1','1','1','2','2','2']

        }
       )
      import json
      res = json.loads(test_dataframe.to_json(orient='records'))
      print(res)

[{'OrderID': '123', 'field': 'Date', 'fieldValue': '01-01-23', 'itemSeqNo': '1'}, {'OrderID': '123', 'field': 'Amount', 'fieldValue': '10.00', 'itemSeqNo': '1'}, {'OrderID': '123', 'field': 'description', 'fieldValue': 'Pencil', 'itemSeqNo': '1'}, {'OrderID': '123', 'field': 'Date', 'fieldValue': '01-02-23 ', 'itemSeqNo': '2'}, {'OrderID': '123', 'field': 'Amount', 'fieldValue': '11.00', 'itemSeqNo': '2'}, {'OrderID': '123', 'field': 'description', 'fieldValue': 'Pen ', 'itemSeqNo': '2'}]

cuxqih21

cuxqih211#

Pyspark解决方案

轴心重塑框架

df1 = df.groupby('OrderID', 'itemSeqNo').pivot('field').agg(F.first('fieldValue'))

# +-------+---------+------+---------+-----------+
# |OrderID|itemSeqNo|Amount|     Date|description|
# +-------+---------+------+---------+-----------+
# |    123|        1| 10.00| 01-01-23|     Pencil|
# |    123|        2| 11.00|01-02-23 |       Pen |
# +-------+---------+------+---------+-----------+

字符串
将所需列打包到结构类型中

df1 = df1.withColumn('itemizationDetails', F.struct('Amount', 'Date', 'description'))

# +-------+---------+------+---------+-----------+-------------------------+
# |OrderID|itemSeqNo|Amount|Date     |description|itemizationDetails       |
# +-------+---------+------+---------+-----------+-------------------------+
# |123    |1        |10.00 |01-01-23 |Pencil     |{10.00, 01-01-23, Pencil}|
# |123    |2        |11.00 |01-02-23 |Pen        |{11.00, 01-02-23 , Pen } |
# +-------+---------+------+---------+-----------+-------------------------+


按OrderID对框架进行分组并收集结构列表

df1 = df1.groupby('OrderID').agg(F.collect_list('itemizationDetails').alias('itemizationDetails'))

# +-------+-----------------------------------------------------+
# |OrderID|itemizationDetails                                   |
# +-------+-----------------------------------------------------+
# |123    |[{10.00, 01-01-23, Pencil}, {11.00, 01-02-23 , Pen }]|


将OrderID打包到结构字段中

df1 = df1.withColumn('OrderDetails', F.struct('OrderID'))

# +-------+--------------------+------------+
# |OrderID|  itemizationDetails|OrderDetails|
# +-------+--------------------+------------+
# |    123|[{10.00, 01-01-23...|       {123}|
# +-------+--------------------+------------+


将字符串导出为JSON

result = df1.select('OrderDetails', 'itemizationDetails').toJSON().collect()

['{"OrderDetails":{"OrderID":"123"},"itemizationDetails":[{"Amount":"10.00","Date":"01-01-23","description":"Pencil"},{"Amount":"11.00","Date":"01-02-23 ","description":"Pen "}]}']

相关问题