pandas 部分展平嵌套JSON并延长透视

eulz3vhy  于 2023-06-04  发布在  其他
关注(0)|答案(2)|浏览(185)

我有很多JSON文件,结构如下:

{
  "requestId": "test",
  "executionDate": "2023-05-10",
  "executionTime": "12:02:22",
  "request": {
    "fields": [{
      "geometry": {
        "type": "Point",
        "coordinates": [-90, 41]
      },
      "colour": "blue",
      "bean": "blaCk",
      "birthday": "2021-01-01",
      "arst": "111",
      "arstg": "rst",
      "fct": {
        "start": "2011-01-10",
        "end": "2012-01-10"
      }
    }]
  },
  "response": {
    "results": [{
        "geom": {
          "type": "geo",
          "coord": [-90, 41]
        },
        "md": {
          "type": "arstat",
          "mdl": "trstr",
          "vs": "v0",
          "cal": {
            "num": 4,
            "comment": "message"
          },
          "bean": ["blue", "green"],
          "result_time": 12342
        },
        "predictions": [{
            "date": "2004-05-19",
            "day": 0,
            "count": 0,
            "eating_stage": "trt"
          }, {
            "date": "2002-01-20",
            "day": 1,
            "count": 0,
            "eating_stage": "arstg"
          }, {
            "date": "2004-05-21",
            "day": 2,
            "count": 0,
            "eating_stage": "strg"
          }, {
            "date": "2004-05-22",
            "day": 3,
            "count": 0,
            "eating_stage": "rst"
          }
        }
      }
    }

预测的部分可以非常深入。我想将这个JSON转换为具有以下结构的CSV:
| requestId|执行日期|执行时间|色彩|预测日期|预测日|预测计数|预测进食期|
| - -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|
| 测试|2023-05-10 2023-05-10 2023-05-10|十二点零二分二十二秒|蓝色|2004-05-19| 0| 0| TRT|
| 测试|2023-05-10 2023-05-10 2023-05-10|十二点零二分二十二秒|蓝色|2002-01-20| 1| 0| astrg|
| 测试|2023-05-10 2023-05-10 2023-05-10|十二点零二分二十二秒|蓝色|2004-05-21| 2| 0|斯特格|
| 测试|2023-05-10 2023-05-10 2023-05-10|十二点零二分二十二秒|蓝色|2004-05-22| 3| 0|第一|
我尝试了以下代码:

flat_json = pd.DataFrame(
    flatten(json_data), index=[0]
)

代码导致每个数据点都变成了一列,我不确定如何在Python中使用JSON函数在“预测”键处旋转更长的时间。我认识到,在这个阶段,我可以使用列名来旋转更长的时间,但我觉得有一种更干净的方法来实现这一点。

nwsw7zdq

nwsw7zdq1#

我建议你只提取你需要的东西。使用特定的解析来解决它似乎非常具体。因此,我首先创建两个dataframe:

df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0])

重命名预测中的列:

df_prediction.columns = ['prediction_'+x for x in df_prediction]

连接并添加最后一段数据(颜色):

output = df_data.assign(colour = example['request']['fields'][0]['colour']).join(df_prediction,how='right').ffill()

输出:

requestId executionDate  ... prediction_count prediction_eating_stage
0      test    2023-05-10  ...                0                     trt
1      test    2023-05-10  ...                0                   arstg
2      test    2023-05-10  ...                0                    strg
3      test    2023-05-10  ...                0                     rst
xdnvmnnf

xdnvmnnf2#

您还可以使用json_normalize来提取要规范化为csv的记录数组。

>>> df_predictions = pd.json_normalize(json_data,record_path=['response', 'results','predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour = json_data['request']['fields'][0]['colour'])
>>> df_predictions
  predictions.date  predictions.day  predictions.count  ... executionDate executionTime colour
0       2004-05-19                0                  0  ...    2023-05-10      12:02:22   blue
1       2002-01-20                1                  0  ...    2023-05-10      12:02:22   blue
2       2004-05-21                2                  0  ...    2023-05-10      12:02:22   blue
3       2004-05-22                3                  0  ...    2023-05-10      12:02:22   blue

[4 rows x 8 columns]

不幸的是, meta字段有一个限制,因为它会为包含数组/列表的路径抛出异常,所以“colour”列是单独添加的。如果顺序很重要,则可以根据需要重新排列列。

相关问题