pyspark sql-nested array conditional select到一个新列中

cdmah0mi  于 2021-07-13  发布在  Java
关注(0)|答案(1)|浏览(301)

我有以下模式:

root 
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

我的事件参数是一个结构数组。样本数据:

{
  "event_params": [
    {
      "element": {
        "value": {
          "string_value": "LoginVC",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "firebase_screen_class"
      }
    },
    {
      "element": {
        "value": {
          "string_value": null,
          "float_value": null,
          "double_value": null,
          "int_value": 3600000
        },
        "key": "engagement_time_msec"
      }
    },
    {
      "element": {
        "value": {
          "string_value": "app_entered_background",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "item_name"
      }
    }
  ]
}

如何使用value.string\u value where“key”:“item\u name”中的值在同一行级别创建新列。我不想筛选行,因为我想对另外两个键重复此过程。
所以我想要一个新的模式,像这样:

root 
|-- item_name_string_value: string (nullable = true)
|-- firebase_screen_class_string_value: string (nullable = true)
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

我想用pyspark实现这一点。

olmpazwi

olmpazwi1#

从pyspark:通过arraytype列过滤和提取struct,这似乎对我有用:

curDFfil = spark.sql("select event_params from temp_data_l1 ")

df = curDFfil.select(expr("filter(event_params, s -> s.key == 'item_name')").getItem(0).alias('item_name'))
newDf = df.select(col("item_name.value.string_value").alias('item_name_string_value'))
newDf.show(10, False)

相关问题