pyspark sql-nested array conditional select到一个新列中

cdmah0mi 于 2021-07-13 发布在 Java

关注(0)|答案(1)|浏览(301)

我有以下模式：

root 
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

我的事件参数是一个结构数组。样本数据：

{
  "event_params": [
    {
      "element": {
        "value": {
          "string_value": "LoginVC",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "firebase_screen_class"
      }
    },
    {
      "element": {
        "value": {
          "string_value": null,
          "float_value": null,
          "double_value": null,
          "int_value": 3600000
        },
        "key": "engagement_time_msec"
      }
    },
    {
      "element": {
        "value": {
          "string_value": "app_entered_background",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "item_name"
      }
    }
  ]
}

如何使用value.string\u value where“key”：“item\u name”中的值在同一行级别创建新列。我不想筛选行，因为我想对另外两个键重复此过程。
所以我想要一个新的模式，像这样：

root 
|-- item_name_string_value: string (nullable = true)
|-- firebase_screen_class_string_value: string (nullable = true)
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

我想用pyspark实现这一点。

python apache-spark pyspark apache-spark-sql aws-glue-spark

来源：https://stackoverflow.com/questions/67289396/pyspark-sql-nested-array-conditional-select-into-a-new-column

1条答案

按热度按时间

olmpazwi1#

从pyspark:通过arraytype列过滤和提取struct，这似乎对我有用：

curDFfil = spark.sql("select event_params from temp_data_l1 ")

df = curDFfil.select(expr("filter(event_params, s -> s.key == 'item_name')").getItem(0).alias('item_name'))
newDf = df.select(col("item_name.value.string_value").alias('item_name_string_value'))
newDf.show(10, False)

赞(0）回复(0）举报 2021-07-13

我来回答

pyspark sql-nested array conditional select到一个新列中

1条答案

相关问题

热门标签

最新问答