pyspark 有没有办法指定Spark SQL中inline_outer函数生成的列名?

ghg1uchk  于 2023-02-15  发布在  Spark
关注(0)|答案(2)|浏览(163)

我有一个名为order的表,如下所示:
| 身份证|战役|
| - ------|- ------|
| 第二章|[{"编号":" 1 ","职务":"试验"、"类型":"一个"},{" id":"2"、"职称":"测试2","类型":"两个"}]|
| 五个|[{"编号":" 3 ","职务":"测试3","类型":"三"}]|
我的期望:
| 身份证|活动ID|标题|类型|
| - ------|- ------|- ------|- ------|
| 第二章|1个|测验|一|
| 第二章|第二章|测试2|二|
| 五个|三个|测试3|三|
我的代码:

SELECT orderId AS id, id AS campaignid, title, type
FROM (
    SELECT id AS orderId, inline_outer(from_json(campaigns, 'ARRAY<STRUCT<id: STRING, title: STRING, type: STRING>>'))
    FROM order
);

我必须在subQuery中将id字段重命名为orderId,因为campaigns字段包含一个id键。

    • 问:有没有办法指定Spark SQL中inline_outer函数生成的列名?**

我尝试了:
x一个一个一个一个x一个一个二个x
但是,上述两种方法并不符合Spark SQL的语法。
先谢谢你。

kb5ga3dv

kb5ga3dv1#

您需要castfrom_json输出并更改列名:

SELECT id, 
 inline_outer(cast(from_json(campaigns, 'ARRAY<STRUCT<Id: STRING, title: STRING, type: STRING>>')
        as ARRAY<STRUCT<campaignId: STRING, title: STRING, type: STRING>>) ) 
    FROM order
uyto3xhc

uyto3xhc2#

以下是使用完整pyspark的解决方案:

from pyspark.sql import functions as F, types as T

# Define schema of the JSON
schema = T.ArrayType(
    T.StructType(
        [
            T.StructField("id", T.StringType()),
            T.StructField("title", T.StringType()),
            T.StructField("type", T.StringType()),
        ]
    )
)
# OR you can use also this schema with your current example
schema = T.ArrayType(T.MapType(T.StringType(), T.StringType()))

# Convert string to struct 
df = df.withColumn(
    "campaigns",
    F.from_json("campaigns", schema),
)

# Explode the array
df = df.withColumn("campaign", F.explode("campaigns"))

# Rename the field
df = df.select(
    "id",
    F.col("campaign.id").alias("caimpagnId"),
    F.col("campaign.title"),
    F.col("campaign.type"),
)
+---+----------+-----+-----+
| id|caimpagnId|title| type|
+---+----------+-----+-----+
|  2|         1| test|  one|
|  2|         2|test2|  two|
|  5|         3|test3|three|
+---+----------+-----+-----+

相关问题