聚合列并更改为pyspark中的arraytype模式

b1payxdu  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(319)

我想聚合一些列并以特定的模式格式修改它
idname\u enname\u spname\u fr1hellohello\u spanishhello\u法语
我需要struct格式的名称,这样当我调用json.dumps或将其转换为适当的json对象时。
我正在使用这个数组类型

schema = ArrayType(StructType([StructField("locale", StringType(), True),StructField("value", StringType(), True)]))

最好将其转换为struct数组,以便在调用整个表上的json.dumps时将其转换为以下json对象

{
 "id" : 1,
  "names" : [{"locale" : "en", "value" : "hello"}, {"locale" : "sp", "value" : "hello_spanish"}, {"locale" : "fr", "value" : "hello_french"}]
}

而且它不会变成带有字符串和转义字符的jsonobject。

{
 "id" : 1,
  "names" : "[{\\"locale\\" : \\"en\\", \\"value\\" : \\"hello\\"}, {\\"locale\\" : \\"sp\\", \\"value\\" : \\"hello_spanish\\"}, {\\"locale\\" : \\"fr\\", \\"value\\" : \\"hello_french\\"}]"
}
6qfn3psc

6qfn3psc1#

可以将结构数组构造为:

df2 = df.selectExpr(
    'id', 
    """
    array(
        struct('en' as locale, name_en as value),
        struct('sp' as locale, name_sp as value),
        struct('fr' as locale, name_fr as value)
    ) as names
    """
)

df2.show(truncate=False)
+---+------------------------------------------------------+
|id |names                                                 |
+---+------------------------------------------------------+
|1  |[[en, hello], [sp, hello_spanish], [fr, hello_french]]|
+---+------------------------------------------------------+

要获取json,可以尝试:

df2.toJSON().collect()

# this gives:

# ['{"id":1,"names":[{"locale":"en","value":"hello"},

# {"locale":"sp","value":"hello_spanish"},

# {"locale":"fr","value":"hello_french"}]}']

相关问题