Pyspark从JSON数组中排除某些键

f45qwnt8  于 2022-11-21  发布在  Spark
关注(0)|答案(1)|浏览(120)

我想从JSON数组列中过滤某些键并将其添加到一个新列中。我首先尝试通过定义模式将JSON字符串转换为数组。
样本数据,即col(“data1”):

[{"type":"fruit","value":"apple"},{"type":"car","value":"audi"},{"type":"color","value":"red"}, {"type":"vegetable","value":"onion"}]

我要排除的机码:

cols_to_exclude = ["car", "color"]

架构定义并添加为新列:

dataMapSchema = ArrayType(StructType([StructField('type', StringType()),StructField('value', StringType())]))

df = datasource0.toDF().withColumn("temp", (from_json(F.lower(col("data1")), dataMapSchema))) \
        .withColumn("tempMap", F.map_from_entries(col("temp"))) \
        .withColumn("filteredColumn", F.to_json(F.map_filter(col("tempMap"), lambda k, v: ~k.isin(cols_to_exclude))))

这将创建以下输出:

{"fruit":"apple", "vegetable":"onion" }

预期输出:
col(“过滤的列”):

[{"fruit":"apple"}, {"vegetable":"onion"}]

如何将代码转换为返回一个Map列表?我正在寻找与map_from_arrays相反的内容

y0u0uwnf

y0u0uwnf1#

我在pyspark functions中找不到函数inline或inline_outer,所以我在pysprk中执行SQL str,结果似乎可以接受。
密码:

str = '[{"type":"fruit","value":"apple"},{"type":"car","value":"audi"},{"type":"color","value":"red"}, {"type":"vegetable","value":"onion"}]'

sql = f"""
SELECT string(collect_set(to_json(map(type, value)))) AS data
FROM (
SELECT inline_outer(from_json('{str}', 'ARRAY<STRUCT<type: string, value: string>>'))
)
WHERE type not in ('car', 'color');
"""

spark.sql(sql).show()

实验结果:

[{"fruit":"apple"}, {"vegetable":"onion"}]

相关问题