pyspark 合并具有不同字段的数组结构的两列

xa9qqrwz 于 2024-01-06 发布在 Spark

关注(0)|答案(1)|浏览(181)

我有一个架构的框架

StructType(
        [
            StructField('product_id', IntegerType(), True),
            StructField('tenant_id', IntegerType(), True),
            StructField("materials", ArrayType(StructType([
              StructField('id', IntegerType(), True),
              StructField('percentage', FloatType(), True)]
            ))),
            StructField("elastic", ArrayType(StructType([
              StructField('id', IntegerType(), True),
              StructField('name', MapType(StringType(), StringType()), True)
          ])))
        ])

字符串
我想合并这两个结构体，以便有一个新的3个字段，id，百分比和名称，其中materials.id = elastic.id喜欢：

StructType(
        [
            StructField('product_id', IntegerType(), True),
            StructField('tenant_id', IntegerType(), True),
            StructField("materials", ArrayType(StructType([
              StructField('id', IntegerType(), True),
              StructField('percentage', FloatType(), True),
              StructField('name', MapType(StringType(), StringType()), True)]
            )))
        ])

型
基本上，
我想通过这个Before and expected result
我尝试过udf和结果，但在性能方面不是最好的方法。

@udf(returnType=ArrayType(StructType([
            StructField("id", IntegerType(), False),
            StructField('percentage', FloatType(), True),
            StructField('name', MapType(StringType(), StringType()), True)
        ])))
        def expand_list(materials, elastic):
            final = []
            for k in materials:
              for i in elastic:
                if k.id == i.id:
                  final += [{'id': k.id, 'percentage': k.percentage, 'name': i.name}]
            return final

型

pyspark

来源：https://stackoverflow.com/questions/77731788/merge-two-columns-of-array-struct-with-different-fields

1条答案

按热度按时间

nle07wnf1#

使用transform遍历第一个数组，然后使用filter在第二个数组中查找相应的条目：

from pyspark.sql import functions as F
# some testdata
testdata="""
  {"product_id": 1, "tenant_id": 1, "materials": [{"id": 1, "percentage": 0.1}, {"id": 3, "percentage": 0.3}, {"id": 2, "percentage": 0.2}], "elastic": [{"id": 1, "name": "one"},{"id":2, "name": "two"}] }
"""
df = spark.read.json(spark.sparkContext.parallelize([testdata]))
# create a new column with the merged array
df.withColumn("merged_materials", F.expr("""
  transform(materials, m -> named_struct(
      'id', m.id, 
      'percentage', m.percentage, 
      'name', filter(elastic, e -> e.id == m.id)[0].name)
    )
    """
  )).show(vertical=True, truncate=False)

字符串
输出量：

-RECORD 0----------------------------------------------------------
 elastic          | [{1, one}, {2, two}]                           
 materials        | [{1, 0.1}, {3, 0.3}, {2, 0.2}]                 
 product_id       | 1                                              
 tenant_id        | 1                                              
 merged_materials | [{1, 0.1, one}, {3, 0.3, null}, {2, 0.2, two}]

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

我来回答

pyspark 合并具有不同字段的数组结构的两列

1条答案

相关问题

热门标签

最新问答