如何在pyspark中向Dataframe添加嵌套列？

ngynwnxp 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(464)

我有一个Dataframe，其模式如下：

root
 |-- field_a: string (nullable = true)
 |-- field_b: integer (nullable = true)

我想在我的数据框中添加一个嵌套列，这样：

root
 |-- field_a: string (nullable = true)
 |-- field_b: integer (nullable = true)
 |-- field_c: struct (nullable = true)
 |    |-- subfield_a: integer (nullable = true)
 |    |-- subfield_b: integer (nullable = true)

如何在Pypark中实现这一点？

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/61778520/how-add-a-nested-column-to-a-dataframe-in-pyspark

1条答案

按热度按时间

cyej8jka1#

实际上有两种选择，一种是声明一个新模式并嵌套 pyspark.sql.types.StructField ，或您使用 pyspark.sql.functions.struct 具体如下：

df = spark._sc.parallelize([
    [0, 1.0, 0.71, 0.143],
    [1, 0.0, 0.97, 0.943],
    [0, 0.123, 0.27, 0.443],
    [1, 0.67, 0.3457, 0.243],
    [1, 0.39, 0.7777, 0.143]
]).toDF(['col1', 'col2', 'col3', 'col4'])

df_new = df.withColumn(
    'tada', 
    f.struct(*[f.col('col2').alias('subcol_1'), f.col('col3').alias('subcol_2')])
)
df_new.show()
+----+-----+------+-----+--------------+
|col1| col2|  col3| col4|          tada|
+----+-----+------+-----+--------------+
|   0|  1.0|  0.71|0.143|   [1.0, 0.71]|
|   1|  0.0|  0.97|0.943|   [0.0, 0.97]|
|   0|0.123|  0.27|0.443| [0.123, 0.27]|
|   1| 0.67|0.3457|0.243|[0.67, 0.3457]|
|   1| 0.39|0.7777|0.143|[0.39, 0.7777]|
+----+-----+------+-----+--------------+

现在，给定 tada 是一个 StructType ，您可以使用 [...] 符号如下：

df_new.select(f.col('tada')['subcol_1']).show()
+-------------+
|tada.subcol_1|
+-------------+
|          1.0|
|          0.0|
|        0.123|
|         0.67|
|         0.39|
+-------------+

打印模式还总结了：

df_new.printSchema()

root
 |-- col1: long (nullable = true)
 |-- col2: double (nullable = true)
 |-- col3: double (nullable = true)
 |-- col4: double (nullable = true)
 |-- tada: struct (nullable = false)
 |    |-- subcol_1: double (nullable = true)
 |    |-- subcol_2: double (nullable = true)

nb1：代替 f.col(...) 要获取现有列，可以使用任何其他返回 pyspark.sql.functions.Column ，例如 f.lit() . nb2:使用时 f.col(...) ，可以看到现有的列类型将被继承。希望这有帮助！

赞(0）回复(0）举报 2021-05-27

我来回答

如何在pyspark中向Dataframe添加嵌套列？

1条答案

相关问题

热门标签

最新问答