scala ApacheSpark：引用和转换结构数组中的字段

假设我们有以下数据集：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

case class Rec3(i: Long, j: Boolean)
case class Rec2(a: Int, b:Rec3, c: String, d: Int)
case class Rec1(x:Int, y:Option[Seq[Rec2]], z:Boolean, zz: String)

val df = Seq(Rec1(5, Some(Seq(Rec2(4, Rec3(3L, true), "2022-09-22 13:00:00", 3), Rec2(44, Rec3(33L, true), "2022-11-11 22:11:00", 3))), false, "2022-09-23 14:30:00"), Rec1(55, Some(Seq(Rec2(44, Rec3(33L, false), "2023-01-11 21:00:00", 33))), true, "2023-01-22 11:33:00")).toDF

df.show(false)
+---+----------------------------------------------------------------------------------+-----+-------------------+
|x  |y                                                                                 |z    |zz                 |
+---+----------------------------------------------------------------------------------+-----+-------------------+
|5  |[{4, {3, true}, 2022-09-22 13:00:00, 3}, {44, {33, true}, 2022-11-11 22:11:00, 3}]|false|2022-09-23 14:30:00|
|55 |[{44, {33, false}, 2023-01-11 21:00:00, 33}]                                      |true |2023-01-22 11:33:00|
+---+----------------------------------------------------------------------------------+-----+-------------------+

df.printSchema
root
 |-- x: integer (nullable = false)
 |-- y: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: integer (nullable = false)
 |    |    |-- b: struct (nullable = true)
 |    |    |    |-- i: long (nullable = false)
 |    |    |    |-- j: boolean (nullable = false)
 |    |    |-- c: string (nullable = true)
 |    |    |-- d: integer (nullable = false)
 |-- z: boolean (nullable = false)
 |-- zz: string (nullable = true)

我只想把字段y.c转换成Timestamp，这就是我想得到的模式：

root
 |-- x: integer (nullable = false)
 |-- y: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: integer (nullable = false)
 |    |    |-- b: struct (nullable = true)
 |    |    |    |-- i: long (nullable = false)
 |    |    |    |-- j: boolean (nullable = false)
 |    |    |-- c: timestamp (nullable = true)
 |    |    |-- d: integer (nullable = false)
 |-- z: boolean (nullable = false)
 |-- zz: string (nullable = true)

这是我目前为止尝试的：

df.withColumn(
  "y",
  transform(
    col("y"),
    elem => elem.withField(
      "c",
      unix_timestamp(
        col("y.c"),
        "yyyy-MM-dd' 'HH:mm:ss"
      )
    )
  )
)

org.apache.spark.sql.AnalysisException:由于数据类型不匹配，无法解析“unix_timestamp（y. c，”yyyy-MM-dd“”HH：mm：ss“）”：参数1要求（字符串或日期或时间戳或时间戳_ntz）类型，但是，“y.c”是数组类型。'项目[x#4，转换（y#5，λ函数（更新_字段（λ x_0#35，具有字段（c，unix_时间戳（y#5.c，yyyy-MM-dd“'时：分：秒，一些（欧洲/巴黎），假）），λ x_0#35，假））AS y#34，z#6，zz#7] +-本地关系[x#4，y#5，z#6，zz#7]
现在我很清楚，引用数组中的任何字段都将返回该字段的数组，如下所示：

df.select("y.c")
res6: org.apache.spark.sql.DataFrame = [c: array<string>]

但是我找不到任何优雅的方法来引用和转换这种字段，你有什么想法吗？

通过使用lambda函数中定义的数组的每个元素的引用，这是可能的，如下所示：

val out = df
  .withColumn("y",transform(col("y"),
     elem => elem.withField("c",unix_timestamp(elem("c"),
     "yyyy-MM-dd' 'HH:mm:ss").cast(TimestampType))))

out.show(false)
+---+----------------------------------------------------------------------------------+-----+-------------------+
|x  |y                                                                                 |z    |zz                 |
+---+----------------------------------------------------------------------------------+-----+-------------------+
|5  |[{4, {3, true}, 2022-09-22 13:00:00, 3}, {44, {33, true}, 2022-11-11 22:11:00, 3}]|false|2022-09-23 14:30:00|
|55 |[{44, {33, false}, 2023-01-11 21:00:00, 33}]                                      |true |2023-01-22 11:33:00|
+---+----------------------------------------------------------------------------------+-----+-------------------+

out.printSchema
root
 |-- x: integer (nullable = false)
 |-- y: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: integer (nullable = false)
 |    |    |-- b: struct (nullable = true)
 |    |    |    |-- i: long (nullable = false)
 |    |    |    |-- j: boolean (nullable = false)
 |    |    |-- c: timestamp (nullable = true)
 |    |    |-- d: integer (nullable = false)
 |-- z: boolean (nullable = false)
 |-- zz: string (nullable = true)

scala ApacheSpark：引用和转换结构数组中的字段

1条答案

相关问题

热门标签

最新问答