假设我们有以下数据集:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
case class Rec3(i: Long, j: Boolean)
case class Rec2(a: Int, b:Rec3, c: String, d: Int)
case class Rec1(x:Int, y:Option[Seq[Rec2]], z:Boolean, zz: String)
val df = Seq(Rec1(5, Some(Seq(Rec2(4, Rec3(3L, true), "2022-09-22 13:00:00", 3), Rec2(44, Rec3(33L, true), "2022-11-11 22:11:00", 3))), false, "2022-09-23 14:30:00"), Rec1(55, Some(Seq(Rec2(44, Rec3(33L, false), "2023-01-11 21:00:00", 33))), true, "2023-01-22 11:33:00")).toDF
df.show(false)
+---+----------------------------------------------------------------------------------+-----+-------------------+
|x |y |z |zz |
+---+----------------------------------------------------------------------------------+-----+-------------------+
|5 |[{4, {3, true}, 2022-09-22 13:00:00, 3}, {44, {33, true}, 2022-11-11 22:11:00, 3}]|false|2022-09-23 14:30:00|
|55 |[{44, {33, false}, 2023-01-11 21:00:00, 33}] |true |2023-01-22 11:33:00|
+---+----------------------------------------------------------------------------------+-----+-------------------+
df.printSchema
root
|-- x: integer (nullable = false)
|-- y: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: struct (nullable = true)
| | | |-- i: long (nullable = false)
| | | |-- j: boolean (nullable = false)
| | |-- c: string (nullable = true)
| | |-- d: integer (nullable = false)
|-- z: boolean (nullable = false)
|-- zz: string (nullable = true)
我只想把字段y.c
转换成Timestamp
,这就是我想得到的模式:
root
|-- x: integer (nullable = false)
|-- y: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: struct (nullable = true)
| | | |-- i: long (nullable = false)
| | | |-- j: boolean (nullable = false)
| | |-- c: timestamp (nullable = true)
| | |-- d: integer (nullable = false)
|-- z: boolean (nullable = false)
|-- zz: string (nullable = true)
这是我目前为止尝试的:
df.withColumn(
"y",
transform(
col("y"),
elem => elem.withField(
"c",
unix_timestamp(
col("y.c"),
"yyyy-MM-dd' 'HH:mm:ss"
)
)
)
)
org.apache.spark.sql.AnalysisException:由于数据类型不匹配,无法解析“unix_timestamp(y. c
,”yyyy-MM-dd“”HH:mm:ss“)”:参数1要求(字符串或日期或时间戳或时间戳_ntz)类型,但是,“y.c
”是数组类型。'项目[x#4,转换(y#5,λ函数(更新_字段(λ x_0#35,具有字段(c,unix_时间戳(y#5.c,yyyy-MM-dd“'时:分:秒,一些(欧洲/巴黎),假)),λ x_0#35,假))AS y#34,z#6,zz#7] +-本地关系[x#4,y#5,z#6,zz#7]
现在我很清楚,引用数组中的任何字段都将返回该字段的数组,如下所示:
df.select("y.c")
res6: org.apache.spark.sql.DataFrame = [c: array<string>]
但是我找不到任何优雅的方法来引用和转换这种字段,你有什么想法吗?
1条答案
按热度按时间avkwfej41#
通过使用lambda函数中定义的数组的每个元素的引用,这是可能的,如下所示: