我有一个dataframe模式(用于以parquet格式存储的数据),如下所示
root
|-- id: string (nullable = true)
|-- mid: integer (nullable = true)
|-- relationship: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- child_id: string (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- package_contains: string (nullable = true)
| | | |-- package_level: string (nullable = true)
如果我有这样的问题
df.select($"id",$"mid").filter(array_contains(col("relationship.child_id"), "id1") && $"mid" === 1).show(false)
有人知道spark是否反序列化了吗 details
struct数组中的struct( relationship
)仅查询时 child_id
列?如果有办法验证的话?
1条答案
按热度按时间jaxagkaj1#
斯帕克不会看电视的
details
除非有必要。我用相同的结构创建了一个新的人工数据集。这个
details
结构由自定义项计算。稍后,udf将被更改,以便在运行时引发异常。如果查询仍然有效,我知道尚未调用udf,并且details
尚未计算结构。尝试一:“好”的自定义项
root
|-- id: long (nullable = false)
|-- mid: integer (nullable = false)
|-- relationship: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- child_id: string (nullable = false)
| | |-- details: struct (nullable = true)
| | | |-- _1: string (nullable = true)
| | | |-- _2: string (nullable = true)
+---+---+---------------+
| id|mid| relationship|
+---+---+---------------+
| 0| 1|id0, [a, b]|
| 1| 1|id1, [a, b]|
| 2| 1|id2, [a, b]|
| 3| 1|id3, [a, b]|
| 4| 1|id4, [a, b]|
+---+---+---------------+
def details(x:Int): Tuple2[String, String] = throw new NullPointerException
val detailsUdf = udf(details _)
df.select($"id",$"mid").filter(array_contains(col("relationship.child_id"), "id1") && $"mid" === 1).show(false)
+---+---+
|id |mid|
+---+---+
|1 |1 |
+---+---+
df.select($"id",$"mid").filter(array_contains(col("relationship.details._1"), "id1") && $"mid" === 1).show(false)