scala/spark：检查数组列中的null元素，但intellij建议不要使用null？

dnph8jn4 于 2021-07-09 发布在 Spark

关注(0)|答案(3)|浏览(485)

我有一个专栏叫 responseTimes 数组类型：

ArrayType(IntegerType,true)

我正在尝试添加另一列来计算 null 或不在此数组中设置值：

val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.withColumn("totalNulls", when(contains_null(col("responseTimes")),
    lit(1)).otherwise(0))

尽管这给了我正确的输出，intellij一直告诉我避免使用 null 这让我觉得这很糟糕。还有别的办法吗？另外，是否可以不使用自定义项？

scala apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/66824937/scala-spark-checking-for-null-elements-in-an-array-column-but-intellij-suggests

3条答案

按热度按时间

iyzzxitl1#

原因很简单，这是因为spark udf的规则，spark以一种不同的分布式方式处理null，我不知道您是否知道array\u包含spark sql中的内置函数。
如果需要自定义项，请遵循以下规则：
scala代码应该优雅地处理空值，如果有空值，就不应该出错。
对于未知、缺失或不相关的值，scala代码应该返回none（或null）。对于未知、缺失或不相关的值，dataframes也应该使用null。
在scala代码中使用option，如果option成为性能瓶颈，则返回null。
如果您想了解更多信息，请参阅此链接：https://mungingdata.com/apache-spark/dealing-with-null/

赞(0）回复(0）举报 2021-07-09

irlmq6kh2#

你可以重写你的自定义项来使用 Option . 在斯卡拉， Option(null) 给予 None ，所以您可以：

val contains_null = udf((xs: Seq[Integer]) => xs.exists(e => Option(e).isEmpty))

但是，如果您使用的是spark 2.4+，则更适合为此使用spark内置函数。要检查数组列是否包含空元素，请使用 exists 正如@mck的回答所暗示的。
如果你想得到数组中的空值，你可以合并 filter 以及 size 功能：

df.withColumn("totalNulls", size(expr("filter(responseTimes, x -> x is null)")))

赞(0）回复(0）举报 2021-07-09

whhtz7ly3#

更好的方法可能是使用高阶函数 exists 检查 isnull 对于每个数组元素：

// sample dataframe
val df = spark.sql("select array(1,null,2) responseTimes union all select array(3,4)")

df.show
+-------------+
|responseTimes|
+-------------+
|      [1,, 2]|
|       [3, 4]|
+-------------+

// check whether there exists null elements in the array
val df2 = df.withColumn("totalNulls", expr("int(exists(responseTimes, x -> isnull(x)))"))

df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
|      [1,, 2]|         1|
|       [3, 4]|         0|
+-------------+----------+

你也可以使用 array_max 一起 transform :

val df2 = df.withColumn("totalNulls", expr("int(array_max(transform(responseTimes, x -> isnull(x))))"))

df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
|      [1,, 2]|         1|
|       [3, 4]|         0|
+-------------+----------+

赞(0）回复(0）举报 2021-07-09

我来回答

scala/spark：检查数组列中的null元素，但intellij建议不要使用null？

3条答案

相关问题

热门标签

最新问答