填充dataframe列中的嵌套数组

nue99wik  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(452)

我有一个这样结构的Dataframe:

|-- col0: double (nullable = true)
 |-- arr: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = false)

数组列必须保存两个元素(数组),从中创建的元素不丢失。举个例子,我有:

|0.0 |[[0.0, 182.0], [1.0, 14.0]]|
|0.0 |[[1.0, 60.0]]              |
|1.0 |[[0.0, 3.0], [1.0, 48.0]]  |
|2.0 |[[1.0, 6.0], [0.0, 111.0]] |
|0.0 |[[1.0, 4.0], [0.0, 120.0]] |
|2.0 |[[0.0, 21.0]]              |
|0.0 |[[0.0, 3.0], [1.0, 13.0]]  |

期望的结果是:

|0.0 |[[0.0, 182.0], [1.0, 14.0]]|
|0.0 |[[0.0, 0.0], [1.0, 60.0]]  |
|1.0 |[[0.0, 3.0], [1.0, 48.0]]  |
|2.0 |[[0.0, 111.0], [1.0, 6.0]] |
|0.0 |[[0.0, 120.0], [1.0, 4.0]] |
|2.0 |[[0.0, 21.0], [1.0, 0.0]]  |
|0.0 |[[0.0, 3.0], [1.0, 13.0]]  |

所以,当数组有2个元素时,什么都不用做。但是如果它有一个元素,我需要用缺少的值创建第二个元素(如果有一个值为0.0的元素,我需要创建一个值为[1.0,0.0]的元素,如果有一个值为0.0的元素,我需要[0.0,0.0])。
我尝试了以下方法,但没有成功:

val headValue = udf((arr: Array[Array[Double]], maxValue: Double, minValue: Double) => {
  val flatArr = arr.flatMap(_.headOption)
  val nArr = arr
  if (flatArr.length == 1){
    if (flatArr.head == maxValue){
      nArr :+  Array (minValue, 0.0)
    } else {
      nArr :+  Array (maxValue, 0.0)
    }
  } else {
    nArr
  }
})

df.withColumn("Test", headValue(df("arrOfarr"), lit(maxValue), lit(minValue) ))

错误是:

org.apache.spark.SparkException: Failed to execute user defined function(anonfun$20: (array<array<double>>, double, double) => array<array<double>>)
...
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [[D
lrl1mhuk

lrl1mhuk1#

而不是将udf的输入定义为 Array ,定义为 Seq 你应该很好:

val headValue = udf((arr: Seq[Seq[Double]], maxValue: Double, minValue: Double) => {
  val flatArr = arr.flatMap(_.headOption)
  val nArr = arr
  if (flatArr.length == 1){
    if (flatArr.head == maxValue){
      nArr :+  Seq(minValue, 0.0)
    } else {
      nArr :+  Seq(maxValue, 0.0)
    }
  } else {
    nArr
  }
})

相关问题