使用 Dataframe 中多个其他列的值向 Dataframe 添加新列-Spark/Scala

xdnvmnnf 于 2022-11-09 发布在 Scala

关注(0)|答案(1)|浏览(214)

我是个新手，不熟悉SQL和Dataframe。我有一个Dataframe，我应该根据其他列的值向其添加一个新列。我有一个来自EXCEL的Nested IF公式，我应该实现该公式(用于将值添加到新列)，该公式在转换为编程术语时如下所示：

if(k =='yes')
{
  if(!(i==''))
  {
    if(diff(max_date, target_date) < 0)
    {
      if(j == '')
      {
        "pending" //the value of the column
      }
      else {
        "approved" //the value of the column
      }
    }
    else{
      "expired" //the value of the column
    }
  }
  else{
    "" //the value should be empty
  }
}
else{
  "" //the value should be empty
}

i,j,k are three other columns in the Dataframe.我知道我们可以使用withColumn和when在其他列的基础上添加新列，但我不确定如何使用该方法实现上述逻辑。
实现添加新列的上述逻辑的简单/高效方法是什么？任何帮助都将不胜感激。
谢谢。

scala

来源：https://stackoverflow.com/questions/47445328/adding-a-new-column-to-a-dataframe-by-using-the-values-of-multiple-other-columns

1条答案

按热度按时间

svgewumm1#

首先，让我们简化IF语句：

if(k == "yes" && i.nonEmpty)
  if(maxDate - targetDate < 0)
    if (j.isEmpty) "pending" 
    else "approved"
  else "expired"
else ""

现在有两种主要方法来实现这一点
1.使用自定义UDF
1.使用Spark内置函数：coalesce、when、otherwise

自定义自定义项

现在，由于您的条件的复杂性，执行第二个操作将相当棘手。使用定制的UDF应该符合您的需要。

def getState(i: String, j: String, k: String, maxDate: Long, targetDate: Long): String =  
  if(k == "yes" && i.nonEmpty)
    if(maxDate - targetDate < 0)
      if (j.isEmpty) "pending" 
      else "approved"
    else "expired"
  else ""

val stateUdf = udf(getState _)
df.withColumn("state", stateUdf($"i",$"j",$"k",lit(0),lit(0)))

只需将LITE(0)和LITE(0)更改为您的日期代码，这应该对您有效。

使用Spark内置函数

如果您注意到性能问题，可以切换到使用coalesce、otherwise和when，如下所示：

val isApproved = df.withColumn("state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" =!= "", "approved").otherwise(null))
val isPending = isApproved.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" === "", "pending").otherwise(null)))
val isExpired = isPending.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) >= 0), "expired").otherwise(null)))
val finalDf = isExpired.withColumn("state", coalesce($"state", lit("")))

我过去在大输入源上使用过定制的UDF，没有出现问题，定制的udf可以产生更具可读性的代码，尤其是在这种情况下。

赞(0）回复(0）举报 2022-11-09

我来回答

使用 Dataframe 中多个其他列的值向 Dataframe 添加新列-Spark/Scala

1条答案

自定义自定义项

使用Spark内置函数

相关问题

热门标签

最新问答