定义带有选项的Spark Scala UDF作为输入参数[Closed]

jjhzyzn0  于 2022-11-09  发布在  Scala
关注(0)|答案(1)|浏览(192)

已关闭。此问题需要details or clarity。它目前不接受答案。
**想要改进这个问题吗?**添加细节,并通过editing this post澄清问题。

上个月关门了。
Improve this question
编写了下面的UDF,目的是让它处理没有定义一个参数的情况。代码如下:

val addTimeFromCols: UserDefinedFunction = udf((year: String, month: String, day: String, hour: String) => {
      Option(hour) match {
        case None    => (List(year, month, day).mkString(DASH_SEP)).concat(SPACE).concat(defaultHour)
        case Some(x) => (List(year, month, day).mkString(DASH_SEP)).concat(SPACE).concat(hour)
      }
    })

 def addTimestampFromFileCols(): DataFrame = df
  .withColumn(COLUMN_TS, addTimeFromCols(col(COLUMN_YEAR), col(COLUMN_MONTH), col(COLUMN_DAY), col(COLUMN_HOUR)).cast(TimestampType))

我的目标是使该函数适用于所有用例(具有小时列的 Dataframe 和其他不具有该列的 Dataframe ),在本例中,我默认定义一个值。不幸的是,当我测试没有列的DataFrame时,这是错误的,我得到了以下错误:

cannot resolve '`HOUR`' given input columns

有什么办法可以解决这个问题吗?

yjghlzjz

yjghlzjz1#

如果该列不存在,则必须通过Lit()函数提供一个缺省值,否则将抛出错误。下面的方法对我很有效

scala> defaultHour
res77: String = 00

scala> :paste
// Entering paste mode (ctrl-D to finish)

def addTimestampFromFileCols(df:DataFrame) =
{
val hr = if( df.columns.contains("hour") ) col(COLUMN_HOUR) else lit(defaultHour)
df.withColumn(COLUMN_TS, addTimeFromCols(col(COLUMN_YEAR), col(COLUMN_MONTH), col(COLUMN_DAY), hr).cast(TimestampType))
}

// Exiting paste mode, now interpreting.

addTimestampFromFileCols: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

scala>

+ve大小写

scala> val df = Seq(("2019","01","10","09")).toDF("year","month","day","hour")
df: org.apache.spark.sql.DataFrame = [year: string, month: string ... 2 more fields]

scala> addTimestampFromFileCols(df).show(false)
+----+-----+---+----+-------------------+
|year|month|day|hour|tstamp             |
+----+-----+---+----+-------------------+
|2019|01   |10 |09  |2019-01-10 09:00:00|
+----+-----+---+----+-------------------+

-VE案例

scala> val df = Seq(("2019","01","10")).toDF("year","month","day")
df: org.apache.spark.sql.DataFrame = [year: string, month: string ... 1 more field]

scala> addTimestampFromFileCols(df).show(false)
+----+-----+---+-------------------+
|year|month|day|tstamp             |
+----+-----+---+-------------------+
|2019|01   |10 |2019-01-10 00:00:00|
+----+-----+---+-------------------+

scala>

相关问题