spark：对没有udf的每一行应用sliding()

e3bfsja2 于 2021-05-18 发布在 Spark

关注(0)|答案(1)|浏览(481)

我有一个有几列的数据框。第i列包含字符串。我要应用字符串 sliding(n) 函数中的每个字符串。有没有不使用用户定义函数的方法？
示例：我的Dataframe是

var df = Seq((0, "hello"), (1, "hola")).toDF("id", "text")

我想申请 sliding(3) 函数到列的每个元素 "text" 获取对应于

Seq(
    (0, ("hel", "ell", "llo"))
    (1, ("hol", "ola"))
)

我该怎么做？

scala apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/64686364/spark-apply-sliding-to-each-row-without-udf

1条答案

按热度按时间

anhgbhbe1#

对于spark版本>=2.4.0，这可以使用内置函数完成 array_repeat , transform 以及 substring .

import org.apache.spark.sql.functions.{array_repeat, transform, substring}

//Repeat the array `n` times
val repeated_df = df.withColumn("tmp",array_repeat($"text",length($"text")-3+1))
//Get the slices with transform higher order function
val res = repeated_df.withColumn("str_slices",
                                 expr("transform(tmp,(x,i) -> substring(x from i+1 for 3))")
                                )
//res.show()
+---+-----+---------------------+---------------+
|id |text |tmp                  |str_slices     |
+---+-----+---------------------+---------------+
|0  |hello|[hello, hello, hello]|[hel, ell, llo]|
|1  |hola |[hola, hola]         |[hol, ola]     |
+---+-----+---------------------+---------------+

赞(0）回复(0）举报 2021-05-19

我来回答

spark：对没有udf的每一行应用sliding()

1条答案

相关问题

热门标签

最新问答