结构的scala spark udf过滤器阵列

xn1cxnb4 于 2021-07-12 发布在 Spark

关注(0)|答案(3)|浏览(435)

我有一个带有模式的Dataframe

root
 |-- x: Long (nullable = false)
 |-- y: Long (nullable = false)
 |-- features: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- score: double (nullable = true)

例如，我有数据

+--------------------+--------------------+------------------------------------------+
|                x   |              y     |       features                           |
+--------------------+--------------------+------------------------------------------+
|10                  |          9         |[["f1", 5.9], ["ft2", 6.0], ["ft3", 10.9]]|
|11                  |          0         |[["f4", 0.9], ["ft1", 4.0], ["ft2", 0.9] ]|
|20                  |          9         |[["f5", 5.9], ["ft2", 6.4], ["ft3", 1.9] ]|
|18                  |          8         |[["f1", 5.9], ["ft4", 8.1], ["ft2", 18.9]]|
+--------------------+--------------------+------------------------------------------+

我想用一个特定的前缀过滤特征，比如说“ft”，所以最终我想要的结果是：

+--------------------+--------------------+-----------------------------+
|                x   |              y     |       features              |
+--------------------+--------------------+-----------------------------+
|10                  |          9         |[["ft2", 6.0], ["ft3", 10.9]]|
|11                  |          0         |[["ft1", 4.0], ["ft2", 0.9] ]|
|20                  |          9         |[["ft2", 6.4], ["ft3", 1.9] ]|
|18                  |          8         |[["ft4", 8.1], ["ft2", 18.9]]|
+--------------------+--------------------+-----------------------------+

我没有使用spark2.4+，所以我不能使用这里提供的解决方案：spark（scala）filter structs array without explode
我尝试使用自定义项，但仍然不起作用。这是我的尝试。我定义自定义项：

def filterFeature: UserDefinedFunction = 
udf((features: Seq[Row]) =>
    features.filter{
        x.getString(0).startsWith("ft")
    }
)

但如果我应用这个自定义项

df.withColumn("filtered", filterFeature($"features"))

我得到了错误 Schema for type org.apache.spark.sql.Row is not supported . 我发现我回不来了 Row 来自udf。然后我试着

def filterFeature: UserDefinedFunction = 
udf((features: Seq[Row]) =>
    features.filter{
        x.getString(0).startsWith("ft")
    }, (StringType, DoubleType)
)

然后我得到一个错误：

error: type mismatch;
 found   : (org.apache.spark.sql.types.StringType.type, org.apache.spark.sql.types.DoubleType.type)
 required: org.apache.spark.sql.types.DataType
              }, (StringType, DoubleType)
                 ^

我还尝试了一些答案所建议的案例课程：

case class FilteredFeature(featureName: String, featureScore: Double)
def filterFeature: UserDefinedFunction = 
udf((features: Seq[Row]) =>
    features.filter{
        x.getString(0).startsWith("ft")
    }, FilteredFeature
)

但我得到了：

error: type mismatch;
 found   : FilteredFeature.type
 required: org.apache.spark.sql.types.DataType
              }, FilteredFeature
                 ^

我试过：

case class FilteredFeature(featureName: String, featureScore: Double)
def filterFeature: UserDefinedFunction = 
udf((features: Seq[Row]) =>
    features.filter{
        x.getString(0).startsWith("ft")
    }, Seq[FilteredFeature]
)

我得到了：

<console>:192: error: missing argument list for method apply in class GenericCompanion
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `apply _` or `apply(_)` instead of `apply`.
              }, Seq[FilteredFeature]
                    ^

我试过：

case class FilteredFeature(featureName: String, featureScore: Double)
def filterFeature: UserDefinedFunction = 
udf((features: Seq[Row]) =>
    features.filter{
        x.getString(0).startsWith("ft")
    }, Seq[FilteredFeature](_)
)

我得到了：

<console>:201: error: type mismatch;
 found   : Seq[FilteredFeature]
 required: FilteredFeature
              }, Seq[FilteredFeature](_)
                          ^

在这种情况下我该怎么办？

scala apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/66594216/how-to-write-user-defined-function-whose-input-is-array-of-structtype

3条答案

按热度按时间

au9on6nz1#

您有两种选择：
a）为udf提供一个模式，让我们返回 Seq[Row] b）转换 Seq[Row] 到 Seq 的 Tuple2 或者一个case类，则不需要提供模式（但是如果使用元组，结构字段名将丢失！）
我更喜欢选项a）对于您的情况（适用于具有许多字段的结构）：

val schema = df.schema("features").dataType

val filterFeature = udf((features:Seq[Row]) => features.filter(_.getAs[String]("name").startsWith("ft")),schema)

赞(0）回复(0）举报 2021-07-12

htzpubme2#

试试这个：

def filterFeature: UserDefinedFunction =
    udf((features: Row) => {
      features.getAs[Array[Array[Any]]]("features").filter(in => in(0).asInstanceOf[String].startsWith("ft"))
 })

赞(0）回复(0）举报 2021-07-12

a2mppw5e3#

如果您没有使用spark 2.4，那么这应该适用于您的情况

case class FilteredFeature(featureName: String, featureScore: Double)

import org.apache.spark.sql.functions._  
def filterFeature: UserDefinedFunction = udf((feature: Seq[Row]) => {
  feature.filter(x => {
    x.getString(0).startsWith("ft")
  }).map(r => FilteredFeature(r.getString(0), r.getDouble(1)))
})

df.select($"x", $"y", filterFeature($"feature") as "filter").show(false)

输出：

+---+---+-----------------------+
|x  |y  |filter                 |
+---+---+-----------------------+
|10 |9  |[[ft2,6.0], [ft3,10.9]]|
|11 |0  |[[ft1,4.0], [ft2,0.9]] |
|20 |9  |[[ft2,6.4], [ft3,1.9]] |
|18 |8  |[[ft4,8.1], [ft2,18.9]]|
+---+---+-----------------------+

赞(0）回复(0）举报 2021-07-12

我来回答

结构的scala spark udf过滤器阵列

3条答案

相关问题

热门标签

最新问答