scala—在数据集和Dataframe上使用spark中的自定义类

dm7nw8vv  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(504)

我把一些我想用作本机数据类型的自定义类放在一起,即sparksql。我看到UDT刚刚向公众开放,但很难搞清楚。有没有办法让我这么做?
例子

case class IPv4(ipAddress: String){
  // IPv4 converted to a number
  val addrL: Long = IPv4ToLong(ipAddress)
}

// Will read in a bunch of random IPs in the form {"ipAddress": "60.80.39.27"}
val IPv4DF: DataFrame = spark.read.json(path)
IPv4DF.createOrReplaceTempView("IPv4")

spark.sql(
    """SELECT *
     FROM IPv4
     WHERE ipAddress.addrL > 100000"""
    )
xkftehaa

xkftehaa1#

你可以构造一个 Dataset 并使用case类进行筛选 addrL 属性:

case class IPv4(ipAddress: String){
  // IPv4 converted to a number
  val addrL: Long = IPv4ToLong(ipAddress)
}

val ds = Seq("60.80.39.27").toDF("ipAddress").as[IPv4]

ds.filter(_.addrL > 100000).show
+-----------+
|  ipAddress|
+-----------+
|60.80.39.27|
+-----------+

相关问题