scala—在数据集和Dataframe上使用spark中的自定义类

dm7nw8vv  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(544)

我把一些我想用作本机数据类型的自定义类放在一起,即sparksql。我看到UDT刚刚向公众开放,但很难搞清楚。有没有办法让我这么做?
例子

  1. case class IPv4(ipAddress: String){
  2. // IPv4 converted to a number
  3. val addrL: Long = IPv4ToLong(ipAddress)
  4. }
  5. // Will read in a bunch of random IPs in the form {"ipAddress": "60.80.39.27"}
  6. val IPv4DF: DataFrame = spark.read.json(path)
  7. IPv4DF.createOrReplaceTempView("IPv4")
  8. spark.sql(
  9. """SELECT *
  10. FROM IPv4
  11. WHERE ipAddress.addrL > 100000"""
  12. )
xkftehaa

xkftehaa1#

你可以构造一个 Dataset 并使用case类进行筛选 addrL 属性:

  1. case class IPv4(ipAddress: String){
  2. // IPv4 converted to a number
  3. val addrL: Long = IPv4ToLong(ipAddress)
  4. }
  5. val ds = Seq("60.80.39.27").toDF("ipAddress").as[IPv4]
  6. ds.filter(_.addrL > 100000).show
  7. +-----------+
  8. | ipAddress|
  9. +-----------+
  10. |60.80.39.27|
  11. +-----------+

相关问题