使用case类将列添加到Dataframe

8mmmxcuj  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(656)

我有一个dataframe(invoice),它有两列firstname和lastname,我想用case类创建一个新的列fullname。下面的代码不起作用,因为fullname列不在dataframe中。

  1. **INPUT**
  2. | firstname | lastname |
  3. |:-----------|------------:|
  4. | tom | jerry |
  5. | hank | polo |
  6. **OUTPUT**
  7. | firstname | lastname | fullname |
  8. |:-----------|------------:|:------------:|
  9. | tom | jerry | tomjerry |
  10. | hank | polo | hankpolo |
  11. val names = invoice.as[invoiceColumns].map(updateFields)
  12. case class invoiceColumns (firstname :String,lastname:String,fullname:String)
  13. def updateFields(c: invoiceColumns): invoiceColumns= {
  14. val fullname = c.first+c.last+c.fullname
  15. c.copy(fullname = fullname)
  16. }
ljo96ir5

ljo96ir51#

也许这是有用的-

备选方案-1

  1. case class invoiceColumns (firstname :String,lastname:String,fullname:String)
  2. val df3 = Seq(("tom", "jerry"), ("hank", "polo")).toDF("firstname", "lastname")
  3. df3.show(false)
  4. df3.printSchema()
  5. /**
  6. * +---------+--------+
  7. * |firstname|lastname|
  8. * +---------+--------+
  9. * |tom |jerry |
  10. * |hank |polo |
  11. * +---------+--------+
  12. *
  13. * root
  14. * |-- firstname: string (nullable = true)
  15. * |-- lastname: string (nullable = true)
  16. */
  17. val p = df3.withColumn("fullname", concat(col("firstname"), col("lastname")))
  18. .as[invoiceColumns]
  19. p.show(false)
  20. p.printSchema()
  21. /**
  22. * +---------+--------+--------+
  23. * |firstname|lastname|fullname|
  24. * +---------+--------+--------+
  25. * |tom |jerry |tomjerry|
  26. * |hank |polo |hankpolo|
  27. * +---------+--------+--------+
  28. *
  29. * root
  30. * |-- firstname: string (nullable = true)
  31. * |-- lastname: string (nullable = true)
  32. * |-- fullname: string (nullable = true)
  33. */

备选方案-2

  1. case class invoiceColumns2 (firstname :String,lastname:String,fullname:String) {
  2. def this(firstname :String,lastname:String) = {
  3. this(firstname, lastname, firstname + lastname)
  4. }
  5. }
  6. val p1 = df3.map{case Row(firstname: String, lastname: String) => new invoiceColumns2(firstname, lastname)}
  7. p1.show(false)
  8. p1.printSchema()
  9. /**
  10. * +---------+--------+--------+
  11. * |firstname|lastname|fullname|
  12. * +---------+--------+--------+
  13. * |tom |jerry |tomjerry|
  14. * |hank |polo |hankpolo|
  15. * +---------+--------+--------+
  16. *
  17. * root
  18. * |-- firstname: string (nullable = true)
  19. * |-- lastname: string (nullable = true)
  20. * |-- fullname: string (nullable = true)
  21. */
展开查看全部
yiytaume

yiytaume2#

有几种不同的方法。

对输入和输出都使用case类

如果可以为输入和输出定义case类,则可以使用dataset api安全地完成此操作:

  1. case class Input(firstname: String, lastname: String)
  2. case class Output(firstname: String, lastname: String, fullname: String)
  3. object Output {
  4. def apply(in: Input): Output =
  5. Output(in.firstname, in.lastname, in.firstname + in.lastname)
  6. }
  7. Seq(Input("tom", "jerry"), Input("hank", "polo"))
  8. .toDS()
  9. .map(Output.apply)
  10. .show()
  1. +---------+--------+--------+
  2. |firstname|lastname|fullname|
  3. +---------+--------+--------+
  4. | tom| jerry|tomjerry|
  5. | hank| polo|hankpolo|
  6. +---------+--------+--------+

仅对输出使用case类

由于在运行时检查列名,因此安全性较低:

  1. case class Output(firstname: String, lastname: String, fullname: String)
  2. object Output {
  3. def apply(firstname: String, lastname: String): Output =
  4. Output(firstname, lastname, firstname + lastname)
  5. }
  6. Seq(("tom", "jerry"), ("hank", "polo"))
  7. .toDF("firstname", "lastname")
  8. .map(row =>
  9. Output(row.getAs[String]("firstname"), row.getAs[String]("lastname")))
  10. .show()

产生相同的输出。

展开查看全部

相关问题