使用spark scala从具有数组值的流dataframe列中查找平均值

hgqdbh6s  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(414)

我有以下流Dataframe

  1. +-----------------------------------------------------------------------
  2. |______name__________________|______orderOfHobbies_____________________|
  3. | Liza | [singing, painting] |
  4. | Inter | [singing, singing] |
  5. | Ovin | [singing, playing, reading, singing] |
  6. ------------------------------------------------------------------------

我想找到每个学生最喜欢的爱好。如果每个学生每个爱好的发生率相等,那么我想放弃这个记录。莉莎的记录将被取消。由于唱歌是经常发生在国际米兰和奥文唱歌将是最喜爱的爱好。
预期产量

  1. +----------------------------------------------------
  2. |______name__________________|______favoriteHobby___|
  3. | Inter | singing |
  4. | Ovin | singing |
  5. -----------------------------------------------------
bvpmtnay

bvpmtnay1#

您可以使用自定义项:

  1. val favoriteUDF = udf(
  2. (hobby: Seq[String]) =>
  3. if ((hobby.distinct.size != hobby.size) || (hobby.size == 1))
  4. hobby.groupBy(identity).maxBy(_._2.size)._1
  5. else "invalid"
  6. )
  7. val df2 = df.select(
  8. col("name"),
  9. favoriteUDF(col("orderOfHobbies")).as("favoriteHobby")
  10. ).filter("favoriteHobby != 'invalid'")
  11. df2.show
  12. +-----+-------------+
  13. | name|favoriteHobby|
  14. +-----+-------------+
  15. |Inter| singing|
  16. | Ovin| singing|
  17. +-----+-------------+
展开查看全部

相关问题