使用spark scala从具有数组值的流dataframe列中查找平均值

hgqdbh6s  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(360)

我有以下流Dataframe

+-----------------------------------------------------------------------
|______name__________________|______orderOfHobbies_____________________|
| Liza                       |   [singing, painting]                   |
| Inter                      |   [singing, singing]                    |
| Ovin                       |   [singing, playing, reading, singing]  |
------------------------------------------------------------------------

我想找到每个学生最喜欢的爱好。如果每个学生每个爱好的发生率相等,那么我想放弃这个记录。莉莎的记录将被取消。由于唱歌是经常发生在国际米兰和奥文唱歌将是最喜爱的爱好。
预期产量

+----------------------------------------------------
|______name__________________|______favoriteHobby___|                  
| Inter                      |   singing            |
| Ovin                       |   singing            |
-----------------------------------------------------
bvpmtnay

bvpmtnay1#

您可以使用自定义项:

val favoriteUDF = udf(
    (hobby: Seq[String]) => 
    if ((hobby.distinct.size != hobby.size) || (hobby.size == 1)) 
    hobby.groupBy(identity).maxBy(_._2.size)._1 
    else "invalid"
)

val df2 = df.select(
    col("name"), 
    favoriteUDF(col("orderOfHobbies")).as("favoriteHobby")
).filter("favoriteHobby != 'invalid'")

df2.show
+-----+-------------+
| name|favoriteHobby|
+-----+-------------+
|Inter|      singing|
| Ovin|      singing|
+-----+-------------+

相关问题