arraytype pyspark列中唯一元素行的平均值

qpgpyjmq  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(333)

我有一个大的pysparkDataframe(23m行),格式如下:

names, sentiment
["Lily","Kerry","Mona"], 10
["Kerry", "Mona"], 2
["Mona"], 0

我想计算“名称”列中每个唯一名称的平均情绪,结果如下:

name, sentiment
"Lily", 10
"Kerry", 6
"Mona", 4
mctunoxg

mctunoxg1#

val avgDF = Seq((Seq("Lily","Kerry","Mona"), 10),
      (Seq("Kerry", "Mona"), 2),
      (Seq("Mona"), 0)
  ).toDF("names", "sentiment")

  val avgDF1 = avgDF.withColumn("name", explode('names))
  val avgResultDF = avgDF1.groupBy("name").agg(avg(col("sentiment")))

  avgResultDF.show(false)
  //      +-----+--------------+
  //      |name |avg(sentiment)|
  //      +-----+--------------+
  //      |Lily |10.0          |
  //      |Kerry|6.0           |
  //      |Mona |4.0           |
  //      +-----+--------------+
bjp0bcyl

bjp0bcyl2#

只需分解数组,然后分组
Pypark当量

import pyspark.sql.functions as f
df1 = df.select(f.explode('names').alias('name'),'sentiment')

df1.groupBy('name').agg(f.avg('sentiment').alias('sentiment')).show()

相关问题