pyspark统计Dataframe中一个项在不同日期出现的次数

g6ll5ycj  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(434)

假设我有一个日期框架,比如

date         offer        member
2020-01-01    o1           m1
2020-01-01    o2           m1
2020-01-01    o1           m2
2020-01-01    o2           m2
2020-01-02    o1           m3
2020-01-02    o2           m3
2020-01-03    o1           m4

我应该计算一下有多少不同的天数提供

date         offer        member    count
2020-01-01    o1           m1       3
2020-01-01    o2           m1       2
2020-01-01    o1           m2       3
2020-01-01    o2           m2       2
2020-01-02    o1           m3       3
2020-01-02    o2           m3       2
2020-01-03    o1           m4       3

有人能帮我怎么在Pypark做这个,因为我是新手。

x759pob2

x759pob21#

val source1DF = Seq(
    ("2020-01-01", "o1", "m1"),
    ("2020-01-01", "o2", "m1"),
    ("2020-01-01", "o1", "m2"),
    ("2020-01-01", "o2", "m2"),
    ("2020-01-02", "o1", "m3"),
    ("2020-01-02", "o2", "m3"),
    ("2020-01-03", "o1", "m4")
  ).toDF("date", "offer", "member")

  val tmp1DF = source1DF.select('date, 'offer).dropDuplicates()
  val tmp2DF = tmp1DF.groupBy("offer").count.alias("count")

  val resultDF = source1DF
    .join(tmp2DF, source1DF.col("offer") === tmp2DF.col("offer"))
    .select(
      source1DF.col("date"),
      source1DF.col("offer"),
      source1DF.col("member"),
      tmp2DF.col("count")
    )

  resultDF.show(false)
  //  +----------+-----+------+-----+
  //  |date      |offer|member|count|
  //  +----------+-----+------+-----+
  //  |2020-01-01|o1   |m1    |3    |
  //  |2020-01-01|o2   |m1    |2    |
  //  |2020-01-01|o1   |m2    |3    |
  //  |2020-01-01|o2   |m2    |2    |
  //  |2020-01-02|o1   |m3    |3    |
  //  |2020-01-02|o2   |m3    |2    |
  //  |2020-01-03|o1   |m4    |3    |
  //  +----------+-----+------+-----+

相关问题