spark scala窗口计数最大值

vnzz0bqm  于 2021-07-12  发布在  Spark
关注(0)|答案(2)|浏览(275)

我有以下资料:-
结果状态包括名称WINXYZCLUB1WINXYZCLUB2WINXYZCLUB1WINPQRCLUB3
我需要州明智的最大赢家俱乐部名称

val byState =Window.partitionBy("state").orderBy('state)

我试图创建一个窗口,但没有帮助。。
预期结果:-
有些在sql中是这样的

select temp.res
(select count(result) as res
from table
group by clubName) temp
group by state

例如
winsclubnamexyz2club1的状态最大\u计数\u

baubqpgj

baubqpgj1#

您可以获得每个俱乐部的赢数,然后为按赢数排序的每个俱乐部分配排名,并筛选排名为1的行。

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn(
    "wins", 
    count(when(col("result") === "win", 1))
    .over(Window.partitionBy("state","clubName"))
).withColumn(
    "rn", 
    row_number().over(Window.partitionBy("state").orderBy(desc("wins")))
).filter("rn = 1").selectExpr("state", "wins as max_count_of_wins", "clubName")

df2.show
+-----+-----------------+--------+
|state|max_count_of_wins|clubName|
+-----+-----------------+--------+
|  PQR|                1|   club3|
|  XYZ|                2|   club1|
+-----+-----------------+--------+
hsgswve4

hsgswve42#

您还可以将sql方言与sparksql结合使用(在此处查找文档):

df.sql("""
SELECT tt.name, tt.state, MAX(tt.nWins) as max_count_of_wins
FROM (
  SELECT t1.clubName as name, t1.state as state, COUNT(1) as nWins
  FROM Table1 t1
  WHERE t1.result = 'win'
  GROUP BY state, name
  ) as tt
GROUP BY tt.state;
""")

其中dataframe中的表将被命名为 Table1 以及你的Dataframe df .
p、 如果你想自己尝试,使用初始化

CREATE TABLE Table1
    (`result` varchar(3), `state` varchar(3), `clubName` varchar(5))
;

INSERT INTO Table1
    (`result`, `state`, `clubName`)
VALUES
    ('win', 'XYZ', 'club1'),
    ('win', 'XYZ', 'club2'),
    ('win', 'XYZ', 'club1'),
    ('win', 'PQR', 'club3')
;

在http://sqlfiddle.com.

相关问题