select子句中的Spark计数(列名)

z8dt9xmd  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(338)

如何使用spark编写以下sql查询 DataFrame API ?

val dt = spark.sql("select Subject, (count(Student)*100 / (select count(DISTINCT Student) from StudentTable)) as PercentPassed from StudentTable where Marks > 35 GROUP BY Subject").show(false)

在这里我想让学生通过的百分比(100%),在每个科目。
下面是注册为的Dataframe StudentTable :

val data = Seq(
    ("S1","English",45),
    ("S1","Hindi",78),
    ("S1","Maths",67),
    ("S1","Science",23),
    ("S2","English",34),
    ("S2","Hindi",45),
    ("S2","Maths",88),
    ("S2","Science",65),
    ("S3","English",21),
    ("S3","Hindi",66),
    ("S3","Maths",63),
    ("S3","Science",44)
  )
val df = spark.sparkContext.parallelize(data).toDF("Student","Subject","Marks")
df.createTempView("StudentTable")
hkmswyz6

hkmswyz61#

你可以使用 groupBy 然后 aggregate 有条件的 count ```
df.groupBy($"Subject")
.agg(
count(when($"Marks" > 35, 1)) /
count($"Marks") * 100 as "PercentPassed")

说明:
第一组按主题分组。
然后在聚合中生成两个计数:
第一个按标记值过滤(已通过)
第二个每行计数(总计)
然后将第一个除以第二个,再乘以100得到百分比
结果

+-------+-----------------+
|Subject| PercentPassed|
+-------+-----------------+
|Science|66.66666666666666|
|English|33.33333333333333|
| Hindi| 100.0|
| Maths| 100.0|
+-------+-----------------+

相关问题