基于pyspark dataframe中一列的平均值过滤数据?

rsaldnfx  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(614)

如何根据薪资高于平均薪资的情况从数据框中筛选数据。
像这样的
df.select('name').filter((df['salary'])>(avg['salary]))
应该使用什么确切的命令。tia

waxmsbnn

waxmsbnn1#

尝试将平均值存储到 variable 并使用 filter 条款。 Example: ```
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.show()

+------+----+

|salary|name|

+------+----+

| 1| a|

| 2| b|

| 3| c|

+------+----+

avg=df.select(avg('salary').cast("int")).collect()[0][0]
df.filter(df['salary'] > avg).show()

+------+----+

|salary|name|

+------+----+

| 3| c|

+------+----+

df.select("name").filter(df['salary'] > avg).show()

+----+

|name|

+----+

| c|

+----+

`Using window average function:`
import sys
from pyspark.sql import *
w=Window.orderBy(monotonically_increasing_id()).rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("avg_salary",avg(col("salary")).over(w).cast("int")).filter(col("salary") > col("avg_salary")).select("name").show()

+----+

|name|

+----+

| c|

+----+

`Using sparksql subquery`
df.createOrReplaceTempView("tmp")
sql("select * from tmp where salary > (select avg(salary) from tmp)").show()

+------+----+

|salary|name|

+------+----+

| 3| c|

+------+----+

相关问题