如何根据薪资高于平均薪资的情况从数据框中筛选数据。像这样的df.select('name').filter((df['salary'])>(avg['salary]))应该使用什么确切的命令。tia
waxmsbnn1#
尝试将平均值存储到 variable 并使用 filter 条款。 Example: ```from pyspark.sql.functions import *from pyspark.sql.types import *df.show()
variable
filter
Example:
avg=df.select(avg('salary').cast("int")).collect()[0][0]df.filter(df['salary'] > avg).show()
df.select("name").filter(df['salary'] > avg).show()
`Using window average function:`import sysfrom pyspark.sql import *w=Window.orderBy(monotonically_increasing_id()).rowsBetween(-sys.maxsize,sys.maxsize)df.withColumn("avg_salary",avg(col("salary")).over(w).cast("int")).filter(col("salary") > col("avg_salary")).select("name").show()
`Using window average function:`
`Using sparksql subquery`df.createOrReplaceTempView("tmp")sql("select * from tmp where salary > (select avg(salary) from tmp)").show()
`Using sparksql subquery`
1条答案
按热度按时间waxmsbnn1#
尝试将平均值存储到
variable
并使用filter
条款。Example:
```from pyspark.sql.functions import *
from pyspark.sql.types import *
df.show()
+------+----+
|salary|name|
+------+----+
| 1| a|
| 2| b|
| 3| c|
+------+----+
avg=df.select(avg('salary').cast("int")).collect()[0][0]
df.filter(df['salary'] > avg).show()
+------+----+
|salary|name|
+------+----+
| 3| c|
+------+----+
df.select("name").filter(df['salary'] > avg).show()
+----+
|name|
+----+
| c|
+----+
`Using window average function:`
import sys
from pyspark.sql import *
w=Window.orderBy(monotonically_increasing_id()).rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("avg_salary",avg(col("salary")).over(w).cast("int")).filter(col("salary") > col("avg_salary")).select("name").show()
+----+
|name|
+----+
| c|
+----+
`Using sparksql subquery`
df.createOrReplaceTempView("tmp")
sql("select * from tmp where salary > (select avg(salary) from tmp)").show()
+------+----+
|salary|name|
+------+----+
| 3| c|
+------+----+