如何基于特定列的公共值的条件选择行?

yshpjwxd  于 2021-05-27  发布在  Spark
关注(0)|答案(3)|浏览(403)

对不起,标题太混乱了,我不知道怎么说清楚。
这就是我想用pyspark sql实现的目标:
当变量“z”的值介于2000和3000之间时,返回该特定id的行,但仅返回变量y和z的数据。
除了下面的查询,我不知道如何更进一步,如何让sql知道我们需要在附加的df中选择适当的id(2和3)?

SELECT ID, Variable, Date, Value 
 FROM TABLE 
 WHERE (Variable == 'Y' OR Variable == 'Z') AND "if Value of Z between 2000 and 3000 then select only these IDs"

yyyllmsg

yyyllmsg1#

在spark中,我建议使用窗口函数:

select t.*
from (select t.*,
             sum(case when variable = 'Z' and z_value between 2000 and 3000 then 1 else 0 end) over (partition by id) as z_value_cnt
      from t
     ) t
where variable in ('Y', 'Z') and z_value_cnt >= 1;

我希望这会有一个更好的执行计划。

erhoui1w

erhoui1w2#

df=spark.createDataFrame([(1,'X', '01/01/20',5),
                      (1,'Y', '01/01/20',10),
                      (1,'Z', '01/01/20',1000),
                      (2,'X', '01/02/20',20),
                      (2,'Y', '01/02/20',30),
                      (2,'Z', '01/02/20',2500)],['ID','VAR','DATE','VAL'])
display(df.filter("VAL Between 2000 and 3000").select('ID').join(df,['ID']).filter("VAR in ('X','Y')"))
fruv7luv

fruv7luv3#

尝试使用简单的内部查询

SELECT ID, Variable, Date, Value 
 FROM TABLE 
 WHERE (Variable == 'Y' OR Variable == 'Z') AND ID in (select id from TABLE where variable = 'Z' and value between 2000 and 3000)

相关问题