scala 是否在“GROUP BY”子句中重用SELECT表达式的结果？

new9mtju 于 2022-11-09 发布在 Scala

关注(0)|答案(2)|浏览(133)

在MySQL中，我可以使用如下查询：

select  
    cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour
    , ... 
from
    some_table t 
group by
    timeHour, ...
order by
    timeHour, ...

其中，GROUP BY中的timeHour是SELECT表达式的结果。
但是我刚刚尝试了一个类似于Sqark SQL中的查询，我得到了一个错误

Error: org.apache.spark.sql.AnalysisException: 
cannot resolve '`timeHour`' given input columns: ...

我对Spark SQL的查询如下：

select  
      cast(t.unixTime as timestamp) as timeHour
    , ...
from
    another_table as t
group by
    timeHour, ...
order by
    timeHour, ...

这种构造在Spark SQL中可能吗？

scala

来源：https://stackoverflow.com/questions/46395333/reuse-the-result-of-a-select-expression-in-the-group-by-clause

2条答案

按热度按时间

x9ybnkn61#

这种构造在Spark SQL中可能吗？

是的，是。在GROUP BY和ORDER BY子句中使用新列的两种方法可以使它在Spark SQL中工作
使用子查询的方法1：

SELECT timeHour, someThing FROM (SELECT  
      from_unixtime((starttime/1000)) AS timeHour
    , sum(...)                          AS someThing
    , starttime
FROM
    some_table) 
WHERE
    starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
      AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
    timeHour
ORDER BY
    timeHour
LIMIT 10;

方法二使用WITH//优雅方式：

-- create alias 
WITH table_aliase AS(SELECT  
      from_unixtime((starttime/1000)) AS timeHour
    , sum(...)                          AS someThing
    , starttime
FROM
    some_table)

-- use the same alias as table
SELECT timeHour, someThing FROM table_aliase
WHERE
    starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
      AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
    timeHour
ORDER BY
    timeHour
LIMIT 10;

Scala替代使用Spark DataFrame(Wo SQL)接口：

// This code may need additional import to work well

val df = .... //load the actual table as df

import org.apache.spark.sql.functions._

df.withColumn("timeHour", from_unixtime($"starttime"/1000))
  .groupBy($"timeHour")
  .agg(sum("...").as("someThing"))
  .orderBy($"timeHour")
  .show()

//another way - as per eliasah comment
df.groupBy(from_unixtime($"starttime"/1000).as("timeHour"))
  .agg(sum("...").as("someThing"))
  .orderBy($"timeHour")
  .show()

赞(0）回复(0）举报 2022-11-09

xdnvmnnf2#

我在这里试着给自己一个答案。
在我看来，我们必须重写查询并重复计算GROUP BY子句中的SELECT表达式。例如：

select  
      from_unixtime((t.starttime/1000)) as timeHour
    , sum(...)                          as someThing
from
    some_table as t
where
    t.starttime>=1000*unix_timestamp('2017-09-16 00:00:00')
      and t.starttime<=1000*unix_timestamp('2017-09-16 04:00:00')
group by
    from_unixtime((t.starttime/1000))
order by
    from_unixtime((t.starttime/1000))
limit 10;

赞(0）回复(0）举报 2022-11-09

我来回答

scala 是否在“GROUP BY”子句中重用SELECT表达式的结果？

2条答案

相关问题

热门标签

最新问答