假设我们有一张候选人表现表
CREATE TABLE IF NOT EXISTS candidates AS
WITH RECURSIVE candidates(team, score) AS (
SELECT RANDOM() % 1000, RANDOM() % 1000000
UNION
SELECT RANDOM() % 1000, RANDOM() % 1000000
FROM candidates
LIMIT 1000000
)
SELECT team, score
FROM candidates;
我们的目标是输出1000个团队的列表以及该团队中候选人的总分。不过,如果一支球队的总比分不在上半场,那么它将被一个零分所取代。我想出了两种方法:
与 EXISTS
,花了 Run Time: real 30.653 user 30.635649 sys 0.008798
```
WITH top_teams_verbose(top_team, total_score) AS (
SELECT team, SUM(score)
FROM candidates
GROUP BY team
ORDER BY 2 DESC
LIMIT 500
)
SELECT team, SUM(score) * EXISTS(SELECT 1 FROM top_teams_verbose WHERE team = top_team)
FROM candidates
GROUP BY team;
查询计划
QUERY PLAN
|--SCAN TABLE candidates
|--USE TEMP B-TREE FOR GROUP BY--CORRELATED SCALAR SUBQUERY 2 |--CO-ROUTINE 1 | |--SCAN TABLE candidates | |--USE TEMP B-TREE FOR GROUP BY |
--USE TEMP B-TREE FOR ORDER BY
`--SCAN SUBQUERY 1
与 `IN` ,花了 `Run Time: real 0.045 user 0.041872 sys 0.002999` ```
WITH top_teams_verbose(top_team, total_score) AS (
SELECT team, SUM(score)
FROM candidates
GROUP BY team
ORDER BY 2 DESC
LIMIT 500
),
top_teams AS (
SELECT top_team
FROM top_teams_verbose
)
SELECT team, SUM(score) * (team IN top_teams)
FROM candidates
GROUP BY team;
查询计划
QUERY PLAN
|--SCAN TABLE candidates
|--USE TEMP B-TREE FOR GROUP BY
`--LIST SUBQUERY 3
|--CO-ROUTINE 1
| |--SCAN TABLE candidates
| |--USE TEMP B-TREE FOR GROUP BY
| `--USE TEMP B-TREE FOR ORDER BY
`--SCAN SUBQUERY 1
为什么会这样?也许 吧 EXISTS
为每行执行,而 IN
是否用作聚合函数?我看了一下查询计划,唯一的区别( CORRELATED SCALAR SUBQUERY
与 LIST SUBQUERY
)太抽象了,无法提供信息。
我使用的是sqlite3版本 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837bb4d6
在rhel 7上。
1条答案
按热度按时间x6h2sr281#
我没有时间对查询计划进行深入分析,但事实证明
EXISTS
使公共表表达式为每行重新运行。根据维基百科,这被称为相关查询:在sql数据库查询中,相关子查询(也称为同步子查询)是使用外部查询中的值的子查询(嵌套在另一个查询中的查询)。因为对于外部查询处理的每一行,子查询可能只计算一次,所以速度可能会很慢。
具体来说,考虑下面的例子,从总共10个团队中随机选择5个“幸运团队”
有人可能希望总是有5支球队得分为正,其余球队得分为零,但有时你会得到4到7分的正分数。这是因为每个团队都有50%的独立机会出现在比赛中
lucky_teams
表,以防止他们的分数被归零。