postgresql 正确的计划员估计的1:1连接(无FK或左连接)

ltskdhd1  于 2022-11-23  发布在  PostgreSQL
关注(0)|答案(1)|浏览(112)

在下面的示例中,我将在两个列上连接两个相同的表:

  1. create table a (id int,txt text);
  2. create table b (id int,txt text);
  3. insert into a select *,* from generate_series(1,40000);
  4. insert into b select *,* from generate_series(1,40000);
  5. analyze a;
  6. analyze b;
  7. explain analyze
  8. select * from a inner join b on a.id = b.id and a.txt = b.txt;

在解释计划中,您可以看到它低估了连接产生的行数约40.000。它认为产生了1行,而不是40.000行。在我的真实的示例中(此理论示例基于此示例),这是一个问题,因为对行数的严重错误估计会导致包含此连接的较大查询的执行计划出现错误:

  1. Hash Join (... rows=1 ...) (actual ... rows=40000 ...)

因此,很明显,计划器并不知道对于表a中的每一行,它都将在表b中找到一行。很清楚,应该怎么做呢?有两个解决方法:
(A)左连接
使用左联接,我们可以更正估计值:

  1. explain analyze
  2. select * from a LEFT join b on a.id = b.id and a.txt = b.txt;

我们现在可以看到估计是正确的:

  1. Hash Left Join (... rows=40000 ...) (actual ... rows=40000 ...)

(B)外部索引键
使用外键,我们还可以更正估计值:

  1. CREATE UNIQUE INDEX unq_b ON b USING btree (id,txt);
  2. alter table a add constraint fk_a foreign key (id,txt) references b (id,txt);
  3. explain analyze
  4. select * from a inner join b on a.id = b.id and a.txt = b.txt;

我们现在可以看到估计是正确的:

  1. Hash Join (... rows=40000 ...) (actual ... rows=40000 ...)

也不想使连接成为左连接,因为我不能保证查询结果在所有边缘情况下都与以前100%相同。我也不想引入FK,因为程序会以各种顺序插入表中,我将不得不更改应用程序。
你能想出其他方法来告诉计划者这两个表的特殊关系吗?也许是一种编写查询的特殊方法?或者是某种统计对象?有什么想法吗?
TYVM!
这在两个版本上进行了测试:

  1. PostgreSQL 12.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
  2. PostgreSQL 14.6, compiled by Visual C++ build 1914, 64-bit

UPDATE -此错误估计成为问题的原因示例:

在我的真实的例子中,postgres认为连接中只输出了1行,而实际上输出了40.000行,这是有问题的。这是因为它随后决定对1行(实际上是40.000行)进行嵌套循环,并在一个大表上使用FTS--因此在一个大表上使用40.000 FTS:

  1. create table c (id int,txt text);
  2. insert into c select *,* from generate_series(1,40000000);
  3. analyze c;
  4. SET max_parallel_workers_per_gather = 0;
  5. set join_collapse_limit = 1;
  6. explain
  7. with a_b as (
  8. select a.id a_id,b.id b_id,a.txt a_txt,b.txt b_txt
  9. from a inner join b
  10. on a.id = b.id and a.txt = b.txt
  11. )
  12. select * from a_b inner join c
  13. on a_b.a_id = c.id and a_b.b_txt = c.txt and a_b.b_id = c.id and a_b.a_txt = c.id::text;

即表c中的40.000 FTS:

  1. QUERY PLAN |
  2. -----------------------------------------------------------------------+
  3. Nested Loop (cost=1216.00..921352.51 rows=1 width=30) |
  4. Join Filter: ((a.id = c.id) AND (a.txt = c.txt)) |
  5. -> Hash Join (cost=1216.00..2132.01 rows=1 width=18) |
  6. Hash Cond: ((a.id = b.id) AND (a.txt = b.txt)) |
  7. -> Seq Scan on a (cost=0.00..616.00 rows=40000 width=9) |
  8. -> Hash (cost=616.00..616.00 rows=40000 width=9) |
  9. -> Seq Scan on b (cost=0.00..616.00 rows=40000 width=9)|
  10. -> Seq Scan on c (cost=0.00..916220.48 rows=200001 width=12) |
  11. Filter: (txt = (id)::text) |

有趣的是,左连接技巧在这里甚至不起作用,只有FK修正了估计值,从而修正了计划:

  1. /* left join trick not working*/
  2. explain
  3. with a_b as (
  4. select a.id a_id,b.id b_id,a.txt a_txt,b.txt b_txt
  5. from a LEFT join b
  6. on a.id = b.id and a.txt = b.txt
  7. )
  8. select * from a_b inner join c
  9. on a_b.a_id = c.id and a_b.b_txt = c.txt and a_b.b_id = c.id and a_b.a_txt = c.id::text;
  10. /*QUERY PLAN |
  11. -----------------------------------------------------------------------+
  12. Nested Loop (cost=1216.00..921352.51 rows=1 width=30) |
  13. Join Filter: ((a.id = c.id) AND (a.txt = c.txt)) |
  14. -> Hash Join (cost=1216.00..2132.01 rows=1 width=18) |
  15. Hash Cond: ((a.id = b.id) AND (a.txt = b.txt)) |
  16. -> Seq Scan on a (cost=0.00..616.00 rows=40000 width=9) |
  17. -> Hash (cost=616.00..616.00 rows=40000 width=9) |
  18. -> Seq Scan on b (cost=0.00..616.00 rows=40000 width=9)|
  19. -> Seq Scan on c (cost=0.00..916220.48 rows=200001 width=12) |
  20. Filter: (txt = (id)::text) |*/
  21. /* with the FK the plan is correct */
  22. CREATE UNIQUE INDEX unq_b ON b USING btree (id,txt);
  23. alter table a add constraint fk_a foreign key (id,txt) references b (id,txt);
  24. explain
  25. with a_b as (
  26. select a.id a_id,b.id b_id,a.txt a_txt,b.txt b_txt
  27. from a join b
  28. on a.id = b.id and a.txt = b.txt
  29. )
  30. select * from a_b inner join c
  31. on a_b.a_id = c.id and a_b.b_txt = c.txt and a_b.b_id = c.id and a_b.a_txt = c.id::text;
  32. /*QUERY PLAN |
  33. -----------------------------------------------------------------------------+
  34. Hash Join (cost=2642.00..920362.50 rows=1 width=30) |
  35. Hash Cond: ((c.id = a.id) AND (c.txt = a.txt)) |
  36. -> Seq Scan on c (cost=0.00..916220.48 rows=200001 width=12) |
  37. Filter: (txt = (id)::text) |
  38. -> Hash (cost=2042.00..2042.00 rows=40000 width=18) |
  39. -> Hash Join (cost=1216.00..2042.00 rows=40000 width=18) |
  40. Hash Cond: ((a.id = b.id) AND (a.txt = b.txt)) |
  41. -> Seq Scan on a (cost=0.00..616.00 rows=40000 width=9) |
  42. -> Hash (cost=616.00..616.00 rows=40000 width=9) |
  43. -> Seq Scan on b (cost=0.00..616.00 rows=40000 width=9)|*/

此示例所基于的真实的示例的执行计划的屏幕快照(绿色箭头显示问题)。请注意,真实示例连续两次出现1:1问题(2个FK可以解决此问题):

ddrv8njm

ddrv8njm1#

PostgreSQL中没有跨表统计,因此您无法修复错误的估计。如果这是一个更大查询的一部分,并且错误的估计导致了问题,您可以将查询拆分为两部分:首先使用错误的估计值计算子查询,并使用它填充临时表,然后对该临时表执行ANALYZE操作以确保估计值是正确的,然后将该临时表用于查询的其余部分。

相关问题