如何优化这个like join查询?

bxpogfeg  于 2021-07-26  发布在  Java
关注(0)|答案(4)|浏览(485)

此查询查找域的后缀:

SELECT
        DISTINCT ON ("companyDomain".id)
            "companyDomain".domain,
            "publicSuffix".suffix
        FROM
            "companyDomain"
        INNER JOIN
            "publicSuffix"
        ON
            REVERSE("companyDomain".domain) LIKE REVERSE("publicSuffix".suffix) || '%'
        ORDER BY
            "companyDomain".id, LENGTH("publicSuffix".suffix) DESC

编辑:注意这也适用于子域。
你可以在这里摆弄这个例子,并用pev可视化这个计划。我尝试过向表中添加覆盖索引,但查询规划器最终没有使用它们。也许还有另一个查询更有效?

n1bvdmb6

n1bvdmb61#

你考虑过使用 gin 索引?
我对您的示例dml进行了以下修改:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
...
CREATE INDEX companyDomain_domain_reverse ON "companyDomain" USING gin (REVERSE(domain) gin_trgm_ops);
...
CREATE INDEX publicSuffix_suffix_reverse ON "publicSuffix" USING gin (REVERSE(suffix) gin_trgm_ops);

下面是查询计划:

+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique  (cost=40802.07..41004.44 rows=908 width=31) (actual time=98.229..98.356 rows=908 loops=1)                                                       |
|  ->  Sort  (cost=40802.07..40903.26 rows=40474 width=31) (actual time=98.228..98.264 rows=1006 loops=1)                                                |
|        Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                                      |
|        Sort Method: quicksort  Memory: 103kB                                                                                                           |
|        ->  Nested Loop  (cost=0.05..37704.86 rows=40474 width=31) (actual time=1.655..97.976 rows=1006 loops=1)                                        |
|              ->  Seq Scan on "publicSuffix"  (cost=0.00..151.15 rows=8915 width=12) (actual time=0.011..0.728 rows=8915 loops=1)                       |
|              ->  Bitmap Heap Scan on "companyDomain"  (cost=0.05..4.15 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915)                   |
|                    Recheck Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                                    |
|                    Rows Removed by Index Recheck: 0                                                                                                    |
|                    Heap Blocks: exact=301                                                                                                              |
|                    ->  Bitmap Index Scan on companydomain_domain_reverse  (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
|                          Index Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                                |
|Planning Time: 0.150 ms                                                                                                                                 |
|Execution Time: 98.439 ms                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+

作为奖励-你甚至不需要 REVERSE() 索引和查询中的文本:

create index companydomain_domain
    on "companyDomain" using gin(domain gin_trgm_ops);

SELECT DISTINCT ON ("companyDomain".id) "companyDomain".domain, "publicSuffix".suffix
FROM "companyDomain"
         INNER JOIN "publicSuffix" ON "companyDomain".domain LIKE '%' || "publicSuffix".suffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC

查询所用时间相同,但仍使用gin索引:

+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique  (cost=40556.91..40759.28 rows=908 width=31) (actual time=96.170..96.315 rows=908 loops=1)                                               |
|  ->  Sort  (cost=40556.91..40658.10 rows=40474 width=31) (actual time=96.169..96.209 rows=1006 loops=1)                                        |
|        Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                              |
|        Sort Method: quicksort  Memory: 103kB                                                                                                   |
|        ->  Nested Loop  (cost=0.05..37459.70 rows=40474 width=31) (actual time=1.764..95.919 rows=1006 loops=1)                                |
|              ->  Seq Scan on "publicSuffix"  (cost=0.00..151.15 rows=8915 width=12) (actual time=0.009..0.711 rows=8915 loops=1)               |
|              ->  Bitmap Heap Scan on "companyDomain"  (cost=0.05..4.12 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915)           |
|                    Recheck Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text))                                              |
|                    Rows Removed by Index Recheck: 0                                                                                            |
|                    Heap Blocks: exact=301                                                                                                      |
|                    ->  Bitmap Index Scan on companydomain_domain  (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
|                          Index Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text))                                          |
|Planning Time: 0.132 ms                                                                                                                         |
|Execution Time: 96.393 ms                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------+

ps:我猜你只需要一个索引-在这种情况下: companyDomain_domain_reverse

tpxzln5u

tpxzln5u2#

索引对数据结构/查询没有任何好处。试想一下索引在这里是如何使用的。我运气不好。
我的建议是将域/后缀转换为数组,如

alter table "companyDomain" add column adomain text[];
update "companyDomain" set adomain = string_to_array(domain, '.');
create index idx_adom on "companyDomain" using gin (adomain array_ops);

alter table "publicSuffix" add column asuffix text[];
update "publicSuffix" set asuffix = string_to_array(ltrim(suffix, '.'), '.');
create index idx_asuffix on "publicSuffix" using gin (asuffix array_ops);

让我们比较这些查询:

ostgres=# explain (analyze, verbose, buffers)
SELECT  DISTINCT ON ("companyDomain".id)
    "companyDomain".domain,
    "publicSuffix".suffix
FROM
    "companyDomain"
        INNER JOIN "publicSuffix" ON REVERSE("companyDomain".domain) LIKE REVERSE("publicSuffix".suffix) || '%'
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                                   │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=185738.35..185940.72 rows=908 width=31) (actual time=2364.720..2364.890 rows=908 loops=1)                                        │
│   Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                           │
│   Buffers: shared hit=306                                                                                                                      │
│   ->  Sort  (cost=185738.35..185839.53 rows=40474 width=31) (actual time=2364.719..2364.764 rows=1006 loops=1)                                 │
│         Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                     │
│         Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                             │
│         Sort Method: quicksort  Memory: 103kB                                                                                                  │
│         Buffers: shared hit=306                                                                                                                │
│         ->  Nested Loop  (cost=0.00..182641.13 rows=40474 width=31) (actual time=22.735..2364.484 rows=1006 loops=1)                           │
│               Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text)                 │
│               Join Filter: (reverse(("companyDomain".domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                  │
│               Rows Removed by Join Filter: 8093814                                                                                             │
│               Buffers: shared hit=306                                                                                                          │
│               ->  Seq Scan on public."publicSuffix"  (cost=0.00..377.15 rows=8915 width=12) (actual time=0.081..0.794 rows=8915 loops=1)       │
│                     Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix                        │
│                     Buffers: shared hit=288                                                                                                    │
│               ->  Materialize  (cost=0.00..31.62 rows=908 width=15) (actual time=0.001..0.036 rows=908 loops=8915)                             │
│                     Output: "companyDomain".domain, "companyDomain".id                                                                         │
│                     Buffers: shared hit=18                                                                                                     │
│                     ->  Seq Scan on public."companyDomain"  (cost=0.00..27.08 rows=908 width=15) (actual time=11.576..11.799 rows=908 loops=1) │
│                           Output: "companyDomain".domain, "companyDomain".id                                                                   │
│                           Buffers: shared hit=18                                                                                               │
│ Planning Time: 0.167 ms                                                                                                                        │
│ JIT:                                                                                                                                           │
│   Functions: 9                                                                                                                                 │
│   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                │
│   Timing: Generation 1.956 ms, Inlining 0.000 ms, Optimization 0.507 ms, Emission 10.878 ms, Total 13.341 ms                                   │
│ Execution Time: 2366.971 ms                                                                                                                    │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

据我所知,这里的瓶颈, Rows Removed by Join Filter: 8093814 似乎postgresql构建了表的笛卡尔连接,然后使用 ON 条件:

select count(*) from "companyDomain", "publicSuffix";
---
8094820

要解决此问题,请尝试使用数组运算符:

postgres=# explain (analyze, verbose, buffers)
SELECT  DISTINCT ON ("companyDomain".id)
    "companyDomain".domain,
    "publicSuffix".suffix
FROM
    "companyDomain"
        INNER JOIN "publicSuffix" ON adomain @> asuffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                 QUERY PLAN                                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=8310.60..8512.97 rows=908 width=31) (actual time=180.149..180.335 rows=908 loops=1)                                           │
│   Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                        │
│   Buffers: shared hit=48986                                                                                                                 │
│   ->  Sort  (cost=8310.60..8411.78 rows=40474 width=31) (actual time=180.148..180.200 rows=1239 loops=1)                                    │
│         Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                  │
│         Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                          │
│         Sort Method: quicksort  Memory: 145kB                                                                                               │
│         Buffers: shared hit=48986                                                                                                           │
│         ->  Nested Loop  (cost=0.59..5213.39 rows=40474 width=31) (actual time=0.190..179.693 rows=1239 loops=1)                            │
│               Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text)              │
│               Buffers: shared hit=48986                                                                                                     │
│               ->  Seq Scan on public."companyDomain"  (cost=0.00..27.08 rows=908 width=57) (actual time=0.015..0.098 rows=908 loops=1)      │
│                     Output: "companyDomain".id, "companyDomain".domain, "companyDomain".created_at, "companyDomain".adomain                 │
│                     Buffers: shared hit=18                                                                                                  │
│               ->  Bitmap Heap Scan on public."publicSuffix"  (cost=0.59..5.15 rows=45 width=54) (actual time=0.052..0.197 rows=1 loops=908) │
│                     Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix                     │
│                     Recheck Cond: ("companyDomain".adomain @> "publicSuffix".asuffix)                                                       │
│                     Rows Removed by Index Recheck: 572                                                                                      │
│                     Heap Blocks: exact=41510                                                                                                │
│                     Buffers: shared hit=48968                                                                                               │
│                     ->  Bitmap Index Scan on idx_asuffix  (cost=0.00..0.58 rows=45 width=0) (actual time=0.039..0.039 rows=573 loops=908)   │
│                           Index Cond: ("publicSuffix".asuffix <@ "companyDomain".adomain)                                                   │
│                           Buffers: shared hit=7458                                                                                          │
│ Planning Time: 0.189 ms                                                                                                                     │
│ Execution Time: 180.434 ms                                                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

它可能不太准确(例如 aaa.bbb 等于 bbb.aaa 这里)但是你可以把它修好 WHERE 条款。无论如何它会更快。
就目前而言 domain 以及 suffix 列是多余的,因为您可以从中恢复它们 adomain/asuffix 使用 array_to_string(anyarray, text [, text]) 功能。
另一种方法是,为了避免表结构中的更改,可以在上创建函数索引 string_to_array() 然后在过滤器/连接中使用它。

dxxyhpgq

dxxyhpgq3#

你想要这样的比赛吗

'something.google.com' like '%google.com'

但是您知道postgresql不会为此使用索引,因为模式字符串以通配符开头。所以你把这两条线颠倒过来:

'moc.elgoog.gnihtemos' like 'moc.elgoog%'

并在上创建函数索引 REVERSE("companyDomain".domain) .
这是一个非常好的主意,但是postgresql不使用您的索引。这是因为dbms不知道字符串中的内容(因为这是表数据,dbms不会先读取整个表来获得计划)。在最坏的情况下,所有颠倒的后缀都以 '%' . 在这种情况下,如果dbms决定遍历索引,这可能会非常慢。你知道后缀不是以 '%' ,但dbms不这样做,而是决定一个安全的计划(全表扫描)。
这里有记录:https://www.postgresql.org/docs/9.2/indexes-types.html
优化器还可以使用b-树索引来查询涉及模式匹配运算符的查询,如和~如果模式是常量。。。
我认为没有办法说服postgresql使用索引是安全的。 AND REVERSE("publicSuffix".suffix) || '%' NOT LIKE '/%%' ESCCAPE '/' 举个例子,没用。
在我看来,你最好的办法是使用索引 RIGHT(domain, 3) 以及 RIGHT(suffix, 3) ,因为我们知道包含点的后缀至少有三个字符长。这可以缩小匹配范围,使其足够有用。

CREATE INDEX idx_publicSuffix_suffix3 ON "publicSuffix"(RIGHT(suffix, 3) varchar_pattern_ops, suffix);

CREATE INDEX idx_companyDomain_domain3 ON "companyDomain"(RIGHT(domain, 3) varchar_pattern_ops, id, domain);

SELECT DISTINCT ON (cd.id)
  cd.domain,
  ps.suffix
FROM "companyDomain" cd
JOIN "publicSuffix" ps ON cd.domain LIKE '%' || ps.suffix
                       AND RIGHT(cd.domain, 3) = RIGHT(ps.suffix, 3)
ORDER BY cd.id, LENGTH(ps.suffix) DESC;

演示:https://www.db-fiddle.com/f/dppvfwjpvjhyfnvut4k7ws/1

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
¦                                                                   QUERY PLAN                                                                                                     ¦
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
¦ Unique  (cost=1684.72..1685.71 rows=198 width=72) (actual time=165.676..165.882 rows=908 loops=1)                                                                                ¦
¦     Buffers: shared hit=4079                                                                                                                                                     ¦
¦     ->  Sort  (cost=1684.72..1685.22 rows=198 width=72) (actual time=165.675..165.723 rows=1006 loops=1)                                                                         ¦
¦           Sort Key: cd.id, (length((ps.suffix)::text)) DESC                                                                                                                      ¦
¦           Sort Method: quicksort Memory: 103kB                                                                                                                                   ¦
¦           Buffers: shared hit=4079                                                                                                                                               ¦
¦           ->  Merge Join  (cost=0.56..1677.17 rows=198 width=72) (actual time=0.090..165.222 rows=1006 loops=1)                                                                  ¦
¦                 Buffers: shared hit=4076                                                                                                                                         ¦
¦                 ->  Index Only Scan using idx_companydomain_domain3 on companyDomain cd  (cost=0.28..93.23 rows=1130 width=36) (actual time=0.018..0.429 rows=908 loops=1)       ¦
¦                       Heap Fetches: 908                                                                                                                                          ¦
¦                       Buffers: shared hit=109                                                                                                                                    ¦
¦                 ->  Materialize  (cost=0.28..602.89 rows=7006 width=32) (actual time=0.019..47.510 rows=390620 loops=1)                                                          ¦
¦                       Buffers: shared hit=3967                                                                                                                                   ¦
¦                       ->  Index Only Scan using idx_publicsuffix_suffix3 on publicSuffix ps  (cost=0.28..585.37 rows=7006 width=32) (actual time=0.015..2.798 rows=8354 loops=1) ¦
¦                             Heap Fetches: 8354                                                                                                                                   ¦
¦                             Buffers: shared hit=3967                                                                                                                             ¦
¦ Planning time: 0.471 ms                                                                                                                                                          ¦
¦ Execution time: 166.054 ms                                                                                                                                                       ¦
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ou6hu8tu

ou6hu8tu4#

怎么样:

SELECT 
  DISTINCT ON ("companyDomain".id) "companyDomain".domain, 
  "publicSuffix".suffix 
FROM 
  "companyDomain" 
  INNER JOIN "publicSuffix" ON RIGHT(
    domain, 
    - POSITION('.' IN domain) + 1
  ) = "publicSuffix".suffix 
ORDER BY 
  "companyDomain".id, 
  LENGTH("publicSuffix".suffix) DESC;

我们得到第一个的位置 . 在域中,然后使用该值的负值(+1)包含第一个 . )从…中提取后缀 RIGHT 向左。
看起来它跑得更快,从2500毫秒到120毫秒。
带电试验

相关问题