我有两个表-invoices
和invoiceitems
。关系是1-多。我的应用程序允许使用查询中的发票项字段查询发票。只退回发票,不退回项目。
例如,我想获取所有包含项目的发票,其名称包含ac
,不区分大小写。输出是分页的,因此我执行一个查询以获得满足条件的发票的计数,然后执行另一个查询以获得发票的相应页面。
表格大小为:
- 发票-65,000条记录
- 发票项-3,281,518条记录
- 条款- 5项
- 代表- 5项目
- shipVia - 5产品
每张发票最多链接到100个发票项目。
我的问题是我无法为我的查询找出最佳索引:
架构:
CREATE TABLE invoiceitems
(
id serial NOT NULL,
invoice_id integer NOT NULL,
name text NOT NULL,
...
CONSTRAINT invoiceitems_pkey PRIMARY KEY (id),
CONSTRAINT invoiceitems_invoice_id_fkey FOREIGN KEY (invoice_id)
REFERENCES invoices (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
);
CREATE INDEX idx_lower_name
ON invoiceitems
USING btree
(lower(name) COLLATE pg_catalog."default" text_pattern_ops);
CREATE TABLE invoices
(
id serial NOT NULL,
term_id integer,
rep_id integer NOT NULL,
ship_via_id integer,
...
CONSTRAINT invoices_pkey PRIMARY KEY (id),
CONSTRAINT invoices_rep_id_fkey FOREIGN KEY (rep_id)
REFERENCES reps (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT invoices_ship_via_id_fkey FOREIGN KEY (ship_via_id)
REFERENCES shipvia (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT invoices_term_id_fkey FOREIGN KEY (term_id)
REFERENCES terms (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
);
字符串
计数查询:
SELECT COUNT(DISTINCT(o.id))
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%';
型
结果:
6518
查询计划
"Aggregate (cost=107651.35..107651.36 rows=1 width=4)"
" -> Hash Join (cost=3989.50..106010.59 rows=656304 width=4)"
" Hash Cond: (items.invoice_id = o.id)"
" -> Seq Scan on invoiceitems items (cost=0.00..85089.77 rows=656304 width=4)"
" Filter: (lower(name) ~~ '%ac%'::text)"
" -> Hash (cost=2859.00..2859.00 rows=65000 width=16)"
" -> Seq Scan on invoices o (cost=0.00..2859.00 rows=65000 width=16)"
型
似乎我在invoiceitems.name
字段上的函数索引根本不起作用。我想这是因为我在寻找名字的一部分,这不是名字的严格前缀。我不确定,但似乎我的发票主键索引在这里也不起作用。
我的问题是,我可以优化计数查询和/或我的模式,以提高性能吗?
我必须允许搜索的部分名称,这不是严格的前缀,我也必须支持不区分大小写的搜索。
我返回匹配记录的查询同样糟糕:
SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%' LIMIT 100;
型
其计划:
"Limit (cost=901846.63..901854.13 rows=100 width=627)"
" -> Unique (cost=901846.63..951069.43 rows=656304 width=627)"
" -> Sort (cost=901846.63..903487.39 rows=656304 width=627)"
" Sort Key: o.id, t.terms, r.rep, s.ship_via, ..."
" -> Hash Join (cost=11509.54..286596.53 rows=656304 width=627)"
" Hash Cond: (items.invoice_id = o.id)"
" -> Seq Scan on invoiceitems items (cost=0.00..85089.77 rows=656304 width=4)"
" Filter: (lower(name) ~~ '%ac%'::text)"
" -> Hash (cost=5491.03..5491.03 rows=65000 width=627)"
" -> Hash Left Join (cost=113.02..5491.03 rows=65000 width=627)"
" Hash Cond: (o.ship_via_id = s.id)"
" -> Hash Left Join (cost=75.35..4559.61 rows=65000 width=599)"
" Hash Cond: (o.rep_id = r.id)"
" -> Hash Left Join (cost=37.67..3628.19 rows=65000 width=571)"
" Hash Cond: (o.term_id = t.id)"
" -> Seq Scan on invoices o (cost=0.00..2859.00 rows=65000 width=543)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36)"
" -> Seq Scan on terms t (cost=0.00..22.30 rows=1230 width=36)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36)"
" -> Seq Scan on reps r (cost=0.00..22.30 rows=1230 width=36)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36)"
" -> Seq Scan on shipvia s (cost=0.00..22.30 rows=1230 width=36)"
型
我只限于PostgreSQL。切换到SQL Server不是一个选项。
**EDIT ==
我已经遵循了欧文非常翔实的指示,这是我所拥有的。
索引:
CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name gin_trgm_ops);
型
带JOIN的count查询,不带额外表:
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(o.id))
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
WHERE items.name ILIKE '%ac%';
"Aggregate (cost=78961.52..78961.53 rows=1 width=4) (actual time=5205.448..5205.450 rows=1 loops=1)"
" -> Nested Loop (cost=0.00..78960.73 rows=316 width=4) (actual time=0.396..5176.761 rows=6518 loops=1)"
" -> Seq Scan on invoiceitems items (cost=0.00..76885.98 rows=316 width=4) (actual time=0.021..4502.043 rows=6518 loops=1)"
" Filter: (name ~~* '%ac%'::text)"
" Rows Removed by Filter: 3275000"
" -> Index Only Scan using invoices_pkey on invoices o (cost=0.00..6.56 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=6518)"
" Index Cond: (id = items.invoice_id)"
" Heap Fetches: 6518"
"Total runtime: 5205.509 ms"
型
半连接的count查询:
EXPLAIN ANALYZE SELECT COUNT(1)
FROM invoices o
WHERE EXISTS (
SELECT 1
FROM invoiceitems i
WHERE i.invoice_id = o.id
AND i.name ILIKE '%ac%'
);
"Aggregate (cost=76920.43..76920.44 rows=1 width=0) (actual time=5713.597..5713.598 rows=1 loops=1)"
" -> Nested Loop (cost=76886.76..76919.64 rows=316 width=0) (actual time=5583.706..5703.801 rows=6518 loops=1)"
" -> HashAggregate (cost=76886.76..76886.82 rows=5 width=4) (actual time=5583.568..5594.977 rows=6518 loops=1)"
" -> Seq Scan on invoiceitems i (cost=0.00..76885.98 rows=316 width=4) (actual time=0.295..5148.801 rows=6518 loops=1)"
" Filter: (name ~~* '%ac%'::text)"
" Rows Removed by Filter: 3275000"
" -> Index Only Scan using invoices_pkey on invoices o (cost=0.00..6.56 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=6518)"
" Index Cond: (id = i.invoice_id)"
" Heap Fetches: 6518"
"Total runtime: 5713.804 ms"
型
半连接似乎没有任何效果。为什么?为什么?
(我认为这并不重要,但我删除了lower(invoiceitems.name)
上的原始函数索引)。
**EDIT 2=
我想重点介绍fetchrows查询并提供更多的上下文。
首先,用户可以要求根据发票中的任意字段(而不是发票项)对列进行排序。
此外,用户可以提供涉及发票和发票项目字段的过滤器语句的列表。这些过滤器语句捕获通过字符串或数值进行过滤的语义,例如,过滤器可以是“发票项目名称包含‘ac’,发票折扣高于5%”
我很清楚,我不太可能将每个字段都编入索引,我可能只需要索引最常见的字段,如发票项目名称和其他一些字段。
无论如何,以下是我迄今为止对invoices和invoiceitems表的索引:
发票
- id作为主键
发票项 - id作为主键
CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems USING btree (invoice_id);
个CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);
个
下面是使用JOIN对发票项进行的获取行查询的分析:
explain analyze
SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id
WHERE (items.name ILIKE '%df%' AND items.name IS NOT NULL) LIMIT 100;
"Limit (cost=79100.70..79106.95 rows=100 width=312) (actual time=4637.195..4637.195 rows=0 loops=1)"
" -> Unique (cost=79100.70..79120.45 rows=316 width=312) (actual time=4637.190..4637.190 rows=0 loops=1)"
" -> Sort (cost=79100.70..79101.49 rows=316 width=312) (actual time=4637.186..4637.186 rows=0 loops=1)"
" Sort Key: o.id, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, o.ship_to_postal_code, o.purchase_order_no, t.terms, r.rep, ((o.ship_date)::text), s.ship_via, o.delivery, o.hst_percents, o.sub_total, o.total_before_hst, o.total, o.total_discount, o.hst, o.item_count"
" Sort Method: quicksort Memory: 25kB"
" -> Hash Left Join (cost=113.02..79087.58 rows=316 width=312) (actual time=4637.179..4637.179 rows=0 loops=1)"
" Hash Cond: (o.ship_via_id = s.id)"
" -> Hash Left Join (cost=75.35..79043.98 rows=316 width=284) (actual time=4637.123..4637.123 rows=0 loops=1)"
" Hash Cond: (o.rep_id = r.id)"
" -> Hash Left Join (cost=37.67..79001.96 rows=316 width=256) (actual time=4637.119..4637.119 rows=0 loops=1)"
" Hash Cond: (o.term_id = t.id)"
" -> Nested Loop (cost=0.00..78960.73 rows=316 width=228) (actual time=4637.115..4637.115 rows=0 loops=1)"
" -> Seq Scan on invoiceitems items (cost=0.00..76885.98 rows=316 width=4) (actual time=4637.108..4637.108 rows=0 loops=1)"
" Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
" Rows Removed by Filter: 3281518"
" -> Index Scan using invoices_pkey on invoices o (cost=0.00..6.56 rows=1 width=228) (never executed)"
" Index Cond: (id = items.invoice_id)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36) (never executed)"
" -> Seq Scan on terms t (cost=0.00..22.30 rows=1230 width=36) (never executed)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36) (never executed)"
" -> Seq Scan on reps r (cost=0.00..22.30 rows=1230 width=36) (never executed)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36) (never executed)"
" -> Seq Scan on shipvia s (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"Total runtime: 4637.731 ms"
型
下面是使用WHERE EXISTS而不是JOIN对发票项进行的获取行查询的分析:
explain analyze
SELECT o.id, t.terms, r.rep, s.ship_via, ...
FROM invoices o
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id
WHERE EXISTS (
SELECT 1
FROM invoiceitems i
WHERE i.invoice_id = o.id
AND i.name ILIKE '%df%'
AND i.name IS NOT NULL
) LIMIT 100;
"Limit (cost=0.19..43302.88 rows=100 width=610) (actual time=5771.852..5771.852 rows=0 loops=1)"
" -> Nested Loop Left Join (cost=0.19..136836.68 rows=316 width=610) (actual time=5771.848..5771.848 rows=0 loops=1)"
" -> Nested Loop Left Join (cost=0.19..135404.33 rows=316 width=582) (actual time=5771.844..5771.844 rows=0 loops=1)"
" -> Nested Loop Left Join (cost=0.19..134052.55 rows=316 width=554) (actual time=5771.841..5771.841 rows=0 loops=1)"
" -> Merge Semi Join (cost=0.19..132700.78 rows=316 width=526) (actual time=5771.837..5771.837 rows=0 loops=1)"
" Merge Cond: (o.id = i.invoice_id)"
" -> Index Scan using invoices_pkey on invoices o (cost=0.00..3907.27 rows=65000 width=526) (actual time=0.017..0.017 rows=1 loops=1)"
" -> Index Scan using invoiceitems_invoice_id_idx on invoiceitems i (cost=0.00..129298.19 rows=316 width=4) (actual time=5771.812..5771.812 rows=0 loops=1)"
" Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
" Rows Removed by Filter: 3281518"
" -> Index Scan using terms_pkey on terms t (cost=0.00..4.27 rows=1 width=36) (never executed)"
" Index Cond: (id = o.term_id)"
" -> Index Scan using reps_pkey on reps r (cost=0.00..4.27 rows=1 width=36) (never executed)"
" Index Cond: (id = o.rep_id)"
" -> Index Scan using shipvia_pkey on shipvia s (cost=0.00..4.27 rows=1 width=36) (never executed)"
" Index Cond: (id = o.ship_via_id)"
"Total runtime: 5771.948 ms"
型
我没有尝试第三种选择,即按不同的invoice_id对invoiceitems行进行排序,因为这种方法似乎只有在没有给出排序的情况下才可行,而通常情况恰恰相反--存在排序。
1条答案
按热度按时间gzszwxb41#
索引
三元索引
使用
pg_trgm
模块提供的三元组索引,该模块为GIN或GiST索引提供运算符类,以支持所有LIKE
和ILIKE
模式,而不仅仅是左锚定模式。请参阅:更多关于如何使用trigram index的信息:
示例如下:
字符串
GIN
索引可能更快,但也更大。The manual:根据经验,
GIN
索引的搜索速度比GiST
索引快,但构建或更新速度较慢;因此GIN
更适合静态数据,而GiST
更适合经常更新的数据。这一切都取决于确切的要求。
额外的btree索引
当然,在
invoiceitems.invoice_id
上还需要一个普通的btree索引(默认):型
多列索引,仅索引扫描
您可能会从making this index "covering" for an index-only scan中获得一些额外的好处。GIN索引通常对于
integer
列(如invoice_id
)没有意义。但是为了保存额外的堆查找,可能需要将其包含在多列GIN(或GiST)索引中。你得先测试一下。为此,您需要额外的模块
btree_gin
(或btree_gist
)。GIN示例:型
这将消除对以上B树索引的需要。但无论如何都要创建它,以加速FK检查,以及用于许多其他目的。
查询
计数
为了一个...
查询以获取发票的数量
......省略只会造成伤害的附加表(如果有的话):
型
由于外键约束保证了引用完整性,因此您甚至可以在此查询中省略表
invoices
。你闪亮的新指数应该踢!返回行
对于退货:
EXISTS
在这里仍然很好:型
或者你可以测试这个变量,它作为子选择连接到上面的查询。可能更快:
型
本例通过
invoice_id
获取前100个(因为您没有提供排序顺序)。这一切都取决于细节。。