SQLite3查询优化连接与子选择

wmvff8tz  于 12个月前  发布在  SQLite
关注(0)|答案(6)|浏览(111)

我试图找出最好的方法,(在这种情况下可能不重要)找到一个表的行,基于一个标志的存在,以及另一个表中的一行中的关系ID。
以下是这些模式:

CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

字符串
我用的是SQLite3
这里的文件表将非常大,通常为10 K-5 M行。2 resume_points将小于10 K,只有1-2个不同的scan_file_id
所以我的第一个想法是

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;


一位同事建议将连接转过来:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;


然后我想,既然我们知道不同scan_file_id的数量将是如此之小,也许一个子选择将是最佳的(在这种罕见的情况下):

select * from files where id in (select distinct scan_file_id from resume_points);


explain输出分别具有以下行:42、42和48。

myss37ts

myss37ts1#

TL;DR:最好的查询和索引是:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

字符串
由于我通常使用SQL Server,起初我认为查询优化器肯定会为这样一个简单的查询找到最佳的执行计划,而不管你用哪种方式编写这些等价的SQL语句。所以我下载了SQLite,并开始玩。令我惊讶的是,性能上有很大的差异。
下面是安装代码:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;


我考虑了两个指数:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);


下面是我在i5笔记本电脑上尝试的查询和执行时间。数据库文件大小只有大约200 MB,因为它没有任何其他数据。

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles


看起来SQLite的查询优化器一点也不高级,最好的查询首先将resume_points减少到少量的行(测试用例中的两个。OP说它将是1-2。),然后查找文件,看看它是否脏。dirtyFiles索引对任何文件都没有太大的影响。我想这可能是因为数据的排列方式在测试表中。它可能会在生产表中产生差异。然而,差异并不是太大,因为查找次数将少于少数。uniqueFiles确实有所不同,因为它可以将10000行的resume_points减少到2行,而无需扫描其中的大多数。fileLookup确实使一些查询稍微快一些,但不足以显著改变结果。值得注意的是,它使分组非常缓慢。总之,尽早减少结果集以产生最大的差异。

pgccezyw

pgccezyw2#

由于files.id是主键,因此尝试GROUP ing BY此字段,而不是检查DISTINCT files.*

SELECT f.*
FROM resume_points rp
INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1
GROUP BY f.id

字符串
另一个要考虑的性能选项是向resume_points.scan_file_id添加索引。

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

z4iuyo4d

z4iuyo4d3#

您可以尝试exists,它不会产生任何重复的files

select * from files
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id
)
and dirty = 1;

字符串
当然,它 * 可能 * 有助于有适当的索引:

files.dirty
resume_points.scan_file_id


索引是否有用将取决于您的数据。

cx6n0qe3

cx6n0qe34#

我认为曾智文给出了解决方案。

select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1

字符串
基本上,它是相同的,你已经张贴作为您的最后一个选项:

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;


这是因为你必须避免一个完整的表扫描/连接。
所以首先你需要1-2个不同的ID:

select distinct scan_file_id from resume_points


在此之后,只有1-2行必须连接到另一个表上,而不是所有10 K,这提供了性能优化。
如果你多次需要这个语句,我会把它放到一个视图中。视图不会改变性能,但它看起来更干净/更容易阅读。
也可以查看查询优化文档:http://www.sqlite.org/optoverview.html

dgenwo3n

dgenwo3n5#

在sqlite-studio中测试了100万行。
链接和已处理的表,都有100万行。
子查询:

SELECT l.id AS id, l.col1, l.col2 FROM items l 
 WHERE l.id NOT IN (SELECT c_id FROM processed WHERE c_type = 4) 
 ORDER BY l.col4 ASC LIMIT 1;

[22:06:05] Query finished in 0.108 second(s).

字符串
加入查询:

SELECT l.id AS id, l.col1, l.col2 FROM items l 
 LEFT JOIN processed p ON p.c_id = l.id AND p.c_type = 4 
 WHERE p.c_id IS NULL 
 ORDER BY l.col4 ASC LIMIT 1;

[22:04:23] Query finished in 0.000 second(s).


与子查询连接:

SELECT l.id AS id, l.col1, l.col2 FROM items l 
 LEFT JOIN (SELECT c_id FROM processed WHERE c_type = 4) p 
  ON p.c_id = l.id
 WHERE p.c_id IS NULL 
 ORDER BY l.col4 ASC LIMIT 1;

[22:45:46] Query finished in 0.001 second(s).


在子查询方法中,子查询加载主查询的SQL语句中的所有结果,这就是为什么它很慢。
在Join with subquery方法的情况下,如果子查询返回很多行,它会变慢,因为首先子查询加载所有行,然后连接这些行。如果子查询返回不超过几十行,那么Join with subquery方法仍然很快。
连接查询方法保持快速,即使连接表有很多行。

c86crjj0

c86crjj06#

如果“resume_points”表只有一个或两个不同的文件id号,它似乎只需要一个或两个行,并且似乎需要scan_file_id作为主键。该表只有两列,id号没有意义。
如果是这样的话,你不需要任何一个身份证号码。

pragma foreign_keys = on;
CREATE TABLE resume_points (
  scan_file_id integer primary key
);

CREATE TABLE files (
  scan_file_id integer not null references resume_points (scan_file_id),
  dirty INTEGER NOT NULL,
  primary key (scan_file_id, dirty)
);

字符串
现在也不需要连接了,只需要查询“files”表。

相关问题