mysql 使用limit查询按时间戳排序的性能较差

velaa5lx  于 2023-10-15  发布在  Mysql
关注(0)|答案(2)|浏览(84)

我有一个简单的MySQL表,它由一个主键ID字段、一个时间戳(整数)字段和一个外键ID字段(device_id)组成。这些列中的每一列都有索引:

mysql> show indexes from device_heartbeats;
+-------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table             | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| device_heartbeats |          0 | PRIMARY    |            1 | id          | A         |     2603552 |     NULL | NULL   |      | BTREE      |         |               |
| device_heartbeats |          1 | IDX...bb8c |            1 | time        | A         |     1573451 |     NULL | NULL   |      | BTREE      |         |               |
| device_heartbeats |          1 | FKb...xi10 |            1 | device_id   | A         |          16 |     NULL | NULL   |      | BTREE      |         |               |
+-------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

该表目前包含约260万条记录。以下是每个设备的计数:

mysql> select device_id, count(device_id) from device_heartbeats group by device_id;
+-----------+------------------+
| device_id | count(device_id) |
+-----------+------------------+
|         1 |           315833 |
|         2 |              589 |
|         3 |           851461 |
|         4 |             2115 |
|         5 |          1104668 |
|         6 |                6 |
|         7 |              409 |
|         8 |              783 |
|         9 |              778 |
|        10 |              772 |
|        11 |              211 |
|        12 |              333 |
|        13 |            57370 |
|        14 |            57121 |
|        15 |           217468 |
|        16 |               58 |
|        17 |               66 |
+-----------+------------------+
17 rows in set (0.26 sec)

我有一个查询,使用特定device_id的timestamp字段查找最近的记录(850k匹配记录):

mysql> select * from device_heartbeats where device_id = 3 order by time desc limit 1;
+---------+------------+-----------+
| id      | time       | device_id |
+---------+------------+-----------+
| 2610040 | 1697068792 |         3 |
+---------+------------+-----------+
1 row in set (0.00 sec)

这个查询的性能很好,但是如果我使用另一个device_id(只有大约2000条记录),那么性能就很差:

mysql> select * from device_heartbeats where device_id = 4 order by time desc limit 1;
+-------+------------+-----------+
| id    | time       | device_id |
+-------+------------+-----------+
| 48451 | 1684888379 |         4 |
+-------+------------+-----------+
1 row in set (1.59 sec)

除了4和5之外,每个device_id的性能都很好。
这里发生了什么,我如何修复它,使性能始终保持良好?

l0oc07j2

l0oc07j21#

简短的回答

如果你只对给定device_id的最大time感兴趣,那么最轻/最快的查询就是:

SELECT MAX(time) FROM device_heartbeats WHERE device_id = 3;

添加了一个复合键:

ALTER TABLE `device_heartbeats`
    DROP INDEX `idx_device_id`,
    ADD INDEX `idx_device_id_time` (`device_id`, `time`);

冗长的回答

您没有为这两个查询包含EXPLAIN计划,但我们可以相当高的确定性猜测发生了什么。
如果你用EXPLAIN运行你的初始GROUP BY查询,你会看到这样的东西:

+----+-------------+-------------------+------------+-------+---------------+---------------+---------+-----+---------+----------+-------------+
| id | select_type | table             | partitions | type  | possible_keys | key           | key_len | ref | rows    | filtered | Extra       |
+----+-------------+-------------------+------------+-------+---------------+---------------+---------+-----+---------+----------+-------------+
| 1  | SIMPLE      | device_heartbeats |            | index | idx_device_id | idx_device_id | 1       |     | 2645572 | 100.00   | Using index |
+----+-------------+-------------------+------------+-------+---------------+---------------+---------+-----+---------+----------+-------------+

这相当快,因为整个查询都是针对索引执行的。
对于当前的单列索引,优化器将不得不选择其中之一。对于order by time desc limit 1,它很可能会选择time上的索引。如果给定的device_id有最近的一行,这是一个很好的选择,但如果它必须扫描索引的大部分并获取大量行,就不是那么好了。
如果你运行EXPLAIN来查询device_id = 3,你可能会看到这样的东西:

+----+-------------+-------------------+------------+-------+---------------+----------+---------+-----+------+----------+----------------------------------+
| id | select_type | table             | partitions | type  | possible_keys | key      | key_len | ref | rows | filtered | Extra                            |
+----+-------------+-------------------+------------+-------+---------------+----------+---------+-----+------+----------+----------------------------------+
| 1  | SIMPLE      | device_heartbeats |            | index | idx_device_id | idx_time | 8       |     | 8    | 11.70    | Using where; Backward index scan |
+----+-------------+-------------------+------------+-------+---------------+----------+---------+-----+------+----------+----------------------------------+

您可能会看到与device_id = 4非常相似的EXPLAIN,可能具有更高但严重低估的行计数。Backward index scan反向遍历time上的索引,从聚集索引(主键)中阅读相应的行,直到找到具有device_id = ?的行。如果给定device_id的所有行都来自 “很久以前”,那么它必须获取很多行。如果有最近的一行,那么在找到给定device_id的第一行之前,它不必在索引中走很远。
您可以在device_id上强制使用索引:

EXPLAIN
SELECT *
FROM device_heartbeats FORCE INDEX (idx_device_id)
WHERE device_id = 3
ORDER BY time DESC
LIMIT 1;
/* Output for device_id = 3 */

+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------+
| id | select_type | table             | partitions | type | possible_keys | key           | key_len | ref   | rows   | filtered | Extra          |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id | idx_device_id | 1       | const | 851461 | 100.00   | Using filesort |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------+

/* Output for `device_id = 4 */

+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+------+----------+----------------+
| id | select_type | table             | partitions | type | possible_keys | key           | key_len | ref   | rows | filtered | Extra          |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+------+----------+----------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id | idx_device_id | 1       | const | 2115 | 100.00   | Using filesort |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+------+----------+----------------+

这对于device_id = 4来说会快得多,因为它只需要filesort 2115行,但是对于device_id = 4来说会相对较慢,因为它需要filesort 851461行。
如果你添加一个复合索引(如Senthil P Nathan在评论中所建议的):

ALTER TABLE `device_heartbeats`
    DROP INDEX `idx_device_id`,
    ADD INDEX `idx_device_id_time` (`device_id`, `time`);
/* Output for device_id = 3 */

+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+--------+----------+----------------------------------+
| id | select_type | table             | partitions | type | possible_keys                    | key                | key_len | ref   | rows   | filtered | Extra                            |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+--------+----------+----------------------------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id,idx_device_id_time | idx_device_id_time | 1       | const | 529656 | 100.00   | Backward index scan; Using index |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+--------+----------+----------------------------------+

/* Output for device_id = 4 */

+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+------+----------+----------------------------------+
| id | select_type | table             | partitions | type | possible_keys                    | key                | key_len | ref   | rows | filtered | Extra                            |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+------+----------+----------------------------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id,idx_device_id_time | idx_device_id_time | 1       | const | 5232 | 100.00   | Backward index scan; Using index |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+------+----------+----------------------------------+

注意从Using where; Backward index scanBackward index scan; Using index的变化。
因为这些现在是简单的索引查找,它们应该在<1 ms内返回。
一个稍微好一点的选择,假设你的表只有你的问题中包含的三列(idtimedevice_id),是删除代理PK:

ALTER TABLE `device_heartbeats`
    DROP PRIMARY KEY,
    DROP INDEX `idx_device_id_time`,
    DROP COLUMN id,
    ADD PRIMARY KEY (`device_id`, `time`);
h9a6wy2h

h9a6wy2h2#

传感器数据最好从device_id开始使用PRIMARY KEY。从@user1191247和@Schwern的全面讨论开始。现在,让我用这个附加组件将它们 Package 在一起。
我同意完全删除id。但是如果你不能相信每个设备的时间都是唯一的,那么保留auto_inc id并执行以下操作:

PRIMARY KEY(device_id, ts, id),
INDEX(id)

PK为您提供了“聚类”,这对此类数据的许多可能查询都有好处。而且它对INSERTs没有任何不利影响,尽管插入物非常接近于“时间顺序”。
INDEX(id)是所有需要保持AUTO_INCREMENT高兴。
推荐的SELECT MAX(ts) FROM device_heartbeats WHERE device_id = 3;只会处理一行,只需要几毫秒,即使没有缓存必要的块。索引是“覆盖”的(Explain的“使用索引”),不需要任何文件排序。
即使这样也几乎可以立即获取其他列:

SELECT *
    FROM device_heartbeats
    WHERE device_id = 3
    ORDER BY ts DESC
    LIMIT 1;

device_id值的不均匀分布不会影响这些特定查询。
(附注:如果你打算最终删除“旧”数据,我强烈推荐PARTITION BY RANGE。参见Partition)。

相关问题