mysql 使用limit查询按时间戳排序的性能较差

velaa5lx 于 2023-10-15 发布在 Mysql

关注(0)|答案(2)|浏览(83)

我有一个简单的MySQL表，它由一个主键ID字段、一个时间戳（整数）字段和一个外键ID字段（device_id）组成。这些列中的每一列都有索引：

mysql> show indexes from device_heartbeats;
+-------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table             | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| device_heartbeats |          0 | PRIMARY    |            1 | id          | A         |     2603552 |     NULL | NULL   |      | BTREE      |         |               |
| device_heartbeats |          1 | IDX...bb8c |            1 | time        | A         |     1573451 |     NULL | NULL   |      | BTREE      |         |               |
| device_heartbeats |          1 | FKb...xi10 |            1 | device_id   | A         |          16 |     NULL | NULL   |      | BTREE      |         |               |
+-------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

该表目前包含约260万条记录。以下是每个设备的计数：

mysql> select device_id, count(device_id) from device_heartbeats group by device_id;
+-----------+------------------+
| device_id | count(device_id) |
+-----------+------------------+
|         1 |           315833 |
|         2 |              589 |
|         3 |           851461 |
|         4 |             2115 |
|         5 |          1104668 |
|         6 |                6 |
|         7 |              409 |
|         8 |              783 |
|         9 |              778 |
|        10 |              772 |
|        11 |              211 |
|        12 |              333 |
|        13 |            57370 |
|        14 |            57121 |
|        15 |           217468 |
|        16 |               58 |
|        17 |               66 |
+-----------+------------------+
17 rows in set (0.26 sec)

我有一个查询，使用特定device_id的timestamp字段查找最近的记录（850k匹配记录）：

mysql> select * from device_heartbeats where device_id = 3 order by time desc limit 1;
+---------+------------+-----------+
| id      | time       | device_id |
+---------+------------+-----------+
| 2610040 | 1697068792 |         3 |
+---------+------------+-----------+
1 row in set (0.00 sec)

这个查询的性能很好，但是如果我使用另一个device_id（只有大约2000条记录），那么性能就很差：

mysql> select * from device_heartbeats where device_id = 4 order by time desc limit 1;
+-------+------------+-----------+
| id    | time       | device_id |
+-------+------------+-----------+
| 48451 | 1684888379 |         4 |
+-------+------------+-----------+
1 row in set (1.59 sec)

除了4和5之外，每个device_id的性能都很好。
这里发生了什么，我如何修复它，使性能始终保持良好？

mysql

来源：https://stackoverflow.com/questions/77284793/poor-performance-of-sort-by-timestamp-with-limit-query

2条答案

按热度按时间

l0oc07j21#

简短的回答

如果你只对给定device_id的最大time感兴趣，那么最轻/最快的查询就是：

SELECT MAX(time) FROM device_heartbeats WHERE device_id = 3;

添加了一个复合键：

ALTER TABLE `device_heartbeats`
    DROP INDEX `idx_device_id`,
    ADD INDEX `idx_device_id_time` (`device_id`, `time`);

冗长的回答

您没有为这两个查询包含EXPLAIN计划，但我们可以相当高的确定性猜测发生了什么。
如果你用EXPLAIN运行你的初始GROUP BY查询，你会看到这样的东西：

+----+-------------+-------------------+------------+-------+---------------+---------------+---------+-----+---------+----------+-------------+
| id | select_type | table             | partitions | type  | possible_keys | key           | key_len | ref | rows    | filtered | Extra       |
+----+-------------+-------------------+------------+-------+---------------+---------------+---------+-----+---------+----------+-------------+
| 1  | SIMPLE      | device_heartbeats |            | index | idx_device_id | idx_device_id | 1       |     | 2645572 | 100.00   | Using index |
+----+-------------+-------------------+------------+-------+---------------+---------------+---------+-----+---------+----------+-------------+

这相当快，因为整个查询都是针对索引执行的。
对于当前的单列索引，优化器将不得不选择其中之一。对于order by time desc limit 1，它很可能会选择time上的索引。如果给定的device_id有最近的一行，这是一个很好的选择，但如果它必须扫描索引的大部分并获取大量行，就不是那么好了。
如果你运行EXPLAIN来查询device_id = 3，你可能会看到这样的东西：

+----+-------------+-------------------+------------+-------+---------------+----------+---------+-----+------+----------+----------------------------------+
| id | select_type | table             | partitions | type  | possible_keys | key      | key_len | ref | rows | filtered | Extra                            |
+----+-------------+-------------------+------------+-------+---------------+----------+---------+-----+------+----------+----------------------------------+
| 1  | SIMPLE      | device_heartbeats |            | index | idx_device_id | idx_time | 8       |     | 8    | 11.70    | Using where; Backward index scan |
+----+-------------+-------------------+------------+-------+---------------+----------+---------+-----+------+----------+----------------------------------+

您可能会看到与device_id = 4非常相似的EXPLAIN，可能具有更高但严重低估的行计数。Backward index scan反向遍历time上的索引，从聚集索引（主键）中阅读相应的行，直到找到具有device_id = ?的行。如果给定device_id的所有行都来自 “很久以前”，那么它必须获取很多行。如果有最近的一行，那么在找到给定device_id的第一行之前，它不必在索引中走很远。
您可以在device_id上强制使用索引：

EXPLAIN
SELECT *
FROM device_heartbeats FORCE INDEX (idx_device_id)
WHERE device_id = 3
ORDER BY time DESC
LIMIT 1;

/* Output for device_id = 3 */

+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------+
| id | select_type | table             | partitions | type | possible_keys | key           | key_len | ref   | rows   | filtered | Extra          |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id | idx_device_id | 1       | const | 851461 | 100.00   | Using filesort |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------+

/* Output for `device_id = 4 */

+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+------+----------+----------------+
| id | select_type | table             | partitions | type | possible_keys | key           | key_len | ref   | rows | filtered | Extra          |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+------+----------+----------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id | idx_device_id | 1       | const | 2115 | 100.00   | Using filesort |
+----+-------------+-------------------+------------+------+---------------+---------------+---------+-------+------+----------+----------------+

这对于device_id = 4来说会快得多，因为它只需要filesort 2115行，但是对于device_id = 4来说会相对较慢，因为它需要filesort 851461行。
如果你添加一个复合索引（如Senthil P Nathan在评论中所建议的）：

ALTER TABLE `device_heartbeats`
    DROP INDEX `idx_device_id`,
    ADD INDEX `idx_device_id_time` (`device_id`, `time`);

/* Output for device_id = 3 */

+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+--------+----------+----------------------------------+
| id | select_type | table             | partitions | type | possible_keys                    | key                | key_len | ref   | rows   | filtered | Extra                            |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+--------+----------+----------------------------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id,idx_device_id_time | idx_device_id_time | 1       | const | 529656 | 100.00   | Backward index scan; Using index |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+--------+----------+----------------------------------+

/* Output for device_id = 4 */

+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+------+----------+----------------------------------+
| id | select_type | table             | partitions | type | possible_keys                    | key                | key_len | ref   | rows | filtered | Extra                            |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+------+----------+----------------------------------+
| 1  | SIMPLE      | device_heartbeats |            | ref  | idx_device_id,idx_device_id_time | idx_device_id_time | 1       | const | 5232 | 100.00   | Backward index scan; Using index |
+----+-------------+-------------------+------------+------+----------------------------------+--------------------+---------+-------+------+----------+----------------------------------+

注意从Using where; Backward index scan到Backward index scan; Using index的变化。
因为这些现在是简单的索引查找，它们应该在<1 ms内返回。
一个稍微好一点的选择，假设你的表只有你的问题中包含的三列（id，time，device_id），是删除代理PK：

ALTER TABLE `device_heartbeats`
    DROP PRIMARY KEY,
    DROP INDEX `idx_device_id_time`,
    DROP COLUMN id,
    ADD PRIMARY KEY (`device_id`, `time`);

赞(0）回复(0）举报 2023-10-15

h9a6wy2h2#

传感器数据最好从device_id开始使用PRIMARY KEY。从@user1191247和@Schwern的全面讨论开始。现在，让我用这个附加组件将它们 Package 在一起。
我同意完全删除id。但是如果你不能相信每个设备的时间都是唯一的，那么保留auto_inc id并执行以下操作：

PRIMARY KEY(device_id, ts, id),
INDEX(id)

PK为您提供了“聚类”，这对此类数据的许多可能查询都有好处。而且它对INSERTs没有任何不利影响，尽管插入物非常接近于“时间顺序”。
INDEX(id)是所有需要保持AUTO_INCREMENT高兴。
推荐的SELECT MAX(ts) FROM device_heartbeats WHERE device_id = 3;只会处理一行，只需要几毫秒，即使没有缓存必要的块。索引是“覆盖”的（Explain的“使用索引”），不需要任何文件排序。
即使这样也几乎可以立即获取其他列：

SELECT *
    FROM device_heartbeats
    WHERE device_id = 3
    ORDER BY ts DESC
    LIMIT 1;

device_id值的不均匀分布不会影响这些特定查询。
（附注：如果你打算最终删除“旧”数据，我强烈推荐PARTITION BY RANGE。参见Partition）。

赞(0）回复(0）举报 2023-10-15

我来回答

mysql 使用limit查询按时间戳排序的性能较差

2条答案

简短的回答

冗长的回答

相关问题

热门标签

最新问答