select with max()和group by()是否有效？它是否读取所有行

ryevplcw 于 2021-06-10 发布在 Cassandra

关注(0)|答案(2)|浏览(393)

我创建了一个cassandra表，如下所示：

create table messages
    (user_id int, peer_id int, send_on timestamp, message text, 
    PRIMARY KEY (user_id, peer_id, send_on))
    WITH CLUSTERING ORDER BY (peer_id ASC, send_on DESC);

并且充满了数据。
我想查询给定用户的每个peer\u id的最新消息，得到的结果是：

select peer_id, max(send_on), message 
  from messages 
  where user_id = 1 group by peer_id;

我想知道这是要阅读所有的消息，只是提取最新的或它是足够聪明，只拿起最新的消息。
我之所以这样问是因为用以下值填充表：

1, 1, now(), hello 1
1, 1, now(), hello 2
1, 1, now(), hello 3
1, 2, now(), hello 4
1, 2, now(), hello 5
1, 2, now(), hello 6
...
1, 3, now(), hello 9

当我运行查询时，我看到了预期的结果：

select peer_id, max(send_on), message from messages where user_id = 1 group by peer_id;

 peer_id | system.max(send_on)             | message
---------+---------------------------------+---------
       1 | 2019-04-13 19:20:48.567000+0000 | hello 3
       2 | 2019-04-13 19:21:07.929000+0000 | hello 6
       3 | 2019-04-13 19:21:22.081000+0000 | hello 9

(3 rows)

但是随着追踪的进行，我看到：

activity                                                                                                                      | timestamp                  | source    | source_elapsed | client
-------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                            Execute CQL3 query | 2019-04-13 19:24:54.948000 | 127.0.0.1 |              0 | 127.0.0.1
 Parsing select peer_id, max(send_on), message from messages where user_id = 1 group by peer_id; [Native-Transport-Requests-1] | 2019-04-13 19:24:54.956000 | 127.0.0.1 |           8812 | 127.0.0.1
                                                                             Preparing statement [Native-Transport-Requests-1] | 2019-04-13 19:24:54.957000 | 127.0.0.1 |          10234 | 127.0.0.1
                                                                    Executing single-partition query on messages [ReadStage-2] | 2019-04-13 19:24:54.962000 | 127.0.0.1 |          14757 | 127.0.0.1
                                                                                    Acquiring sstable references [ReadStage-2] | 2019-04-13 19:24:54.962000 | 127.0.0.1 |          14961 | 127.0.0.1
                                       Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-2] | 2019-04-13 19:24:54.962000 | 127.0.0.1 |          15211 | 127.0.0.1
                                                                       Merged data from memtables and 0 sstables [ReadStage-2] | 2019-04-13 19:24:54.963000 | 127.0.0.1 |          15665 | 127.0.0.1
                                                                          Read 9 live rows and 0 tombstone cells [ReadStage-2] | 2019-04-13 19:24:54.963000 | 127.0.0.1 |          15817 | 127.0.0.1
                                                                                                              Request complete | 2019-04-13 19:24:54.964448 | 127.0.0.1 |          16448 | 127.0.0.1

所以它好像读了9行。有没有办法优化这个？也许改变我的模式？

cassandra cql cassandra-3.0

来源：https://stackoverflow.com/questions/55669145/is-a-select-with-max-and-group-by-efficient-or-will-it-read-all-rows

2条答案

按热度按时间

wmomyfyw1#

所以这里有一个想法；将分区密钥更改为 user_id 以及 peer_id ，然后您可以使用 PER PARTITION LIMIT 构造。这将只读回一行（每个分区），然后您也不必使用 MAX 由于第一排将是最近的 CLUSTERING ORDER BY (send_on DESC) :

> CREATE TABLE messages
    (user_id int, peer_id int, send_on timestamp, message text,
    PRIMARY KEY ((user_id, peer_id), send_on))
    WITH CLUSTERING ORDER BY (send_on DESC);

> SELECT peer_id, send_on, message
          FROM messages
          WHERE user_id = 1 AND peer_id=1
          PER PARTITION LIMIT 1;

 peer_id | send_on                         | message
---------+---------------------------------+---------
       1 | 2019-04-15 15:21:40.350000+0000 | hello 3

(1 rows)

> SELECT peer_id, send_on, message
          FROM messages PER PARTITION LIMIT 1;

 peer_id | send_on                         | message
---------+---------------------------------+---------
       3 | 2019-04-15 15:21:40.387000+0000 | hello 9
       2 | 2019-04-15 15:21:40.365000+0000 | hello 6
       1 | 2019-04-15 15:21:40.350000+0000 | hello 3

(3 rows)

注意：最后一个查询是一个多键查询，仅用于演示目的，显然不能在大型生产集群中执行。

赞(0）回复(0）举报 2021-06-10

2o7dmzc52#

我能想到的两个选项是创建另一个表，作为每个userid和peerid的max记录的索引。这两个字段将构成分区键，然后将包含在messages表中查找该userid和peerid的max记录所需的其余数据。当你把数据放到表中时，数据就会被更新，所以你总是把最新的消息写到表中，它总是最大值。你可以做的另一件事就是把最后一条消息全部存储在那里，然后你就不必在那里引用你的消息表来获取实际的数据了。和我之前提到的分区键一样，只需在那里编写实际的消息。

赞(0）回复(0）举报 2021-06-10

我来回答

select with max()和group by()是否有效？它是否读取所有行

2条答案

相关问题

热门标签

最新问答